Parquet: simplify dictionary-page ordering / remove double buffering in the page-spilling writer

## Background

[#10020](https://github.com/apache/arrow-rs/pull/10020) adds a pluggable `PageStore` so the `ArrowWriter` can spill completed column-chunk pages out of the heap while a row group is buffered. During review @alamb noted that the writer now has *two* levels of buffering and that the dictionary page is special-cased in a way that's hard to follow. This issue tracks the deferred cleanup so #10020 can land without it.

## The root cause

A dictionary page is finalized **last** (it isn't known until `close()` — dictionary fallback can occur mid-column) but must be written **first** in the file. So production order is `data₁…dataₙ, dict` and file order is `dict, data₁…dataₙ`. Resolving that inversion is what creates the apparent double-buffering:

- `GenericColumnWriter::data_pages: VecDeque<CompressedPage>` (`column/writer/mod.rs`) — the pre-existing buffer that holds dictionary-column data pages until `close()` on the column-at-a-time `SerializedFileWriter` path (bytes commit live there, so there's no other way to get dict-first).
- On the `ArrowWriter` path #10020 instead streams data pages straight to the `PageStore`, holds the dict page in memory (`ArrowColumnChunkData.dictionary: Vec<Bytes>`), and rewrites offsets to dict-first at splice (`ColumnCloseResult::update_dictionary_location`).

The seam between these is the back-channel `PageWriter::defers_dictionary_ordering()` flag, which makes the column writer change its buffering based on a property of its sink. (As cheap insurance, that method and `buffered_memory_size()` were marked `#[doc(hidden)]` / unstable in #10020 so this cleanup won't be a breaking change — see f4e3518581.)

## Proposed cleanup (two separable steps)

**B — route the dictionary page through the store too.** Replace the `dictionary: Vec<Bytes>` carve-out with a tracked `dictionary_key: Option<PageKey>`; at splice, take that key first, then the data keys. This makes the store uniform (it no longer treats the dict page specially) and lets dictionary pages spill as well — relevant for misconfigured writes with a large dict page. Entirely private to `ArrowColumnChunkData`. Trade-off to weigh: today the dict page is kept resident deliberately to avoid round-tripping ~1 bounded page through the backend.

**C — move page-ordering ownership into the page-writer layer.** Have `GenericColumnWriter` always stream pages in production order and let the page writer own final layout: `SerializedPageWriter` buffers data pages until it sees the dict, `ArrowPageWriter` buffers-and-reorders. This removes the `defers_dictionary_ordering()` branch and the `data_pages` field from the shared column writer entirely. Note it relocates rather than eliminates the column-at-a-time buffering (you still can't seek back in a plain `Write` sink), but it gives one clean seam instead of a flag-driven split. Touches the column writer used by **both** write paths, so it wants its own benchmarking.

B and C compose; B+C is the full "defer all buffering to the store; the store needn't know page kinds" that was suggested in review.

## Constraints / non-goals

- **Not a breaking change to the externally-implemented surface.** `PageStore` (`put`/`take`/`PageKey`), `PageStoreFactory`, and `PageStoreArgs` stay as-is; B reuses the same `put`/`take`. The only public-API churn is to the already-`#[doc(hidden)]` `PageWriter` methods.
- Keep the **key-based** `PageStore` API (not a page iterator): the dict-first reordering needs random access, which a FIFO iterator can't express.
- No change to on-disk output beyond what #10020 already does (the `page_encoding_stats` ordering difference for dictionary columns, which is spec-valid).

## Acceptance criteria

- [ ] `defers_dictionary_ordering()` removed (or its role absorbed) and `GenericColumnWriter` has a single buffering path.
- [ ] Dictionary pages are no longer special-cased outside the store (B), or this is explicitly decided against with rationale.
- [ ] Existing round-trip / spill memory tests still pass; column-at-a-time write memory and throughput benchmarked for no regression.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: simplify dictionary-page ordering / remove double buffering in the page-spilling writer #10062

Background

The root cause

Proposed cleanup (two separable steps)

Constraints / non-goals

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Parquet: simplify dictionary-page ordering / remove double buffering in the page-spilling writer #10062

Description

Background

The root cause

Proposed cleanup (two separable steps)

Constraints / non-goals

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions