Skip to content

Parquet: simplify dictionary-page ordering / remove double buffering in the page-spilling writer #10062

@adriangb

Description

@adriangb

Background

#10020 adds a pluggable PageStore so the ArrowWriter can spill completed column-chunk pages out of the heap while a row group is buffered. During review @alamb noted that the writer now has two levels of buffering and that the dictionary page is special-cased in a way that's hard to follow. This issue tracks the deferred cleanup so #10020 can land without it.

The root cause

A dictionary page is finalized last (it isn't known until close() — dictionary fallback can occur mid-column) but must be written first in the file. So production order is data₁…dataₙ, dict and file order is dict, data₁…dataₙ. Resolving that inversion is what creates the apparent double-buffering:

  • GenericColumnWriter::data_pages: VecDeque<CompressedPage> (column/writer/mod.rs) — the pre-existing buffer that holds dictionary-column data pages until close() on the column-at-a-time SerializedFileWriter path (bytes commit live there, so there's no other way to get dict-first).
  • On the ArrowWriter path Pluggable page spilling API for the Parquet ArrowWriter (PageStore) #10020 instead streams data pages straight to the PageStore, holds the dict page in memory (ArrowColumnChunkData.dictionary: Vec<Bytes>), and rewrites offsets to dict-first at splice (ColumnCloseResult::update_dictionary_location).

The seam between these is the back-channel PageWriter::defers_dictionary_ordering() flag, which makes the column writer change its buffering based on a property of its sink. (As cheap insurance, that method and buffered_memory_size() were marked #[doc(hidden)] / unstable in #10020 so this cleanup won't be a breaking change — see f4e3518.)

Proposed cleanup (two separable steps)

B — route the dictionary page through the store too. Replace the dictionary: Vec<Bytes> carve-out with a tracked dictionary_key: Option<PageKey>; at splice, take that key first, then the data keys. This makes the store uniform (it no longer treats the dict page specially) and lets dictionary pages spill as well — relevant for misconfigured writes with a large dict page. Entirely private to ArrowColumnChunkData. Trade-off to weigh: today the dict page is kept resident deliberately to avoid round-tripping ~1 bounded page through the backend.

C — move page-ordering ownership into the page-writer layer. Have GenericColumnWriter always stream pages in production order and let the page writer own final layout: SerializedPageWriter buffers data pages until it sees the dict, ArrowPageWriter buffers-and-reorders. This removes the defers_dictionary_ordering() branch and the data_pages field from the shared column writer entirely. Note it relocates rather than eliminates the column-at-a-time buffering (you still can't seek back in a plain Write sink), but it gives one clean seam instead of a flag-driven split. Touches the column writer used by both write paths, so it wants its own benchmarking.

B and C compose; B+C is the full "defer all buffering to the store; the store needn't know page kinds" that was suggested in review.

Constraints / non-goals

  • Not a breaking change to the externally-implemented surface. PageStore (put/take/PageKey), PageStoreFactory, and PageStoreArgs stay as-is; B reuses the same put/take. The only public-API churn is to the already-#[doc(hidden)] PageWriter methods.
  • Keep the key-based PageStore API (not a page iterator): the dict-first reordering needs random access, which a FIFO iterator can't express.
  • No change to on-disk output beyond what Pluggable page spilling API for the Parquet ArrowWriter (PageStore) #10020 already does (the page_encoding_stats ordering difference for dictionary columns, which is spec-valid).

Acceptance criteria

  • defers_dictionary_ordering() removed (or its role absorbed) and GenericColumnWriter has a single buffering path.
  • Dictionary pages are no longer special-cased outside the store (B), or this is explicitly decided against with rationale.
  • Existing round-trip / spill memory tests still pass; column-at-a-time write memory and throughput benchmarked for no regression.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions