You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#10020 adds a pluggable PageStore so the ArrowWriter can spill completed column-chunk pages out of the heap while a row group is buffered. During review @alamb noted that the writer now has two levels of buffering and that the dictionary page is special-cased in a way that's hard to follow. This issue tracks the deferred cleanup so #10020 can land without it.
The root cause
A dictionary page is finalized last (it isn't known until close() — dictionary fallback can occur mid-column) but must be written first in the file. So production order is data₁…dataₙ, dict and file order is dict, data₁…dataₙ. Resolving that inversion is what creates the apparent double-buffering:
GenericColumnWriter::data_pages: VecDeque<CompressedPage> (column/writer/mod.rs) — the pre-existing buffer that holds dictionary-column data pages until close() on the column-at-a-time SerializedFileWriter path (bytes commit live there, so there's no other way to get dict-first).
On the ArrowWriter path Pluggable page spilling API for the Parquet ArrowWriter (PageStore) #10020 instead streams data pages straight to the PageStore, holds the dict page in memory (ArrowColumnChunkData.dictionary: Vec<Bytes>), and rewrites offsets to dict-first at splice (ColumnCloseResult::update_dictionary_location).
The seam between these is the back-channel PageWriter::defers_dictionary_ordering() flag, which makes the column writer change its buffering based on a property of its sink. (As cheap insurance, that method and buffered_memory_size() were marked #[doc(hidden)] / unstable in #10020 so this cleanup won't be a breaking change — see f4e3518.)
Proposed cleanup (two separable steps)
B — route the dictionary page through the store too. Replace the dictionary: Vec<Bytes> carve-out with a tracked dictionary_key: Option<PageKey>; at splice, take that key first, then the data keys. This makes the store uniform (it no longer treats the dict page specially) and lets dictionary pages spill as well — relevant for misconfigured writes with a large dict page. Entirely private to ArrowColumnChunkData. Trade-off to weigh: today the dict page is kept resident deliberately to avoid round-tripping ~1 bounded page through the backend.
C — move page-ordering ownership into the page-writer layer. Have GenericColumnWriter always stream pages in production order and let the page writer own final layout: SerializedPageWriter buffers data pages until it sees the dict, ArrowPageWriter buffers-and-reorders. This removes the defers_dictionary_ordering() branch and the data_pages field from the shared column writer entirely. Note it relocates rather than eliminates the column-at-a-time buffering (you still can't seek back in a plain Write sink), but it gives one clean seam instead of a flag-driven split. Touches the column writer used by both write paths, so it wants its own benchmarking.
B and C compose; B+C is the full "defer all buffering to the store; the store needn't know page kinds" that was suggested in review.
Constraints / non-goals
Not a breaking change to the externally-implemented surface.PageStore (put/take/PageKey), PageStoreFactory, and PageStoreArgs stay as-is; B reuses the same put/take. The only public-API churn is to the already-#[doc(hidden)]PageWriter methods.
Keep the key-basedPageStore API (not a page iterator): the dict-first reordering needs random access, which a FIFO iterator can't express.
Background
#10020 adds a pluggable
PageStoreso theArrowWritercan spill completed column-chunk pages out of the heap while a row group is buffered. During review @alamb noted that the writer now has two levels of buffering and that the dictionary page is special-cased in a way that's hard to follow. This issue tracks the deferred cleanup so #10020 can land without it.The root cause
A dictionary page is finalized last (it isn't known until
close()— dictionary fallback can occur mid-column) but must be written first in the file. So production order isdata₁…dataₙ, dictand file order isdict, data₁…dataₙ. Resolving that inversion is what creates the apparent double-buffering:GenericColumnWriter::data_pages: VecDeque<CompressedPage>(column/writer/mod.rs) — the pre-existing buffer that holds dictionary-column data pages untilclose()on the column-at-a-timeSerializedFileWriterpath (bytes commit live there, so there's no other way to get dict-first).ArrowWriterpath Pluggable page spilling API for the Parquet ArrowWriter (PageStore) #10020 instead streams data pages straight to thePageStore, holds the dict page in memory (ArrowColumnChunkData.dictionary: Vec<Bytes>), and rewrites offsets to dict-first at splice (ColumnCloseResult::update_dictionary_location).The seam between these is the back-channel
PageWriter::defers_dictionary_ordering()flag, which makes the column writer change its buffering based on a property of its sink. (As cheap insurance, that method andbuffered_memory_size()were marked#[doc(hidden)]/ unstable in #10020 so this cleanup won't be a breaking change — see f4e3518.)Proposed cleanup (two separable steps)
B — route the dictionary page through the store too. Replace the
dictionary: Vec<Bytes>carve-out with a trackeddictionary_key: Option<PageKey>; at splice, take that key first, then the data keys. This makes the store uniform (it no longer treats the dict page specially) and lets dictionary pages spill as well — relevant for misconfigured writes with a large dict page. Entirely private toArrowColumnChunkData. Trade-off to weigh: today the dict page is kept resident deliberately to avoid round-tripping ~1 bounded page through the backend.C — move page-ordering ownership into the page-writer layer. Have
GenericColumnWriteralways stream pages in production order and let the page writer own final layout:SerializedPageWriterbuffers data pages until it sees the dict,ArrowPageWriterbuffers-and-reorders. This removes thedefers_dictionary_ordering()branch and thedata_pagesfield from the shared column writer entirely. Note it relocates rather than eliminates the column-at-a-time buffering (you still can't seek back in a plainWritesink), but it gives one clean seam instead of a flag-driven split. Touches the column writer used by both write paths, so it wants its own benchmarking.B and C compose; B+C is the full "defer all buffering to the store; the store needn't know page kinds" that was suggested in review.
Constraints / non-goals
PageStore(put/take/PageKey),PageStoreFactory, andPageStoreArgsstay as-is; B reuses the sameput/take. The only public-API churn is to the already-#[doc(hidden)]PageWritermethods.PageStoreAPI (not a page iterator): the dict-first reordering needs random access, which a FIFO iterator can't express.page_encoding_statsordering difference for dictionary columns, which is spec-valid).Acceptance criteria
defers_dictionary_ordering()removed (or its role absorbed) andGenericColumnWriterhas a single buffering path.