[python] Fix DataBlobWriter KeyError for partial writes with blob columns#7850
Merged
Conversation
…umns DataBlobWriter._split_data selected full-table normal and blob-file column names, while TableWrite.with_write_type only supplies narrowed batches, so pa.RecordBatch.select raised KeyError on missing names. Pass write_cols from FileStoreWrite into DataBlobWriter; narrow normal and blob-file column lists and BlobWriter initialization accordingly. Add blob_table_test coverage for partial writes (normal+blob, normal-only, single blob of two).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Problem: For tables with BLOB columns, pypaimon uses
DataBlobWriter, which splits eachpyarrow.RecordBatchinto “normal” columns (written to Parquet/ORC/…) and blob-file columns (written viaBlobWriter)._split_dataused the full table lists of normal and blob-file column names when callingRecordBatch.select(...).Regression: When
TableWrite.with_write_type(...)narrows the write to a partial column list, validation ensures incoming batches only contain thosecolumns. _split_datastill tried to select columns not present in the batch (e.g. a normal column omitted from the partial write), which caused PyArrow to raise KeyError.Fix:
write_colsfromFileStoreWriteintoDataBlobWriter(same as forAppendOnlyDataWriter), so the blob writer sees the narrowed column set from with_write_type.DataBlobWriter.__init__, derivenormal_column_namesandblob_file_column_namesfrom that subset whenwrite_colsis set: only blob-file columns that appear in write_cols, and normal columns = write_cols minus blob-file columns (order preserved fromwrite_cols). Only instantiateBlobWriterfor blob-file columns in that narrowed set.write_colsis None (full schema write)._split_dataconsistent with the actual batch schema and matches the intent of partial / data-evolution writes.Close #7849.
Tests
Added in
paimon-python/pypaimon/tests/blob_table_test.py(DataBlobWriterTest):with_write_type(['id', 'blob_data'])with a batch that only has those columns; asserts one Parquet file (write_cols == ['id']), blob file(s) (write_cols == ['blob_data']), row counts, commit + read-back (unwritten name is null).with_write_type(['id', 'name'])without blob columns in the batch; asserts no .blob files, Parquet write_cols == ['id', 'name'], read-back with blob column null.with_write_type(['id', 'blob1']); asserts exactly one blob file and write_cols == ['blob1'].