Skip to content

[python] Fix DataBlobWriter KeyError for partial writes with blob columns#7850

Merged
JingsongLi merged 1 commit into
apache:masterfrom
SteNicholas:PAIMON-7849
May 14, 2026
Merged

[python] Fix DataBlobWriter KeyError for partial writes with blob columns#7850
JingsongLi merged 1 commit into
apache:masterfrom
SteNicholas:PAIMON-7849

Conversation

@SteNicholas
Copy link
Copy Markdown
Member

@SteNicholas SteNicholas commented May 14, 2026

Purpose

Problem: For tables with BLOB columns, pypaimon uses DataBlobWriter, which splits each pyarrow.RecordBatch into “normal” columns (written to Parquet/ORC/…) and blob-file columns (written via BlobWriter). _split_data used the full table lists of normal and blob-file column names when calling RecordBatch.select(...).

Regression: When TableWrite.with_write_type(...) narrows the write to a partial column list, validation ensures incoming batches only contain those columns. _split_data still tried to select columns not present in the batch (e.g. a normal column omitted from the partial write), which caused PyArrow to raise KeyError.

Fix:

  • Pass write_cols from FileStoreWrite into DataBlobWriter (same as for AppendOnlyDataWriter), so the blob writer sees the narrowed column set from with_write_type.
  • In DataBlobWriter.__init__, derive normal_column_names and blob_file_column_names from that subset when write_cols is set: only blob-file columns that appear in write_cols, and normal columns = write_cols minus blob-file columns (order preserved from write_cols). Only instantiate BlobWriter for blob-file columns in that narrowed set.
  • Keep full-table behavior when write_cols is None (full schema write).
  • This keeps _split_data consistent with the actual batch schema and matches the intent of partial / data-evolution writes.

Close #7849.

Tests

Added in paimon-python/pypaimon/tests/blob_table_test.py(DataBlobWriterTest):

  1. Partial normal + one blob — with_write_type(['id', 'blob_data']) with a batch that only has those columns; asserts one Parquet file (write_cols == ['id']), blob file(s) (write_cols == ['blob_data']), row counts, commit + read-back (unwritten name is null).
  2. Partial normal only — with_write_type(['id', 'name']) without blob columns in the batch; asserts no .blob files, Parquet write_cols == ['id', 'name'], read-back with blob column null.
  3. Two blob columns, write one — schema with blob1 and blob2, with_write_type(['id', 'blob1']); asserts exactly one blob file and write_cols == ['blob1'].

…umns

DataBlobWriter._split_data selected full-table normal and blob-file column names, while TableWrite.with_write_type only supplies narrowed batches, so pa.RecordBatch.select raised KeyError on missing names.

Pass write_cols from FileStoreWrite into DataBlobWriter; narrow normal and blob-file column lists and BlobWriter initialization accordingly. Add blob_table_test coverage for partial writes (normal+blob, normal-only, single blob of two).
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit db86e31 into apache:master May 14, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] DataBlobWriter._split_data raise KeyError when a partial write included blob columns

2 participants