[python] Fix DataBlobWriter KeyError for partial writes with blob columns by SteNicholas · Pull Request #7850 · apache/paimon

SteNicholas · 2026-05-14T05:09:22Z

Purpose

Problem: For tables with BLOB columns, pypaimon uses DataBlobWriter, which splits each pyarrow.RecordBatch into “normal” columns (written to Parquet/ORC/…) and blob-file columns (written via BlobWriter). _split_data used the full table lists of normal and blob-file column names when calling RecordBatch.select(...).

Regression: When TableWrite.with_write_type(...) narrows the write to a partial column list, validation ensures incoming batches only contain those columns. _split_data still tried to select columns not present in the batch (e.g. a normal column omitted from the partial write), which caused PyArrow to raise KeyError.

Fix:

Pass write_cols from FileStoreWrite into DataBlobWriter (same as for AppendOnlyDataWriter), so the blob writer sees the narrowed column set from with_write_type.
In DataBlobWriter.__init__, derive normal_column_names and blob_file_column_names from that subset when write_cols is set: only blob-file columns that appear in write_cols, and normal columns = write_cols minus blob-file columns (order preserved from write_cols). Only instantiate BlobWriter for blob-file columns in that narrowed set.
Keep full-table behavior when write_cols is None (full schema write).
This keeps _split_data consistent with the actual batch schema and matches the intent of partial / data-evolution writes.

Close #7849.

Tests

Added in paimon-python/pypaimon/tests/blob_table_test.py(DataBlobWriterTest):

Partial normal + one blob — with_write_type(['id', 'blob_data']) with a batch that only has those columns; asserts one Parquet file (write_cols == ['id']), blob file(s) (write_cols == ['blob_data']), row counts, commit + read-back (unwritten name is null).
Partial normal only — with_write_type(['id', 'name']) without blob columns in the batch; asserts no .blob files, Parquet write_cols == ['id', 'name'], read-back with blob column null.
Two blob columns, write one — schema with blob1 and blob2, with_write_type(['id', 'blob1']); asserts exactly one blob file and write_cols == ['blob1'].

…umns DataBlobWriter._split_data selected full-table normal and blob-file column names, while TableWrite.with_write_type only supplies narrowed batches, so pa.RecordBatch.select raised KeyError on missing names. Pass write_cols from FileStoreWrite into DataBlobWriter; narrow normal and blob-file column lists and BlobWriter initialization accordingly. Add blob_table_test coverage for partial writes (normal+blob, normal-only, single blob of two).

JingsongLi

+1

JingsongLi approved these changes May 14, 2026

View reviewed changes

JingsongLi merged commit db86e31 into apache:master May 14, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Fix DataBlobWriter KeyError for partial writes with blob columns#7850

[python] Fix DataBlobWriter KeyError for partial writes with blob columns#7850
JingsongLi merged 1 commit into
apache:masterfrom
SteNicholas:PAIMON-7849

SteNicholas commented May 14, 2026 •

edited

Loading

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SteNicholas commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SteNicholas commented May 14, 2026 •

edited

Loading