[python] Support BlobView fields by leaves12138 · Pull Request #7837 · apache/paimon

leaves12138 · 2026-05-13T06:20:18Z

Summary

Sync Python BLOB support with the BlobView work reviewed in [core] Support BlobView for shared blob data #7602, [codex] Support catalog-aware blob view function #7786, and [core] Allow blob inline fields without blob-field #7827.
Add Python BlobViewStruct / BlobView wire-format support and the blob-view-field option.
Store descriptor/view BLOB fields inline, validate bad inline field configuration and payloads, and avoid writing new .blob files for view fields.
Resolve blob-view fields during reads through catalog-aware lookup, returning bytes by default or upstream BlobDescriptor bytes when blob-as-descriptor=true.

Codex

This PR was prepared and submitted by Codex.

Validation

PYTHONPATH=paimon-python pytest -q paimon-python/pypaimon/tests/blob_test.py::BlobTest::test_blob_view_struct_roundtrip
PYTHONPATH=paimon-python pytest -q paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_view_fields_resolve_upstream_blob paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_view_fields_rejects_non_view_input paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_inline_fields_reject_overlap_and_unknown_fields paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_descriptor_fields_mixed_mode paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_to_arrow_batch_reader paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_descriptor_fields_rejects_non_descriptor_input
PYTHONPATH=paimon-python python3 -m flake8 --config=paimon-python/dev/cfg.ini paimon-python/pypaimon/table/row/blob.py paimon-python/pypaimon/common/options/core_options.py paimon-python/pypaimon/schema/schema.py paimon-python/pypaimon/write/writer/data_blob_writer.py paimon-python/pypaimon/read/reader/data_file_batch_reader.py paimon-python/pypaimon/read/reader/blob_descriptor_convert_reader.py paimon-python/pypaimon/utils/blob_view_lookup.py paimon-python/pypaimon/tests/blob_test.py paimon-python/pypaimon/tests/blob_table_test.py
python3 -m compileall -q paimon-python/pypaimon/table/row/blob.py paimon-python/pypaimon/common/options/core_options.py paimon-python/pypaimon/schema/schema.py paimon-python/pypaimon/write/writer/data_blob_writer.py paimon-python/pypaimon/read/reader/data_file_batch_reader.py paimon-python/pypaimon/read/reader/blob_descriptor_convert_reader.py paimon-python/pypaimon/utils/blob_view_lookup.py
git diff --check

leaves12138 · 2026-05-14T04:18:02Z

Thanks for the PR. The overall direction looks right to me: the Python BlobViewStruct wire format matches the Java side, and the write/read path for inline blob-view-field plus blob-as-descriptor=true is mostly aligned with the BlobView design.

I found one blocker around the data-evolution read path though.

DataFileBatchReader already resolves inline blob-view-field values to the actual blob payload when blob-as-descriptor=false. In the data-evolution path, DataEvolutionSplitRead then wraps the merged reader with BlobDescriptorConvertReader, and that reader scans the already-resolved bytes again with BlobViewStruct.is_blob_view_struct() / deserialize(). This means a perfectly valid blob payload whose first bytes happen to match the BlobView magic header can be parsed as a BlobViewStruct and fail, or even be incorrectly resolved as another view.

The problematic chain is:

DataFileBatchReader._convert_inline_blob_columns resolves view fields before returning file batches.
DataEvolutionSplitRead.create_reader wraps the result with BlobDescriptorConvertReader whenever blob descriptor/view fields exist.
BlobDescriptorConvertReader._convert_batch attempts to interpret the final bytes as BlobViewStruct again.

I reproduced this locally by writing a source blob payload starting with the BlobView version + magic bytes and then reading it through a downstream blob-view-field; the default read failed with ValueError: Invalid BlobViewStruct data: too short even though the source payload is just normal blob data.

I think we should ensure inline blob values are converted exactly once. One possible fix is to keep conversion in DataFileBatchReader and not wrap data-evolution readers with BlobDescriptorConvertReader for fields that have already been converted; alternatively, disable the file-reader-side conversion in the data-evolution path and do conversion only after merge/filter.

One more compatibility point: Java validation requires fields listed in blob-descriptor-field / blob-view-field to also be listed in blob-field. The Python schema validation currently only checks that those names are BLOB fields and do not overlap. If this relaxation is intentional because Python treats all large_binary columns as BLOBs, please document it; otherwise it would be better to align the option validation with Java to avoid creating schemas that Java/Flink would reject.

Local validation I ran:

python3.8 -m pytest -q paimon-python/pypaimon/tests/blob_table_test.py paimon-python/pypaimon/tests/blob_test.py -> 87 passed, 1 skipped
targeted BlobView tests -> passed
flake8 on changed Python files -> passed
compileall on changed Python files -> passed
git diff --check -> passed

JingsongLi

Thanks for the contribution -- BlobView support for Python is a useful addition. I reviewed the diff in detail and have the following observations:

1. Transitive BlobViewStruct resolution lacks cycle detection

In BlobViewLookup._to_descriptor():

if BlobViewStruct.is_blob_view_struct(value):
    return self.resolve_descriptor(BlobViewStruct.deserialize(value))

If table A has a view pointing to table B and table B has a view pointing back to table A (or any longer cycle), this recursion will loop until stack exhaustion. Consider adding a visited set or a maximum recursion depth to resolve_descriptor / _to_descriptor.

2. Weakened concurrency test assertion

In table_update_test.py, test_concurrent_updates_overlapping_rows_last_writer_wins was changed from asserting the winner's values match the last committed writer to simply asserting the values match any writer:

# Before: winner = specs[completion_order[-1]]['ages']; assertEqual(winner, ages[:3])
# After:  assertIn(ages[:3], [spec['ages'] for spec in specs])

This weakens the semantic guarantee the test was verifying (last-writer-wins). If this change is a flaky-test workaround, it should probably be in a separate PR with a note explaining why the stronger assertion is unreliable, or the underlying race should be fixed.

3. Redundant deserialization in `_blob_view_cell_to_data`

Each cell goes through: _deserialize_blob_view_or_none (which calls _normalize_blob_cell, is_blob_view_struct, and deserialize), then if non-None, resolve_data calls resolve_descriptor which calls preload([view_struct]). But _preload_blob_views was already called on the entire column before the per-cell loop. So preload([view_struct]) inside resolve_descriptor should be a no-op for cache-hit rows but still re-iterates the cache check per row. This is minor but could be avoided by splitting resolve_descriptor into a lookup-only path when preloading has already been done.

4. Schema blob-field detection is fragile

In schema.py:

blob_field_names = {
    field.name for field in fields if 'blob' in str(field.type).lower()
}

This relies on the string representation of the data type containing "blob". If the type system adds new types or changes __str__, this will silently break. A more robust approach would be checking against the specific blob type class (e.g., isinstance(field.type, BlobType) or a dedicated is_blob_type() method).

5. `_normalize_blob_cell` duplicated in two classes

DataFileBatchReader._normalize_blob_cell and BlobDescriptorConvertReader._normalize_blob_cell are identical static methods. Consider extracting this into a shared utility function (e.g., in blob.py or a blob_utils.py module) to avoid drift.

6. `BlobViewLookup._filesystem_table_path` assumes fixed warehouse layout

The path derivation:

warehouse = os.path.dirname(os.path.dirname(current_table_path))
return "{}/{}.db/{}".format(warehouse, identifier.get_database_name(), identifier.get_table_name())

This assumes a warehouse/<db>.db/<table> directory structure. If any deployment uses a custom catalog path or nested database layout, this will resolve to the wrong location. The fallback path (when catalog_loader is None) should at minimum log a warning or document this assumption explicitly.

7. Minor: `row_ids` variable shadowed

In _load_field_descriptors:

row_ids = list(row_ids)  # parameter shadowed
...
row_ids = result.column(SpecialFields.ROW_ID.name).to_pylist()  # now from result

The parameter row_ids (the requested IDs) is overwritten with the returned IDs. This makes it impossible to detect missing rows in the response. Consider renaming the second one to result_row_ids for clarity.

Overall the implementation is solid and well-tested. The double-conversion issue in the data-evolution path was already flagged in prior review and is the primary blocker. Items 1 (cycle detection) and 2 (weakened test) are worth addressing; the rest are suggestions for robustness and maintainability.

leaves12138 added 2 commits May 13, 2026 14:19

[python] Support blob view fields

5c5ecf5

[python] Refine blob view lookup

ee2ecc8

leaves12138 marked this pull request as ready for review May 13, 2026 06:38

[python] Stabilize concurrent update test

1ddc37f

leaves12138 marked this pull request as draft May 14, 2026 05:46

JingsongLi reviewed May 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Support BlobView fields#7837

[python] Support BlobView fields#7837
leaves12138 wants to merge 3 commits into
apache:masterfrom
leaves12138:codex/python-blob-view

leaves12138 commented May 13, 2026

Uh oh!

leaves12138 commented May 14, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leaves12138 commented May 13, 2026

Summary

Codex

Validation

Uh oh!

leaves12138 commented May 14, 2026

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

1. Transitive BlobViewStruct resolution lacks cycle detection

2. Weakened concurrency test assertion

3. Redundant deserialization in _blob_view_cell_to_data

4. Schema blob-field detection is fragile

5. _normalize_blob_cell duplicated in two classes

6. BlobViewLookup._filesystem_table_path assumes fixed warehouse layout

7. Minor: row_ids variable shadowed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

3. Redundant deserialization in `_blob_view_cell_to_data`

5. `_normalize_blob_cell` duplicated in two classes

6. `BlobViewLookup._filesystem_table_path` assumes fixed warehouse layout

7. Minor: `row_ids` variable shadowed