Skip to content

[python] Support BlobView fields#7837

Draft
leaves12138 wants to merge 3 commits into
apache:masterfrom
leaves12138:codex/python-blob-view
Draft

[python] Support BlobView fields#7837
leaves12138 wants to merge 3 commits into
apache:masterfrom
leaves12138:codex/python-blob-view

Conversation

@leaves12138
Copy link
Copy Markdown
Contributor

Summary

Codex

This PR was prepared and submitted by Codex.

Validation

  • PYTHONPATH=paimon-python pytest -q paimon-python/pypaimon/tests/blob_test.py::BlobTest::test_blob_view_struct_roundtrip
  • PYTHONPATH=paimon-python pytest -q paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_view_fields_resolve_upstream_blob paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_view_fields_rejects_non_view_input paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_inline_fields_reject_overlap_and_unknown_fields paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_descriptor_fields_mixed_mode paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_to_arrow_batch_reader paimon-python/pypaimon/tests/blob_table_test.py::DataBlobWriterTest::test_blob_descriptor_fields_rejects_non_descriptor_input
  • PYTHONPATH=paimon-python python3 -m flake8 --config=paimon-python/dev/cfg.ini paimon-python/pypaimon/table/row/blob.py paimon-python/pypaimon/common/options/core_options.py paimon-python/pypaimon/schema/schema.py paimon-python/pypaimon/write/writer/data_blob_writer.py paimon-python/pypaimon/read/reader/data_file_batch_reader.py paimon-python/pypaimon/read/reader/blob_descriptor_convert_reader.py paimon-python/pypaimon/utils/blob_view_lookup.py paimon-python/pypaimon/tests/blob_test.py paimon-python/pypaimon/tests/blob_table_test.py
  • python3 -m compileall -q paimon-python/pypaimon/table/row/blob.py paimon-python/pypaimon/common/options/core_options.py paimon-python/pypaimon/schema/schema.py paimon-python/pypaimon/write/writer/data_blob_writer.py paimon-python/pypaimon/read/reader/data_file_batch_reader.py paimon-python/pypaimon/read/reader/blob_descriptor_convert_reader.py paimon-python/pypaimon/utils/blob_view_lookup.py
  • git diff --check

@leaves12138 leaves12138 marked this pull request as ready for review May 13, 2026 06:38
@leaves12138
Copy link
Copy Markdown
Contributor Author

Thanks for the PR. The overall direction looks right to me: the Python BlobViewStruct wire format matches the Java side, and the write/read path for inline blob-view-field plus blob-as-descriptor=true is mostly aligned with the BlobView design.

I found one blocker around the data-evolution read path though.

DataFileBatchReader already resolves inline blob-view-field values to the actual blob payload when blob-as-descriptor=false. In the data-evolution path, DataEvolutionSplitRead then wraps the merged reader with BlobDescriptorConvertReader, and that reader scans the already-resolved bytes again with BlobViewStruct.is_blob_view_struct() / deserialize(). This means a perfectly valid blob payload whose first bytes happen to match the BlobView magic header can be parsed as a BlobViewStruct and fail, or even be incorrectly resolved as another view.

The problematic chain is:

  • DataFileBatchReader._convert_inline_blob_columns resolves view fields before returning file batches.
  • DataEvolutionSplitRead.create_reader wraps the result with BlobDescriptorConvertReader whenever blob descriptor/view fields exist.
  • BlobDescriptorConvertReader._convert_batch attempts to interpret the final bytes as BlobViewStruct again.

I reproduced this locally by writing a source blob payload starting with the BlobView version + magic bytes and then reading it through a downstream blob-view-field; the default read failed with ValueError: Invalid BlobViewStruct data: too short even though the source payload is just normal blob data.

I think we should ensure inline blob values are converted exactly once. One possible fix is to keep conversion in DataFileBatchReader and not wrap data-evolution readers with BlobDescriptorConvertReader for fields that have already been converted; alternatively, disable the file-reader-side conversion in the data-evolution path and do conversion only after merge/filter.

One more compatibility point: Java validation requires fields listed in blob-descriptor-field / blob-view-field to also be listed in blob-field. The Python schema validation currently only checks that those names are BLOB fields and do not overlap. If this relaxation is intentional because Python treats all large_binary columns as BLOBs, please document it; otherwise it would be better to align the option validation with Java to avoid creating schemas that Java/Flink would reject.

Local validation I ran:

  • python3.8 -m pytest -q paimon-python/pypaimon/tests/blob_table_test.py paimon-python/pypaimon/tests/blob_test.py -> 87 passed, 1 skipped
  • targeted BlobView tests -> passed
  • flake8 on changed Python files -> passed
  • compileall on changed Python files -> passed
  • git diff --check -> passed

@leaves12138 leaves12138 marked this pull request as draft May 14, 2026 05:46
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution -- BlobView support for Python is a useful addition. I reviewed the diff in detail and have the following observations:


1. Transitive BlobViewStruct resolution lacks cycle detection

In BlobViewLookup._to_descriptor():

if BlobViewStruct.is_blob_view_struct(value):
    return self.resolve_descriptor(BlobViewStruct.deserialize(value))

If table A has a view pointing to table B and table B has a view pointing back to table A (or any longer cycle), this recursion will loop until stack exhaustion. Consider adding a visited set or a maximum recursion depth to resolve_descriptor / _to_descriptor.


2. Weakened concurrency test assertion

In table_update_test.py, test_concurrent_updates_overlapping_rows_last_writer_wins was changed from asserting the winner's values match the last committed writer to simply asserting the values match any writer:

# Before: winner = specs[completion_order[-1]]['ages']; assertEqual(winner, ages[:3])
# After:  assertIn(ages[:3], [spec['ages'] for spec in specs])

This weakens the semantic guarantee the test was verifying (last-writer-wins). If this change is a flaky-test workaround, it should probably be in a separate PR with a note explaining why the stronger assertion is unreliable, or the underlying race should be fixed.


3. Redundant deserialization in _blob_view_cell_to_data

Each cell goes through: _deserialize_blob_view_or_none (which calls _normalize_blob_cell, is_blob_view_struct, and deserialize), then if non-None, resolve_data calls resolve_descriptor which calls preload([view_struct]). But _preload_blob_views was already called on the entire column before the per-cell loop. So preload([view_struct]) inside resolve_descriptor should be a no-op for cache-hit rows but still re-iterates the cache check per row. This is minor but could be avoided by splitting resolve_descriptor into a lookup-only path when preloading has already been done.


4. Schema blob-field detection is fragile

In schema.py:

blob_field_names = {
    field.name for field in fields if 'blob' in str(field.type).lower()
}

This relies on the string representation of the data type containing "blob". If the type system adds new types or changes __str__, this will silently break. A more robust approach would be checking against the specific blob type class (e.g., isinstance(field.type, BlobType) or a dedicated is_blob_type() method).


5. _normalize_blob_cell duplicated in two classes

DataFileBatchReader._normalize_blob_cell and BlobDescriptorConvertReader._normalize_blob_cell are identical static methods. Consider extracting this into a shared utility function (e.g., in blob.py or a blob_utils.py module) to avoid drift.


6. BlobViewLookup._filesystem_table_path assumes fixed warehouse layout

The path derivation:

warehouse = os.path.dirname(os.path.dirname(current_table_path))
return "{}/{}.db/{}".format(warehouse, identifier.get_database_name(), identifier.get_table_name())

This assumes a warehouse/<db>.db/<table> directory structure. If any deployment uses a custom catalog path or nested database layout, this will resolve to the wrong location. The fallback path (when catalog_loader is None) should at minimum log a warning or document this assumption explicitly.


7. Minor: row_ids variable shadowed

In _load_field_descriptors:

row_ids = list(row_ids)  # parameter shadowed
...
row_ids = result.column(SpecialFields.ROW_ID.name).to_pylist()  # now from result

The parameter row_ids (the requested IDs) is overwritten with the returned IDs. This makes it impossible to detect missing rows in the response. Consider renaming the second one to result_row_ids for clarity.


Overall the implementation is solid and well-tested. The double-conversion issue in the data-evolution path was already flagged in prior review and is the primary blocker. Items 1 (cycle detection) and 2 (weakened test) are worth addressing; the rest are suggestions for robustness and maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants