[python] Support BlobView fields#7837
Conversation
|
Thanks for the PR. The overall direction looks right to me: the Python I found one blocker around the data-evolution read path though.
The problematic chain is:
I reproduced this locally by writing a source blob payload starting with the BlobView version + magic bytes and then reading it through a downstream I think we should ensure inline blob values are converted exactly once. One possible fix is to keep conversion in One more compatibility point: Java validation requires fields listed in Local validation I ran:
|
JingsongLi
left a comment
There was a problem hiding this comment.
Thanks for the contribution -- BlobView support for Python is a useful addition. I reviewed the diff in detail and have the following observations:
1. Transitive BlobViewStruct resolution lacks cycle detection
In BlobViewLookup._to_descriptor():
if BlobViewStruct.is_blob_view_struct(value):
return self.resolve_descriptor(BlobViewStruct.deserialize(value))If table A has a view pointing to table B and table B has a view pointing back to table A (or any longer cycle), this recursion will loop until stack exhaustion. Consider adding a visited set or a maximum recursion depth to resolve_descriptor / _to_descriptor.
2. Weakened concurrency test assertion
In table_update_test.py, test_concurrent_updates_overlapping_rows_last_writer_wins was changed from asserting the winner's values match the last committed writer to simply asserting the values match any writer:
# Before: winner = specs[completion_order[-1]]['ages']; assertEqual(winner, ages[:3])
# After: assertIn(ages[:3], [spec['ages'] for spec in specs])This weakens the semantic guarantee the test was verifying (last-writer-wins). If this change is a flaky-test workaround, it should probably be in a separate PR with a note explaining why the stronger assertion is unreliable, or the underlying race should be fixed.
3. Redundant deserialization in _blob_view_cell_to_data
Each cell goes through: _deserialize_blob_view_or_none (which calls _normalize_blob_cell, is_blob_view_struct, and deserialize), then if non-None, resolve_data calls resolve_descriptor which calls preload([view_struct]). But _preload_blob_views was already called on the entire column before the per-cell loop. So preload([view_struct]) inside resolve_descriptor should be a no-op for cache-hit rows but still re-iterates the cache check per row. This is minor but could be avoided by splitting resolve_descriptor into a lookup-only path when preloading has already been done.
4. Schema blob-field detection is fragile
In schema.py:
blob_field_names = {
field.name for field in fields if 'blob' in str(field.type).lower()
}This relies on the string representation of the data type containing "blob". If the type system adds new types or changes __str__, this will silently break. A more robust approach would be checking against the specific blob type class (e.g., isinstance(field.type, BlobType) or a dedicated is_blob_type() method).
5. _normalize_blob_cell duplicated in two classes
DataFileBatchReader._normalize_blob_cell and BlobDescriptorConvertReader._normalize_blob_cell are identical static methods. Consider extracting this into a shared utility function (e.g., in blob.py or a blob_utils.py module) to avoid drift.
6. BlobViewLookup._filesystem_table_path assumes fixed warehouse layout
The path derivation:
warehouse = os.path.dirname(os.path.dirname(current_table_path))
return "{}/{}.db/{}".format(warehouse, identifier.get_database_name(), identifier.get_table_name())This assumes a warehouse/<db>.db/<table> directory structure. If any deployment uses a custom catalog path or nested database layout, this will resolve to the wrong location. The fallback path (when catalog_loader is None) should at minimum log a warning or document this assumption explicitly.
7. Minor: row_ids variable shadowed
In _load_field_descriptors:
row_ids = list(row_ids) # parameter shadowed
...
row_ids = result.column(SpecialFields.ROW_ID.name).to_pylist() # now from resultThe parameter row_ids (the requested IDs) is overwritten with the returned IDs. This makes it impossible to detect missing rows in the response. Consider renaming the second one to result_row_ids for clarity.
Overall the implementation is solid and well-tested. The double-conversion issue in the data-evolution path was already flagged in prior review and is the primary blocker. Items 1 (cycle detection) and 2 (weakened test) are worth addressing; the rest are suggestions for robustness and maintainability.
Summary
Codex
This PR was prepared and submitted by Codex.
Validation