[python] Add table repair verification logic for pypaimon (Part 1/3)#7943
[python] Add table repair verification logic for pypaimon (Part 1/3)#7943JunRuiLee wants to merge 2 commits into
Conversation
JingsongLi
left a comment
There was a problem hiding this comment.
Thanks for the well-structured PR. The metadata chain verification logic looks solid overall. A few comments:
-
Performance concern with
check_data_files: When verifying all snapshots, each snapshot's manifest entries are checked independently. For tables with many snapshots sharing the same base manifest files, the same data files will be checked repeatedly. Consider deduplicating the set of (manifest_file, data_file) pairs across snapshots, or at minimum adding achecked_filesset to skip already-verified paths. -
Exception handling in
_verify_manifest_file: The blanketexcept Exceptionthat swallows partition deserialization failures (line ~396 in repair.py) means ifGenericRowDeserializer.from_bytesfails, we silently fall through to construct a wrong data file path (without partition dirs). This could produce false-positive "data file missing" errors. Consider logging a warning or reporting a specific issue category when partition deserialization fails. -
_list_snapshot_ids—hasattrcheck:name = status.base_name if hasattr(status, 'base_name') else ""
If
FileIO.list_statushas a well-defined return type (e.g.,FileStatus), thishasattrcheck shouldn't be needed. If there are multiple implementations with different return types, it would be clearer to define an interface. -
Test coverage: Good integration tests. Consider adding a test for a table with multiple snapshots sharing manifest files to verify no double-counting in the report stats.
Overall the design is clean — looking forward to seeing Parts 2/3.
Implement read-only metadata consistency verification for Paimon tables. This verifies the chain: LATEST -> snapshot -> manifest list -> manifest files -> data files, and reports any broken links or corrupted files. Key components: - RepairIssue/RepairReport: data classes for structured issue reporting - TableRepair.verify(): walks the metadata chain and detects issues - Support for branch-qualified tables and partitioned data file paths - Respects custom partition.default-name configuration - Progress logging every 1000 data files when check_data_files=True - Documented time complexity: O(total_data_files)
- Deduplicate data file checks across snapshots sharing manifest files - Report warning when partition deserialization fails instead of silently ignoring - Remove unnecessary hasattr check in _list_snapshot_ids - Add test for multiple snapshots sharing manifest files
86f3647 to
86f5fc9
Compare
JunRuiLee
left a comment
There was a problem hiding this comment.
@JingsongLi Thanks for the detailed review! All 4 points have been addressed:
-
Performance: Added a
checked_filesset that deduplicates data file existence checks across snapshots. Files already verified are skipped. -
Exception handling: Partition deserialization failures now report a warning issue (category:
"partition") instead of being silently swallowed, so users can see when data file paths may be incorrect. -
hasattrcheck: Removed — bothLocalFileStatusand PyArrowFileInfodefinebase_name. -
Test coverage: Added
test_check_data_files_shared_manifests_no_double_count— two snapshots referencing the same manifest file, assertingdata_files_checked == 1.
Summary
partition.default-nameContext
Split from #7940 following @JingsongLi's review comment.
TableRepair.verify())Please merge in order: Part 1 → Part 2 → Part 3.
Tests added
TestRepairReport: 5 cases covering RepairReport/RepairIssue data classesTestTableRepairVerify: 11 cases covering verify() — healthy table, missing manifest list, dangling LATEST, corrupted snapshot, data file checks (missing/existing/DELETE entries/partitioned/custom default-name), branch support