Skip to content

[python] Add table repair verification logic for pypaimon (Part 1/3)#7943

Open
JunRuiLee wants to merge 2 commits into
apache:masterfrom
JunRuiLee:pypaimon-repair-verify
Open

[python] Add table repair verification logic for pypaimon (Part 1/3)#7943
JunRuiLee wants to merge 2 commits into
apache:masterfrom
JunRuiLee:pypaimon-repair-verify

Conversation

@JunRuiLee
Copy link
Copy Markdown
Contributor

Summary

  • Implement read-only metadata consistency verification for Paimon tables
  • Verifies the chain: LATEST → snapshot → manifest list → manifest files → data files
  • Reports broken links, corrupted files, and dangling references
  • Supports branch-qualified tables and custom partition.default-name
  • Progress logging every 1000 data files; time complexity: O(total_data_files)

Context

Split from #7940 following @JingsongLi's review comment.

  • Part 1 (this PR): Read-only verification logic (TableRepair.verify())
  • Part 2: Fix mode + catalog integration (depends on Part 1)
  • Part 3: CLI command (depends on Part 2)

Please merge in order: Part 1 → Part 2 → Part 3.

Tests added

  • TestRepairReport: 5 cases covering RepairReport/RepairIssue data classes
  • TestTableRepairVerify: 11 cases covering verify() — healthy table, missing manifest list, dangling LATEST, corrupted snapshot, data file checks (missing/existing/DELETE entries/partitioned/custom default-name), branch support

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the well-structured PR. The metadata chain verification logic looks solid overall. A few comments:

  1. Performance concern with check_data_files: When verifying all snapshots, each snapshot's manifest entries are checked independently. For tables with many snapshots sharing the same base manifest files, the same data files will be checked repeatedly. Consider deduplicating the set of (manifest_file, data_file) pairs across snapshots, or at minimum adding a checked_files set to skip already-verified paths.

  2. Exception handling in _verify_manifest_file: The blanket except Exception that swallows partition deserialization failures (line ~396 in repair.py) means if GenericRowDeserializer.from_bytes fails, we silently fall through to construct a wrong data file path (without partition dirs). This could produce false-positive "data file missing" errors. Consider logging a warning or reporting a specific issue category when partition deserialization fails.

  3. _list_snapshot_idshasattr check:

    name = status.base_name if hasattr(status, 'base_name') else ""

    If FileIO.list_status has a well-defined return type (e.g., FileStatus), this hasattr check shouldn't be needed. If there are multiple implementations with different return types, it would be clearer to define an interface.

  4. Test coverage: Good integration tests. Consider adding a test for a table with multiple snapshots sharing manifest files to verify no double-counting in the report stats.

Overall the design is clean — looking forward to seeing Parts 2/3.

JunRuiLee added 2 commits May 24, 2026 20:59
Implement read-only metadata consistency verification for Paimon tables.
This verifies the chain: LATEST -> snapshot -> manifest list -> manifest
files -> data files, and reports any broken links or corrupted files.

Key components:
- RepairIssue/RepairReport: data classes for structured issue reporting
- TableRepair.verify(): walks the metadata chain and detects issues
- Support for branch-qualified tables and partitioned data file paths
- Respects custom partition.default-name configuration
- Progress logging every 1000 data files when check_data_files=True
- Documented time complexity: O(total_data_files)
- Deduplicate data file checks across snapshots sharing manifest files
- Report warning when partition deserialization fails instead of silently ignoring
- Remove unnecessary hasattr check in _list_snapshot_ids
- Add test for multiple snapshots sharing manifest files
@JunRuiLee JunRuiLee force-pushed the pypaimon-repair-verify branch from 86f3647 to 86f5fc9 Compare May 24, 2026 13:05
Copy link
Copy Markdown
Contributor Author

@JunRuiLee JunRuiLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JingsongLi Thanks for the detailed review! All 4 points have been addressed:

  1. Performance: Added a checked_files set that deduplicates data file existence checks across snapshots. Files already verified are skipped.

  2. Exception handling: Partition deserialization failures now report a warning issue (category: "partition") instead of being silently swallowed, so users can see when data file paths may be incorrect.

  3. hasattr check: Removed — both LocalFileStatus and PyArrow FileInfo define base_name.

  4. Test coverage: Added test_check_data_files_shared_manifests_no_double_count — two snapshots referencing the same manifest file, asserting data_files_checked == 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants