Skip to content

[python] Add table repair fix mode and catalog integration (Part 2/3)#7944

Draft
JunRuiLee wants to merge 2 commits into
apache:masterfrom
JunRuiLee:pypaimon-repair-fix
Draft

[python] Add table repair fix mode and catalog integration (Part 2/3)#7944
JunRuiLee wants to merge 2 commits into
apache:masterfrom
JunRuiLee:pypaimon-repair-fix

Conversation

@JunRuiLee
Copy link
Copy Markdown
Contributor

Summary

  • Add TableRepair.repair(dry_run=False) to fix metadata inconsistencies
  • Currently fixes: LATEST pointing to missing snapshot → rewritten to newest valid snapshot
  • Add repair_table/repair_database/repair_catalog catalog API with return type annotations
  • FileSystemCatalog implementation delegating to repair module
  • Per-table error isolation in repair_database (continues on individual table failures)
  • Fix operation is idempotent: single atomic write, safe to re-run after interruption

Context

Split from #7940 following @JingsongLi's review comment.

Depends on Part 1: #7943

  • Part 1: Read-only verification logic ✅
  • Part 2 (this PR): Fix mode + catalog integration
  • Part 3: CLI command (depends on Part 2)

Please merge in order: Part 1 → Part 2 → Part 3.

Tests added

  • test_repair_fix_dangling_latest: fix mode rewrites LATEST to newest valid snapshot
  • test_dry_run_does_not_modify: dry_run=True does not change any files
  • test_repair_database_level: repairs all tables in a database
  • test_repair_catalog_level: repairs all tables across all databases
  • test_fix_latest_respects_check_data_files: fix skips snapshots with missing data files
  • test_repair_is_idempotent: running repair twice converges — second run is a no-op

JunRuiLee added 2 commits May 24, 2026 11:16
Implement read-only metadata consistency verification for Paimon tables.
This verifies the chain: LATEST -> snapshot -> manifest list -> manifest
files -> data files, and reports any broken links or corrupted files.

Key components:
- RepairIssue/RepairReport: data classes for structured issue reporting
- TableRepair.verify(): walks the metadata chain and detects issues
- Support for branch-qualified tables and partitioned data file paths
- Respects custom partition.default-name configuration
- Progress logging every 1000 data files when check_data_files=True
- Documented time complexity: O(total_data_files)
Add the ability to fix metadata inconsistencies found during verification.
Currently supports fixing the LATEST hint file to point to the newest
valid snapshot when it references a missing one.

Key additions:
- TableRepair.repair(dry_run=False): applies fixes after verification
- repair_table/repair_database/repair_catalog module-level entry points
- Catalog.repair_table/repair_database/repair_catalog API with type annotations
- FileSystemCatalog implementation delegating to repair module
- Fix mode selects newest snapshot with intact manifest chain
- check_data_files is respected when choosing which snapshot to fix to
- Per-table error isolation in repair_database (continues on failure)
- Idempotent fix operations (safe to re-run after interruption)
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good follow-up. The fix mode and catalog integration look reasonable. Some feedback:

  1. _fix_latest_file atomicity: The fallback path (delete + overwrite when try_to_write_atomic returns False) has a small window where LATEST doesn't exist. In the Java Paimon implementation, SnapshotManager uses an overwrite-in-place strategy for hint files since they're best-effort. Consider just doing overwrite_file_utf8 directly without the delete step — the atomic write is a nice-to-have for LATEST but not strictly required.

  2. repair_database / repair_catalog filtering: The directory listing uses name.endswith(".db") to find databases. This could accidentally pick up non-database directories that happen to end with .db. The Java catalog uses a metadata-based approach (listing from catalog metadata). Since this is filesystem-only, it's acceptable, but worth a comment noting the limitation.

  3. except DatabaseNotExistException: raise in filesystem_catalog.py — this try/except that just re-raises is a no-op. You can simplify to just call self.get_database(database_name) without the try block.

  4. Test duplication across Part 1 and Part 2: The test file in this PR (940 lines) includes all tests from Part 1 plus the new fix-mode tests. Since Part 2 depends on Part 1, the test file should only contain the new TestTableRepairFixMode class, with Part 1's tests already merged. Otherwise there will be conflicts when merging.

  5. Missing check_data_files in CLI: The CLI cmd_table_repair doesn't expose --check-data-files flag. Should it? Users who want to do a deep check would need to use the Python API directly.

@JunRuiLee JunRuiLee marked this pull request as draft May 24, 2026 12:57
Copy link
Copy Markdown
Contributor Author

@JunRuiLee JunRuiLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Marking this as Draft since it depends on Part 1 (#7943). Once #7943 is merged, I'll rebase onto master so the diff only contains the Part 2 incremental changes (fix mode + catalog integration). Will address your feedback points at that time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants