Skip to content

feat(table): add RewriteDataFiles for compaction execution#892

Merged
zeroshade merged 1 commit intoapache:mainfrom
laskoviymishka:feat/rewrite-data-files
Apr 14, 2026
Merged

feat(table): add RewriteDataFiles for compaction execution#892
zeroshade merged 1 commit intoapache:mainfrom
laskoviymishka:feat/rewrite-data-files

Conversation

@laskoviymishka
Copy link
Copy Markdown
Contributor

Add Transaction.RewriteDataFiles() that reads data with deletes applied, writes new consolidated files, and atomically replaces old files via ReplaceFiles. Position delete files matched to rewritten data files are removed in the same commit.

  • CompactionTaskGroup bridges compaction planner and executor (avoids circular import between table/ and table/compaction/)
  • collectSafePositionDeletes only removes pos deletes with explicit content type check; equality deletes preserved (may apply outside scope)
  • Context cancellation checked between groups
  • Empty groups skipped gracefully

E2e tests with real Parquet files:

  • Small file compaction (5→1 file, row count preserved)
  • Position delete cleanup (delete applied, delete file removed)
  • Empty plan / empty group handling
  • Partial progress mode
  • Context cancellation

Part of #832 (table compaction).

@laskoviymishka laskoviymishka marked this pull request as ready for review April 14, 2026 08:12
Copy link
Copy Markdown
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a test that confirms that equality delete files get left? Otherwise looks good to me!

@laskoviymishka
Copy link
Copy Markdown
Contributor Author

good call, will add a test: write data files, commit equality deletes via RowDelta, compact, then verify equality delete files are still present in the post-compaction snapshot and still apply on scan.

Add Transaction.RewriteDataFiles() that reads data with deletes applied,
writes new consolidated files, and atomically replaces old files via
ReplaceFiles. Position delete files matched to rewritten data files are
removed in the same commit.

- CompactionTaskGroup bridges compaction planner and executor (avoids
  circular import between table/ and table/compaction/)
- collectSafePositionDeletes only removes pos deletes with explicit
  content type check; equality deletes preserved (may apply outside scope)
- Context cancellation checked between groups
- Empty groups skipped gracefully

E2e tests with real Parquet files:
- Small file compaction (5→1 file, row count preserved)
- Position delete cleanup (delete applied, delete file removed)
- Empty plan / empty group handling
- Partial progress mode
- Context cancellation

Part of apache#832 (table compaction).
@laskoviymishka laskoviymishka force-pushed the feat/rewrite-data-files branch from bc6c772 to 02e8989 Compare April 14, 2026 20:26
@zeroshade zeroshade merged commit 285632c into apache:main Apr 14, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants