Skip to content

perf(table): Add optional skip duplicate check on AddDataFiles#901

Merged
zeroshade merged 1 commit into
apache:mainfrom
abhirathod95:add-data-files-ignore-duplicate-search
Apr 15, 2026
Merged

perf(table): Add optional skip duplicate check on AddDataFiles#901
zeroshade merged 1 commit into
apache:mainfrom
abhirathod95:add-data-files-ignore-duplicate-search

Conversation

@abhirathod95
Copy link
Copy Markdown
Contributor

AddDataFiles unconditionally scans every manifest in the current snapshot to check whether any file being added already exists in the table. Each manifest requires a storage read (e.g. an S3 GET). For tables with many commits and manifests this means hundreds of sequential reads, easily pushing the operation past reasonable processing time.

AddFiles method already has an ignoreDuplicates parameter that skips this scan, but AddDataFiles has no equivalent.

WithoutDuplicateCheck() fills this gap. Callers who can guarantee the files being added are new (e.g. freshly written by a compaction job or an ingestion pipeline with unique naming) can opt out of the scan and avoid the I/O cost entirely.

@abhirathod95 abhirathod95 requested a review from zeroshade as a code owner April 14, 2026 22:47
@zeroshade
Copy link
Copy Markdown
Member

Looks great, thanks!

@zeroshade zeroshade merged commit 758b24a into apache:main Apr 15, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants