Skip to content

Conversation

@amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Oct 18, 2025

Which issue does this PR close?

What changes are included in this PR?

This changes scan planning of equality deletes so that:

  1. we only match against eq deletes which are strictly older than the data file. Right now it looks like we incorrectly over-apply based on the seq number for global equality deletes.

  2. Partition scoped deletes (both equality and position) are compared correctly by also factoring in the spec ID. It's not quite enough to compare just off the tuple, we should also compare based off the spec ID as well.

Are these changes tested?

Added unit tests which are scoped to testing delete index matching logic.

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
-->

@amogh-jahagirdar amogh-jahagirdar changed the title Fix global eq delete matching so that only deletes strictly older than seq number are applied fix: global eq delete matching should only match against data files which are strictly older Oct 18, 2025
self.global_deletes
.iter()
// filter that returns true if the provided delete file's sequence number is **greater than or equal to** `seq_num`
// filter that returns true if the provided delete file's sequence number is **greater than** `seq_num`
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could just be missing something in the implementation, but right now seems like we're over applying in a way that would technically be incorrect. If a snapshot has some added data files and an equality delete, the equality deletes should only be applied to older seq numbers. We do this correctly for partition scoped eq deletes below, but for some reason the global equality delete matching is different; the seq number criteria for eq deletes applies regardless of scope of the delete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this when I was working on some optimizations for equality delete matching (eliminating eq deletes to apply based on bounds)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. I think "greater than" is correct here. The global_deletes variable is used for gathering all the eq delete files for an unpartitioned table, but the same logic should still apply.

Looking at past commits, i did find this was previously discussed here

Also from https://iceberg.apache.org/spec/#scan-planning
Screenshot 2025-10-18 at 5 57 03 PM
This implies that we should not apply eq delete files that have equivalent sequence number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thanks for finding that github thread, looks like the confusion was what global_deletes meant, but really it's global equality deletes. Regardless any equality delete must be applied for files strictly older. I also fixed some of the partition scoped delete matching logic because it's technically possible to have tuples match but the spec IDs don't match; currently we only go off the tuple which isn't quite correct for matching as well.

kevinjqliu
kevinjqliu previously approved these changes Oct 19, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

global_deletes (eq delete for unpartitioned table) should only apply if sequence number is strictly greater than the data files' sequence number

@amogh-jahagirdar amogh-jahagirdar force-pushed the fix-equality-delete-matching branch from eca75bf to 3df3b6c Compare October 19, 2025 18:20
@amogh-jahagirdar amogh-jahagirdar changed the title fix: global eq delete matching should only match against data files which are strictly older fix: global eq delete matching should apply to only strictly older files, and fix partition scoped matching to consider spec id Oct 19, 2025
kevinjqliu
kevinjqliu previously approved these changes Oct 19, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks for adding the tests

seq_num
.map(|seq_num| delete.manifest_entry.sequence_number() >= Some(seq_num))
.unwrap_or_else(|| true)
&& data_file.partition_spec_id == delete.partition_spec_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

👍
partition value match is handled by self.pos_deletes_by_partition.get(data_file.partition())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. We could also define a key which is tupled off the <spec ID, partition value> pair in the map (this is what the java implementation does).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets see if we can refactor this in a follow up PR and keep the changes here straightforward :)

@kevinjqliu kevinjqliu requested a review from Copilot October 20, 2025 16:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes two bugs in delete file index matching logic during scan planning:

  1. Global equality deletes now only match against strictly older data files (changed from >= to > comparison)
  2. Partition-scoped deletes now correctly validate that both the partition value and spec ID match

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

seq_num
.map(|seq_num| delete.manifest_entry.sequence_number() >= Some(seq_num))
.unwrap_or_else(|| true)
&& data_file.partition_spec_id == delete.partition_spec_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets see if we can refactor this in a follow up PR and keep the changes here straightforward :)

@kevinjqliu kevinjqliu merged commit d33f3bb into apache:main Oct 20, 2025
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Equality Delete Matching over applies global equality deletes

2 participants