Skip to content

Core: Add EntryStatus.MODIFIED and TrackingBuilder status derivation#16689

Open
anoopj wants to merge 4 commits into
apache:mainfrom
anoopj:v4-modified-status
Open

Core: Add EntryStatus.MODIFIED and TrackingBuilder status derivation#16689
anoopj wants to merge 4 commits into
apache:mainfrom
anoopj:v4-modified-status

Conversation

@anoopj
Copy link
Copy Markdown
Member

@anoopj anoopj commented Jun 5, 2026

  • Add MODIFIED to EntryStatus for entries whose file was modified by
    a column update or DV change.
  • Update TrackingBuilder such that it derives the right status
  • Added tests

@github-actions github-actions Bot added the core label Jun 5, 2026
@anoopj anoopj moved this to In review in V4: metadata tree Jun 5, 2026
@anoopj anoopj mentioned this pull request Jun 5, 2026
3 tasks
Copy link
Copy Markdown
Contributor

@gaborkaszab gaborkaszab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition, @anoopj!
I went through this, seems goon in general, just left some nits.

Comment thread core/src/main/java/org/apache/iceberg/EntryStatus.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackingStruct.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackingStruct.java Outdated
return EntryStatus.ADDED;
}

boolean sameSnapshot = source.snapshotId() != null && source.snapshotId() == newSnapshotId;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this I have the impression that this PR contains 2 different functionalities:

  1. the "sameSnapshot" case for ADDED status
  2. Introduction of MODIFIED and all the transitions to/from
    No strong opinion, but I usually prefer keeping a single purpose for my PRs. LMK WDYT

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my benefit: What is the use-case for the "sameSnapshot" case? I figure that Tracking is created whenever we create a new TrackedFile. Not sure I see, within the same snapshot where we want to change Tracking again (or recreate the TrackedFile with different Tracking). I probably miss, how this integrates into the big picture.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added/same snapshot is not a separate scope. This is for the use case we discussed in #16408 where a a writer constructs an ADDED tracking and pipelines a DV attach in the same commit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean this conversation? If yes, it was about whether to return Tracking or TrackingBuilder from added(long). The argument was to return the latter, because we might want to add some DVs on top using the same builder.
However, here it seems slightly different: We first build a Tracking object with ADDED status, then use it as a source to add DVs on top using another builder. For this, we could use Tracking as a return value from added(long)
(I'll think this through tomorrow again)

@anoopj anoopj requested a review from gaborkaszab June 5, 2026 14:31
REPLACED(3);
/**
* Non-live entry recording that a prior file version was superseded by another live entry. Added
* in v4.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a term that is better than "non-live". It's probably implied by "superceded by another live entry", although I think we can improve on that still.

What about this?

The starting (replaced) state of an entry that is modified.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is old better than starting in this conext?

The old (replaced) state of en entry that is modified. Paired with MODIFIED. Added in v4.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Went with the old/new pairing Steven proposed above.

"snapshot_id",
Types.LongType.get(),
"Snapshot ID where the file was added or deleted");
"Snapshot ID where the file was added, deleted, replaced, or modified");
Copy link
Copy Markdown
Contributor

@rdblue rdblue Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we agreed to modify the snapshot ID for a replaced entry? I thought that we were not going to change replaced entries.

We change the snapshot ID for deleted entries, but not for existing entries so there's precedent both ways. If you're scanning for changes, the snapshot ID is useful for filtering out changes that are left-over from older snapshots. For instance, I may rewrite a manifest and delete a file in it. If I'm later scanning that file for changes, I would be able to check whether the delete entry is for the snapshot ID I'm getting changes for.

The counter-argument is that the manifest would probably only be scanned for changes if you're looking for changes that would match. In order to scan that manifest, you'd first check its snapshot ID (when it was added) and not scan otherwise.

Overall, I think the right thing is to update the snapshot ID as you have here. That way if any implementation reads files it doesn't need to, it has enough information to filter out the entries.

Good to note in the spec @stevenzwu and @amogh-jahagirdar.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrackingBuilder.terminal() writes newSnapshotId for both DELETED and REPLACED today, so the doc matches the current code.

To me, deleted and replaced entries should behave the same. replaced is just a special flavor of deleted. Since v1-v3 already modify the snapshot ID for a deleted entry, it seems reasonable to maintain the same behavior.

Also for change detection, it is probably better to update the snapshot id in tracking for deleted or replaced entries. If the tracking snapshot_id doesn't match the current snapshot id, the entries should be ignored for change detection. This is important because the current spec is silent on lifetime of deleted entries.

Comment thread core/src/main/java/org/apache/iceberg/EntryStatus.java Outdated
"snapshot_id",
Types.LongType.get(),
"Snapshot ID where the file was added or deleted");
"Snapshot ID where the file was added, deleted, replaced, or modified");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrackingBuilder.terminal() writes newSnapshotId for both DELETED and REPLACED today, so the doc matches the current code.

To me, deleted and replaced entries should behave the same. replaced is just a special flavor of deleted. Since v1-v3 already modify the snapshot ID for a deleted entry, it seems reasonable to maintain the same behavior.

Also for change detection, it is probably better to update the snapshot id in tracking for deleted or replaced entries. If the tracking snapshot_id doesn't match the current snapshot id, the entries should be ignored for change detection. This is important because the current spec is silent on lifetime of deleted entries.

REPLACED(3);
/**
* Non-live entry recording that a prior file version was superseded by another live entry. Added
* in v4.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is old better than starting in this conext?

The old (replaced) state of en entry that is modified. Paired with MODIFIED. Added in v4.

class TrackingBuilder {
private final EntryStatus status;
private final Long snapshotId;
// null for the fresh-added path; non-null for the source-based path.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this. I'm not sure what it means.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed with the revert of the refactor.

TrackingBuilder deletedPositions(ByteBuffer positions) {
Preconditions.checkState(
status == EntryStatus.EXISTING, "Cannot set deleted positions on %s entry", status);
Preconditions.checkState(source != null, "Cannot set deleted positions on ADDED entry");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message should agree with the check. In this case, I think the check is better if you use status == ADDED.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. TrackingBuilder now holds status and snapshotId as fields instead of source.

}

private TrackingBuilder(long newSnapshotId) {
this.status = EntryStatus.ADDED;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should still use status rather than mutated and source. There's no need to keep the source around instead of storing its snapshot ID individually. And we don't want to make inferences like state is added when source is null.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. (same as above)

}

/** Derives the output status from the source, the snapshot, and any mutations. */
private EntryStatus deriveStatus() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should go back to updating status in the builder config methods rather than this. The impulse to co-locate logic for entry status is good, but I think it is more readable not to do this refactor. When modifications are made, the status can be validated and updated inline. That's simpler and more clear, in my opinion, rather than trying to detect what happened in the configuration phase and produce the correct status.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Reverted the refactor.

tracking.set(STATUS_ORDINAL, EntryStatus.EXISTING.id());
assertThat(tracking.isLive()).isTrue();

tracking.set(STATUS_ORDINAL, EntryStatus.MODIFIED.id());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this setting status through an ordinal?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to use builder method.

Comment thread core/src/main/java/org/apache/iceberg/TrackingBuilder.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackingStruct.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackingStruct.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackingStruct.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackingStruct.java Outdated
anoopj and others added 2 commits June 6, 2026 07:00
@anoopj anoopj requested review from rdblue and stevenzwu June 6, 2026 03:31
Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a minor comment for Javadoc

Comment on lines +148 to +153
private void promoteToModified() {
if (status == EntryStatus.EXISTING) {
this.status = EntryStatus.MODIFIED;
this.snapshotId = newSnapshotId;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new state-machine reads cleanly, but the rule that fresh-add builders stay ADDED through dvUpdated() is implicit — a reader has to notice that this method only flips EXISTING, then trace back to the no-source constructor to see that ADDED is the starting state. A one-line javadoc captures the intent in one place.

Suggested change
private void promoteToModified() {
if (status == EntryStatus.EXISTING) {
this.status = EntryStatus.MODIFIED;
this.snapshotId = newSnapshotId;
}
}
/**
* Promotes an EXISTING entry to MODIFIED on mutation. Fresh-add builders (status = ADDED) are
* preservedcovers the same-commit append + DV attach case without a special branch.
*/
private void promoteToModified() {
if (status == EntryStatus.EXISTING) {
this.status = EntryStatus.MODIFIED;
this.snapshotId = newSnapshotId;
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

4 participants