Core: Add EntryStatus.MODIFIED and TrackingBuilder status derivation#16689
Core: Add EntryStatus.MODIFIED and TrackingBuilder status derivation#16689anoopj wants to merge 4 commits into
Conversation
gaborkaszab
left a comment
There was a problem hiding this comment.
Thanks for the addition, @anoopj!
I went through this, seems goon in general, just left some nits.
| return EntryStatus.ADDED; | ||
| } | ||
|
|
||
| boolean sameSnapshot = source.snapshotId() != null && source.snapshotId() == newSnapshotId; |
There was a problem hiding this comment.
Reading this I have the impression that this PR contains 2 different functionalities:
- the "sameSnapshot" case for ADDED status
- Introduction of MODIFIED and all the transitions to/from
No strong opinion, but I usually prefer keeping a single purpose for my PRs. LMK WDYT
There was a problem hiding this comment.
Just for my benefit: What is the use-case for the "sameSnapshot" case? I figure that Tracking is created whenever we create a new TrackedFile. Not sure I see, within the same snapshot where we want to change Tracking again (or recreate the TrackedFile with different Tracking). I probably miss, how this integrates into the big picture.
There was a problem hiding this comment.
The added/same snapshot is not a separate scope. This is for the use case we discussed in #16408 where a a writer constructs an ADDED tracking and pipelines a DV attach in the same commit.
There was a problem hiding this comment.
Do you mean this conversation? If yes, it was about whether to return Tracking or TrackingBuilder from added(long). The argument was to return the latter, because we might want to add some DVs on top using the same builder.
However, here it seems slightly different: We first build a Tracking object with ADDED status, then use it as a source to add DVs on top using another builder. For this, we could use Tracking as a return value from added(long)
(I'll think this through tomorrow again)
| REPLACED(3); | ||
| /** | ||
| * Non-live entry recording that a prior file version was superseded by another live entry. Added | ||
| * in v4. |
There was a problem hiding this comment.
We need a term that is better than "non-live". It's probably implied by "superceded by another live entry", although I think we can improve on that still.
What about this?
The starting (replaced) state of an entry that is modified.
There was a problem hiding this comment.
Is old better than starting in this conext?
The old (replaced) state of en entry that is modified. Paired with MODIFIED. Added in v4.
There was a problem hiding this comment.
Done. Went with the old/new pairing Steven proposed above.
| "snapshot_id", | ||
| Types.LongType.get(), | ||
| "Snapshot ID where the file was added or deleted"); | ||
| "Snapshot ID where the file was added, deleted, replaced, or modified"); |
There was a problem hiding this comment.
Have we agreed to modify the snapshot ID for a replaced entry? I thought that we were not going to change replaced entries.
We change the snapshot ID for deleted entries, but not for existing entries so there's precedent both ways. If you're scanning for changes, the snapshot ID is useful for filtering out changes that are left-over from older snapshots. For instance, I may rewrite a manifest and delete a file in it. If I'm later scanning that file for changes, I would be able to check whether the delete entry is for the snapshot ID I'm getting changes for.
The counter-argument is that the manifest would probably only be scanned for changes if you're looking for changes that would match. In order to scan that manifest, you'd first check its snapshot ID (when it was added) and not scan otherwise.
Overall, I think the right thing is to update the snapshot ID as you have here. That way if any implementation reads files it doesn't need to, it has enough information to filter out the entries.
Good to note in the spec @stevenzwu and @amogh-jahagirdar.
There was a problem hiding this comment.
TrackingBuilder.terminal() writes newSnapshotId for both DELETED and REPLACED today, so the doc matches the current code.
To me, deleted and replaced entries should behave the same. replaced is just a special flavor of deleted. Since v1-v3 already modify the snapshot ID for a deleted entry, it seems reasonable to maintain the same behavior.
Also for change detection, it is probably better to update the snapshot id in tracking for deleted or replaced entries. If the tracking snapshot_id doesn't match the current snapshot id, the entries should be ignored for change detection. This is important because the current spec is silent on lifetime of deleted entries.
| "snapshot_id", | ||
| Types.LongType.get(), | ||
| "Snapshot ID where the file was added or deleted"); | ||
| "Snapshot ID where the file was added, deleted, replaced, or modified"); |
There was a problem hiding this comment.
TrackingBuilder.terminal() writes newSnapshotId for both DELETED and REPLACED today, so the doc matches the current code.
To me, deleted and replaced entries should behave the same. replaced is just a special flavor of deleted. Since v1-v3 already modify the snapshot ID for a deleted entry, it seems reasonable to maintain the same behavior.
Also for change detection, it is probably better to update the snapshot id in tracking for deleted or replaced entries. If the tracking snapshot_id doesn't match the current snapshot id, the entries should be ignored for change detection. This is important because the current spec is silent on lifetime of deleted entries.
| REPLACED(3); | ||
| /** | ||
| * Non-live entry recording that a prior file version was superseded by another live entry. Added | ||
| * in v4. |
There was a problem hiding this comment.
Is old better than starting in this conext?
The old (replaced) state of en entry that is modified. Paired with MODIFIED. Added in v4.
| class TrackingBuilder { | ||
| private final EntryStatus status; | ||
| private final Long snapshotId; | ||
| // null for the fresh-added path; non-null for the source-based path. |
There was a problem hiding this comment.
Let's remove this. I'm not sure what it means.
There was a problem hiding this comment.
Done. Removed with the revert of the refactor.
| TrackingBuilder deletedPositions(ByteBuffer positions) { | ||
| Preconditions.checkState( | ||
| status == EntryStatus.EXISTING, "Cannot set deleted positions on %s entry", status); | ||
| Preconditions.checkState(source != null, "Cannot set deleted positions on ADDED entry"); |
There was a problem hiding this comment.
The error message should agree with the check. In this case, I think the check is better if you use status == ADDED.
There was a problem hiding this comment.
Done. TrackingBuilder now holds status and snapshotId as fields instead of source.
| } | ||
|
|
||
| private TrackingBuilder(long newSnapshotId) { | ||
| this.status = EntryStatus.ADDED; |
There was a problem hiding this comment.
I think this should still use status rather than mutated and source. There's no need to keep the source around instead of storing its snapshot ID individually. And we don't want to make inferences like state is added when source is null.
| } | ||
|
|
||
| /** Derives the output status from the source, the snapshot, and any mutations. */ | ||
| private EntryStatus deriveStatus() { |
There was a problem hiding this comment.
I think that this should go back to updating status in the builder config methods rather than this. The impulse to co-locate logic for entry status is good, but I think it is more readable not to do this refactor. When modifications are made, the status can be validated and updated inline. That's simpler and more clear, in my opinion, rather than trying to detect what happened in the configuration phase and produce the correct status.
There was a problem hiding this comment.
Done. Reverted the refactor.
| tracking.set(STATUS_ORDINAL, EntryStatus.EXISTING.id()); | ||
| assertThat(tracking.isLive()).isTrue(); | ||
|
|
||
| tracking.set(STATUS_ORDINAL, EntryStatus.MODIFIED.id()); |
There was a problem hiding this comment.
Why is this setting status through an ordinal?
There was a problem hiding this comment.
Switched to use builder method.
Co-authored-by: Steven Zhen Wu <stevenz3wu@gmail.com>
stevenzwu
left a comment
There was a problem hiding this comment.
LGTM. Just a minor comment for Javadoc
| private void promoteToModified() { | ||
| if (status == EntryStatus.EXISTING) { | ||
| this.status = EntryStatus.MODIFIED; | ||
| this.snapshotId = newSnapshotId; | ||
| } | ||
| } |
There was a problem hiding this comment.
The new state-machine reads cleanly, but the rule that fresh-add builders stay ADDED through dvUpdated() is implicit — a reader has to notice that this method only flips EXISTING, then trace back to the no-source constructor to see that ADDED is the starting state. A one-line javadoc captures the intent in one place.
| private void promoteToModified() { | |
| if (status == EntryStatus.EXISTING) { | |
| this.status = EntryStatus.MODIFIED; | |
| this.snapshotId = newSnapshotId; | |
| } | |
| } | |
| /** | |
| * Promotes an EXISTING entry to MODIFIED on mutation. Fresh-add builders (status = ADDED) are | |
| * preserved — covers the same-commit append + DV attach case without a special branch. | |
| */ | |
| private void promoteToModified() { | |
| if (status == EntryStatus.EXISTING) { | |
| this.status = EntryStatus.MODIFIED; | |
| this.snapshotId = newSnapshotId; | |
| } | |
| } |
a column update or DV change.