Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API, Core: Make RewriteFiles flexible #7501

Merged
merged 1 commit into from
May 8, 2023

Conversation

aokolnychyi
Copy link
Contributor

This PR makes RewriteFiles more flexible and aligned with other APIs like OverwriteFiles. One valid use case is discussed in #7389, where we want to replace one set of delete files with another one. Instead of adding another overloaded method, we should make the interface flexible enough to support various combinations.

@@ -34,13 +34,57 @@
* will throw a {@link ValidationException}.
*/
public interface RewriteFiles extends SnapshotUpdate<RewriteFiles> {
/**
* Delete a data file whose content was rewritten.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the word delete is a bit questionable in this context but that's what the Javadoc says. It is also similar to the naming used in OverwriteFiles and other APIs. I can see this being called rewriteFile or something too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our terminology always throw me a bit off here. We struggle with "delete" = physically remove and "delete" mark as logically removed. Both are used in various places and sometimes we also use "remove".

Here I think "originalFile" maybe is more of the logical description, "original" => "rewritten" instead of "add" and "delete" which are really the implementation details.

That said i'm fine with just "add" and "delete" since we use them in many other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is unfortunate we overlooked the distinction between physical delete and logical remove. Let me double check other APIs to see whether there are also any good examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone else have any preferences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 @rdblue, any thoughts?

* @param sequenceNumber a data sequence number
* @return this for method chaining
*/
RewriteFiles dataSequenceNumber(long sequenceNumber);
Copy link
Contributor Author

@aokolnychyi aokolnychyi May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only set a data sequence number for data files. Whenever we rewrite delete files, we have to rely on sequence numbers of source files. I can see this being called newDataFilesDataSequenceNumber but that's too long.

@@ -86,7 +86,7 @@ public void testEmptyTable() {
.rewriteFiles(
ImmutableSet.of(),
ImmutableSet.of(FILE_A_DELETES),
ImmutableSet.of(FILE_A),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it ever a valid use case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be removing a delete file and adding a data file and a new delete file? Seems impossible in rewrite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, this is testing an error message so it was, probably, overlooked.

@aokolnychyi
Copy link
Contributor Author

}

/**
* Delete a delete file whose content was rewritten.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: will this be not entirely accurate for RewritePositionDelete? We may end up not rewriting invalid deletes. Or is 'rewritten' more nuanced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean cases when we rewrite a position delete file but discard no longer valid deletes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the way I read it, it may indicate that all the content will be rewritten. But its likely in my case that it is just dropped without being rewritten. Not sure if its just worth calling out here 'whose content was rewritten'? Maybe we can do the other way around, and add the javadoc on addFile(): 'Add a DeleteFile rewriting content of a deleted DeleteFile'.

* @param deleteFile a new delete file
* @return this for method chaining
*/
default RewriteFiles addFile(DeleteFile deleteFile) {
Copy link
Member

@RussellSpitzer RussellSpitzer May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to specify that adding the same delete file/data file twice will have no effect? We required set
s before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I'll add. That said, neither DataFile nor DeleteFile implementations provide proper equality at the moment so Set was kind of useless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, Sounds like we can drop it, although may want to fix that in the future since we don't actually support multiple instances of the same file within a table.

@@ -26,6 +26,9 @@
class BaseRewriteFiles extends MergingSnapshotProducer<RewriteFiles> implements RewriteFiles {
private final Set<DataFile> replacedDataFiles = Sets.newHashSet();
private Long startingSnapshotId = null;
private boolean replacesDeleteFiles = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than fields, can we check whether "repalcesDataFiles" or other data structures have contents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean checking the underlying ManifestFilterManager, List<DataFile> and the map of delete files? I thought about this but was initially reluctant to touch ManifestFilterManager (4 different ways to delete files).

That said, these boolean flags are a bit odd. Now I am thinking it makes sense to spend a bit more time and check the underlying structures. @RussellSpitzer, is this what you meant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sorry if I was unclear. It's just weird to keep this as state flags when they are just proxy variables for the actual state of containers we have access to.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes more sense as an api than continuing to expand the single method rewrite(lots of args)

@aokolnychyi
Copy link
Contributor Author

CI failed because of the API checks. I'll fix them together with addressing the feedback.

@aokolnychyi
Copy link
Contributor Author

I double checked a few places that use this API as well as the underlying implementation methods. We do use delete in most places, so I'll go ahead and merge this PR. If there are any comments later, we can still change this before the release.

@aokolnychyi aokolnychyi merged commit 174c4ca into apache:master May 8, 2023
jimjag pushed a commit to jimjag/iceberg that referenced this pull request May 8, 2023
jimjag pushed a commit to jimjag/iceberg that referenced this pull request May 8, 2023
jimjag pushed a commit to jimjag/iceberg that referenced this pull request May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants