-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API, Core: Make RewriteFiles flexible #7501
Conversation
@@ -34,13 +34,57 @@ | |||
* will throw a {@link ValidationException}. | |||
*/ | |||
public interface RewriteFiles extends SnapshotUpdate<RewriteFiles> { | |||
/** | |||
* Delete a data file whose content was rewritten. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the word delete
is a bit questionable in this context but that's what the Javadoc says. It is also similar to the naming used in OverwriteFiles
and other APIs. I can see this being called rewriteFile
or something too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our terminology always throw me a bit off here. We struggle with "delete" = physically remove and "delete" mark as logically removed. Both are used in various places and sometimes we also use "remove".
Here I think "originalFile" maybe is more of the logical description, "original" => "rewritten" instead of "add" and "delete" which are really the implementation details.
That said i'm fine with just "add" and "delete" since we use them in many other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is unfortunate we overlooked the distinction between physical delete
and logical remove
. Let me double check other APIs to see whether there are also any good examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone else have any preferences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995 @rdblue, any thoughts?
* @param sequenceNumber a data sequence number | ||
* @return this for method chaining | ||
*/ | ||
RewriteFiles dataSequenceNumber(long sequenceNumber); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can only set a data sequence number for data files. Whenever we rewrite delete files, we have to rely on sequence numbers of source files. I can see this being called newDataFilesDataSequenceNumber
but that's too long.
@@ -86,7 +86,7 @@ public void testEmptyTable() { | |||
.rewriteFiles( | |||
ImmutableSet.of(), | |||
ImmutableSet.of(FILE_A_DELETES), | |||
ImmutableSet.of(FILE_A), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it ever a valid use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be removing a delete file and adding a data file and a new delete file? Seems impossible in rewrite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, this is testing an error message so it was, probably, overlooked.
e9d4b8f
to
ca2d545
Compare
ca2d545
to
7d16caa
Compare
} | ||
|
||
/** | ||
* Delete a delete file whose content was rewritten. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: will this be not entirely accurate for RewritePositionDelete? We may end up not rewriting invalid deletes. Or is 'rewritten' more nuanced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean cases when we rewrite a position delete file but discard no longer valid deletes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the way I read it, it may indicate that all the content will be rewritten. But its likely in my case that it is just dropped without being rewritten. Not sure if its just worth calling out here 'whose content was rewritten'? Maybe we can do the other way around, and add the javadoc on addFile(): 'Add a DeleteFile rewriting content of a deleted DeleteFile'.
* @param deleteFile a new delete file | ||
* @return this for method chaining | ||
*/ | ||
default RewriteFiles addFile(DeleteFile deleteFile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to specify that adding the same delete file/data file twice will have no effect? We required set
s before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, I'll add. That said, neither DataFile
nor DeleteFile
implementations provide proper equality at the moment so Set
was kind of useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, Sounds like we can drop it, although may want to fix that in the future since we don't actually support multiple instances of the same file within a table.
@@ -26,6 +26,9 @@ | |||
class BaseRewriteFiles extends MergingSnapshotProducer<RewriteFiles> implements RewriteFiles { | |||
private final Set<DataFile> replacedDataFiles = Sets.newHashSet(); | |||
private Long startingSnapshotId = null; | |||
private boolean replacesDeleteFiles = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than fields, can we check whether "repalcesDataFiles" or other data structures have contents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean checking the underlying ManifestFilterManager
, List<DataFile>
and the map of delete files? I thought about this but was initially reluctant to touch ManifestFilterManager
(4 different ways to delete files).
That said, these boolean flags are a bit odd. Now I am thinking it makes sense to spend a bit more time and check the underlying structures. @RussellSpitzer, is this what you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sorry if I was unclear. It's just weird to keep this as state flags when they are just proxy variables for the actual state of containers we have access to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes more sense as an api than continuing to expand the single method rewrite(lots of args)
CI failed because of the API checks. I'll fix them together with addressing the feedback. |
7d16caa
to
cc7d2d5
Compare
cc7d2d5
to
1a5d11d
Compare
I double checked a few places that use this API as well as the underlying implementation methods. We do use |
This PR makes
RewriteFiles
more flexible and aligned with other APIs likeOverwriteFiles
. One valid use case is discussed in #7389, where we want to replace one set of delete files with another one. Instead of adding another overloaded method, we should make the interface flexible enough to support various combinations.