[WIP] [HUDI-1072] Use replace metadata file to filter excluded files in views#1859
[WIP] [HUDI-1072] Use replace metadata file to filter excluded files in views#1859satishkotha wants to merge 1 commit intoapache:masterfrom
Conversation
ccd2611 to
fd9291b
Compare
There was a problem hiding this comment.
This name irks me getCommitsReplaceAndCompactionTimeline..we should introduce another hierarchy to group our actions, {commit, delta, compaction, replace} introduce new file groups, {rollback, restore, clean} remove file groups etc. Need to think more
There was a problem hiding this comment.
+1 need a name to capture this more nicely.
There was a problem hiding this comment.
Does this mean that we can never go back to querying the older file groups once they have been replaced ? Can you still do time-travel for insert-overwrite use-cases ?
There was a problem hiding this comment.
Yes, for time travel, consider this scenario:
t0 -> insert
t1 -> insert overwrite1
t2 -> insert overwrite2
If we set high watermark to t1 for time travel, visibleCommitTimeline would not have t2.commit, t2.replace. So file groups in t1 would still show as active file groups.
When we move to t2, visibleCommitTimeline will have t2 commit/replace. So file groups in t1 will not show as active
There was a problem hiding this comment.
+1 we should ideally have a test for this
There was a problem hiding this comment.
Test here simulates rollback of replace instant.
I can add another one by filtering timeline to move high watermark.
n3nash
left a comment
There was a problem hiding this comment.
Reviewed 50%, high level, I feel the changes of excludeFileGroups is being forced into many of the TableFileSystem implementations. Need to think more if there is a way to introduce the correct abstractions to avoid having to add this excludeFileGroups everywhere.
Yes, intent is to get early feedback. Appreciate any suggestions. The reason I added excludeFileGroups in all views is that in some cases this list may be huge. So having configurable spillable view (or RocksDB view) can be useful. It is also possible to encapsulate all this in AbstractFileView and hide it from subclasses too. Let me know if you think that is a better solution. |
fd9291b to
873854b
Compare
873854b to
16650d4
Compare
vinothchandar
left a comment
There was a problem hiding this comment.
High level approach LGTM. Can do a more thorough review as a follow up.
can you clarify what the state transitions are for REPLACE? would it be like compaction?
t1.replace.requested, t1.replace.inflight, t1.commit?
or
t1.replace.requested, t1.replace.inflight, t1.replace
| "type": "record", | ||
| "name": "HoodieReplaceMetadata", | ||
| "fields": [ | ||
| {"name": "totalFilesReplaced", "type": "int"}, |
There was a problem hiding this comment.
rename: totalFileSlicesReplaced
| {"name": "partitionMetadata", "type": { | ||
| "type" : "map", "values" : { | ||
| "type": "array", | ||
| "items": "string" |
There was a problem hiding this comment.
I was expecting this to contain the actual file slices being replaced? seems like we just want to have the partitions here?
There was a problem hiding this comment.
+1 need a name to capture this more nicely.
There was a problem hiding this comment.
+1 we should ideally have a test for this
So, in the approach I implemented, we will have both t1.replace and t1.commit files. i.e., t1.replace.requested, t1.replace.inflight, t1.replace, t1.commit There are few reasons for doing this:
In short, 't1.replace ' and 't1.commit' together define changes done during t1 instant. After consolidated metadata lands, I think this can be simplified quite a bit. I discussed this with few others offline and implemented this approach. But, let me know if you think there is a better way to do this. Its still early stages and i'm happy to implement cleaner approach, if theres one. |
… replaced files as part of archival
16650d4 to
1217882
Compare
|
Moved to #2048 |
What is the purpose of the pull request
Follow up on #1853
Use metadata and filter excluded files from views.
Changed base views. If general approach looks good, I can update RocksDB and spillable view implementations
Brief change log
Add new methods in Abstract view to filter files excluded by replace commits
Verify this pull request
Added unit tests
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.