[HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset#1009
[HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset#1009bvaradar merged 1 commit intoapache:masterfrom
Conversation
9d84e21 to
093c8c5
Compare
882710e to
d2b87d7
Compare
|
@vinothchandar @n3nash : Ready for review. |
d2b87d7 to
0eac6a0
Compare
|
@bvaradar left some comments. In general, I couldn't understand how will existing tables move from VERSION_0 metadata to VERSION_1. Is the new version only supported for new tables ? If yes, what is the plan for the existing tables, if not, what is the migration strategy for existing tables ? |
vinothchandar
left a comment
There was a problem hiding this comment.
High level approach looks fine. fact that this did not need boiling the ocean is a testament that our code is in good shape actually :)
But left a bunch of comments. Will do a closer pass of rollback path in that context.
There was a problem hiding this comment.
wont this provide both inflight and requested?
There was a problem hiding this comment.
Yes, that was the intention. One of the places where we are using is rolling back pending commits
There was a problem hiding this comment.
dont we have similar logic in timeline class itself? can we consolidate there?
There was a problem hiding this comment.
refactored a bit to be used here.
0535625 to
25ffe70
Compare
|
@vinothchandar @n3nash : Redid the migration handling and addressed your comments High-Level Changes since the previous review.
|
25ffe70 to
f3d1f61
Compare
vinothchandar
left a comment
There was a problem hiding this comment.
Mostly final cosmetic changes.. One clarification : existing tables have to explicitly opt-in for this.. right?
You can merge once you do the final round and push again
There was a problem hiding this comment.
Even writers use HoodieTableMetaClient right? can you clarify this comment?
There was a problem hiding this comment.
Agree, this is misleading. What I meant was MetaClient Readers ( use-cases which just lists the .hoodie folder) as opposed to MetaClient Writer (performing action transitions in .hoodie folder). Will remove this comment
There was a problem hiding this comment.
Seems the applyLayoutVersionFilters is set selectively using which HoodieActiveTimeline constructor is invoked? Would this be fragile.. Thinking out loud, applying filters on V0, has no effect since there are nothing to get rid off. Only thing that could do wrong is not filtering V1.. hmmm
There was a problem hiding this comment.
The key case here is for archival where we need all instants without filtering. May be introduce couple of factory methods which instantiate HoodieActiveTimeline w/o filtering ? HUDI-414
f3d1f61 to
c7e7bcd
Compare
|
@bvaradar Feel free to merge when you feel this is ready |
[HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset
With this PR, Hudi Timeline management no longer uses rename to mark state transitions. As renames can be non-atomic in some cloud stores, this PR addresses this issue in a clean way.
Related Changes:
Introduce new metadata layout version to Hudi table properties and use this to determine if renames should be used or not while writing. Any existing table created prior to 0.5.1 will preserve old semantics. Newer tables that are created after 0.5.1 will automatically avoid renames. Hudi Query Engine integration should be able to handle both cases. We expect the deployment to first upgrade query engines before upgrading writer
As the new format enforces write once semantics, there is no longer any need to write compaction and cleaner plan in both places (.hoodie and .hoodie/.aux). Code changes handles this
Commits/DeltaCommits also follow requested -> inflight -> completed state transitions. Rollback for "requested" state (failure during index lookup) is trivial as no side-effects happened.
Commit Archiving handles the case of intermediate state files also being present