-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-759] Integrate checkpoint provider with delta streamer #1486
Conversation
@pratyakshsharma @vinothchandar |
Codecov Report
@@ Coverage Diff @@
## master #1486 +/- ##
============================================
- Coverage 72.23% 72.15% -0.08%
- Complexity 289 294 +5
============================================
Files 338 373 +35
Lines 15947 16282 +335
Branches 1624 1638 +14
============================================
+ Hits 11519 11748 +229
- Misses 3700 3798 +98
- Partials 728 736 +8 Continue to review full report at Codecov.
|
LGTM |
Test added. Thanks for the review |
12957f3
to
0304288
Compare
Add #1493 into this PR. |
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few clarifications
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
if (cfg.initialCheckpointProvider != null && cfg.bootstrapFromPath != null && cfg.checkpoint == null) { | ||
InitialCheckPointProvider checkPointProvider = | ||
UtilHelpers.createInitialCheckpointProvider(cfg.initialCheckpointProvider, new Path(cfg.bootstrapFromPath), fs); | ||
cfg.checkpoint = checkPointProvider.getCheckpoint(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC setting cfg.checkpoint
will force use of that timestamp instead of what we normally do - read from the last commit?
Should we do this only when creating the dataset for the first time..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this depends on how we design the migration flow for the user.
What I did myself is I use Spark datasource to do a bulkInsert to convert all the plain parquet files to Hudi format, then the second job I'd like to use delta streamer to read from Kafka. So this initialCheckpointProvider should be the first delta streamer job when switching sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.. do you think if we made it such that even if someone runs delta streamer few times after initial bootstrap, the initial checkpoint provider is used just once? otherwise, you need to scramble to stop the delta streamer after the first run or manually run it by hand once before scheduling it using airflow or deploying in --continuous mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the initial checkpoint provider should be just used once when the user wants to switch from one source to another. After that, the delta streamer should be able to get the checkpoint from the previous commit. We can improve this once the bootstrap is ready. At this point, I am not sure how to put everything together if we want one step to handling everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay.. lets revisit once we have bootstrap support.. cc @bvaradar as fyi
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some suggestions.. lmk what you think
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Outdated
Show resolved
Hide resolved
...ilities/src/main/java/org/apache/hudi/utilities/checkpointing/InitialCheckPointProvider.java
Outdated
Show resolved
Hide resolved
hmm... Looks like the checkstyle auto-fix something...Let me see what's going... |
I added the save action and checkstyle as documented, not sure which one triggered all those |
Addressed some comments, summary:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert the non-essential fixes from the PR (final, this..) and so on., so its easier to review.. Those need more discussion if we are changing code style and applied uniformly..
hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The style check tool automatically adds final and this to match the stylecheck.xml.
Please revert this change..
92cf0e7
to
94bb6ff
Compare
a660f04
to
51740fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through this @garyli1019 .. Hopefully just one more cycle.. and we are home.
...ilities/src/main/java/org/apache/hudi/utilities/checkpointing/InitialCheckPointProvider.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Show resolved
Hide resolved
if (cfg.initialCheckpointProvider != null && cfg.bootstrapFromPath != null && cfg.checkpoint == null) { | ||
InitialCheckPointProvider checkPointProvider = | ||
UtilHelpers.createInitialCheckpointProvider(cfg.initialCheckpointProvider, new Path(cfg.bootstrapFromPath), fs); | ||
cfg.checkpoint = checkPointProvider.getCheckpoint(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.. do you think if we made it such that even if someone runs delta streamer few times after initial bootstrap, the initial checkpoint provider is used just once? otherwise, you need to scramble to stop the delta streamer after the first run or manually run it by hand once before scheduling it using airflow or deploying in --continuous mode?
@vinothchandar Thanks for all the feedback! Very helpful! |
@Parameter(names = {"--initial-checkpoint-provider"}, description = "Generate check point for delta streamer " | ||
+ "for the first run. This field will override the checkpoint of last commit using the checkpoint field. " | ||
+ "Use this field only when switch source, for example, from DFS source to Kafka Source. Check the class " | ||
+ "org.apache.hudi.utilities.checkpointing for details") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InitialCheckPointProvider
did you intend to write the name of the class here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we should add InitialCheckPointProvider
here or we should remove this description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the description to match with --schemaprovider-class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.. 1 minor comment. once you respond/repush.. can merge
What is the purpose of the pull request
Integrate the initial checkpoint provider with delta streamer
Brief change log
Verify this pull request
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.