-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]HUDI-644 Implement checkpoint generator helper tool #1362
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1362 +/- ##
=========================================
Coverage ? 67.09%
Complexity ? 224
=========================================
Files ? 333
Lines ? 16217
Branches ? 1659
=========================================
Hits ? 10880
Misses ? 4600
Partials ? 737
Continue to review full report at Codecov.
|
@garyli1019 If I understand it correctly, you are talking of a use case where you are using HoodieDeltaStreamer along with using spark data source as a backup. Why do you want to have two different pipelines writing to the same destination path? If you really want to have a backup to prevent any data loss, you can write to a separate path using spark data source and continue using DeltaStreamer to write to Hudi dataset. In case of any issues, you can always use CHECKPOINT_RESET_KEY to ingest the data from your back up path into your Hudi dataset path. We have support for kafka as well as DFS source for this purpose. Also what is the source for your homebrew spark? If it is also consuming from kafka, then I do not see any case where using DeltaStreamer can result in data loss. Can you please explain why do you want to use two pipelines for writing to the same destination path? |
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
Show resolved
Hide resolved
@pratyakshsharma Thanks for reviewing this PR.
So right now, if I switch to delta streamer directly ingesting from Kafka, I will start from the |
I think running the parallel jobs once sounds a little bit hacky. The best way should be to generate the checkpoint string and pass it to the delta streamer in the first run. In this way, I will need to write a checkpoint generator to scan all the files generated by Kafka connect. This is definitely doable but needs some effort.
|
@garyli1019 still I feel all these challenges are arising because you are trying to ingest data in the same dataset using 2 different spark jobs. Few questions -
I am a bit skeptical of trying to use 2 pipelines to write to same destination path. Additionally we have options available for taking backup of your hudi dataset or for migrating existing dataset to Hudi. Anyways if you strongly feel the need to write this checkPointGenerator, let us hear the opinion of @leesf and @vinothchandar as well on this before proceeding. |
@pratyakshsharma So let's forget about my homebrew Spark data source reader. Let's assume I am using delta streamer consuming DFS source, now I'd like to switch to delta streamer consuming Kafka source. The data arrive at DFS and Kafka is asynchronous. DFS source has 30 minutes delay from Kafka.
Currently, I couldn't find a perfect way to switch to delta streamer cause: |
Let me put forward my viewpoint on this. When I was in the phase of adopting Hudi, I kept my already running pipeline writing to some path and started DeltaStreamer to write to some other path. Then I used to do validation everyday for some period of time to gain enough confidence on this framework before completely switching to Hudi. Coming to your point of switching from Kafka -> HDFS raw parquet -> Hudi table to Kafka -> Hudi table, I was thinking of a similar use case some time back and the simplest thing I could think of was to support having checkpoints for Hudi dataset source wise. Currently we store checkpoint "deltastreamer.checkpoint.key" in .commit file and this variable stores checkpoint in a particular format for every source which creates problems when you try to switch your source for the same dataset. So I think if we could simply introduce more variables like this and each one of them will store checkpoints for their corresponding sources, this use case can be solved with minimal efforts. And yes this needs development cycle since what I am proposing is not supported as of now. WDYT? Currently to handle such scenarios, we have "deltastreamer.checkpoint.reset_key" configurable for every DeltaStreamer run and you can do hacks around these two variables ("deltastreamer.checkpoint.key" and "deltastreamer.checkpoint.reset_key") to get your use case solved but a clean solution should be what I proposed above. The above solution works well in cases where you want to switch sources quite frequently also. Also would like to hear from @leesf and @vinothchandar on this. |
Yeah, I definitely agree that there are some work to do to improve the migration process to the delta streamer. In order to use
With though flexibility, I believe the user will be able to use the delta streamer in a more programmatically way. |
Let me catch up on this discussion and circle back.. :) Just one high level question (apologies if its already answered above). why can't we use the checkpoint reset flag, if one-time manual restarts are needed for deltastreamer? is it because its hard to compute that? |
Right. I need a robust way to generate the checkpoint from kafka-connect-hdfs managed files and kafka-connect itself sometimes having an issue to retrieve checkpoint when the Kafka partition number was large. The mechanism is to scan every single file and get the latest checkpoint of each Kafka partition. |
Okay. caught up now.. Firstly, writing in parallel using two jobs is a dangerous thing as Hudi does not support such multi writer access. I would advise against it (although you could hack it to work per se if you tried enough).. @garyli1019 we can definitely add tooling to generate checkpoints in the format that DeltaStreamer expects.. But, I would like to decouple that from the delta streamer itself.. I favor, keeping it simple and just a single knob for the user wanting to override the checkpoint.. There is already an option to override the checkpoint I believe..
Would like to understand this more in general .. For DFS sources, all you need is a timestamp right? And for Kafka, you need to call |
@vinothchandar @pratyakshsharma Agree that running non-delta streamer commit to fix the data gap sounds a bit hacky. I think I can get the checkpoint from previous commit myself and pass it to the delta streamer as the The ideal migration process for my use case would be:
Thanks for the hint on the Kafka API :) One edge case would stop me using it is that a kafka-connector stuck for one partition and other partitions are fine, if I pick a timestamp not earlier than the stuck partition, then I might lose some messages there. Once I figure out the |
@garyli1019 I understand what you are getting at.. We had a similar issue cutting over pipelines and we handled that by having ability to force a checkpoint for a single run of delta streamer.. So, I my guess is, we will explore a way to generate checkpoints from different other mechanisms like connect-hdfs.? |
@vinothchandar right. Step1: implement the tool. Step2: find a way to integrate it with the initial bulk insert or the HDFS importer. In this way we can provide a migration guide of the delta streamer to the users. |
Given that, do we still need the ability to search for the checkpoints in reverse time order? tbh I don't see a value in it, since there cannot be multiple writers to a hudi table anyway. May be we can think about an
|
Maybe not anymore? If I have a tool to tell me where the checkpoint is, I can use the |
No. not at the moment.. We can close this PR out if you agree |
ok, I will make a separate PR for the tool. Thanks everyone who participated in this long discussion... |
No. thank you.. This kind of stuff, gives me energy to keep pushing more :) |
What is the purpose of the pull request
This PR is to resolve the following problem:
The user is using a homebrew Spark data source to read new data and write to Hudi table
The user would like to migrate to Delta Streamer
But the Delta Streamer only checks the last commit metadata, if there is no checkpoint info, then the Delta Streamer will use the default. For Kafka source, it is LATEST.
The user would like to run the homebrew Spark data source reader and Delta Streamer in parallel to prevent data loss, but the Spark data source writer will make commit without checkpoint info, which will reset the delta streamer.
So if we have an option to allow the user to retrieve the checkpoint from previous commits instead of the latest commit would be helpful for the migration.
Brief change log
Verify this pull request
This pull request is a trivial rework / code cleanup without any test coverage.
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.