-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-6005] Auto generate client id for Flink multi writer #8323
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The current flink does not have a good msg notification mechanism from JM to TM tasks. In order to get rid of this, we introduced two kind of fs based messages on the JM: 1. the view_storage_conf file to keep remote fs view storage properties 2. the ckp_metadata used for pending instants fetching in light-weight way Both of the two messages work well on single-writer scenario, we just use a fixed name file path for these files/directories. But in multi-writer scenario, writing into same name msg file incurs conflicts, the msg file finally got corrupted and the existing writers can be impacted. In HUDI-5673, we introduced an option 'write.client.id' for manual conflict resolution, the user needs to config the client id for each of the multi-writers. This is valid but very inconvenient in production practice. In this patch, we try to auto infer the client id when the job is submitted from the client machine. On start up, each job tries to scramble for a client id first, after the job is running, the coordinator on JM would try to send a heartbeat message for its client id in a fixed time interval(by default 1 minute). The heartbeat is mainly used for deciding whether the job that holds the client id is still alive, if not, the client id can be recycled and reused.
XuQianJin-Stars
approved these changes
Mar 31, 2023
fengjian428
pushed a commit
to fengjian428/hudi
that referenced
this pull request
Apr 5, 2023
The current flink does not have a good msg notification mechanism from JM to TM tasks. In order to get rid of this, we introduced two kind of fs based messages on the JM: 1. the view_storage_conf file to keep remote fs view storage properties 2. the ckp_metadata used for pending instants fetching in light-weight way Both of the two messages work well on single-writer scenario, we just use a fixed name file path for these files/directories. But in multi-writer scenario, writing into same name msg file incurs conflicts, the msg file finally got corrupted and the existing writers can be impacted. In HUDI-5673, we introduced an option 'write.client.id' for manual conflict resolution, the user needs to config the client id for each of the multi-writers. This is valid but very inconvenient in production practice. In this patch, we try to auto infer the client id when the job is submitted from the client machine. On start up, each job tries to scramble for a client id first, after the job is running, the coordinator on JM would try to send a heartbeat message for its client id in a fixed time interval(by default 1 minute). The heartbeat is mainly used for deciding whether the job that holds the client id is still alive, if not, the client id can be recycled and reused.
stayrascal
pushed a commit
to stayrascal/hudi
that referenced
this pull request
Apr 20, 2023
The current flink does not have a good msg notification mechanism from JM to TM tasks. In order to get rid of this, we introduced two kind of fs based messages on the JM: 1. the view_storage_conf file to keep remote fs view storage properties 2. the ckp_metadata used for pending instants fetching in light-weight way Both of the two messages work well on single-writer scenario, we just use a fixed name file path for these files/directories. But in multi-writer scenario, writing into same name msg file incurs conflicts, the msg file finally got corrupted and the existing writers can be impacted. In HUDI-5673, we introduced an option 'write.client.id' for manual conflict resolution, the user needs to config the client id for each of the multi-writers. This is valid but very inconvenient in production practice. In this patch, we try to auto infer the client id when the job is submitted from the client machine. On start up, each job tries to scramble for a client id first, after the job is running, the coordinator on JM would try to send a heartbeat message for its client id in a fixed time interval(by default 1 minute). The heartbeat is mainly used for deciding whether the job that holds the client id is still alive, if not, the client id can be recycled and reused.
KnightChess
pushed a commit
to KnightChess/hudi
that referenced
this pull request
Jan 2, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Logs
The current flink does not have a good msg notification mechanism from JM to TM tasks. In order to get rid of this, we introduced two kind of fs based messages on the JM:
Both of the two messages work well on single-writer scenario, we just use a fixed name file path for these files/directories. But in multi-writer scenario, writing into same name msg file incurs conflicts, the msg file finally got corrupted and the existing writers can be impacted.
In HUDI-5673, we introduced an option 'write.client.id' for manual conflict resolution, the user needs to config the client id for each of the multi-writers. This is valid but very inconvenient in production practice.
In this patch, we try to auto infer the client id when the job is submitted from the client machine. On start up, each job tries to scramble for a client id first, after the job is running, the coordinator on JM would try to send a heartbeat message for its client id in a fixed time interval(by default 1 minute).
The heartbeat is mainly used for deciding whether the job that holds the client id is still alive, if not, the client id can be recycled and reused.
Impact
no impact
Risk level (write none, low medium or high below)
none
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist