-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-33402] Hybrid Source Concurrency Race Condition Fixes and Related Bugs Results in Data Loss #23687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@flinkbot run azure |
1 similar comment
|
@flinkbot run azure |
| } | ||
| Integer lastIndex = null; | ||
| for (Integer sourceIndex : readerSourceIndex.values()) { | ||
| if (lastIndex != null && lastIndex != sourceIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to merge this fix as a separate change since it had been reported in the past and is unrelated to the race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct; Created a separate PR for that here
|
@varun1729DD let's start with the enumerator. I'm probably missing something. From the logs that you had posted in JIRA, it appears that all events are processed by the same thread |
Hmm... can you point me to what is indicating that Enumerator is single threaded? |
|
@varun1729DD the thread name of the coordinator in the logs is I looked at |
|
@tweise I am coming back to this one; got busy in the middle. |
|
I tested the following synchronization combinations: |
|
@varun1729DD thanks for getting back to this. I would be very interested to see where the concurrency issue occurs. If you tested with enumerator only and that did not fix the issue, then that seems to confirm that there isn't any issue on the enumerator side as we expect that execution is single threaded (always same source coordinator thread). If you add synchronization to the reader only and the error goes away, the we can dig into that. |
|
Hi @tweise |
|
Continued here: #24055 |
What is the purpose of the change
(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)
Brief change log
(for example:)
Verifying this change
Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
(Please pick either of the following options)
This change is a trivial rework / code cleanup without any test coverage.
(or)
This change is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Does this pull request potentially affect one of the following parts:
@Public(Evolving): NoDocumentation