[FLINK-33402] Hybrid Source Concurrency Race Condition Fixes and Related Bugs Results in Data Loss #23687

varun1729DD · 2023-11-09T00:10:39Z

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

Brief change log

(for example:)

The TaskInfo is stored in the blob store on job creation time as a persistent artifact
Deployments RPC transmits only the blob storage reference
TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (100MB)
Extended integration test for recovery after master (JobManager) failure
Added test that validates that TaskInfo is transferred only once across recoveries
Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): No
The public API, i.e., is any changed class annotated with @Public(Evolving): No
The serializers: No
The runtime per-record code paths (performance sensitive): No
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: No
The S3 file system connector: No

Documentation

Does this pull request introduce a new feature? No
If yes, how is the feature documented? Not applicable

flinkbot · 2023-11-09T00:17:16Z

CI report:

2e8f95b Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

varun1729DD · 2023-11-09T00:34:43Z

@flinkbot run azure

varun1729DD · 2023-11-09T21:00:55Z

@flinkbot run azure

tweise · 2023-11-13T16:04:10Z

...src/main/java/org/apache/flink/connector/base/source/hybrid/HybridSourceSplitEnumerator.java

            }
            Integer lastIndex = null;
            for (Integer sourceIndex : readerSourceIndex.values()) {
-                if (lastIndex != null && lastIndex != sourceIndex) {


We may want to merge this fix as a separate change since it had been reported in the past and is unrelated to the race condition.

Correct; Created a separate PR for that here

tweise · 2023-11-13T16:07:55Z

@varun1729DD let's start with the enumerator. I'm probably missing something. From the logs that you had posted in JIRA, it appears that all events are processed by the same thread SourceCoordinator-Source: parquet-source - if that is so, then why do we need locking? The locking code is intrusive and ideally best avoided as the mailbox model was designed to solve this.

varun1729DD · 2023-11-14T01:29:37Z

@varun1729DD let's start with the enumerator. I'm probably missing something. From the logs that you had posted in JIRA, it appears that all events are processed by the same thread SourceCoordinator-Source: parquet-source - if that is so, then why do we need locking? The locking code is intrusive and ideally best avoided as the mailbox model was designed to solve this.

Hmm... can you point me to what is indicating that Enumerator is single threaded? SourceCoordinator-Source: parquet-source doesn't show any thread info. I could be missing something. Also, can you share a code-pointer to the higher-level mailbox logic?

tweise · 2023-11-14T22:04:24Z

@varun1729DD the thread name of the coordinator in the logs is [SourceCoordinator-Source: parquet-source] - can you please check if there is any other thread executing enumerator code?

I looked at HybridSourceITCase and all enumerator actions are performed in the coordinator thread [SourceCoordinator-Source: hybrid-source]

varun1729DD · 2023-12-08T19:54:14Z

@tweise I am coming back to this one; got busy in the middle.
I wonder if synchronization can be restricted to just the readers

varun1729DD · 2023-12-08T19:54:55Z

I tested the following synchronization combinations:
Enumerator Only -> Did not work
Enumerator and Read -> Works
But never Reader Only -> ?

tweise · 2023-12-08T22:27:19Z

@varun1729DD thanks for getting back to this. I would be very interested to see where the concurrency issue occurs. If you tested with enumerator only and that did not fix the issue, then that seems to confirm that there isn't any issue on the enumerator side as we expect that execution is single threaded (always same source coordinator thread).

If you add synchronization to the reader only and the error goes away, the we can dig into that.

varun1729DD · 2024-01-09T09:29:57Z

Hi @tweise
I am able to reproduce this issue with just changes just to the reader. The error does not go away
The PR says closed because I reset my fork.
I am trying to put together a test case

varun1729DD · 2024-01-10T01:49:43Z

Continued here: #24055

tweise self-requested a review November 13, 2023 15:49

tweise reviewed Nov 13, 2023

View reviewed changes

varun1729DD closed this Jan 9, 2024

varun1729DD force-pushed the master branch from 2e8f95b to e07545e Compare January 9, 2024 08:28

varun1729DD mentioned this pull request Jan 10, 2024

[FLINK-33402] 2 CONTINUED #24055

Closed

[FLINK-33402] Hybrid Source Concurrency Race Condition Fixes and Related Bugs Results in Data Loss #23687

[FLINK-33402] Hybrid Source Concurrency Race Condition Fixes and Related Bugs Results in Data Loss #23687

Uh oh!

Conversation

varun1729DD commented Nov 9, 2023

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Nov 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

varun1729DD commented Nov 9, 2023

Uh oh!

varun1729DD commented Nov 9, 2023

Uh oh!

tweise Nov 13, 2023

Choose a reason for hiding this comment

Uh oh!

varun1729DD Nov 13, 2023

Choose a reason for hiding this comment

Uh oh!

tweise commented Nov 13, 2023

Uh oh!

varun1729DD commented Nov 14, 2023

Uh oh!

tweise commented Nov 14, 2023

Uh oh!

varun1729DD commented Dec 8, 2023

Uh oh!

varun1729DD commented Dec 8, 2023

Uh oh!

tweise commented Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varun1729DD commented Jan 9, 2024

Uh oh!

varun1729DD commented Jan 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

flinkbot commented Nov 9, 2023 •

edited

Loading

tweise commented Dec 8, 2023 •

edited

Loading