Skip to content

Conversation

@varun1729DD
Copy link
Contributor

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

Brief change log

(for example:)

  • The TaskInfo is stored in the blob store on job creation time as a persistent artifact
  • Deployments RPC transmits only the blob storage reference
  • TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (100MB)
  • Extended integration test for recovery after master (JobManager) failure
  • Added test that validates that TaskInfo is transferred only once across recoveries
  • Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): No
  • The public API, i.e., is any changed class annotated with @Public(Evolving): No
  • The serializers: No
  • The runtime per-record code paths (performance sensitive): No
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: No
  • The S3 file system connector: No

Documentation

  • Does this pull request introduce a new feature? No
  • If yes, how is the feature documented? Not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 9, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@varun1729DD
Copy link
Contributor Author

@flinkbot run azure

1 similar comment
@varun1729DD
Copy link
Contributor Author

@flinkbot run azure

@tweise tweise self-requested a review November 13, 2023 15:49
}
Integer lastIndex = null;
for (Integer sourceIndex : readerSourceIndex.values()) {
if (lastIndex != null && lastIndex != sourceIndex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to merge this fix as a separate change since it had been reported in the past and is unrelated to the race condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct; Created a separate PR for that here

@tweise
Copy link
Contributor

tweise commented Nov 13, 2023

@varun1729DD let's start with the enumerator. I'm probably missing something. From the logs that you had posted in JIRA, it appears that all events are processed by the same thread SourceCoordinator-Source: parquet-source - if that is so, then why do we need locking? The locking code is intrusive and ideally best avoided as the mailbox model was designed to solve this.

@varun1729DD
Copy link
Contributor Author

@varun1729DD let's start with the enumerator. I'm probably missing something. From the logs that you had posted in JIRA, it appears that all events are processed by the same thread SourceCoordinator-Source: parquet-source - if that is so, then why do we need locking? The locking code is intrusive and ideally best avoided as the mailbox model was designed to solve this.

Hmm... can you point me to what is indicating that Enumerator is single threaded? SourceCoordinator-Source: parquet-source doesn't show any thread info. I could be missing something. Also, can you share a code-pointer to the higher-level mailbox logic?

@tweise
Copy link
Contributor

tweise commented Nov 14, 2023

@varun1729DD the thread name of the coordinator in the logs is [SourceCoordinator-Source: parquet-source] - can you please check if there is any other thread executing enumerator code?

I looked at HybridSourceITCase and all enumerator actions are performed in the coordinator thread [SourceCoordinator-Source: hybrid-source]

@varun1729DD
Copy link
Contributor Author

@tweise I am coming back to this one; got busy in the middle.
I wonder if synchronization can be restricted to just the readers

@varun1729DD
Copy link
Contributor Author

I tested the following synchronization combinations:
Enumerator Only -> Did not work
Enumerator and Read -> Works
But never Reader Only -> ?

@tweise
Copy link
Contributor

tweise commented Dec 8, 2023

@varun1729DD thanks for getting back to this. I would be very interested to see where the concurrency issue occurs. If you tested with enumerator only and that did not fix the issue, then that seems to confirm that there isn't any issue on the enumerator side as we expect that execution is single threaded (always same source coordinator thread).

If you add synchronization to the reader only and the error goes away, the we can dig into that.

@varun1729DD
Copy link
Contributor Author

Hi @tweise
I am able to reproduce this issue with just changes just to the reader. The error does not go away
The PR says closed because I reset my fork.
I am trying to put together a test case

@varun1729DD
Copy link
Contributor Author

Continued here: #24055

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants