Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-35024][Runtime/State] Implement the record buffer of AsyncExecutionController #24633

Merged
merged 1 commit into from
Apr 15, 2024

Conversation

fredia
Copy link
Contributor

@fredia fredia commented Apr 8, 2024

What is the purpose of the change

As a part of FLIP-425, this pr implements the record buffer of AsyncExecutionController.

Brief change log

  • Introduce activeBuffer , blockingBuffer and inFlightRecordNum into AsyncExecutionController.
  • Control the in-flight record number in AsyncExecutionController

Verifying this change

  • Add AsyncExecutionControllerTest#testInFlightRecordControl

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (yes)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (JavaDocs)

@flinkbot
Copy link
Collaborator

flinkbot commented Apr 8, 2024

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@Zakelly Zakelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I post some concerns in advance!


/** Migrate the blocking requests to the active buffer. */
@VisibleForTesting
void migrateBlockingToActive() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd suggest a queueing mechanism, meaning that each blocked request will queue under one key. While the active request/context's reference counting reached zero, it pop one request from the corresponding queue of that key and put that into active buffer. I think this is a high-performance implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I add a new class StateRequestsBuffer which groups state requests in the blocking buffer by key. Migrating one state request from blocking buffer to active buffer will trigger in RecordContext#disposer.

Copy link
Contributor

@Zakelly Zakelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your update! I left some further comments. PTAL.

// 3. Ensure the currentContext is restored.
setCurrentContext(storedContext);
inFlightRecordNum.incrementAndGet();
} catch (InterruptedException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the InterruptedException should be ignored, otherwise it will produce a fatal error during TM normal exit (required by JM) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍good catch, Thread.sleep(50); is deleted, so the InterruptedException wouldn't throw.

// be less than the max in-flight record number.
// Note: the currentContext may be updated by {@code StateFutureFactory#build}.
try {
while (inFlightRecordNum.get() > maxInFlightRecordNum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could clarify the definition of inFlightRecordNum... IIUC currently the inFlightRecordNum == keyAccountingUnit.occupiedCount. I'm not sure which size/count should we control, the record in AEC or the running record?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inFlightRecordNum is the records in AEC, including the records in active buffer and blocking buffer.
And the inFlightRecordNum == keyAccountingUnit.occupiedCount may not always be true.

I rewrited the description and added some asserts in testRecordsRunInOrder and testBasicRun.

@fredia
Copy link
Contributor Author

fredia commented Apr 11, 2024

@Zakelly Thanks for the suggestions, I updated and rebased this PR, please take a look again.

@fredia fredia requested review from curcur and masteryhx April 11, 2024 03:16
Copy link
Contributor

@Zakelly Zakelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your update!

* @param <R> the type of the record
* @param <K> the type of the key
*/
@NotThreadSafe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better add some description about only manipulating this class within task thread

Copy link
Contributor

@Zakelly Zakelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. LGTM only one minor thing.

* @param N the number of state requests to pop.
* @return A list of state requests.
*/
List<StateRequest<?, ?, ?>> popActive(int N) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
List<StateRequest<?, ?, ?>> popActive(int N) {
List<StateRequest<?, ?, ?>> popActive(int n) {

@fredia
Copy link
Contributor Author

fredia commented Apr 12, 2024

@Zakelly Thanks for the review, updated and squashed.

Copy link
Contributor

@masteryhx masteryhx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.
I just left some minor suggestions. PTAL.

Copy link
Contributor

@yunfengzhou-hub yunfengzhou-hub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I checked the changes in production code and it LGTM.

Copy link
Contributor

@masteryhx masteryhx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. LGTM.

@fredia fredia merged commit 6d139a1 into apache:master Apr 15, 2024
@fredia fredia deleted the FLINK-35024 branch April 15, 2024 06:28
hanyuzheng7 pushed a commit to hanyuzheng7/flink that referenced this pull request May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants