Skip to content

Conversation

@jiangxin369
Copy link
Contributor

@jiangxin369 jiangxin369 commented Dec 19, 2023

What is the purpose of the change

This PR is to avoid the potential hang of Hybrid Shuffle during redistribution. The details about how a hang happens, please see issue.

Brief change log

  • Adds an interface ensureCapacity to TieredStorageMemoryManager to help reserve enough buffers.
  • Adds a variable definitelyRecycled to indicate that the buffers of a tier can be definitely recycled even if there are no readers.
  • Reserves buffers every time a non-definitelyRecycled tier receives a buffer.

Verifying this change

This change added tests and can be verified as follows:

  • Added test that validates that the ResultPartition would not hang even if most buffers are occupied by the memory tier and redistribution happens.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (yes)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@flinkbot
Copy link
Collaborator

flinkbot commented Dec 19, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@jiangxin369 jiangxin369 force-pushed the FLINK-33879 branch 2 times, most recently from d0553ee to faaf085 Compare December 20, 2023 11:05
@jiangxin369 jiangxin369 changed the title [WIP][FLINK-33879] Avoids the potential hang of Hybrid Shuffle during redistribution [FLINK-33879] Avoids the potential hang of Hybrid Shuffle during redistribution Dec 20, 2023
@jiangxin369
Copy link
Contributor Author

@TanYuxin-tyx Could you help review this PR?

Copy link
Contributor

@TanYuxin-tyx TanYuxin-tyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangxin369 Thanks for the contribution. I have some comments on the PR, PTAL.

@jiangxin369
Copy link
Contributor Author

@xintongsong Could you take a look at this PR?

Copy link
Contributor

@TanYuxin-tyx TanYuxin-tyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangxin369 Thanks for the update, I have no more comments on the change.

Copy link
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, I left some comments.

Copy link
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, I have left some comments, please take a look.

@jiangxin369
Copy link
Contributor Author

@reswqa Thanks for the review, I've updated the PR, please take a look.

Copy link
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangxin369 Thanks for the quick update, this PR LGTM now.

@jiangxin369
Copy link
Contributor Author

@reswqa Seems that the CI failed but the cause is related to python rather than this PR. Could you help merge it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants