Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-35550][runtime] Move rescaling functionality into dedicated class RescaleManager #24909

Merged
merged 1 commit into from
Jun 27, 2024

Conversation

XComp
Copy link
Contributor

@XComp XComp commented Jun 7, 2024

PR Chain

What is the purpose of the change

The purpose of this PR is the reorganization of responsibilities for rescaling.

Brief change log

Class diagrams

  • Introduction of new interfaces RescaleManager, RescaleManager.Context and RescaleManager.Factory
  • Responsibilities:
    • AdaptiveScheduler only provides the available parallelism (through the SlotManager). The rescalingControllers moved into RescaleManager
    • Executing is only in charge of state transitition and savepoints
    • RescaleManager handles the rescale decisions

Verifying this change

The tests were reorganized accordingly. Some additional unit tests are added to improve coverage.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Jun 7, 2024

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@XComp
Copy link
Contributor Author

XComp commented Jun 20, 2024

@1996fanrui : This PR and the linked PRs are ready to be reviewed

Keep in mind that follow-up PRs are also including the commits of their base PRs because I chained them but Github doesn't support PR chaining in forks that well.

Copy link
Member

@1996fanrui 1996fanrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @XComp for the contribution! IIUC, this PR only has refactor with a series of new interface, and doesn't have any logic change, right?

I left some minor comments, please take a look in your free time.

Copy link
Contributor

@ztison ztison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through it again to see the recent changes and it looks good.

@XComp
Copy link
Contributor Author

XComp commented Jun 24, 2024

Thanks for your reviews, @ztison and @1996fanrui . I addressed your changes and squashed the commits in a separate force-push (see diff).

@1996fanrui Yes, this PR is solely about collecting the rescale-related code in the Executing state to prepare for the follow-up PRs.

@XComp
Copy link
Contributor Author

XComp commented Jun 24, 2024

args, I forgot the commit on move the factory instantiation into the AdaptiveScheduler constructor (1fca0a1). The subsequent force-push squashes the changes once more (diff).

@XComp
Copy link
Contributor Author

XComp commented Jun 24, 2024

Final force-push to rebase to most-recent master. Can you approve this PR?

Copy link
Member

@1996fanrui 1996fanrui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @XComp for the update!

LGTM

… the rescaling logic to improve code testability and extensibility

Rescaling is a state-specific functionality. Moving all the logic into Executing state allows
us to align the resource controlling in Executing state and WaitingForResources state in a future effort.
@XComp
Copy link
Contributor Author

XComp commented Jun 26, 2024

Removed unused members.

@XComp
Copy link
Contributor Author

XComp commented Jun 27, 2024

CI failure due to FLINK-30719

@XComp
Copy link
Contributor Author

XComp commented Jun 27, 2024

@flinkbot run azure

@XComp XComp merged commit 7f13995 into apache:master Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants