Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-21450][runtime] Support LocalRecovery by AdaptiveScheduler #21981

Merged
merged 4 commits into from
Mar 2, 2023

Conversation

rkhachatryan
Copy link
Contributor

@rkhachatryan rkhachatryan commented Feb 21, 2023

What is the purpose of the change

Adjust slot assignment by Adaptive Scheduler
to try to re-use previous allocations
so that TMs can use Local Recovery.

Contributed mostly by @dmvk.
The main defferences from the original contribution:

  1. Previous ExecutionGraph is passed from the previous state explicitly (currently, WaitingForResources stage, which triggers the computation, doesn't have the graph)
  2. In SlotAssigner, the split into two methods is removed mostly for consistency (two methods mostly duplicated each other). That results in higher asymptotical complexity of StateLocalitySlotAssigner (O(mnlog*mnlog) vs O(mnlog)
  3. DoP is computed according to FLINK-30895

Brief change log

  • Support LocalRecovery by AdaptiveScheduler
  • Add previous ExecutionGraph to WaitingForResources AdaptiveScheduler state
  • Make LocalRecoveryITCase fail when allocations don't match

Verifying this change

  • Adjusted LocalRecoveryITCase
  • Added SlotSharingSlotAllocatorTest.testStickyAllocation
  • Added StateLocalitySlotAssignerTest

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Feb 21, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of passing around execution graphs, and would rather see a dedicated structure for our purposes that lives in the AdaptiveScheduler.

This would avoid some edge-cases, like local recovery breaking down unnecessarily when CreatingWithExecutionGraph failed.

@rkhachatryan
Copy link
Contributor Author

Thanks for the feedback, @zentol and @dmvk .
I've updated the PR, would you mind taking another look?

I've significantly restructured the code after the offline discussions.
Probably a good place to start are the final versions of SlotAllocator and SlotAssigner interfaces.

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @rkhachatryan, this starts looking really good! I've left a few questions, PTAL

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @rkhachatryan; this looks great! I've added a few more minor comments, PTAL.

My biggest concern is whether the integration test correctly stresses the AdaptiveScheduler code path.

Please prepare the PR for the merging (fixing the commit history + moving the StateSizeEstimests out of the PR as discussed offline).

…match

Currently, wrong allocation fails the task causing a restart,
which eventually allows to fix the allocation by picking the right TM.
This prevents the test from failure and hides the wrong allocation.
@rkhachatryan
Copy link
Contributor Author

Thanks a lot for the thorough review @dmvk!

I've cleaned up the commit history and I think all concerns are now resolved, PTAL.

@rkhachatryan rkhachatryan requested a review from dmvk March 1, 2023 08:28
Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR @rkhachatryan. Great stuff!

🎉 💪

rkhachatryan and others added 3 commits March 2, 2023 14:30
…rces AdaptiveScheduler state

Previous ExecutionGraph will be used in a subsequent commit to allocate
workloads more optimally by taking previous allocations into account.
…uler and SlotAssigner

Slot assignments are computed and consumed by SlotAllocator.
This is expressed implicitly by extending VertexParallelism.

This change tries to make that clear, while still allowing to assign
slots to something other than Slot Sharing Groups.

It does so by:
1. Introduce JobSchedulingPlan, computed and consumed by SlotAllocator. It couples VertexParallelism with slot assignments
2. Introduce determineParallelismAndCalculateAssignment method in addition to determineParallelism, specifically for assignments
3. Push the polymorphism of state assignments from VertexParallelism into the JobSchedulingPlan (slot assignment target)
@rkhachatryan rkhachatryan deleted the f21450 branch March 2, 2023 21:39
@rkhachatryan
Copy link
Contributor Author

Merged as d718342..e38a670.
Thanks a lot for the initial prototype and for the thorough reviews @dmvk and @zentol !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants