Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-9884] [runtime] fix slot request may not be removed when it has already be assigned in slot manager #6360

Closed
wants to merge 2 commits into from

Conversation

shuai-xu
Copy link
Contributor

What is the purpose of the change

(The pull request fix the bug that slot request may not be removed from pendingSlotRequests in slot manager when it has been assigned.)

Verifying this change

This change added tests and can be verified as follows:

  • *Added test in SlotManagerTest.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@tisonkun
Copy link
Member

@shuai-xu
It makes sense.
The message that TM has successfully allocated slot might lost in transport.
When slot manager receives a slot status report which says one slot has allocation id irrelevant to this offer, then the slot is allocated to another slot request.
It looks this PR prevents runtime from some potential resource leak, doesn't it?

@tisonkun
Copy link
Member

When task executor report a slotA with allocationId1, it may happen that slot manager record slotA is assigned to allocationId2, and the slot request with allocationId1 is not assigned. Then slot manager will update itself with slotA assigned to allocationId1, by it does not clear the slot request with allocationId1.

For example:
# There is one free slot in slot manager.
# Now come two slot request with allocationId1 and allocationId2.
# The slot is assigned to allocationId1, but the requestSlot call timeout.
# SlotManager assign the slot to allocationId2 and insert a slot request with allocationId1.
# The second requestSlot call to task executor return SlotOccupiedException.
# SlotManager update the slot to allocationID1, but the slot request is left.

pick from the assigned JIRA for further discuss

@tisonkun
Copy link
Member

tisonkun commented Sep 1, 2018

cc @tillrohrmann @GJL

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good catch @shuai-xu. The changes look good to me. Merging this PR :-)

@asfgit asfgit closed this in 48e724f Sep 18, 2018
kl0u pushed a commit to kl0u/flink that referenced this pull request Sep 20, 2018
Clarkkkkk pushed a commit to Clarkkkkk/flink that referenced this pull request Mar 7, 2019
…s already be assigned in slot manager

This closes apache#6360.

(cherry picked from commit 48e724f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants