Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race conditions around executor reuse #120

Merged
merged 5 commits into from
Jun 21, 2021
Merged

Fix race conditions around executor reuse #120

merged 5 commits into from
Jun 21, 2021

Conversation

Shillaker
Copy link
Collaborator

@Shillaker Shillaker commented Jun 18, 2021

Once an executor has finished execution it needs to do three things:

  • Set the function result - this notifies anything waiting on that task's completion.
  • Vacate a slot in the scheduler - the scheduler keeps track of how many threads or processes are running concurrently via these slots.
  • Release its claim - the scheduler sets a boolean flag called a claim on each executor to keep track of whether it's available for reuse.

The previous flow of events was:

  1. Set the function result
  2. Vacate the scheduler slot
  3. Release the executor claim

If something is waiting on the task result, it can receive the result and submit another instance of the same task between steps 1 and 2 or steps 2 and 3.

Both cause problems, if it's before 2 then the scheduler will wrongly assume one of its slots is still occupied. If it's before 3,
the scheduler knows the right number of slots, but still can't reuse the executor.

In fact, the sequence of events should be reversed:

  1. Release the executor claim
  2. Vacate the scheduler slot
  3. Set the function result

At the time of writing the initial race condition is reproduced fairly reliably in our Github Actions build; the test case for doing so is also included in this PR.

@Shillaker Shillaker self-assigned this Jun 18, 2021
@@ -14,16 +14,11 @@ class DistTestsFixture
, public ConfTestFixture
, public SnapshotTestFixture
{
protected:
std::set<std::string> otherHosts;
Copy link
Collaborator Author

@Shillaker Shillaker Jun 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated, but I noticed otherHosts isn't actually used anywhere so I removed it as part of this PR. Happy to move elsewhere but it's just 5 lines in this one file.

@Shillaker Shillaker changed the title Remove scheduler race condition related to slots and claims Fix race conditions around executor reuse Jun 18, 2021
@Shillaker Shillaker merged commit 1f3b249 into master Jun 21, 2021
@Shillaker Shillaker deleted the test-reuse branch June 21, 2021 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants