Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in P2P restarts #8091

Merged
merged 2 commits into from Aug 10, 2023
Merged

Conversation

hendrikmakait
Copy link
Member

Closes #8088

  • Tests added / passed
  • Passes pre-commit run --all-files

@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       20 files  +       1         20 suites  +1   11h 22m 11s ⏱️ + 18m 44s
  3 764 tests  -        1    3 656 ✔️ +       7     106 💤  -     2  2  - 6 
36 412 runs  +3 628  34 658 ✔️ +3 534  1 751 💤 +101  3  - 7 

For more details on these failures, see this check.

Results for commit 65fdcc4. ± Comparison against base commit 9469b91.

This pull request removes 2 and adds 1 tests. Note that renamed tests count towards both.
distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---nanny]
distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---no-nanny]
distributed.shuffle.tests.test_shuffle ‑ test_restarting_does_not_deadlock

♻️ This comment has been updated with latest results.


return {}, {}, {}
recommendations: Recs = {}
self._propagate_released(ts, recommendations)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing a dedicated test for this is harder than anticipated. I think we might catch most (all?) non-P2P edge cases in other places. FWIW, removing this from transition_queued_released only fails P2P as well.

Copy link
Member

@fjetter fjetter Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this kind of test is triggering this transition

@gen_cluster(nthreads=[("", 1)] * 2, client=True)
async def test_no_worker_released(c, s, a, b):
    f1 = c.submit(inc, 1, workers=[a.address], allow_other_workers=True, key='f1')
    f2 = c.submit(inc, f1, resources={"C": 1}, key='f2')

    while not f2.key in s.tasks or s.tasks[f2.key].state != "no-worker":
        await asyncio.sleep(0.01)

    assert f1.key in a.data and not f1.key in b.data
    await a.close()
    ...

In this test, f2 is in no-worker and then transitioned to released. The reason why this hasn't caused any issues so far is because in this example, the transition is no-worker->released->waiting

The scheduler extension is however expecting two transitions

  • whatever->released
  • released->waiting->processing

i.e. the scheduler extension issues a transition to released and is relying on the recommendation system to trigger appropriate follow ups. All other ordinary transitions rather transition immediately to waiting

So, another fix could be to make the shuffle extension smarter and more aware about what the intended target state is. However, I believe that defeats the purpose of the transition engine and this fix is fine. (this ambiguity was/is also a common problem in the worker)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, f2 is in no-worker and then transitioned to released. The reason why this hasn't caused any issues so far is because in this example, the transition is no-worker->released->waiting

That's what I meant, sorry. It's hard to write a dedicated test that fails on main but isn't using P2P for the reason you described above.

So, another fix could be to make the shuffle extension smarter and more aware about what the intended target state is. However, I believe that defeats the purpose of the transition engine and this fix is fine. (this ambiguity was/is also a common problem in the worker)

+1, I think it's better to have the reconciliation logic within the state machine instead of forcing all users of the state machine to handle those edge cases.

@hendrikmakait hendrikmakait marked this pull request as ready for review August 10, 2023 09:29

return {}, {}, {}
recommendations: Recs = {}
self._propagate_released(ts, recommendations)
Copy link
Member

@fjetter fjetter Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this kind of test is triggering this transition

@gen_cluster(nthreads=[("", 1)] * 2, client=True)
async def test_no_worker_released(c, s, a, b):
    f1 = c.submit(inc, 1, workers=[a.address], allow_other_workers=True, key='f1')
    f2 = c.submit(inc, f1, resources={"C": 1}, key='f2')

    while not f2.key in s.tasks or s.tasks[f2.key].state != "no-worker":
        await asyncio.sleep(0.01)

    assert f1.key in a.data and not f1.key in b.data
    await a.close()
    ...

In this test, f2 is in no-worker and then transitioned to released. The reason why this hasn't caused any issues so far is because in this example, the transition is no-worker->released->waiting

The scheduler extension is however expecting two transitions

  • whatever->released
  • released->waiting->processing

i.e. the scheduler extension issues a transition to released and is relying on the recommendation system to trigger appropriate follow ups. All other ordinary transitions rather transition immediately to waiting

So, another fix could be to make the shuffle extension smarter and more aware about what the intended target state is. However, I believe that defeats the purpose of the transition engine and this fix is fine. (this ambiguity was/is also a common problem in the worker)

@fjetter fjetter merged commit 1f8a11c into dask:main Aug 10, 2023
23 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

P2P shuffle restart may deadlock if no running workers exist
2 participants