New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent race condition in P2P shuffle run manager #8262
Conversation
) -> ShuffleRun: | ||
# FIXME: This should never be ToPickle[ShuffleRunSpec] | ||
result: ShuffleRunSpec | ToPickle[ShuffleRunSpec] | ||
if spec is None: | ||
result = await self._plugin.worker.scheduler.shuffle_get( | ||
id=shuffle_id, | ||
worker=self._plugin.worker.address, | ||
) | ||
else: | ||
result = await self._plugin.worker.scheduler.shuffle_get_or_create( | ||
spec=ToPickle(spec), | ||
key=key, | ||
worker=self._plugin.worker.address, | ||
) | ||
if isinstance(result, ToPickle): | ||
result = result.data | ||
result = await self._fetch(shuffle_id=shuffle_id, spec=spec, key=key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has been moved into _fetch
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 21 files + 1 21 suites +1 10h 44m 28s ⏱️ + 45m 47s For more details on these failures, see this check. Results for commit 19cca42. ± Comparison against base commit cb3da26. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very minor notes
key=key, | ||
worker=self._plugin.worker.address, | ||
) | ||
if isinstance(result, ToPickle): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of scope: I'm not sure I understand why this may or may not be wrapped in ToPickle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's something in the comms, haven't dug too deep into it.
Co-authored-by: crusaderky <crusaderky@gmail.com>
Co-authored-by: crusaderky <crusaderky@gmail.com>
23d233c
to
ba0d7da
Compare
There is a potential race condition in P2P shuffling between the stream-based
shuffle_fail
and the RPC-basedget
andget_or_create
. This race can cause a worker to cache a stale shuffle run locally and tasks to perform their work on a stale run. This PR prevents this.pre-commit run --all-files