Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosmetic review of story() #6442

Merged
merged 5 commits into from
Jun 15, 2022
Merged

Cosmetic review of story() #6442

merged 5 commits into from
Jun 15, 2022

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented May 25, 2022

Perform cosmetic tweaks to various methods related to worker and scheduler stories.

This PR has been rescoped since it was opened.

Original description - obsolete

There was discussion on #6342 about using Worker.story() on non-key arbitrary tags, e.g. "gather-dep".

I find this functionality invaluable.
Example usage:

story = b.story("busy-gather")
# 1 busy response straight away, followed by 1 retry every 150ms for 500ms.
# The requests for b and g are clustered together in single messages.
assert 3 <= len(story) <= 7

This tests that the task was received from worker rw instead of another worker:

assert_story(a.story("receive-dep"), [("receive-dep", rw.address, {"f"})])

This PR doesn't change functionality; it merely clarifies in the documentation that this is a legit use.

@crusaderky crusaderky self-assigned this May 25, 2022
crusaderky added a commit to crusaderky/distributed that referenced this pull request May 25, 2022
@fjetter
Copy link
Member

fjetter commented May 25, 2022

The reason why I am opposed to the usage of this is because Worker.log is supposed to be a transition log. The tags you are requesting here are not transitions but rather would classify as an event or instruction or similar. We introduced a dedicated log for this and should not encourage "improper" usage of the existing log.

@crusaderky
Copy link
Collaborator Author

crusaderky commented May 25, 2022

it's not an event; it's something way more granular that happens in the state machine in response to an event.

For example, it is true that {"op": "remove-replicas"} will always trigger a "remove-replicas" entry in the log, which makes it somewhat redundant with the event log (although it helps readability a lot, since events are logged in a different log and one would have to match timestamps by hand!). However, then you have two different entries, "remove-replica-confirmed" and "remove-replica-rejected", which are neither transitions, nor events, nor instructions, but carry valuable information as they tell you why there are - or there aren't - transitions immediately afterwards on those keys.

self.log.append(("remove-replicas", keys, stimulus_id, time()))
recommendations: Recs = {}
rejected = []
for key in keys:
ts = self.tasks.get(key)
if ts is None or ts.state != "memory":
continue
if not ts.is_protected():
self.log.append(
(ts.key, "remove-replica-confirmed", stimulus_id, time())
)
recommendations[ts] = "released"
else:
rejected.append(key)
if rejected:
self.log.append(("remove-replica-rejected", rejected, stimulus_id, time()))
smsg = AddKeysMsg(keys=rejected, stimulus_id=stimulus_id)
self._handle_instructions([smsg])
self.transitions(recommendations, stimulus_id=stimulus_id)

@github-actions
Copy link
Contributor

github-actions bot commented May 25, 2022

Unit Test Results

       15 files         15 suites   7h 7m 7s ⏱️
  2 828 tests   2 713 ✔️   81 💤 30  4 🔥
20 960 runs  19 942 ✔️ 948 💤 64  6 🔥

For more details on these failures and errors, see this check.

Results for commit 8d80592.

♻️ This comment has been updated with latest results.

@pentschev
Copy link
Member

rerun tests

Note: rerunning gpuCI tests since those errors should be fixed by #6434

@fjetter
Copy link
Member

fjetter commented Jun 1, 2022

Here more details to explain my reasoning. These logs actually inspired the refactoring we're working on right now, at least partially.

I took all existing logs we currently have into this table (I skipped the actual transition log) and annotated them with the event I think it should be replaced with that carries identical if not more thorough information. There are a couple I consider redundant by now, I marked with 🔥 and added an explanation to it. I took most names from (#6410) if they don't exist on main, yet.

Tag StateMachineEvent Notes
request-dep GatherDep
receive-dep GatherDepDone success
receive-from-scatter TBD should probably be also an event
free-keys FreeKeysEvent
remove-replicas RemoveReplicasEvent
remove-replica-confirmed TBD
remove-replica-rejected TBD
ensure-task-exists 🔥 This carries some useful information but is very verbose and can be inferred by matching stimulus_ids of a ComputeTaskEvent and the log since transition logs always have a start state
compute-task ComputeTaskEvent
gather-dependencies GatherDep identical to request-dep since we removed pre-fetch filtering
put-in-memory 🔥 This is a degenerate event mixed between execute, scatter, fetch, etc. it is mostly redundant given the respective success events
receive-dep-failed GatherDepDone failure
missing-who-has GatherDepDone should be part of the event
busy-gather RetryBusyWorkerEvent
missing-dep FindMissingEvent
release-key 🔥 I think this is entirely redundant by now, isn't it?
cancel-compute CancelComputeEvent

This basically leaves remove-replica-confirmed and remove-replica-rejected as the last entries that are not a transition log. These two logs basically explain why certain decisions where made. We're not doing this anywhere else and I would argue we should remove them as well.
A RemoveReplicasEvent with a stimulus_id in combination with the transition log associated to this stimulus_id, this carries the same information, doesn't it?

I don't think we should log any more granular information here. The change is obviously trivial but I would like to have alignment about how this log should be used in the future.

Down the road, I hope we'll be having a utility function that matches Events and logs, e.g. by stimulus_id which will likely have a different format than the current story.

@crusaderky crusaderky changed the title Clarify that Worker.story() can request arbitrary log tags Cosmetic review of story() Jun 9, 2022
@crusaderky crusaderky requested review from fjetter and removed request for fjetter June 9, 2022 16:47
@crusaderky
Copy link
Collaborator Author

I've salvaged this PR by removing all references to arbitrary tags.
Ready for re-review and merge.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 9, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 29m 16s ⏱️ - 20m 22s
  2 865 tests ±0    2 783 ✔️ +32    81 💤 ±0  1  - 29 
21 224 runs  ±0  20 284 ✔️ +35  939 💤 ±0  1  - 32 

For more details on these failures, see this check.

Results for commit 008d9dd. ± Comparison against base commit 344868a.

♻️ This comment has been updated with latest results.

@crusaderky
Copy link
Collaborator Author

Please do not merge until after #6566

@crusaderky
Copy link
Collaborator Author

Ready for review and merge again

@fjetter fjetter merged commit 2778cf5 into dask:main Jun 15, 2022
@crusaderky crusaderky deleted the story_tags branch June 16, 2022 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants