Cosmetic review of story() #6442

crusaderky · 2022-05-25T10:38:05Z

Perform cosmetic tweaks to various methods related to worker and scheduler stories.

This PR has been rescoped since it was opened.

Original description - obsolete

There was discussion on #6342 about using Worker.story() on non-key arbitrary tags, e.g. "gather-dep".

I find this functionality invaluable.
Example usage:

distributed/distributed/tests/test_worker.py

Lines 1896 to 1899 in ba39915

    
           story = b.story("busy-gather") 
        
           # 1 busy response straight away, followed by 1 retry every 150ms for 500ms. 
        
           # The requests for b and g are clustered together in single messages. 
        
           assert 3 <= len(story) <= 7

This tests that the task was received from worker rw instead of another worker:

distributed/distributed/tests/test_worker.py

Line 1952 in ba39915

assert_story(a.story("receive-dep"), [("receive-dep", rw.address, {"f"})])

This PR doesn't change functionality; it merely clarifies in the documentation that this is a legit use.

fjetter · 2022-05-25T11:28:27Z

The reason why I am opposed to the usage of this is because Worker.log is supposed to be a transition log. The tags you are requesting here are not transitions but rather would classify as an event or instruction or similar. We introduced a dedicated log for this and should not encourage "improper" usage of the existing log.

crusaderky · 2022-05-25T11:42:00Z

it's not an event; it's something way more granular that happens in the state machine in response to an event.

For example, it is true that {"op": "remove-replicas"} will always trigger a "remove-replicas" entry in the log, which makes it somewhat redundant with the event log (although it helps readability a lot, since events are logged in a different log and one would have to match timestamps by hand!). However, then you have two different entries, "remove-replica-confirmed" and "remove-replica-rejected", which are neither transitions, nor events, nor instructions, but carry valuable information as they tell you why there are - or there aren't - transitions immediately afterwards on those keys.

distributed/distributed/worker.py

Lines 1814 to 1835 in ba39915

    
           self.log.append(("remove-replicas", keys, stimulus_id, time())) 
        
           recommendations: Recs = {} 
        
           rejected = [] 
        
           for key in keys: 
        
               ts = self.tasks.get(key) 
        
               if ts is None or ts.state != "memory": 
        
                   continue 
        
               if not ts.is_protected(): 
        
                   self.log.append( 
        
                       (ts.key, "remove-replica-confirmed", stimulus_id, time()) 
        
                   ) 
        
                   recommendations[ts] = "released" 
        
               else: 
        
                   rejected.append(key) 
        
           if rejected: 
        
               self.log.append(("remove-replica-rejected", rejected, stimulus_id, time())) 
        
               smsg = AddKeysMsg(keys=rejected, stimulus_id=stimulus_id) 
        
               self._handle_instructions([smsg]) 
        
           self.transitions(recommendations, stimulus_id=stimulus_id)

github-actions · 2022-05-25T13:00:32Z

Unit Test Results

      15 files       15 suites 7h 7m 7s ⏱️
  2 828 tests   2 713 ✔️   81 💤 30 ❌ 4 🔥
20 960 runs 19 942 ✔️ 948 💤 64 ❌ 6 🔥

For more details on these failures and errors, see this check.

Results for commit 8d80592.

♻️ This comment has been updated with latest results.

pentschev · 2022-05-26T15:56:29Z

rerun tests

Note: rerunning gpuCI tests since those errors should be fixed by #6434

Moved out of dask#6342

fjetter · 2022-06-01T16:42:40Z

Here more details to explain my reasoning. These logs actually inspired the refactoring we're working on right now, at least partially.

I took all existing logs we currently have into this table (I skipped the actual transition log) and annotated them with the event I think it should be replaced with that carries identical if not more thorough information. There are a couple I consider redundant by now, I marked with 🔥 and added an explanation to it. I took most names from (#6410) if they don't exist on main, yet.

Tag	StateMachineEvent	Notes
request-dep	`GatherDep`
receive-dep	`GatherDepDone`	success
receive-from-scatter	TBD	should probably be also an event
free-keys	`FreeKeysEvent`
remove-replicas	`RemoveReplicasEvent`
remove-replica-confirmed	TBD
remove-replica-rejected	TBD
ensure-task-exists	🔥	This carries some useful information but is very verbose and can be inferred by matching `stimulus_id`s of a `ComputeTaskEvent` and the log since transition logs always have a start state
compute-task	`ComputeTaskEvent`
gather-dependencies	`GatherDep`	identical to `request-dep` since we removed pre-fetch filtering
put-in-memory	🔥	This is a degenerate event mixed between execute, scatter, fetch, etc. it is mostly redundant given the respective success events
receive-dep-failed	`GatherDepDone`	failure
missing-who-has	`GatherDepDone`	should be part of the event
busy-gather	`RetryBusyWorkerEvent`
missing-dep	`FindMissingEvent`
release-key	🔥	I think this is entirely redundant by now, isn't it?
cancel-compute	`CancelComputeEvent`

This basically leaves remove-replica-confirmed and remove-replica-rejected as the last entries that are not a transition log. These two logs basically explain why certain decisions where made. We're not doing this anywhere else and I would argue we should remove them as well.
A RemoveReplicasEvent with a stimulus_id in combination with the transition log associated to this stimulus_id, this carries the same information, doesn't it?

I don't think we should log any more granular information here. The change is obviously trivial but I would like to have alignment about how this log should be used in the future.

Down the road, I hope we'll be having a utility function that matches Events and logs, e.g. by stimulus_id which will likely have a different format than the current story.

crusaderky · 2022-06-09T16:47:52Z

I've salvaged this PR by removing all references to arbitrary tags.
Ready for re-review and merge.

github-actions · 2022-06-09T18:25:47Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 29m 16s ⏱️ - 20m 22s
  2 865 tests ±0   2 783 ✔️ +32   81 💤 ±0 1 ❌ - 29
21 224 runs ±0 20 284 ✔️ +35 939 💤 ±0 1 ❌ - 32

For more details on these failures, see this check.

Results for commit 008d9dd. ± Comparison against base commit 344868a.

♻️ This comment has been updated with latest results.

crusaderky · 2022-06-13T14:36:52Z

Please do not merge until after #6566

crusaderky · 2022-06-14T10:58:48Z

Ready for review and merge again

crusaderky self-assigned this May 25, 2022

crusaderky added a commit to crusaderky/distributed that referenced this pull request May 25, 2022

Moved to dask#6442

4b3adf8

crusaderky mentioned this pull request May 26, 2022

Remove EnsureCommunicatingAfterTransitions #6462

Merged

crusaderky force-pushed the story_tags branch from 76c95ec to 038fb72 Compare June 1, 2022 12:10

fjetter mentioned this pull request Jun 1, 2022

Migrate ensure_communicating transitions to new WorkerState event mechanism #5896

Closed

8 tasks

Clarify that Worker.story() can request arbitrary log tags

8d80592

Moved out of dask#6342

crusaderky force-pushed the story_tags branch from 038fb72 to 8d80592 Compare June 1, 2022 14:31

crusaderky added 2 commits June 9, 2022 17:40

Merge branch 'main' into story_tags

27fea8b

Rescope to exclude arbitrary tags

318760d

crusaderky changed the title ~~Clarify that Worker.story() can request arbitrary log tags~~ Cosmetic review of story() Jun 9, 2022

crusaderky requested review from fjetter and removed request for fjetter June 9, 2022 16:47

fjetter mentioned this pull request Jun 13, 2022

Yank state machine out of Worker class #6566

Merged

crusaderky added 2 commits June 14, 2022 11:56

Merge branch 'main' into story_tags

b050d28

fix

008d9dd

fjetter approved these changes Jun 15, 2022

View reviewed changes

fjetter merged commit 2778cf5 into dask:main Jun 15, 2022

crusaderky deleted the story_tags branch June 16, 2022 09:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cosmetic review of story() #6442

Cosmetic review of story() #6442

crusaderky commented May 25, 2022 •

edited

Loading

fjetter commented May 25, 2022

crusaderky commented May 25, 2022 •

edited

Loading

github-actions bot commented May 25, 2022 •

edited

Loading

pentschev commented May 26, 2022

fjetter commented Jun 1, 2022 •

edited

Loading

crusaderky commented Jun 9, 2022

github-actions bot commented Jun 9, 2022 •

edited

Loading

crusaderky commented Jun 13, 2022

crusaderky commented Jun 14, 2022

	story = b.story("busy-gather")
	# 1 busy response straight away, followed by 1 retry every 150ms for 500ms.
	# The requests for b and g are clustered together in single messages.
	assert 3 <= len(story) <= 7

Cosmetic review of story() #6442

Cosmetic review of story() #6442

Conversation

crusaderky commented May 25, 2022 • edited Loading

Original description - obsolete

fjetter commented May 25, 2022

crusaderky commented May 25, 2022 • edited Loading

github-actions bot commented May 25, 2022 • edited Loading

Unit Test Results

pentschev commented May 26, 2022

fjetter commented Jun 1, 2022 • edited Loading

crusaderky commented Jun 9, 2022

github-actions bot commented Jun 9, 2022 • edited Loading

Unit Test Results

crusaderky commented Jun 13, 2022

crusaderky commented Jun 14, 2022

crusaderky commented May 25, 2022 •

edited

Loading

crusaderky commented May 25, 2022 •

edited

Loading

github-actions bot commented May 25, 2022 •

edited

Loading

fjetter commented Jun 1, 2022 •

edited

Loading

github-actions bot commented Jun 9, 2022 •

edited

Loading