Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mass key replication may sit idly for many seconds #6446

Closed
crusaderky opened this issue May 25, 2022 · 1 comment · Fixed by #6594
Closed

Mass key replication may sit idly for many seconds #6446

crusaderky opened this issue May 25, 2022 · 1 comment · Fixed by #6594
Assignees

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented May 25, 2022

This is a follow-up from #6342

  1. A worker, a, owns task x in memory
  2. compute-task issues {op: compute-task, key: y, who_has: {x: [a]}} to workers b, c, and maybe more. Alternatively, an AMM policy suggests creating 2+ more replicas of x: {op: acquire-replicas, who_has: {x: [a]}}
  3. Worker b is already transferring a very large piece of data from a; which will take many seconds to finish. When the compute-task/acquire-replica request arrives, x is parked in fetch state because a is in flight.
  4. This is not the case for worker c, which quickly acquires x and informs the scheduler.
  5. Shortly afterwards, b receives {op: compute-task, key: z, who_has: {x: [a, c]}}. Alternatively, 2 seconds have passed and AMM reiterates that b should acquire a replica of x with a new command {op: acquire-replicas, who_has: {x: [a, c]}}
  6. Expected result: x is transitioned immediately from fetch to flight and acquired from c
    Actual result: x sits idly in fetch state until something else triggers ensure_communicating. In the worst case scenario, it may be worker a coming out of flight 10-15 seconds later.

#6342 introduces a new test, test_new_replica_while_all_workers_in_flight, currently marked as @skip, to test this.

It was initially proposed to enable a fetch->fetch transition as follows:

--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2685,9 +2685,6 @@ class Worker(ServerNode):
             assert not args
             finish, *args = finish  # type: ignore
 
-        if ts.state == finish:
-            return {}, []
-
         start = ts.state
         func = self._TRANSITIONS_TABLE.get((start, cast(TaskStateState, finish)))
 

The above doesn't work, as it sends the key into missing state which is then "rescued" by find_missing 1 second later.

CC @fjetter

@crusaderky
Copy link
Collaborator Author

This is fixed for free by #6497

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant