Mass key replication may sit idly for many seconds #6446

crusaderky · 2022-05-25T14:42:58Z

This is a follow-up from #6342

A worker, a, owns task x in memory
compute-task issues {op: compute-task, key: y, who_has: {x: [a]}} to workers b, c, and maybe more. Alternatively, an AMM policy suggests creating 2+ more replicas of x: {op: acquire-replicas, who_has: {x: [a]}}
Worker b is already transferring a very large piece of data from a; which will take many seconds to finish. When the compute-task/acquire-replica request arrives, x is parked in fetch state because a is in flight.
This is not the case for worker c, which quickly acquires x and informs the scheduler.
Shortly afterwards, b receives {op: compute-task, key: z, who_has: {x: [a, c]}}. Alternatively, 2 seconds have passed and AMM reiterates that b should acquire a replica of x with a new command {op: acquire-replicas, who_has: {x: [a, c]}}
Expected result: x is transitioned immediately from fetch to flight and acquired from c
Actual result: x sits idly in fetch state until something else triggers ensure_communicating. In the worst case scenario, it may be worker a coming out of flight 10-15 seconds later.

#6342 introduces a new test, test_new_replica_while_all_workers_in_flight, currently marked as @skip, to test this.

It was initially proposed to enable a fetch->fetch transition as follows:

--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2685,9 +2685,6 @@ class Worker(ServerNode):
             assert not args
             finish, *args = finish  # type: ignore
 
-        if ts.state == finish:
-            return {}, []
-
         start = ts.state
         func = self._TRANSITIONS_TABLE.get((start, cast(TaskStateState, finish)))

The above doesn't work, as it sends the key into missing state which is then "rescued" by find_missing 1 second later.

CC @fjetter

The text was updated successfully, but these errors were encountered:

crusaderky · 2022-06-17T16:16:44Z

This is fixed for free by #6497

crusaderky mentioned this issue May 25, 2022

Refactor find_missing and refresh_who_has crusaderky/distributed#3

Closed

crusaderky self-assigned this Jun 17, 2022

This was referenced Jun 17, 2022

Adding replicas to a task in fetch now sends it to flight immediately #6594

Merged

Deduplicate data_needed #6587

Merged

Support for neater WorkerState tests #6609

Merged

crusaderky closed this as completed in #6594 Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mass key replication may sit idly for many seconds #6446

Mass key replication may sit idly for many seconds #6446

crusaderky commented May 25, 2022 •

edited

Loading

crusaderky commented Jun 17, 2022

Mass key replication may sit idly for many seconds #6446

Mass key replication may sit idly for many seconds #6446

Comments

crusaderky commented May 25, 2022 • edited Loading

crusaderky commented Jun 17, 2022

crusaderky commented May 25, 2022 •

edited

Loading