You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
compute-task issues {op: compute-task, key: y, who_has: {x: [a]}} to workers b, c, and maybe more. Alternatively, an AMM policy suggests creating 2+ more replicas of x: {op: acquire-replicas, who_has: {x: [a]}}
Worker b is already transferring a very large piece of data from a; which will take many seconds to finish. When the compute-task/acquire-replica request arrives, x is parked in fetch state because a is in flight.
This is not the case for worker c, which quickly acquires x and informs the scheduler.
Shortly afterwards, b receives {op: compute-task, key: z, who_has: {x: [a, c]}}. Alternatively, 2 seconds have passed and AMM reiterates that b should acquire a replica of x with a new command {op: acquire-replicas, who_has: {x: [a, c]}}
Expected result: x is transitioned immediately from fetch to flight and acquired from c Actual result: x sits idly in fetch state until something else triggers ensure_communicating. In the worst case scenario, it may be worker a coming out of flight 10-15 seconds later.
#6342 introduces a new test, test_new_replica_while_all_workers_in_flight, currently marked as @skip, to test this.
It was initially proposed to enable a fetch->fetch transition as follows:
This is a follow-up from #6342
{op: compute-task, key: y, who_has: {x: [a]}}
to workers b, c, and maybe more. Alternatively, an AMM policy suggests creating 2+ more replicas of x:{op: acquire-replicas, who_has: {x: [a]}}
{op: compute-task, key: z, who_has: {x: [a, c]}}
. Alternatively, 2 seconds have passed and AMM reiterates that b should acquire a replica of x with a new command{op: acquire-replicas, who_has: {x: [a, c]}}
Actual result: x sits idly in fetch state until something else triggers ensure_communicating. In the worst case scenario, it may be worker a coming out of flight 10-15 seconds later.
#6342 introduces a new test,
test_new_replica_while_all_workers_in_flight
, currently marked as@skip
, to test this.It was initially proposed to enable a fetch->fetch transition as follows:
The above doesn't work, as it sends the key into missing state which is then "rescued" by find_missing 1 second later.
CC @fjetter
The text was updated successfully, but these errors were encountered: