New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse of keys in blockwise fusion can cause spurious KeyErrors on distributed cluster #9888
Comments
If I understood correctly:
I don't think this should be in dask/dask. This is (yet another) fragility of the cancelled state and the dask/dask graph shouldn't work around it. It should be OK to submit to the scheduler a new, different task with the same key as a released future. |
Actually - I don't think you need to change the graph. You can get this error if you resubmit the same exact tasks.
This requires x to go through its whole lifecycle between client, scheduler, and worker b before a has had the chance of doing even a single round of the event loop. The easiest way to achieve this is to hamstring a's event loop by spilling many GiBs worth of data. |
TLDR The current system assumes strongly, in many places, that a key uniquely identifies a task, not just the output. Reusing a key for a different task violates this and distributed can fail on multiple levels. The scheduler doesn't have the possibility to understand that the tasks are new and different. The problem has nothing to do with the Worker, this error is just surfacing there. As stated above, if validation is on, this raises in various places on the scheduler (I found three different assertion that could be triggered depending on queuing/no-queuing and some timing; here, here and I forgot where the third one was). If you inspect closely what happens if you are executing the above script (and the del is not necessary, the del is merely there for being explicit) is that the second What happens here is that the scheduler knows a certain task A already that doesn't have any dependencies (first persist, w/ task fusion). Then a subsequent It may be true that the worker could handle this better but the scheduler is also corrupted already.
I believe this is where the first fallacy is hidden. In the code above, what actually happens is
i.e. the scheduler doesn't even know that the futures were released. This timing is unavoidable due to how refcounting works. Even if you disregard this release-chain, the same issue can very very easily be constructed with a second client and different graphs. |
@crusaderky I can't truly follow your example here but it is a different issue. I'm talking about a guarantee that the task has to be uniquely identifiable by the key which is violated here which causes all sorts of failures in distributed. If you found something else, I would like you to open a new ticket, please. I believe this issue is cause by this line reusing Line 1589 in a8327a3
@rjzamora @jrbourbeau Can either one of you motivate why blockwise fusion layers cannot generate a new key? @jrbourbeau you mentioned this recently as a UX problem that could be solved by a "better version of HLGs". I feel like I'm missing some context |
Agree that this is a significant problem with HLGs. The reason is the simple fact that there is no (formal/robust) mechanism to "replay" the construction of a graph after the names/properties for a subset of its tasks has been changed. That is, you can't really change the name of a |
So this is a limitation because there might've been an "accidental" materialization before the fusing? |
Yes and no - The limitation is that there is no API to replace a layer in the HLG (e.g. |
xref #8635. Like @rjzamora said:
|
I'm still trying to wrap my head around it but I'm getting there. IIUC this is mostly about layer dependencies? Best case, layer dependencies is simply exchanging a single key; Worst case would be if the layer dependent of that Blockwise layer is a MaterializedLayer in which case we'd need to walk the entire graph to rename the key. is this roughly correct? |
Yes. The problem is that you would need to regenerate all dependent layers, and there is no clear way to do this. Why there is no clear way to do this: There is no formal mechanism to change the dependencies for an existing Layer. The required Layer API is very limited, and does not provide a way to set the dependencies. Therefore, different logic would need to be used to change dependencies for different layers. For the Given that the logic here would be "ugly" with the existing Layer API, the ideal solution is probably to update the Layer API to include a clear mechanism for tracking dependencies. However, we have been waiting to clean up serialization. I have personally been waiting to remove the |
Thanks, that clears things up for me. |
Found a nice textbook example in the distributed tests https://github.com/dask/distributed/pull/8185/files#r1479864806 |
(repost from dask/distributed#8185 (comment)) ddf = dd.from_pandas(df, npartitions=4)
with dask.config.set({"dataframe.shuffle.method": "p2p"}):
ddf = ddf.set_index("a").sort_values("b")
result = ddf.compute() This code:
Some keys are the same, but run_spec changes; this occasionally triggers a race condition where the scheduler didn't have the time to forget those keys yet. This is (fortuitously) not causing failures because the key is always close to its released->forgotten transition when this happens (although in other cases it may be in memory; I'm not sure). If I add a line
|
They are currently triggering dask warnings related to a bug dask/dask#9888 . Retry including them after it's fixed.
Subsequent Blockwise layers are currently fused into a single layer. This reduces the number of tasks, the overhead and is very generally a good thing to do. Currently, the fused output does not generate unique key names which is a problem from a UX perspective but can also cause severe failure cases when being executed on the distributed scheduler since distributed assumes that a task key is a unique identifier for the entire task. While it is true that the data output of the fused key and the non-fused key is identical, the runspec and the local topology is intentionally very different. Specifically, a fused task, for example, may not have any dependencies while the non-fused task does have dependencies.
An example where this matters is the following (async code is not necessary but the condition is actually a bit difficult to trigger and this helps. Paste this code in a Jupyter notebook and run it a couple of times).
Note how initial shuffle is persisted and a slightly different version of this graph is computed again below but the graph below is slightly different. From what I can gather, the initial persist is fusing the keys while the latter one does not (maybe it's the other way round, I'm not sure. Either way a different issue).
This specific reproducer actually triggers (not every time) a
KeyError
in a workers data buffer while trying to read data.This is caused because the dependency relations between tasks are no longer accurate on the scheduler and it considers a task "ready", i.e. all dependencies in memory, too soon, causing a failure on the worker.
When
validate
is activated, the scheduler catches these cases earlier and raises appropriate AssertionErrors. This is not checked at runtime for performance reasons and is typically not necessary since we rely on the assumption that keys identify a task uniquely.Apart from this artificial example, we do have internal reports about such a spurious KeyError in combination with an xgboost workload
The text was updated successfully, but these errors were encountered: