Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_stress_creation_and_deletion: ValueError: Could not find dependent #5172

Open
crusaderky opened this issue Aug 5, 2021 · 5 comments
Open
Labels
flaky test Intermittent failures on CI.

Comments

@crusaderky
Copy link
Collaborator

This test frequently randomly fails on CI with the exception

E               ValueError: Could not find dependent ('sum-99d8cecd7d8e742da266a502529a4475', 2, 1).  Check worker logs

e.g. https://github.com/dask/distributed/pull/5168/checks?check_run_id=3249880498

The more I look at this test the more I get convinced that the test itself is fine and it's genuinely reporting the one thing that it's supposed to test against - a computation must be resilient to nanny restart - is not OK.

@fjetter
Copy link
Member

fjetter commented Aug 5, 2021

There is a suspicious counter on the worker. if it cannot find a dependency after 5 attempts, it will raise this message. Looking at how the test is written there is nothing protecting us from breaching this threshold and this failure is anticipated in X% of runs.

@woodcockr
Copy link

This looks a lot like an error I am seeing since updating to 2021.07.0 and is persistent and preventative in 2021.07.1 & .2.
I'm trying to work up a an appropriate reproducible but having a little trouble given the randomness involved. It appears to occur more often when workers are under memory stress, even more so if they are spilling to disk, and when there are 10-20 workers (with 8-16 threads each). In 2021.07.1/2 it will fail the vast majority of the time and often the worker_logs are complete empty. 2021.07.0 shows a similar behaviour but I get logs with "Could not find dependent".
The workflow will often simply stop at this point without errors at the client side and tasks remaining - presumably deadlocked somewhere.

The workflow is not complicated - data load, per element math, reduction (mean) and rechunk operation.
Generally the failure occurs when the final result is being pulled together - so out of 100,000 task a few 100 reman at the point of lock up.

I'm not deep on dask code but I might take a look at this test to see if it can assist in creating a reproducible.

@fjetter
Copy link
Member

fjetter commented Aug 6, 2021

@woodcockr If you have a reliable reproducer, that would be extremely helpful. I'll take another look at the test to see if I can find something

@woodcockr
Copy link

My apologies I still can't get a reliable reproducer that isn't my entire code set. I'll chip away but 2021.07.0 is succeeding more often than not whereas 07.1 and 07.2 just fail to complete. When I have some clear air from current tasks I'll try again.

@chrisroat
Copy link
Contributor

I recently updated to 2021.9.1 and see this issue in a run:

ValueError: Could not find dependent ('from-zarr-d3ffd833f6e9a7a74704507ec969f469', 4, 0, 6, 2).  Check worker logs

I'm running on GKE and searched the logs for the task name and found many, many hits of the form.
Can't find dependencies...for key..., Depdendent not found... Asking scheduler, Task ... does not know who has, No workers found for ....,

There is also an interesting assertion failing where the code is expecting "OK" but gets a JSON reply:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 502, in handle_comm
    result = await result
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1461, in get_data
    assert response == "OK", response
AssertionError: {'op': 'get_data', 'keys': {"('from-zarr-d3ffd833f6e9a7a74704507ec969f469', 4, 1, 16, 22)"}, 'who': 'tls://10.17.242.4:39777', 'max_connections': None, 'reply': True}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
Development

No branches or pull requests

5 participants