-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_dask_collections.py::test_dataframe_set_index_sync passes locally, fails on CI #8561
Comments
Just getting started with debugging. Haven't had any luck with reproducing locally so far. Could be either a timing issue or that tests are inteferring. The logs already indicate something interesting 2024-03-08 10:07:19,968 - distributed.scheduler - INFO - User asked for computation on lost data, ('operation-8ebe45051147aef38027a1960f230fee', 0)
2024-03-08 10:07:19,969 - distributed.scheduler - INFO - User asked for computation on lost data, ('operation-8ebe45051147aef38027a1960f230fee', 1)
2024-03-08 10:07:19,969 - distributed.scheduler - INFO - User asked for computation on lost data, ('operation-8ebe45051147aef38027a1960f230fee', 2)
2024-03-08 10:07:19,969 - distributed.scheduler - INFO - User asked for computation on lost data, ('operation-8ebe45051147aef38027a1960f230fee', 3)
2024-03-08 10:07:19,969 - distributed.scheduler - INFO - User asked for computation on lost data, ('operation-8ebe45051147aef38027a1960f230fee', 4)
2024-03-08 10:07:19,969 - distributed.scheduler - INFO - User asked for computation on lost data, ('operation-8ebe45051147aef38027a1960f230fee', 5)
2024-03-08 10:07:19,969 - distributed.scheduler - INFO - User asked for computation on lost data, ('operation-8ebe45051147aef38027a1960f230fee', 6)
2024-03-08 10:07:19,969 - distributed.scheduler - INFO - User asked for computation on lost data, ('operation-8ebe45051147aef38027a1960f230fee', 7)
2024-03-08 10:07:19,969 - distributed.scheduler - INFO - User asked for computation on lost data, ('operation-8ebe45051147aef38027a1960f230fee', 8)
2024-03-08 10:07:19,969 - distributed.scheduler - INFO - User asked for computation on lost data, ('operation-8ebe45051147aef38027a1960f230fee', 9)
2024-03-08 10:07:19,969 - distributed.scheduler - INFO - User asked for computation on lost data, ('operation-8ebe45051147aef38027a1960f230fee', 10)
2024-03-08 10:07:19,969 - distributed.scheduler - INFO - User asked for computation on lost data, ('chunk-28ced5066580337a26eef5edca3c5f01', 0)
2024-03-08 10:07:19,969 - distributed.scheduler - INFO - User asked for computation on lost data, ('chunk-28ced5066580337a26eef5edca3c5f01', 1)
2024-03-08 10:07:19,969 - distributed.scheduler - INFO - User asked for computation on lost data, ('chunk-28ced5066580337a26eef5edca3c5f01', 2)
2024-03-08 10:07:19,970 - distributed.scheduler - INFO - User asked for computation on lost data, ('chunk-28ced5066580337a26eef5edca3c5f01', 3)
2024-03-08 10:07:19,970 - distributed.scheduler - INFO - User asked for computation on lost data, ('chunk-28ced5066580337a26eef5edca3c5f01', 4)
2024-03-08 10:07:19,970 - distributed.scheduler - INFO - User asked for computation on lost data, ('chunk-28ced5066580337a26eef5edca3c5f01', 5)
2024-03-08 10:07:19,970 - distributed.scheduler - INFO - User asked for computation on lost data, ('chunk-28ced5066580337a26eef5edca3c5f01', 6)
2024-03-08 10:07:19,970 - distributed.scheduler - INFO - User asked for computation on lost data, ('chunk-28ced5066580337a26eef5edca3c5f01', 7)
2024-03-08 10:07:19,970 - distributed.scheduler - INFO - User asked for computation on lost data, ('chunk-28ced5066580337a26eef5edca3c5f01', 8)
2024-03-08 10:07:19,970 - distributed.scheduler - INFO - User asked for computation on lost data, ('chunk-28ced5066580337a26eef5edca3c5f01', 9)
2024-03-08 10:07:19,970 - distributed.scheduler - INFO - User asked for computation on lost data, ('chunk-28ced5066580337a26eef5edca3c5f01', 10)
2024-03-08 10:07:19,970 - distributed.scheduler - INFO - User asked for computation on lost data, ('len-tree-b08e45d3088106d64c77136c41d6bd5b', 1, 0)
2024-03-08 10:07:19,970 - distributed.scheduler - INFO - User asked for computation on lost data, ('len-tree-b08e45d3088106d64c77136c41d6bd5b', 1, 1)
2024-03-08 10:07:19,970 - distributed.scheduler - INFO - User asked for computation on lost data, ('len-tree-b08e45d3088106d64c77136c41d6bd5b', 2, 0)
2024-03-08 10:07:19,970 - distributed.scheduler - INFO - User asked for computation on lost data, ('len-tree-b08e45d3088106d64c77136c41d6bd5b', 0) which is a check implemented in distributed/distributed/scheduler.py Lines 4466 to 4478 in 0438768
CancelledError but it's yet unclear how we end up in this situation
|
This condition can be triggered if the computed dependencies and the actual graph drift. The "real" condition this should alert to is if a previously scattered result has been lost. In this particular test there is no scattered data os this could be interpreted as the cluster dropping a persisted result. An alternative theory is that the graph materialization, particularly around unpacking remote task, is somehow broken. |
Looks like I can reproduce locally when running the entire test suite. That likely indicates some dirty state. Will bisect the test suite to narrow it down. |
This seems to reproduce it for me locally. When I change anything here the test passes. pytest --runslow \
distributed/cli/tests/test_dask_scheduler.py::test_idle_timeout \
distributed/cli/tests/test_dask_scheduler.py::test_restores_signal_handler \
distributed/tests/test_dask_collections.py::test_dataframe_set_index_sync I have no idea what the connection between the CLI tests and the set_index is and why I have to run both of the CLI tests to trigger this. This is unfortunately still many seconds in runtime so it's difficult to iterate quickly (the CLI tests are excruciatingly slow nowadays) |
I still don't understand what is happening but the two CLI tests are the only (nontrivial) tests in If I replace ClickRunner with a |
Also, the test This all points towards a caching issue. Maybe something in the vicinity of tokenization that is messing with the graphs but I'm just guessing at this point |
Reproduced. And if I add dataframe:
query-planning: false to my |
Yes, I could already trace this back to the singleton implementation. If I remove object deduplication, there is no longer an issue. This means that something is reusing names even though it mustn't. I have no idea how this connect to the ClickRunner, though |
Ok... so, the problem we're facing here is that the tokenization of |
Could you link the relevant code? |
Test passes fine locally but consistently fails on CI with CancelledError on lambda parameter.
https://github.com/milesgranger/distributed/actions/runs/8201545593/job/22430476616#step:19:3885
xref: #8560
The text was updated successfully, but these errors were encountered: