Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error on cancelled tasks due to disconnect #8705

Merged
merged 11 commits into from
Jun 24, 2024

Conversation

hendrikmakait
Copy link
Member

Closes #8690

  • Tests added / passed
  • Passes pre-commit run --all-files

Copy link
Contributor

github-actions bot commented Jun 19, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    29 files  ±    0      29 suites  ±0   11h 31m 28s ⏱️ + 1h 45m 39s
 4 063 tests  -     3   3 959 ✅ +    8     97 💤  -   9  7 ❌ +2 
55 967 runs  +7 625  53 795 ✅ +7 391  2 163 💤 +248  9 ❌ +3 

For more details on these failures, see this check.

Results for commit 192674b. ± Comparison against base commit 03035da.

This pull request removes 13 and adds 10 tests. Note that renamed tests count towards both.
distributed.protocol.tests.test_arrow
distributed.protocol.tests.test_collection
distributed.protocol.tests.test_highlevelgraph
distributed.protocol.tests.test_numpy
distributed.protocol.tests.test_pandas
distributed.shuffle.tests.test_graph
distributed.shuffle.tests.test_merge
distributed.shuffle.tests.test_merge_column_and_index
distributed.shuffle.tests.test_metrics
distributed.shuffle.tests.test_rechunk
…
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler_report_args[False]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler_report_args[report_args0]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[1]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[False]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[True]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers_report_args[False]
distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers_report_args[report_args0]
distributed.tests.test_client ‑ test_client_disconnect_exception_on_cancelled_futures
distributed.tests.test_client ‑ test_scheduler_restart_exception_on_cancelled_futures

♻️ This comment has been updated with latest results.

@@ -5445,7 +5456,7 @@ async def _wait(fs, timeout=None, return_when=ALL_COMPLETED):
{fu for fu in fs if fu.status != "pending"},
{fu for fu in fs if fu.status == "pending"},
)
cancelled = [f.key for f in done if f.status == "cancelled"]
cancelled = {f.key: f._state.exception for f in done if f.cancelled()}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might become very large. Ideally, we'd group these by the reason for cancellation.

bad = {f for f in futures if f.cancelled()}
if bad:
raise CancelledError(bad)
cancelled = {f.key: f._state.exception for f in futures if f.cancelled()}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might become very large. Ideally, we'd group these by the reason for cancellation.

if key not in self.futures or self.futures[key].status in failed:
for future in future_set:
key = future.key
if key not in self.futures or future.status in failed:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As opposed to futures_of and wait, we only raise a single exception here. We should probably align this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do this in a follow-up PR, this mixes cancellations with actual errors.

@hendrikmakait hendrikmakait marked this pull request as draft June 19, 2024 15:53
@hendrikmakait hendrikmakait marked this pull request as ready for review June 19, 2024 16:55
distributed/client.py Outdated Show resolved Hide resolved
@fjetter
Copy link
Member

fjetter commented Jun 21, 2024

test failure in test_scheduler_restart_exception_on_cancelled_futures might be related

@hendrikmakait
Copy link
Member Author

test failure in test_scheduler_restart_exception_on_cancelled_futures might be related

Should be fixed, it was one of those tests that worked locally but failed on CI

Comment on lines 8600 to 8601
# @pytest.mark.slow
@gen_cluster(client=True, Worker=Nanny, timeout=60)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still for debugging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦

@hendrikmakait hendrikmakait merged commit f925766 into dask:main Jun 24, 2024
24 of 34 checks passed
@hendrikmakait hendrikmakait deleted the 8690 branch June 24, 2024 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better exception if scheduler disconnects from client
2 participants