Do not reset CUDA context after UCX tests #8201

pentschev · 2023-09-21T11:39:04Z

Resetting CUDA contexts during a running process may have unintended consequences on third-party libraries -- e.g., CuPy -- that store state based on the context. Therefore, prevent destroying CUDA context for now.

Additionally fix cuDF failure due to a FutureWarning.

Closes #8194

Tests added / passed
Passes pre-commit run --all-files

Resetting CUDA contexts during a running process may have unintended consequences on third-party libraries -- e.g., CuPy -- that store state based on the context. Therefore, prevent destroying CUDA context for now.

pentschev · 2023-09-21T11:50:54Z

rerun tests

pentschev · 2023-09-21T12:00:23Z

It seems gpuCI is failing to resolve github.com, I've raised an internal issue to report that.

github-actions · 2023-09-21T12:39:48Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      21 files ±  0       21 suites ±0 10h 20m 37s ⏱️ - 9m 0s
  3 814 tests +  1   3 702 ✔️ -   1   107 💤 ±0 5 ❌ +2
36 865 runs +22 35 061 ✔️ +26 1 799 💤 - 6 5 ❌ +2

For more details on these failures, see this check.

Results for commit 119dc14. ± Comparison against base commit 2858930.

♻️ This comment has been updated with latest results.

pentschev · 2023-09-21T13:29:40Z

rerun tests

pentschev · 2023-09-21T14:12:59Z

It seems gpuCI is failing to resolve github.com, I've raised an internal issue to report that.

This is now resolved.

pentschev · 2023-09-21T14:33:08Z

rerun tests

pentschev · 2023-09-22T12:08:49Z

rerun tests

pentschev · 2023-09-22T12:27:31Z

rerun tests

jrbourbeau

Thanks @pentschev! So looks like gpuCI completed successfully three times in a row, is that right?

jrbourbeau · 2023-09-22T15:09:39Z

continuous_integration/gpuci/build.sh

@@ -40,6 +40,8 @@ gpuci_logger "Activate conda env"
 . /opt/conda/etc/profile.d/conda.sh
 conda activate dask

+mamba install -y 'aws-sdk-cpp<1.11'


Is this due to some upstream issue? The actual change here looks totally fine, but it might be worth a comment with extra context

Indeed, this is due to aws/aws-sdk-cpp#2681, but the change here was just to test/confirm this. For now we will likely resolve the issue via rapidsai/cudf#14173 and revert this change here.

Sorry for not commenting here earlier, I was trying to get this in before bringing people in to review the changes here. 🙂

No worries. I just noticed that rapidsai/cudf#14173 is merged -- does that mean we no longer need this here?

It does now, we first had to fix some issues (mark old cuDF packages as broken and build new gpuCI docker images). I reverted the aws-sdk-cpp changes now, we just need to wait and see if everything passes and we should be good to merge this. 🙂

pentschev · 2023-09-22T15:31:45Z

Thanks @pentschev! So looks like gpuCI completed successfully three times in a row, is that right?

Yes, that's right. I was rerunning multiple times to confirm. 😄

jrbourbeau

Thanks for fixing @pentschev! I left one non-blocking comment about possibly removing the aws-sdk-cpp pin, but it's not critical

jrbourbeau · 2023-09-22T18:10:40Z

continuous_integration/gpuci/build.sh

@@ -40,6 +40,8 @@ gpuci_logger "Activate conda env"
 . /opt/conda/etc/profile.d/conda.sh
 conda activate dask

+mamba install -y 'aws-sdk-cpp<1.11'


No worries. I just noticed that rapidsai/cudf#14173 is merged -- does that mean we no longer need this here?

This reverts commit 0f1c4da.

pentschev · 2023-09-23T09:44:49Z

rerun tests

pentschev · 2023-09-23T09:55:02Z

rerun tests

pentschev · 2023-09-23T10:17:28Z

Alright, gpuCI tests now passed 3 times in a row. I think this should be good to go @jrbourbeau @charlesbluca @quasiben @wence- .

Given we're not changing anything that's running on non-gpuCI tests I don't think any of them are related to this PR. In case anybody sees some correlation I do not, failing tests are summarized below:

FAILED tests/test_explicit_comms.py::test_dataframe_shuffle[ucx-cudf-3] - AssertionError: assert not 1
 +  where 1 = <SpawnProcess name='SpawnProcess-15' pid=8047 parent=808 stopped exitcode=1>.exitcode
FAILED distributed/shuffle/tests/test_shuffle.py::test_closed_worker_during_unpack - AssertionError: assert 'shuffle' not in 'shuffle-711...5ab91499-213'
  'shuffle' is contained here:
    shuffle-711a259b679cd132c6eb25b25ab91499-213
  ? +++++++
FAILED distributed/shuffle/tests/test_shuffle.py::test_crashed_worker_after_shuffle - UserWarning: Timed out waiting for sampling thread.
FAILED distributed/tests/test_variable.py::test_race - TimeoutError
FAILED distributed/tests/test_client.py::test_forward_logging - AssertionError: {<Nanny: None, threads: 2>: <Status.closed: 'closed'>, <Nanny: tcp://127.0.0.1:57566, threads: 1>: <Status.closing: 'closing'>}
assert False

pentschev · 2023-09-25T15:28:45Z

Thanks @jrbourbeau for reporting/merging and @charlesbluca for the help in debugging and creating new gpuCI images for this. 🙂

Do not reset CUDA context after UCX tests

40285e7

Resetting CUDA contexts during a running process may have unintended consequences on third-party libraries -- e.g., CuPy -- that store state based on the context. Therefore, prevent destroying CUDA context for now.

pentschev requested a review from fjetter as a code owner September 21, 2023 11:39

Prevent cuDF FutureWarning on empty series datatype

89aa552

Downgrade to aws-sdk-cpp<1.11

0f1c4da

jrbourbeau reviewed Sep 22, 2023

View reviewed changes

jrbourbeau approved these changes Sep 22, 2023

View reviewed changes

Revert "Downgrade to aws-sdk-cpp<1.11"

119dc14

This reverts commit 0f1c4da.

charlesbluca approved these changes Sep 25, 2023

View reviewed changes

jrbourbeau merged commit e1387a0 into dask:main Sep 25, 2023
23 of 28 checks passed

pentschev deleted the ucx-test-no-cuda-context-reset branch September 26, 2023 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not reset CUDA context after UCX tests #8201

Do not reset CUDA context after UCX tests #8201

pentschev commented Sep 21, 2023 •

edited

pentschev commented Sep 21, 2023

pentschev commented Sep 21, 2023

github-actions bot commented Sep 21, 2023 •

edited

pentschev commented Sep 21, 2023

pentschev commented Sep 21, 2023

pentschev commented Sep 21, 2023

pentschev commented Sep 22, 2023

pentschev commented Sep 22, 2023

jrbourbeau left a comment

jrbourbeau Sep 22, 2023

pentschev Sep 22, 2023

jrbourbeau Sep 22, 2023

pentschev Sep 23, 2023

pentschev commented Sep 22, 2023

jrbourbeau left a comment

jrbourbeau Sep 22, 2023

pentschev commented Sep 23, 2023

pentschev commented Sep 23, 2023

pentschev commented Sep 23, 2023

pentschev commented Sep 25, 2023

Do not reset CUDA context after UCX tests #8201

Do not reset CUDA context after UCX tests #8201

Conversation

pentschev commented Sep 21, 2023 • edited

pentschev commented Sep 21, 2023

pentschev commented Sep 21, 2023

github-actions bot commented Sep 21, 2023 • edited

Unit Test Results

pentschev commented Sep 21, 2023

pentschev commented Sep 21, 2023

pentschev commented Sep 21, 2023

pentschev commented Sep 22, 2023

pentschev commented Sep 22, 2023

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau Sep 22, 2023

Choose a reason for hiding this comment

pentschev Sep 22, 2023

Choose a reason for hiding this comment

jrbourbeau Sep 22, 2023

Choose a reason for hiding this comment

pentschev Sep 23, 2023

Choose a reason for hiding this comment

pentschev commented Sep 22, 2023

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau Sep 22, 2023

Choose a reason for hiding this comment

pentschev commented Sep 23, 2023

pentschev commented Sep 23, 2023

pentschev commented Sep 23, 2023

pentschev commented Sep 25, 2023

pentschev commented Sep 21, 2023 •

edited

github-actions bot commented Sep 21, 2023 •

edited