Add cuDF spilling statistics to RMM/GPU memory plot #8148

charlesbluca · 2023-08-31T16:53:26Z

Exposes cuDF's spilling statistics as an extra worker metric, and refactors the RMM/GPU memory plot to reflect this information in a similar way to the standard worker memory plot.

cc @pentschev @quasiben

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2023-08-31T17:16:43Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      27 files ±  0       27 suites ±0 11h 48m 16s ⏱️ - 2m 58s
  3 939 tests +  1   3 826 ✔️ ±0   111 💤 +  1   2 ❌ ±0
49 563 runs +27 47 235 ✔️ ±0 2 317 💤 +27 11 ❌ ±0

For more details on these failures, see this check.

Results for commit 936f0f6. ± Comparison against base commit 87576ae.

♻️ This comment has been updated with latest results.

pentschev · 2023-08-31T18:40:15Z

That's awesome @charlesbluca , when this is somewhat usable (if it isn't already), would you mind posting some screenshot(s) to see what it looks like?

charlesbluca · 2023-09-01T18:59:08Z

Made some modifications to add this information to the RMM individual plot, since that seems to be the most sensible place to include this information; I've opted for something similar to the individual worker memory plots, where spilled memory is represented as grey:

Note that the color scheme follows Dask defaults because I'm using MemoryColor to control the bar's coloring, if this is a problem it shouldn't be too difficult to modify that class to allow for overriding the default color scheme.

distributed/diagnostics/cudf.py

pentschev · 2023-09-01T20:31:11Z

Made some modifications to add this information to the RMM individual plot, since that seems to be the most sensible place to include this information; I've opted for something similar to the individual worker memory plots, where spilled memory is represented as grey:

It seems that we've now lost the opacity for GPU memory used/RMM pool size/used memory size, or is it simply the case that you didn't have a pool enable and thus it didn't show?

Note that the color scheme follows Dask defaults because I'm using MemoryColor to control the bar's coloring, if this is a problem it shouldn't be too difficult to modify that class to allow for overriding the default color scheme.

I think the green tones were @jacobtomlinson 's idea, they provide a nice touch for GPU memory, particularly I think it looked nicer as well. I'd anyway defer visual details to Jacob anyway, I think he's back Monday anyway. 🙂

jacobtomlinson

I'm in two minds about the colours. The defaults will feel familiar to users, but they may confuse regular memory plots with GPU memory plots. Using green more clearly highlights they are NVIDIA GPU memory plots. My previous comment on this subject was that there was an early suggestion to use a RAPIDS purple-based scheme, but my recommendation was to stick with green as users may use other GPU libraries with Dask.

charlesbluca · 2023-09-06T15:25:31Z

It seems that we've now lost the #7718 (comment), or is it simply the case that you didn't have a pool enable and thus it didn't show?

Yup, apologies for the confusion, with a partially utilized RMM pool enabled things should look something like this:

Based on @jacobtomlinson's comment, I think it makes sense to keep the original green to avoid potential confusion here, so will go ahead and push some changes to MemoryColor to allow the colors to be overridden

charlesbluca · 2023-09-06T15:29:24Z

distributed/dashboard/components/scheduler.py

+    def __init__(self, neutral_color="blue", target_color="orange", terminated_color="red"):
+        self.neutral_color = neutral_color
+        self.target_color = target_color
+        self.terminated_color = terminated_color


Worth noting that we could also configure the colors for the worker memory bars when they are approaching the memory limit (i.e. possibly spilling) or have been terminated; right now I've only overridden the neutral color for the GPU plot but not sure if it makes sense to use different colors for these other states?

charlesbluca · 2023-09-06T15:38:56Z

distributed/dashboard/components/scheduler.py

        target = dask.config.get("distributed.worker.memory.target")
        spill = dask.config.get("distributed.worker.memory.spill")
        terminate = dask.config.get("distributed.worker.memory.terminate")


Assuming these variables don't come up when deciding GPU to CPU spilling and can probably be ignored in that case, not sure what the best way to do this is? First thought was to just use a context manager to override config when initializing the MemoryColor, not sure if that will screw up AMM behavior down the line

quasiben · 2023-09-07T13:49:09Z

This looks really good! Thanks @charlesbluca . I think what's missing now is a test . I think pulling inspiration from test_worker_http.py or previous rmm tests test_rmm_diagnostics.py would be good

quasiben · 2023-09-07T13:55:02Z

In #8148 (comment) can you remind me what all the values are ?

17.88 GiB is ?
5.96 is ?
95.37 GiB is ?
256 GiB is the total amount of GPU memory (this is the rmm pool allocation across all 8 GPUs), correct ?

charlesbluca · 2023-09-07T15:11:22Z

In #8148 (comment) can you remind me what all the values are?

Ah, I notice now that the numbers there look a little wonky - I had hardcoded some values to illustrate what all the colors/opacities should look like, but missed that the RMM util exceeds the pool size 😅 a more realistic representation of what things would look like is this:

In the above screenshot:

11.93 GiB represents the sum of RMM allocated bytes across all the workers
23.84 GiB represents the sum of the RMM pool sizes across all workers
47.68 GiB represents the sum of GPU memory utilization (as tracked by pynvml) across all workers
256.00 GiB represents the sum of total GPU memory (as tracked by pynvml) across all workers

While the worker hover tool represents these values for each individual worker.

I think what's missing now is a test

Yup, will go ahead and add a test once I resolve merge conflicts

jacobtomlinson

This is generally looking good to me from a GPU perspective and would be happy to merge it.

However there seem to be a lot of CI failures and I'm a little out of the loop on the state of the distributed CI. Perhaps @fjetter or @hendrikmakait could weigh in on whether these failures are expected?

jacobtomlinson · 2023-09-25T16:24:34Z

@fjetter @hendrikmakait @jrbourbeau another gentle nudge. Are the CI failures here of concern?

charlesbluca · 2023-09-28T21:20:29Z

continuous_integration/gpuci/build.sh

+# enable monitoring of cuDF spilling
+export CUDF_SPILL=on
+export CUDF_SPILL_STATS=1
+export DASK_DISTRIBUTED__DIAGNOSTICS__CUDF=1


I initially set added this config variable to get test_no_unnecessary_imports_on_worker[scipy] and test_malloc_trim_threshold passing in GPU CI, but it seems like there isn't a trivial way to enable/disable cuDF spilling monitoring on a per-test basis.

Is there a way that we could somehow achieve this, or would it make sense to just not run these specific tests on GPU if we're expecting users to have cuDF installed on the workers?

Would doing the same this test work? If not, maybe our only option would be to launch the test in a separate process so that we can have full control of environment variables before import cudf.

charlesbluca · 2023-09-28T21:22:12Z

distributed/diagnostics/tests/test_cudf_diagnostics.py

+@gen_cluster(
+    client=True,
+    nthreads=[("127.0.0.1", 1)],
+    Worker=dask_cuda.CUDAWorker,


cc @pentschev this was the impetus for rapidsai/dask-build-environment#73, as using the default worker class here we seem to be unable to initialize the spilling monitoring? This was unexpected to me as I would think all we require is a spilling-enabled installation of cuDF on the workers

Would my #8148 (comment) above help with the initialization issues? Otherwise, let's leave it like this for now and discuss how to improve our testing strategy next week.

charlesbluca · 2023-10-03T16:14:39Z

rerun tests

…oard

hendrikmakait

Sorry for the long silence, CI failures are unrelated.

hendrikmakait · 2023-12-12T16:18:21Z

distributed/worker.py

+if dask.config.get("distributed.diagnostics.cudf"):
+    try:
+        import cudf as _cudf  # noqa: F401
+    except Exception:


nit:

Suggested change

except Exception:

except ImportError:

…oard

charlesbluca added 2 commits August 30, 2023 07:40

Initial exposure of cuDF logging information

7fc10ce

Initial plot of GPU to CPU nbytes

04137ef

Refactor RMM plot to include spilled memory

d38de06

charlesbluca added 2 commits September 1, 2023 12:05

Fix memory limit on x axis

eeddf1e

Remove unused dashboard plot

043835c

charlesbluca commented Sep 1, 2023

View reviewed changes

distributed/diagnostics/cudf.py Show resolved Hide resolved

charlesbluca marked this pull request as ready for review September 1, 2023 19:19

charlesbluca requested review from jacobtomlinson and fjetter as code owners September 1, 2023 19:19

charlesbluca changed the title ~~Add individual dashboard page for cuDF spilling statistics~~ Add cuDF spilling statistics to RMM/GPU memory plot Sep 1, 2023

jacobtomlinson reviewed Sep 4, 2023

View reviewed changes

Allow MemoryColor colors to be overridden

0ac3344

charlesbluca commented Sep 6, 2023

View reviewed changes

charlesbluca added 2 commits September 6, 2023 08:39

Merge remote-tracking branch 'origin/main' into cudf-spilling-dashboard

698be13

Linting

bb49135

jacobtomlinson mentioned this pull request Sep 7, 2023

GPU-related prometheus metrics #8168

Open

charlesbluca added 3 commits September 7, 2023 08:18

Merge remote-tracking branch 'origin/main' into cudf-spilling-dashboard

628fe39

Add cudf diagnostics test

50e626a

Merge remote-tracking branch 'origin/main' into cudf-spilling-dashboard

98e283e

jacobtomlinson reviewed Sep 12, 2023

View reviewed changes

charlesbluca added 2 commits September 12, 2023 09:13

Merge remote-tracking branch 'origin/main' into cudf-spilling-dashboard

008cca8

Resolve bokeh test failures

082ddef

Merge remote-tracking branch 'origin/main' into cudf-spilling-dashboard

87f8020

charlesbluca added 4 commits September 26, 2023 08:06

Merge remote-tracking branch 'origin/main' into cudf-spilling-dashboard

4453b8b

Make cudf spilling monitoring optional and disabled by default

b60173d

Merge remote-tracking branch 'origin/main' into cudf-spilling-dashboard

7426548

Modify cudf spilling test

1890136

charlesbluca mentioned this pull request Sep 28, 2023

Add dask-cudf and dask-cuda to Distributed environment rapidsai/dask-build-environment#73

Merged

charlesbluca commented Sep 28, 2023

View reviewed changes

charlesbluca added 3 commits September 29, 2023 10:19

Test cuDF spill tests in separate process

5eafddc

Remove global cuDF spilling settings from build.sh

21e106b

cuDF metrics test is flaky

3cc4b94

charlesbluca added 3 commits October 10, 2023 08:04

Merge remote-tracking branch 'origin/main' into cudf-spilling-dashboard

98dbfc7

Merge remote-tracking branch 'upstream/main' into cudf-spilling-dashb…

a5fce3c

…oard

Shouldn't need dask-cuda worker for test

b2fdfc6

hendrikmakait self-requested a review October 25, 2023 15:40

Merge remote-tracking branch 'upstream/main' into pr/charlesbluca/8148

f96b8a4

hendrikmakait approved these changes Dec 12, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into cudf-spilling-dashb…

936f0f6

…oard

hendrikmakait merged commit b44e661 into dask:main Dec 18, 2023
20 of 34 checks passed

charlesbluca mentioned this pull request Jan 23, 2024

Update gpuCI RAPIDS_VER to 24.04 #8471

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cuDF spilling statistics to RMM/GPU memory plot #8148

Add cuDF spilling statistics to RMM/GPU memory plot #8148

charlesbluca commented Aug 31, 2023 •

edited

github-actions bot commented Aug 31, 2023 •

edited

pentschev commented Aug 31, 2023

charlesbluca commented Sep 1, 2023

pentschev commented Sep 1, 2023

jacobtomlinson left a comment

charlesbluca commented Sep 6, 2023

charlesbluca Sep 6, 2023

charlesbluca Sep 6, 2023

quasiben commented Sep 7, 2023

quasiben commented Sep 7, 2023

charlesbluca commented Sep 7, 2023

jacobtomlinson left a comment

jacobtomlinson commented Sep 25, 2023

charlesbluca Sep 28, 2023

pentschev Sep 29, 2023

charlesbluca Sep 28, 2023

pentschev Sep 29, 2023

charlesbluca commented Oct 3, 2023

hendrikmakait left a comment

hendrikmakait Dec 12, 2023

Add cuDF spilling statistics to RMM/GPU memory plot #8148

Add cuDF spilling statistics to RMM/GPU memory plot #8148

Conversation

charlesbluca commented Aug 31, 2023 • edited

github-actions bot commented Aug 31, 2023 • edited

Unit Test Results

pentschev commented Aug 31, 2023

charlesbluca commented Sep 1, 2023

pentschev commented Sep 1, 2023

jacobtomlinson left a comment

Choose a reason for hiding this comment

charlesbluca commented Sep 6, 2023

charlesbluca Sep 6, 2023

Choose a reason for hiding this comment

charlesbluca Sep 6, 2023

Choose a reason for hiding this comment

quasiben commented Sep 7, 2023

quasiben commented Sep 7, 2023

charlesbluca commented Sep 7, 2023

jacobtomlinson left a comment

Choose a reason for hiding this comment

jacobtomlinson commented Sep 25, 2023

charlesbluca Sep 28, 2023

Choose a reason for hiding this comment

pentschev Sep 29, 2023

Choose a reason for hiding this comment

charlesbluca Sep 28, 2023

Choose a reason for hiding this comment

pentschev Sep 29, 2023

Choose a reason for hiding this comment

charlesbluca commented Oct 3, 2023

hendrikmakait left a comment

Choose a reason for hiding this comment

hendrikmakait Dec 12, 2023

Choose a reason for hiding this comment

charlesbluca commented Aug 31, 2023 •

edited

github-actions bot commented Aug 31, 2023 •

edited