Add configuration to enable/disable NVML diagnostics #4893

pentschev · 2021-06-08T22:08:02Z

Closes Simple local cluster on laptop fails with deactivated GPU #4885

In some cases, users may want or need to explicitly disable NVML diagnostics, this PR adds support for that via Dask config.

pentschev · 2021-06-08T22:08:40Z

cc @quasiben @jrbourbeau @SteffenBauer

pentschev · 2021-06-08T22:14:02Z

Notice that I introduced a new section diagnostics for lack of a better place to add nvml, but feel free to suggest a more suitable section or name to replace diagnostics.

quasiben · 2021-06-09T12:17:59Z

What do you think about placing this under Dashboard section instead of creating a new section ?

cc @jacobtomlinson

pentschev · 2021-06-09T12:28:33Z

What do you think about placing this under Dashboard section instead of creating a new section ?

I thought about that, but I reach the conclusion it could be more confusing, since the NVML module is under distributed/diagnostics and there's another directory distributed/dashboard where the distributed.dashboard configs apply.

quasiben · 2021-06-09T20:30:34Z

That seems reasonable. I'm @jrbourbeau how do you feel about adding a new config section ?

jrbourbeau · 2021-06-11T01:32:19Z

distributed/diagnostics/nvml.py

-    global nvmlInitialized, nvmlLibraryNotFound, nvmlOwnerPID
+    global nvmlEnabled, nvmlInitialized, nvmlLibraryNotFound, nvmlOwnerPID
+
+    nvmlEnabled = dask.config.get("distributed.diagnostics.nvml")


IIUC this nvmlEnabled variable appears to be equivalent to the the distributed.diagnostics.nvml config value. If that's the case, I'd prefer to not introduce a new variable and instead just use the config value

This is a good point, addressed in fed9e97 .

jrbourbeau · 2021-06-11T01:35:44Z

distributed/distributed.yaml

+  diagnostics:
+    nvml: True


Since this is GPU specific, we might want to put this option in it's own section like rmm and ucx

distributed/distributed/distributed.yaml

Lines 228 to 237 in 0c15432

rmm:

pool-size: null

ucx:

cuda_copy: False # enable cuda-copy

tcp: False # enable tcp

nvlink: False # enable cuda_ipc

infiniband: False # enable Infiniband

rdmacm: False # enable RDMACM

net-devices: null # define what interface to use for UCX comm

reuse-endpoints: null # enable endpoint reuse

I don't mind moving it there if you prefer, but it seems like the general feel about those is that they have been a mistake to be outside of the distributed schema. For example #4904 is attempting to move ucx into distributed to comply with the general config schema.

jacobtomlinson · 2021-06-14T09:44:50Z

I think putting this under diagnostics makes sense.

jrbourbeau

Thanks @pentschev!

pentschev added 3 commits June 8, 2021 15:04

Add configuration to enable/disable NVML diagnostics

6a00b44

Enable/disable NVML diagnostics based on configuration

8932ba3

Test enable/disable NVML diagnostics

a2e26b6

pentschev mentioned this pull request Jun 8, 2021

Simple local cluster on laptop fails with deactivated GPU #4885

Closed

Check for NVML disable before initialization

0c15432

jrbourbeau reviewed Jun 11, 2021

View reviewed changes

Remove redundant nvmlEnabled variable

fed9e97

jrbourbeau mentioned this pull request Jun 14, 2021

Release 2021.06.1 dask/community#165

Closed

4 tasks

jrbourbeau approved these changes Jun 17, 2021

View reviewed changes

jrbourbeau merged commit fced981 into dask:main Jun 17, 2021

pentschev deleted the config-nvml branch June 30, 2021 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configuration to enable/disable NVML diagnostics #4893

Add configuration to enable/disable NVML diagnostics #4893

pentschev commented Jun 8, 2021

pentschev commented Jun 8, 2021

pentschev commented Jun 8, 2021

quasiben commented Jun 9, 2021

pentschev commented Jun 9, 2021

quasiben commented Jun 9, 2021

jrbourbeau Jun 11, 2021

pentschev Jun 11, 2021

jrbourbeau Jun 11, 2021

pentschev Jun 11, 2021

jacobtomlinson commented Jun 14, 2021

jrbourbeau left a comment

	rmm:
	pool-size: null
	ucx:
	cuda_copy: False # enable cuda-copy
	tcp: False # enable tcp
	nvlink: False # enable cuda_ipc
	infiniband: False # enable Infiniband
	rdmacm: False # enable RDMACM
	net-devices: null # define what interface to use for UCX comm
	reuse-endpoints: null # enable endpoint reuse

Add configuration to enable/disable NVML diagnostics #4893

Add configuration to enable/disable NVML diagnostics #4893

Conversation

pentschev commented Jun 8, 2021

pentschev commented Jun 8, 2021

pentschev commented Jun 8, 2021

quasiben commented Jun 9, 2021

pentschev commented Jun 9, 2021

quasiben commented Jun 9, 2021

jrbourbeau Jun 11, 2021

Choose a reason for hiding this comment

pentschev Jun 11, 2021

Choose a reason for hiding this comment

jrbourbeau Jun 11, 2021

Choose a reason for hiding this comment

pentschev Jun 11, 2021

Choose a reason for hiding this comment

jacobtomlinson commented Jun 14, 2021

jrbourbeau left a comment

Choose a reason for hiding this comment