New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add configuration to enable/disable NVML diagnostics #4893
Conversation
Notice that I introduced a new section |
What do you think about placing this under Dashboard section instead of creating a new section ? |
I thought about that, but I reach the conclusion it could be more confusing, since the NVML module is under |
That seems reasonable. I'm @jrbourbeau how do you feel about adding a new config section ? |
distributed/diagnostics/nvml.py
Outdated
global nvmlInitialized, nvmlLibraryNotFound, nvmlOwnerPID | ||
global nvmlEnabled, nvmlInitialized, nvmlLibraryNotFound, nvmlOwnerPID | ||
|
||
nvmlEnabled = dask.config.get("distributed.diagnostics.nvml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this nvmlEnabled
variable appears to be equivalent to the the distributed.diagnostics.nvml
config value. If that's the case, I'd prefer to not introduce a new variable and instead just use the config value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, addressed in fed9e97 .
diagnostics: | ||
nvml: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is GPU specific, we might want to put this option in it's own section like rmm
and ucx
distributed/distributed/distributed.yaml
Lines 228 to 237 in 0c15432
rmm: | |
pool-size: null | |
ucx: | |
cuda_copy: False # enable cuda-copy | |
tcp: False # enable tcp | |
nvlink: False # enable cuda_ipc | |
infiniband: False # enable Infiniband | |
rdmacm: False # enable RDMACM | |
net-devices: null # define what interface to use for UCX comm | |
reuse-endpoints: null # enable endpoint reuse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind moving it there if you prefer, but it seems like the general feel about those is that they have been a mistake to be outside of the distributed
schema. For example #4904 is attempting to move ucx
into distributed
to comply with the general config schema.
I think putting this under diagnostics makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pentschev!
In some cases, users may want or need to explicitly disable NVML diagnostics, this PR adds support for that via Dask config.