Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configuration to enable/disable NVML diagnostics #4893

Merged
merged 5 commits into from Jun 17, 2021

Conversation

pentschev
Copy link
Member

In some cases, users may want or need to explicitly disable NVML diagnostics, this PR adds support for that via Dask config.

@pentschev
Copy link
Member Author

@pentschev
Copy link
Member Author

Notice that I introduced a new section diagnostics for lack of a better place to add nvml, but feel free to suggest a more suitable section or name to replace diagnostics.

@quasiben
Copy link
Member

quasiben commented Jun 9, 2021

What do you think about placing this under Dashboard section instead of creating a new section ?

cc @jacobtomlinson

@pentschev
Copy link
Member Author

What do you think about placing this under Dashboard section instead of creating a new section ?

I thought about that, but I reach the conclusion it could be more confusing, since the NVML module is under distributed/diagnostics and there's another directory distributed/dashboard where the distributed.dashboard configs apply.

@quasiben
Copy link
Member

quasiben commented Jun 9, 2021

That seems reasonable. I'm @jrbourbeau how do you feel about adding a new config section ?

global nvmlInitialized, nvmlLibraryNotFound, nvmlOwnerPID
global nvmlEnabled, nvmlInitialized, nvmlLibraryNotFound, nvmlOwnerPID

nvmlEnabled = dask.config.get("distributed.diagnostics.nvml")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this nvmlEnabled variable appears to be equivalent to the the distributed.diagnostics.nvml config value. If that's the case, I'd prefer to not introduce a new variable and instead just use the config value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, addressed in fed9e97 .

Comment on lines +198 to +199
diagnostics:
nvml: True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is GPU specific, we might want to put this option in it's own section like rmm and ucx

rmm:
pool-size: null
ucx:
cuda_copy: False # enable cuda-copy
tcp: False # enable tcp
nvlink: False # enable cuda_ipc
infiniband: False # enable Infiniband
rdmacm: False # enable RDMACM
net-devices: null # define what interface to use for UCX comm
reuse-endpoints: null # enable endpoint reuse

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind moving it there if you prefer, but it seems like the general feel about those is that they have been a mistake to be outside of the distributed schema. For example #4904 is attempting to move ucx into distributed to comply with the general config schema.

@jacobtomlinson
Copy link
Member

I think putting this under diagnostics makes sense.

@jrbourbeau jrbourbeau mentioned this pull request Jun 14, 2021
4 tasks
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pentschev!

@jrbourbeau jrbourbeau merged commit fced981 into dask:main Jun 17, 2021
@pentschev pentschev deleted the config-nvml branch June 30, 2021 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simple local cluster on laptop fails with deactivated GPU
4 participants