[BUGFIX] Improve compile/use of nvmlDeviceGetComputeRunningProcesses() #20887

guanxingithub · 2022-02-08T23:59:06Z

Description

This PR is filed to solve the issue reported in #20467 as well as #20863. For fixing this issue, an initial solution was provided in #20499 by @khaotik. This PR, with a fix from @DickJC123, adds more compilation robustness across drivers and both settings of the NVML_NO_UNVERSIONED_FUNC_DEFS cmake flag.

In this PR, we would like to merge these two solutions and find a general solution that avoids compilation errors no matter which signature of the nvmlDeviceGetComputeRunningProcesses() function is enabled in the code.

Checklist

Essentials

Dick Carter finished coding on this PR
All changes are fully tested by Guanxin Li
Code is well-documented

Changes

The change reduce the possibly versioned variant of nvmlProcessInfo_t* expected
as the 3rd arg of nvmlDeviceGetComputeRunningProcesses().

Comments

This change is a backward incompatible change, without this change, mxnet can't be compiled in cuda11 450.x driver

mxnet-bot · 2022-02-08T23:59:09Z

Hey @guanxingithub , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [clang, miscellaneous, unix-cpu, windows-gpu, unix-gpu, windows-cpu, edge, website, sanity, centos-gpu, centos-cpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

ptrendx

Moving my approval from #20866

DickJC123

Based on the signature of the nvmlDeviceGetComputeRunningProcesses() function (as dictated by the driver include file and the NVML_NO_UNVERSIONED_FUNC_DEFS cmake flag), we want the local variable infos vector to have an element type compatible with the 3rd arg of that function. This PR uses template meta-programming to deduce that. LGTM.

Resolve the conflict with PR#20499

cfa06f3

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress labels Feb 8, 2022

This was referenced Feb 9, 2022

[BUGFIX] Make compile/use of nvmlDeviceGetComputeRunningProcesses() adapt to n… #20866

Closed

Compilation issue of NVML usage across all possible drivers #20863

Closed

guanxingithub changed the title ~~Resolve the conflict with PR#20499~~ [BUGFIX] Make compile/use of nvmlDeviceGetComputeRunningProcesses()#20866 Feb 9, 2022

mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 9, 2022

guanxingithub marked this pull request as draft February 10, 2022 16:42

Merge remote-tracking branch 'mxnet/master' into pr_conflict

fd48d48

DickJC123 requested review from ptrendx and szha February 10, 2022 17:23

guanxingithub marked this pull request as ready for review February 10, 2022 17:41

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 10, 2022

ptrendx approved these changes Feb 10, 2022

View reviewed changes

guanxingithub changed the title ~~[BUGFIX] Make compile/use of nvmlDeviceGetComputeRunningProcesses()#20866~~ [BUGFIX] Improve compile/use of nvmlDeviceGetComputeRunningProcesses() #20887 Feb 10, 2022

guanxingithub changed the title ~~[BUGFIX] Improve compile/use of nvmlDeviceGetComputeRunningProcesses() #20887~~ [BUGFIX] Improve compile/use of nvmlDeviceGetComputeRunningProcesses() Feb 10, 2022

mseth10 added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 10, 2022

DickJC123 self-requested a review February 10, 2022 23:24

DickJC123 approved these changes Feb 10, 2022

View reviewed changes

szha merged commit f03fb23 into apache:master Feb 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGFIX] Improve compile/use of nvmlDeviceGetComputeRunningProcesses() #20887

[BUGFIX] Improve compile/use of nvmlDeviceGetComputeRunningProcesses() #20887

guanxingithub commented Feb 8, 2022 •

edited

mxnet-bot commented Feb 8, 2022

ptrendx left a comment

DickJC123 left a comment

[BUGFIX] Improve compile/use of nvmlDeviceGetComputeRunningProcesses() #20887

[BUGFIX] Improve compile/use of nvmlDeviceGetComputeRunningProcesses() #20887

Conversation

guanxingithub commented Feb 8, 2022 • edited

Description

Checklist

Essentials

Changes

Comments

mxnet-bot commented Feb 8, 2022

ptrendx left a comment

Choose a reason for hiding this comment

DickJC123 left a comment

Choose a reason for hiding this comment

guanxingithub commented Feb 8, 2022 •

edited