Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[BUGFIX] Improve compile/use of nvmlDeviceGetComputeRunningProcesses() #20887

Merged
merged 2 commits into from Feb 11, 2022

Conversation

guanxingithub
Copy link
Contributor

@guanxingithub guanxingithub commented Feb 8, 2022

Description

This PR is filed to solve the issue reported in #20467 as well as #20863. For fixing this issue, an initial solution was provided in #20499 by @khaotik. This PR, with a fix from @DickJC123, adds more compilation robustness across drivers and both settings of the NVML_NO_UNVERSIONED_FUNC_DEFS cmake flag.

In this PR, we would like to merge these two solutions and find a general solution that avoids compilation errors no matter which signature of the nvmlDeviceGetComputeRunningProcesses() function is enabled in the code.

Checklist

Essentials

Dick Carter finished coding on this PR
All changes are fully tested by Guanxin Li
Code is well-documented

Changes

The change reduce the possibly versioned variant of nvmlProcessInfo_t* expected
as the 3rd arg of nvmlDeviceGetComputeRunningProcesses().

Comments

This change is a backward incompatible change, without this change, mxnet can't be compiled in cuda11 450.x driver

@mxnet-bot
Copy link

Hey @guanxingithub , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [clang, miscellaneous, unix-cpu, windows-gpu, unix-gpu, windows-cpu, edge, website, sanity, centos-gpu, centos-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress labels Feb 8, 2022
@guanxingithub guanxingithub changed the title Resolve the conflict with PR#20499 [BUGFIX] Make compile/use of nvmlDeviceGetComputeRunningProcesses()#20866 Feb 9, 2022
@mseth10 mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 9, 2022
@guanxingithub guanxingithub marked this pull request as draft February 10, 2022 16:42
@guanxingithub guanxingithub marked this pull request as ready for review February 10, 2022 17:41
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 10, 2022
Copy link
Member

@ptrendx ptrendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving my approval from #20866

@guanxingithub guanxingithub changed the title [BUGFIX] Make compile/use of nvmlDeviceGetComputeRunningProcesses()#20866 [BUGFIX] Improve compile/use of nvmlDeviceGetComputeRunningProcesses() #20887 Feb 10, 2022
@guanxingithub guanxingithub changed the title [BUGFIX] Improve compile/use of nvmlDeviceGetComputeRunningProcesses() #20887 [BUGFIX] Improve compile/use of nvmlDeviceGetComputeRunningProcesses() Feb 10, 2022
@mseth10 mseth10 added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 10, 2022
@DickJC123 DickJC123 self-requested a review February 10, 2022 23:24
Copy link
Contributor

@DickJC123 DickJC123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the signature of the nvmlDeviceGetComputeRunningProcesses() function (as dictated by the driver include file and the NVML_NO_UNVERSIONED_FUNC_DEFS cmake flag), we want the local variable infos vector to have an element type compatible with the 3rd arg of that function. This PR uses template meta-programming to deduce that. LGTM.

@szha szha merged commit f03fb23 into apache:master Feb 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants