Compilation issue of NVML usage across all possible drivers #20863

guanxingithub · 2022-01-31T17:54:14Z

Description

We would like to report a compilation issue on the master branch, related to use of NVIDIA’s NVML library. The source lines involved are: https://github.com/apache/incubator-mxnet/blob/master/src/profiler/storage_profiler.cc#L103-L111

These were the same lines that caused issue#20145, as was fixed by @Zha0q1 in PR#20146. The problem is that these source lines still have a sensitivity to the driver version and cmake build flag NVML_NO_UNVERSIONED_FUNC_DEFS.

Error Message

This issue was found when we compile MXNet master on the cuda11 450.x driver, where we see:

FAILED:

CMakeFiles/mxnet.dir/src/profiler/storage_profiler.cc.o../src/profiler/storage_profiler.cc:109:78: error: cannot convert ‘nvmlProcessInfo_st*’ to ‘nvmlProcessInfo_v1_t*’ {aka ‘nvmlProcessInfo_v1_st*’}
109 | nvmlDeviceGetComputeRunningProcesses(nvml_device, &info_count, infos.data());

In file included from ../src/profiler/storage_profiler.cc:22:
/usr/local/cuda/include/nvml.h:8403:127: note: initializing argument 3 of ‘nvmlReturn_t nvmlDeviceGetComputeRunningProcesses(nvmlDevice_t, unsigned int*, nvmlProcessInfo_v1_t*)’
8403 | nvmlReturn_t DECLDIR nvmlDeviceGetComputeRunningProcesses(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_v1_t *infos);

Steps to reproduce

Find machine with cuda11 450.x driver
Compile mxnet

What have you tried to solve it?

This issue was found and fixed by Dick Carter
@DickJC123 has developed a general solution that avoids compilation errors no matter which signature of the nvmlDeviceGetComputeRunningProcesses() function is enabled in the code. We will be submitting this fix as a PR shortly.

github-actions · 2022-01-31T17:54:52Z

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue.
Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly.
If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

TristonC · 2022-02-01T19:46:31Z

@ptrendx Please help to review.

guanxingithub · 2022-02-01T20:51:58Z

PR was filed #20866

guanxingithub · 2022-02-09T17:15:37Z

PR for merging the solutions of #20499 and #20866 was filed as #20887

guanxingithub · 2022-02-25T01:40:51Z

This issue was fixed and merged in PR #20877

guanxingithub added Bug needs triage labels Jan 31, 2022

guanxingithub mentioned this issue Feb 1, 2022

[BUGFIX] Make compile/use of nvmlDeviceGetComputeRunningProcesses() adapt to n… #20866

Closed

guanxingithub mentioned this issue Feb 9, 2022

[BUGFIX] Improve compile/use of nvmlDeviceGetComputeRunningProcesses() #20887

Merged

guanxingithub closed this as completed Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compilation issue of NVML usage across all possible drivers #20863

Compilation issue of NVML usage across all possible drivers #20863

guanxingithub commented Jan 31, 2022

github-actions bot commented Jan 31, 2022

TristonC commented Feb 1, 2022

guanxingithub commented Feb 1, 2022

guanxingithub commented Feb 9, 2022

guanxingithub commented Feb 25, 2022

Compilation issue of NVML usage across all possible drivers #20863

Compilation issue of NVML usage across all possible drivers #20863

Comments

guanxingithub commented Jan 31, 2022

Description

Error Message

Steps to reproduce

What have you tried to solve it?

github-actions bot commented Jan 31, 2022

TristonC commented Feb 1, 2022

guanxingithub commented Feb 1, 2022

guanxingithub commented Feb 9, 2022

guanxingithub commented Feb 25, 2022