feat(core/libnvml): add compatibility layers for NVML Python bindings #30

XuehaiPan · 2022-07-24T11:26:02Z

Issue Type

Improvement/feature implementation

Runtime Environment

Operating system and version: Ubuntu 20.04 LTS
Terminal emulator and version: GNOME Terminal 3.36.2
Python version: 3.9.13
NVML version (driver version): 470.129.06
nvitop version or commit: v0.7.1
python-ml-py version: 11.450.51
Locale: en_US.UTF-8

Description

Automatically patch the pynvml module when the first call fails when calling the versioned APIs. Now we support a more broad range of the PyPI package nvidia-ml-py dependency versions.

Motivation and Context

See #29 for more details.

Resolves #29
Closes #13

Testing

Using nvidia-ml-py == 11.515.48 with the NVIDIA R430 driver (CUDA 10.x):

$ pip3 install --ignore-installed .
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Processing /home/panxuehai/Projects/nvitop
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting psutil>=5.6.6
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/62/1f/f14225bda76417ab9bd808ff21d5cd59d5435a9796ca09b34d4cb0edcd88/psutil-5.9.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (281 kB)
Collecting cachetools>=1.0.1
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/68/aa/5fc646cae6e997c3adf3b0a7e257cda75cff21fcba15354dffd67789b7bb/cachetools-5.2.0-py3-none-any.whl (9.3 kB)
Collecting nvidia-ml-py<11.516.0a0,>=11.450.51
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/7c/b6/738d9c68f8abcdedf8901c4abf00df74e8f281626de67b5185dcc443e693/nvidia_ml_py-11.515.48-py3-none-any.whl (28 kB)
Collecting termcolor>=1.0.0
  Using cached termcolor-1.1.0-py3-none-any.whl
Building wheels for collected packages: nvitop
  Building wheel for nvitop (pyproject.toml) ... done
  Created wheel for nvitop: filename=nvitop-0.7.1+6.g0feed99-py3-none-any.whl size=154871 sha256=da07a27d8579e1cc38a3bd3d537f0d885d592df0c3293ba585b831fa236f100e
  Stored in directory: /tmp/pip-ephem-wheel-cache-3qzopv_e/wheels/9a/17/84/86d7a108dc1c0d7a25e96628d476e19df73a27353725b35779
Successfully built nvitop
Installing collected packages: termcolor, nvidia-ml-py, psutil, cachetools, nvitop
Successfully installed cachetools-5.2.0 nvidia-ml-py-11.515.48 nvitop-0.7.1+6.g84f43f5 psutil-5.9.1 termcolor-1.1.0

Result:

The v3 API nvmlDeviceGetComputeRunningProcesses_v3 fails-back to v2 API nvmlDeviceGetComputeRunningProcesses_v2 (which could not found either), then fails-back to v1 API nvmlDeviceGetComputeRunningProcesses.

$ LOGLEVEL=DEBUG ./nvitop.py -1
Patching NVML function pointer `nvmlDeviceGetComputeRunningProcesses_v3`
    Map NVML function `nvmlDeviceGetComputeRunningProcesses_v3` to `nvmlDeviceGetComputeRunningProcesses_v2`
    Map NVML function `nvmlDeviceGetGraphicsRunningProcesses_v3` to `nvmlDeviceGetGraphicsRunningProcesses_v2`
    Map NVML function `nvmlDeviceGetMPSComputeRunningProcesses_v3` to `nvmlDeviceGetMPSComputeRunningProcesses_v2`
    Patch NVML struct `c_nvmlProcessInfo_t` to `c_nvmlProcessInfo_v2_t`
Patching NVML function pointer `nvmlDeviceGetComputeRunningProcesses_v2`
    Map NVML function `nvmlDeviceGetComputeRunningProcesses_v2` to `nvmlDeviceGetComputeRunningProcesses`
    Map NVML function `nvmlDeviceGetGraphicsRunningProcesses_v2` to `nvmlDeviceGetGraphicsRunningProcesses`
    Map NVML function `nvmlDeviceGetMPSComputeRunningProcesses_v2` to `nvmlDeviceGetMPSComputeRunningProcesses`
    Map NVML function `nvmlDeviceGetComputeRunningProcesses_v3` to `nvmlDeviceGetComputeRunningProcesses`
    Map NVML function `nvmlDeviceGetGraphicsRunningProcesses_v3` to `nvmlDeviceGetGraphicsRunningProcesses`
    Map NVML function `nvmlDeviceGetMPSComputeRunningProcesses_v3` to `nvmlDeviceGetMPSComputeRunningProcesses`
    Patch NVML struct `c_nvmlProcessInfo_t` to `c_nvmlProcessInfo_v1_t`
Sun Jul 24 19:32:24 2022
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪══════════════════════════════════════════════════════════════════════════════════╕
│   0  TITAN Xp            Off  │ 00000000:05:00.0 Off │                  N/A │ MEM: ▏ 0.2%                                                                      │
│ 24%   43C    P8    19W / 250W │     19MiB / 12194MiB │      0%      Default │ UTL: ▏ 0%                                                                        │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│   1  TITAN Xp            Off  │ 00000000:06:00.0 Off │                  N/A │ MEM: ▏ 0.0%                                                                      │
│ 23%   36C    P8    10W / 250W │      2MiB / 12196MiB │      0%      Default │ UTL: ▏ 0%                                                                        │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│   2  ..orce GTX TITAN X  Off  │ 00000000:09:00.0 Off │                  N/A │ MEM: ▏ 0.0%                                                                      │
│ 22%   34C    P8    17W / 250W │      2MiB / 12213MiB │      0%      Default │ UTL: ▏ 0%                                                                        │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╧══════════════════════════════════════════════════════════════════════════════════╛
[ CPU: █████▉ 5.3%                                                                                                          ]  ( Load Average:  0.89  0.61  0.39 )
[ MEM: ███▋ 3.2%                                                                                                            ]  [ SWP: ▏ 0.0%                     ]

╒════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Processes:                                                                                                                                    panxuehai@ubuntu │
│ GPU     PID      USER  GPU-MEM %SM  %CPU  %MEM      TIME  COMMAND                                                                                              │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│   0    2122 G    root    17MiB   0   0.0   0.0  4.2 days  /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch │
╘════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

XuehaiPan · 2022-07-24T13:09:33Z

Not sure when will the v2 version of nvmlDeviceGetMemoryInfo become GA. It seems only experimental on the NVIDIA R510 driver.

In nvidia-ml-py == 11.515.48:

class c_nvmlMemory_t(_PrintableStructure):
    _fields_ = [
        ('total', c_ulonglong),
        ('free', c_ulonglong),
        ('used', c_ulonglong),
    ]
    _fmt_ = {'<default>': "%d B"}

class c_nvmlMemory_v2_t(_PrintableStructure):
    _fields_ = [
        ('version', c_uint),
        ('total', c_ulonglong),
        ('reserved', c_ulonglong),
        ('free', c_ulonglong),
        ('used', c_ulonglong),
    ]
    _fmt_ = {'<default>': "%d B"}

nvmlMemory_v2 = 0x02000028

def nvmlDeviceGetMemoryInfo(handle, version=None):
    if not version:
        c_memory = c_nvmlMemory_t()
        fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo")
    else:
        c_memory = c_nvmlMemory_v2_t()
        c_memory.version = version
        fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2")
    ret = fn(handle, byref(c_memory))
    _nvmlCheckReturn(ret)
    return c_memory

See also:

I got NVML_ERROR_UNKNOWN (even not NVML_ERROR_NOT_SUPPORTED nor NVML_ERROR_INVALID_ARGUMENT) when calling the v2 memory info inside WSL (the function symbol can be found with my 516.59 NVIDIA driver on Windows).

In [1]: import cupy as cp

In [2]: x = cp.zeros((1,))

In [3]: from nvitop import *

In [4]: d = Device(0)

In [5]: str(libnvml.nvmlDeviceGetMemoryInfo(d.handle))
Out[5]: 'c_nvmlMemory_t(total: 8589934592 B, free: 4304117760 B, used: 4285816832 B)'

In [6]: str(libnvml.nvmlDeviceGetMemoryInfo(d.handle, version=2))
╭──────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
│ <ipython-input-6-c179ba6931d1>:1 in <cell line: 1>                                                         │
│                                                                                                            │
│ /home/PanXuehai/Projects/nvitop/venv/lib/python3.9/site-packages/pynvml.py:2301 in nvmlDeviceGetMemoryInfo │
│                                                                                                            │
│   2298 │   │   c_memory.version = version                                                                  │
│   2299 │   │   fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2")                                  │
│   2300 │   ret = fn(handle, byref(c_memory))                                                               │
│ ❱ 2301 │   _nvmlCheckReturn(ret)                                                                           │
│   2302 │   return c_memory                                                                                 │
│   2303                                                                                                     │
│   2304 def nvmlDeviceGetBAR1MemoryInfo(handle):                                                            │
│                                                                                                            │
│ /home/PanXuehai/Projects/nvitop/venv/lib/python3.9/site-packages/pynvml.py:795 in _nvmlCheckReturn         │
│                                                                                                            │
│    792                                                                                                     │
│    793 def _nvmlCheckReturn(ret):                                                                          │
│    794 │   if (ret != NVML_SUCCESS):                                                                       │
│ ❱  795 │   │   raise NVMLError(ret)                                                                        │
│    796 │   return ret                                                                                      │
│    797                                                                                                     │
│    798 ## Function access ##                                                                               │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
NVMLError_Unknown: Unknown Error

In [7]: Device.driver_version()
Out[7]: '516.59'

Same error on the Windows host:

In [1]: import cupy as cp

In [2]: x = cp.zeros((1,))

In [3]: from nvitop import *

In [4]: d = Device(0)

In [5]: str(libnvml.nvmlDeviceGetMemoryInfo(d.handle))
Out[5]: 'c_nvmlMemory_t(total: 8589934592 B, free: 4412493824 B, used: 4177440768 B)'

In [6]: str(libnvml.nvmlDeviceGetMemoryInfo(d.handle, version=2))
┌─────────────────────────────── Traceback (most recent call last) ────────────────────────────────┐
│ <ipython-input-6-c179ba6931d1>:1 in <cell line: 1>                                               │
│                                                                                                  │
│ C:\Tools\Python3\lib\site-packages\pynvml.py:2301 in nvmlDeviceGetMemoryInfo                     │
│                                                                                                  │
│   2298 │   │   c_memory.version = version                                                        │
│   2299 │   │   fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2")                        │
│   2300 │   ret = fn(handle, byref(c_memory))                                                     │
│ > 2301 │   _nvmlCheckReturn(ret)                                                                 │
│   2302 │   return c_memory                                                                       │
│   2303                                                                                           │
│   2304 def nvmlDeviceGetBAR1MemoryInfo(handle):                                                  │
│                                                                                                  │
│ C:\Tools\Python3\lib\site-packages\pynvml.py:795 in _nvmlCheckReturn                             │
│                                                                                                  │
│    792                                                                                           │
│    793 def _nvmlCheckReturn(ret):                                                                │
│    794 │   if (ret != NVML_SUCCESS):                                                             │
│ >  795 │   │   raise NVMLError(ret)                                                              │
│    796 │   return ret                                                                            │
│    797                                                                                           │
│    798 ## Function access ##                                                                     │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
NVMLError_Unknown: Unknown Error

XuehaiPan · 2022-07-25T16:26:51Z

Waiting for a new driver release for v2 memory info API.

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

XuehaiPan · 2022-10-15T14:29:48Z

I upgrade the NVIDIA driver to 520.56.06 but still get Function Not Found errors for the nvmlDeviceGetMemoryInfo_v2 function.

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

XuehaiPan · 2022-10-17T07:51:22Z

Update: The correct API call for the nvmlDeviceGetMemoryInfo_v2 function is:

pynvml.nvmlDeviceGetMemoryInfo(handle, version=pynvml.nvmlMemory_v2)

rather than

pynvml.nvmlDeviceGetMemoryInfo(handle, version=2)

where

pynvml.nvmlMemory_v2 = 33554472 = ctypes.sizeof(pynvml.c_nvmlMemory_v2_t) | 2 << 24

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

…e for memory info version 2 APIs Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

XuehaiPan self-assigned this Jul 24, 2022

XuehaiPan added this to the v1.0.0 milestone Jul 24, 2022

XuehaiPan added enhancement New feature or request pynvml Something related to the `nvidia-ml-py` package labels Jul 24, 2022

XuehaiPan force-pushed the backward-compatibility branch 4 times, most recently from d513134 to 4f841fd Compare July 24, 2022 12:19

XuehaiPan force-pushed the backward-compatibility branch 5 times, most recently from d9d08d0 to 4be5f8e Compare July 25, 2022 02:24

XuehaiPan force-pushed the main branch from 486019b to d858986 Compare July 25, 2022 14:48

XuehaiPan force-pushed the backward-compatibility branch 3 times, most recently from 1aa80e7 to 01133f8 Compare July 25, 2022 14:54

XuehaiPan force-pushed the main branch from d858986 to 2552357 Compare July 25, 2022 15:49

XuehaiPan force-pushed the backward-compatibility branch from 01133f8 to 02918b9 Compare July 25, 2022 16:25

XuehaiPan marked this pull request as ready for review July 25, 2022 16:26

XuehaiPan force-pushed the main branch 2 times, most recently from e18247e to 26067ca Compare July 29, 2022 05:24

XuehaiPan force-pushed the backward-compatibility branch from 02918b9 to ddfd0b8 Compare July 29, 2022 06:52

XuehaiPan force-pushed the main branch from 1c89eb6 to 6d1c6e6 Compare July 29, 2022 07:01

XuehaiPan force-pushed the backward-compatibility branch 4 times, most recently from 326f49f to 41f7325 Compare July 31, 2022 08:13

XuehaiPan force-pushed the backward-compatibility branch 5 times, most recently from 2155404 to 5ce55ec Compare August 9, 2022 15:26

XuehaiPan force-pushed the backward-compatibility branch from 5ce55ec to ddcca1c Compare August 12, 2022 14:37

XuehaiPan force-pushed the backward-compatibility branch 2 times, most recently from e1c2584 to ea5aa1d Compare August 23, 2022 11:51

XuehaiPan force-pushed the main branch from 81d775e to 3a21b85 Compare September 2, 2022 12:20

XuehaiPan force-pushed the backward-compatibility branch from ea5aa1d to 06a335f Compare September 6, 2022 05:29

XuehaiPan added the api Something related to the core APIs label Sep 7, 2022

XuehaiPan force-pushed the backward-compatibility branch from 06a335f to 3ae6d86 Compare September 19, 2022 11:47

XuehaiPan force-pushed the backward-compatibility branch from 3ae6d86 to 8c45011 Compare October 8, 2022 07:54

XuehaiPan added 4 commits October 15, 2022 22:24

feat(core/libnvml): add compatibility layers for process info

b88ecfe

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

chore(core/libnvml): update supported nvidia-ml-py version list

9c9b028

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

feat(core/libnvml): add compatibility layers for memory info

abd98d8

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

docs: add notes for environment variable LOGLEVEL

5054849

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

XuehaiPan force-pushed the backward-compatibility branch from 8c45011 to 5054849 Compare October 15, 2022 14:24

fix(core/libnvml): fix memory version 2 API arguments

6c343a2

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

chore(core/libnvml): update supported nvidia-ml-py version list

34253e5

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

XuehaiPan force-pushed the backward-compatibility branch from 77572ee to 34253e5 Compare October 17, 2022 08:20

XuehaiPan added 2 commits October 17, 2022 17:28

chore(core/libnvml): add notes for incompatible nvidia-ml-py packag…

c103922

…e for memory info version 2 APIs Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

style(core/libnvml): rename functions

5bce06d

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

XuehaiPan merged commit 1a194af into main Oct 17, 2022

XuehaiPan deleted the backward-compatibility branch October 17, 2022 09:46

XuehaiPan mentioned this pull request Oct 17, 2022

[Bug] gpu memory-usage not show right in driver 510 version #13

Closed

jinmingyi1998 mentioned this pull request Nov 17, 2022

Incorrect memory usage for nvidia driver higher than R510 wookayin/gpustat#141

Closed

XuehaiPan mentioned this pull request Apr 16, 2023

Add GPU stats features giampaolo/psutil#526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core/libnvml): add compatibility layers for NVML Python bindings #30

feat(core/libnvml): add compatibility layers for NVML Python bindings #30

XuehaiPan commented Jul 24, 2022 •

edited

XuehaiPan commented Jul 24, 2022 •

edited

XuehaiPan commented Jul 25, 2022

XuehaiPan commented Oct 15, 2022

XuehaiPan commented Oct 17, 2022

feat(core/libnvml): add compatibility layers for NVML Python bindings #30

feat(core/libnvml): add compatibility layers for NVML Python bindings #30

Conversation

XuehaiPan commented Jul 24, 2022 • edited

Issue Type

Runtime Environment

Description

Motivation and Context

Testing

XuehaiPan commented Jul 24, 2022 • edited

XuehaiPan commented Jul 25, 2022

XuehaiPan commented Oct 15, 2022

XuehaiPan commented Oct 17, 2022

XuehaiPan commented Jul 24, 2022 •

edited

XuehaiPan commented Jul 24, 2022 •

edited