Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core/libnvml): add compatibility layers for NVML Python bindings #30

Merged
merged 8 commits into from Oct 17, 2022

Conversation

XuehaiPan
Copy link
Owner

@XuehaiPan XuehaiPan commented Jul 24, 2022

Issue Type

  • Improvement/feature implementation

Runtime Environment

  • Operating system and version: Ubuntu 20.04 LTS
  • Terminal emulator and version: GNOME Terminal 3.36.2
  • Python version: 3.9.13
  • NVML version (driver version): 470.129.06
  • nvitop version or commit: v0.7.1
  • python-ml-py version: 11.450.51
  • Locale: en_US.UTF-8

Description

Automatically patch the pynvml module when the first call fails when calling the versioned APIs. Now we support a more broad range of the PyPI package nvidia-ml-py dependency versions.

Motivation and Context

See #29 for more details.

Resolves #29
Closes #13

Testing

Using nvidia-ml-py == 11.515.48 with the NVIDIA R430 driver (CUDA 10.x):

$ pip3 install --ignore-installed .
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Processing /home/panxuehai/Projects/nvitop
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting psutil>=5.6.6
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/62/1f/f14225bda76417ab9bd808ff21d5cd59d5435a9796ca09b34d4cb0edcd88/psutil-5.9.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (281 kB)
Collecting cachetools>=1.0.1
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/68/aa/5fc646cae6e997c3adf3b0a7e257cda75cff21fcba15354dffd67789b7bb/cachetools-5.2.0-py3-none-any.whl (9.3 kB)
Collecting nvidia-ml-py<11.516.0a0,>=11.450.51
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/7c/b6/738d9c68f8abcdedf8901c4abf00df74e8f281626de67b5185dcc443e693/nvidia_ml_py-11.515.48-py3-none-any.whl (28 kB)
Collecting termcolor>=1.0.0
  Using cached termcolor-1.1.0-py3-none-any.whl
Building wheels for collected packages: nvitop
  Building wheel for nvitop (pyproject.toml) ... done
  Created wheel for nvitop: filename=nvitop-0.7.1+6.g0feed99-py3-none-any.whl size=154871 sha256=da07a27d8579e1cc38a3bd3d537f0d885d592df0c3293ba585b831fa236f100e
  Stored in directory: /tmp/pip-ephem-wheel-cache-3qzopv_e/wheels/9a/17/84/86d7a108dc1c0d7a25e96628d476e19df73a27353725b35779
Successfully built nvitop
Installing collected packages: termcolor, nvidia-ml-py, psutil, cachetools, nvitop
Successfully installed cachetools-5.2.0 nvidia-ml-py-11.515.48 nvitop-0.7.1+6.g84f43f5 psutil-5.9.1 termcolor-1.1.0

Result:

The v3 API nvmlDeviceGetComputeRunningProcesses_v3 fails-back to v2 API nvmlDeviceGetComputeRunningProcesses_v2 (which could not found either), then fails-back to v1 API nvmlDeviceGetComputeRunningProcesses.

$ LOGLEVEL=DEBUG ./nvitop.py -1
Patching NVML function pointer `nvmlDeviceGetComputeRunningProcesses_v3`
    Map NVML function `nvmlDeviceGetComputeRunningProcesses_v3` to `nvmlDeviceGetComputeRunningProcesses_v2`
    Map NVML function `nvmlDeviceGetGraphicsRunningProcesses_v3` to `nvmlDeviceGetGraphicsRunningProcesses_v2`
    Map NVML function `nvmlDeviceGetMPSComputeRunningProcesses_v3` to `nvmlDeviceGetMPSComputeRunningProcesses_v2`
    Patch NVML struct `c_nvmlProcessInfo_t` to `c_nvmlProcessInfo_v2_t`
Patching NVML function pointer `nvmlDeviceGetComputeRunningProcesses_v2`
    Map NVML function `nvmlDeviceGetComputeRunningProcesses_v2` to `nvmlDeviceGetComputeRunningProcesses`
    Map NVML function `nvmlDeviceGetGraphicsRunningProcesses_v2` to `nvmlDeviceGetGraphicsRunningProcesses`
    Map NVML function `nvmlDeviceGetMPSComputeRunningProcesses_v2` to `nvmlDeviceGetMPSComputeRunningProcesses`
    Map NVML function `nvmlDeviceGetComputeRunningProcesses_v3` to `nvmlDeviceGetComputeRunningProcesses`
    Map NVML function `nvmlDeviceGetGraphicsRunningProcesses_v3` to `nvmlDeviceGetGraphicsRunningProcesses`
    Map NVML function `nvmlDeviceGetMPSComputeRunningProcesses_v3` to `nvmlDeviceGetMPSComputeRunningProcesses`
    Patch NVML struct `c_nvmlProcessInfo_t` to `c_nvmlProcessInfo_v1_t`
Sun Jul 24 19:32:24 2022
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪══════════════════════════════════════════════════════════════════════════════════╕
│   0  TITAN Xp            Off  │ 00000000:05:00.0 Off │                  N/A │ MEM: ▏ 0.2%                                                                      │
│ 24%   43C    P8    19W / 250W │     19MiB / 12194MiB │      0%      Default │ UTL: ▏ 0%                                                                        │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│   1  TITAN Xp            Off  │ 00000000:06:00.0 Off │                  N/A │ MEM: ▏ 0.0%                                                                      │
│ 23%   36C    P8    10W / 250W │      2MiB / 12196MiB │      0%      Default │ UTL: ▏ 0%                                                                        │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│   2  ..orce GTX TITAN X  Off  │ 00000000:09:00.0 Off │                  N/A │ MEM: ▏ 0.0%                                                                      │
│ 22%   34C    P8    17W / 250W │      2MiB / 12213MiB │      0%      Default │ UTL: ▏ 0%                                                                        │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╧══════════════════════════════════════════════════════════════════════════════════╛
[ CPU: █████▉ 5.3%                                                                                                          ]  ( Load Average:  0.89  0.61  0.39 )
[ MEM: ███▋ 3.2%                                                                                                            ]  [ SWP: ▏ 0.0%                     ]

╒════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Processes:                                                                                                                                    panxuehai@ubuntu │
│ GPU     PID      USER  GPU-MEM %SM  %CPU  %MEM      TIME  COMMAND                                                                                              │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│   0    2122 G    root    17MiB   0   0.0   0.0  4.2 days  /usr/lib/xorg/Xorg -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch │
╘════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

@XuehaiPan XuehaiPan self-assigned this Jul 24, 2022
@XuehaiPan XuehaiPan added this to the v1.0.0 milestone Jul 24, 2022
@XuehaiPan XuehaiPan added enhancement New feature or request pynvml Something related to the `nvidia-ml-py` package labels Jul 24, 2022
@XuehaiPan XuehaiPan force-pushed the backward-compatibility branch 4 times, most recently from d513134 to 4f841fd Compare July 24, 2022 12:19
@XuehaiPan
Copy link
Owner Author

XuehaiPan commented Jul 24, 2022

Not sure when will the v2 version of nvmlDeviceGetMemoryInfo become GA. It seems only experimental on the NVIDIA R510 driver.

In nvidia-ml-py == 11.515.48:

class c_nvmlMemory_t(_PrintableStructure):
    _fields_ = [
        ('total', c_ulonglong),
        ('free', c_ulonglong),
        ('used', c_ulonglong),
    ]
    _fmt_ = {'<default>': "%d B"}

class c_nvmlMemory_v2_t(_PrintableStructure):
    _fields_ = [
        ('version', c_uint),
        ('total', c_ulonglong),
        ('reserved', c_ulonglong),
        ('free', c_ulonglong),
        ('used', c_ulonglong),
    ]
    _fmt_ = {'<default>': "%d B"}

nvmlMemory_v2 = 0x02000028
def nvmlDeviceGetMemoryInfo(handle, version=None):
    if not version:
        c_memory = c_nvmlMemory_t()
        fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo")
    else:
        c_memory = c_nvmlMemory_v2_t()
        c_memory.version = version
        fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2")
    ret = fn(handle, byref(c_memory))
    _nvmlCheckReturn(ret)
    return c_memory

See also:

I got NVML_ERROR_UNKNOWN (even not NVML_ERROR_NOT_SUPPORTED nor NVML_ERROR_INVALID_ARGUMENT) when calling the v2 memory info inside WSL (the function symbol can be found with my 516.59 NVIDIA driver on Windows).

In [1]: import cupy as cp

In [2]: x = cp.zeros((1,))

In [3]: from nvitop import *

In [4]: d = Device(0)

In [5]: str(libnvml.nvmlDeviceGetMemoryInfo(d.handle))
Out[5]: 'c_nvmlMemory_t(total: 8589934592 B, free: 4304117760 B, used: 4285816832 B)'

In [6]: str(libnvml.nvmlDeviceGetMemoryInfo(d.handle, version=2))
╭──────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
│ <ipython-input-6-c179ba6931d1>:1 in <cell line: 1>                                                         │
│                                                                                                            │
│ /home/PanXuehai/Projects/nvitop/venv/lib/python3.9/site-packages/pynvml.py:2301 in nvmlDeviceGetMemoryInfo │
│                                                                                                            │
│   2298 │   │   c_memory.version = version                                                                  │
│   2299 │   │   fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2")                                  │
│   2300ret = fn(handle, byref(c_memory))                                                               │
│ ❱ 2301_nvmlCheckReturn(ret)                                                                           │
│   2302return c_memory                                                                                 │
│   2303                                                                                                     │
│   2304 def nvmlDeviceGetBAR1MemoryInfo(handle):                                                            │
│                                                                                                            │
│ /home/PanXuehai/Projects/nvitop/venv/lib/python3.9/site-packages/pynvml.py:795 in _nvmlCheckReturn         │
│                                                                                                            │
│    792                                                                                                     │
│    793 def _nvmlCheckReturn(ret):                                                                          │
│    794if (ret != NVML_SUCCESS):                                                                       │
│ ❱  795 │   │   raise NVMLError(ret)                                                                        │
│    796return ret                                                                                      │
│    797                                                                                                     │
│    798 ## Function access ##                                                                               │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
NVMLError_Unknown: Unknown Error

In [7]: Device.driver_version()
Out[7]: '516.59'

Same error on the Windows host:

In [1]: import cupy as cp

In [2]: x = cp.zeros((1,))

In [3]: from nvitop import *

In [4]: d = Device(0)

In [5]: str(libnvml.nvmlDeviceGetMemoryInfo(d.handle))
Out[5]: 'c_nvmlMemory_t(total: 8589934592 B, free: 4412493824 B, used: 4177440768 B)'

In [6]: str(libnvml.nvmlDeviceGetMemoryInfo(d.handle, version=2))
┌─────────────────────────────── Traceback (most recent call last) ────────────────────────────────┐
│ <ipython-input-6-c179ba6931d1>:1 in <cell line: 1>                                               │
│                                                                                                  │
│ C:\Tools\Python3\lib\site-packages\pynvml.py:2301 in nvmlDeviceGetMemoryInfo                     │
│                                                                                                  │
│   2298 │   │   c_memory.version = version                                                        │
│   2299 │   │   fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2")                        │
│   2300ret = fn(handle, byref(c_memory))                                                     │
│ > 2301_nvmlCheckReturn(ret)                                                                 │
│   2302return c_memory                                                                       │
│   2303                                                                                           │
│   2304 def nvmlDeviceGetBAR1MemoryInfo(handle):                                                  │
│                                                                                                  │
│ C:\Tools\Python3\lib\site-packages\pynvml.py:795 in _nvmlCheckReturn                             │
│                                                                                                  │
│    792                                                                                           │
│    793 def _nvmlCheckReturn(ret):                                                                │
│    794if (ret != NVML_SUCCESS):                                                             │
│ >  795 │   │   raise NVMLError(ret)                                                              │
│    796return ret                                                                            │
│    797                                                                                           │
│    798 ## Function access ##                                                                     │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
NVMLError_Unknown: Unknown Error

@XuehaiPan XuehaiPan force-pushed the backward-compatibility branch 5 times, most recently from d9d08d0 to 4be5f8e Compare July 25, 2022 02:24
@XuehaiPan XuehaiPan force-pushed the backward-compatibility branch 3 times, most recently from 1aa80e7 to 01133f8 Compare July 25, 2022 14:54
@XuehaiPan XuehaiPan marked this pull request as ready for review July 25, 2022 16:26
@XuehaiPan
Copy link
Owner Author

Waiting for a new driver release for v2 memory info API.

@XuehaiPan XuehaiPan force-pushed the backward-compatibility branch 5 times, most recently from 2155404 to 5ce55ec Compare August 9, 2022 15:26
@XuehaiPan XuehaiPan force-pushed the backward-compatibility branch 2 times, most recently from e1c2584 to ea5aa1d Compare August 23, 2022 11:51
@XuehaiPan XuehaiPan added the api Something related to the core APIs label Sep 7, 2022
Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
@XuehaiPan
Copy link
Owner Author

I upgrade the NVIDIA driver to 520.56.06 but still get Function Not Found errors for the nvmlDeviceGetMemoryInfo_v2 function.

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
@XuehaiPan
Copy link
Owner Author

Update: The correct API call for the nvmlDeviceGetMemoryInfo_v2 function is:

pynvml.nvmlDeviceGetMemoryInfo(handle, version=pynvml.nvmlMemory_v2)

rather than

pynvml.nvmlDeviceGetMemoryInfo(handle, version=2)

where

pynvml.nvmlMemory_v2 = 33554472 = ctypes.sizeof(pynvml.c_nvmlMemory_v2_t) | 2 << 24

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
…e for memory info version 2 APIs

Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Signed-off-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Something related to the core APIs enhancement New feature or request pynvml Something related to the `nvidia-ml-py` package
Projects
None yet
1 participant