Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Skip error gpus and show normal infos automatically #45

Closed
jue-jue-zi opened this issue Oct 22, 2022 · 6 comments
Closed
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@jue-jue-zi
Copy link

Runtime Environment

  • Operating system and version: Ubuntu 20.04 LTS
  • Terminal emulator and version: SSH
  • Python version: 3.8.10
  • NVML version (driver version): 515.65.01
  • nvitop version or commit: 0.10.0
  • nvidia-ml-py version: 11.515.75

Current Behavior

There are four GPUs on our server. And one of those was overheated for some reasons, which make that GPU cannot be recognized. If run nvidia-smi command without any args to query all the GPUs, error Unable to determine the device handle for GPU 0000:0C:00.0: Unknown Error will show without showing the remaining normal GPUs' infos. But if the command assigns the normal GPUs (nvidia-smi -i 0,1,3), all infos of the normal GPUs can be shown directly.

image

image

And if I use nvitop command to show the GPUs' infos, nvidia-ml-py will throw exceptions like this below,

image

image

Expected Behavior

I hope that with nvitop command, all the GPUs with errors can be skipped automatically, and show the normal GPUs' infos. If possible, maybe the error GPUs' info can be shown as tips below the normal infos using red fonts for emphasizing.

@XuehaiPan
Copy link
Owner

@jue-jue-zi Thanks for the feedback! I'll add a quick fix soon.

@XuehaiPan XuehaiPan self-assigned this Oct 22, 2022
@XuehaiPan XuehaiPan added bug Something isn't working enhancement New feature or request labels Oct 22, 2022
@XuehaiPan
Copy link
Owner

@jue-jue-zi I pushed a new commit to handle this. You can reinstall nvitop from GitHub by:

pip3 install git+https://github.com/XuehaiPan/nvitop.git#egg=nvitop

@jue-jue-zi
Copy link
Author

jue-jue-zi commented Oct 22, 2022

@jue-jue-zi I pushed a new commit to handle this. You can reinstall nvitop from GitHub by:

pip3 install git+https://github.com/XuehaiPan/nvitop.git#egg=nvitop

Thanks for fixing it so soon, but it seems that there still exist some problems,

Traceback (most recent call last):
  File "/usr/local/bin/nvitop", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/nvitop/cli.py", line 336, in main
    ui = UI(
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/ui.py", line 43, in __init__
    self.main_screen = MainScreen(
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/__init__.py", line 38, in __init__
    self.device_panel = DevicePanel(self.devices, compact, win=win, root=root)
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/device.py", line 61, in __init__
    self.snapshots = self.take_snapshots()
  File "/usr/local/lib/python3.8/dist-packages/cachetools/func.py", line 62, in wrapper
    v = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/device.py", line 129, in take_snapshots
    snapshots = [device.as_snapshot() for device in self.all_devices]
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/device.py", line 129, in <listcomp>
    snapshots = [device.as_snapshot() for device in self.all_devices]
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/library/device.py", line 70, in as_snapshot
    self._snapshot = super().as_snapshot()
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/device.py", line 1667, in as_snapshot
    **{key: getattr(self, key)() for key in self.SNAPSHOT_KEYS},
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/device.py", line 1667, in <dictcomp>
    **{key: getattr(self, key)() for key in self.SNAPSHOT_KEYS},
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/device.py", line 878, in memory_used
    return self.memory_info().used
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/utils.py", line 702, in wrapped
    ret = self._cache[method]  # pylint: disable=protected-access
TypeError: 'function' object is not subscriptable

@XuehaiPan
Copy link
Owner

but it seems that there still exist some problems,

Fixed by the newest commit.

@jue-jue-zi
Copy link
Author

It works right now! Thanks, it is a really great project.

image

@jue-jue-zi
Copy link
Author

It works right now! Thanks, it is a really great project.

image

Maybe red fonts for errors would be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants