Steps to reproduce
- Start a run on Lambda GPU instance
- SSH into the container:
ssh <run-name>
- Run
nvidia-smi
Actual behaviour
Sometimes nvidia-smi fails with the error: Failed to initialize NVML: Unknown Error. To reliably trigger the issue:
- SSH into the host
ssh <run-name-host
- Run
sudo systemctl daemon-reload
- Run
nvidia-smi inside the container again
Expected behaviour
No response
dstack version
0.19.7
Server logs
Additional information
NVIDIA/nvidia-container-toolkit#48
|
DOCKER_DAEMON_CONFIG = { |
|
"runtimes": {"nvidia": {"args": [], "path": "nvidia-container-runtime"}}, |
|
# Workaround for https://github.com/NVIDIA/nvidia-container-toolkit/issues/48 |
|
"exec-opts": ["native.cgroupdriver=cgroupfs"], |
|
} |
|
SETUP_COMMANDS = [ |
|
"ufw allow ssh", |
|
"ufw allow from 10.0.0.0/8", |
|
"ufw allow from 172.16.0.0/12", |
|
"ufw allow from 192.168.0.0/16", |
|
"ufw default deny incoming", |
|
"ufw default allow outgoing", |
|
"ufw enable", |
|
'sed -i "s/.*AllowTcpForwarding.*/AllowTcpForwarding yes/g" /etc/ssh/sshd_config', |
|
"service ssh restart", |
|
f"echo {shlex.quote(json.dumps(DOCKER_DAEMON_CONFIG))} > /etc/docker/daemon.json", |
|
"service docker restart", |
|
] |
Steps to reproduce
ssh <run-name>nvidia-smiActual behaviour
Sometimes
nvidia-smifails with the error:Failed to initialize NVML: Unknown Error. To reliably trigger the issue:ssh <run-name-hostsudo systemctl daemon-reloadnvidia-smiinside the container againExpected behaviour
No response
dstack version
0.19.7
Server logs
Additional information
NVIDIA/nvidia-container-toolkit#48
dstack/src/dstack/_internal/core/backends/nebius/compute.py
Lines 44 to 61 in c5d1bd5