Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rootless nvidia runtime not working as expected #3659

Closed
hholst80 opened this issue Jul 28, 2019 · 22 comments
Closed

rootless nvidia runtime not working as expected #3659

hholst80 opened this issue Jul 28, 2019 · 22 comments
Assignees
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@hholst80
Copy link

(Disclaimer: I might be barking up the wrong tree here, I have no idea if this is even supposed to work yet.)

Problem

nvidia runtime does not work in rootless mode without root (see debug log: /sys/fs permission)

Expected result

Rainbows.

$ sudo ^Wpodman run --rm nvidia/cuda:10.1-base nvidia-smi
Error: container_linux.go:345: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": OCI runtime error
$ sudo ^Wpodman run --rm --runtime=nvidia nvidia/cuda:10.1-base nvidia-smi
Sun Jul 28 21:26:43 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.34       Driver Version: 430.34       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro GP100        Off  | 00000000:01:00.0 Off |                  Off |
| 36%   48C    P0    28W / 235W |      0MiB / 16277MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Debug log

$ podman --log-level=debug run --rm --runtime=nvidia nvidia/cuda:10.1-base nvidia-smi
INFO[0000] running as rootless                          
DEBU[0000] Initializing boltdb state at /home/hholst/.local/share/containers/storage/libpod/bolt_state.db 
DEBU[0000] Using graph driver vfs                       
DEBU[0000] Using graph root /home/hholst/.local/share/containers/storage 
DEBU[0000] Using run root /run/user/1001                
DEBU[0000] Using static dir /home/hholst/.local/share/containers/storage/libpod 
DEBU[0000] Using tmp dir /run/user/1001/libpod/tmp      
DEBU[0000] Using volume path /home/hholst/.local/share/containers/storage/volumes 
DEBU[0000] Set libpod namespace to ""                   
DEBU[0000] [graphdriver] trying provided driver "vfs"   
DEBU[0000] Initializing event backend file              
DEBU[0000] parsed reference into "[vfs@/home/hholst/.local/share/containers/storage+/run/user/1001]docker.io/nvidia/cuda:10.1-base" 
DEBU[0000] parsed reference into "[vfs@/home/hholst/.local/share/containers/storage+/run/user/1001]@d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd" 
DEBU[0000] exporting opaque data as blob "sha256:d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd" 
DEBU[0000] parsed reference into "[vfs@/home/hholst/.local/share/containers/storage+/run/user/1001]@d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd" 
DEBU[0000] exporting opaque data as blob "sha256:d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd" 
DEBU[0000] parsed reference into "[vfs@/home/hholst/.local/share/containers/storage+/run/user/1001]@d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd" 
DEBU[0000] Got mounts: []                               
DEBU[0000] Got volumes: []                              
DEBU[0000] Using slirp4netns netmode                    
DEBU[0000] created OCI spec and options for new container 
DEBU[0000] Allocated lock 1 for container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde 
DEBU[0000] parsed reference into "[vfs@/home/hholst/.local/share/containers/storage+/run/user/1001]@d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd" 
DEBU[0000] exporting opaque data as blob "sha256:d5ce0ddf6959429e592ec5a55f88b73ff9040d3f77899284822f70a31cfd86fd" 
DEBU[0000] created container "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde" 
DEBU[0000] container "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde" has work directory "/home/hholst/.local/share/containers/storage/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata" 
DEBU[0000] container "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde" has run directory "/run/user/1001/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata" 
DEBU[0000] New container created "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde" 
DEBU[0000] container "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde" has CgroupParent "/libpod_parent/libpod-f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde" 
DEBU[0000] Not attaching to stdin                       
DEBU[0000] mounted container "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde" at "/home/hholst/.local/share/containers/storage/vfs/dir/640c8ef93cb8b3e03ede3f34bb8c346869f4ffda98b234f9d63ae22411944d9d" 
DEBU[0000] Created root filesystem for container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde at /home/hholst/.local/share/containers/storage/vfs/dir/640c8ef93cb8b3e03ede3f34bb8c346869f4ffda98b234f9d63ae22411944d9d 
DEBU[0000] /etc/system-fips does not exist on host, not mounting FIPS mode secret 
DEBU[0000] Created OCI spec for container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde at /home/hholst/.local/share/containers/storage/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata/config.json 
DEBU[0000] /usr/bin/conmon messages will be logged to syslog 
DEBU[0000] running conmon: /usr/bin/conmon               args="[-c f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde -u f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde -n quizzical_lewin -r /usr/bin/nvidia-container-runtime -b /home/hholst/.local/share/containers/storage/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata -p /run/user/1001/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata/pidfile --exit-dir /run/user/1001/libpod/tmp/exits --conmon-pidfile /run/user/1001/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /home/hholst/.local/share/containers/storage --exit-command-arg --runroot --exit-command-arg /run/user/1001 --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg cgroupfs --exit-command-arg --tmpdir --exit-command-arg /run/user/1001/libpod/tmp --exit-command-arg --runtime --exit-command-arg nvidia --exit-command-arg --storage-driver --exit-command-arg vfs --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde --socket-dir-path /run/user/1001/libpod/tmp/socket -l k8s-file:/home/hholst/.local/share/containers/storage/vfs-containers/f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde/userdata/ctr.log --log-level debug --syslog]"
WARN[0000] Failed to add conmon to cgroupfs sandbox cgroup: error creating cgroup for cpu: mkdir /sys/fs/cgroup/cpu/libpod_parent: permission denied 
[conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied

DEBU[0000] Received container pid: -1                   
DEBU[0000] Cleaning up container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde 
DEBU[0000] Network is already cleaned up, skipping...   
DEBU[0000] unmounted container "f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde" 
DEBU[0000] Cleaning up container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde 
DEBU[0000] Network is already cleaned up, skipping...   
DEBU[0000] Container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde storage is already unmounted, skipping... 
DEBU[0000] Container f707854cba7b0b191de88ce225033d8a58cd93939585e24fa3c600368b4b2dde storage is already unmounted, skipping... 
ERRO[0001] container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/sbin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 --pid=9588 /home/hholst/.local/share/containers/storage/vfs/dir/640c8ef93cb8b3e03ede3f34bb8c346869f4ffda98b234f9d63ae22411944d9d]\\\\nnvidia-container-cli: mount error: open failed: /sys/fs/cgroup/devices/user.slice/devices.allow: permission denied\\\\n\\\"\""
: OCI runtime error 

Config

$ podman --version
podman version 1.4.4
$ uname -a
Linux puff.lan 5.2.3-arch1-1-ARCH #1 SMP PREEMPT Fri Jul 26 08:13:47 UTC 2019 x86_64 GNU/Linux
$ cat ~/.config/containers/libpod.conf 
volume_path = "/home/hholst/.local/share/containers/storage/volumes"
image_default_transport = "docker://"
runtime = "runc"
conmon_path = ["/usr/libexec/podman/conmon", "/usr/local/lib/podman/conmon", "/usr/bin/conmon", "/usr/sbin/conmon", "/usr/local/bin/conmon", "/usr/local/sbin/conmon"]
conmon_env_vars = ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"]
cgroup_manager = "cgroupfs"
init_path = "/usr/libexec/podman/catatonit"
static_dir = "/home/hholst/.local/share/containers/storage/libpod"
tmp_dir = "/run/user/1001/libpod/tmp"
max_log_size = -1
no_pivot_root = false
cni_config_dir = "/etc/cni/net.d/"
cni_plugin_dir = ["/usr/libexec/cni", "/usr/lib/cni", "/usr/local/lib/cni", "/opt/cni/bin"]
infra_image = "k8s.gcr.io/pause:3.1"
infra_command = "/pause"
enable_port_reservation = true
label = true
network_cmd_path = ""
num_locks = 2048
events_logger = "file"
EventsLogFilePath = ""
detach_keys = "ctrl-p,ctrl-q"

[runtimes]
  runc = ["/usr/bin/runc", "/usr/sbin/runc", "/usr/local/bin/runc", "/usr/local/sbin/runc", "/sbin/runc", "/bin/runc", "/usr/lib/cri-o-runc/sbin/runc"]
  nvidia = ["/usr/bin/nvidia-container-runtime"]
$ pacman -Q nvidia-container-runtime
nvidia-container-runtime 2.0.0+3.docker18.09.6-1
@mheon
Copy link
Member

mheon commented Jul 28, 2019

We had a bug about this before - I'll try and dig it up on Monday. However, I believe the conclusion was that oci-nvidia-hook was doing things that require root, and as such failed when running rootless

@mheon
Copy link
Member

mheon commented Jul 28, 2019

@baude

events_logger = "file"
EventsLogFilePath = ""

That doesn't look right...
(completely unrelated to this issue, but if that's what we're autogenerating for rootless configs, we need to fix it)

@baude
Copy link
Member

baude commented Jul 29, 2019

@mheon which part do you think is wrong?

@mheon
Copy link
Member

mheon commented Jul 29, 2019

No path for the log file - that doesn't seem correct

@nvjmayo
Copy link

nvjmayo commented Oct 18, 2019

I'm able to run with a few changes to a config and a custom hook. Of course since I'm non-root the system hooks won't used by default, so I have to add the --hook-dir option as well.

Add/update these two sections of /etc/nvidia-container-runtime/config.toml

[nvidia-container-cli]
no-cgroups = true

[nvidia-container-runtime]
debug = "/tmp/nvidia-container-runtime.log"

My quick-and-dirty hook, I put it in /usr/share/containers/oci/hooks.d/01-nvhook.json

{
  "version": "1.0.0",
  "hook": {
    "path": "/usr/bin/nvidia-container-toolkit",
    "args": ["nvidia-container-toolkit", "prestart"],
    "env": ["NVIDIA_REQUIRE_CUDA=cuda>=10.1"]
  },
  "when": {
    "always": true
  },
  "stages": ["prestart"]
}

Once that is in place, and I don't have any mysterious bits in /run/user/1000/vfs-layers from previously using sudo podman ...

~/ podman run --rm --hooks-dir /usr/share/containers/oci/hooks.d nvcr.io/nvidia/cuda nvidia-smi
Fri Oct 18 19:48:57 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 710      Off  | 00000000:65:00.0 N/A |                  N/A |
| 50%   37C    P0    N/A /  N/A |      0MiB /  2001MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+

Usual failures I see are when non-root is trying to open files in /var/log/ for writing. And the cgroups thing which was mentioned in the report at NVIDIA/nvidia-container-runtime#85

The above is only a work around. I goals for fully resolving this issue would be:

  • nvidia-container-runtime needs to install a hook somewhere OR someone (me) needs to add a --gpus option to podman.
  • nvidia-container-runtime would be nicer if per-user configs were possible to make re-directing logfiles easier. That would be a more production-ready solution to rootless usage.

@rhatdan
Copy link
Member

rhatdan commented Oct 18, 2019

Could nvidia hook allow the use of --logfile param? To redirect
--logfile $HOME/nvidia.log or simply add

--syslog

So that the messages would end up going to journal/syslog rather then writing to a file.

What does the --gpus flag do?
Is podman supposed to do something with this?

@nvjmayo
Copy link

nvjmayo commented Oct 18, 2019

In Docker, we added a --gpus flag which does a few things to trigger the nvidia runtime hooks to start. The parameter lets the user select which gpu(s) to expose to the container or all of them. (ex: --gpus 0,2,3). Low-level detail: these settings are communicated as environment variables along the execution to the hooks and in-container libraries, not elegant but working OK in production.

I'm working out a prototype and proposal to add a similar option to podman because I feel that having similar command-line options on both podman and docker is easier on users. (next week I will send a PR to propose changes to cmd/podman/common.go, completions/bash/podman, etc...)

I opened an issue for nvidia-container-runtime to improve their logging support. https://gitlab.com/nvidia/container-toolkit/nvidia-container-runtime/issues/5
The feature request will be considered at their next sprint meeting for assignment.

@rajatchopra
Copy link

rajatchopra commented Oct 18, 2019

Could nvidia hook allow the use of --logfile param? To redirect
--logfile $HOME/nvidia.log or simply add

--syslog

@rhatdan Will it be acceptable if we had syslog=true option in config.toml ? (In order to accommodate other runtimes that will not allow us to modify command line args).

@rhatdan
Copy link
Member

rhatdan commented Oct 20, 2019

Sure, the question then would be what is the default. Since this is not something shipped/controlled by the distributions. Another option would be for you plugin to see that it is not running as root, so fall back to syslog logging.

@nvjmayo
Copy link

nvjmayo commented Oct 22, 2019

I would prefer to have everything controlled by the config file and command-line options rather than establish some drastically different behavior based on uid.

  1. I am OK with syslog being default for all if syslog=true.
  2. I'm also OK with /var/run/$UID/nvidia-container-runtime/nvidia-container-runtime.log being the default if the logfile in the config location is not writable.
  3. Something more exotic like syslog=false ; syslog_user=true is also acceptable to me. Where root could write to a file and users go to syslog. I don't know the use case, but it's close to what was originally proposed, and meets my criteria of using config file as the policy instead of coding up a policy.

@rhatdan
Copy link
Member

rhatdan commented Oct 23, 2019

I am fine with either, my only goal is rootless podman does not suddenly blow up because it cannot write to a system log file.

@github-actions
Copy link

This issue had no activity for 30 days. In the absence of activity or the "do-not-close" label, the issue will be automatically closed within 7 days.

@github-actions
Copy link

This issue had no activity for 30 days. In the absence of activity or the "do-not-close" label, the issue will be automatically closed within 7 days.

@dagrayvid
Copy link

dagrayvid commented Feb 12, 2020

Are there any updates on this? Is the recommended way to do this to set no-cgroups = true in /etc/nvidia-container-runtime/config.toml?

@grzegorzk
Copy link

Hi @dagrayvid - I successfully built container with nvidia drivers and CUDA within it, this comment here helped me: #3155 (comment)

Additionally I spotted weird behavior. I was using opencv from within the container and simple code like below was failing until I have executed it from the host first:

import cv2
cv2.cuda.getCudaEnabledDeviceCount()

@andrewssobral
Copy link

I had the same problem in my deep learning server, someone can help me?

docker run --runtime=nvidia --privileged nvidia/cuda nvidia-smi works fine but
podman run --runtime=nvidia --privileged nvidia/cuda nvidia-smi crashes

$ podman run --runtime=nvidia --privileged nvidia/cuda nvidia-smi
2020/04/03 13:34:52 ERROR: /usr/bin/nvidia-container-runtime: find runc path: exec: "runc": executable file not found in $PATH
Error: `/usr/bin/nvidia-container-runtime start e3ccb660bf27ce0858ee56476e58b53cd3dc900e8de80f08d10f3f844c0e9f9a` failed: exit status 1
$ podman --version
podman version 1.8.2
$ cat ~/.config/containers/libpod.conf
# libpod.conf is the default configuration file for all tools using libpod to
# manage containers

# Default transport method for pulling and pushing for images
image_default_transport = "docker://"

# Paths to look for the conmon container manager binary.
# If the paths are empty or no valid path was found, then the `$PATH`
# environment variable will be used as the fallback.
conmon_path = [
            "/usr/libexec/podman/conmon",
            "/usr/local/libexec/podman/conmon",
            "/usr/local/lib/podman/conmon",
            "/usr/bin/conmon",
            "/usr/sbin/conmon",
            "/usr/local/bin/conmon",
            "/usr/local/sbin/conmon",
            "/run/current-system/sw/bin/conmon",
]

# Environment variables to pass into conmon
conmon_env_vars = [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]

# CGroup Manager - valid values are "systemd" and "cgroupfs"
#cgroup_manager = "systemd"

# Container init binary
#init_path = "/usr/libexec/podman/catatonit"

# Directory for persistent libpod files (database, etc)
# By default, this will be configured relative to where containers/storage
# stores containers
# Uncomment to change location from this default
#static_dir = "/var/lib/containers/storage/libpod"

# Directory for temporary files. Must be tmpfs (wiped after reboot)
#tmp_dir = "/var/run/libpod"
tmp_dir = "/run/user/1000/libpod/tmp"

# Maximum size of log files (in bytes)
# -1 is unlimited
max_log_size = -1

# Whether to use chroot instead of pivot_root in the runtime
no_pivot_root = false

# Directory containing CNI plugin configuration files
cni_config_dir = "/etc/cni/net.d/"

# Directories where the CNI plugin binaries may be located
cni_plugin_dir = [
               "/usr/libexec/cni",
               "/usr/lib/cni",
               "/usr/local/lib/cni",
               "/opt/cni/bin"
]

# Default CNI network for libpod.
# If multiple CNI network configs are present, libpod will use the network with
# the name given here for containers unless explicitly overridden.
# The default here is set to the name we set in the
# 87-podman-bridge.conflist included in the repository.
# Not setting this, or setting it to the empty string, will use normal CNI
# precedence rules for selecting between multiple networks.
cni_default_network = "podman"

# Default libpod namespace
# If libpod is joined to a namespace, it will see only containers and pods
# that were created in the same namespace, and will create new containers and
# pods in that namespace.
# The default namespace is "", which corresponds to no namespace. When no
# namespace is set, all containers and pods are visible.
#namespace = ""

# Default infra (pause) image name for pod infra containers
infra_image = "k8s.gcr.io/pause:3.1"

# Default command to run the infra container
infra_command = "/pause"

# Determines whether libpod will reserve ports on the host when they are
# forwarded to containers. When enabled, when ports are forwarded to containers,
# they are held open by conmon as long as the container is running, ensuring that
# they cannot be reused by other programs on the host. However, this can cause
# significant memory usage if a container has many ports forwarded to it.
# Disabling this can save memory.
#enable_port_reservation = true

# Default libpod support for container labeling
# label=true

# The locking mechanism to use
lock_type = "shm"

# Number of locks available for containers and pods.
# If this is changed, a lock renumber must be performed (e.g. with the
# 'podman system renumber' command).
num_locks = 2048

# Directory for libpod named volumes.
# By default, this will be configured relative to where containers/storage
# stores containers.
# Uncomment to change location from this default.
#volume_path = "/var/lib/containers/storage/volumes"

# Selects which logging mechanism to use for Podman events.  Valid values
# are `journald` or `file`.
# events_logger = "journald"

# Specify the keys sequence used to detach a container.
# Format is a single character [a-Z] or a comma separated sequence of
# `ctrl-<value>`, where `<value>` is one of:
# `a-z`, `@`, `^`, `[`, `\`, `]`, `^` or `_`
#
# detach_keys = "ctrl-p,ctrl-q"

# Default OCI runtime
runtime = "runc"

# List of the OCI runtimes that support --format=json.  When json is supported
# libpod will use it for reporting nicer errors.
runtime_supports_json = ["crun", "runc"]

# List of all the OCI runtimes that support --cgroup-manager=disable to disable
# creation of CGroups for containers.
runtime_supports_nocgroups = ["crun"]

# Paths to look for a valid OCI runtime (runc, runv, etc)
# If the paths are empty or no valid path was found, then the `$PATH`
# environment variable will be used as the fallback.
[runtimes]
runc = [
            "/usr/bin/runc",
            "/usr/sbin/runc",
            "/usr/local/bin/runc",
            "/usr/local/sbin/runc",
            "/sbin/runc",
            "/bin/runc",
            "/usr/lib/cri-o-runc/sbin/runc",
            "/run/current-system/sw/bin/runc",
]

crun = [
                "/usr/bin/crun",
                "/usr/sbin/crun",
                "/usr/local/bin/crun",
                "/usr/local/sbin/crun",
                "/sbin/crun",
                "/bin/crun",
                "/run/current-system/sw/bin/crun",
]

nvidia = ["/usr/bin/nvidia-container-runtime"]

# Kata Containers is an OCI runtime, where containers are run inside lightweight
# Virtual Machines (VMs). Kata provides additional isolation towards the host,
# minimizing the host attack surface and mitigating the consequences of
# containers breakout.
# Please notes that Kata does not support rootless podman yet, but we can leave
# the paths below blank to let them be discovered by the $PATH environment
# variable.

# Kata Containers with the default configured VMM
kata-runtime = [
    "/usr/bin/kata-runtime",
]

# Kata Containers with the QEMU VMM
kata-qemu = [
    "/usr/bin/kata-qemu",
]

# Kata Containers with the Firecracker VMM
kata-fc = [
    "/usr/bin/kata-fc",
]

# The [runtimes] table MUST be the last thing in this file.
# (Unless another table is added)
# TOML does not provide a way to end a table other than a further table being
# defined, so every key hereafter will be part of [runtimes] and not the main
# config.
$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
no-cgroups = true
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
debug = "/tmp/nvidia-container-runtime.log
$ cat /tmp/nvidia-container-runtime.log
2020/04/03 13:23:02 Running /usr/bin/nvidia-container-runtime
2020/04/03 13:23:02 Using bundle file: /home/andrews/.local/share/containers/storage/vfs-containers/614cb26f8f4719e3aba56be2e1a6dc29cd91ae760d9fe3bf83d6d1b24becc638/userdata/config.json
2020/04/03 13:23:02 prestart hook path: /usr/bin/nvidia-container-runtime-hook
2020/04/03 13:23:02 Prestart hook added, executing runc
2020/04/03 13:23:02 Looking for "docker-runc" binary
2020/04/03 13:23:02 "docker-runc" binary not found
2020/04/03 13:23:02 Looking for "runc" binary
2020/04/03 13:23:02 Runc path: /usr/bin/runc
2020/04/03 13:23:09 Running /usr/bin/nvidia-container-runtime
2020/04/03 13:23:09 Command is not "create", executing runc doing nothing
2020/04/03 13:23:09 Looking for "docker-runc" binary
2020/04/03 13:23:09 "docker-runc" binary not found
2020/04/03 13:23:09 Looking for "runc" binary
2020/04/03 13:23:09 ERROR: find runc path: exec: "runc": executable file not found in $PATH
2020/04/03 13:31:06 Running nvidia-container-runtime
2020/04/03 13:31:06 Command is not "create", executing runc doing nothing
2020/04/03 13:31:06 Looking for "docker-runc" binary
2020/04/03 13:31:06 "docker-runc" binary not found
2020/04/03 13:31:06 Looking for "runc" binary
2020/04/03 13:31:06 Runc path: /usr/bin/runc
$ nvidia-container-runtime --version
runc version 1.0.0-rc8
commit: 425e105d5a03fabd737a126ad93d62a9eeede87f
spec: 1.0.1-dev
NVRM version:   440.64.00
CUDA version:   10.2

Device Index:   0
Device Minor:   0
Model:          GeForce RTX 2070
Brand:          GeForce
GPU UUID:       GPU-22dfd02e-a668-a6a6-a90a-39d6efe475ee
Bus Location:   00000000:01:00.0
Architecture:   7.5
$ whereis runc
runc: /usr/bin/runc
$ whereis docker-runc
docker-runc:
$ docker version
Client:
 Version:           18.09.7
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        2d0083d
 Built:             Thu Jun 27 17:56:23 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       afacb8b7f0
  Built:            Wed Mar 11 01:24:19 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.6
  GitCommit:        894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc:
  Version:          1.0.0-rc8
  GitCommit:        425e105d5a03fabd737a126ad93d62a9eeede87f
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

@rhatdan
Copy link
Member

rhatdan commented Apr 3, 2020

Please open new issues, do not just keep adding to existing closed issues.

If you run podman as root does it work?

@andrewssobral
Copy link

Hi @rhatdan , no , same error when I do:
sudo podman run --runtime=nvidia --privileged nvidia/cuda nvidia-smi

@rhatdan
Copy link
Member

rhatdan commented Apr 3, 2020

Is there a nvidia hook that is attempting to launch runc?

@rhatdan
Copy link
Member

rhatdan commented Apr 3, 2020

Actually I don't think podman looks for the executable in path.
If you run with it pathed does it work.

sudo podman run --runtime=/usr/bin/nvidia --privileged nvidia/cuda nvidia-smi

@andrewssobral
Copy link

andrewssobral commented Apr 3, 2020

@rhatdan
[updated] This is my current output:
(same error with sudo)

$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ --runtime=nvidia --privileged nvidia/cuda nvidia-smi
2020/04/03 17:38:30 ERROR: /usr/bin/nvidia-container-runtime: find runc path: exec: "runc": executable file not found in $PATH
2020/04/03 17:38:31 ERROR: /usr/bin/nvidia-container-runtime: find runc path: exec: "runc": executable file not found in $PATH
Error: `/usr/bin/nvidia-container-runtime start 458d3eaa57ed972cc76fa5d9b99cc2a6db7f51e1a91befde1fd6ba17735f0b79` failed: exit status 1

@rhatdan
Copy link
Member

rhatdan commented Apr 3, 2020

I wonder if we exec the OCI runtime and remove the settings of $PATH.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

No branches or pull requests

9 participants