Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run singularity in (unprivileged) k8s pod #5857

Closed
rptaylor opened this issue Mar 5, 2021 · 25 comments
Closed

run singularity in (unprivileged) k8s pod #5857

rptaylor opened this issue Mar 5, 2021 · 25 comments
Assignees
Labels

Comments

@rptaylor
Copy link

rptaylor commented Mar 5, 2021

I am trying to figure out how to use Singularity inside a k8s pod. If the pod is privileged it works, but I want to make it more secure and use a non-privileged pod. My first attempt is based on using a setuid installation of Singularity.

I have done the following:

  • k8s PSP allows privilege escalation , required to allow setuid executables to work: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#privilege-escalation
  • k8s PSP allows CAP_SYS_ADMIN and CAP_CHROOT in the pod
  • a permissive seccomp profile is applied, allowing all syscalls
  • Singularity is installed in setuid mode in the container (the image I'm using for the k8s pod is git.computecanada.ca:4567/rptaylor/misc/atlas-grid-centos7-singbuild )

Nevertheless I am running into issues, I think because Singularity perhaps does not expect to be already running inside a container namespace.

Version of Singularity:

3.7.1

Expected behavior

singularity could start a (nested) container inside a k8s pod.

Actual behavior

$ singularity -d -vvv run /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7-base/ 
DEBUG   [U=10000,P=32]     persistentPreRun()            Singularity version: 3.7.1
DEBUG   [U=10000,P=32]     persistentPreRun()            Parsing configuration file /usr/local/etc/singularity/singularity.conf
DEBUG   [U=10000,P=32]     handleConfDir()               /home/localuser/.singularity already exists. Not creating.
DEBUG   [U=10000,P=32]     execStarter()                 Saving umask 0022 for propagation into container
DEBUG   [U=10000,P=32]     execStarter()                 Checking for encrypted system partition
DEBUG   [U=10000,P=32]     Init()                        Image format detection
DEBUG   [U=10000,P=32]     Init()                        Check for sandbox image format
DEBUG   [U=10000,P=32]     Init()                        sandbox image format detected
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding HEP_OSLIBS_VER environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding HOSTNAME environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding KUBERNETES_PORT environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding KUBERNETES_PORT_443_TCP_PORT environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding TERM environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding TMPDIR environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding KUBERNETES_SERVICE_PORT environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding KUBERNETES_SERVICE_HOST environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding UMD_REL_VER environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding GIT_COMMITTER_NAME environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding GIT_COMMITTER_EMAIL environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding PWD environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding SHLVL environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding KUBERNETES_PORT_443_TCP_PROTO environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding KUBERNETES_SERVICE_PORT_HTTPS environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding KUBERNETES_PORT_443_TCP_ADDR environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding KUBERNETES_PORT_443_TCP environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding _ environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding OLDPWD environment variable
DEBUG   [U=10000,P=32]     SetContainerEnv()             Forwarding USER_PATH environment variable
VERBOSE [U=10000,P=32]     SetContainerEnv()             Setting HOME=/home/localuser
VERBOSE [U=10000,P=32]     SetContainerEnv()             Setting PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
DEBUG   [U=10000,P=32]     init()                        Use starter binary /usr/local/libexec/singularity/bin/starter-suid
VERBOSE [U=0,P=32]         print()                       Set messagelevel to: 5
VERBOSE [U=0,P=32]         init()                        Starter initialization
DEBUG   [U=0,P=32]         load_overlay_module()         Trying to load overlay kernel module
DEBUG   [U=0,P=32]         load_overlay_module()         Overlay seems supported by the kernel
VERBOSE [U=0,P=32]         is_suid()                     Check if we are running as setuid
VERBOSE [U=0,P=32]         priv_drop()                   Drop root privileges
DEBUG   [U=10000,P=32]     read_engine_config()          Read engine configuration
DEBUG   [U=10000,P=32]     init()                        Wait completion of stage1
VERBOSE [U=10000,P=40]     priv_drop()                   Drop root privileges permanently
DEBUG   [U=10000,P=40]     set_parent_death_signal()     Set parent death signal to 9
VERBOSE [U=10000,P=40]     init()                        Spawn stage 1
DEBUG   [U=10000,P=40]     startup()                     singularity runtime engine selected
VERBOSE [U=10000,P=40]     startup()                     Execute stage 1
DEBUG   [U=10000,P=40]     StageOne()                    Entering stage 1
DEBUG   [U=10000,P=40]     prepareAutofs()               No autofs mount point found
DEBUG   [U=10000,P=40]     Init()                        Image format detection
DEBUG   [U=10000,P=40]     Init()                        Check for sandbox image format
DEBUG   [U=10000,P=40]     Init()                        sandbox image format detected
DEBUG   [U=10000,P=40]     setSessionLayer()             Overlay seems supported and allowed by kernel
DEBUG   [U=10000,P=40]     setSessionLayer()             Attempting to use overlayfs (enable overlay = try)
VERBOSE [U=10000,P=32]     wait_child()                  stage 1 exited with status 0
DEBUG   [U=10000,P=32]     init()                        Applying stage 1 working directory
DEBUG   [U=10000,P=32]     cleanup_fd()                  Close file descriptor 4
DEBUG   [U=10000,P=32]     cleanup_fd()                  Close file descriptor 5
DEBUG   [U=10000,P=32]     cleanup_fd()                  Close file descriptor 6
DEBUG   [U=10000,P=32]     init()                        Set child signal mask
DEBUG   [U=10000,P=32]     init()                        Create socketpair for master communication channel
DEBUG   [U=10000,P=32]     init()                        Create RPC socketpair for communication between stage 2 and RPC server
VERBOSE [U=10000,P=32]     priv_escalate()               Get root privileges
VERBOSE [U=0,P=32]         priv_escalate()               Change filesystem uid to 10000
VERBOSE [U=0,P=32]         init()                        Spawn master process
DEBUG   [U=0,P=46]         set_parent_death_signal()     Set parent death signal to 9
VERBOSE [U=0,P=46]         create_namespace()            Create mount namespace
VERBOSE [U=0,P=32]         enter_namespace()             Entering in mount namespace
DEBUG   [U=0,P=32]         enter_namespace()             Opening namespace file ns/mnt
DEBUG   [U=0,P=32]         enter_namespace()             FAILED to open namespace file. result -1 , error Permission denied
ERROR   [U=0,P=32]         init()                        Failed to enter in shared mount namespace: Permission denied

The last line is a debug message I added which confirms the error is occurring here: https://github.com/hpcng/singularity/blob/master/cmd/starter/c/starter.c#L550

Perhaps when it tries to read "ns/mnt" it is not the right mount namespace?

Steps to reproduce this behavior

Run this container image on kubernetes (I can provide kubectl access to the pod if needed)

What OS/distro are you running

The kubelet node is CentOS8 and the container image is based on CentOS7.

How did you install Singularity

Built 3.7.1 for CentOS7 and installed into container image.

@rptaylor
Copy link
Author

rptaylor commented Mar 5, 2021

Related to #5806 and, possibly, #2397 ?

If there is another way to do this which doesn't use the setuid approach that would be even better.
Aha in fact I found that now that the capabilities and seccomp profile are enabled, it works with the -u mode:
singularity -d -vvv run -u /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7-base/

I would have thought the setuid mode would be more likely to work than the user namespace mode.

@dtrudg
Copy link
Contributor

dtrudg commented Mar 5, 2021

It'd be expected that the -u user namespace mode (or an unprivileged installation without setuid) works here with the appropriate capabilities / system calls permitted. I don't think we have any intention to support setuid mode nested in a container.

@dtrudg
Copy link
Contributor

dtrudg commented Mar 5, 2021

Marking this as a 'wontfix' per the above, but pinging @cclerget just in case he has a different viewpoint on it.

@DrDaveD
Copy link
Collaborator

DrDaveD commented Mar 5, 2021

Hi Ryan,

The problem remaining and which I have been trying to find a Kubernetes admin to help me test for over a year is that the singularity -p option to give an unprivileged user namespace is disabled in docker and Kubernetes by default. The kubernetes option is documented as being in the PodSecurityPolicy under allowedProcMountTypes Unmasked. I referred to this a couple of days ago in #5454.

@rptaylor
Copy link
Author

rptaylor commented Mar 6, 2021

@DrDaveD I'm a kubernetes admin :)

I am very glad to finally have a working recipe for unprivileged singularity in k8s pods; the part I had been missing and still need to narrow down more is defining the seccomp profile needed for Singularity to work - should be easy if it only needs the unshare system call as you mentioned in the other issue.
On a CentOS8 kubelet , in a CentOS7 pod, I can start a singularity container:
/cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec -B /cvmfs:/cvmfs /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 /bin/bash

What is the advantage of using -p ? I don't think a separate PID namespace is essential for my use case. Trying that I get

/cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec -p -B /cvmfs:/cvmfs /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 /bin/bash
FATAL:   container creation failed: mount proc->/proc error: can't mount proc filesystem to /proc: operation not permitted

Setting this in the psp:

spec:
  allowedProcMountTypes:
    - Unmasked

and procMount: Unmasked in the container (not pod) securityContext, I run into a PSP violation.

Took some digging and I found
kubernetes/kubernetes#64283
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/proc-mount-type.md
the procMountType feature gate needs to be enabled on the API servers:
https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
so it is an extra hoop to jump through.

@rptaylor
Copy link
Author

rptaylor commented Mar 6, 2021

Regarding a seccomp profile that allows Singularity to work, I tried applying this file https://github.com/moby/moby/blob/master/profiles/seccomp/default.json#L384 with 'unshare' included on line 384 as unconditionally allowed, but ran into this:

/cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity -d -v exec -B /cvmfs:/cvmfs /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 /bin/bash
VERBOSE [U=10000,P=35]     execStarter()                 starter-suid not found, using user namespace
VERBOSE [U=10000,P=35]     print()                       Set messagelevel to: 5
DEBUG   [U=10000,P=35]     init()                        PIPE_EXEC_FD value: 7
VERBOSE [U=10000,P=35]     init()                        Container runtime
VERBOSE [U=10000,P=35]     is_suid()                     Check if we are running as setuid
DEBUG   [U=10000,P=35]     init()                        Read json configuration from pipe
DEBUG   [U=10000,P=35]     init()                        Set child signal mask
DEBUG   [U=10000,P=35]     init()                        Wait completion of stage1
DEBUG   [U=10000,P=46]     set_parent_death_signal()     Set parent death signal to 9
VERBOSE [U=10000,P=46]     init()                        Spawn stage 1
VERBOSE [U=10000,P=46]     startup()                     Execute stage 1
DEBUG   [U=10000,P=46]     Stage()                       Entering stage 1
DEBUG   [U=10000,P=46]     prepareFd()                   Open file descriptor for /cvmfs
DEBUG   [U=10000,P=46]     Init()                        Entering image format intializer
DEBUG   [U=10000,P=46]     Init()                        Check for image format sif
DEBUG   [U=10000,P=46]     Init()                        sif format initializer returns: not a SIF file image
DEBUG   [U=10000,P=46]     Init()                        Check for image format sandbox
DEBUG   [U=10000,P=35]     init()                        Create socketpair for master communication channel
DEBUG   [U=10000,P=35]     cleanup_fd()                  Check file descriptor /proc/self/fd/3 pointing to /cvmfs
DEBUG   [U=10000,P=35]     cleanup_fd()                  Check file descriptor /proc/self/fd/4 pointing to anon_inode:[eventpoll]
DEBUG   [U=10000,P=35]     cleanup_fd()                  Closing /proc/self/fd/4
DEBUG   [U=10000,P=35]     cleanup_fd()                  Check file descriptor /proc/self/fd/5 pointing to /cvmfs/atlas.cern.ch/repo/containers/fs/singularity/centos7-20201007170041-3738a6aaf38f06a42305140f14aeffc260f9fd61939e26a7135a0923d156db82
VERBOSE [U=10000,P=35]     user_namespace_init()         Create user namespace
DEBUG   [U=65534,P=35]     setup_userns_mappings()       Write deny to set group file
DEBUG   [U=65534,P=35]     setup_userns_mappings()       Write to GID map
DEBUG   [U=65534,P=35]     setup_userns_mappings()       Write to UID map
VERBOSE [U=10000,P=35]     create_namespace()            Create mount namespace
ERROR   [U=10000,P=35]     shared_mount_namespace_init() Failed to set mount propagation: Operation not permitted

so there must be a number of additional syscalls that are needed. I think I'm getting them logged to audit logs using SCMP_ACT_LOG but it could take a lot of digging to enumerate all the required ones.

@DrDaveD
Copy link
Collaborator

DrDaveD commented Mar 6, 2021

The singularity -p option is essential for complete isolation between unrelated payloads under a pilot job. We always use singularity -cip for isolation. That kubernetes feature gate you found is what the admin of the OSG service kubernetes cluster is up to, and I'm waiting on him to get around to it. It would be great if you could try it in the meanwhile.

I was aware that unshare is only the first system call that singularity needs that is being blocked by the default docker/kubernetes seccomp profile. @jthiltges made a complete profile although I don't know where it is. In my opinion, it is much better to use allowPrivilegeEscalation: false which I think uses the kernel No New Privileges feature, the same feature that singularity uses, and then in that case seccomp is unnecessary because there are no dangerous system calls since it is impossible to elevate privileges. It's no different than running untrusted code in singularity, and we don't require seccomp for that. I recommend the equivalent options for unprivileged singularity in docker:

--security-opt seccomp=unconfined
--security-opt systempaths=unconfined
--security-opt no-new-privileges

@jthiltges
Copy link

With Docker, adding clone, mount, setns, and unshare seemed sufficient to get Singularity running.

This gist has the seccomp changes I'd used for testing: https://gist.github.com/jthiltges/02f93509bd92f3fc9a276bbc2e966d35/revisions

@rptaylor
Copy link
Author

rptaylor commented Mar 8, 2021

@jthiltges thanks for the pointer, I'll take a look!
@DrDaveD Yes I'm already using allowPrivilegeEscalation: false in k8s which sets the no new privs flag.
https://kubernetes.io/docs/concepts/policy/pod-security-policy/#privilege-escalation
Edit: I think I see what you mean, it should be possible to apply these settings at the pod level with a PSP that applies to particular jobs.

@DrDaveD
Copy link
Collaborator

DrDaveD commented Mar 8, 2021

One more thing which probably goes without saying, but for completeness: untrusted code needs to be started as an unprivileged user, not as a fake root user.

@rptaylor
Copy link
Author

rptaylor commented Mar 9, 2021

Fake root (user namespace ID remapping) is not supported in kubernetes yet anyway.

I enabled the ProcMountType=true feature gate and applied the same YAML changes described above, this time with no PSP complaints. get pod -o yaml shows "procMount: Unmasked" but using the -p Singularity option still returns
FATAL: container creation failed: mount proc->/proc error: can't mount proc filesystem to /proc: operation not permitted

From my point of view the application is completely contained inside the pod. ATLAS' use of Singularity to protect different parts of the workload from each other inside the pod is a separate matter; I am not sure if there is a firm policy on that.

@rptaylor
Copy link
Author

rptaylor commented Mar 9, 2021

genuinetools/img#212 seems unmasked may only work with containerd ?

@cclerget
Copy link
Collaborator

cclerget commented Mar 9, 2021

Hi @rptaylor ,

DEBUG   [U=0,P=32]         enter_namespace()             FAILED to open namespace file. result -1 , error Permission denied
ERROR   [U=0,P=32]         init()                        Failed to enter in shared mount namespace: Permission denied

You also need to allow CAP_SYS_PTRACE capability in PSP to be able to open namespace descriptors

@DrDaveD
Copy link
Collaborator

DrDaveD commented Mar 9, 2021

genuinetools/img#212 seems unmasked may only work with containerd ?

That's an interesting and potentially helpful thread you found. As I read it, though, it wasn't working with containerd either in the end, although it seemed to get further. Can you still see the /proc mask mounts it refers to inside your pod, in /proc/mounts? A year and a half has elapsed since that thread; please list your software versions too for the record. It's not clear to me if the person testing was using a new enough version of kubernetes. It was apparently a new feature in kubernetes 1.13 according to this thread. That thread notes the PSP only allows it, it also has to be enabled in the container spec. Did you include that? The example in the img thread includes

containers:
  securityContext:
    procMount: "unmasked"

When I use docker-ce-20.10.3-3.el7 API version 1.41 (seen with "docker version") I see those mounts under /proc until I add --security-opt systempaths=unconfined.

@rptaylor
Copy link
Author

Yes, my earlier comment showed that: #5857 (comment)
and I confirmed the pod resource has procMount: Unmasked

/proc/mounts is

overlay / overlay rw,seclabel,relatime,lowerdir=/var/lib/docker/overlay2/l/BLUQIPAMGC5VOAWWRUGZXTPZ3K:/var/lib/docker/overlay2/l/KVKOED7GK7MVHO6LZALYPBGGDL:/var/lib/docker/overlay2/l/AO2FDHKYM7QBMJGMNGXKGFK6MG:/var/lib/docker/overlay2/l/VXK3PVGL7U6FBNNDR5LAUBWT3H:/var/lib/docker/overlay2/l/4NVVCPKPS6EPLPRRMCFGWBNPAY:/var/lib/docker/overlay2/l/M64QBXBIMCDU4ASRODSHHFYEHK:/var/lib/docker/overlay2/l/6E32HUADMMGCEGSAXOJXDR5MT2:/var/lib/docker/overlay2/l/TW3BD45QP2RMBLZMA2YOTGNKKR:/var/lib/docker/overlay2/l/SP2RSBJ5I5WHTRCQNT6PCVBQUQ,upperdir=/var/lib/docker/overlay2/7b3f887e0740a56f50e726fc24c5fb4b85798a5bb9790b5647604f71f4782ba4/diff,workdir=/var/lib/docker/overlay2/7b3f887e0740a56f50e726fc24c5fb4b85798a5bb9790b5647604f71f4782ba4/work 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev tmpfs rw,seclabel,nosuid,size=65536k,mode=755 0 0
devpts /dev/pts devpts rw,seclabel,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666 0 0
sysfs /sys sysfs ro,seclabel,nosuid,nodev,noexec,relatime 0 0
tmpfs /sys/fs/cgroup tmpfs ro,seclabel,nosuid,nodev,noexec,relatime,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup ro,seclabel,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup ro,seclabel,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/rdma cgroup ro,seclabel,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup ro,seclabel,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/devices cgroup ro,seclabel,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup ro,seclabel,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup ro,seclabel,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/perf_event cgroup ro,seclabel,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/memory cgroup ro,seclabel,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/cpuset cgroup ro,seclabel,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/pids cgroup ro,seclabel,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/blkio cgroup ro,seclabel,nosuid,nodev,noexec,relatime,blkio 0 0
mqueue /dev/mqueue mqueue rw,seclabel,nosuid,nodev,noexec,relatime 0 0
/dev/vda1 /pilotdir xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/vda1 /dev/termination-log xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
cvmfs2 /cvmfs/atlas.cern.ch fuse ro,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
cvmfs2 /cvmfs/atlas-condb.cern.ch fuse ro,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
cvmfs2 /cvmfs/atlas-nightlies.cern.ch fuse ro,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
cvmfs2 /cvmfs/sft.cern.ch fuse ro,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
cvmfs2 /cvmfs/grid.cern.ch fuse ro,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
/dev/vda1 /etc/resolv.conf xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/vda1 /etc/hostname xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/vda1 /etc/hosts xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
shm /dev/shm tmpfs rw,seclabel,nosuid,nodev,noexec,relatime,size=65536k 0 0
tmpfs /run/secrets/kubernetes.io/serviceaccount tmpfs ro,seclabel,relatime 0 0
proc /proc/bus proc ro,relatime 0 0
proc /proc/fs proc ro,relatime 0 0
proc /proc/irq proc ro,relatime 0 0
proc /proc/sys proc ro,relatime 0 0
proc /proc/sysrq-trigger proc ro,relatime 0 0
tmpfs /proc/acpi tmpfs ro,seclabel,relatime 0 0
tmpfs /proc/kcore tmpfs rw,seclabel,nosuid,size=65536k,mode=755 0 0
tmpfs /proc/keys tmpfs rw,seclabel,nosuid,size=65536k,mode=755 0 0
tmpfs /proc/timer_list tmpfs rw,seclabel,nosuid,size=65536k,mode=755 0 0
tmpfs /proc/sched_debug tmpfs rw,seclabel,nosuid,size=65536k,mode=755 0 0
tmpfs /proc/scsi tmpfs ro,seclabel,relatime 0 0
tmpfs /sys/firmware tmpfs ro,seclabel,relatime 0 0

k8s v1.19.7
docker-ce 19.03.14-3.el8

Perhaps the kubernetes implementation of this alpha feature gate has not been updated since the new systempaths=unconfined Docker feature was added. Anyway dockershim is deprecated, probably it will eventually work with containerd. Hopefully the k8s feature will reach beta too.

@bbockelm
Copy link
Collaborator

@rptaylor - having a bit of a difficult time following the whole thread. Were you successful in the end?

@qafro1
Copy link

qafro1 commented Mar 16, 2021

Please what's was the resolve and steps taken?

@rptaylor
Copy link
Author

I think the only things needed to make unprivileged Singularity work in unprivileged k8s pods is using an unconfined seccomp profile or otherwise allowing various syscalls that would otherwise be blocked by the default docker seccomp profile (if your PSP applies seccomp). Also the nodes are EL8 which may have something to do with it, it might also work on EL7 if you enable the max user namespaces sysctl.

However full PID isolation (the stuff about procMount unmasked) does not seem to work currently, at least not with Docker.

@bbockelm
Copy link
Collaborator

@rptaylor - any luck with this? Was eventually able to replicate all the steps you did and hit the same problem.

It looks like dockershim expects to pass the MaskPath and ReadonlyPaths in the HostConfig, not the CLI flag Dave quotes above. It's not clear whether that is getting dropped by Docker or it is unspecified by Kubernetes. Any thoughts on how to get Docker to log all its interactions?

@DrDaveD
Copy link
Collaborator

DrDaveD commented Apr 21, 2021

I wonder if the problem is in this line of kubernetes code and the fact that it appears to only set MaskedPaths if synthesized == nil. If I am reading it correctly, that only happens when DetermineEffectiveSecurityContext returns nil, which happens when pod.Spec.SecurityContext and container.SecurityContext are nil. But that doesn't make sense, because if synthesized is nil it is assuming that effectiveSc is not nil, because it is accessing effectiveSc.ProcMount. So I must be reading something incorrectly.

@rptaylor
Copy link
Author

I'm not sure but it would make more sense to try this with containerd since Dockershim is deprecated.

@rptaylor
Copy link
Author

Works well enough for now, would be interesting to try further with containerd in the future. In any case this is a useful Singularity-related discussion but not a Singularity issue per se.

@bbockelm
Copy link
Collaborator

bbockelm commented Aug 2, 2022

Hi all --

Got it working! The recipe is:

  1. Enable the feature gate for unmasked proc in the kube apiserver. As of time of writing (1.24) it appears to default to off.
  2. Apply the appropriate security context. This worked for me:
        securityContext:
          privileged: false
          procMount: Unmasked
          allowPrivilegeEscalation: false
          seccompProfile:
            type: Unconfined

(you can do as @jthiltges did and develop a more refined seccomp profile if you'd like.)

Verify that the created pod has procMount: Unmasked set. If it reverts back to Default magically, it means that the feature gate was not successfully enabled in the API server.

  1. Ensure nothing else might be masking /proc.

SADLY it appears that the nvidia container runtime violates item (3). Without a GPU assigned:

$ cat /proc/mounts | grep proc
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0

(and Singularity works with PID namespaces).

With a GPU assigned:

sh-4.4$ cat /proc/mounts | grep proc
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /proc/driver/nvidia tmpfs rw,nosuid,nodev,noexec,relatime,mode=555 0 0
proc /proc/driver/nvidia/gpus/0000:17:00.0 proc ro,nosuid,nodev,noexec,relatime 0 0

(and the kernel effectively considers /proc masked and no PID namespaces for you!)

@DrDaveD
Copy link
Collaborator

DrDaveD commented Aug 2, 2022

That's great news, Brian! Can you give any more details for the record about how exactly to "Enable the feature gate for unmasked proc in the kube apiserver"?

@bbockelm
Copy link
Collaborator

bbockelm commented Aug 3, 2022

In order to enable the feature gate, I had to add a command line flag to the kube-apiserver pod. In my on-prem cluster (deployed via kubeadm), the flag looks like this:

# cat /etc/kubernetes/manifests/kube-apiserver.yaml  | grep fea
    - --feature-gates=ProcMountType=true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants