Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--userns=auto for podman build is generating different uid/gid for each ephemeral container #18031

Closed
lukasmrtvy opened this issue Apr 3, 2023 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@lukasmrtvy
Copy link

Issue Description

UID/GID should be the same for each ephemeral container when using --userns=auto. Content shared across layers can be unavailable. Multistage is probably also affected, this will probably also improve building speed.

Steps to reproduce the issue

Steps to reproduce the issue
1.

cat << EOF > Dockerfile
FROM alpine
RUN cat /proc/self/uid_map
RUN cat /proc/self/uid_map
RUN cat /proc/self/uid_map
EOF
  1. podman build --userns=auto .

Describe the results you received

STEP 1/4: FROM alpine
Resolving "alpine" using unqualified-search registries (/var/home/actors/.config/containers/registries.conf)
Trying to pull docker.io/library/alpine:latest...
Getting image source signatures
Copying blob f56be85fc22e skipped: already exists
Copying config 9ed4aefc74 done
Writing manifest to image destination
Storing signatures
STEP 2/4: RUN cat /proc/self/uid_map
         0          1       1024
--> 71680cc9527
STEP 3/4: RUN cat /proc/self/uid_map
         0       1025       1024
--> 5a42a8ead4e
STEP 4/4: RUN cat /proc/self/uid_map
         0       2049       1024
COMMIT
--> 9473bd7fae5
9473bd7fae5996e3428e6a3a296de0693db336829f546393ed422e3c55a9f369

Describe the results you expected

Same namespace for all ephemeral containers / build context

podman info output

host:
  arch: amd64
  buildahVersion: 1.29.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc37.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 99.53
    systemPercent: 0.09
    userPercent: 0.38
  cpus: 32
  distribution:
    distribution: fedora
    variant: coreos
    version: "37"
  eventLogger: journald
  hostname: ip-10-1-85-134
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 100000
      size: 599900000
    uidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 100000
      size: 599900000
  kernel: 6.1.18-200.fc37.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 78982516736
  memTotal: 132926525440
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.8.1-1.fc37.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.1
      commit: f8a096be060b22ccd3d5f3ebe44108517fbf6c30
      rundir: /run/user/1001/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/1001/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-8.fc37.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 3h 7m 19.00s (Approximately 0.12 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /var/home/actors/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/home/actors/.local/share/containers/storage
  graphRootAllocated: 1098974756864
  graphRootUsed: 55107256320
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 38
  runRoot: /run/user/1001/containers
  transientStore: false
  volumePath: /var/home/actors/.local/share/containers/storage/volumes
version:
  APIVersion: 4.4.2
  Built: 1677669779
  BuiltTime: Wed Mar  1 11:22:59 2023
  GitCommit: ""
  GoVersion: go1.19.6
  Os: linux
  OsArch: linux/amd64
  Version: 4.4.2

Podman in a container

No

Privileged Or Rootless

None

Upstream Latest Release

Yes

Additional environment details

Additional environment details

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

@lukasmrtvy lukasmrtvy added the kind/bug Categorizes issue or PR as related to a bug. label Apr 3, 2023
@rhatdan
Copy link
Member

rhatdan commented Apr 3, 2023

@flouthoc @nalind @giuseppe PTAL

@giuseppe
Copy link
Member

giuseppe commented Apr 4, 2023

I don't see why there is that expectation. If you use --userns=auto you can't predict what IDs will be used.

The --userns=auto works at the container level, it is not for the entire build

@rhatdan
Copy link
Member

rhatdan commented Apr 4, 2023

I believe we should make the same userns for the length of the build. Not have it change for each run line.

@giuseppe
Copy link
Member

giuseppe commented Apr 6, 2023

we'd need to create a dummy container to keep that user namespace busy and join it from other containers, it is kind of messy for something that looks like a minor detail. To me it still feels like an implementation detail, once you specify --userns=auto you cannot make assumptions on the mappings used.

If you want to use the same mapping, you can force using the same container with --squash or force the mapping with --userns-uid-map=0:1:1024 --userns-gid-map=0:1:1024

@lukasmrtvy
Copy link
Author

Thanks.
--squash will have performance impact, while --userns-uid-map=0:1:1024 --userns-gid-map=0:1:1024 will not isolate builds from each other. These are not direct alternatives to --userns=auto.

@giuseppe
Copy link
Member

giuseppe commented Apr 6, 2023

I'd prefer to not change the semantic of --userns=auto to give such expectations on what mappings you get, even if they are part of the same build. I am thinking of changing its implementation to randomly pick a range instead of the first available to increase security, similarly to what Kubernetes does.

Performance issues with --userns=auto will be taken care of with idmapped mounts, which will even work from a user namespace at some point (there is work happening upstream for that, so we are not too far from seeing it).

If you really want that semantic, you can still handle it manually:

$ ctr=$(podman run --userns=container:$ctr --rm -d fedora sleep infinity)
$ uidmap=$(podman unshare cat /proc/$(podman inspect $ctr --format='{{ .State.Pid }}')/uid_map | sed -z -e 's/^\s\+//g' -e 's/\s\+$//g' -e 's/\s\+/:/g')
$ gidmap=$(podman unshare cat /proc/$(podman inspect $ctr --format='{{ .State.Pid }}')/gid_map | sed -z -e 's/^\s\+//g' -e 's/\s\+$//g' -e 's/\s\+/:/g')
$ podman build --userns-uid-map=$uidmap --userns-gid-map=$gidmap .
STEP 1/4: FROM alpine
STEP 2/4: RUN cat /proc/self/uid_map
         0          1       1024
--> 8753a30b5eb
STEP 3/4: RUN cat /proc/self/uid_map
         0          1       1024
--> a01c76621bc
STEP 4/4: RUN cat /proc/self/uid_map
         0          1       1024
COMMIT
--> bd89241b319
bd89241b319dd0a34b544ab9a7f893ef564f8fb8b879d9dd77b86f59df2fd690
$ podman rm -t 0 -f $ctr

@rhatdan
Copy link
Member

rhatdan commented Apr 6, 2023

Why not just bake that semantics into podman build?

I guess we can live with this as long as things like the ability to use content from one run command into a second run command.

@rhatdan
Copy link
Member

rhatdan commented Apr 6, 2023

One other issue with this is that we will use up the user namespaces quicker. So if I have a container with 10 run lines, will it allocate and retain 10 ranges until the build is complete? In rootless mode this could cause issues with users running out of ranges of UID mappings.

@giuseppe
Copy link
Member

giuseppe commented Apr 6, 2023

One other issue with this is that we will use up the user namespaces quicker. So if I have a container with 10 run lines, will it allocate and retain 10 ranges until the build is complete? In rootless mode this could cause issues with users running out of ranges of UID mappings.

that seems like another issue, I don't see why we wouldn't delete the container as soon as it is not needed anymore.
With Buildah at least I see a different output:

$ buildah bud --userns=auto . 
STEP 1/4: FROM alpine
STEP 2/4: RUN cat /proc/self/uid_map
         0          1       1024
STEP 3/4: RUN cat /proc/self/uid_map
         0          1       1024
STEP 4/4: RUN cat /proc/self/uid_map
         0          1       1024
COMMIT

what other problems wouldn't be solved with idmapped to share volumes?

A problem I see with having a single user namespace is that we will have less flexibility in choosing the right size with files that use multiple FROM statements:

e.g. let's create an image like the following (podman build -t base .):

FROM alpine
RUN touch /foo && chown 1025:1025 /foo

Now the default size of 1024 IDs won't be enough:

$ podman run --userns=auto --rm alpine cat /proc/self/uid_map 
         0          1       1024
$ podman run --userns=auto --rm base cat /proc/self/uid_map 
         0          1       1026

There are still corner cases that are not covered when we start mixing and picking files from different layers, but picking a single size will make this worse.

How would we pick the size? Should we scan the entire file and apply the heuristic to each FROM statement?

FROM alpine AS base
RUN cat /proc/self/uid_map
RUN touch /file

FROM base
RUN cat /proc/self/uid_map
COPY --from=base /file /file

@rhatdan
Copy link
Member

rhatdan commented Apr 7, 2023

Ok, I guess I am thinking of this incorrectly? If the From lines are creating containers, as opposed to the run commands. So in general the different User Namespace is only for multi from builds.

Then I go along with @giuseppe idea, as long as tools that copy content from a previous stage in the build to another.

@giuseppe
Copy link
Member

giuseppe commented Apr 7, 2023

When we create a container we look at its image to decide how big the user namespace should be.

In the example above we create one for RUN and we look the alpine image to pick its size, similarly for the next RUN commands we look at the layer that was created earlier (even if it cannot change since it is not possible to create files owned by IDs that are not present in the previous user namespace).

The RUN that follows the second FROM though can use a different size since it is based on a different image.

To not make things worse than they are now if we want to move to the logic of using a single user namespace, we will need to analyze the entire Containerfile first and pick the biggest size. This will complicate the logic and we will also need to expose more APIs from c/storage since it is not currently possible to run the heuristic to compute the userns size based on an image (unless we create a bunch of temporary containers and check their size after they are created).

It complicates the logic a bit and I still don't see the benefit, sharing of content is possible with idmapped mounts.

@rhatdan
Copy link
Member

rhatdan commented Apr 10, 2023

I think we stick to one user namespace per image (from line).

@giuseppe
Copy link
Member

I've found an issue while playing with --userns=auto: containers/storage#1564

With a file like:

FROM alpine AS base
RUN touch /tmp/foo
RUN cat /proc/self/uid_map

FROM busybox
COPY --from=base /tmp/foo /tmp/foo
RUN cat /proc/self/uid_map

I expect the user namespaces created to look like (there are different sizes depending on the image):

[1/2] STEP 1/3: FROM alpine AS base
[1/2] STEP 2/3: RUN touch /tmp/foo
[1/2] STEP 3/3: RUN cat /proc/self/uid_map
         0     201024       1024
[2/2] STEP 1/3: FROM busybox
[2/2] STEP 2/3: COPY --from=base /tmp/foo /tmp/foo
[2/2] STEP 3/3: RUN cat /proc/self/uid_map
         0     202048      65535
[2/2] COMMIT
--> 80db9ad973e3
80db9ad973e3fe563f51ebd19cf1929d45db9a8769b35144d063834e2ea5d91b

to honor each FROM statement.

If we want to use one user namespace, then I think we need to pre-scan the Containerfile and analyze each FROM. I see more disadvantages than benefit with this model. The only problem I see is that we keep containers around and we may consume all the available IDs.

@flouthoc is there any reason why we keep containers around after the RUN was successful?

@rhatdan
Copy link
Member

rhatdan commented Apr 12, 2023

This LGTM

@giuseppe
Copy link
Member

that is the behavior we have today and I think we should keep it without tying our hands to use only one user namespace.
We can improve our tools to consider stuff like COPY --from=base /tmp/foo /tmp/foo, since it could in theory copy a file that is not mapped in the current user namespace, so we will need to increase its size. This wouldn't be possible (or requires preprocessing the entire Containerfile first) if we use the same user namespace

@giuseppe
Copy link
Member

we can probably close this issue, for the discussion above, I don't think we can stick to a single user namespace for the build. Please reopen if you disagree with the decision

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 25, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

3 participants