Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate containerd user home directory has the wrong owner and group IDs #8226

Open
clarafu opened this issue Mar 31, 2022 · 3 comments
Open
Labels

Comments

@clarafu
Copy link
Contributor

clarafu commented Mar 31, 2022

Summary

There was a user that reported they were seeing a container image that was used to run integration tests, and the user’s home directory had the wrong owner and group IDs. The container image resource pinned so nothing has changed in the container image.

More information can be found in the slack channel thread in hush-house, but it seemed to have happened after they upgraded their workers to the new 7.7 version.

Triaging info

  • Concourse version: 7.7.0
  • Browser (if applicable):
  • Did this used to work? Yes
@clarafu clarafu added the bug label Mar 31, 2022
@clarafu clarafu added this to the v7.8.0 milestone Mar 31, 2022
@cfryanr
Copy link

cfryanr commented Mar 31, 2022

Hi all, thanks for creating this issue for us.

I’ve created a simple pipeline which reproduces the issue. Here it is:

jobs:
  - name: build-and-run
    plan:
      - task: dockerfile
        tags: [ my-workers ]
        config:
          platform: linux
          image_resource:
            type: registry-image
            source:
              repository: debian
              tag: 11-slim
          outputs:
            - name: dockerfile
          run:
            path: bash
            args:
              - -ceux
              - |
                cat << EOF > dockerfile/Dockerfile
                FROM debian:11-slim
                RUN useradd --create-home testuser1 # create new UID, GID, and home dir
                RUN useradd --create-home testuser2 # create new UID, GID, and home dir
                RUN cat /etc/passwd | grep testuser # show new UIDs
                RUN cat /etc/group | grep testuser # show new GIDs
                RUN ls -lan /home/testuser1 # show ownership of new home dir
                RUN ls -lan /home/testuser2 # show ownership of new home dir
                EOF
      - task: build
        tags: [ my-workers ]
        privileged: true
        config:
          platform: linux
          image_resource:
            type: registry-image
            source:
              repository: concourse/oci-build-task
          inputs:
            - name: dockerfile
          outputs:
            - name: image
          run:
            path: build
          caches:
            - path: cache
          params:
            CONTEXT: dockerfile
            UNPACK_ROOTFS: true
      - task: run
        tags: [ my-workers ]
        image: image
        config:
          platform: linux
          run:
            path: bash
            args:
              - -ceux
              - |
                cat /etc/passwd | grep testuser # show UIDs
                cat /etc/group | grep testuser # show GIDs
                ls -lan /home/testuser1 # show file ownership
                ls -lan /home/testuser2 # show file ownership
                # Fail if the directory ownership does not match expected UID/GID
                if [[ "$(id testuser1 -u)" != "$(stat -c '%u' /home/testuser1)" ]]; then exit 1; fi
                if [[ "$(id testuser2 -u)" != "$(stat -c '%u' /home/testuser2)" ]]; then exit 1; fi
                if [[ "$(id testuser1 -g)" != "$(stat -c '%g' /home/testuser1)" ]]; then exit 1; fi
                if [[ "$(id testuser2 -g)" != "$(stat -c '%g' /home/testuser2)" ]]; then exit 1; fi

Note that these tasks use my team's external workers (via tags).

This job:

  • Always passes when the worker for the build task is randomly selected to be the same worker for the run task
  • Always fails when the worker for the build task is randomly selected to be a different worker for the run task

Example output from a failed job:

selected worker: my-workers-us-west1-b-28ba72c6
streaming volume for image from my-workers-us-west1-b-06a6d6b6
+ cat /etc/passwd
+ grep testuser
testuser1:x:1000:1000::/home/testuser1:/bin/sh
testuser2:x:1001:1001::/home/testuser2:/bin/sh
+ cat /etc/group
+ grep testuser
testuser1:x:1000:
testuser2:x:1001:
+ ls -lan /home/testuser1
total 12
drwxr-xr-x 1 1001 1000   54 Mar 31 19:02 .
drwxr-xr-x 1    0    0   36 Mar 31 19:02 ..
-rw-r--r-- 1 1001 1000  220 Aug  4  2021 .bash_logout
-rw-r--r-- 1 1001 1000 3526 Aug  4  2021 .bashrc
-rw-r--r-- 1 1001 1000  807 Aug  4  2021 .profile
+ ls -lan /home/testuser2
total 12
drwxr-xr-x 1 1000 1002   54 Mar 31 19:02 .
drwxr-xr-x 1    0    0   36 Mar 31 19:02 ..
-rw-r--r-- 1 1000 1002  220 Aug  4  2021 .bash_logout
-rw-r--r-- 1 1000 1002 3526 Aug  4  2021 .bashrc
-rw-r--r-- 1 1000 1002  807 Aug  4  2021 .profile
++ id testuser1 -u
++ stat -c %u /home/testuser1
+ [[ 1000 != \1\0\0\1 ]]
+ exit 1

Our workers used CONCOURSE_WORK_DIR=/mnt/disks/local-ssd where that mount point was an SSD drive device formatted using mkfs.btrfs, along with CONCOURSE_BAGGAGECLAIM_DRIVER=btrfs.

When I changed our workers to instead format that same drive using mkfs -t ext4, along with CONCOURSE_BAGGAGECLAIM_DRIVER=overlay, then the above job always passes, even when different workers are selected for the build and run tasks.

My suspicion is that Concourse v7.7.0 (or one of its dependencies) introduced some kind of bug related to streaming volumes between workers using btrfs.

Note that I did not check to see if the version of btrfs that gets installed onto our workers changed recently. We were installing btrfs tools onto the workers using apt-get update --allow-releaseinfo-change && apt-get install btrfs-progs -y. The workers are GCP VMs with debian-10-buster-v20210316 boot disks. The worker VMs get deleted and recreated on a regular basis, almost nightly.

@xtremerui
Copy link
Contributor

Tried to reproduce this in a local docker compose env with

CONCOURSE_BAGGAGECLAIM_DRIVER: btrfs
CONCOURSE_CONTAINER_PLACEMENT_STRATEGY: fewest-build-containers  // forcing volume streaming

and two workers. While the image was built on worker 1 and the run task on worker 2, the build passed.

Thus I wonder it is related to the worker base image that used in original issue. Could you provide more details about that? For example, the gcloud image that used to build the external worker VM.

@cfryanr
Copy link

cfryanr commented Apr 5, 2022

Hi @xtremerui, thanks for looking at this.

Our workers were GCP c2-standard-8 VMs created with a debian-10-buster-v20210316 boot disk (this is a boot disk offered by Google). We also add a Local SSD Scratch Disk at VM creation time, and after booting the VM we formatted the scratch disk as btrfs using:

apt-get update --allow-releaseinfo-change && apt-get install btrfs-progs -y
mkfs.btrfs /dev/nvme0n1
mkdir -p /mnt/disks/local-ssd
mount /dev/nvme0n1 /mnt/disks/local-ssd

Then we used it as the CONCOURSE_WORK_DIR=/mnt/disks/local-ssd with CONCOURSE_BAGGAGECLAIM_DRIVER=btrfs.

To workaround the issue, we changed this to instead be:

mkfs -t ext4 /dev/nvme0n1
mkdir -p /mnt/disks/local-ssd
mount /dev/nvme0n1 /mnt/disks/local-ssd

with CONCOURSE_WORK_DIR=/mnt/disks/local-ssd and CONCOURSE_BAGGAGECLAIM_DRIVER=overlay which made the problem go away.

@xtremerui xtremerui removed this from the v7.8.0 milestone May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants