Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

patch: Rebase to upstream 1.1.12 #2

Closed
wants to merge 2,343 commits into from
Closed

Conversation

aethernet
Copy link

kolyshkin and others added 30 commits May 27, 2022 12:17
Add checking of downloaded tarball checksum.

In case it doesn't match the hardcoded value, the error is like this:

	libseccomp-2.5.4.tar.gz: FAILED
	sha256sum: WARNING: 1 computed checksum did NOT match

In case the checksum for a particular version is not specified in the
script, the error will look like this:

	./script/seccomp.sh: line 29: SECCOMP_SHA256[${ver}]: unbound variable

In case the the hardcoded value in the file is of wrong format/length,
we'll get:

	sha256sum: 'standard input': no properly formatted SHA256 checksum lines found

In any of these cases, the script aborts (due to set -e).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 95f1e2e18872de54a17d64b2d808255463ee3d93)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Kir Kolyshkin (2):
  script/seccomp.sh: check tarball sha256
  Dockerfile,scripts/release: bump libseccomp to v2.5.4

LGTMs: AkihiroSuda cyphar
Closes opencontainers#3481
Remove upper bound in integer sanity check
to not restrict the number of socket-activated
sockets passed in.

Closes opencontainers#3488

Signed-off-by: Erik Sjölund <erik.sjolund@gmail.com>
(cherry picked from commit 03a210d)
Signed-off-by: Erik Sjölund <erik.sjolund@gmail.com>
…3489

[1.1] libcontainer: relax getenv_int sanity check
systemd emits very loud warnings when the path specified doesn't exist
(which can be the case for some of our default rules). We don't need the
ruleset we give systemd to be completely accurate (we discard some kinds
of wildcard rules anyway) so we can safely skip adding these.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
…s-nonexistent-files

[1.1] cgroups: systemd: skip adding device paths that don't exist
Perform some basic checks for CHANGELOG.md.

In particular, check for
 - missing periods;
 - extra spaces at EOL;
 - non-ASCII characters.

Fix the issues found.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
* Changelog for v1.1.3.
* Fixed 1.1.2 release date.
* Fixed the order of footnotes.

Note that backport (rather than original) PRs are listed as references,
since this makes it easier to cross-check against the git log.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Kir Kolyshkin (3):
  VERSION: back to development
  VERSION: release 1.1.3
  ci: add basic checks for CHANGELOG.md

LGTMs: thaJeztah cyphar
Closes opencontainers#3490
Due to a bug in commit 9c44407, when the user and mount namespaces
are used, and the bind mount is followed by the cgroup mount in the
spec, the cgroup is mounted using the bind mount's mount fd.

This can be reproduced with podman 4.1 (when configured to use runc):

$ podman run --uidmap 0:100:10000 quay.io/libpod/testimage:20210610 mount
Error: /home/kir/git/runc/runc: runc create failed: unable to start container process: error during container init: error mounting "cgroup" to rootfs at "/sys/fs/cgroup": mount /proc/self/fd/11:/sys/fs/cgroup/systemd (via /proc/self/fd/12), flags: 0x20502f: operation not permitted: OCI permission denied

or manually with the spec mounts containing something like this:

    {
      "destination": "/etc/resolv.conf",
      "type": "bind",
      "source": "/userdata/resolv.conf",
      "options": [
        "bind"
      ]
    },
    {
      "destination": "/sys/fs/cgroup",
      "type": "cgroup",
      "source": "cgroup",
      "options": [
        "rprivate",
        "nosuid",
        "noexec",
        "nodev",
        "relatime",
        "ro"
      ]
    }

The issue was not found earlier since it requires using userns, and even then
mount fd is ignored by mountToRootfs, except for bind mounts, and all the bind
mounts have mountfd set, except for the case of cgroup v1's /sys/fs/cgroup
which is internally transformed into a bunch of bind mounts.

This is a minimal fix for the issue, suitable for backporting.

A test case is added which reproduces the issue without the fix applied.

Fixes: 9c44407 ("Open bind mount sources from the host userns")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit d370e3c)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Kir Kolyshkin (1):
  libct: fix mounting via wrong proc fd

LGTMs: AkihiroSuda cyphar
Closes opencontainers#3511
The following failure was observed in CI (on centos-stream-8 in
integration-cgroup suite):

	not ok 42 runc delete
	 (from function `fail' in file tests/integration/helpers.bash, line 338,
	  in test file tests/integration/delete.bats, line 30)
	   `[ "$output" = "" ] || fail "cgroup not cleaned up correctly: $output"' failed
	....
	cgroup not cleaned up correctly: /sys/fs/cgroup/pids/system.slice/tmp-bats\x2drun\x2d68012-runc.IPOypI-state-testbusyboxdelete-runc.zriC8C.mount
	/sys/fs/cgroup/cpu,cpuacct/system.slice/tmp-bats\x2drun\x2d68012-runc.IPOypI-state-testbusyboxdelete-runc.zriC8C.mount
	...

Apparently, this is a cgroup systemd creates for a mount unit which
appears then runc does internal /proc/self/exe bind-mount. The test
case should not take it into account.

The second problem with this test is it does not check that cgroup
actually exists when the container is running (so checking that it
was removed after makes less sense). For example, in rootless mode
the cgroup might not have been created.

Fix the find arguments to look for a specific cgroup name, and add
a check that these arguments are correct (i.e. the cgroup is found
when the container is running).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 728571c)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
A couple of test cases in delete.bats check that a particular cgroup
exists (or doesn't exist) using find. This is now resulting in errors
like these:

        find: ‘/sys/fs/cgroup/blkio/azsec’: Permission denied
        find: ‘/sys/fs/cgroup/blkio/azsec_clamav’: Permission denied
        find: ‘/sys/fs/cgroup/cpu,cpuacct/azsec’: Permission denied
        find: ‘/sys/fs/cgroup/cpu,cpuacct/azsec_clamav’: Permission denied
        find: ‘/sys/fs/cgroup/memory/azsec’: Permission denied
        find: ‘/sys/fs/cgroup/memory/azsec_clamav’: Permission denied

leading to test case failures.

Apparently, GHA runs something else on a test box, so we get this.

To fix, ignore non-zero exit code from find, and redirect its stderr
to /dev/null.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is a partial backport of commit 6e1d476 from main branch.

Instead of specifying path to criu binary, use whatever is found in
$PATH.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Older criu builds fail to work properly on CentOS Stream 9 due to
changes in glibc's rseq.

Skip criu tests if an older criu version is found.

Fixes: opencontainers#3532

Cherry picked from commit 4fd4af5.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Kir Kolyshkin (4):
  CI: workaround CentOS Stream 9 criu issue
  tests/int: don't use --criu
  [1.1] ci: fix delete.bats for GHA
  tests/int: runc delete: fix flake, enable for rootless

LGTMs: AkihiroSuda cyphar
Closes opencontainers#3538
Signed-off-by: guodong <guodong9211@gmail.com>
guodong (1):
  [1.1] libct/nsenter: switch to sane_kill()

LGTMs: AkihiroSuda cyphar
Closes opencontainers#3536
When starting a new container, and the very last step of executing of a
user process fails (last lines of (*linuxStandardInit).Init), it is too
late to print a proper error since both the log pipe and the init pipe
are closed.

This is partially mitigated by using exec.LookPath() which is supposed
to say whether we will be able to execute or not. Alas, it fails to do
so when the binary to be executed resides on a filesystem mounted with
noexec flag.

A workaround would be to use access(2) with X_OK flag. Alas, it is not
working when runc itself is a setuid (or setgid) binary. In this case,
faccessat2(2) with AT_EACCESS can be used, but it is only available
since Linux v5.8.

So, use faccessat2(2) with AT_EACCESS if available. If not, fall back to
access(2) for non-setuid runc, and do nothing for setuid runc (as there
is nothing we can do). Note that this check if in addition to whatever
exec.LookPath does.

Fixes opencontainers#3520

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 957d97b)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
[1.1] Fix error from runc run on noexec fs
When golang 1.19 is used to build unit tests on 386, it fails like this:

 sudo -E PATH="$PATH" -- make GOARCH=386 CGO_ENABLED=1 localunittest
 <...>
 go test -timeout 3m -tags "seccomp"  -v ./...
 <...>
 # github.com/opencontainers/runc/libcontainer/capabilities.test
 runtime/cgo(.text): unknown symbol __stack_chk_fail_local in pcrel
 runtime/cgo(.text): unknown symbol __stack_chk_fail_local in pcrel
 runtime/cgo(.text): unknown symbol __stack_chk_fail_local in pcrel
 runtime/cgo(.text): unknown symbol __stack_chk_fail_local in pcrel
 runtime/cgo(.text): unknown symbol __stack_chk_fail_local in pcrel
 runtime/cgo(.text): relocation target __stack_chk_fail_local not defined
 runtime/cgo(.text): relocation target __stack_chk_fail_local not defined

The fix is to add CGO_CFLAGS=-fno-stack-protector.

See also:
 - docker-library/golang#426
 - https://go.dev/issue/52919
 - https://go.dev/issue/54313
 - https://go-review.googlesource.com/c/go/+/421935

Cherry picked from commit 589a9d5.

Conflict in .github/workflows/test.yml due to missing commit dafcacb.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Recently released codespell 2.2 adds some more false positives,
such as:

	./Makefile:78: ro ==> to, row, rob, rod, roe, rot
	./Makefile:88: ro ==> to, row, rob, rod, roe, rot
	./notify_socket.go:51: ro ==> to, row, rob, rod, roe, rot
	./LICENSE:128: complies ==> compiles
	./go.sum:59: BU ==> BY
	./types/features/features.go:17: ro ==> to, row, rob, rod, roe, rot
	./libcontainer/rootfs_linux.go:52: ro ==> to, row, rob, rod, roe, rot
	./libcontainer/rootfs_linux.go:166: ro ==> to, row, rob, rod, roe, rot
	....
	./tests/integration/cgroup_delegation.bats:38: inh ==> in
	...

To fix:
 - exclude go.sum;
 - add ro and complies to the list of ignored words;
 - s/inh/inherit in cgroup_delegation.bats.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit df9e32b)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
A regression reported for runc v1.1.3 says that "runc exec -t" fails
after doing "systemctl daemon-reload":

> exec failed: unable to start container process: open /dev/pts/0: operation not permitted: unknown

Apparently, with commit 7219387 we are no longer adding
"DeviceAllow=char-pts rwm" rule (because os.Stat("char-pts") returns
ENOENT).

The bug can only be seen after "systemctl daemon-reload" because runc
also applies the same rules manually (by writing to devices.allow for
cgroup v1), and apparently reloading systemd leads to re-applying the
rules that systemd has (thus removing the char-pts access).

The fix is to do os.Stat only for "/dev" paths.

Also, emit a warning that the path was skipped. Since the original idea
was to emit less warnings, demote the level to debug.

Note this also fixes the issue of not adding "m" permission for block-*
and char-* devices.

A test case is added, which reliably fails before the fix
on both cgroup v1 and v2.

This is a backport of commit 58b1374
to release-1.1 branch.

Fixes: opencontainers#3551
Fixes: 7219387
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
[1.1] Fix failed exec after systemctl daemon-reload (regression in 1.1.3)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Akihiro Suda (2):
  VERSION: back to development
  Release 1.1.4

LGTMs: kolyshkin thaJeztah cyphar
Closes opencontainers#3564
cyphar and others added 21 commits December 14, 2023 11:56
(This is a cherry-pick of 6fa8d06.)

Given we've had several bugs in this behaviour that have now been fixed,
add an integration test that makes sure that you can start a container
that joins all of the namespaces of a second container.

The only namespace we do not join is the mount namespace, because
joining a namespace that has been pivot_root'd leads to a bunch of
errors. In principle, removing everything from config.json that requires
a mount _should_ work, but the root.path configuration is mandatory and
we cannot just ignore setting up the rootfs in the namespace joining
scenario (if the user has configured a different rootfs, we need to use
it or error out, and there's no reasonable way of checking if if the
rootfs paths are the same that doesn't result in spaghetti logic).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
(This is a cherry-pick of ebcef3e.)

It turns out that the error added in commit 09822c3 ("configs:
disallow ambiguous userns and timens configurations") causes issues with
containerd and CRIO because they pass both userns mappings and a userns
path.

These configurations are broken, but to avoid the regression in this one
case, output a warning to tell the user that the configuration is
incorrect but we will continue to use it if and only if the configured
mappings are identical to the mappings of the provided namespace.

Fixes: 09822c3 ("configs: disallow ambiguous userns and timens configurations")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
(This is a cherry-pick of 482e563.)

Using ints for all of our mapping structures means that a 32-bit binary
errors out when trying to parse /proc/self/*id_map:

  failed to cache mappings for userns: failed to parse uid_map of userns /proc/1/ns/user:
  parsing id map failed: invalid format in line "         0          0 4294967295": integer overflow on token 4294967295

This issue was unearthed by commit 1912d59 ("*: actually support
joining a userns with a new container") but the underlying issue has
been present since the docker/libcontainer days.

In theory, switching to uint32 (to match the spec) instead of int64
would also work, but keeping everything signed seems much less
error-prone. It's also important to note that a mapping might be too
large for an int on 32-bit, so we detect this during the mapping.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
[1.1] *: fix several issues with userns path handling
Signed-off-by: lfbzhm <lifubang@acmcoder.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: lfbzhm <lifubang@acmcoder.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
lfbzhm (2):
  VERSION: back to development
  VERSION: release 1.1.11

LGTMs: AkihiroSuda cyphar
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Aleksa Sarai (2):
  keyring: update AkihiroSuda key expiry
  keyring: update cyphar@cyphar.com key expiry

LGTMs: AkihiroSuda lifubang
(This is a cherry-pick of 937ca10.)

Signed-off-by: hang.jiang <hang.jiang@daocloud.io>
Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
If a file descriptor of a directory in the host's mount namespace is
leaked to runc init, a malicious config.json could use /proc/self/fd/...
as a working directory to allow for host filesystem access after the
container runs. This can also be exploited by a container process if it
knows that an administrator will use "runc exec --cwd" and the target
--cwd (the attacker can change that cwd to be a symlink pointing to
/proc/self/fd/... and wait for the process to exec and then snoop on
/proc/$pid/cwd to get access to the host). The former issue can lead to
a critical vulnerability in Docker and Kubernetes, while the latter is a
container breakout.

We can (ab)use the fact that getcwd(2) on Linux detects this exact case,
and getcwd(3) and Go's Getwd() return an error as a result. Thus, if we
just do os.Getwd() after chdir we can easily detect this case and error
out.

In runc 1.1, a /sys/fs/cgroup handle happens to be leaked to "runc
init", making this exploitable. On runc main it just so happens that the
leaked /sys/fs/cgroup gets clobbered and thus this is only consistently
exploitable for runc 1.1.

Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Co-developed-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
[refactored the implementation and added more comments]
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
(This is a partial backport of a minor change included in commit
dac4171.)

This mirrors the logic in standard_init_linux.go, and also ensures that
we do not call exec.LookPath in the final execve step.

While this is okay for regular binaries, it seems exec.LookPath calls
os.Getenv which tries to emit a log entry to the test harness when
running in "go test" mode. In a future patch (in order to fix
CVE-2024-21626), we will close all of the file descriptors immediately
before execve, which would mean the file descriptor for test harness
logging would be closed at execve time. So, moving exec.LookPath earlier
is necessary.

Ref: dac4171 ("runc-dmz: reduce memfd binary cloning cost with small C binary")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
If we leak a file descriptor referencing the host filesystem, an
attacker could use a /proc/self/fd magic-link as the source for execve
to execute a host binary in the container. This would allow the binary
itself (or a process inside the container in the 'runc exec' case) to
write to a host binary, leading to a container escape.

The simple solution is to make sure we close all file descriptors
immediately before the execve(2) step. Doing this earlier can lead to very
serious issues in Go (as file descriptors can be reused, any (*os.File)
reference could start silently operating on a different file) so we have
to do it as late as possible.

Unfortunately, there are some Go runtime file descriptors that we must
not close (otherwise the Go scheduler panics randomly). The only way of
being sure which file descriptors cannot be closed is to sneakily
go:linkname the runtime internal "internal/poll.IsPollDescriptor"
function. This is almost certainly not recommended but there isn't any
other way to be absolutely sure, while also closing any other possible
files.

In addition, we can keep the logrus forwarding logfd open because you
cannot execve a pipe and the contents of the pipe are so restricted
(JSON-encoded in a format we pick) that it seems unlikely you could even
construct shellcode. Closing the logfd causes issues if there is an
error returned from execve.

In mainline runc, runc-dmz protects us against this attack because the
intermediate execve(2) closes all of the O_CLOEXEC internal runc file
descriptors and thus runc-dmz cannot access them to attack the host.

Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
We auto-close this file descriptor in the final exec step, but it's
probably a good idea to not possibly leak the file descriptor to "runc
init" (we've had issues like this in the past) especially since it is a
directory handle from the host mount namespace.

In practice, on runc 1.1 this does leak to "runc init" but on main the
handle has a low enough file descriptor that it gets clobbered by the
ForkExec of "runc init".

OPEN_TREE_CLONE would let us protect this handle even further, but the
performance impact of creating an anonymous mount namespace is probably
not worth it.

Also, switch to using an *os.File for the handle so if it goes out of
scope during setup (i.e. an error occurs during setup) it will get
cleaned up by the GC.

Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Given the core issue in GHSA-xr7r-f8xq-vfvv was that we were unknowingly
leaking file descriptors to "runc init", it seems prudent to make sure
we proactively prevent this in the future. The solution is to simply
mark all non-stdio file descriptors as O_CLOEXEC before we spawn "runc
init".

For libcontainer library users, this could result in unrelated files
being marked as O_CLOEXEC -- however (for the same reason we are doing
this for runc), for security reasons those files should've been marked
as O_CLOEXEC anyway.

Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
We close the logfd before execve so there's no need to special case it.
In addition, it turns out that (*os.File).Fd() doesn't handle the case
where the file was closed and so it seems suspect to use that kind of
check.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This is a security fix for CVE-2024-21626. See the advisory[1] for more
details.

Aleksa Sarai (6):
  init: don't special-case logrus fds
  libcontainer: mark all non-stdio fds O_CLOEXEC before spawning init
  cgroup: plug leaks of /sys/fs/cgroup handle
  init: close internal fds before execve
  setns init: do explicit lookup of execve argument early
  init: verify after chdir that cwd is inside the container

Hang Jiang (1):
  Fix File to Close

[1]: GHSA-xr7r-f8xq-vfvv

Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
LGTMs: cyphar AkihiroSuda kolyshkin lifubang
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Leandro Motta Barros <leandro@balena.io>
Change-type: patch
Signed-off-by: Leandro Motta Barros <leandro@balena.io>
Change-type: patch
@aethernet aethernet changed the title Rebase to upstream 1.1.12 patch: Rebase to upstream 1.1.12 Feb 2, 2024
panic(err)
}

logrus.SetLevel(logrus.Level(level))

Check failure

Code scanning / CodeQL

Incorrect conversion between integer types High

Incorrect conversion of an integer with architecture-dependent bit size from
strconv.Atoi
to a lower bit size type uint32 without an upper bound check.
return nil, errors.New("invalid range: " + r)
}
for i := start; i <= end; i++ {
bits.SetBit(bits, int(i), 1)

Check failure

Code scanning / CodeQL

Incorrect conversion between integer types High

Incorrect conversion of an unsigned 32-bit integer from
strconv.ParseUint
to a lower bit size type int without an upper bound check.
if err != nil {
return nil, err
}
bits.SetBit(bits, int(val), 1)

Check failure

Code scanning / CodeQL

Incorrect conversion between integer types High

Incorrect conversion of an unsigned 32-bit integer from
strconv.ParseUint
to a lower bit size type int without an upper bound check.
@flowzone-app flowzone-app bot enabled auto-merge February 2, 2024 18:28
@aethernet
Copy link
Author

this is not the way to go

@aethernet aethernet closed this Feb 2, 2024
auto-merge was automatically disabled February 2, 2024 18:47

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet