Skip to content

Commit

Permalink
top: do not depend on ps(1) in container
Browse files Browse the repository at this point in the history
This ended up more complicated then expected. Lets start first with the
problem to show why I am doing this:

Currently we simply execute ps(1) in the container. This has some
drawbacks. First, obviously you need to have ps(1) in the container
image. That is no always the case especially in small images. Second,
even if you do it will often be only busybox's ps which supports far
less options.

Now we also have psgo which is used by default but that only supports a
small subset of ps(1) options. Implementing all options there is way to
much work.

Docker on the other hand executes ps(1) directly on the host and tries
to filter pids with `-q` an option which is not supported by busybox's
ps and conflicts with other ps(1) arguments. That means they fall back
to full ps(1) on the host and then filter based on the pid in the
output. This is kinda ugly and fails short because users can modify the
ps output and it may not even include the pid in the output which causes
an error.

So every solution has a different drawback, but what if we can combine
them somehow?! This commit tries exactly that.

We use ps(1) from the host and execute that in the container's pid
namespace.
There are some security concerns that must be addressed:
- mount the executable paths for ps and podman itself readonly to
  prevent the container from overwriting it via /proc/self/exe.
- set NO_NEW_PRIVS, SET_DUMPABLE and PDEATHSIG
- close all non std fds to prevent leaking files in that the caller had
  open
- unset all environment variables to not leak any into the contianer

Technically this could be a breaking change if somebody does not
have ps on the host and only in the container but I find that very
unlikely, we still have the exec in container fallback.

Because this can be insecure when the contianer has CAP_SYS_PTRACE we
still only use the podman exec version in that case.

This updates the docs accordingly, note that podman pod top never falls
back to executing ps in the container as this makes no sense with
multiple containers so I fixed the docs there as well.

Fixes #19001
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=2215572

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
  • Loading branch information
Luap99 authored and ashley-cui committed Jul 20, 2023
1 parent 2551112 commit 574b782
Show file tree
Hide file tree
Showing 10 changed files with 359 additions and 22 deletions.
4 changes: 3 additions & 1 deletion docs/source/markdown/podman-pod-top.1.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ podman\-pod\-top - Display the running processes of containers in a pod
**podman pod top** [*options*] *pod* [*format-descriptors*]

## DESCRIPTION
Display the running processes of containers in a pod. The *format-descriptors* are ps (1) compatible AIX format descriptors but extended to print additional information, such as the seccomp mode or the effective capabilities of a given process. The descriptors can either be passed as separate arguments or as a single comma-separated argument. Note that if additional options of ps(1) are specified, Podman falls back to executing ps with the specified arguments and options in the container.
Display the running processes of containers in a pod. The *format-descriptors* are ps (1) compatible AIX format
descriptors but extended to print additional information, such as the seccomp mode or the effective capabilities
of a given process. The descriptors can either be passed as separate arguments or as a single comma-separated argument.

## OPTIONS

Expand Down
11 changes: 9 additions & 2 deletions docs/source/markdown/podman-top.1.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,14 @@ podman\-top - Display the running processes of a container
**podman container top** [*options*] *container* [*format-descriptors*]

## DESCRIPTION
Display the running processes of the container. The *format-descriptors* are ps (1) compatible AIX format descriptors but extended to print additional information, such as the seccomp mode or the effective capabilities of a given process. The descriptors can either be passed as separated arguments or as a single comma-separated argument. Note that options and or flags of ps(1) can also be specified; in this case, Podman falls back to executing ps with the specified arguments and flags in the container. Please use the "h*" descriptors to extract host-related information. For instance, `podman top $name hpid huser` to display the PID and user of the processes in the host context.
Display the running processes of the container. The *format-descriptors* are ps (1) compatible AIX format
descriptors but extended to print additional information, such as the seccomp mode or the effective capabilities
of a given process. The descriptors can either be passed as separated arguments or as a single comma-separated
argument. Note that options and or flags of ps(1) can also be specified; in this case, Podman falls back to
executing ps(1) from the host with the specified arguments and flags in the container namespace. If the container
has the `CAP_SYS_PTRACE` capability then we will execute ps(1) in the container so it must be installed there.
Please use the "h*" descriptors to extract host-related information. For instance, `podman top $name hpid huser`
to display the PID and user of the processes in the host context.

## OPTIONS

Expand Down Expand Up @@ -90,7 +97,7 @@ PID SECCOMP COMMAND %CPU
8 filter vi /etc/ 0.000
```

Podman falls back to executing ps(1) in the container if an unknown descriptor is specified.
Podman falls back to executing ps(1) from the host in the container namespace if an unknown descriptor is specified.

```
$ podman top -l -- aux
Expand Down
81 changes: 81 additions & 0 deletions libpod/container_top_linux.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#define _GNU_SOURCE
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mount.h>
#include <sys/wait.h>
#include <unistd.h>

/* keep special_exit_code in sync with container_top_linux.go */
int special_exit_code = 255;
char **argv = NULL;

void
create_argv (int len)
{
/* allocate one extra element because we need a final NULL in c */
argv = malloc (sizeof (char *) * (len + 1));
if (argv == NULL)
{
fprintf (stderr, "failed to allocate ps argv");
exit (special_exit_code);
}
/* add final NULL */
argv[len] = NULL;
}

void
set_argv (int pos, char *arg)
{
argv[pos] = arg;
}

/*
We use cgo code here so we can fork then exec separately,
this is done so we can mount proc after the fork because the pid namespace is
only active after spawning childs.
*/
void
fork_exec_ps ()
{
int r, status = 0;
pid_t pid;

if (argv == NULL)
{
fprintf (stderr, "argv not initialized");
exit (special_exit_code);
}

pid = fork ();
if (pid < 0)
{
fprintf (stderr, "fork: %m");
exit (special_exit_code);
}
if (pid == 0)
{
r = mount ("proc", "/proc", "proc", 0, NULL);
if (r < 0)
{
fprintf (stderr, "mount proc: %m");
exit (special_exit_code);
}
/* use execve to unset all env vars, we do not want to leak anything into the container */
execve (argv[0], argv, NULL);
fprintf (stderr, "execve: %m");
exit (special_exit_code);
}

r = waitpid (pid, &status, 0);
if (r < 0)
{
fprintf (stderr, "waitpid: %m");
exit (special_exit_code);
}
if (WIFEXITED (status))
exit (WEXITSTATUS (status));
if (WIFSIGNALED (status))
exit (128 + WTERMSIG (status));
exit (special_exit_code);
}
Loading

0 comments on commit 574b782

Please sign in to comment.