Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow CRIU to be used as non-root with CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN #1155

Closed
wants to merge 222 commits into from

Conversation

adrianreber
Copy link
Member

With the almost finished discussion about CAP_CHECKPOINT_RESTORE I reworked my test code to be ready for review.

This contains the changes to run CRIU as non-root with either CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE. As Travis does not yet support CAP_CHECKPOINT_RESTORE I am using CAP_SYS_ADMIN during Travis testing, but locally it works just the same with CAP_CHECKPOINT_RESTORE.

This PR checks if the euid is non-zero and if it is non-zero it checks if CRIU has either CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE. If that is true CRIU remembers the euid and decides at different places in the code to not do certain things which do not work as non-root.

This is a minimal start of non-root CRIU and it can be used with zdtm to test with env00. As seen in the patch.

This is based on @nviennot and my test branches for the lkml CAP_CHECKPOINT_RESTORE discussions.

test/zdtm.py Outdated
@@ -2604,6 +2620,9 @@ def clean_stuff(opts):
help="Use splice or read mode of pre-dumping",
choices=['splice', 'read'],
default='splice')
rp.add_argument("--non-root",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have the --user option. How is this one different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all. I was not aware of --user. I have updated this PR to use --user instead of --non-root.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename --user to --non-root? user could mean anything. (criU, user namespaces). Not at all obvious of what it would do. --unprivileged is pretty excplicit too. I'd argue that Adrian might have seen the option if the naming of the option was unambiguous. Our users might miss it too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the history behind --user? I see now that is has been used in one of the jenkins jobs which I am not familiar with.

Is --user already testing exactly the same I wanted to test with this PR? Running CRIU with CAP_CHECKPOINT_RESTORE/CAP_SYS_ADMIN?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the history behind --user?

It was Pavel's attempt to implement criu dump that can't be executed by non-root users to dump their processes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I suggest --rootless? It is an established term and the meaning is very clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first patch series was using --non-root, now I am currently using --user. --rootless is indeed a good idea, but I do not really care which name it is.

Should we re-use --user or give it a new name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of the --user option is to allow non-root users dumping processes even when criu doesn't have any capabilities. We want to save this functionality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed my changes to --rootless.

val = getenv("ZDTM_GROUPS");
if (val) {
char *tok = NULL;
unsigned int size = 0, groups[NGROUPS_MAX];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: NGROUPS_MAX is 65536
Isn't it will be better to allocate memory from heap?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just me moving code behind an if if running as non-root. I have not changed the actual code. If this should be changed it should happen in another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just me moving code behind an if if running as non-root. I have not changed the actual code. If this should be changed it should happen in another PR.

no problem. It's just comment.

torvalds added a commit to torvalds/linux that referenced this pull request Aug 4, 2020
…cm/linux/kernel/git/brauner/linux

Pull checkpoint-restore updates from Christian Brauner:
 "This enables unprivileged checkpoint/restore of processes.

  Given that this work has been going on for quite some time the first
  sentence in this summary is hopefully more exciting than the actual
  final code changes required. Unprivileged checkpoint/restore has seen
  a frequent increase in interest over the last two years and has thus
  been one of the main topics for the combined containers &
  checkpoint/restore microconference since at least 2018 (cf. [1]).

  Here are just the three most frequent use-cases that were brought forward:

   - The JVM developers are integrating checkpoint/restore into a Java
     VM to significantly decrease the startup time.

   - In high-performance computing environment a resource manager will
     typically be distributing jobs where users are always running as
     non-root. Long-running and "large" processes with significant
     startup times are supposed to be checkpointed and restored with
     CRIU.

   - Container migration as a non-root user.

  In all of these scenarios it is either desirable or required to run
  without CAP_SYS_ADMIN. The userspace implementation of
  checkpoint/restore CRIU already has the pull request for supporting
  unprivileged checkpoint/restore up (cf. [2]).

  To enable unprivileged checkpoint/restore a new dedicated capability
  CAP_CHECKPOINT_RESTORE is introduced. This solution has last been
  discussed in 2019 in a talk by Google at Linux Plumbers (cf. [1]
  "Update on Task Migration at Google Using CRIU") with Adrian and
  Nicolas providing the implementation now over the last months. In
  essence, this allows the CRIU binary to be installed with the
  CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling
  unprivileged users to restore processes.

  To make this possible the following permissions are altered:

   - Selecting a specific PID via clone3() set_tid relaxed from userns
     CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.

   - Selecting a specific PID via /proc/sys/kernel/ns_last_pid relaxed
     from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.

   - Accessing /proc/pid/map_files relaxed from init userns
     CAP_SYS_ADMIN to init userns CAP_CHECKPOINT_RESTORE.

   - Changing /proc/self/exe from userns CAP_SYS_ADMIN to userns
     CAP_CHECKPOINT_RESTORE.

  Of these four changes the /proc/self/exe change deserves a few words
  because the reasoning behind even restricting /proc/self/exe changes
  in the first place is just full of historical quirks and tracking this
  down was a questionable version of fun that I'd like to spare others.

  In short, it is trivial to change /proc/self/exe as an unprivileged
  user, i.e. without userns CAP_SYS_ADMIN right now. Either via ptrace()
  or by simply intercepting the elf loader in userspace during exec.
  Nicolas was nice enough to even provide a POC for the latter (cf. [3])
  to illustrate this fact.

  The original patchset which introduced PR_SET_MM_MAP had no
  permissions around changing the exe link. They too argued that it is
  trivial to spoof the exe link already which is true. The argument
  brought up against this was that the Tomoyo LSM uses the exe link in
  tomoyo_manager() to detect whether the calling process is a policy
  manager. This caused changing the exe links to be guarded by userns
  CAP_SYS_ADMIN.

  All in all this rather seems like a "better guard it with something
  rather than nothing" argument which imho doesn't qualify as a great
  security policy. Again, because spoofing the exe link is possible for
  the calling process so even if this were security relevant it was
  broken back then and would be broken today. So technically, dropping
  all permissions around changing the exe link would probably be
  possible and would send a clearer message to any userspace that relies
  on /proc/self/exe for security reasons that they should stop doing
  this but for now we're only relaxing the exe link permissions from
  userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE.

  There's a final uapi change in here. Changing the exe link used to
  accidently return EINVAL when the caller lacked the necessary
  permissions instead of the more correct EPERM. This pr contains a
  commit fixing this. I assume that userspace won't notice or care and
  if they do I will revert this commit. But since we are changing the
  permissions anyway it seems like a good opportunity to try this fix.

  With these changes merged unprivileged checkpoint/restore will be
  possible and has already been tested by various users"

[1] LPC 2018
     1. "Task Migration at Google Using CRIU"
        https://www.youtube.com/watch?v=yI_1cuhoDgA&t=12095
     2. "Securely Migrating Untrusted Workloads with CRIU"
        https://www.youtube.com/watch?v=yI_1cuhoDgA&t=14400
     LPC 2019
     1. "CRIU and the PID dance"
         https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2m48s
     2. "Update on Task Migration at Google Using CRIU"
        https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h2m8s

[2] checkpoint-restore/criu#1155

[3] https://github.com/nviennot/run_as_exe

* tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  selftests: add clone3() CAP_CHECKPOINT_RESTORE test
  prctl: exe link permission error changed from -EINVAL to -EPERM
  prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
  proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
  pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
  pid: use checkpoint_restore_ns_capable() for set_tid
  capabilities: Introduce CAP_CHECKPOINT_RESTORE
Copy link
Member

@nviennot nviennot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We need solid testing without CAP_SYS_ADMIN, where we run with just with CAP_CHECKPOINT_RESTORE and CAP_PTRACE, as advertised. Otherwise, we can't develop properly. This may require us to use a custom kernel (could be compiled and put as a binary blob in some repo we own) that we run with qemu.

  2. CLI options: should we require users to manually disable features (like network), or automatically disable features when we see we don't have the right capability. Another way to look at this is to be tolerant as much as we can to EPERM errors.

  3. CRIU should not look at uid. Rather, it should look at its effective capabilities (and perhaps whether it is running in a user namespace, as it makes all the capable(CAP_*) tests fail in the kernel, which matters for things like SO_SNDBUFFORCE).

  4. At some point, we should motivate the docker/kubernetes people to add CAP_CHECKPOINT_RESTORE in the set of bounded capabilities (so users don't have to add it manually).

I can help

criu/cr-check.c Outdated
* CAP_CHECKPOINT_RESTORE and CAP_PTRACE.
*/

fd = open(criu, O_RDONLY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good idea to check for the caps of the CRIU binary. Even if the binary has some setcap magic bits on it, we might not have the effective capabilities we want, because the bounding capabilities may be restricted. (For example, in kubernetes, one must specify the capabilities that can be activated within the container in the yaml file, regardless of the setcap attrs of the binaries in the image)

We should look at /proc/self/status and see what effective capabilities we have. Using the capget() syscall is another option. Depending if we have CAP_SYS_ADMIN, CAP_CHECKPOINT_RESTORE, or CAP_NET_ADMIN, we may want to do a variety of things.
Parsing /proc/self/status should be fairly easy, as there is already one parse_pid_status(), which parses the capabilities.

Replacing opts.uid with opts.caps would be helpful: instead of doing tests like if (opts.uid), we could do tests like if (opts.caps & CAP_NET_ADMIN)

XXX We might want to test if we are in a user namespace, and not in the root namespace. If that's the case, then a bunch of tests won't work though. If we are in a user namespace, all the kernel checks that are if (capable(CAP_*)) will return false. I'm not sure how to test if we are in a user_ns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

criu already should have a helper to test whether or not it is running a user namespace. If not, what should work is sm like:

inline static bool in_userns(void)
{
	__do_fclose FILE *f = NULL;
	uid_t user, host, count;
	int ret;

	/* Now: are we in a user namespace? Because then we're also
	 * unprivileged.
	 */
	f = fopen("/proc/self/uid_map", "re");
	if (!f)
		return false;

	ret = fscanf(f, "%u %u %u", &user, &host, &count);
	if (ret != 3)
		return false;

	return user != 0 || host != 0 || count != UINT32_MAX;
}

criu/cr-check.c Outdated
out_close:
close(fd);
if (exit_code) {
pr_msg(" CRIU needs to be run as root or needs to have the CAP_SYS_ADMIN or\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One call instead of three, and pr_msg -> pr_error. Also I wouldn't go with the two space of indent.

so:

pr_error(
"CRIU...\n"
"2nd line...\n"
"3rd line...\n");

* capabilities available. Now we have to make the process dumpable
* so that /proc/self is not owned by root.
*/
if (pr_set_dumpable(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With capabilities checks, we could either do set_dumpable() in all cases. If we wanted to put it behind an if, we could check if we have CAP_DAC_OVERRIDE

criu/fdstore.c Outdated
@@ -46,8 +49,16 @@ int fdstore_init(void)
return -1;
}

if (setsockopt(sk, SOL_SOCKET, SO_SNDBUFFORCE, &buf[0], sizeof(buf[0])) < 0 ||
setsockopt(sk, SOL_SOCKET, SO_RCVBUFFORCE, &buf[1], sizeof(buf[1])) < 0) {
if (opts.uid) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check CAP_NET_ADMIN for this one but I'm not sure how this work in practice. The kernel checks if we have CAP_NET_ADMIN in the root namespace. This test can also succeed if we are not in a user namespace. I'm not sure how to test that we are in a user name space

Maybe it's better to try to use SO_SNDBUFFORCE, and if we can't due to -EPERM, we fallback to SO_SNDBUF.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we need to check CAP_NET_ADMIN here. uid can be 0 but setsockopt(SO_RCVBUFFORCE) will fail without CAP_NET_ADMIN.

criu/files.c Outdated
@@ -1323,6 +1323,8 @@ int prepare_fds(struct pstree_item *me)

static int fchroot(int fd)
{
if (opts.uid)
Copy link
Member

@nviennot nviennot Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe test CAP_SYS_CHROOT (here the kernel tests CAP_SYS_CHROOT in the current user namespace).

@@ -187,7 +187,7 @@ static int lsm_set_label(char *label, char *type, int procfd)
}

static int restore_creds(struct thread_creds_args *args, int procfd,
int lsm_type)
int lsm_type, uid_t uid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of things going in there that can fail if we have limited creds: . sys_setresuid() and friends, sys_prctl(PR_CAPBSET_DROP, etc.) / sys_capset(). SELinux things.

I think we need testing without CAP_SYS_ADMIN, and just CAP_CHECKPOINT_RESTORE. Can we boot a kernel that has the new cap on travis ci via qemu? Otherwise, I'm afraid we are not going to be effective in writing this non-root functionality.

@adrianreber
Copy link
Member Author

1. We need solid testing without CAP_SYS_ADMIN, where we run with just with CAP_CHECKPOINT_RESTORE and CAP_PTRACE, as advertised. Otherwise, we can't develop properly. This may require us to use a custom kernel (could be compiled and put as a binary blob in some repo we own) that we run with qemu.

Thanks to @brauner the CAP_CHECKPOINT_RESTORE changes have been merged and should be available with RC1 (torvalds/linux@74858ab). As we are already using Vagrant to test the latest Fedora release we can easily add another Vagrant based test and install the latest RC1 kernel via https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories which provides all RC releases:

https://repos.fedorapeople.org/repos/thl/kernel-vanilla-mainline/fedora-32/x86_64/

2. CLI options: should we require users to manually disable features (like network), or automatically disable features when we see we don't have the right capability. Another way to look at this is to be tolerant as much as we can to EPERM errors.

Any reason to not automatically disable features if we already detect that it will not work?

3. CRIU should not look at uid. Rather, it should look at its effective capabilities (and perhaps whether it is running in a user namespace, as it makes all the `capable(CAP_*)` tests fail in the kernel, which matters for things like `SO_SNDBUFFORCE`).

Sounds good. Like you mentioned above looking at /proc/self/status sounds like a good approach.

4. At some point, we should motivate the docker/kubernetes people to add CAP_CHECKPOINT_RESTORE in the set of bounded capabilities (so users don't have to add it manually).

I can help

Thanks. In your review you mentioned a lot of different capabilities we should test and this sounds like a good approach. I would start with reading /proc/self/status and only focus on CAP_CHECKPOINT_RESTORE for now before looking at what other capabilities are needed. This could be done in follow up pull requests.

@nviennot
Copy link
Member

nviennot commented Aug 5, 2020

  1. Wonderful

Any reason to not automatically disable features if we already detect that it will not work?

Sometimes, failing can be preferable:

  1. Let say that one is using CRIU for some time with some settings that they like in a managed kubernetes environment.
  2. The hosting provider changes the kubernetes environment, changing the default capabilities that a container has
  3. CRIU will then skip certain things to checkpoint (or restore) as a result. However, the user will not know about the change of behavior. This can potentially lead to an silent application failure.

So I'd argue that we want a deterministic behavior given a set of CLI options.

I see two approaches:
a) We can skip individual features: So we can throw errors like "CAP_SYS_ADMIN is not detected. Please retry with --skip-lsm --skip-X -skip-Y" where X and Y are features that need CAP_SYS_ADMIN to work.
b) We can supply the list of capabilities that should not be used by CRIU via the CLI, e.g. "--no-cap-sys-admin". With the list of caps that we have and don't have, we can disables all non-compatible features on boot.

b) is a superset of a) and would be easy to use for users. Most importantly, the behavior of CRIU is deterministic given a list of CLI options.

criu/cr-check.c Outdated Show resolved Hide resolved
criu/cr-check.c Outdated
* CAP_CHECKPOINT_RESTORE and CAP_PTRACE.
*/

fd = open(criu, O_RDONLY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

criu already should have a helper to test whether or not it is running a user namespace. If not, what should work is sm like:

inline static bool in_userns(void)
{
	__do_fclose FILE *f = NULL;
	uid_t user, host, count;
	int ret;

	/* Now: are we in a user namespace? Because then we're also
	 * unprivileged.
	 */
	f = fopen("/proc/self/uid_map", "re");
	if (!f)
		return false;

	ret = fscanf(f, "%u %u %u", &user, &host, &count);
	if (ret != 3)
		return false;

	return user != 0 || host != 0 || count != UINT32_MAX;
}

@adrianreber
Copy link
Member Author

I created a vagrant based test using a 5.8.0 kernel for now. Once the Fedora vanilla kernel mainline repository picks up 5.9.0-rc1 the test should work. Right now I added the test more as a proof of concept. To see how it could look like.

https://travis-ci.org/github/checkpoint-restore/criu/jobs/715943265

criu/kerndat.c Outdated
@@ -829,6 +829,9 @@ static int kerndat_try_load_cache(void)
{
int fd, ret;

if (opts.uid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a comment why we can't load cache in this case.

criu/kerndat.c Show resolved Hide resolved
criu/namespaces.c Outdated Show resolved Hide resolved
@adrianreber
Copy link
Member Author

Test is now running on Fedora with 5.9.0-rc0 kernel and CAP_CHECKPOINT_RESTORE

@adrianreber
Copy link
Member Author

I rebased this and pushed a new version in which I tried to switch to using /proc/pid/status.

@adrianreber
Copy link
Member Author

Pushed again to fix a CI error with usernsd.

@adrianreber
Copy link
Member Author

@nviennot Any comments from your side to the current patches?

@@ -1334,6 +1334,9 @@ void rlimit_unlimit_nofile(void)
{
struct rlimit new;

if (opts.uid)
return;

Copy link
Member

@rst0git rst0git Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will cause pre-dump to fail for non-root users with the following error for processes with larger (e.g., 1GB) memory state.

Error (criu/page-pipe.c:120): page-pipe: Can't make pipe for page-pipe: Too many open files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrianreber you marked it as resolved, but for me it is unclear how it was resolved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's about pre-dump which is out of scope of this PR.

@rst0git
Copy link
Member

rst0git commented Sep 29, 2020

@adrianreber running pre-dump as non-root user fails with:

(00.023457) Error (criu/mem.c:44): BUG at criu/mem.c:44

@adrianreber
Copy link
Member Author

@rst0git What does happen if you run pre-dump as non-root before my patches? I am aware that my patches are only touching a small part of what needs to be done to have the complete code base ready for CAP_CHECKPOINT_RESTORE. But it at least provides a working checkpoint and restore implementation as non-root.

I would prefer an incremental approach on the criu-dev branch. If the patches in this PR are correct I would rather see this merged and then look at the pre-dump use case.

@rst0git
Copy link
Member

rst0git commented Sep 30, 2020

@rst0git What does happen if you run pre-dump as non-root before my patches?

@adrianreber I can confirm that the same error occurs before applying the patches in this PR and I agree with you that incremental approach on the criu-dev branch would be better.

So, here's the enhanced version of the first try.

Changes are:

1. The wrapper name is criu-ns instead of crns.py
2. The CLI is absolutely the same as for criu, since the script
   re-execl-s criu binary. E.g.
	   scripts/criu-ns dump -t 1234 ...
   just works
3. Caller doesn't need to care about substituting CLI options,
   instead, the scripts analyzes the command line and
   a) replaces -t|--tree argument with virtual pid __if__ the
      target task lives in another pidns
   b) keeps the current cwd (and root) __if__ switches to another
      mntns. A limitation applies here -- cwd path should be the
      same in target ns, no "smart path mapping" is performed. So
      this script is for now only useful for mntns clones (which
      is our main goal at the moment).

Signed-off-by: Pavel Emelyanov <xemul@virtuozzo.com>
Looks-good-to: Andrey Vagin <avagin@openvz.org>
This adds the function check_caps() which checks if CRIU is running
with at least CAP_CHECKPOINT_RESTORE. That is the minimum capability
CRIU needs to do a minimal checkpoint and restore from it.

In addition helper functions are added to easily query for other
capability for enhanced checkpoint/restore support.

Signed-off-by: Adrian Reber <areber@redhat.com>
This commit enables checkpointing and restoring of applications as
non-root.

First goal was to enable checkpoint and restore of the env00 and
pthread00 test case.

This uses the information from opts.uid and opts.cap_eff to skip certain
code paths which do not work as non-root.

The kerndat file is saved in $HOME/.criu.kdat. As $HOME is usually not on
a tmpfs there is now a check for nodename and release to see if the system
has been rebooted since last writing the kerndat cache file and mark it
as stale.

Signed-off-by: Adrian Reber <areber@redhat.com>
This adds the non-root section and information about the parameter
--unprivileged to the man page.

Signed-off-by: Adrian Reber <areber@redhat.com>
This are the minimal changes to make zdtm.py successfully run the
env00 and pthread test case as non-root using the '--rootless' zdtm option.

Signed-off-by: Adrian Reber <areber@redhat.com>
Run env00 and pthread00 test as non-root as initial proof of concept.

Signed-off-by: Adrian Reber <areber@redhat.com>
@adrianreber
Copy link
Member Author

Rebased to solve a merge conflict.

So what is still missing to merge this?

@@ -732,6 +732,9 @@ int dump_task_cgroup(struct pstree_item *item, u32 *cg_id, struct parasite_dump_
unsigned int n_ctls = 0;
struct cg_set *cs;

if (opts.uid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be "if (opts.unprivileged)" ?

@@ -989,6 +992,9 @@ int dump_cgroups(void)
CgroupEntry cg = CGROUP_ENTRY__INIT;
int ret = -1;

if (opts.uid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opts.unprivileged and we need to print a warning that cgroups are not dumped.

@@ -211,7 +211,8 @@ int prepare_inventory(InventoryEntry *he)
if (get_task_ids(&crt.i))
return -1;

he->has_root_cg_set = true;
if (!opts.uid)
he->has_root_cg_set = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very very suspicious. We set has_root_cg_set only if uid is zero, but we always call dump_task_cgroup. If here is any reasons for this, we need to write a comment here.

When I suggested to add the --unprivileged options, I mean that we will check opts.unprivileged instead of opts.uid.

* allow to write to KDAT_RUNDIR which usually is only writable by root.
* Let's write .criu.kdat file to the home directory for non-root cases.
*/
if (opts.uid) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opts.unprivileged. opts.uid doesn't guarantee that criu has been executed with a full set of capabilities.

@@ -847,24 +880,48 @@ static int kerndat_try_load_cache(void)

close(fd);

/*
* For the non-root use case we use uname(2) to decide
* if a reboot happened.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can uname help up to detect a reboot? Do you mean a kernel update?

@adrianreber
Copy link
Member Author

Hello @avagin,

I think it would make more sense if you take over this PR. You have a much clearer understanding and vision about what needs to be done than myself. Just take what already exist and make the necessary changes. You can close this PR or modify it. Whatever works best for you.

@github-actions
Copy link

A friendly reminder that this PR had no activity for 30 days.

@rst0git
Copy link
Member

rst0git commented Apr 12, 2022

The changes in this PR have been used to enable rootless checkpoint/restore support in OpenLiberty/open-liberty#20683

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet