Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable checkpoint/restore support (finally) #480

Merged
merged 3 commits into from Nov 9, 2020

Conversation

adrianreber
Copy link
Contributor

This pull requests adds the missing pieces of crun checkpoint/restore and enables it.

Missing pieces were, from my point of view, to make Podman's checkpoint/restore tests successful. With this Podman's checkpoint restore finishes successfully:

# ginkgo -v --tags exclude_graphdriver_btrfs --focus="checkpoint"  test/e2e/.
...
Ran 16 of 1076 Specs in 63.152 seconds
SUCCESS! -- 16 Passed | 0 Failed | 0 Pending | 1060 Skipped
PASS

The changes in this PR to enable this are:

  • Require at least CRIU 3.15, currently unreleased - using criu-dev branch for my tests
  • Correctly reconnect console socket after restore - this requires 3.15
  • Use cgroup freezer to stop all processes in the container instead of ptrace - this requires 3.15
  • Activate the checkpoint and restore command which were hidden and prefixed with _
    (as long as CRIU >= 3.15 is not detected during build-time, the subcommands will stay deactivated.)

Once CRIU 3.15 is released I plan to update crun's CI to make sure the checkpoint/restore tests are being run. Once crun with CRIU 3.15 has been built and once it hits Podman's CI the checkpoint/restore tests on crun based systems will be running automatically.

Just a draft to see what crun's CI thinks of this change.

CRIU can either freeze/pause all processes in the container by going
through a loop and using ptrace (default) or use the cgroup freezer to
pause all processes in a container. This changes crun to tell CRIU the
information about the used cgroup freezer.

This needs the not yet released CRIU 3.15 which should be released in
the next weeks.

Signed-off-by: Adrian Reber <areber@redhat.com>
With this change the Podman checkpoint/restore test suites passes on
crun based systems. With this CRIU can tell us the console socket of the
restored process which can then be passed to the upper layer (Podman).

Signed-off-by: Adrian Reber <areber@redhat.com>
Now that all functionality for the Podman checkpoint/restore test suite
has been implemented in crun, the 'checkpoint' and 'restore' commands
are no longer hidden (with the '_' prefix), but enabled to be used.

Checkpoint/restore support will not show up in crun unless CRIU 3.15 is
available (which has not yet been released).

Signed-off-by: Adrian Reber <areber@redhat.com>
@adrianreber
Copy link
Contributor Author

Not sure if the test oci-validation fails because of my changes.

@giuseppe
Copy link
Member

giuseppe commented Sep 7, 2020

Not sure if the test oci-validation fails because of my changes.

I've restarted them and they are green now. Thanks for working on it! The patch LGTM, if you are fine with it we can merge it.

@adrianreber adrianreber marked this pull request as ready for review September 10, 2020 07:35
@adrianreber
Copy link
Contributor Author

Removed draft status. This means that #71 could be closed. We could wait until CRIU 3.15 has been released and until I have made sure the corresponding CI tests are running to make sure we have test coverage.

@rhatdan
Copy link
Member

rhatdan commented Sep 29, 2020

Where are we on this one now?

@rhatdan
Copy link
Member

rhatdan commented Oct 25, 2020

@adrianreber @giuseppe Do we want to get this in?

@giuseppe
Copy link
Member

still waiting for CRIU to release 3.15.

Once this is in, I think we can release crun 1.0

static int
criu_notify (char *action, criu_notify_arg_t na)
{
if (strncmp (action, "orphan-pts-master", 17) == 0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrianreber do you think that inverting the logic of this check would make the function look better? For example:

if (strncmp (action, "orphan-pts-master", 17) != 0)
  return 0;

// function flow proceeds...

This way, the entire logic doesn't need to be nested. And it's similar to what is done within the if when checking console_socket, console_socket_fd, and ret.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually I try to structure the code like you suggested. As the notify function potentially can handle different callback types it does not sound useful once there are multiple callbacks. Right now, however, I can changed the code flow. You are right. I will do that as I want to include additional tests for this PR anyway.

@adrianreber
Copy link
Contributor Author

adrianreber commented Nov 2, 2020

CRIU 3.15 has been prepared, the Makefile has been updated with the 3.15 version and I was told that the release will be tomorrow. I will make sure that the packages are available in Fedora and PPA as soon as possible to be able to update the CI to use 3.15.

@giuseppe
Copy link
Member

giuseppe commented Nov 9, 2020

now that ciu 3.15 is in testing, let's merge.

Thanks again for working on it!

@giuseppe giuseppe merged commit ce493f0 into containers:master Nov 9, 2020
@giuseppe giuseppe mentioned this pull request Nov 9, 2020
@adrianreber
Copy link
Contributor Author

I am just working on a PR to enable checkpoint/restore testing in Travis because that test needs to run as root and is currently skipped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants