docker checkpoint an experimental feature #718

ashu-mehra · 2019-06-08T05:18:16Z

Would any one know why is docker checkpoint an experimental feature and not suggested to be used in production as per the official doc - https://docs.docker.com/engine/reference/commandline/checkpoint/
Does it have anything to do with stability of criu or is there any security concern or something else?
And what can be done to be able to use criu in production environments?

rppt · 2019-06-08T07:20:46Z

AFAIK, the experimental part is the integration of CRIU in docker. CRIU is used in production with other container engines, see e.g.
https://www.linuxplumbersconf.org/event/2/contributions/69/

cc @kolyshkin

adrianreber · 2019-06-08T19:56:59Z

Just yesterday my work on supporting container migration in Podman was merged (containers/podman#2272). This will be part of the upcoming Podman 1.4.0 release. Maybe this is something you could try out.

I just added some documentation to the CRIU wiki concerning CRIU's Podman integration at: https://criu.org/Podman

As I am actively working on Podman's checkpoint/restore support and container migration support I would be interested to know if this is something you could use.

ashu-mehra · 2019-06-19T12:35:51Z

@adrianreber thanks for providing the links.
I understand live migration is one of the main use-case for CRIU, but we are actually looking at using it for startup improvements, especially for JVM based applications, i.e. create a checkpoint once application's startup phase is over, and then for each new instance of the application just restore it from the checkpoint.
Have you guys looked into that use-case or do you know if anyone is looking into it actively? Any (potential) problems (like security concerns, or usability/functional issues) that you think may come up in using CRIU like this?

adrianreber · 2019-06-19T13:32:32Z

@ashu-mehra If you look at https://criu.org/Podman there is a link to a recording: https://asciinema.org/a/249922

In that recording I am basically doing what you are describing.

I am starting Wildfly once. That takes about 8 seconds. Then I am checkpointing it and restoring it multiple times from the checkpoint. Starting Wildfly from the checkpoint only takes 4 seconds.

In that example I can reduce container start up time by 50%.

There has been a FOSDEM talk about including CRIU into the JVM which also tries to reduce start up time: https://fosdem.org/2019/schedule/event/checkpoint_restore/

The biggest concerns with the JVM is that the restarted JVM has somehow to detect the new environment on the new systems and adapt itself to the new environment (like CPUS, number of GC threads, hostname).

chflood · 2019-06-19T19:07:09Z

@ashu-mehra

I have a prototype of a Java API to allow you to call CRIU from Java.
You can see it here: https://github.com/chflood/CRIUForJava/

Comments welcome

Christine

kirs · 2019-06-24T22:01:53Z

@adrianreber super excited to see your work on podman pull + podman container restore that accepts snapshot that was possibly made on another host.

Is there anything similar planned for Docker? From what I understand, docker start --checkpoint only works on the same host as where the snapshot was collected. Is there a workaround?

adrianreber · 2019-06-25T04:37:22Z

@kirs I am not familiar with Docker's checkpoint/restore implementation. I cannot say anything about if there are any plans to continue working on it.

Just for the record: podman pull is not really required before doing'podman container restore. That will happen automatically.

ashu-mehra · 2019-07-05T05:51:48Z

@adrianreber @chflood
Thanks for providing your inputs on startup use-case for CRIU. We have some internal discussions on using CRIU for improving startup performance of applications running in container, and there were some points raised regarding security aspect of restored applications:

Since every instance of application restored from the checkpoint would be running with same address map, would it nullify the benefit of address space layout randomization, thereby making the application vulnerable?
The checkpoint can potentially contain application secrets and keys depending on when the checkpointing is done, which is again a potential security issue.
How would random generation algorithms behave in restored applications? Would it make them predictable?

It would be good to hear the view of the community on these issues, and whether or not something can be or need to be done.
Thanks again!

adrianreber · 2019-07-05T06:04:42Z

@ashu-mehra as you are explicitly tagging me in your question, I will answer, but probably not a very helpful answer. I guess that all your security concerns are true. Depending on the control you have over your application you could make sure that secrets are removed from the memory before doing the checkpoint and you could reseed your random generation algorithms after the restore.

ashu-mehra · 2019-07-05T06:20:05Z

@adrianreber - thanks for quick response.

Depending on the control you have over your application you could make sure that secrets are removed from the memory before doing the checkpoint and you could reseed your random generation algorithms after the restore.

Makes sense and we had similar thoughts on tackling these which is why I mentioned depending on when the checkpointing is done.
Can address space layout randomization be addressed in some way? One workaround could be to recreate the checkpoint periodically. Any other thoughts?

LeBovin · 2020-05-10T15:11:21Z

Do we have any details about how experimental it is ?

adrianreber · 2020-05-10T17:01:06Z

Do we have any details about how experimental it is ?

I am not aware of any one working on it actively. At least, as far as I know, no one has contacted the CRIU community about it. As I have implemented the checkpoint/restore feature for Podman I never really looked into the docker code, so I do not no any details, but neither on the github bug tracker or on the CRIU mailing-list are any communications concerning the docker checkpoint integration. Only users having problems with it. If you need checkpoint/restore please try out Podman's checkpoint/restore support.

I am closing this ticket as we cannot answer that question here for almost one year so we will probably not have an answer any time soon. Podman's checkpoint/restore support is not marked experimental and if there are any problems I am happy to help.

avagin added the docker label Jul 22, 2019

adrianreber closed this as completed May 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker checkpoint an experimental feature #718

docker checkpoint an experimental feature #718

ashu-mehra commented Jun 8, 2019

rppt commented Jun 8, 2019

adrianreber commented Jun 8, 2019

ashu-mehra commented Jun 19, 2019

adrianreber commented Jun 19, 2019

chflood commented Jun 19, 2019

kirs commented Jun 24, 2019

adrianreber commented Jun 25, 2019

ashu-mehra commented Jul 5, 2019

adrianreber commented Jul 5, 2019

ashu-mehra commented Jul 5, 2019

LeBovin commented May 10, 2020

adrianreber commented May 10, 2020