Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

criu couldn't checkpoint program state #1223

Closed
ghost opened this issue Oct 6, 2020 · 18 comments
Closed

criu couldn't checkpoint program state #1223

ghost opened this issue Oct 6, 2020 · 18 comments

Comments

@ghost
Copy link

ghost commented Oct 6, 2020

Hello adrianreber,
May I ask you
I am using criu 3.14 and docker 19.03.13 for checkpoint and restore my programs. Its working well but unfortunately, when I try to checkpoint and restore my program, it is not check pointing the program state and it always start from the beginning step when resume my container. But, it didn't show any errors. Is it relate as I am checking my kernel version? Previous, I didn't notice which version that I used but now is 4.4.0-190-generic Ubuntu16.04. Please suggest me. Do I need to downgrade or upgrade the kernel? Thank you very much.

@adrianreber
Copy link
Member

As mentioned in the other ticket, you might be doing it wrong. You should really tell us what exactly you are trying to do. Please include all the commands you are running.

@ghost
Copy link
Author

ghost commented Oct 6, 2020

I do apologize for my misunderstanding. I am trying to checkpoint my program running inside the container and resume at another container. The followings are that I tried previous and that time, all are working well. However, when I try to test today, I cannot checkpoint the state and it normally start from the begging state at another container.

docker run -it --name test pollen5005/dcgan:latest
docker logs test
docker checkpoint create --checkpoint-dir=/tmp test checkpoint1

docker create -it --name test1 pollen5005/dcgan:latest
mv /tmp/checkpoint1 /var/lib/docker/containers/$(docker ps -aq --no-trunc --filter name=test1)/checkpoints/
docker start --checkpoint=checkpoint1 test1
docker logs test1
(logs showed that it didn't start from the checkpoint state that I created.)

@ghost
Copy link
Author

ghost commented Oct 6, 2020

Therefore, I checked with the common example.
(Checkpoint)
docker run -d --name looper --security-opt seccomp:unconfined busybox
/bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'

docker logs looper
(For instance, it generated
1 2 3 4 5 6 7 8 9)

docker checkpoint create looper checkpoint1

(Restore)
docker checkpoint create looper checkpoint1

docker logs looper
(For instance, it generated
1 2 3 4 5 6 7 8 9 1 2 3 4 ..........)
(Actually it should generate
10 11 12 13 14 15 16 ...............)

Previous, there is no such kind of problem.
Today, I got such problem.

@adrianreber
Copy link
Member

Looking at out latest CI run it seems I see a similar error: https://travis-ci.org/github/checkpoint-restore/criu/jobs/732998447

Maybe, I am not sure. In the CI test I see that a docker container is checkpointed and restored. After that we run 'ps axf' but the result is almost always the same, so it seems like the container is not started from the checkpoint, but from the beginning.

Looking at our Podman CI run, this seems to be still working as expected. Please try CentOS 8.2 with Podman as described in https://criu.org/Podman to see if it works better for you.

I, myself, have never used docker's checkpoint/restore support, but maybe @avagin and @rst0git can have a look at our docker CI run to see if my analysis is correct that it indeed does not seem to be working.

@ghost
Copy link
Author

ghost commented Oct 6, 2020

Thank you for your suggestions.
Using Podman, it is working well and I did this on also Ubuntu16.04.

@adrianreber
Copy link
Member

Thank you for your suggestions.
Using Podman, it is working well and I did this on also Ubuntu16.04.

Good to hear.

@ghost
Copy link
Author

ghost commented Oct 6, 2020

Thanks a lot @adrianreber
If there is some option without using Podman(only CRIU on Docker), I would like to request suggestions from @avagin and @rst0git.
I really appreciate all of your helps, Sirs.

@rst0git
Copy link
Member

rst0git commented Oct 6, 2020

@upc-distribution What is the output of the following commands?

docker run -d --name looper busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'
docker checkpoint create looper checkpoint1
docker start --checkpoint checkpoint1 looper
docker logs looper

@ghost
Copy link
Author

ghost commented Oct 8, 2020

@rst0git Hello Sir. I do apologize for my late reply.
The output of the following commands is 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,.........
Please suggest me something, Sir.

@rst0git
Copy link
Member

rst0git commented Oct 11, 2020

Hi @upc-distribution, this issue was caused by commit moby/moby@d4c6372, which was included in v19.03.13. However, there is another issue with upstream checkpoint/restore in moby (moby/moby#41531) and I hope both issues might be fixed in the next release.

As a workaround, you can downgrade to v19.03.12 or earlier version.

@adrianreber
Copy link
Member

@rst0git Can you confirm that our docker tests are broken because of those bugs? Do we need to change our docker tests to catch errors like this? Maybe include the output of docker logs?

@rst0git
Copy link
Member

rst0git commented Oct 14, 2020

Can you confirm that our docker tests are broken because of those bugs?

Yes

Do we need to change our docker tests to catch errors like this? Maybe include the output of docker logs?

I think it would be better to fix the integration test in moby: moby/moby#38963

@Heming-Zhong
Copy link

Hi @upc-distribution, this issue was caused by commit moby/moby@d4c6372, which was included in v19.03.13. However, there is another issue with upstream checkpoint/restore in moby (moby/moby#41531) and I hope both issues might be fixed in the next release.

As a workaround, you can downgrade to v19.03.12 or earlier version.

Hello sir,it seems that docker v19.03.12 has the the same problem.

@rst0git
Copy link
Member

rst0git commented Oct 21, 2020

Hi @Heming-Zhong, could you please try the following?

wget https://download.docker.com/linux/static/stable/x86_64/docker-19.03.12.tgz
tar -xvf docker-19.03.12.tgz
sudo cp docker/* /usr/bin/

@Heming-Zhong
Copy link

Hi @Heming-Zhong, could you please try the following?


wget https://download.docker.com/linux/static/stable/x86_64/docker-19.03.12.tgz

tar -xvf docker-19.03.12.tgz

sudo cp docker/* /usr/bin/

Hello,sir. I have tried your edition of docker and it runs well. It seems that the original version of docker's server I installed is 19.03.12-ce from the manjaro community repos. Is that means the "ce" version can't handle checkpoint/restore well?

@buck2202
Copy link

@Heming-Zhong
If you want a working downgrade through OS repositories, I think you also need to drop containerd to 1.2.x as implied by moby/moby#41531. The static build you downloaded above included a binary for containerd 1.2.13.

For what it's worth, a quick check of the combination of docker-ce 19.03.13 and containerd 1.2.13 from the download.docker repository seems able to restore. I checked docker down to 19.03.10, and no version could restore with containerd 1.3.7

@github-actions
Copy link

github-actions bot commented Jan 5, 2021

A friendly reminder that this issue had no activity for 30 days.

@rst0git
Copy link
Member

rst0git commented Feb 4, 2021

The problem has been resolved with containerd v1.5.0-beta.0.

wget https://github.com/containerd/containerd/releases/download/v1.5.0-beta.0/containerd-1.5.0-beta.0-linux-amd64.tar.gz -O - | sudo tar -xz -C /usr/bin/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants