Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable CRIU tests by not using overlayfs snapshotter #4708

Merged
merged 2 commits into from
Mar 19, 2021

Conversation

kzys
Copy link
Member

@kzys kzys commented Nov 9, 2020

While the issue hasn't been fixed in the kernel yet, we can workaround
the issue by not using overlayfs snapshotter.

Fixes #3930.

Signed-off-by: Kazuyoshi Kato katokazu@amazon.com

@k8s-ci-robot
Copy link

Hi @kzys. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kzys
Copy link
Member Author

kzys commented Nov 9, 2020

Hmm, let me check the kernel version on the GitHub Actions' Ubuntu.

@kzys
Copy link
Member Author

kzys commented Nov 9, 2020

Ubuntu 14.04 uses Linux fv-az54-676 5.4.0-1031-azure #32~18.04.1-Ubuntu SMP Tue Oct 6 10:03:22 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux.

@estesp
Copy link
Member

estesp commented Nov 9, 2020

I don't think this will work based on ongoing commentary in checkpoint-restore/criu#860 (see this comment). The upstream bug is confusing as it has been marked fixed at least once or twice, but is clearly still an issue.

The recommendation in checkpoint-restore/criu#1239 (comment) might be worth considering--should we run just the checkpoint/restore tests to make sure our code is still working by running a quick test with a different graphdriver?

@kzys
Copy link
Member Author

kzys commented Nov 10, 2020

Thanks @estesp. How about running the tests with multiple different snapshotters? criu may need devmappper, but I'm bit concerned to move containerd's tests from overlay to let's say devmapper entirely because of criu.

@estesp
Copy link
Member

estesp commented Nov 10, 2020

Yes, I think we can build a small action that only runs the criu tests with a native/devicemapper snapshotter and let the rest of the testsuite continue using overlay.

@kzys kzys force-pushed the enable-criu branch 6 times, most recently from 6bace7f to 4213735 Compare November 11, 2020 23:00
@kzys kzys changed the title Re-enable CRIU tests Re-enable CRIU tests by not using overlayfs snapshotter Nov 11, 2020
@kzys kzys force-pushed the enable-criu branch 2 times, most recently from a8a3388 to fd8bc31 Compare November 12, 2020 00:31
@kzys
Copy link
Member Author

kzys commented Nov 12, 2020

container_checkpoint_test.go:142: type with url : not found: unknown is coming from

v, err := typeurl.UnmarshalAny(r.Options)
.

@@ -49,7 +49,7 @@ func NewContainer(ctx context.Context, platform stdio.Platform, r *task.CreateTa
}

var opts options.Options
if r.Options != nil {
if r.Options != nil && r.Options.GetTypeUrl() != "" {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure this is the right place to fix this issue. Shouldn't it be handled by typeurl.UnmarshalAny()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crosbymichael can you provide any insight here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, i think this is fine. It should be handled by the caller because of the nil check. I'll have to think a bit on a better way to fix this in the lib but it shouldn't be common.

@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Nov 13, 2020
@kzys
Copy link
Member Author

kzys commented Nov 16, 2020

@estesp Could you take a look? Seems like CRIU has been broken.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 12, 2021

Build succeeded.

@estesp
Copy link
Member

estesp commented Mar 12, 2021

Sorry @kzys this needs a rebase due to the PR which moved *_test.go from the root being merged.

Also, thinking about the time cost of having yet another pass through make integration just to pick up the checkpoint tests:
What about renaming the last test in the checkpoint test file here to start with TestCheckpoint and then you could pass EXTRA_TESTFLAGS="-run TestCheckpoint" to your make integration and only have the handful of checkpoint tests run rather than adding 2-3 minutes to all of the runc+Linux integration combinations in CI?

@kzys
Copy link
Member Author

kzys commented Mar 12, 2021

@estesp Sounds good to me. Let me update the PR.

@kzys kzys force-pushed the enable-criu branch 2 times, most recently from 706a99c to 36afcb7 Compare March 12, 2021 18:09
@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 12, 2021

Build succeeded.

@kzys
Copy link
Member Author

kzys commented Mar 12, 2021

Rebased! Not so sure why "CGroupsV2 and SELinux Integration" is failing here.

@mikebrow
Copy link
Member

Rebased! Not so sure why "CGroupsV2 and SELinux Integration" is failing here.

429 Too Many Requests. ... we're trying to work on this.. it's the new docker.io pull rate limits

@kzys
Copy link
Member Author

kzys commented Mar 16, 2021

@estesp Can you take a look?

- process.Init#io could be nil
- Make sure CreateTaskRequest#Options is not empty before unmarshaling

Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
While the issue hasn't been fixed in the kernel yet, we can workaround
the issue by not using overlayfs snapshotter.

The newly added step runs all tests that match /TestCheckpoint/.
So, TestCRWithImagePath has been renamed to match the regexp.

Fixes containerd#3930.

Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 16, 2021

Build succeeded.

Copy link
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; would love @crosbymichael to confirm the options typeurl change

@kzys
Copy link
Member Author

kzys commented Mar 17, 2021

/ok-to-test

@crosbymichael crosbymichael merged commit e0c94bb into containerd:master Mar 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reenable CRIU tests when kernel issue is resolved
5 participants