Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix container pid race condition. #3857

Merged
merged 1 commit into from
Dec 2, 2019

Conversation

Random-Liu
Copy link
Member

@Random-Liu Random-Liu commented Nov 28, 2019

If you try docker with containerd 1.3, and run docker run hello-world, it will stuck.

After #3366, we change pid after container/exec process created, this causes various issues:

  1. The pid we get from task.Start might be -1 if the container is a short running container, because Start and State are not called in a transaction: https://github.com/containerd/containerd/blob/master/services/tasks/local.go#L230.
  2. This is fine for Kubernetes integration, because we stopped relying on containerd event and pid matching. However, this will break Docker: https://github.com/moby/moby/blob/05469b5fa2b48cf20cd0137ca8c45645b63049ff/daemon/monitor.go#L50
  3. In all higher level apis, pid is an uint, returning -1 will eventually make it 4294967295, which is not very user friendly:
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 4294967295,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2019-11-27T22:21:24.999965567Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },

This PR partially reverts #3366, and use exited to figure out whether the process has already exited. I tested the change with Docker, and docker run hello-world works now.

Signed-off-by: Lantao Liu lantaol@google.com

@Random-Liu Random-Liu added this to the 1.3.1 milestone Nov 28, 2019
@codecov-io
Copy link

codecov-io commented Nov 28, 2019

Codecov Report

Merging #3857 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3857   +/-   ##
=======================================
  Coverage   42.34%   42.34%           
=======================================
  Files         130      130           
  Lines       14683    14683           
=======================================
  Hits         6217     6217           
  Misses       7540     7540           
  Partials      926      926
Flag Coverage Δ
#linux 45.76% <ø> (ø) ⬆️
#windows 37.82% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3a31ce2...a6b6097. Read the comment docs.

@Random-Liu Random-Liu removed this from the 1.3.1 milestone Nov 28, 2019
@theopenlab-ci
Copy link

theopenlab-ci bot commented Nov 28, 2019

Build succeeded.

@AkihiroSuda
Copy link
Member

Can we have a test?

@Random-Liu
Copy link
Member Author

Random-Liu commented Dec 2, 2019

Can we have a test?

Done.

To make it easy to reproduce the race condition, I added a time.Sleep(5 * time.Second) before https://github.com/containerd/containerd/blob/v1.3.0/services/tasks/local.go#L225 in my experiment.

Here is the test result with containerd v1.3.0:

$ sudo PATH=$PATH GOPATH=$GOPATH TESTFLAGS="--test.run=TestShortRunningTaskPid -v" make integration
+ integration
INFO[0000] running tests against containerd              revision=942a06f75080ea055514d301dba83a8d71acc099.m runtime= version=v1.3.0-1-g942a06f7.m
INFO[0000] start to pull seed image                     
=== RUN   TestShortRunningTaskPid
=== PAUSE TestShortRunningTaskPid
=== CONT  TestShortRunningTaskPid
--- FAIL: TestShortRunningTaskPid (5.44s)
    container_test.go:1625: Pid -1
    container_test.go:1627: Unexpected task pid -1

Here is the test result with this patch:

$ sudo PATH=$PATH GOPATH=$GOPATH TESTFLAGS="--test.run=TestShortRunningTaskPid -v" make integration
+ integration
INFO[0000] running tests against containerd              revision=0f68b5b1ca30210d15391db4f889b8970f45ca9d.m runtime= version=v1.3.0-3-g0f68b5b1.m
INFO[0000] start to pull seed image                     
=== RUN   TestShortRunningTaskPid
=== PAUSE TestShortRunningTaskPid
=== CONT  TestShortRunningTaskPid
--- PASS: TestShortRunningTaskPid (5.80s)
    container_test.go:1625: Pid 227421
PASS
ok  	github.com/containerd/containerd	7.806s

BTW, I tried to add retry in the test to help reproducing the race condition, but it is hard. However, it can be easily reproduced by running docker run hello-world with containerd 1.3.0 at least on COS.

Signed-off-by: Lantao Liu <lantaol@google.com>
@theopenlab-ci
Copy link

theopenlab-ci bot commented Dec 2, 2019

Build succeeded.

@crosbymichael
Copy link
Member

LGTM

@crosbymichael crosbymichael merged commit b0821c8 into containerd:master Dec 2, 2019
@Random-Liu Random-Liu deleted the fix-container-pid branch December 2, 2019 18:36
jterry75 added a commit to jterry75/containerd that referenced this pull request Dec 3, 2019
containerd 1.3.2

Welcome to the v1.3.2 release of containerd!

The second patch release for `containerd` 1.3 includes a fix for a race condition
related to the reported pid on exit when called from Docker.

### Runtime

* Fix containerd pid race condition [containerd#3857](containerd#3857)
* Use cached process state to reduce exec cost [containerd#3711](containerd#3711)

### CRI

* Added `insecure_skip_verify` option in the registry tls config to allow skipping registry certificate verification [containerd#3847](containerd#3847)

Please try out the release binaries and report any issues at
https://github.com/containerd/containerd/issues.

### Contributors

* Lantao Liu
* Derek McGowan
* Michael Crosby
* Phil Estes
* Maksym Pavlenko

### Changes

* [`ff48f57fc8`](containerd@ff48f57) Merge pull request  [containerd#3866](containerd#3866) from dmcgowan/prepare-1.3.2
* [`99005c2647`](containerd@99005c2) Add release notes for v1.3.2
* [`e987ea3cac`](containerd@e987ea3) Merge pull request  [containerd#3864](containerd#3864) from Random-Liu/update-cri-release-1.3
* [`8bbd71e560`](containerd@8bbd71e) Update cri to b1bef15fbeb6c6f0569b67322acfa74ca3597755.
* [`070c27205c`](containerd@070c272) Merge pull request  [containerd#3863](containerd#3863) from Random-Liu/cherrypick-#3857-release-1.3
* [`306d6d4b55`](containerd@306d6d4) Fix container pid.
* [`5028701f1a`](containerd@5028701) Merge pull request  [containerd#3753](containerd#3753) from thaJeztah/1.3_backport_avoid_unnecessary_runc_state
* [`e3d2b608cc`](containerd@e3d2b60) Merge pull request  [containerd#3855](containerd#3855) from fuweid/cp-3853
* [`04fbb97ad0`](containerd@04fbb97) Fix cleanup error on content client test
* [`be24f39454`](containerd@be24f39) Merge pull request  [containerd#3840](containerd#3840) from ameyag/1.3_backport_log_file_win
* [`333679343b`](containerd@3336793) Add `--log-file` flag for windows service.
* [`9abfc70043`](containerd@9abfc70) Use cached state instead of `runc state`.

### Changes from containerd/cri

* [`b1bef15f`](containerd/cri@b1bef15) Merge pull request  [containerd#1346](containerd/cri#1346) from Random-Liu/cherrypick-#1345-release-1.3
* [`27edfa06`](containerd/cri@27edfa0) Add insecure_skip_verify option.
* [`e5dd8053`](containerd/cri@e5dd805) Merge pull request  [containerd#1322](containerd/cri#1322) from Random-Liu/cherrypick-#1319-release-1.3
* [`c0dee957`](containerd/cri@c0dee95) Fix containerd build, use `libbtrfs-dev` when available.
* [`1fb415d2`](containerd/cri@1fb415d) Merge pull request  [containerd#1314](containerd/cri#1314) from Random-Liu/cherrypick-#1312-release-1.3
* [`0973459d`](containerd/cri@0973459) Update based on default xenial distro.
* [`a46f6baf`](containerd/cri@a46f6ba) Configure golangci-lint

### Dependency Changes

Previous release can be found at [v1.3.1](https://github.com/containerd/containerd/releases/tag/v1.3.1)

* **github.com/containerd/cri**  5d49e7e51b43e36a6b9c4386257c7d08c602237f -> b1bef15fbeb6c6f0569b67322acfa74ca3597755

# gpg: directory '/c/Users/juterry/.gnupg' created
# gpg: keybox '/c/Users/juterry/.gnupg/pubring.kbx' created
# gpg: Signature made Tue Dec  3 11:09:10 2019 PST
# gpg:                using RSA key 8C7A111C21105794B0E8A27BF58C5D0A4405ACDB
# gpg: Can't check signature: No public key
@zhsj zhsj mentioned this pull request Feb 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants