[18.09 backport] Delete stale containerd object on start failure #154

thaJeztah · 2019-02-15T00:04:19Z

backport of moby#38364 for 18.09
fixes moby#38346 for 18.09

containerd has two objects with regard to containers.
There is a "container" object which is metadata and a "task" which is
manging the actual runtime state.

When docker starts a container, it creartes both the container metadata
and the task at the same time. So when a container exits, docker deletes
both of these objects as well.

This ensures that if, on start, when we go to create the container metadata object
in containerd, if there is an error due to a name conflict that we go
ahead and clean that up and try again.

Signed-off-by: Brian Goff cpuguy83@gmail.com
(cherry picked from commit 5ba30cd)
Signed-off-by: Sebastiaan van Stijn github@gone.nl

- What I did

- How I did it

- How to verify it

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

containerd has two objects with regard to containers. There is a "container" object which is metadata and a "task" which is manging the actual runtime state. When docker starts a container, it creartes both the container metadata and the task at the same time. So when a container exits, docker deletes both of these objects as well. This ensures that if, on start, when we go to create the container metadata object in containerd, if there is an error due to a name conflict that we go ahead and clean that up and try again. Signed-off-by: Brian Goff <cpuguy83@gmail.com> (cherry picked from commit 5ba30cd) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

thaJeztah · 2019-02-15T00:04:34Z

ping @tonistiigi @cpuguy83 PTAL

cpuguy83

LGTM

thaJeztah · 2019-02-18T10:53:46Z

Interesting; both Power and S390x are failing with; https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/13358/console https://jenkins.dockerproject.org/job/Docker-PRs-s390x/13246/console

01:54:08 FAIL: docker_cli_swarm_test.go:340: DockerSwarmSuite.TestSwarmContainerEndpointOptions
01:54:08 
01:54:08 [d4244336afd63] waiting for daemon to start
01:54:08 [d4244336afd63] daemon started
01:54:08 
01:54:08 docker_cli_swarm_test.go:348:
01:54:08     c.Assert(err, checker.IsNil, check.Commentf("%s", out))
01:54:08 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc4242ca480), Stderr:[]uint8(nil)} ("exit status 125")
01:54:08 ... jwpgahfrribmvdlkks63m881o
01:54:08 
01:54:08 
01:54:08 [d4244336afd63] exiting daemon

thaJeztah · 2019-02-18T10:57:52Z

From the daemon logs of that test;

time="2019-02-15T01:54:08.013683320Z" level=debug msg="DisableService ingress-sbox START"
time="2019-02-15T01:54:08.013772013Z" level=debug msg="DisableService ingress-sbox DONE"
time="2019-02-15T01:54:08.214509584Z" level=debug msg="Revoking external connectivity on endpoint gateway_ingress-sbox (84ee9ee8134f41eea984de47ec1580f8af63258aedec2d636e6cbfca2978a9d7)"
time="2019-02-15T01:54:08.215592405Z" level=debug msg="DeleteConntrackEntries purged ipv4:0, ipv6:0"
time="2019-02-15T01:54:08.321575898Z" level=debug msg="Releasing addresses for endpoint gateway_ingress-sbox's interface on network docker_gwbridge"
time="2019-02-15T01:54:08.321648833Z" level=debug msg="ReleaseAddress(LocalDefault/172.26.0.0/16, 172.26.0.2)"
time="2019-02-15T01:54:08.321714887Z" level=debug msg="Released address PoolID:LocalDefault/172.26.0.0/16, Address:172.26.0.2 Sequence:App: ipam/default/data, ID: LocalDefault/172.26.0.0/16, DBIndex: 0x0, Bits: 65536, Unselected: 65532, Sequence: (0xe0000000, 1)->(0x0, 2046)->(0x1, 1)->end Curr:3"
time="2019-02-15T01:54:08.322653261Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint yy1f8qf6tq4un5i0dauxfjdox 5e3f493590e12eb724bbd4da6947326074018ab87c7d239a4a8c192d11c5b472], retrying...."
time="2019-02-15T01:54:08.332116764Z" level=debug msg="Releasing addresses for endpoint ingress-endpoint's interface on network ingress"
time="2019-02-15T01:54:08.332146193Z" level=debug msg="ReleaseAddress(LocalDefault/10.255.0.0/16, 10.255.0.2)"
time="2019-02-15T01:54:08.332208481Z" level=debug msg="Released address PoolID:LocalDefault/10.255.0.0/16, Address:10.255.0.2 Sequence:App: ipam/default/data, ID: LocalDefault/10.255.0.0/16, DBIndex: 0x0, Bits: 65536, Unselected: 65532, Sequence: (0xe0000000, 1)->(0x0, 2046)->(0x1, 1)->end Curr:0"
time="2019-02-15T01:54:08.345266665Z" level=debug msg="releasing IPv4 pools from network ingress (yy1f8qf6tq4un5i0dauxfjdox)"
time="2019-02-15T01:54:08.345293981Z" level=debug msg="ReleaseAddress(LocalDefault/10.255.0.0/16, 10.255.0.1)"
time="2019-02-15T01:54:08.345330922Z" level=debug msg="Released address PoolID:LocalDefault/10.255.0.0/16, Address:10.255.0.1 Sequence:App: ipam/default/data, ID: LocalDefault/10.255.0.0/16, DBIndex: 0x0, Bits: 65536, Unselected: 65533, Sequence: (0xc0000000, 1)->(0x0, 2046)->(0x1, 1)->end Curr:0"
time="2019-02-15T01:54:08.345367527Z" level=debug msg="ReleasePool(LocalDefault/10.255.0.0/16)"
time="2019-02-15T01:54:08.345611258Z" level=debug msg="cleanupServiceDiscovery for network:yy1f8qf6tq4un5i0dauxfjdox"
time="2019-02-15T01:54:08.350383843Z" level=debug msg="Unix socket /run/docker/libnetwork/74ea7309c6268e8e8671cef226d36e5534a22f3bb42428b98d0103fac1508edc.sock doesn't exist. cannot accept client connections"
time="2019-02-15T01:54:08.350490938Z" level=debug msg="Cleaning up old mountid : start."
time="2019-02-15T01:54:08.350896947Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
time="2019-02-15T01:54:08.355076867Z" level=debug msg="Cleaning up old mountid : done."
time="2019-02-15T01:54:08.359546495Z" level=debug msg="Clean shutdown succeeded"

docker.log

thaJeztah · 2019-02-21T16:41:07Z

I think the remaining failures are flaky tests

https://github.com/docker/docker-ce/releases/tag/v18.09.3 The more important fixes in this version: * When copying existing folder, ignore xattr set errors when the target filesystem doesn't support xattr. docker-archive/engine#135 * Graphdriver: fix device mode not being detected if character-device bit is set. docker-archive/engine#160 * Fix nil pointer derefence on failure to connect to containerd. docker-archive/engine#162 * Delete stale containerd object on start failure. docker-archive/engine#154

…ll dockers are down except database It's an issue in docker engine, which has been resolved in PR#154 docker-archive/engine#154 And in this commit, we will update the docker to 19.03.0 for this Signed-off-by: Dante (Kuo-Jung) Su <dante.su@broadcom.com> Change-Id: Iecfb7b312abfbcc7741cdcd8b506f9d6c19c4eef

alexanderadam · 2020-03-02T16:03:03Z

Should this bug be fixed in 19.03.6?
Could it be the cause for Ansible issue 64492?

cpuguy83 · 2020-03-02T16:40:31Z

Based on the error message it seems like there is still a task running, which this PR does not handle and sounds like a very different problem.

cpuguy83 · 2020-03-02T16:41:07Z

Could you open a new issue with the details? Any clues on how the system got into that state?

alexanderadam · 2020-03-03T08:55:01Z

Done. Not really. I believe that it was caused by installing some updates (including containerd.io, docker-ce and docker-ce-cli) this time.

thaJeztah added this to the 18.09.3 milestone Feb 15, 2019

cpuguy83 approved these changes Feb 15, 2019

View reviewed changes

thaJeztah mentioned this pull request Feb 18, 2019

[18.09 backport] Allow overriding repository and branch in validate scripts, and no need to git fetch in CI #155

Merged

andrewhsu merged commit ba8664c into docker-archive:18.09 Feb 22, 2019

thaJeztah deleted the 18.09_backport_fix_stale_container_on_start branch February 22, 2019 22:07

tomalok mentioned this pull request Mar 3, 2019

community/docker: udpate to 18.09.3 alpinelinux/aports#6530

Closed

giorgiosironi mentioned this pull request Mar 20, 2019

Failed to start container: id already in use (after reboot) moby/moby#38249

Closed

neethajohn mentioned this pull request Jul 24, 2019

Update docker engine to 18.09.8 sonic-net/sonic-buildimage#3211

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[18.09 backport] Delete stale containerd object on start failure #154

[18.09 backport] Delete stale containerd object on start failure #154

thaJeztah commented Feb 15, 2019

thaJeztah commented Feb 15, 2019

cpuguy83 left a comment

thaJeztah commented Feb 18, 2019 •

edited

Loading

thaJeztah commented Feb 18, 2019

thaJeztah commented Feb 21, 2019

alexanderadam commented Mar 2, 2020 •

edited

Loading

cpuguy83 commented Mar 2, 2020

cpuguy83 commented Mar 2, 2020

alexanderadam commented Mar 3, 2020 •

edited

Loading

[18.09 backport] Delete stale containerd object on start failure #154

[18.09 backport] Delete stale containerd object on start failure #154

Conversation

thaJeztah commented Feb 15, 2019

thaJeztah commented Feb 15, 2019

cpuguy83 left a comment

Choose a reason for hiding this comment

thaJeztah commented Feb 18, 2019 • edited Loading

thaJeztah commented Feb 18, 2019

thaJeztah commented Feb 21, 2019

alexanderadam commented Mar 2, 2020 • edited Loading

cpuguy83 commented Mar 2, 2020

cpuguy83 commented Mar 2, 2020

alexanderadam commented Mar 3, 2020 • edited Loading

thaJeztah commented Feb 18, 2019 •

edited

Loading

alexanderadam commented Mar 2, 2020 •

edited

Loading

alexanderadam commented Mar 3, 2020 •

edited

Loading