2016.03.a (and above) AMIs fail to restart Docker: "Error starting daemon: error initializing graphdriver: Device is Busy" #389

ChrisRut · 2016-04-27T19:55:24Z

While testing the new 2016.03.a AMI (ami-67a3a90d) I noticed that Docker fails to restart cleanly with the following error on r3.xlarge instance types:

time="2016-04-26T21:15:53.189574773Z" level=fatal msg="Error starting daemon: error initializing graphdriver: Device is Busy"

This appears to be related to moby/moby#14088 but it is not clear to me from the context of that issue what the appropriate solution is. I've seen recommendations for running rm -Rf /var/lib/docker/devicemapper which I have tried before starting, and that results in a slightly different error message:

time="2016-04-27T17:11:00.726967151Z" level=fatal msg="Error starting daemon: error initializing graphdriver: Unable to take ownership of thin-pool (docker-docker--pool) that already has used data blocks"

This issue does not occur on the 2015.09.g AMI (ami-33b48a59), and more specifically this issue only seems to be happening on r3.xlarge instance types. The reason we need to restart Docker is in order for it to see our newly mounted ephemeral volume (see also: #384), we are running the following as part of the instance's user-data that runs at boot to mount our ephemeral volumes:

# Ensure ephemeral volume is formatted and mounted
if [ -e /dev/xvdx ] && ! mountpoint -q /media/ephemeral0; then
    mkfs.ext4 -q /dev/xvdx
    mount /dev/xvdx
    service docker stop
    # Remove potentially corrupted network kv.db
    # see: https://github.com/docker/docker/issues/18113
    # @TODO: remove this once ECS AMI starts using Docker 1.10+
    rm /var/lib/docker/network/files/local-kv.db
    service docker start
fi

You'll notice there is another "hack" in there to fix a Docker 1.9.1 issue related to: moby/moby#18113

Again to be clear, we are only able to reproduce this on r3.xlarge instance-types using the 2016.03.a AMI (ami-67a3a90d), we are not able to reproduce this on any other instance-type (though we haven't tried them all) also using 2015.09.g AMI (ami-33b48a59) on the r3.xlarge works fine.

The text was updated successfully, but these errors were encountered:

vpal · 2016-04-29T10:43:52Z

We experience the same with docker on t2.micro (ECS) with Docker version 1.9.1, build a34a1d5/1.9.1.
We need to restart docker so it recognizes the EFS volumes we mount with Cloudformation Init.

@ChrisRut where you able to workaround this somehow?

ChrisRut · 2016-04-29T17:42:04Z

@vpal for now my "workaround" is to not use the latest (2016.03.a) AMI, and continue to use 2015.09.g AMI (ami-33b48a59).

greglboxer · 2016-04-29T20:37:37Z

I was able to workaround this by running a docker command like docker ps or docker volume ls before restarting the service. Looking through the logs it looks like it's allowing the docker daemon to initialize correctly before the restart - but i haven't really had the chance to investigate thoroughly, so I could be way off base.

As a side note, I was also seeing

Stopping docker: [  OK  ]
Starting docker:        [   92.892809] device-mapper: thin: Deletion of thin device 1 failed.
....[   97.484149] device-mapper: ioctl: remove_all left 3 open device(s)
......[FAILED]

in the system log

Edit: I am starting t2.small instances and running into this.

vpal · 2016-05-02T09:32:49Z

@greglboxer thanks for you response.
I now moved the NFS initialization, configuration and docker restart part from AWS::CloudFormation::Init to UserData based on this howto: https://aws.amazon.com/blogs/compute/using-amazon-efs-to-persist-data-from-amazon-ecs-containers/

It seems that this has resolved the issue, although I only did some basic testing.

ChrisRut · 2016-05-09T20:35:37Z

I can confirm @greglboxer 's suggestion of running docker ps before stopping the service does seem to workaround this issue, so my revised user-data is:

# Ensure ephemeral volume is formatted and mounted
if [ -e /dev/xvdx ] && ! mountpoint -q /media/ephemeral0; then
    mkfs.ext4 -q /dev/xvdx
    mount /dev/xvdx
    # @TODO: remove `docker ps` once the following bug is fixed:
    # - https://github.com/aws/amazon-ecs-agent/issues/389
    docker ps
    service docker stop
    # Remove potentially corrupted network kv.db
    # see: https://github.com/docker/docker/issues/18113
    # @TODO: remove this once ECS AMI starts using Docker 1.10+
    rm -f /var/lib/docker/network/files/local-kv.db
    service docker start
fi

Seems silly to have to do that though.

ChrisRut · 2016-05-09T20:38:08Z

I can also confirm this issue persists on the newer 2016.03.b (ami-a1fa1acc) AMI.

gugahoi · 2016-05-10T00:58:25Z

I can confirm that too! Same happens when trying to use s3fs.

samuelkarp · 2016-05-18T23:02:22Z

@ChrisRut Thanks for reporting this. It looks like Docker does not seem to do well if it is stopped while initializing the graph (layer) storage for the first time. When we switched back to ext4 from xfs (due to issues with the combination of devicemapper thin pools and xfs when space is exhausted), the amount of time it takes for initialization takes longer and can bleed into the time when user-data scripts start running.

Waiting for docker ps to complete is a reasonable way to ensure that the daemon has completed initialization, but another approach would be to move your mount logic ahead of when the daemon starts by utilizing a Cloud Boothook.

Note that unlike user-data scripts, cloud boothooks execute on every boot, so you'll need to ensure that the boothook can either handle running multiple times or can skip running on subsequent boots (you can use the cloud-init-per helper here, you'll find an example in /etc/cloud/cloud.cfg.d/90_ecs.cfg on recent versions of the ECS-optimized AMI). If you want to use both user-data scripts and cloud boothooks, you can combine them using MIME-multipart. Documentation on this is here.

samuelkarp · 2016-06-02T16:38:22Z

I'm going to close this for now. Please let us know if you continue to have any issues.

jbergknoff · 2016-08-17T15:07:16Z

We tried the docker ps workaround with the latest AMI (amzn-ami-2016.03.g-amazon-ecs-optimized) but it was ineffective. Our userdata updates /etc/sysconfig/docker and then runs docker ps before attempting service docker restart. A few instances came up without incident, but at least one did not: the docker ps command hung for 15+ minutes before we gave up on the instance, never reaching the service restart.

lox · 2016-11-14T03:50:16Z

We are having this issue too. Requiring docker ps seems like a horrible, undocumented hack @samuelkarp. Surely this is something that should have an open issue tracking it?

Suncatcher · 2017-02-01T12:28:21Z

On version 2016.09.e the bug still exists.
docker ps doesn't help, I can't even run docker info. Is there any stable workaround?
Instance is perfectly clean t2.micro, just after isntall.

adiii717 · 2019-04-01T11:47:00Z

I am facing in 2018 version, any update or fix about this.
There was a permission issue with custom AMI and i just set proper permission and it works like charm.
Fixed. Thanks, Team

ChrisRut changed the title ~~2016.03.a AMI fails to restart Docker: "Error starting daemon: error initializing graphdriver: Device is Busy"~~ 2016.03.a (and above) AMIs fail to restart Docker: "Error starting daemon: error initializing graphdriver: Device is Busy" May 9, 2016

samuelkarp closed this as completed Jun 2, 2016

samuelkarp mentioned this issue Jun 10, 2016

ECS agent fails to launch with latest AMI/Docker #419

Closed

lox added a commit to lox/ecsy that referenced this issue Nov 14, 2016

Workaround for aws/amazon-ecs-agent#389

c895420

lox added a commit to lox/ecsy that referenced this issue Nov 14, 2016

Workaround for aws/amazon-ecs-agent#389

2b195e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2016.03.a (and above) AMIs fail to restart Docker: "Error starting daemon: error initializing graphdriver: Device is Busy" #389

2016.03.a (and above) AMIs fail to restart Docker: "Error starting daemon: error initializing graphdriver: Device is Busy" #389

ChrisRut commented Apr 27, 2016 •

edited

vpal commented Apr 29, 2016

ChrisRut commented Apr 29, 2016 •

edited

greglboxer commented Apr 29, 2016 •

edited

vpal commented May 2, 2016

ChrisRut commented May 9, 2016

ChrisRut commented May 9, 2016 •

edited

gugahoi commented May 10, 2016

samuelkarp commented May 18, 2016

samuelkarp commented Jun 2, 2016

jbergknoff commented Aug 17, 2016

lox commented Nov 14, 2016

Suncatcher commented Feb 1, 2017 •

edited

adiii717 commented Apr 1, 2019 •

edited

2016.03.a (and above) AMIs fail to restart Docker: "Error starting daemon: error initializing graphdriver: Device is Busy" #389

2016.03.a (and above) AMIs fail to restart Docker: "Error starting daemon: error initializing graphdriver: Device is Busy" #389

Comments

ChrisRut commented Apr 27, 2016 • edited

vpal commented Apr 29, 2016

ChrisRut commented Apr 29, 2016 • edited

greglboxer commented Apr 29, 2016 • edited

vpal commented May 2, 2016

ChrisRut commented May 9, 2016

ChrisRut commented May 9, 2016 • edited

gugahoi commented May 10, 2016

samuelkarp commented May 18, 2016

samuelkarp commented Jun 2, 2016

jbergknoff commented Aug 17, 2016

lox commented Nov 14, 2016

Suncatcher commented Feb 1, 2017 • edited

adiii717 commented Apr 1, 2019 • edited

ChrisRut commented Apr 27, 2016 •

edited

ChrisRut commented Apr 29, 2016 •

edited

greglboxer commented Apr 29, 2016 •

edited

ChrisRut commented May 9, 2016 •

edited

Suncatcher commented Feb 1, 2017 •

edited

adiii717 commented Apr 1, 2019 •

edited