Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2016.03.a (and above) AMIs fail to restart Docker: "Error starting daemon: error initializing graphdriver: Device is Busy" #389

Closed
ChrisRut opened this issue Apr 27, 2016 · 13 comments

Comments

@ChrisRut
Copy link

ChrisRut commented Apr 27, 2016

While testing the new 2016.03.a AMI (ami-67a3a90d) I noticed that Docker fails to restart cleanly with the following error on r3.xlarge instance types:

time="2016-04-26T21:15:53.189574773Z" level=fatal msg="Error starting daemon: error initializing graphdriver: Device is Busy"

This appears to be related to moby/moby#14088 but it is not clear to me from the context of that issue what the appropriate solution is. I've seen recommendations for running rm -Rf /var/lib/docker/devicemapper which I have tried before starting, and that results in a slightly different error message:

time="2016-04-27T17:11:00.726967151Z" level=fatal msg="Error starting daemon: error initializing graphdriver: Unable to take ownership of thin-pool (docker-docker--pool) that already has used data blocks"

This issue does not occur on the 2015.09.g AMI (ami-33b48a59), and more specifically this issue only seems to be happening on r3.xlarge instance types. The reason we need to restart Docker is in order for it to see our newly mounted ephemeral volume (see also: #384), we are running the following as part of the instance's user-data that runs at boot to mount our ephemeral volumes:

# Ensure ephemeral volume is formatted and mounted
if [ -e /dev/xvdx ] && ! mountpoint -q /media/ephemeral0; then
    mkfs.ext4 -q /dev/xvdx
    mount /dev/xvdx
    service docker stop
    # Remove potentially corrupted network kv.db
    # see: https://github.com/docker/docker/issues/18113
    # @TODO: remove this once ECS AMI starts using Docker 1.10+
    rm /var/lib/docker/network/files/local-kv.db
    service docker start
fi

You'll notice there is another "hack" in there to fix a Docker 1.9.1 issue related to: moby/moby#18113

Again to be clear, we are only able to reproduce this on r3.xlarge instance-types using the 2016.03.a AMI (ami-67a3a90d), we are not able to reproduce this on any other instance-type (though we haven't tried them all) also using 2015.09.g AMI (ami-33b48a59) on the r3.xlarge works fine.

@vpal
Copy link

vpal commented Apr 29, 2016

We experience the same with docker on t2.micro (ECS) with Docker version 1.9.1, build a34a1d5/1.9.1.
We need to restart docker so it recognizes the EFS volumes we mount with Cloudformation Init.

@ChrisRut where you able to workaround this somehow?

@ChrisRut
Copy link
Author

ChrisRut commented Apr 29, 2016

@vpal for now my "workaround" is to not use the latest (2016.03.a) AMI, and continue to use 2015.09.g AMI (ami-33b48a59).

@greglboxer
Copy link

greglboxer commented Apr 29, 2016

I was able to workaround this by running a docker command like docker ps or docker volume ls before restarting the service. Looking through the logs it looks like it's allowing the docker daemon to initialize correctly before the restart - but i haven't really had the chance to investigate thoroughly, so I could be way off base.

As a side note, I was also seeing

Stopping docker: [  OK  ]
Starting docker:        [   92.892809] device-mapper: thin: Deletion of thin device 1 failed.
....[   97.484149] device-mapper: ioctl: remove_all left 3 open device(s)
......[FAILED]

in the system log

Edit: I am starting t2.small instances and running into this.

@vpal
Copy link

vpal commented May 2, 2016

@greglboxer thanks for you response.
I now moved the NFS initialization, configuration and docker restart part from AWS::CloudFormation::Init to UserData based on this howto: https://aws.amazon.com/blogs/compute/using-amazon-efs-to-persist-data-from-amazon-ecs-containers/

It seems that this has resolved the issue, although I only did some basic testing.

@ChrisRut
Copy link
Author

ChrisRut commented May 9, 2016

I can confirm @greglboxer 's suggestion of running docker ps before stopping the service does seem to workaround this issue, so my revised user-data is:

# Ensure ephemeral volume is formatted and mounted
if [ -e /dev/xvdx ] && ! mountpoint -q /media/ephemeral0; then
    mkfs.ext4 -q /dev/xvdx
    mount /dev/xvdx
    # @TODO: remove `docker ps` once the following bug is fixed:
    # - https://github.com/aws/amazon-ecs-agent/issues/389
    docker ps
    service docker stop
    # Remove potentially corrupted network kv.db
    # see: https://github.com/docker/docker/issues/18113
    # @TODO: remove this once ECS AMI starts using Docker 1.10+
    rm -f /var/lib/docker/network/files/local-kv.db
    service docker start
fi

Seems silly to have to do that though.

@ChrisRut
Copy link
Author

ChrisRut commented May 9, 2016

I can also confirm this issue persists on the newer 2016.03.b (ami-a1fa1acc) AMI.

@ChrisRut ChrisRut changed the title 2016.03.a AMI fails to restart Docker: "Error starting daemon: error initializing graphdriver: Device is Busy" 2016.03.a (and above) AMIs fail to restart Docker: "Error starting daemon: error initializing graphdriver: Device is Busy" May 9, 2016
@gugahoi
Copy link

gugahoi commented May 10, 2016

I can confirm that too! Same happens when trying to use s3fs.

@samuelkarp
Copy link
Contributor

@ChrisRut Thanks for reporting this. It looks like Docker does not seem to do well if it is stopped while initializing the graph (layer) storage for the first time. When we switched back to ext4 from xfs (due to issues with the combination of devicemapper thin pools and xfs when space is exhausted), the amount of time it takes for initialization takes longer and can bleed into the time when user-data scripts start running.

Waiting for docker ps to complete is a reasonable way to ensure that the daemon has completed initialization, but another approach would be to move your mount logic ahead of when the daemon starts by utilizing a Cloud Boothook.

Note that unlike user-data scripts, cloud boothooks execute on every boot, so you'll need to ensure that the boothook can either handle running multiple times or can skip running on subsequent boots (you can use the cloud-init-per helper here, you'll find an example in /etc/cloud/cloud.cfg.d/90_ecs.cfg on recent versions of the ECS-optimized AMI). If you want to use both user-data scripts and cloud boothooks, you can combine them using MIME-multipart. Documentation on this is here.

@samuelkarp
Copy link
Contributor

I'm going to close this for now. Please let us know if you continue to have any issues.

@jbergknoff
Copy link

We tried the docker ps workaround with the latest AMI (amzn-ami-2016.03.g-amazon-ecs-optimized) but it was ineffective. Our userdata updates /etc/sysconfig/docker and then runs docker ps before attempting service docker restart. A few instances came up without incident, but at least one did not: the docker ps command hung for 15+ minutes before we gave up on the instance, never reaching the service restart.

lox added a commit to lox/ecsy that referenced this issue Nov 14, 2016
@lox
Copy link

lox commented Nov 14, 2016

We are having this issue too. Requiring docker ps seems like a horrible, undocumented hack @samuelkarp. Surely this is something that should have an open issue tracking it?

lox added a commit to lox/ecsy that referenced this issue Nov 14, 2016
@Suncatcher
Copy link

Suncatcher commented Feb 1, 2017

On version 2016.09.e the bug still exists.
docker ps doesn't help, I can't even run docker info. Is there any stable workaround?
Instance is perfectly clean t2.micro, just after isntall.

@adiii717
Copy link

adiii717 commented Apr 1, 2019

I am facing in 2018 version, any update or fix about this.
There was a permission issue with custom AMI and i just set proper permission and it works like charm.
Fixed. Thanks, Team

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants