Skip to content
This repository has been archived by the owner on Jan 23, 2020. It is now read-only.

Service fails to start with Cloudstor EBS Volume attached #157

Open
ambrons opened this issue May 23, 2018 · 38 comments
Open

Service fails to start with Cloudstor EBS Volume attached #157

ambrons opened this issue May 23, 2018 · 38 comments

Comments

@ambrons
Copy link

ambrons commented May 23, 2018

Expected behavior

Service starts with attached EBS Volume attached

Actual behavior

My assumption is that it's taking too long to snapshot and load the EBS volume for a specific availability zone and therefore times out.

Note: the EBS volumes are 200GB, however they're currently empty.

The initial error is this:

$ swarm service ps nvkmbnmc9nwyn8ojw3v1bkjzh --no-trunc
ID                          NAME                IMAGE                                                                                                                                                     NODE                                               DESIRED STATE       CURRENT STATE              ERROR                                                                                                                                                                       PORTS
zgkin90vm4iyt242jsjeb80h4   cassandra.1         xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm:latest@sha256:a7150a38203c44e332d05a37d8275a76a001be5b814c661d05fd73edab893437   ip-172-20-25-211.ap-southeast-1.compute.internal   Running             Preparing 13 seconds ago   "Post http://%2Frun%2Fdocker%2Fplugins%2F4125eb31d4a89cab3863d96d60403c0134f69a0d19937225a2f0839f737384e1%2Fcloudstor.sock/VolumeDriver.Mount: context deadline exceeded"   

After subsequent retires to start the service I get this error:

$ swarm service ps nvkmbnmc9nwyn8ojw3v1bkjzh --no-trunc
ID                          NAME                IMAGE                                                                                                                                                     NODE                                               DESIRED STATE       CURRENT STATE              ERROR                                                                                                                                                                                                                                                                                                                                                                                                                   PORTS
ep598hubcgsmuox2wm1wfcn1f   cassandra.1         xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm:latest@sha256:a7150a38203c44e332d05a37d8275a76a001be5b814c661d05fd73edab893437   ip-172-20-9-77.ap-southeast-1.compute.internal     Running             Preparing 51 seconds ago                                                                                                                                                                                                                                                                                                                                                                                                                           
zgkin90vm4iyt242jsjeb80h4    \_ cassandra.1     xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm:latest@sha256:a7150a38203c44e332d05a37d8275a76a001be5b814c661d05fd73edab893437   ip-172-20-25-211.ap-southeast-1.compute.internal   Shutdown            Rejected 51 seconds ago    "create cassandra-1: found reference to volume 'cassandra-1' in driver 'cloudstor:aws', but got an error while checking the driver: error while checking if volume "cassandra-1" exists in driver "cloudstor:aws": Post http://%2Frun%2Fdocker%2Fplugins%2F4125eb31d4a89cab3863d96d60403c0134f69a0d19937225a2f0839f737384e1%2Fcloudstor.sock/VolumeDriver.Get: context deadline exceeded: volume name must be unique"   

The service never seems to start.

Information

Docker-diagnose: 1527092193-JGugtUgVNBmvU7S8tXn0mV4ryIhPF4zc

Volumes created:

swarm volume create -d "cloudstor:aws" --opt ebstype=io1 --opt size=200 --opt iops=1000 --opt backing=relocatable --opt ebs_tag_Name=cassandra-1 cassandra-1

AWS Region: ap-southeast-1

Service Creation Setup:

docker service create \
  --name cassandra \
  --network data \
  --update-delay 60s \
  --replicas 1 \
  --with-registry-auth \
  --env LOCAL_JMX=no \
  --env SERVICE_NAME=cassandra \
  --constraint 'node.role != manager' \
  --reserve-memory 3gb \
  --mount type=volume,target=/var/lib/cassandra,source={{.Service.Name}}-{{.Task.Slot}} \
  xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm
@ambrons
Copy link
Author

ambrons commented May 24, 2018

I did some testing in us-east-1 last week and 50GB, ios, and 1000 iops and it was working fine. However now that I'm using ap-southeast-1 it appears to be failing with the above error. Around the 3 minute mark:

$ swarm service create \
>   --name cassandra \
>   --network data \
> --update-delay 300s \
>   --replicas 1 \
>   --with-registry-auth \
>   --env LOCAL_JMX=no \
>   --env SERVICE_NAME=cassandra \
>   --constraint 'node.role != manager' \
>   --reserve-memory 3gb \
>   --mount type=volume,volume-driver=cloudstor:aws,source=asports-prod-{{.Service.Name}}-{{.Task.Slot}},destination=/var/lib/cassandra,volume-opt=backing=relocatable,volume-opt=size=150,volume-opt=ebstype=io1,volume-opt=iops=1000,volume-opt=ebs_tag_Name=asports-prod-{{.Service.Name}}-{{.Task.Slot}} \
>   xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm
xdlfmtzojphbuoiqd5feuhdes
overall progress: 0 out of 1 tasks 
1/1: Post http://%2Frun%2Fdocker%2Fplugins%2F483cd69b6d69e5aa11bdf44a3b13345aed… 

Just to make sure it wasn't anything else I removed the mount argument and the service runs fine with a standard container volume. The issue is only when using the EBS backed relocatable volumes that it fails.

I've also tried letting the service definition create the volume on start and that didn't seem to help either.

Here's the updated configuration:

docker service create \
  --name cassandra \
  --network data \
  --update-delay 300s \
  --replicas 1 \
  --with-registry-auth \
  --env LOCAL_JMX=no \
  --env SERVICE_NAME=cassandra \
  --constraint 'node.role != manager' \
  --reserve-memory 3gb \
  --mount type=volume,volume-driver=cloudstor:aws,source=asports-prod-{{.Service.Name}}-{{.Task.Slot}},destination=/var/lib/cassandra,volume-opt=backing=relocatable,volume-opt=size=150,volume-opt=ebstype=io1,volume-opt=iops=1000,volume-opt=ebs_tag_Name=asports-prod-{{.Service.Name}}-{{.Task.Slot}} \
  xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm

The plugin seems to be working fine

$ docker plugin ls
ID                  NAME                DESCRIPTION                       ENABLED
e3c2802690d7        cloudstor:aws       cloud storage plugin for Docker   true

@ambrons
Copy link
Author

ambrons commented May 24, 2018

@ddebroy Do you have any thoughts? I'm dead in the water for our deployment as this doesn't appear to work as advertised.

@soar
Copy link

soar commented Jun 3, 2018

Same problem - cloudstor:aws creates and deletes volumes successfully, but hangs when I try to start container.

# docker volume create -d "cloudstor:aws" --opt ebstype=gp2 --opt size=10 mylocalvol1
mylocalvol1
# docker volume ls
DRIVER              VOLUME NAME
cloudstor:aws       mylocalvol1
# docker run -it -v mylocalvol1:/mnt debian bash
... nothing after 10 minutes ...

@VictorLopess
Copy link

Hello everyone.
I'm having a similar problem. I am trying to create an EBS via cloudstor, I made the configuration as below:

version: '3'
services:
  rabbitmq:
    image: rabbitmq:3.6-management-alpine
    networks:
      - my-network
    ports:
      - 5672:5672
      - 15672:15672
    volumes:
      - rabbitmq_data_staging:/var/lib/rabbitmq
    logging:
      driver: "awslogs"
      options:
        awslogs-region: "us-east-1"
        awslogs-group: "queues"
        awslogs-stream: "rabbitmq-staging"
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.mylabel == mylabelvalue
      restart_policy:
        condition: on-failure
networks:
  my-network:
    external: true

volumes:
  rabbitmq_data_staging:
    driver: "cloudstor:aws"
    driver_opts:
      size: "5"
      ebstype: "gp2"
      backing: "relocatable"

Every time I make a deploy command in swarm, it simply does not raise the container, without giving an error or something. This command is made from a manager to a worker.

When I run the command in the manger itself it works normally.

When I take out the command to mount the volume,

volumes:
      - rabbitmq_data_staging:/var/lib/rabbitmq

the container goes up normally.

I tested other plugins like this and had the same problem
I tested other plugins like rexray, and had the same problem. Which makes me think that it is some incompatibility of swarm with the plugin. Can anyone help, or even tell if I'm doing something wrong on my docker-compose?

The cloudstor normally creates the EBS volume, without any problem.

The versions of the plugins are:

ID                  NAME                DESCRIPTION                       ENABLED
ed0c2ebcfc92        rexray/ebs:latest   REX-Ray for Amazon EBS            false
208c7b943f6d        cloudstor:aws       cloud storage plugin for Docker   true

@ddebroy , you can help us? Thank You!

@lordvlad
Copy link

have the same issue with docker4aws 18.03 (stable) and 18.04 (edge) cloudformation templates. Hadn't had the issue with docker4aws 17.12 (edge).

@nunofernandes
Copy link

Any news on this?

@dodgemich
Copy link

I ran into similar issues using it with ECS - I found that it worked with T2s and C4's, but would fail in this manner with C5/M5...might help debug the root issue.

@abashev
Copy link

abashev commented Sep 11, 2018

@dodgemich you are my hero!! I spend two days to understand why rexray and cloudstor doesn't work on my new shiny t3 cluster. And I just have to migrate it on t2.

@Richard-Mathie
Copy link

may be #148 is related, I get issues like the above and with the mount point /dev/xvdf allready existing when trying to mount cloudstore:aws volumes to modern generation aws instances. Apparently this may have something todo with nvme drivers on that hardware

@lepetitpierdol
Copy link

Having the exact same issue with rexray/efs... Did anyone manage to find a solution?

@abashev
Copy link

abashev commented Sep 28, 2018

@lepetitpierdol you have to use instances from previous generations - t2, c4 and so on. Looks like latest T3 and C5 have new disk controllers that don't work with rexray/convoy/cloudstor

@gartz
Copy link

gartz commented Oct 9, 2018

No luck for me, I'm having issues with T2 on docker4aws 18.06.1 (stable) and 18.01 (edge) when mounting volumes using cloudstor.

@aplex
Copy link

aplex commented Oct 16, 2018

In my case, I got this error when one of the containers got stuck and could not be stopped. It was holding a reference to a volume, so new container could not be started. I resolved this by rebooting the host VM.

@kinghuang
Copy link

There's been some PRs in REX-Ray to handle the new NVMe device names (rexray/rexray#1233, rexray/rexray#1252). I've run the edge release successfully to create and mount EBS volumes on current generation instances.

We need a similar change in Cloudstor. I really wish Docker would at least give some indication of whether they're even going to address this issue. Or, open source the code so that we can do something about it.

@bandesz
Copy link

bandesz commented Oct 23, 2018

@kinghuang you couldn't have said better. I tried REX-Ray, but it's not enough for my use case. I'm using Cloudstor currently on Amazon ECS, but I'm forced to use the old instance types.

@brawong, @joeabbey sorry to mention you guys, but do you have any feedback on when the NVMe devices would be supported in Cloudstor, so we could use it on the new AWS EC2 generations (t3, m5, c5, etc.)?

@rafagsiqueira
Copy link

I am trying to create volumes and I am running into the same problem. My cluster is based on T2 instances, so that does not seem to be the source of the problem. Docker version 18.06.1-ce, build e68fc7a.

@gartz
Copy link

gartz commented Oct 24, 2018 via email

@Carlos4ndresh
Copy link

I'm experiencing the same problem. I have 18.06.1-ce:

Status": {
"Timestamp": "2018-10-25T16:18:57.558536532Z",
"State": "preparing",
"Message": "preparing",
"Err": "create testapp_web: Post http://%2Frun%2Fdocker%2Fplugins%2Fcd8e305e0fc9d1030761f3dfc3a873f4923ec78a669489fb15d3123df6f1c10b%2Fcloudstor.sock/VolumeDriver.Create: context deadline exceeded",
"PortStatus": {}
},

I restarted EC2 instances, the swarm ended up with the master and a couple workers out of the swarm after the reboot. Then all services went up (previously recreating stack), but I have now another problem, and I think is related, but have no evidence.

@mateodelnorte
Copy link

Anybody find a solution here? Just started seeing this issue.

@kinghuang
Copy link

I haven't found any solutions for Cloudstor. I've started to use REX-Ray, but it has the downside that it doesn't copy EBS volumes between availability zones.

We really need Docker to provide an answer.

@mateodelnorte
Copy link

thanks @kinghuang. any tips or pointers to documentation on REX-Ray, in case we need to go that route?

@gartz was rolling back your stack as easy as running the cloudformation template with version 18.03.0 specified?

@gartz
Copy link

gartz commented Nov 2, 2018

@mateodelnorte yes, it rollback, but I need to login in the new manager and force-initialize it to work, after that the workers and other managers start working again.

I also edited my cloud formation template to add EFS support to N. California (it's disabled in the original, but N. California supports it).

@mateodelnorte
Copy link

Currently attempting to update our CloudFormation template from 18.06.1 to 18.03.1. Our new manager came online but is clearly in an odd state:

ID                            HOSTNAME                        STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
24fr9wl3maq76rwcc4j6w28q0     ip-172-22-6-98.ec2.internal     Ready               Active              Reachable           18.06.1-ce
1jdanx00fjqily9ev6rtkz158     ip-172-22-7-254.ec2.internal    Ready               Active                                  18.06.1-ce
e3ez03wlf33aohmktfnfnwaym     ip-172-22-17-55.ec2.internal    Down                Active              Reachable           18.03.0-ce
fsvz7lhywdetcgunx005gndgq     ip-172-22-17-55.ec2.internal    Ready               Active              Unreachable         18.03.0-ce
xe4sd7jp9kbpln1ysfy1dojq3     ip-172-22-17-249.ec2.internal   Ready               Active                                  18.06.1-ce
blqwttvjaaxt8z1z79ohdb0le     ip-172-22-22-66.ec2.internal    Ready               Active              Leader              18.06.1-ce
x09n7onls3cutd4cu530o60i8     ip-172-22-34-115.ec2.internal   Ready               Active                                  18.06.1-ce
jadehezhrfgrazpjnr3i972gd *   ip-172-22-40-45.ec2.internal    Ready               Active              Reachable           18.06.1-ce
Every 2s: docker node ls                                                                                           2018-11-02 23:18:54

Notice ip-172-22-17-55.ec2.internal is listed twice. That's the new manager. It's registering as both Ready and Down.

docker info on the new manager yields:

docker info
Containers: 5
 Running: 5
 Paused: 0
 Stopped: 0
Images: 5
Server Version: 18.03.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: fsvz7lhywdetcgunx005gndgq
 Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
 Is Manager: true
 Node Address: 172.22.17.55
 Manager Addresses:
  172.22.17.55:2377
  172.22.17.55:2377
  172.22.22.66:2377
  172.22.40.45:2377
  172.22.6.98:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.81-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.785GiB
Name: ip-172-22-17-55.ec2.internal
ID: OBAN:2FHN:UX7C:BHOR:DIVY:27HI:SSAI:KSVU:NVXP:WJW7:VCX3:QWFF
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 os=linux
 region=us-east-1
 availability_zone=us-east-1b
 instance_type=m4.large
 node_type=manager
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

~ $ docker service ls
Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded

I'm not confident a swarm init --force-new-cluster on this new node will result in success. I would think it doesn't have service configuration, since it can't join and make quorum.

@gartz is this the situation you were in when you forced a new cluster?

@gartz
Copy link

gartz commented Nov 2, 2018 via email

@mateodelnorte
Copy link

That was a typo on my part. The non-connecting manager, and upgrade version we're attempting to move toward is 18.03.0-ce.

@mapcentia
Copy link

mapcentia commented Nov 5, 2018

I also experience this issue on both Stable and Edge. I tried to downgrade stable to 18.03.0-ce but with no luck. I deploy using a docker-compose with:

volumes:
    gc2core_var_www_geocloud2:
      driver: cloudstor:aws
      driver_opts:
        :backing: shared

CURRENT STATE just keep saying "Preparing [--] minutes ago"

EDIT 1:
Just got the service up and running on a clean install of T2 instances using Edge

EDIT 2:
When I tried to deploy with 2 replicas it took like 19 minutes for one service to get running. One replica did thrown context deadline exceeded: volume name must be unique. But all got up and running eventually.

@gartz
Copy link

gartz commented Nov 5, 2018 via email

@anasoler
Copy link

Downgrading to 18.03.0-ce from 18.06.1-ce (where I was experiencing the same issue) worked for me too.

@dodgemich
Copy link

In terms of NVMe support, is this getting addressed? (Seems like two issues discussed in the comments).

@FrenchBen handled #148 for root NVMe - perhaps he has some insight on adding in to Cloudstor??

@kinghuang
Copy link

kinghuang commented Nov 20, 2018

Yeah, I think the comments here are describing two different problems.

  1. Some users are having problems with Cloudstor on 18.06.1, regardless of whether NVMe volumes are being used. Downgrading to 18.03.0 appears to be the solution for these users.
  2. Others (myself included) want Cloudstor updated to handle NVMe volumes on current generation instances.

There's been zero communication from Docker about either problem, AFAIK.

@dodgemich
Copy link

dodgemich commented Nov 21, 2018

Yeah, I think the comments here are describing two different problems.

1. Some users are having problems with Cloudstor on 18.06.1, regardless of whether NVMe volumes are being used. Downgrading to 18.03.0 appears to be the solution for these users.

2. Others (myself included) want Cloudstor updated to handle NVMe volumes on current generation instances.

There's been zero communication from Docker about either problem, AFAIK.

Agreed - my issue is (2). Not sure if worth cutting a new ticket to split them up...or how to get better info from Docker on when they'll address...without addressing that, Cloudstor is basically on the path for retirement.

@kinghuang
Copy link

That’s a good idea. I’ll create an issue for the second issue (NVMe mounts on current generation instances).

Isn’t Cloudstor also part of Docker EE on AWS (Docker Certified Infrastructure)?

@kinghuang
Copy link

Created #184 for the second issue.

@daaru00
Copy link

daaru00 commented Dec 19, 2018

Same error here, deploying a new stack raise an error context deadline exceeded, updating the stack raise a similar error: context deadline exceeded: volume name must be unique.
The inability to create shareable volumes between instances creates enormous problems and I can no longer use half the services..

Any update on this issue?

@matthewmrichter
Copy link

I'm hitting this now on t3 instances.

@a-marcel
Copy link

I'm hitting this now on t3 instances.

Me too

@darkl0rd
Copy link

darkl0rd commented Dec 29, 2018

All "Nitro" Based instances are affected, which make use of the new "/dev/nvme*" block devices.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html#ec2-nitro-instances

Workarounds:

  • Do not use the new generation instances, they use "/dev/nvme*" block device names which currently don't seem to work with cloudstor (e.g. use a T2 instead of a T3).
  • If you must use these instances, you can (temporarily) divert to the "Rex-Ray" driver, which I can confirm does already work with the new block device names.

When you are experiencing issues with a current generation instance; e.g. one that does not use the new block device names yet - some have reported that downgrading to the 18.03 driver alleviates the problem. I can not personally confirm this, as I have only dealt with the former problem myself.

@matthewmrichter
Copy link

matthewmrichter commented Dec 29, 2018

I was also one of the people who reported this issue when it popped up for RexRay as well - in case it helps with prompt resolution of the cludstor issue, here is the relevant issue in their GitHub: rexray/rexray#1252

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests