-
Notifications
You must be signed in to change notification settings - Fork 26
Service fails to start with Cloudstor EBS Volume attached #157
Comments
I did some testing in
Just to make sure it wasn't anything else I removed the I've also tried letting the service definition create the volume on start and that didn't seem to help either. Here's the updated configuration:
The plugin seems to be working fine
|
@ddebroy Do you have any thoughts? I'm dead in the water for our deployment as this doesn't appear to work as advertised. |
Same problem -
|
Hello everyone.
Every time I make a deploy command in swarm, it simply does not raise the container, without giving an error or something. This command is made from a manager to a worker. When I run the command in the manger itself it works normally. When I take out the command to mount the volume,
the container goes up normally. I tested other plugins like this and had the same problem The cloudstor normally creates the EBS volume, without any problem. The versions of the plugins are:
@ddebroy , you can help us? Thank You! |
have the same issue with docker4aws 18.03 (stable) and 18.04 (edge) cloudformation templates. Hadn't had the issue with docker4aws 17.12 (edge). |
Any news on this? |
I ran into similar issues using it with ECS - I found that it worked with T2s and C4's, but would fail in this manner with C5/M5...might help debug the root issue. |
@dodgemich you are my hero!! I spend two days to understand why rexray and cloudstor doesn't work on my new shiny t3 cluster. And I just have to migrate it on t2. |
may be #148 is related, I get issues like the above and with the mount point /dev/xvdf allready existing when trying to mount |
Having the exact same issue with rexray/efs... Did anyone manage to find a solution? |
@lepetitpierdol you have to use instances from previous generations - t2, c4 and so on. Looks like latest T3 and C5 have new disk controllers that don't work with rexray/convoy/cloudstor |
No luck for me, I'm having issues with T2 on docker4aws 18.06.1 (stable) and 18.01 (edge) when mounting volumes using cloudstor. |
In my case, I got this error when one of the containers got stuck and could not be stopped. It was holding a reference to a volume, so new container could not be started. I resolved this by rebooting the host VM. |
There's been some PRs in REX-Ray to handle the new NVMe device names (rexray/rexray#1233, rexray/rexray#1252). I've run the edge release successfully to create and mount EBS volumes on current generation instances. We need a similar change in Cloudstor. I really wish Docker would at least give some indication of whether they're even going to address this issue. Or, open source the code so that we can do something about it. |
@kinghuang you couldn't have said better. I tried REX-Ray, but it's not enough for my use case. I'm using Cloudstor currently on Amazon ECS, but I'm forced to use the old instance types. @brawong, @joeabbey sorry to mention you guys, but do you have any feedback on when the NVMe devices would be supported in Cloudstor, so we could use it on the new AWS EC2 generations (t3, m5, c5, etc.)? |
I am trying to create volumes and I am running into the same problem. My cluster is based on T2 instances, so that does not seem to be the source of the problem. Docker version 18.06.1-ce, build e68fc7a. |
Version 18.03.0 works fine, I rollback my stack and no more issues with cloudstor plugin.
—
Gabriel Reitz Giannattasio
…On Oct 24, 2018, 7:36 AM -0700, Rafael Guimaraes Siqueira ***@***.***>, wrote:
I am trying to create volumes and I am running into the same problem. My cluster is based on T2 instances, so that does not seem to be the source of the problem. Docker version 18.06.1-ce, build e68fc7a.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I'm experiencing the same problem. I have 18.06.1-ce: Status": { I restarted EC2 instances, the swarm ended up with the master and a couple workers out of the swarm after the reboot. Then all services went up (previously recreating stack), but I have now another problem, and I think is related, but have no evidence. |
Anybody find a solution here? Just started seeing this issue. |
I haven't found any solutions for Cloudstor. I've started to use REX-Ray, but it has the downside that it doesn't copy EBS volumes between availability zones. We really need Docker to provide an answer. |
thanks @kinghuang. any tips or pointers to documentation on REX-Ray, in case we need to go that route? @gartz was rolling back your stack as easy as running the cloudformation template with version 18.03.0 specified? |
@mateodelnorte yes, it rollback, but I need to login in the new manager and force-initialize it to work, after that the workers and other managers start working again. I also edited my cloud formation template to add EFS support to N. California (it's disabled in the original, but N. California supports it). |
Currently attempting to update our CloudFormation template from 18.06.1 to 18.03.1. Our new manager came online but is clearly in an odd state:
Notice ip-172-22-17-55.ec2.internal is listed twice. That's the new manager. It's registering as both Ready and Down.
I'm not confident a @gartz is this the situation you were in when you forced a new cluster? |
No, you're in a new situation, the version I'm using that work is the
`18.03.0`, not the `18.03.1`.
If you get the context deadline exceeded error DO NOT run `swarm init
--force-new-cluster` the problem seems to be in the cloudstor plugin, you
might lose your data forcing a new cluster without being able to
communicate with EFS correctly.
…On Fri, Nov 2, 2018 at 4:24 PM Matt Walters ***@***.***> wrote:
Currently attempting to update our CloudFormation template from 18.06.1 to
18.03.1. Our new manager came online but is clearly in an odd state:
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
24fr9wl3maq76rwcc4j6w28q0 ip-172-22-6-98.ec2.internal Ready Active Reachable 18.06.1-ce
1jdanx00fjqily9ev6rtkz158 ip-172-22-7-254.ec2.internal Ready Active 18.06.1-ce
e3ez03wlf33aohmktfnfnwaym ip-172-22-17-55.ec2.internal Down Active Reachable 18.03.0-ce
fsvz7lhywdetcgunx005gndgq ip-172-22-17-55.ec2.internal Ready Active Unreachable 18.03.0-ce
xe4sd7jp9kbpln1ysfy1dojq3 ip-172-22-17-249.ec2.internal Ready Active 18.06.1-ce
blqwttvjaaxt8z1z79ohdb0le ip-172-22-22-66.ec2.internal Ready Active Leader 18.06.1-ce
x09n7onls3cutd4cu530o60i8 ip-172-22-34-115.ec2.internal Ready Active 18.06.1-ce
jadehezhrfgrazpjnr3i972gd * ip-172-22-40-45.ec2.internal Ready Active Reachable 18.06.1-ce
Every 2s: docker node ls 2018-11-02 23:18:54
Notice ip-172-22-17-55.ec2.internal is listed twice. That's the new
manager. It's registering as both Ready and Down.
docker info on the new manager yields:
docker info
Containers: 5
Running: 5
Paused: 0
Stopped: 0
Images: 5
Server Version: 18.03.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: fsvz7lhywdetcgunx005gndgq
Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Is Manager: true
Node Address: 172.22.17.55
Manager Addresses:
172.22.17.55:2377
172.22.17.55:2377
172.22.22.66:2377
172.22.40.45:2377
172.22.6.98:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.81-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.785GiB
Name: ip-172-22-17-55.ec2.internal
ID: OBAN:2FHN:UX7C:BHOR:DIVY:27HI:SSAI:KSVU:NVXP:WJW7:VCX3:QWFF
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
os=linux
region=us-east-1
availability_zone=us-east-1b
instance_type=m4.large
node_type=manager
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
~ $ docker service ls
Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I'm not confident a swarm init --force-new-cluster on this new node will
result in success. I would think it doesn't have service configuration,
since it can't join and make quorum.
@gartz <https://github.com/gartz> is this the situation you were in when
you forced a new cluster?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#157 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGFGIqzpm93lg1EnFoBcpu2_orfEAwRks5urNQpgaJpZM4UKx62>
.
|
That was a typo on my part. The non-connecting manager, and upgrade version we're attempting to move toward is |
I also experience this issue on both Stable and Edge. I tried to downgrade stable to volumes:
gc2core_var_www_geocloud2:
driver: cloudstor:aws
driver_opts:
:backing: shared CURRENT STATE just keep saying "Preparing [--] minutes ago" EDIT 1: EDIT 2: |
My case I could downgrade in one of the CloudFormation and the problem was
solved, in the other CloudFormation the downgrade it self-didn't fix the
problem, so I did a new CloudFormation using the version 18.03.0-ce then
moved data from broken EFS to the new EFS mounting it manually in an EC2
temporary instance. Finally, I started my docker services in the new
CloudFormation and as it goes it detected the folders in the EFS and it
worked.
Don't forget that you need to use the CloudFormation file from
`18.03.0-ce` not only change the version in the current file, just change
the version in the file won't change the AMI used to spawn Instances.
I hope this information helps. It's a very frustrating problem, hard to
detect and hard to fix.
|
Downgrading to 18.03.0-ce from 18.06.1-ce (where I was experiencing the same issue) worked for me too. |
In terms of NVMe support, is this getting addressed? (Seems like two issues discussed in the comments). @FrenchBen handled #148 for root NVMe - perhaps he has some insight on adding in to Cloudstor?? |
Yeah, I think the comments here are describing two different problems.
There's been zero communication from Docker about either problem, AFAIK. |
Agreed - my issue is (2). Not sure if worth cutting a new ticket to split them up...or how to get better info from Docker on when they'll address...without addressing that, Cloudstor is basically on the path for retirement. |
That’s a good idea. I’ll create an issue for the second issue (NVMe mounts on current generation instances). Isn’t Cloudstor also part of Docker EE on AWS (Docker Certified Infrastructure)? |
Created #184 for the second issue. |
Same error here, deploying a new stack raise an error Any update on this issue? |
I'm hitting this now on t3 instances. |
Me too |
All "Nitro" Based instances are affected, which make use of the new "/dev/nvme*" block devices. Workarounds:
When you are experiencing issues with a current generation instance; e.g. one that does not use the new block device names yet - some have reported that downgrading to the 18.03 driver alleviates the problem. I can not personally confirm this, as I have only dealt with the former problem myself. |
I was also one of the people who reported this issue when it popped up for RexRay as well - in case it helps with prompt resolution of the cludstor issue, here is the relevant issue in their GitHub: rexray/rexray#1252 |
Expected behavior
Service starts with attached EBS Volume attached
Actual behavior
My assumption is that it's taking too long to snapshot and load the EBS volume for a specific availability zone and therefore times out.
Note: the EBS volumes are 200GB, however they're currently empty.
The initial error is this:
After subsequent retires to start the service I get this error:
The service never seems to start.
Information
Docker-diagnose: 1527092193-JGugtUgVNBmvU7S8tXn0mV4ryIhPF4zc
Volumes created:
AWS Region: ap-southeast-1
Service Creation Setup:
The text was updated successfully, but these errors were encountered: