New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.14.2 causing container instances to grind to a halt #833

Closed
tysonmote opened this Issue Jun 7, 2017 · 30 comments

Comments

Projects
None yet
10 participants
@tysonmote

tysonmote commented Jun 7, 2017

1.14.2, which was just released a few hours ago, appears to be causing major issues in one of our ECS clusters. We have a large number of instances running 1.14.1 and 1.14.2. Many of the instances running 1.14.2 are exhibiting an issue where the instance accepts as many tasks as possible but is never able to transition them from the "PENDING" state to the "RUNNING" state.

We've opened an AWS Support ticket as well but I wanted to post this here in case anyone else is encountering the same issue. We're digging through logs to see what's going on now.

@tysonmote

This comment has been minimized.

Show comment
Hide comment
@tysonmote

tysonmote Jun 8, 2017

Our first clue is that we see "Pulling container ... (NONE->RUNNING) serially" log lines but not "Finished pulling container" log lines for all of the affected tasks. This means that we're being blocked somewhere in this method:

func (engine *DockerTaskEngine) serialPull(task *api.Task, container *api.Container) DockerContainerMetadata {
seelog.Debugf("Attempting to obtain ImagePullDeleteLock to pull image - %s. Task: %v", container.Image, task)
ImagePullDeleteLock.Lock()
seelog.Debugf("Acquired ImagePullDeleteLock, start pulling image - %s. Task: %v", container.Image, task)
defer seelog.Debugf("Released ImagePullDeleteLock after pulling image - %s. Task: %v", container.Image, task)
defer ImagePullDeleteLock.Unlock()
pullStart := time.Now()
defer func(startTime time.Time) {
seelog.Infof("Finished pulling container %v in %s. Task: %v", container.Image, time.Since(startTime).String(), task)
}(pullStart)
return engine.pullAndUpdateContainerReference(task, container)
}

Specifically, we're blocked on ImagePullDeleteLock.Lock(). To confirm this, we killed the ECS agent with the ABRT signal to get a full dump of all goroutines, which showed that we were blocked on that lock.

Fortunately restarting the ECS agent appears to fix the issue (tasks go from PENDING to RUNNING successfully), but the issue will likely just crop up again because it's a race / deadlock.

tysonmote commented Jun 8, 2017

Our first clue is that we see "Pulling container ... (NONE->RUNNING) serially" log lines but not "Finished pulling container" log lines for all of the affected tasks. This means that we're being blocked somewhere in this method:

func (engine *DockerTaskEngine) serialPull(task *api.Task, container *api.Container) DockerContainerMetadata {
seelog.Debugf("Attempting to obtain ImagePullDeleteLock to pull image - %s. Task: %v", container.Image, task)
ImagePullDeleteLock.Lock()
seelog.Debugf("Acquired ImagePullDeleteLock, start pulling image - %s. Task: %v", container.Image, task)
defer seelog.Debugf("Released ImagePullDeleteLock after pulling image - %s. Task: %v", container.Image, task)
defer ImagePullDeleteLock.Unlock()
pullStart := time.Now()
defer func(startTime time.Time) {
seelog.Infof("Finished pulling container %v in %s. Task: %v", container.Image, time.Since(startTime).String(), task)
}(pullStart)
return engine.pullAndUpdateContainerReference(task, container)
}

Specifically, we're blocked on ImagePullDeleteLock.Lock(). To confirm this, we killed the ECS agent with the ABRT signal to get a full dump of all goroutines, which showed that we were blocked on that lock.

Fortunately restarting the ECS agent appears to fix the issue (tasks go from PENDING to RUNNING successfully), but the issue will likely just crop up again because it's a race / deadlock.

@tejasmanohar

This comment has been minimized.

Show comment
Hide comment
@tejasmanohar

tejasmanohar Jun 8, 2017

(I'm investigating this issue with @tysonmote)

See the stack of one of our (many) ECS agents facing the issue here.

tejasmanohar commented Jun 8, 2017

(I'm investigating this issue with @tysonmote)

See the stack of one of our (many) ECS agents facing the issue here.

@nmeyerhans

This comment has been minimized.

Show comment
Hide comment
@nmeyerhans

nmeyerhans Jun 8, 2017

Contributor

Thanks for sending us the report and stacktraces. It's interesting that agent is reporting that it's pulling serially. What OS/AMI are you using? What docker version?

If you're able to run the ECS log collector and send me the resulting archive, we may be able to learn more from the logs. You can send me the archive directly at nmeyerha at amazon.com We'll also dig in to your stacktrace, which may provide enough detail on its own.

Contributor

nmeyerhans commented Jun 8, 2017

Thanks for sending us the report and stacktraces. It's interesting that agent is reporting that it's pulling serially. What OS/AMI are you using? What docker version?

If you're able to run the ECS log collector and send me the resulting archive, we may be able to learn more from the logs. You can send me the archive directly at nmeyerha at amazon.com We'll also dig in to your stacktrace, which may provide enough detail on its own.

@tysonmote

This comment has been minimized.

Show comment
Hide comment
@tysonmote

tysonmote Jun 8, 2017

Here's some details about the machines we're running:

Server Version: 1.10.3
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: null host bridge
Kernel Version: 4.6.3-coreos
Operating System: CoreOS 1068.10.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.308 GiB
...

We're using AMI CoreOS-stable-1068.10.0-hvm (ami-7d11c51d). I'll email you the logs from one of our instances that ran in to the problem, as well.

Thanks!

tysonmote commented Jun 8, 2017

Here's some details about the machines we're running:

Server Version: 1.10.3
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: null host bridge
Kernel Version: 4.6.3-coreos
Operating System: CoreOS 1068.10.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.308 GiB
...

We're using AMI CoreOS-stable-1068.10.0-hvm (ami-7d11c51d). I'll email you the logs from one of our instances that ran in to the problem, as well.

Thanks!

@tejasmanohar

This comment has been minimized.

Show comment
Hide comment
@tejasmanohar

tejasmanohar Jun 8, 2017

(Unfortunately, CoreOS is not supported by the log collector.)

Edit: my bad, there was previously an operating system check that would fail.

tejasmanohar commented Jun 8, 2017

(Unfortunately, CoreOS is not supported by the log collector.)

Edit: my bad, there was previously an operating system check that would fail.

@nmeyerhans

This comment has been minimized.

Show comment
Hide comment
@nmeyerhans

nmeyerhans Jun 8, 2017

Contributor

Thanks, I've received the logs.

Contributor

nmeyerhans commented Jun 8, 2017

Thanks, I've received the logs.

@tysonmote

This comment has been minimized.

Show comment
Hide comment
@tysonmote

tysonmote Jun 8, 2017

@tejasmanohar I was able to run it. I've sent that log archive as well.

tysonmote commented Jun 8, 2017

@tejasmanohar I was able to run it. I've sent that log archive as well.

@tysonmote

This comment has been minimized.

Show comment
Hide comment
@tysonmote

tysonmote Jun 8, 2017

I just realized that I had a typo in the original issue message that said the problem was in 1.14.1, not 1.14.2. I fixed that.

tysonmote commented Jun 8, 2017

I just realized that I had a typo in the original issue message that said the problem was in 1.14.1, not 1.14.2. I fixed that.

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jun 8, 2017

Member

Hi @tysonmote and @tejasmanohar,

We've confirmed that a race condition exists in the ECS agent that can lead to a deadlock. We believe that the race condition is more likely in 1.14.2, but exists in 1.14.1 as well. We're actively working to fix this race condition and deadlock.

The specific race condition involves two locks that both relate to images: the ImagePullDeleteLock and imageManager.updateLock. The ImagePullDeleteLock is used for serializing image pulls (on older versions of Docker) and for ensuring that we don't delete an image at the same time we're pulling it (this is important for the task launch workflow, since we don't want to successfully pull the image and then fail to create the container because it has gone missing between when pull finished and container creation started). The imageManager.updateLock is used for synchronization of the internal data structure of the imageManager, which tracks images and selects images for deletion. When we're pulling an image, the ImagePullDeleteLock is acquired first for the actual pull and then the imageManager.updateLock is acquired when recording the metadata about the image. When we're deleting an image, the reverse lock acquisition order occurs. This mismatch in lock acquisition order can create a deadlock, since one goroutine can hold ImagePullDeleteLock and block while trying to acquire imageManager.updateLock while the other goroutine can hold imageManager.updateLock and block while trying to acquire ImagePullDeleteLock. This became significantly more likely to occur in 1.14.2 as the scope of imageManager.updateLock was modified to fix a different race condition involving concurrent writes to the imageManager's internal data structure (see #707 and #743).

We're working on addressing the lock acquisition order mismatch now and will keep this issue open until that fix is released.

Thanks,
Sam

Member

samuelkarp commented Jun 8, 2017

Hi @tysonmote and @tejasmanohar,

We've confirmed that a race condition exists in the ECS agent that can lead to a deadlock. We believe that the race condition is more likely in 1.14.2, but exists in 1.14.1 as well. We're actively working to fix this race condition and deadlock.

The specific race condition involves two locks that both relate to images: the ImagePullDeleteLock and imageManager.updateLock. The ImagePullDeleteLock is used for serializing image pulls (on older versions of Docker) and for ensuring that we don't delete an image at the same time we're pulling it (this is important for the task launch workflow, since we don't want to successfully pull the image and then fail to create the container because it has gone missing between when pull finished and container creation started). The imageManager.updateLock is used for synchronization of the internal data structure of the imageManager, which tracks images and selects images for deletion. When we're pulling an image, the ImagePullDeleteLock is acquired first for the actual pull and then the imageManager.updateLock is acquired when recording the metadata about the image. When we're deleting an image, the reverse lock acquisition order occurs. This mismatch in lock acquisition order can create a deadlock, since one goroutine can hold ImagePullDeleteLock and block while trying to acquire imageManager.updateLock while the other goroutine can hold imageManager.updateLock and block while trying to acquire ImagePullDeleteLock. This became significantly more likely to occur in 1.14.2 as the scope of imageManager.updateLock was modified to fix a different race condition involving concurrent writes to the imageManager's internal data structure (see #707 and #743).

We're working on addressing the lock acquisition order mismatch now and will keep this issue open until that fix is released.

Thanks,
Sam

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jun 8, 2017

Member

Correction: This race condition was introduced in 1.14.2. 1.14.1 does not have this deadlock, but does have a potential panic as described in #707.

Member

samuelkarp commented Jun 8, 2017

Correction: This race condition was introduced in 1.14.2. 1.14.1 does not have this deadlock, but does have a potential panic as described in #707.

@tejasmanohar

This comment has been minimized.

Show comment
Hide comment
@tejasmanohar

tejasmanohar Jun 8, 2017

@samuelkarp Got it, thanks! We'll stay on 1.14.1 for now. FWIW, if this is not going to be fixed ASAP, I'd recommend reverting the code that introduced the deadlock and re-releasing because a panic (that the agent is probably restarted on by process manager) is much, much better than a mysterious deadlock. Also, #834 may be helpful.

tejasmanohar commented Jun 8, 2017

@samuelkarp Got it, thanks! We'll stay on 1.14.1 for now. FWIW, if this is not going to be fixed ASAP, I'd recommend reverting the code that introduced the deadlock and re-releasing because a panic (that the agent is probably restarted on by process manager) is much, much better than a mysterious deadlock. Also, #834 may be helpful.

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jun 8, 2017

Member

@tejasmanohar Yep, we're exploring both options right now.

Member

samuelkarp commented Jun 8, 2017

@tejasmanohar Yep, we're exploring both options right now.

@samuelkarp samuelkarp added this to the 1.14.3 milestone Jun 8, 2017

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jun 8, 2017

Member

amazon/amazon-ecs-agent:latest on DockerHub now points to 1.14.1 again. We have a fix forward in progress here: #836.

Member

samuelkarp commented Jun 8, 2017

amazon/amazon-ecs-agent:latest on DockerHub now points to 1.14.1 again. We have a fix forward in progress here: #836.

@tejasmanohar

This comment has been minimized.

Show comment
Hide comment
@tejasmanohar

tejasmanohar Jun 8, 2017

Good to hear. Thanks for acting quickly, @samuelkarp & team!

tejasmanohar commented Jun 8, 2017

Good to hear. Thanks for acting quickly, @samuelkarp & team!

@sergiofigueras

This comment has been minimized.

Show comment
Hide comment
@sergiofigueras

sergiofigueras Jun 12, 2017

Hello @samuelkarp ,

Any ETA for that fix be available and new version is released?

Thanks!

sergiofigueras commented Jun 12, 2017

Hello @samuelkarp ,

Any ETA for that fix be available and new version is released?

Thanks!

@vsiddharth

This comment has been minimized.

Show comment
Hide comment
@vsiddharth

vsiddharth Jun 12, 2017

Member

@sergiofigueras We already have a fix merged on the dev branch.
The release process is under way and the new version should be out as soon as our testing and release process is complete.

Member

vsiddharth commented Jun 12, 2017

@sergiofigueras We already have a fix merged on the dev branch.
The release process is under way and the new version should be out as soon as our testing and release process is complete.

@sharmaanshul2102

This comment has been minimized.

Show comment
Hide comment
@sharmaanshul2102

sharmaanshul2102 Jun 13, 2017

I am still getting 1.14.2 while fetching latest

[ec2-user@ip-172-31-91-192 ~]$ docker run amazon/amazon-ecs-agent:latest version
2017-06-12T23:59:05Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.2 (35acca3)

sharmaanshul2102 commented Jun 13, 2017

I am still getting 1.14.2 while fetching latest

[ec2-user@ip-172-31-91-192 ~]$ docker run amazon/amazon-ecs-agent:latest version
2017-06-12T23:59:05Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.2 (35acca3)
@nehalrp

This comment has been minimized.

Show comment
Hide comment
@nehalrp

nehalrp Jun 13, 2017

@optimisticanshul You would need to re-pull it:

ec2-user@ip-172-30-83-142 ~ » docker pull amazon/amazon-ecs-agent:latest
latest: Pulling from amazon/amazon-ecs-agent
202850a75283: Pull complete
572682c3a733: Pull complete
896c431e2a43: Pull complete
559716f174f2: Pull complete
Digest: sha256:8d4e316e0280402a2c666492e2e454506966c776c635408634496c9c2b012465
Status: Downloaded newer image for amazon/amazon-ecs-agent:latest
ec2-user@ip-172-30-83-142 ~ » docker run -it amazon/amazon-ecs-agent:latest                                                                                          
2017-06-13T00:01:46Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.1 (467c3d7)

nehalrp commented Jun 13, 2017

@optimisticanshul You would need to re-pull it:

ec2-user@ip-172-30-83-142 ~ » docker pull amazon/amazon-ecs-agent:latest
latest: Pulling from amazon/amazon-ecs-agent
202850a75283: Pull complete
572682c3a733: Pull complete
896c431e2a43: Pull complete
559716f174f2: Pull complete
Digest: sha256:8d4e316e0280402a2c666492e2e454506966c776c635408634496c9c2b012465
Status: Downloaded newer image for amazon/amazon-ecs-agent:latest
ec2-user@ip-172-30-83-142 ~ » docker run -it amazon/amazon-ecs-agent:latest                                                                                          
2017-06-13T00:01:46Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.1 (467c3d7)
@sharmaanshul2102

This comment has been minimized.

Show comment
Hide comment
@sharmaanshul2102

sharmaanshul2102 Jun 13, 2017

@nehalrp It is a new ec2 instance

sharmaanshul2102 commented Jun 13, 2017

@nehalrp It is a new ec2 instance

@masutaka

This comment has been minimized.

Show comment
Hide comment
@masutaka

masutaka Jun 13, 2017

@optimisticanshul

I am still getting 1.14.2 while fetching latest

Also me too. So I've added the following workaround to userdata.

ECS_AGENT_VERSION=v1.14.1
docker pull amazon/amazon-ecs-agent:$ECS_AGENT_VERSION
docker tag amazon/amazon-ecs-agent:$ECS_AGENT_VERSION amazon/amazon-ecs-agent:latest

I've avoided automatically update ecs-agent.

masutaka commented Jun 13, 2017

@optimisticanshul

I am still getting 1.14.2 while fetching latest

Also me too. So I've added the following workaround to userdata.

ECS_AGENT_VERSION=v1.14.1
docker pull amazon/amazon-ecs-agent:$ECS_AGENT_VERSION
docker tag amazon/amazon-ecs-agent:$ECS_AGENT_VERSION amazon/amazon-ecs-agent:latest

I've avoided automatically update ecs-agent.

@vsiddharth

This comment has been minimized.

Show comment
Hide comment
@vsiddharth

vsiddharth Jun 13, 2017

Member

We have tagged v1.14.1 as latest on docker hub.

If you pull amazon/amazon-ecs-agent:latest you should get v1.14.1

docker pull amazon/amazon-ecs-agent                                                                                                                                                                                                    
Using default tag: latest
latest: Pulling from amazon/amazon-ecs-agent
202850a75283: Pull complete
572682c3a733: Pull complete
896c431e2a43: Pull complete
559716f174f2: Pull complete
Digest: sha256:8d4e316e0280402a2c666492e2e454506966c776c635408634496c9c2b012465
Status: Downloaded newer image for amazon/amazon-ecs-agent:latest

docker run amazon/amazon-ecs-agent                                                                                                                                                                                                     
2017-06-13T15:57:32Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.1 (467c3d7)
Member

vsiddharth commented Jun 13, 2017

We have tagged v1.14.1 as latest on docker hub.

If you pull amazon/amazon-ecs-agent:latest you should get v1.14.1

docker pull amazon/amazon-ecs-agent                                                                                                                                                                                                    
Using default tag: latest
latest: Pulling from amazon/amazon-ecs-agent
202850a75283: Pull complete
572682c3a733: Pull complete
896c431e2a43: Pull complete
559716f174f2: Pull complete
Digest: sha256:8d4e316e0280402a2c666492e2e454506966c776c635408634496c9c2b012465
Status: Downloaded newer image for amazon/amazon-ecs-agent:latest

docker run amazon/amazon-ecs-agent                                                                                                                                                                                                     
2017-06-13T15:57:32Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.1 (467c3d7)
@ernestm

This comment has been minimized.

Show comment
Hide comment
@ernestm

ernestm Jun 13, 2017

I'm also still getting v1.14.2 when I autoscale using the designated newest ECS AMI (ami-62745007 in us-east-2). Our ECS clusters autoscale all the time so we're not pulling the agent container manually ourselves. When the AMI starts, in ecs/ecs-init.log:

[root@ip-10-0-86-11 ecs]# more ecs-init.log
2017-06-13T14:30:03Z [INFO] pre-start
2017-06-13T14:30:03Z [INFO] Downloading Amazon EC2 Container Service Agent
2017-06-13T14:30:03Z [DEBUG] Downloading published md5sum from https://s3.amazonaws.com/amazon-ecs-agent/ecs-agent-v1.14.2.tar.md5

I guess whatever the AMI does isn't just "pull latest."

ernestm commented Jun 13, 2017

I'm also still getting v1.14.2 when I autoscale using the designated newest ECS AMI (ami-62745007 in us-east-2). Our ECS clusters autoscale all the time so we're not pulling the agent container manually ourselves. When the AMI starts, in ecs/ecs-init.log:

[root@ip-10-0-86-11 ecs]# more ecs-init.log
2017-06-13T14:30:03Z [INFO] pre-start
2017-06-13T14:30:03Z [INFO] Downloading Amazon EC2 Container Service Agent
2017-06-13T14:30:03Z [DEBUG] Downloading published md5sum from https://s3.amazonaws.com/amazon-ecs-agent/ecs-agent-v1.14.2.tar.md5

I guess whatever the AMI does isn't just "pull latest."

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jun 13, 2017

Member

I guess whatever the AMI does isn't just "pull latest."

That's correct. Each of the ECS-optimized AMIs has a specific version of the agent that is pre-cached. Similarly, when you install ecs-init on a regular Amazon Linux instance, it'll grab a specific version of the agent rather than "latest".

Member

samuelkarp commented Jun 13, 2017

I guess whatever the AMI does isn't just "pull latest."

That's correct. Each of the ECS-optimized AMIs has a specific version of the agent that is pre-cached. Similarly, when you install ecs-init on a regular Amazon Linux instance, it'll grab a specific version of the agent rather than "latest".

MiguelMoll added a commit to convox/rack that referenced this issue Jun 13, 2017

MiguelMoll added a commit to convox/rack that referenced this issue Jun 13, 2017

@ernestm

This comment has been minimized.

Show comment
Hide comment
@ernestm

ernestm Jun 13, 2017

Right... But on
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI_launch_latest.html and http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html it says this AMI should be getting me 1.14.1, and it's not not, so I'm not sure as an ECS AMI consumer what I'm supposed to do to downgrade agent versions. Though maybe some of these recent commits are for that?

ernestm commented Jun 13, 2017

Right... But on
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI_launch_latest.html and http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html it says this AMI should be getting me 1.14.1, and it's not not, so I'm not sure as an ECS AMI consumer what I'm supposed to do to downgrade agent versions. Though maybe some of these recent commits are for that?

MiguelMoll added a commit to convox/rack that referenced this issue Jun 13, 2017

MiguelMoll added a commit to convox/rack that referenced this issue Jun 13, 2017

@vsiddharth

This comment has been minimized.

Show comment
Hide comment
@vsiddharth

vsiddharth Jun 14, 2017

Member

We have now published a new AMI that includes the latest agent release.
The latest agent includes fixes to resolve the reported deadlock

The new AMIs can be found here

Member

vsiddharth commented Jun 14, 2017

We have now published a new AMI that includes the latest agent release.
The latest agent includes fixes to resolve the reported deadlock

The new AMIs can be found here

@sharmaanshul2102

This comment has been minimized.

Show comment
Hide comment
@sharmaanshul2102

sharmaanshul2102 Jun 29, 2017

@vsiddharth @samuelkarp even in 1.14.3 my tasks are stuck in pending state for hours.

sharmaanshul2102 commented Jun 29, 2017

@vsiddharth @samuelkarp even in 1.14.3 my tasks are stuck in pending state for hours.

@vsiddharth

This comment has been minimized.

Show comment
Hide comment
@vsiddharth

vsiddharth Jun 29, 2017

Member

@optimisticanshul The issue discovered with agent versionv1.14.2 should be mitigated by v1.14.3. Would it be possible for you to share the logs for further investigation ?

Please have a look here for the logs we require.

Member

vsiddharth commented Jun 29, 2017

@optimisticanshul The issue discovered with agent versionv1.14.2 should be mitigated by v1.14.3. Would it be possible for you to share the logs for further investigation ?

Please have a look here for the logs we require.

@sharmaanshul2102

This comment has been minimized.

Show comment
Hide comment
@sharmaanshul2102

sharmaanshul2102 Jun 30, 2017

Definitely where i send them ?

sharmaanshul2102 commented Jun 30, 2017

Definitely where i send them ?

@sharmaanshul2102

This comment has been minimized.

Show comment
Hide comment
@sharmaanshul2102

sharmaanshul2102 Jun 30, 2017

Steps to reproduce:

  1. ECS Cluster 2 instances.
  2. Deploy a sample service.
  3. Redeploy

In ecs agent log

./ecs-agent.log.2017-06-30-03:2017-06-30T03:55:43Z [INFO] Stopping container module="TaskEngine" task="prod-whatever:108 arn:aws:ecs:us-east-1:ACCOUNT-ID:task/1730060e-66e6-4795-ac20-551f4d09dd7a, Status: (RUNNING->STOPPED) Containers: [prod-whatever (RUNNING->STOPPED),]" container="prod-whatever(ACCOUNT-ID.dkr.ecr.us-east-1.amazonaws.com/whatever:IMAGE_NAME) (RUNNING->STOPPED)"

But output of docker ps

274ba55bfca5        ACCOUNT-ID.dkr.ecr.us-east-1.amazonaws.com/whatever:IMAGE_NAME   "/bin/sh -c 'bundl..."   25 minutes ago      Up 25 minutes       80/tcp, 0.0.0.0:10283->8080/tcp   ecs-prod-whatever-108-prod-whatever-feb2cdae8894f389d001

See new tasks are stuck in pending state. If i manually run docker stop 274ba55bfca5 it deploys new tasks.

I am not sure whether it's related to this issue or something else.

sharmaanshul2102 commented Jun 30, 2017

Steps to reproduce:

  1. ECS Cluster 2 instances.
  2. Deploy a sample service.
  3. Redeploy

In ecs agent log

./ecs-agent.log.2017-06-30-03:2017-06-30T03:55:43Z [INFO] Stopping container module="TaskEngine" task="prod-whatever:108 arn:aws:ecs:us-east-1:ACCOUNT-ID:task/1730060e-66e6-4795-ac20-551f4d09dd7a, Status: (RUNNING->STOPPED) Containers: [prod-whatever (RUNNING->STOPPED),]" container="prod-whatever(ACCOUNT-ID.dkr.ecr.us-east-1.amazonaws.com/whatever:IMAGE_NAME) (RUNNING->STOPPED)"

But output of docker ps

274ba55bfca5        ACCOUNT-ID.dkr.ecr.us-east-1.amazonaws.com/whatever:IMAGE_NAME   "/bin/sh -c 'bundl..."   25 minutes ago      Up 25 minutes       80/tcp, 0.0.0.0:10283->8080/tcp   ecs-prod-whatever-108-prod-whatever-feb2cdae8894f389d001

See new tasks are stuck in pending state. If i manually run docker stop 274ba55bfca5 it deploys new tasks.

I am not sure whether it's related to this issue or something else.

@vsiddharth

This comment has been minimized.

Show comment
Hide comment
@vsiddharth

vsiddharth Jul 5, 2017

Member

@optimisticanshul you could send them over to sidvin at amazon.com.

Member

vsiddharth commented Jul 5, 2017

@optimisticanshul you could send them over to sidvin at amazon.com.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment