ecs-agent running but not connected after a while, generates AGENT errors trying to start tasks #316

Closed
workybee opened this Issue Feb 19, 2016 · 31 comments

Comments

Projects
None yet
@workybee

Running 2015.09.f with ecs-agent version 1.8.0 we try and start about 100 tasks every 15 minutes.
This configuration seemed to run fine for a few hours after bringing up a fresh cluster of 4 m4.large instances. Then start getting AGENT errors:
2016-02-19 09:11:46,172 [13767] ERROR ECS - Failed running task prod_new_collect_ads_-u_304671576282352 - reason [{u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/181f09d8-b9af-4062-9728-25a64c2288d0'}, {u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/716ac35d-b6c9-4610-b3f0-19c5385f272c'}, {u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/d27f9931-a69f-4a32-92d0-8a9889ad1412'}, {u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/fdae64b7-583a-4da3-baf1-9cf9e905e3c5'}]
Indeed the console show "Agent Connected" false, but on the instances a "docker ps" show it still running (as the only thing running!).

At the suggestion of @samuelkarp in #313 ecs is configured to produce debug output.
Also configured with:
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=30m
to help manage disk space usage.

The last several Error lines from one of instance:

/var/log/ecs/ecs-agent.log.2016-02-19-02:2016-02-19T02:57:23Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:1ac2aaabb6a98f201ef4489d0345a35a4370bfc757e3d6421d0ff91a6979a013}"
/var/log/ecs/ecs-agent.log.2016-02-19-02:2016-02-19T02:57:23Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
/var/log/ecs/ecs-agent.log.2016-02-19-02:2016-02-19T02:59:32Z [DEBUG] Error getting stats module="stats" error="open /var/lib/docker/execdriver/native/1aee3569aa72bb5a09942b83c9aad54dd116e41101889555e6ebe3dfe5a0f604/state.json: no such file or directory" contianer="&{containerMetadata:0xc20817e460 ctx:0xc208320800 cancel:0x4fd120 statePath:/var/lib/docker/execdriver/native/1aee3569aa72bb5a09942b83c9aad54dd116e41101889555e6ebe3dfe5a0f604 statsQueue:0xc2083207c0 statsCollector:0xcb4ab0}"
/var/log/ecs/ecs-agent.log.2016-02-19-03:2016-02-19T03:00:07Z [DEBUG] Error getting stats module="stats" error="open /var/lib/docker/execdriver/native/4c3bf47e8edf4463f3abfed65505351e58c89acaec43b0782cb377fe6175cb41/state.json: no such file or directory" contianer="&{containerMetadata:0xc208203a30 ctx:0xc2083785c0 cancel:0x4fd120 statePath:/var/lib/docker/execdriver/native/4c3bf47e8edf4463f3abfed65505351e58c89acaec43b0782cb377fe6175cb41 statsQueue:0xc208378580 statsCollector:0xcb4ab0}"
/var/log/ecs/ecs-agent.log.2016-02-19-03:2016-02-19T03:00:20Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
 /var/log/ecs/ecs-agent.log.2016-02-19-03:2016-02-19T03:00:20Z [INFO] Error from acs; backing off module="acs handler" err="unexpected EOF"
/var/log/ecs/ecs-agent.log.2016-02-19-07:2016-02-19T07:23:05Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
 /var/log/ecs/ecs-agent.log.2016-02-19-07:2016-02-19T07:23:05Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
/var/log/ecs/ecs-agent.log.2016-02-19-13:2016-02-19T13:28:48Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
/var/log/ecs/ecs-agent.log.2016-02-19-13:2016-02-19T13:28:48Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Feb 19, 2016

Member

@workybee Are you just grepping for "error"? If so, you're excluding a bunch of context that would help us actually dig into what the Agent is doing.

Since it sounds like Docker is being responsive now (docker ps is working?), can you run docker kill -s USR1 ecs-agent and post the resulting stack dump here (you'll find it in the log file in /var/log/ecs)?

Member

samuelkarp commented Feb 19, 2016

@workybee Are you just grepping for "error"? If so, you're excluding a bunch of context that would help us actually dig into what the Agent is doing.

Since it sounds like Docker is being responsive now (docker ps is working?), can you run docker kill -s USR1 ecs-agent and post the resulting stack dump here (you'll find it in the log file in /var/log/ecs)?

@workybee

This comment has been minimized.

Show comment
Hide comment
@workybee

workybee Feb 19, 2016

Yes, I was just grepping for Error.
Sorry took so long to get back to you, many fires today... ;-)

Attached is the result of "docker kill -s USR1 ecs-agent" on one of the instances:

ecs-agent_stacktrace_2016-02-19T2138.log.gz

Yes, I was just grepping for Error.
Sorry took so long to get back to you, many fires today... ;-)

Attached is the result of "docker kill -s USR1 ecs-agent" on one of the instances:

ecs-agent_stacktrace_2016-02-19T2138.log.gz

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Feb 19, 2016

Member

@workybee This looks like the same thing I was seeing in @tschutte's log in #313. Can you let me know how many containers are listed in the data file? Something like this:

sudo yum -y install jq
sudo cat /var/lib/ecs/data/ecs_agent_data.json | jq '.Data.TaskEngine.Tasks[].Containers[].Name' | wc -l
Member

samuelkarp commented Feb 19, 2016

@workybee This looks like the same thing I was seeing in @tschutte's log in #313. Can you let me know how many containers are listed in the data file? Something like this:

sudo yum -y install jq
sudo cat /var/lib/ecs/data/ecs_agent_data.json | jq '.Data.TaskEngine.Tasks[].Containers[].Name' | wc -l
@workybee

This comment has been minimized.

Show comment
Hide comment
@workybee

workybee Feb 19, 2016

Definite similarity, but my case might be more amenable to your lock held theory. ;-)
After the yum install of jq mentioned above:

[ec2-user@ip-172-30-1-173 ~]$ sudo cat /var/lib/ecs/data/ecs_agent_data.json | jq '.Data.TaskEngine.Tasks[].Containers[].Name' | wc -l
41

and certainly my attempts at "docker ps" or similar sometimes take "forever" just after we fire off multiple start tasks (total of about 100 every 15 minutes). Things seem to work for a while, then get wedged. (see #309 and #313 for more context)

Definite similarity, but my case might be more amenable to your lock held theory. ;-)
After the yum install of jq mentioned above:

[ec2-user@ip-172-30-1-173 ~]$ sudo cat /var/lib/ecs/data/ecs_agent_data.json | jq '.Data.TaskEngine.Tasks[].Containers[].Name' | wc -l
41

and certainly my attempts at "docker ps" or similar sometimes take "forever" just after we fire off multiple start tasks (total of about 100 every 15 minutes). Things seem to work for a while, then get wedged. (see #309 and #313 for more context)

@andreausu

This comment has been minimized.

Show comment
Hide comment
@andreausu

andreausu Feb 22, 2016

We're having the same issue after upgrading to 2015.09.f.
We've updated our nodes to the new AMI on Friday, today (after a couple of days) we're seeing issues deploying new code due to the disconnected agent (but the container is up indeed), the logs are showing these lines over and over:

2016-02-22T10:33:04Z [INFO] Creating poll dialer module="ws client" host="ecs-t-1.eu-west-1.amazonaws.com"
2016-02-22T10:33:05Z [WARN] Error creating a websocket client module="ws client" err="websocket: bad handshake"
2016-02-22T10:33:05Z [ERROR] Error connecting to TCS: {"AccessDeniedException":"Forbidden"}, websocket: bad handshake module="tcs handler"
2016-02-22T10:33:05Z [INFO] Error from tcs; backing off module="tcs handler" err="{"AccessDeniedException":"Forbidden"}, websocket: bad handshake"

We're running less than 10 containers on these nodes and we didn't start/stop tasks during these 2 days.

We've fixed it for now by doing a stop ecs and start ecs.

We're having the same issue after upgrading to 2015.09.f.
We've updated our nodes to the new AMI on Friday, today (after a couple of days) we're seeing issues deploying new code due to the disconnected agent (but the container is up indeed), the logs are showing these lines over and over:

2016-02-22T10:33:04Z [INFO] Creating poll dialer module="ws client" host="ecs-t-1.eu-west-1.amazonaws.com"
2016-02-22T10:33:05Z [WARN] Error creating a websocket client module="ws client" err="websocket: bad handshake"
2016-02-22T10:33:05Z [ERROR] Error connecting to TCS: {"AccessDeniedException":"Forbidden"}, websocket: bad handshake module="tcs handler"
2016-02-22T10:33:05Z [INFO] Error from tcs; backing off module="tcs handler" err="{"AccessDeniedException":"Forbidden"}, websocket: bad handshake"

We're running less than 10 containers on these nodes and we didn't start/stop tasks during these 2 days.

We've fixed it for now by doing a stop ecs and start ecs.

@ejholmes

This comment has been minimized.

Show comment
Hide comment
@ejholmes

ejholmes Feb 23, 2016

Contributor

We're seeing the same problem after upgrading to 1.8.0, but were seeing the same issue on 1.6.0 (although, with more frequency now).

Contributor

ejholmes commented Feb 23, 2016

We're seeing the same problem after upgrading to 1.8.0, but were seeing the same issue on 1.6.0 (although, with more frequency now).

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Feb 23, 2016

Member

At this point, we believe this is the cause. We've just merged a fix to be included in our next release.

Member

samuelkarp commented Feb 23, 2016

At this point, we believe this is the cause. We've just merged a fix to be included in our next release.

@jpignata

This comment has been minimized.

Show comment
Hide comment
@jpignata

jpignata Feb 24, 2016

@samuelkarp Awesome, thank you. Is there a workaround here prior to the release of your next version? This is happening multiple times a day on our clusters and causing myriad problems. Is downgrading to the previous version the recommendation?

@samuelkarp Awesome, thank you. Is there a workaround here prior to the release of your next version? This is happening multiple times a day on our clusters and causing myriad problems. Is downgrading to the previous version the recommendation?

@mmontagna

This comment has been minimized.

Show comment
Hide comment
@mmontagna

mmontagna Feb 24, 2016

@jpignata We downgraded to version 1.7 and haven't had any problems, but for what it's worth periodically killing the 1.8 containers unsticks the service and didn't seem to cause any problems, we did that for a bit, but I'm not sure it's a good long term idea.

@jpignata We downgraded to version 1.7 and haven't had any problems, but for what it's worth periodically killing the 1.8 containers unsticks the service and didn't seem to cause any problems, we did that for a bit, but I'm not sure it's a good long term idea.

@ejholmes

This comment has been minimized.

Show comment
Hide comment
@ejholmes

ejholmes Feb 24, 2016

Contributor

Also downgraded to 1.6 for the time being.

Contributor

ejholmes commented Feb 24, 2016

Also downgraded to 1.6 for the time being.

@bilalaslamseattle

This comment has been minimized.

Show comment
Hide comment
@bilalaslamseattle

bilalaslamseattle Feb 25, 2016

@samuelkarp we deployed the dev branch to one of our clusters and it did not fix the problem. ECS is still reporting periodic AGENT disconnects :(

@samuelkarp we deployed the dev branch to one of our clusters and it did not fix the problem. ECS is still reporting periodic AGENT disconnects :(

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Feb 25, 2016

Member

@bilalaslamseattle Are they the same symptoms as before? Can you provide logs at debug level (ECS_LOGLEVEL=debug) and the stack trace on a disconnected Agent (docker kill -s USR1 ecs-agent)?

Member

samuelkarp commented Feb 25, 2016

@bilalaslamseattle Are they the same symptoms as before? Can you provide logs at debug level (ECS_LOGLEVEL=debug) and the stack trace on a disconnected Agent (docker kill -s USR1 ecs-agent)?

@bilalaslamseattle

This comment has been minimized.

Show comment
Hide comment
@bilalaslamseattle

bilalaslamseattle Feb 25, 2016

@samuelkarp roger that, I'll get these over to you.

@samuelkarp roger that, I'll get these over to you.

@jwerak

This comment has been minimized.

Show comment
Hide comment
@jwerak

jwerak Feb 25, 2016

@samuelkarp answering here for @bilalaslamseattle .

Is this info you are looking for? https://gist.github.com/veverjak/3607a9dd1f141e31d5a8

Also how does one set ECS_LOGLEVEL?
Are envs from /etc/init/ecs.conf passed to docker run? It doesn't seem so. Is there simple way to pass env to docker from ecs-ini or do I have to stop the service and run ecs-agent container manually using docker run cli?

jwerak commented Feb 25, 2016

@samuelkarp answering here for @bilalaslamseattle .

Is this info you are looking for? https://gist.github.com/veverjak/3607a9dd1f141e31d5a8

Also how does one set ECS_LOGLEVEL?
Are envs from /etc/init/ecs.conf passed to docker run? It doesn't seem so. Is there simple way to pass env to docker from ecs-ini or do I have to stop the service and run ecs-agent container manually using docker run cli?

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Feb 25, 2016

Member

@veverjak Yes, that's what I'm looking for. Is that from an Agent built with the dev branch (and the fix) or from the 1.8.0 version?

If you're using our ECS-optimized AMI, you can set ECS_LOGLEVEL in /etc/ecs/ecs.config, then run sudo stop ecs ; sudo start ecs to restart the Agent.

Member

samuelkarp commented Feb 25, 2016

@veverjak Yes, that's what I'm looking for. Is that from an Agent built with the dev branch (and the fix) or from the 1.8.0 version?

If you're using our ECS-optimized AMI, you can set ECS_LOGLEVEL in /etc/ecs/ecs.config, then run sudo stop ecs ; sudo start ecs to restart the Agent.

@jwerak

This comment has been minimized.

Show comment
Hide comment
@jwerak

jwerak Feb 25, 2016

@samuelkarp yes, we are running latest dev branch.
I am not seeing the errors now (they appear irregularly) so I will switch on debug mode and send you logs when the issue will appear or are even logs without error appearing useful?

jwerak commented Feb 25, 2016

@samuelkarp yes, we are running latest dev branch.
I am not seeing the errors now (they appear irregularly) so I will switch on debug mode and send you logs when the issue will appear or are even logs without error appearing useful?

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Feb 25, 2016

Member

There may not be lines that say "error", but logs and stack trace information if you run into the deadlock/state drift issue will be very useful. In our testing so far we haven't reproduced the deadlock when building from the dev branch.

Member

samuelkarp commented Feb 25, 2016

There may not be lines that say "error", but logs and stack trace information if you run into the deadlock/state drift issue will be very useful. In our testing so far we haven't reproduced the deadlock when building from the dev branch.

@jwerak

This comment has been minimized.

Show comment
Hide comment
@jwerak

jwerak Feb 25, 2016

@samuelkarp
Got it, here is the stacktrace from disconnected instance: https://gist.github.com/veverjak/73e05734c5c29a2257f4

and here are the debug logs: https://gist.github.com/veverjak/31327365a43115e71fb9

jwerak commented Feb 25, 2016

@samuelkarp
Got it, here is the stacktrace from disconnected instance: https://gist.github.com/veverjak/73e05734c5c29a2257f4

and here are the debug logs: https://gist.github.com/veverjak/31327365a43115e71fb9

@aaithal

This comment has been minimized.

Show comment
Hide comment
@aaithal

aaithal Feb 26, 2016

Member

Hi @veverjak ,

Apologies for asking you to confirm this again. But, I looked up the information about the container instance on which you are facing this issue and it seems like it has a different agentHash than the one on the "dev" branch. It should have been computed as *305353a, to correspond to the latest commit. But, I see that it is set to *fd7b600, which does not correspond to the latest commit.

Could you please confirm that the agent you are using does have the fix from the "dev" branch merged in?

Member

aaithal commented Feb 26, 2016

Hi @veverjak ,

Apologies for asking you to confirm this again. But, I looked up the information about the container instance on which you are facing this issue and it seems like it has a different agentHash than the one on the "dev" branch. It should have been computed as *305353a, to correspond to the latest commit. But, I see that it is set to *fd7b600, which does not correspond to the latest commit.

Could you please confirm that the agent you are using does have the fix from the "dev" branch merged in?

@jwerak

This comment has been minimized.

Show comment
Hide comment
@jwerak

jwerak Feb 26, 2016

Hi @aaithal,

no problem, we are building from this repo actually - https://github.com/appuri/amazon-ecs-agent/commits/build-agent

jwerak commented Feb 26, 2016

Hi @aaithal,

no problem, we are building from this repo actually - https://github.com/appuri/amazon-ecs-agent/commits/build-agent

@bilalaslamseattle

This comment has been minimized.

Show comment
Hide comment
@bilalaslamseattle

bilalaslamseattle Feb 26, 2016

@aaithal thanks for digging into this. This bug sort of makes ECS unusable ... deploys fail, etc. etc. Badness all around. Can the ECS team fast-track this please?

@aaithal thanks for digging into this. This bug sort of makes ECS unusable ... deploys fail, etc. etc. Badness all around. Can the ECS team fast-track this please?

@nicolai86

This comment has been minimized.

Show comment
Hide comment
@nicolai86

nicolai86 Feb 26, 2016

👍 we downgraded as well since this is breaking ECS for us very regularly.

👍 we downgraded as well since this is breaking ECS for us very regularly.

@juanrhenals

This comment has been minimized.

Show comment
Hide comment
@juanrhenals

juanrhenals Mar 3, 2016

Contributor

Fixing this deadlock issue has been released as part of Agent v1.8.1. We also released a new AMI with the v1.8.1 agent bundled.

If you continue to run into issues or have new information, please do reopen this or a new issue.
For now, I don't think we have any additional action items for this issue on our end.

Contributor

juanrhenals commented Mar 3, 2016

Fixing this deadlock issue has been released as part of Agent v1.8.1. We also released a new AMI with the v1.8.1 agent bundled.

If you continue to run into issues or have new information, please do reopen this or a new issue.
For now, I don't think we have any additional action items for this issue on our end.

@juanrhenals juanrhenals closed this Mar 3, 2016

@workybee

This comment has been minimized.

Show comment
Hide comment
@workybee

workybee Mar 4, 2016

Refreshed our cluster with the new AMI (2015.09.g) and this ecs-agent v1.8.1 and have been running smoothly ever since (about 24 hours) where we used to need to restart the ecs-agent every 30 minutes.
Thanks all for the fix.

workybee commented Mar 4, 2016

Refreshed our cluster with the new AMI (2015.09.g) and this ecs-agent v1.8.1 and have been running smoothly ever since (about 24 hours) where we used to need to restart the ecs-agent every 30 minutes.
Thanks all for the fix.

@jwerak

This comment has been minimized.

Show comment
Hide comment
@jwerak

jwerak Mar 7, 2016

@juanrhenals we are running ecs-agent v1.8.1 but not on latest AMI - can this be a cause to that?

jwerak commented Mar 7, 2016

@juanrhenals we are running ecs-agent v1.8.1 but not on latest AMI - can this be a cause to that?

@bilalaslamseattle

This comment has been minimized.

Show comment
Hide comment
@bilalaslamseattle

bilalaslamseattle Mar 8, 2016

@juanrhenals we are still seeing this issue. Like @veverjak not on the latest AMI. Could this be a cause?

@juanrhenals we are still seeing this issue. Like @veverjak not on the latest AMI. Could this be a cause?

@aaithal

This comment has been minimized.

Show comment
Hide comment
@aaithal

aaithal Mar 8, 2016

Member

@veverjak Our tests prior to releasing the new version did not reproduce the agent deadlock with the fix. Could you please open a new issue to debug the issues that you are seeing with the agent? We were unable to detect the deadlock from the previous set of logs that you provided. We would like to know the following information from your instance to help us debug this better:

  1. ECS Agent logs at debug level (ECS_LOGLEVEL=debug) and the stack trace on a disconnected Agent (docker kill -s USR1 ecs-agent)
  2. ECS Agent Config, if you have any non-default customizations
  3. Output of curl 127.0.0.1:51678/v1/metadata on the EC2 Instance where you see the deadlock

Feel free to send those to aithal at amazon.com if you do not want to share the files on github.

Also, Are you still using https://github.com/appuri/amazon-ecs-agent/commits/build-agent to build the agent? I took a shot at building this image. But, it seems like there are some unresolved merge conflicts, which would cause build failures.

Member

aaithal commented Mar 8, 2016

@veverjak Our tests prior to releasing the new version did not reproduce the agent deadlock with the fix. Could you please open a new issue to debug the issues that you are seeing with the agent? We were unable to detect the deadlock from the previous set of logs that you provided. We would like to know the following information from your instance to help us debug this better:

  1. ECS Agent logs at debug level (ECS_LOGLEVEL=debug) and the stack trace on a disconnected Agent (docker kill -s USR1 ecs-agent)
  2. ECS Agent Config, if you have any non-default customizations
  3. Output of curl 127.0.0.1:51678/v1/metadata on the EC2 Instance where you see the deadlock

Feel free to send those to aithal at amazon.com if you do not want to share the files on github.

Also, Are you still using https://github.com/appuri/amazon-ecs-agent/commits/build-agent to build the agent? I took a shot at building this image. But, it seems like there are some unresolved merge conflicts, which would cause build failures.

@aaithal

This comment has been minimized.

Show comment
Hide comment
@aaithal

aaithal Mar 8, 2016

Member

@bilalaslamseattle Could you please confirm if are you seeing these issues on ECS Agent v1.8.1? I'd also encourage you to create a new issue with more details about the failure as mentioned in my previous post.

Member

aaithal commented Mar 8, 2016

@bilalaslamseattle Could you please confirm if are you seeing these issues on ECS Agent v1.8.1? I'd also encourage you to create a new issue with more details about the failure as mentioned in my previous post.

@jwerak

This comment has been minimized.

Show comment
Hide comment
@jwerak

jwerak Mar 8, 2016

@aaithal thanks, I will gather that info once we experience disconnect again and create a ticket for that.
Also sorry for agent with merge conflict, I've cleared it out and we are now on top of 1.8.1 branch here - https://github.com/appuri/amazon-ecs-agent

jwerak commented Mar 8, 2016

@aaithal thanks, I will gather that info once we experience disconnect again and create a ticket for that.
Also sorry for agent with merge conflict, I've cleared it out and we are now on top of 1.8.1 branch here - https://github.com/appuri/amazon-ecs-agent

@jhovell

This comment has been minimized.

Show comment
Hide comment
@jhovell

jhovell Jun 2, 2016

@samuelkarp I think I am experiencing this issue with ECS Agent 1.9.0/ ECS AMI 2016.03b. I have an instance that dumps the following to the log:

2016-06-02T11:08:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:13:41Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:13:41Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:7f5c4e32286e416897a9399cad5b1d2006adbc2d24cd5c1d08ab7fee5ab9c050}"
2016-06-02T11:13:41Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
2016-06-02T11:18:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:27:41Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:27:41Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:7f5c4e32286e416897a9399cad5b1d2006adbc2d24cd5c1d08ab7fee5ab9c050}"
2016-06-02T11:27:41Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
2016-06-02T11:28:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:38:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:45:41Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:45:41Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:7f5c4e32286e416897a9399cad5b1d2006adbc2d24cd5c1d08ab7fee5ab9c050}"
2016-06-02T11:45:41Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
2016-06-02T11:48:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:55:41Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
2016-06-02T11:55:42Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:58:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING

The instance is still running a my-app container (as shown by the log) but the agent is disconnected from the cluster. I could enable debug logging and restart the agent (not sure if that will fix the problem) and post up debug logs if this helps.

Anything else this could be?

jhovell commented Jun 2, 2016

@samuelkarp I think I am experiencing this issue with ECS Agent 1.9.0/ ECS AMI 2016.03b. I have an instance that dumps the following to the log:

2016-06-02T11:08:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:13:41Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:13:41Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:7f5c4e32286e416897a9399cad5b1d2006adbc2d24cd5c1d08ab7fee5ab9c050}"
2016-06-02T11:13:41Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
2016-06-02T11:18:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:27:41Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:27:41Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:7f5c4e32286e416897a9399cad5b1d2006adbc2d24cd5c1d08ab7fee5ab9c050}"
2016-06-02T11:27:41Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
2016-06-02T11:28:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:38:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:45:41Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:45:41Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:7f5c4e32286e416897a9399cad5b1d2006adbc2d24cd5c1d08ab7fee5ab9c050}"
2016-06-02T11:45:41Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
2016-06-02T11:48:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:55:41Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
2016-06-02T11:55:42Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:58:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING

The instance is still running a my-app container (as shown by the log) but the agent is disconnected from the cluster. I could enable debug logging and restart the agent (not sure if that will fix the problem) and post up debug logs if this helps.

Anything else this could be?

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jun 24, 2016

Member

@jhovell Yes, debug logs definitely helps. If you're still experiencing problems, please open a new issue.

Member

samuelkarp commented Jun 24, 2016

@jhovell Yes, debug logs definitely helps. If you're still experiencing problems, please open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment