New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS Agent disconnected problem #103

Closed
rossf7 opened this Issue Jun 8, 2015 · 5 comments

Comments

Projects
None yet
3 participants
@rossf7

rossf7 commented Jun 8, 2015

We're using ECS for force12.io our demo of micro scaling. We're seeing intermittent problems when one of our container instances stops responding for between 30 and 60 seconds. During this time the agent connected flag in the ECS web console is false. We also see the 2 errors below in the agent logs.

2015-06-08T15:06:09Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.20.16:443: use of closed network connection"
2015-06-08T15:06:09Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.20.16:443: use of closed network connection"

This error is occuring 6 or 7 times an hour on each container instance. Here is when it occurred between 15:00 and 16:00 UTC today.

2015-06-08T15:06:09Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.20.16:443: use of closed network connection"
2015-06-08T15:06:09Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.20.16:443: use of closed network connection"
2015-06-08T15:14:49Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.18.202:443: use of closed network connection"
2015-06-08T15:14:49Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.18.202:443: use of closed network connection"
2015-06-08T15:23:31Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.19.128:443: use of closed network connection"
2015-06-08T15:23:31Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.19.128:443: use of closed network connection"
2015-06-08T15:32:08Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.20.16:443: use of closed network connection"
2015-06-08T15:32:08Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.20.16:443: use of closed network connection"
2015-06-08T15:39:20Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.19.128:443: use of closed network connection"
2015-06-08T15:39:20Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.19.128:443: use of closed network connection"
2015-06-08T15:48:08Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.19.128:443: use of closed network connection"
2015-06-08T15:48:08Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.19.128:443: use of closed network connection"
2015-06-08T15:55:56Z [ERROR] Error getting message from acs module="acs client" err="read tcp 54.239.20.16:443: use of closed network connection"
2015-06-08T15:55:56Z [INFO] Error from acs; backing off module="acs handler" err="read tcp 54.239.20.16:443: use of closed network connection"

I've uploaded the full agent log to S3 for this hour.
Our instances are running amzn-ami-2015.03.b-amazon-ecs-optimized - ami-d0b9acb8

Please let me know if you need any further information or additional logs.

Thanks

Ross

euank added a commit to euank/amazon-ecs-agent that referenced this issue Jun 9, 2015

ACS: Handle heartbeats vs idle correctly
Previously a heartbeat message was required to consider the channel
active.
In realitiy, heartbeat messages were only sent when the channel was
inactive and no other messages were being sent.
This avoids treating a lack of heartbeats as an idle channel and closing
it unless there are also no other messages.

In addition, this tweaks how backoffs are handled (time, resets, etc) a
bit to be more forgiving to these sorts of issues (where the connection
is lost, but can be re-established).

Relates to aws#103

euank added a commit to euank/amazon-ecs-agent that referenced this issue Jun 9, 2015

ACS: Handle heartbeats vs idle correctly
Previously a heartbeat message was required to consider the channel
active.
In realitiy, heartbeat messages were only sent when the channel was
inactive and no other messages were being sent.
This avoids treating a lack of heartbeats as an idle channel and closing
it unless there are also no other messages.

In addition, this tweaks how backoffs are handled (time, resets, etc) a
bit to be more forgiving to these sorts of issues (where the connection
is lost, but can be re-established).

Relates to aws#103
@euank

This comment has been minimized.

Show comment
Hide comment
@euank

euank Jun 9, 2015

Contributor

Thanks for reporting the issue, Ross.
It manifests for me fairly reliably when I start tasks quickly (at least 1 a minute) for extended periods.
I think I figured out what's happening and I'm working on a fix (see above).

Again, thanks for reporting!
Best,
Euan

Contributor

euank commented Jun 9, 2015

Thanks for reporting the issue, Ross.
It manifests for me fairly reliably when I start tasks quickly (at least 1 a minute) for extended periods.
I think I figured out what's happening and I'm working on a fix (see above).

Again, thanks for reporting!
Best,
Euan

@euank euank added the kind/bug label Jun 9, 2015

@rossf7

This comment has been minimized.

Show comment
Hide comment
@rossf7

rossf7 Jun 9, 2015

Hi Euan,
great, thanks for this. We're happy to test the fix in the dev branch once its ready if that is helpful.
Cheers

Ross

rossf7 commented Jun 9, 2015

Hi Euan,
great, thanks for this. We're happy to test the fix in the dev branch once its ready if that is helpful.
Cheers

Ross

euank added a commit to euank/amazon-ecs-agent that referenced this issue Jun 10, 2015

ACS: Handle heartbeats vs idle correctly
Previously a heartbeat message was required to consider the channel
active.
In realitiy, heartbeat messages were only sent when the channel was
inactive and no other messages were being sent.
This avoids treating a lack of heartbeats as an idle channel and closing
it unless there are also no other messages.

In addition, this tweaks how backoffs are handled (time, resets, etc) a
bit to be more forgiving to these sorts of issues (where the connection
is lost, but can be re-established).

Relates to aws#103

@euank euank referenced this issue Jun 10, 2015

Closed

"Error from acs" #108

euank added a commit to euank/amazon-ecs-agent that referenced this issue Jul 9, 2015

ACS: Handle heartbeats vs idle correctly
Previously a heartbeat message was required to consider the channel
active.
In realitiy, heartbeat messages were only sent when the channel was
inactive and no other messages were being sent.
This avoids treating a lack of heartbeats as an idle channel and closing
it unless there are also no other messages.

In addition, this tweaks how backoffs are handled (time, resets, etc) a
bit to be more forgiving to these sorts of issues (where the connection
is lost, but can be re-established).

Relates to aws#103
@euank

This comment has been minimized.

Show comment
Hide comment
@euank

euank Jul 9, 2015

Contributor

@rossf7 apologies for the delay in getting that merged. It's now in the current dev branch and should be in the next release. If you'd like to test it, that'd be awesome!

Thanks again for reporting the issue and bearing with me,
Euan

Contributor

euank commented Jul 9, 2015

@rossf7 apologies for the delay in getting that merged. It's now in the current dev branch and should be in the next release. If you'd like to test it, that'd be awesome!

Thanks again for reporting the issue and bearing with me,
Euan

@rossf7

This comment has been minimized.

Show comment
Hide comment
@rossf7

rossf7 Jul 13, 2015

Hi @euank,
great, thanks for this! I've done some testing and it looks much better.

I've built the agent from the dev branch and deployed it to our staging rig. The 3 container instances are running CoreOS-beta-717.1.0-hvm (ami-7f9a6214).

I'm still seeing the warning message below but I think this is expected?

2015-07-13T17:06:53Z [WARN] Could not submit a container state change module="api client" change="{TaskArn:arn:aws:ecs:us-east-1:187012023547:task/634649e8-47c1-4de3-b56e-915afd911166 ContainerName:priority1 Status:4 Reason: ExitCode:0xc2089fe068 PortBindings:[] SentStatus:RUNNING}" err="Post https://ecs.us-east-1.amazonaws.com/: read tcp 54.239.20.154:443: use of closed network connection"

We're no longer seeing the problem where the agents stop responding. The demo itself is now tracking our random demand metric most of the time. I'll continue to monitor over the next couple of days.

Thanks

Ross

rossf7 commented Jul 13, 2015

Hi @euank,
great, thanks for this! I've done some testing and it looks much better.

I've built the agent from the dev branch and deployed it to our staging rig. The 3 container instances are running CoreOS-beta-717.1.0-hvm (ami-7f9a6214).

I'm still seeing the warning message below but I think this is expected?

2015-07-13T17:06:53Z [WARN] Could not submit a container state change module="api client" change="{TaskArn:arn:aws:ecs:us-east-1:187012023547:task/634649e8-47c1-4de3-b56e-915afd911166 ContainerName:priority1 Status:4 Reason: ExitCode:0xc2089fe068 PortBindings:[] SentStatus:RUNNING}" err="Post https://ecs.us-east-1.amazonaws.com/: read tcp 54.239.20.154:443: use of closed network connection"

We're no longer seeing the problem where the agents stop responding. The demo itself is now tracking our random demand metric most of the time. I'll continue to monitor over the next couple of days.

Thanks

Ross

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Aug 4, 2015

Member

We've released v1.3.1, which we believe has addressed this problem. Please feel free to reopen or file a new issue if the problem persists.

Ross, that warning should be benign as the Agent should open a new connection and retry the submission.

Thanks!
Sam

Member

samuelkarp commented Aug 4, 2015

We've released v1.3.1, which we believe has addressed this problem. Please feel free to reopen or file a new issue if the problem persists.

Ross, that warning should be benign as the Agent should open a new connection and retry the submission.

Thanks!
Sam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment