Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start ecs doesn't seem to start ecs-agent #75

Closed
davidvuong opened this issue Dec 29, 2016 · 6 comments
Closed

start ecs doesn't seem to start ecs-agent #75

davidvuong opened this issue Dec 29, 2016 · 6 comments

Comments

@davidvuong
Copy link

davidvuong commented Dec 29, 2016

Here is everything in user-data (largely copied from http://docs.datadoghq.com/integrations/ecs/#create-an-ecs-task):

#!/bin/bash
yum install -y aws-cli jq
aws s3 cp s3://xxx/ecs.config /etc/ecs/ecs.config

ECS_CLUSTER=${aws_ecs_cluster.main.name}
DD_TASK_DEFINITON="dd-agent"

echo "before"
docker ps -a

echo ECS_CLUSTER=$ECS_CLUSTER >> /etc/ecs/ecs.config
start ecs

echo "after"
docker ps -a

# @see: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-introspection.html
EC2_AZ=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)
EC2_REGION="`echo \"$EC2_AZ\" | sed -e 's:\([0-9][0-9]*\)[a-z]*\$:\\1:'`"
CONTAINER_INSTANCE_ARN=$(curl -s http://localhost:51678/v1/metadata | jq -r '. | .ContainerInstanceArn' | awk -F/ '{print $NF}' )

echo "before curl"
curl -v http://localhost:51678/v1/metadata
echo "after curl"

# On EC2 boot, start an ECS task using the dd-agent task definition.
echo "
cluster=$ECS_CLUSTER
az=$EC2_AZ
region=$EC2_REGION
aws ecs start-task \
  --cluster $ECS_CLUSTER \
  --task-definition $DD_TASK_DEFINITON \
  --container-instances $CONTAINER_INSTANCE_ARN \
  --region $EC2_REGION
  " >> /etc/rc.local
EOF

This is the tail of cloud-init-output:

...
Dependency Installed:
  freetype.x86_64 0:2.3.11-15.14.amzn1
  jq-libs.x86_64 0:1.5-1.2.amzn1
  libjpeg-turbo.x86_64 0:1.2.90-5.14.amzn1
  mailcap.noarch 0:2.1.31-2.7.amzn1
  oniguruma.x86_64 0:5.9.1-3.1.2.amzn1
  python27-botocore.noarch 0:1.4.86-1.62.amzn1
  python27-colorama.noarch 0:0.2.5-1.7.amzn1
  python27-dateutil.noarch 0:2.1-1.3.amzn1
  python27-docutils.noarch 0:0.11-1.15.amzn1
  python27-futures.noarch 0:3.0.3-1.3.amzn1
  python27-imaging.x86_64 0:1.1.6-19.9.amzn1
  python27-jmespath.noarch 0:0.9.0-1.11.amzn1
  python27-ply.noarch 0:3.4-3.12.amzn1
  python27-pyasn1.noarch 0:0.1.7-2.9.amzn1
  python27-rsa.noarch 0:3.4.1-1.8.amzn1

Complete!
download: s3://xxx/ecs.config to etc/ecs/ecs.config
before
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
ecs start/running, process 3036
after
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
before curl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 127.0.0.1...
* connect to 127.0.0.1 port 51678 failed: Connection refused
* Failed to connect to localhost port 51678: Connection refused
* Closing connection 0
curl: (7) Failed to connect to localhost port 51678: Connection refused
after curl

Additional information:

  1. AMI: ami-6df8fe7a
  2. I'm using terraform which is why you see ECS_CLUSTER=${aws_ecs_cluster.main.name} but assume that ECS_CLUSTER=main
  3. When I try to run curl I seem can't connect
  4. I've tried sudo start ecs but I get the same results
  5. I can curl and get a response after ssh into the EC2 instance
  6. ... and the ecs-agent is up too
[ec2-user@ip-172-20-2-81 ~]$ docker ps -a
CONTAINER ID        IMAGE                            COMMAND             CREATED             STATUS              PORTS               NAMES
7a63f5ba0a8b        amazon/amazon-ecs-agent:latest   "/agent"            3 minutes ago       Up 3 minutes                            ecs-agent

... and the contents of my rc.local file looks like this:

#!/bin/sh
#
# This script will be executed *after* all the other init scripts.
# You can put your own initialization stuff in here if you don't
# want to do the full Sys V style init stuff.

touch /var/lock/subsys/local

cluster=main
az=us-east-1b
region=us-east-1
aws ecs start-task   --cluster main   --task-definition dd-agent   --container-instances    --region us-east-1

There isn't a value next to --container-instances because I couldn't fetch any metadata from the ecs-agent 😢

Any ideas?

@nmeyerhans
Copy link
Contributor

Hi @davidvuong. ecs-init and ecs-agent both log to files in /var/log/ecs/. It is possible that these will provide some insight into what's happening. My guess is that there's a dependency that hasn't yet finished initializing by the time you call start ecs or curl, but that's just a guess.

@davidvuong
Copy link
Author

davidvuong commented Dec 30, 2016

Hey @narehayrapetyan,

Here's the log for both files in /var/log/ecs/:

ecs-init.log:

2016-12-30T03:16:26Z [INFO] pre-start
2016-12-30T03:16:27Z [INFO] start
2016-12-30T03:16:27Z [INFO] No existing agent container to remove.
2016-12-30T03:16:27Z [INFO] Starting Amazon EC2 Container Service Agent

ecs-agent.log:

2016-12-30T03:16:29Z [INFO] Starting Agent: Amazon ECS Agent - v1.13.1 (efe53c6)
2016-12-30T03:16:29Z [INFO] Loading configuration
2016-12-30T03:16:29Z [INFO] Event stream ContainerChange start listening...
2016-12-30T03:16:29Z [INFO] Checkpointing is enabled. Attempting to load state
2016-12-30T03:16:29Z [INFO] Loading state! module="statemanager"
2016-12-30T03:16:29Z [INFO] Detected Docker versions [1.17 1.18 1.19 1.20 1.21 1.22 1.23]
2016-12-30T03:16:29Z [INFO] Registering Instance with ECS
2016-12-30T03:16:29Z [INFO] Registered! module="api client"
2016-12-30T03:16:29Z [INFO] Registration completed successfully. I am running as 'xxx' in cluster 'yyy'
2016-12-30T03:16:29Z [INFO] Saving state! module="statemanager"
2016-12-30T03:16:29Z [INFO] Beginning Polling for updates
2016-12-30T03:16:29Z [INFO] Initializing stats engine
2016-12-30T03:16:29Z [INFO] Event stream DeregisterContainerInstance start listening...
2016-12-30T03:16:29Z [INFO] Creating poll dialer, host: ecs-a-1.us-east-1.amazonaws.com
2016-12-30T03:16:29Z [INFO] Creating poll dialer, host: ecs-t-1.us-east-1.amazonaws.com
2016-12-30T03:16:39Z [INFO] Saving state! module="statemanager"

@davidvuong
Copy link
Author

davidvuong commented Dec 30, 2016

This might also be useful. Cloud init logs before and after the start ecs call:

echo "before"
docker ps -a

cat /var/log/ecs/ecs-init.log.2016-12-30-03
cat /var/log/ecs/ecs-agent.log.2016-12-30-03

echo ECS_CLUSTER=$ECS_CLUSTER >> /etc/ecs/ecs.config
start ecs

echo "after"
docker ps -a

cat /var/log/ecs/ecs-init.log.2016-12-30-03
cat /var/log/ecs/ecs-agent.log.2016-12-30-03

and this is what I get:

before
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
cat: /var/log/ecs/ecs-init.log.2016-12-30-03: No such file or directory
cat: /var/log/ecs/ecs-agent.log.2016-12-30-03: No such file or directory
ecs start/running, process 3029
after
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
2016-12-30T03:29:11Z [INFO] pre-start
2016-12-30T03:29:12Z [INFO] start
2016-12-30T03:29:12Z [INFO] No existing agent container to remove.
2016-12-30T03:29:12Z [INFO] Starting Amazon EC2 Container Service Agent
cat: /var/log/ecs/ecs-agent.log.2016-12-30-03: No such file or directory

@davidvuong
Copy link
Author

@nmeyerhans Yep you're right. It seems like there's a timing issue. I forced a sleep 10 before the curl and everything works as it should.

Is there another way I can fix this aside from forcing an arbitrary sleep time?

@nmeyerhans
Copy link
Contributor

Aside from a fixed sleep, another option would be to write a polling loop, retrying your curl until you get a meaningful response. I'd recommend a short sleep (probably 1 second is fine) between loop iterations and a fatal timeout of maybe 1 or two minutes so you avoid looping forever in case the agent never starts for some reason.

In theory we could implement such a loop in ecs-init itself or in the upstart config we use to start it, so I'll leave this issue open as a feature request. It's possible that there are issues with this idea that I haven't considered yet, so it may not be the right thing to do.

@yumex93
Copy link
Contributor

yumex93 commented Jan 17, 2020

Based on my understanding, the problem here is "start ecs" just works as an interface to tell upstart init system to start agent, but starting agent is async. Adding loop in ecs-init or upstart config will not make "start ecs" command itself to wait for agent successfully started and then return. So the retry mentioned above would be the way to handle your case now.

Close the issue considering there is no action items on our side. Feel free to reopen it if you have any questions.

@yumex93 yumex93 closed this as completed Jan 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants