Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error occurred when starting amazon-ssm-agent: Failed to fetch region. Data from vault is empty #48

Closed
ratidzidziguri opened this issue Apr 28, 2017 · 12 comments

Comments

@ratidzidziguri
Copy link

ratidzidziguri commented Apr 28, 2017

Recently i initiated new EC2 machine which had some issues starting SSM agent it is windows server 2016 machine. and whenever i try to start iSSM service it fails i looked inside logs and what I see there is. the following errors.

2017-04-28 20:24:14 ERROR [Execute @ agent_windows.go.169] Failed to start agent. Failed to fetch region. Data from vault is empty. Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2017-04-28 20:24:51 ERROR [NewCoreManager @ coremanager.go.63] error fetching the region, Failed to fetch region. Data from vault is empty. Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2017-04-28 20:24:51 ERROR [start @ agent.go.61] error occured when starting core manager: Failed to fetch region. Data from vault is empty. Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2017-04-28 20:24:51 ERROR [Execute @ agent_windows.go.169] Failed to start agent. Failed to fetch region. Data from vault is empty. Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2017-04-28 20:25:28 ERROR [NewCoreManager @ coremanager.go.63] error fetching the region, Failed to fetch region. Data from vault is empty. Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2017-04-28 20:25:28 ERROR [start @ agent.go.61] error occured when starting core manager: Failed to fetch region. Data from vault is empty. Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2017-04-28 20:25:28 ERROR [Execute @ agent_windows.go.169] Failed to start agent. Failed to fetch region. Data from vault is empty. Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2017-04-28 20:26:05 ERROR [NewCoreManager @ coremanager.go.63] error fetching the region, Failed to fetch region. Data from vault is empty. Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2017-04-28 20:26:05 ERROR [start @ agent.go.61] error occured when starting core manager: Failed to fetch region. Data from vault is empty. Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2017-04-28 20:26:05 ERROR [Execute @ agent_windows.go.169] Failed to start agent. Failed to fetch region. Data from vault is empty. Get http://169.254.169.254/latest/dynamic/instance-identity/document: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

instance information is not displayed on desktop as well.

I Tried to reinstall agent but got the same error. so there might be something wrong with it.

@liath
Copy link

liath commented May 4, 2017

I'm struggling with this too. The same AMI in a different VPC doesn't seem to be effected. I haven't the fuzziest idea why as everything else in both VPCs is fine.

@ratidzidziguri
Copy link
Author

What i found initialy was that the Image which i created was on one VPC but based on that image istarted new machine in different VPC.

@liath
Copy link

liath commented May 5, 2017 via email

@nelsestu
Copy link

No question about it, this is a bug with SSMAgent on windows. I have a system that relies on user data for configuration details and the application is unable to start without this config. When i create an AMI of an instance in say AZ us-west-2a, and attempt to use it in an autoscale group where us-west-2b and us-west-2c are also possible AZs, only instances that scale out in us-west-2a will work. AZ b and c will fail with the errors that started this thread. Splunking through the ec2launch logs, i can see it binding 169.254.169.254 to the subnet associated with AZ a. The binding shows success, and since it doesn't exist, it won't ever be reachable.

@mmendonca3
Copy link
Contributor

Thank you for posting here. Our team is investigating this issue and will provide you with a fix or ETA.

@liath
Copy link

liath commented May 30, 2017

Just as a follow up to my resolution above, we now run InitializeInstance.ps1 -Schedule before building images which resolves the network issues preventing us from talking to 169.254.169.254.

Another part of the puzzle for us was that because we use autologon to get a user session on these instances, InitializeInstance.ps1 fails to at line 125 because Restart-Computer needs the -Force flag in order to work while there is a user session. Without the reboot, the instance will be identical to imaged instance and have the same route table which was causing the above network issue. Adding the -Force flag to InitializeInstance.ps1 fixes the rest of our problems.

@yogeshdengle
Copy link

yogeshdengle commented Jul 11, 2017

That's basically what happened with me. We built an image in our dev VPC
and promoted it to our staging VPC where it's route table no longer made
sense. (dev is 10.4.x.x and staging is 10.40.x.x). Changing the route to
the right subnet fixed everything.

I feel like this used to work on the pre-Server2016 base AMIs. Perhaps this
is part of the new EC2Launch configs?

@liath Can you elaborate what kind of route table changes you needed to do.
TIA

@liath
Copy link

liath commented Jul 14, 2017

@yogeshdengle Copying the routing table from a working instance in the same VPC should fix things but like I said in my last comment, running InitializeInstance.ps1 -Schedule before imaging an instance that will be moved between VPCs resolves the route table issues.

@lasitha-petthawadu
Copy link

I got the same issue and was able to solve it by checking the routing tables within windows using
netstat -rn

And I noticed that the Gateway address was incorrect within persistent routes and was not matching the subnet that was used.
image

So I updated the persistent route entries by executing the command.

route -p add 169.254.169.254 mask 255.255.255.255 <correct gateway IP> metric 25 if 2

Finally with the route table change I was able to access 169.254.169.254 via a web browser

image

@NJITman
Copy link

NJITman commented Dec 5, 2017

UPDATE - After doing some tests and viewing the logs on the instance, we can see that InitializeInstance.ps1 does run on boot, regardless of if it is from Amazon AMI or one that you created.

InitializeInstance.ps1 does all of the work in Correct-Routes.ps1, so there is no need to duplicate.

What is strange is that you can see the routes being updated to the wrong gateway in the log, so when you access the instance for the first time (via RDP) and run netstat -rn, they point to the gateway for the subnet that the AMI was baked from (the source instance). Both InitializeInstance.ps1 (which ran first) and Correct-Routes.ps1 (which ran second) added the wrong routes to the routing table on the instance.

Just some history:

  1. We created an AMI with all of the settings and applications needed for our web application. It was created in az Delete kaos folder #1 in subnet Delete kaos folder #1.
  2. We created a launch configuration using that AMI and launched 2 instanced from an auto-scaling group in az Add go report to README.md #2 (subnet Add go report to README.md #2) and az Merge dev to rc #3 (subnet Merge dev to rc #3).
  3. Our app gets the instance data from the AWS SDK and displays the unique host part of the IP (last 8 digits in decimal form) in the footer so that we can see that the app is truly running in multiple AZs via the ELB.
  4. The source instance would show the IP.
  5. The launched instances from the AMI would not (SDK was returning null).
  6. Upon investigation, we found that SSM was hung and that the routes for the 3 default IPs were pointing to the gateway for subnet Delete kaos folder #1 (source), instead of subnet Add go report to README.md #2 or Merge dev to rc #3.
  7. Once the routes were updated to the proper gateway, SSM would restart and run. SDK would then properly show the IP address of the instance.

For example, the subnet of the source instance (AMI) is 10.0.20.0, but the subnet of the launched instance is 10.0.12.0. Here are the log entries:
2017/12/05 03:46:23Z: Successfully added the Route: 169.254.169.254/32, gateway: 10.0.20.1, NIC index: 3, Metric: 25
2017/12/05 03:46:23Z: Successfully added the Route: 169.254.169.250/32, gateway: 10.0.20.1, NIC index: 3, Metric: 25
2017/12/05 03:46:23Z: Successfully added the Route: 169.254.169.251/32, gateway: 10.0.20.1, NIC index: 3, Metric: 25

And here are the log entries after connecting to the instance via RDP and manually running Correct-Routes.ps1:
2017/12/06 00:05:53Z: Successfully added the Route: 169.254.169.254/32, gateway: 10.0.12.1, NIC index: 3, Metric: 25
2017/12/06 00:05:53Z: Successfully added the Route: 169.254.169.250/32, gateway: 10.0.12.1, NIC index: 3, Metric: 25
2017/12/06 00:05:53Z: Successfully added the Route: 169.254.169.251/32, gateway: 10.0.12.1, NIC index: 3, Metric: 25

Will continue testing and figuring this out and report back.

@mmendonca3
Copy link
Contributor

The behavior described by NJITman is expected behavior on current EC2Launch.

EC2Lauch is executed at first launch and will not be executed on further instance start unless explicitly schedule it. This means explicit routes to meta-data and KMS servers are not updated between instance start/stop.

If you create an image from an instance without re-schedule EC2Launch, EC2Launch will not be executed on launched instances from the image. It means launched instances may not have correct route to meta-data or KMS servers.

To prevent this, you should sysprep the instance or re-schedule EC2Launch to execute it at next launch on the instance by executing '.\InitializeInstance.ps1 -Schedule' before creating an image.

See our public document for more information about EC2Launch:
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ec2launch.html#ec2launch-inittasks

@mmendonca3
Copy link
Contributor

Please reopen this issue if you have any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants