Reduce timeout for scrapping IMDS and give instruction when fail to scrape IMDS inside container by khanhntd · Pull Request #480 · aws/amazon-cloudwatch-agent

khanhntd · 2022-06-02T02:58:07Z

Description of the issue

Whenever customers turn on EC2 Tagger, the customer would need to fulfill these two conditions before able to scrap metadata from IMDS:

CloudWatch Agent needs to be run in EC2 Instances
Enable IMDSv2 and IMDSv1

However, for scrapping IMDSv2 and running in Docker Container, customers need to increase their hop limit to 2 for general case and for certain case such as KOPS Vanilla, it would be 3. One thing to note here is for most AWS Resources, the default hop limit would be 2 for resources that manage container such as EKS. But if customers enable IMDSv1, the AWS SDK Go will fall back to use IMDSv1 as default and CloudWatchAgent won't need to worry about Hop Limit. However, consider the security such as
"Protecting against open layer 3 firewall and NATS", it would be best to keep in mind customers won't use IMDSv1 in the future and its not best practice to always use IMDSv1.

Therefore, we have considered these strategies for dealing with IMDSv2:

Fail faster by reducing 4 minutes time out ( 4 retry and 1 retry/ minute) to recommend timeout strategies (same as Terraform) and show corresponding course of action when customer runs in Docker container and only enable IMDSv2 with 1 Hop Limit.
CloudWatchAgent needs to modify hop limit under the hood. In order to modify the hop limit, the API requires an instance ID and we can only get that from IMDS. However, we can scrap instance ID from certain scripts which are created when EC2 Instances are created such as in Linux Instances (/var/lib/cloud/data/instance-id) and we would need to use Docker Mount between container's volume and host's volume.
Modify the TTL or Hop Limit in Token(package IP) from IMDSv2 ; however, it would not be recommended to go against what IMDSv2 has been designed for security reasons.

Therefore, after discussing with the team, we would go with the first strategy.

Description of changes

Reduce timeout from 4 retry (1 retry / minute) to 2 retry ( 1 retry / second) for fail faster and add AWSDebugLogging for EC2Tagger( both IMDS and retrieve Tag, Volume)
Add EC2 Metadata Provider, same as EC2 Client to separate the specific configuration for each Service SDK.
Reduce API Call when scrapping IMDS Metadata and create a separate function for only scrapping IMDS Metadata

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

I have tested CloudWatchAgent on two edge cases:

LXC Containers.
Vanilla Kops.

For specific test on second edge case:

Step 1: Build KOPS EKS Cluster by following [this document(https://kops.sigs.k8s.io/getting_started/aws/)
Step 2: Build Docker Image by using make dockerized-build and publish your image through ECR
Step 3: Replace the ECR Image and the below config with this template

{
	"agent": {
		"metrics_collection_interval": 60,
		"run_as_user": "root",
	},
	"metrics": {
		"append_dimensions": {
			"AutoScalingGroupName": "${aws:AutoScalingGroupName}",
			"ImageId": "${aws:ImageId}",
			"InstanceId": "${aws:InstanceId}",
			"InstanceType": "${aws:InstanceType}"
		},
		"metrics_collected": {
			"disk": {
				"measurement": [
					"used_percent"
				],
				"metrics_collection_interval": 60,
				"resources": [
					"*"
				]
			}
		}
	}
}

Step 4: See the expected following results

2022-06-07T03:36:11Z I! 
2022-06-07T03:36:11Z E! [processors.ec2tagger] ec2tagger: Unable to retrieve Instance Metadata Tags. This plugin must only be used on an EC2 instance.
2022-06-07T03:36:11Z E! [processors.ec2tagger] ec2tagger: Please increase hop limit to 2 by following this document https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html#configuring-IMDS-existing-instances.
2022-06-07T03:36:11Z E! [telegraf] Error running agent: could not initialize processor ec2tagger: EC2MetadataRequestError: failed to get EC2 instance identity document
caused by: EC2MetadataError: failed to make EC2Metadata request
        status code: 401, request id: 
caused by:

Requirements

Before commit the code, please do the following steps.

Run make fmt and make fmt-sh
Run make linter

codecov-commenter · 2022-06-02T16:30:14Z

Codecov Report

Merging #480 (9921a14) into master (5335531) will increase coverage by 0.04%.
The diff coverage is 61.19%.

@@            Coverage Diff             @@
##           master     #480      +/-   ##
==========================================
+ Coverage   56.81%   56.86%   +0.04%     
==========================================
  Files         374      363      -11     
  Lines       17711    16937     -774     
==========================================
- Hits        10063     9631     -432     
+ Misses       7057     6754     -303     
+ Partials      591      552      -39

Impacted Files	Coverage Δ
translator/util/sdkutil.go	`0.00% <ø> (ø)`
plugins/processors/ec2tagger/ec2tagger.go	`80.48% <61.19%> (-4.08%)`	⬇️
...md/amazon-cloudwatch-agent-config-wizard/wizard.go	`56.94% <0.00%> (-11.12%)`	⬇️
plugins/inputs/demo/demo.go	`50.00% <0.00%> (-7.15%)`	⬇️
...anslator/translate/metrics/util/measurementutil.go	`30.83% <0.00%> (-1.92%)`	⬇️
plugins/outputs/cloudwatch/cloudwatch.go	`73.37% <0.00%> (-0.70%)`	⬇️
...translate/metrics/metrics_collect/gpu/nvidiaSmi.go	`94.73% <0.00%> (-0.14%)`	⬇️
plugins/inputs/logfile/tail/tail_windows.go
translator/util/platform_windows.go
...gins/inputs/windows_event_log/windows_event_log.go
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5335531...9921a14. Read the comment docs.

SaxyPandaBear · 2022-06-10T01:44:19Z

"This plugin must only be used on an EC2 instance"

Does the customer configure the ec2tagger? I thought that the CloudWatch agent does that under the hood. What is the point of calling this out if the customer doesn't have control over this?

Yes. The customers configure the EC2Tagger indirectly; therefore, I see the point of not showing directly This plugin must only be used on an EC2 instance.. An example would be below:

{ "metrics": { "append_dimensions": { "AutoScalingGroupName": "${aws:AutoScalingGroupName}", "ImageId": "${aws:ImageId}", "InstanceId": "${aws:InstanceId}", "InstanceType": "${aws:InstanceType}" } } }

However, the log message are from the old command so I'm do not have enough context to remove it so I keep it here

We have the opportunity to address this to make the error message clearer. Can you explain your example? What does this configuration have to do with the ec2tagger specifically? If you can describe that, then we probably have enough data to reword the error message.

Yes. I'm not good at wording so thanks for the suggestion. For example,

{ "metrics": { "append_dimensions": { "ImageId": "${aws:ImageId}", "InstanceId": "${aws:InstanceId}", "InstanceType": "${aws:InstanceType}" } } }

If the customer specifies these three dimensions for their collected metrics, the CloudWatchAgent will scrape these dimensions from the IMDS. In another way of speaking, the custom's JSON configuration uses the EC2 Tagger under the hood when appending the aforementioned dimensions but it is not shown explicitly in the JSON configuration.

SaxyPandaBear · 2022-06-10T01:57:43Z

So the only two scenarios where we would return an actual error are if:

There is some error with creating a session

IMDS is unavailable

We fail silently (with logs) in a case where there is an error trying to fetch specific data from the metadata endpoint. Why is that? There should be more consistency with the behavior of this function. Does it make sense to return a value here? If so, what are the different intended values, and what causes them to be returned?

Expanding on this - does it make sense to not return an error if there is a non-nil error when trying to get the hostname from the metadata endpoint? Isn't that value required to be populated?

As you mentioned, we would return an actual error for both case. However, does that mean the error will impact the whole process? My answer is no (the same as existing behavior), the return error value only to consolidate the message for the customers. Based on my understanding, the existing behavior for CloudWatchAgent aware that it could be failing in getting the metadata (from host name or from get instance document). However, it does not return error because it can be replaced by the placeHolder function.

So long short story, CloudWatchAgent only expect the failure from those aforementioned case and will always return the values (even if it different intended values because of the replacement with placeholder function) so there are no reason why we would want to return the error when getting one of these metadata endpoint.

Note: This does not apply to EC2Tagger

For example, if the host name is failing to be scraped, there would be a replacement value for the host name.

hostname := provider().Hostname if hostname == "" { hostname = localHostname }

This applies to the metadata scrapped from Instance Document too. Therefore, it is not required to be populated for the value. So I would stand by in not return an error if there is a non-nil error when considering only scraping the metadata

SaxyPandaBear · 2022-06-10T02:24:24Z

We have the opportunity to address this to make the error message clearer. Can you explain your example? What does this configuration have to do with the ec2tagger specifically? If you can describe that, then we probably have enough data to reword the error message.

SaxyPandaBear · 2022-06-10T02:32:44Z

Expanding on this - does it make sense to not return an error if there is a non-nil error when trying to get the hostname from the metadata endpoint? Isn't that value required to be populated?

stale

sethAmazon · 2022-06-10T15:54:11Z

This should be a random number imo. Think all systems send out the call at the same time. Then next round they all send at the same time. It needs to be random within an interval. This comment is probs out of scope for this pr but something to think about for best behaved agents. @straussb thoughts.

SaxyPandaBear · 2022-06-11T19:09:24Z

Why are we putting all of this information inside of another struct? I don't see any benefit for pulling these out into a struct.

Its the same reason for ec2MetadataLookup:

Increase consolidation since its consolidate all the EC2MetadataRespond

Less variables to be aware of when looking at EC2Tagger interface

How is it less variables to be aware of if you are introducing another variable? I don't think the reasons you presented are convincing enough for this change. Are we going to be passing around this "response" struct outside of this Tagger? If not, it just makes it more confusing and harder to maintain.

Based on my understanding, no for both EC2MetadataLookup and EC2MetadataRespond need to be passed outside the Tagger.
Long short story, these metadata are parsed from IMDS, so why it's more confusing and harder to maintain? This follows the definition of type structure (It is used to group related data to form a single unit.) so I'm not sure the reason why not to change it.

instanceId string imageId string // aka AMI instanceType string region string

,

SaxyPandaBear · 2022-06-11T19:12:28Z

Moving all of the functions around in the file makes this a confusing diff so it'll take longer to review.

Agree its take longer time and confusing. However, for better structure format (e.g following the ecs_decorator), its worth to restructure it and better considerate for future reviewers.

I disagree. We don't have clearly defined standards for the structure or order of functions. Moving functions around in this file just creates noise. If we want to reformat the file to follow some structure, that should not be part of a PR that updates functionality. Now the reviewer has to closely examine all of the changes.

If that's the case, I will revert it back 👍

SaxyPandaBear · 2022-06-11T19:16:34Z

Why is there inconsistency for the map keys? tagKey1 is created as a var/const at the top of the file but "AutoScalingGroupName" gets redefined in all of the tests.

Here is the reason why though:

## Note: This plugin renames the "aws:autoscaling:groupName" EC2 Instance Tag key to be spelled "AutoScalingGroupName". ## This aligns it with the AutoScaling dimension-name seen in AWS CloudWatch. # ec2_instance_tag_keys = ["aws:autoscaling:groupName", "Name"]

Another explanation can be found:

// if the customer said 'AutoScalingGroupName' (the CW dimension), do what they mean not what they said // and filter for the EC2 tag name called 'aws:autoscaling:groupName'

I don't understand what this means

Based on my understanding, whenever CWAgent retrieves EC2 Instance Tags (not volume or not IMDS), the EC2 Instance will have the key tag as aws:autoscaling:groupName for AutoScalingGroup; then we will scrapped it; converted to AutoScalingGroupName as a key (back and forth) and appends the AutoScalingGroup as a dimension to the collected metrics (to match the AutoScalingGroupName key in CWAgent JSON configuration)

So long short story, the EC2 Instance has tag aws:autoscaling:groupName (and its value) whenever they created a Auto Scaling Group, we collected the aws:autoscaling:groupName tag, converted the key toAutoScalingGroup and append to the metric as a dimension.

Let's me know if that make sense

SaxyPandaBear

I see test output that shows that the new behavior errors out in the same second. Can you please also include logging without this change to illustrate what the existing behavior is?

SaxyPandaBear · 2022-06-21T03:24:44Z

I disagree. We don't have clearly defined standards for the structure or order of functions. Moving functions around in this file just creates noise. If we want to reformat the file to follow some structure, that should not be part of a PR that updates functionality. Now the reviewer has to closely examine all of the changes.

…crape IMDS inside container

Fix Aggregrator Shut Down Behavior

66938d2

khanhntd changed the title ~~Imds v2~~ Always setting hops to 2 if CloudWatchAgent is deployed as container Jun 2, 2022

khanhntd force-pushed the imds_v2 branch 10 times, most recently from f05b1a8 to 32e5e4a Compare June 2, 2022 16:25

khanhntd force-pushed the imds_v2 branch 16 times, most recently from 08cdab4 to 6ae3cc8 Compare June 5, 2022 22:20

Always setting hops to 2 if CloudWatchAgent is deployed as container

7fe088f

khanhntd force-pushed the imds_v2 branch 7 times, most recently from 1cec9fb to 2bf9f76 Compare June 9, 2022 05:09

SaxyPandaBear reviewed Jun 10, 2022

View reviewed changes

khanhntd force-pushed the imds_v2 branch from 2bf9f76 to af30491 Compare June 10, 2022 02:14

SaxyPandaBear reviewed Jun 10, 2022

View reviewed changes

khanhntd force-pushed the imds_v2 branch 2 times, most recently from 380e2e0 to 024373d Compare June 10, 2022 04:07

sethAmazon reviewed Jun 10, 2022

View reviewed changes

Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated

sethAmazon reviewed Jun 10, 2022

View reviewed changes

Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated

sethAmazon reviewed Jun 10, 2022

View reviewed changes

Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated

khanhntd force-pushed the imds_v2 branch from 024373d to 676b3cd Compare June 10, 2022 16:13

khanhntd added this to the 1.247354.0 milestone Jun 10, 2022

khanhntd force-pushed the imds_v2 branch 6 times, most recently from ba170b6 to 7dc9161 Compare June 11, 2022 10:01

SaxyPandaBear reviewed Jun 11, 2022

View reviewed changes

SaxyPandaBear reviewed Jun 12, 2022

View reviewed changes

Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated

SaxyPandaBear reviewed Jun 21, 2022

View reviewed changes

Reduce timeout for scrapping IMDS and give instruction when fail to s…

9921a14

…crape IMDS inside container

SaxyPandaBear approved these changes Jun 22, 2022

View reviewed changes

Conversation

khanhntd commented Jun 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the issue

Description of changes

License

Tests

Requirements

Uh oh!

codecov-commenter commented Jun 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khanhntd Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khanhntd Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khanhntd Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khanhntd Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khanhntd Jun 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khanhntd Jun 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

khanhntd commented Jun 2, 2022 •

edited

Loading

codecov-commenter commented Jun 2, 2022 •

edited

Loading

khanhntd Jun 10, 2022 •

edited

Loading

khanhntd Jun 10, 2022 •

edited

Loading

khanhntd Jun 10, 2022 •

edited

Loading

khanhntd Jun 10, 2022 •

edited

Loading

khanhntd Jun 12, 2022 •

edited

Loading

khanhntd Jun 11, 2022 •

edited

Loading

khanhntd Jun 12, 2022 •

edited

Loading