Skip to content

Reduce timeout for scrapping IMDS and give instruction when fail to scrape IMDS inside container#480

Merged
khanhntd merged 10 commits intoaws:masterfrom
khanhntd:imds_v2
Jun 22, 2022
Merged

Reduce timeout for scrapping IMDS and give instruction when fail to scrape IMDS inside container#480
khanhntd merged 10 commits intoaws:masterfrom
khanhntd:imds_v2

Conversation

@khanhntd
Copy link
Copy Markdown
Contributor

@khanhntd khanhntd commented Jun 2, 2022

Description of the issue

Whenever customers turn on EC2 Tagger, the customer would need to fulfill these two conditions before able to scrap metadata from IMDS:

However, for scrapping IMDSv2 and running in Docker Container, customers need to increase their hop limit to 2 for general case and for certain case such as KOPS Vanilla, it would be 3. One thing to note here is for most AWS Resources, the default hop limit would be 2 for resources that manage container such as EKS. But if customers enable IMDSv1, the AWS SDK Go will fall back to use IMDSv1 as default and CloudWatchAgent won't need to worry about Hop Limit. However, consider the security such as
"Protecting against open layer 3 firewall and NATS", it would be best to keep in mind customers won't use IMDSv1 in the future and its not best practice to always use IMDSv1.

Therefore, we have considered these strategies for dealing with IMDSv2:

  • Fail faster by reducing 4 minutes time out ( 4 retry and 1 retry/ minute) to recommend timeout strategies (same as Terraform) and show corresponding course of action when customer runs in Docker container and only enable IMDSv2 with 1 Hop Limit.
  • CloudWatchAgent needs to modify hop limit under the hood. In order to modify the hop limit, the API requires an instance ID and we can only get that from IMDS. However, we can scrap instance ID from certain scripts which are created when EC2 Instances are created such as in Linux Instances (/var/lib/cloud/data/instance-id) and we would need to use Docker Mount between container's volume and host's volume.
  • Modify the TTL or Hop Limit in Token(package IP) from IMDSv2 ; however, it would not be recommended to go against what IMDSv2 has been designed for security reasons.

Therefore, after discussing with the team, we would go with the first strategy.

Description of changes

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

I have tested CloudWatchAgent on two edge cases:

  1. LXC Containers.
  2. Vanilla Kops.

For specific test on second edge case:

  • Step 1: Build KOPS EKS Cluster by following [this document(https://kops.sigs.k8s.io/getting_started/aws/)
  • Step 2: Build Docker Image by using make dockerized-build and publish your image through ECR
  • Step 3: Replace the ECR Image and the below config with this template
{
	"agent": {
		"metrics_collection_interval": 60,
		"run_as_user": "root",
	},
	"metrics": {
		"append_dimensions": {
			"AutoScalingGroupName": "${aws:AutoScalingGroupName}",
			"ImageId": "${aws:ImageId}",
			"InstanceId": "${aws:InstanceId}",
			"InstanceType": "${aws:InstanceType}"
		},
		"metrics_collected": {
			"disk": {
				"measurement": [
					"used_percent"
				],
				"metrics_collection_interval": 60,
				"resources": [
					"*"
				]
			}
		}
	}
}
  • Step 4: See the expected following results
2022-06-07T03:36:11Z I! 
2022-06-07T03:36:11Z E! [processors.ec2tagger] ec2tagger: Unable to retrieve Instance Metadata Tags. This plugin must only be used on an EC2 instance.
2022-06-07T03:36:11Z E! [processors.ec2tagger] ec2tagger: Please increase hop limit to 2 by following this document https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html#configuring-IMDS-existing-instances.
2022-06-07T03:36:11Z E! [telegraf] Error running agent: could not initialize processor ec2tagger: EC2MetadataRequestError: failed to get EC2 instance identity document
caused by: EC2MetadataError: failed to make EC2Metadata request
        status code: 401, request id: 
caused by:

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make linter

@khanhntd khanhntd changed the title Imds v2 Always setting hops to 2 if CloudWatchAgent is deployed as container Jun 2, 2022
@khanhntd khanhntd force-pushed the imds_v2 branch 10 times, most recently from f05b1a8 to 32e5e4a Compare June 2, 2022 16:25
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 2, 2022

Codecov Report

Merging #480 (9921a14) into master (5335531) will increase coverage by 0.04%.
The diff coverage is 61.19%.

@@            Coverage Diff             @@
##           master     #480      +/-   ##
==========================================
+ Coverage   56.81%   56.86%   +0.04%     
==========================================
  Files         374      363      -11     
  Lines       17711    16937     -774     
==========================================
- Hits        10063     9631     -432     
+ Misses       7057     6754     -303     
+ Partials      591      552      -39     
Impacted Files Coverage Δ
translator/util/sdkutil.go 0.00% <ø> (ø)
plugins/processors/ec2tagger/ec2tagger.go 80.48% <61.19%> (-4.08%) ⬇️
...md/amazon-cloudwatch-agent-config-wizard/wizard.go 56.94% <0.00%> (-11.12%) ⬇️
plugins/inputs/demo/demo.go 50.00% <0.00%> (-7.15%) ⬇️
...anslator/translate/metrics/util/measurementutil.go 30.83% <0.00%> (-1.92%) ⬇️
plugins/outputs/cloudwatch/cloudwatch.go 73.37% <0.00%> (-0.70%) ⬇️
...translate/metrics/metrics_collect/gpu/nvidiaSmi.go 94.73% <0.00%> (-0.14%) ⬇️
plugins/inputs/logfile/tail/tail_windows.go
translator/util/platform_windows.go
...gins/inputs/windows_event_log/windows_event_log.go
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5335531...9921a14. Read the comment docs.

@khanhntd khanhntd force-pushed the imds_v2 branch 16 times, most recently from 08cdab4 to 6ae3cc8 Compare June 5, 2022 22:20
@khanhntd khanhntd force-pushed the imds_v2 branch 7 times, most recently from 1cec9fb to 2bf9f76 Compare June 9, 2022 05:09
Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This plugin must only be used on an EC2 instance"

Does the customer configure the ec2tagger? I thought that the CloudWatch agent does that under the hood. What is the point of calling this out if the customer doesn't have control over this?

Copy link
Copy Markdown
Contributor Author

@khanhntd khanhntd Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The customers configure the EC2Tagger indirectly; therefore, I see the point of not showing directly This plugin must only be used on an EC2 instance.. An example would be below:

{
	"metrics": {
		"append_dimensions": {
			"AutoScalingGroupName": "${aws:AutoScalingGroupName}",
			"ImageId": "${aws:ImageId}",
			"InstanceId": "${aws:InstanceId}",
			"InstanceType": "${aws:InstanceType}"
		}
	}
}

However, the log message are from the old command so I'm do not have enough context to remove it so I keep it here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the opportunity to address this to make the error message clearer. Can you explain your example? What does this configuration have to do with the ec2tagger specifically? If you can describe that, then we probably have enough data to reword the error message.

Copy link
Copy Markdown
Contributor Author

@khanhntd khanhntd Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'm not good at wording so thanks for the suggestion. For example,

{
	"metrics": {
		"append_dimensions": {
			"ImageId": "${aws:ImageId}",
			"InstanceId": "${aws:InstanceId}",
			"InstanceType": "${aws:InstanceType}"
		}
	}
}

If the customer specifies these three dimensions for their collected metrics, the CloudWatchAgent will scrape these dimensions from the IMDS. In another way of speaking, the custom's JSON configuration uses the EC2 Tagger under the hood when appending the aforementioned dimensions but it is not shown explicitly in the JSON configuration.

Comment thread translator/util/ec2util/ec2util.go Outdated
Comment thread translator/util/ec2util/ec2util.go Outdated
Comment thread translator/util/ec2util/ec2util.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the only two scenarios where we would return an actual error are if:

  1. There is some error with creating a session
  2. IMDS is unavailable

We fail silently (with logs) in a case where there is an error trying to fetch specific data from the metadata endpoint. Why is that? There should be more consistency with the behavior of this function. Does it make sense to return a value here? If so, what are the different intended values, and what causes them to be returned?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding on this - does it make sense to not return an error if there is a non-nil error when trying to get the hostname from the metadata endpoint? Isn't that value required to be populated?

Copy link
Copy Markdown
Contributor Author

@khanhntd khanhntd Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned, we would return an actual error for both case. However, does that mean the error will impact the whole process? My answer is no (the same as existing behavior), the return error value only to consolidate the message for the customers. Based on my understanding, the existing behavior for CloudWatchAgent aware that it could be failing in getting the metadata (from host name or from get instance document). However, it does not return error because it can be replaced by the placeHolder function.

So long short story, CloudWatchAgent only expect the failure from those aforementioned case and will always return the values (even if it different intended values because of the replacement with placeholder function) so there are no reason why we would want to return the error when getting one of these metadata endpoint.

Note: This does not apply to EC2Tagger

Copy link
Copy Markdown
Contributor Author

@khanhntd khanhntd Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if the host name is failing to be scraped, there would be a replacement value for the host name.

	hostname := provider().Hostname
	if hostname == "" {
		hostname = localHostname
	}

This applies to the metadata scrapped from Instance Document too. Therefore, it is not required to be populated for the value. So I would stand by in not return an error if there is a non-nil error when considering only scraping the metadata

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the opportunity to address this to make the error message clearer. Can you explain your example? What does this configuration have to do with the ec2tagger specifically? If you can describe that, then we probably have enough data to reword the error message.

Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated
Comment thread translator/util/ec2util/ec2util.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding on this - does it make sense to not return an error if there is a non-nil error when trying to get the hostname from the metadata endpoint? Isn't that value required to be populated?

Comment thread translator/util/ec2util/ec2util.go Outdated
@khanhntd khanhntd force-pushed the imds_v2 branch 2 times, most recently from 380e2e0 to 024373d Compare June 10, 2022 04:07
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a random number imo. Think all systems send out the call at the same time. Then next round they all send at the same time. It needs to be random within an interval. This comment is probs out of scope for this pr but something to think about for best behaved agents. @straussb thoughts.

Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated
Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated
Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated
@khanhntd khanhntd added this to the 1.247354.0 milestone Jun 10, 2022
@khanhntd khanhntd force-pushed the imds_v2 branch 6 times, most recently from ba170b6 to 7dc9161 Compare June 11, 2022 10:01
Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated
Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated
Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated
Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated
Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we putting all of this information inside of another struct? I don't see any benefit for pulling these out into a struct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the same reason for ec2MetadataLookup:

  • Increase consolidation since its consolidate all the EC2MetadataRespond
  • Less variables to be aware of when looking at EC2Tagger interface

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it less variables to be aware of if you are introducing another variable? I don't think the reasons you presented are convincing enough for this change. Are we going to be passing around this "response" struct outside of this Tagger? If not, it just makes it more confusing and harder to maintain.

Copy link
Copy Markdown
Contributor Author

@khanhntd khanhntd Jun 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, no for both EC2MetadataLookup and EC2MetadataRespond need to be passed outside the Tagger.
Long short story, these metadata are parsed from IMDS, so why it's more confusing and harder to maintain? This follows the definition of type structure (It is used to group related data to form a single unit.) so I'm not sure the reason why not to change it.

	instanceId     string
	imageId        string // aka AMI
	instanceType   string
	region         string

,

Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving all of the functions around in the file makes this a confusing diff so it'll take longer to review.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree its take longer time and confusing. However, for better structure format (e.g following the ecs_decorator), its worth to restructure it and better considerate for future reviewers.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. We don't have clearly defined standards for the structure or order of functions. Moving functions around in this file just creates noise. If we want to reformat the file to follow some structure, that should not be part of a PR that updates functionality. Now the reviewer has to closely examine all of the changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, I will revert it back 👍

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there inconsistency for the map keys? tagKey1 is created as a var/const at the top of the file but "AutoScalingGroupName" gets redefined in all of the tests.

Copy link
Copy Markdown
Contributor Author

@khanhntd khanhntd Jun 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the reason why though:

## Note: This plugin renames the "aws:autoscaling:groupName" EC2 Instance Tag key to be spelled "AutoScalingGroupName".
## This aligns it with the AutoScaling dimension-name seen in AWS CloudWatch.
# ec2_instance_tag_keys = ["aws:autoscaling:groupName", "Name"]

Another explanation can be found:

// if the customer said 'AutoScalingGroupName' (the CW dimension), do what they mean not what they said
// and filter for the EC2 tag name called 'aws:autoscaling:groupName'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this means

Copy link
Copy Markdown
Contributor Author

@khanhntd khanhntd Jun 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, whenever CWAgent retrieves EC2 Instance Tags (not volume or not IMDS), the EC2 Instance will have the key tag as aws:autoscaling:groupName for AutoScalingGroup; then we will scrapped it; converted to AutoScalingGroupName as a key (back and forth) and appends the AutoScalingGroup as a dimension to the collected metrics (to match the AutoScalingGroupName key in CWAgent JSON configuration)

So long short story, the EC2 Instance has tag aws:autoscaling:groupName (and its value) whenever they created a Auto Scaling Group, we collected the aws:autoscaling:groupName tag, converted the key toAutoScalingGroup and append to the metric as a dimension.

Let's me know if that make sense

Comment thread plugins/processors/ec2tagger/ec2tagger.go Outdated
Copy link
Copy Markdown
Contributor

@SaxyPandaBear SaxyPandaBear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see test output that shows that the new behavior errors out in the same second. Can you please also include logging without this change to illustrate what the existing behavior is?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. We don't have clearly defined standards for the structure or order of functions. Moving functions around in this file just creates noise. If we want to reformat the file to follow some structure, that should not be part of a PR that updates functionality. Now the reviewer has to closely examine all of the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants