Reduce timeout for scrapping IMDS and give instruction when fail to scrape IMDS inside container#480
Conversation
f05b1a8 to
32e5e4a
Compare
Codecov Report
@@ Coverage Diff @@
## master #480 +/- ##
==========================================
+ Coverage 56.81% 56.86% +0.04%
==========================================
Files 374 363 -11
Lines 17711 16937 -774
==========================================
- Hits 10063 9631 -432
+ Misses 7057 6754 -303
+ Partials 591 552 -39
Continue to review full report at Codecov.
|
08cdab4 to
6ae3cc8
Compare
1cec9fb to
2bf9f76
Compare
There was a problem hiding this comment.
"This plugin must only be used on an EC2 instance"
Does the customer configure the ec2tagger? I thought that the CloudWatch agent does that under the hood. What is the point of calling this out if the customer doesn't have control over this?
There was a problem hiding this comment.
Yes. The customers configure the EC2Tagger indirectly; therefore, I see the point of not showing directly This plugin must only be used on an EC2 instance.. An example would be below:
{
"metrics": {
"append_dimensions": {
"AutoScalingGroupName": "${aws:AutoScalingGroupName}",
"ImageId": "${aws:ImageId}",
"InstanceId": "${aws:InstanceId}",
"InstanceType": "${aws:InstanceType}"
}
}
}
However, the log message are from the old command so I'm do not have enough context to remove it so I keep it here
There was a problem hiding this comment.
We have the opportunity to address this to make the error message clearer. Can you explain your example? What does this configuration have to do with the ec2tagger specifically? If you can describe that, then we probably have enough data to reword the error message.
There was a problem hiding this comment.
Yes. I'm not good at wording so thanks for the suggestion. For example,
{
"metrics": {
"append_dimensions": {
"ImageId": "${aws:ImageId}",
"InstanceId": "${aws:InstanceId}",
"InstanceType": "${aws:InstanceType}"
}
}
}
If the customer specifies these three dimensions for their collected metrics, the CloudWatchAgent will scrape these dimensions from the IMDS. In another way of speaking, the custom's JSON configuration uses the EC2 Tagger under the hood when appending the aforementioned dimensions but it is not shown explicitly in the JSON configuration.
There was a problem hiding this comment.
So the only two scenarios where we would return an actual error are if:
- There is some error with creating a session
- IMDS is unavailable
We fail silently (with logs) in a case where there is an error trying to fetch specific data from the metadata endpoint. Why is that? There should be more consistency with the behavior of this function. Does it make sense to return a value here? If so, what are the different intended values, and what causes them to be returned?
There was a problem hiding this comment.
Expanding on this - does it make sense to not return an error if there is a non-nil error when trying to get the hostname from the metadata endpoint? Isn't that value required to be populated?
There was a problem hiding this comment.
As you mentioned, we would return an actual error for both case. However, does that mean the error will impact the whole process? My answer is no (the same as existing behavior), the return error value only to consolidate the message for the customers. Based on my understanding, the existing behavior for CloudWatchAgent aware that it could be failing in getting the metadata (from host name or from get instance document). However, it does not return error because it can be replaced by the placeHolder function.
So long short story, CloudWatchAgent only expect the failure from those aforementioned case and will always return the values (even if it different intended values because of the replacement with placeholder function) so there are no reason why we would want to return the error when getting one of these metadata endpoint.
Note: This does not apply to EC2Tagger
There was a problem hiding this comment.
For example, if the host name is failing to be scraped, there would be a replacement value for the host name.
hostname := provider().Hostname
if hostname == "" {
hostname = localHostname
}
This applies to the metadata scrapped from Instance Document too. Therefore, it is not required to be populated for the value. So I would stand by in not return an error if there is a non-nil error when considering only scraping the metadata
There was a problem hiding this comment.
We have the opportunity to address this to make the error message clearer. Can you explain your example? What does this configuration have to do with the ec2tagger specifically? If you can describe that, then we probably have enough data to reword the error message.
There was a problem hiding this comment.
Expanding on this - does it make sense to not return an error if there is a non-nil error when trying to get the hostname from the metadata endpoint? Isn't that value required to be populated?
380e2e0 to
024373d
Compare
There was a problem hiding this comment.
This should be a random number imo. Think all systems send out the call at the same time. Then next round they all send at the same time. It needs to be random within an interval. This comment is probs out of scope for this pr but something to think about for best behaved agents. @straussb thoughts.
ba170b6 to
7dc9161
Compare
There was a problem hiding this comment.
Why are we putting all of this information inside of another struct? I don't see any benefit for pulling these out into a struct.
There was a problem hiding this comment.
Its the same reason for ec2MetadataLookup:
- Increase consolidation since its consolidate all the EC2MetadataRespond
- Less variables to be aware of when looking at
EC2Tagger interface
There was a problem hiding this comment.
How is it less variables to be aware of if you are introducing another variable? I don't think the reasons you presented are convincing enough for this change. Are we going to be passing around this "response" struct outside of this Tagger? If not, it just makes it more confusing and harder to maintain.
There was a problem hiding this comment.
Based on my understanding, no for both EC2MetadataLookup and EC2MetadataRespond need to be passed outside the Tagger.
Long short story, these metadata are parsed from IMDS, so why it's more confusing and harder to maintain? This follows the definition of type structure (It is used to group related data to form a single unit.) so I'm not sure the reason why not to change it.
instanceId string
imageId string // aka AMI
instanceType string
region string
,
There was a problem hiding this comment.
Moving all of the functions around in the file makes this a confusing diff so it'll take longer to review.
There was a problem hiding this comment.
Agree its take longer time and confusing. However, for better structure format (e.g following the ecs_decorator), its worth to restructure it and better considerate for future reviewers.
There was a problem hiding this comment.
I disagree. We don't have clearly defined standards for the structure or order of functions. Moving functions around in this file just creates noise. If we want to reformat the file to follow some structure, that should not be part of a PR that updates functionality. Now the reviewer has to closely examine all of the changes.
There was a problem hiding this comment.
If that's the case, I will revert it back 👍
There was a problem hiding this comment.
Why is there inconsistency for the map keys? tagKey1 is created as a var/const at the top of the file but "AutoScalingGroupName" gets redefined in all of the tests.
There was a problem hiding this comment.
Here is the reason why though:
## Note: This plugin renames the "aws:autoscaling:groupName" EC2 Instance Tag key to be spelled "AutoScalingGroupName".
## This aligns it with the AutoScaling dimension-name seen in AWS CloudWatch.
# ec2_instance_tag_keys = ["aws:autoscaling:groupName", "Name"]
Another explanation can be found:
// if the customer said 'AutoScalingGroupName' (the CW dimension), do what they mean not what they said
// and filter for the EC2 tag name called 'aws:autoscaling:groupName'
There was a problem hiding this comment.
I don't understand what this means
There was a problem hiding this comment.
Based on my understanding, whenever CWAgent retrieves EC2 Instance Tags (not volume or not IMDS), the EC2 Instance will have the key tag as aws:autoscaling:groupName for AutoScalingGroup; then we will scrapped it; converted to AutoScalingGroupName as a key (back and forth) and appends the AutoScalingGroup as a dimension to the collected metrics (to match the AutoScalingGroupName key in CWAgent JSON configuration)
So long short story, the EC2 Instance has tag aws:autoscaling:groupName (and its value) whenever they created a Auto Scaling Group, we collected the aws:autoscaling:groupName tag, converted the key toAutoScalingGroup and append to the metric as a dimension.
Let's me know if that make sense
SaxyPandaBear
left a comment
There was a problem hiding this comment.
I see test output that shows that the new behavior errors out in the same second. Can you please also include logging without this change to illustrate what the existing behavior is?
There was a problem hiding this comment.
I disagree. We don't have clearly defined standards for the structure or order of functions. Moving functions around in this file just creates noise. If we want to reformat the file to follow some structure, that should not be part of a PR that updates functionality. Now the reviewer has to closely examine all of the changes.
…crape IMDS inside container
Description of the issue
Whenever customers turn on EC2 Tagger, the customer would need to fulfill these two conditions before able to scrap metadata from IMDS:
However, for scrapping IMDSv2 and running in Docker Container, customers need to increase their hop limit to 2 for general case and for certain case such as KOPS Vanilla, it would be 3. One thing to note here is for most AWS Resources, the default hop limit would be 2 for resources that manage container such as EKS. But if customers enable IMDSv1, the AWS SDK Go will fall back to use IMDSv1 as default and CloudWatchAgent won't need to worry about Hop Limit. However, consider the security such as
"Protecting against open layer 3 firewall and NATS", it would be best to keep in mind customers won't use IMDSv1 in the future and its not best practice to always use IMDSv1.
Therefore, we have considered these strategies for dealing with IMDSv2:
/var/lib/cloud/data/instance-id) and we would need to use Docker Mount between container's volume and host's volume.Therefore, after discussing with the team, we would go with the first strategy.
Description of changes
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
I have tested CloudWatchAgent on two edge cases:
For specific test on second edge case:
make dockerized-buildand publish your image through ECRRequirements
Before commit the code, please do the following steps.
make fmtandmake fmt-shmake linter