Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Env vars to configure IMDS retry and timeouts #625

Open
1 of 2 tasks
kevinpark1217 opened this issue Sep 30, 2022 · 12 comments
Open
1 of 2 tasks

Env vars to configure IMDS retry and timeouts #625

kevinpark1217 opened this issue Sep 30, 2022 · 12 comments
Labels
feature-request A feature should be added or improved. good first issue Good for newcomers p3 This is a minor priority issue

Comments

@kevinpark1217
Copy link
Contributor

kevinpark1217 commented Sep 30, 2022

Describe the feature

Other official AWS SDK libraries support specifying AWS_METADATA_SERVICE_NUM_ATTEMPTS and AWS_METADATA_SERVICE_TIMEOUT environment variables to automatically retry IMDS requests.

This feature is currently missing in the aws-sdk-rust making it more difficult for applications to handle rare credential failures originating from IMDS requests.

Use Case

When applications are deployed in Kubernetes cluster with KIAM project intercepting and redirecting IMDS requests, it can be flaky.

Applications such as Vector will out right abort and throw-away the current operation when it encounters an IMDS credential error. It would be super beneficial to have the retry abilities built-in to the SDK itself.

Proposed Solution

Implement retry logic into the library with AWS_METADATA_SERVICE_NUM_ATTEMPTS and AWS_METADATA_SERVICE_TIMEOUT environment variables support.

Other Information

No response

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

A note for the community

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue, please leave a comment
@kevinpark1217 kevinpark1217 added feature-request A feature should be added or improved. needs-triage This issue or PR still needs to be triaged. labels Sep 30, 2022
@unkempthenry
Copy link

Like @kevinpark1217 mentioned our calls to KIAM are proxied by KIAM, and while KIAM is in maintenance mode we're stuck on it for the time being.

It appears like the responsiveness of IMDS can depend on load, so supporting these env vars could help users increase reliability when something running the SDK is sharing an instance with a caller that is hitting IMDS hard.

I think it's possible to adjust the timeout in code, but supporting the env vars would make it easy for everyone to tweak these settings for all the downstream applications of the SDK.

@jdisanti
Copy link
Contributor

It looks like the Go SDK also has this feature request: aws/aws-sdk-go#3495

Seems like these env vars need to be standardized across the SDKs.

@jdisanti jdisanti removed the needs-triage This issue or PR still needs to be triaged. label Sep 30, 2022
@jdisanti jdisanti changed the title Support for IMDS Retry attempts and interval Env vars to configure IMDS retry and timeouts Sep 30, 2022
@jdisanti jdisanti added the good first issue Good for newcomers label Sep 30, 2022
@unkempthenry
Copy link

looks like the support across the different SDKs is tracked here https://docs.aws.amazon.com/sdkref/latest/guide/feature-ec2-instance-metadata.html

@Sigurthorb
Copy link

I would be interested in working on this, couple of clarifying questions to the team:

  1. ec2 metadata page linked above defines default max_retries should be 1 but current value is 4, should this be changed to 1 to be consistent cross SDKs?
  2. Does the AWS_METADATA_SERVICE_TIMEOUT represent both connection and read timeout? i.e. if set to 5, connection timeout would be set to 5 and read timeout would be set to 5?
  3. builder already exposes connection_timeout read_timeout functions but does not use those values anywhere, how do you see the single timeout environment configuration interacting with those two configurations?
    • I'm thinking I would take on fixing this in the PR as well.

@kevinpark1217
Copy link
Contributor Author

@Sigurthorb Oops, I didn't see your comment on taking on this issue. I took a stab at it with #626 PR.

should this be changed to 1 to be consistent cross SDKs?

I think I should probably change this to 1 for consistency. Let me know if anyone else has opinions on this.

My question for the reviewers, what's the standard way of handling parsing error in the AWS sdk?

  1. Fail outright and terminate
  2. Display warning, but use default values
  3. Use default values silently

@Sigurthorb
Copy link

You technically called first dibs with your checkmark to I may be able to implement this feature request and you reported it. Good luck!

@Velfi
Copy link
Contributor

Velfi commented Oct 4, 2022

@Sigurthorb Oops, I didn't see your comment on taking on this issue. I took a stab at it with #626 PR.

should this be changed to 1 to be consistent cross SDKs?

I think I should probably change this to 1 for consistency. Let me know if anyone else has opinions on this.

My question for the reviewers, what's the standard way of handling parsing error in the AWS sdk?

1. Fail outright and terminate

2. Display warning, but use default values

3. Use default values silently

When encountering parsing errors, we typically disregard the config source that produced them and move on to other config sources in the provider chain. If all configs are invalid, then we will either set a default or panic depending on the specific config variable.

@kevinpark1217
Copy link
Contributor Author

kevinpark1217 commented Oct 6, 2022

@Velfi Sorry for taking a while to follow up on this. I was preparing the slides for the AWS reInvent presentation.

I have opened a new PR in the smithy-rs repo. Based on your error handling suggestion, I simplified the code a little bit. For this initial PR, I left the EndpointSource struct untouched for code readability and initial review.

I think it would be good to implement this feature, since Python SDK already supports this. But also, the IMDS retry isn't well exposed for programmatically configuring by other parts of the Rust SDK.

smithy-lang/smithy-rs#1821

@kevinpark1217
Copy link
Contributor Author

kevinpark1217 commented Oct 6, 2022

Also, there appears to be a regression where
self.read_timeout and self.connect_timeout fields is now unused in the imds/client.rs

Checkout here. (takes a while to auto scroll to the correct position)

Edit: Opened an issue and a PR (smithy-lang/smithy-rs#1822)

@kevinpark1217
Copy link
Contributor Author

I have found another way to implement timeout and retry attempts through code rather than environment variables. Please check out smithy-lang/smithy-rs#1867

@Velfi
Copy link
Contributor

Velfi commented Oct 19, 2022

@kevinpark1217 sorry for the slowness, we're still working on getting the next version released.

@jdisanti
Copy link
Contributor

@kevinpark1217 - Your changes to override the IMDS client have been released with release-2022-10-26.

I'm going to leave this feature request open for now to track interest in this configuration as environment variables.

@jmklix jmklix added the p3 This is a minor priority issue label Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. good first issue Good for newcomers p3 This is a minor priority issue
Projects
None yet
Development

No branches or pull requests

6 participants