Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InstanceProfileCredentialsProvider unable to refresh/recover from network problems at time of credential cache refresh #5247

Open
steveloughran opened this issue May 27, 2024 · 2 comments
Labels
bug This issue is a bug. p2 This is a standard priority issue

Comments

@steveloughran
Copy link

Describe the bug

The full details and analysis are in HADOOP-19181. IAMCredentialsProvider throttle failures

  • the credentials retrieved by IAMCredentialsProvider are set to expire() only 1s before the expiry time returned by the Instance metadata service
  • this is not enough time for the CachedSupplier to recover from any failure as on a failure it sets a jitter on a retry to ComparableUtils.minimum(Duration.ofMillis(exponentialBackoffMillis), Duration.ofSeconds(10)). That is: up to 9 seconds after the credentials expire.

It's notable that ContainerCredentialsProvider has more resilience in

  • credentials are set to expire 15 minutes before the credentials become invalid
  • it also retries 5 times with no sleep on a request failure.

Expected Behavior

IAMCredentialsProvider to always issue valid credentials. This should include asynchronous refreshing of credentials when enabled, far enough in advance of their expiry that recovery attempts can be repeated.

Current Behavior

invalid credentials were returned when requesting credentials to sign a request.

Reproduction Steps

see the hadoop bug report. The deployment was a single EC2 node running many services, long enough for the EC2 credentials to expire and need refreshing.

Possible Solution

I really don't see what we can do in the hadoop codebase to recover from this. We could consider extending our own IAMInstanceCredentialsProvider to wrap the retry failures with our own sleep/retry, but as it'd take > 10 seconds, API calls awaiting signing (e.g. S3 Express CreateSession) will time out.

I'd propose copying ContainerCredentialsProvider, if not with the retries then at least the expiry time many minutes ahead of actual credential expiry

Additional Information/Context

No response

AWS Java SDK version used

2.24.6

JDK version used

an openjdk java 8 build

Operating System and version

linux

@steveloughran steveloughran added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 27, 2024
@debora-ito
Copy link
Member

Hi @steveloughran

we added a task to analyze and improve the credential experience. Thank you for the detailed bug report - HADOOP-19181.

@debora-ito debora-ito added p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Jun 7, 2024
@steveloughran
Copy link
Author

Is there an approximate timeline for this? I don't want to have to reimplement our own refresh thread code -not because it is hard, but because maintenance and testing with fault injection are the pain points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

2 participants