InstanceProfileCredentialsProvider unable to refresh/recover from network problems at time of credential cache refresh #5247

steveloughran · 2024-05-27T18:07:34Z

Describe the bug

The full details and analysis are in HADOOP-19181. IAMCredentialsProvider throttle failures

the credentials retrieved by IAMCredentialsProvider are set to expire() only 1s before the expiry time returned by the Instance metadata service
this is not enough time for the CachedSupplier to recover from any failure as on a failure it sets a jitter on a retry to ComparableUtils.minimum(Duration.ofMillis(exponentialBackoffMillis), Duration.ofSeconds(10)). That is: up to 9 seconds after the credentials expire.

It's notable that ContainerCredentialsProvider has more resilience in

credentials are set to expire 15 minutes before the credentials become invalid
it also retries 5 times with no sleep on a request failure.

Expected Behavior

IAMCredentialsProvider to always issue valid credentials. This should include asynchronous refreshing of credentials when enabled, far enough in advance of their expiry that recovery attempts can be repeated.

Current Behavior

invalid credentials were returned when requesting credentials to sign a request.

Reproduction Steps

see the hadoop bug report. The deployment was a single EC2 node running many services, long enough for the EC2 credentials to expire and need refreshing.

Possible Solution

I really don't see what we can do in the hadoop codebase to recover from this. We could consider extending our own IAMInstanceCredentialsProvider to wrap the retry failures with our own sleep/retry, but as it'd take > 10 seconds, API calls awaiting signing (e.g. S3 Express CreateSession) will time out.

I'd propose copying ContainerCredentialsProvider, if not with the retries then at least the expiry time many minutes ahead of actual credential expiry

Additional Information/Context

No response

AWS Java SDK version used

2.24.6

JDK version used

an openjdk java 8 build

Operating System and version

linux

The text was updated successfully, but these errors were encountered:

debora-ito · 2024-06-07T19:25:15Z

Hi @steveloughran

we added a task to analyze and improve the credential experience. Thank you for the detailed bug report - HADOOP-19181.

steveloughran · 2024-06-11T12:22:09Z

Is there an approximate timeline for this? I don't want to have to reimplement our own refresh thread code -not because it is hard, but because maintenance and testing with fault injection are the pain points.

steveloughran added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 27, 2024

debora-ito added p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InstanceProfileCredentialsProvider unable to refresh/recover from network problems at time of credential cache refresh #5247

InstanceProfileCredentialsProvider unable to refresh/recover from network problems at time of credential cache refresh #5247

steveloughran commented May 27, 2024

debora-ito commented Jun 7, 2024

steveloughran commented Jun 11, 2024

InstanceProfileCredentialsProvider unable to refresh/recover from network problems at time of credential cache refresh #5247

InstanceProfileCredentialsProvider unable to refresh/recover from network problems at time of credential cache refresh #5247

Comments

steveloughran commented May 27, 2024

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

AWS Java SDK version used

JDK version used

Operating System and version

debora-ito commented Jun 7, 2024

steveloughran commented Jun 11, 2024