Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure backoff and retry options for credentials provider authentication refresh #59

Closed
olileach opened this issue Feb 20, 2022 · 5 comments

Comments

@olileach
Copy link

olileach commented Feb 20, 2022

We are running a large number of processes on EMR. We have 10 YARN jobs, with each YARN job spawning 8 processes using a Java Futures object, and these 10 YARN jobs are running on one EC2 instance. We have several EC2 instances running within our EMR cluster of which some don't exhibit problems authenticating and some do. We are seeing intermittent authentication failures after the EMR jobs are running for a few hours, where the aws-msk-iam-auth library is trying to refresh the IAM token in order to continue processing messages from MSK in EMR. Here's the error message we receive:


ExtractionConsumer:116 - Extraction Kafka processor has failed: topic=develop_headstate_adjustment  
org.apache.kafka.common.errors.SaslAuthenticationException: An error: (java.security.PrivilegedActionException:   
javax.security.sasl.SaslException: Failed to find AWS IAM Credentials [Caused by com.amazonaws.SdkClientException: 
Unable to load AWS credentials from any provider in the chain [com.amazonaws.auth.AWSCredentialsProviderChain@523d0202: 
Unable to load AWS credentials from any provider in the chain: [EnvironmentVariableCredentialsProvider: 
Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID 
(or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)), 
SystemPropertiesCredentialsProvider: 
Unable to load AWS credentials from Java system properties (aws.accessKeyId and aws.secretKey), 
WebIdentityTokenCredentialsProvider: You must specify a value for roleArn and roleSessionName,
[software.amazon.msk.auth.iam.internals.EnhancedProfileCredentialsProvider@1f17510]
(mailto:software.amazon.msk.auth.iam.internals.EnhancedProfileCredentialsProvider@1f17510): 
Profile file contained no credentials for profile 'default': ProfileFile(profiles=
[]),[com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper@525bf518]
(mailto:com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper@525bf518): null]]]) 
occurred when evaluating SASL token received from the Kafka Broker. 
Kafka Client will go to AUTHENTICATION_FAILED state.
Caused by: javax.security.sasl.SaslException: Failed to find AWS IAM Credentials

The credentials provider should be using the EC2 instance profile attached to the EC2 instance. If you follow the errors above , you can see the process matches this chain https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html, but doesn't find a credentials provider and then fails. The key is that this is an intermittent issue whereby most of the time, the auth works. However, when there are no default credentials provider found, the YARN job fails and EMR jobs fail.

I can see where the token refresh callback is:

protected void handleCallback(AWSCredentialsCallback callback) throws IOException {

It would be great to have some config that allows us configure a backoff and retry to refresh the IAM credentials to handle situations where there is potential throttling happening when querying the metadata service where there is particular high load.

Similar to the backoff for the number of connections to MSK, we would like options to configure the retries and backoff in ms (say 1000 or 2000) and retry attempts

So if the option is specified, sleep 1 or 2 seconds (or time based on the provided configuration) and retry 3 times?

Thanks in advance

@sayantacC
Copy link
Contributor

@olileach Thanks for reporting the problem. I will look into the error handling in that code path. In the meanwhile is there any chance for getting debug logs of such a failure, say in a test environment ? It would be very helpful to find out the exact problem the EC2ContainerCredentialsProviderWrapper runs into while fetching credentials.

@olileach
Copy link
Author

@sayantacC - Thanks for such a quick response on this. We will report back and update the open support case with the relevant info.

@olileach
Copy link
Author

olileach commented Mar 3, 2022

Looks like the new branch released to solve this issue has fixed our problem as we now see the re-try after a failed auth , which results in the job continuing rather than failing with the no default credentials error. Please can we get this branch merged to main and then I'd be happy to close this issue? Thanks .

@sayantacC
Copy link
Contributor

sayantacC commented Mar 3, 2022

I have merged the branch to main as 1.1.3.
I have not yet had a chance to release it to maven. I will try to get the release done in the next few days.

@sayantacC
Copy link
Contributor

Version 1.1.3 has been released. It should show up in the maven repos in a day or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants