Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS IRSA role assumption fails silently due to dependency issue #3555

Closed
d-t-w opened this issue Nov 18, 2022 · 8 comments
Closed

EKS IRSA role assumption fails silently due to dependency issue #3555

d-t-w opened this issue Nov 18, 2022 · 8 comments
Labels
bug This issue is a bug. p3 This is a minor priority issue

Comments

@d-t-w
Copy link

d-t-w commented Nov 18, 2022

Describe the bug

Kpow for Apache Kafka is an enterprise toolkit for Apache Kafka that includes multiple AWS libraries including LicenseManager, MSK Iam Auth, and AWS Glue.

Including the AWS Glue dependency silently breaks IRSA, causing the pod to run under the NodeInstanceRole rather than the properly configured IRSA role.

This is likely to impact projects intending to use IRSA with MSK, as the MSK IAM library and AWS Glue libraries are very likely to be included in those projects. See: aws/aws-msk-iam-auth#55

Isolating this error required a full deploy into an IRSA enabled EKS environment with debug logging in place.

Expected Behavior

Kpow prior to the addition of the AWS Glue dependency implemented IRSA correctly.

We expect to add the AWS Glue dependency to Kpow without breaking IRSA.

Current Behavior

After adding the AWS Glue dependency Kpow reverted to operating under the EKS Node Instance role:

01:30:45.217 ERROR [main] instruct.system – [:instruct.system/init :kafka/primary-cluster] instruction failed
software.amazon.awssdk.services.licensemanager.model.AuthorizationException: User: arn:aws:sts::489728315157:assumed-role/eksctl-awsmp-kpow-example-nodegro-NodeInstanceRole-RF0DW6JPCQ07/i-0dd68413a10f85f5c is not authorized to perform: license-manager:CheckoutLicense because no identity-based policy allows the license-manager:CheckoutLicense action (Service: LicenseManager, Status Code: 400, Request ID: c2546bfe-6a8e-4d0f-a635-36d07ddacad2)

Turning on debug logging shows the WebIdentityTokenCredentialsProvider is not executing due to an error resolving the http implementation,

01:30:44.137 DEBUG [main] s.a.a.a.c.AwsCredentialsProviderChain – Unable to load credentials from WebIdentityTokenCredentialsProvider(): Multiple HTTP implementations were found on the classpath. To avoid non-deterministic loading implementations, please explicitly provide an HTTP client via the client builders, set the software.amazon.awssdk.http.service.impl system property with the FQCN of the HTTP service to use as the default, or remove all but one HTTP implementation from the classpath
software.amazon.awssdk.core.exception.SdkClientException: Multiple HTTP implementations were found on the classpath. To avoid non-deterministic loading implementations, please explicitly provide an HTTP client via the client builders, set the software.amazon.awssdk.http.service.impl system property with the FQCN of the HTTP service to use as the default, or remove all but one HTTP implementation from the classpath
        at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:102)
        at software.amazon.awssdk.core.internal.http.loader.ClasspathSdkHttpServiceProvider.loadService(ClasspathSdkHttpServiceProvider.java:62)

The error is caused by conflicting http implementations brought in by:

LicenseManager or MSK (Iam Auth): ApacheHttpClient
AWS Glue: URLConnection

Reproduction Steps

See minimum viable reproducer here: https://github.com/factorhouse/aws-irsa-deps-reproducer

Possible Solution

Manually set the http client implementation:

(System/setProperty "software.amazon.awssdk.http.service.impl" "software.amazon.awssdk.http.apache.ApacheSdkHttpService")

It is not clear how this setting impacts any of the libraries, but Glue/LM/MSK appear to work with that setting and IRSA roles are resumed.

Additional Information/Context

No response

AWS Java SDK version used

2.18.20

JDK version used

java --version openjdk 11.0.16 2022-07-19

Operating System and version

Mac OS

@d-t-w d-t-w added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Nov 18, 2022
@debora-ito debora-ito self-assigned this Nov 21, 2022
@debora-ito
Copy link
Member

@d-t-w thank you for the detailed report.

SdkClientException: Multiple HTTP implementations were found on the classpath.

This is expected. When you have multiple HTTP clients in your classpath, the SDK doesn't know which one to use. You need to tell the SDK which one to use by (1) setting the software.amazon.awssdk.http.service.impl system property; or (2) setting the http client programmatically in the service client builder:

S3Client s3Client = S3Client.builder()
    .httpClientBuilder(ApacheHttpClient.builder())
    .build();

(System/setProperty "software.amazon.awssdk.http.service.impl" "software.amazon.awssdk.http.apache.ApacheSdkHttpService")

It is not clear how this setting impacts any of the libraries, but Glue/LM/MSK appear to work with that setting and IRSA roles are resumed.

The Java SDK will honor this setting across all the service clients vended by the SDK. But note that the software.amazon.glue/schema-registry-serde library is not maintained by we the Java SDK team, so I can't say how setting the "software.amazon.awssdk.http.service.impl" system property to use Apache Client will impact the rest of the library that expected to use UrlConnection Client.

Let me know if this makes sense.

@debora-ito debora-ito added closing-soon This issue will close in 4 days unless further comments are made. and removed needs-triage This issue or PR still needs to be triaged. labels Nov 22, 2022
@d-t-w
Copy link
Author

d-t-w commented Nov 22, 2022

No worries @debora-ito, but I think this issue deserves a bit more thought.

This is expected. When you have multiple HTTP clients in your classpath, the SDK doesn't know which one to use.

As a user of the AWS SDK I do not expect that adding a library will cause IRSA to stop working.

You need to tell the SDK which one to use by (1) setting the software.amazon.awssdk.http.service.impl system property; or (2) setting the http client programmatically in the service client builder:

The software.amazon.glue/schema-registry-serde library sets the UrlConnectionHttpClient programatically.

https://github.com/awslabs/aws-glue-schema-registry/blob/ffd7452f0ea3bf1bbf52879a19436fded769667d/common/src/main/java/com/amazonaws/services/schemaregistry/common/AWSSchemaRegistryClient.java#L97

The library does not provide any way for me to configure a different http client.

The software.amazon.awssdk/auth WebIdentityTokenCredentialsProvider class uses a serviceLoader for http impl.

Presumably setting software.amazon.awssdk.http.service.impl only applies as a default when using the serviceLoader technique? To be clear I have no idea if that is true or if setting that variable overrides any programatic settings.

I assume setting software.amazon.awssdk.http.service.impl impacts WebIdentityTokenCredentialsProvider behaviour.

Multiple factors lead this to being a broad issue:

[1] This issue is easily triggered

The WebIdentityTokenCredentialsProvider created by the DefaultCredentialsProvider will fail if there are multiple HTTP Client implementations on the classpath.

[2] This issue is likely to be triggered

Adding any library that causes multiple http implementation on the classpath will trigger this issue. E.g. software.amazon.glue/schema-registry-serde, or (randomly picking one..) org.apache.iceberg.iceberg/iceberg-aws.

[3] This issue is consequential

This issue causes IRSA to stop working.

Our product runs on EKS and requires IRSA to call the AWS Licence Manager as a part of AWS Marketplace integration. Adding the scema-registry-serde library causes that integration to fail.

[4] This issue is hard to identify

Setting up EKS with IRSA is a multi-step process with lots of moving parts. When it doesn't work you tend to assume that you have set your cluster up incorrectly, missed some configuration, etc. The IRSA troubleshooting pages suggest confirm ingOIDC configuration, IAM role trust policies, etc.

The thought that a dependency conflict within your JAR file has caused the failure isn't one that springs to mind.

[5] This issue is hard to debug

The AwsCredentialsProviderChain class logs a debug log-line, that is very low visibility.

Finding the root cause of this issue required debug monitoring of an application in a full EKS+IRSA environment.

[6] This issue is encountered

The AWS iam-auth and serdes libraries have these open tickets:

awslabs/aws-glue-schema-registry#151
awslabs/aws-glue-schema-registry#157 (specifically the comments at the bottom)
aws/aws-msk-iam-auth#55

[7] This issue probably more broadly applicable

My examples are from Kafka related libraries, but basically anything that introduces a second http impl will break IRSA.

[8] The workaround is imprecise

How can I know what the impact of setting software.amazon.awssdk.http.service.impl is across the different libraries in my application? Should I use UrlConnection / ApacheClient / NettyClient, etc?


Suggestions

  1. Update the log line from DEBUG to WARN to raise the visibility of this issue.
  2. Note this behaviour here: https://aws.amazon.com/premiumsupport/knowledge-center/eks-troubleshoot-IRSA-errors/
  3. Possibly suggest any project relying on IRSA should set the system property to something appropriate?

Someone with a better knowledge of the libraries than me could consider the impact of:

  1. Change the WebIdentityTokenCredentialsProvider to have http client programatically set.
  2. Identify areas where the DefaultStsClientBuilder will fail beyond WebIdentityTokenCredentialsProvider.

Thanks! Derek.

@github-actions github-actions bot removed the closing-soon This issue will close in 4 days unless further comments are made. label Nov 22, 2022
@debora-ito debora-ito added the p3 This is a minor priority issue label Mar 20, 2023
@maximethebault
Copy link

Just spent way too much time investigating an issue related to this as well. Raising visibility of the log line, particularly in the context of IRSA, sounds like a MUST to me as well.

@debora-ito debora-ito removed their assignment Aug 16, 2023
@Min3953
Copy link

Min3953 commented Oct 5, 2023

Any update about this issue?
I spent too much time of debugging because of this issue too.

@debora-ito
Copy link
Member

@d-t-w @maximethebault @Min3953

We changed the rule when multiple implementations are found in the classpath. Now, instead of throwing an error, the SDK will choose one http client based on priority. The priority order is defined in the ClasspathSdkHttpServiceProvider. The change was released in SDK version 2.22.0.

So in this case, WebIdentityTokenCredentialsProvider will not fail anymore, granting version 2.22.0 or greater is being used. This solves the root cause of this issue. Let us know what you think.

@d-t-w
Copy link
Author

d-t-w commented Feb 20, 2024

Hi @debora-ito that sounds like a good solution to me, thanks very much!

@debora-ito
Copy link
Member

@d-t-w Good to know, thank you for the follow-up. Resolving this.

Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p3 This is a minor priority issue
Projects
None yet
Development

No branches or pull requests

4 participants