Skip to content

Conversation

@lauzadis
Copy link
Member

@lauzadis lauzadis commented Oct 5, 2022

Adds support for static stability for IMDS credentials.
Uses the expireCredentialsAfter time in CachedCredentialsProvider for credentials which have an expiration time set.

Issue #

#707
#718

Description of changes

#707:

This change is required to comply with AWS SDK standards. Specifically, this will support the use of expired IMDS credentials. This is useful because in the case of service degradation in IMDS which prevents credentials refresh, AWS will allow us to keep using the expired credentials.

When we use expired credentials, the CachedCredentialsProvider will continuously try to refresh them. To solve this, a new field called nextRefresh was added to the Credentials data class. When present, this will override any expiration times, and the providers will only try to refresh the credentials if the nextRefresh time has passed.

#718:

When receiving credentials with an expiration time, set the cached credentials' expiration time to the minimum of expireCredentialsAfter and the received expiration time. This will ensure we are refreshing credentials at the rate the user specifies.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@lauzadis lauzadis requested a review from a team as a code owner October 5, 2022 21:41
}

// if we have previously served IMDS credentials and it's not time for a refresh, just return the previous credentials
if (previousCredentials != null &&
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a nicer way to do this null check? Previously I had:

previousCredentials?.run {
    this.nextRefresh?.run {
        if (clock.now() < this) {
            return previousCredentials as Credentials
        }
    }
}

and was torn between the two

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you could use the safe call operator and takeIf to clean this up slightly:

previousCredentials?.nextRefresh?.takeIf { Clock.now() < it }?.run {
    return previousCredentials as Credentials
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the as Credentials cast necessary?

Copy link
Member Author

@lauzadis lauzadis Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without as Credentials, I get this error: Smart cast to 'Credentials' is impossible, because 'previousCredentials' is a mutable property that could have been changed by this time

Another option that works is adding the non-null assertion !!

previousCredentials?.nextRefresh?.takeIf { clock.now() < it }?.run {
    return previousCredentials!!
}

I will change it to that.

ex is IOException ||
ex is EC2MetadataError && ex.statusCode == HttpStatusCode.InternalServerError.value -> {
previousCredentials = previousCredentials?.copy(nextRefresh = clock.now() + DEFAULT_CREDENTIALS_REFRESH_SECONDS.seconds)
return previousCredentials ?: throw CredentialsProviderException("failed to load instance profile", ex)
Copy link
Member Author

@lauzadis lauzadis Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a nice way to remove this duplicated throw ...? (consolidating the right-half of the elvis operator and the else block)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not given that the condition cannot easily be condensed into a when branch. You can either create a helper function for it or just leave the duplicated throw.

@github-actions
Copy link

github-actions bot commented Oct 5, 2022

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

ex is IOException ||
ex is EC2MetadataError && ex.statusCode == HttpStatusCode.InternalServerError.value -> {
previousCredentials = previousCredentials?.copy(nextRefresh = clock.now() + DEFAULT_CREDENTIALS_REFRESH_SECONDS.seconds)
return previousCredentials ?: throw CredentialsProviderException("failed to load instance profile", ex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not given that the condition cannot easily be condensed into a when branch. You can either create a helper function for it or just leave the duplicated throw.

private val profileOverride: String? = null,
private val client: Lazy<InstanceMetadataProvider> = lazy { ImdsClient() },
private val platformProvider: PlatformEnvironProvider = Platform,
private var previousCredentials: Credentials? = null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness: None of the handling around this field is thread-safe, leading to potentially inconsistent results when invoking on multiple threads simultaneously. I suggest you control access to this field with a mutex lock.

Comment on lines 66 to 72
// if we have previously served IMDS credentials and it's not time for a refresh, just return the previous credentials
if (previousCredentials != null &&
previousCredentials!!.nextRefresh != null &&
clock.now() < previousCredentials!!.nextRefresh!!
) {
return previousCredentials as Credentials
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is this the correct layer for this behavior? This means we'll effectively cache the credentials at this layer and not try fetching new ones until nextRefresh. That sounds correct for a caching layer but not what I'd necessarily expect from the direct IMDS provider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this seems to duplicate the cache behavior does it not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing this would violate the guidelines for waiting 5-10 minutes between requests when the user has expired credentials and is using a direct IMDS provider, right?

This was added for the case when a user specifies a custom CredentialsProvider (i.e one consisting only of ImdsCredentialsProvider). If we remove this, then we will reach out to IMDS to get credentials for every HTTP request, even if the returned credentials are expired.

If we get expired credentials, we are supposed to wait 5-10 minutes before reaching out to IMDS again. This internal caching logic only runs when we have expired credentials (nextRefresh != null). If everything is operating normally, nextRefresh will remain null and there will be no caching.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm a little conflicted here. All of our providers work this way and would refresh on every HTTP request without a caching provider wrapping it. We even note this in some of the provider documentation.

On one hand I would sort of like to keep that clean separation, on the other I can understand your point. 🤔
@ianbotsf thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic of using cached credentials from within the ImdsCredentialsProvider is correct if we're in stability mode (because that's an implementation detail of how to use IMDS). If we're not in stability mode, I'd expect the provider to fetch fresh credentials every time getCredentials is called (just like every other provider does) and relegate regular caching to the CachedCredentialsProvider.

How complicated would that be to implement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to tell static stability is enabled with 100% certainty, so we can't really create any logic around that:

IMDS doesn’t provide a formal flag, so the SDKs must always assume Static Stability is enabled.

One solution is to remove the nextRefresh logic from CachedCredentialsProvider entirely and leave the expired-credential caching solely in ImdsCredentialsProvider. In this scenario, when the default CachedCredentialsProvider is used and we are in static stability mode, the IMDS credentials will always appear as expired. To fetch credentials, we walk the chain of CredentialsProviders, and when we reach ImdsCredentialsProvider, we will branch into this caching logic and return the credentials.

One downside is that we will have to check the entire chain, which could be costly. But the added overhead could be fixed with this TODO where we would cache the credential provider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we know we're in stability mode the first time we call IMDS and get either expired credentials or some type of service error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would be ok with that, it isolates all of this into the IMDS provider and if that is the only downside then we can fix the other TODO as you mentioned.

Which way are you leaning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ianbotsf We could but it would not be 100% certain. For example, there may have been a service error for some other reason (i.e there's an availability issue, but static stability is not enabled yet)

@aajtodd I am leaning towards the idea I proposed. One more possible downside is that we are adding a whole new field to Credentials (nextRefresh) just to be used in ImdsCredentialsProvider. Would it make sense to keep this as a member in ImdsCredentialsProvider instead of in the Credentials data class?

Copy link
Contributor

@aajtodd aajtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall. Couple questions and callouts (in addition to Ian's callout on thread safety)

import aws.smithy.kotlin.runtime.util.Platform
import aws.smithy.kotlin.runtime.util.PlatformEnvironProvider
import aws.smithy.kotlin.runtime.util.asyncLazy
import java.io.IOException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix: JVM only (not sure why this even compiles)

Comment on lines 66 to 72
// if we have previously served IMDS credentials and it's not time for a refresh, just return the previous credentials
if (previousCredentials != null &&
previousCredentials!!.nextRefresh != null &&
clock.now() < previousCredentials!!.nextRefresh!!
) {
return previousCredentials as Credentials
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this seems to duplicate the cache behavior does it not?


if (creds.expiration!! < clock.now()) {
// let the user know we will be using expired credentials
logger.info { STATIC_STABILITY_LOG_MESSAGE.format(DEFAULT_CREDENTIALS_REFRESH_SECONDS / 60) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix: It looks like String.format is JVM only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also probably log at warn level?

PROVIDER_NAME,
)

if (creds.expiration!! < clock.now()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is this condition sufficient or should it include that previousCredentials wasn't null as well? I suppose it would be odd that IMDS would be returning expired creds so maybe this is ok?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to receive expired credentials from IMDS when it's in static stability mode:

Test 1: SDK can use IMDS credential provider if first IMDS call returns expired credentials.

So you are proposing the conditional to be:

if (creds.expiration!! < clock.now() && previousCredentials != null)

I'm wondering what you think we should do if the incoming credentials are expired but previousCredentials is null? This can happen if we start an SDK client during an outage -- we will have no previousCredentials but the incoming credentials will be expired

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can happen if we start an SDK client during an outage

Hmm good point. That seems sufficient then

PROVIDER_NAME,
)
is JsonCredentialsResponse.SessionCredentials -> {
var creds = Credentials(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: I think we can restructure this to remove the mutable creds variable and the copy, something like:

val nextRefresh = if (resp.expiration != null && resp.expiration < clock.now()) {
     logger.warn(...)
     clock.now() + DEFAULT_CREDENTIALS_REFRESH_SECONDS.seconds
}else {
    null
}

return Credentials(..., nextRefresh = nextRefresh)
      .also { previousCredentials = it }

import io.mockk.spyk
import kotlinx.coroutines.ExperimentalCoroutinesApi
import kotlinx.coroutines.test.runTest
import java.net.SocketTimeoutException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix: JVM only

val providerCreds = source.getCredentials()
if (providerCreds.expiration != null) {
ExpiringValue(providerCreds, providerCreds.expiration!!)
val expirationTime = providerCreds.nextRefresh ?: minOf(providerCreds.expiration!!, (clock.now() + expireCredentialsAfter))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed this in review: Missing test for this

when {
ex is IOException ||
ex is EC2MetadataError && ex.statusCode == HttpStatusCode.InternalServerError.value -> {
previousCredentials = previousCredentials?.copy(nextRefresh = clock.now() + DEFAULT_CREDENTIALS_REFRESH_SECONDS.seconds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another miss in my initial review.

Question: Should this be the DEFAULT_CREDENTIALS_REFRESH_SECONDS? I see our guidance is random between 5-10 minutes which would attempt to refresh a little more aggressively.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the guidance is a random time in the 5-10 minute interval, "unless the SDK has a more appropriate default already defined"

I figured this DEFAULT_CREDENTIALS_REFRESH_SECONDS would be good to use since it's already well-defined but I can make it align closer to the guidance if you think it should be. To be fair the guidance gives some conflicting messages -- is it a "MUST" or "SHOULD"?

If a refresh returns expired credentials, an SDK MUST wait between 5-10 minutes before attempting another refresh.

The delay until the next refresh SHOULD be a uniformly random time in the interval of 5-10 minutes, unless the SDK has a more appropriate default already defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either is fine I was mostly just asking if we think we should be more aggressive here or not

@github-actions
Copy link

github-actions bot commented Oct 6, 2022

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

@github-actions
Copy link

github-actions bot commented Oct 6, 2022

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

@github-actions
Copy link

github-actions bot commented Oct 6, 2022

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

@github-actions
Copy link

github-actions bot commented Oct 6, 2022

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

@lauzadis lauzadis marked this pull request as ready for review October 7, 2022 18:34
@github-actions
Copy link

github-actions bot commented Oct 7, 2022

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

@github-actions
Copy link

github-actions bot commented Oct 7, 2022

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

@github-actions
Copy link

github-actions bot commented Oct 7, 2022

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

1 similar comment
@github-actions
Copy link

github-actions bot commented Oct 7, 2022

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

@github-actions
Copy link

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

@github-actions
Copy link

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

@github-actions
Copy link

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

@github-actions
Copy link

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

@lauzadis lauzadis requested review from aajtodd and ianbotsf October 10, 2022 16:47
private const val CODE_ASSUME_ROLE_UNAUTHORIZED_ACCESS: String = "AssumeRoleUnauthorizedAccess"
private const val PROVIDER_NAME = "IMDSv2"

public expect class SdkIOException : Exception // FIXME move this to the proper place when we do the larger KMP Exception refactor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix: This can be internal or private

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I switch this to internal or private, I get the following error in the JVM definition of SDKIOException:

Actual typealias 'SdkIOException' has no corresponding expected declaration The following declaration is incompatible because visibility is different:
internal final expect class SdkIOException : Exception /* = Exception */ defined in aws.sdk.kotlin.runtime.auth.credentials in file ImdsCredentialsProvider.kt

I played around with it for a while and this expect seems to be required to be public

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expect and actual visibility has to match

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were both matching in this case -- I tried all combinations and it does not compile when both are set to internal. This seems to be a common issue: https://youtrack.jetbrains.com/issue/KT-37316

It can be fixed by adding @Suppress("ACTUAL_WITHOUT_EXPECT") decorator to the JVM implementation. I will go that route unless anybody is against it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see, because the type aliased actual differs in visibility. The suppression suggested seems fine then

private val profileOverride: String? = null,
private val client: Lazy<InstanceMetadataProvider> = lazy { ImdsClient() },
private val platformProvider: PlatformEnvironProvider = Platform,
private var previousCredentials: Credentials? = null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Does this really need to be a constructor parameter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was nice for testing purposes, but not necessary. I will refactor it and update the test cases

)

// call getCredentials 3 times
for (i in 1..3) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: repeat(3) { ... }

} catch (ex: Exception) {
throw CredentialsProviderException("failed to load instance profile", ex)
when {
ex is SdkIOException ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 85-92 are nearly same as 99-107, we could probably consolidate

e.g.

val profileName = try {
    ... 
} catch(ex :Exception) {
    return useCachedCredentials(ex) ?: throw CredentialsProviderException("failed to load instance profile", ex)
}


val payload = try {
    ... 
} catch(ex :Exception) {
    return useCachedCredentials(ex) ?: throw CredentialsProviderException("failed to load credentials", ex)
}



private fun useCachedCredentials(ex: Exception): Credentials? = when(ex) {
    is SdkIOException || is EC2MetadataError && ex.statusCode == HttpStatusCode.InternalServerError.value -> {
        mu.withLock {
            previousCredentials?.run { nextRefresh = clock.now() + ... }
            previousCredentials        
         }
     }
     else -> null
}

private val clock: Clock = Clock.System,
) : CredentialsProvider, Closeable {
private val logger = Logger.getLogger<ImdsCredentialsProvider>()
private val mu = Mutex()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You might clarify what this mutex protects:

e.g.

// protects previousCredentials + nextRefresh
private val mu = Mutex()

@github-actions
Copy link

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

Copy link
Contributor

@ianbotsf ianbotsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me!

Comment on lines 149 to 150
previousCredentials?.run { nextRefresh = clock.now() + DEFAULT_CREDENTIALS_REFRESH_SECONDS.seconds }
return previousCredentials
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I believe this can be simplified to:

previousCredentials?.apply { nextRefresh = clock.now() + DEFAULT_CREDENTIALS_REFRESH_SECONDS.seconds }

@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot E 11 Security Hotspots
Code Smell A 5 Code Smells

No Coverage information No Coverage information
38.2% 38.2% Duplication

@github-actions
Copy link

A new generated diff is ready to view: __generated-main...__generated-feat-static-stability

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants