Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add handling for mfa_serial in ~/.aws/config #1359

Closed
wants to merge 7 commits into from

Conversation

jelford
Copy link

@jelford jelford commented May 2, 2022

Motivation and Context

Fixes: awslabs/aws-sdk-rust#527

Description

Adds trait and configuration machinery for supplying MFA tokens. The goal is to fit into the AssumeRole workflow in a similar way to what happens in boto3.

The default provider will just log/return an error if an MFA token is requested, but users can plug their own ProvideMfaToken implementations (e.g. to read from stdin). There are a few outstanding things I know about for the PR to be ready:

  • better testing
  • doc strings

But I'd also like to get some feedback before addressing those on the general shape of the PR (I think it's about in line with what was discussed in awslabs/aws-sdk-rust#527), and I wasn't sure on a few code structure things - in particular:

  • is profile::credentials::exec::AssumeRoleProvider::credentials() the right place to be calling out to users? One thing that's a little awkward about this is that by the time we get to this point in the implementation, we're already being wrapped in a (rather short) timeout for the cache... which might be a bit of a smell, if we're then pausing for user input?
  • is aws_config the right place for the new mfa_provider module? I notice that ProvideCredentials and friends come from aws_types - but I'm not sure that the MFA code is really "shared" between services in a way that would qualify it for the aws_types crate?
  • is aws_config::profile the right place? Maybe it should go beneath profile::credentials ? I'm not totally clear on the structure here.
  • general naming stuff - I've broadly tried to keep things similar to e.g. ProvideCredentials, is that about right?

One thing that I proposed in awslabs/aws-sdk-rust#527 was to add a default ProvideMfaToken implementation that would read the code from stdin, analogously to how boto uses getpass. @Velfi suggested over there that it would probably be worth an RFC for something like this, and having done a rough implementation for testing, I think they were right - there are a few choices to be made for implementation (I don't think Rust has a convenient getpass that we can reach for), so a more conservative approach seems appropriate. I would think it would be convenient at some point to have parity with other SDKs here, but this initial implementation at least gives us the ability to bring our own (and maybe gives the ecosystem space to play around and find the right / minimal way of doing it).

Here's a sample implementation of ProvideMfaToken that grabs the token from stdin using the atty and rpassword crates:

impl ProvideMfaToken for ProvideMfaTokenFromStdin {
    fn mfa_token(&self, serial: &str) -> ProvideMfaFuture {
        if !atty::is(atty::Stream::Stdin) {
            return ProvideMfaFuture::ready(
                Err(MfaTokenFetchError::provider_error(ProvideMfaTokenFromStdinError::NotATty)));
        }
        match rpassword::prompt_password(format!("Enter MFA code for {}: ", serial)) {
            Err(e) => {
                ProvideMfaFuture::ready(Err(MfaTokenFetchError::provider_error(e)))
            }
            Ok(t) => {
                ProvideMfaFuture::ready(Ok(MfaToken::from(t)))
            }
        }
    }
}

Testing

  • to add: automated testing
  • Here's a sample repository that's had these changes patched in via Cargo.toml overrides (assuming smithy-rs is checked out in a neighbouring repository): https://github.com/jelford/aws_mfa_token_provider. That's what I've been using for exploratory testing, and have verified that:
    • A bad MFA code (randomly typed letters) results in a sensible error percolating through from the API backend
    • A good MFA code successfully authenticates
  • When not specifying a custom ProvideMfaToken implementation, a reasonably friendly error comes back to the user.

Checklist

  • I have updated CHANGELOG.next.toml if I made changes to the smithy-rs codegen or runtime crates
  • I have updated CHANGELOG.next.toml if I made changes to the AWS SDK, generated SDK code, or SDK runtime crates

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- The default provider will simply log/return an error if an MFA token is requested, but users can plug their own ProvideMfaToken implementations (e.g. to read from stdin).
@jelford jelford requested a review from a team as a code owner May 2, 2022 23:25
…er and ClientConfiguration

The structs already have non-public fields so non_exhaustive is redundant
Copy link
Collaborator

@jdisanti jdisanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for contributing this! ❤️

The overall structure looks good to me. You raise a very good point about the lazy cache timeout potentially timing out the MFA user input use-case. We definitely need to think more about that, and maybe there is some structural change that needs to happen to ProvideCredentials to support this use-case (something the cache can be aware of and exclude from its timeout). I don't have any good suggestions there yet and need to mull on it for a bit.

is aws_config the right place for the new mfa_provider module?

This is a great question! The aws-types crate is meant to have as stable of an API as possible, while aws-config provides implementations for aws-types and is intended to be less stable. I think I would answer this question with: is the AssumeRoleProvider exposed in aws-types? If not, then the MFA provider should probably live in aws-config.

is aws_config::profile the right place?

Naively, it seems like it should go in the same place organizationally as the AssumeRoleProvider, but I'm not certain if there are other credential providers that have similar MFA features.

general naming stuff

This is always tough. In general, if it's related to the AssumeRoleProvider, we should take the names from STS's AssumeRole operation. However, those names are less meaningful in the context of the larger profile credentials provider, so we may want to name them differently there.

@jdisanti
Copy link
Collaborator

jdisanti commented May 3, 2022

Thinking about the cache timeout issue more, I'm wondering if ProvideMfaToken should instead by sync with the expectation that the application either asks for an MFA token ahead of time, or continuously refreshes a MFA token in a separate thread. The challenge with that, I think, is with the mfa_serial entry in the profile credentials file since anything external wouldn't have easy access to that.

It seems like an application requiring MFA is going to need to be more aware of the overall MFA input process. For example, what is the expected behavior if the MFA token expires for the TTY input use-case? Would the user be prompted for another token? Would that interrupt something else in the TTY?

Or thinking about a GUI application that wants to run an AWS command on the user's behalf. If a dialog prompt came up asking for MFA, and the user clicked cancel, would there need to be some way to communicate that cancellation action back so that the application can handle it accordingly rather than assuming some unrecoverable error occurred?

Anyway, that's my brain dump for now. Will keep thinking about this.

@jelford
Copy link
Author

jelford commented May 3, 2022

Just a quick note on the topic of MFA refreshes - I think what you're saying about managing refreshes sounds very reasonable on the face of it, but at the same time, I note that boto has chosen to punt on this one.

It's hard to see how that could turn out well in a long-lived application like an IDE, though, so the topic is worth a little more thought. I wonder whether the right approach to deal with caching might be to find a way to push it "down" a level, rather than introducing a kind of complex interaction where it knows that there are certain things it just shouldn't cache (feels like a layer / encapsulation violation). I'll confess I didn't look hard at the current caching setup (I just noticed it when I hit a timeout).
I think this is not right - see comment below

I'm not sure I understand this comment:

I'm wondering if ProvideMfaToken should instead by sync

I'm not wedded to it being async by any means, but I think maybe it's orthogonal to the rest of the issue? (I mean that the next part of the sentance, asking the user ahead of time, can work just as well if it stays async, I think?)

@tschuett
Copy link

tschuett commented May 3, 2022

GetSessionToken also supports MFA.

@jdisanti
Copy link
Collaborator

jdisanti commented May 3, 2022

I think we have a potential solution for the timeout issue. What we can do is modify the ProvideMfaToken trait to look something like this:

pub trait ProvideMfaToken: Send + Sync + Debug {
    fn mfa_token(&self, mfa_serial: &str) -> Result<MaybeMfaToken, MfaTokenFetchError>;
}

With an enum that allows for later, timeout free, resolution of the MFA token:

#[non_exhaustive]
pub enum MaybeMfaToken {
    // The names definitely need some thinking...
    // This variant states that we believe the token will become available before the timeout
    NearTerm(future::ProvideMfaToken),
    // And this variant says we know the token likely won't be available before the timeout
    NotReady(future::WhenReady),
}

When the mfa_token function returns NotReady, we can have the AssumeRoleProvider return a CredentialsError::ProviderError with a boxed cause that has NotReady's future inside it. Then the LazyCachingCredentialsProvider can downcast the cause when it gets a ProviderError to see if it's a case where something needs to get resolved first.

The nice part about this approach is that it:

  1. Allows us to break up the implementation so that a quicker solution can be implemented first for those that don't anticipate running into the cache timeout. By merely omitting the NotReady variant from the enum, we get half of the functionality without worrying about a breaking change later.
  2. It's non-committal on API changes. Everything is contained in aws-config, the crate with a less stable API than aws-types.
  3. We could potentially feature-gate it behind something like mfa-experimental to collect more feedback before deciding to stabilize it.

Downsides are:

  1. While NotReady isn't implemented, the API is going to be obtuse since implementers will need to wrap the return value in this one-variant enum.
  2. The downcast examination of cause in CredentialsError::ProviderError is not great. But if we need to change approach, we won't have a dead variant in CredentialsError (which is in aws-types) this way.

I'm definitely open to more ideas, but I think overall we need an RFC to examine this deeper (if for no other reason than to answer what went into our decision making). If that's something you're interested in tackling, I'm happy to help out!

@jelford
Copy link
Author

jelford commented May 7, 2022

Hi @jdisanti , first of all thanks for the thoughtful response on this caching / timeouts issue. I had hoped to have a bit more time on this during the week but it doesn't look like that's panned out. I'm happy in theory to look at an RFC, just with the caveat that my latency on discussions is going to be quite high over the next couple of weeks - I don't think that's a huge deal since this wasn't being worked on before (?), but if folks are keen to make progress here then I don't want to be a bottleneck.

I have two general thoughts on this topic before thinking harder about solutions though (I'm reaching for ways for it not to be a problem in the first place - I wonder whether we're over-complicating things?):

  1. In cases where the application knows that it's likely to experience long delays in getting an MFA code, it might be acceptable just to bump the default cache-load timeout, as I've done in my own testing. Here's how I set up my credentials:

    credentials::DefaultCredentialsChain::builder()
                .profile_name(&args.profile)
                .mfa_token(ProvideMfaTokenFromStdin())
                .load_timeout(Duration::from_secs(120))
                .build()
                .await;
    

    I'm not super keen on that workflow because it requires the application to have some idea that user interaction is going to be involved - but, since implementing ProvideMfaToken is already the application's responsibility (for now...), that might not be such an unreasonable ask. The line of thinking here is:

    • The application already has to have some way to interact with the user
    • The application already has to deal with the case that auth will fail for an indefinite period / need to be retried because the user is not attentive
    • ... therefore, the application will need its own way to handle this workflow in general (for commandline / batch-oriented apps, "handle" probably just means "crash"), and already needs to think about appropriate timeouts.

    So, the "do nothing" option is to rely on the application widening the credentials cache load timeout. Do you think there's any mileage there?

  2. I might be reading this wrong, but I think that in botocore, the way the cache wrapper works is that it doesn't enforce timeouts at the cached-load level at all. Instead, timeout decisions are delegated to whatever's going on in the underlying credentials loader (so, boto's AssumeRoleCredentialFetcher will have the normal set of timeouts applied when it actually sends a request out to sts, by virtue of going through the normal botocore.client._make_api_call flow, rather than because of anything enforced by the caching layer).

So, from the looks of it, the timeout in LazyCachingCredentialsProvider is an extra layer of timeout. If it was removed (or the application widens it), we would still have (network) timeout enforced by virtue of using .sts_client in AssumeRoleProvider.

Overall, I do think the botocore approach is probably about right actually - or at least, a very wide timeout set by the application at the cache level seems acceptable, given that there are timeouts at the "next level down" for anything that really needs them. That gives me a little confidence that the model suggested in 1 ("just widen the timeout on the cache") is, maybe, okay? Maybe one could go even further and allow the application to disable the cache's load timeouts entirely.

One missing piece in the "just widen the timeout on the cache" model is: the cache timeout is what's being configured with a call to DefaultCredentialsChain::load_timeout(...). I don't think there's currently a good user- accessible way to configure the timeout on the underlying sts_client, though maybe I am missing it.

@jelford
Copy link
Author

jelford commented May 7, 2022

So, setting aside the "just widen the timeout" idea in my previous for the time being, and engaging with the idea of having the ProvideMfaToken implementation provide some feedback to say "I need longer".

I think the basic idea of using a boxed trait to back-channel information back up to the LazyCachingCredentialsProvider is neat, and I like that this could all be kept local to aws-config, at least while API kinks are ironed out. I think MaybeMfaToken can work nicely, but there is one thing that troubles me about it: the whole reason it's needed is that the LazyCachingCredentialsProvider is imposing a timeout when it doesn't have the information to know whether a timeout is appropriate.

I wonder whether there's a way we can take your idea a step further; instead of:

  1. Call the provider
  2. Let it fail
  3. Inspect the error; some errors are special
  4. ... handle the special case

Can we say directly that "some providers are special", and skip the timeout dance entirely, for certain providers?

  1. Inspect the ProvideCredentials
  2. If the ProvideCredentials comes with its own timeout preference, use that
  3. Call the provider with whatever timeout logic was determined in step 2
  4. No change to error handling

This would involve a new trait, something like this:

/// `ProvideCredentialsWithoutTimeout` implementations are not subject to the normal timeout
/// rules imposed by the `LazyCachingCredentialsProvider`.
pub(crate) trait ProvideCredentialsTimeoutCustomizer {
    /// If the provider returns its own timeout, this is what will be used in `LazyCachingCredentialsProvider`
    fn provider_timeout_override(self) -> Option<Duration>;
}

And then in LazyCachingCredentialsProvider::provide_credentials, the timeout is determined by:

let load_timeout = if let Some(timeout_customizer) = (loader.as_ref() as &dyn Any).downcast_ref::<dyn ProvideCredentialsTimeoutCustomizer>() {
    // Defer to the provider
    timeout_customizer.provider_timeout_override().unwrap_or(self.load_timeout)
} else {
    // As before
    self.load_timeout
};

The tricky part is that there's some plumbing required to make LazyCachingCredentialsProvider's loader into an Any - I think it can be done without external API changes, but it's definitely a bigger lift than your path of using the error as a back-channel, since errors already have downcast support.

I guess one final possibility that doesn't require any downcasting hijinks is to add something like the provider_timeout_override() method in ProvideCredentials directly, with a default implementation that returns None. I suspect that's not the right thing, since it forces ProvideCredentials to be concerned with a specific use-case (a caller wants to supply their own time-out logic, but doesn't know what the time-out should be), and whilst it doesn't cause API breakage, it doesn't feel like the best-possible API, so it's probably not "ready" for aws-types (and removing it would be an API break).

If you let me know your thoughts, I'd be happy to take a stab at drafting an RFC some time over the next 2 weeks to bring all these three paths together into a coherent proposal. I find the angle that there's no timeout around this in boto quite persuasive, though presumably timeout was added to the cache layer for a good reason(!).

This path doesn't use the ProvideMfaToken trait, since parameters are expected to be provided to AssumeRoleProviderBuilder up-front (c.f. region is supplied directly, rather than as a ProvideRegion).
@jelford
Copy link
Author

jelford commented May 7, 2022

GetSessionToken also supports MFA.

@tschuett thanks for the note. I'm not sure whether you're suggesting a change here, or just pointing out that one can use this API as-is, since it accepts .token_code() and .serial_number() ?

@tschuett
Copy link

tschuett commented May 7, 2022

There was a discussion whether AssumeRole is the only provider using MFA.

The Rust people use it:
https://github.com/rust-lang/simpleinfra/blob/master/aws-creds.py

@tschuett
Copy link

tschuett commented May 7, 2022

If every implementation of ProvideCredentials is caching, then could hide the timeout.

@jelford
Copy link
Author

jelford commented May 8, 2022

There was a discussion whether AssumeRole is the only provider using MFA.

The Rust people use it: https://github.com/rust-lang/simpleinfra/blob/master/aws-creds.py

Ah, got it, thanks. As it happens, what I was trying to implement against the Rust SDK when I stumbled into this was extremely similar to that tool, so good to know!

@tschuett
Copy link

tschuett commented May 9, 2022

If you put a MFA protected bucket policy on your S3 bucket, then AssumeRole is insufficient.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-use-case-7

@jdisanti
Copy link
Collaborator

@jelford - You raise some very good points, and the example of boto separating the concerns of caching and timeout is particularly intriguing. I'm definitely open to that approach so long as we figure out how that plays into the default credential provider chain and its configuration.

Can we say directly that "some providers are special", and skip the timeout dance entirely, for certain providers?

That was something I considered before arriving at the boxed back-channel error approach, but the API change to the ProvideCredentials trait is a commitment I'm not so confident about in the long term. It looks like you might be getting around that by having the additional information in a separate trait and accessing it via downcasting though. If we find a way to successfully pull off that downcasting, then that seems like another workable approach.

The tricky part is that there's some plumbing required to make LazyCachingCredentialsProvider's loader into an Any - I think it can be done without external API changes, but it's definitely a bigger lift than your path of using the error as a back-channel, since errors already have downcast support.

Yeah, I see what you mean. I spent a bit of time playing around with it in the Rust Playground and wasn't able to get from an Arc<dyn SomeTrait> into a &dyn OtherTrait. It seems like it should be possible, but it's not immediately obvious how to do it.

presumably timeout was added to the cache layer for a good reason

I don't recall there being a good reason, but @rcoh might know of one.

If you let me know your thoughts, I'd be happy to take a stab at drafting an RFC some time over the next 2 weeks to bring all these three paths together into a coherent proposal. I find the angle that there's no timeout around this in boto quite persuasive, though presumably timeout was added to the cache layer for a good reason(!).

This sounds great! I think an RFC exploring the different solutions and their pros and cons will help a lot with deciding which we want to go with. I think the biggest snag with the separation of cache and timeout will be figuring out exactly how it works in the default chain. There are lots of questions that come to mind, such as: what are the defaults? Is the timeout configurable for each individual provider in the chain? Or does the timeout become an implementation detail of each individual provider? Are there API changes? And so on...

Thanks for the great discussion so far!

@jelford jelford closed this Aug 21, 2022
@jelford
Copy link
Author

jelford commented Aug 21, 2022

Closing this as I'm not currently actively working on it. I still want to pursue MFA token support, but as discussed, there should probably be an RFC for any changes around how timeouts are handled, and I notice that RFC-0014 has landed since I was first looking at this a few months ago, so I think it's a good idea to pause and digest what's gone on there before proceeding.

I would like to come back to this, but spare time to work on things outside of "work" is at a premium right now. I'd be happy to chat through if anyone is eager to make progress here and wants to pick it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add MFA support when profile assumes role
4 participants