Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh credentials for assume-role-with-web-identity calls #2641

Closed
danielvdao opened this issue Jan 28, 2022 · 14 comments
Closed

Refresh credentials for assume-role-with-web-identity calls #2641

danielvdao opened this issue Jan 28, 2022 · 14 comments
Labels
feature-request A feature should be added or improved.

Comments

@danielvdao
Copy link
Contributor

danielvdao commented Jan 28, 2022

Is your feature request related to a problem? Please describe.
When using a DynamoDB client, AssumeRoleWithWebIdentity calls happen on EKS pods to grab a AWS token using the STS client. In a Rails app, this solution can be subverted by initializing a singleton client instance on boot using the initializer. However there's an expiration time tied to these tokens and if a cluster has a lot of pods -- then those clients are going to spike in latency whenever it makes those requests to re-fetch the token since it has to make the STS client call again.

Describe the solution you'd like
In the Aws::DynamoDB::Client or any AWS client, there should be an easy way to specify refresh.

Describe alternatives you've considered
Currently I'm exploring if there's a way to implement a refresh mechanism myself using Threads and having it run in the same process that's serving web requests.

@danielvdao danielvdao added the feature-request A feature should be added or improved. label Jan 28, 2022
@danielvdao
Copy link
Contributor Author

As a reference, this issue has been surfaced in the Python lib: boto/boto3#2345 and also referenced here. It'd be nice to have an agreed upon way to do this.

@alextwoods
Copy link
Contributor

You can mange the refresh timing/scheduling yourself by calling the refresh! method on the credentials - because the credentials don't lock when they are not "near" ( < 5 minutes) from expiration, this won't cause other threads that are using those credentials to block.

client = get_dynamodb_client # get your client
credentials = client.config.credentials # this should be the AssumeRoleCredentials
credentials.refresh! if credentials.respond_to? :refresh! # in case these are not RefreshingCredentials

@danielvdao
Copy link
Contributor Author

@alextwoods let me give that a shot and get back to you :-) then for more advice.

@danielvdao
Copy link
Contributor Author

@alextwoods what would you say to doing something like:

class AwsCredentialRefresher
  def self.init_refresher!(credentials)
    # Refresh the client every X minutes in a background thread.
    wait_limit = 600

    Thread.new do
      loop do
        Thread.current.name = "AWS::DynamoDB::Client.config.credentials#refresh!"
        time_left = credentials.expiration - Time.current

        # Check if < 10 minutes left because SDK refreshes if < 5 minutes,
        if time_left < wait_limit
          credentials.refresh! if credentials.respond_to?(:refresh!)
        else
          sleep(wait_limit)
        end
      end
    end
  end
end

and in an initializer, do something like:

DYNAMO_DB_CLIENT = Aws::DynamoDB::Client.new

AwsCredentialRefresher.init_refresher!(DYNAMO_DB_CLIENT.config.credentials)

We're running on Rails 6 with a Puma webserver.

@alextwoods
Copy link
Contributor

That seems pretty reasonable overall - I'm not sure what your environment is exactly - but I'd potentially consider a check on the credentials before kicking off the thread: ie, check if its a RefreshingCredentials. You're currently checking the credentials.expiration which would fail for non-refreshing credentials.

@danielvdao
Copy link
Contributor Author

danielvdao commented Jan 28, 2022

@alextwoods would you be opposed to me trying 🙂 to add a method in RefreshingCredentials that checks to see if an options for auto_refresh: true is present and in the initialize method, call if it true?

Something like:

    def initialize(options = {})
      @mutex = Mutex.new
      refresh
      schedule_refresher! if options[:auto_refresh]
    end


    def schedule_refresher!
      wait_limit = 600 # maybe a constant

      Thread.new do
        loop do
          sleep(wait_limit)

          time_left = @credentials.expiration - Time.current
  
          if time_left < wait_limit
            @credentials.refresh!
          end
        end
      end
    end

I'm wondering if one can just pass the client with a parameter of Aws::DynamoDb::Client.new(auto_refresh: true). It seems like that's possible since the provider chain is initialized as part of the plugin architecture here and calls options.merge(config: @config). So it'd have to traverse the config hash to grab auto_refresh: true

@danielvdao
Copy link
Contributor Author

danielvdao commented Jan 28, 2022

Just to be clear (realize that I wasn't), by trying I meant opening a pull request 😅🙇🏽‍♂️

@alextwoods
Copy link
Contributor

I'm not opposed to providing some utility to help with the refreshing of credentials - however, I'm concerned about adding it as a parameter (even default disabled) that would get used as pat of the default provider chain. Spawning threads for credential refreshing could be unintended/surprising for users - especially those that are creating a large number of clients.

@alextwoods
Copy link
Contributor

Just another thought - what if the refresh when credentials were close to expiration (ie, 5 minutes before expiration) but still valid was async? That is - the current request would use the existing, non expired credentials, but kick off a thread that would refresh the credentials in the background?

@danielvdao
Copy link
Contributor Author

danielvdao commented Jan 31, 2022

what if the refresh when credentials were close to expiration (ie, 5 minutes before expiration) but still valid was async?

Do you mean something here where we do something like:

def expiration
  Thread.new { refresh_if_near_expiration }
  @expiration
end

def refresh_if_near_expiration
  # if near_expiration? <--- I don't think this check is needed twice
  # and actually is a bit confusing -- see my long accidental tangent below, haha.
  @mutex.synchronize do
    refresh if near_expiration?
  end
end

I can see the benefit to doing something like that then is that the process checking credentials kicks off a background thread to refresh credentials instead of actually calling the auth API in-line. I think it would help with preventing a lock occurring on the main thread that's executing the credentials code and relieve an external network call 🤔

EDIT: Actually scratch everything below, but now I realize... we may be checking #near_expiration? twice unnecessarily? 😂 sorry I didn't read the latter half of refresh if near_expiration? 🤦🏽‍♂️ , sometimes Ruby eludes me with its inline if checks.


Actually @alextwoods I had a question now that I am looking at the code (assuming you are more familiar here 😅) - are we 100% sure it's thread-safe? It looks like we are potentially invoking multiple calls to refresh the credential token and there's a potential check-then-act race condition.

  1. We acquire the mutex here to call #refresh on sub-classes.
  2. However, we do the checking here to see if it's #near_expiration?

Here's a timeline where I'm wondering (could be wrong) if there's a bug. Assuming t1 and t2 are two threads operating on with an AWS client instance.

  1. Thread 1 (t1) checks near_expiration? and it returns true
  2. Thread 2 (t2) checks near_expiration? and it returns true
  3. t2 acquires mutex and locks critical section to update the @expiration
  4. t2 releases mutex
  5. t1 acquires mutex and locks critical section to update @expiration. though unnecessarily.

Basically -- are we updating @expiration and making refresh calls unnecessarily? I'd imagine the mutex should be acquired before checking, otherwise we have a race condition 🤔

@alextwoods
Copy link
Contributor

alextwoods commented Feb 1, 2022

Yeah - I recently had to dig into this while looking at another, related issue, but I do believe the code as is is correct (ish *, I think there is another related threading weirdness).

The reason the near_expiration? is checked before calling the mutex is to avoid locking as much as possible, when we aren't close to expiration. However, once that is true and we lock, we check near_expiration? again: refresh if near_expiration?). The two calls to near_expiration? look a bit silly in code, but allow us to avoid the cost of locking when we aren't close to expiration and avoid calling refresh multiple times.

So that means in your step 5, t1 will acquire the mutex, but will NOT call refresh unnecessarily. EDIT: Saw you noticed this as well, but leaving the explanation for posterity.

However, there is the possibility of another weird ordering:

  1. Thread 1 checks near_expiration and it returns false. T1 does NOT acquire the mutex
  2. Thread 2 checks near_expiration and it returns true.
  3. T2 acquires mutex and calls refresh
  4. T1 returns @credentials. NOT locked, so its unclear whether this will be the old or the new credentials.
  5. T2 releases the mutex.

However, I don't believe the @credentials can be only partially updated at step 4 (they'll either be the valid old creds or also valid new creds) - its just unpredictable which t1 may end up with.

@danielvdao
Copy link
Contributor Author

danielvdao commented Feb 1, 2022

Ah yeah, the optimization for lock fetching makes sense. As for old/new value -- hard to say 😂 (my Ruby threading knowledge is still getting there) my understanding is that there's a potential of Ruby caching whereby one needs to acquire a mutex for the most updated value of a critical variable, but in this case I imagine it's fine since we refresh near expiration.

I think the visibility section of this article might explain why a stale value gets returned sometimes.

Regarding implementation I can try to open a PR soon (should be within a few days) to see if I can get the async / Thread idea up.

@danielvdao
Copy link
Contributor Author

PR here - #2642

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved.
Projects
None yet
Development

No branches or pull requests

2 participants