Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECSTaskCredentials refreshes too late #2498

Closed
cfbao opened this issue Dec 11, 2022 · 10 comments
Closed

ECSTaskCredentials refreshes too late #2498

cfbao opened this issue Dec 11, 2022 · 10 comments
Labels
bug This issue is a bug. credentials p1 This is a high priority issue queued

Comments

@cfbao
Copy link

cfbao commented Dec 11, 2022

Describe the bug

ECSTaskCredentials by default has PreemptExpiryTime set to zero (as defined in RefreshingAWSCredentials). This causes errors when one uses RDS/Aurora IAM authentication with RDSAuthTokenGenerator in an ECS task:

  • ECSTaskCredentials caches the credentials until the very end of their lifetime (because PreemptExpiryTime is zero)
  • near the end of the creds' lifetime (say <1 sec), RDSAuthTokenGenerator.GenerateAuthToken is called with fallback credentials, and returns an auth token with a nominal expiry time of 15 minutes.
  • 2 seconds later, try creating a new DB connection with the generated auth token, and we see an authentication error with the DB, because the signing credential has expired.

Expected Behavior

  • ECSTaskCredentials refreshes its cached credentials much earlier than its actual expiry, ideally as soon as a new one is available at http://169.254.170.2${AWS_CONTAINER_CREDENTIALS_RELATIVE_URI}
  • RDSAuthTokenGenerator.GenerateAuthToken can be used with fallback credentials and return auth tokens that can be safely cached up to their nominal expiry time (i.e. 15 minutes).

Current Behavior

  • ECSTaskCredentials doesn't refresh its cached credentials until the very end of their lifetime.
  • RDSAuthTokenGenerator.GenerateAuthToken, when used with fallback credentials, returns auth tokens that may expire at any moment, because we don't know when the signing IAM creds will expire.

This has caused intermittent connection errors in our application.

Reproduction Steps

Set up an RDS/Aurora PostgreSQL DB with IAM authentication, then run the following code in an ECS Fargate task.
You should see an authentication error in about 6 hours (the lifetime of IAM creds in Fargate)

using Npgsql;

var dataSourceBuilder = new NpgsqlDataSourceBuilder(
    // purposefully disabling pooling, because it can hide the issue sometimes
    "Host=<rds_host_name>;Port=5432;Database=<db_name>;Username=<user>;SSL Mode=require;Trust Server Certificate=true;Pooling=false"
);
dataSourceBuilder.UsePeriodicPasswordProvider(
    passwordProvider: (_, _) => ValueTask.FromResult(
        RDSAuthTokenGenerator.GenerateAuthToken("<rds_host_name>", 5432, "<user>")
    ),
    // a ~10-minute refresh interval should theoretically be safe, because generated tokens have nominal expiry of 15 minutes
    // using 9.6 here to avoid the interval coinciding with ECSTaskCredentials refreshes
    successRefreshInterval: TimeSpan.FromMinutes(9.6),
    failureRefreshInterval: TimeSpan.FromSeconds(5)
);

await using var dataSource = dataSourceBuilder.Build();

while(true) {
    try{
        await using var command = dataSource.CreateCommand("SELECT 1");
        await command.ExecuteNonQueryAsync();
        await Task.Delay(TimeSpan.FromMinutes(1));
    } catch(Exception ex) {
        // this will happen in ~6 hours, but it shouldn't
        Console.WriteLine($"Error talking to the DB: {ex}");
    }
}

Possible Solution

Set a non-zero PreemptExpiryTime for ECSTaskCredentials.

ECS Fargate seems to refresh the creds available at http://169.254.170.2${AWS_CONTAINER_CREDENTIALS_RELATIVE_URI} as early as 3 hours before the old one expires, so 3 hours may work? But for my purpose, I'd be happy with 1 hour or even just 15 minutes too.

Additional Information/Context

No response

AWS .NET SDK and/or Package version used

AWSSDK.RDS 3.7.105.5

Targeted .NET Platform

.NET 6

Operating System and version

Debian

@cfbao cfbao added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Dec 11, 2022
@ashishdhingra
Copy link
Contributor

Appears to be a valid concern. But not sure on how we could decide on the refresh interval. Needs discussion with the team.

@MariusVladu
Copy link

Would it help to manually call FallbackCredentialsFactory.Reset(); before generating the new auth token ?

I think I'm facing the same issue: ECS fargate spot task, RDS MySQL db.t4g.small, .NET 7 with entity framework.
I'm caching the entire connection string for 10 minutes and always setting it in a ConnectionOpen interceptor.
DB instance has a max of 19 connections at a time with an average of 5-9.
I'm using connection pooling (default settings).

Still, I'm getting rare database authentication failures some after ~6 hours, some at slightly different times.

No idea how else to troubleshoot this. Finding this open issue gave me some hope though.

@peterrsongg
Copy link
Contributor

@MariusVladu @cfbao We intend to release the fix for this tomorrow. Will comment on here when it is officially released. Thank you

@peterrsongg
Copy link
Contributor

@cfbao The fix was released in version 3.7.506.0. Thank you for bringing this to our attention. If the issue persists, feel free to re-open this.

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@Runaground
Copy link

@peterrsongg

I'm experiencing the same problem.
We are using NpgsqlDataSourceBuilder with password rotation every 10 mins. We verified that password is indeed requested every 10 mins via RDSAuthTokenGenerator.GenerateAuthToken.
Our AWSSDK.Core version is 3.7.107.108 (which is more recent than 3.7.506.0. release).
Initially I opened an issue against NpgSql but now I think AWSSDK is a culprit npgsql/npgsql#5163

@cfbao
Copy link
Author

cfbao commented Jul 19, 2023

@peterrsongg

I don't think this issue is actually fixed.
PreemptExpiryTime is now set to 5 minutes:

PreemptExpiryTime = TimeSpan.FromMinutes(5);

which still isn't enough to cover the lifetime of an RDS auth token which is 15 minutes.

@peterrsongg
Copy link
Contributor

@Runaground @cfbao My understanding was the credentials were being refreshed at the moment it was expiring which is what was causing this error, but it seems like for both of your cases 5 minutes is not enough. I'll look into increasing this to 20 minutes

@peterrsongg
Copy link
Contributor

We decided to increase all of our credential providers PreemptyExpiryTime to 15 minutes. This will go out in our next manual release. I'll ping here when that happens. Appreciate your patience.

@peterrsongg
Copy link
Contributor

@cfbao @Runaground The fix has been released in Core version 3.7.202.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. credentials p1 This is a high priority issue queued
Projects
None yet
Development

No branches or pull requests

5 participants