Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(rds): AWS::RDS::DBInstance should not have EngineVersion property set for Aurora clusters #21758

Closed
blimmer opened this issue Aug 25, 2022 · 11 comments · Fixed by #22185
Closed
Assignees
Labels
@aws-cdk/aws-rds Related to Amazon Relational Database bug This issue is a bug. effort/small Small work item – less than a day of effort p1

Comments

@blimmer
Copy link
Contributor

blimmer commented Aug 25, 2022

Describe the bug

Currently, when upgrading RDS Aurora between versions, a runtime error will occur that leaves the CloudFormation stack in an unrecoverable UPDATE_ROLLBACK_FAILED state. This is because CDK is setting the EngineVersion property on AWS::RDS::DBInstance, which the Cfn documentation states that you should not set when using an Aurora cluster:

Amazon Aurora

Not applicable. The version number of the database engine to be used by the DB instance is managed by the DB cluster.

Screen_Shot_2022-08-25_at_08_49_07

When upgrading between versions that require downtime for upgrade, it causes this error:

The stack named BlimmerTestAuroraUpgradeStack failed to deploy: UPDATE_ROLLBACK_FAILED (The following resource(s) failed to update: [DatabaseCluster68FC2945, DatabaseClusterInstance1C566869D]. ): The specified DB Instance is a member of a cluster. Modify the DB engine version for the DB Cluster using the ModifyDbCluster API (Service: Rds, Status Code: 400, Request ID: 9998e162-bb47-4ff0-a6ad-91665e964ff2), DB cluster isn't available for modification with status upgrading. (Service: Rds, Status Code: 400, Request ID: 5b967f09-58a2-42dc-aa20-3a8bffbf705a)

Here's a test stack with the event log:

event-log

The worst part about this bug is that the Database stack is left in the unrecoverable UPDATE_ROLLBACK_FAILED state, which means that you either have to:

a) attempt to complete the rollback. this is impossible because the cluster actually upgrades even though the CloudFormation steps fail. A rollback between the new target major version and the old version is not allowed by RDS.

b) delete the stack. this obviously is not ideal because databases should not be deleted in most cases. worse, you can't update the RetentionPolicy to try to retain the database while deleting the stack.

Expected Behavior

I expected to be able to update a DatabaseCluster between major supported versions via the engine property on DatabaseCluster.

Current Behavior

As mentioned above, if you try to upgrade an Aurora cluster between major versions, you'll encounter this error, which leaves the stack in an unrecoverable state.

The stack named BlimmerTestAuroraUpgradeStack failed to deploy: UPDATE_ROLLBACK_FAILED (The following resource(s) failed to update: [DatabaseCluster68FC2945, DatabaseClusterInstance1C566869D]. ): The specified DB Instance is a member of a cluster. Modify the DB engine version for the DB Cluster using the ModifyDbCluster API (Service: Rds, Status Code: 400, Request ID: 9998e162-bb47-4ff0-a6ad-91665e964ff2), DB cluster isn't available for modification with status upgrading. (Service: Rds, Status Code: 400, Request ID: 5b967f09-58a2-42dc-aa20-3a8bffbf705a)

Reproduction Steps

  1. Create a new stack with an older major version of Aurora Postgres:
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { DatabaseCluster, DatabaseClusterEngine, AuroraPostgresEngineVersion } from 'aws-cdk-lib/aws-rds';
import { Vpc } from 'aws-cdk-lib/aws-ec2';

export class CdkBugReportsStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    new DatabaseCluster(this, 'DatabaseCluster', {
      engine: DatabaseClusterEngine.auroraPostgres({
        version: AuroraPostgresEngineVersion.VER_10_18,
      }),
      instanceProps: {
        vpc: new Vpc(this, 'Vpc')
      }
    })
  }
}
  1. cdk deploy the stack above
  2. Update to a newer major version of Aurora Postgres. At the time of writing, 10.18 -> 13.4 is a valid upgrade target:
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { DatabaseCluster, DatabaseClusterEngine, AuroraPostgresEngineVersion } from 'aws-cdk-lib/aws-rds';
import { Vpc } from 'aws-cdk-lib/aws-ec2';

export class CdkBugReportsStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    new DatabaseCluster(this, 'DatabaseCluster', {
      engine: DatabaseClusterEngine.auroraPostgres({
        version: AuroraPostgresEngineVersion.VER_13_4,  // Changed
      }),
      instanceProps: {
        vpc: new Vpc(this, 'Vpc')
      }
    })
  }
}
  1. cdk deploy

Observe: the stack update will fail with the aforementioned error, leaving the Database stack in an unrecoverable state.

Possible Solution

Short Term Workaround

As a short-term solution, users can reach into the L1 construct to remove the EngineVersion property like this:

const cfnInstances = cluster.node.children.filter((child) => child instanceof CfnDBInstance);
if (cfnInstances.length === 0) {
  throw new Error("Couldn't pull CfnDBInstances from the L1 constructs!");
}
cfnInstances.forEach((cfnInstance) => delete (cfnInstance as CfnDBInstance).engineVersion);

I've tested removing the EngineVersion property from an existing CloudFormation Stack/DbInstance. It shows the following diff:

Stack BlimmerTestAuroraUpgradeStack
Resources
[~] AWS::RDS::DBInstance DatabaseCluster/Instance1 DatabaseClusterInstance1C566869D
 └─ [-] EngineVersion
     └─ 10.18

And the change applied with (what appears to be) no changes to the actual instances:

BlimmerTestAuroraUpgradeStack: creating CloudFormation changeset...
BlimmerTestAuroraUpgradeStack | 0/3 | 11:31:22 AM | UPDATE_IN_PROGRESS   | AWS::CloudFormation::Stack                  | BlimmerTestAuroraUpgradeStack User Initiated
BlimmerTestAuroraUpgradeStack | 0/3 | 11:31:29 AM | UPDATE_IN_PROGRESS   | AWS::RDS::DBInstance                        | DatabaseCluster/Instance1 (DatabaseClusterInstance1C566869D)
BlimmerTestAuroraUpgradeStack | 1/3 | 11:31:32 AM | UPDATE_COMPLETE      | AWS::RDS::DBInstance                        | DatabaseCluster/Instance1 (DatabaseClusterInstance1C566869D)
BlimmerTestAuroraUpgradeStack | 2/3 | 11:31:33 AM | UPDATE_COMPLETE_CLEA | AWS::CloudFormation::Stack                  | BlimmerTestAuroraUpgradeStack
BlimmerTestAuroraUpgradeStack | 3/3 | 11:31:34 AM | UPDATE_COMPLETE      | AWS::CloudFormation::Stack                  | BlimmerTestAuroraUpgradeStack

I verified in the RDS console that the instance didn't shut down or have any other events during this cdk deploy.

So, I believe it should be safe for all Aurora DatabaseCluster users to use this workaround without downtime. However, I've only tested this on my test cluster, so your mileage may vary.

Note that this should only be done for Aurora clusters, per the AWS::RDS::DBInstance CFN documentation

Longer Term Fix

CDK should detect when the engine passed is an Aurora RDS engine. In this case, it should not set the AWS::RDS::DBInstance EngineVersion property

Additional Information/Context

I confirmed this issue by testing in my AWS accounts. I also have an internal ticket (case #10630169951) opened with AWS Support about this issue. In addition to a CDK fix, it seems this could be handled more elegantly on the CloudFormation side.

CDK CLI Version

2.38.1 (build a5ced21)

Framework Version

No response

Node.js Version

16 LTS

OS

MacOS

Language

Typescript

Language Version

No response

Other information

Because of the unrecoverable state in which the DatabaseCluster stack is left, I'd highly recommend this being a P1 bug.

@blimmer blimmer added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 25, 2022
@github-actions github-actions bot added the @aws-cdk/aws-rds Related to Amazon Relational Database label Aug 25, 2022
@corymhall
Copy link
Contributor

It seems like this issue impacts a significant number of customers, and I've tagged it as P1, which means it should be on our near-term roadmap.

We welcome community contributions! If you are able, we encourage you to contribute (https://github.com/aws/aws-cdk/blob/master/CONTRIBUTING.md) a bug fix or new feature to the CDK. If you decide to contribute, please start an engineering discussion in this issue to ensure there is a commonly understood design before submitting code. This will minimize the number of review cycles and get your code merged faster.

@corymhall corymhall added p1 effort/small Small work item – less than a day of effort and removed needs-triage This issue or PR still needs to be triaged. labels Aug 25, 2022
@corymhall corymhall removed their assignment Aug 25, 2022
@davenix-palmetto
Copy link
Contributor

Thanks all for raising this issue. We are hitting this in production as well.

@blimmer
Copy link
Contributor Author

blimmer commented Sep 1, 2022

IMPORTANT: please note that I had an issue in the workaround I previously posted above. I've edited the description so the workaround there is now correct.

Previously, I was only removing the EngineVersion from the first DbInstance (because I was only running one instance in my testing). However, we need to remove it from all instances. The workaround should look like this:

const cfnInstances = cluster.node.children.filter((child) => child instanceof CfnDBInstance);
if (cfnInstances.length === 0) {
  throw new Error("Couldn't pull CfnDBInstances from the L1 constructs!");
}
cfnInstances.forEach((cfnInstance) => delete (cfnInstance as CfnDBInstance).engineVersion);

Note that instead of a .find, I'm now using a .filter to grab all instances.

@rittneje
Copy link

rittneje commented Sep 2, 2022

We are observing the same bug even upgrading between minor versions (#21899).

I wonder if an Aspect would be a better fit for the workaround.

@blimmer
Copy link
Contributor Author

blimmer commented Sep 2, 2022

Thanks @rittneje, that behavior you saw makes sense. I updated the ticket to remove the language about minor versions being OK.

@rittneje
Copy link

rittneje commented Sep 2, 2022

@corymhall Not sure if any bug has been filed against CloudFormation, but it should have rejected templates with an AWS::RDS::DBInstance that has both DBClusterIdentifier and EngineVersion specified since RDS cannot support that.

@joshlartz
Copy link
Contributor

@corymhall I moved my PR over to this issue. Can you give the PR a look please?

@mergify mergify bot closed this as completed in #22185 Sep 26, 2022
mergify bot pushed a commit that referenced this issue Sep 26, 2022
…s that were part of a DBCluster (#22185)

Engine version should not be set on instances that are part of a cluster. The cluster is responsible for this setting and throws an API error when an update is attempted on them. 

closes #21758 #22180


----

### All Submissions:

* [X] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md)

### Adding new Unconventional Dependencies:

* [x] This PR adds new unconventional dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md/#adding-new-unconventional-dependencies)

### New Features

* [x] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/main/INTEGRATION_TESTS.md)?
	* [x] Did you use `yarn integ` to deploy the infrastructure and generate the snapshot (i.e. `yarn integ` without `--dry-run`)?

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

hacker65536 pushed a commit to hacker65536/aws-cdk that referenced this issue Sep 30, 2022
…s that were part of a DBCluster (aws#22185)

Engine version should not be set on instances that are part of a cluster. The cluster is responsible for this setting and throws an API error when an update is attempted on them. 

closes aws#21758 aws#22180


----

### All Submissions:

* [X] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md)

### Adding new Unconventional Dependencies:

* [x] This PR adds new unconventional dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md/#adding-new-unconventional-dependencies)

### New Features

* [x] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/main/INTEGRATION_TESTS.md)?
	* [x] Did you use `yarn integ` to deploy the infrastructure and generate the snapshot (i.e. `yarn integ` without `--dry-run`)?

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
arewa pushed a commit to arewa/aws-cdk that referenced this issue Oct 8, 2022
…s that were part of a DBCluster (aws#22185)

Engine version should not be set on instances that are part of a cluster. The cluster is responsible for this setting and throws an API error when an update is attempted on them. 

closes aws#21758 aws#22180


----

### All Submissions:

* [X] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md)

### Adding new Unconventional Dependencies:

* [x] This PR adds new unconventional dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md/#adding-new-unconventional-dependencies)

### New Features

* [x] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/main/INTEGRATION_TESTS.md)?
	* [x] Did you use `yarn integ` to deploy the infrastructure and generate the snapshot (i.e. `yarn integ` without `--dry-run`)?

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
homakk pushed a commit to homakk/aws-cdk that referenced this issue Dec 1, 2022
…s that were part of a DBCluster (aws#22185)

Engine version should not be set on instances that are part of a cluster. The cluster is responsible for this setting and throws an API error when an update is attempted on them. 

closes aws#21758 aws#22180


----

### All Submissions:

* [X] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md)

### Adding new Unconventional Dependencies:

* [x] This PR adds new unconventional dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md/#adding-new-unconventional-dependencies)

### New Features

* [x] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/main/INTEGRATION_TESTS.md)?
	* [x] Did you use `yarn integ` to deploy the infrastructure and generate the snapshot (i.e. `yarn integ` without `--dry-run`)?

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@ajhool
Copy link

ajhool commented Jan 24, 2023

The worst part about this bug is that the Database stack is left in the unrecoverable UPDATE_ROLLBACK_FAILED state, which means that you either have to:

a) attempt to complete the rollback. this is impossible because the cluster actually upgrades even though the CloudFormation steps fail. A rollback between the new target major version and the old version is not allowed by RDS.

b) delete the stack. this obviously is not ideal because databases should not be deleted in most cases. worse, you can't update the RetentionPolicy to try to retain the database while deleting the stack.

Is it correct that the only way out of this state is to delete the stack and the database? Or can the "skip resources" technique be used to complete the rollback into a "rollback_completed" state?

@rittneje
Copy link

I believe you can resolve this by completing the rollback by skipping resources in the UI. Then deploy a new template with the cluster version set to whatever it actually is in RDS right now, and EngineVersion removed from each DBInstance.

I highly recommend testing this against some test stack first.

@ajhool
Copy link

ajhool commented Jan 24, 2023

Okay thanks. I was able to:

Starting at bad state of Update Rollback Failed due to the problem detailed in this issue

Some of the text in these steps might be different because I don't have access to that state any longer. The basic steps are also outlined here:

  1. Go to the AWS Console Cloudformation Service
  2. Click on the affected stack
  3. Select "Continue update rollback"
  4. Advanced Troubleshooting
  5. Select to skip the affected cluster
  6. (I tried updating to the latest CDK (2.61.1) and redeploying here, but it failed with the original instance version error)
  7. Add the patch code provided here: (rds): AWS::RDS::DBInstance should not have EngineVersion property set for Aurora clusters #21758 (comment)
  8. Redeploy

That seems to have worked and the cloudformation stack is back in the correct state. I then tried removing the patch code and redeploying using CDK version (2.61.1) but it failed with the original error and the diff shows an attempt to add EngineVersion to both instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-rds Related to Amazon Relational Database bug This issue is a bug. effort/small Small work item – less than a day of effort p1
Projects
None yet
6 participants