(rds): AWS::RDS::DBInstance should not have EngineVersion property set for Aurora clusters #21758

blimmer · 2022-08-25T16:02:45Z

Describe the bug

Currently, when upgrading RDS Aurora between versions, a runtime error will occur that leaves the CloudFormation stack in an unrecoverable UPDATE_ROLLBACK_FAILED state. This is because CDK is setting the EngineVersion property on AWS::RDS::DBInstance, which the Cfn documentation states that you should not set when using an Aurora cluster:

Amazon Aurora

Not applicable. The version number of the database engine to be used by the DB instance is managed by the DB cluster.

When upgrading between versions that require downtime for upgrade, it causes this error:

The stack named BlimmerTestAuroraUpgradeStack failed to deploy: UPDATE_ROLLBACK_FAILED (The following resource(s) failed to update: [DatabaseCluster68FC2945, DatabaseClusterInstance1C566869D]. ): The specified DB Instance is a member of a cluster. Modify the DB engine version for the DB Cluster using the ModifyDbCluster API (Service: Rds, Status Code: 400, Request ID: 9998e162-bb47-4ff0-a6ad-91665e964ff2), DB cluster isn't available for modification with status upgrading. (Service: Rds, Status Code: 400, Request ID: 5b967f09-58a2-42dc-aa20-3a8bffbf705a)

Here's a test stack with the event log:

The worst part about this bug is that the Database stack is left in the unrecoverable UPDATE_ROLLBACK_FAILED state, which means that you either have to:

a) attempt to complete the rollback. this is impossible because the cluster actually upgrades even though the CloudFormation steps fail. A rollback between the new target major version and the old version is not allowed by RDS.

b) delete the stack. this obviously is not ideal because databases should not be deleted in most cases. worse, you can't update the RetentionPolicy to try to retain the database while deleting the stack.

Expected Behavior

I expected to be able to update a DatabaseCluster between major supported versions via the engine property on DatabaseCluster.

Current Behavior

As mentioned above, if you try to upgrade an Aurora cluster between major versions, you'll encounter this error, which leaves the stack in an unrecoverable state.

The stack named BlimmerTestAuroraUpgradeStack failed to deploy: UPDATE_ROLLBACK_FAILED (The following resource(s) failed to update: [DatabaseCluster68FC2945, DatabaseClusterInstance1C566869D]. ): The specified DB Instance is a member of a cluster. Modify the DB engine version for the DB Cluster using the ModifyDbCluster API (Service: Rds, Status Code: 400, Request ID: 9998e162-bb47-4ff0-a6ad-91665e964ff2), DB cluster isn't available for modification with status upgrading. (Service: Rds, Status Code: 400, Request ID: 5b967f09-58a2-42dc-aa20-3a8bffbf705a)

Reproduction Steps

Create a new stack with an older major version of Aurora Postgres:

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { DatabaseCluster, DatabaseClusterEngine, AuroraPostgresEngineVersion } from 'aws-cdk-lib/aws-rds';
import { Vpc } from 'aws-cdk-lib/aws-ec2';

export class CdkBugReportsStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    new DatabaseCluster(this, 'DatabaseCluster', {
      engine: DatabaseClusterEngine.auroraPostgres({
        version: AuroraPostgresEngineVersion.VER_10_18,
      }),
      instanceProps: {
        vpc: new Vpc(this, 'Vpc')
      }
    })
  }
}

cdk deploy the stack above
Update to a newer major version of Aurora Postgres. At the time of writing, 10.18 -> 13.4 is a valid upgrade target:

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { DatabaseCluster, DatabaseClusterEngine, AuroraPostgresEngineVersion } from 'aws-cdk-lib/aws-rds';
import { Vpc } from 'aws-cdk-lib/aws-ec2';

export class CdkBugReportsStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    new DatabaseCluster(this, 'DatabaseCluster', {
      engine: DatabaseClusterEngine.auroraPostgres({
        version: AuroraPostgresEngineVersion.VER_13_4,  // Changed
      }),
      instanceProps: {
        vpc: new Vpc(this, 'Vpc')
      }
    })
  }
}

cdk deploy

Observe: the stack update will fail with the aforementioned error, leaving the Database stack in an unrecoverable state.

Possible Solution

Short Term Workaround

As a short-term solution, users can reach into the L1 construct to remove the EngineVersion property like this:

const cfnInstances = cluster.node.children.filter((child) => child instanceof CfnDBInstance);
if (cfnInstances.length === 0) {
  throw new Error("Couldn't pull CfnDBInstances from the L1 constructs!");
}
cfnInstances.forEach((cfnInstance) => delete (cfnInstance as CfnDBInstance).engineVersion);

I've tested removing the EngineVersion property from an existing CloudFormation Stack/DbInstance. It shows the following diff:

Stack BlimmerTestAuroraUpgradeStack
Resources
[~] AWS::RDS::DBInstance DatabaseCluster/Instance1 DatabaseClusterInstance1C566869D
 └─ [-] EngineVersion
     └─ 10.18

And the change applied with (what appears to be) no changes to the actual instances:

BlimmerTestAuroraUpgradeStack: creating CloudFormation changeset...
BlimmerTestAuroraUpgradeStack | 0/3 | 11:31:22 AM | UPDATE_IN_PROGRESS   | AWS::CloudFormation::Stack                  | BlimmerTestAuroraUpgradeStack User Initiated
BlimmerTestAuroraUpgradeStack | 0/3 | 11:31:29 AM | UPDATE_IN_PROGRESS   | AWS::RDS::DBInstance                        | DatabaseCluster/Instance1 (DatabaseClusterInstance1C566869D)
BlimmerTestAuroraUpgradeStack | 1/3 | 11:31:32 AM | UPDATE_COMPLETE      | AWS::RDS::DBInstance                        | DatabaseCluster/Instance1 (DatabaseClusterInstance1C566869D)
BlimmerTestAuroraUpgradeStack | 2/3 | 11:31:33 AM | UPDATE_COMPLETE_CLEA | AWS::CloudFormation::Stack                  | BlimmerTestAuroraUpgradeStack
BlimmerTestAuroraUpgradeStack | 3/3 | 11:31:34 AM | UPDATE_COMPLETE      | AWS::CloudFormation::Stack                  | BlimmerTestAuroraUpgradeStack

I verified in the RDS console that the instance didn't shut down or have any other events during this cdk deploy.

So, I believe it should be safe for all Aurora DatabaseCluster users to use this workaround without downtime. However, I've only tested this on my test cluster, so your mileage may vary.

Note that this should only be done for Aurora clusters, per the AWS::RDS::DBInstance CFN documentation

Longer Term Fix

CDK should detect when the engine passed is an Aurora RDS engine. In this case, it should not set the AWS::RDS::DBInstance EngineVersion property

Additional Information/Context

I confirmed this issue by testing in my AWS accounts. I also have an internal ticket (case #10630169951) opened with AWS Support about this issue. In addition to a CDK fix, it seems this could be handled more elegantly on the CloudFormation side.

CDK CLI Version

2.38.1 (build a5ced21)

Framework Version

No response

Node.js Version

16 LTS

OS

MacOS

Language

Typescript

Language Version

No response

Other information

Because of the unrecoverable state in which the DatabaseCluster stack is left, I'd highly recommend this being a P1 bug.

The text was updated successfully, but these errors were encountered:

corymhall · 2022-08-25T18:05:10Z

It seems like this issue impacts a significant number of customers, and I've tagged it as P1, which means it should be on our near-term roadmap.

We welcome community contributions! If you are able, we encourage you to contribute (https://github.com/aws/aws-cdk/blob/master/CONTRIBUTING.md) a bug fix or new feature to the CDK. If you decide to contribute, please start an engineering discussion in this issue to ensure there is a commonly understood design before submitting code. This will minimize the number of review cycles and get your code merged faster.

davenix-palmetto · 2022-09-01T15:09:41Z

Thanks all for raising this issue. We are hitting this in production as well.

blimmer · 2022-09-01T15:41:45Z

IMPORTANT: please note that I had an issue in the workaround I previously posted above. I've edited the description so the workaround there is now correct.

Previously, I was only removing the EngineVersion from the first DbInstance (because I was only running one instance in my testing). However, we need to remove it from all instances. The workaround should look like this:

const cfnInstances = cluster.node.children.filter((child) => child instanceof CfnDBInstance);
if (cfnInstances.length === 0) {
  throw new Error("Couldn't pull CfnDBInstances from the L1 constructs!");
}
cfnInstances.forEach((cfnInstance) => delete (cfnInstance as CfnDBInstance).engineVersion);

Note that instead of a .find, I'm now using a .filter to grab all instances.

rittneje · 2022-09-02T16:51:34Z

We are observing the same bug even upgrading between minor versions (#21899).

I wonder if an Aspect would be a better fit for the workaround.

blimmer · 2022-09-02T16:55:00Z

Thanks @rittneje, that behavior you saw makes sense. I updated the ticket to remove the language about minor versions being OK.

rittneje · 2022-09-02T17:04:21Z

@corymhall Not sure if any bug has been filed against CloudFormation, but it should have rejected templates with an AWS::RDS::DBInstance that has both DBClusterIdentifier and EngineVersion specified since RDS cannot support that.

joshlartz · 2022-09-26T14:19:34Z

@corymhall I moved my PR over to this issue. Can you give the PR a look please?

…s that were part of a DBCluster (#22185) Engine version should not be set on instances that are part of a cluster. The cluster is responsible for this setting and throws an API error when an update is attempted on them. closes #21758 #22180 ---- ### All Submissions: * [X] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md) ### Adding new Unconventional Dependencies: * [x] This PR adds new unconventional dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md/#adding-new-unconventional-dependencies) ### New Features * [x] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/main/INTEGRATION_TESTS.md)? * [x] Did you use `yarn integ` to deploy the infrastructure and generate the snapshot (i.e. `yarn integ` without `--dry-run`)? *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

github-actions · 2022-09-26T16:39:51Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

…s that were part of a DBCluster (aws#22185) Engine version should not be set on instances that are part of a cluster. The cluster is responsible for this setting and throws an API error when an update is attempted on them. closes aws#21758 aws#22180 ---- ### All Submissions: * [X] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md) ### Adding new Unconventional Dependencies: * [x] This PR adds new unconventional dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md/#adding-new-unconventional-dependencies) ### New Features * [x] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/main/INTEGRATION_TESTS.md)? * [x] Did you use `yarn integ` to deploy the infrastructure and generate the snapshot (i.e. `yarn integ` without `--dry-run`)? *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

ajhool · 2023-01-24T15:28:37Z

The worst part about this bug is that the Database stack is left in the unrecoverable UPDATE_ROLLBACK_FAILED state, which means that you either have to:

a) attempt to complete the rollback. this is impossible because the cluster actually upgrades even though the CloudFormation steps fail. A rollback between the new target major version and the old version is not allowed by RDS.

b) delete the stack. this obviously is not ideal because databases should not be deleted in most cases. worse, you can't update the RetentionPolicy to try to retain the database while deleting the stack.

Is it correct that the only way out of this state is to delete the stack and the database? Or can the "skip resources" technique be used to complete the rollback into a "rollback_completed" state?

rittneje · 2023-01-24T15:50:32Z

I believe you can resolve this by completing the rollback by skipping resources in the UI. Then deploy a new template with the cluster version set to whatever it actually is in RDS right now, and EngineVersion removed from each DBInstance.

I highly recommend testing this against some test stack first.

ajhool · 2023-01-24T17:11:36Z

Okay thanks. I was able to:

Starting at bad state of Update Rollback Failed due to the problem detailed in this issue

Some of the text in these steps might be different because I don't have access to that state any longer. The basic steps are also outlined here:

Go to the AWS Console Cloudformation Service
Click on the affected stack
Select "Continue update rollback"
Advanced Troubleshooting
Select to skip the affected cluster
(I tried updating to the latest CDK (2.61.1) and redeploying here, but it failed with the original instance version error)
Add the patch code provided here: (rds): AWS::RDS::DBInstance should not have EngineVersion property set for Aurora clusters #21758 (comment)
Redeploy

That seems to have worked and the cloudformation stack is back in the correct state. I then tried removing the patch code and redeploying using CDK version (2.61.1) but it failed with the original error and the diff shows an attempt to add EngineVersion to both instances.

blimmer added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 25, 2022

github-actions bot added the @aws-cdk/aws-rds Related to Amazon Relational Database label Aug 25, 2022

github-actions bot assigned corymhall Aug 25, 2022

blimmer mentioned this issue Aug 25, 2022

(aws-rds): DatabaseCluster should default auto_minor_version_upgrade to False if engine version is pinned #15475

Open

corymhall added p1 effort/small Small work item – less than a day of effort and removed needs-triage This issue or PR still needs to be triaged. labels Aug 25, 2022

corymhall removed their assignment Aug 25, 2022

blimmer mentioned this issue Sep 2, 2022

(rds): DBInstances that are part of DBCluster incorrectly have EngineVersion set #21899

Closed

corymhall self-assigned this Sep 9, 2022

corymhall mentioned this issue Sep 26, 2022

aws-rds: Updating engine version on DBInstances that are part of a cluster causes an error #22180

Closed

joshlartz mentioned this issue Sep 26, 2022

fix(rds): changing engine versions would fail to update on DBInstances that were part of a DBCluster #22185

Merged

4 tasks

mergify bot closed this as completed in #22185 Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(rds): AWS::RDS::DBInstance should not have EngineVersion property set for Aurora clusters #21758

(rds): AWS::RDS::DBInstance should not have EngineVersion property set for Aurora clusters #21758

blimmer commented Aug 25, 2022 •

edited

corymhall commented Aug 25, 2022

davenix-palmetto commented Sep 1, 2022

blimmer commented Sep 1, 2022

rittneje commented Sep 2, 2022 •

edited

blimmer commented Sep 2, 2022

rittneje commented Sep 2, 2022 •

edited

joshlartz commented Sep 26, 2022

github-actions bot commented Sep 26, 2022

ajhool commented Jan 24, 2023 •

edited

rittneje commented Jan 24, 2023

ajhool commented Jan 24, 2023

(rds): AWS::RDS::DBInstance should not have EngineVersion property set for Aurora clusters #21758

(rds): AWS::RDS::DBInstance should not have EngineVersion property set for Aurora clusters #21758

Comments

blimmer commented Aug 25, 2022 • edited

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Short Term Workaround

Longer Term Fix

Additional Information/Context

CDK CLI Version

Framework Version

Node.js Version

OS

Language

Language Version

Other information

corymhall commented Aug 25, 2022

davenix-palmetto commented Sep 1, 2022

blimmer commented Sep 1, 2022

rittneje commented Sep 2, 2022 • edited

blimmer commented Sep 2, 2022

rittneje commented Sep 2, 2022 • edited

joshlartz commented Sep 26, 2022

github-actions bot commented Sep 26, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

ajhool commented Jan 24, 2023 • edited

rittneje commented Jan 24, 2023

ajhool commented Jan 24, 2023

blimmer commented Aug 25, 2022 •

edited

rittneje commented Sep 2, 2022 •

edited

rittneje commented Sep 2, 2022 •

edited

ajhool commented Jan 24, 2023 •

edited