Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(aws-logs): log retention update can race against the target lambda's log group creation resulting in OperationAbortedException #15709

Closed
steved opened this issue Jul 22, 2021 · 12 comments · Fixed by #16083
Labels
@aws-cdk/aws-logs Related to Amazon CloudWatch Logs bug This issue is a bug. effort/small Small work item – less than a day of effort p1

Comments

@steved
Copy link

steved commented Jul 22, 2021

When a lambda runs a log group and stream are created if they do not already exist. When creating a CDK Lambda function, the log retention update can then race with the background log group creation resulting in an OperationAbortedException error.

Reproduction Steps

Hard to reproduce reliably as it relies on a race condition between AWS and the log retention lambda.

What did you expect to happen?

Log retention policy is properly applied.

What actually happened?

2021-07-22T02:52:24.985Z	a53fc34d-6d93-4590-94fd-8115d0b93e3b	INFO	OperationAbortedException: A conflicting operation is currently in progress against this resource. Please try again.
    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'OperationAbortedException',
  time: 2021-07-22T02:52:24.925Z,
  requestId: '[redacted]',
  statusCode: 400,
  retryable: false,
  retryDelay: 56.75647811158137
}

Environment

  • CDK CLI Version : 1.110.1
  • Framework Version: 1.110.1
  • Node.js Version: 10.24.0
  • OS : Buster (container)
  • Language (Version): all, Python (3.8)

Other

AWS log group creation cloudtrail event:

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "[...]",
        "arn": "[...]",
        "accountId": "[...]",
        "accessKeyId": "[...]",
        "sessionContext": {
            "sessionIssuer": {
               ...
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2021-07-22T02:52:23Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2021-07-22T02:52:24Z",
    "eventSource": "logs.amazonaws.com",
    "eventName": "CreateLogGroup",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "[...]",
    "userAgent": "awslambda-worker/1.0  rusoto/0.42.0 rust/1.52.1 linux",
    "requestParameters": {
        "logGroupName": "/aws/lambda/mylambda"
    },
    "responseElements": null,
    "requestID": "[...]",
    "eventID": "[...]",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "apiVersion": "20140328",
    "managementEvent": true,
    "recipientAccountId": "...",
    "eventCategory": "Management"
}

Log retention API call:

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "...",
        "arn": "...",
        "accountId": "...",
        "accessKeyId": "...",
        "sessionContext": {
            "sessionIssuer": {
                ...
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2021-07-22T02:52:23Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2021-07-22T02:52:24Z",
    "eventSource": "logs.amazonaws.com",
    "eventName": "CreateLogGroup",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "[...]",
    "userAgent": "aws-sdk-nodejs/2.880.0 linux/v12.22.1 exec-env/AWS_Lambda_nodejs12.x promise",
    "errorCode": "OperationAbortedException",
    "errorMessage": "A conflicting operation is currently in progress against this resource. Please try again.",
    "requestParameters": {
        "logGroupName": "/aws/lambda/mylambda"
    },
    "responseElements": null,
    "requestID": "[...]",
    "eventID": "[...]",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "apiVersion": "20140328",
    "managementEvent": true,
    "recipientAccountId": "[...]",
    "eventCategory": "Management"
}

#2237 seems to be a fix for the same potential error on the actual log retention lambda's log group, but not the target.

@steved steved added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jul 22, 2021
@github-actions github-actions bot added @aws-cdk/aws-lambda Related to AWS Lambda @aws-cdk/aws-logs Related to Amazon CloudWatch Logs labels Jul 22, 2021
@rix0rrr rix0rrr added effort/small Small work item – less than a day of effort p1 and removed needs-triage This issue or PR still needs to be triaged. labels Jul 22, 2021
@rix0rrr
Copy link
Contributor

rix0rrr commented Jul 22, 2021

Easiest way to fix this is to update the Lambda to be resilient against this occurring.

@rix0rrr rix0rrr removed their assignment Jul 22, 2021
@rix0rrr
Copy link
Contributor

rix0rrr commented Jul 22, 2021

Easiest way to fix this is to update the Lambda to be resilient against this occurring.

@steved
Copy link
Author

steved commented Jul 22, 2021

How would you recommend to approach this? Would the addition of the error code to this exception help? Or a global retry?

if (e.code !== 'ResourceAlreadyExistsException') {

@rix0rrr
Copy link
Contributor

rix0rrr commented Jul 22, 2021

I woukd say retry on the specific error

@ddl-denis-parnovskiy
Copy link
Contributor

Please, could someone review the proposed fix?

We have 2 bugs in packages/@aws-cdk/aws-logs/lib/log-retention-provider/index.ts

  1. It handles OperationAbortedException race for its own lambda but not for the lambda created elsewhere in CDK (this issue essential).
  2. When it handles OperationAbortedException race for its own lambda it skips retention assignment altogether while it should just ignore the log group creation error.

@mergify mergify bot closed this as completed in #16083 Sep 10, 2021
mergify bot pushed a commit that referenced this issue Sep 10, 2021
Fixes: #15709

When creating a lambda with log retention, CDK actually creates 2 lambda functions. The second lambda function alters log retention of the log group of the first lambda and the retention of its own log group.

Because log group creation is asynchronous, the log retention lambda tries to pre-create both log groups to guarantee it has an object to work on.

If a normal lambda execution also creates the related log group at the same time, an "OperationAbortedException:... Please retry" error is returned.

The existing code handles this situation for log retention lambda but not for the first lambda. 

This fix adds the retry pattern to the general log group creation code.

Also existing code had a bug: if OperationAbortedException is hit, the error is hidden but the retention policy is skipped and not actually applied. This fix addresses this bug as well.


----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

steved added a commit to dominodatalab/cdk-cf-eks that referenced this issue Sep 21, 2021
steved added a commit to dominodatalab/cdk-cf-eks that referenced this issue Sep 22, 2021
* bump cdk to v1.124.0

Includes fix for aws/aws-cdk#15709

* update unit tests

* match aws-cdk version in CI
@slavman
Copy link

slavman commented Sep 23, 2021

@steved we still have this error when add logRetention. cdk version is 1.124.0

@m90
Copy link

m90 commented Nov 3, 2021

I am also experiencing this issue when using 1.127.0 throughout.

@horsmand
Copy link
Contributor

horsmand commented Nov 5, 2021

My team is hitting this with version 1.129.0.

We have some bash scripting that allows us to deploy multiple CDK apps in parallel that share some base infrastructure and we are very consistently hitting this issue.

It actually seems worse with the new fix because before it would fail silently and let the deployment continue, but now it fails the whole CDK deployment due to the Lambda function now throwing errors instead of swallowing them.

I haven't captured logs from the Lambda yet, but I am seeing failures like this in the cdk deploy output:

SingletonLambda2e19e24316ee4d1aa3c918d35eddd446/LogRetention (SingletonLambda2e19e24316ee4d1aa3c918d35eddd446LogRetention970D13ED) Received response status [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress against this resource. Please try again.

This leads me to believe the retry logic isn't being hit at all, since the code for that looks like it should throw a new error with the message Out of attempts to create a logGroup.

@mmeylan
Copy link

mmeylan commented Nov 9, 2021

We're also getting this error with 1.128.0 and 1.131.0. It happens at deploy time when trying to access a function's log group in order to setup an alarm based on errors in the logs. It happens consistently.

Here's the stack we're getting:

	9:40:23 PM | CREATE_FAILED        | Custom::LogRetention                      | XXX...gRetention2CACE5F0
Received response status [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress
against this resource. Please try again. (RequestId: 6e534485-2033-4eaa-a4fe-2c0cde8dbcf3)

	new LogRetention (/local/node_modules/@aws-cdk/aws-logs/lib/log-retention.ts:53:22)
	\_ Function.get logGroup [as logGroup] (/local/node_modules@aws-cdk/aws-lambda/lib/functiod response status [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress
n.ts:550:28)
	\_ LambdaMonitoring.setupAlarm (/local/cdk/lib/monitoring/lambda-monitoring.ts:83:35)eceived response status [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress
	\_ new LambdaMonitoring (/local/cdk/lib/monitoring/lambda-monitoring.ts:27:24)
	\_ BackendStack.setupLambdaMonitoring (/local/cdk/lib/backend-stack.ts:1616:7)
	\_ BackendStack.setupCloudwatchDashboards (/local/cdk/lib/backend-stack.ts:1436:10)
	\_ BackendStack.setupCloudwatch (/local/cdk/lib/backend-stack.ts:1380:10)
	\_ new BackendStack (/local/cdk/lib/backend-stack.ts:78:12)
	\_ StackBuilder.buildStack (/local/cdk/lib/stack-builder.ts:220:28)
	\_ StackBuilder.build (/local/cdk/lib/stack-builder.ts:18:30)
	\_ processTicksAndRejections (internal/process/task_queues.js:95:5)

@oscarwest
Copy link

oscarwest commented Nov 11, 2021

We've been hitting this as well, "@aws-cdk/core": "1.126.0"
Still seems to be a bug here somewhere: @ddl-denis-parnovskiy

EDIT: Same problem with

"@serverless-stack/cli": "0.51.1",
"@serverless-stack/resources": "0.51.1",
"@aws-cdk/core": "1.132.0"

Custom::LogRetention | LambdaGETprojectsprojectNameapikeysLogRetention247393C8 Received response status [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress against this resource. Please try again. (RequestId: 62f8c1aa-d164-402b-9e75-d852f9092acb)

@horsmand
Copy link
Contributor

I've opened a new ticket since many people have been reporting the issue seems to be happening still. I believe I've made some findings in regards to why it might still be causing problems: #17546

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-logs Related to Amazon CloudWatch Logs bug This issue is a bug. effort/small Small work item – less than a day of effort p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants