Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(logs): log retention fails with OperationAbortedException #16083

Merged
merged 4 commits into from Sep 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
64 changes: 42 additions & 22 deletions packages/@aws-cdk/aws-logs/lib/log-retention-provider/index.ts
Expand Up @@ -18,14 +18,39 @@ interface SdkRetryOptions {
* @param options CloudWatch API SDK options.
*/
async function createLogGroupSafe(logGroupName: string, region?: string, options?: SdkRetryOptions) {
try { // Try to create the log group
const cloudwatchlogs = new AWS.CloudWatchLogs({ apiVersion: '2014-03-28', region, ...options });
await cloudwatchlogs.createLogGroup({ logGroupName }).promise();
} catch (e) {
if (e.code !== 'ResourceAlreadyExistsException') {
throw e;
// If we set the log retention for a lambda, then due to the async nature of
// Lambda logging there could be a race condition when the same log group is
// already being created by the lambda execution. This can sometime result in
// an error "OperationAbortedException: A conflicting operation is currently
// in progress...Please try again."
// To avoid an error, we do as requested and try again.
let retryCount = options?.maxRetries == undefined ? 10 : options.maxRetries;
const delay = options?.retryOptions?.base == undefined ? 10 : options.retryOptions.base;
do {
try {
const cloudwatchlogs = new AWS.CloudWatchLogs({ apiVersion: '2014-03-28', region, ...options });
await cloudwatchlogs.createLogGroup({ logGroupName }).promise();
return;
} catch (error) {
if (error.code === 'ResourceAlreadyExistsException') {
// The log group is already created by the lambda execution
return;
}
if (error.code === 'OperationAbortedException') {
if (retryCount > 0) {
retryCount--;
await new Promise(resolve => setTimeout(resolve, delay));
continue;
} else {
// The log group is still being created by another execution but we are out of retries
throw new Error('Out of attempts to create a logGroup');
}
}
// Any other error
console.error(error);
throw error;
}
}
} while (true); // exit happens on retry count check
}

/**
Expand Down Expand Up @@ -64,21 +89,16 @@ export async function handler(event: AWSLambda.CloudFormationCustomResourceEvent
await setRetentionPolicy(logGroupName, logGroupRegion, retryOptions, parseInt(event.ResourceProperties.RetentionInDays, 10));

if (event.RequestType === 'Create') {
// Set a retention policy of 1 day on the logs of this function. The log
// group for this function should already exist at this stage because we
// already logged the event but due to the async nature of Lambda logging
// there could be a race condition. So we also try to create the log group
// of this function first. If multiple LogRetention constructs are present
// in the stack, they will try to act on this function's log group at the
// same time. This can sometime result in an OperationAbortedException. To
// avoid this and because this operation is not critical we catch all errors.
ddl-denis-parnovskiy marked this conversation as resolved.
Show resolved Hide resolved
try {
const region = process.env.AWS_REGION;
await createLogGroupSafe(`/aws/lambda/${context.functionName}`, region, retryOptions);
await setRetentionPolicy(`/aws/lambda/${context.functionName}`, region, retryOptions, 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See in the old code, if CreateLogGroupSafe throws an error, setRetentionPolicy is skipped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this?
This behaviour is the same in your updated code as well, i.e., if createLogGroupSafe() throws an error setRetentionPolicy() is skipped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old code:
CreateLogGroupSafe encounters OperationAbortedException, throws exception, setRetentionPolicy is skipped, the error is eaten by catch all.
Result: retention policy for this function is not set and no error. This is a bug in the old code I'm talking about.

New code:
CreateLogGroupSafe encounters OperationAbortedException, retries, if really fails after several attempts, then the error is floated up the stack.
Result: either log group is created with limited retention, or an exception is thrown.

} catch (e) {
console.log(e);
}
// Set a retention policy of 1 day on the logs of this very function.
// Due to the async nature of the log group creation, the log group for this function might
// still be not created yet at this point. Therefore we attempt to create it.
// In case it is being created, createLogGroupSafe will handle the conflic.
const region = process.env.AWS_REGION;
await createLogGroupSafe(`/aws/lambda/${context.functionName}`, region, retryOptions);
// If createLogGroupSafe fails, the log group is not created even after multiple attempts
// In this case we have nothing to set the retention policy on but an exception will skip
// the next line.
await setRetentionPolicy(`/aws/lambda/${context.functionName}`, region, retryOptions, 1);
}
}

Expand Down
98 changes: 96 additions & 2 deletions packages/@aws-cdk/aws-logs/test/test.log-retention-provider.ts
Expand Up @@ -27,6 +27,14 @@ function createRequest(type: string) {
.reply(200);
}

class MyError extends Error {
code: string;
constructor(message: string, code: string) {
super(message);
this.code = code;
}
}

export = {
'tearDown'(callback: any) {
AWS.restore();
Expand Down Expand Up @@ -231,10 +239,60 @@ export = {
test.done();
},

async 'does not fail when operations on provider log group fail'(test: Test) {
async 'does not if when operations on provider log group fails'(test: Test) {
let attempt = 2;
const createLogGroupFake = (params: AWSSDK.CloudWatchLogs.CreateLogGroupRequest) => {
if (params.logGroupName === '/aws/lambda/provider') {
return Promise.reject(new Error('OperationAbortedException'));
if (attempt > 0) {
attempt--;
return Promise.reject(new MyError(
'A conflicting operation is currently in progress against this resource. Please try again.',
'OperationAbortedException'));
} else {
return Promise.resolve({});
}
}
return Promise.resolve({});
};

const putRetentionPolicyFake = sinon.fake.resolves({});
const deleteRetentionPolicyFake = sinon.fake.resolves({});

AWS.mock('CloudWatchLogs', 'createLogGroup', createLogGroupFake);
AWS.mock('CloudWatchLogs', 'putRetentionPolicy', putRetentionPolicyFake);
AWS.mock('CloudWatchLogs', 'deleteRetentionPolicy', deleteRetentionPolicyFake);

const event = {
...eventCommon,
RequestType: 'Create',
ResourceProperties: {
ServiceToken: 'token',
RetentionInDays: '30',
LogGroupName: 'group',
},
};

const request = createRequest('SUCCESS');

await provider.handler(event as AWSLambda.CloudFormationCustomResourceCreateEvent, context);

test.equal(request.isDone(), true);

test.done();
},

async 'does not fail if operations on CDK lambda log group fails twice'(test: Test) {
let attempt = 2;
const createLogGroupFake = (params: AWSSDK.CloudWatchLogs.CreateLogGroupRequest) => {
if (params.logGroupName === 'group') {
if (attempt > 0) {
attempt--;
return Promise.reject(new MyError(
'A conflicting operation is currently in progress against this resource. Please try again.',
'OperationAbortedException'));
} else {
return Promise.resolve({});
}
}
return Promise.resolve({});
};
Expand Down Expand Up @@ -265,6 +323,42 @@ export = {
test.done();
},

async 'does fail if operations on CDK lambda log group fails indefinitely'(test: Test) {
const createLogGroupFake = (params: AWSSDK.CloudWatchLogs.CreateLogGroupRequest) => {
if (params.logGroupName === 'group') {
return Promise.reject(new MyError(
'A conflicting operation is currently in progress against this resource. Please try again.',
'OperationAbortedException'));
}
return Promise.resolve({});
};

const putRetentionPolicyFake = sinon.fake.resolves({});
const deleteRetentionPolicyFake = sinon.fake.resolves({});

AWS.mock('CloudWatchLogs', 'createLogGroup', createLogGroupFake);
AWS.mock('CloudWatchLogs', 'putRetentionPolicy', putRetentionPolicyFake);
AWS.mock('CloudWatchLogs', 'deleteRetentionPolicy', deleteRetentionPolicyFake);

const event = {
...eventCommon,
RequestType: 'Create',
ResourceProperties: {
ServiceToken: 'token',
RetentionInDays: '30',
LogGroupName: 'group',
},
};

const request = createRequest('FAILED');

await provider.handler(event as AWSLambda.CloudFormationCustomResourceCreateEvent, context);

test.equal(request.isDone(), true);

test.done();
},

async 'response data contains the log group name'(test: Test) {
AWS.mock('CloudWatchLogs', 'createLogGroup', sinon.fake.resolves({}));
AWS.mock('CloudWatchLogs', 'putRetentionPolicy', sinon.fake.resolves({}));
Expand Down