Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(stepfunctions): CDK generated stepfunction roles breaking inflight stepfunction executions with versioned lambdas #17515

Open
nsaman opened this issue Nov 15, 2021 · 11 comments
Labels
@aws-cdk/aws-stepfunctions Related to AWS StepFunctions bug This issue is a bug. effort/small Small work item – less than a day of effort p2

Comments

@nsaman
Copy link

nsaman commented Nov 15, 2021

What is the problem?

Using stepfunction auto generate of stepfunction roles and also use versioned lambdas in the step functions. On deployment, the stepfunction role is updated with the new lambda version. This causes invoke:lambda role failures in in-flight stepfunction executions as they will have the previous lambda version in their stepfunction execution definition but will now have the newer lambda version in the stepfunction role.

Is there way to have stepfunction auto generated roles to not include the lambda version in the role?

Reproduction Steps

Create a stepfunction that invokes a lambda version. The stepfunction role will contain a lambda version

What did you expect to happen?

Stepfunctions to not fail on inflight executions during a deployment

What actually happened?

Stepfunction lambda:invoke errors on mismatched lambda versions:
Error

Lambda.AWSLambdaException

Cause

User: arn:aws:sts::335321747591:assumed-role/TidewaterWorkflowsCreateJ-CreateJournalStateMachin-184QJ29APKE3O/VAqgLpXDrcGwUULKzfuDBGJmuwiKLfzI is not authorized to perform: lambda:InvokeFunction on resource: arn:aws:lambda:us-west-2:335321747591:function:LogResources:28 because no identity-based policy allows the lambda:InvokeFunction action (Service: AWSLambda; Status Code: 403; Error Code: AccessDeniedException; Request ID: 6ccb7c61-369f-4826-9fc6-113954ec38c8; Proxy: null)

CDK CLI Version

1.130.0 (build 9c094ae)

Framework Version

No response

Node.js Version

12

OS

macos 10.15.7

Language

Typescript

Language Version

No response

Other information

No response

@nsaman nsaman added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Nov 15, 2021
@github-actions github-actions bot added the @aws-cdk/aws-stepfunctions Related to AWS StepFunctions label Nov 15, 2021
@peterwoodworth peterwoodworth added the needs-reproduction This issue needs reproduction. label Nov 15, 2021
@peterwoodworth
Copy link
Contributor

Hey @nsaman, how exactly are you going about this in your code?

Are you making use of the LambdaInvoke construct?

@nsaman
Copy link
Author

nsaman commented Nov 15, 2021

Yes, we are creating a LambdaInvoke on lambdaConstruct.currentVersion.functionArn

@peterwoodworth
Copy link
Contributor

What exactly do you mean by this @nsaman? A snippet of the relevant parts of your code would be helpful

@kaizencc kaizencc added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 20, 2021
@github-actions
Copy link

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Nov 20, 2021
@mohitpali
Copy link

So LambdaInvoke generates policies by lambda function arn. The fucntion arn is versioned and we want the specific versions to be executed. Suppose our step function is long running and points to lambda function 1. The Role has permissions to invoke v1. A CDK deployment updates the lambda version to 2 and Step Function role is updated to invoke lambda v2.
When currently running Step Function invokes lambda v1, it fails because the permissions got updated to V2.

There could be multiple solutions. You could add a function addTaskPolicy to update the read only property taskPolicy in LambdaInvoke.

@github-actions github-actions bot removed closing-soon This issue will automatically close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Nov 24, 2021
@kaizencc kaizencc changed the title CDK generated stepfunction roles breaking inflight stepfunction executions with versioned lambdas (stepfunctions): CDK generated stepfunction roles breaking inflight stepfunction executions with versioned lambdas Dec 17, 2021
@kaizencc kaizencc added effort/small Small work item – less than a day of effort and removed needs-reproduction This issue needs reproduction. needs-triage This issue or PR still needs to be triaged. labels Dec 17, 2021
@kaizencc kaizencc removed their assignment Dec 17, 2021
@rix0rrr
Copy link
Contributor

rix0rrr commented Feb 24, 2022

The solution for this will be to generate a policy that looks like:

Resource: [
  'arn:aws:lambda:....:MyFunction',
  'arn:aws:lambda:....:MyFunction:*',
]

It will go well with a change Lambda is about to make where invocations that involve Qualifiers also need to have the qualified ARN in the policy.

@mrgrain
Copy link
Contributor

mrgrain commented Aug 8, 2022

Hi @nsaman and @mohitpali Apologies for the long delay on this. We've been looking at this recently and came to the conclusion that just providing additional permissions is not the right approach. Using versions in this scenario there are quite a few things to consider and none of them work automatically. The tl;dr is that when a new version is created, the previous version ceases to be managed by CDK/CFN. Deletion can be avoided by setting removal policies. But permissions would either have to be widely scoped (insecure) or maintained by hand (annoying). Saying that, the permission bit is currently not easily done.

Now to my actual question: The idiomatic way to do this in AWS is using Alias. Permissions are granted to the StepFunction to invoke the alias and when a new version is published the Alias gets updated. The StepFunction will always run the latest version and have the correct permissions.

Is there any reasons Aliases would not work in your scenario?

PS: We are still considering addTaskPolicy and other options to open up the generated policies.

@mrgrain
Copy link
Contributor

mrgrain commented Aug 9, 2022

Downgrading this to a p2. To provide access to all versions of a Lambda, one can do:

declare stepLambda: lambda.Function;
declare sfn: stepfunctions.StateMachine;

stepLambda.grantInvoke(sfn);

This is very idiomatic. For tightly scoped permissions, Lambda Alias should be used.

@mrgrain mrgrain added p2 and removed p1 labels Aug 9, 2022
@mrgrain mrgrain removed their assignment Aug 12, 2022
@hoegertn
Copy link
Contributor

hoegertn commented Apr 1, 2023

As far as I see, the idea behind this, and so not working with aliases IMHO, is to make sure that an SFN execution that started with reference to Lambda v1 will always use this code version and not a newer version of this Lambda function that might be used by executions that are started later.

The reasoning is that the code might have a breaking change that does not work with the inputs in a previous step function definition.

@gerritmaritz
Copy link

gerritmaritz commented Jun 16, 2023

Agreed, Aliases seem like the right solution at first glance from a permissions perspective but play havoc in the long term with your Step Function executions. Using a Lambda alias, a Step Function that is repeatedly invoking a Lambda will be shifted from running version X to version Y, causing all kinds of issues with backwards compatibility. This is unfortunately not documented well in the Step Function documentation but a common occurrence.

The way to get around it is to always have Step Function definitions point to a specific version of a function so that they keep on invoking that until the Step Function is completed. The problem with this is that the generated policies hard code the version.

I feel like there should be a broader discussion around the idiomatic ways that Step Functions recommends Lambdas should be used. As I mentioned, the documentation is silent about this as a best practice or even that this is an issue that customers need to deal with. If there is agreement on 'the best way to invoke lambdas', maybe that can be documented and incorporated into CDK.

@everett1992
Copy link
Contributor

I just discovered this issue because I tried to use StepFunction Alias's deployment preference to implement traffic shaping.

const lambdaFunction = createFunction()
const lambdaInvoke = new LambdaInvoke(this, "Invoke", { lamdaFunction: lambdaFunction.currentVersion })
const stateMachine = new StateMachine(this, "StateMachine", { definitionBody: DefinitionBody.fromChainable(lambdaInvoke) })

const stateMachineVersion = new CfnStateMachineVersion(scope, "StateMachineVersion", {
  stateMachineArn: stateMachine.stateMachineArn,
  stateMachineRevisionId: stateMachine.stateMachineRevisionId,
});

const alias = new CfnStateMachineAlias(scope, "StateMachineAlias", {
  name: "active",
  deploymentPreference: {
    stateMachineVersionArn: stateMachineVersion.attrArn,
    type: "CANARY",
    interval: Duration.hours(2).toMinutes(),
    percentage: 10,
  },
});

However during deployments the state machine's policy is automatically updated to grant invoke for the lambda current version, while 90% of traffic uses the previous state machine version which will invoke the previous lambda version.

Maybe the Ideal solution is for the invocation role and policy to be versioned along with the State Machine, but that goes beyond a CDK feature request.

In my case I'm happy to allow the state machine to invoke any version of the Lambdas. lambda-arn:* or lambda-arn should work. I think this could be made to work with a property on Function, Version, or LambdaInvoke to configure when qualified or unqualified permission is granted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-stepfunctions Related to AWS StepFunctions bug This issue is a bug. effort/small Small work item – less than a day of effort p2
Projects
None yet
10 participants