Skip to content

fix: clean up IAM roles on env stack failures#2107

Merged
mergify[bot] merged 12 commits intoaws:mainlinefrom
efekarakus:issues/fix-2100
Mar 31, 2021
Merged

fix: clean up IAM roles on env stack failures#2107
mergify[bot] merged 12 commits intoaws:mainlinefrom
efekarakus:issues/fix-2100

Conversation

@efekarakus
Copy link
Copy Markdown
Contributor

Fixes #2100

  1. We now wait for a failed stack to be deleted first before attempting to re-create it.
  2. Best-effort attempt to delete IAM roles on env delete this removes intermittent role deletion errors in the CLI.
  3. If a user deletes an environment, and then re-creates it, env init now ensures the previous dangling IAM roles are deleted.
  4. Similarly, if the env stack creation fails, we delete any retained IAM roles created part of the stack.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@efekarakus efekarakus requested a review from a team as a code owner March 25, 2021 23:54
@kohidave
Copy link
Copy Markdown
Contributor

Heroic.

Copy link
Copy Markdown
Contributor

@bvtujo bvtujo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What David said. This is amazing, and should help with a lot of our frustrating error cases. I combed through and couldn't find anything confusing; so I just have a naming nit.

Comment thread internal/pkg/cli/interfaces.go Outdated
serviceLinkedRoleCreator
}

type stackExister interface {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there a way we can reframe this? The name is a little confusing, and while it's clear that it's not a stackCreator it's not immediately clear what the interface does from the exister verb.

Suggested change
type stackExister interface {
type stackExistChecker interface

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines +190 to +191
// Exists returns true if the environment stack exists, false otherwise.
// If an error occurs for another reason than ErrStackNotFound, then returns the error.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

Suggested change
// Exists returns true if the environment stack exists, false otherwise.
// If an error occurs for another reason than ErrStackNotFound, then returns the error.
// Exists returns true if the CloudFormation stack exists, false otherwise.
// If an error occurs for another reason than ErrStackNotFound, then returns the error.

as aws.cloudformation pkg should be agnostic about Copilot terminology?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup sorry, I had this method originally in the describe pkg and while moving it here forgot to update the comment :D fixed!

if status.requiresCleanup() {
// If the stack exists, but failed to create, we'll clean it up and then re-create it.
if err := c.Delete(stack.Name); err != nil {
if err := c.DeleteAndWait(stack.Name); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

}
// The stack failed to create due to an unexpect reason.
// Delete the retained roles created part of the stack.
o.tryDeletingEnvRoles(o.appName, o.name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we delete the role before deleting the stack (since it failed to be deleted), will this make users not authorized to manage the env stack unless they manually create the env manager role again (if env upgrade happened before)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PH's comment makes me wonder whether it is guaranteed that if deployAndRenderEnvironment fails, then the env stack will be rolled back and delete 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point! I think most of the time it will just be deleted but sometimes the rollback could fail (see ROLLBACK_FAILED in here). It is just since we always try deleting the roles for a clean state when we create the environment so maybe it is ok to skip deleting the roles if the creation fails?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I am fully following, so I'll try my best to clarify but please let me know if I'm not answering properly:

  1. If we delete the role before deleting the stack (since it failed to be deleted), will this make users not authorized to manage the env stack unless they manually create the env manager role again

We don't pass a RoleARN for CloudFormation to assume when we initially create the stack. So on failure, it is safe to delete the roles that were retained since they were not used yet to mutate the cloudformation stack.

  1. whether it is guaranteed that if deployAndRenderEnvironment fails, then the env stack will be rolled back and delete

On failure, deployAndRenderEnvironment will indeed rollback back the stack but won't delete it. Instead, the second time that the customer runs copilot env init then on Create we will check if the stack requires deletion and if so first delete it and then re-create it:

func (c *CloudFormation) Create(stack *Stack) (changeSetID string, err error) {
descr, err := c.Describe(stack.Name)
if err != nil {
var stackNotFound *ErrStackNotFound
if !errors.As(err, &stackNotFound) {
return "", err
}
// If the stack does not exist, create it.
return c.create(stack)
}
status := StackStatus(aws.StringValue(descr.StackStatus))
if status.requiresCleanup() {
// If the stack exists, but failed to create, we'll clean it up and then re-create it.
if err := c.Delete(stack.Name); err != nil {
return "", fmt.Errorf("clean up previously failed stack %s: %w", stack.Name, err)
}
return c.create(stack)
}

  1. it is ok to skip deleting the roles if the creation fails

This would be true, if only the stack got deleted if the creation fails. Instead it goes into a ROLLBACK_COMPLETE state I believe. So if the user just re-runs copilot env init and we don't clean up the IAM roles on failed create, then we will:

  1. first check if the stack exists ✅, so skip
  2. The stack tries to re-create the IAM but fails because the CFNExecRole already exists ❌

Does this help?

Copy link
Copy Markdown
Contributor

@iamhopaul123 iamhopaul123 Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. Sorry I was stupid I wasn't aware it is creating an environment so env upgrade can't happen before.

Im totally ok with the current implementation but

The stack tries to re-create the IAM but fails because the CFNExecRole already exists ❌

env init will try to delete the existing IAM roles before deploying the stack right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesss thanks for the explanation! It makes sense. I have no problem with the current implementation either but same question as Penghao's just to make sure my understanding aligns with the implementation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

env init will try to delete the existing IAM roles before deploying the stack right?

Yes that's right! We delete the roles iff:

  1. There is no environment cloudformation stack
  2. But there are IAM roles

This scenario means that env delete previously failed to delete the EnvManagerRole and it should be cleaned up before trying to re-create it. I updated the comment in the cleanUpDanglingRoles method, hopefully that explains a bit better this scenario.

Copy link
Copy Markdown
Contributor

@Lou1415926 Lou1415926 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Nothing much to add except for a tiny nit on comment and a quesiton.

Comment thread internal/pkg/aws/cloudformation/cloudformation.go Outdated
}
// The stack failed to create due to an unexpect reason.
// Delete the retained roles created part of the stack.
o.tryDeletingEnvRoles(o.appName, o.name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PH's comment makes me wonder whether it is guaranteed that if deployAndRenderEnvironment fails, then the env stack will be rolled back and delete 🤔

@mergify mergify Bot merged commit f75d750 into aws:mainline Mar 31, 2021
thrau pushed a commit to localstack/copilot-cli-local that referenced this pull request Dec 9, 2022
Fixes aws#2100 

1. We now wait for a failed stack to be deleted first before attempting to re-create it.
2. Best-effort attempt to delete IAM roles on `env delete` this removes intermittent role deletion errors in the CLI.
3. If a user deletes an environment, and then re-creates it, `env init` now ensures the previous dangling IAM roles are deleted.
4. Similarly, if the env stack creation fails, we delete any retained IAM roles created part of the stack.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stack Cleanup for Environments fails

6 participants