fix: clean up IAM roles on env stack failures#2107
fix: clean up IAM roles on env stack failures#2107mergify[bot] merged 12 commits intoaws:mainlinefrom
Conversation
|
Heroic. |
bvtujo
left a comment
There was a problem hiding this comment.
What David said. This is amazing, and should help with a lot of our frustrating error cases. I combed through and couldn't find anything confusing; so I just have a naming nit.
| serviceLinkedRoleCreator | ||
| } | ||
|
|
||
| type stackExister interface { |
There was a problem hiding this comment.
nit: Is there a way we can reframe this? The name is a little confusing, and while it's clear that it's not a stackCreator it's not immediately clear what the interface does from the exister verb.
| type stackExister interface { | |
| type stackExistChecker interface |
| // Exists returns true if the environment stack exists, false otherwise. | ||
| // If an error occurs for another reason than ErrStackNotFound, then returns the error. |
There was a problem hiding this comment.
Maybe
| // Exists returns true if the environment stack exists, false otherwise. | |
| // If an error occurs for another reason than ErrStackNotFound, then returns the error. | |
| // Exists returns true if the CloudFormation stack exists, false otherwise. | |
| // If an error occurs for another reason than ErrStackNotFound, then returns the error. |
as aws.cloudformation pkg should be agnostic about Copilot terminology?
There was a problem hiding this comment.
Yup sorry, I had this method originally in the describe pkg and while moving it here forgot to update the comment :D fixed!
| if status.requiresCleanup() { | ||
| // If the stack exists, but failed to create, we'll clean it up and then re-create it. | ||
| if err := c.Delete(stack.Name); err != nil { | ||
| if err := c.DeleteAndWait(stack.Name); err != nil { |
| } | ||
| // The stack failed to create due to an unexpect reason. | ||
| // Delete the retained roles created part of the stack. | ||
| o.tryDeletingEnvRoles(o.appName, o.name) |
There was a problem hiding this comment.
If we delete the role before deleting the stack (since it failed to be deleted), will this make users not authorized to manage the env stack unless they manually create the env manager role again (if env upgrade happened before)?
There was a problem hiding this comment.
PH's comment makes me wonder whether it is guaranteed that if deployAndRenderEnvironment fails, then the env stack will be rolled back and delete 🤔
There was a problem hiding this comment.
Very good point! I think most of the time it will just be deleted but sometimes the rollback could fail (see ROLLBACK_FAILED in here). It is just since we always try deleting the roles for a clean state when we create the environment so maybe it is ok to skip deleting the roles if the creation fails?
There was a problem hiding this comment.
I don't think I am fully following, so I'll try my best to clarify but please let me know if I'm not answering properly:
-
If we delete the role before deleting the stack (since it failed to be deleted), will this make users not authorized to manage the env stack unless they manually create the env manager role again
We don't pass a RoleARN for CloudFormation to assume when we initially create the stack. So on failure, it is safe to delete the roles that were retained since they were not used yet to mutate the cloudformation stack.
-
whether it is guaranteed that if deployAndRenderEnvironment fails, then the env stack will be rolled back and delete
On failure, deployAndRenderEnvironment will indeed rollback back the stack but won't delete it. Instead, the second time that the customer runs copilot env init then on Create we will check if the stack requires deletion and if so first delete it and then re-create it:
copilot-cli/internal/pkg/aws/cloudformation/cloudformation.go
Lines 48 to 65 in 1284849
-
it is ok to skip deleting the roles if the creation fails
This would be true, if only the stack got deleted if the creation fails. Instead it goes into a ROLLBACK_COMPLETE state I believe. So if the user just re-runs copilot env init and we don't clean up the IAM roles on failed create, then we will:
- first check if the stack exists ✅, so skip
- The stack tries to re-create the IAM but fails because the CFNExecRole already exists ❌
Does this help?
There was a problem hiding this comment.
Ah yes. Sorry I was stupid I wasn't aware it is creating an environment so env upgrade can't happen before.
Im totally ok with the current implementation but
The stack tries to re-create the IAM but fails because the CFNExecRole already exists ❌
env init will try to delete the existing IAM roles before deploying the stack right?
There was a problem hiding this comment.
Yesss thanks for the explanation! It makes sense. I have no problem with the current implementation either but same question as Penghao's just to make sure my understanding aligns with the implementation.
There was a problem hiding this comment.
env init will try to delete the existing IAM roles before deploying the stack right?
Yes that's right! We delete the roles iff:
- There is no environment cloudformation stack
- But there are IAM roles
This scenario means that env delete previously failed to delete the EnvManagerRole and it should be cleaned up before trying to re-create it. I updated the comment in the cleanUpDanglingRoles method, hopefully that explains a bit better this scenario.
Lou1415926
left a comment
There was a problem hiding this comment.
Awesome! Nothing much to add except for a tiny nit on comment and a quesiton.
| } | ||
| // The stack failed to create due to an unexpect reason. | ||
| // Delete the retained roles created part of the stack. | ||
| o.tryDeletingEnvRoles(o.appName, o.name) |
There was a problem hiding this comment.
PH's comment makes me wonder whether it is guaranteed that if deployAndRenderEnvironment fails, then the env stack will be rolled back and delete 🤔
Fixes aws#2100 1. We now wait for a failed stack to be deleted first before attempting to re-create it. 2. Best-effort attempt to delete IAM roles on `env delete` this removes intermittent role deletion errors in the CLI. 3. If a user deletes an environment, and then re-creates it, `env init` now ensures the previous dangling IAM roles are deleted. 4. Similarly, if the env stack creation fails, we delete any retained IAM roles created part of the stack. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Fixes #2100
env deletethis removes intermittent role deletion errors in the CLI.env initnow ensures the previous dangling IAM roles are deleted.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.