Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activate explicit LB/SG deletion again and improve code to reduce API calls #295

Merged
merged 4 commits into from
Mar 18, 2021

Conversation

rfranzke
Copy link
Member

@rfranzke rfranzke commented Mar 17, 2021

How to categorize this PR?

/area scalability robustness ops-productivity
/kind enhancement
/priority 3
/platform aws

What this PR does / why we need it:
With #290 we had disabled the explicit deletion for clusters >= 1.16. However, we see some occasional issues with leaked LB/SG, so let's reactivate it again and improve the code to reduce the number of API calls.

Special notes for your reviewer:
/assign @ialidzhikov
/invite @ialidzhikov
/squash

Earlier, we listed all LBs and executed a DescribeTags call for each load balancer. Now, we first list LB and execute only one DescribeTags call with all the found LB names. This should significantly improve the overall number of API calls (especially, when many clusters in the same AWS account are deleted in parallel).

Release note:

The load balancers and security groups are again explicitly deleted by the AWS provider extension (independent of the Kubernetes version used by the shoot cluster). The number of API calls have been reduced to the absolute minimum.

@rfranzke rfranzke requested review from a team as code owners March 17, 2021 12:10
@gardener-robot gardener-robot added area/ops-productivity Operator productivity related (how to improve operations) area/robustness Robustness, reliability, resilience related area/scalability Scalability related kind/enhancement Enhancement, improvement, extension merge/squash Should be merged via 'Squash and merge' needs/review Needs review platform/aws Amazon web services platform/infrastructure labels Mar 17, 2021
@gardener-robot
Copy link

@rfranzke Label priority/normal does not exist.

@gardener-robot gardener-robot added size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Mar 17, 2021
@rfranzke
Copy link
Member Author

/test

@testmachinery
Copy link

testmachinery bot commented Mar 17, 2021

Testrun: e2e-j2xms
Workflow: e2e-j2xms-wf
Phase: Failed

+---------------------+---------------------+--------+----------+
|        NAME         |        STEP         | PHASE  | DURATION |
+---------------------+---------------------+--------+----------+
| infrastructure-test | infrastructure-test | Failed | 25m27s   |
+---------------------+---------------------+--------+----------+

@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 17, 2021
@gardener-robot-ci-3 gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Mar 17, 2021
@rfranzke
Copy link
Member Author

/test

@testmachinery
Copy link

testmachinery bot commented Mar 18, 2021

Testrun: e2e-5wpxs
Workflow: e2e-5wpxs-wf
Phase: Succeeded

+---------------------+---------------------+-----------+----------+
|        NAME         |        STEP         |   PHASE   | DURATION |
+---------------------+---------------------+-----------+----------+
| infrastructure-test | infrastructure-test | Succeeded | 15m7s    |
+---------------------+---------------------+-----------+----------+

@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 18, 2021
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 18, 2021
Copy link
Member

@ialidzhikov ialidzhikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played with it locally and it looks good to me. Thanks once again for this PR!
Only one minor suggestion inline.

pkg/controller/infrastructure/actuator_delete.go Outdated Show resolved Hide resolved
@gardener-robot gardener-robot added the priority/3 Priority (lower number equals higher priority) label Mar 18, 2021
@rfranzke
Copy link
Member Author

/test

@testmachinery
Copy link

testmachinery bot commented Mar 18, 2021

Testrun: e2e-b9d5g
Workflow: e2e-b9d5g-wf
Phase: Succeeded

+---------------------+---------------------+-----------+----------+
|        NAME         |        STEP         |   PHASE   | DURATION |
+---------------------+---------------------+-----------+----------+
| infrastructure-test | infrastructure-test | Succeeded | 13m59s   |
+---------------------+---------------------+-----------+----------+

@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 18, 2021
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 18, 2021
Copy link
Member

@ialidzhikov ialidzhikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/review Needs review needs/second-opinion Needs second review by someone else labels Mar 18, 2021
@ialidzhikov ialidzhikov merged commit 26ce3c2 into gardener:master Mar 18, 2021
@rfranzke rfranzke deleted the enh/explicit-lb-sg-deletion branch March 18, 2021 14:09
@vlerenc
Copy link
Member

vlerenc commented Apr 1, 2021

we see some occasional issues with leaked LB/SG

Where are those coming from?

In the end, it doesn't matter too much, I guess, because we need this code for the cluster deletion (w/o a control plane) anyway, but out of curiosity: what's happening there, that Kubernetes (?) leaks IaaS resources? Do we know?

@mvladev
Copy link

mvladev commented Apr 1, 2021

In the end, it doesn't matter too much, I guess, because we need this code for the cluster deletion (w/o a control plane) anyway, but out of curiosity: what's happening there, that Kubernetes (?) leaks IaaS resources? Do we know?

We delete all Services when cluster is in deletion state, but end-users have automation that re-creates them. So after we finish with the delete operations and mark the cluster as free of any resources, the end-users manage to sneak in a Service create call. cloud-controller-manager creates this LB and is scaled down. We then proceed to delete the infrastructure. The VPC / subnet deletion fails due to dependencies.

@ialidzhikov
Copy link
Member

In the end, it doesn't matter too much, I guess, because we need this code for the cluster deletion (w/o a control plane) anyway, but out of curiosity: what's happening there, that Kubernetes (?) leaks IaaS resources? Do we know?

We delete all Services when cluster is in deletion state, but end-users have automation that re-creates them. So after we finish with the delete operations and mark the cluster as free of any resources, the end-users manage to sneak in a Service create call. cloud-controller-manager creates this LB and is scaled down. We then proceed to delete the infrastructure. The VPC / subnet deletion fails due to dependencies.

@mvladev , please don't mix issues and cases. We were observing the leaked SGs and LBs mostly for our vpn-shoot LB and nginx-ingress LB. The issue you talk about is a different one.
@vlerenc , I had in my list to open a follow issue to have an investigation. I will do it later..

@ialidzhikov
Copy link
Member

ialidzhikov commented Apr 1, 2021

Ref #308

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ops-productivity Operator productivity related (how to improve operations) area/robustness Robustness, reliability, resilience related area/scalability Scalability related kind/enhancement Enhancement, improvement, extension kind/test Test merge/squash Should be merged via 'Squash and merge' needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) platform/aws Amazon web services platform/infrastructure priority/3 Priority (lower number equals higher priority) reviewed/lgtm Has approval for merging size/l Size of pull request is large (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants