Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly destroy kube-apiserver-sni resources before the kube-apiserver-service is destroyed. #4068

Conversation

ScheererJ
Copy link
Contributor

@ScheererJ ScheererJ commented May 17, 2021

In case the envoy filter, which is part of kube-apiserver-sni, is not destroyed before the kube-apiserver-service the envoy configuration might become inconsistent. An inconsistent envoy configuration might affect other clusters as no new configuration changes can be applied. The resolution is to destroy the sni resources earlier.

How to categorize this PR?

/area networking
/kind bug

What this PR does / why we need it:
Explicitly destroys the kube-apiserver-sni resources during the shoot cluster deletion to prevent other cluster configuration changes to be affected by a race during the deletion.

Which issue(s) this PR fixes:
None.

Special notes for your reviewer:

Release note:

Fixed a race in shoot cluster deletion, which could affect other clusters as the envoy filter (as part of the kube-apiserver-sni) was not deleted before the kube-apiserver-service. This is now explicitly ensured.

@ScheererJ ScheererJ requested a review from a team as a code owner May 17, 2021 12:21
@gardener-robot gardener-robot added area/networking Networking related kind/bug Bug labels May 17, 2021
@gardener-robot
Copy link

@ScheererJ Thank you for your contribution.

@gardener-robot gardener-robot added needs/review size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 17, 2021
@ialidzhikov
Copy link
Member

@ScheererJ , can you give more details about the issue and how to reproduce it?

/assign @mvladev

@mvladev
Copy link

mvladev commented May 17, 2021

@ScheererJ , can you give more details about the issue and how to reproduce it?

I think the problem is that the Service for the kube-apiserver and other components is removed before the EnvoyFilters are removed. The EnvoyFilters reference the service (cluster in envoy).

This causes problems for Istio / Envoy.

@ScheererJ
Copy link
Contributor Author

@ScheererJ , can you give more details about the issue and how to reproduce it?

/assign @mvladev

@ialidzhikov we had a cluster two weeks ago were deletion was stuck. The remaining envoy filter caused others cluster's envoy configurations to not be applied anymore as the envoy configuration was inconsistent. We did not try to reproduce it, though.

@ScheererJ ScheererJ force-pushed the bugfix/explicitly-destroy-kube-apiserver-sni-on-shoot-deletion branch from b6e43ca to f4e56f2 Compare May 17, 2021 13:18
@ialidzhikov
Copy link
Member

ialidzhikov commented May 17, 2021

@ialidzhikov we had a cluster two weeks ago were deletion was stuck. The remaining envoy filter caused others cluster's envoy configurations to not be applied anymore as the envoy configuration was inconsistent. We did not try to reproduce it, though.

Could you please add appropriate release note of type bugfix describing the issue this PR is fixing?

@ScheererJ
Copy link
Contributor Author

@ialidzhikov we had a cluster two weeks ago were deletion was stuck. The remaining envoy filter caused others cluster's envoy configurations to not be applied anymore as the envoy configuration was inconsistent. We did not try to reproduce it, though.

Could you please add appropriate release note of type bugfix describing the issue this PR is fixing?

@ialidzhikov I added a release note. Any feedback?

@rfranzke
Copy link
Member

@ScheererJ Still, why is the destruction task of the extensions.gardener.cloud/v1alpha1.ControlPlane resource depending on the SNI deletion?

@ScheererJ
Copy link
Contributor Author

@ScheererJ Still, why is the destruction task of the extensions.gardener.cloud/v1alpha1.ControlPlane resource depending on the SNI deletion?

@rfranzke In discussion with @mvladev, he suggested that kube-apiserver-sni should be deleted before kube-apiserver-service to reverse the creation order. From his perspective, kube-apiserver-service should be deleted as part of the controlplane deletion. This is why I added the dependency.

@ialidzhikov I am still trying to figure out where kube-apiserver-service is actually being deleted. It does not seem completely obvious.

@rfranzke
Copy link
Member

I don't think we explicitly delete the kube-apiserver service yet. You'd need to introduce this. However, the destruction task of the extensions.gardener.cloud/v1alpha1.ControlPlane resource should be completely independent/unrelated.

@ScheererJ
Copy link
Contributor Author

I don't think we explicitly delete the kube-apiserver service yet. You'd need to introduce this. However, the destruction task of the extensions.gardener.cloud/v1alpha1.ControlPlane resource should be completely independent/unrelated.

@mvladev How should we proceed then? Should I also introduce the explicit destruction of the kube-apiserver-service or is it sufficient just explicitly delete the kube-apiserver-sni late in the process, but without the dependency to the controlplane?
@rfranzke / @ialidzhikov Your feedback is appreciated as well.

@rfranzke
Copy link
Member

I don't know if deletion of the kube-apiserver service would be necessary, but it for sure doesn't harm. Hence, my recommendation: Introduce two new tasks which are both dependent on the deleteKubeAPIServer task: One for deleting the SNI resources, one for deleting the service.

@ScheererJ
Copy link
Contributor Author

Ok, but so that we are on the same page: The task for the kube-apiserver-service would depend on the kube-apiserver-sni task as we want to ensure that the envoy filter (included in the sni resources) is deleted before the service.

@rfranzke
Copy link
Member

OK, sure :) I guess @ialidzhikov were just confused how the other task (deletion of ControlPlane) is related to the SNI/service.

@ialidzhikov
Copy link
Member

ialidzhikov commented May 18, 2021

Hence, my recommendation: Introduce two new tasks which are both dependent on the deleteKubeAPIServer task: One for deleting the SNI resources, one for deleting the service.

This sounds good to me.

Going one step back, as we now know that the kube-apiserver Service is not deleted explicitly, the following statement from the release note does not sound good:

[...] as the envoy filter (as part of the kube-apiserver-sni) was not deleted before the kube-apiserver-service.

We know that both envoy filter and kube-apiserver are deleted as part of the Namespace deletion at the end of the Shoot deletion flow. So what is the real issue? Envoy filter is not deleted when the apiserver Deployment is deleted and this is causing issues? So we now need to delete envoy filter when we delete the apiserver Deployment?

@ScheererJ
Copy link
Contributor Author

The envoy filter is not located in the shoot namespace on the seed, but in the istio-ingress namespace. Therefore, there is a race.
I changed the order as discussed.

@ScheererJ ScheererJ force-pushed the bugfix/explicitly-destroy-kube-apiserver-sni-on-shoot-deletion branch from f4e56f2 to 910c8a8 Compare May 19, 2021 09:14
@ialidzhikov
Copy link
Member

The envoy filter is not located in the shoot namespace on the seed, but in the istio-ingress namespace. Therefore, there is a race.

You are right, I missed that the envoyfilter is in the istio-ingress ns. I see that the envoyfilter has ownerReference to the Shoot namespace in the Seed and that's how it is deleted.

Copy link
Member

@ialidzhikov ialidzhikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@rfranzke rfranzke merged commit 7c4388e into gardener:master May 20, 2021
krgostev pushed a commit to krgostev/gardener that referenced this pull request Apr 21, 2022
krgostev pushed a commit to krgostev/gardener that referenced this pull request Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking Networking related kind/bug Bug size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants