Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several bugs after migrating azure seed/shoot to v1.21 #328

Closed
prashanth26 opened this issue Jun 29, 2021 · 16 comments
Closed

Several bugs after migrating azure seed/shoot to v1.21 #328

prashanth26 opened this issue Jun 29, 2021 · 16 comments
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug platform/azure Microsoft Azure platform/infrastructure priority/3 Priority (lower number equals higher priority)

Comments

@prashanth26
Copy link

prashanth26 commented Jun 29, 2021

How to categorize this issue?

/area quality
/kind bug
/priority 3
/platform azure

What happened:
After the rollout of a seed with 100 shoots from k8s v1.20.6 to v1.21.1 I see the following issues

  1. Nodes kept getting replaced after going into a FileSystem (FS) read-only state and are replaced.
  2. Nodes still kept switching between healthy/unhealthy.
  3. KubeAPIserver fails on this error. This is due to the secret encryption not working as expected (we think) cc: @amshuman-kr
E0629 03:10:34.157664       1 cacher.go:419] cacher (*core.Secret): unexpected ListAndWatch error: failed to list *core.Secret: unable to transform key "/registry/secrets/default/default-token-xxx": no matching key was found for the provided AES transformer; reinitializing...
I0629 03:10:34.181550       1 shared_informer.go:266] stop requested
I0629 03:10:34.181608       1 healthz.go:244] informer-sync check failed: readyz
  1. volumeattachments resource state was mismatched. The VA said attachment status to the node was true. However, a pod kept reporting PVC is yet to be attached. Once I deleted the VA, the pod successfully started after a new VA with the correct state was created.

What you expected to happen:
Seed rollout to v1.21.1 to occur without any problems.

How to reproduce it (as minimally and precisely as possible):

  1. Create a shoot on azure with
  • Using k8s version v1.20.x
  • Create several pods with PV
  • Make sure to create about 25 nodes with maxSurge set to 100%. (to reproduce scale issues)
  1. Perform a rolling update of the shoot to v1.21.1

Anything else we need to know?:
I eventually found some hacks for each above issues. However, the fix is still required.

For 1,

  • I saw several disks having multiple volumeAttachments resources for PV.
  • Once I deleted all the volumeAttachments to be re-created, the FS issues have reduced by a lot.
  • However, they still keep coming up once in a while. @ialidzhikov reported this on v1.20 also.

For 2,

  • it turned out that the CCM wasn't able to delete routes to the old VMs on 1.20.
  • CCM has some nomenclature change that caused this. Will need to investigate this.
  • Manually deleted the old routes to fix it for now.
  • However, I suspect that a bug exists at the CCM where old routes for nodes are not deleted properly.

For 3,
I still don't have a solution/fix. Not sure why this occurs. Any hints on why this could occur would be appreciated.

Environment:

  • Gardener version (if relevant): v1.25.x
  • Extension version: 1.20.1
  • Kubernetes version (use kubectl version): v1.21.1
  • Cloud provider or hardware configuration: Azure with CSI enabled shoot/seed.
  • Others:
@prashanth26 prashanth26 added the kind/bug Bug label Jun 29, 2021
@gardener-robot gardener-robot added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related platform/azure Microsoft Azure platform/infrastructure priority/3 Priority (lower number equals higher priority) labels Jun 29, 2021
@prashanth26 prashanth26 changed the title Several bugs after migrating azure seed to v1.21 Several bugs after migrating azure seed/shoot to v1.21 Jun 29, 2021
@prashanth26
Copy link
Author

For (3) I see a similar issue raised here - rancher/rancher#28400

@ialidzhikov
Copy link
Member

Yep, it is something that I can also confirm. I see integration test failures related to this:

time="2021-06-28T10:13:19Z" level=info msg="At 2021-06-28 09:50:24 +0000 UTC - event for shoot--it--tm1d3-s1p-worker-1-z1-66fd7-9hk2n: {route_controller } FailedToCreateRoute: Could not create route f2f39568-04ff-4bdf-95dc-fe82bbe14b50 100.96.0.0/24 for node shoot--it--tm1d3-s1p-worker-1-z1-66fd7-9hk2n after 30.164634768s: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {\r\n  \"error\": {\r\n    \"code\": \"RouteConflict\",\r\n    \"message\": \"Route: shoot--it--tm1d3-s1p-worker-1-z2-54b84-h4jpg____100961024 conflicts with Route: shoot--it--tm1d3-s1p-worker-1-z2-6474f-sf987. Routes cannot have the same AddressPrefix.\",\r\n    \"details\": []\r\n  }\r\n}"

We will definitely update our CCM when the upstream fix is also backported to release-1.21 branch.

@vlerenc
Copy link
Member

vlerenc commented Jul 14, 2021

3 out of 4 issues seem addressed and improve with a new patch version, but do we now know what caused (3) ("secret encryption not working as expected")? It's nothing MCM related, so who would look into it. Do we have a time bomb with that version or even that functionality in general that we are not yet aware of?

@rfranzke
Copy link
Member

As we use secret encryption since 2 years and have never seen such issues so far, and as @prashanth26 talks about issues with the file system/disk and with unready nodes, my first guess would be that the problem is not directly related to the encryption as such. Maybe the volume containing the encryption secret wasn't properly mounted to the kube-apiserver? To be sure and to exclude the possibility that the issue is somehow related to v1.21, one would have to take a deeper look, of course.

@rfranzke
Copy link
Member

OK, I took a short look into the affected seed and shoot cluster, and I don't see any obvious issues with volumes or disks anymore. However, the KAPI can still not start with above error.
I noticed some inconsistent state between the data in etcd and the used encryption secret.

  • In etcd, I see this:
/registry/secrets/kube-system/metrics-server
�j�:enc:aescbc:v1:key1624637421:��𞋰&J��Ə"�*��A�H^j��...
  • In the encryption configuration, I see this:
...
  - aescbc:
      keys:
      - name: key1624642438
        secret: <some-secret>

Please note that key1624637421 and key1624642438 are different although g/g does not support any rotation of this secret. Hence, the key should actually never change during the lifetime of a shoot cluster. Though, it seems the encryption secret was regenerated and does not match the state in etcd anymore?!

@plkokanov This reminds me of some issue with the CA certificate we observed some weeks ago. It was similarly regenerated and caused issues. We talked about this and IIRC then you wanted to take a look at the "secret manager" code in gardenlet (https://github.com/gardener/gardener/tree/master/pkg/operation/shootsecrets) to improve a few things. Could it be that there is still some bug/race condition that could lead to such situation? The entire secret handling was refactored/rewritten back then when support for the core.gardener.cloud/v1alpha1.ShootState resource (mainly used for CPM) was introduced.

It's not clear to me how it should be related to @prashanth26's tests. I'd rather think it'd be a general problem if this assumption holds true.

/cc @stoyanr

@vlerenc
Copy link
Member

vlerenc commented Jul 16, 2021

Thanks @rfranzke for looking into it. @plkokanov could you then please pick this up and investigate. If we have a problem, we'd rather like to find out ourselves, before we see even more such clusters. @prashanth26 has still the clusters around, e.g.
https://dashboard.garden.dev.k8s.ondemand.com/namespace/garden-perftest/shoots -> https://dashboard.garden.dev.k8s.ondemand.com/namespace/garden-perftest/shoots/cluster-83

@plkokanov
Copy link
Contributor

I'll take a closer look. It really seems like there's a race condition happening as it is the only way that there are inconsistencies between secrets in the ShootState and in the shoot's controlplane.

Back when we observed a similar behaviour for an openstack managed seed cluster, I wrote a script to compare the secrets for all shoots with the data in their ShootStates and the only mismatch was that particular openstack cluster. I can run it again, but I guess it's something that happens very rarely.

@prashanth26
Copy link
Author

prashanth26 commented Jul 19, 2021

@plkokanov This reminds me of some issue with the CA certificate we observed some weeks ago. It was similarly regenerated and caused issues. We talked about this and IIRC then you wanted to take a look at the "secret manager" code in gardenlet (https://github.com/gardener/gardener/tree/master/pkg/operation/shootsecrets) to improve a few things. Could it be that there is still some bug/race condition that could lead to such situation? The entire secret handling was refactored/rewritten back then when support for the core.gardener.cloud/v1alpha1.ShootState resource (mainly used for CPM) was introduced.

It's not clear to me how it should be related to @prashanth26's tests. I'd rather think it'd be a general problem if this assumption holds true.

Thanks, @rfranzke for taking a look at this. I tried to figure out the reason for the in the past few days however wasn't successful. Yes, even I would think that it could be a general problem that was aggravated due to the disk detachment issues.

I'll take a closer look. It really seems like there's a race condition happening as it is the only way that there are inconsistencies between secrets in the ShootState and in the shoot's controlplane.

Back when we observed a similar behaviour for an openstack managed seed cluster, I wrote a script to compare the secrets for all shoots with the data in their ShootStates and the only mismatch was that particular openstack cluster. I can run it again, but I guess it's something that happens very rarely.

Thanks, @plkokanov for looking further into it. I saw it more frequently occur once the migration of seed to v1.21.1 occurred on azure. I had about 100 shoots running on the seed. During each rollout (I updated the seed workers), I observed ~5 new clusters running into this state. I have cleaned up the other such cluster by manually editing the secrets to allow shoot deletions. The above are a couple of them I have kept for investigation. As I said, this issue might have been aggravated due to the bugs on azure k8s v1.21.1 or CSI issues. However, I think it makes sense to keep an eye out when we make future rollouts (clusters with CSI and k8s 1.21.x) on seeds to keep an eye out for these issues.

Also, I think we should probably deprecate v1.21.1 considering several issues listed above and only support >=v1.21.2?

@plkokanov
Copy link
Contributor

plkokanov commented Jul 19, 2021

@prashanth26 @stoyanr I took a closer look at the secrets, but still can't see any possible race conditions, however I checked the timestamps of the Shoot, ShootState and encryption key creations (the name of the key is actually key + timestamp of when it was generated):
key1624637421 - generated June 25th 16:10:21 (this key, as @rfranzke pointed out, is currently used to encrypt the secrets in etcd)
key1624642438 - generated June 25th 17:33:58 (this is the key present in the ShootState and the etcd-encryption-secret in the shoot's control plane)
The Shoot has a creation timestamp of 2021-06-25T17:32:02Z
And the ShootState - 2021-06-25T17:32:04Z

For the cluster-96 shoot, the key used to encrypt is key1624558497 which is from Thursday, June 24, 2021 6:14:57, but the shoot creation timestamp is 2021-06-25T17:32:46Z

So according to this, the key that was used to encrypt secrets was created before the Shoot.
Is it possible that this is still due to the volume issues and somehow loading stale data for the ETCD?

Update:
I downloaded the oldest backup of the cluster-96 Shoot from (July 11th) and found the following:

C/registry/apiregistration.k8s.io/apiservices/v1beta1.metrics.k8s.io�á��flÆ� Õ�*÷�{"kind":"APIService","apiVersion":"apiregistration.k8s.io/v1beta1","metadata":{"name":"v1beta1.metrics.k8s.io","uid":"090ff039-61f1-404e-8dc3-a1a2e84d4331","creationTimestamp":"2021-06-24T18:17:17Z","labels":{"shoot.gardener.cloud/no-cleanup":"true"},"annotations":{"resources.gardener.cloud/description":"DO NOT EDIT - This resource is managed by gardener-resource-manager.\nAny modifications are discarded and the resource is returned to the original state.","resources.gardener.cloud/origin":"shoot--garden--az-perf-(id):shoot--perftest--cluster-4/shoot-core-metrics-server"}

Note that at the end it mentions shoot--perftest--cluster-4. Also the creation timestamp
2021-06-24T18:17:17Z" corresponds to the timestamp of the encryption key of cluster-96 - key1624558497 "June 24, 2021 6:14:57". This isn't the only entry with shoot--perftest--cluster-4 and there seem to be no entries containing shoot--perftest--cluster-96/

You can probably find more details in the backup itself. @prashanth26 could you check if ETCD data was loaded from an incorrect backup?

@amshuman-kr
Copy link
Collaborator

if ETCD data was loaded from an incorrect backup?

cc @shreyas-s-rao Can you please check this? Or possibly the wrong data volume got mounted?

@prashanth26 Do you remember if cluster-4 also one of the shoots that faced the same secret encryption issue?

@prashanth26
Copy link
Author

You can probably find more details in the backup itself. @prashanth26 could you check if ETCD data was loaded from an incorrect backup?

@shreyas-s-rao / @abdasgupta / @ishan16696 can you confirm this?

@prashanth26 Do you remember if cluster-4 also one of the shoots that faced the same secret encryption issue?

I actually don't remember. But I vaguely remember the cluster names with issues. I don't think this was one of them with the same issues. Not sure though.

@amshuman-kr
Copy link
Collaborator

It might be worth checking if there are older snapshots (full) that are correct (have resources for cluster-96). I hope the snapshots are not already garbage collected. If they are not there it might be worth reproducing the issue for at least one shoot to have a look at the data folder and the snapshots.

@shreyas-s-rao
Copy link

I was looking into a similar issue for https://dashboard.garden.canary.k8s.ondemand.com/namespace/garden-hc-ci/shoots/531571-orc just now, so checked the shoot and seed etcds as well as the backups. These are my findings:

The shoot etcd backup is consistent with the shoot etcd. The secrets are showing up correctly. The only inconsistency I see is the encryptionsecret key in the etcd vs the actual encryption key from the etcd-encryption-secret :
etcd and etcd backup:

/registry/secrets/default/default-token-fztqb
k8s:enc:aescbc:v1:key1615384167:...

etcd-encryption-secret:

apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- providers:
  - aescbc:
      keys:
      - name: key1625719010
        secret: ...
  - identity: {}
  resources:
  - secrets

I checked the shoot's etcd from a backup from 12th July, and that too has the encrypted secrets with key key1615384167. But the latest etcd-encryption-secret was updated last on 2021-07-08T04:37:03Z.

Further, I checked if there’s an inconsistency between the etcd backing the seed and its backup, to see whether the encryption secret itself was updated anytime recently. Looks like those are consistent as well. I confirmed with an old backup from July 12th, that the etcd entry for /registry/secrets/shoot--hc-ci--531571-orc/etcd-encryption-secret has remained unchanged.

I see that cluster-96 is in the same state where the secrets have been encrypted with key1624558497 whereas the current encryption-secret holds the key key1624642441.

@plkokanov
Copy link
Contributor

@shreyas-s-rao the key that is currently used to encrypt secrets in the 531571-orc etcd - key1615384167 which is from "Wednesday, March 10, 2021 1:49:27 PM" which is waaaaay before the creation timestamp of the shoot cluster - "2021-07-08T04:36:49Z". And as you can see the key in the encryption secret corresponds to the creationTimestamp of the shoot.

You can search for resources that are managed by the ResourceManager in the backup. Those usually have a reference to the shoot's ResourceManager that created them in their resources.gardener.cloud/origin annotation and check to which shoot the backup belongs (if i'm not mistaken).

Is there any way that we can reproduce the issue and check the backups and encryption secret then?

@vpnachev
Copy link
Member

All above mentioned bugs are fixed, therefore I am now closing this issue.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug platform/azure Microsoft Azure platform/infrastructure priority/3 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

9 participants