Several bugs after migrating azure seed/shoot to v1.21 #328

prashanth26 · 2021-06-29T11:44:55Z

How to categorize this issue?

/area quality
/kind bug
/priority 3
/platform azure

What happened:
After the rollout of a seed with 100 shoots from k8s v1.20.6 to v1.21.1 I see the following issues

Nodes kept getting replaced after going into a FileSystem (FS) read-only state and are replaced.
Nodes still kept switching between healthy/unhealthy.
KubeAPIserver fails on this error. This is due to the secret encryption not working as expected (we think) cc: @amshuman-kr

E0629 03:10:34.157664       1 cacher.go:419] cacher (*core.Secret): unexpected ListAndWatch error: failed to list *core.Secret: unable to transform key "/registry/secrets/default/default-token-xxx": no matching key was found for the provided AES transformer; reinitializing...
I0629 03:10:34.181550       1 shared_informer.go:266] stop requested
I0629 03:10:34.181608       1 healthz.go:244] informer-sync check failed: readyz

volumeattachments resource state was mismatched. The VA said attachment status to the node was true. However, a pod kept reporting PVC is yet to be attached. Once I deleted the VA, the pod successfully started after a new VA with the correct state was created.

What you expected to happen:
Seed rollout to v1.21.1 to occur without any problems.

How to reproduce it (as minimally and precisely as possible):

Create a shoot on azure with

Using k8s version v1.20.x
Create several pods with PV
Make sure to create about 25 nodes with maxSurge set to 100%. (to reproduce scale issues)

Perform a rolling update of the shoot to v1.21.1

Anything else we need to know?:
I eventually found some hacks for each above issues. However, the fix is still required.

For 1,

I saw several disks having multiple volumeAttachments resources for PV.
Once I deleted all the volumeAttachments to be re-created, the FS issues have reduced by a lot.
However, they still keep coming up once in a while. @ialidzhikov reported this on v1.20 also.

For 2,

it turned out that the CCM wasn't able to delete routes to the old VMs on 1.20.
CCM has some nomenclature change that caused this. Will need to investigate this.
Manually deleted the old routes to fix it for now.
However, I suspect that a bug exists at the CCM where old routes for nodes are not deleted properly.

For 3,
I still don't have a solution/fix. Not sure why this occurs. Any hints on why this could occur would be appreciated.

Environment:

Gardener version (if relevant): v1.25.x
Extension version: 1.20.1
Kubernetes version (use kubectl version): v1.21.1
Cloud provider or hardware configuration: Azure with CSI enabled shoot/seed.
Others:

The text was updated successfully, but these errors were encountered:

prashanth26 · 2021-06-29T12:16:43Z

Potential solutions:

Fix VolumeAttachment garbage collection for migrated PVs kubernetes/kubernetes#102176 for (1) partially. Not sure if it would completely solve FS issues.
fix: cleanup outdated routes kubernetes/kubernetes#102935 for (2)
Use CSI driver to determine unique name for migrated in-tree plugins kubernetes/kubernetes#101737 for(4)

Credits: @timebertt @vpnachev @ialidzhikov

prashanth26 · 2021-06-29T12:18:23Z

For (3) I see a similar issue raised here - rancher/rancher#28400

ialidzhikov · 2021-07-02T07:01:03Z

fix: cleanup outdated routes kubernetes/kubernetes#102935 for (2)

Yep, it is something that I can also confirm. I see integration test failures related to this:

time="2021-06-28T10:13:19Z" level=info msg="At 2021-06-28 09:50:24 +0000 UTC - event for shoot--it--tm1d3-s1p-worker-1-z1-66fd7-9hk2n: {route_controller } FailedToCreateRoute: Could not create route f2f39568-04ff-4bdf-95dc-fe82bbe14b50 100.96.0.0/24 for node shoot--it--tm1d3-s1p-worker-1-z1-66fd7-9hk2n after 30.164634768s: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {\r\n  \"error\": {\r\n    \"code\": \"RouteConflict\",\r\n    \"message\": \"Route: shoot--it--tm1d3-s1p-worker-1-z2-54b84-h4jpg____100961024 conflicts with Route: shoot--it--tm1d3-s1p-worker-1-z2-6474f-sf987. Routes cannot have the same AddressPrefix.\",\r\n    \"details\": []\r\n  }\r\n}"

We will definitely update our CCM when the upstream fix is also backported to release-1.21 branch.

vlerenc · 2021-07-14T11:48:35Z

3 out of 4 issues seem addressed and improve with a new patch version, but do we now know what caused (3) ("secret encryption not working as expected")? It's nothing MCM related, so who would look into it. Do we have a time bomb with that version or even that functionality in general that we are not yet aware of?

rfranzke · 2021-07-16T11:25:22Z

As we use secret encryption since 2 years and have never seen such issues so far, and as @prashanth26 talks about issues with the file system/disk and with unready nodes, my first guess would be that the problem is not directly related to the encryption as such. Maybe the volume containing the encryption secret wasn't properly mounted to the kube-apiserver? To be sure and to exclude the possibility that the issue is somehow related to v1.21, one would have to take a deeper look, of course.

rfranzke · 2021-07-16T12:18:30Z

OK, I took a short look into the affected seed and shoot cluster, and I don't see any obvious issues with volumes or disks anymore. However, the KAPI can still not start with above error.
I noticed some inconsistent state between the data in etcd and the used encryption secret.

In etcd, I see this:

/registry/secrets/kube-system/metrics-server
�j�:enc:aescbc:v1:key1624637421:��𞋰&J��Ə"�*��A�H^j��...

In the encryption configuration, I see this:

...
  - aescbc:
      keys:
      - name: key1624642438
        secret: <some-secret>

Please note that key1624637421 and key1624642438 are different although g/g does not support any rotation of this secret. Hence, the key should actually never change during the lifetime of a shoot cluster. Though, it seems the encryption secret was regenerated and does not match the state in etcd anymore?!

@plkokanov This reminds me of some issue with the CA certificate we observed some weeks ago. It was similarly regenerated and caused issues. We talked about this and IIRC then you wanted to take a look at the "secret manager" code in gardenlet (https://github.com/gardener/gardener/tree/master/pkg/operation/shootsecrets) to improve a few things. Could it be that there is still some bug/race condition that could lead to such situation? The entire secret handling was refactored/rewritten back then when support for the core.gardener.cloud/v1alpha1.ShootState resource (mainly used for CPM) was introduced.

It's not clear to me how it should be related to @prashanth26's tests. I'd rather think it'd be a general problem if this assumption holds true.

/cc @stoyanr

vlerenc · 2021-07-16T12:48:26Z

Thanks @rfranzke for looking into it. @plkokanov could you then please pick this up and investigate. If we have a problem, we'd rather like to find out ourselves, before we see even more such clusters. @prashanth26 has still the clusters around, e.g.
https://dashboard.garden.dev.k8s.ondemand.com/namespace/garden-perftest/shoots -> https://dashboard.garden.dev.k8s.ondemand.com/namespace/garden-perftest/shoots/cluster-83

plkokanov · 2021-07-16T13:19:33Z

I'll take a closer look. It really seems like there's a race condition happening as it is the only way that there are inconsistencies between secrets in the ShootState and in the shoot's controlplane.

Back when we observed a similar behaviour for an openstack managed seed cluster, I wrote a script to compare the secrets for all shoots with the data in their ShootStates and the only mismatch was that particular openstack cluster. I can run it again, but I guess it's something that happens very rarely.

prashanth26 · 2021-07-19T04:07:38Z

@plkokanov This reminds me of some issue with the CA certificate we observed some weeks ago. It was similarly regenerated and caused issues. We talked about this and IIRC then you wanted to take a look at the "secret manager" code in gardenlet (https://github.com/gardener/gardener/tree/master/pkg/operation/shootsecrets) to improve a few things. Could it be that there is still some bug/race condition that could lead to such situation? The entire secret handling was refactored/rewritten back then when support for the core.gardener.cloud/v1alpha1.ShootState resource (mainly used for CPM) was introduced.

It's not clear to me how it should be related to @prashanth26's tests. I'd rather think it'd be a general problem if this assumption holds true.

Thanks, @rfranzke for taking a look at this. I tried to figure out the reason for the in the past few days however wasn't successful. Yes, even I would think that it could be a general problem that was aggravated due to the disk detachment issues.

I'll take a closer look. It really seems like there's a race condition happening as it is the only way that there are inconsistencies between secrets in the ShootState and in the shoot's controlplane.

Back when we observed a similar behaviour for an openstack managed seed cluster, I wrote a script to compare the secrets for all shoots with the data in their ShootStates and the only mismatch was that particular openstack cluster. I can run it again, but I guess it's something that happens very rarely.

Thanks, @plkokanov for looking further into it. I saw it more frequently occur once the migration of seed to v1.21.1 occurred on azure. I had about 100 shoots running on the seed. During each rollout (I updated the seed workers), I observed ~5 new clusters running into this state. I have cleaned up the other such cluster by manually editing the secrets to allow shoot deletions. The above are a couple of them I have kept for investigation. As I said, this issue might have been aggravated due to the bugs on azure k8s v1.21.1 or CSI issues. However, I think it makes sense to keep an eye out when we make future rollouts (clusters with CSI and k8s 1.21.x) on seeds to keep an eye out for these issues.

Also, I think we should probably deprecate v1.21.1 considering several issues listed above and only support >=v1.21.2?

plkokanov · 2021-07-19T12:38:49Z

@prashanth26 @stoyanr I took a closer look at the secrets, but still can't see any possible race conditions, however I checked the timestamps of the Shoot, ShootState and encryption key creations (the name of the key is actually key + timestamp of when it was generated):
key1624637421 - generated June 25th 16:10:21 (this key, as @rfranzke pointed out, is currently used to encrypt the secrets in etcd)
key1624642438 - generated June 25th 17:33:58 (this is the key present in the ShootState and the etcd-encryption-secret in the shoot's control plane)
The Shoot has a creation timestamp of 2021-06-25T17:32:02Z
And the ShootState - 2021-06-25T17:32:04Z

For the cluster-96 shoot, the key used to encrypt is key1624558497 which is from Thursday, June 24, 2021 6:14:57, but the shoot creation timestamp is 2021-06-25T17:32:46Z

So according to this, the key that was used to encrypt secrets was created before the Shoot.
Is it possible that this is still due to the volume issues and somehow loading stale data for the ETCD?

Update:
I downloaded the oldest backup of the cluster-96 Shoot from (July 11th) and found the following:

C/registry/apiregistration.k8s.io/apiservices/v1beta1.metrics.k8s.io�á��ﬂÆ� Õ�*÷�{"kind":"APIService","apiVersion":"apiregistration.k8s.io/v1beta1","metadata":{"name":"v1beta1.metrics.k8s.io","uid":"090ff039-61f1-404e-8dc3-a1a2e84d4331","creationTimestamp":"2021-06-24T18:17:17Z","labels":{"shoot.gardener.cloud/no-cleanup":"true"},"annotations":{"resources.gardener.cloud/description":"DO NOT EDIT - This resource is managed by gardener-resource-manager.\nAny modifications are discarded and the resource is returned to the original state.","resources.gardener.cloud/origin":"shoot--garden--az-perf-(id):shoot--perftest--cluster-4/shoot-core-metrics-server"}

Note that at the end it mentions shoot--perftest--cluster-4. Also the creation timestamp
2021-06-24T18:17:17Z" corresponds to the timestamp of the encryption key of cluster-96 - key1624558497 "June 24, 2021 6:14:57". This isn't the only entry with shoot--perftest--cluster-4 and there seem to be no entries containing shoot--perftest--cluster-96/

You can probably find more details in the backup itself. @prashanth26 could you check if ETCD data was loaded from an incorrect backup?

amshuman-kr · 2021-07-20T02:19:43Z

if ETCD data was loaded from an incorrect backup?

cc @shreyas-s-rao Can you please check this? Or possibly the wrong data volume got mounted?

@prashanth26 Do you remember if cluster-4 also one of the shoots that faced the same secret encryption issue?

prashanth26 · 2021-07-20T04:03:01Z

You can probably find more details in the backup itself. @prashanth26 could you check if ETCD data was loaded from an incorrect backup?

@shreyas-s-rao / @abdasgupta / @ishan16696 can you confirm this?

@prashanth26 Do you remember if cluster-4 also one of the shoots that faced the same secret encryption issue?

I actually don't remember. But I vaguely remember the cluster names with issues. I don't think this was one of them with the same issues. Not sure though.

amshuman-kr · 2021-07-20T04:13:14Z

It might be worth checking if there are older snapshots (full) that are correct (have resources for cluster-96). I hope the snapshots are not already garbage collected. If they are not there it might be worth reproducing the issue for at least one shoot to have a look at the data folder and the snapshots.

shreyas-s-rao · 2021-07-20T09:00:20Z

I was looking into a similar issue for https://dashboard.garden.canary.k8s.ondemand.com/namespace/garden-hc-ci/shoots/531571-orc just now, so checked the shoot and seed etcds as well as the backups. These are my findings:

The shoot etcd backup is consistent with the shoot etcd. The secrets are showing up correctly. The only inconsistency I see is the encryptionsecret key in the etcd vs the actual encryption key from the etcd-encryption-secret :
etcd and etcd backup:

/registry/secrets/default/default-token-fztqb
k8s:enc:aescbc:v1:key1615384167:...

etcd-encryption-secret:

apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- providers:
  - aescbc:
      keys:
      - name: key1625719010
        secret: ...
  - identity: {}
  resources:
  - secrets

I checked the shoot's etcd from a backup from 12th July, and that too has the encrypted secrets with key key1615384167. But the latest etcd-encryption-secret was updated last on 2021-07-08T04:37:03Z.

Further, I checked if there’s an inconsistency between the etcd backing the seed and its backup, to see whether the encryption secret itself was updated anytime recently. Looks like those are consistent as well. I confirmed with an old backup from July 12th, that the etcd entry for /registry/secrets/shoot--hc-ci--531571-orc/etcd-encryption-secret has remained unchanged.

I see that cluster-96 is in the same state where the secrets have been encrypted with key1624558497 whereas the current encryption-secret holds the key key1624642441.

plkokanov · 2021-07-20T09:28:16Z

@shreyas-s-rao the key that is currently used to encrypt secrets in the 531571-orc etcd - key1615384167 which is from "Wednesday, March 10, 2021 1:49:27 PM" which is waaaaay before the creation timestamp of the shoot cluster - "2021-07-08T04:36:49Z". And as you can see the key in the encryption secret corresponds to the creationTimestamp of the shoot.

You can search for resources that are managed by the ResourceManager in the backup. Those usually have a reference to the shoot's ResourceManager that created them in their resources.gardener.cloud/origin annotation and check to which shoot the backup belongs (if i'm not mistaken).

Is there any way that we can reproduce the issue and check the backups and encryption secret then?

vpnachev · 2021-08-13T14:05:12Z

All above mentioned bugs are fixed, therefore I am now closing this issue.
/close

prashanth26 added the kind/bug Bug label Jun 29, 2021

gardener-robot added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related platform/azure Microsoft Azure platform/infrastructure priority/3 Priority (lower number equals higher priority) labels Jun 29, 2021

prashanth26 changed the title ~~Several bugs after migrating azure seed to v1.21~~ Several bugs after migrating azure seed/shoot to v1.21 Jun 29, 2021

plkokanov mentioned this issue Jul 21, 2021

Perform reads and modifications to ShootState in a concurrency safe way gardener/gardener#4411

Merged

gardener-robot closed this as completed Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several bugs after migrating azure seed/shoot to v1.21 #328

Several bugs after migrating azure seed/shoot to v1.21 #328

prashanth26 commented Jun 29, 2021 •

edited

prashanth26 commented Jun 29, 2021

prashanth26 commented Jun 29, 2021

ialidzhikov commented Jul 2, 2021

vlerenc commented Jul 14, 2021

rfranzke commented Jul 16, 2021

rfranzke commented Jul 16, 2021

vlerenc commented Jul 16, 2021

plkokanov commented Jul 16, 2021

prashanth26 commented Jul 19, 2021 •

edited

plkokanov commented Jul 19, 2021 •

edited

amshuman-kr commented Jul 20, 2021

prashanth26 commented Jul 20, 2021

amshuman-kr commented Jul 20, 2021

shreyas-s-rao commented Jul 20, 2021

plkokanov commented Jul 20, 2021

vpnachev commented Aug 13, 2021

Several bugs after migrating azure seed/shoot to v1.21 #328

Several bugs after migrating azure seed/shoot to v1.21 #328

Comments

prashanth26 commented Jun 29, 2021 • edited

prashanth26 commented Jun 29, 2021

prashanth26 commented Jun 29, 2021

ialidzhikov commented Jul 2, 2021

vlerenc commented Jul 14, 2021

rfranzke commented Jul 16, 2021

rfranzke commented Jul 16, 2021

vlerenc commented Jul 16, 2021

plkokanov commented Jul 16, 2021

prashanth26 commented Jul 19, 2021 • edited

plkokanov commented Jul 19, 2021 • edited

amshuman-kr commented Jul 20, 2021

prashanth26 commented Jul 20, 2021

amshuman-kr commented Jul 20, 2021

shreyas-s-rao commented Jul 20, 2021

plkokanov commented Jul 20, 2021

vpnachev commented Aug 13, 2021

prashanth26 commented Jun 29, 2021 •

edited

prashanth26 commented Jul 19, 2021 •

edited

plkokanov commented Jul 19, 2021 •

edited