-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several bugs after migrating azure seed/shoot to v1.21 #328
Comments
Potential solutions:
Credits: @timebertt @vpnachev @ialidzhikov |
For (3) I see a similar issue raised here - rancher/rancher#28400 |
Yep, it is something that I can also confirm. I see integration test failures related to this:
We will definitely update our CCM when the upstream fix is also backported to release-1.21 branch. |
3 out of 4 issues seem addressed and improve with a new patch version, but do we now know what caused (3) ("secret encryption not working as expected")? It's nothing MCM related, so who would look into it. Do we have a time bomb with that version or even that functionality in general that we are not yet aware of? |
As we use secret encryption since 2 years and have never seen such issues so far, and as @prashanth26 talks about issues with the file system/disk and with unready nodes, my first guess would be that the problem is not directly related to the encryption as such. Maybe the volume containing the encryption secret wasn't properly mounted to the |
OK, I took a short look into the affected seed and shoot cluster, and I don't see any obvious issues with volumes or disks anymore. However, the KAPI can still not start with above error.
...
- aescbc:
keys:
- name: key1624642438
secret: <some-secret> Please note that @plkokanov This reminds me of some issue with the CA certificate we observed some weeks ago. It was similarly regenerated and caused issues. We talked about this and IIRC then you wanted to take a look at the "secret manager" code in gardenlet (https://github.com/gardener/gardener/tree/master/pkg/operation/shootsecrets) to improve a few things. Could it be that there is still some bug/race condition that could lead to such situation? The entire secret handling was refactored/rewritten back then when support for the It's not clear to me how it should be related to @prashanth26's tests. I'd rather think it'd be a general problem if this assumption holds true. /cc @stoyanr |
Thanks @rfranzke for looking into it. @plkokanov could you then please pick this up and investigate. If we have a problem, we'd rather like to find out ourselves, before we see even more such clusters. @prashanth26 has still the clusters around, e.g. |
I'll take a closer look. It really seems like there's a race condition happening as it is the only way that there are inconsistencies between secrets in the ShootState and in the shoot's controlplane. Back when we observed a similar behaviour for an openstack managed seed cluster, I wrote a script to compare the secrets for all shoots with the data in their ShootStates and the only mismatch was that particular openstack cluster. I can run it again, but I guess it's something that happens very rarely. |
Thanks, @rfranzke for taking a look at this. I tried to figure out the reason for the in the past few days however wasn't successful. Yes, even I would think that it could be a general problem that was aggravated due to the disk detachment issues.
Thanks, @plkokanov for looking further into it. I saw it more frequently occur once the migration of seed to Also, I think we should probably deprecate |
@prashanth26 @stoyanr I took a closer look at the secrets, but still can't see any possible race conditions, however I checked the timestamps of the Shoot, ShootState and encryption key creations (the name of the key is actually For the So according to this, the key that was used to encrypt secrets was created before the Shoot. Update: C/registry/apiregistration.k8s.io/apiservices/v1beta1.metrics.k8s.io�á��flÆ� Õ�*÷�{"kind":"APIService","apiVersion":"apiregistration.k8s.io/v1beta1","metadata":{"name":"v1beta1.metrics.k8s.io","uid":"090ff039-61f1-404e-8dc3-a1a2e84d4331","creationTimestamp":"2021-06-24T18:17:17Z","labels":{"shoot.gardener.cloud/no-cleanup":"true"},"annotations":{"resources.gardener.cloud/description":"DO NOT EDIT - This resource is managed by gardener-resource-manager.\nAny modifications are discarded and the resource is returned to the original state.","resources.gardener.cloud/origin":"shoot--garden--az-perf-(id):shoot--perftest--cluster-4/shoot-core-metrics-server"} Note that at the end it mentions shoot--perftest--cluster-4. Also the creation timestamp You can probably find more details in the backup itself. @prashanth26 could you check if ETCD data was loaded from an incorrect backup? |
cc @shreyas-s-rao Can you please check this? Or possibly the wrong data volume got mounted? @prashanth26 Do you remember if |
@shreyas-s-rao / @abdasgupta / @ishan16696 can you confirm this?
I actually don't remember. But I vaguely remember the cluster names with issues. I don't think this was one of them with the same issues. Not sure though. |
It might be worth checking if there are older snapshots (full) that are correct (have resources for |
I was looking into a similar issue for https://dashboard.garden.canary.k8s.ondemand.com/namespace/garden-hc-ci/shoots/531571-orc just now, so checked the shoot and seed etcds as well as the backups. These are my findings: The shoot etcd backup is consistent with the shoot etcd. The secrets are showing up correctly. The only inconsistency I see is the encryptionsecret key in the etcd vs the actual encryption key from the
I checked the shoot's etcd from a backup from 12th July, and that too has the encrypted secrets with key Further, I checked if there’s an inconsistency between the etcd backing the seed and its backup, to see whether the encryption secret itself was updated anytime recently. Looks like those are consistent as well. I confirmed with an old backup from July 12th, that the etcd entry for I see that cluster-96 is in the same state where the secrets have been encrypted with |
@shreyas-s-rao the key that is currently used to encrypt secrets in the 531571-orc etcd - You can search for resources that are managed by the ResourceManager in the backup. Those usually have a reference to the shoot's ResourceManager that created them in their Is there any way that we can reproduce the issue and check the backups and encryption secret then? |
All above mentioned bugs are fixed, therefore I am now closing this issue. |
How to categorize this issue?
/area quality
/kind bug
/priority 3
/platform azure
What happened:
After the rollout of a seed with 100 shoots from k8s
v1.20.6
tov1.21.1
I see the following issuesvolumeattachments
resource state was mismatched. The VA said attachment status to the node was true. However, a pod kept reporting PVC is yet to be attached. Once I deleted the VA, the pod successfully started after a new VA with the correct state was created.What you expected to happen:
Seed rollout to
v1.21.1
to occur without any problems.How to reproduce it (as minimally and precisely as possible):
v1.20.x
100%
. (to reproduce scale issues)v1.21.1
Anything else we need to know?:
I eventually found some hacks for each above issues. However, the fix is still required.
For 1,
volumeAttachments
resources for PV.volumeAttachments
to be re-created, the FS issues have reduced by a lot.v1.20
also.For 2,
1.20
.For 3,
I still don't have a solution/fix. Not sure why this occurs. Any hints on why this could occur would be appreciated.
Environment:
v1.25.x
1.20.1
kubectl version
):v1.21.1
The text was updated successfully, but these errors were encountered: