Update vendoring to k8s 1.20/1.21 #402

ggaurav10 · 2020-02-02T06:32:52Z

Logs of MCM:

I0201 07:37:20.264399       1 machine_safety.go:69] reconcileClusterMachineSafetyOvershooting: Start
I0201 07:37:20.264533       1 machine_safety.go:374] checkAndFreezeORUnfreezeMachineSets: MS:"shoot--garden--az-us2-cpu-worker-z1-6979c847b6" LowerThreshold:4 FullyLabeledReplicas:4 HigherThreshold:5
I0201 07:37:20.565096       1 machine_safety.go:95] reconcileClusterMachineSafetyOvershooting: End, reSync-Period: 1m0s
E0201 07:37:24.347148       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-wbmqr" not found
E0201 07:37:24.349910       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-4cpkc" not found
E0201 07:37:24.349993       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-r6wnm" not found
E0201 07:37:24.350033       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-5n7d8" not found
E0201 07:37:34.589908       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-wbmqr" not found
E0201 07:37:34.592394       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-r6wnm" not found
E0201 07:37:34.592409       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-5n7d8" not found
E0201 07:37:34.592444       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-4cpkc" not found
I0201 07:37:35.077956       1 machine_safety.go:105] reconcileClusterMachineSafetyAPIServer: Start
I0201 07:37:35.084685       1 machine_safety.go:174] reconcileClusterMachineSafetyAPIServer: Stop
E0201 07:37:55.073089       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-wbmqr" not found
E0201 07:37:55.075920       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-5n7d8" not found
E0201 07:37:55.075931       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-4cpkc" not found
E0201 07:37:55.077233       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847

MCM doesn't reconcile machines anymore. Restarting MCM results in all the machines getting recreated again, after which the issue is resolved.

The text was updated successfully, but these errors were encountered:

ggaurav10 · 2020-02-03T08:16:59Z

Here there should be a check like this so that a non-existent machine key is not requeued. Also, please note that at the second link, the Get is from informer cache while the called function reconcileClusterMachine Get's from apiserver. Not sure why the get call from cache is not failing with NotFound for such long time.

prashanth26 · 2020-02-03T11:11:30Z

I think this is related to this issue - #394

prashanth26 · 2020-02-04T08:41:16Z

The above fix partially closes this issue. However, the larger issue of cache inconsistency is yet to be fixed.

prashanth26 · 2020-02-11T12:29:51Z

We have observed that the cache get's outdated for different machine CRD objects causing MCM to stop reconciling objects. We need to validate why this happens.

prashanth26 · 2020-02-27T07:37:09Z

With help from @rfranzke, it has been identified that the issue is with the Kubernetes version - 1.14.8. Updating to k8s version 1.15 and above seems to fix it.

Refer to - kubernetes/client-go#755. For more details.

amshuman-kr · 2020-02-27T08:53:19Z

Ideally, the client-side should implement the timeout by itself without depending on the server side to close the channel on timeout. This has already been done recently in client-go but not released yet. We should adopt this change as soon as it is released to insulate ourselves from such issues in the future.

prashanth26 · 2020-02-28T04:24:17Z

The issue has been fixed on the server side with the K8s version update to 1.15. However, for the client-side fix we could wait for 1.18 client-go.

@amshuman-kr - Let's keep this issue open to track the same. And once 1.18 is released we could adopt the changes.

ggaurav10 mentioned this issue Feb 4, 2020

Don't requeue machine key if error returned is "not found" #403

Merged

prashanth26 closed this as completed in #403 Feb 4, 2020

prashanth26 reopened this Feb 4, 2020

prashanth26 assigned hardikdr and prashanth26 Feb 20, 2020

prashanth26 added kind/bug Bug priority/blocker Needs to be resolved now, because it breaks the service labels Feb 20, 2020

prashanth26 mentioned this issue Mar 2, 2020

Hanging of MCM #394

Closed

ghost added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Apr 29, 2020

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jun 29, 2020

prashanth26 added priority/normal and removed priority/blocker Needs to be resolved now, because it breaks the service labels Jul 21, 2020

prashanth26 unassigned hardikdr and prashanth26 Jul 21, 2020

prashanth26 removed the kind/bug Bug label Jul 21, 2020

prashanth26 changed the title ~~MCM stops reconciling machines~~ Update vendoring to k8s 1.18 Aug 16, 2020

gardener-robot removed the priority/normal label Aug 16, 2020

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 16, 2020

hardikdr mentioned this issue Nov 6, 2020

Updated k8s.io dependency to 0.18 #534

Closed

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 16, 2020

prashanth26 modified the milestone: 2021-Q2 Feb 3, 2021

prashanth26 added effort/1w Effort for issue is around 1 week area/dev-productivity Developer productivity related (how to improve development) and removed effort/2d Effort for issue is around 2 days labels Mar 30, 2021

This was referenced Apr 6, 2021

could not apply CRDs #600

Closed

Use cached informers clients gardener/autoscaler#73

Merged

prashanth26 changed the title ~~Update vendoring to k8s 1.18~~ Update vendoring to k8s 1.2x Apr 14, 2021

prashanth26 changed the title ~~Update vendoring to k8s 1.2x~~ Update vendoring to k8s 1.20/1.21 Apr 14, 2021

AxiomSamarth self-assigned this Apr 14, 2021

AxiomSamarth mentioned this issue Apr 16, 2021

Update Kubernetes dependency versions to 1.20.6 #601

Merged

prashanth26 closed this as completed in #601 Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update vendoring to k8s 1.20/1.21 #402

Update vendoring to k8s 1.20/1.21 #402

ggaurav10 commented Feb 2, 2020 •

edited

ggaurav10 commented Feb 3, 2020

prashanth26 commented Feb 3, 2020

prashanth26 commented Feb 4, 2020

prashanth26 commented Feb 11, 2020

prashanth26 commented Feb 27, 2020

amshuman-kr commented Feb 27, 2020

prashanth26 commented Feb 28, 2020 •

edited

Update vendoring to k8s 1.20/1.21 #402

Update vendoring to k8s 1.20/1.21 #402

Comments

ggaurav10 commented Feb 2, 2020 • edited

ggaurav10 commented Feb 3, 2020

prashanth26 commented Feb 3, 2020

prashanth26 commented Feb 4, 2020

prashanth26 commented Feb 11, 2020

prashanth26 commented Feb 27, 2020

amshuman-kr commented Feb 27, 2020

prashanth26 commented Feb 28, 2020 • edited

ggaurav10 commented Feb 2, 2020 •

edited

prashanth26 commented Feb 28, 2020 •

edited