Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update vendoring to k8s 1.20/1.21 #402

Closed
ggaurav10 opened this issue Feb 2, 2020 · 7 comments · Fixed by #403 or #601
Closed

Update vendoring to k8s 1.20/1.21 #402

ggaurav10 opened this issue Feb 2, 2020 · 7 comments · Fixed by #403 or #601
Assignees
Labels
area/dev-productivity Developer productivity related (how to improve development) effort/1w Effort for issue is around 1 week exp/beginner Issue that requires only basic skills kind/api-change API change with impact on API users lifecycle/rotten Nobody worked on this for 12 months (final aging stage) priority/2 Priority (lower number equals higher priority)

Comments

@ggaurav10
Copy link
Contributor

ggaurav10 commented Feb 2, 2020

Logs of MCM:

I0201 07:37:20.264399       1 machine_safety.go:69] reconcileClusterMachineSafetyOvershooting: Start
I0201 07:37:20.264533       1 machine_safety.go:374] checkAndFreezeORUnfreezeMachineSets: MS:"shoot--garden--az-us2-cpu-worker-z1-6979c847b6" LowerThreshold:4 FullyLabeledReplicas:4 HigherThreshold:5
I0201 07:37:20.565096       1 machine_safety.go:95] reconcileClusterMachineSafetyOvershooting: End, reSync-Period: 1m0s
E0201 07:37:24.347148       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-wbmqr" not found
E0201 07:37:24.349910       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-4cpkc" not found
E0201 07:37:24.349993       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-r6wnm" not found
E0201 07:37:24.350033       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-5n7d8" not found
E0201 07:37:34.589908       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-wbmqr" not found
E0201 07:37:34.592394       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-r6wnm" not found
E0201 07:37:34.592409       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-5n7d8" not found
E0201 07:37:34.592444       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-4cpkc" not found
I0201 07:37:35.077956       1 machine_safety.go:105] reconcileClusterMachineSafetyAPIServer: Start
I0201 07:37:35.084685       1 machine_safety.go:174] reconcileClusterMachineSafetyAPIServer: Stop
E0201 07:37:55.073089       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-wbmqr" not found
E0201 07:37:55.075920       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-5n7d8" not found
E0201 07:37:55.075931       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847b6-4cpkc" not found
E0201 07:37:55.077233       1 machine.go:158] Could not fetch machine object machines.machine.sapcloud.io "shoot--garden--az-us2-cpu-worker-z1-6979c847

MCM doesn't reconcile machines anymore. Restarting MCM results in all the machines getting recreated again, after which the issue is resolved.

@ggaurav10
Copy link
Contributor Author

Here there should be a check like this so that a non-existent machine key is not requeued. Also, please note that at the second link, the Get is from informer cache while the called function reconcileClusterMachine Get's from apiserver. Not sure why the get call from cache is not failing with NotFound for such long time.

@prashanth26
Copy link
Contributor

I think this is related to this issue - #394

@prashanth26
Copy link
Contributor

The above fix partially closes this issue. However, the larger issue of cache inconsistency is yet to be fixed.

@prashanth26 prashanth26 reopened this Feb 4, 2020
@prashanth26
Copy link
Contributor

We have observed that the cache get's outdated for different machine CRD objects causing MCM to stop reconciling objects. We need to validate why this happens.

@prashanth26 prashanth26 added kind/bug Bug priority/blocker Needs to be resolved now, because it breaks the service labels Feb 20, 2020
@prashanth26
Copy link
Contributor

With help from @rfranzke, it has been identified that the issue is with the Kubernetes version - 1.14.8. Updating to k8s version 1.15 and above seems to fix it.

Refer to - kubernetes/client-go#755. For more details.

@amshuman-kr
Copy link
Collaborator

Ideally, the client-side should implement the timeout by itself without depending on the server side to close the channel on timeout. This has already been done recently in client-go but not released yet. We should adopt this change as soon as it is released to insulate ourselves from such issues in the future.

@prashanth26
Copy link
Contributor

prashanth26 commented Feb 28, 2020

The issue has been fixed on the server side with the K8s version update to 1.15. However, for the client-side fix we could wait for 1.18 client-go.

@amshuman-kr - Let's keep this issue open to track the same. And once 1.18 is released we could adopt the changes.

@ghost ghost added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Apr 29, 2020
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jun 29, 2020
@prashanth26 prashanth26 added priority/normal and removed priority/blocker Needs to be resolved now, because it breaks the service labels Jul 21, 2020
@prashanth26 prashanth26 removed the kind/bug Bug label Jul 21, 2020
@prashanth26 prashanth26 changed the title MCM stops reconciling machines Update vendoring to k8s 1.18 Aug 16, 2020
@prashanth26 prashanth26 added kind/api-change API change with impact on API users size/s Size of pull request is small (see gardener-robot robot/bots/size.py) exp/beginner Issue that requires only basic skills priority/critical Needs to be resolved soon, because it impacts users negatively and removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) labels Aug 16, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 16, 2020
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 16, 2020
@prashanth26 prashanth26 modified the milestone: 2021-Q2 Feb 3, 2021
@gardener-robot gardener-robot added priority/2 Priority (lower number equals higher priority) effort/2d Effort for issue is around 2 days and removed priority/critical Needs to be resolved soon, because it impacts users negatively size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels Mar 8, 2021
@prashanth26 prashanth26 added effort/1w Effort for issue is around 1 week area/dev-productivity Developer productivity related (how to improve development) and removed effort/2d Effort for issue is around 2 days labels Mar 30, 2021
@prashanth26 prashanth26 changed the title Update vendoring to k8s 1.18 Update vendoring to k8s 1.2x Apr 14, 2021
@prashanth26 prashanth26 changed the title Update vendoring to k8s 1.2x Update vendoring to k8s 1.20/1.21 Apr 14, 2021
@AxiomSamarth AxiomSamarth self-assigned this Apr 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dev-productivity Developer productivity related (how to improve development) effort/1w Effort for issue is around 1 week exp/beginner Issue that requires only basic skills kind/api-change API change with impact on API users lifecycle/rotten Nobody worked on this for 12 months (final aging stage) priority/2 Priority (lower number equals higher priority)
Projects
None yet
6 participants