Skip to content
This repository has been archived by the owner on Nov 9, 2022. It is now read-only.

Worker routines get stuck #99

Closed
timebertt opened this issue Dec 9, 2020 · 2 comments · Fixed by #102
Closed

Worker routines get stuck #99

timebertt opened this issue Dec 9, 2020 · 2 comments · Fixed by #102
Assignees
Labels
area/robustness Robustness, reliability, resilience related kind/bug Bug priority/3 Priority (lower number equals higher priority)

Comments

@timebertt
Copy link
Contributor

How to categorize this issue?

/area robustness
/kind bug
/priority normal

What happened:

We have observed some situations, were grm gets stuck reconciling a specific managed resource and does not act upon it anymore.
In all cases I observed, it was either happening in conjunction with a longer period of downtime of the source or target API server (before #95) or a large amount of secret data in the target cluster (like described in #92).

What you expected to happen:

grm should not get stuck and reconcile all managed resources with the given sync interval.

How to reproduce it (as minimally and precisely as possible):

Not sure yet.
My guess would be that the worker goroutines get stuck in some WaitForCacheSync, when the API server is unavailable for a longer period of time or the amount of watched data is to big.

Anything else we need to know?:

Environment:

  • Gardener-Resource-Manager version: v0.20.0
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:
@timebertt timebertt added the kind/bug Bug label Dec 9, 2020
@gardener-robot gardener-robot added area/robustness Robustness, reliability, resilience related priority/normal labels Dec 9, 2020
@timebertt
Copy link
Contributor Author

I think, a possible solution or at least one good first step would be to use a context with a timeout for each reconciliation (e.g. 1m).
This way, the WaitForCacheSync funcs will return with false and the key will be marked done in the queue, so it can be reconciled again.

@rfranzke
Copy link
Contributor

rfranzke commented Jan 8, 2021

/assign

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/robustness Robustness, reliability, resilience related kind/bug Bug priority/3 Priority (lower number equals higher priority)
Projects
None yet
3 participants