Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-controller OOM events #303

Closed
robparrott opened this issue Feb 24, 2021 · 18 comments
Closed

source-controller OOM events #303

robparrott opened this issue Feb 24, 2021 · 18 comments

Comments

@robparrott
Copy link

Describe the bug

When registering FluxCD to a repository in gitlab enterprise, I am seeing OOM activity on the source-controller pod. Removing the 1GB memory limit fixes the issues.

To Reproduce

Register fluxcd on a repo with some level of complexity, I believe.

Expected behavior

The source-controller pod should not be killed and restarted repeatedly.

Additional context

  • Kubernetes version: 1.19
  • Git provider: gitlab self-hosted
  • Container registry provider: gitlab/ECR

Below please provide the output of the following commands:

flux --version : flux version 0.8.0
flux check
► checking prerequisites
✔ kubectl 1.19.3 >=1.18.0
✔ Kubernetes 1.19.6-eks-49a6c0 >=1.16.0
► checking controllers

✔ source-controller: healthy
► ghcr.io/fluxcd/source-controller:v0.8.1
✔ kustomize-controller: healthy
► ghcr.io/fluxcd/kustomize-controller:v0.8.1
✔ helm-controller: healthy
► ghcr.io/fluxcd/helm-controller:v0.7.0
✔ notification-controller: healthy
► ghcr.io/fluxcd/notification-controller:v0.8.0
✔ all checks passed
kubectl -n <namespace> get all
kubectl -n flux-system get all
NAME                                           READY   STATUS             RESTARTS   AGE
pod/helm-controller-6946b6dc7f-5nr8q           1/1     Running            0          9m34s
pod/kustomize-controller-55dfcdfd58-xj25c      1/1     Running            0          10h
pod/notification-controller-649754966b-2677x   1/1     Running            0          10h
pod/source-controller-597cc769b-lp6w4          0/1     CrashLoopBackOff   5          6m23s

NAME                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/notification-controller   ClusterIP   10.100.114.245   <none>        80/TCP    10h
service/source-controller         ClusterIP   10.100.185.20    <none>        80/TCP    10h
service/webhook-receiver          ClusterIP   10.100.198.200   <none>        80/TCP    10h

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/helm-controller           1/1     1            1           10h
deployment.apps/kustomize-controller      1/1     1            1           10h
deployment.apps/notification-controller   1/1     1            1           10h
deployment.apps/source-controller         0/1     1            0           10h

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/helm-controller-6779d46d69           0         0         0       10h
replicaset.apps/helm-controller-6946b6dc7f           1         1         1       9m34s
replicaset.apps/kustomize-controller-55dfcdfd58      1         1         1       10h
replicaset.apps/notification-controller-649754966b   1         1         1       10h
replicaset.apps/source-controller-555d4f9d6          0         0         0       10h
replicaset.apps/source-controller-597cc769b          1         1         0       10h




kubectl -n <namespace> logs deploy/source-controller

-- various without errors until killed ---

kubectl -n <namespace> logs deploy/kustomize-controller

-- various ---

level":"info","ts":"2021-02-24T00:06:40.724Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"istio-system","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:06:41.811Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:06:41.815Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"error","ts":"2021-02-24T00:06:41.825Z","logger":"controller.kustomization","msg":"Reconciliation failed after 1.059192016s, next try in 5m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"podinfo","namespace":"flux-system","revision"
:"master/e43ebfa5bf4b87c46f2e1db495eb571cd398e2f7","error":"failed to download artifact from http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/podinfo/e43ebfa5bf4b87c46f2e1db495eb571cd398e2f7.tar.gz, error: Get \"http://source-controller.flux-system.svc.cl
uster.local./gitrepository/flux-system/podinfo/e43ebfa5bf4b87c46f2e1db495eb571cd398e2f7.tar.gz\": dial tcp 10.100.185.20:80: connect: connection refused"}
{"level":"info","ts":"2021-02-24T00:06:41.843Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:07:41.833Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:07:41.834Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:07:41.853Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:08:41.853Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:08:41.855Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:08:41.863Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:09:41.872Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:09:41.874Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:09:41.875Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:10:41.893Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:10:41.895Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:10:41.895Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}


@robparrott
Copy link
Author

Changing the source-controller deployment resources stanza as follows:

        resources:
          limits:
            cpu: 1000m
            #memory: 1Gi
          requests:
            cpu: 50m
            #memory: 64Mi

addresses the issue

@stefanprodan stefanprodan transferred this issue from fluxcd/flux2 Feb 24, 2021
@mahmoud-abdelhafez
Copy link

I had the same issue but this time increasing the memory limts to 2Gi did mitigate the issue

@hihellobolke
Copy link

I am seeing OOMs with 2Gi and I am on v0.14.1.

@thomasroot
Copy link

Same here on flux2 version 0.16.2. Increasing the memory limits to 2Gi mitigated the issue.

@runningman84
Copy link

This issue seems to be linked to:
#192
Our clusters also suffer from this issue, we see memory usages of 1-2GB.

Generally speaking it is strange that a service which just downloads some files from other repos consumes so much memory.

@kav
Copy link

kav commented Sep 2, 2021

I was able to trigger this issue by putting interval: 1d in my helm repository spec. Happy to file separately if needed but trying to limit the issue count on source controller OOM

@hiddeco
Copy link
Member

hiddeco commented Sep 2, 2021

As with any workload on Kubernetes, the right resource limit configuration highly depends on what you are making the source-controller do (and you may thus have to increase it).

Helm related operations for example, are resource intensive because at present we haven't found a right optimization path to work with repository index files without loading them in memory in full (due to certain constraints around the unmarshalling of YAML).

Combined with the popularity of some solutions like Artifactory, which likes to stuff as much as possible in a single index (in some cases resulting in a file of >100MB), and the fact that the reconciliation of resources is isolated, resource usage exceeding the defaults can be expected.

Another task that can be resource intensive is the packaging of a Helm chart from a Git source, because Helm first loads all the chart data into an object in memory (including all files, and the files of the dependencies), before writing it to disk.

For a fun experiment: check the current resources your CI worker nodes have (or ask around), or monitor the resource usage of various helm commands on your local machine, and then take into account that the controller does this in parallel with multiple workers, for multiple resources.


Generally speaking it is strange that a service which just downloads some files from other repos consumes so much memory.

The controller does much more than just downloading files, and I think you are oversimplifying or underestimating the inner workings of the controller, and ignoring the fact that it has several features that perform composition tasks, etc. In addition, to ensure proper isolation of e.g. credentials, most Git things are done in memory as well.

I was able to trigger this issue by putting interval: 1d in my helm repository spec. Happy to file separately if needed but trying to limit the issue count on source controller OOM

Your Helm index likely is simply too big, or your resource limit settings are too low, see explanation above.


Lastly, we are continuously looking into ways to reduce the footprint of our controllers, and I can already tell you some paths have been identified (and are actively worked on) to help reduce it.

Do however always keep in mind that while the YAML creates simple looking and composable abstractions, there will always be processes behind it that actually execute the task, and that the hardware of your local development machine often outperforms most containers.

@kav
Copy link

kav commented Sep 2, 2021

Your Helm index likely is simply too big, or your resource limit settings are too low, see explanation above.

No, it appears 1d is simply not valid per the log. Sorry should have included that

E0902 19:20:30.626842       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.21.1/tools/cache/reflector.go:167: Failed to watch *v1beta1.HelmRepository: failed to list *v1beta1.HelmRepository: v1beta1.HelmRepositoryList.Items: []v1beta1.HelmRepository: v1beta1.HelmRepository.Spec: v1beta1.HelmRepositorySpec.Timeout: Interval: unmarshalerDecoder: time: unknown unit "d" in duration "1d", error found in #10 byte of ...|rval":"1d","timeout"|..., bigger context ...|0-4596-8543-9d6d4b573433"},"spec":{"interval":"1d","timeout":"60s","url":"https://raw.githubusercont|...

@hiddeco
Copy link
Member

hiddeco commented Sep 2, 2021

That is expected, as 1d is simply invalid.

There is no definition for units of Day or larger to avoid confusion across daylight savings time zone transitions.

A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as "300ms", "-1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".

@kav
Copy link

kav commented Sep 2, 2021

Yes sure, but it synchronized that change from the repository into the Helmrepository resource and then OOMed the source controller trying to read the helmrepo. I backed out the change in git but then had to manually edit the helmrepository object since the source controller was hung. Not saying it should support days just that that is a footgun. If it's not supported I would have expected the helmrepository to fail validation on the sync

@hiddeco
Copy link
Member

hiddeco commented Sep 3, 2021

@kav can you please move this into a separate issue? I did a small test yesterday evening and was indeed able to apply a resource with an invalid interval format, but the cluster I was testing on wasn't running any controllers at the time so I wasn't able to validate the crash.

apatelGWS added a commit to apatelGWS/flux2-kustomize-helm-example that referenced this issue Feb 21, 2022
updated source-controller deployment according to this issue: fluxcd/source-controller#303
apatelGWS added a commit to apatelGWS/flux2-kustomize-helm-example that referenced this issue Feb 21, 2022
@mkoertgen
Copy link

Having the same issue with OOMKilled and with the information from #192 pinned it down to large helm-repo of bitnami with index-file alone having 13.4M

image

@stefanprodan
Copy link
Member

For large Helm repository index files, you can enable caching to reduce the memory footprint of source-controller, docs here: https://fluxcd.io/docs/cheatsheets/bootstrap/#enable-helm-repositories-caching

@mkoertgen
Copy link

Thanks for the documentation link @stefanprodan. That was helpful.

Removing bitnami-helm-repos in redundant namespaces brought down the mem-footprint to 190M, yet still peaking every 10min (helm repo update interval)

image

I will check on enabling helm-caching. Thanks again, much appreciated.

@mkoertgen
Copy link

Needed to update 0.28 -> 0.30 so the helm-cache-arguments were available.

gotk_cache_events_total looks good so far. Will observe the mem-footprint but for now seems to solve the issue, at least for me.

Thanks again.

@mkoertgen
Copy link

Looks much better with helm-caching enabled

image

@stefanprodan
Copy link
Member

Yeap that's consistent with what I'm seeing on my test clusters, using source-controller cache brought the memory from 2GB down to 200MB.

@stefanprodan
Copy link
Member

Enabling Helm caching doc is now here: https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-helm-repositories-caching

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants