Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArgoCD landed AVP Secrets with placeholder values due to application-controller crash loop #17855

Open
paterczm opened this issue Apr 15, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@paterczm
Copy link

paterczm commented Apr 15, 2024

Hello!

We're using ArgoCD to manage cluster configurations. Recently, the application-controller pod entered a crashloop due to OOM errors. It happens. The concerning part is that it caused a service degradation on the clusters it manages by replacing values in argocd-vault-plugin managed Secrets with placeholder values, as if the plugin sidecar didn't run at all. For example, this is what it did to the apiserver crt:

apiVersion: v1
data:
  tls.crt: PHRscy5wZW0+
  tls.key: PHRscy5rZXk+
kind: Secret
metadata:
  annotations:
    avp.kubernetes.io/path: apps/data/../cluster/apiserver
  labels:
    app.kubernetes.io/instance: cluster
  name: apiserver-tls
  namespace: openshift-config
type: kubernetes.io/tls
❯ echo PHRscy5wZW0+ | base64 -d
<tls.pem>

❯ echo PHRscy5rZXk+ | base64 -d
<tls.key>

Multiple Secrets were affected (not all) and some services did not take it well, resulting in cluster degradation.

I experimented with this later and was able to reproduce with following steps:

  1. Install ArgoCD from gitops-operator 1.12.1 (v2.10.5+335875d)
  2. Setup AVP plugin as a sidecar.
  3. Create ArgoCD application(s) managing Secrets using AVP.
  4. Set memory limit so that application-controller gets OOMKilled.

Expected results:
Sync does not work at all or works partially, but when and where it works, it delivers correct resources, with Secrets correctly populated from Hashicorp Vault using argocd-vault-plugin (AVP).

Actual results:
A number of Secrets got synced with placeholder values instead of actual tokens/certs/passwords from Hashicorp Vault.

When application-controller is stable, ArgoCD works as expected. The issues observed with Secret resolution or otherwise.

❯ argocd version
argocd: v2.10.3+0fd6344
  BuildDate: 2024-03-13T19:37:04Z
  GitCommit: 0fd6344537eb948cff602824a1d060421ceff40e
  GitTreeState: clean
  GoVersion: go1.21.7
  Compiler: gc
  Platform: linux/amd64
WARN[0000] Failed to invoke grpc call. Use flag --grpc-web in grpc calls. To avoid this warning message, use flag --grpc-web. 
argocd-server: v2.10.5+335875d
  BuildDate: 2024-04-04T12:32:14Z
  GitCommit: 335875d13e018bed6e03873f4742582582964745
  GitTreeState: clean
  GoVersion: go1.21.7 (Red Hat 1.21.7-1.module+el8.10.0+21318+5ea197f8)
  Compiler: gc
  Platform: linux/amd64
  ExtraBuildInfo: {Vendor Information: Red Hat OpenShift GitOps version: v1.12.1}
  Kustomize Version: v5.2.1 unknown
  Helm Version: v3.14.0+g2a2fb3b
  Kubectl Version: v0.26.11
  Jsonnet Version: v0.20.0

  avp-kustomize.yaml: |
    apiVersion: argoproj.io/v1alpha1
    kind: ConfigManagementPlugin
    metadata:
      name: argocd-vault-plugin-kustomize
    spec:
      allowConcurrency: true
      discover:
        find:
          command:
            - find
            - "."
            - -name
            - kustomization.yaml
      generate:
        command:
          - bash
          - "-c"
          - "set -o pipefail; kustomize build . | argocd-vault-plugin generate -"
      lockRepo: false
@paterczm paterczm added the bug Something isn't working label Apr 15, 2024
@paterczm
Copy link
Author

paterczm commented Apr 15, 2024

Incorrectly reported ArgoCD version, just updated the initial comment to correct it.

@jannfis
Copy link
Member

jannfis commented Apr 16, 2024

This seems weird. Plugins are executed on behalf of the repository server, not the application controller. The application controller does not interact with a plugin in whatever way.

Is there something else happening when the application controller gets OOM killed?

@paterczm
Copy link
Author

I didn't notice anything else wrong and didn't find anything interesting in the logs. There is definitely a correlation between the application controller crashloop and messed up secrets (I've seen it twice). Not sure about causality, perhaps it's indirect, or perhaps there was something else going on. I'll experiment some more if I find the time, so far we're running stable after bumping memory limits and sharding the app controller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants