Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator duplicates values from VMAlertManagerConfig to config secret #954

Closed
mariapoliss opened this issue May 16, 2024 · 4 comments
Closed
Labels
bug Something isn't working

Comments

@mariapoliss
Copy link

Hello!

I found a problem with updating Secret with VM Alert Manager config.
I deployed the last release version v0.44.0 for this moment.
For example, if I create VMAlertManagerConfig like this:

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAlertmanagerConfig
metadata:
  name: test-amc
spec:
  inhibit_rules:
    - equal:
        - alertname
      source_matchers:
        - severity="critical"
      target_matchers:
        - severity="warning"
    - source_matchers:
        - alertname="QuietWeeklyNotifications"
      target_matchers:
        - alert_group="l2ci_weekly"
  receivers:
    - name: l2ci_receiver
      webhook_configs:
        - url: http://notification_stub_ci1:8080
    - name: blackhole
    - name: ca_em_receiver
      webhook_configs:
        - url: http://notification_stub_ci2:8080
  route:
    group_by:
      - alertname
      - l2ci_channel
    routes:
      - matchers:
          - alertname="QuietWeeklyNotifications"
        receiver: blackhole
      - matchers:
          - alertname="QuietDailyNotifications"
        receiver: blackhole
      - matchers:
          - alert_group=~"^l2ci.*"
        receiver: l2ci_receiver

Victoria Metrics operator watches for test-amc, updates secret with name set in VMAlertManager (spec.configSecret) and makes it every reconcile cycle.

After updating the secret I see duplicated values (receivers, inhibit_rules and route.routes) in secret data alertmanager.yaml:

route:
  receiver: default-test-amc-blackhole
  routes:
    - routes:
        - matchers:
            - alertname="QuietWeeklyNotifications"
          receiver: default-test-amc-blackhole
          continue: false
        - matchers:
            - alertname="QuietDailyNotifications"
          receiver: default-test-amc-blackhole
          continue: false
        - matchers:
            - alert_group=~"^l2ci.*"
          receiver: default-test-amc-l2ci_receiver
          continue: false
      matchers:
        - namespace = "default"
      group_by:
        - alertname
        - l2ci_channel
      receiver: default-test-amc-blackhole
      continue: true
    - routes:
        - matchers:
            - alertname="QuietWeeklyNotifications"
          receiver: default-test-amc-blackhole
          continue: false
        - matchers:
            - alertname="QuietDailyNotifications"
          receiver: default-test-amc-blackhole
          continue: false
        - matchers:
            - alert_group=~"^l2ci.*"
          receiver: default-test-amc-l2ci_receiver
          continue: false
      matchers:
        - namespace = "default"
      group_by:
        - alertname
        - l2ci_channel
      receiver: default-test-amc-blackhole
      continue: true
inhibit_rules:
  - target_matchers:
      - severity="warning"
      - namespace = "default"
    source_matchers:
      - severity="critical"
      - namespace = "default"
    equal:
      - alertname
  - target_matchers:
      - alert_group="l2ci_weekly"
      - namespace = "default"
    source_matchers:
      - alertname="QuietWeeklyNotifications"
      - namespace = "default"
  - target_matchers:
      - severity="warning"
      - namespace = "default"
    source_matchers:
      - severity="critical"
      - namespace = "default"
    equal:
      - alertname
  - target_matchers:
      - alert_group="l2ci_weekly"
      - namespace = "default"
    source_matchers:
      - alertname="QuietWeeklyNotifications"
      - namespace = "default"
receivers:
  - name: default-test-amc-l2ci_receiver
    webhook_configs:
      - url: http://notification_stub_ci1:8080
  - name: default-test-amc-blackhole
  - name: default-test-amc-ca_em_receiver
    webhook_configs:
      - url: http://notification_stub_ci2:8080
  - name: default-test-amc-l2ci_receiver
    webhook_configs:
      - url: http://notification_stub_ci1:8080
  - name: default-test-amc-blackhole
  - name: default-test-amc-ca_em_receiver
    webhook_configs:
      - url: http://notification_stub_ci2:8080
templates: []

Seems like rootcause is in function buildConfig https://github.com/VictoriaMetrics/operator/blob/v0.44.0/controllers/factory/alertmanager/config.go#L55. There is no check for duplicates in existing receivers, inhibit rules and subroutes.

Finally, I see error logs in vmalertmanager:

ts=2024-01-17T12:12:04.662Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config/alertmanager.yaml err="notification config name \"test-amc-l2ci_receiver\" is not unique"

and in the operator after some time (several hours), because secret has too many duplicated values:

{"level":"error","ts":"2024-05-16T11:00:37Z","logger":"manager","msg":"Reconciler error","controller":"vmalertmanager","controllerGroup":"operator.victoriametrics.com","controllerKind":"VMAlertmanager","VMAlertmanager":{"name":"test-am","namespace":"test"},"namespace":"test","name":"test-am","reconcileID":"45f70f50-ccdd-4542-9c56-51835b8cdc99","error":"callback error: failed to check default Alertmanager config: Secret \"vmalertmanager-config\" is invalid: data: Too long: must have at most 1048576 bytes","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}

I created a test to check the behavior locally. The problem reproduces again.

alertmanager_test.zip

Could you please investigate and fix the problem?

@f41gh7 f41gh7 added the bug Something isn't working label May 16, 2024
@f41gh7
Copy link
Collaborator

f41gh7 commented May 16, 2024

Hello, thanks for reporting!

Issue related to the name clash of configSecret and secret created by operator itself. Operator builds the following name for own secret: vmalertmanager-%CR_NAME-config. And it may clash with exist secret.

an example of config with bug:

apiVersion: v1
kind: Secret
metadata:
  name: vmalertmanager-example-config
  labels:
    app: vm-operator
type: Opaque
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
    route:
      receiver: 'blachole'
    receivers:
      - name: 'blachole'
---
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAlertmanager
metadata:
  name: example
spec:
  replicaCount: 1
  configSecret: vmalertmanager-example-config
  selectAllByDefault: true

I think, proper fix for it is ignoring any content of secret, if it's name matches the name of secret created by operator.

Will be fixed at the next patch release

f41gh7 added a commit that referenced this issue May 16, 2024
…efault

Ignores any content from secret define at cr.spec.configSecret, if it name clashes with name of the secret created by operator.

#954
f41gh7 added a commit that referenced this issue May 16, 2024
…efault

Ignores any content from secret define at cr.spec.configSecret, if it name clashes with name of the secret created by operator.

#954
@mariapoliss
Copy link
Author

mariapoliss commented May 17, 2024

Thank you for fast response!

One more detail: I set configSecret in my custom resource VMAlertManager (not default secret name).
And the fix that is proposed now won`t resolve the problem in the case if I set secret name.

@f41gh7
Copy link
Collaborator

f41gh7 commented May 17, 2024

Thank you for fast response!

One more detail: I set configSecret in my custom resource VMAlertManager (not default secret name). And the fix that is proposed now won`t resolve the problem in the case if I set secret name.

You have to set value of configSecret to the different name, it must solve issue. Linked PR must resolve it.

Thanks for the test case, added it to the PR.

f41gh7 added a commit that referenced this issue May 20, 2024
…efault (#955)

* vmalertmanager: ignore content of configSecret if name clashes with default

Ignores any content from secret define at cr.spec.configSecret, if it name clashes with name of the secret created by operator.

#954

* add test
@f41gh7
Copy link
Collaborator

f41gh7 commented Jun 11, 2024

Changes was included to v0.45.0 release.

@f41gh7 f41gh7 closed this as completed Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants