[8.13](backport #38471) [Metricbeat][Autodiscover Kubernetes] Fix multiple instances reporting same metrics #38761

mergify · 2024-04-08T10:28:54Z

Proposed commit message

If the holder of the lease changes when using metricbeat autodiscover, then we have multiple hosts reporting the same metrics.

Please read the issue #38543 for a more detailed description.

This only affects the metrics that are unique cluster wide, like KSM metrics.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Create a 2 node cluster, with metricbeat running in both nodes.
Use the metricbeat image from this branch to deploy the metricbeat instance.

If you want to see this fix in action, follow the section Results below.

Edit: here are more detailed steps for this:

Create a two node cluster. You can do it this way.

$ kind create cluster --config kind-config.yaml

And kind-config.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    kubeadmConfigPatches:
      - |
        kind: KubeProxyConfiguration
        metricsBindAddress: "0.0.0.0"
  - role: worker

Build a docker image from this branch, in a place where you have the metricbeat binary.

So build metricbeat with GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build in metricbeat directory.

You can use this Dockerfile to create the docker image:

FROM ubuntu:20.04

WORKDIR /usr/share/metricbeat

COPY metricbeat /usr/share/metricbeat/metricbeat

ENTRYPOINT ["./metricbeat"]

CMD [ "-e" ]

Then run:

$ docker build -t metricbeat-run-image .

And then upload it to kind nodes:

$ kind load docker-image metricbeat-run-image:latest

Deploy the metricbeat manifest with this image.

I am using this manifest, that only has state_node enabled, and a few other metricsets.

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: metricbeat-daemonset-config
  namespace: kube-system
  labels:
    k8s-app: metricbeat
data:
  metricbeat.yml: |-
    metricbeat.config.modules:
      # Mounted `metricbeat-daemonset-modules` configmap:
      path: ${path.config}/modules.d/*.yml
      # Reload module configs as they change:
      reload.enabled: false

    metricbeat.autodiscover:
      providers:
        - type: kubernetes
          scope: cluster
          node: ${NODE_NAME}
          unique: true
          templates:
            - config:
                - module: kubernetes
                  hosts: ["kube-state-metrics:8080"]
                  period: 10s
                  add_metadata: true
                  metricsets:
                    - state_node
                    #- state_deployment
                    #- state_daemonset
                    #- state_replicaset
                    #- state_pod
                    #- state_container
                    #- state_cronjob
                    #- state_resourcequota
                    #- state_statefulset
                    #- state_service
                    #- state_persistentvolume
                    #- state_persistentvolumeclaim
                    #- state_storageclass
                    #- state_namespace
                - module: kubernetes
                  metricsets:
                    - apiserver
                  hosts: ["https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}"]
                  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                  ssl.certificate_authorities:
                    - /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  period: 30s
                # Uncomment this to get k8s events:
                #- module: kubernetes
                #  metricsets:
                #    - event
        # To enable hints based autodiscover uncomment this:
        #- type: kubernetes
        #  node: ${NODE_NAME}
        #  hints.enabled: true

    logging.level: debug

    processors:
      - add_cloud_metadata:

    cloud.id: ${ELASTIC_CLOUD_ID}
    cloud.auth: ${ELASTIC_CLOUD_AUTH}

    output.elasticsearch:
      hosts: ['https://${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
      username: ${ELASTICSEARCH_USERNAME}
      password: ${ELASTICSEARCH_PASSWORD}
      ssl.verification_mode: "none"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: metricbeat-daemonset-modules
  namespace: kube-system
  labels:
    k8s-app: metricbeat
data:
  #system.yml: |-
  #  - module: system
  #    period: 10s
  #    metricsets:
  #      - cpu
  #      - load
  #      - memory
  #      - network
  #      - process
  #      - process_summary
  #      #- core
  #      #- diskio
  #      #- socket
  #    processes: ['.*']
  #    process.include_top_n:
  #      by_cpu: 5      # include top 5 processes by CPU
  #      by_memory: 5   # include top 5 processes by memory
#
  #  - module: system
  #    period: 1m
  #    metricsets:
  #      - filesystem
  #      - fsstat
  #    processors:
  #    - drop_event.when.regexp:
  #        system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib|snap)($|/)'
  kubernetes.yml: |-
    - module: kubernetes
      metricsets:
        - node
        #- system
        - pod
        - container
        #- volume
      period: 10s
      host: ${NODE_NAME}
      hosts: ["https://${NODE_NAME}:10250"]
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      ssl.verification_mode: "none"
      # If there is a CA bundle that contains the issuer of the certificate used in the Kubelet API,
      # remove ssl.verification_mode entry and use the CA, for instance:
      #ssl.certificate_authorities:
        #- /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
    # Currently `proxy` metricset is not supported on Openshift, comment out section
    #- module: kubernetes
    #  metricsets:
    #    - proxy
    #  period: 10s
    #  host: ${NODE_NAME}
    #  hosts: ["localhost:10249"]
---
# Deploy a Metricbeat instance per node for node metrics retrieval
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: metricbeat
  namespace: kube-system
  labels:
    k8s-app: metricbeat
spec:
  selector:
    matchLabels:
      k8s-app: metricbeat
  template:
    metadata:
      labels:
        k8s-app: metricbeat
    spec:
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          effect: NoSchedule
      serviceAccountName: metricbeat
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
        - name: metricbeat
          image: metricbeat-run-image
          imagePullPolicy: Never
          args: [
            "-c", "/etc/metricbeat.yml",
            "-e",
            "-system.hostfs=/hostfs",
          ]
          env:
            - name: ELASTICSEARCH_HOST
              value: elasticsearch
            - name: ELASTICSEARCH_PORT
              value: "9200"
            - name: ELASTICSEARCH_USERNAME
              value: elastic
            - name: ELASTICSEARCH_PASSWORD
              value: "changeme"
            - name: ELASTIC_CLOUD_ID
              value:
            - name: ELASTIC_CLOUD_AUTH
              value:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          securityContext:
            runAsUser: 0
            # If using Red Hat OpenShift uncomment this:
            #privileged: true
          resources:
            limits:
              memory: 200Mi
            requests:
              cpu: 100m
              memory: 100Mi
          volumeMounts:
            - name: config
              mountPath: /etc/metricbeat.yml
              readOnly: true
              subPath: metricbeat.yml
            - name: data
              mountPath: /usr/share/metricbeat/data
            - name: modules
              mountPath: /usr/share/metricbeat/modules.d
              readOnly: true
            - name: proc
              mountPath: /hostfs/proc
              readOnly: true
            - name: cgroup
              mountPath: /hostfs/sys/fs/cgroup
              readOnly: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: cgroup
          hostPath:
            path: /sys/fs/cgroup
        - name: config
          configMap:
            defaultMode: 0640
            name: metricbeat-daemonset-config
        - name: modules
          configMap:
            defaultMode: 0640
            name: metricbeat-daemonset-modules
        - name: data
          hostPath:
            # When metricbeat runs as non-root user, this directory needs to be writable by group (g+w)
            path: /var/lib/metricbeat-data
            type: DirectoryOrCreate
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: metricbeat
subjects:
  - kind: ServiceAccount
    name: metricbeat
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: metricbeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: metricbeat
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: metricbeat
    namespace: kube-system
roleRef:
  kind: Role
  name: metricbeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: metricbeat-kubeadm-config
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: metricbeat
    namespace: kube-system
roleRef:
  kind: Role
  name: metricbeat-kubeadm-config
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: metricbeat
  labels:
    k8s-app: metricbeat
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - namespaces
      - events
      - pods
      - services
      - persistentvolumes
      - persistentvolumeclaims
    verbs: ["get", "list", "watch"]
  # Enable this rule only if planing to use Kubernetes keystore
  #- apiGroups: [""]
  #  resources:
  #  - secrets
  #  verbs: ["get"]
  - apiGroups: ["extensions"]
    resources:
      - replicasets
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources:
      - statefulsets
      - deployments
      - replicasets
      - daemonsets
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources:
      - jobs
      - cronjobs
    verbs: ["get", "list", "watch"]
  - apiGroups: ["storage.k8s.io"]
    resources:
      - storageclasses
    verbs: ["get", "list", "watch"]
  - apiGroups:
      - ""
    resources:
      - nodes/stats
    verbs:
      - get
  - nonResourceURLs:
      - "/metrics"
    verbs:
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: metricbeat
  # should be the namespace where metricbeat is running
  namespace: kube-system
  labels:
    k8s-app: metricbeat
rules:
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs: ["get", "create", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: metricbeat-kubeadm-config
  namespace: kube-system
  labels:
    k8s-app: metricbeat
rules:
  - apiGroups: [""]
    resources:
      - configmaps
    resourceNames:
      - kubeadm-config
    verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: metricbeat
  namespace: kube-system
  labels:
    k8s-app: metricbeat
---

Don't forget to set up your ES outputs if you are not using the elastic stack.

Then deploy the manifest.

Update the lease object like this, so a lease renewal fail occurrs.

Depending on your current holder, you might have to update holderIdentity:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: metricbeat-cluster-leader
  namespace: kube-system
spec:
  holderIdentity: beats-leader-metricbeat-worker

Related issues

Closes #34998.
Relates #38543.

Results

Lease belongs to control-plane metricbeat instance:

c@c:~$ kubectl get leases -n kube-system
NAME                                   HOLDER                                                                      AGE
...
metricbeat-cluster-leader              beats-leader-metricbeat-cluster-control-plane                               4s

Logs from leader, control-plane metricbeat instance:

I0320 14:35:40.853155       1 leaderelection.go:258] successfully acquired lease kube-system/metricbeat-cluster-leader
{"log.level":"debug","@timestamp":"2024-03-20T14:35:40.853Z","log.logger":"autodiscover","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.NewLeaderElectionManager.func1","file.name":"kubernetes/kubernetes.go","file.line":300},"message":"leader election lock GAINED, holder: beats-leader-metricbeat-cluster-control-plane, eventID: metricbeat-cluster-leader-kube-system-1710945340853430124","service.name":"metricbeat","ecs.version":"1.6.0"}

Change the leader. You can modify the lease like this, which will cause a failure on lease renewal.

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: metricbeat-cluster-leader
  namespace: kube-system
spec:
  holderIdentity: beats-leader-metricbeat-worker

Now we see in the logs from the previous leader control-plane metricbeat instance:

I0320 14:36:47.313084       1 leaderelection.go:283] failed to renew lease kube-system/metricbeat-cluster-leader: timed out waiting for the condition
{"log.level":"debug","@timestamp":"2024-03-20T14:36:47.313Z","log.logger":"autodiscover","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.NewLeaderElectionManager.func2","file.name":"kubernetes/kubernetes.go","file.line":304},"message":"leader election lock LOST, holder: beats-leader-metricbeat-cluster-control-plane, eventID: metricbeat-cluster-leader-kube-system-1710945340853430124","service.name":"metricbeat","ecs.version":"1.6.0"}

And in the logs from the new leader worker metricbeat instance:

I0320 14:36:52.497868       1 leaderelection.go:258] successfully acquired lease kube-system/metricbeat-cluster-leader
{"log.level":"debug","@timestamp":"2024-03-20T14:36:52.498Z","log.logger":"autodiscover","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.NewLeaderElectionManager.func1","file.name":"kubernetes/kubernetes.go","file.line":300},"message":"leader election lock GAINED, holder: beats-leader-metricbeat-cluster-worker, eventID: metricbeat-cluster-leader-kube-system-1710945412498017908","service.name":"metricbeat","ecs.version":"1.6.0"}

Results in discover just reporting metrics from one host.name. Check by comparing the number of documents before and after lease holder changed: they remain the same, so we are not having duplicates like before:

This is an automatic backport of pull request #38471 done by [Mergify](https://mergify.com).

…g same metrics (#38471) * Fix event id Signed-off-by: constanca <constanca.manteigas@elastic.co> * Update changelog Signed-off-by: constanca <constanca.manteigas@elastic.co> * Update libbeat/autodiscover/providers/kubernetes/kubernetes.go Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co> * Update libbeat/autodiscover/providers/kubernetes/kubernetes.go Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co> * add space to log line Signed-off-by: constanca <constanca.manteigas@elastic.co> * change log.debug order Signed-off-by: constanca <constanca.manteigas@elastic.co> * - run leader elector until context is cancelled - add unit tests Signed-off-by: constanca <constanca.manteigas@elastic.co> * fix lint errors Signed-off-by: constanca <constanca.manteigas@elastic.co> * mage check Signed-off-by: constanca <constanca.manteigas@elastic.co> * use assert instead of require Signed-off-by: constanca <constanca.manteigas@elastic.co> * Update changelog Signed-off-by: constanca <constanca.manteigas@elastic.co> * Update changelog Signed-off-by: constanca <constanca.manteigas@elastic.co> * Add test comments Signed-off-by: constanca <constanca.manteigas@elastic.co> * Update docs Signed-off-by: constanca <constanca.manteigas@elastic.co> --------- Signed-off-by: constanca <constanca.manteigas@elastic.co> Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co> (cherry picked from commit 5947565)

botelastic · 2024-04-08T10:29:01Z

This pull request doesn't have a Team:<team> label.

elasticmachine · 2024-04-08T12:52:38Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Duration: 134 min 26 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify bot requested a review from a team as a code owner April 8, 2024 10:28

mergify bot added the backport label Apr 8, 2024

mergify bot requested review from ycombinator and belimawr and removed request for a team April 8, 2024 10:28

mergify bot assigned constanca-m Apr 8, 2024

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 8, 2024

Merge branch '8.13' into mergify/bp/8.13/pr-38471

f66b31c

pierrehilbert approved these changes Apr 8, 2024

View reviewed changes

constanca-m merged commit 938e13c into 8.13 Apr 9, 2024
89 checks passed

constanca-m deleted the mergify/bp/8.13/pr-38471 branch April 9, 2024 06:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8.13](backport #38471) [Metricbeat][Autodiscover Kubernetes] Fix multiple instances reporting same metrics #38761

[8.13](backport #38471) [Metricbeat][Autodiscover Kubernetes] Fix multiple instances reporting same metrics #38761

mergify bot commented Apr 8, 2024 •

edited by zube bot

Loading

botelastic bot commented Apr 8, 2024

elasticmachine commented Apr 8, 2024 •

edited

Loading

Build stats

[8.13](backport #38471) [Metricbeat][Autodiscover Kubernetes] Fix multiple instances reporting same metrics #38761

[8.13](backport #38471) [Metricbeat][Autodiscover Kubernetes] Fix multiple instances reporting same metrics #38761

Conversation

mergify bot commented Apr 8, 2024 • edited by zube bot Loading

Proposed commit message

Checklist

How to test this PR locally

Related issues

Results

botelastic bot commented Apr 8, 2024

elasticmachine commented Apr 8, 2024 • edited Loading

💚 Build Succeeded

Build stats

❕ Flaky test report

🤖 GitHub comments

mergify bot commented Apr 8, 2024 •

edited by zube bot

Loading

elasticmachine commented Apr 8, 2024 •

edited

Loading