Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.13](backport #38471) [Metricbeat][Autodiscover Kubernetes] Fix multiple instances reporting same metrics #38761

Merged
merged 2 commits into from
Apr 9, 2024

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Apr 8, 2024

Proposed commit message

If the holder of the lease changes when using metricbeat autodiscover, then we have multiple hosts reporting the same metrics.

Please read the issue #38543 for a more detailed description.

This only affects the metrics that are unique cluster wide, like KSM metrics.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  1. Create a 2 node cluster, with metricbeat running in both nodes.
  2. Use the metricbeat image from this branch to deploy the metricbeat instance.

If you want to see this fix in action, follow the section Results below.

Edit: here are more detailed steps for this:

Create a two node cluster. You can do it this way.
$ kind create cluster --config kind-config.yaml

And kind-config.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    kubeadmConfigPatches:
      - |
        kind: KubeProxyConfiguration
        metricsBindAddress: "0.0.0.0"
  - role: worker
Build a docker image from this branch, in a place where you have the metricbeat binary.

So build metricbeat with GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build in metricbeat directory.

You can use this Dockerfile to create the docker image:

FROM ubuntu:20.04

WORKDIR /usr/share/metricbeat

COPY metricbeat /usr/share/metricbeat/metricbeat

ENTRYPOINT ["./metricbeat"]

CMD [ "-e" ]

Then run:

$ docker build -t metricbeat-run-image .

And then upload it to kind nodes:

$ kind load docker-image metricbeat-run-image:latest
Deploy the metricbeat manifest with this image.

I am using this manifest, that only has state_node enabled, and a few other metricsets.

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: metricbeat-daemonset-config
  namespace: kube-system
  labels:
    k8s-app: metricbeat
data:
  metricbeat.yml: |-
    metricbeat.config.modules:
      # Mounted `metricbeat-daemonset-modules` configmap:
      path: ${path.config}/modules.d/*.yml
      # Reload module configs as they change:
      reload.enabled: false

    metricbeat.autodiscover:
      providers:
        - type: kubernetes
          scope: cluster
          node: ${NODE_NAME}
          unique: true
          templates:
            - config:
                - module: kubernetes
                  hosts: ["kube-state-metrics:8080"]
                  period: 10s
                  add_metadata: true
                  metricsets:
                    - state_node
                    #- state_deployment
                    #- state_daemonset
                    #- state_replicaset
                    #- state_pod
                    #- state_container
                    #- state_cronjob
                    #- state_resourcequota
                    #- state_statefulset
                    #- state_service
                    #- state_persistentvolume
                    #- state_persistentvolumeclaim
                    #- state_storageclass
                    #- state_namespace
                - module: kubernetes
                  metricsets:
                    - apiserver
                  hosts: ["https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}"]
                  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                  ssl.certificate_authorities:
                    - /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  period: 30s
                # Uncomment this to get k8s events:
                #- module: kubernetes
                #  metricsets:
                #    - event
        # To enable hints based autodiscover uncomment this:
        #- type: kubernetes
        #  node: ${NODE_NAME}
        #  hints.enabled: true

    logging.level: debug

    processors:
      - add_cloud_metadata:

    cloud.id: ${ELASTIC_CLOUD_ID}
    cloud.auth: ${ELASTIC_CLOUD_AUTH}

    output.elasticsearch:
      hosts: ['https://${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
      username: ${ELASTICSEARCH_USERNAME}
      password: ${ELASTICSEARCH_PASSWORD}
      ssl.verification_mode: "none"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: metricbeat-daemonset-modules
  namespace: kube-system
  labels:
    k8s-app: metricbeat
data:
  #system.yml: |-
  #  - module: system
  #    period: 10s
  #    metricsets:
  #      - cpu
  #      - load
  #      - memory
  #      - network
  #      - process
  #      - process_summary
  #      #- core
  #      #- diskio
  #      #- socket
  #    processes: ['.*']
  #    process.include_top_n:
  #      by_cpu: 5      # include top 5 processes by CPU
  #      by_memory: 5   # include top 5 processes by memory
#
  #  - module: system
  #    period: 1m
  #    metricsets:
  #      - filesystem
  #      - fsstat
  #    processors:
  #    - drop_event.when.regexp:
  #        system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib|snap)($|/)'
  kubernetes.yml: |-
    - module: kubernetes
      metricsets:
        - node
        #- system
        - pod
        - container
        #- volume
      period: 10s
      host: ${NODE_NAME}
      hosts: ["https://${NODE_NAME}:10250"]
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      ssl.verification_mode: "none"
      # If there is a CA bundle that contains the issuer of the certificate used in the Kubelet API,
      # remove ssl.verification_mode entry and use the CA, for instance:
      #ssl.certificate_authorities:
        #- /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
    # Currently `proxy` metricset is not supported on Openshift, comment out section
    #- module: kubernetes
    #  metricsets:
    #    - proxy
    #  period: 10s
    #  host: ${NODE_NAME}
    #  hosts: ["localhost:10249"]
---
# Deploy a Metricbeat instance per node for node metrics retrieval
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: metricbeat
  namespace: kube-system
  labels:
    k8s-app: metricbeat
spec:
  selector:
    matchLabels:
      k8s-app: metricbeat
  template:
    metadata:
      labels:
        k8s-app: metricbeat
    spec:
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          effect: NoSchedule
      serviceAccountName: metricbeat
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
        - name: metricbeat
          image: metricbeat-run-image
          imagePullPolicy: Never
          args: [
            "-c", "/etc/metricbeat.yml",
            "-e",
            "-system.hostfs=/hostfs",
          ]
          env:
            - name: ELASTICSEARCH_HOST
              value: elasticsearch
            - name: ELASTICSEARCH_PORT
              value: "9200"
            - name: ELASTICSEARCH_USERNAME
              value: elastic
            - name: ELASTICSEARCH_PASSWORD
              value: "changeme"
            - name: ELASTIC_CLOUD_ID
              value:
            - name: ELASTIC_CLOUD_AUTH
              value:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          securityContext:
            runAsUser: 0
            # If using Red Hat OpenShift uncomment this:
            #privileged: true
          resources:
            limits:
              memory: 200Mi
            requests:
              cpu: 100m
              memory: 100Mi
          volumeMounts:
            - name: config
              mountPath: /etc/metricbeat.yml
              readOnly: true
              subPath: metricbeat.yml
            - name: data
              mountPath: /usr/share/metricbeat/data
            - name: modules
              mountPath: /usr/share/metricbeat/modules.d
              readOnly: true
            - name: proc
              mountPath: /hostfs/proc
              readOnly: true
            - name: cgroup
              mountPath: /hostfs/sys/fs/cgroup
              readOnly: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: cgroup
          hostPath:
            path: /sys/fs/cgroup
        - name: config
          configMap:
            defaultMode: 0640
            name: metricbeat-daemonset-config
        - name: modules
          configMap:
            defaultMode: 0640
            name: metricbeat-daemonset-modules
        - name: data
          hostPath:
            # When metricbeat runs as non-root user, this directory needs to be writable by group (g+w)
            path: /var/lib/metricbeat-data
            type: DirectoryOrCreate
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: metricbeat
subjects:
  - kind: ServiceAccount
    name: metricbeat
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: metricbeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: metricbeat
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: metricbeat
    namespace: kube-system
roleRef:
  kind: Role
  name: metricbeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: metricbeat-kubeadm-config
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: metricbeat
    namespace: kube-system
roleRef:
  kind: Role
  name: metricbeat-kubeadm-config
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: metricbeat
  labels:
    k8s-app: metricbeat
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - namespaces
      - events
      - pods
      - services
      - persistentvolumes
      - persistentvolumeclaims
    verbs: ["get", "list", "watch"]
  # Enable this rule only if planing to use Kubernetes keystore
  #- apiGroups: [""]
  #  resources:
  #  - secrets
  #  verbs: ["get"]
  - apiGroups: ["extensions"]
    resources:
      - replicasets
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources:
      - statefulsets
      - deployments
      - replicasets
      - daemonsets
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources:
      - jobs
      - cronjobs
    verbs: ["get", "list", "watch"]
  - apiGroups: ["storage.k8s.io"]
    resources:
      - storageclasses
    verbs: ["get", "list", "watch"]
  - apiGroups:
      - ""
    resources:
      - nodes/stats
    verbs:
      - get
  - nonResourceURLs:
      - "/metrics"
    verbs:
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: metricbeat
  # should be the namespace where metricbeat is running
  namespace: kube-system
  labels:
    k8s-app: metricbeat
rules:
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs: ["get", "create", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: metricbeat-kubeadm-config
  namespace: kube-system
  labels:
    k8s-app: metricbeat
rules:
  - apiGroups: [""]
    resources:
      - configmaps
    resourceNames:
      - kubeadm-config
    verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: metricbeat
  namespace: kube-system
  labels:
    k8s-app: metricbeat
---

Don't forget to set up your ES outputs if you are not using the elastic stack.

Then deploy the manifest.

Update the lease object like this, so a lease renewal fail occurrs.

Depending on your current holder, you might have to update holderIdentity:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: metricbeat-cluster-leader
  namespace: kube-system
spec:
  holderIdentity: beats-leader-metricbeat-worker

Related issues

Closes #34998.
Relates #38543.

Results

Lease belongs to control-plane metricbeat instance:

c@c:~$ kubectl get leases -n kube-system
NAME                                   HOLDER                                                                      AGE
...
metricbeat-cluster-leader              beats-leader-metricbeat-cluster-control-plane                               4s

Logs from leader, control-plane metricbeat instance:

I0320 14:35:40.853155       1 leaderelection.go:258] successfully acquired lease kube-system/metricbeat-cluster-leader
{"log.level":"debug","@timestamp":"2024-03-20T14:35:40.853Z","log.logger":"autodiscover","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.NewLeaderElectionManager.func1","file.name":"kubernetes/kubernetes.go","file.line":300},"message":"leader election lock GAINED, holder: beats-leader-metricbeat-cluster-control-plane, eventID: metricbeat-cluster-leader-kube-system-1710945340853430124","service.name":"metricbeat","ecs.version":"1.6.0"}
Change the leader. You can modify the lease like this, which will cause a failure on lease renewal.
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: metricbeat-cluster-leader
  namespace: kube-system
spec:
  holderIdentity: beats-leader-metricbeat-worker

Now we see in the logs from the previous leader control-plane metricbeat instance:

I0320 14:36:47.313084       1 leaderelection.go:283] failed to renew lease kube-system/metricbeat-cluster-leader: timed out waiting for the condition
{"log.level":"debug","@timestamp":"2024-03-20T14:36:47.313Z","log.logger":"autodiscover","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.NewLeaderElectionManager.func2","file.name":"kubernetes/kubernetes.go","file.line":304},"message":"leader election lock LOST, holder: beats-leader-metricbeat-cluster-control-plane, eventID: metricbeat-cluster-leader-kube-system-1710945340853430124","service.name":"metricbeat","ecs.version":"1.6.0"}

And in the logs from the new leader worker metricbeat instance:

I0320 14:36:52.497868       1 leaderelection.go:258] successfully acquired lease kube-system/metricbeat-cluster-leader
{"log.level":"debug","@timestamp":"2024-03-20T14:36:52.498Z","log.logger":"autodiscover","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.NewLeaderElectionManager.func1","file.name":"kubernetes/kubernetes.go","file.line":300},"message":"leader election lock GAINED, holder: beats-leader-metricbeat-cluster-worker, eventID: metricbeat-cluster-leader-kube-system-1710945412498017908","service.name":"metricbeat","ecs.version":"1.6.0"}

Results in discover just reporting metrics from one host.name. Check by comparing the number of documents before and after lease holder changed: they remain the same, so we are not having duplicates like before:

image


This is an automatic backport of pull request #38471 done by [Mergify](https://mergify.com).

…g same metrics (#38471)

* Fix event id

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update changelog

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update libbeat/autodiscover/providers/kubernetes/kubernetes.go

Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>

* Update libbeat/autodiscover/providers/kubernetes/kubernetes.go

Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>

* add space to log line

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* change log.debug order

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* - run leader elector until context is cancelled
- add unit tests

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* fix lint errors

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* mage check

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* use assert instead of require

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update changelog

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update changelog

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Add test comments

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update docs

Signed-off-by: constanca <constanca.manteigas@elastic.co>

---------

Signed-off-by: constanca <constanca.manteigas@elastic.co>
Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>
(cherry picked from commit 5947565)
@mergify mergify bot requested a review from a team as a code owner April 8, 2024 10:28
@mergify mergify bot added the backport label Apr 8, 2024
@mergify mergify bot requested review from ycombinator and belimawr and removed request for a team April 8, 2024 10:28
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 8, 2024
@botelastic
Copy link

botelastic bot commented Apr 8, 2024

This pull request doesn't have a Team:<team> label.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Apr 8, 2024

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 134 min 26 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@constanca-m constanca-m merged commit 938e13c into 8.13 Apr 9, 2024
89 checks passed
@constanca-m constanca-m deleted the mergify/bp/8.13/pr-38471 branch April 9, 2024 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport needs_team Indicates that the issue/PR needs a Team:* label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants