Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat][Autodiscover Kubernetes] Fix multiple instances reporting same metrics #38471

Merged
merged 23 commits into from Apr 8, 2024

Conversation

constanca-m
Copy link
Contributor

@constanca-m constanca-m commented Mar 20, 2024

Proposed commit message

If the holder of the lease changes when using metricbeat autodiscover, then we have multiple hosts reporting the same metrics.

Please read the issue #38543 for a more detailed description.

This only affects the metrics that are unique cluster wide, like KSM metrics.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  1. Create a 2 node cluster, with metricbeat running in both nodes.
  2. Use the metricbeat image from this branch to deploy the metricbeat instance.

If you want to see this fix in action, follow the section Results below.

Edit: here are more detailed steps for this:

Create a two node cluster. You can do it this way.
$ kind create cluster --config kind-config.yaml

And kind-config.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    kubeadmConfigPatches:
      - |
        kind: KubeProxyConfiguration
        metricsBindAddress: "0.0.0.0"
  - role: worker
Build a docker image from this branch, in a place where you have the metricbeat binary.

So build metricbeat with GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build in metricbeat directory.

You can use this Dockerfile to create the docker image:

FROM ubuntu:20.04

WORKDIR /usr/share/metricbeat

COPY metricbeat /usr/share/metricbeat/metricbeat

ENTRYPOINT ["./metricbeat"]

CMD [ "-e" ]

Then run:

$ docker build -t metricbeat-run-image .

And then upload it to kind nodes:

$ kind load docker-image metricbeat-run-image:latest
Deploy the metricbeat manifest with this image.

I am using this manifest, that only has state_node enabled, and a few other metricsets.

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: metricbeat-daemonset-config
  namespace: kube-system
  labels:
    k8s-app: metricbeat
data:
  metricbeat.yml: |-
    metricbeat.config.modules:
      # Mounted `metricbeat-daemonset-modules` configmap:
      path: ${path.config}/modules.d/*.yml
      # Reload module configs as they change:
      reload.enabled: false

    metricbeat.autodiscover:
      providers:
        - type: kubernetes
          scope: cluster
          node: ${NODE_NAME}
          unique: true
          templates:
            - config:
                - module: kubernetes
                  hosts: ["kube-state-metrics:8080"]
                  period: 10s
                  add_metadata: true
                  metricsets:
                    - state_node
                    #- state_deployment
                    #- state_daemonset
                    #- state_replicaset
                    #- state_pod
                    #- state_container
                    #- state_cronjob
                    #- state_resourcequota
                    #- state_statefulset
                    #- state_service
                    #- state_persistentvolume
                    #- state_persistentvolumeclaim
                    #- state_storageclass
                    #- state_namespace
                - module: kubernetes
                  metricsets:
                    - apiserver
                  hosts: ["https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}"]
                  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                  ssl.certificate_authorities:
                    - /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  period: 30s
                # Uncomment this to get k8s events:
                #- module: kubernetes
                #  metricsets:
                #    - event
        # To enable hints based autodiscover uncomment this:
        #- type: kubernetes
        #  node: ${NODE_NAME}
        #  hints.enabled: true

    logging.level: debug

    processors:
      - add_cloud_metadata:

    cloud.id: ${ELASTIC_CLOUD_ID}
    cloud.auth: ${ELASTIC_CLOUD_AUTH}

    output.elasticsearch:
      hosts: ['https://${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
      username: ${ELASTICSEARCH_USERNAME}
      password: ${ELASTICSEARCH_PASSWORD}
      ssl.verification_mode: "none"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: metricbeat-daemonset-modules
  namespace: kube-system
  labels:
    k8s-app: metricbeat
data:
  #system.yml: |-
  #  - module: system
  #    period: 10s
  #    metricsets:
  #      - cpu
  #      - load
  #      - memory
  #      - network
  #      - process
  #      - process_summary
  #      #- core
  #      #- diskio
  #      #- socket
  #    processes: ['.*']
  #    process.include_top_n:
  #      by_cpu: 5      # include top 5 processes by CPU
  #      by_memory: 5   # include top 5 processes by memory
#
  #  - module: system
  #    period: 1m
  #    metricsets:
  #      - filesystem
  #      - fsstat
  #    processors:
  #    - drop_event.when.regexp:
  #        system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib|snap)($|/)'
  kubernetes.yml: |-
    - module: kubernetes
      metricsets:
        - node
        #- system
        - pod
        - container
        #- volume
      period: 10s
      host: ${NODE_NAME}
      hosts: ["https://${NODE_NAME}:10250"]
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      ssl.verification_mode: "none"
      # If there is a CA bundle that contains the issuer of the certificate used in the Kubelet API,
      # remove ssl.verification_mode entry and use the CA, for instance:
      #ssl.certificate_authorities:
        #- /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
    # Currently `proxy` metricset is not supported on Openshift, comment out section
    #- module: kubernetes
    #  metricsets:
    #    - proxy
    #  period: 10s
    #  host: ${NODE_NAME}
    #  hosts: ["localhost:10249"]
---
# Deploy a Metricbeat instance per node for node metrics retrieval
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: metricbeat
  namespace: kube-system
  labels:
    k8s-app: metricbeat
spec:
  selector:
    matchLabels:
      k8s-app: metricbeat
  template:
    metadata:
      labels:
        k8s-app: metricbeat
    spec:
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          effect: NoSchedule
      serviceAccountName: metricbeat
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
        - name: metricbeat
          image: metricbeat-run-image
          imagePullPolicy: Never
          args: [
            "-c", "/etc/metricbeat.yml",
            "-e",
            "-system.hostfs=/hostfs",
          ]
          env:
            - name: ELASTICSEARCH_HOST
              value: elasticsearch
            - name: ELASTICSEARCH_PORT
              value: "9200"
            - name: ELASTICSEARCH_USERNAME
              value: elastic
            - name: ELASTICSEARCH_PASSWORD
              value: "changeme"
            - name: ELASTIC_CLOUD_ID
              value:
            - name: ELASTIC_CLOUD_AUTH
              value:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          securityContext:
            runAsUser: 0
            # If using Red Hat OpenShift uncomment this:
            #privileged: true
          resources:
            limits:
              memory: 200Mi
            requests:
              cpu: 100m
              memory: 100Mi
          volumeMounts:
            - name: config
              mountPath: /etc/metricbeat.yml
              readOnly: true
              subPath: metricbeat.yml
            - name: data
              mountPath: /usr/share/metricbeat/data
            - name: modules
              mountPath: /usr/share/metricbeat/modules.d
              readOnly: true
            - name: proc
              mountPath: /hostfs/proc
              readOnly: true
            - name: cgroup
              mountPath: /hostfs/sys/fs/cgroup
              readOnly: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: cgroup
          hostPath:
            path: /sys/fs/cgroup
        - name: config
          configMap:
            defaultMode: 0640
            name: metricbeat-daemonset-config
        - name: modules
          configMap:
            defaultMode: 0640
            name: metricbeat-daemonset-modules
        - name: data
          hostPath:
            # When metricbeat runs as non-root user, this directory needs to be writable by group (g+w)
            path: /var/lib/metricbeat-data
            type: DirectoryOrCreate
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: metricbeat
subjects:
  - kind: ServiceAccount
    name: metricbeat
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: metricbeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: metricbeat
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: metricbeat
    namespace: kube-system
roleRef:
  kind: Role
  name: metricbeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: metricbeat-kubeadm-config
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: metricbeat
    namespace: kube-system
roleRef:
  kind: Role
  name: metricbeat-kubeadm-config
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: metricbeat
  labels:
    k8s-app: metricbeat
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - namespaces
      - events
      - pods
      - services
      - persistentvolumes
      - persistentvolumeclaims
    verbs: ["get", "list", "watch"]
  # Enable this rule only if planing to use Kubernetes keystore
  #- apiGroups: [""]
  #  resources:
  #  - secrets
  #  verbs: ["get"]
  - apiGroups: ["extensions"]
    resources:
      - replicasets
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources:
      - statefulsets
      - deployments
      - replicasets
      - daemonsets
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources:
      - jobs
      - cronjobs
    verbs: ["get", "list", "watch"]
  - apiGroups: ["storage.k8s.io"]
    resources:
      - storageclasses
    verbs: ["get", "list", "watch"]
  - apiGroups:
      - ""
    resources:
      - nodes/stats
    verbs:
      - get
  - nonResourceURLs:
      - "/metrics"
    verbs:
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: metricbeat
  # should be the namespace where metricbeat is running
  namespace: kube-system
  labels:
    k8s-app: metricbeat
rules:
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs: ["get", "create", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: metricbeat-kubeadm-config
  namespace: kube-system
  labels:
    k8s-app: metricbeat
rules:
  - apiGroups: [""]
    resources:
      - configmaps
    resourceNames:
      - kubeadm-config
    verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: metricbeat
  namespace: kube-system
  labels:
    k8s-app: metricbeat
---

Don't forget to set up your ES outputs if you are not using the elastic stack.

Then deploy the manifest.

Update the lease object like this, so a lease renewal fail occurrs.

Depending on your current holder, you might have to update holderIdentity:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: metricbeat-cluster-leader
  namespace: kube-system
spec:
  holderIdentity: beats-leader-metricbeat-worker

Related issues

Closes #34998.
Relates #38543.

Results

Lease belongs to control-plane metricbeat instance:

c@c:~$ kubectl get leases -n kube-system
NAME                                   HOLDER                                                                      AGE
...
metricbeat-cluster-leader              beats-leader-metricbeat-cluster-control-plane                               4s

Logs from leader, control-plane metricbeat instance:

I0320 14:35:40.853155       1 leaderelection.go:258] successfully acquired lease kube-system/metricbeat-cluster-leader
{"log.level":"debug","@timestamp":"2024-03-20T14:35:40.853Z","log.logger":"autodiscover","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.NewLeaderElectionManager.func1","file.name":"kubernetes/kubernetes.go","file.line":300},"message":"leader election lock GAINED, holder: beats-leader-metricbeat-cluster-control-plane, eventID: metricbeat-cluster-leader-kube-system-1710945340853430124","service.name":"metricbeat","ecs.version":"1.6.0"}
Change the leader. You can modify the lease like this, which will cause a failure on lease renewal.
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: metricbeat-cluster-leader
  namespace: kube-system
spec:
  holderIdentity: beats-leader-metricbeat-worker

Now we see in the logs from the previous leader control-plane metricbeat instance:

I0320 14:36:47.313084       1 leaderelection.go:283] failed to renew lease kube-system/metricbeat-cluster-leader: timed out waiting for the condition
{"log.level":"debug","@timestamp":"2024-03-20T14:36:47.313Z","log.logger":"autodiscover","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.NewLeaderElectionManager.func2","file.name":"kubernetes/kubernetes.go","file.line":304},"message":"leader election lock LOST, holder: beats-leader-metricbeat-cluster-control-plane, eventID: metricbeat-cluster-leader-kube-system-1710945340853430124","service.name":"metricbeat","ecs.version":"1.6.0"}

And in the logs from the new leader worker metricbeat instance:

I0320 14:36:52.497868       1 leaderelection.go:258] successfully acquired lease kube-system/metricbeat-cluster-leader
{"log.level":"debug","@timestamp":"2024-03-20T14:36:52.498Z","log.logger":"autodiscover","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/autodiscover/providers/kubernetes.NewLeaderElectionManager.func1","file.name":"kubernetes/kubernetes.go","file.line":300},"message":"leader election lock GAINED, holder: beats-leader-metricbeat-cluster-worker, eventID: metricbeat-cluster-leader-kube-system-1710945412498017908","service.name":"metricbeat","ecs.version":"1.6.0"}

Results in discover just reporting metrics from one host.name. Check by comparing the number of documents before and after lease holder changed: they remain the same, so we are not having duplicates like before:

image

Signed-off-by: constanca <constanca.manteigas@elastic.co>
@constanca-m constanca-m added Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team bugfix backport-v8.13.0 Automated backport with mergify labels Mar 20, 2024
@constanca-m constanca-m requested a review from a team March 20, 2024 12:50
@constanca-m constanca-m self-assigned this Mar 20, 2024
@constanca-m constanca-m requested a review from a team as a code owner March 20, 2024 12:50
@constanca-m constanca-m requested review from gsantoro, tetianakravchenko, belimawr and rdner and removed request for a team March 20, 2024 12:50
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 20, 2024
@constanca-m constanca-m changed the title Fix event id [Metricbeat][Autodiscover Kubernetes] Fix multiple instances reporting same metrics Mar 20, 2024
Signed-off-by: constanca <constanca.manteigas@elastic.co>
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 20, 2024

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 99 min 51 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

@gsantoro gsantoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address my comments before merging

Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>
constanca-m and others added 2 commits March 20, 2024 15:25
Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>
Signed-off-by: constanca <constanca.manteigas@elastic.co>
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 20, 2024

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 132 min 53 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

constanca-m and others added 2 commits March 20, 2024 15:34
Signed-off-by: constanca <constanca.manteigas@elastic.co>
@belimawr
Copy link
Contributor

I finally managed to follow the steps you posted @constanca-m, however even before updating the lease object I already have both hostnames reporting metrics...

2024-03-25_10-22

Which query did you use to take the screen shot you put in the PR description?

@constanca-m
Copy link
Contributor Author

constanca-m commented Mar 25, 2024

You will only see one host name for the metrics that are unique cluster wide. These metrics are coming from, for example, state_* metricsets. @belimawr

Which query did you use to take the screen shot you put in the PR description?

Can you filter by the metricset? Example metricset: state_node. Do you see more than one host.name for the same timestamp?

@rdner rdner removed their request for review March 26, 2024 09:49
@MichaelKatsoulis
Copy link
Contributor

@constanca-m Have you noticed the same behaviour in elastic-agent because the code of leader election provider is somewhat different there ?

@constanca-m
Copy link
Contributor Author

constanca-m commented Mar 28, 2024

Yes @MichaelKatsoulis , I have not checked, but I documented it in the issue as well as one of the upcoming tasks

Signed-off-by: constanca <constanca.manteigas@elastic.co>
Signed-off-by: constanca <constanca.manteigas@elastic.co>
Signed-off-by: constanca <constanca.manteigas@elastic.co>
constanca-m and others added 2 commits April 2, 2024 11:26
Signed-off-by: constanca <constanca.manteigas@elastic.co>
@constanca-m constanca-m merged commit 5947565 into elastic:main Apr 8, 2024
145 of 183 checks passed
@constanca-m constanca-m deleted the leader-election-issue branch April 8, 2024 10:28
mergify bot pushed a commit that referenced this pull request Apr 8, 2024
…g same metrics (#38471)

* Fix event id

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update changelog

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update libbeat/autodiscover/providers/kubernetes/kubernetes.go

Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>

* Update libbeat/autodiscover/providers/kubernetes/kubernetes.go

Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>

* add space to log line

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* change log.debug order

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* - run leader elector until context is cancelled
- add unit tests

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* fix lint errors

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* mage check

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* use assert instead of require

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update changelog

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update changelog

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Add test comments

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update docs

Signed-off-by: constanca <constanca.manteigas@elastic.co>

---------

Signed-off-by: constanca <constanca.manteigas@elastic.co>
Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>
(cherry picked from commit 5947565)
constanca-m added a commit that referenced this pull request Apr 9, 2024
…g same metrics (#38471) (#38761)

* Fix event id

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update changelog

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update libbeat/autodiscover/providers/kubernetes/kubernetes.go

Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>

* Update libbeat/autodiscover/providers/kubernetes/kubernetes.go

Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>

* add space to log line

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* change log.debug order

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* - run leader elector until context is cancelled
- add unit tests

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* fix lint errors

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* mage check

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* use assert instead of require

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update changelog

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update changelog

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Add test comments

Signed-off-by: constanca <constanca.manteigas@elastic.co>

* Update docs

Signed-off-by: constanca <constanca.manteigas@elastic.co>

---------

Signed-off-by: constanca <constanca.manteigas@elastic.co>
Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>
(cherry picked from commit 5947565)

Co-authored-by: Constança Manteigas <113898685+constanca-m@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.13.0 Automated backport with mergify bugfix Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Kubernetes provider] LeaderElection: error during lease renewal leads to events duplication
7 participants