[Kubernetes provider] LeaderElection: error during lease renewal leads to events duplication #34998

tetianakravchenko · 2023-04-03T16:56:55Z

if there occurs some error during lease renewal, which may be caused by temporal network troubles, saturation, accidentally removed resource, api server temporary unavailable, etc. - it is not handled correctly and leads to the situation that multiple pods are leaders, hence metrics endpoints are scraped more than once, even if configured unique: true.

Kubernetes client-go issue that might be related: kubernetes/client-go#1155

How to reproduce:
0. kind create cluster --config config.yaml

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  kubeadmConfigPatches:
      - |
        kind: KubeProxyConfiguration
        metricsBindAddress: "0.0.0.0"
- role: worker
- role: worker

kubectl apply -f metricbeat.yaml
Check amount of events that are coming for 1 metricset, that has unique: true configuration, for example state_node:
see - for each scrape - 3 documents, host.name kind-worker2

check logs leases:
metricbeat-1

13 leaderelection.go:248] attempting to acquire leader lease kube-system/metricbeat-cluster-leader...
13 leaderelection.go:258] successfully acquired lease kube-system/metricbeat-cluster-leader
{"log.level":"debug","@timestamp":"2023-04-03T16:27:31.963Z","log.logger":"autodiscover","log.origin":{"file.name":"kubernetes/kubernetes.go","file.line":298},"message":"leader election lock GAINED, id beats-leader-kind-worker","service.name":"metricbeat","ecs.version":"1.6.0"}

metricbeat-2

13 leaderelection.go:248] attempting to acquire leader lease kube-system/metricbeat-cluster-leader...

lease is taken by metricbeat-1:

kubectl get leases.coordination.k8s.io
metricbeat-cluster-leader   beats-leader-kind-worker                                  39m

Generate some issue by removing the roleBinding for lease role: k delete rolebindings metricbeat

metricbeat-1 have errors:

E0403 16:29:14.243178      13 leaderelection.go:330] error retrieving resource lock kube-system/metricbeat-cluster-leader: leases.coordination.k8s.io "metricbeat-cluster-leader" is forbidden: User "system:serviceaccount:kube-system:metricbeat" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
...
I0403 16:29:16.238161      13 leaderelection.go:283] failed to renew lease kube-system/metricbeat-cluster-leader: timed out waiting for the condition
E0403 16:29:16.238229      13 leaderelection.go:306] Failed to release lock: resource name may not be empty
{"log.level":"debug","@timestamp":"2023-04-03T16:29:16.238Z","log.logger":"autodiscover","log.origin":{"file.name":"kubernetes/kubernetes.go","file.line":303},"message":"leader election lock LOST, id beats-leader-kind-worker","service.name":"metricbeat","ecs.version":"1.6.0"}

metricbeat-2:

E0403 16:29:32.627177      13 leaderelection.go:330] error retrieving resource lock kube-system/metricbeat-cluster-leader: leases.coordination.k8s.io "metricbeat-cluster-leader" is forbidden: User "system:serviceaccount:kube-system:metricbeat" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
I0403 16:29:54.563778      13 leaderelection.go:258] successfully acquired lease kube-system/metricbeat-cluster-leader
{"log.level":"debug","@timestamp":"2023-04-03T16:29:54.563Z","log.logger":"autodiscover","log.origin":{"file.name":"kubernetes/kubernetes.go","file.line":298},"message":"leader election lock GAINED, id beats-leader-kind-worker2","service.name":"metricbeat","ecs.version":"1.6.0"}

lease:

metricbeat-cluster-leader   beats-leader-kind-worker2                                 41m

events are duplicated, and metricbeat-1 still continue scraping state_* metricsets (all metricsets with unique: true configuration)

cc @ChrsMark @gizas

The text was updated successfully, but these errors were encountered:

ChrsMark · 2023-04-04T07:42:41Z

Thanks for raising this @tetianakravchenko ! I agree that this issue most probably is related to kubernetes/client-go#1155.

Btw, what if we manually delete the lease object while both Metricbeat's hold it? I guess nothing will change since Metricbeat-1 will continue to be stacked and holding the lease but just in case.

tetianakravchenko · 2023-04-04T12:07:10Z

@ChrsMark

Btw, what if we manually delete the lease object while both Metricbeat's hold it?

there always is only 1 metricbeat that holds the lease. former leader just fail to renew the lease and loose the lock:

failed to renew lease kube-system/metricbeat-cluster-leader: timed out waiting for the condition

this metricbeat still continues behaving as a leader, abandoned lease still exists in the cluster.

another metricbeat Pod - become a leader, rewrite the abandoned lease object (changing the holder info)

I also don't think that smth will change if lease object will be manually deleted.

ChrsMark · 2023-04-05T07:38:13Z

Yes, most probably deleting the lease won't change anything. In that case only deleting the "old" leader Pod will fix things, right?

gizas · 2023-04-05T09:11:03Z

Can you also try restarting the new leader before removing old? This might force a new leaderlection? (this is what they do in the temporary fix)
(I guess deleting old leader is like forcing him to update the status so ok it should work.)

Also dont know if it worths to test RenewDeadline and RetryPeriod (https://pkg.go.dev/k8s.io/client-go/tools/leaderelection)?
Just to make the issue happen less frequently? But we need to pass those variables to agent's startup

tetianakravchenko · 2023-04-05T09:11:07Z

Yes, most probably deleting the lease won't change anything. In that case only deleting the "old" leader Pod will fix things, right?

yes, correct, deleting the "old" leader Pod fixes data duplication issue

Joren-Thijs-KasparSolutions · 2024-03-05T10:16:09Z

My company also ran into this using Metricbeat 7.17.12. Updating to the latest 7.17.18 triggered a restart of the nodes and fixed the issue.

constanca-m · 2024-03-20T12:18:48Z

I took another look at this issue.

Even though the lease renewal fails, that should not be a reason for the previous metricbeat lease holder to keep reporting metrics. The expected behavior should be: as soon as the holder loses the lock - no matter if there was a renewal or not - that metricbeat instance should stop reporting metrics. As mentioned in description:

E0403 16:29:14.243178      13 leaderelection.go:330] error retrieving resource lock kube-system/metricbeat-cluster-leader: leases.coordination.k8s.io "metricbeat-cluster-leader" is forbidden: User "system:serviceaccount:kube-system:metricbeat" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"
...
"message":"leader election lock LOST, id beats-leader-kind-worker"}

So we know at least that this function is being called correctly:

beats/libbeat/autodiscover/providers/kubernetes/kubernetes.go

Lines 301 to 305 in 10ff992

    
           OnStoppedLeading: func() { 
        
           	logger.Debugf("leader election lock LOST, id %v", id) 
        
           	eventID := fmt.Sprintf("%v-%v", metaUID, time.Now().UnixNano()) 
        
           	stopLeading(uuid.String(), eventID) 
        
           },

Why is this problem happening then?

The reason for the duplicated metrics is actually quite simple.

The leader, once it starts, emits an event with the flag start set to true:

beats/libbeat/autodiscover/providers/kubernetes/kubernetes.go

Lines 208 to 213 in 10ff992

    
           event := bus.Event{ 
        
           	"start":    true, 
        
           	"provider": uuid, 
        
           	"id":       eventID, 
        
           	"unique":   "true", 
        
           }

This event is then captured in this part of the code:

beats/libbeat/autodiscover/autodiscover.go

Lines 141 to 146 in 10ff992

    
           if _, ok := event["start"]; ok { 
        
           	// if updated is true, we don't want to set it back to false 
        
           	if a.handleStart(event) { 
        
           		updated = true 
        
           	} 
        
           }

And this handle start function initializes the right configuration for our a.configs here:

beats/libbeat/autodiscover/autodiscover.go

Lines 264 to 266 in 10ff992

    
           if updated { 
        
           	a.configs[eventID] = newCfg 
        
           }

So now we know updated is set to true and we need to reload our autodiscover configuration.

Once we handle stop events we should do the same. However, we have a problem. When dealing with stopLeading(), we use a new event id:

beats/libbeat/autodiscover/providers/kubernetes/kubernetes.go

Lines 301 to 305 in 10ff992

    
           OnStoppedLeading: func() { 
        
           	logger.Debugf("leader election lock LOST, id %v", id) 
        
           	eventID := fmt.Sprintf("%v-%v", metaUID, time.Now().UnixNano()) 
        
           	stopLeading(uuid.String(), eventID) 
        
           },

And this event id was used to save the configuration on autodiscover... So once we start handling the stop event, we check if we have the configuration there and upload the new autodiscover settings:

beats/libbeat/autodiscover/autodiscover.go

Lines 281 to 284 in 10ff992

    
           if len(a.configs[eventID]) > 0 { 
        
           	a.logger.Debugf("Stopping %d configs", len(a.configs[eventID])) 
        
           	updated = true 
        
           }

Because this is a new event id, nothing is found there, and our metricbeat instance never stops reporting metrics...

Solution

PR: #38471

Because this issue grew a bit since opening the PR, I put together everything we found out so far and have to do in a new issue here: #38543.

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 3, 2023

tetianakravchenko added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Apr 3, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 3, 2023

This was referenced Mar 20, 2024

[Metricbeat][Autodiscover Kubernetes] Fix multiple instances reporting same metrics #38471

Merged

[Autodiscover Kubernetes] Leader election never stops instances after lease lock is lost #38543

Closed

constanca-m closed this as completed in #38471 Apr 8, 2024

mergify bot mentioned this issue Apr 8, 2024

[8.13](backport #38471) [Metricbeat][Autodiscover Kubernetes] Fix multiple instances reporting same metrics #38761

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kubernetes provider] LeaderElection: error during lease renewal leads to events duplication #34998

[Kubernetes provider] LeaderElection: error during lease renewal leads to events duplication #34998

tetianakravchenko commented Apr 3, 2023 •

edited

Loading

ChrsMark commented Apr 4, 2023

tetianakravchenko commented Apr 4, 2023

ChrsMark commented Apr 5, 2023

gizas commented Apr 5, 2023

tetianakravchenko commented Apr 5, 2023 •

edited

Loading

Joren-Thijs-KasparSolutions commented Mar 5, 2024

constanca-m commented Mar 20, 2024 •

edited

Loading

[Kubernetes provider] LeaderElection: error during lease renewal leads to events duplication #34998

[Kubernetes provider] LeaderElection: error during lease renewal leads to events duplication #34998

Comments

tetianakravchenko commented Apr 3, 2023 • edited Loading

ChrsMark commented Apr 4, 2023

tetianakravchenko commented Apr 4, 2023

ChrsMark commented Apr 5, 2023

gizas commented Apr 5, 2023

tetianakravchenko commented Apr 5, 2023 • edited Loading

Joren-Thijs-KasparSolutions commented Mar 5, 2024

constanca-m commented Mar 20, 2024 • edited Loading

Why is this problem happening then?

Solution

tetianakravchenko commented Apr 3, 2023 •

edited

Loading

tetianakravchenko commented Apr 5, 2023 •

edited

Loading

constanca-m commented Mar 20, 2024 •

edited

Loading