-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kubernetes provider] LeaderElection: error during lease renewal leads to events duplication #34998
Comments
Thanks for raising this @tetianakravchenko ! I agree that this issue most probably is related to kubernetes/client-go#1155. Btw, what if we manually delete the lease object while both Metricbeat's hold it? I guess nothing will change since Metricbeat-1 will continue to be stacked and holding the lease but just in case. |
there always is only 1 metricbeat that holds the lease. former leader just fail to renew the lease and loose the lock:
this metricbeat still continues behaving as a leader, abandoned lease still exists in the cluster. another metricbeat Pod - become a leader, rewrite the abandoned lease object (changing the holder info) I also don't think that smth will change if lease object will be manually deleted. |
Yes, most probably deleting the lease won't change anything. In that case only deleting the "old" leader Pod will fix things, right? |
Can you also try restarting the new leader before removing old? This might force a new leaderlection? (this is what they do in the temporary fix) Also dont know if it worths to test RenewDeadline and RetryPeriod (https://pkg.go.dev/k8s.io/client-go/tools/leaderelection)? |
yes, correct, deleting the "old" leader Pod fixes data duplication issue |
My company also ran into this using Metricbeat 7.17.12. Updating to the latest 7.17.18 triggered a restart of the nodes and fixed the issue. |
I took another look at this issue. Even though the lease renewal fails, that should not be a reason for the previous metricbeat lease holder to keep reporting metrics. The expected behavior should be: as soon as the holder loses the lock - no matter if there was a renewal or not - that metricbeat instance should stop reporting metrics. As mentioned in description:
So we know at least that this function is being called correctly: beats/libbeat/autodiscover/providers/kubernetes/kubernetes.go Lines 301 to 305 in 10ff992
Why is this problem happening then?The reason for the duplicated metrics is actually quite simple. The leader, once it starts, emits an event with the flag beats/libbeat/autodiscover/providers/kubernetes/kubernetes.go Lines 208 to 213 in 10ff992
This event is then captured in this part of the code: beats/libbeat/autodiscover/autodiscover.go Lines 141 to 146 in 10ff992
And this handle start function initializes the right configuration for our beats/libbeat/autodiscover/autodiscover.go Lines 264 to 266 in 10ff992
So now we know Once we handle stop events we should do the same. However, we have a problem. When dealing with beats/libbeat/autodiscover/providers/kubernetes/kubernetes.go Lines 301 to 305 in 10ff992
And this event id was used to save the configuration on autodiscover... So once we start handling the stop event, we check if we have the configuration there and upload the new autodiscover settings: beats/libbeat/autodiscover/autodiscover.go Lines 281 to 284 in 10ff992
Because this is a new event id, nothing is found there, and our metricbeat instance never stops reporting metrics... SolutionPR: #38471 Because this issue grew a bit since opening the PR, I put together everything we found out so far and have to do in a new issue here: #38543. |
if there occurs some error during lease renewal, which may be caused by temporal network troubles, saturation, accidentally removed resource, api server temporary unavailable, etc. - it is not handled correctly and leads to the situation that multiple pods are leaders, hence metrics endpoints are scraped more than once, even if configured
unique: true
.Kubernetes client-go issue that might be related: kubernetes/client-go#1155
How to reproduce:
0. kind create cluster --config config.yaml
kubectl apply -f metricbeat.yaml
unique: true
configuration, for examplestate_node
:see - for each scrape - 3 documents,
host.name
kind-worker2
metricbeat-1
metricbeat-2
lease is taken by metricbeat-1:
k delete rolebindings metricbeat
metricbeat-1 have errors:
metricbeat-2:
lease:
metricbeat-1
still continue scrapingstate_*
metricsets (all metricsets withunique: true
configuration)cc @ChrsMark @gizas
The text was updated successfully, but these errors were encountered: