Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metricbeat][Autodiscover Kubernetes] Fix multiple instances reporting same metrics #38471
[Metricbeat][Autodiscover Kubernetes] Fix multiple instances reporting same metrics #38471
Changes from all commits
0f1019e
d72bb25
1a5222f
f839eed
1656831
4525d78
9bd259e
d5b5872
8f86db2
7dcc5d9
d1cd700
e621934
0f52db0
a45812f
2de31ca
75b7776
f7c3ddc
346737b
9af1aa5
e675cae
b8eff25
55052d4
1af4973
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any possibility of data races with this? Presumably OnStoppedLeading and OnStartedLeading should never be called concurrently...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct. The code for running the leader election is in these lines, and
startLeading
is running as a go routine. It also says:So I will check for racing conditions and try to prevent it from happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do a unit test to check the racing condition:
Edit: Yes, it never happens. See comment [Metricbeat][Autodiscover Kubernetes] Fix multiple instances reporting same metrics #38471 (comment).
(leaseDuration + retryPeriod) * 2
, but this means our test will wait 34s at most for one iteration. I was testing with multiple, so it could take 2 minutes. Is it good idea to make a commit with such a long test since it would be triggered in other PRs? What do you think?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we should avoid such long running tests. We could perhaps make the lease duration configurable. Another option would be to pass in a fake clock, like https://github.com/jonboulle/clockwork, and then we can control it in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a unit test. It still takes a few seconds to complete, at most 30s. I could not reduce the lease duration fields anymore, because it was causing unexpected lease renewals (I am guessing for network issues).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we include the timestamp in the event ID? I'm wondering if we could drop it, and just use the leaseId? That way we don't need to store/reference
eventID
, and that would eliminate the question of data races entirely.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is risky and it could cause more problems. Example:
metricbeat-1
is the leader.metricbeat-1
is the new leader again, but:startLeading
from this new leader election is called.stopLeading
is called after, triggered from the loss of lease from step 2. In that case, since they are using the sameeventId
we end up entirely without a leader because we deleted the configs.If we use event id with timestamp, we make sure this never happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... we use
Run
for leader election:beats/libbeat/autodiscover/providers/kubernetes/kubernetes.go
Line 355 in 4525d78
And it says this:
This means we will never have the same leader twice! Because they stop running and they never are reelected!
So we have a problem. Example:
node-1
andnode-2
.node-1
is the first leader.node-1
loses the locker, so it stops running.node-2
gets elected.rolebinding
gets deleted).node-2
loses the lease. It stops running.I tried this with a unit test, trying to renew around 20 times, and I could see the leader election had stopped like in the example above.
So I believe the implementation as of now (the official one, not the from this branch) has two problems:
I think we need to consider other alternatives to leader election or find a way to make it run again, because like this we will be forcing users to delete pods so they can start again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it run indefinitely, we would have to put the code in this function in a for cycle:
beats/libbeat/autodiscover/providers/kubernetes/kubernetes.go
Lines 330 to 334 in 4525d78
Maybe not ideal... Would it be just easier to use a lease and discard the leader election all together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two possible solutions:
Make leader election run again once it stops, so we know the metricbeat instance is always a candidate to be a leader.
Use a watcher that keeps track of the lease. This way, once the lease changes holder, we can start reporting metrics with this metricbeat instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes... another good find.
I'm not familiar enough with Kubernetes leadership election or resource watchers, so it's probably best to discuss with others - but my instinct is to go with this one:
I'm not sure if we can do better by using watchers, but this option means a relatively small change, so seems less risky to me. I don't think this would consume any more resources than having multiple Metricbeat instances that each attempt to acquire a lease?
In terms of code, I think we would need to change the
go le.Run(ctx)
to something like this: