-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected behaviour duing continious master election #14
Comments
You can't depend on etcd's election to do your business. In my opinion, the etcd election service APIs exposed by
According to the descrption, I think you should get a distributed lock firstly, then do your business. See |
Thank you for your response!
I could not find any mention of this in the documentation. Use of etcd, Zookeeper and similar services for master election is a very common scenario. The API is not marked as internal. Anyway, I've replaced master election with a lock, but it works in the same way. After lock-holder releases the lock by failing to keep-alive the lease - everyone awakens. They all get During the lock-holder expiry, others were blocking on their original lease which is long expired. This is when the first surprise happens - When I try to immediately run a keep-alive, I get ttl()=0 and break. let (mut keeper, mut keeper_responses) = etcd_client.lease_keep_alive(lease_id).await?;
keeper.keep_alive().await?;
if let Some(resp) = keeper_responses.message().await? {
if resp.ttl() == 0 {
break; // Let's create a new lease and start over.
}
// I win! This is the only way that I see so far to figure out who won. Ok(LockResponse) is completely not informative as it can spuriously be returned without actually acquiring the lock. |
Sorry, maybe I misled you. In fact, the election and lock interfaces are not included in etcd API Reference. In golang, election and lock are not in Perhaps you need a concurrency-like crate in rust. |
Hi,
I am trying to write a code for a cluster, where only one node would periodially download a file and store it for everyone else. I don't want to put the file via
promote
because I want to know who is the leader first, so that only that node does work. I plan to write the file to another key and let everyone observe. I could probably still use promote, but the file should still be readable if it is stale and I'll probably have to put an empty value during initialcampaign
call.Anyway, it seems like I cannot implement a continious master election. I am having spurious wake-up from
client.campaign(...)
for both leaders and not leaders. This is due to lease expiry during thecampaign
call.Here is the code that reproduces the issue
The log:
Notice how first election runs correctly. Only one client gets to win the election. Everyone else waits.
The leader sends keep-alives as expected. To simulate failure it stops sending keep-alive after 5 times. Now the interesting part begins.
New election runs. Someone wins, but everyone's
campaign
call completes. I'd expect an error to be returned at this point, but instead everyone gets aLeaderKey
with their ownlease_id
. Did everyone just won and lost instantly? I am not sure, but it does not look like it. Only one can successfully update keep-alive on the lease and this isidx=0
because this client got a chance to refresh the lease due to completing the loop. Still, I don't really known the leader.Eventually they all complete this strange loop, refresh their grants and participate in the correct election again.
What is most suprising none of this ever resurns
Result=Err
. Is there anything wrong in my code and I should check something else? Should Err be returned on wrong lease? Both?Thanks,
Igor.
The text was updated successfully, but these errors were encountered: