Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

Leader failover too slow, taking 40 seconds #1227

Closed
huxihx opened this issue Sep 22, 2015 · 18 comments
Closed

Leader failover too slow, taking 40 seconds #1227

huxihx opened this issue Sep 22, 2015 · 18 comments

Comments

@huxihx
Copy link

huxihx commented Sep 22, 2015

Hi there,
I've started two swarm managers with advertise set to their own hostnames. With consul, I could implement the leader failover successfully, but found the whole process always took 40 seconds before the new leader information was written to the KV store in consul. I guess this is becuase the default TTL for consul session is 15 seconds. Consul takes a conservative method to multiple it with a factor of two plus 10 seconds as the check interval.
Do you guys know how to descrease the failover interval? Sounds like no existing config parameters help. Thanks.

@jimmyxian
Copy link
Contributor

@Amethystic
Related to #930

@huxihx
Copy link
Author

huxihx commented Sep 22, 2015

Thanks for the response. Seems the configurable TTl only applies for 'swarm join'. It's not a valid parameter for 'swarm manage'

image

@jimmyxian
Copy link
Contributor

@Amethystic Yeah.
Not support in Swarm Leader Election yet. It will come soon. :)

Just as @abronan say:

it's done on the libkv side for at least etcd and zookeeper.
The only bits left are the Consul TTL scheme (with session delete behavior) in libkv and finally add the flag here to configure the TTL accordingly. I guess that this issue should be tracked on both sides now that we have a separate repository to handle the distributed locking logic.

@huxihx
Copy link
Author

huxihx commented Sep 22, 2015

Bad luck since we have to use Consul. Do you think it's okay to change the source code directly to set a lower TTL ?

@jimmyxian
Copy link
Contributor

@Amethystic
Actually, No.
Firstly, we have to add support TTL in libkv, and then support in swarm.

@huxihx
Copy link
Author

huxihx commented Sep 22, 2015

Thanks for the nice responding. Below is what I wanna do, correct me if I am wrong:
candidate.go:
lock, err := c.client.NewLock(c.key, &store.LockOptions{Value: []byte(c.node), TTL: 5}) //add TTL:5
consul.go:
if options != nil {
consulOpts.Value = options.Value
consulOpts.SessionTTL = options.TTL // add this line to receive the TTL passed from candidate.go
}

Do you think it's a workable way to set lower TTL since the default 15 seconds is too long.

@jimmyxian
Copy link
Contributor

@Amethystic Aha, I'm not familiar with Discovery. :)
ping @abronan

@huxihx
Copy link
Author

huxihx commented Sep 22, 2015

Thanks. @abronan , please kindly advise.

@abronan
Copy link
Contributor

abronan commented Sep 22, 2015

Hi @Amethystic, you can't set the TTL lower than 20 seconds with Consul using libkv. As you mentioned, consul takes the conservative approach to multiply the TTL by 2x but this was misleading as other K/V are not taking this approach so we divide it back to it's original value. If you want to do so you need to use Consul 0.5.2 and set the session_ttl_min parameter in the consul config to 2s for example. Then you'll be able to set the TTL to 5s for example. And it will actually be 5s and not 10s because as mentioned libkv is dividing the TTL by 2x to match with etcd and other backends.

@abronan
Copy link
Contributor

abronan commented Sep 22, 2015

I'll handle this on libkv's side.

@huxihx
Copy link
Author

huxihx commented Sep 24, 2015

Please also add support to specify lock-delay when creating consul sessions.

@huxihx
Copy link
Author

huxihx commented Sep 24, 2015

How can I specify lock-delay for a consul session which is used as a leadership election? Seems Swarm hard-coded 15 seconds as the default value. Can I change it?

@abronan
Copy link
Contributor

abronan commented Sep 24, 2015

@Amethystic It's done in #1228 you might want to take a look, just waiting for it to be reviewed and merged ;)

@abronan
Copy link
Contributor

abronan commented Sep 24, 2015

@Amethystic Ah just realized, no there is no way to specify the lock-delay in this PR. Although the lock-delay is not at 15s but is fixed at 1ms in libkv (for whatever reasons putting this at 0 does not work as intended and can block for the default period..). So the lock-delay is virtually disabled.

@huxihx
Copy link
Author

huxihx commented Sep 25, 2015

@abronan Thanks for the information, but the result shows the lock-delay is still 15 seconds as shown below:
image

Do I make any mistake?

@abronan
Copy link
Contributor

abronan commented Sep 25, 2015

@Amethystic Actually you are right because I just noticed that the libkv tests for Consul are taking 15s more than etcd.... So setting the LockDelay has no effect in this case. Thanks for the update. I will investigate and fix 👍

@huxihx
Copy link
Author

huxihx commented Sep 25, 2015

Thanks!

@abronan
Copy link
Contributor

abronan commented Oct 19, 2015

@Amethystic Coming back to this and it seems that it works as intended with Consul. Just using --replication-ttl "10s" shows that the LockDelay in the session is set at 100000ns which is 1ms. (I should probably set this at 1ns but better would be to fix this in Consul and allow to specify a LockDelay of 0)

related to hashicorp/consul#1077

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants