-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small number of high latency requests after a leader election #12680
Comments
It's aligned with my recent observation on 1.4.13. What I found is that the node that is follower and becoming a leader during the election has some troubles with responding to the read-only requests it used to be serving. Can you check if the 'delayed' requests are 'served' by the 'new leader' or any of the nodes ? |
So I'm not actually sure and have had difficulty trying to figure this kind of thing out in the past. Specifically trying to track which node a request is dispatched to, has been troublesome, is there an endpoint which I can hook into for it? Additionally this is should be a write-only workload (just |
You can have 3 concurrently running clients each of them connecting to specific etcd node. |
Hi I managed to get all that working today. It seems that you were correct, the delayed requests do seem to be read requests served by the new leader. (leader killed at 20s, that process is restarted at 40s, just read requests) Afaict the original leader is "10.0.0.1", and "10.0.0.2" is subsequently elected. |
It might be caused by dropping request on Lines 1075 to 1078 in aefbd22
as later on we wait for response (and didn't get one) in etcd/server/etcdserver/v3_server.go Line 754 in aefbd22
thus nothing is happening till timeout. On my local setup, commenting out |
@wpedrak I think that the Is it possible to do a case split kind of thing? So in the standard case, everything normal, in the |
@Cjen1 Could you please rerun your test with the if commented out to confirm that root-cause. Agree that we should somehow delay execution of the code Lines 1083 to 1093 in aefbd22
Line 509 in aefbd22
Seems to be always executed post-commit. I don't know whether in clusters without any RW traffic, we can assume a commit soon after election, that would drain the queued 'readIndex' updates. |
It is second approach (with first being etcd-io#12762) to solve etcd-io#12680
It is second approach (with first being etcd-io#12762) to solve etcd-io#12680
It is second approach (with first being etcd-io#12762) to solve etcd-io#12680
@wpedrak below are repeats of the test using the fixes. It appears that they work as expected. As a minor point, I don't think that the current build from source instructions are correct, I had a devil of a time building your branches. For interpretation's sake, there are three repeats of each of the tests (etcd, etcd with postponing the reads, etcd with retrying the reads after 500ms). |
@Cjen1 great to hear that it works for you. Could you elaborate on build issues you encountered? I've just did
and it went through without any issue. |
I was getting a build directory wasn't
(this was from within the I think it might have been related to using an older version of golang (v1.11.5), but I don't know enough about go's tooling to fix it. |
For the master branch we expect golang-1.15+. |
@ptabor Ah ok! |
I've recently been testing etcd's availability around leader failures.
If I create a new client is follows:
And the load generator dispatches 1000 requests per second, using a new goroutine for each requests such that each request is applied asynchronously.
In a three node configuration when I kill the leader 20s into the test (it restarts at 40s), I observe the following odd behaviour:
This shows the latency of requests dispatched at a given point in time.
At 20s the leader is killed and a leader election occurs, thus requests submitted during this period are stalled until the election completes. This explains the lack of requests submitted between 20s and 21s which have latencies lower than a 100ms.
However after the election there is a subset of requests (~10%-15%) which are stalled for a further about 8s (the curving down line of high latency requests).
Are there any possible reasons for this? I've found it to be relatively reproducible, but can't find a possible reason for it.
(The nodes are all running on my local machine. The nodes and the client are running etcd v3.4.14)
The text was updated successfully, but these errors were encountered: