Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd client missing a watch update. #9107

Closed
adityadani opened this issue Jan 6, 2018 · 10 comments

Comments

3 participants
@adityadani
Copy link

commented Jan 6, 2018

Env info:

etcd (client and server) version: 3.2.2
etcd server cluster size: 3

I have 3 etcd clients configured with all the 3 etcd server endpoints.

Expectation:
I am testing a scenario where one of the etcd server endpoints goes down, but the etcd clients should still get the watch updates.

Outcome:
One of the 3 clients does not get an update for 30 seconds. While other 2 clients get the update instantly.

The key put operation was done at timestamp 2018-01-05T10:05:13. We expect a watch update within 30 seconds.

Here are the etcd server logs
etcd-server-169
etcd-server-170 - Node which was stopped
etcd-server-171

In order to debug this issue I had added a debug statement in etcd clientv3's balancer code to print the last pinned etcd server address.
Two clients were pinned to etcd server 169 while the one client which did not get the update was pinned to etcd server 171.

An interesting log line I found on 169 was this

2018-01-05 10:05:13.815332 W | etcdserver: timed out waiting for read index response

The above timestamp lies within our 30 seconds timeout where we expect the update.
I have referred to #7970 which suggests that the fsync durations are high causing leader elections. But could that lead to a client which is connected to another etcd server to miss an update?

Any help will be much appreciated.

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Jan 6, 2018

can you provide detailed steps or a simple script to reproduce the problem you hit?

@adityadani

This comment has been minimized.

Copy link
Author

commented Jan 6, 2018

I am not able to reproduce this issue consistently, may be 2 out of 10 times.
I dont have a script as such as this test case is a part of a bigger test effort.
But essentially it does this in a loop

On the server side:
1. Select one node randomly from the etcd cluster and stop it.
2. Start the node back and let it join the cluster.

Setup 3 clients to talk to the 3 etcd servers
On the client side:
1. Select one client randomly and put a key "foo" into etcd
2. Expect a watch update on all the clients within 30 seconds
@xiang90

This comment has been minimized.

Copy link
Contributor

commented Jan 6, 2018

I dont have a script as such as this test case is a part of a bigger test effort.
But essentially it does this in a loop

can you please put some effort to create a script that we can run to reproduce it?

@adityadani

This comment has been minimized.

Copy link
Author

commented Jan 6, 2018

@xiang90 Sure. Let me do that.

On a side note, can high fsync latency on a node cause such missed watch updates? Or are they completely unrelated

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Jan 6, 2018

@wgw335363240

This comment has been minimized.

Copy link

commented Jan 15, 2018

I met same problem and etcd version is 3.2.9. And the same with 3.3.0.rc

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Jan 15, 2018

@wgw335363240

can you reproduce it?

@wgw335363240

This comment has been minimized.

Copy link

commented Jan 17, 2018

A new envionment does not have the problem.

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Jan 17, 2018

@wgw335363240

there is no way for us to help unless you tell us how to reproduce the problem.

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Jan 24, 2018

i am going to close this due to low activity. please reopen if anyone can reproduce it or can prove this is an real issue.

@xiang90 xiang90 closed this Jan 24, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.