-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
500 InternalServerError when trying to PUT a key into a cluster #1538
Comments
@rocketraman I am unable to reproduce. What platform, version of go, etc? |
@rocketraman Nevermind, another developer, @unihorn, knows about this issue and apparently it is intermittent and based on the timing of proposals vs the http response. We will work on it. |
I think this is an expected behavior. We set timeout for each client request, and cancel it if deadline comes. Current timeout is fixed to be 0.5s, which is short IMO. Some thoughts:
@rocketraman @philips @xiangli-cmu vote for them? |
Ah, that explains it. But why would my cluster latency be so slow? Everything is running on localhost on a fast machine. Cluster statistics show that communication is faster than pre-0.5 code, but it is indeed averaging just above the 0.5s mark:
👍 A
It would increase complexity but the timeout could be based on the cluster statistics?
The HTTP response could have a processing id to get completion status in a subsequent request. |
@rocketraman I am unclear about whether it mostly relates to the network latency and we will analyze the performance of it soon. In 0.4, it doesn't have this timeout and will return only when it gets the result. |
@unihorn In most all cases the disk sync can be more expensive than the network latency. My guess is that the same is happening here. |
@rocketraman FWIW we have bumped the timeout to 5 mins in 0.5 to restore the same behaviour as 0.4. |
Is there a way I can help analyze this by running some profiling on my machine? Are other people seeing similar numbers when doing |
@rocketraman How often do you meet this case? I run the local cluster in my medium-range MAC pro and never meet this. |
I get this timeout almost 100% of the time. Its a new test cluster created by
Is it really a I'm running on a software RAID-5 array which might slow things down a bit, but it shouldn't be significant on the order of fractions of a second. Here is a dd-based benchmark using
So many small writes of 512b with sync are pretty slow @ 84ms each (as might be expected on a software RAID-5), but are still well within the 0.5s timeout window. Are there any other benchmarks I can provide? I also noticed that when the cluster is running, but idle, the
Here is some strace summary output showing the system calls for one of the etcd processes over about 10 seconds:
|
@rocketraman The continuous small requests are reasonable, and it is caused by internal implementation. |
For comparison, I setup etcd to use my SSD for storage. I have no problems on the SSD, so the issue appears to be related to my RAID array. The |
I started an etcd cluster with 3 members (via goreman) and performed a PUT every 0.1s for 30s, while tracing I/O using The blktrace -> blkparse -> btt output is here: https://gist.github.com/rocketraman/a8ea43255674b561c53b The device major/minor numbers are: I'm not seeing any obvious problems at the I/O layer that should be causing timeouts in |
@rocketraman Can you please try with the current master? We have fixed a bunch of small issues, which might cause this. |
@rocketraman I am going to close this issue. Thanks for reporting! |
Using master as of commit 6087e2b, started a 3-member cluster:
Attempt to put a key:
Despite the 500 Internal Server Error, the key is set:
The console shows:
The text was updated successfully, but these errors were encountered: