-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exceptionally weird server crashes due to panic: sync: negative WaitGroup counter #7
Comments
hi phil, thanks for your report! the waitgroup going below zero should - as you describe - only happen after a server needs to get hammered. i was not happy with the recover solution there but it seemed to work... not for you it seems so let's look for a better solution. do you have any test code you could provide to help me reproduce your issue? cheers |
Hi @fvbock, sorry for my belated response. I'm currently quite busy over here. I analyzed my logs very thoroughly and couldn't find any [STOP - Hammer Time] Forcefully shutting down parent. So this means that the for-loop that counts the WaitGroup down to zero (in the I wish that I could isolate the problem with a test case or something. But unfortunately it's not that easy. If I could do so it probably wouldn't be to hard to find a solution. Meanwhile my
And from the log statement I try to figure out patterns that show up when the problem occurres. But to be honest that didn't show a real pattern til now. There are situations where we've got a lot of traffic and everything is fine, even for a longer period of time. And then there are situations where the server has to handle a couple of requests and the problem shows up. All in all very weird. That's all I can tell you by now. Will continue to monitor the problem and inform you if I gain more information. Bye, |
I have same problem and can repeat it easily. That happens all the time i stress test my server and when i just reload my page quickly in the browser. P.S. |
@flamedmg thanks! is this maybe related? johnniedoe@715b6ce i completely missed that PR :-( sounds like it could have something todo with it. i will look into it tomorrow. is the stresstest you're running somthing you could post in a gist so that i could use it? |
I'm using wrk to do stress testing for me. Command looks like this: wrk I'm not the owner of mytest.com, i overridden it in /etc/hosts file Hope this will help. In the mean time i can revert changes you mentioned 2015-07-10 16:26 GMT+03:00 Florian von Bock notifications@github.com:
Thanks & Regards |
I figured out that johnniedoe@715b6ce is not accepted by you and added change myself to the local copy. Server still fails even after 3 second stress test |
What i figured out is that Close method on endlessConn object is called multiple times for the same connection object. I discovered this by giving each connection unique identifier and seen them in the log at least twice. This does not happen all the time. The less number of connections is the less is the chance to get this behavior. 100 connections is enough to get it in 100% of cases on my machine. |
i tried a bunch of things, but i could not reproduce this until now. i used variations of this https://github.com/fvbock/endless/blob/master/examples/testserver.go one. i dropped the delay in the handler and used 1k, 10k, 100k and 1000k payloads to send that i created from /dev/urandom i tested with the server being restarted while running the test and without. i did use i am running what are you guys running? |
Your testserver is not failing under wrk too. I tried several timeouts values. I'm running MintLinux 17.1 and go 1.4.2 linux/amd64 |
that's a start. can you post (some or all) code of your test server? what are the differences....? i guess yours is more complex? |
@justphil can you see any general difference between the basic https://github.com/fvbock/endless/blob/master/examples/testserver.go and your server code? |
Sorry, @fvbock @flamedmg I'm currently very busy due to my job. Will take a look at it on the weekend. And will post details about my system configuration as well. BTW. |
I think I found the problem. I added some code in endlessConn.Close to identify the connection being closed and the call stack. It turns out that a connection is closed twice. Once from net/http/server.go:274 and once from net/http/server.go:1071. So I guess whenever the connection got interrupted while writing, it will be closed twice. But it doesn't crash the app immediately. The crash happens when the the last few connections (depending on how many times it happened) are about to get closed. Here are the stack trace of both closing actions. As you can see they happened almost at the same time.
In between I got an error from io.Copy complaining bout "Broken pipe" which pretty much explained what just caused the closing action.
|
@ledzep2 That's exactly what i've found, but i can't repeat that on a small sample app, just on my pretty large code base. |
@flamedmg Did you try manually interrupting the connection while transfering data (like killall -9 wrk)? Theoretically that should do the trick. |
no, i not terminated the process in any way, please check my earlier messages in this thread. I found that issue during load testing. What that tool is doing is opening specified number of keep-alive connections and making requests. After that it closes them. During closing i found that some connections are closed two or even more times. That makes opened connection counter negative and library fails. I'm still waiting on a solution, until our app not in production mode. When it comes to production i think i will do what @justphil did - basically catching and silencing all exception in that part of code. Ugly, but it will work or will work on rewriting that logic. |
@flamedmg I read your previous posts. I'm trying to locate the problem in the source code and reproduce it. Report back later. |
@justphil @flamedmg I reproduced it folks. Replace the handler in testserver.go with the following
Then wget /foo and interrupt it when it begins with ctrl+c. Crashes everytime. |
I created a pull request for this with the commit above. #11 |
I think this, still happen. see #13 |
Hey,
First of all thank you for this really helpful library! :-)
After having spent multiple hours testing this lib and reasoning about your code whether it is race conditions free and whether it is robust enough to survive in a high profile scenario I decided to give it a try. Since then I use endless in a mission critical production app.
All in all I'm very happy with the result. But there are situations where my Go server sporadically crashes due to a panic (sync: negative WaitGroup counter) in endless with the following trace:
(Note: I forked your lib and extended it with the ability to write pid files. But the same problem appears in the original lib as well.)
So, I again started to reason about the endless code in order to get rid of this problem. There are basically two possibilities how the WaitGroup counter can become negative. The first one is when hammerTime() tries to forcefully shut down the parent process. And the second one is the one that is panicking when a connection is closed. To me this is really weird because according to your code this panicking implies that endless wants to close connections that have neven been established.
Any hints on this one? :-)
Bye,
Phil
The text was updated successfully, but these errors were encountered: