Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceptionally weird server crashes due to panic: sync: negative WaitGroup counter #7

Closed
justphil opened this issue Jun 24, 2015 · 24 comments

Comments

@justphil
Copy link

Hey,
First of all thank you for this really helpful library! :-)

After having spent multiple hours testing this lib and reasoning about your code whether it is race conditions free and whether it is robust enough to survive in a high profile scenario I decided to give it a try. Since then I use endless in a mission critical production app.

All in all I'm very happy with the result. But there are situations where my Go server sporadically crashes due to a panic (sync: negative WaitGroup counter) in endless with the following trace:

goroutine 3849 [running]:
runtime.panic(0x764000, 0xc208404de0)
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/runtime/panic.c:279 +0xf5
sync.(*WaitGroup).Add(0xc208042080, 0xffffffffffffffff)
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/sync/waitgroup.go:64 +0x93
sync.(*WaitGroup).Done(0xc208042080)
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/sync/waitgroup.go:82 +0x30
bitbucket.org/justphil/glutenplan.de/vendor/endless.endlessConn.Close(0x7f0997bd9ce0, 0xc20803e018, 0xc208042000, 0x0, 0x0)
    /Users/pt/Dev/Workspaces/go/src/bitbucket.org/justphil/glutenplan.de/vendor/endless/endless.go:508 +0x48
bitbucket.org/justphil/glutenplan.de/vendor/endless.(*endlessConn).Close(0xc2083d8360, 0x0, 0x0)
    <autogenerated>:16 +0xa4
net/http.(*conn).close(0xc20836e000)
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/net/http/server.go:1047 +0x4f
net/http.func·011()
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/net/http/server.go:1104 +0x22a
net/http.(*conn).serve(0xc20836e000)
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/net/http/server.go:1187 +0x78b
created by net/http.(*Server).Serve
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/net/http/server.go:1721 +0x313

(Note: I forked your lib and extended it with the ability to write pid files. But the same problem appears in the original lib as well.)

So, I again started to reason about the endless code in order to get rid of this problem. There are basically two possibilities how the WaitGroup counter can become negative. The first one is when hammerTime() tries to forcefully shut down the parent process. And the second one is the one that is panicking when a connection is closed. To me this is really weird because according to your code this panicking implies that endless wants to close connections that have neven been established.

Any hints on this one? :-)

Bye,
Phil

@fvbock
Copy link
Owner

fvbock commented Jun 28, 2015

hi phil,

thanks for your report!

the waitgroup going below zero should - as you describe - only happen after a server needs to get hammered. i was not happy with the recover solution there but it seemed to work... not for you it seems so let's look for a better solution.

do you have any test code you could provide to help me reproduce your issue?

cheers
_f

@justphil
Copy link
Author

justphil commented Jul 6, 2015

Hi @fvbock,

sorry for my belated response. I'm currently quite busy over here.

I analyzed my logs very thoroughly and couldn't find any [STOP - Hammer Time] Forcefully shutting down parent. So this means that the for-loop that counts the WaitGroup down to zero (in the hammerTime() func) hasn't been executed in the crash scenario. Therefore my assumption is that there is no causal link between the hammering and the crash.

I wish that I could isolate the problem with a test case or something. But unfortunately it's not that easy. If I could do so it probably wouldn't be to hard to find a solution.

Meanwhile my Close() func looks like this.

func (w endlessConn) Close() error {
    // prevent the server from crashing (this is only a mitigation!)
    // TODO: find the root cause for the problem...
    defer func() {
        if err := recover(); err != nil {
            log.Println("endless WaitGroup went below 0 when closing connection", err)
        }
    }()
    w.server.wg.Done()
    return w.Conn.Close()
}

And from the log statement I try to figure out patterns that show up when the problem occurres. But to be honest that didn't show a real pattern til now. There are situations where we've got a lot of traffic and everything is fine, even for a longer period of time. And then there are situations where the server has to handle a couple of requests and the problem shows up. All in all very weird.

That's all I can tell you by now. Will continue to monitor the problem and inform you if I gain more information.

Bye,
Phil

@flamedmg
Copy link

I have same problem and can repeat it easily. That happens all the time i stress test my server and when i just reload my page quickly in the browser.

P.S.
Forgot to mention that server fails after stresstest is complete. While requests are still coming server is not failing.

@fvbock
Copy link
Owner

fvbock commented Jul 10, 2015

@flamedmg thanks! is this maybe related? johnniedoe@715b6ce i completely missed that PR :-( sounds like it could have something todo with it.

i will look into it tomorrow.

is the stresstest you're running somthing you could post in a gist so that i could use it?

@flamedmg
Copy link

I'm using wrk to do stress testing for me. Command looks like this: wrk
-t100 -c5000 -d120s http://www.mytest.com:8080

I'm not the owner of mytest.com, i overridden it in /etc/hosts file

Hope this will help. In the mean time i can revert changes you mentioned
and test again

2015-07-10 16:26 GMT+03:00 Florian von Bock notifications@github.com:

@flamedmg https://github.com/flamedmg thanks! is this maybe related?
johnniedoe/endless@715b6ce
johnniedoe@715b6ce
i completely missed that PR :-( sounds like it could have something todo
with it.

i will look into it tomorrow.

is the stresstest you're running somthing you could post in a gist so that
i could use it?


Reply to this email directly or view it on GitHub
#7 (comment).

Thanks & Regards
Dmitry

@flamedmg
Copy link

I figured out that johnniedoe@715b6ce is not accepted by you and added change myself to the local copy. Server still fails even after 3 second stress test

@flamedmg
Copy link

What i figured out is that Close method on endlessConn object is called multiple times for the same connection object. I discovered this by giving each connection unique identifier and seen them in the log at least twice. This does not happen all the time. The less number of connections is the less is the chance to get this behavior. 100 connections is enough to get it in 100% of cases on my machine.

@fvbock
Copy link
Owner

fvbock commented Jul 12, 2015

hi @justphil @flamedmg

i tried a bunch of things, but i could not reproduce this until now. i used variations of this https://github.com/fvbock/endless/blob/master/examples/testserver.go one. i dropped the delay in the handler and used 1k, 10k, 100k and 1000k payloads to send that i created from /dev/urandom

i tested with the server being restarted while running the test and without.

i did use ab instead of wrt: ab -c 2000 -n 2000000 http://localhost:4242/foo

i am running
Linux 3.19.0-21-generic #21-Ubuntu SMP
go version go1.4.2 linux/amd64

what are you guys running?

@flamedmg
Copy link

Your testserver is not failing under wrk too. I tried several timeouts values.

I'm running MintLinux 17.1 and go 1.4.2 linux/amd64

@fvbock
Copy link
Owner

fvbock commented Jul 13, 2015

that's a start. can you post (some or all) code of your test server? what are the differences....? i guess yours is more complex?

@fvbock
Copy link
Owner

fvbock commented Jul 16, 2015

Your testserver is not failing under wrk too. I tried several timeouts values.

@justphil can you see any general difference between the basic https://github.com/fvbock/endless/blob/master/examples/testserver.go and your server code?

@justphil
Copy link
Author

Sorry, @fvbock @flamedmg I'm currently very busy due to my job. Will take a look at it on the weekend. And will post details about my system configuration as well.

BTW.
I think that I can now see an emerging crash pattern. The first process seems to run "endlessly" without any problems. The problems start to occur after the first hot redeployment when the parent process passes the listening socket to the child process. From this point of time the mentioned problem starts to show up.

@fvbock
Copy link
Owner

fvbock commented Jul 23, 2015

@justphil @flamedmg sorry for taking a while again... i tried a few more things but with a server based on the testserver i was not able to reproduce the behaviour both of you observed.
any code that produces it would be helpful at this point.

@ledzep2
Copy link
Contributor

ledzep2 commented Aug 11, 2015

I think I found the problem. I added some code in endlessConn.Close to identify the connection being closed and the call stack. It turns out that a connection is closed twice. Once from net/http/server.go:274 and once from net/http/server.go:1071.

So I guess whenever the connection got interrupted while writing, it will be closed twice. But it doesn't crash the app immediately. The crash happens when the the last few connections (depending on how many times it happened) are about to get closed. Here are the stack trace of both closing actions. As you can see they happened almost at the same time.

2015/08/11 07:22:59 Closing connection #379
/root/repo/go/src/github.com/fvbock/endless/endless.go:514 (0x4c0c0f)
<autogenerated>:16 (0x4c1de2)
/usr/lib/go/src/net/http/server.go:274 (0x493eed)
/usr/lib/go/src/bufio/bufio.go:562 (0x50e375)
/usr/lib/go/src/net/http/server.go:1005 (0x498d13)
/usr/lib/go/src/net/http/server.go:977 (0x498a77)
<autogenerated>:47 (0x4b1c29)
/usr/lib/go/src/io/io.go:364 (0x4c6af8)
/usr/lib/go/src/net/http/server.go:391 (0x494a39)
/usr/lib/go/src/bufio/bufio.go:433 (0x50da94)
/usr/lib/go/src/io/io.go:354 (0x4c6932)
/root/repo/go/src/bs2proxy/proxy.go:221 (0x407179)
/root/repo/go/src/bs2proxy/controller.go:273 (0x404537)
/root/repo/go/src/bs2proxy/main.go:33 (0x40afa5)
/usr/lib/go/src/net/http/server.go:1265 (0x49a2b1)
/usr/lib/go/src/net/http/server.go:1541 (0x49ba1d)
/usr/lib/go/src/net/http/server.go:1703 (0x49c33a)
/usr/lib/go/src/net/http/server.go:1204 (0x499e07)
/usr/lib/go/src/runtime/asm_amd64.s:2232 (0x448f51)

In between I got an error from io.Copy complaining bout "Broken pipe" which pretty much explained what just caused the closing action.

2015/08/11 07:22:59 Closing connection #379
/root/repo/go/src/github.com/fvbock/endless/endless.go:514 (0x4c0c0f)
<autogenerated>:16 (0x4c1de2)
/usr/lib/go/src/net/http/server.go:1071 (0x4990b0)
/usr/lib/go/src/net/http/server.go:1134 (0x4ac0fb)
/usr/lib/go/src/net/http/server.go:1217 (0x499ab9)
/usr/lib/go/src/runtime/asm_amd64.s:2232 (0x448f51)

@flamedmg
Copy link

@ledzep2 That's exactly what i've found, but i can't repeat that on a small sample app, just on my pretty large code base.

@ledzep2
Copy link
Contributor

ledzep2 commented Aug 11, 2015

@flamedmg Did you try manually interrupting the connection while transfering data (like killall -9 wrk)? Theoretically that should do the trick.

@flamedmg
Copy link

no, i not terminated the process in any way, please check my earlier messages in this thread. I found that issue during load testing. What that tool is doing is opening specified number of keep-alive connections and making requests. After that it closes them. During closing i found that some connections are closed two or even more times. That makes opened connection counter negative and library fails. I'm still waiting on a solution, until our app not in production mode. When it comes to production i think i will do what @justphil did - basically catching and silencing all exception in that part of code. Ugly, but it will work or will work on rewriting that logic.

@ledzep2
Copy link
Contributor

ledzep2 commented Aug 11, 2015

@flamedmg I read your previous posts. I'm trying to locate the problem in the source code and reproduce it. Report back later.

@ledzep2
Copy link
Contributor

ledzep2 commented Aug 11, 2015

@justphil @flamedmg I reproduced it folks. Replace the handler in testserver.go with the following

func handler(w http.ResponseWriter, r *http.Request) {
    buf := make([]byte, 1000*1000*50)
    br := bytes.NewReader(buf)
    io.Copy(w, br)
}

Then wget /foo and interrupt it when it begins with ctrl+c. Crashes everytime.

ledzep2 added a commit to ledzep2/endless that referenced this issue Aug 11, 2015
@ledzep2
Copy link
Contributor

ledzep2 commented Aug 11, 2015

I created a pull request for this with the commit above. #11

fvbock added a commit that referenced this issue Aug 11, 2015
@fvbock
Copy link
Owner

fvbock commented Aug 11, 2015

with the hander @ledzep2 posted i was also able to reproduce the problem and #11 fixed it. i merged the PR.

@justphil @flamedmg can you confirm that it fixes the problem in your scenarios too?

@fvbock
Copy link
Owner

fvbock commented Aug 15, 2015

@justphil @flamedmg closing this for now - please let me know if you still experience problems.

@fvbock fvbock closed this as completed Aug 15, 2015
@justphil
Copy link
Author

Thank you for the further investigation @fvbock @ledzep2 and @flamedmg.
Will try the fix in our staging environment and report back if there are still problems.

Bye, Phil

@fzerorubigd
Copy link
Contributor

I think this, still happen. see #13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants