Exceptionally weird server crashes due to panic: sync: negative WaitGroup counter #7

justphil · 2015-06-24T21:12:32Z

Hey,
First of all thank you for this really helpful library! :-)

After having spent multiple hours testing this lib and reasoning about your code whether it is race conditions free and whether it is robust enough to survive in a high profile scenario I decided to give it a try. Since then I use endless in a mission critical production app.

All in all I'm very happy with the result. But there are situations where my Go server sporadically crashes due to a panic (sync: negative WaitGroup counter) in endless with the following trace:

goroutine 3849 [running]:
runtime.panic(0x764000, 0xc208404de0)
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/runtime/panic.c:279 +0xf5
sync.(*WaitGroup).Add(0xc208042080, 0xffffffffffffffff)
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/sync/waitgroup.go:64 +0x93
sync.(*WaitGroup).Done(0xc208042080)
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/sync/waitgroup.go:82 +0x30
bitbucket.org/justphil/glutenplan.de/vendor/endless.endlessConn.Close(0x7f0997bd9ce0, 0xc20803e018, 0xc208042000, 0x0, 0x0)
    /Users/pt/Dev/Workspaces/go/src/bitbucket.org/justphil/glutenplan.de/vendor/endless/endless.go:508 +0x48
bitbucket.org/justphil/glutenplan.de/vendor/endless.(*endlessConn).Close(0xc2083d8360, 0x0, 0x0)
    <autogenerated>:16 +0xa4
net/http.(*conn).close(0xc20836e000)
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/net/http/server.go:1047 +0x4f
net/http.func·011()
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/net/http/server.go:1104 +0x22a
net/http.(*conn).serve(0xc20836e000)
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/net/http/server.go:1187 +0x78b
created by net/http.(*Server).Serve
    /usr/local/Cellar/go/1.3.3/libexec/src/pkg/net/http/server.go:1721 +0x313

(Note: I forked your lib and extended it with the ability to write pid files. But the same problem appears in the original lib as well.)

So, I again started to reason about the endless code in order to get rid of this problem. There are basically two possibilities how the WaitGroup counter can become negative. The first one is when hammerTime() tries to forcefully shut down the parent process. And the second one is the one that is panicking when a connection is closed. To me this is really weird because according to your code this panicking implies that endless wants to close connections that have neven been established.

Any hints on this one? :-)

Bye,
Phil

The text was updated successfully, but these errors were encountered:

fvbock · 2015-06-28T17:13:37Z

hi phil,

thanks for your report!

the waitgroup going below zero should - as you describe - only happen after a server needs to get hammered. i was not happy with the recover solution there but it seemed to work... not for you it seems so let's look for a better solution.

do you have any test code you could provide to help me reproduce your issue?

cheers
_f

justphil · 2015-07-06T11:22:36Z

Hi @fvbock,

sorry for my belated response. I'm currently quite busy over here.

I analyzed my logs very thoroughly and couldn't find any [STOP - Hammer Time] Forcefully shutting down parent. So this means that the for-loop that counts the WaitGroup down to zero (in the hammerTime() func) hasn't been executed in the crash scenario. Therefore my assumption is that there is no causal link between the hammering and the crash.

I wish that I could isolate the problem with a test case or something. But unfortunately it's not that easy. If I could do so it probably wouldn't be to hard to find a solution.

Meanwhile my Close() func looks like this.

func (w endlessConn) Close() error {
    // prevent the server from crashing (this is only a mitigation!)
    // TODO: find the root cause for the problem...
    defer func() {
        if err := recover(); err != nil {
            log.Println("endless WaitGroup went below 0 when closing connection", err)
        }
    }()
    w.server.wg.Done()
    return w.Conn.Close()
}

And from the log statement I try to figure out patterns that show up when the problem occurres. But to be honest that didn't show a real pattern til now. There are situations where we've got a lot of traffic and everything is fine, even for a longer period of time. And then there are situations where the server has to handle a couple of requests and the problem shows up. All in all very weird.

That's all I can tell you by now. Will continue to monitor the problem and inform you if I gain more information.

Bye,
Phil

flamedmg · 2015-07-10T12:56:12Z

I have same problem and can repeat it easily. That happens all the time i stress test my server and when i just reload my page quickly in the browser.

P.S.
Forgot to mention that server fails after stresstest is complete. While requests are still coming server is not failing.

fvbock · 2015-07-10T13:26:56Z

@flamedmg thanks! is this maybe related? johnniedoe@715b6ce i completely missed that PR :-( sounds like it could have something todo with it.

i will look into it tomorrow.

is the stresstest you're running somthing you could post in a gist so that i could use it?

flamedmg · 2015-07-10T13:30:39Z

I'm using wrk to do stress testing for me. Command looks like this: wrk
-t100 -c5000 -d120s http://www.mytest.com:8080

I'm not the owner of mytest.com, i overridden it in /etc/hosts file

Hope this will help. In the mean time i can revert changes you mentioned
and test again

2015-07-10 16:26 GMT+03:00 Florian von Bock notifications@github.com:

@flamedmg https://github.com/flamedmg thanks! is this maybe related?
johnniedoe/endless@715b6ce
johnniedoe@715b6ce
i completely missed that PR :-( sounds like it could have something todo
with it.

i will look into it tomorrow.

is the stresstest you're running somthing you could post in a gist so that
i could use it?

—
Reply to this email directly or view it on GitHub
#7 (comment).

Thanks & Regards
Dmitry

flamedmg · 2015-07-10T13:39:30Z

I figured out that johnniedoe@715b6ce is not accepted by you and added change myself to the local copy. Server still fails even after 3 second stress test

flamedmg · 2015-07-10T15:05:58Z

What i figured out is that Close method on endlessConn object is called multiple times for the same connection object. I discovered this by giving each connection unique identifier and seen them in the log at least twice. This does not happen all the time. The less number of connections is the less is the chance to get this behavior. 100 connections is enough to get it in 100% of cases on my machine.

fvbock · 2015-07-12T16:49:37Z

hi @justphil @flamedmg

i tried a bunch of things, but i could not reproduce this until now. i used variations of this https://github.com/fvbock/endless/blob/master/examples/testserver.go one. i dropped the delay in the handler and used 1k, 10k, 100k and 1000k payloads to send that i created from /dev/urandom

i tested with the server being restarted while running the test and without.

i did use ab instead of wrt: ab -c 2000 -n 2000000 http://localhost:4242/foo

i am running
Linux 3.19.0-21-generic #21-Ubuntu SMP
go version go1.4.2 linux/amd64

what are you guys running?

flamedmg · 2015-07-13T17:08:30Z

Your testserver is not failing under wrk too. I tried several timeouts values.

I'm running MintLinux 17.1 and go 1.4.2 linux/amd64

fvbock · 2015-07-13T18:31:23Z

that's a start. can you post (some or all) code of your test server? what are the differences....? i guess yours is more complex?

fvbock · 2015-07-16T02:15:33Z

Your testserver is not failing under wrk too. I tried several timeouts values.

@justphil can you see any general difference between the basic https://github.com/fvbock/endless/blob/master/examples/testserver.go and your server code?

justphil · 2015-07-16T18:32:07Z

Sorry, @fvbock @flamedmg I'm currently very busy due to my job. Will take a look at it on the weekend. And will post details about my system configuration as well.

BTW.
I think that I can now see an emerging crash pattern. The first process seems to run "endlessly" without any problems. The problems start to occur after the first hot redeployment when the parent process passes the listening socket to the child process. From this point of time the mentioned problem starts to show up.

fvbock · 2015-07-23T17:34:48Z

@justphil @flamedmg sorry for taking a while again... i tried a few more things but with a server based on the testserver i was not able to reproduce the behaviour both of you observed.
any code that produces it would be helpful at this point.

ledzep2 · 2015-08-11T13:51:24Z

I think I found the problem. I added some code in endlessConn.Close to identify the connection being closed and the call stack. It turns out that a connection is closed twice. Once from net/http/server.go:274 and once from net/http/server.go:1071.

So I guess whenever the connection got interrupted while writing, it will be closed twice. But it doesn't crash the app immediately. The crash happens when the the last few connections (depending on how many times it happened) are about to get closed. Here are the stack trace of both closing actions. As you can see they happened almost at the same time.

2015/08/11 07:22:59 Closing connection #379
/root/repo/go/src/github.com/fvbock/endless/endless.go:514 (0x4c0c0f)
<autogenerated>:16 (0x4c1de2)
/usr/lib/go/src/net/http/server.go:274 (0x493eed)
/usr/lib/go/src/bufio/bufio.go:562 (0x50e375)
/usr/lib/go/src/net/http/server.go:1005 (0x498d13)
/usr/lib/go/src/net/http/server.go:977 (0x498a77)
<autogenerated>:47 (0x4b1c29)
/usr/lib/go/src/io/io.go:364 (0x4c6af8)
/usr/lib/go/src/net/http/server.go:391 (0x494a39)
/usr/lib/go/src/bufio/bufio.go:433 (0x50da94)
/usr/lib/go/src/io/io.go:354 (0x4c6932)
/root/repo/go/src/bs2proxy/proxy.go:221 (0x407179)
/root/repo/go/src/bs2proxy/controller.go:273 (0x404537)
/root/repo/go/src/bs2proxy/main.go:33 (0x40afa5)
/usr/lib/go/src/net/http/server.go:1265 (0x49a2b1)
/usr/lib/go/src/net/http/server.go:1541 (0x49ba1d)
/usr/lib/go/src/net/http/server.go:1703 (0x49c33a)
/usr/lib/go/src/net/http/server.go:1204 (0x499e07)
/usr/lib/go/src/runtime/asm_amd64.s:2232 (0x448f51)

In between I got an error from io.Copy complaining bout "Broken pipe" which pretty much explained what just caused the closing action.

2015/08/11 07:22:59 Closing connection #379
/root/repo/go/src/github.com/fvbock/endless/endless.go:514 (0x4c0c0f)
<autogenerated>:16 (0x4c1de2)
/usr/lib/go/src/net/http/server.go:1071 (0x4990b0)
/usr/lib/go/src/net/http/server.go:1134 (0x4ac0fb)
/usr/lib/go/src/net/http/server.go:1217 (0x499ab9)
/usr/lib/go/src/runtime/asm_amd64.s:2232 (0x448f51)

flamedmg · 2015-08-11T13:54:24Z

@ledzep2 That's exactly what i've found, but i can't repeat that on a small sample app, just on my pretty large code base.

ledzep2 · 2015-08-11T14:02:07Z

@flamedmg Did you try manually interrupting the connection while transfering data (like killall -9 wrk)? Theoretically that should do the trick.

flamedmg · 2015-08-11T14:06:21Z

no, i not terminated the process in any way, please check my earlier messages in this thread. I found that issue during load testing. What that tool is doing is opening specified number of keep-alive connections and making requests. After that it closes them. During closing i found that some connections are closed two or even more times. That makes opened connection counter negative and library fails. I'm still waiting on a solution, until our app not in production mode. When it comes to production i think i will do what @justphil did - basically catching and silencing all exception in that part of code. Ugly, but it will work or will work on rewriting that logic.

ledzep2 · 2015-08-11T14:30:14Z

@flamedmg I read your previous posts. I'm trying to locate the problem in the source code and reproduce it. Report back later.

ledzep2 · 2015-08-11T14:42:21Z

@justphil @flamedmg I reproduced it folks. Replace the handler in testserver.go with the following

func handler(w http.ResponseWriter, r *http.Request) {
    buf := make([]byte, 1000*1000*50)
    br := bytes.NewReader(buf)
    io.Copy(w, br)
}

Then wget /foo and interrupt it when it begins with ctrl+c. Crashes everytime.

ledzep2 · 2015-08-11T15:53:02Z

I created a pull request for this with the commit above. #11

fix for issue #7

fvbock · 2015-08-11T22:59:06Z

with the hander @ledzep2 posted i was also able to reproduce the problem and #11 fixed it. i merged the PR.

@justphil @flamedmg can you confirm that it fixes the problem in your scenarios too?

fvbock · 2015-08-15T04:16:55Z

@justphil @flamedmg closing this for now - please let me know if you still experience problems.

justphil · 2015-08-17T10:05:29Z

Thank you for the further investigation @fvbock @ledzep2 and @flamedmg.
Will try the fix in our staging environment and report back if there are still problems.

Bye, Phil

fzerorubigd · 2015-09-07T19:38:34Z

I think this, still happen. see #13

fvbock mentioned this issue Jul 10, 2015

Disable keep-alives on servers during shutdown. #6

Merged

ledzep2 added a commit to ledzep2/endless that referenced this issue Aug 11, 2015

fix for issue fvbock#7

882c502

fvbock added a commit that referenced this issue Aug 11, 2015

Merge pull request #11 from ledzep2/master

877f135

fix for issue #7

fvbock closed this as completed Aug 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exceptionally weird server crashes due to panic: sync: negative WaitGroup counter #7

Exceptionally weird server crashes due to panic: sync: negative WaitGroup counter #7

justphil commented Jun 24, 2015

fvbock commented Jun 28, 2015

justphil commented Jul 6, 2015

flamedmg commented Jul 10, 2015

fvbock commented Jul 10, 2015

flamedmg commented Jul 10, 2015

flamedmg commented Jul 10, 2015

flamedmg commented Jul 10, 2015

fvbock commented Jul 12, 2015

flamedmg commented Jul 13, 2015

fvbock commented Jul 13, 2015

fvbock commented Jul 16, 2015

justphil commented Jul 16, 2015

fvbock commented Jul 23, 2015

ledzep2 commented Aug 11, 2015

flamedmg commented Aug 11, 2015

ledzep2 commented Aug 11, 2015

flamedmg commented Aug 11, 2015

ledzep2 commented Aug 11, 2015

ledzep2 commented Aug 11, 2015

ledzep2 commented Aug 11, 2015

fvbock commented Aug 11, 2015

fvbock commented Aug 15, 2015

justphil commented Aug 17, 2015

fzerorubigd commented Sep 7, 2015

Exceptionally weird server crashes due to panic: sync: negative WaitGroup counter #7

Exceptionally weird server crashes due to panic: sync: negative WaitGroup counter #7

Comments

justphil commented Jun 24, 2015

fvbock commented Jun 28, 2015

justphil commented Jul 6, 2015

flamedmg commented Jul 10, 2015

fvbock commented Jul 10, 2015

flamedmg commented Jul 10, 2015

flamedmg commented Jul 10, 2015

flamedmg commented Jul 10, 2015

fvbock commented Jul 12, 2015

flamedmg commented Jul 13, 2015

fvbock commented Jul 13, 2015

fvbock commented Jul 16, 2015

justphil commented Jul 16, 2015

fvbock commented Jul 23, 2015

ledzep2 commented Aug 11, 2015

flamedmg commented Aug 11, 2015

ledzep2 commented Aug 11, 2015

flamedmg commented Aug 11, 2015

ledzep2 commented Aug 11, 2015

ledzep2 commented Aug 11, 2015

ledzep2 commented Aug 11, 2015

fvbock commented Aug 11, 2015

fvbock commented Aug 15, 2015

justphil commented Aug 17, 2015

fzerorubigd commented Sep 7, 2015