Deadlock at disconnect #126

troian · 2017-07-18T13:07:01Z

Pretty often I see client hangs at disconnect stage
Here is log

[net] done putting msg on incomingPubChan
[net] putting puback msg on obound
[net] obound priority msg to write, type *packets.DisconnectPacket
[net] outbound wrote disconnect, stopping
[pinger] keepalive stopped
[net] incoming stopped

You might notice that there is no message done putting puback msg on obound because of the channel oboundP is full and outgoing worker has finished it's job before alllogic.
Quick workaround is to increase channel depth by SetMessageChannelDepth but it's not a case when there is extensive message exchange

troian · 2017-07-18T13:11:53Z

At first sight it seems worth to remove return at net.go:203 which gives alllogic chance to finish properly without deadlock. But might break some internal usecase

ghost · 2017-07-20T07:38:13Z

Could you have a look at #122. I think it's the same issue.

Another alternative could be to have a cascading teardown of goroutines when a stop is received by all 4 goroutines being waited on (keepalive optional).

troian · 2017-07-20T13:07:06Z

@pshirali #122 looks same. +1 one for cascading shutdown

troian · 2017-07-20T13:07:54Z

Close as dup of #122

troian · 2017-08-10T20:29:17Z

Reopen as seems deadlock still persists on recent codebase
See attached goroutines dump goroutines.txt

ghost · 2017-08-11T13:48:23Z

I have the same problem: hang on with disconnect() method of client.go, at c.workers.Wait().
I use iot.eclipse.org broker, and send 1 message per second. It hangs randomly between minutes and hours.
I have bypassed the problem for now by using a single connection for publications.

troian · 2017-08-11T13:56:28Z

@sdariz if possible for your implementation you can temporary set keep alive to 0 as a workaround

ghost · 2017-08-11T14:56:24Z

@troian Thanks, I'm going to try

alsm · 2017-08-11T15:29:46Z

Thanks for reporting it again, I think I was getting far to complicated with the keepalive and timers, just trying something much simpler than should be better.

Cond vars, multiple timers, locks, handling resets. It'd gotten to complex and unwieldy, and worse didn't seem to work reliably. There is now a simple calculation for sending a ping and determining if we haven't received one. There is a side effect that we actually wait the longer PingTimeout or keepalive/2 (when keepalive <= 10 seconds) before triggering that we haven't seen an expected pingresp. Finally, hopefully the resolution to #126

alsm · 2017-08-11T15:46:34Z

Please test the above, if it's good I'll do a point release on Monday

troian · 2017-08-12T05:22:55Z

looks good for me. At least no hangs at my tests

ghost · 2017-08-12T11:15:24Z

Same as troian. No hangs until now. Seems ok

ghost · 2017-08-17T07:04:19Z

I currently haven't integrated your latest changes e020008, as there was a national holiday here in India. I'm going to integrate your changes today and report by tomorrow or day-after.

I had previously integrated aff1577. I did find some hangs on a long running instance (in the goroutines spawned inside keepalive), but that code doesn't exist in e020008. So, hopefully, there may be no hangs now.

alsm · 2017-08-23T15:45:29Z

Any feedback on this @pshirali ?

ghost · 2017-08-23T16:56:16Z

I've run 50 instances of my program for 6 days now, with a disconnect-reconnect happening every 40 seconds. No issues!. Great work @alsm

You might also want to checkout a comment at the bottom of this commit though.

daenney · 2017-08-31T08:36:44Z

I just upgraded to master and I get similar results as @pshirali. We had some behaviour that very much seemed like a deadlock, it always happened after mosquitto would log a socket error and see the client disconnect, but never come back. Ran with master instead of v1.1.0 for the past 24hrs and so far none of it. Interestingly enough the "socket error"s on mosquitto's side seem to also have disappeared, whereas before we'd have about 10 of those over the course of 24hrs.

alsm · 2017-09-08T15:50:09Z

I'm going to call this done then, thanks to everyone for the details and feedback you provided.

troian closed this as completed Jul 20, 2017

troian reopened this Aug 10, 2017

alsm closed this as completed Sep 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock at disconnect #126

Deadlock at disconnect #126

troian commented Jul 18, 2017

troian commented Jul 18, 2017 •

edited

Loading

ghost commented Jul 20, 2017

troian commented Jul 20, 2017

troian commented Jul 20, 2017

troian commented Aug 10, 2017

ghost commented Aug 11, 2017

troian commented Aug 11, 2017

ghost commented Aug 11, 2017

alsm commented Aug 11, 2017

alsm commented Aug 11, 2017

troian commented Aug 12, 2017

ghost commented Aug 12, 2017

ghost commented Aug 17, 2017

alsm commented Aug 23, 2017

ghost commented Aug 23, 2017

daenney commented Aug 31, 2017

alsm commented Sep 8, 2017

Deadlock at disconnect #126

Deadlock at disconnect #126

Comments

troian commented Jul 18, 2017

troian commented Jul 18, 2017 • edited Loading

ghost commented Jul 20, 2017

troian commented Jul 20, 2017

troian commented Jul 20, 2017

troian commented Aug 10, 2017

ghost commented Aug 11, 2017

troian commented Aug 11, 2017

ghost commented Aug 11, 2017

alsm commented Aug 11, 2017

alsm commented Aug 11, 2017

troian commented Aug 12, 2017

ghost commented Aug 12, 2017

ghost commented Aug 17, 2017

alsm commented Aug 23, 2017

ghost commented Aug 23, 2017

daenney commented Aug 31, 2017

alsm commented Sep 8, 2017

troian commented Jul 18, 2017 •

edited

Loading