Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock at disconnect #126

Closed
troian opened this issue Jul 18, 2017 · 17 comments
Closed

Deadlock at disconnect #126

troian opened this issue Jul 18, 2017 · 17 comments

Comments

@troian
Copy link

troian commented Jul 18, 2017

Pretty often I see client hangs at disconnect stage
Here is log

[net] done putting msg on incomingPubChan
[net] putting puback msg on obound
[net] obound priority msg to write, type *packets.DisconnectPacket
[net] outbound wrote disconnect, stopping
[pinger] keepalive stopped
[net] incoming stopped

You might notice that there is no message done putting puback msg on obound because of the channel oboundP is full and outgoing worker has finished it's job before alllogic.
Quick workaround is to increase channel depth by SetMessageChannelDepth but it's not a case when there is extensive message exchange

@troian
Copy link
Author

troian commented Jul 18, 2017

At first sight it seems worth to remove return at net.go:203 which gives alllogic chance to finish properly without deadlock. But might break some internal usecase

@ghost
Copy link

ghost commented Jul 20, 2017

Could you have a look at #122. I think it's the same issue.

Another alternative could be to have a cascading teardown of goroutines when a stop is received by all 4 goroutines being waited on (keepalive optional).

@troian
Copy link
Author

troian commented Jul 20, 2017

@pshirali #122 looks same. +1 one for cascading shutdown

@troian
Copy link
Author

troian commented Jul 20, 2017

Close as dup of #122

@troian troian closed this as completed Jul 20, 2017
@troian
Copy link
Author

troian commented Aug 10, 2017

Reopen as seems deadlock still persists on recent codebase
See attached goroutines dump goroutines.txt

@troian troian reopened this Aug 10, 2017
@ghost
Copy link

ghost commented Aug 11, 2017

I have the same problem: hang on with disconnect() method of client.go, at c.workers.Wait().
I use iot.eclipse.org broker, and send 1 message per second. It hangs randomly between minutes and hours.
I have bypassed the problem for now by using a single connection for publications.

@troian
Copy link
Author

troian commented Aug 11, 2017

@sdariz if possible for your implementation you can temporary set keep alive to 0 as a workaround

@ghost
Copy link

ghost commented Aug 11, 2017

@troian Thanks, I'm going to try

@alsm
Copy link
Contributor

alsm commented Aug 11, 2017

Thanks for reporting it again, I think I was getting far to complicated with the keepalive and timers, just trying something much simpler than should be better.

alsm added a commit that referenced this issue Aug 11, 2017
Cond vars, multiple timers, locks, handling resets. It'd gotten to complex and unwieldy, and worse didn't seem to work reliably.
There is now a simple calculation for sending a ping and determining if we haven't received one.
There is a side effect that we actually wait the longer PingTimeout or keepalive/2 (when keepalive <= 10 seconds) before triggering that we haven't seen an expected pingresp.
Finally, hopefully the resolution to #126
@alsm
Copy link
Contributor

alsm commented Aug 11, 2017

Please test the above, if it's good I'll do a point release on Monday

@troian
Copy link
Author

troian commented Aug 12, 2017

looks good for me. At least no hangs at my tests

@ghost
Copy link

ghost commented Aug 12, 2017

Same as troian. No hangs until now. Seems ok

@ghost
Copy link

ghost commented Aug 17, 2017

I currently haven't integrated your latest changes e020008, as there was a national holiday here in India. I'm going to integrate your changes today and report by tomorrow or day-after.

I had previously integrated aff1577. I did find some hangs on a long running instance (in the goroutines spawned inside keepalive), but that code doesn't exist in e020008. So, hopefully, there may be no hangs now.

@alsm
Copy link
Contributor

alsm commented Aug 23, 2017

Any feedback on this @pshirali ?

@ghost
Copy link

ghost commented Aug 23, 2017

I've run 50 instances of my program for 6 days now, with a disconnect-reconnect happening every 40 seconds. No issues!. Great work @alsm

You might also want to checkout a comment at the bottom of this commit though.

@daenney
Copy link

daenney commented Aug 31, 2017

I just upgraded to master and I get similar results as @pshirali. We had some behaviour that very much seemed like a deadlock, it always happened after mosquitto would log a socket error and see the client disconnect, but never come back. Ran with master instead of v1.1.0 for the past 24hrs and so far none of it. Interestingly enough the "socket error"s on mosquitto's side seem to also have disappeared, whereas before we'd have about 10 of those over the course of 24hrs.

@alsm
Copy link
Contributor

alsm commented Sep 8, 2017

I'm going to call this done then, thanks to everyone for the details and feedback you provided.

@alsm alsm closed this as completed Sep 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants