Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic on connection lost / reconnecting #488

Closed
rockwithamoon opened this issue Feb 23, 2021 · 5 comments
Closed

Panic on connection lost / reconnecting #488

rockwithamoon opened this issue Feb 23, 2021 · 5 comments

Comments

@rockwithamoon
Copy link

A program with many clients (in the thousands), when there is a network error with the broker, sometimes, there's a panic on closing a channel

... panic: close of closed channel
goroutine 23171 [running]:
github.com/eclipse/paho%2emqtt%2egolang.(*client).startCommsWorkers.func2(0xc0229ca288, 0xc0229ca290, 0xc0253d40c0, 0xc01a0dae00)
        /go/src/bitbucket.org/xxx/xxx/vendor/github.com/eclipse/paho.mqtt.golang/client.go:604 +0x365
created by github.com/eclipse/paho%2emqtt%2egolang.(*client).startCommsWorkers
        /go/src/bitbucket.org/xxx/xxx/vendor/github.com/eclipse/paho.mqtt.golang/client.go:579 +0x4b3

Sometimes the OnConnectionLost handler is called before panic (but not always, and not every connection lost leads to a panic)

connection lost: EOF

This is happening with v1.3.1 and v1.3.2 that where tested.

Occasionally the program will call client.IsConnected() && client.Disconnect(). Is there a concurrency issue with Disconnect()? The program will not call Disconnect() twice, but it might be possible that a network error to occur during Disconnect() and AutoReconnect is set to true. Is there anyway to avoid this?

@MattBrittan
Copy link
Contributor

MattBrittan commented Feb 24, 2021

Quick Analysis

The panic is due to the use of the closed channel c.commsStopped. The only places this is accessed is:

StartCommsWorkers:
578: c.commsStopped = make(chan struct{})
604: close(c.commsStopped)

StopCommsWorkers:
643: <-c.commsStopped // wait for comms routine to stop

So the use of this channel is pretty well contained so the only thing that I can think of that would explain the panic is that startCommsWorkers is being called before the goRoutine started by the previous call has completed.

startCommsWorkers is called from within Connect and reconnect. reconnect is only called from within internalConnLost (and that calls stopCommsWorkers which waits for the channel to close unless it's already closed).

Hypothesis

There is an issue here; the situation of calling Disconnect when c.options.AutoReconnect = true is not being catered to: When Disconnect() is called it sets the status to disconnected immediately and then closes the connection. This may (well will) trigger an error resulting in a call to internalConnLost. If c.options.AutoReconnect is true then this will trigger a reconnection attempt.

So what I suspect is happening is:

  • Disconnect is called (triggering StartCommsWorkers() in a round about way as per the above).
  • Your code calls Connect; this now means that startCommsWorkers is running twice which could lead to the error you are seeing.

Does this match what your code is doing? (i.e. are you calling Connect on the same client after the Disconnect completes?). Much of the code in Client was written without giving consideration to reuse (e.g. after calling Disconnect throw away the client and reconnect with a new one); I believe that I have addressed most of the issues preventing reuse but it's not something a lot of effort has gone into testing (I think all of the tests call NewClient after Disconnect if they want to reconnect).

I can see a few options to fix this; my initial thought is to set c.options.AutoReconnect = false when Disconnect() is called (if you are asking for the connection to be dropped then you don't want it to reconnect). However this may not be what a user would expect. While it would be possible to look at using status for this I'm reluctant to do so because that variable is already problematic (its one thing I did not attempt to fix in my rewrite of the comms code and it's thread safety is a bit dubious).

@rockwithamoon
Copy link
Author

rockwithamoon commented Feb 24, 2021

Thank you for the quick response.

Yes, I"m calling Connect() again after Disconnect() with AutoReconnect = true. This may seem weird but we have specific requirements that I need to address. But now come to think of it, Disconnect() / Connect() might not have been the best way but seemed the simpler one.

I guess one option for me would to simply drop the client after Disconnect() and re-created it with NewClient(). I would prefer to use the same ClientId as to avoid Unsubscribe() and possibly other issues. Do you see any problem with that?

MattBrittan added a commit to ChIoT-Tech/paho.mqtt.golang that referenced this issue Feb 24, 2021
…ct could trigger another connection attempt.

Signed-off-by: Matt Brittan <matt@brittan.nz>
@MattBrittan
Copy link
Contributor

I have merged a fix for this - please see if master fixes your issue (go get github.com/eclipse/paho.mqtt.golang@master). While not keen on using status for this I think the risk is fairly low in this case and its the simplest option (and fixes a problem others may encounter).

I guess one option for me would to simply drop the client after Disconnect() and re-created it with NewClient(). I would prefer to use the same ClientId as to avoid Unsubscribe() and possibly other issues. Do you see any problem with that?

The only thing to watch out for is that this client would not remember the fact that there are subscriptions in place (so data received from the broker may be ignored). Use Subscribe or AddRoute to inform it how messages on the new connection should be directed.

@rockwithamoon
Copy link
Author

I have it running for 12 hours until now and seems fine, I'll keep you posted if anything changes.

@MattBrittan
Copy link
Contributor

Closing this because it appears that the fix works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants