Skip to content
This repository has been archived by the owner on Aug 28, 2019. It is now read-only.

Data stream stops, cannot close connection. Possible deadlock? #10

Closed
rcolombo opened this issue May 29, 2018 · 3 comments
Closed

Data stream stops, cannot close connection. Possible deadlock? #10

rcolombo opened this issue May 29, 2018 · 3 comments

Comments

@rcolombo
Copy link

rcolombo commented May 29, 2018

I'm following your Bittrex example and I've noticed that after a random period of time (usually 5+ hours), my signalr client's MsgHandler() function will stop receiving data.

I coded in a workaround which monitors the amount of data coming through my system, and if I stop getting data, I'll run the Close() function on my existing connection, then I'll build a new one and start over.

When testing, this works perfectly fine. However my tests don't actually reproduce the issue of data not being pushed through the stream. When I run my workaround in production when the issue occurs, the Close() function ends up hanging indefinitely, and I'm never able to build a new connection. This leads me to believe that there is a deadlock somewhere in this package.

Since the Close() function just pushes an item into a channel (https://github.com/carterjones/signalr/blob/master/signalr.go#L671), and that channel is only read from in the readMessage() function (https://github.com/carterjones/signalr/blob/master/signalr.go#L551), I think that I'm hitting a deadlock in readMessage() which is preventing my connection from being closed. Also - it's likely the source of my error in the first place.

Apologies if the above is too vague, but I wanted to at least drop an initial report here to see if you had any thoughts. My next step is going to be to dig into this package a big deeper and try to debug.

@carterjones
Copy link
Owner

Thanks for the extremely detailed bug report! I really appreciate it. I'm quite busy at the moment, but if I get a chance to investigate this in the next few weeks, I will.

I think you might be onto something with that close channel. There is quite possibly a race condition/deadlock lying in there somewhere. The readMessage() function is a bit confusing. I might have messed up the logic in there somehow.

carterjones added a commit that referenced this issue Jun 16, 2018
@carterjones
Copy link
Owner

carterjones commented Jun 16, 2018

I tried to trace the code flow that might lead to this bug. Here's what I found:

  1. readMessage
    case err := <-errs
         ok = c.processReadMessagesError(err, errhandler)
  2. processReadMessagesError
    case 1006:
        ok = c.attemptReconnect()
  3. attemptReconnect (up to MaxReconnectRetries times, default: 5)
    _, err := c.Reconnect()
  4. Reconnect
    conn, err := c.xconnect(u.String(), true)
  5. xconnect
    for i := 0; i < retryCount; i++ { // MaxReconnectRetries, default: 5
       conn, resp, err = dialer.dial(url, header)
       // https://godoc.org/github.com/gorilla/websocket#pkg-variables:
       // var DefaultDialer = &Dialer{
       //    Proxy:            http.ProxyFromEnvironment,
       //    HandshakeTimeout: 45 * time.Second, <----------
       // }
       // ...
       // First possible sleep
       if resp == nil {
           time.Sleep(c.RetryWaitDuration) // default: 1 minute
       // Second possible sleep
       case 503:
           time.Sleep(c.RetryWaitDuration) // default: 1 minute

So potentially, the third step might be spawing 5 instances of the following a
subloop that attempts 5 retries (so we're up to potentially 25 retries) of a
dial attempt, which could take up to 45 seconds, followed by one of two things:

  1. a 1 minute sleep if the response is nil or
  2. a 1 minute sleep if the response is a 503.

This means that in the worst case scenario, this could take 43.75 minutes (25 *
1.75 minutes) prior to the Close() function being completed. This seems to
explain the behavior that is seen where calling the Close() function ends up
doing nothing.

I still have to add some more tests since new code paths have been added, but
hopefully, this will fix the issue you were experiencing. Once I've written
tests, I'll merge it into master. Until then, you can test the code if you want
by checking out the bugfix/issue10 branch.

@carterjones
Copy link
Owner

This should be fixed in master as of v0.3.3 (7289a89). If the issue is not fixed for you, please feel free to comment and I'll re-open.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants