Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled server-side disconnects #61

Open
dblock opened this Issue Aug 19, 2018 · 27 comments

Comments

Projects
None yet
3 participants
@dblock
Copy link

dblock commented Aug 19, 2018

Coming from slack-ruby/slack-ruby-client#208 and celluloid/celluloid-io#187, we're seeing the Slack server drop connections and the client not noticing and being stuck in a run loop.

There's some chance this is a bug here or misunderstanding in the way we use websocket-ruby-driver. Initially I thought this was celluloid-io, but I think this is the driver not handling something correctly.

This currently affects many thousands of users as the slack-ruby-client library is widely used in production and we haven't had much luck even when they looked at it from their side via @aoberoi.

Questions:

  • Could someone please take a look at this code to see if there's an obvious bug in the run loop?
  • Can someone please share debugging code to translate the opaque binary messages read to add here?
  • Any other suggestions?
@ioquatix

This comment has been minimized.

Copy link
Contributor

ioquatix commented Aug 19, 2018

There are two kinds of disconnects - the web socket disconnect event, and the socket read returning nil or throwing an exception (e.g. EPIPE).

Are you detecting the later cases?

https://github.com/slack-ruby/slack-ruby-client/blob/d8b7b267377c4ef97742de2919ee328249e10b49/lib/slack/real_time/concurrency/celluloid.rb#L37-L39

Are you sure that's covering all cases?

Is this returning nil?

https://github.com/slack-ruby/slack-ruby-client/blob/d8b7b267377c4ef97742de2919ee328249e10b49/lib/slack/real_time/concurrency/celluloid.rb#L55

Are you handling that case?

@dblock

This comment has been minimized.

Copy link
Author

dblock commented Aug 19, 2018

Thanks @ioquatix!

I didn't realize the socket read can return nil meaning something like a disconnect, just to confirm, does that mean it's disconnected and is that a special kind of disconnect (we want to distinguish between client closing and server throwing the client out)? Looks like we're not handling that ...

Help me write the correct code here? What should it look like on nil read?

I would have expected to never need this code (and the driver to emit this in all the right cases):

driver.emit(:close, WebSocket::Driver::CloseEvent.new(1001, 'server closed connection')) unless @closing

Are we doing it wrong?

Will add logging to see if we're getting a nil read too.

@ioquatix

This comment has been minimized.

Copy link
Contributor

ioquatix commented Aug 19, 2018

Ruby sockets are a bit of a mess. A read of length 0 gives nil return. You also need to handle empty string, i.e. read(x > 0) -> nil | "" indicates disconnected. Try printing out the result of that line to see what you get.

@dblock

This comment has been minimized.

Copy link
Author

dblock commented Aug 19, 2018

I kinda expected the driver to do this for me as it emits other kinds of disconnects. Am I asking for too much?

@ioquatix

This comment has been minimized.

Copy link
Contributor

ioquatix commented Aug 19, 2018

Yes. Do I think it's annoying? EPIPE, EOFError, nil, "". Yes, it's a mess.

Here is an implementation that works:

https://github.com/socketry/async-websocket/blob/f946504fc21a62c3546aa5efa3fd3443dbe6a931/lib/async/websocket/connection.rb#L56-L72

@dblock

This comment has been minimized.

Copy link
Author

dblock commented Aug 20, 2018

This doesn't look like the problem :(

I added this code to log all messages, handle and show nil or empty buffers.

I never see one.

I, [2018-08-20T16:21:42.177568 #214]  INFO -- : Starting team game=pool, name=Team, id=ID.
I, [2018-08-20T16:21:43.122604 #214]  INFO -- : [69912811601720, :server_open]
I, [2018-08-20T16:21:43.123755 #214]  INFO -- : [69912811601720, :server_message, "{\"type\": \"hello\"}"]
I, [2018-08-20T16:21:43.124090 #214]  INFO -- : [69912811601720, "Successfully connected team Team (ID) to https://....slack.com."]
I, [2018-08-20T16:21:52.980706 #214]  INFO -- : [69912811601720, :server_ping, ""]
I, [2018-08-20T16:22:33.646281 #214]  INFO -- : [69912811601720, :server_message, "{\"type\":\"user_change\", ...}"]
I, [2018-08-20T16:23:21.760776 #214]  INFO -- : [69912811601720, :server_message, "{\"type\":\"user_change\", ...}"]
I, [2018-08-20T16:23:52.985827 #214]  INFO -- : [69912811601720, :server_ping, ""]
I, [2018-08-20T16:23:58.959916 #214]  INFO -- : [69912811601720, :server_message, "{\"type\":\"user_change\", ...}"]
I, [2018-08-20T16:25:52.973440 #214]  INFO -- : [69912811601720, :server_ping, ""]
I, [2018-08-20T16:38:14.770694 #214]  INFO -- : [69912811601720, :server_ping, "", "repeated 7 times"]
I, [2018-08-20T16:39:53.168919 #214]  INFO -- : [69912811601720, :server_ping, ""]
I, [2018-08-20T16:53:00.228227 #214]  INFO -- : [69912811601720, :server_ping, "", "repeated 7 times"]
I, [2018-08-20T19:15:20.998797 #214]  INFO -- : [69912811601720, :server_message, "{\"type\":\"dnd_updated_user\", ..."]
I, [2018-08-20T19:15:53.211556 #214]  INFO -- : [69912811601720, :server_ping, ""]
W, [2018-08-20T19:47:40.985025 #214]  WARN -- : DOWN: game=pool, name=Team, id=ID

That last down is from a ping thread that watches slack status and restarts a bot when slack reports it's down.

I noticed in your implementation you implement next_event and keep a queue to process. This is a bit different from our code that loops over socket.readpartial and asynchronously lets the driver.parse on that returned data which I believe triggers the events. Could we be skipping messages because too slow? Doesn't seem so but I am out of ideas.

@dblock

This comment has been minimized.

Copy link
Author

dblock commented Aug 20, 2018

I also have a piece of ping code that can identify a socket/driver stuck in this disconnected state. I can examine anything on the object when that happens, ie. an instance of ::WebSocket::Driver or Celluloid::IO::SSLSocket, if something could be helpful?

@ioquatix

This comment has been minimized.

Copy link
Contributor

ioquatix commented Aug 20, 2018

I wonder if there are bugs in Celluloid::IO::SSLSocket. Maybe check the implementation of readpartial.

@ioquatix

This comment has been minimized.

Copy link
Contributor

ioquatix commented Aug 20, 2018

As you know if it's a bug in Celluloid, it's unlikely to ever get fixed.

@dblock

This comment has been minimized.

Copy link
Author

dblock commented Aug 21, 2018

I am seeing more new problems, now with the DNS resolver ... sigh.

What should I be replacing celluloid-io with?

@ioquatix

This comment has been minimized.

Copy link
Contributor

ioquatix commented Aug 21, 2018

I already shower how to use async with slack: https://gist.github.com/ioquatix/c24f107e2cc7f48e571a37e8e93b0cda

@dblock

This comment has been minimized.

Copy link
Author

dblock commented Aug 21, 2018

Yes thank you @ioquatix for being helpful.

@ioquatix

This comment has been minimized.

Copy link
Contributor

ioquatix commented Aug 21, 2018

Let me know if I can be of further assistance.

@dblock

This comment has been minimized.

Copy link
Author

dblock commented Aug 21, 2018

Well given all the money I am paying you why don't you just implement slack-ruby/slack-ruby-client#210 :))) I am KIDDING totally of course. I'll eventually get to it. Much much thanks for being responsive and pitching in your comments, suggestions and thoughts. Let me know how I can be helpful.

@ioquatix

This comment has been minimized.

Copy link
Contributor

ioquatix commented Aug 22, 2018

Hehe. Well, if you want to pay me money to make it work, I won't stop you :)

@dblock

This comment has been minimized.

Copy link
Author

dblock commented Sep 5, 2018

I want to resurrect this since we've implemented async support in slack-ruby/slack-ruby-client#219 and the problem persists.

Can someone help me understand what code in this library handles the ping/pong frames per https://tools.ietf.org/html/rfc6455#section-5.5.2? What's the best way to observe these? What happens if the server decides to go away and how does a :close event get emitted when that happens?

@ioquatix

This comment has been minimized.

Copy link
Contributor

ioquatix commented Sep 6, 2018

The next step to debugging this is to dump all socket read/write events.

@ioquatix

This comment has been minimized.

Copy link
Contributor

ioquatix commented Sep 6, 2018

I also wouldn’t rely on a close event from the websocket driver at worst you won’t receive it.

@dblock

This comment has been minimized.

Copy link
Author

dblock commented Sep 6, 2018

I also wouldn’t rely on a close event from the websocket driver at worst you won’t receive it.

It seems that you're saying that the websocket driver implementation can't always notice that it got disconnected? What good is such implementation for?

@ioquatix

This comment has been minimized.

Copy link
Contributor

ioquatix commented Sep 6, 2018

It’s for graceful shutdown of the connection. If the underlying TCP connection fails e.g. timeout or someone trips over the ethernet, there won’t be any high level close event.

@dblock

This comment has been minimized.

Copy link
Author

dblock commented Sep 6, 2018

I don't get it. How's a client of the web socket supposed to know someone pulled the cable? How do other language libraries handle this?

@ioquatix

This comment has been minimized.

Copy link
Contributor

ioquatix commented Sep 6, 2018

#read from the underlying socket returns nil, or it throws EOFError or EPIPE.

@aoberoi

This comment has been minimized.

Copy link

aoberoi commented Sep 6, 2018

yep, its not the most usable API, @dblock.

this is actually the entire purpose of sending "ping" messages. it gives both the server and client a mechanism to test whether the connection is alive, rather than depending on a close event that reports the connection is no longer alive.

as a tangent: it's not really the fault of the implementors or designers of the WebsSocket API. even TCP connections (lower level protocol) have a similar problem that when there's no traffic you can't really tell the difference between a dead connection and one where there's just nothing being said. that's why in general we call these ping/pong messages a keep-alive mechanism.

@dblock

This comment has been minimized.

Copy link
Author

dblock commented Sep 6, 2018

First, it's nobody's fault, I am just trying to get to the bottom of an issue that's affecting hundreds if not thousands of bots. Everyone who's contributing and helping is amazing, including @aoberoi and @ioquatix.

#read from the underlying socket returns nil, or it throws EOFError or EPIPE.

I see, that makes sense. Not the "best" API, but if a disconnect causes a nil or EOFError or EPIPE it would be fine. It's just not happening. I've never seen a nil returned and in the particular issue described here we're not seeing an exception raised.

this is actually the entire purpose of sending "ping" messages

You two @ioquatix and @aoberoi are contradicting each-other. It seems like @ioquatix is saying the implementation "handles" disconnects and will return nil or raise an error and @aoberoi is saying it won't and is supposed to just sit there and wait on something to happen.

@aoberoi

This comment has been minimized.

Copy link

aoberoi commented Sep 6, 2018

Okay let me back up a step and confirm my understanding of what's happening with you both (and anyone else who finds this issue interesting).

Apps built using slack-ruby-client client are commonly landing into an unresponsive state, where the app (websocket client) believes to be connected but is not receiving any messages.

Currently, we're determining that the connection is unresponsive by periodically sending a request to Slack's Web API method users.getPresence, and testing whether the bot user is online. This is checking for an asymmetry in the client's and the server's understanding of the connection state, which is a reasonable approach. However, my theory is that if we were instead to send a message over the websocket (from that state) the underlying driver would either return nil or raise one of the previously mentioned errors. @dblock, can you test this theory easily? Do you already know it to be untrue?

I don't think @ioquatix and myself are disagreeing (but please correct me if I'm wrong). We're both saying that returning nil or raising an error is as good as this driver can do to inform an application of a certain class of disconnects. I'm just saying that rather than periodically performing a request to users.getPresence, we should instead be periodically sending a message over the websocket, to stimulate the nil or error to occur, otherwise the TCP layer beneath the websocket or the platform (OS) may not have acknowledged that the connection has died.

Again, this is my current understanding, but I admit that I've not read all the code or followed every development along the way, so I'm happy to be corrected.

@ioquatix

This comment has been minimized.

Copy link
Contributor

ioquatix commented Sep 6, 2018

I think that’s a reasonable assessment.

If the TCP connection is still alive but the websocket session is dead, a periodic ping at the websocket layer should detect it. If TCP and websockets are still working but the bot is non-functional for some other high level reason, then checking if the bot is in the channel might be the only solution.

@dblock

This comment has been minimized.

Copy link
Author

dblock commented Sep 7, 2018

I agree, this seems reasonable.

I am going to first get our current situation wrapped up, ie. get to a mergeable state for slack-ruby/slack-ruby-client#219, slack-ruby/slack-ruby-bot#198 and slack-ruby/slack-ruby-bot-server#75 that replaces celluloid with async and force-closes the websocket when the bot goes offline according to slack status.

Next I'll try (unless someone beats me to it :) to build a websocket-level ping monitoring worker into slack-ruby-client itself, slack-ruby/slack-ruby-client#223.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.