Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled server-side disconnects #61

Open
dblock opened this issue Aug 19, 2018 · 47 comments
Open

Unhandled server-side disconnects #61

dblock opened this issue Aug 19, 2018 · 47 comments

Comments

@dblock
Copy link

@dblock dblock commented Aug 19, 2018

Coming from slack-ruby/slack-ruby-client#208 and celluloid/celluloid-io#187, we're seeing the Slack server drop connections and the client not noticing and being stuck in a run loop.

There's some chance this is a bug here or misunderstanding in the way we use websocket-ruby-driver. Initially I thought this was celluloid-io, but I think this is the driver not handling something correctly.

This currently affects many thousands of users as the slack-ruby-client library is widely used in production and we haven't had much luck even when they looked at it from their side via @aoberoi.

Questions:

  • Could someone please take a look at this code to see if there's an obvious bug in the run loop?
  • Can someone please share debugging code to translate the opaque binary messages read to add here?
  • Any other suggestions?
@ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Aug 19, 2018

There are two kinds of disconnects - the web socket disconnect event, and the socket read returning nil or throwing an exception (e.g. EPIPE).

Are you detecting the later cases?

https://github.com/slack-ruby/slack-ruby-client/blob/d8b7b267377c4ef97742de2919ee328249e10b49/lib/slack/real_time/concurrency/celluloid.rb#L37-L39

Are you sure that's covering all cases?

Is this returning nil?

https://github.com/slack-ruby/slack-ruby-client/blob/d8b7b267377c4ef97742de2919ee328249e10b49/lib/slack/real_time/concurrency/celluloid.rb#L55

Are you handling that case?

@dblock
Copy link
Author

@dblock dblock commented Aug 19, 2018

Thanks @ioquatix!

I didn't realize the socket read can return nil meaning something like a disconnect, just to confirm, does that mean it's disconnected and is that a special kind of disconnect (we want to distinguish between client closing and server throwing the client out)? Looks like we're not handling that ...

Help me write the correct code here? What should it look like on nil read?

I would have expected to never need this code (and the driver to emit this in all the right cases):

driver.emit(:close, WebSocket::Driver::CloseEvent.new(1001, 'server closed connection')) unless @closing

Are we doing it wrong?

Will add logging to see if we're getting a nil read too.

@ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Aug 19, 2018

Ruby sockets are a bit of a mess. A read of length 0 gives nil return. You also need to handle empty string, i.e. read(x > 0) -> nil | "" indicates disconnected. Try printing out the result of that line to see what you get.

@dblock
Copy link
Author

@dblock dblock commented Aug 19, 2018

I kinda expected the driver to do this for me as it emits other kinds of disconnects. Am I asking for too much?

@ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Aug 19, 2018

Yes. Do I think it's annoying? EPIPE, EOFError, nil, "". Yes, it's a mess.

Here is an implementation that works:

https://github.com/socketry/async-websocket/blob/f946504fc21a62c3546aa5efa3fd3443dbe6a931/lib/async/websocket/connection.rb#L56-L72

@dblock
Copy link
Author

@dblock dblock commented Aug 20, 2018

This doesn't look like the problem :(

I added this code to log all messages, handle and show nil or empty buffers.

I never see one.

I, [2018-08-20T16:21:42.177568 #214]  INFO -- : Starting team game=pool, name=Team, id=ID.
I, [2018-08-20T16:21:43.122604 #214]  INFO -- : [69912811601720, :server_open]
I, [2018-08-20T16:21:43.123755 #214]  INFO -- : [69912811601720, :server_message, "{\"type\": \"hello\"}"]
I, [2018-08-20T16:21:43.124090 #214]  INFO -- : [69912811601720, "Successfully connected team Team (ID) to https://....slack.com."]
I, [2018-08-20T16:21:52.980706 #214]  INFO -- : [69912811601720, :server_ping, ""]
I, [2018-08-20T16:22:33.646281 #214]  INFO -- : [69912811601720, :server_message, "{\"type\":\"user_change\", ...}"]
I, [2018-08-20T16:23:21.760776 #214]  INFO -- : [69912811601720, :server_message, "{\"type\":\"user_change\", ...}"]
I, [2018-08-20T16:23:52.985827 #214]  INFO -- : [69912811601720, :server_ping, ""]
I, [2018-08-20T16:23:58.959916 #214]  INFO -- : [69912811601720, :server_message, "{\"type\":\"user_change\", ...}"]
I, [2018-08-20T16:25:52.973440 #214]  INFO -- : [69912811601720, :server_ping, ""]
I, [2018-08-20T16:38:14.770694 #214]  INFO -- : [69912811601720, :server_ping, "", "repeated 7 times"]
I, [2018-08-20T16:39:53.168919 #214]  INFO -- : [69912811601720, :server_ping, ""]
I, [2018-08-20T16:53:00.228227 #214]  INFO -- : [69912811601720, :server_ping, "", "repeated 7 times"]
I, [2018-08-20T19:15:20.998797 #214]  INFO -- : [69912811601720, :server_message, "{\"type\":\"dnd_updated_user\", ..."]
I, [2018-08-20T19:15:53.211556 #214]  INFO -- : [69912811601720, :server_ping, ""]
W, [2018-08-20T19:47:40.985025 #214]  WARN -- : DOWN: game=pool, name=Team, id=ID

That last down is from a ping thread that watches slack status and restarts a bot when slack reports it's down.

I noticed in your implementation you implement next_event and keep a queue to process. This is a bit different from our code that loops over socket.readpartial and asynchronously lets the driver.parse on that returned data which I believe triggers the events. Could we be skipping messages because too slow? Doesn't seem so but I am out of ideas.

@dblock
Copy link
Author

@dblock dblock commented Aug 20, 2018

I also have a piece of ping code that can identify a socket/driver stuck in this disconnected state. I can examine anything on the object when that happens, ie. an instance of ::WebSocket::Driver or Celluloid::IO::SSLSocket, if something could be helpful?

@ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Aug 20, 2018

I wonder if there are bugs in Celluloid::IO::SSLSocket. Maybe check the implementation of readpartial.

@ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Aug 20, 2018

As you know if it's a bug in Celluloid, it's unlikely to ever get fixed.

@dblock
Copy link
Author

@dblock dblock commented Aug 21, 2018

I am seeing more new problems, now with the DNS resolver ... sigh.

What should I be replacing celluloid-io with?

@ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Aug 21, 2018

I already shower how to use async with slack: https://gist.github.com/ioquatix/c24f107e2cc7f48e571a37e8e93b0cda

@dblock
Copy link
Author

@dblock dblock commented Aug 21, 2018

Yes thank you @ioquatix for being helpful.

@ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Aug 21, 2018

Let me know if I can be of further assistance.

@dblock
Copy link
Author

@dblock dblock commented Aug 21, 2018

Well given all the money I am paying you why don't you just implement slack-ruby/slack-ruby-client#210 :))) I am KIDDING totally of course. I'll eventually get to it. Much much thanks for being responsive and pitching in your comments, suggestions and thoughts. Let me know how I can be helpful.

@ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Aug 22, 2018

Hehe. Well, if you want to pay me money to make it work, I won't stop you :)

@dblock
Copy link
Author

@dblock dblock commented Sep 5, 2018

I want to resurrect this since we've implemented async support in slack-ruby/slack-ruby-client#219 and the problem persists.

Can someone help me understand what code in this library handles the ping/pong frames per https://tools.ietf.org/html/rfc6455#section-5.5.2? What's the best way to observe these? What happens if the server decides to go away and how does a :close event get emitted when that happens?

@ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Sep 6, 2018

The next step to debugging this is to dump all socket read/write events.

@ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Sep 6, 2018

I also wouldn’t rely on a close event from the websocket driver at worst you won’t receive it.

@dblock
Copy link
Author

@dblock dblock commented Sep 6, 2018

I also wouldn’t rely on a close event from the websocket driver at worst you won’t receive it.

It seems that you're saying that the websocket driver implementation can't always notice that it got disconnected? What good is such implementation for?

@ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Sep 6, 2018

It’s for graceful shutdown of the connection. If the underlying TCP connection fails e.g. timeout or someone trips over the ethernet, there won’t be any high level close event.

@dblock
Copy link
Author

@dblock dblock commented Sep 6, 2018

I don't get it. How's a client of the web socket supposed to know someone pulled the cable? How do other language libraries handle this?

@ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Sep 6, 2018

#read from the underlying socket returns nil, or it throws EOFError or EPIPE.

@aoberoi
Copy link

@aoberoi aoberoi commented Sep 6, 2018

yep, its not the most usable API, @dblock.

this is actually the entire purpose of sending "ping" messages. it gives both the server and client a mechanism to test whether the connection is alive, rather than depending on a close event that reports the connection is no longer alive.

as a tangent: it's not really the fault of the implementors or designers of the WebsSocket API. even TCP connections (lower level protocol) have a similar problem that when there's no traffic you can't really tell the difference between a dead connection and one where there's just nothing being said. that's why in general we call these ping/pong messages a keep-alive mechanism.

@dblock
Copy link
Author

@dblock dblock commented Sep 6, 2018

First, it's nobody's fault, I am just trying to get to the bottom of an issue that's affecting hundreds if not thousands of bots. Everyone who's contributing and helping is amazing, including @aoberoi and @ioquatix.

#read from the underlying socket returns nil, or it throws EOFError or EPIPE.

I see, that makes sense. Not the "best" API, but if a disconnect causes a nil or EOFError or EPIPE it would be fine. It's just not happening. I've never seen a nil returned and in the particular issue described here we're not seeing an exception raised.

this is actually the entire purpose of sending "ping" messages

You two @ioquatix and @aoberoi are contradicting each-other. It seems like @ioquatix is saying the implementation "handles" disconnects and will return nil or raise an error and @aoberoi is saying it won't and is supposed to just sit there and wait on something to happen.

@aoberoi
Copy link

@aoberoi aoberoi commented Sep 6, 2018

Okay let me back up a step and confirm my understanding of what's happening with you both (and anyone else who finds this issue interesting).

Apps built using slack-ruby-client client are commonly landing into an unresponsive state, where the app (websocket client) believes to be connected but is not receiving any messages.

Currently, we're determining that the connection is unresponsive by periodically sending a request to Slack's Web API method users.getPresence, and testing whether the bot user is online. This is checking for an asymmetry in the client's and the server's understanding of the connection state, which is a reasonable approach. However, my theory is that if we were instead to send a message over the websocket (from that state) the underlying driver would either return nil or raise one of the previously mentioned errors. @dblock, can you test this theory easily? Do you already know it to be untrue?

I don't think @ioquatix and myself are disagreeing (but please correct me if I'm wrong). We're both saying that returning nil or raising an error is as good as this driver can do to inform an application of a certain class of disconnects. I'm just saying that rather than periodically performing a request to users.getPresence, we should instead be periodically sending a message over the websocket, to stimulate the nil or error to occur, otherwise the TCP layer beneath the websocket or the platform (OS) may not have acknowledged that the connection has died.

Again, this is my current understanding, but I admit that I've not read all the code or followed every development along the way, so I'm happy to be corrected.

@ioquatix
Copy link
Contributor

@ioquatix ioquatix commented Sep 6, 2018

I think that’s a reasonable assessment.

If the TCP connection is still alive but the websocket session is dead, a periodic ping at the websocket layer should detect it. If TCP and websockets are still working but the bot is non-functional for some other high level reason, then checking if the bot is in the channel might be the only solution.

@dblock
Copy link
Author

@dblock dblock commented Sep 7, 2018

I agree, this seems reasonable.

I am going to first get our current situation wrapped up, ie. get to a mergeable state for slack-ruby/slack-ruby-client#219, slack-ruby/slack-ruby-bot#198 and slack-ruby/slack-ruby-bot-server#75 that replaces celluloid with async and force-closes the websocket when the bot goes offline according to slack status.

Next I'll try (unless someone beats me to it :) to build a websocket-level ping monitoring worker into slack-ruby-client itself, slack-ruby/slack-ruby-client#223.

@jcoglan
Copy link
Collaborator

@jcoglan jcoglan commented May 28, 2019

Apologies for taking a while to get to this and thank you to @ioquatix for pitching in :)

I think the root of this issue is a question of scope: this library does not handle any I/O or network concerns for you, it only handles the WebSocket protocol data flow. This is because Ruby has a huge variety of I/O and concurrency frameworks, and we wanted something that could be combined with any of them.

Therefore, how you interact with the network socket is out of scope: you read from it, and feed the results into parse(). In return, the driver calls write() on your I/O adapter whenever it knows data should be written, which could be automatically in response to incoming data (handshakes, ping/pong, close frames) or as a result of you calling text() or ping() to send a message.

The driver is only capable of detecting events that result from data coming over the wire. Any network-related events like disconnections need to be handled by the caller -- if your TCP socket becomes disconnected, the driver cannot detect that and you need to handle it yourself.

This library will emit a close event if it receives a closing frame that was sent by the other peer; if you call close on your driver, that will send a close frame to the other peer, which should respond in kind, and if you receive that close frame and feed it into parse(), the driver will emit a close event. If you do not receive such a frame the driver will remain in the closing state and will not emit a close event.

If you want to see how ping/pong is handled, that happens here:

https://github.com/faye/websocket-driver-ruby/blob/0.7.0/lib/websocket/driver/hybi.rb#L369-L378

If this is still a problem and is in scope for us to fix in this library, can you leave further comments? Otherwise let us know this is out of scope and close the issue :)

Happy to try and answer any other questions you might have.

@jcoglan
Copy link
Collaborator

@jcoglan jcoglan commented May 28, 2019

One thing I wanted to add is that we do provide an interface for sending a ping message and detecting the peer's response to it -- the block will execute if/when we receive a matching pong frame from the remote peer:

driver.ping("my ping message") do
  # this runs if/when we receive a pong response
end

However as this library is agnostic about your concurrency framework, it does not support calling ping() periodically, or setting a timeout on this response being received. You should use whatever methods are appropriate for your environment to support this, if you like.

For example, faye-websocket combines this library with EventMachine to produce a complete client. It can be configured to call ping() periodically using EventMachine's timers:

https://github.com/faye/faye-websocket-ruby/blob/0.10.7/lib/faye/websocket/api.rb#L52-L57

@jcoglan
Copy link
Collaborator

@jcoglan jcoglan commented May 14, 2020

Is this an ongoing issue, on can I close it @dblock ?

@mattbrictson
Copy link

@mattbrictson mattbrictson commented May 14, 2020

My team is still seeing at least 1 disconnect every day. Our workaround is some extra code to ping every minute to test that the connection is alive, and if not, close and reconnect as needed. That is working fine but the underlying issue is still there.

@dblock
Copy link
Author

@dblock dblock commented May 14, 2020

We implemented ping in slack-ruby-client, read https://code.dblock.org/2019/03/04/solving-slack-side-disconnects-in-slack-ruby-client.html. If you think the driver should be doing this, then leave open. Otherwise close (and document)?

@mattbrictson
Copy link

@mattbrictson mattbrictson commented May 15, 2020

Thanks, I didn't realize there was a higher-level fix here: slack-ruby/slack-ruby-client#226. Seems like this issue can be closed, then.

@dblock
Copy link
Author

@dblock dblock commented May 15, 2020

Thanks, I didn't realize there was a higher-level fix here: slack-ruby/slack-ruby-client#226. Seems like this issue can be closed, then.

This may not be what you want! I think an argument could be for the websocket driver to maintain the connection all on its own. After all, it's surprising to users that a driver would lose a connection pretty much guaranteed at some point.

@mattbrictson
Copy link

@mattbrictson mattbrictson commented May 15, 2020

Hmm. I am not a consumer of this library directly; I'm using it via slack-ruby-client. So I don't think I am knowledgable enough to weigh in on this question. Sorry if I derailed this thread!

@jcoglan
Copy link
Collaborator

@jcoglan jcoglan commented May 18, 2020

@dblock Can you elaborate on why this is surprising, because it would help me understand how I might have mis-communicated the fact that the network (or other I/O) layer is out of scope for this library and it is not capable by design of handling events from the network layer.

@jcoglan
Copy link
Collaborator

@jcoglan jcoglan commented May 18, 2020

@dblock I notice in that blog post you have the following code:

def run_loop
  loop { read }
rescue EOFError, Errno::ECONNRESET, Errno::EPIPE => e
  driver.emit(:close, WebSocket::Driver::CloseEvent.new(1001, 'server closed connection'))
end

I would not advise calling emit yourself, it's really only intended for internal use. Making it private would be a breaking change to your library so I won't do that, but in general I would avoid triggering events on objects you don't own. The library providing that object might have its own logic bound to those events and you triggering them when they're not expected can mess with that.

@dblock
Copy link
Author

@dblock dblock commented May 18, 2020

@dblock Can you elaborate on why this is surprising, because it would help me understand how I might have mis-communicated the fact that the network (or other I/O) layer is out of scope for this library and it is not capable by design of handling events from the network layer.

Because ping is part of the websocket protocol to maintain connections?
Because the issue is not a failure of the network, but a silent failure of the server going away that becomes undetectable on the client, and the client just sits ducks?

I totally see your point, though, re:design.

I guess someone has to maintain the connection via ping to really make websockets permanent and useful for something like slack-ruby-client. So either we're missing a websocket-connection-that-works-reliably library or one has to implement it here or there.

@dblock
Copy link
Author

@dblock dblock commented May 18, 2020

@dblock I notice in that blog post you have the following code:

def run_loop
  loop { read }
rescue EOFError, Errno::ECONNRESET, Errno::EPIPE => e
  driver.emit(:close, WebSocket::Driver::CloseEvent.new(1001, 'server closed connection'))
end

I would not advise calling emit yourself, it's really only intended for internal use. Making it private would be a breaking change to your library so I won't do that, but in general I would avoid triggering events on objects you don't own. The library providing that object might have its own logic bound to those events and you triggering them when they're not expected can mess with that.

We spent so much time getting this to work in 100% of the cases that I am going to leave this as is until someone comes up with better :) In this case we need the websocket driver to abort itself (this is the scenario where the server went away but the driver is sitting there waiting).

@dblock
Copy link
Author

@dblock dblock commented May 18, 2020

I did want to add a thank you to everyone here for keeping the thread alive and for writing this driver and maintaining it and more. I am not a websocket expert by any means, I am in the "i thought it would just work" camp :)

@jcoglan
Copy link
Collaborator

@jcoglan jcoglan commented May 20, 2020

@dblock We might be talking at cross purposes. What I was trying to say is that this library only handles the WebSocket protocol, which is to say everything related to the data that goes over the socket. It does not handle the socket itself, and the server going away or closing the connection is a socket-related event that is out of scope for this library.

It also doesn't take a position on when or how often ping should be called, because that depends on the application protocol you're running on top of WebSocket. And, not being tied to any particular concurrency framework, it has no way of scheduling pings even if it wanted to.

So the question I was asking is about how we could have communicated this better in the documentation, so that you would have known you needed to handle this yourself?

@dblock
Copy link
Author

@dblock dblock commented May 20, 2020

So the question I was asking is about how we could have communicated this better in the documentation, so that you would have known you needed to handle this yourself?

Probably an FAQ? Frankly I don't know if I would have discovered it, but it would have been nice to read a detailed explanation on the responsibility of clients to maintain connections. I found https://django-websocket-redis.readthedocs.io/en/latest/heartbeats.html that seems to explain something similar in python, or https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_servers#Pings_and_Pongs_The_Heartbeat_of_WebSockets.

I do think a library that is linked from this project that implements a connected, and maintained websocket, with ping, is a better answer, but I am not writing one ATM.

@jcoglan
Copy link
Collaborator

@jcoglan jcoglan commented Jun 8, 2020

That Django documentation highlights what I was trying to say about your WebSocket keep-alive mechanism depending on the application protocol you're running on top of it. For example, this library was developed as part of the Faye project, which implements a protocol called Bayeux.

Faye used to implement heartbeats on the client side by sending an empty JSON list of messages ([]) to the server. However, the canonical Bayeux implementation stopped accepting frames with no messages, so it became the case that there was no valid message the client could send that would act purely as a heartbeat, and carry no application-level information.

In the case of Bayeux, that protocol has its own heartbeat mechanism that works over WebSocket, HTTP or any other transport -- it doesn't rely on the transport layer to do keep-alive itself. Nevertheless the Faye server can be configured to send WebSocket ping messages if desired.

You'll also find that it doesn't necessarily make sense to make a WebSocket automatically reconnect on failure, because the application layer needs to know about that interruption in case it resulted in dropped messages. So, there's not really a neat solution to "just keeping a WebSocket connected" that doesn't involve the application layer, unfortunately.

@tjad
Copy link

@tjad tjad commented Sep 16, 2020

I have a sneaky suspicion that these experiences are all due to "keep alive" mechanisms. What I have noticed is that the client does not respond to ping messages with a pong message. In the api integration I am doing, the websocket server expects pongs to pings it sends. This might sound like an implementation detail, it does seem that as the ping pong is part of the protocol, that this is missing from the client.

I do see that in the hybi driver, there is

emit(:pong, PongEvent.new(payload))

The server I am joining is disconnecting me as if I didn't respond to its pong message - will do further investigation (wireshark)

@tjad
Copy link

@tjad tjad commented Sep 19, 2020

I have a sneaky suspicion that these experiences are all due to "keep alive" mechanisms. What I have noticed is that the client does not respond to ping messages with a pong message. In the api integration I am doing, the websocket server expects pongs to pings it sends. This might sound like an implementation detail, it does seem that as the ping pong is part of the protocol, that this is missing from the client.

I do see that in the hybi driver, there is

emit(:pong, PongEvent.new(payload))

The server I am joining is disconnecting me as if I didn't respond to its pong message - will do further investigation (wireshark)

I see, my issue with ping | pong not enabled is that my ping setting was not set as per suggested answer here
https://stackoverflow.com/questions/63757851/faye-websocket-ruby-not-working-as-expected

@jcoglan
Copy link
Collaborator

@jcoglan jcoglan commented Sep 19, 2020

@tjad Just to confirm: websocket-driver should always send a pong message if it receives a ping. It will only send a ping message if explicitly instructed.

@tjad
Copy link

@tjad tjad commented Sep 19, 2020

@jcoglan Thanks a lot. Yeah makes sense. Now receiving and handling opcode 10s . Perhaps in this case server was expecting inital ping. All working as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants