Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sending a message right after handshake #21

Closed
madsheep opened this issue Jul 19, 2017 · 11 comments
Closed

sending a message right after handshake #21

madsheep opened this issue Jul 19, 2017 · 11 comments

Comments

@madsheep
Copy link

Hi again!

We have been struggling with a very weird issue recently (that's also something that triggered us to update iodine).

In our websocket use case, we send an initial message to the server right after the handshake is finished. The message is relatively small, something like {"init": true} (we ask server for some initial data).

Every now and then we don't get anything back as if the message didn't make it at all. It happens more often if we are in the same location as the server (i.e. it's a local machine connection, or local network connection).

In our logs we also don't see anything happening with the on_message callback. If we repeat the message second later we get the reply back normally.

Our educated guess (I don't know C well enough) was that the callbacks for messages are set after the websocket handshake is finished - and hence if we are very quick we can squeeze the initial message on the socket before there is anything listening for it.
Could you please take a look if that's the case? Or maybe there is some other race condition in there?

@boazsegev
Copy link
Owner

Hi @madsheep ,

Thanks for opening this issue 🙏🏻 It's intriguing and I wish to understand it more before I try to reproduce it.

If you have an example ready up your sleeve, that would be great for tracking the issue down.

For example...:

  • Does the issue occurs in single-threaded mode (command line -t 1)?

  • Which client are you using?

If we repeat the message second later we get the reply back normally

  • Are you getting the reply twice or was the first message discarded?

As for your guess and question:

the callbacks for messages are set after the websocket handshake is finished...

Unless you wrote something explicit, than this shouldn't be the case. The C level uses locks to set the protocol object practically atomically. The Ruby bridge doesn't set any callbacks, it sets a whole object as a callback object (the object implements the callbacks)... so there's no "in between" time.

It could be (in theory) that the HTTP protocol object is receiving the first message. But this shouldn't be the case unless the first message was sent before the handshake was complete.

Or maybe there is some other race condition in there?

There's a theoretical race condition between the on_open, the actual (internal) upgrade and on_message.

I stress tested it using the websocket shootout client and at the moment it's solved by using the HTTP protocol's lock to call the on_open callback synchronously before switching protocol objects in the C layer.

The lock prevents new network "read" events from being processed, so the message event should be waiting until the "upgrade" task is complete (including the on_open).

@boazsegev
Copy link
Owner

boazsegev commented Jul 19, 2017

I tried reproducing the issue with no success.

Could you send me example client code as well... I'm not sure it's a server-side issue.

I've been using the following JavaScript edited to add some concurrency (the test might take a long while, depending on the client you use, mostly because client concurrency is limited):

var count = 0;
var max_count = 65536;
// change this to test pub/sub at: ws://localhost:3000/
var ws_uri = "ws://localhost:3000/no_pubsub";
var fresh = true;
function attempt_small_msg() {
  count++
  ws = new WebSocket(ws_uri);
  ws.attmpt_num = count;
  ws.onmessage = function(e) {
    if(!e.target.timeout)
      console.log("SUCCESS for attempt " + e.target.attmpt_num + " (misreported).");
    else
      clearTimeout(e.target.timeout);
    e.target.close();
    if(count < max_count) { attempt_small_msg(); }
    else if(fresh) { fresh  = false; console.log("Couldn't reproduce issue."); }
  };
  ws.onopen = function(e) {
    e.target.timeout = setTimeout(function (ws) {
      fresh  = false;
      ws.timeout = false;
      console.log("attempt " + ws.attmpt_num + " FAILED to get initial message in time.");
    }, 2000, e.target);
    e.target.send("hi");
    e.target.test_flag = true;
  };
  ws.onclose = function(e) {
    if(!e.target.test_flag)
    console.log("Attempt " + ws.attmpt_num + " failed to connnect(!)");
  }
}
// go get coffee
count = 0;
for (var i = 0; i < 128; i++) {
  attempt_small_msg();
}

With the following Ruby code:

require 'iodine'
# A simple router - Checks for Websocket Upgrade and answers HTTP.
module MyHTTPRouter
  # This is the HTTP response object according to the Rack specification.
  HTTP_RESPONSE = [200, { 'Content-Type' => 'text/html',
          'Content-Length' => '32' },
   ['Please connect using websockets.']]

   WS_RESPONSE = [0, {}, []]

   # this is function will be called by the Rack server (iodine) for every request.
   def self.call env
     # check if this is an upgrade request.
     if(env['upgrade.websocket?'.freeze])
       if env['PATH_INFO'] == '/'.freeze
         env['upgrade.websocket'.freeze] = WSHandler.new
       else
         env['upgrade.websocket'.freeze] = WSHandler2.new
       end
       return [0, {}, []]
       return WS_RESPONSE
     end
     # simply return the RESPONSE object, no matter what request was received.
     HTTP_RESPONSE
   end
end

# A simple Websocket Callback Object.
class WSHandler
  # seng a message to new clients.
  def on_open
    subscribe channel: "chat"
    # let everyone know we arrived
    # publish channel: "chat", message: "#{@name} entered the chat."
  end
  # send a message, letting the client know the server is suggunt down.
  def on_shutdown
    write "Server shutting down. Goodbye."
  end
  # perform the echo using a channel, so it fails if `on_open` didn't happen yet
  def on_message data
    publish channel: "chat", message: "#{@name}: #{data}"
  end
  def on_close
    # let everyone know we left
    publish channel: "chat", message: "#{@name} left the chat."
    # we don't need to unsubscribe, subscriptions are cleared automatically once the connection is closed.
  end
end


# A simple Websocket Callback Object.
class WSHandler2
  # just echo.
  def on_message data
    write data
  end
end

Iodine::Rack.app = MyHTTPRouter
Iodine.threads = 8
Iodine.start

@madsheep
Copy link
Author

madsheep commented Jul 20, 2017

hi!

Answering your questions quickly:

  • it does happen with -t 1
  • we never receive the reply to the first message - it's discarded

A colleague of mine (@nilclass) worked further on narrowing this down and found an easy way to reproduce this. He replaced the write call here: https://github.com/boazsegev/iodine/blob/master/examples/echo.ru#L44 with sleep 0.1

and started spamming with a client like so:

require 'faye/websocket'

def try_echo(i)
  ws = Faye::WebSocket::Client.new 'ws://127.0.0.1:3000'
  ws.on :message do |evt|
    if evt.data == "hello"
      puts "#{i}"
      ws.close
      try_echo i + 1
    elsif evt.data == "Welcome to our echo service!"
      # ignore.
    else
      $stderr.puts "Unexpected: #{evt.data.inspect}"
      exit -1
    end
  end

  ws.on :open do |evt|
    ws.send "hello"
  end
end

EM.run do
  try_echo(0)
end
  • without the sleep: runs forever (ctrl-c after 10000 connections)
  • with sleep 3: hangs on the first try
  • with sleep 0.1: hangs on 4th or 5th try usually

In our case, we do some initial setup in the websocket handler object in on_open (i.e. lookup database for relevant objects, do rabbitmq rpc calls here and there). We think that's what causing it.

@madsheep
Copy link
Author

Btw - we also figured out a workaround - we moved the initial setup and heavy lifting to the initialize method of our handler - so now it happens before browser gets the handshake reply.

@boazsegev
Copy link
Owner

@madsheep and @nilclass - Thanks for the example and for exposing the issue!
🙏🏻

I can replicate the issue coding in Ruby and using the sleep method in the on_open callback.

I'm trying to figure this out, but it's an elusive one...

I checked the flow and noticed that the on_data network event doesn't seem to fire after the on_open returns.

So I tried replicating the issue in the C level (where the on_data event occurs) - however, I can't replicate this issue when coding in C (using sleep in the on_open callback), either because I continued development on the C library (facil.io or because it's actually related to the Ruby layer.

I'm sure I'm overlooking something super simple and stupid, but it might take me a while to figure out what that is 🙄

I'll keep you posted,
B.

@boazsegev
Copy link
Owner

Update:

I've managed to recreate the issue using the C layer.

The difference was the number of threads (Iodine has a dedicated IO thread that makes sure the socket buffer is cleared, but the C layer doesn't actually send the response until it returns from on_open, it only places it in the outgoing buffer queue).

This is very good news. Now that I can replicate the issue, I have it's scent, I'm on the hunt, and this bug is going down! 👍🏻

@boazsegev
Copy link
Owner

I fixed it... I think.

Could you install from Github to test? I want to make sure the next release solves your issue.

boazsegev pushed a commit that referenced this issue Jul 20, 2017
HTTP/1.1 was running in a loop, except for pipelined requests, meaning
that any incoming data that was received during processing (not
pipelined) was processed within the loop.

By having the `on_open` delay the review of the loop, the loop resumed
after new data entered the buffer.

This was the root cause and this is an actual fix and not just a
stop-gag that masks the issue.

This fix implements a `facil_force_event` for looping, so if the
protocol was changed mid-stream, the new protocol will receive the new
data.
@boazsegev
Copy link
Owner

P.S.

I released the fixed versions (0.4.7) and I'm closing the issue.

Thanks again for opening the issue 🙏🏻

If anything else pops up, please let me know 👍🏻☺️

@boazsegev
Copy link
Owner

Hi @madsheep ,

I just wanted to drop a note to let you know I released an update to fix a minor logging issue that popped into the patch I implemented for this issue.

When experimenting with the issue you exposed, I updated iodine's facil.io library to use the edge version...

However, the edge version included a testing message for an experiment Date string caching feature.

I just realized it today as I was working on refactoring the HTTP/1.x parser and reviewed the logs.

I apologize for any inconvenience.

I've released an update (v. 0.4.8) with both a fix, an updated HTTP parser and the Date caching feature (which you probably won't use, since Rack often writes a Date header.

I hope everything works as expected after this update.

Kindly,
B.

@nilclass
Copy link

Hi @boazsegev,

I've tested v0.4.8 just now, and it works like a charm!
Thank you for such an awesome tool.

One tiny thing i want to mention, because i keep seeing it in the examples and in the code you added above:

       return [0, {}, []]
       return WS_RESPONSE

one of those suffices 🙊

cheers!
-n

@boazsegev
Copy link
Owner

Hi @nilclass ,

Thanks for keeping me posted 🙏🏻

I'm very happy everything's working and you're enjoying iodine ☺️👍🏻

One tiny thing i want to mention ...

😂😂😂

I guess it shows how highly skilled I am at Copy&Paste, that I didn't even notice I kept pasting this... 😂

Thanks 👍🏻

Have a good one!
B.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants