wait_readable inside Celluloid::Actor#timeout not unregistering the IO descriptor #101

HoneyryderChuck · 2014-03-11T16:09:19Z

I'm having an issue using the wait_readable inside the timeout handler. I have something resembling this code:

 # inside actor
  def wait_readable(t)
     timeout(t) do
       @sock.wait_readable
     end
  end

I have this case in which the wait_readable blocks and the timeout is triggered. I'm thrown a Celluloid::Task::TimeoutError exception.

The problem is, right afterwards some other method calls @sock.wait_writable and I get the following exception:
this IO is already registered with selector

this only happens in such a case. Can it be that the timeout trigger handling is not taking into account the possibly registered sockets?

HoneyryderChuck · 2014-03-11T16:13:17Z

What I would like to have would be an API to deregister Sockets. I can understand if the timeout doesn't have to do this by default, but I could treat the exception of the exact code snippet by deregistering the socket myself. Something like Celluloid::IO.deregister(socket)

HoneyryderChuck · 2014-03-11T19:22:13Z

A hint might be that the timeout is not timeouting in the expected interval inside the reactor. I set the timeout as 0.01, and I still get an around 8-10 second break.

tarcieri · 2014-03-11T19:57:25Z

This is with master?

HoneyryderChuck · 2014-03-11T20:00:00Z

No, 0.15.0. Has this been fixed in the master? I can give it a look tomorrow.

tarcieri · 2014-03-11T20:01:39Z

Yes, I pushed a potential fix to Celluloid master:

celluloid/celluloid@bf37c11

...but we didn't have a test to properly capture if the behavior was fixed

HoneyryderChuck · 2014-03-12T08:58:06Z

It still happens using celluloid and celluloid-io master. Seing the commit, you rescue the timeout from the message handling. But I think the timeout in this case happens inside a timer (the snippet above uses Celluloid::Actor#timeout, which is a wrapper for #after ). Might this be the issue?

HoneyryderChuck · 2014-03-12T09:24:30Z

Hum, I think that somehow the "handle_message message" line blocks on the @sock.wait_readable and does not let the timers fire. Do we have a timeout to wait for readable?

HoneyryderChuck · 2014-03-12T10:45:50Z

Forget what I said. I ran stuff with the debugger (and the master versions of the gems) and I arrived here: timeout is called, wait_readable too, the socket is registered and the task gets suspended. The timer is then triggered, and the exception is passed to the tasks. the task therefore raises the new Celluloid::Task::TimeoutError which is therefore caught inside the actor run loop. But when timers fire, the socket is still there in the reactor selector. When the timeout exception is raised, it is still there (at this point, I'm already "out" of celluloid and inside my handling of the timeout block on top (I rescue Celluloid::Task::TimeoutError). At this point, in order for it to run properly, I should then deregister the socket myself, right? Or is there something in the flow I'm forgetting?

HoneyryderChuck · 2014-03-12T11:56:48Z

I found a bitter snippet for the example:

class A
  include Celluloid::IO

  attr_accessor :sock
  def initialize
    @sock = TCPSocket.new("127.0.0.1", 22)
  end

  def test
    timeout(2) do
      @sock.read
    end
  rescue Celluloid::Task::TimeoutError
      @sock.write "bang"
  end
end

a = A.new
a.async.test
a.async.test

My backtrace:

ArgumentError: this IO is already registered with selector
/home/taacati1/celluloid-io/lib/celluloid/io/reactor.rb:43:in `register'

HoneyryderChuck · 2014-03-12T12:42:32Z

I think a good solution would to extend this here: https://github.com/celluloid/celluloid/blob/master/lib/celluloid/tasks.rb#L85

When the exception is raised, the task accesses its monitor and closes. But I see that the monitor holds the reference to the task and not the other way round.

tarcieri · 2014-03-12T17:55:49Z

@TiagoCardoso1983 so if I understand correctly, the descriptor in wait_readable/writable isn't getting deregistered correctly on timeout?

"this IO is already registered with selector" definitely indicates a bug

HoneyryderChuck · 2014-03-12T21:54:35Z

Exactly. I got to follow the process with the debugger (registering the socket, then on timeout error inside the actor handling the exception). I got to the point where the task is resumed, and the TimeoutError is raised further (because it is a resumable error). And the socket was left in the NIO Selector.

tarcieri · 2014-03-13T05:52:30Z

Perhaps the remaining hurdle is to deregister the socket from the selector on timeout. What's really needed is a spec that captures the failing behavior.

HoneyryderChuck · 2014-03-13T12:29:59Z

Feel free to change the syntax. But the main idea is that one: if a timeout occurs while the socket waits for readable, then the descriptor must be released immediately so that subsequent usages of the socket are made possible.

If you think that such a use case is very edgy, think about the waittime option of telnet or ssh clients, in which the client is supposed to wait x time for further output after it matched the prompt (sometimes the output does not end with the first prompt). That means they are blocked on read for x, and when the timeout occurs they assume the output is over and stop reading.

HoneyryderChuck · 2014-03-25T09:58:36Z

@tarcieri , already had time to look at this issue? Is the way to go that the task keeps record of registered sockets and deregisters them on ResumableError handling? I'd like to patch the behaviour or maybe provide a possible solution for it, but not before I discuss it through.

tarcieri · 2014-03-26T18:17:10Z

@TiagoCardoso1983 I have not. It does seem like we just need to deregister in the event of a timeout. Perhaps it would be good to have a specific exception for timeouts arising inside of a Celluloid::IO reactor (i.e. Celluloid::IO::ReactorTimeout)

HoneyryderChuck · 2014-03-27T10:00:18Z

So, if I understood correctly:

In this line:
https://github.com/celluloid/celluloid/blob/master/lib/celluloid/actor.rb#L152
We might send a message to the mailbox.
(or maybe here?: https://github.com/celluloid/celluloid/blob/master/lib/celluloid/actor.rb#L253).

But how does the mailbox know which task failed to operate in the given timeout? The way I see it, in the second link we can figure out which task failed (Thread.current[:celluloid_task], right?). The task could keep a record of the given io it is waiting for (This might require extending the task class under celluloid-io to have an attribute io) before call suspend :iowait : https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/reactor.rb#L43 . Actually one could "decorate" the #suspend task method for io to store the io before doing normal suspension:

# on celluloid/io/task_io.rb
module Celluloid::IO::TaskIO
  def suspend(status, io)
    return super(status) unless status == :iowait
    self.io = io
    super(status)
  rescue ResumableError => e
    Thread.current[celluloid_actor].mailbox.reactor.deregister(io)
     raise e
  ensure
    self.io = nil
  end
end

Celluloid::Task.send(:include, Celluloid::IO::TaskIO)

What do you think of the overall idea? Right now some things disable the possibility to do this, namely, the suspend method takes one argument and not two, there is not attribute to store io's under task, and there is no method on the reactor to deregister an io object (it's only forwarding the register method to the NIO::Selector instance). But this way I think one could patch it easily: decorate a task method under celluloid-io, keep a pointer to the io object whenever the task suspends on iowait, and then handle the resumable errors by freeing up the socket to other contending tasks.

HoneyryderChuck · 2014-03-28T15:07:40Z

Have been trying around with this solution. It works when the socket released is readable and the selected is writable and vice-versa, but it does not when they are from the same type (the lock, the lock). Hence, deadlock. And even so, it does not seem to respect the timeout delay (the delay that releases the task seems to be the delay from the reactor select). How could I signalize to the reactor that a certain io is to be deregistered? (it seems such an event can only happen from within the loop, right?)

HoneyryderChuck · 2014-04-01T07:51:39Z

Humm, I guess I'm finding deeper issues. Consider a task starts a timeout action in which it requests for a socket inside an actor's reactor loop. Consider the next task requires the same socket. since the timeout action will only release the socket after the action is complete or the timeout expires, the second task will block on selecting the socket. And this happens all the time. I find that issue everytime I run a timer method inside the reactor when a socket is waiting on read or write. I think the issue is that as of now it is letting only one task at a time monitor an io (an io is inside a monitor, which also has a task). Since tasks inside an actor are not really run concurrently shouldn't we be letting the socket roam free across tasks? Please correct my assumptions.

ioquatix · 2014-05-23T21:23:37Z

I think the important thing is whether or not it is useful for the socket to be available in multiple tasks. At least in many cases, e.g. UDP server, the main loop is waiting for recvfrom, once we have an incoming request, typically async dispatch a task to construct a response and sendto back across the network. In my case, either recvfrom needs to have a timeout (please wait for an incoming message but if you don't get it within 2 seconds try something else or signal an error.

The other thing to consider is whether you share the same UDP socket across multiple actors or even tasks perhaps. For a high performance server, you may want completely isolated actors receiving input from a single socket round robin - e.g. in my case a DNS server - often the OS can multiplex quite well (http://stackoverflow.com/questions/1981372/are-parallel-calls-to-send-recv-on-the-same-socket-valid) so the ability to timeout a socket should be task specific.

The workaround for me for the request based timeout (e.g. request this resource, if it isn't available after a few seconds give up and try next server) is to use after(timeout) { socket.close } and then catch the IOError that gets thrown by recvfrom. It seems to work reasonable well for now as a temporary workaround.

tarcieri · 2014-05-23T21:26:40Z

I've talked rather extensively with the Rust developers about the various tradeoffs here as they're building a similar system that abstracts across both "green" (i.e. Celluloid::IO) and "native" (i.e. non-Celluloid::IO) scenarios.

Concurrent reads are garbage, but concurrent writes are useful, as write() is atomic (or should be treated as such). Celluloid::IO contains logic to prevent interleaving here:

https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/stream.rb#L39
https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/stream.rb#L58

UDP is a different case. The kernel should handle shared writes. I'm not sure what to do about shared reads for UDP.

ioquatix · 2014-05-23T21:42:49Z

Concurrent reads from UDP are well defined by POSIX - oh whoops, missed the last sentence :D I guess whatever you do it should work - can't you just let the OS handle it?

tarcieri · 2014-05-23T21:47:30Z

Letting the OS handle it is what the existing code should be doing

ioquatix · 2014-05-23T21:48:59Z

Okay right so back to the main issue, how to handle socket io with a timeout. Is it possible that wait_readable can save a handle which allows the io to be cancelled?

tarcieri · 2014-05-23T21:56:25Z

@ioquatix the most helpful thing that could be done to solve this problem is to create specs that accurately capture the failing cases

ioquatix · 2014-05-23T23:04:35Z

Okay sure, I will see if I have time to do that today.

ioquatix · 2014-05-28T11:49:37Z

@TiagoCardoso1983 I believe this issue has been resolved with the PRs I've submitted.

celluloid/celluloid#432
celluloid/celluloid#435

Once these have both been integrated, it would be great if you could test again.

HoneyryderChuck · 2014-05-28T15:03:56Z

Nice! I'll take a look as soon as I have the opportunity. This was merged in master, I presume.

tarcieri · 2014-05-28T16:22:33Z

@TiagoCardoso1983 not yet, still needs to be reviewed

ioquatix · 2014-05-28T21:07:09Z

@TiagoCardoso1983 My master branch has all the fixes: https://github.com/ioquatix/celluloid

HoneyryderChuck · 2014-06-04T08:44:56Z

@ioquatix @tarcieri, I've just reran the issue using ioquatix celluloid master branch, current celluloid-io branch and timers master branch (@ioquatix, you've already set timers 3.0.0 dependency in the gemspec, and it is currently in .pre phase, if I understood correctly?). But the output was, it worked like a charm :) .

For me there is only a "small" issue, which is not exaclty celluloid concern: I'm using a Celluloid IO-compatible Telnet Client based on this PR: https://github.com/pberndt/celluloid-io/blob/73060d15f175b1c773606c22ea155be91dc88c99/lib/celluloid/net/telnet.rb#L501 and an SSH client which responds to the same API: https://github.com/jasonkarns/net-ssh-telnet/blob/master/lib/net/ssh/telnet.rb#L353 .

This concerns the "waittime" feature of such clients (wait n seconds for more output after matching against a prompt). In the telnet case, in order to be fully Celluloid::IO compatible, I have to rescue from Celluloid::Task::TimeoutError as well. In the ssh client case, it gets more complicated, since it is doing an IO::select call directly. Regarding the second example, will the non Celluloid::IO select call block the reactor? How could I "celluloidize" it? In the first case, is it possible somewhere inside celluloid to catch the Celluloid::Task::TimeoutError and re-raise it as a Timeout::Error? (@Arcieri, we already discussed this concerning the timeout gem, I guess).

Maybe we can continue discussing it here or on a separate proper thread, as it is best for you. From my side, since the main celluloid issue seems fixed, I assume we can close the issue once we have an official release. Thx for the contributions :)

ioquatix · 2014-06-04T08:52:53Z

@TiagoCardoso1983 That is good news and I'm glad that the issue has been resolved.

Here is how I do IO with timeout: https://github.com/ioquatix/rubydns/blob/celluloid/lib/rubydns/resolver.rb#L98-L100

Basically, my understanding of celluloid, is that using IO::select directly would cause problems. It will block the entire actor from processing tasks, but it shouldn't block the entire process, other actors would continue to run on their own threads.

Regarding how to fix this issue, perhaps it would be possible for Celluloid to monkey patch the SSH code such that IO::select redirects to the current actor's IO reactor. I'm not sure if this is the best option so perhaps @tarcieri can comment on this too. In any case, I'd suggest we close this thread and open a new issue specifically for this case - do you want to do that?

HoneyryderChuck · 2014-06-04T14:48:29Z

I think @tarcieri will not like it, as we had a similar discussion regarding timeouts (hence the vaporware timeout gem reference). In a better world such ruby APIs would have an opt-in/handles for us to replace at will, but that's not the world we live in :=)

I'll wait for whatever @tarcieri has to say before we open this new thread, it only makes sense if there is an objective path to a solution.

ioquatix · 2014-06-04T19:54:13Z

How many connections do you need in parallel? Can you spawn one actor per connection?

HoneyryderChuck · 2014-06-04T21:32:37Z

The number of connections varies (1 - n). Usually it batch-connects and accomplishes work on all of them. For this reason I went with one actor per available cpu core, distribute the connections among them, and supervise them all. I took this decision since the Ruby VM maps to native threads and doesn't control the context switching a la Erlang or Go (don't know how right I am with this).

ioquatix · 2014-06-04T21:37:47Z

@TiagoCardoso1983 My advice right now, would be to use one actor per connection (e.g. N actors) and let Ruby schedule the IO::select calls internally. Unless n > 1000, you'll probably be fine.

The alternative is to cap N to say, 100, and then allocate (number of connections) / N jobs to each actor, and have them go through one at a time.

Does that make sense?

I tried a similar thing in RubyDNS, where I was doing over 5000 UDP requests in parallel. At some point you start to run into limitations of the networking stack/OS doing too much in parallel, so I batched it into groups of 1000 requests at a time.

HoneyryderChuck · 2014-06-05T10:42:10Z

Humm, then if you are using one actor per connection, why Celluloid::IO? The benefit of it is having one actor handling multiple connections in its thread in a quase-green scenario. The case you describe can be done using only Celluloid and let the only connection in the thread block on IO, right?

The batch scenario is something that will have to be done eventually anyway, the OS does have limitations in terms of sockets open you're right about that.

tarcieri · 2014-06-05T18:21:05Z

@TiagoCardoso1983 the main reason to use Celluloid::IO even in a actor-per-connection scenario is so you can multiplex I/O events with actor messages.

However keep in mind each Celluloid::IO actor requires 2 file descriptors as it uses a pipe for wakeup

HoneyryderChuck · 2014-06-06T14:06:01Z

@tarcieri I add forgotten that IO in Celluloid does block the mailbox, thx for reminding. But still, it does make sense to distribute the load by the available cores and let each actor/core handle n number of connections, right?

tarcieri · 2014-06-06T17:27:23Z

You can use more than N threads for N cores if you're just doing blocking I/O.

If you're using some library with a lot of weird interaction with Timeout, for now I'd say just use blocking I/O. Maybe in the future we can get a better solution for Timeout.

HoneyryderChuck · 2014-06-10T18:12:12Z

I rewrote the logic to not rely on the feature which relies on the timeout (the "Waittime" option), even though this fix makes it possible to use.

ioquatix · 2014-10-24T12:53:13Z

I believe this issue can now be closed, it was resolved with my patches to timers/celluloid, but the issue I just brought up appears to be a new/different issue.

HoneyryderChuck · 2015-01-17T00:19:10Z

Solved in #122

HoneyryderChuck mentioned this issue Mar 11, 2014

Exceptions when a TCPSocket connection timed out in and out of the Reactor using Kernel.timeout #97

Open

HoneyryderChuck mentioned this issue Mar 13, 2014

added failing spec which checks wether a socket is usable after a read o... #102

Closed

tarcieri mentioned this issue May 24, 2014

Any kind of non-reactor timer event will be ignored after entering EventedMailbox#receive celluloid/celluloid#426

Closed

ioquatix mentioned this issue Jun 2, 2014

Refactor timers into module socketry/timers#27

Closed

ioquatix mentioned this issue Oct 23, 2014

ArgumentError: this IO is already registered with selector #121

Closed

HoneyryderChuck closed this as completed Jan 17, 2015

wait_readable inside Celluloid::Actor#timeout not unregistering the IO descriptor #101

wait_readable inside Celluloid::Actor#timeout not unregistering the IO descriptor #101

Comments

HoneyryderChuck commented Mar 11, 2014

HoneyryderChuck commented Mar 11, 2014

HoneyryderChuck commented Mar 11, 2014

tarcieri commented Mar 11, 2014

HoneyryderChuck commented Mar 11, 2014

tarcieri commented Mar 11, 2014

HoneyryderChuck commented Mar 12, 2014

HoneyryderChuck commented Mar 12, 2014

HoneyryderChuck commented Mar 12, 2014

HoneyryderChuck commented Mar 12, 2014

HoneyryderChuck commented Mar 12, 2014

tarcieri commented Mar 12, 2014

HoneyryderChuck commented Mar 12, 2014

tarcieri commented Mar 13, 2014

HoneyryderChuck commented Mar 13, 2014

HoneyryderChuck commented Mar 25, 2014

tarcieri commented Mar 26, 2014

HoneyryderChuck commented Mar 27, 2014

HoneyryderChuck commented Mar 28, 2014

HoneyryderChuck commented Apr 1, 2014

ioquatix commented May 23, 2014

tarcieri commented May 23, 2014

ioquatix commented May 23, 2014

tarcieri commented May 23, 2014

ioquatix commented May 23, 2014

tarcieri commented May 23, 2014

ioquatix commented May 23, 2014

ioquatix commented May 28, 2014

HoneyryderChuck commented May 28, 2014

tarcieri commented May 28, 2014

ioquatix commented May 28, 2014

HoneyryderChuck commented Jun 4, 2014

ioquatix commented Jun 4, 2014

HoneyryderChuck commented Jun 4, 2014

ioquatix commented Jun 4, 2014

HoneyryderChuck commented Jun 4, 2014

ioquatix commented Jun 4, 2014

HoneyryderChuck commented Jun 5, 2014

tarcieri commented Jun 5, 2014

HoneyryderChuck commented Jun 6, 2014

tarcieri commented Jun 6, 2014

HoneyryderChuck commented Jun 10, 2014

ioquatix commented Oct 24, 2014

HoneyryderChuck commented Jan 17, 2015