Skip to content
This repository has been archived by the owner on Dec 7, 2018. It is now read-only.

wait_readable inside Celluloid::Actor#timeout not unregistering the IO descriptor #101

Closed
HoneyryderChuck opened this issue Mar 11, 2014 · 43 comments

Comments

@HoneyryderChuck
Copy link

I'm having an issue using the wait_readable inside the timeout handler. I have something resembling this code:

 # inside actor
  def wait_readable(t)
     timeout(t) do
       @sock.wait_readable
     end
  end

I have this case in which the wait_readable blocks and the timeout is triggered. I'm thrown a Celluloid::Task::TimeoutError exception.

The problem is, right afterwards some other method calls @sock.wait_writable and I get the following exception:
this IO is already registered with selector

this only happens in such a case. Can it be that the timeout trigger handling is not taking into account the possibly registered sockets?

@HoneyryderChuck
Copy link
Author

What I would like to have would be an API to deregister Sockets. I can understand if the timeout doesn't have to do this by default, but I could treat the exception of the exact code snippet by deregistering the socket myself. Something like Celluloid::IO.deregister(socket)

@HoneyryderChuck
Copy link
Author

A hint might be that the timeout is not timeouting in the expected interval inside the reactor. I set the timeout as 0.01, and I still get an around 8-10 second break.

@tarcieri
Copy link
Member

This is with master?

@HoneyryderChuck
Copy link
Author

No, 0.15.0. Has this been fixed in the master? I can give it a look tomorrow.

@tarcieri
Copy link
Member

Yes, I pushed a potential fix to Celluloid master:

celluloid/celluloid@bf37c11

...but we didn't have a test to properly capture if the behavior was fixed

@HoneyryderChuck
Copy link
Author

It still happens using celluloid and celluloid-io master. Seing the commit, you rescue the timeout from the message handling. But I think the timeout in this case happens inside a timer (the snippet above uses Celluloid::Actor#timeout, which is a wrapper for #after ). Might this be the issue?

@HoneyryderChuck
Copy link
Author

Hum, I think that somehow the "handle_message message" line blocks on the @sock.wait_readable and does not let the timers fire. Do we have a timeout to wait for readable?

@HoneyryderChuck
Copy link
Author

Forget what I said. I ran stuff with the debugger (and the master versions of the gems) and I arrived here: timeout is called, wait_readable too, the socket is registered and the task gets suspended. The timer is then triggered, and the exception is passed to the tasks. the task therefore raises the new Celluloid::Task::TimeoutError which is therefore caught inside the actor run loop. But when timers fire, the socket is still there in the reactor selector. When the timeout exception is raised, it is still there (at this point, I'm already "out" of celluloid and inside my handling of the timeout block on top (I rescue Celluloid::Task::TimeoutError). At this point, in order for it to run properly, I should then deregister the socket myself, right? Or is there something in the flow I'm forgetting?

@HoneyryderChuck
Copy link
Author

I found a bitter snippet for the example:

class A
  include Celluloid::IO

  attr_accessor :sock
  def initialize
    @sock = TCPSocket.new("127.0.0.1", 22)
  end

  def test
    timeout(2) do
      @sock.read
    end
  rescue Celluloid::Task::TimeoutError
      @sock.write "bang"
  end
end

a = A.new
a.async.test
a.async.test

My backtrace:

ArgumentError: this IO is already registered with selector
/home/taacati1/celluloid-io/lib/celluloid/io/reactor.rb:43:in `register'

@HoneyryderChuck
Copy link
Author

I think a good solution would to extend this here: https://github.com/celluloid/celluloid/blob/master/lib/celluloid/tasks.rb#L85

When the exception is raised, the task accesses its monitor and closes. But I see that the monitor holds the reference to the task and not the other way round.

@tarcieri
Copy link
Member

@TiagoCardoso1983 so if I understand correctly, the descriptor in wait_readable/writable isn't getting deregistered correctly on timeout?

"this IO is already registered with selector" definitely indicates a bug

@HoneyryderChuck
Copy link
Author

Exactly. I got to follow the process with the debugger (registering the socket, then on timeout error inside the actor handling the exception). I got to the point where the task is resumed, and the TimeoutError is raised further (because it is a resumable error). And the socket was left in the NIO Selector.

@tarcieri
Copy link
Member

Perhaps the remaining hurdle is to deregister the socket from the selector on timeout. What's really needed is a spec that captures the failing behavior.

@HoneyryderChuck
Copy link
Author

Feel free to change the syntax. But the main idea is that one: if a timeout occurs while the socket waits for readable, then the descriptor must be released immediately so that subsequent usages of the socket are made possible.

If you think that such a use case is very edgy, think about the waittime option of telnet or ssh clients, in which the client is supposed to wait x time for further output after it matched the prompt (sometimes the output does not end with the first prompt). That means they are blocked on read for x, and when the timeout occurs they assume the output is over and stop reading.

@HoneyryderChuck
Copy link
Author

@tarcieri , already had time to look at this issue? Is the way to go that the task keeps record of registered sockets and deregisters them on ResumableError handling? I'd like to patch the behaviour or maybe provide a possible solution for it, but not before I discuss it through.

@tarcieri
Copy link
Member

@TiagoCardoso1983 I have not. It does seem like we just need to deregister in the event of a timeout. Perhaps it would be good to have a specific exception for timeouts arising inside of a Celluloid::IO reactor (i.e. Celluloid::IO::ReactorTimeout)

@HoneyryderChuck
Copy link
Author

So, if I understood correctly:

In this line:
https://github.com/celluloid/celluloid/blob/master/lib/celluloid/actor.rb#L152
We might send a message to the mailbox.
(or maybe here?: https://github.com/celluloid/celluloid/blob/master/lib/celluloid/actor.rb#L253).

But how does the mailbox know which task failed to operate in the given timeout? The way I see it, in the second link we can figure out which task failed (Thread.current[:celluloid_task], right?). The task could keep a record of the given io it is waiting for (This might require extending the task class under celluloid-io to have an attribute io) before call suspend :iowait : https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/reactor.rb#L43 . Actually one could "decorate" the #suspend task method for io to store the io before doing normal suspension:

# on celluloid/io/task_io.rb
module Celluloid::IO::TaskIO
  def suspend(status, io)
    return super(status) unless status == :iowait
    self.io = io
    super(status)
  rescue ResumableError => e
    Thread.current[celluloid_actor].mailbox.reactor.deregister(io)
     raise e
  ensure
    self.io = nil
  end
end

Celluloid::Task.send(:include, Celluloid::IO::TaskIO)

What do you think of the overall idea? Right now some things disable the possibility to do this, namely, the suspend method takes one argument and not two, there is not attribute to store io's under task, and there is no method on the reactor to deregister an io object (it's only forwarding the register method to the NIO::Selector instance). But this way I think one could patch it easily: decorate a task method under celluloid-io, keep a pointer to the io object whenever the task suspends on iowait, and then handle the resumable errors by freeing up the socket to other contending tasks.

@HoneyryderChuck
Copy link
Author

Have been trying around with this solution. It works when the socket released is readable and the selected is writable and vice-versa, but it does not when they are from the same type (the lock, the lock). Hence, deadlock. And even so, it does not seem to respect the timeout delay (the delay that releases the task seems to be the delay from the reactor select). How could I signalize to the reactor that a certain io is to be deregistered? (it seems such an event can only happen from within the loop, right?)

@HoneyryderChuck
Copy link
Author

Humm, I guess I'm finding deeper issues. Consider a task starts a timeout action in which it requests for a socket inside an actor's reactor loop. Consider the next task requires the same socket. since the timeout action will only release the socket after the action is complete or the timeout expires, the second task will block on selecting the socket. And this happens all the time. I find that issue everytime I run a timer method inside the reactor when a socket is waiting on read or write. I think the issue is that as of now it is letting only one task at a time monitor an io (an io is inside a monitor, which also has a task). Since tasks inside an actor are not really run concurrently shouldn't we be letting the socket roam free across tasks? Please correct my assumptions.

@ioquatix
Copy link
Contributor

I think the important thing is whether or not it is useful for the socket to be available in multiple tasks. At least in many cases, e.g. UDP server, the main loop is waiting for recvfrom, once we have an incoming request, typically async dispatch a task to construct a response and sendto back across the network. In my case, either recvfrom needs to have a timeout (please wait for an incoming message but if you don't get it within 2 seconds try something else or signal an error.

The other thing to consider is whether you share the same UDP socket across multiple actors or even tasks perhaps. For a high performance server, you may want completely isolated actors receiving input from a single socket round robin - e.g. in my case a DNS server - often the OS can multiplex quite well (http://stackoverflow.com/questions/1981372/are-parallel-calls-to-send-recv-on-the-same-socket-valid) so the ability to timeout a socket should be task specific.

The workaround for me for the request based timeout (e.g. request this resource, if it isn't available after a few seconds give up and try next server) is to use after(timeout) { socket.close } and then catch the IOError that gets thrown by recvfrom. It seems to work reasonable well for now as a temporary workaround.

@tarcieri
Copy link
Member

I've talked rather extensively with the Rust developers about the various tradeoffs here as they're building a similar system that abstracts across both "green" (i.e. Celluloid::IO) and "native" (i.e. non-Celluloid::IO) scenarios.

Concurrent reads are garbage, but concurrent writes are useful, as write() is atomic (or should be treated as such). Celluloid::IO contains logic to prevent interleaving here:

https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/stream.rb#L39
https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/stream.rb#L58

UDP is a different case. The kernel should handle shared writes. I'm not sure what to do about shared reads for UDP.

@ioquatix
Copy link
Contributor

Concurrent reads from UDP are well defined by POSIX - oh whoops, missed the last sentence :D I guess whatever you do it should work - can't you just let the OS handle it?

@tarcieri
Copy link
Member

Letting the OS handle it is what the existing code should be doing

@ioquatix
Copy link
Contributor

Okay right so back to the main issue, how to handle socket io with a timeout. Is it possible that wait_readable can save a handle which allows the io to be cancelled?

@tarcieri
Copy link
Member

@ioquatix the most helpful thing that could be done to solve this problem is to create specs that accurately capture the failing cases

@ioquatix
Copy link
Contributor

Okay sure, I will see if I have time to do that today.

@ioquatix
Copy link
Contributor

@TiagoCardoso1983 I believe this issue has been resolved with the PRs I've submitted.

celluloid/celluloid#432
celluloid/celluloid#435

Once these have both been integrated, it would be great if you could test again.

@HoneyryderChuck
Copy link
Author

Nice! I'll take a look as soon as I have the opportunity. This was merged in master, I presume.

@tarcieri
Copy link
Member

@TiagoCardoso1983 not yet, still needs to be reviewed

@ioquatix
Copy link
Contributor

@TiagoCardoso1983 My master branch has all the fixes: https://github.com/ioquatix/celluloid

@HoneyryderChuck
Copy link
Author

@ioquatix @tarcieri, I've just reran the issue using ioquatix celluloid master branch, current celluloid-io branch and timers master branch (@ioquatix, you've already set timers 3.0.0 dependency in the gemspec, and it is currently in .pre phase, if I understood correctly?). But the output was, it worked like a charm :) .

For me there is only a "small" issue, which is not exaclty celluloid concern: I'm using a Celluloid IO-compatible Telnet Client based on this PR: https://github.com/pberndt/celluloid-io/blob/73060d15f175b1c773606c22ea155be91dc88c99/lib/celluloid/net/telnet.rb#L501 and an SSH client which responds to the same API: https://github.com/jasonkarns/net-ssh-telnet/blob/master/lib/net/ssh/telnet.rb#L353 .

This concerns the "waittime" feature of such clients (wait n seconds for more output after matching against a prompt). In the telnet case, in order to be fully Celluloid::IO compatible, I have to rescue from Celluloid::Task::TimeoutError as well. In the ssh client case, it gets more complicated, since it is doing an IO::select call directly. Regarding the second example, will the non Celluloid::IO select call block the reactor? How could I "celluloidize" it? In the first case, is it possible somewhere inside celluloid to catch the Celluloid::Task::TimeoutError and re-raise it as a Timeout::Error? (@Arcieri, we already discussed this concerning the timeout gem, I guess).

Maybe we can continue discussing it here or on a separate proper thread, as it is best for you. From my side, since the main celluloid issue seems fixed, I assume we can close the issue once we have an official release. Thx for the contributions :)

@ioquatix
Copy link
Contributor

ioquatix commented Jun 4, 2014

@TiagoCardoso1983 That is good news and I'm glad that the issue has been resolved.

Here is how I do IO with timeout: https://github.com/ioquatix/rubydns/blob/celluloid/lib/rubydns/resolver.rb#L98-L100

Basically, my understanding of celluloid, is that using IO::select directly would cause problems. It will block the entire actor from processing tasks, but it shouldn't block the entire process, other actors would continue to run on their own threads.

Regarding how to fix this issue, perhaps it would be possible for Celluloid to monkey patch the SSH code such that IO::select redirects to the current actor's IO reactor. I'm not sure if this is the best option so perhaps @tarcieri can comment on this too. In any case, I'd suggest we close this thread and open a new issue specifically for this case - do you want to do that?

@HoneyryderChuck
Copy link
Author

I think @tarcieri will not like it, as we had a similar discussion regarding timeouts (hence the vaporware timeout gem reference). In a better world such ruby APIs would have an opt-in/handles for us to replace at will, but that's not the world we live in :=)

I'll wait for whatever @tarcieri has to say before we open this new thread, it only makes sense if there is an objective path to a solution.

@ioquatix
Copy link
Contributor

ioquatix commented Jun 4, 2014

How many connections do you need in parallel? Can you spawn one actor per connection?

@HoneyryderChuck
Copy link
Author

The number of connections varies (1 - n). Usually it batch-connects and accomplishes work on all of them. For this reason I went with one actor per available cpu core, distribute the connections among them, and supervise them all. I took this decision since the Ruby VM maps to native threads and doesn't control the context switching a la Erlang or Go (don't know how right I am with this).

@ioquatix
Copy link
Contributor

ioquatix commented Jun 4, 2014

@TiagoCardoso1983 My advice right now, would be to use one actor per connection (e.g. N actors) and let Ruby schedule the IO::select calls internally. Unless n > 1000, you'll probably be fine.

The alternative is to cap N to say, 100, and then allocate (number of connections) / N jobs to each actor, and have them go through one at a time.

Does that make sense?

I tried a similar thing in RubyDNS, where I was doing over 5000 UDP requests in parallel. At some point you start to run into limitations of the networking stack/OS doing too much in parallel, so I batched it into groups of 1000 requests at a time.

@HoneyryderChuck
Copy link
Author

Humm, then if you are using one actor per connection, why Celluloid::IO? The benefit of it is having one actor handling multiple connections in its thread in a quase-green scenario. The case you describe can be done using only Celluloid and let the only connection in the thread block on IO, right?

The batch scenario is something that will have to be done eventually anyway, the OS does have limitations in terms of sockets open you're right about that.

@tarcieri
Copy link
Member

tarcieri commented Jun 5, 2014

@TiagoCardoso1983 the main reason to use Celluloid::IO even in a actor-per-connection scenario is so you can multiplex I/O events with actor messages.

However keep in mind each Celluloid::IO actor requires 2 file descriptors as it uses a pipe for wakeup

@HoneyryderChuck
Copy link
Author

@tarcieri I add forgotten that IO in Celluloid does block the mailbox, thx for reminding. But still, it does make sense to distribute the load by the available cores and let each actor/core handle n number of connections, right?

@tarcieri
Copy link
Member

tarcieri commented Jun 6, 2014

You can use more than N threads for N cores if you're just doing blocking I/O.

If you're using some library with a lot of weird interaction with Timeout, for now I'd say just use blocking I/O. Maybe in the future we can get a better solution for Timeout.

@HoneyryderChuck
Copy link
Author

I rewrote the logic to not rely on the feature which relies on the timeout (the "Waittime" option), even though this fix makes it possible to use.

@ioquatix
Copy link
Contributor

I believe this issue can now be closed, it was resolved with my patches to timers/celluloid, but the issue I just brought up appears to be a new/different issue.

@HoneyryderChuck
Copy link
Author

Solved in #122

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants