ArgumentError: this IO is already registered with selector #121

ioquatix · 2014-10-23T16:11:00Z

Can't for the life of me figure out why this is happening, albeit very rarely:

ArgumentError: this IO is already registered with selector
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/reactor.rb:42:in `register'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/reactor.rb:42:in `wait'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/reactor.rb:21:in `wait_readable'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io.rb:53:in `wait_readable'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/udp_socket.rb:13:in `wait_readable'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/udp_socket.rb:28:in `rescue in recvfrom'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/udp_socket.rb:20:in `recvfrom'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:151:in `try_udp_server'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:124:in `try_server'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:99:in `block (2 levels) in dispatch_request'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid.rb:407:in `block in timeout'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/actor.rb:252:in `timeout'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid.rb:406:in `timeout'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:98:in `block in dispatch_request'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:208:in `block in each'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:205:in `each'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:205:in `each'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:92:in `dispatch_request'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:62:in `query'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:68:in `block in addresses_for'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:67:in `times'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:67:in `addresses_for'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `public_send'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `dispatch'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/calls.rb:63:in `dispatch'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/cell.rb:60:in `block in invoke'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/cell.rb:71:in `block in task'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/actor.rb:357:in `block in task'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/tasks.rb:57:in `block in initialize'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/tasks/task_fiber.rb:15:in `block in create'

e.g. https://travis-ci.org/ioquatix/rubydns/jobs/38832552

I believe we may have discussed this a bit here too: #101 but I'm raising a new issue since that stuff is all so old.

The text was updated successfully, but these errors were encountered:

ioquatix · 2014-10-23T16:19:11Z

The function that causes the failure is here:

https://github.com/ioquatix/rubydns/blob/7cedb543ab7d28561ff0f4eed509fdfab7ca5199/lib/rubydns/resolver.rb#L146-L162

Not sure how that could possibly be failing.. perhaps the ensure block? Some kind of weird flow control due to a timeout somewhere?

ioquatix · 2014-10-23T16:26:09Z

Okay, yeah, might be exactly the same issue as #101, because

https://github.com/ioquatix/rubydns/blob/7cedb543ab7d28561ff0f4eed509fdfab7ca5199/lib/rubydns/resolver.rb#L98-L100

Hmmmmmm...

ioquatix · 2014-10-23T16:29:40Z

Ah, damn it, it's 5:30AM already.

ioquatix · 2014-10-24T10:14:45Z

I believe there is a bug in celluloid-io where sockets are not being deregistered correctly.

I made a test case in RubyDNS which mostly hits the error condition systematically: https://github.com/ioquatix/rubydns/blob/master/spec/rubydns/celluloid_bug_spec.rb

I augmented celluloid-io with some debugging output:

      # Wait for the given IO operation to complete
      def wait(io, set)
        # zomg ugly type conversion :(
        unless io.is_a?(::IO) or io.is_a?(OpenSSL::SSL::SSLSocket)
          if io.respond_to? :to_io
            io = io.to_io
          elsif ::IO.respond_to? :try_convert
            io = ::IO.try_convert(io)
          end

          raise TypeError, "can't convert #{io.class} into IO" unless io.is_a?(::IO)
        end

        puts "-> wait(#{io} -> #{io.inspect}, #{set})"
        monitor = @selector.register(io, set)
        monitor.value = Task.current
        Task.suspend :iowait        
        puts "<- wait(#{io}, #{set})"
      end

I checked several times and I had the argument error but never noticed entering @selector.register more than once, i.e. -> wait and <- wait were balanced for every socket, but still got error.

I added @selector.deregister io after Task.suspend and everything is working correctly now.

I'm going to assume that somehow, the register (https://github.com/celluloid/nio4r/blob/3e4e8cb90856502c84e896fb667cb407ff68d353/ext/nio4r/selector.c#L201-L232) function needs to be matched with deregister (https://github.com/celluloid/nio4r/blob/3e4e8cb90856502c84e896fb667cb407ff68d353/ext/nio4r/selector.c#L241-L257) in order to work correctly in all conditions?

Perhaps you can give me some feedback on what is going on here.

ioquatix · 2014-10-28T10:04:51Z

@tarcieri Can you please check this?

tarcieri · 2014-10-28T17:36:30Z

@ioquatix oh, I see what might be happening. Yeah, you're right, except:

https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/reactor.rb#L51

monitor.close gets called there, except if something else resumes the task first, it won't be

We should probably move monitor.close out of the reactor, and add it to an ensure block after Task.suspend

ioquatix · 2014-10-28T21:39:57Z

Okay, do you want me to submit a PR? I'm pretty keen to get this sorted as it's causing bugs in the latest RubyDNS release.

tarcieri · 2014-10-28T21:43:07Z

@ioquatix sure! Delete the monitor.close out of the reactor, and close it in #wait

ioquatix · 2014-10-28T22:25:09Z

I tried adding it to an ensure block like you said but Ruby segfaults. Removing the ensure block, but just leaving it at the end (i.e. monitor.close as the last line in #wait) appears to work fine. Any ideas?

ioquatix · 2014-10-29T09:07:33Z

Sorry to bump this (don't mean to be annoying), but have you made any progress on this?

tarcieri · 2014-10-29T15:26:08Z

I haven't had a chance to take a look

ioquatix · 2014-11-05T08:34:48Z

@tarcieri Did you get a chance to take a look?

tarcieri · 2014-11-05T17:12:40Z

no

ioquatix · 2014-11-19T10:45:59Z

bump :)

tarcieri · 2014-11-19T18:14:59Z

No progress, sorry

HoneyryderChuck · 2014-11-27T11:09:56Z

Bumped into it as well. It seems #101 is not as fixed as we thought... Here is a small gist:

# ran with ruby 2.0.0
require 'celluloid/io'                                      

class Dad                                                   
  include Celluloid::IO                                     

  def initialize                                            
    @sock = TCPSocket.new("localhost", 22)                  
  end                                                       

  def wiriwiri                                              
    begin                                                   
      timeout(2) do                                         
        loop do                                             
          @sock.wait_readable                               
        end                                                 
      end                                                   
    rescue                                                  
      @sock.wait_readable                                   
    end                                                     
  end                                                       
end                                                         

Dad.new.wiriwiri

I theory, this script shouldn't blow. I force the endless state of waiting on socket to be readable. When the timeout happens, the socket should be fred from the reactor. But this is not happening. My "theory": the timeout failed message doesn't get prioritized in the mailbox, or does not trigger the reactor select. And the next evaluated message uses the socket which is still under the reactor registered collection.

One flaw IMHO from the celluloid io API when comparing to the native ruby implementation is that: IO::Select's last parameter is a timeout for waiting on descriptors. This is how most evented implementations handle timeouts: they IO::select for a certain time period. An example:

https://github.com/net-ssh/net-ssh/blob/043bbca9b65cb7f8d862371f66cd4e163287291b/lib/net/ssh/transport/packet_stream.rb#L75
https://github.com/net-ssh/net-ssh/blob/441d07882645114b5acc764d2b0b83930c6cacd7/lib/net/ssh/connection/session.rb#L209

First example checks whether socket is available for reading by passing 0. Second example waits for a definite period on descriptors.

So, this is the main thing about it: celluloid-io doesn't allow me to wait on a socket for a definite period of time.

My proposal would be to pass an optional 3rd parameter to the #wait method: a timeout. when not passed (nil), it defaults to the current behaviour and waits indefinitely (like IO::select). when passed a parameter, it times out, deregisters the socket and return true/false, whether socket is available or not.

Actually, @ioquatix and me have been using the timeout feature because this API is not available (correct me if I'm wrong).

Let me know what you think. I think this would mean some changes on nio4r, probably. Anyway, the "recover from timeout" thing must succeed first.

HoneyryderChuck · 2014-11-27T11:16:46Z

@ioquatix, I used the monitor.close call in the ensure block and it worked well for me. Can you troubleshoot exactly where does the segfault come from? My reasoning for it is: if an exception occurs while the fiber is paused, what happens to it? Does it get resumed? Can we say for sure that it will always safely close?

HoneyryderChuck · 2014-11-27T14:32:54Z

@ioquatix , found one way to "force" the stackdump when the monitor.close is in the ensure block: when on timeout loop, you send a SIGTERM to the process it.

Also found out what I wanted: if not inside the ensure block, the time out will not jump to the instructions after task suspending, hence it will not close the monitor and fail with "ArgumentError: this IO is already registered with selector".

HoneyryderChuck · 2014-11-27T14:36:44Z

here's the stack dump (the relevant nio4r part):

...
7f1b01525000-7f1b017a7000 rw-p 00000000 00:00 0
7f1b017a7000-7f1b017b6000 r-xp 00000000 00:1d 9442576991091956877        /home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so
7f1b017b6000-7f1b019b5000 ---p 0000f000 00:1d 9442576991091956877        /home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so
7f1b019b5000-7f1b019b6000 rw-p 0000e000 00:1d 9442576991091956877        /home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so
7f1b019b6000-7f1b019b7000 rw-p 00000000 00:00 0
7f1b019b7000-7f1b019ba000 r-xp 00000000 00:1d 12052790693188749692       /home/taacati1/sem/.bundle/ruby/1.9.1/gems/hitimes-1.2.2/lib/hitimes/1.9/hitimes.so
7f1b019ba000-7f1b01bb9000 ---p 00003000 00:1d 12052790693188749692       /home/taacati1/sem/.bundle/ruby/1.9.1/gems/hitimes-1.2.2/lib/hitimes/1.9/hitimes.so
7f1b01bb9000-7f1b01bba000 rw-p 00002000 00:1d 12052790693188749692       /home/taacati1/sem/.bundle/ruby/1.9.1/gems/hitimes-1.2.2/lib/hitimes/1.9/hitimes.so
7f1b01bba000-7f1b01bbb000 r-xp 00000000 fd:00 2093087                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/fiber.so
7f1b01bbb000-7f1b01dba000 ---p 00001000 fd:00 2093087                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/fiber.so
7f1b01dba000-7f1b01dbb000 rw-p 00000000 fd:00 2093087                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/fiber.so
7f1b01dbb000-7f1b01dc0000 r-xp 00000000 fd:00 2093153                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/strscan.so
7f1b01dc0000-7f1b01fc0000 ---p 00005000 fd:00 2093153                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/strscan.so
7f1b01fc0000-7f1b01fc1000 rw-p 00005000 fd:00 2093153                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/strscan.so
7f1b01fc1000-7f1b02006000 r-xp 00000000 fd:00 2093157                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/openssl.sozsh: abort      bundle exec ruby timeout_script.rb

@tarcieri , if this might be helpful.

HoneyryderChuck · 2014-11-27T15:02:27Z

Correction: "timers" seems to be the issue (not handling SIGTERM properly on some c extension??)

tarcieri · 2014-11-29T20:42:47Z

One flaw IMHO from the celluloid io API when comparing to the native ruby implementation is that: IO::Select's last parameter is a timeout for waiting on descriptors.

...and the first thing anyone wants to do is wrap this up into an event loop that can calculate that timeout based on any number of potential timers.

So, this is the main thing about it: celluloid-io doesn't allow me to wait on a socket for a definite period of time.

Celluloid is a full-featured event system complete with timers, and expects you to use that functionality to implement timeout handling.

Adding timeouts with Celluloid::IO (now that they actually work) is as simple as:

timeout(n) { io_operation_here }

Timeouts are handled by the reactor:

https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/reactor.rb#L48

HoneyryderChuck · 2014-12-01T00:14:34Z

@tarcieri , not what I meant at all. I'm fully aware that the reactor handles the timeouts, I meant more of a way of "replacing" native calls with celluloid calls to select, like in the net-ssh example, which would "in theory" involve overwriting the ruby_compat module. This could be then applied to other "native" libraries that use the IO::select approach. But I don't want to move the attention away from the bug above :) Tomorrow I'll get back to this.

ioquatix · 2014-12-01T01:07:34Z

Hi Guys, I suggest we have a group Google Hangout to bump heads on this issue and see if we can resolve it. Are you interested?

ioquatix · 2014-12-01T01:08:10Z

@TiagoCardoso1983 can you please confirm you were using Timers 4.0.1 for your example above?

ioquatix · 2014-12-01T01:35:58Z

@TiagoCardoso1983 I tend to agree with @tarcieri on the timer issue, explicit timeouts are a real pain especially with nested APIs. In RubyDNS, this is used in the resolver:

https://github.com/ioquatix/rubydns/blob/master/lib/rubydns/resolver.rb#L101-L103

In this example, it would be a pain to track the timeout through several read/write calls.

HoneyryderChuck · 2014-12-01T09:54:01Z

@ioquatix , I mostly agree with both, I'm just getting frustrated with net-ssh (another topic, another already existing issue ;) ). But I've been working around it.

Concerning this, yes, I confirm it was the timers 4.0.1 (and nio4r 1.0.1) It seems by the stack dump it is coming from either hitimes or the nio4r libev extensions.

HoneyryderChuck · 2014-12-01T14:41:54Z

Forgot also to had this missing dump:

HoneyryderChuck · 2014-12-01T14:42:40Z

-- C level backtrace information -------------------------------------------
ruby() [0x523d05]
ruby() [0x56b7b6]
ruby(rb_bug+0xb3) [0x56b953]
ruby() [0x4b0bc0]
/lib64/libpthread.so.0(+0xf710) [0x7fe658d18710]
/home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so(ev_io_stop+0x53) [0x7fe64e4baa93] ../libev/ev.c:3659
/home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so(+0x4bcc) [0x7fe64e4b8bcc] monitor.c:130
ruby() [0x51267e]
ruby() [0x514d6d]
ruby() [0x5189c8]
ruby() [0x521689]
ruby(rb_rescue2+0x15b) [0x41546b]
ruby() [0x50e319]
ruby() [0x51267e]
ruby() [0x514d6d]
ruby() [0x5189c8]
ruby() [0x519f23]
ruby() [0x51abc5]
ruby() [0x51267e]
ruby() [0x514d6d]
ruby() [0x5189c8]
ruby() [0x51226f]
ruby() [0x52fb73]
/lib64/libc.so.6(+0x43bf0) [0x7fe6580f1bf0]

-- Other runtime information -----------------------------------------------

seems also relevant, even though I must admit I have 0 experience with C extensions.

HoneyryderChuck · 2014-12-01T15:34:06Z

The offender might then have been this line:

https://github.com/celluloid/nio4r/blob/master/ext/libev/ev.c#L3659

@tarcieri , shouldn't the assert call "bubble up" to ruby?

Asmod4n · 2014-12-01T16:29:32Z

asserts get disabled when compiled with -DN(O?)DEBUG

Von einem mobilen Gerät gesendet

Am 01.12.2014 um 16:34 schrieb Tiago notifications@github.com:

The offender might then have been this line:

https://github.com/celluloid/nio4r/blob/master/ext/libev/ev.c#L3659

@tarcieri , shouldn't the assert call "bubble up" to ruby?

—
Reply to this email directly or view it on GitHub.

ioquatix · 2014-12-03T00:35:33Z

@TiagoCardoso1983 my proposed changes fix the problem https://github.com/ioquatix/celluloid-io/commits/fix-monitor but I'm not convinced it is the correct solution. Can you please try it out and let me know what you think? I can run your example given above and it works. Correctly? Not sure about that, but it doesn't blow up.

ioquatix · 2014-12-03T00:39:46Z

Specifically this commit: ioquatix@6eb2bd8

HoneyryderChuck · 2014-12-04T12:45:01Z

@ioquatix it does correct the segfault. I do get an IOError: selector is closed, which is probably ok (I don't see where the socket gets closed though). I'd say it's good to go!

tarcieri · 2014-12-04T18:52:28Z

So the ensure block triggered the segfault?

Does that mean that one of the C extensions isn't handling ensure properly?

tarcieri · 2014-12-04T21:53:13Z

We could merge ioquatix@6eb2bd8 for now and investigate adding an ensure later

ioquatix · 2014-12-04T22:20:14Z

My proposed solution using ensure is ioquatix@a6a8ad7

ioquatix · 2014-12-04T22:21:41Z

I think merging this but not releasing it might be a good first step. We can test it. I believe that we need to make some specific tests to hit this case so we don't have regressions in the future. We also want to make sure this is the correct logic. I just moved lines around to see if it would work based on @tarcieri's advice, but I don't actually know WHY it fixed the issue completely, or if it HAS fixed the issue..

tarcieri · 2014-12-04T22:23:12Z

Regardless of why it corrects the problem, it's still a better, easier-to-understand factoring, IMO

tarcieri · 2014-12-04T22:23:21Z

Want to open a PR?

ioquatix · 2014-12-04T22:24:11Z

Okay cap'in, I'll open a PArrrrrgh. Yarrrr.

ioquatix mentioned this issue Oct 23, 2014

Please tag v1.x.x socketry/rubydns#28

Closed

ioquatix mentioned this issue Dec 5, 2014

Ensure monitor is closed after task is resumed, fixes #121. #122

Merged

tarcieri closed this as completed in aebd2d6 Dec 5, 2014

mighe mentioned this issue Apr 12, 2015

ArgumentError: this IO is already registered with selector kenichi/angelo#53

Open

ArgumentError: this IO is already registered with selector #121

ArgumentError: this IO is already registered with selector #121

Comments

ioquatix commented Oct 23, 2014

ioquatix commented Oct 23, 2014

ioquatix commented Oct 23, 2014

ioquatix commented Oct 23, 2014

ioquatix commented Oct 24, 2014

ioquatix commented Oct 28, 2014

tarcieri commented Oct 28, 2014

ioquatix commented Oct 28, 2014

tarcieri commented Oct 28, 2014

ioquatix commented Oct 28, 2014

ioquatix commented Oct 29, 2014

tarcieri commented Oct 29, 2014

ioquatix commented Nov 5, 2014

tarcieri commented Nov 5, 2014

ioquatix commented Nov 19, 2014

tarcieri commented Nov 19, 2014

HoneyryderChuck commented Nov 27, 2014

HoneyryderChuck commented Nov 27, 2014

HoneyryderChuck commented Nov 27, 2014

HoneyryderChuck commented Nov 27, 2014

HoneyryderChuck commented Nov 27, 2014

tarcieri commented Nov 29, 2014

HoneyryderChuck commented Dec 1, 2014

ioquatix commented Dec 1, 2014

ioquatix commented Dec 1, 2014

ioquatix commented Dec 1, 2014

HoneyryderChuck commented Dec 1, 2014

HoneyryderChuck commented Dec 1, 2014

HoneyryderChuck commented Dec 1, 2014

HoneyryderChuck commented Dec 1, 2014

Asmod4n commented Dec 1, 2014

ioquatix commented Dec 3, 2014

ioquatix commented Dec 3, 2014

HoneyryderChuck commented Dec 4, 2014

tarcieri commented Dec 4, 2014

tarcieri commented Dec 4, 2014

ioquatix commented Dec 4, 2014

ioquatix commented Dec 4, 2014

tarcieri commented Dec 4, 2014

tarcieri commented Dec 4, 2014

ioquatix commented Dec 4, 2014