Skip to content
This repository has been archived by the owner on Dec 7, 2018. It is now read-only.

ArgumentError: this IO is already registered with selector #121

Closed
ioquatix opened this issue Oct 23, 2014 · 40 comments
Closed

ArgumentError: this IO is already registered with selector #121

ioquatix opened this issue Oct 23, 2014 · 40 comments

Comments

@ioquatix
Copy link
Contributor

Can't for the life of me figure out why this is happening, albeit very rarely:

ArgumentError: this IO is already registered with selector
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/reactor.rb:42:in `register'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/reactor.rb:42:in `wait'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/reactor.rb:21:in `wait_readable'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io.rb:53:in `wait_readable'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/udp_socket.rb:13:in `wait_readable'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/udp_socket.rb:28:in `rescue in recvfrom'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-io-0.16.1/lib/celluloid/io/udp_socket.rb:20:in `recvfrom'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:151:in `try_udp_server'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:124:in `try_server'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:99:in `block (2 levels) in dispatch_request'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid.rb:407:in `block in timeout'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/actor.rb:252:in `timeout'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid.rb:406:in `timeout'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:98:in `block in dispatch_request'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:208:in `block in each'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:205:in `each'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:205:in `each'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:92:in `dispatch_request'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:62:in `query'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:68:in `block in addresses_for'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:67:in `times'
    /home/travis/build/ioquatix/rubydns/lib/rubydns/resolver.rb:67:in `addresses_for'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `public_send'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `dispatch'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/calls.rb:63:in `dispatch'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/cell.rb:60:in `block in invoke'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/cell.rb:71:in `block in task'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/actor.rb:357:in `block in task'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/tasks.rb:57:in `block in initialize'
    /home/travis/.rvm/gems/ruby-2.0.0-p576/gems/celluloid-0.16.0/lib/celluloid/tasks/task_fiber.rb:15:in `block in create'

e.g. https://travis-ci.org/ioquatix/rubydns/jobs/38832552

I believe we may have discussed this a bit here too: #101 but I'm raising a new issue since that stuff is all so old.

@ioquatix
Copy link
Contributor Author

The function that causes the failure is here:

https://github.com/ioquatix/rubydns/blob/7cedb543ab7d28561ff0f4eed509fdfab7ca5199/lib/rubydns/resolver.rb#L146-L162

Not sure how that could possibly be failing.. perhaps the ensure block? Some kind of weird flow control due to a timeout somewhere?

@ioquatix
Copy link
Contributor Author

Okay, yeah, might be exactly the same issue as #101, because

https://github.com/ioquatix/rubydns/blob/7cedb543ab7d28561ff0f4eed509fdfab7ca5199/lib/rubydns/resolver.rb#L98-L100

Hmmmmmm...

@ioquatix
Copy link
Contributor Author

Ah, damn it, it's 5:30AM already.

@ioquatix
Copy link
Contributor Author

I believe there is a bug in celluloid-io where sockets are not being deregistered correctly.

I made a test case in RubyDNS which mostly hits the error condition systematically: https://github.com/ioquatix/rubydns/blob/master/spec/rubydns/celluloid_bug_spec.rb

I augmented celluloid-io with some debugging output:

      # Wait for the given IO operation to complete
      def wait(io, set)
        # zomg ugly type conversion :(
        unless io.is_a?(::IO) or io.is_a?(OpenSSL::SSL::SSLSocket)
          if io.respond_to? :to_io
            io = io.to_io
          elsif ::IO.respond_to? :try_convert
            io = ::IO.try_convert(io)
          end

          raise TypeError, "can't convert #{io.class} into IO" unless io.is_a?(::IO)
        end

        puts "-> wait(#{io} -> #{io.inspect}, #{set})"
        monitor = @selector.register(io, set)
        monitor.value = Task.current
        Task.suspend :iowait        
        puts "<- wait(#{io}, #{set})"
      end

I checked several times and I had the argument error but never noticed entering @selector.register more than once, i.e. -> wait and <- wait were balanced for every socket, but still got error.

I added @selector.deregister io after Task.suspend and everything is working correctly now.

I'm going to assume that somehow, the register (https://github.com/celluloid/nio4r/blob/3e4e8cb90856502c84e896fb667cb407ff68d353/ext/nio4r/selector.c#L201-L232) function needs to be matched with deregister (https://github.com/celluloid/nio4r/blob/3e4e8cb90856502c84e896fb667cb407ff68d353/ext/nio4r/selector.c#L241-L257) in order to work correctly in all conditions?

Perhaps you can give me some feedback on what is going on here.

@ioquatix
Copy link
Contributor Author

@tarcieri Can you please check this?

@tarcieri
Copy link
Member

@ioquatix oh, I see what might be happening. Yeah, you're right, except:

https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/reactor.rb#L51

monitor.close gets called there, except if something else resumes the task first, it won't be

We should probably move monitor.close out of the reactor, and add it to an ensure block after Task.suspend

@ioquatix
Copy link
Contributor Author

Okay, do you want me to submit a PR? I'm pretty keen to get this sorted as it's causing bugs in the latest RubyDNS release.

@tarcieri
Copy link
Member

@ioquatix sure! Delete the monitor.close out of the reactor, and close it in #wait

@ioquatix
Copy link
Contributor Author

I tried adding it to an ensure block like you said but Ruby segfaults. Removing the ensure block, but just leaving it at the end (i.e. monitor.close as the last line in #wait) appears to work fine. Any ideas?

@ioquatix
Copy link
Contributor Author

Sorry to bump this (don't mean to be annoying), but have you made any progress on this?

@tarcieri
Copy link
Member

I haven't had a chance to take a look

@ioquatix
Copy link
Contributor Author

ioquatix commented Nov 5, 2014

@tarcieri Did you get a chance to take a look?

@tarcieri
Copy link
Member

tarcieri commented Nov 5, 2014

no

@ioquatix
Copy link
Contributor Author

bump :)

@tarcieri
Copy link
Member

No progress, sorry

@HoneyryderChuck
Copy link

Bumped into it as well. It seems #101 is not as fixed as we thought... Here is a small gist:

# ran with ruby 2.0.0
require 'celluloid/io'                                      

class Dad                                                   
  include Celluloid::IO                                     

  def initialize                                            
    @sock = TCPSocket.new("localhost", 22)                  
  end                                                       

  def wiriwiri                                              
    begin                                                   
      timeout(2) do                                         
        loop do                                             
          @sock.wait_readable                               
        end                                                 
      end                                                   
    rescue                                                  
      @sock.wait_readable                                   
    end                                                     
  end                                                       
end                                                         

Dad.new.wiriwiri                                            

I theory, this script shouldn't blow. I force the endless state of waiting on socket to be readable. When the timeout happens, the socket should be fred from the reactor. But this is not happening. My "theory": the timeout failed message doesn't get prioritized in the mailbox, or does not trigger the reactor select. And the next evaluated message uses the socket which is still under the reactor registered collection.

One flaw IMHO from the celluloid io API when comparing to the native ruby implementation is that: IO::Select's last parameter is a timeout for waiting on descriptors. This is how most evented implementations handle timeouts: they IO::select for a certain time period. An example:

https://github.com/net-ssh/net-ssh/blob/043bbca9b65cb7f8d862371f66cd4e163287291b/lib/net/ssh/transport/packet_stream.rb#L75
https://github.com/net-ssh/net-ssh/blob/441d07882645114b5acc764d2b0b83930c6cacd7/lib/net/ssh/connection/session.rb#L209

First example checks whether socket is available for reading by passing 0. Second example waits for a definite period on descriptors.

So, this is the main thing about it: celluloid-io doesn't allow me to wait on a socket for a definite period of time.

My proposal would be to pass an optional 3rd parameter to the #wait method: a timeout. when not passed (nil), it defaults to the current behaviour and waits indefinitely (like IO::select). when passed a parameter, it times out, deregisters the socket and return true/false, whether socket is available or not.

Actually, @ioquatix and me have been using the timeout feature because this API is not available (correct me if I'm wrong).

Let me know what you think. I think this would mean some changes on nio4r, probably. Anyway, the "recover from timeout" thing must succeed first.

@HoneyryderChuck
Copy link

@ioquatix, I used the monitor.close call in the ensure block and it worked well for me. Can you troubleshoot exactly where does the segfault come from? My reasoning for it is: if an exception occurs while the fiber is paused, what happens to it? Does it get resumed? Can we say for sure that it will always safely close?

@HoneyryderChuck
Copy link

@ioquatix , found one way to "force" the stackdump when the monitor.close is in the ensure block: when on timeout loop, you send a SIGTERM to the process it.

Also found out what I wanted: if not inside the ensure block, the time out will not jump to the instructions after task suspending, hence it will not close the monitor and fail with "ArgumentError: this IO is already registered with selector".

@HoneyryderChuck
Copy link

here's the stack dump (the relevant nio4r part):

...
7f1b01525000-7f1b017a7000 rw-p 00000000 00:00 0
7f1b017a7000-7f1b017b6000 r-xp 00000000 00:1d 9442576991091956877        /home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so
7f1b017b6000-7f1b019b5000 ---p 0000f000 00:1d 9442576991091956877        /home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so
7f1b019b5000-7f1b019b6000 rw-p 0000e000 00:1d 9442576991091956877        /home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so
7f1b019b6000-7f1b019b7000 rw-p 00000000 00:00 0
7f1b019b7000-7f1b019ba000 r-xp 00000000 00:1d 12052790693188749692       /home/taacati1/sem/.bundle/ruby/1.9.1/gems/hitimes-1.2.2/lib/hitimes/1.9/hitimes.so
7f1b019ba000-7f1b01bb9000 ---p 00003000 00:1d 12052790693188749692       /home/taacati1/sem/.bundle/ruby/1.9.1/gems/hitimes-1.2.2/lib/hitimes/1.9/hitimes.so
7f1b01bb9000-7f1b01bba000 rw-p 00002000 00:1d 12052790693188749692       /home/taacati1/sem/.bundle/ruby/1.9.1/gems/hitimes-1.2.2/lib/hitimes/1.9/hitimes.so
7f1b01bba000-7f1b01bbb000 r-xp 00000000 fd:00 2093087                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/fiber.so
7f1b01bbb000-7f1b01dba000 ---p 00001000 fd:00 2093087                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/fiber.so
7f1b01dba000-7f1b01dbb000 rw-p 00000000 fd:00 2093087                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/fiber.so
7f1b01dbb000-7f1b01dc0000 r-xp 00000000 fd:00 2093153                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/strscan.so
7f1b01dc0000-7f1b01fc0000 ---p 00005000 fd:00 2093153                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/strscan.so
7f1b01fc0000-7f1b01fc1000 rw-p 00005000 fd:00 2093153                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/strscan.so
7f1b01fc1000-7f1b02006000 r-xp 00000000 fd:00 2093157                    /opt/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/x86_64-linux/openssl.sozsh: abort      bundle exec ruby timeout_script.rb

@tarcieri , if this might be helpful.

@HoneyryderChuck
Copy link

Correction: "timers" seems to be the issue (not handling SIGTERM properly on some c extension??)

@tarcieri
Copy link
Member

One flaw IMHO from the celluloid io API when comparing to the native ruby implementation is that: IO::Select's last parameter is a timeout for waiting on descriptors.

...and the first thing anyone wants to do is wrap this up into an event loop that can calculate that timeout based on any number of potential timers.

So, this is the main thing about it: celluloid-io doesn't allow me to wait on a socket for a definite period of time.

Celluloid is a full-featured event system complete with timers, and expects you to use that functionality to implement timeout handling.

Adding timeouts with Celluloid::IO (now that they actually work) is as simple as:

timeout(n) { io_operation_here }

Timeouts are handled by the reactor:

https://github.com/celluloid/celluloid-io/blob/master/lib/celluloid/io/reactor.rb#L48

@HoneyryderChuck
Copy link

@tarcieri , not what I meant at all. I'm fully aware that the reactor handles the timeouts, I meant more of a way of "replacing" native calls with celluloid calls to select, like in the net-ssh example, which would "in theory" involve overwriting the ruby_compat module. This could be then applied to other "native" libraries that use the IO::select approach. But I don't want to move the attention away from the bug above :) Tomorrow I'll get back to this.

@ioquatix
Copy link
Contributor Author

ioquatix commented Dec 1, 2014

Hi Guys, I suggest we have a group Google Hangout to bump heads on this issue and see if we can resolve it. Are you interested?

@ioquatix
Copy link
Contributor Author

ioquatix commented Dec 1, 2014

@TiagoCardoso1983 can you please confirm you were using Timers 4.0.1 for your example above?

@ioquatix
Copy link
Contributor Author

ioquatix commented Dec 1, 2014

@TiagoCardoso1983 I tend to agree with @tarcieri on the timer issue, explicit timeouts are a real pain especially with nested APIs. In RubyDNS, this is used in the resolver:

https://github.com/ioquatix/rubydns/blob/master/lib/rubydns/resolver.rb#L101-L103

In this example, it would be a pain to track the timeout through several read/write calls.

@HoneyryderChuck
Copy link

@ioquatix , I mostly agree with both, I'm just getting frustrated with net-ssh (another topic, another already existing issue ;) ). But I've been working around it.

Concerning this, yes, I confirm it was the timers 4.0.1 (and nio4r 1.0.1) It seems by the stack dump it is coming from either hitimes or the nio4r libev extensions.

@HoneyryderChuck
Copy link

Forgot also to had this missing dump:

@HoneyryderChuck
Copy link

-- C level backtrace information -------------------------------------------
ruby() [0x523d05]
ruby() [0x56b7b6]
ruby(rb_bug+0xb3) [0x56b953]
ruby() [0x4b0bc0]
/lib64/libpthread.so.0(+0xf710) [0x7fe658d18710]
/home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so(ev_io_stop+0x53) [0x7fe64e4baa93] ../libev/ev.c:3659
/home/taacati1/sem/.bundle/ruby/1.9.1/gems/nio4r-1.0.1/lib/nio4r_ext.so(+0x4bcc) [0x7fe64e4b8bcc] monitor.c:130
ruby() [0x51267e]
ruby() [0x514d6d]
ruby() [0x5189c8]
ruby() [0x521689]
ruby(rb_rescue2+0x15b) [0x41546b]
ruby() [0x50e319]
ruby() [0x51267e]
ruby() [0x514d6d]
ruby() [0x5189c8]
ruby() [0x519f23]
ruby() [0x51abc5]
ruby() [0x51267e]
ruby() [0x514d6d]
ruby() [0x5189c8]
ruby() [0x51226f]
ruby() [0x52fb73]
/lib64/libc.so.6(+0x43bf0) [0x7fe6580f1bf0]

-- Other runtime information -----------------------------------------------

seems also relevant, even though I must admit I have 0 experience with C extensions.

@HoneyryderChuck
Copy link

The offender might then have been this line:

https://github.com/celluloid/nio4r/blob/master/ext/libev/ev.c#L3659

@tarcieri , shouldn't the assert call "bubble up" to ruby?

@Asmod4n
Copy link
Contributor

Asmod4n commented Dec 1, 2014

asserts get disabled when compiled with -DN(O?)DEBUG

Von einem mobilen Gerät gesendet

Am 01.12.2014 um 16:34 schrieb Tiago notifications@github.com:

The offender might then have been this line:

https://github.com/celluloid/nio4r/blob/master/ext/libev/ev.c#L3659

@tarcieri , shouldn't the assert call "bubble up" to ruby?


Reply to this email directly or view it on GitHub.

@ioquatix
Copy link
Contributor Author

ioquatix commented Dec 3, 2014

@TiagoCardoso1983 my proposed changes fix the problem https://github.com/ioquatix/celluloid-io/commits/fix-monitor but I'm not convinced it is the correct solution. Can you please try it out and let me know what you think? I can run your example given above and it works. Correctly? Not sure about that, but it doesn't blow up.

@ioquatix
Copy link
Contributor Author

ioquatix commented Dec 3, 2014

Specifically this commit: ioquatix@6eb2bd8

@HoneyryderChuck
Copy link

@ioquatix it does correct the segfault. I do get an IOError: selector is closed, which is probably ok (I don't see where the socket gets closed though). I'd say it's good to go!

@tarcieri
Copy link
Member

tarcieri commented Dec 4, 2014

So the ensure block triggered the segfault?

Does that mean that one of the C extensions isn't handling ensure properly?

@tarcieri
Copy link
Member

tarcieri commented Dec 4, 2014

We could merge ioquatix@6eb2bd8 for now and investigate adding an ensure later

@ioquatix
Copy link
Contributor Author

ioquatix commented Dec 4, 2014

My proposed solution using ensure is ioquatix@a6a8ad7

@ioquatix
Copy link
Contributor Author

ioquatix commented Dec 4, 2014

I think merging this but not releasing it might be a good first step. We can test it. I believe that we need to make some specific tests to hit this case so we don't have regressions in the future. We also want to make sure this is the correct logic. I just moved lines around to see if it would work based on @tarcieri's advice, but I don't actually know WHY it fixed the issue completely, or if it HAS fixed the issue..

@tarcieri
Copy link
Member

tarcieri commented Dec 4, 2014

Regardless of why it corrects the problem, it's still a better, easier-to-understand factoring, IMO

@tarcieri
Copy link
Member

tarcieri commented Dec 4, 2014

Want to open a PR?

@ioquatix
Copy link
Contributor Author

ioquatix commented Dec 4, 2014

Okay cap'in, I'll open a PArrrrrgh. Yarrrr.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants