Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.16.1? #457

Closed
mperham opened this issue Sep 10, 2014 · 22 comments
Closed

0.16.1? #457

mperham opened this issue Sep 10, 2014 · 22 comments

Comments

@mperham
Copy link
Contributor

mperham commented Sep 10, 2014

Sidekiq users have reported several showstopper bugs in 0.16.

  1. Sidekiq no longer shuts down on TERM: Stucking in Stoping sidekiq/sidekiq#1929
  2. Queues don't process quickly: Takes a long time for enqueued jobs to enter the busy tab sidekiq/sidekiq#1937
  3. Memory leak: Potential memory leak when running sidekiq with jRuby sidekiq/sidekiq#1934

I'll probably release 3.2.5 today which locks to 0.15.2. In the meantime, please update this issue if there's a new release which I can smoke test for you.

@tarcieri
Copy link
Member

Yeah, locking to 0.15.2 is definitely the best course of action right now.

I'll be releasing timers 4.0.1 soon which should address #3

@mperham
Copy link
Contributor Author

mperham commented Sep 10, 2014

Regarding 1, when I Ctrl-C Sidekiq, it says this as expected but does not actually exit:

2014-09-10T17:38:02.864Z 2504 TID-owe9u406g INFO: Shutting down
2014-09-10T17:38:04.133Z 2504 TID-owe9yp1oc INFO: Shutting down 25 quiet workers

When I send QUIT to the process after that, I get this output:

2014-09-10T17:38:42.916Z 2504 TID-owe9yo8k0 WARN: {:context=>"Launcher#stop"}
2014-09-10T17:38:42.917Z 2504 TID-owe9yo8k0 WARN: task was terminated
2014-09-10T17:38:42.917Z 2504 TID-owe9yo8k0 WARN: /Users/mike/.gem/ruby/2.1.2/gems/celluloid-0.16.0/lib/celluloid/tasks/task_fiber.rb:34:in `terminate'
/Users/mike/.gem/ruby/2.1.2/gems/celluloid-0.16.0/lib/celluloid/actor.rb:345:in `each'
/Users/mike/.gem/ruby/2.1.2/gems/celluloid-0.16.0/lib/celluloid/actor.rb:345:in `cleanup'
/Users/mike/.gem/ruby/2.1.2/gems/celluloid-0.16.0/lib/celluloid/actor.rb:329:in `shutdown'
/Users/mike/.gem/ruby/2.1.2/gems/celluloid-0.16.0/lib/celluloid/actor.rb:164:in `run'
/Users/mike/.gem/ruby/2.1.2/gems/celluloid-0.16.0/lib/celluloid/actor.rb:130:in `block in start'
/Users/mike/.gem/ruby/2.1.2/gems/celluloid-0.16.0/lib/celluloid/thread_handle.rb:13:in `block in initialize'
/Users/mike/.gem/ruby/2.1.2/gems/celluloid-0.16.0/lib/celluloid/actor_system.rb:32:in `block in get_thread'
/Users/mike/.gem/ruby/2.1.2/gems/celluloid-0.16.0/lib/celluloid/internal_pool.rb:130:in `call'
/Users/mike/.gem/ruby/2.1.2/gems/celluloid-0.16.0/lib/celluloid/internal_pool.rb:130:in `block in create'

@tarcieri
Copy link
Member

@mperham I'll take a look.

That said, I'm getting reports that timers 4.0.1 addressed the memory leak (see #455)

@ioquatix
Copy link
Contributor

Yes, you should definitely be using timers 4.0.1 IMHO.

@tarcieri
Copy link
Member

@mperham I haven't had time to dig into the Sidekiq INTR problem. Is there a corresponding Sidekiq issue specifically for that, and can we publicize it to get people digging into it? I'd like to get Sidekiq on 0.16.

@mperham
Copy link
Contributor Author

mperham commented Oct 12, 2014

There's a number of closed ones. Feel free to open a new one if you'd like.

On Oct 12, 2014, at 14:48, Tony Arcieri notifications@github.com wrote:

@mperham I haven't had time to dig into the Sidekiq INTR problem. Is there a corresponding Sidekiq issue specifically for that, and can we publicize it to get people digging into it? I'd like to get Sidekiq on 0.16.


Reply to this email directly or view it on GitHub.

@ioquatix
Copy link
Contributor

@tarcieri What's the timeline for 0.16.1? Anything that I can help with?

@tarcieri
Copy link
Member

@ioquatix AFAIK, the biggest problem right now is Sidekiq shutdown

@ioquatix
Copy link
Contributor

What is the issue #?

@tarcieri
Copy link
Member

Per @mperham, sounds like there should be a new one. Backtrace is earlier in this ticket.

@petergoldstein
Copy link
Contributor

@mperham and @tarcieri - I've done some research on the Sidekiq hang on exit tonight and I wanted to share my results and get your thoughts. I started by following the procedure @mperham outlined in #466 and experimented with various changes to Sidekiq and Celluloid. Conclusions follow.

The source of the hang appears to be these two lines - https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/launcher.rb#L51-L52

The hang on exit is occurring on the second line - the wait call. Makes sense.

Now if I change this line - https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/manager.rb#L50 to

@ready.each { |x| x.terminate if x.alive?; sleep 0.01 }

then the Sidekiq process will exit on a Ctrl-C. That means the underlying problem looks like a race condition to me.

With the sleep 0.01 it takes longer for the worker termination loop to run, so the fire_event(:shutdown) in the manager happens after the invoking thread has called the wait(:shutdown) on line 52 of launcher.rb

If I revert line 50 of manager.rb to the Github version and update https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/manager.rb#L205-L208 to be

def shutdown
  requeue
  terminate
  signal_shutdown
end

then the Sidekiq process will exit on a Ctrl-C. This makes sense, because manager.alive? no longer returns true. So we clear the wait, and nothing hangs. I think this change to shutdown makes sense and is probably correct.

I'd submit this change as a PR, but I get some odd errors in the console on exit. Specifically, I get a warning that the code attempted to call a dead actor, with stack traces like this:

WARN: attempted to call a dead actor
WARN: /Users/peter/.rbenv/versions/2.1.4/lib/ruby/gems/2.1.0/gems/celluloid-0.16.0/lib/celluloid/responses.rb:29:in `value'
/Users/peter/.rbenv/versions/2.1.4/lib/ruby/gems/2.1.0/gems/celluloid-0.16.0/lib/celluloid/calls.rb:92:in `value'
/Users/peter/.rbenv/versions/2.1.4/lib/ruby/gems/2.1.0/gems/celluloid-0.16.0/lib/celluloid/proxies/sync_proxy.rb:33:in `method_missing'
/Users/peter/src/sidekiq/lib/sidekiq/launcher.rb:52:in `block in stop'
/Users/peter/src/sidekiq/lib/sidekiq/util.rb:15:in `watchdog'
/Users/peter/src/sidekiq/lib/sidekiq/launcher.rb:45:in `stop'
...

Do either of you (or anyone else) understand why these warning messages would be generated in this scenario?

@ioquatix
Copy link
Contributor

@petergoldstein Awesome investigative work.

call a dead actor means that the actor has been shutdown/terminated and you try to send it a method. I believe there are a small subset of methods you can send a dead actor (e.g. #hash might work)

@tarcieri
Copy link
Member

x.terminate if x.alive?

...is racy, as x could die between the time of the alive check and when it's told to terminate

@petergoldstein
Copy link
Contributor

@ioquatix Yep, I got that. My real question is why we're seeing that happen in this circumstance.

After sleeping on it I have a couple of thoughts here:

  1. I think the Celluloid 0.15.2 => 0.16.0 is something of a red herring. I think there is an underlying race condition in how Sidekiq is shutting down that was hidden in 0.15.2 and exposed in 0.16.0, probably because of speed improvements in timers. I'm going to do some more testing to confirm or refute that proposition.
  2. From the Celluloid docs, it really looks like the Sidekiq Manager should be rebuilt as a Celluloid SupervisionGroup managing a pool of Sidekiq Processor actors. I'm not 100% sure on this point, but it really looks like the natural Celluloid model for this case. The big questions I have is 1) how (if at all) can you terminate pool members externally, as shutdown seems to depend on a finalizer and 2) is it possible to send them a message to kill their threads (as Actor.kill does). It may be possible to customize the SupervisionGroup to make that happen.

@tarcieri Yep, agreed. That's definitely a possible source of the issue. However I'd like to understand why I always get the stack trace - 100% of the time - which is not what I'd expect if the race condition were so simple. But I don't have a great understanding of Celluloid internals, so it's very possible my intuition is off here.

@petergoldstein
Copy link
Contributor

@tarcieri If you have any comments on my thoughts above - especially thought 2 - I'd appreciate any insight you'd be able to share on the best way to use Celluloid in this case.

@tarcieri
Copy link
Member

@petergoldstein I'd probably suggest an immediate fix rather than trying to make massive changes to the way Sidekiq works today.

Your suggested fix:

def shutdown
  requeue
  terminate
  signal_shutdown
end

...looks good to me, but I see how it could generate spurious logspam.

Perhaps I could switch the loglevel of calls to dead actors to DEBUG rather than WARN, since in the shutdown case they're spurious.

@mperham
Copy link
Contributor Author

mperham commented Oct 30, 2014

All of the Celluloid shutdown issues I had to deal with make me want to change Sidekiq to be much more draconian about shutdown. That is, don't bother trying to shut down the Actors nicely: just stop processing jobs on USR1, start the shutdown timer at TERM and exit(0) once the shutdown timer expires.

@petergoldstein
Copy link
Contributor

@tarcieri I agree - I think the quick fix is desirable right now, provided that the logspam doesn't indicate a real problem.

That said, I think it would be a good idea to get a sense of exactly how Sidekiq should be using Celluloid, and target any required changes for the next Sidekiq major version. Sidekiq is awesome, and I use it in most of my Ruby projects (Thanks @mperham !), but even after several hours of digging in I don't entirely understand how it interacts with (or is supposed to use) Celluloid

I think it would be advantageous on a number of levels for Sidekiq to serve as an example of Celluloid best practices. That might require some time and effort, but I think it would be beneficial for both projects. I'd be happy to contribute to that effort.

amatriain added a commit to amatriain/feedbunch that referenced this issue Nov 9, 2014
Version 3.2.4 and below have issues when shutting down. There's also mention in the bugreport of a memory leak, but I haven't been able to confirm it in this app.

The fix is to add a >=3.2.5 condition in the Gemfile. That is the first version that fixes this issue, or at least works around it. It seems to be some problem with recent Celluloid versions, or a race condition that only manifests in recent Celluloid versions. Version 3.2.5 of Sidekiq fixes version 0.15.2 of Celluloid as a dependency, which seems to make the problem go away for now.

I don't directly include Celluloid, so I'm not too worried about staying with older versions as long as Sidekiq works.

For more about this bug:

celluloid/celluloid#457
@CygnusAlpha
Copy link

I'm not sure how helpful this is but I'm still getting this with celluloid 0.15.2 and sidekiq 3.2.6.

2014-11-19T15:20:23.411Z 24639 TID-ovxvs5yuo WARN: task was terminated
2014-11-19T15:20:23.411Z 24639 TID-ovxvs5yuo WARN: /home/gitotron/.rbenv/versions/2.1.2/lib/ruby/gems/2.1.0/gems/celluloid-0.15.2/lib/celluloid/tasks/task_fiber.rb:32:in terminate' /home/gitotron/.rbenv/versions/2.1.2/lib/ruby/gems/2.1.0/gems/celluloid-0.15.2/lib/celluloid/actor.rb:404:inblock in cleanup'
...
2014-11-19T15:20:23.411Z 24639 TID-ovxvs5yuo WARN: Terminating task: type=:exit_handler, meta={:method_name=>:processor_died}, status=:callwait
2014-11-19T15:20:23.411Z 24639 TID-ovxvs5yuo WARN: {:context=>"Manager#processor_died died"}

@digitalextremist
Copy link
Member

@mperham && @vangberg is this still an issue for you? We've got a 0-16-stable branch right now, and we're moving forward to 0.17.0 ... unless there is a point release needed for Sidekiq or Adhearsion?

@mperham
Copy link
Contributor Author

mperham commented Mar 28, 2015

All good.

@mperham mperham closed this as completed Mar 28, 2015
@digitalextremist
Copy link
Member

Thanks @mperham

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants