Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For review: "truly" synchronous sync calls #554

Closed
e2 opened this issue Apr 8, 2015 · 19 comments
Closed

For review: "truly" synchronous sync calls #554

e2 opened this issue Apr 8, 2015 · 19 comments
Milestone

Comments

@e2
Copy link
Contributor

e2 commented Apr 8, 2015

TL;DR - seems to work well, but probably has a lot of implications

Here is the patch: e2@5e17eae

I don't particularly like it, but it seemed necessary with Threads, and it seems to help with other JRuby issues.

It bypasses the mailbox when an actor calls it's own methods - this was necessary (in the cases I encountered), because there was no way protect a "receiver" task that makes a sync call to it's sender again (deadlock, because sender is waiting for signal, while receiver mails a sync call to the "waiting" sender task).

On MRI without this line: https://github.com/celluloid/celluloid/blob/master/lib/celluloid/tasks/task_thread.rb#L31 you get a deadlock in the Probe tests. On JRuby though, removing it prevents the following from getting stuck:

# Not actual code, but more or less what happens in the actor system:
SupervisionGroup.new do |group|
  group.supervise_as :notifications_fanout, Notifications::Fanout
  # (...)
end

Which pretty much becomes:

class SupervisionGroup
  def initialize
    supervise_as :notifications_fanout, Notifications::Fanout
  end
end

On MRI, the hacky signal "pushes things forward" - but I believe that line is just a deadlock workaround, and not a solution to the sync problem (which doesn't probably exist on MRI anyway - due to how tasks are suspended and resumed everywhere).

I'm still fixing tests, but this seems to work fine.

The upside may be a small performance increase (no mailbox overhead, etc.), the downside - maybe lower responsiveness in some cases.

But then again, ideally the user should avoid sync calls unless absolutely necessary (since it potentially means locking the whole call chain).

Would this be ok to merge?

@tarcieri
Copy link
Member

tarcieri commented Apr 8, 2015

At one point I implemented these semantics. I guess they got lost along the way?

Yes, this is definitely how it should work. Things are actually broken without it (although unfortunately I can't recommend a specific case)

@digitalextremist
Copy link
Member

I've been running into this more and more, so much that I'm going to refactor supervision during 0.17.0 rather than 0.18.0. In a few hours I will come back to this with a more thorough response.

@e2 e2 mentioned this issue Apr 8, 2015
@digitalextremist digitalextremist modified the milestone: 0.17.0 Apr 8, 2015
@digitalextremist
Copy link
Member

This landed in #556, for better or worse. Reopen if worse.

@digitalextremist
Copy link
Member

Reopen if worse.

@e2
Copy link
Contributor Author

e2 commented Apr 11, 2015

@tarcieri - I think this is an architecture problem. An overreliance on MRI with GIL.

I'm guessing the reason it can't be synchronous on MRI is because of potential deadlocks.

I'd say there should be an "idle" and "non-sleeping" thread that should take control on MRI so everything else can sleep without a deadlock. Timeouts are another workaround.

I think MRI should start to behave like JRuby/Rubinius (and not the other way around).

Thoughts?

@e2
Copy link
Contributor Author

e2 commented Apr 11, 2015

@digitalextremist - revert this on the 0.17.0, and then purge this from the master branch in abstractive (I don't mind history being rewritten in this case if you don't either). Then let me know so I can rebase and dump this patch into a separate branch.

@digitalextremist
Copy link
Member

@e2 in the referenced symptom ( which I am testing a revert against ) the task is losing its block context. Where do you figure on deadlocking being the reason for that?

@digitalextremist
Copy link
Member

I'm going to comment it out and reference the ticket for now.

@e2
Copy link
Contributor Author

e2 commented Apr 11, 2015

@digitalextremist - not 100% sure it's the reason, but on MRI SyncCall calls are made in a separate task - this patch eliminated the "extra task", by blocking the caller (which is faster, because there's no task overhead) ... but ... on MRI it means a potential deadlock.

The MRI "cure" is presumably here (mentioned in this issue description): https://github.com/celluloid/celluloid/blob/master/lib/celluloid/tasks/task_thread.rb#L31

TL;DR - something is fundamentally broken in Celluloid with MRI-specific "bandaids".

@tarcieri
Copy link
Member

@e2 can you give a timeline of the events and how you believe they're leading to a deadlock?

@digitalextremist
Copy link
Member

( and why this issue is happening in mri, jruby, and rbx if it's centered around one ruby )

@e2
Copy link
Contributor Author

e2 commented Apr 11, 2015

@tarcieri - I was basically testing on threads (not fibers) when I created the issue.

Occasionally on JRuby the Probe tests deadlocked. It went like this:

  1. main thread creates SupervisionGroup (from actor system)
  2. SupervisionGroup gets :initialize in the mailbox
  3. :initialize executes the given block (with fanout and incidentreporter) ... and waits for signal
  4. the block sends a :supervise_as message to the SupervisionGroup mailbox ...
    ... but ... the SupervisionGroup thread is waiting for the signal

On MRI, the signal() method was called (I don't remember where the resume came from), which resumed the supervision group mailbox - which was strange, really, since executing the block should be synchronous - and once finished, the SG should then finish :initialize and resume checking the mailbox.

So it worked on MRI because of the strange "signal" method invocation (I don't remember where from) - without it, MRI deadlocks every time (because it cant wait for the (cond)signal and listen to the mailbox at the same time).

But removing that line fixed JRuby, because the call became synchronous (instead of deadlocking when the signal was sent at the wrong moment later).

To be honest, I'm very confused about how the suspend/resume is supposed to work on MRI.

Let me know if that helps.

The issue with smokers though may be unrelated - there's a circular dependency there of calls (or not), so it needs investigating. Needs a test before anything else.

@tarcieri
Copy link
Member

On MRI, the signal() method was called (I don't remember where the resume came from), which resumed the supervision group mailbox - which was strange, really, since executing the block should be synchronous - and once finished, the SG should then finish :initialize and resume checking the mailbox.

So it worked on MRI because of the strange "signal" method invocation

It sounds like some codepaths were sending messages through the mailbox. That's odd, but I'm not sure why it's alarming?

When a deadlock occurs, what's contending on what?

@e2
Copy link
Contributor Author

e2 commented Apr 12, 2015

It sounds like some codepaths were sending messages through the mailbox. That's odd, but I'm not sure why it's alarming?

What's alarming to me is that I have no idea how/why MRI got the signal (which happened before :supervise_as was called) and was able to handle the message. This makes sense on MRI to have signal() called before the :supervise_as returns, but this is like aborting a synchronous call simply because of the GIL.

It's also kind of confusing semantically that initialize is treated like a method in Celluloid - and can actually return a value. (Maybe :initialize should have a special return value other than 'nil'? At least when tracing, debugging).

I pretty much abandoned the issue (and moved on to keep up with the changes), but here's the state the threads were in: (based on my notes):

... so ...

0x16fe was waiting for the signal (to deliver the result of the whole call chain back to it's caller - 0x1700), while 0x1704 sent the synchronous message to the now-awaiting 0x16fe.

This was on JRuby with InternalPool.

On MRI the situation was avoided because of the signal() call from a resume() somewhere - which freed the mutex, allowing deliver() to finish (with a nil), then handle the :supervise_as message and continue.

Shortly after I got into this, the InternalPool was "dethroned", modified and the task classes were changed, and so I rebased to avoid effectively working "on a separate project".

I also thought this might be an InternalPool-specific issue, or a side-effect of something else - so I moved it aside.

A JRuby lockup happened 3 out of 5 times, but this specific case was 1 out of those 3. (in one of the other cases, the same Probe test was stuck I think in a similar way, but "later" - the TestProbe client suddenly stopped getting events (as a result of the Probe being stuck - which could've been an entirely different issue altogether).

Sorry for not being specific or creating an isolated failing example (I tried briefly) - just debugging this was too spartan of an experience, and maybe the race occurred due to puts statements being present.

So instead, I focused on an overhaul on the test side - hoping to be better prepared, having better tools and a better setup before this issue lures it's ugly head again...

@e2
Copy link
Contributor Author

e2 commented Apr 12, 2015

The smokers example is fixed - I forgot to pass the block parameter (there's a test for this now).

@digitalextremist - you can merge and test this. You shouldn't be getting any more LocalJump errors.

I'm still not sure this won't deadlock in certain cases, so I can't guarantee it's ready for 0.17.

e2 added a commit to e2/celluloid that referenced this issue Apr 12, 2015
digitalextremist pushed a commit that referenced this issue Apr 12, 2015
@e2
Copy link
Contributor Author

e2 commented Apr 12, 2015

TL;DR - more info about the problem and what could be causing it (based on behavior)

@tarcieri - The issue reappeared on both JRuby and Rubinius - but there's a lot of merging/rebasing ahead again, so I'll try to reproduce a failing example and report back when I do.

I'm pretty sure it's related to sync calls and waiting for results as above, so no new insights yet.

I think it happens when the tasks/calls "overlap" and come in an unexpected order - this results in sync calls screwing up the conditional mutex (linked above) at the wrong time, and one of the actors gets stuck on it, and then the test fails (because ultimately not all the events get through to the testprobeclient).

I rarely get "sync call timeouts" (I've added them), more often the test doesn't get the probe events it needs (it times outs - e.g it gets 90% of the events in 0.7 seconds, then nothing until the 5 second timeout). So it's like the last few events get "eaten".

The problem occurs more frequently if I switch to async notifications everywhere (I'm guessing calls pile up and overlap/cross more often).

My guess is sync calls are unintentionally resumed - or e.g. 80% of the deliver/wait blocks get resumed out of order - and then things get stuck (e.g. getting 2 signals for one wait, and then 0 signals for the next). This is probably why it also sometimes gets stuck before even initializing the dummy actor class.

This happens both on Fibers and Threads, so it worries me that it could be a bigger bug.

I believe the deliver(value) method isn't called on JRuby/RBX at all (or when it should be - and gets stuck when it doesn't). Maybe the calls are coming back faster than the cond mutex can wait - and when they don't, things grind to a halt.

This may also explain why a certain amount of debugs in the right places causes the tests to run every time. I'm careful not to make SyncCall calls in the debugs (e.g. inspect, class, etc.) and I pretty much log all the crashes (I hope).

The chain is: trigger -> probe -> notification/fanout -> testprobeclient ... and it fails randomly, sometimes as near completion as just missing the events the probe tests expect ... and sometimes it gets screwed up while setting up the supervisor and registering actor names (it never even gets to sending notifications, or even it gets clogged up before subscribe() kicks in).

I probably just need to wrap my head around the task resume/suspend logic...

Once I get a chance, I'll try and avoid supervisors completely (by "unsupervising" the fanout and incident reporter temporarily). I've tried removing the incident reporter - but nothing really changed (one probe test needs it anyway).

Any tips on how to get to the core issue would be much appreciated (even if it just turns out to be a race that occurs only in tests).

@digitalextremist
Copy link
Member

@e2 what's the status with this? I'd like to get all the internals-changing commits in I can before refactoring the supervision gem.

@e2
Copy link
Contributor Author

e2 commented Apr 17, 2015

@digitalextremist - merged into 0.17 prerelease. The rest of the discussion is about other issues. If tests fail they should become new issues, so I'm closing this.

@e2 e2 closed this as completed Apr 17, 2015
@digitalextremist
Copy link
Member

@e2 that's what I thought. Thanks. The issue count for 0.17.0 is looking more manageable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants