New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For review: "truly" synchronous sync calls #554
Comments
At one point I implemented these semantics. I guess they got lost along the way? Yes, this is definitely how it should work. Things are actually broken without it (although unfortunately I can't recommend a specific case) |
I've been running into this more and more, so much that I'm going to refactor supervision during |
This landed in #556, for better or worse. Reopen if worse. |
|
@tarcieri - I think this is an architecture problem. An overreliance on MRI with GIL. I'm guessing the reason it can't be synchronous on MRI is because of potential deadlocks. I'd say there should be an "idle" and "non-sleeping" thread that should take control on MRI so everything else can sleep without a deadlock. Timeouts are another workaround. I think MRI should start to behave like JRuby/Rubinius (and not the other way around). Thoughts? |
@digitalextremist - revert this on the 0.17.0, and then purge this from the master branch in abstractive (I don't mind history being rewritten in this case if you don't either). Then let me know so I can rebase and dump this patch into a separate branch. |
@e2 in the referenced symptom ( which I am testing a revert against ) the task is losing its block context. Where do you figure on deadlocking being the reason for that? |
I'm going to comment it out and reference the ticket for now. |
@digitalextremist - not 100% sure it's the reason, but on MRI SyncCall calls are made in a separate task - this patch eliminated the "extra task", by blocking the caller (which is faster, because there's no task overhead) ... but ... on MRI it means a potential deadlock. The MRI "cure" is presumably here (mentioned in this issue description): https://github.com/celluloid/celluloid/blob/master/lib/celluloid/tasks/task_thread.rb#L31 TL;DR - something is fundamentally broken in Celluloid with MRI-specific "bandaids". |
@e2 can you give a timeline of the events and how you believe they're leading to a deadlock? |
( and why this issue is happening in |
@tarcieri - I was basically testing on threads (not fibers) when I created the issue. Occasionally on JRuby the Probe tests deadlocked. It went like this:
On MRI, the signal() method was called (I don't remember where the resume came from), which resumed the supervision group mailbox - which was strange, really, since executing the block should be synchronous - and once finished, the SG should then finish So it worked on MRI because of the strange "signal" method invocation (I don't remember where from) - without it, MRI deadlocks every time (because it cant wait for the (cond)signal and listen to the mailbox at the same time). But removing that line fixed JRuby, because the call became synchronous (instead of deadlocking when the signal was sent at the wrong moment later). To be honest, I'm very confused about how the suspend/resume is supposed to work on MRI. Let me know if that helps. The issue with smokers though may be unrelated - there's a circular dependency there of calls (or not), so it needs investigating. Needs a test before anything else. |
It sounds like some codepaths were sending messages through the mailbox. That's odd, but I'm not sure why it's alarming? When a deadlock occurs, what's contending on what? |
What's alarming to me is that I have no idea how/why MRI got the signal (which happened before It's also kind of confusing semantically that initialize is treated like a method in Celluloid - and can actually return a value. (Maybe I pretty much abandoned the issue (and moved on to keep up with the changes), but here's the state the threads were in: (based on my notes):
... so ... 0x16fe was waiting for the signal (to deliver the result of the whole call chain back to it's caller - 0x1700), while 0x1704 sent the synchronous message to the now-awaiting 0x16fe. This was on JRuby with InternalPool. On MRI the situation was avoided because of the signal() call from a resume() somewhere - which freed the mutex, allowing deliver() to finish (with a nil), then handle the Shortly after I got into this, the InternalPool was "dethroned", modified and the task classes were changed, and so I rebased to avoid effectively working "on a separate project". I also thought this might be an InternalPool-specific issue, or a side-effect of something else - so I moved it aside. A JRuby lockup happened 3 out of 5 times, but this specific case was 1 out of those 3. (in one of the other cases, the same Probe test was stuck I think in a similar way, but "later" - the TestProbe client suddenly stopped getting events (as a result of the Probe being stuck - which could've been an entirely different issue altogether). Sorry for not being specific or creating an isolated failing example (I tried briefly) - just debugging this was too spartan of an experience, and maybe the race occurred due to puts statements being present. So instead, I focused on an overhaul on the test side - hoping to be better prepared, having better tools and a better setup before this issue lures it's ugly head again... |
The smokers example is fixed - I forgot to pass the block parameter (there's a test for this now). @digitalextremist - you can merge and test this. You shouldn't be getting any more LocalJump errors. I'm still not sure this won't deadlock in certain cases, so I can't guarantee it's ready for 0.17. |
add missing block param in SyncCall (#554)
TL;DR - more info about the problem and what could be causing it (based on behavior)@tarcieri - The issue reappeared on both JRuby and Rubinius - but there's a lot of merging/rebasing ahead again, so I'll try to reproduce a failing example and report back when I do. I'm pretty sure it's related to sync calls and waiting for results as above, so no new insights yet. I think it happens when the tasks/calls "overlap" and come in an unexpected order - this results in sync calls screwing up the conditional mutex (linked above) at the wrong time, and one of the actors gets stuck on it, and then the test fails (because ultimately not all the events get through to the testprobeclient). I rarely get "sync call timeouts" (I've added them), more often the test doesn't get the probe events it needs (it times outs - e.g it gets 90% of the events in 0.7 seconds, then nothing until the 5 second timeout). So it's like the last few events get "eaten". The problem occurs more frequently if I switch to async notifications everywhere (I'm guessing calls pile up and overlap/cross more often). My guess is sync calls are unintentionally resumed - or e.g. 80% of the deliver/wait blocks get resumed out of order - and then things get stuck (e.g. getting 2 signals for one wait, and then 0 signals for the next). This is probably why it also sometimes gets stuck before even initializing the dummy actor class. This happens both on Fibers and Threads, so it worries me that it could be a bigger bug. I believe the This may also explain why a certain amount of debugs in the right places causes the tests to run every time. I'm careful not to make SyncCall calls in the debugs (e.g. inspect, class, etc.) and I pretty much log all the crashes (I hope). The chain is: trigger -> probe -> notification/fanout -> testprobeclient ... and it fails randomly, sometimes as near completion as just missing the events the probe tests expect ... and sometimes it gets screwed up while setting up the supervisor and registering actor names (it never even gets to sending notifications, or even it gets clogged up before subscribe() kicks in). I probably just need to wrap my head around the task resume/suspend logic... Once I get a chance, I'll try and avoid supervisors completely (by "unsupervising" the fanout and incident reporter temporarily). I've tried removing the incident reporter - but nothing really changed (one probe test needs it anyway). Any tips on how to get to the core issue would be much appreciated (even if it just turns out to be a race that occurs only in tests). |
@e2 what's the status with this? I'd like to get all the internals-changing commits in I can before refactoring the supervision gem. |
@digitalextremist - merged into 0.17 prerelease. The rest of the discussion is about other issues. If tests fail they should become new issues, so I'm closing this. |
@e2 that's what I thought. Thanks. The issue count for |
TL;DR - seems to work well, but probably has a lot of implications
Here is the patch: e2@5e17eae
I don't particularly like it, but it seemed necessary with Threads, and it seems to help with other JRuby issues.
It bypasses the mailbox when an actor calls it's own methods - this was necessary (in the cases I encountered), because there was no way protect a "receiver" task that makes a sync call to it's sender again (deadlock, because sender is waiting for signal, while receiver mails a sync call to the "waiting" sender task).
On MRI without this line: https://github.com/celluloid/celluloid/blob/master/lib/celluloid/tasks/task_thread.rb#L31 you get a deadlock in the Probe tests. On JRuby though, removing it prevents the following from getting stuck:
Which pretty much becomes:
On MRI, the hacky signal "pushes things forward" - but I believe that line is just a deadlock workaround, and not a solution to the sync problem (which doesn't probably exist on MRI anyway - due to how tasks are suspended and resumed everywhere).
I'm still fixing tests, but this seems to work fine.
The upside may be a small performance increase (no mailbox overhead, etc.), the downside - maybe lower responsiveness in some cases.
But then again, ideally the user should avoid sync calls unless absolutely necessary (since it potentially means locking the whole call chain).
Would this be ok to merge?
The text was updated successfully, but these errors were encountered: