New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undo method invocation checking #563
Comments
This is literally the highest churn part of Celluloid's codebase. I suggest intimately familiarizing yourself with its history before suggesting any changes. |
@tarcieri - that's what I'm asking: what am I missing? I looked through git blame, I searched the issues, I've already dabbled in that area before ... I even saw how the messages once upon a time turned into the Call objects we have today. |
Perhaps a good way to start would be compiling all of the commits and all of the issues around this particular section of code so we can discuss them. There are many of them, and this is a very subtle and complex issue which has caused a lot of breakage in the past. We need to figure out what those breakages were so we don't repeat them. |
The reason I didn't go much deeper is that after axing it away, all the specs run. The only thing I had to change was expect actors to be dead (and wait until they process the message). |
@halorgium do you have any battle scars related to this section of code you care to relate? |
This is merged into abstractive/celluloid/master, to be reverted if it comes up as a huge issue. But so far, as of that merge -- for the first time in ages, Celluloid is passing its tests. |
I'm pretty strongly against this change. While it may pass the tests, it's a big semantic change. Celluloid uses In the context of something like DCell, it means a malicious client can crash a server just by sending it an invalid method /cc @niamster tl;dr: the intentional separation here is akin to 4xx vs 5xx in HTTP: the client/caller made an error versus the server/recipient made an error Finally, I cannot stress enough that THIS IS THE HIGHEST CHURN PORTION OF CELLULOID. Please do not change it unless you know the entire history of why it's in its current state and exactly what you are doing. |
@e2 let's revert this on our branch and find another way to resolve the tests without introducing the associated vulnerability. |
@e2 I hate to ask you to go on an archeology expedition but I really feel one is warranted here. This is the portion of the codebase where most attempts at "improvements" have resulted in substantial breakages across multiple projects. Tread carefully. |
@tarcieri - isn't this what In Ruby, you'll only know if an object supports a method by actually calling it. The current implementation doesn't let you "opt-out" from the checking. Not just that - a custom proxy class could actually validate parameters and throw an ArgumentError if something is out of bounds. Or it could even validate sessions, etc. Currently, Celluloid can only respond with a 400, 404 or 418 ... - not much of a choice for users. Since this should be implemented in a custom proxy, I'd axe this until there's a PR with a better implementation. The code needs to be removed from where it is anyway. |
@digitalextremist - the patch applies, because the code will disappear one way or another. @tarcieri - would moving the code to a custom proxy_class solve the issue? Anything you can do from a SyncCall can ultimately also be done from a custom proxy. (And then, any extra sync calls would be explicit). |
@digitalextremist , @tarcieri - I've added proof of concept code for custom proxies using a simplified version of the existing code: My question: if this is acceptable, how should I name the module and where should I put it? (The module and specs shouldn't be there since these are specs for a custom proxy, and not actors). @digitalextremist - please reopen this. @tarcieri - I believe it's too much of a burden for Celluloid to "carry" this responsibility in such a core area. I don't deny this is useful - I just don't think it should be in Celluloid, forget being hardcoded. If there are caveats, or serious issues - they should be mentioned in the tests. Especially this - it's way too prone to breakage. |
@e2 I can't overemphasize this enough: please look at the history of this section of code. I can do some digging this weekend too. |
@e2 reopened for deeeeep delving into the "Long long ago, before time" and figuring out how to carry out new approaches to the epicenter of old safely. |
My main worry is that past attempts at changing this code caused a lot of subtle breakages in people's apps, and with all the other changes going on I think you're playing with fire here. Any changes to these codepaths need to be done carefully, and with a lot of testing on real world apps. |
If I broke something, I'll prefer to help debug and fix + add specs rather than carefully trying to move a turd without waking it up ... I'm constantly testing the changes against MRI, JRuby and RBX right now - they are running in loops and I'm trying to work out if stuff is related to bad tests or races/bugs in the code. "being careful" means the tests suck - to be blunt. I'm already quite careful - you just can't "see" it from just looking at the PRs. Compatibility is one thing (needs tests) and having a codebase where development doesn't require "careful surgery" is a second issue (needs tests + refactoring) and having prod apps working reliably is a third thing (needs thorough stress tests + all the previous fixed) ... ... and as for "not breaking things", everything is already broken. No one is "carefully" fixing the InternalPool, for example. Or the PoolManager. Forget compatibility - even guard doesn't work anymore on the project because celluloid-pool has been extracted. I haven't even finished merging other fixes yet - so for me the codebase is broken to begin with. E.g. some RBX failures are warranted (backtrace handling - fixed in rbx-head), but some are bugs in tests or implementation. Probe is now consistently breaking - which IMHO is a good thing. @digitalextremist - let me know if I can help you track down the issues. I'd rather know what's broken ASAP rather than later. I'll be on IRC if you need me.
That's why I wanted the code out of there - it's much easier to unit test and fix a module (than debugging the whole Celluloid stack with notifiers and incident reporters and debugging calls, etc.), and it's much easier for users to add/remove the arity-checking functionality and test their apps against it. This functionality also has to be properly tested against futures, async methods and blocks - setting up a whole Celluloid "stack" in a test - just to test an arity-check code path - is very discouraging. This means edge cases won't have tests (I have a TODO marked for them - the example specs I showed don't have complete coverage - that's because coverage comes soon after tests are green, which they aren't). But that's all a tangent - what specifically have I broken? I already went through git blame up till the beginning of the file (I even mentioned the 2011 "erlangism removal" above, which is the first commit). Seriously - what did I break here SPECIFICALLY (without "speculation")? I might have broken something (e.g. async calls, futures, DCell compatibility, exception handling, NameError vs NoMethodError impact on current apps), but I'll find out what sooner if the code is in a separate module with separate unit tests. Even debugging this path is problematic, forget reproducing issues. (If I wasn't VERY serious about this, I would've merged it - I didn't. I created a PR, since those are easier to discuss.) Back to that race condition in RBX ... and maybe a stab at the probe specs once I'm done. |
@digitalextremist , @tarcieri - I'm already deep into this. I'd like to resolve this in a way so that ... no one else will ever have to. Tests suck because writing tests for this stuff sucks - partially because everything is so "integrated". Breakage is a symptom. Avoiding a symptom doesn't fix the root cause (hard to write good tests fast). So breakage but "fix-friendly" codebase instead - is actually the solution here. |
@e2 I have meetings for a few hours, then I'll be here all night. I'll work through this with you if you're there, then @tarcieri we can talk about this also this weekend. I think @tarcieri is saying the same thing I'm feeling: @e2 changing the highest churn area is totally dangerous, but fine if it's merited and proven that you're really hitting the historic issues head-on, with eyes-open, not breaking anuone out there in the process. |
We can't do that. We can't break people's Sidekiq projects. Based on my past experience with these codepaths, unless you fully understand what's going on, you're bound to do that.
Yes, the tests do suck!
That's the problem, the tests suck, so you don't know. However, about 5 people (iirc?) have attempted to make changes to these codepaths (who aren't me), and 4 of them broke things and we had to revert their changes. Only @halorgium has been successful here, and I think that's because he was extremely meticulous and careful about his changes. Changing this particular part of the codebase should be followed by many real-world trials with large Celluloid projects like Sidekiq workers and Adhearsion systems. We run the former at Square so I can help there, but please keep in mind that most of the past experiments in this area have lead to huge code breakages. In the absence of tests that cover all the cases, that's the best we can do. Now that I've said all that, this change enables a DoS attack against DCell systems, since an attacker can crash remote actors simply by sending them an invalid method. I do not think you're ready to change these codepaths. I plan on going through the commit and issue history for this particular section of the codebase and trying to summarize it for you so you can know what's involved. I really think you should know the history here before you attempt any changes, and if you don't, you're probably doomed to repeat it, and I would prefer not unnecessarily breaking things in the name of ill-conceived experiments. This is, in my opinion, literally the hardest code in Celluloid to make changes to and I am imploring you to tread carefully. If you break things for Sidekiq users I will revert your code commit-by-commit until things are working again. If it were any other part of the codebase I would have a different opinion, but this is where the dragons are, and you seem to want to play with them without realizing you will be burned. @digitalextremist is already reporting breakages so I feel like you haven't done your homework. But the harder problem is the long-tail of systems using Celluloid you might potentially break. To change this, you really need to test it with a lot of canary projects to make sure you aren't breaking anything. Please take this seriously. My personal recommendation is that you hold off on changing this code until we've done another major release (e.g. 0.17) with all the other potentially breaking changes. |
I feel your eagerness @e2 but I agree with @tarcieri. I wish @halorgium could weigh in here too soon. He has been our stongest internals figure. Let's attack this in |
I have 0 issues with that.
Definitely won't - send people to me (mention me) if it's related. If it's the cause, I'll revert it myself.
@niamster - let me know if you have a use case affected by this. Don't hesitate to ask. |
@e2 the reason why I said write repros is because there are current failures which simple test suites and the like don't show. I would gladly dedicate all my time to this issue after |
That's the difference - I care if something intended for this release doesn't work perfect. Give me more info - do you have the other patch as well? (open an issue) |
@e2 Good luck with merging this and it sounds like you've done a good job on the test coverage. Perhaps it could be good to write a brief page in the wiki explaining how things worked pre-merge and how things work post-merge and how to restore the default behaviour - it sounds like you said you could include a module and get the default behaviour - is this still the case? I will try to put aside some time to test with RubyDNS, can you let me know when it is merged? |
Just a note on this:
Naively deleting cruft is the right thing to do from a technical perspective, however, sometimes in cruft lies hard-won knowledge. I tried to put the onus on you to discover and understand why the cruft was there, but getting you to actually research that was like pulling teeth. You and I are both aware the tests are/were in sorry shape, and that makes making major changes to Celluloid quite difficult, because it requires a lot of in-production testing of changes to tease out the complications. All I am asking is when you do try to gut the cruft from Celluloid, you really need to understand why things are in the state they are before making changes, as opposed to "What's this? It sucks! [Delete]" |
@tarcieri I'm not sure if you mixed up what you wanted to say, but I'd like to respectfully suggest that no one should be naively removing or changing anything. Due diligence for all changes please. Getting feedback from people who wrote the code is awesome. Let's try to make celluloid even more awesome :) |
@tarcieri is saying what you are saying, and that @e2 is bypassing that whether knowingly or unknowingly ( but saying otherwise? thinking otherwise? ) due to lack of historical context; nit knowing for sure what happened before, and what will happen ( based on proven knowledge, thoroughly tested, discussed and clear facts gained from those who came before ) in the future if x code is changed. |
@tarcieri - to me, if you can't give a straight technical answer, then no one has a clue why that code is there. I understand the need for "archeology" - but that's a bad "general answer". I felt I was given such a "generic answer", which had nothing to with the "change". This change wasn't introduced with the same scrutiny I'm expected to deal with here. That's unfair to say the least. On top of that, it's an unreasonable expectation to ask me to prove this code can be removed without side-effects. If I was replacing the code with something else, that's a different story. This was about removing a behavior, not changing it. If that results in side-effects, those are even more important to discover, document and prevent than "not touching things". How can I prove the lack of something doesn't introduce something unknown? Especially if I didn't put that code there in the first place? If the code was a design mistake (one of the many needed to "learn" in order to make any framework better over time), how else can it be proven a mistake without removing it? While the tests don't reflect production, they are "full stack tests". Things like supervisor initialization goes through there. The Probe, Pool and StackDump tests were pretty brutal to deal with - the fact that tests failures were very irregular (even on the exact same JRuby version, for example) shows that tests already do cover a significant amount of real life cases. A historical context doesn't apply here - because a historical context doesn't justify this code in the first place. Not to mention the change is ancient - and many changes have been introduced elsewhere in the codebase since. It's not like every other change is deeply investigated to avoid breaking this method. Archeology isn't fun - though I'm pretty certain I use My initial post did have some frustration expressed in it - because I did do archeology and found nothing (realizing all this time I've been babysitting a dud). Maybe that's where the idea came that my change was unmindful? |
I gave several different rationales over the course of many posts for why it's there (which you didn't like, and it seems you're philosophically opposed or simply don't care, but that's beside the point). But even if nobody could, that doesn't just give you carte blanche to change it, especially since it is, as I have repeatedly insisted, the highest churn code in the entire library. That should be a signal to you to tread carefully, and I don't feel you've done that. I'm a bit worried that when 0.17.0 does come out, there's going to be an awful lot of people with an awful lot of problems, and I hope you stick around long enough to address them all. But... go ahead, change it. All I ask is that you please own up to what you're attempting to do. |
@ioquatix - I actually planned to write a Wiki page on the possible changes in behavior (in how calls are scheduled differently and ideas on how to design the app differently). But that doesn't make sense, since I have no merged commits to point to right now. |
@e2 make your experimental branch and let people try. I'll make I don't know if you remember @e, when @tarcieri talked about approaching |
And I explained there's a different solution for that (e.g. via a "proxy actor") and I created 2 actor specs to demonstrate (those 2 examples were merged).
which suggests that area needs lots of rework - I can't think of a saner solution.
In all honestly, I have no idea what I could've done specifically to please you here.
The only 2 behavior changes I've made in 'lib' are well known - I haven't changed anything else (I tried extremely hard not to). I'd rather rework logging and tracing, so it's easier for people (and us) to diagnose problems. With so many changes, it's better to "abandon" a false sense of stability - and prepare to more effectively deal with whatever happens. Most issues people have will be unpredictable, anyway. Even a new MRI release can cause painful breakage. I don't want to tell people "sorry, we can't fix this quickly, because it means changing code in a high-churn area". I can commit to support around the 2 patches - but that doesn't make me responsible for 0.17, because I have no control over what goes in there or not. If someone has a tough race condition, and I'm "not allowed" to add tracing facilities "because of performance" or "historical context" ... committing to support would be like hanging myself (I wouldn't have been able to fix the tests as I did without a crap load of changes for debugging and tracing). Given the changes, I've never believed 0.17 will be without major problems - the tests are barely green to begin with. Many people may just lock to 0.16 (others may have already locked to 0.14 long ago anyway). The best approach is to reduce the efforts in debugging, diagnosing, testing, fixing and preventing issues. The "be careful with changes" approach is perfect for versions up to 0.16. If people are going to have major issues, it's best we find ways to prepare for them. The worst issues are when "something hangs" or "blows up" without extra info (people can't diagnose, we can't help either) - being able to quickly fix an issue can instantly turn any initial frustration around. 0.17 has passed the point of "an 0.16 update" - just extracting the gems means all projects have to be modified to even run. There's a reason RSpec was split into 'rspec' and 'rspec-core'. There's a historical context for it and use cases to back that up. But that's not an area of Celluloid anyone cares about. I'll prepare the patches on whatever branch I'm told to and I'll write a Wiki explaining what could change and fixes/workarounds. |
You could've actually looked into the history of the code in question the first time I asked, instead of precipitating this long, drawn-out argument. It looks like the first time you actually even bothered looking at the history of this section of code was here: That's about a week and dozens of invective-laced messages after I told you to originally. |
Per @digitalextremist your changes are breaking things, so this issue isn't closed yet. Please let @digitalextremist close the issue when he's satisfied your changes actually work. |
@tarcieri - you had me convinced the code was important for something, and I couldn't find out how/what. The history doesn't say why the method and arity checking is important. It just doesn't. Ironically, my mistake all along was assuming the code was crucial. As for this patch - it has been reverted (so it's not in 0.17). So this is closed - I asked @digitalextremist above to open a new issue (since it could be occurring with - but unrelated to this removal). |
@e2 have you read anything I've said this entire thread, or do you just like hearing yourself talk? |
@e2 honestly, in my two decades of experience in open source, this is the most dysfunctional encounter I've ever had with anyone, period. |
@tarcieri - sorry about that. I'll work it out. @digitalextremist - this issue is on you, since you're the only one with an (technically unclear) objection. Prepare a branch with the unreverted patch (and the other syncproxy commit above), and let's work out what's preventing you from putting this into a release. |
Cool! 😄 |
This reverts commit b35bf18. Conflicts: lib/celluloid/calls.rb
See: #579 for experimentation for the time being. |
This forever moved to abstractive/celluloid-experimental#1, and once proven can be merged into |
I want to undo: f2cf9e7
While the idea is nice (calling an actor with bad param count should crash the sender), the issues are:
inspect
inside Kernel#method(), and in general being slightly "incompatible" on the metaprogramming side) - the workarounds would be horrible and make the code unrefactorableAny objections? Thoughts?
The text was updated successfully, but these errors were encountered: