Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undo method invocation checking #563

Closed
e2 opened this issue Apr 9, 2015 · 109 comments
Closed

Undo method invocation checking #563

e2 opened this issue Apr 9, 2015 · 109 comments
Milestone

Comments

@e2
Copy link
Contributor

e2 commented Apr 9, 2015

I want to undo: f2cf9e7

While the idea is nice (calling an actor with bad param count should crash the sender), the issues are:

  1. it's waaaaaay too complex (metaprograming metaprograming - yes that's correct)
  2. it's incomplete and misleading, because while the invocation matches, the receiver can still decide to throw a NoMethodError, NameError, ArgumentError and friends - so it doesn't make sense to "partially" detect
  3. since now sync calls to self are short-circuited (without the mailbox), this is overhead if the object checked isn't a proxy
  4. for correctness, if the receiver object is a proxy ... you'd need multiple calls (going through the mailbox) back and forth to "negotiate" if a method even exists (example: method_missing inside the actor)
  5. Rubinius just does things a little differently (like calling a potentially crashing inspect inside Kernel#method(), and in general being slightly "incompatible" on the metaprogramming side) - the workarounds would be horrible and make the code unrefactorable
  6. Any actor can crash at any time for any reason - in my mind it makes no architectural sense to e.g. shift some of the exceptions to e.g. the main thread "just to make things work like plain Ruby".
  7. It's constantly confusing and hard to handle, because "sometimes the actor crashes" and "sometimes the receiver crashes" - so crashes have to be handled on both sides.
  8. I know people don't like crashing a service simply by calling a non existing method - but these are also practically "runtime errors" - and if someone really wants to protect the service, they should implement method_missing appropriately. (Code injection is also possible, so there's probably no way to protect a service from crashing if someone wants to crash it).

Any objections? Thoughts?

@tarcieri
Copy link
Member

tarcieri commented Apr 9, 2015

This is literally the highest churn part of Celluloid's codebase. I suggest intimately familiarizing yourself with its history before suggesting any changes.

@e2
Copy link
Contributor Author

e2 commented Apr 9, 2015

@tarcieri - that's what I'm asking: what am I missing?

I looked through git blame, I searched the issues, I've already dabbled in that area before ... I even saw how the messages once upon a time turned into the Call objects we have today.

@tarcieri
Copy link
Member

tarcieri commented Apr 9, 2015

Perhaps a good way to start would be compiling all of the commits and all of the issues around this particular section of code so we can discuss them.

There are many of them, and this is a very subtle and complex issue which has caused a lot of breakage in the past.

We need to figure out what those breakages were so we don't repeat them.

@e2
Copy link
Contributor Author

e2 commented Apr 9, 2015

The reason I didn't go much deeper is that after axing it away, all the specs run. The only thing I had to change was expect actors to be dead (and wait until they process the message).

@tarcieri
Copy link
Member

tarcieri commented Apr 9, 2015

@halorgium do you have any battle scars related to this section of code you care to relate?

@digitalextremist
Copy link
Member

This is merged into abstractive/celluloid/master, to be reverted if it comes up as a huge issue. But so far, as of that merge -- for the first time in ages, Celluloid is passing its tests.

@tarcieri
Copy link
Member

I'm pretty strongly against this change. While it may pass the tests, it's a big semantic change. Celluloid uses abort (probably better aliased to fail in the post-Exceptional Ruby days) to signal caller versus receiver error.

In the context of something like DCell, it means a malicious client can crash a server just by sending it an invalid method /cc @niamster

tl;dr: the intentional separation here is akin to 4xx vs 5xx in HTTP: the client/caller made an error versus the server/recipient made an error

Finally, I cannot stress enough that THIS IS THE HIGHEST CHURN PORTION OF CELLULOID. Please do not change it unless you know the entire history of why it's in its current state and exactly what you are doing.

@digitalextremist
Copy link
Member

@e2 let's revert this on our branch and find another way to resolve the tests without introducing the associated vulnerability.

@tarcieri
Copy link
Member

@e2 I hate to ask you to go on an archeology expedition but I really feel one is warranted here. This is the portion of the codebase where most attempts at "improvements" have resulted in substantial breakages across multiple projects.

Tread carefully.

@e2
Copy link
Contributor Author

e2 commented Apr 10, 2015

tl;dr: the intentional separation here is akin to 4xx vs 5xx

@tarcieri - isn't this what proxy_class is for? So the user can check authentication, etc.? Isn't this akin to an Interface in Java?

In Ruby, you'll only know if an object supports a method by actually calling it. The current implementation doesn't let you "opt-out" from the checking.

Not just that - a custom proxy class could actually validate parameters and throw an ArgumentError if something is out of bounds. Or it could even validate sessions, etc.

Currently, Celluloid can only respond with a 400, 404 or 418 ... - not much of a choice for users.

Since this should be implemented in a custom proxy, I'd axe this until there's a PR with a better implementation. The code needs to be removed from where it is anyway.

@e2
Copy link
Contributor Author

e2 commented Apr 10, 2015

@digitalextremist - the patch applies, because the code will disappear one way or another.

@tarcieri - would moving the code to a custom proxy_class solve the issue? Anything you can do from a SyncCall can ultimately also be done from a custom proxy. (And then, any extra sync calls would be explicit).

@e2
Copy link
Contributor Author

e2 commented Apr 10, 2015

@digitalextremist , @tarcieri - I've added proof of concept code for custom proxies using a simplified version of the existing code:

https://github.com/abstractive/celluloid/commit/46d06259c1125f595b98e9212e2716682b086f34#diff-4083b355c5431bd94cf6aac5f5d705dbR1035

My question: if this is acceptable, how should I name the module and where should I put it? (The module and specs shouldn't be there since these are specs for a custom proxy, and not actors).

@digitalextremist - please reopen this.

@tarcieri - I believe it's too much of a burden for Celluloid to "carry" this responsibility in such a core area. I don't deny this is useful - I just don't think it should be in Celluloid, forget being hardcoded. If there are caveats, or serious issues - they should be mentioned in the tests. Especially this - it's way too prone to breakage.

@tarcieri
Copy link
Member

@e2 I can't overemphasize this enough: please look at the history of this section of code. I can do some digging this weekend too.

@digitalextremist
Copy link
Member

@e2 reopened for deeeeep delving into the "Long long ago, before time" and figuring out how to carry out new approaches to the epicenter of old safely.

@tarcieri
Copy link
Member

My main worry is that past attempts at changing this code caused a lot of subtle breakages in people's apps, and with all the other changes going on I think you're playing with fire here.

Any changes to these codepaths need to be done carefully, and with a lot of testing on real world apps.

@digitalextremist
Copy link
Member

@tarcieri point we'll taken. In fact, this branch is already breaking my app. Preparing to go to work and get back to it. @e2 maybe we can return to this after some reflection, while working in other areas.

@e2
Copy link
Contributor Author

e2 commented Apr 10, 2015

past attempts at changing this code caused a lot of subtle breakages in people's apps

If I broke something, I'll prefer to help debug and fix + add specs rather than carefully trying to move a turd without waking it up ...

I'm constantly testing the changes against MRI, JRuby and RBX right now - they are running in loops and I'm trying to work out if stuff is related to bad tests or races/bugs in the code.

"being careful" means the tests suck - to be blunt. I'm already quite careful - you just can't "see" it from just looking at the PRs.

Compatibility is one thing (needs tests) and having a codebase where development doesn't require "careful surgery" is a second issue (needs tests + refactoring) and having prod apps working reliably is a third thing (needs thorough stress tests + all the previous fixed) ...

... and as for "not breaking things", everything is already broken.

No one is "carefully" fixing the InternalPool, for example. Or the PoolManager. Forget compatibility - even guard doesn't work anymore on the project because celluloid-pool has been extracted.

I haven't even finished merging other fixes yet - so for me the codebase is broken to begin with. E.g. some RBX failures are warranted (backtrace handling - fixed in rbx-head), but some are bugs in tests or implementation. Probe is now consistently breaking - which IMHO is a good thing.

@digitalextremist - let me know if I can help you track down the issues. I'd rather know what's broken ASAP rather than later. I'll be on IRC if you need me.

@tarcieri -

My main worry is that past attempts at changing this code caused a lot of subtle breakages

That's why I wanted the code out of there - it's much easier to unit test and fix a module (than debugging the whole Celluloid stack with notifiers and incident reporters and debugging calls, etc.), and it's much easier for users to add/remove the arity-checking functionality and test their apps against it.

This functionality also has to be properly tested against futures, async methods and blocks - setting up a whole Celluloid "stack" in a test - just to test an arity-check code path - is very discouraging.

This means edge cases won't have tests (I have a TODO marked for them - the example specs I showed don't have complete coverage - that's because coverage comes soon after tests are green, which they aren't).

But that's all a tangent - what specifically have I broken? I already went through git blame up till the beginning of the file (I even mentioned the 2011 "erlangism removal" above, which is the first commit).

Seriously - what did I break here SPECIFICALLY (without "speculation")? I might have broken something (e.g. async calls, futures, DCell compatibility, exception handling, NameError vs NoMethodError impact on current apps), but I'll find out what sooner if the code is in a separate module with separate unit tests. Even debugging this path is problematic, forget reproducing issues.

(If I wasn't VERY serious about this, I would've merged it - I didn't. I created a PR, since those are easier to discuss.)

Back to that race condition in RBX ... and maybe a stab at the probe specs once I'm done.

@e2
Copy link
Contributor Author

e2 commented Apr 10, 2015

@digitalextremist , @tarcieri - I'm already deep into this. I'd like to resolve this in a way so that ... no one else will ever have to. Tests suck because writing tests for this stuff sucks - partially because everything is so "integrated". Breakage is a symptom. Avoiding a symptom doesn't fix the root cause (hard to write good tests fast).

So breakage but "fix-friendly" codebase instead - is actually the solution here.

@digitalextremist
Copy link
Member

@e2 I have meetings for a few hours, then I'll be here all night. I'll work through this with you if you're there, then @tarcieri we can talk about this also this weekend.

I think @tarcieri is saying the same thing I'm feeling: @e2 changing the highest churn area is totally dangerous, but fine if it's merited and proven that you're really hitting the historic issues head-on, with eyes-open, not breaking anuone out there in the process.

@e2
Copy link
Contributor Author

e2 commented Apr 10, 2015

@digitalextremist - for the record, the frustration in this area (or this part of Celluloid) isn't new to me:

4104287
1b68ee4
68d5ca2
b522b0b
(#496)

@tarcieri
Copy link
Member

@e2

Forget compatibility

We can't do that. We can't break people's Sidekiq projects. Based on my past experience with these codepaths, unless you fully understand what's going on, you're bound to do that.

"being careful" means the tests suck - to be blunt.

Yes, the tests do suck!

Seriously - what did I break here SPECIFICALLY (without "speculation")?

That's the problem, the tests suck, so you don't know. However, about 5 people (iirc?) have attempted to make changes to these codepaths (who aren't me), and 4 of them broke things and we had to revert their changes. Only @halorgium has been successful here, and I think that's because he was extremely meticulous and careful about his changes.

Changing this particular part of the codebase should be followed by many real-world trials with large Celluloid projects like Sidekiq workers and Adhearsion systems. We run the former at Square so I can help there, but please keep in mind that most of the past experiments in this area have lead to huge code breakages.

In the absence of tests that cover all the cases, that's the best we can do.

Now that I've said all that, this change enables a DoS attack against DCell systems, since an attacker can crash remote actors simply by sending them an invalid method.

I do not think you're ready to change these codepaths. I plan on going through the commit and issue history for this particular section of the codebase and trying to summarize it for you so you can know what's involved. I really think you should know the history here before you attempt any changes, and if you don't, you're probably doomed to repeat it, and I would prefer not unnecessarily breaking things in the name of ill-conceived experiments.

This is, in my opinion, literally the hardest code in Celluloid to make changes to and I am imploring you to tread carefully. If you break things for Sidekiq users I will revert your code commit-by-commit until things are working again. If it were any other part of the codebase I would have a different opinion, but this is where the dragons are, and you seem to want to play with them without realizing you will be burned. @digitalextremist is already reporting breakages so I feel like you haven't done your homework. But the harder problem is the long-tail of systems using Celluloid you might potentially break. To change this, you really need to test it with a lot of canary projects to make sure you aren't breaking anything.

Please take this seriously.

My personal recommendation is that you hold off on changing this code until we've done another major release (e.g. 0.17) with all the other potentially breaking changes.

@digitalextremist
Copy link
Member

I feel your eagerness @e2 but I agree with @tarcieri. I wish @halorgium could weigh in here too soon. He has been our stongest internals figure. Let's attack this in 0.17.5

@digitalextremist digitalextremist modified the milestones: 0.17.5, 0.17.0 Apr 10, 2015
@digitalextremist
Copy link
Member

@tarcieri, I complain.

@e2 Celluloid breaks with the patch, and I don't have time to figure out why right now.

@e2
Copy link
Contributor Author

e2 commented Apr 20, 2015

@tarcieri -

Just prepare for me to revert it the instant anyone complains.

I have 0 issues with that.

And please don't complain when that happens.

Definitely won't - send people to me (mention me) if it's related. If it's the cause, I'll revert it myself.

Heads up @niamster, I guess you need to completely rethink the way DCell validates inputs, since Celluloid is no longer doing it for you.

@niamster - let me know if you have a use case affected by this. Don't hesitate to ask.

@digitalextremist
Copy link
Member

@e2 the reason why I said write repros is because there are current failures which simple test suites and the like don't show. I would gladly dedicate all my time to this issue after 0.17.0 ... or even once it's pre-released and we're waiting for feedback.

@e2
Copy link
Contributor Author

e2 commented Apr 20, 2015

@digitalextremist

Celluloid breaks with the patch, and I don't have time to figure out why right now.

That's the difference - I care if something intended for this release doesn't work perfect. Give me more info - do you have the other patch as well? (open an issue)

@ioquatix
Copy link
Contributor

@e2 Good luck with merging this and it sounds like you've done a good job on the test coverage. Perhaps it could be good to write a brief page in the wiki explaining how things worked pre-merge and how things work post-merge and how to restore the default behaviour - it sounds like you said you could include a module and get the default behaviour - is this still the case? I will try to put aside some time to test with RubyDNS, can you let me know when it is merged?

@tarcieri
Copy link
Member

Just a note on this:

I thought this discussion would be purely technical ("merge" or a list of specific objections for which alternatives may be researched), but it went into releases, compatibility, process, "waiting", etc.

Naively deleting cruft is the right thing to do from a technical perspective, however, sometimes in cruft lies hard-won knowledge. I tried to put the onus on you to discover and understand why the cruft was there, but getting you to actually research that was like pulling teeth.

You and I are both aware the tests are/were in sorry shape, and that makes making major changes to Celluloid quite difficult, because it requires a lot of in-production testing of changes to tease out the complications.

All I am asking is when you do try to gut the cruft from Celluloid, you really need to understand why things are in the state they are before making changes, as opposed to "What's this? It sucks! [Delete]"

@ioquatix
Copy link
Contributor

@tarcieri I'm not sure if you mixed up what you wanted to say, but I'd like to respectfully suggest that no one should be naively removing or changing anything. Due diligence for all changes please. Getting feedback from people who wrote the code is awesome. Let's try to make celluloid even more awesome :)

@digitalextremist
Copy link
Member

@tarcieri is saying what you are saying, and that @e2 is bypassing that whether knowingly or unknowingly ( but saying otherwise? thinking otherwise? ) due to lack of historical context; nit knowing for sure what happened before, and what will happen ( based on proven knowledge, thoroughly tested, discussed and clear facts gained from those who came before ) in the future if x code is changed.

@e2
Copy link
Contributor Author

e2 commented Apr 21, 2015

@tarcieri - to me, if you can't give a straight technical answer, then no one has a clue why that code is there. I understand the need for "archeology" - but that's a bad "general answer".

I felt I was given such a "generic answer", which had nothing to with the "change". This change wasn't introduced with the same scrutiny I'm expected to deal with here. That's unfair to say the least.

On top of that, it's an unreasonable expectation to ask me to prove this code can be removed without side-effects. If I was replacing the code with something else, that's a different story. This was about removing a behavior, not changing it. If that results in side-effects, those are even more important to discover, document and prevent than "not touching things".

How can I prove the lack of something doesn't introduce something unknown? Especially if I didn't put that code there in the first place? If the code was a design mistake (one of the many needed to "learn" in order to make any framework better over time), how else can it be proven a mistake without removing it?

While the tests don't reflect production, they are "full stack tests". Things like supervisor initialization goes through there. The Probe, Pool and StackDump tests were pretty brutal to deal with - the fact that tests failures were very irregular (even on the exact same JRuby version, for example) shows that tests already do cover a significant amount of real life cases.

A historical context doesn't apply here - because a historical context doesn't justify this code in the first place. Not to mention the change is ancient - and many changes have been introduced elsewhere in the codebase since. It's not like every other change is deeply investigated to avoid breaking this method.

Archeology isn't fun - though I'm pretty certain I use git for digging more often than most other developers out there. Even though I expect tests to be a more reliable source of knowledge.

My initial post did have some frustration expressed in it - because I did do archeology and found nothing (realizing all this time I've been babysitting a dud). Maybe that's where the idea came that my change was unmindful?

@tarcieri
Copy link
Member

to me, if you can't give a straight technical answer, then no one has a clue why that code is there.

I gave several different rationales over the course of many posts for why it's there (which you didn't like, and it seems you're philosophically opposed or simply don't care, but that's beside the point).

But even if nobody could, that doesn't just give you carte blanche to change it, especially since it is, as I have repeatedly insisted, the highest churn code in the entire library. That should be a signal to you to tread carefully, and I don't feel you've done that. I'm a bit worried that when 0.17.0 does come out, there's going to be an awful lot of people with an awful lot of problems, and I hope you stick around long enough to address them all.

But... go ahead, change it. All I ask is that you please own up to what you're attempting to do.

@e2
Copy link
Contributor Author

e2 commented Apr 21, 2015

@ioquatix - I actually planned to write a Wiki page on the possible changes in behavior (in how calls are scheduled differently and ideas on how to design the app differently). But that doesn't make sense, since I have no merged commits to point to right now.

@digitalextremist
Copy link
Member

@e2 make your experimental branch and let people try. I'll make 0.17.5-experimental for you to send PRs to, like we talked about. As it stands, 0.17.0 is such a breaking release we will need to leave it in prerelease for a while, and start 0.17.5 without 0.17.0 even merged into master. That I am sure of.

I don't know if you remember @e, when @tarcieri talked about approaching rubinius about the behavior of Kernel.method and look how that went! We're on contentious ground here, all the time, because Celluloid is close to being a ruby interpreter in complexity and ramifications. We can't just decide how things ought to behave, we have to be sensitive to so many factors, under the hood and in the ecosystem of people using Celluloid ... but we've "talked" about this so many times. And here we're still going around and around about it. Just make an experimental branch and promote it for now... that ought to solve your main problems of not having the code up, not having it for yourself, etc. Pretty much every serious person here depends on their own fork anyway at some time or another.

@e2
Copy link
Contributor Author

e2 commented Apr 21, 2015

@tarcieri -

I gave a rationale for why it's there (which you didn't like)

And I explained there's a different solution for that (e.g. via a "proxy actor") and I created 2 actor specs to demonstrate (those 2 examples were merged).

as I have repeatedly insisted, the highest churn code in the entire library

which suggests that area needs lots of rework - I can't think of a saner solution.

That should be a signal to you to tread carefully, and I don't feel you've done that.

In all honestly, I have no idea what I could've done specifically to please you here.

I'm a bit worried that when 0.17.0 does come out, there's going to be an awful lot of people with an awful lot of problems

The only 2 behavior changes I've made in 'lib' are well known - I haven't changed anything else (I tried extremely hard not to). I'd rather rework logging and tracing, so it's easier for people (and us) to diagnose problems.

With so many changes, it's better to "abandon" a false sense of stability - and prepare to more effectively deal with whatever happens. Most issues people have will be unpredictable, anyway. Even a new MRI release can cause painful breakage.

I don't want to tell people "sorry, we can't fix this quickly, because it means changing code in a high-churn area". I can commit to support around the 2 patches - but that doesn't make me responsible for 0.17, because I have no control over what goes in there or not.

If someone has a tough race condition, and I'm "not allowed" to add tracing facilities "because of performance" or "historical context" ... committing to support would be like hanging myself (I wouldn't have been able to fix the tests as I did without a crap load of changes for debugging and tracing).

Given the changes, I've never believed 0.17 will be without major problems - the tests are barely green to begin with. Many people may just lock to 0.16 (others may have already locked to 0.14 long ago anyway). The best approach is to reduce the efforts in debugging, diagnosing, testing, fixing and preventing issues.

The "be careful with changes" approach is perfect for versions up to 0.16. If people are going to have major issues, it's best we find ways to prepare for them.

The worst issues are when "something hangs" or "blows up" without extra info (people can't diagnose, we can't help either) - being able to quickly fix an issue can instantly turn any initial frustration around.

0.17 has passed the point of "an 0.16 update" - just extracting the gems means all projects have to be modified to even run. There's a reason RSpec was split into 'rspec' and 'rspec-core'. There's a historical context for it and use cases to back that up. But that's not an area of Celluloid anyone cares about.

I'll prepare the patches on whatever branch I'm told to and I'll write a Wiki explaining what could change and fixes/workarounds.

@e2 e2 closed this as completed Apr 21, 2015
@tarcieri
Copy link
Member

In all honestly, I have no idea what I could've done specifically to please you here.

You could've actually looked into the history of the code in question the first time I asked, instead of precipitating this long, drawn-out argument.

It looks like the first time you actually even bothered looking at the history of this section of code was here:

#563 (comment)

That's about a week and dozens of invective-laced messages after I told you to originally.

@tarcieri
Copy link
Member

Per @digitalextremist your changes are breaking things, so this issue isn't closed yet. Please let @digitalextremist close the issue when he's satisfied your changes actually work.

@tarcieri tarcieri reopened this Apr 21, 2015
@e2
Copy link
Contributor Author

e2 commented Apr 21, 2015

@tarcieri - you had me convinced the code was important for something, and I couldn't find out how/what. The history doesn't say why the method and arity checking is important. It just doesn't.

Ironically, my mistake all along was assuming the code was crucial.

As for this patch - it has been reverted (so it's not in 0.17). So this is closed - I asked @digitalextremist above to open a new issue (since it could be occurring with - but unrelated to this removal).

@tarcieri
Copy link
Member

@e2 have you read anything I've said this entire thread, or do you just like hearing yourself talk?

@tarcieri
Copy link
Member

@e2 honestly, in my two decades of experience in open source, this is the most dysfunctional encounter I've ever had with anyone, period.

@e2
Copy link
Contributor Author

e2 commented Apr 21, 2015

@tarcieri - sorry about that. I'll work it out.

@digitalextremist - this issue is on you, since you're the only one with an (technically unclear) objection. Prepare a branch with the unreverted patch (and the other syncproxy commit above), and let's work out what's preventing you from putting this into a release.

@tarcieri
Copy link
Member

sorry about that. I'll work it out.

Cool! 😄

@digitalextremist digitalextremist modified the milestones: renaissance, 0.18.0 Apr 21, 2015
e2 added a commit to e2/celluloid that referenced this issue Apr 21, 2015
This reverts commit b35bf18.

Conflicts:
	lib/celluloid/calls.rb
@celluloid celluloid locked and limited conversation to collaborators Apr 21, 2015
@digitalextremist
Copy link
Member

See: #579 for experimentation for the time being.

@digitalextremist
Copy link
Member

This forever moved to abstractive/celluloid-experimental#1, and once proven can be merged into celluloid/current

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants