Skip to content

Ruby: Add basic subclassing support to API Graphs #7663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 3, 2022
Merged

Conversation

hmac
Copy link
Contributor

@hmac hmac commented Jan 19, 2022

Given the code

class A; end
class B < A; end
class C < A; end

You can find uses of B and C with the expression

API::getTopLevelMember("A").getASubclass()

We do this by adding edges from the use of A in class B < A to all uses of B.

API Graph paths that use getASubclass() will, I think, always be longer than an equivalent "canonical" path which doesn't use getASubclass(). For example, a use of the class B above is accessible via API::getTopLevelMember("B") and API::getTopLevelMember("A").getASubclass(). Therefore when it comes to testing getASubclass(), we're interested in non-canonical paths, for example:

class A; end
class B < A; end

B # API::getTopLevelMember("B") [canonical], API::getTopLevelMember("A").getASubclass()

To test subclass support using inline tests, I've extended the InlineTest framework to support optional results, which are matched against annotations but do not trigger a failure if there is no matching annotation. This allows us to add annotations for non-canonical paths where we want to test subclassing, but leave existing tests alone.

Given the code

    class A; end
    class B < A; end
    class C < A; end

You can find uses of B and C with the expression

    API::getTopLevelMember("A").getASubclass()
@hmac hmac force-pushed the hmac/api-graph-subclass branch from cfcbca0 to e225d9d Compare January 23, 2022 23:25
hmac added 4 commits January 25, 2022 16:40
Now that API graphs have basic subclassing support, we can simplify some
of the ActiveRecord and ActionController code.
This simplifies some of the code.
The idea behind optional results is that there may be instances where
each line of source code has many results and you don't want to annotate
all of them, but you still want to ensure that any annotations you do
have are correct.

This change makes that possible by exposing a new predicate
`hasOptionalResult`, which has the same signature as `hasResult`.

Results produced by `hasOptionalResult` will be matched against any
annotations, but the lack of a matching annotation will not cause a
failure.

We will use this in the inline tests for the API edge getASubclass,
because for each API path that uses getASubclass there is always a
shorter path that does not use it, and thus we can't use the normal
shortest-path matching approach that works for other API Graph tests.
@hmac hmac force-pushed the hmac/api-graph-subclass branch from e225d9d to c5904b7 Compare January 25, 2022 03:41
// In Rails applications `ApplicationController` typically extends `ActionController::Base`, but we
// treat it separately in case the `ApplicationController` definition is not in the database.
API::getTopLevelMember("ApplicationController")
].getASubclass*().getAUse().asExpr().getExpr()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have a method named getADirectSubclass and define getASubclass to be getADirectSubclass*. I expect users in almost all cases need the transitive closure and they are likely to forget adding *. Having the nicely named method "just do the right thing in most cases" would be helpful.

Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much experience with sub classes in API graphs, but @max-schaefer mentioned that it might be a good principle for API graphs to adhere to the substitution principle. The way I understand this, it would mean that when asking, say, for getMember(C), we should already then be getting C and all sub classes thereof. However, whether we would also like to be able to get just C, I don't know.

@@ -174,8 +174,7 @@ module API {
// avoid producing strings longer than 1MB
result.length() < 1000 * 1000
)
) and
length in [1 .. Impl::distanceFromRoot(this)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change this just so we can use it in a test. I think it would be OK to replicate the above in the test itself instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this because I noticed that we were only calling this predicate in

string getPath() {
  result = min(string p | p = this.getAPath(Impl::distanceFromRoot(this)) | p)
}

which already passes in a max length of Impl::distanceFromRoot(this), so I thought it was redundant. Am I mistaken?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] which already passes in a max length of Impl::distanceFromRoot(this), so I thought it was redundant. Am I mistaken?

Careful! This is "top-down" reasoning ("passes in"), but the evaluation is actually bottom-up.

Consider how getAPath is evaluated in isolation. The lines you deleted would prevent this predicate from containing a result and length where length is greater than the shortest path to the given node. Without those lines, there's no such restriction (except the fact that result can't be more than 1000000 characters, which is likely something that happens much further out). In particular, if you have a loop in the API graph (which is almost certain), this predicate will keep going round and round in that loop, producing path strings of greater and greater length, until finally hitting the 1M limit.

In general, it's useful to consider the question "which tuples will be in this relation" regardless of any context that may limit that set. If we're lucky, that context will be automatically magicked in, but it might not (and so it's better to just limit the number of tuples in advance).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining! I'm definitely still getting used to thinking in the bottom-up evaluation mindset :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted this change in a followup commit - the test now uses a copy of getAPath without the length restriction.

use(succ, b) and
resolveConstant(b.asExpr().getExpr()) = resolveConstantWriteAccess(c) and
c.getSuperclassExpr() = a.asExpr().getExpr() and
lbl = Label::subclass()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should include the name of the sub class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable. I can't immediately see what benefit it would bring though. Does it make anything particular easier/harder?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just skip it for now.

@max-schaefer
Copy link
Contributor

@max-schaefer mentioned that it might be a good principle for API graphs to adhere to the substitution principle

Yeah, that was admittedly a bit of a throw-away line. I haven't implemented this yet.

My intuition was that if I ask the API graph to give me all instances of C, it would be nice if that included instances of subtypes, since (by the substitution principle) they are also instances of C. Similarly, if I ask for method m of C (perhaps specifying a signature for languages where that makes sense), it would seem reasonable for that to include overrides of m in subclasses of C.

I remember the @github/codeql-python folks remarking how their API-graph code ended up using getASubclass*() all over the place, which could perhaps be avoided by this scheme.

@tausbn
Copy link
Contributor

tausbn commented Jan 27, 2022

I remember the @github/codeql-python folks remarking how their API-graph code ended up using getASubclass*() all over the place, which could perhaps be avoided by this scheme.

Indeed. When I originally implemented support for subclassing, you mentioned this issue and I discussed it with the other members of the Python team (as I was eager to avoid peppering our code with getASubclass*()).

If memory serves, one of the concerns raised was that if some_class_api_node implicitly represents not just itself but all of its subclasses, then we lose the ability to distinguish between these cases (which might be relevant at some point). So instead we opted for the more explicit (and in turn perhaps more Pythonic) getASubclass, to make it explicitly clear that we're now including all subclasses. In practice, this isn't terribly much of a nuisance. (Also, checking our libraries just now, most instances use getASubclass*, but there is one instance where we do getASubclass+ instead.)

@max-schaefer
Copy link
Contributor

If memory serves, one of the concerns raised was that if some_class_api_node implicitly represents not just itself but all of its subclasses, then we lose the ability to distinguish between these cases (which might be relevant at some point)

Absolutely! That's why my proposal was to only bring in the substitution principle when referring to instances or members (not when referring to the class itself), but there are still things you can't express that way, and it's entirely possible that they are practically relevant.

@aibaars
Copy link
Contributor

aibaars commented Jan 27, 2022

If memory serves, one of the concerns raised was that if some_class_api_node implicitly represents not just itself but all of its subclasses, then we lose the ability to distinguish between these cases (which might be relevant at some point)

Absolutely! That's why my proposal was to only bring in the substitution principle when referring to instances or members (not when referring to the class itself), but there are still things you can't express that way, and it's entirely possible that they are practically relevant.

I think we should make the features that people need 90% of the time as convenient as possible. Ideally we'd still offer some predicates for the curious corner cases where one really needs the distinction. For example let getAnInstance/getASubClass return the transitive result, while having things like getAnImmediateInstance/getAnImmediateSubClass to be able to implement the special cases.

@max-schaefer
Copy link
Contributor

I like that suggestion! It's similar to how we have getAUse and getAnImmediateUse that do/don't follow interprocedural flow: you usually want the former but the latter is available for the occasional case where you don't.

hmac added 2 commits January 28, 2022 16:44
    class A; end
    class B < A; end
    class C < B; end

In the example above, `getMember("A").getAnImmediateSubclass()` will
select only uses of B, whereas `getMember("A").getASubclass()` will
select uses of A, B and C. This is usually the behaviour you want.
@hmac
Copy link
Contributor Author

hmac commented Feb 1, 2022

Thanks everyone for your comments and the insightful discussion! I've pushed a change to the exposed API such that:

  • getAnImmediateSubclass() returns direct subclasses of the receiver
  • getASubclass() returns the transitive closure of the above

As a result, to get a class A and all its subclasses you still have to explicitly call getASubclass().

If you want to match calls to Foo::Bar.baz whilst also including any subclasses of Foo and subclasses of Foo::Bar, you will have to use

API::getTopLevelMember("Foo").getASubclass().getMember("Bar").getASubclass().getMethodCall("baz")

However I think this is a rare case. It is more common for Foo to be a module, in which case you can't create subclasses of it.

I think this a reasonable middle ground, and I think it would be nice to merge this PR and try using it in our queries etc, and then make further changes/improvements as we encounter the need. How does that sound?

@hmac hmac marked this pull request as ready for review February 1, 2022 23:18
@hmac hmac requested review from a team as code owners February 1, 2022 23:18
@MathiasVP
Copy link
Contributor

To test subclass support using inline tests, I've extended the InlineTest framework to support optional results, which are matched against annotations but do not trigger a failure if there is no matching annotation. This allows us to add annotations for non-canonical paths where we want to test subclassing, but leave existing tests alone.

I've wanted this feature myself for a while ❤️. In fact, I did it the wrong way in #5417 and closed it again following the discussion with @aschackmull. But I like the API you went with here a lot more 👍.

MathiasVP
MathiasVP previously approved these changes Feb 1, 2022
@hmac
Copy link
Contributor Author

hmac commented Feb 2, 2022

(Sorry about the mass ping, language teams! Changes to the shared inline test framework caused GitHub to request reviews from everyone.)

@erik-krogh
Copy link
Contributor

erik-krogh commented Feb 2, 2022

I'm currently implementing def nodes for the API graph implementation in Python, and I also encountered similar testing issues.
I ended up porting the API-graph testing framework from JS, but using the same syntax as the inline expectation tests.

You can see my implementation here.
A clear advantage is that the error messages you get are usually very good, and there is no potential performance issue from computing a basically unbounded number of paths.

E.g. I think your test will blow up if you try a test like this one.
(Some quick math suggests that there are at least 2^(1000000/17)≈10^17707 paths of length less than 1 million to the API-nodes in that function).

*/
Node getASubclass() { result = this.getASuccessor(Label::subclass()) }
Node getASubclass() { result = this.getAnImmediateSubclass*() }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update

Node getInstance() { result = this.getASuccessor(Label::instance()) }

to

Node getInstance() { result = this.getASubclass().getASuccessor(Label::instance()) }

@asgerf , @aibaars I think this is related to our discussion yesterday

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that's probably better (though I still don't have a good intuition about what it's like to model Ruby frameworks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. If we do so, we should probably also do the same for getReturn. Together that should cover both instance and class methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this in 704b585

Change the behaviour of `API::getInstance()` and `API::getReturn()` to
include results on subclasses of the current API node.
@hmac
Copy link
Contributor Author

hmac commented Feb 3, 2022

I'm currently implementing def nodes for the API graph implementation in Python, and I also encountered similar testing issues. I ended up porting the API-graph testing framework from JS, but using the same syntax as the inline expectation tests.

You can see my implementation here. A clear advantage is that the error messages you get are usually very good, and there is no potential performance issue from computing a basically unbounded number of paths.

E.g. I think your test will blow up if you try a test like this one. (Some quick math suggests that there are at least 2^(1000000/17)≈10^17707 paths of length less than 1 million to the API-nodes in that function).

Thanks @erik-krogh! I think for now I'm going to stick with the simple approach in this PR but if we encounter performance problems in future then it's good to know there's a better implementation we can switch to 👍

@hmac hmac merged commit ab7fd89 into main Feb 3, 2022
@hmac hmac deleted the hmac/api-graph-subclass branch February 3, 2022 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants