-
Notifications
You must be signed in to change notification settings - Fork 1.9k
JavaScript: Use API graphs for library modelling #4082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
34afcd5 to
943ac47
Compare
I wonder how this relates to how we describe flow summaries in C#. We also recently added access path sensitivity to flow summaries. |
|
I don't think it's very closely related (though I'm basing this on an admittedly somewhat superficial understanding of the C# flow summaries). The goal of this PR is to identify data-flow nodes that are reads/writes of global access paths, whereas the flow summaries deal, I think, primarily with local access paths. For example, we want to be able to describe the call-back parameter require('mysql').createPool().getConnection(function(err, conn) { ... })And crucially we want this to work even if the calls aren't nicely chained as in this example, but local or even global data flow is needed to connect up the individual steps. Global access paths can also be used to describe flow summaries, but that's not in scope for this PR. (Apologies for the under-documented state of this; I'll add more details as I work towards moving it out of draft status.) |
erik-krogh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a very superficial look because I wanted to see how it works.
I got two small suggestions related to Promises (both are untested, I'm not sure I got it right).
19376c5 to
8447215
Compare
|
Just in time for the weekend, I have some encouraging early performance numbers, which suggest a mild overall slowdown. Getting to neutral performance would be great, but I think for now the important thing is that it doesn't completely blow up. (Also note that relative performance will improve as more individual type trackers are replaced by API graphs.) |
asgerf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Pretty much LGTM, just some superficial bits to consider.
| /** | ||
| * Gets a data-flow node corresponding to a use of this API feature. | ||
| */ | ||
| DataFlow::Node getAUse() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a brief source code example?
For example,
require('fs').readFileis a use of a feature from thefsmodule.
| /** | ||
| * Gets a data-flow node corresponding to the right-hand side of a definition of this API | ||
| * feature. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise.
For example, the function expression in
exports.foo = function() {}could be a definition of an API feature.
If this is a parameter feature, gets an argument flowing into that parameter.
The last point seems particularly counter intuitive, as most readers would probably assume that the "definition" of a parameter is its declaration, not an argument flowing into it. I mean it's not suprising to me because we talked about these graphs quite a bit, but if I was just reading the client code without knowing about API graphs beforehand, reading something like getParameter(0).getADefinition() would be confusing.
Would it be worth renaming this to getARhs()? It would also make it slightly more obvious that getALocalSource is not implied.
javascript/ql/src/ApiGraphs.qll
Outdated
| } | ||
|
|
||
| /** | ||
| * Gets a feature representing an instance of this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Gets a feature representing an instance of this one. | |
| * Gets a feature representing an instance of this one, that is, an object whose constructor is this feature. |
| exists(string prop | | ||
| this = [createConnection(), createPool()].getOptionArgument(0, prop).asExpr() and | ||
| exists(API::Feature call, string prop | | ||
| (call = createConnection() or call = createPool()) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some reason to go from set literal to (... or ...) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No; set literal would be much better.
| optionsArg = -2 and | ||
| cmdArg = 0 | ||
| or | ||
| mod = "remote-exec" and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to sneak in support for remote-exec here? Which is fine of course, just checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently I managed to duplicate that (it's already defined further down). Will fix.
| exists(DataFlow::ModuleImportNode expressLimiter | | ||
| expressLimiter.getPath() = "express-limiter" and | ||
| expressLimiter.getACall().getArgument(0).getALocalSource().asExpr() = | ||
| exists(API::Feature expressLimiter | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we don't need this exists anymore
The added test shows how this helps us avoid false positives.
The optimiser decided that it would be a great idea to start the pipeline with `getReturn().getAUse().(DataFlow::InvokeNode)`. It's not.
This is instead of making every access to those variables source nodes, and fixes a regression in `DeadStoreOfProperty`.
c4c1e0c to
ec3c1f1
Compare
`SourceNode` in cached layers seems particularly problematic.
31b9e23 to
45cf73e
Compare
|
I've added the cleanups and the tests, in self-contained commits to ease reviewing. Apart from the outstanding evaluation, there is one issue I'd like to see resolved before this go ahead, which is the inevitable terminology bikeshed: what do people think about the "API feature" terminology, and in particular the naming of the class Clearly, API features are just nodes of the API graph, but my thinking was that we already have tons of different things called "nodes" (AST nodes, CFG nodes, data-flow graph nodes, and probably a few I've missed), and I didn't want to add to that pile. However, I find myself slipping back into "node" terminology quite frequently. Thoughts? Should we stick with |
35275ad to
799d720
Compare
|
Slightly awkward restructuring needed to make the tests pass on both the old and the new test runner (cf last commit)... |
…PI graphs. In particular, we now have two different kinds of module features: module definitions and module uses. For the most part, `API::Definition`s correspond to right-hand sides in the data-flow graph, and `API::Use`s correspond to references. However, module definitions can have references (via the CommonJS `module` variable), and so can their exports (via `module.exports` or `exports`). Note that this is different from references to uses of the module, which are simply imports.
…t it from `javascript.qll`.
799d720 to
4c4b83b
Compare
With the old test runner we cannot have `VerifyAssertions.qlref`s for each individual test that reference a shared `VerifyAssertions.ql` in the parent directory, since it doesn't like nested tests. Instead, we have to turn `VerifyAssertions.ql` into `VerifyAssertions.qll`, and each `VerifyAsssertions.qlref` into a `VerifyAssertions.ql` that imports it. But then that doesn't work with our old directory structure, since the import path would have to contain the invalid identifier `library-tests`. As a workaround, I have moved the API graph tests into a directory without dashes in its path.
4c4b83b to
252902d
Compare
Actually, yes I think we should rename it. We already use graph and edge terminology, and I thought some of the doc comments ended up sounding a little weird due to the word "feature". We have many things called node but at least we can be consistent about it. |
It turned out to be more confusing than helpful, so we're back with plain old API-graph "nodes".
|
OK, I've renamed accordingly and updated a few out-of-date comments while I was at it. |
asgerf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The stars haven't quite aligned yet; investigating a 20% slowdown on TypeScript... (Everything else looks fine, fortunately.) |
This prevents spurious recomputation of a cached stage.
This blows up on the TypeScript compiler, and is likely to be much less useful than tracking type names and namespaces, which we still do.
a1b985c to
b8a4924
Compare
|
@asgerf: Performance is looking happy again (see link in internal issue). Would you like to take another look at the recent commits? |
asgerf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's merge!
This PR contributes an implementation of API graphs, which are a way of describing the API surface produced and/or consumed by a code base. The nodes of the API graph, referred to as "API features", represent uses and definitions of API components like functions exported by npm packages or their parameters and return values. Edges are directed and labelled, and indicate how the features relate to each other.
As a concrete example, consider this code snippet:
It is described by this API graph (very slightly simplified):
Nodes rendered as boxes represent definitions, diamonds are uses; the oval root node is neither a use nor a definition. I have labelled nodes with the code snippet they correspond to where possible (the node representing the use of module
fsand the root node do not have corresponding code).Note that by reading off the labels on a path from the root to a node you get a global access path for the component used/defined by this node: for example, the
errparameter is the 0th parameter of the (callback passed as) 1st parameter to memberreadFileof the exports of modulefs, so its access path is/module fs/member exports/member readFile/parameter 1/parameter 0(your notation may vary).When the API graph is a tree, as in this case, it is just a representation of a (finite) set of access paths. In general, however, API graphs can be DAGs or even completely general graphs with cycles. In the former case, a single node can have multiple access paths, and in the latter it can even have infinitely many access paths. Put another way, API graphs represent a possibly infinite set of access paths, and encode aliasing relationship between them.
One practical application of API graphs is in library modelling, where they can be used as a DSL to describe how a code base accesses a library API. This PR contributes a library for doing that, which would allow us, for example, to abstractly describe the parameter
errin the example above asNote that the sequence of QL predicate calls essentially just encodes the access path, with
moduleImportconveniently compressing the first two steps into one.Once we have selected an API feature in this way, we can then use the predicates
getAUse()andgetARhs()to map it to a data-flow node representing either a use or (the right-hand side of) a definition of the corresponding API feature.The present PR shows how to port our libraries for modelling SQL connectors and command-execution libraries to make use of API graphs.
The look-and-feel is very similar to our current
SourceNode-based approach, but while that approach only performs local data flow and inter-procedural flow has to be added manually via type tracking, API graphs come with inter-procedural flow built in. This scales because they only type track nodes that are reachable from the API surface, which in practice tends to work out. (I am running a final evaluation to confirm, but so far experiments show only a mild performance penalty, which will pay for itself as we port more of the library to use API graphs.)As a concrete example where this additional power is useful, this PR also switches
MissingRateLimiting.qllto API graphs, fixing #4000 without the syntactic overhead additional type tracking would require.