Skip to content

Conversation

@max-schaefer
Copy link
Contributor

@max-schaefer max-schaefer commented Aug 14, 2020

This PR contributes an implementation of API graphs, which are a way of describing the API surface produced and/or consumed by a code base. The nodes of the API graph, referred to as "API features", represent uses and definitions of API components like functions exported by npm packages or their parameters and return values. Edges are directed and labelled, and indicate how the features relate to each other.

As a concrete example, consider this code snippet:

const fs = require('fs');
fs.readFile('data.csv', (err, data) => {
  if (err) throw err;
  console.log(data);
});

It is described by this API graph (very slightly simplified):

api-graph

Nodes rendered as boxes represent definitions, diamonds are uses; the oval root node is neither a use nor a definition. I have labelled nodes with the code snippet they correspond to where possible (the node representing the use of module fs and the root node do not have corresponding code).

Note that by reading off the labels on a path from the root to a node you get a global access path for the component used/defined by this node: for example, the err parameter is the 0th parameter of the (callback passed as) 1st parameter to member readFile of the exports of module fs, so its access path is /module fs/member exports/member readFile/parameter 1/parameter 0 (your notation may vary).

When the API graph is a tree, as in this case, it is just a representation of a (finite) set of access paths. In general, however, API graphs can be DAGs or even completely general graphs with cycles. In the former case, a single node can have multiple access paths, and in the latter it can even have infinitely many access paths. Put another way, API graphs represent a possibly infinite set of access paths, and encode aliasing relationship between them.

One practical application of API graphs is in library modelling, where they can be used as a DSL to describe how a code base accesses a library API. This PR contributes a library for doing that, which would allow us, for example, to abstractly describe the parameter err in the example above as

API::moduleImport("fs").getMember("readFile").getParameter(1).getParameter(0)

Note that the sequence of QL predicate calls essentially just encodes the access path, with moduleImport conveniently compressing the first two steps into one.

Once we have selected an API feature in this way, we can then use the predicates getAUse() and getARhs() to map it to a data-flow node representing either a use or (the right-hand side of) a definition of the corresponding API feature.

The present PR shows how to port our libraries for modelling SQL connectors and command-execution libraries to make use of API graphs.

The look-and-feel is very similar to our current SourceNode-based approach, but while that approach only performs local data flow and inter-procedural flow has to be added manually via type tracking, API graphs come with inter-procedural flow built in. This scales because they only type track nodes that are reachable from the API surface, which in practice tends to work out. (I am running a final evaluation to confirm, but so far experiments show only a mild performance penalty, which will pay for itself as we port more of the library to use API graphs.)

As a concrete example where this additional power is useful, this PR also switches MissingRateLimiting.qll to API graphs, fixing #4000 without the syntactic overhead additional type tracking would require.

@max-schaefer max-schaefer requested a review from asgerf August 14, 2020 16:59
@adityasharad adityasharad changed the base branch from master to main August 14, 2020 18:33
@hvitved
Copy link
Contributor

hvitved commented Aug 17, 2020

This PR contributes an implementation of API graphs, which are a way of describing the API surface of a library both in terms of its implementation and in terms of its uses. Technically, it basically amounts to an implementation of global access paths enriched with aliasing information. (More complete description TBD.)

I wonder how this relates to how we describe flow summaries in C#. We also recently added access path sensitivity to flow summaries.

@max-schaefer
Copy link
Contributor Author

I don't think it's very closely related (though I'm basing this on an admittedly somewhat superficial understanding of the C# flow summaries). The goal of this PR is to identify data-flow nodes that are reads/writes of global access paths, whereas the flow summaries deal, I think, primarily with local access paths.

For example, we want to be able to describe the call-back parameter conn in the following snippet:

require('mysql').createPool().getConnection(function(err, conn) { ... })

And crucially we want this to work even if the calls aren't nicely chained as in this example, but local or even global data flow is needed to connect up the individual steps.

Global access paths can also be used to describe flow summaries, but that's not in scope for this PR.

(Apologies for the under-documented state of this; I'll add more details as I work towards moving it out of draft status.)

@asgerf asgerf added the JS label Aug 17, 2020
Copy link
Contributor

@erik-krogh erik-krogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a very superficial look because I wanted to see how it works.

I got two small suggestions related to Promises (both are untested, I'm not sure I got it right).

@max-schaefer max-schaefer force-pushed the js/api-graph branch 7 times, most recently from 19376c5 to 8447215 Compare August 26, 2020 14:26
@max-schaefer
Copy link
Contributor Author

Just in time for the weekend, I have some encouraging early performance numbers, which suggest a mild overall slowdown. Getting to neutral performance would be great, but I think for now the important thing is that it doesn't completely blow up. (Also note that relative performance will improve as more individual type trackers are replaced by API graphs.)

Copy link
Contributor

@asgerf asgerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Pretty much LGTM, just some superficial bits to consider.

/**
* Gets a data-flow node corresponding to a use of this API feature.
*/
DataFlow::Node getAUse() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a brief source code example?

For example, require('fs').readFile is a use of a feature from the fs module.

/**
* Gets a data-flow node corresponding to the right-hand side of a definition of this API
* feature.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise.

For example, the function expression in exports.foo = function() {} could be a definition of an API feature.

If this is a parameter feature, gets an argument flowing into that parameter.

The last point seems particularly counter intuitive, as most readers would probably assume that the "definition" of a parameter is its declaration, not an argument flowing into it. I mean it's not suprising to me because we talked about these graphs quite a bit, but if I was just reading the client code without knowing about API graphs beforehand, reading something like getParameter(0).getADefinition() would be confusing.

Would it be worth renaming this to getARhs()? It would also make it slightly more obvious that getALocalSource is not implied.

}

/**
* Gets a feature representing an instance of this one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Gets a feature representing an instance of this one.
* Gets a feature representing an instance of this one, that is, an object whose constructor is this feature.

exists(string prop |
this = [createConnection(), createPool()].getOptionArgument(0, prop).asExpr() and
exists(API::Feature call, string prop |
(call = createConnection() or call = createPool()) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some reason to go from set literal to (... or ...) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No; set literal would be much better.

optionsArg = -2 and
cmdArg = 0
or
mod = "remote-exec" and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to sneak in support for remote-exec here? Which is fine of course, just checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I managed to duplicate that (it's already defined further down). Will fix.

exists(DataFlow::ModuleImportNode expressLimiter |
expressLimiter.getPath() = "express-limiter" and
expressLimiter.getACall().getArgument(0).getALocalSource().asExpr() =
exists(API::Feature expressLimiter |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we don't need this exists anymore

`SourceNode` in cached layers seems particularly problematic.
@max-schaefer max-schaefer marked this pull request as ready for review September 3, 2020 14:51
@max-schaefer max-schaefer requested a review from a team as a code owner September 3, 2020 14:51
@max-schaefer
Copy link
Contributor Author

I've added the cleanups and the tests, in self-contained commits to ease reviewing.

Apart from the outstanding evaluation, there is one issue I'd like to see resolved before this go ahead, which is the inevitable terminology bikeshed: what do people think about the "API feature" terminology, and in particular the naming of the class API::Feature?

Clearly, API features are just nodes of the API graph, but my thinking was that we already have tons of different things called "nodes" (AST nodes, CFG nodes, data-flow graph nodes, and probably a few I've missed), and I didn't want to add to that pile. However, I find myself slipping back into "node" terminology quite frequently.

Thoughts? Should we stick with API::Feature, or just give up and rename it to API::Node?

@max-schaefer
Copy link
Contributor Author

Slightly awkward restructuring needed to make the tests pass on both the old and the new test runner (cf last commit)...

Max Schaefer added 6 commits September 3, 2020 22:28
…PI graphs.

In particular, we now have two different kinds of module features: module definitions and module uses.

For the most part, `API::Definition`s correspond to right-hand sides in the data-flow graph, and `API::Use`s correspond to references. However, module definitions can have references (via the CommonJS `module` variable), and so can their exports (via `module.exports` or `exports`). Note that this is different from references to uses of the module, which are simply imports.
With the old test runner we cannot have `VerifyAssertions.qlref`s for each individual test that reference a shared `VerifyAssertions.ql` in the parent directory, since it doesn't like nested tests.

Instead, we have to turn `VerifyAssertions.ql` into `VerifyAssertions.qll`, and each `VerifyAsssertions.qlref` into a `VerifyAssertions.ql` that imports it.

But then that doesn't work with our old directory structure, since the import path would have to contain the invalid identifier `library-tests`. As a workaround, I have moved the API graph tests into a directory without dashes in its path.
@asgerf
Copy link
Contributor

asgerf commented Sep 4, 2020

Should we stick with API::Feature, or just give up and rename it to API::Node?

Actually, yes I think we should rename it. We already use graph and edge terminology, and I thought some of the doc comments ended up sounding a little weird due to the word "feature". We have many things called node but at least we can be consistent about it.

It turned out to be more confusing than helpful, so we're back with plain old API-graph "nodes".
@max-schaefer
Copy link
Contributor Author

OK, I've renamed accordingly and updated a few out-of-date comments while I was at it.

asgerf
asgerf previously approved these changes Sep 4, 2020
Copy link
Contributor

@asgerf asgerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@max-schaefer
Copy link
Contributor Author

I'll do another performance evaluation over the weekend just to be sure, and if everything goes well (...)

The stars haven't quite aligned yet; investigating a 20% slowdown on TypeScript... (Everything else looks fine, fortunately.)

This prevents spurious recomputation of a cached stage.
This blows up on the TypeScript compiler, and is likely to be much less useful than tracking type names and namespaces, which we still do.
@max-schaefer
Copy link
Contributor Author

@asgerf: Performance is looking happy again (see link in internal issue). Would you like to take another look at the recent commits?

Copy link
Contributor

@asgerf asgerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's merge!

@codeql-ci codeql-ci merged commit 903bc00 into github:main Sep 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants