JavaScript: Use API graphs for library modelling #4082

max-schaefer · 2020-08-14T16:59:53Z

This PR contributes an implementation of API graphs, which are a way of describing the API surface produced and/or consumed by a code base. The nodes of the API graph, referred to as "API features", represent uses and definitions of API components like functions exported by npm packages or their parameters and return values. Edges are directed and labelled, and indicate how the features relate to each other.

As a concrete example, consider this code snippet:

const fs = require('fs');
fs.readFile('data.csv', (err, data) => {
  if (err) throw err;
  console.log(data);
});

It is described by this API graph (very slightly simplified):

Nodes rendered as boxes represent definitions, diamonds are uses; the oval root node is neither a use nor a definition. I have labelled nodes with the code snippet they correspond to where possible (the node representing the use of module fs and the root node do not have corresponding code).

Note that by reading off the labels on a path from the root to a node you get a global access path for the component used/defined by this node: for example, the err parameter is the 0th parameter of the (callback passed as) 1st parameter to member readFile of the exports of module fs, so its access path is /module fs/member exports/member readFile/parameter 1/parameter 0 (your notation may vary).

When the API graph is a tree, as in this case, it is just a representation of a (finite) set of access paths. In general, however, API graphs can be DAGs or even completely general graphs with cycles. In the former case, a single node can have multiple access paths, and in the latter it can even have infinitely many access paths. Put another way, API graphs represent a possibly infinite set of access paths, and encode aliasing relationship between them.

One practical application of API graphs is in library modelling, where they can be used as a DSL to describe how a code base accesses a library API. This PR contributes a library for doing that, which would allow us, for example, to abstractly describe the parameter err in the example above as

API::moduleImport("fs").getMember("readFile").getParameter(1).getParameter(0)

Note that the sequence of QL predicate calls essentially just encodes the access path, with moduleImport conveniently compressing the first two steps into one.

Once we have selected an API feature in this way, we can then use the predicates getAUse() and getARhs() to map it to a data-flow node representing either a use or (the right-hand side of) a definition of the corresponding API feature.

The present PR shows how to port our libraries for modelling SQL connectors and command-execution libraries to make use of API graphs.

The look-and-feel is very similar to our current SourceNode-based approach, but while that approach only performs local data flow and inter-procedural flow has to be added manually via type tracking, API graphs come with inter-procedural flow built in. This scales because they only type track nodes that are reachable from the API surface, which in practice tends to work out. (I am running a final evaluation to confirm, but so far experiments show only a mild performance penalty, which will pay for itself as we port more of the library to use API graphs.)

As a concrete example where this additional power is useful, this PR also switches MissingRateLimiting.qll to API graphs, fixing #4000 without the syntactic overhead additional type tracking would require.

hvitved · 2020-08-17T06:36:39Z

This PR contributes an implementation of API graphs, which are a way of describing the API surface of a library both in terms of its implementation and in terms of its uses. Technically, it basically amounts to an implementation of global access paths enriched with aliasing information. (More complete description TBD.)

I wonder how this relates to how we describe flow summaries in C#. We also recently added access path sensitivity to flow summaries.

max-schaefer · 2020-08-17T07:24:07Z

I don't think it's very closely related (though I'm basing this on an admittedly somewhat superficial understanding of the C# flow summaries). The goal of this PR is to identify data-flow nodes that are reads/writes of global access paths, whereas the flow summaries deal, I think, primarily with local access paths.

For example, we want to be able to describe the call-back parameter conn in the following snippet:

require('mysql').createPool().getConnection(function(err, conn) { ... })

And crucially we want this to work even if the calls aren't nicely chained as in this example, but local or even global data flow is needed to connect up the individual steps.

Global access paths can also be used to describe flow summaries, but that's not in scope for this PR.

(Apologies for the under-documented state of this; I'll add more details as I work towards moving it out of draft status.)

erik-krogh

I took a very superficial look because I wanted to see how it works.

I got two small suggestions related to Promises (both are untested, I'm not sure I got it right).