Python: Add support for API graphs #5069

tausbn · 2021-02-01T17:31:35Z

Currently only supports the "use" side of things.

For the most part, this follows the corresponding implementation for
JavaScript. Major differences include:

No MkImportUse nodes -- we just move directly from
MkModuleImport to its uses.
Paths are no longer labelled by s-expressions, but rather by a
string that mirrors how you would access it in QL. This makes it very
easy to see how to access an API component -- simply look at its
toString!

This PR also extends LocalSourceNode to support looking up attribute
references and invocations of such nodes. This was again based on the
JavaScript equivalent (though without specific classes for
InvokeNode and the like, it's a bit more awkward to use).

Still missing:

Import handling beyond simple import foo is not working exactly right. For instance, from foo.bar import baz becomes importModule("foo.bar").getMember("baz") but it would probably be more useful for it to be importModule("foo").getMember("bar").getMember("baz")
Tests.
Change note.

Currently only supports the "use" side of things. For the most part, this follows the corresponding implementation for JavaScript. Major differences include: - No `MkImportUse` nodes -- we just move directly from `MkModuleImport` to its uses. - Paths are no longer labelled by s-expressions, but rather by a string that mirrors how you would access it in QL. This makes it very easy to see how to access an API component -- simply look at its `toString`! This PR also extends `LocalSourceNode` to support looking up attribute references and invocations of such nodes. This was again based on the JavaScript equivalent (though without specific classes for `InvokeNode` and the like, it's a bit more awkward to use).

This turned out to be fairly simple. Given an import such as ```python from foo.bar.baz import quux ``` we create an API-graph node for each valid dotted prefix of `foo.bar.baz`, i.e. `foo`, `foo.bar`, and `foo.bar.baz`. For these, we then insert nodes in the API graph, such that `foo` steps to `foo.bar` along an edge labeled `bar`, etc. Finally, we only allow undotted names to hang off of the API-graph root. Thus, `foo` will have a `moduleImport` edge off of the root, and a `getMember` edge for `bar` (which in turn has a `getMember` edge for `baz`). Relative imports are explicitly ignored. Finally, this commit also adds inline tests for a variety of ways of importing modules, including a copy of the "import-helper" tests (with a few modifications to allow a single annotation per line, as these get rather long quickly!).

RasmusWL

Nice 💪 really looking forward to being able to use this 🤩

I don't understand the difference between getACall() and getReturn() 😕 Ohh, is it the case that for the following example, moduleImport("foo").getMember("bar").getACall() would only give foo.bar() as a result, whereas moduleImport("foo").getMember("bar").getReturn() would give both foo.bar(), and x in the print call? (and possibly also the x on the LHS of the assignment)

import foo
x = foo.bar()
print(x)

Besides that, only a few minor comments.

python/ql/test/experimental/dataflow/ApiGraphs/test.py

python/ql/src/semmle/python/ApiGraphs.qll

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll

A slightly odd fix, but still morally okay, I think. The main issue here was that global variables have their first occurrence in an inner scope inside a so-called "scope entry definition", that then subsequently flows to the first use of this variable. This meant that that first use was _not_ a `LocalSourceNode` (since _something_ flowed into it), and this blocked `trackUseNode` from type-tracking to it (as it expects all nodes to be `LocalSourceNode`s). The answer, then, is to say that a `LocalSourceNode` is simply one that doesn't have flow to it from _any `CfgNode`_ (through one or more steps). This disregards the flow from the scope entry definition, as that is flow from an `EssaNode`. Additionally, it makes sense to exclude `ModuleVariableNode`s. These should never be considered local sources, since they always have flow from (at least) the place where the corresponding global variable is introduced.

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

Makes the `a.b.c.d` test more sensible. Also adds a test that shows a case where we're currently _not_ getting the right flow.

yoff

The code around use/2 and use/3 is quite hard to read. I think it could be simplified by adding base as an argument to trackUseNode and move use/2 into the base case (but record the first argument rather than throw it away). Then I think other uses of use/2 can be simplified away.

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll

yoff · 2021-02-04T09:02:11Z

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll

+    // Declaring `source` to be a `SourceNode` currently causes a redundant check in the
+    // recursive case, so instead we check it explicitly here.


Has this actually been seen to affect performance?

In JavaScript almost certainly, otherwise this refactoring probably wouldn't be there.
In Python who knows (but I think it would be wise for us to learn from the JavaScript implementation, rather than painfully rediscover these performance problems on our own).

Ok, let us leave it for now, but check it at some point (not sure when that would be, though).

yoff · 2021-02-04T09:03:49Z

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll

  /** Holds if this `LocalSourceNode` can flow to `nodeTo` in one or more local flow steps. */
  cached
-  predicate flowsTo(Node nodeTo) { simpleLocalFlowStep*(this, nodeTo) }
+  predicate flowsTo(Node nodeTo) { Cached::hasLocalSource(nodeTo, this) }


Does this provide a better implementation? (Or could we just use flowsTo in place of hasLocalSource?)

(Reading the code below, I feel that flowsTo would be clearer because of the order of arguments.)

Does this provide a better implementation? (Or could we just use flowsTo in place of hasLocalSource?)

I honestly don't know what you're asking here. Surely we can't use flowsTo, as that's the predicate that's being defined. Did you mean renaming hasLocalSource to flowsTo? That would be the wrong order of arguments.

I meant, could we disband hasLocalSource and just use the previous implementation of flowsTo? And then use flowsTo in the places that currently use hasLocalSource.

We could certainly do that without changing the meaning of the code, but I can't guarantee the same about the performance. In particular, I don't know that the * operation is able to use the knowledge that this is a LocalSourceNode at this point (and so it may end up calculating the entire simpleLocalFlowStep* relation before restricting the first argument).

Right, that is a valid point. I still think it is a lot easier to read a.flowsTo(b) than hasLocalSource(b, a) but flowsTo is probably not available in the API graph module..

... You've lost me again. All uses of hasLocalSource are inside DataFlowPublic and most of them are inside Cached. Everywhere else we use flowsTo. 😕

I think as part of the general cleanup of the dataflow libraries (mentioned elsewhere) we'll probably want to consider whether Cached should be exposed at this point or put somewhere else. Ideally people should want to interact with local flow through flowsTo.

Ah, you are right, I mixed up the modules. So I can potentially have the readable code, I have submitted suggestions, let me know if that is too optimistic...

python/ql/test/experimental/dataflow/ApiGraphs/test7.py

python/ql/src/semmle/python/ApiGraphs.qll

Spotted by yoff in github#5069 (comment)

python/ql/test/experimental/dataflow/ApiGraphs/test4.py

Co-authored-by: yoff <lerchedahl@gmail.com>

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

tausbn · 2021-02-04T11:43:35Z

I don't understand the difference between getACall() and getReturn() Ohh, is it the case that for the following example, moduleImport("foo").getMember("bar").getACall() would only give foo.bar() as a result, whereas moduleImport("foo").getMember("bar").getReturn() would give both foo.bar(), and x in the print call? (and possibly also the x on the LHS of the assignment)
import foo
x = foo.bar()
print(x)

This is not quite right, as getReturn returns an API node. You would have to additionally call getAUse on it to get to (all of) its uses (including both foo.bar() and x in your example).

tausbn · 2021-02-04T11:49:48Z

The code around use/2 and use/3 is quite hard to read. I think it could be simplified by adding base as an argument to trackUseNode and move use/2 into the base case (but record the first argument rather than throw it away). Then I think other uses of use/2 can be simplified away.

You may be right that we can rewrite it in this way, however I would caution that we're currently only seeing "half" of the API graph interface implemented here. I wouldn't want us to make this rewrite now, only to have to undo it at a later date because it turns out to have been written the way it is for a reason. (Looking at the JS original, I note that while most uses of use/2 occur in the presence of trackUseNode, not all of them do.)

In general, I have tried hard not to diverge from the original JS version, just in case we end up in a place where we might want to share the code between these languages.

That being said, I think use/2 could do with some more documentation, which I'll add.

RasmusWL · 2021-02-04T14:30:41Z

In general, I have tried hard not to diverge from the original JS version

I think that's a very good approach 👍

yoff · 2021-02-04T15:01:14Z

In general, I have tried hard not to diverge from the original JS version

I think that's a very good approach 👍

I think there is a balance to be struck. I would like to take over all of the battle tested robustness and performance optimality. But take over none of the technical debt 😛

tausbn · 2021-02-04T15:35:01Z

Argh, lesson learned: don't expect consistent behaviour if you import the test database (without renaming) and then re-run the tests. Everything was looking just fine with the imports/2 simplification, but now not so much. 😞

There is now a bit of redundancy in the tests, but I thought it useful to actually include some of the cases called out explicitly in the documentation, so as to make it easy to see that the code actually does what we expect (in these cases, anyway).

In lieu of removing the offending flow (which would likely have consequences for a lot of other tests), I opted to simply _include_ the relevant nodes directly.

yoff

Smal typo

python/ql/src/semmle/python/ApiGraphs.qll

Co-authored-by: yoff <lerchedahl@gmail.com>

yoff

I much prefer this, if it is feasible?

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll

tausbn · 2021-02-05T10:56:01Z

I think this could be feasible if we change the annotation on flowsTo from cached to pragma[inline]. In that case, I would expect the above code to do exactly the same as before.

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll

Co-authored-by: yoff <lerchedahl@gmail.com>

Makes it more similar to the other functions in this module.

yoff

Thanks, I am ready to merge this now :-)

yoff · 2021-02-05T12:17:27Z

Merging, ship to learn!

github-actions bot added the Python label Feb 1, 2021

tausbn added 2 commits February 1, 2021 18:57

Python: Add missing documentation

cd7b013

tausbn marked this pull request as ready for review February 2, 2021 21:34

tausbn requested a review from a team as a code owner February 2, 2021 21:34

RasmusWL reviewed Feb 3, 2021

View reviewed changes

tausbn and others added 8 commits February 3, 2021 16:41

Python: Use call instead of invocation

6ce160c

Python: Make toString abstract

c5d6792

Python: Better explanation in use/3

05f290f

Python: Improve documentation for moduleImport

56515c5

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

Python: Autoformat

ebfb1fa

Python: Further elaboration of use/3

ba98b08

Python: Update test file

5974af6

Makes the `a.b.c.d` test more sensible. Also adds a test that shows a case where we're currently _not_ getting the right flow.

yoff requested changes Feb 4, 2021

View reviewed changes

RasmusWL added a commit to RasmusWL/codeql that referenced this pull request Feb 4, 2021

Python: Fix small typo in test-output

9605ddb

Spotted by yoff in github#5069 (comment)

RasmusWL mentioned this pull request Feb 4, 2021

Python: Fix small typo in test-output #5088

Merged

RasmusWL added a commit to RasmusWL/codeql that referenced this pull request Feb 4, 2021

Python: Fix small typo in test-output

ac0f2d3

Spotted by yoff in github#5069 (comment)

RasmusWL reviewed Feb 4, 2021

View reviewed changes

python/ql/test/experimental/dataflow/ApiGraphs/test4.py Outdated Show resolved Hide resolved

tausbn and others added 2 commits February 4, 2021 12:18

Python: Fix typos in test files

e5ec1e1

Co-authored-by: yoff <lerchedahl@gmail.com>

Python: Fix more typos

4627799

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

Python: Add documentation from DataFlowUtil::importNode

d01d7ee

tausbn added 2 commits February 4, 2021 15:58

Python: Greatly simplify imports/2 predicate

e54c925

Python: More documentation

07ffa9f

tausbn added 6 commits February 4, 2021 16:46

Python: Fix imports/2

305bfab

Python: Add big explanatory comment about prefixes.

aa7e9f0

Python: Add more test cases

2524f23

There is now a bit of redundancy in the tests, but I thought it useful to actually include some of the cases called out explicitly in the documentation, so as to make it easy to see that the code actually does what we expect (in these cases, anyway).

Python: Fix missing global variable source nodes

f6e1ea5

In lieu of removing the offending flow (which would likely have consequences for a lot of other tests), I opted to simply _include_ the relevant nodes directly.

Python: Fix typo

3c7d9c3

Python: Adhere to QLDoc style guide

a505eb6

tausbn requested a review from yoff February 4, 2021 17:35

Python: Add change note

d035937

github-actions bot added the documentation label Feb 4, 2021

yoff requested changes Feb 5, 2021

View reviewed changes

python/ql/src/semmle/python/ApiGraphs.qll Outdated Show resolved Hide resolved

Python: Fix typo in docs

a667431

Co-authored-by: yoff <lerchedahl@gmail.com>

yoff requested changes Feb 5, 2021

View reviewed changes

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll Show resolved Hide resolved

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll Show resolved Hide resolved

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll Show resolved Hide resolved

tausbn commented Feb 5, 2021

View reviewed changes

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll Outdated Show resolved Hide resolved

tausbn and others added 2 commits February 5, 2021 12:42

Python: Use flowsTo instead of hasLocalSource

6c8dfb2

Co-authored-by: yoff <lerchedahl@gmail.com>

Python: Slight cleanup of Cached::call

78cb534

Makes it more similar to the other functions in this module.

tausbn requested a review from yoff February 5, 2021 12:11

yoff approved these changes Feb 5, 2021

View reviewed changes

yoff merged commit 7fef1a8 into github:main Feb 5, 2021

tausbn deleted the python-api-graphs branch February 5, 2021 12:18

		// Declaring `source` to be a `SourceNode` currently causes a redundant check in the
		// recursive case, so instead we check it explicitly here.

Python: Add support for API graphs #5069

Python: Add support for API graphs #5069

Uh oh!

Conversation

tausbn commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tausbn commented Feb 4, 2021

Uh oh!

tausbn commented Feb 4, 2021

Uh oh!

RasmusWL commented Feb 4, 2021

Uh oh!

yoff commented Feb 4, 2021

Uh oh!

tausbn commented Feb 4, 2021

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tausbn commented Feb 5, 2021

Uh oh!

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

yoff commented Feb 5, 2021

Uh oh!

Reviewers

Assignees

Labels

tausbn commented Feb 1, 2021 •

edited

Loading