Skip to content

Conversation

@tausbn
Copy link
Contributor

@tausbn tausbn commented Feb 1, 2021

Currently only supports the "use" side of things.

For the most part, this follows the corresponding implementation for
JavaScript. Major differences include:

  • No MkImportUse nodes -- we just move directly from
    MkModuleImport to its uses.

  • Paths are no longer labelled by s-expressions, but rather by a
    string that mirrors how you would access it in QL. This makes it very
    easy to see how to access an API component -- simply look at its
    toString!

This PR also extends LocalSourceNode to support looking up attribute
references and invocations of such nodes. This was again based on the
JavaScript equivalent (though without specific classes for
InvokeNode and the like, it's a bit more awkward to use).


Still missing:

  • Import handling beyond simple import foo is not working exactly right. For instance, from foo.bar import baz becomes importModule("foo.bar").getMember("baz") but it would probably be more useful for it to be importModule("foo").getMember("bar").getMember("baz")
  • Tests.
  • Change note.

Currently only supports the "use" side of things.

For the most part, this follows the corresponding implementation for
JavaScript. Major differences include:

- No `MkImportUse` nodes -- we just move directly from
  `MkModuleImport` to its uses.

- Paths are no longer labelled by s-expressions, but rather by a
string that mirrors how you would access it in QL. This makes it very
easy to see how to access an API component -- simply look at its
`toString`!

This PR also extends `LocalSourceNode` to support looking up attribute
references and invocations of such nodes. This was again based on the
JavaScript equivalent (though without specific classes for
`InvokeNode` and the like, it's a bit more awkward to use).
@github-actions github-actions bot added the Python label Feb 1, 2021
This turned out to be fairly simple. Given an import such as
```python
from foo.bar.baz import quux
```
we create an API-graph node for each valid dotted prefix of
`foo.bar.baz`, i.e. `foo`, `foo.bar`, and `foo.bar.baz`. For these, we
then insert nodes in the API graph, such that `foo` steps to `foo.bar`
along an edge labeled `bar`, etc.

Finally, we only allow undotted names to hang off of the API-graph
root. Thus, `foo` will have a `moduleImport` edge off of the root, and
a `getMember` edge for `bar` (which in turn has a `getMember` edge for
`baz`).

Relative imports are explicitly ignored.

Finally, this commit also adds inline tests for a variety of ways of
importing modules, including a copy of the "import-helper" tests (with
a few modifications to allow a single annotation per line, as these
get rather long quickly!).
@tausbn tausbn marked this pull request as ready for review February 2, 2021 21:34
@tausbn tausbn requested a review from a team as a code owner February 2, 2021 21:34
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 💪 really looking forward to being able to use this 🤩

I don't understand the difference between getACall() and getReturn() 😕 Ohh, is it the case that for the following example, moduleImport("foo").getMember("bar").getACall() would only give foo.bar() as a result, whereas moduleImport("foo").getMember("bar").getReturn() would give both foo.bar(), and x in the print call? (and possibly also the x on the LHS of the assignment)

import foo
x = foo.bar()
print(x)

Besides that, only a few minor comments.

tausbn and others added 8 commits February 3, 2021 16:41
A slightly odd fix, but still morally okay, I think. The main issue
here was that global variables have their first occurrence in an inner
scope inside a so-called "scope entry definition", that then
subsequently flows to the first use of this variable. This meant that
that first use was _not_ a `LocalSourceNode` (since _something_ flowed
into it), and this blocked `trackUseNode` from type-tracking to it (as
it expects all nodes to be `LocalSourceNode`s).

The answer, then, is to say that a `LocalSourceNode` is simply one
that doesn't have flow to it from _any `CfgNode`_ (through one or more
steps). This disregards the flow from the scope entry definition, as
that is flow from an `EssaNode`.

Additionally, it makes sense to exclude `ModuleVariableNode`s. These
should never be considered local sources, since they always have flow
from (at least) the place where the corresponding global variable is
introduced.
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
Makes the `a.b.c.d` test more sensible.

Also adds a test that shows a case where we're currently _not_ getting
the right flow.
Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code around use/2 and use/3 is quite hard to read. I think it could be simplified by adding base as an argument to trackUseNode and move use/2 into the base case (but record the first argument rather than throw it away). Then I think other uses of use/2 can be simplified away.

Comment on lines +494 to +495
// Declaring `source` to be a `SourceNode` currently causes a redundant check in the
// recursive case, so instead we check it explicitly here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this actually been seen to affect performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JavaScript almost certainly, otherwise this refactoring probably wouldn't be there.
In Python who knows (but I think it would be wise for us to learn from the JavaScript implementation, rather than painfully rediscover these performance problems on our own).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let us leave it for now, but check it at some point (not sure when that would be, though).

/** Holds if this `LocalSourceNode` can flow to `nodeTo` in one or more local flow steps. */
cached
predicate flowsTo(Node nodeTo) { simpleLocalFlowStep*(this, nodeTo) }
predicate flowsTo(Node nodeTo) { Cached::hasLocalSource(nodeTo, this) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this provide a better implementation? (Or could we just use flowsTo in place of hasLocalSource?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Reading the code below, I feel that flowsTo would be clearer because of the order of arguments.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this provide a better implementation? (Or could we just use flowsTo in place of hasLocalSource?)

I honestly don't know what you're asking here. Surely we can't use flowsTo, as that's the predicate that's being defined. Did you mean renaming hasLocalSource to flowsTo? That would be the wrong order of arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, could we disband hasLocalSource and just use the previous implementation of flowsTo? And then use flowsTo in the places that currently use hasLocalSource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could certainly do that without changing the meaning of the code, but I can't guarantee the same about the performance. In particular, I don't know that the * operation is able to use the knowledge that this is a LocalSourceNode at this point (and so it may end up calculating the entire simpleLocalFlowStep* relation before restricting the first argument).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that is a valid point. I still think it is a lot easier to read a.flowsTo(b) than hasLocalSource(b, a) but flowsTo is probably not available in the API graph module..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... You've lost me again. All uses of hasLocalSource are inside DataFlowPublic and most of them are inside Cached. Everywhere else we use flowsTo. 😕

I think as part of the general cleanup of the dataflow libraries (mentioned elsewhere) we'll probably want to consider whether Cached should be exposed at this point or put somewhere else. Ideally people should want to interact with local flow through flowsTo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you are right, I mixed up the modules. So I can potentially have the readable code, I have submitted suggestions, let me know if that is too optimistic...

RasmusWL added a commit to RasmusWL/codeql that referenced this pull request Feb 4, 2021
RasmusWL added a commit to RasmusWL/codeql that referenced this pull request Feb 4, 2021
tausbn and others added 2 commits February 4, 2021 12:18
Co-authored-by: yoff <lerchedahl@gmail.com>
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
@tausbn
Copy link
Contributor Author

tausbn commented Feb 4, 2021

I don't understand the difference between getACall() and getReturn() Ohh, is it the case that for the following example, moduleImport("foo").getMember("bar").getACall() would only give foo.bar() as a result, whereas moduleImport("foo").getMember("bar").getReturn() would give both foo.bar(), and x in the print call? (and possibly also the x on the LHS of the assignment)

import foo
x = foo.bar()
print(x)

This is not quite right, as getReturn returns an API node. You would have to additionally call getAUse on it to get to (all of) its uses (including both foo.bar() and x in your example).

@tausbn
Copy link
Contributor Author

tausbn commented Feb 4, 2021

The code around use/2 and use/3 is quite hard to read. I think it could be simplified by adding base as an argument to trackUseNode and move use/2 into the base case (but record the first argument rather than throw it away). Then I think other uses of use/2 can be simplified away.

You may be right that we can rewrite it in this way, however I would caution that we're currently only seeing "half" of the API graph interface implemented here. I wouldn't want us to make this rewrite now, only to have to undo it at a later date because it turns out to have been written the way it is for a reason. (Looking at the JS original, I note that while most uses of use/2 occur in the presence of trackUseNode, not all of them do.)

In general, I have tried hard not to diverge from the original JS version, just in case we end up in a place where we might want to share the code between these languages.

That being said, I think use/2 could do with some more documentation, which I'll add.

@RasmusWL
Copy link
Member

RasmusWL commented Feb 4, 2021

In general, I have tried hard not to diverge from the original JS version

I think that's a very good approach 👍

@yoff
Copy link
Contributor

yoff commented Feb 4, 2021

In general, I have tried hard not to diverge from the original JS version

I think that's a very good approach 👍

I think there is a balance to be struck. I would like to take over all of the battle tested robustness and performance optimality. But take over none of the technical debt 😛

@tausbn
Copy link
Contributor Author

tausbn commented Feb 4, 2021

Argh, lesson learned: don't expect consistent behaviour if you import the test database (without renaming) and then re-run the tests. Everything was looking just fine with the imports/2 simplification, but now not so much. 😞

There is now a bit of redundancy in the tests, but I thought it useful
to actually include some of the cases called out explicitly in the
documentation, so as to make it easy to see that the code actually
does what we expect (in these cases, anyway).
In lieu of removing the offending flow (which would likely have
consequences for a lot of other tests), I opted to simply _include_
the relevant nodes directly.
@tausbn tausbn requested a review from yoff February 4, 2021 17:35
Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smal typo

Co-authored-by: yoff <lerchedahl@gmail.com>
Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I much prefer this, if it is feasible?

@tausbn
Copy link
Contributor Author

tausbn commented Feb 5, 2021

I think this could be feasible if we change the annotation on flowsTo from cached to pragma[inline]. In that case, I would expect the above code to do exactly the same as before.

tausbn and others added 2 commits February 5, 2021 12:42
Co-authored-by: yoff <lerchedahl@gmail.com>
Makes it more similar to the other functions in this module.
@tausbn tausbn requested a review from yoff February 5, 2021 12:11
Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I am ready to merge this now :-)

@yoff yoff merged commit 7fef1a8 into github:main Feb 5, 2021
@yoff
Copy link
Contributor

yoff commented Feb 5, 2021

Merging, ship to learn!

@tausbn tausbn deleted the python-api-graphs branch February 5, 2021 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants