-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Python: Add support for API graphs #5069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently only supports the "use" side of things. For the most part, this follows the corresponding implementation for JavaScript. Major differences include: - No `MkImportUse` nodes -- we just move directly from `MkModuleImport` to its uses. - Paths are no longer labelled by s-expressions, but rather by a string that mirrors how you would access it in QL. This makes it very easy to see how to access an API component -- simply look at its `toString`! This PR also extends `LocalSourceNode` to support looking up attribute references and invocations of such nodes. This was again based on the JavaScript equivalent (though without specific classes for `InvokeNode` and the like, it's a bit more awkward to use).
This turned out to be fairly simple. Given an import such as ```python from foo.bar.baz import quux ``` we create an API-graph node for each valid dotted prefix of `foo.bar.baz`, i.e. `foo`, `foo.bar`, and `foo.bar.baz`. For these, we then insert nodes in the API graph, such that `foo` steps to `foo.bar` along an edge labeled `bar`, etc. Finally, we only allow undotted names to hang off of the API-graph root. Thus, `foo` will have a `moduleImport` edge off of the root, and a `getMember` edge for `bar` (which in turn has a `getMember` edge for `baz`). Relative imports are explicitly ignored. Finally, this commit also adds inline tests for a variety of ways of importing modules, including a copy of the "import-helper" tests (with a few modifications to allow a single annotation per line, as these get rather long quickly!).
RasmusWL
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 💪 really looking forward to being able to use this 🤩
I don't understand the difference between getACall() and getReturn() 😕 Ohh, is it the case that for the following example, moduleImport("foo").getMember("bar").getACall() would only give foo.bar() as a result, whereas moduleImport("foo").getMember("bar").getReturn() would give both foo.bar(), and x in the print call? (and possibly also the x on the LHS of the assignment)
import foo
x = foo.bar()
print(x)Besides that, only a few minor comments.
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll
Outdated
Show resolved
Hide resolved
A slightly odd fix, but still morally okay, I think. The main issue here was that global variables have their first occurrence in an inner scope inside a so-called "scope entry definition", that then subsequently flows to the first use of this variable. This meant that that first use was _not_ a `LocalSourceNode` (since _something_ flowed into it), and this blocked `trackUseNode` from type-tracking to it (as it expects all nodes to be `LocalSourceNode`s). The answer, then, is to say that a `LocalSourceNode` is simply one that doesn't have flow to it from _any `CfgNode`_ (through one or more steps). This disregards the flow from the scope entry definition, as that is flow from an `EssaNode`. Additionally, it makes sense to exclude `ModuleVariableNode`s. These should never be considered local sources, since they always have flow from (at least) the place where the corresponding global variable is introduced.
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
Makes the `a.b.c.d` test more sensible. Also adds a test that shows a case where we're currently _not_ getting the right flow.
yoff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code around use/2 and use/3 is quite hard to read. I think it could be simplified by adding base as an argument to trackUseNode and move use/2 into the base case (but record the first argument rather than throw it away). Then I think other uses of use/2 can be simplified away.
| // Declaring `source` to be a `SourceNode` currently causes a redundant check in the | ||
| // recursive case, so instead we check it explicitly here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this actually been seen to affect performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In JavaScript almost certainly, otherwise this refactoring probably wouldn't be there.
In Python who knows (but I think it would be wise for us to learn from the JavaScript implementation, rather than painfully rediscover these performance problems on our own).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let us leave it for now, but check it at some point (not sure when that would be, though).
| /** Holds if this `LocalSourceNode` can flow to `nodeTo` in one or more local flow steps. */ | ||
| cached | ||
| predicate flowsTo(Node nodeTo) { simpleLocalFlowStep*(this, nodeTo) } | ||
| predicate flowsTo(Node nodeTo) { Cached::hasLocalSource(nodeTo, this) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this provide a better implementation? (Or could we just use flowsTo in place of hasLocalSource?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Reading the code below, I feel that flowsTo would be clearer because of the order of arguments.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this provide a better implementation? (Or could we just use
flowsToin place ofhasLocalSource?)
I honestly don't know what you're asking here. Surely we can't use flowsTo, as that's the predicate that's being defined. Did you mean renaming hasLocalSource to flowsTo? That would be the wrong order of arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant, could we disband hasLocalSource and just use the previous implementation of flowsTo? And then use flowsTo in the places that currently use hasLocalSource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could certainly do that without changing the meaning of the code, but I can't guarantee the same about the performance. In particular, I don't know that the * operation is able to use the knowledge that this is a LocalSourceNode at this point (and so it may end up calculating the entire simpleLocalFlowStep* relation before restricting the first argument).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that is a valid point. I still think it is a lot easier to read a.flowsTo(b) than hasLocalSource(b, a) but flowsTo is probably not available in the API graph module..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... You've lost me again. All uses of hasLocalSource are inside DataFlowPublic and most of them are inside Cached. Everywhere else we use flowsTo. 😕
I think as part of the general cleanup of the dataflow libraries (mentioned elsewhere) we'll probably want to consider whether Cached should be exposed at this point or put somewhere else. Ideally people should want to interact with local flow through flowsTo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you are right, I mixed up the modules. So I can potentially have the readable code, I have submitted suggestions, let me know if that is too optimistic...
Spotted by yoff in github#5069 (comment)
Spotted by yoff in github#5069 (comment)
Co-authored-by: yoff <lerchedahl@gmail.com>
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
This is not quite right, as |
You may be right that we can rewrite it in this way, however I would caution that we're currently only seeing "half" of the API graph interface implemented here. I wouldn't want us to make this rewrite now, only to have to undo it at a later date because it turns out to have been written the way it is for a reason. (Looking at the JS original, I note that while most uses of In general, I have tried hard not to diverge from the original JS version, just in case we end up in a place where we might want to share the code between these languages. That being said, I think |
I think that's a very good approach 👍 |
I think there is a balance to be struck. I would like to take over all of the battle tested robustness and performance optimality. But take over none of the technical debt 😛 |
|
Argh, lesson learned: don't expect consistent behaviour if you import the test database (without renaming) and then re-run the tests. Everything was looking just fine with the |
There is now a bit of redundancy in the tests, but I thought it useful to actually include some of the cases called out explicitly in the documentation, so as to make it easy to see that the code actually does what we expect (in these cases, anyway).
In lieu of removing the offending flow (which would likely have consequences for a lot of other tests), I opted to simply _include_ the relevant nodes directly.
yoff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smal typo
Co-authored-by: yoff <lerchedahl@gmail.com>
yoff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I much prefer this, if it is feasible?
|
I think this could be feasible if we change the annotation on |
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll
Outdated
Show resolved
Hide resolved
Co-authored-by: yoff <lerchedahl@gmail.com>
Makes it more similar to the other functions in this module.
yoff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I am ready to merge this now :-)
|
Merging, ship to learn! |
Currently only supports the "use" side of things.
For the most part, this follows the corresponding implementation for
JavaScript. Major differences include:
No
MkImportUsenodes -- we just move directly fromMkModuleImportto its uses.Paths are no longer labelled by s-expressions, but rather by a
string that mirrors how you would access it in QL. This makes it very
easy to see how to access an API component -- simply look at its
toString!This PR also extends
LocalSourceNodeto support looking up attributereferences and invocations of such nodes. This was again based on the
JavaScript equivalent (though without specific classes for
InvokeNodeand the like, it's a bit more awkward to use).Still missing:
import foois not working exactly right. For instance,from foo.bar import bazbecomesimportModule("foo.bar").getMember("baz")but it would probably be more useful for it to beimportModule("foo").getMember("bar").getMember("baz")