-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Python dataflow: flow summaries restart #8781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python dataflow: flow summaries restart #8781
Conversation
I've only done a superficial glance at this, but I'd like to talk about the issue with The root of the problem is that the following cycle is created between call graph construction and API graphs: Unless we want API graphs to be mutually recursive with type-tracking steps and call construction, we'll need to break the cycle somehow. As far as I can tell, you've broken the cycle like this: This means anything not depending on API graphs can contribute to I believe Ruby's solution so far is to make type-tracking steps explicitly depend on non-library callables only. The downside here is that none of their standard library models are visible to type-tracking. (@hvitved @aibaars do you agree here?) @yoff Since I'm unfamiliar with the Python libraries I'm not 100% sure my interpretation above is an accurate description of what you've done here. Could you please check if this is the case? If so, I think it's a fine solution provided we can come up with names that aren't confusing. |
1430ab6
to
b4aa941
Compare
Force pushed because I could see that my amended commit message to the last commit had not been pushed. |
Correct (since flow summaries depend on API graphs, which depend on type tracking). |
The diagram looks correct to me. Is the difference with Ruby just that we have not (yet?) written any standard library models using flow summaries? Or is there a more fundamental difference? |
b4aa941
to
ebedc7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found 8 vulnerabilities.
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatchPointsTo.qll
Fixed
Show fixed
Hide fixed
/** | ||
* A query predicate for outputting flow summaries in semi-colon separated format in QL tests. | ||
* The syntax is: "namespace;type;overrides;name;signature;ext;inputspec;outputspec;kind", | ||
* ext is hardcoded to empty. | ||
*/ |
Check warning
Code scanning / CodeQL
Predicate QLDoc style.
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatchPointsTo.qll
Fixed
Show fixed
Hide fixed
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatchPointsTo.qll
Fixed
Show fixed
Hide fixed
The failing language test belongs to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this Herculean effort!
I think in terms of the implementation, this looks pretty solid. I've added a bunch of requests for clarification (and also some requests for renaming stuff). I'll be very happy to discuss this synchronously if that's easier. 🙂
abstract predicate isParameterOf(DataFlowCallable c, int i); | ||
} | ||
|
||
class SourceParameterNode extends ParameterNode, CfgNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took me a while to figure out what Source
was referring to in this context (what with this being dataflow and all). I wonder if there is a better name for this, or is SourceParameterNode
just the accepted term?
} | ||
} | ||
|
||
abstract class DataFlowSourceCall extends DataFlowCall, TDataFlowSourceCall { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like my comment on SourceParameterNode
, I feel like this is also potentially confusing. At the very least there should be QLDoc that explains what's going on.
@@ -282,6 +282,8 @@ module EssaFlow { | |||
nodeTo = TKwOverflowNode(call, callable) and | |||
nodeFrom.asCfgNode() = call.getNode().getKwargs().getAFlowNode() | |||
) | |||
or | |||
FlowSummaryImpl::Private::Steps::summaryLocalStep(nodeFrom, nodeTo, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... So we allow summaries to extend local flow steps (which means they underpin all of the dataflow machinery). Does this not lead to everything being recursively dependent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there are now two local step relations. One for type trackers and one for the global data flow computation.
kind = kind and | ||
exists(Call call, Name f, FunctionDef def | | ||
f = call.getAnArg() and | ||
def.getDefinedFunction().getName() = f.getId() and | ||
// c.getCallableValue() = def.getDefinedFunction().getDefinition() and | ||
c.getName() = f.getId() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may have been fixed in a later commit, but as QL-for-QL helpfully points out, creation
isn't restricted by this disjunct.
* Includes function calls, method calls, class calls and library calls. | ||
* All these will be associated with a `CallNode`. | ||
*/ | ||
TNonSpecialCall(CallNode call) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something about NonSpecial...
rubs me the wrong way. Could we use a positive formulation instead? I'm tempted to suggest CallNodeCall
, though that is a bit of a mouthful. (However, it does precisely say that these are calls that happen at call nodes.)
Alternatively, since we have SpecialCall
, perhaps its counterpart should be NormalCall
?
/** Gets a callable value for this callable, if one exists. */ | ||
abstract CallableValue getCallableValue(); | ||
|
||
abstract CallNode getACall2(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a better name. Even if it is only internal, it certainly doesn't convey what it's actually supposed to do.
@@ -497,59 +504,69 @@ class NonSpecialCall extends DataFlowSourceCall, TNonSpecialCall { | |||
override DataFlowCallable getEnclosingCallable() { result.getScope() = call.getNode().getScope() } | |||
} | |||
|
|||
abstract class NonLibraryDataFlowSourceCall extends NonSpecialCall { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this name needs a bit of workshopping. Could we come up with a better description of "non-library"?
} | ||
|
||
class SourceParameterNode extends ParameterNode, CfgNode { | ||
//, LocalSourceNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit ambivalent about these not being local sources, and I would like to understand this a bit more. Is this merely because of negative recursion issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Negative recursion is what directly prevents me from inheriting. But I can add that inheritance into the Ruby codebase which basically has the same structure, so I think the solution is to clear up the mutual recursive structure of our predicates, which I fear lacks a bit of discipline at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With all of the recent changes, is your comment still in effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, yes. I do feel we should be able to resolve this, but I had a quick go and it seems there are multiple cycles to untangle, so I would prefer to do this later..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. Thanks for checking!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few drive-by comments. Is the logic actually hooked up yet? I fail to see references to e.g. any of the FlowSummaryImpl::Private::Steps
predicates. See e.g. DataFlowPrivate.qll
for ruby for how this is done.
@@ -277,21 +286,26 @@ ExprNode exprNode(DataFlowExpr e) { result.getNode().getNode() = e } | |||
* The value of a parameter at function entry, viewed as a node in a data | |||
* flow graph. | |||
*/ | |||
class ParameterNode extends CfgNode, LocalSourceNode { | |||
abstract class ParameterNode extends Node { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you want an abstract
class in the public API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, probably not..
@@ -0,0 +1,175 @@ | |||
/** | |||
* Provides Ruby specific classes and predicates for defining flow summaries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python
or | ||
call = any(ClassValue c | not c.isAbsent()).getACall() | ||
or | ||
call = any(SpecialMethodCallNode special) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call instanceof SpecialMethodCallNode
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatchPointsTo.qll
Show resolved
Hide resolved
@@ -2,6 +2,7 @@ private import python | |||
private import DataFlowPublic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are missing allowParameterReturnInSelf
, see https://github.com/github/codeql/blob/main/ruby/ql/lib/codeql/ruby/dataflow/internal/DataFlowPrivate.qll#L940.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Indeed we just have none()
.
They are mostly in |
Ah, good, sorry for the false alarm. I guess I must have looked at a diff limited to a subset of commits perhaps. |
- add flowsummaries shared files - register in indentical files - fix initial non-monotonic recursions - add DataFlowSourceCall - add resolvedCall - add SourceParameterNode failing tests: - 3/library-tests/with/test.ql
allowing more `OutNode`s (not restricting to `CallNode`s), gives more flow in the `classesCallGraph` test
also convert to inline tests
summary flow is excluded from the local flow relation used for typetracking, but included in the one used for global data flow.
- Branch predicates are made simple. In particular, they do not try to detect library calls. - All branches based on `CallNode`s are gathered into one. - That branch has been given a class `NonSpecialCall`, which is the new parent of call classes based on `CallNode`s. (Those classes now have more involved charpreds.) - A new such class, 'LambdaCall` has been split out from `FunctionCall` to allow the latter to replace its general `CallNode` field with a specific `FunctionValue` one. - `NonSpecialCall` is not an abstract class, but it has some abstract overrides. Therefor, it is not considered a resolved call in the test `UnresolvedCalls.qll`.
Two classes have been inserted into the hierarchies: - `NonLibraryDataFlowCallable` with a method `getACall2`. This method implements "get a call, not considering flow summaries". For `NonLibraryDataFlowCallable`s, `getACall` will defer to `getACall2`. While you could have a synthesised call to such a callable, it would not correspond to a `CallNode`. - `NonLibraryDataFlowSourceCall` with methods `getArg2` and `getCallable2`. These also refer to a call graph that does not consider flow summaries. `getArg2` is used to synthesise pre-update nodes for arguments. `getCallable2` is used in `connects` to compute argument passing. This is used to define data flow nodes for overflow arguments. `getACall2` ensures that `LibraryCallableValue::getACall` is not called when the charpred of `FunctionCall` is evaluated.
- sync summary files - format files - fix compilation
This can be used when one wants to consider a (source) parameter node as a local source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found 1 vulnerability.
by making client code use the "new" class. Really, this part of the split class should have the old name, to minimise disruptions to clients. Same goes for the other split classes.
Thank you for the review. I think both renaming and qldocs are warranted. Specifically, when classes are split (in order to only have one participate in a recursion that would otherwise be non-monotonic) the user-facing one should be retaining the old name so as to minimise breaking changes. |
to avoid nodes in the extracted stdlib Was there a reason for this depth to be 1?
ede5f3c
to
56dcfc2
Compare
since map resolves to a class call also fix test expectation
This is not necessary as long as `LibraryCall` only includes unresolved calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be super nice with a change to the result type of SummarizedCallable.getACall
. (which seems like a fairly simple change, I can even do it myself if you want me to)
Besides that, I don't think I have requests for any other significant changes.
I put it in frameworks\Builtins.qll. It could also go in the StdLib folder, I guess. Or perhaps you have a different place that you find natural, feel free to suggest :-)
We have normally put both builtins and standard library modeling in the Stdlib.qll
file. I guess that is not technically 100% correct. However, on the top of my head, it doesn't seem like we're gaining anything from splitting it into two, so I would suggest you move it into the existing file (but do let me know if I'm overlooking something 🤷). ((note: if you do want to keep the Builtins.qll
file, you should add module structure like in all the other files in the frameworks folder)
class ReversedSummary extends SummarizedCallable { | ||
ReversedSummary() { this = "builtins.reversed" } | ||
|
||
override CallNode getACall() { result = API::builtin("reversed").getACall().getNode() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line of code makes me wonder if we could change getACall
for SummarizedCallable
to have DataFlow::CallCfgNode
as a result type, so this could just be written as result = API::builtin("reversed").getACall()
. The extra .getNode()
seems like it will be included every time we define a SummarizedCallable
, so we might as well get away with it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is slightly inconsistent for LibraryCallable
to return a DatFlow::Node
here, when the other callables return control flow nodes, but I tried it now, see if you like :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'm happy 👍 Much of it will be changed with the new call-graph anyway 🤷
python/ql/lib/semmle/python/dataflow/new/internal/FlowSummaryImplSpecific.qll
Show resolved
Hide resolved
The other callables return control flow nodes, so it is slightly inconsistent for this to return a data flow node, but it does make models based on API graphs nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor fixes needed... but almost 😅
python/ql/lib/semmle/python/dataflow/new/internal/FlowSummaryImplSpecific.qll
Outdated
Show resolved
Hide resolved
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤞 for the tests 😆
Indeed, thanks for the fix :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments here and there. Otherwise, I think this looks very good!
Regarding the overloading of "source" to mean "in source code", how about "extracted"? This, to me, seems fairly unambiguous...
predicate lambdaCall(DataFlowCall call, LambdaCallKind kind, Node receiver) { none() } | ||
predicate lambdaCall(DataFlowCall call, LambdaCallKind kind, Node receiver) { | ||
receiver = call.(SummaryCall).getReceiver() and | ||
kind = kind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if I understand correctly, this is to allow kind
to be any LambdaCallKind
? If so, maybe exists(kind)
is a more idiomatic way of indicating an intentional cartesian product?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct (and LambdaCallKind
is currently Unit
).
} | ||
|
||
class SourceParameterNode extends ParameterNode, CfgNode { | ||
//, LocalSourceNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With all of the recent changes, is your comment still in effect?
The precedence for the use of "source" to denote elements of source code is found in `EssaVariable::getSourceVariable` as well as in the Ruby code base. But it clashes with the many uses of source to mean "source of flow" found in the data flow library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This demonstrates that flow summaries can be based on API graphs and all our existing tests can pass 🎉
I have trimmed down the changes needed to get to this state, and given detailed explanations of the tricky transformations needed to avoid non-monotonic recursion.
For a moment, the commit history was beautiful.
Then main moved on and more changes needed to be made, now it is a bit longer and more hairy.
Also, there is a branch where it is even more hairy, cutting up some of the inheritance hierarchies in order to achieve
SourceParameterNode extends LocalSourceNode
, but I suggest we leave that for now.SourceParameterNode
is no longer something non-internal development should be concerned about.Names have become a bite nicer and the impact on user code and on
.expected
-files has been minimised.We might still do more test, showing off the capabilities, especially the use from MaD which is not demonstrated at the moment. However, simply having it compile and the basic tests pass seems like a worthy milestone at this point.