Skip to content

Python dataflow: flow summaries restart #8781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 84 commits into from
Sep 20, 2022

Conversation

yoff
Copy link
Contributor

@yoff yoff commented Apr 20, 2022

This demonstrates that flow summaries can be based on API graphs and all our existing tests can pass 🎉

I have trimmed down the changes needed to get to this state, and given detailed explanations of the tricky transformations needed to avoid non-monotonic recursion.

For a moment, the commit history was beautiful.

Then main moved on and more changes needed to be made, now it is a bit longer and more hairy.
Also, there is a branch where it is even more hairy, cutting up some of the inheritance hierarchies in order to achieve
SourceParameterNode extends LocalSourceNode, but I suggest we leave that for now. SourceParameterNode is no longer something non-internal development should be concerned about.

Names have become a bite nicer and the impact on user code and on .expected-files has been minimised.

We might still do more test, showing off the capabilities, especially the use from MaD which is not demonstrated at the moment. However, simply having it compile and the basic tests pass seems like a worthy milestone at this point.

@asgerf
Copy link
Contributor

asgerf commented Apr 20, 2022

I've only done a superficial glance at this, but I'd like to talk about the issue with getACall2. This is a fundamental issue when using API graphs and flow summaries together, and will likely affect all dynamic languages attempting to use flow summaries. I'd like to get a bit more clarity on what the problem is and what solution to use (ideally we'll all end up using the same solution).

The root of the problem is that the following cycle is created between call graph construction and API graphs:
Screenshot 2022-04-20 at 12 36 24

Unless we want API graphs to be mutually recursive with type-tracking steps and call construction, we'll need to break the cycle somehow. As far as I can tell, you've broken the cycle like this:

Screenshot 2022-04-20 at 12 41 11

This means anything not depending on API graphs can contribute to getACall2, and things depending on API graphs can use getACall. Type tracking steps will only be discovered for getACall2, which makes it more suitable for things like standard library modelling.

I believe Ruby's solution so far is to make type-tracking steps explicitly depend on non-library callables only. The downside here is that none of their standard library models are visible to type-tracking. (@hvitved @aibaars do you agree here?)

@yoff Since I'm unfamiliar with the Python libraries I'm not 100% sure my interpretation above is an accurate description of what you've done here. Could you please check if this is the case? If so, I think it's a fine solution provided we can come up with names that aren't confusing.

@yoff yoff force-pushed the python-dataflow/flow-summaries-from-scratch branch from 1430ab6 to b4aa941 Compare April 21, 2022 06:15
@yoff
Copy link
Contributor Author

yoff commented Apr 21, 2022

Force pushed because I could see that my amended commit message to the last commit had not been pushed.

@hvitved
Copy link
Contributor

hvitved commented Apr 21, 2022

I believe Ruby's solution so far is to make type-tracking steps explicitly depend on non-library callables only. The downside here is that none of their standard library models are visible to type-tracking. (@hvitved @aibaars do you agree here?)

Correct (since flow summaries depend on API graphs, which depend on type tracking).

@yoff
Copy link
Contributor Author

yoff commented Apr 21, 2022

The diagram looks correct to me. Is the difference with Ruby just that we have not (yet?) written any standard library models using flow summaries? Or is there a more fundamental difference?

@yoff yoff force-pushed the python-dataflow/flow-summaries-from-scratch branch from b4aa941 to ebedc7e Compare April 28, 2022 11:16
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 8 vulnerabilities.

Comment on lines 1047 to 1059
/**
* A query predicate for outputting flow summaries in semi-colon separated format in QL tests.
* The syntax is: "namespace;type;overrides;name;signature;ext;inputspec;outputspec;kind",
* ext is hardcoded to empty.
*/

Check warning

Code scanning / CodeQL

Predicate QLDoc style.

The QLDoc for a predicate without a result should start with 'Holds'.
@yoff
Copy link
Contributor Author

yoff commented May 4, 2022

The failing language test belongs to https://github.com/github/semmle-code and cannot be fixed in this PR (I believe the fix is simply to update the test expectations).

Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this Herculean effort!

I think in terms of the implementation, this looks pretty solid. I've added a bunch of requests for clarification (and also some requests for renaming stuff). I'll be very happy to discuss this synchronously if that's easier. 🙂

abstract predicate isParameterOf(DataFlowCallable c, int i);
}

class SourceParameterNode extends ParameterNode, CfgNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took me a while to figure out what Source was referring to in this context (what with this being dataflow and all). I wonder if there is a better name for this, or is SourceParameterNode just the accepted term?

}
}

abstract class DataFlowSourceCall extends DataFlowCall, TDataFlowSourceCall {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like my comment on SourceParameterNode, I feel like this is also potentially confusing. At the very least there should be QLDoc that explains what's going on.

@@ -282,6 +282,8 @@ module EssaFlow {
nodeTo = TKwOverflowNode(call, callable) and
nodeFrom.asCfgNode() = call.getNode().getKwargs().getAFlowNode()
)
or
FlowSummaryImpl::Private::Steps::summaryLocalStep(nodeFrom, nodeTo, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... So we allow summaries to extend local flow steps (which means they underpin all of the dataflow machinery). Does this not lead to everything being recursively dependent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there are now two local step relations. One for type trackers and one for the global data flow computation.

Comment on lines 895 to 947
kind = kind and
exists(Call call, Name f, FunctionDef def |
f = call.getAnArg() and
def.getDefinedFunction().getName() = f.getId() and
// c.getCallableValue() = def.getDefinedFunction().getDefinition() and
c.getName() = f.getId()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have been fixed in a later commit, but as QL-for-QL helpfully points out, creation isn't restricted by this disjunct.

* Includes function calls, method calls, class calls and library calls.
* All these will be associated with a `CallNode`.
*/
TNonSpecialCall(CallNode call) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about NonSpecial... rubs me the wrong way. Could we use a positive formulation instead? I'm tempted to suggest CallNodeCall, though that is a bit of a mouthful. (However, it does precisely say that these are calls that happen at call nodes.)

Alternatively, since we have SpecialCall, perhaps its counterpart should be NormalCall?

/** Gets a callable value for this callable, if one exists. */
abstract CallableValue getCallableValue();

abstract CallNode getACall2();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a better name. Even if it is only internal, it certainly doesn't convey what it's actually supposed to do.

@@ -497,59 +504,69 @@ class NonSpecialCall extends DataFlowSourceCall, TNonSpecialCall {
override DataFlowCallable getEnclosingCallable() { result.getScope() = call.getNode().getScope() }
}

abstract class NonLibraryDataFlowSourceCall extends NonSpecialCall {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this name needs a bit of workshopping. Could we come up with a better description of "non-library"?

}

class SourceParameterNode extends ParameterNode, CfgNode {
//, LocalSourceNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit ambivalent about these not being local sources, and I would like to understand this a bit more. Is this merely because of negative recursion issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative recursion is what directly prevents me from inheriting. But I can add that inheritance into the Ruby codebase which basically has the same structure, so I think the solution is to clear up the mutual recursive structure of our predicates, which I fear lacks a bit of discipline at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all of the recent changes, is your comment still in effect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, yes. I do feel we should be able to resolve this, but I had a quick go and it seems there are multiple cycles to untangle, so I would prefer to do this later..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Thanks for checking!

Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few drive-by comments. Is the logic actually hooked up yet? I fail to see references to e.g. any of the FlowSummaryImpl::Private::Steps predicates. See e.g. DataFlowPrivate.qll for ruby for how this is done.

@@ -277,21 +286,26 @@ ExprNode exprNode(DataFlowExpr e) { result.getNode().getNode() = e }
* The value of a parameter at function entry, viewed as a node in a data
* flow graph.
*/
class ParameterNode extends CfgNode, LocalSourceNode {
abstract class ParameterNode extends Node {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want an abstract class in the public API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, probably not..

@@ -0,0 +1,175 @@
/**
* Provides Ruby specific classes and predicates for defining flow summaries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python

or
call = any(ClassValue c | not c.isAbsent()).getACall()
or
call = any(SpecialMethodCallNode special)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call instanceof SpecialMethodCallNode

@@ -2,6 +2,7 @@ private import python
private import DataFlowPublic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Indeed we just have none().

@yoff
Copy link
Contributor Author

yoff commented May 10, 2022

A few drive-by comments. Is the logic actually hooked up yet? I fail to see references to e.g. any of the FlowSummaryImpl::Private::Steps predicates. See e.g. DataFlowPrivate.qll for ruby for how this is done.

They are mostly in DataFlowPrivate.qll and then one in DataFlowUtil.qll. I followed the structure from Ruby closely. The interface does not let you search for these terms inside collapsed files, which is a bit misleading.

@hvitved
Copy link
Contributor

hvitved commented May 10, 2022

They are mostly in DataFlowPrivate.qll and then one in DataFlowUtil.qll. I followed the structure from Ruby closely. The interface does not let you search for these terms inside collapsed files, which is a bit misleading.

Ah, good, sorry for the false alarm. I guess I must have looked at a diff limited to a subset of commits perhaps.

yoff and others added 13 commits May 10, 2022 12:48
- add flowsummaries shared files
- register in indentical files
- fix initial non-monotonic recursions
  - add DataFlowSourceCall
  - add resolvedCall
  - add SourceParameterNode

failing tests:
- 3/library-tests/with/test.ql
allowing more `OutNode`s (not restricting to `CallNode`s),
gives more flow in the `classesCallGraph` test
also convert to inline tests
summary flow is excluded from the local flow relation used for
typetracking, but included in the one used for global data flow.
- Branch predicates are made simple. In particular, they do not try to detect library calls.
- All branches based on `CallNode`s are gathered into one.
- That branch has been given a class `NonSpecialCall`, which is the new parent of call classes based on `CallNode`s. (Those classes now have more involved charpreds.)
- A new such class, 'LambdaCall` has been split out from `FunctionCall` to allow the latter to replace its
  general `CallNode` field with a specific `FunctionValue` one.
- `NonSpecialCall` is not an abstract class, but it has some abstract overrides. Therefor, it is not
  considered a resolved call in the test `UnresolvedCalls.qll`.
Two classes have been inserted into the hierarchies:

- `NonLibraryDataFlowCallable` with a method `getACall2`.
This method implements "get a call, not considering flow summaries".
For `NonLibraryDataFlowCallable`s, `getACall` will defer to `getACall2`.
While you could have a synthesised call to such a callable,
it would not correspond to a `CallNode`.

- `NonLibraryDataFlowSourceCall` with methods
`getArg2` and `getCallable2`. These also refer to a call graph that
does not consider flow summaries.

`getArg2` is used to synthesise pre-update nodes for arguments.

`getCallable2` is used in `connects` to compute argument passing.
This is used to define data flow nodes for overflow arguments.

`getACall2` ensures that `LibraryCallableValue::getACall` is not called
when the charpred of `FunctionCall` is evaluated.
- sync summary files
- format files
- fix compilation
This can be used when one wants to consider a
(source) parameter node as a local source.
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 vulnerability.

by making client code use the "new" class.
Really, this part of the split class should have the old name,
to minimise disruptions to clients.
Same goes for the other split classes.
@yoff
Copy link
Contributor Author

yoff commented May 11, 2022

I think in terms of the implementation, this looks pretty solid. I've added a bunch of requests for clarification (and also some requests for renaming stuff). I'll be very happy to discuss this synchronously if that's easier. 🙂

Thank you for the review. I think both renaming and qldocs are warranted. Specifically, when classes are split (in order to only have one participate in a recursion that would otherwise be non-monotonic) the user-facing one should be retaining the old name so as to minimise breaking changes.

to avoid nodes in the extracted stdlib

Was there a reason for this depth to be 1?
@yoff yoff force-pushed the python-dataflow/flow-summaries-from-scratch branch from ede5f3c to 56dcfc2 Compare September 12, 2022 21:34
@yoff yoff marked this pull request as draft September 14, 2022 07:19
since map resolves to a class call

also fix test expectation
@yoff yoff marked this pull request as ready for review September 14, 2022 10:40
This is not necessary as long as `LibraryCall` only
includes unresolved calls.
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be super nice with a change to the result type of SummarizedCallable.getACall. (which seems like a fairly simple change, I can even do it myself if you want me to)

Besides that, I don't think I have requests for any other significant changes.

I put it in frameworks\Builtins.qll. It could also go in the StdLib folder, I guess. Or perhaps you have a different place that you find natural, feel free to suggest :-)

We have normally put both builtins and standard library modeling in the Stdlib.qll file. I guess that is not technically 100% correct. However, on the top of my head, it doesn't seem like we're gaining anything from splitting it into two, so I would suggest you move it into the existing file (but do let me know if I'm overlooking something 🤷). ((note: if you do want to keep the Builtins.qll file, you should add module structure like in all the other files in the frameworks folder)

class ReversedSummary extends SummarizedCallable {
ReversedSummary() { this = "builtins.reversed" }

override CallNode getACall() { result = API::builtin("reversed").getACall().getNode() }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line of code makes me wonder if we could change getACall for SummarizedCallable to have DataFlow::CallCfgNode as a result type, so this could just be written as result = API::builtin("reversed").getACall(). The extra .getNode() seems like it will be included every time we define a SummarizedCallable, so we might as well get away with it now?

Copy link
Contributor Author

@yoff yoff Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is slightly inconsistent for LibraryCallable to return a DatFlow::Node here, when the other callables return control flow nodes, but I tried it now, see if you like :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'm happy 👍 Much of it will be changed with the new call-graph anyway 🤷

The other callables return control flow nodes,
so it is slightly inconsistent for this to return a
data flow node, but it does make models based
on API graphs nicer.
@yoff yoff requested a review from RasmusWL September 19, 2022 12:37
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor fixes needed... but almost 😅

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
@yoff yoff requested a review from RasmusWL September 19, 2022 12:52
RasmusWL
RasmusWL previously approved these changes Sep 19, 2022
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤞 for the tests 😆

@yoff
Copy link
Contributor Author

yoff commented Sep 19, 2022

🤞 for the tests 😆

Indeed, thanks for the fix :-)

Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments here and there. Otherwise, I think this looks very good!

Regarding the overloading of "source" to mean "in source code", how about "extracted"? This, to me, seems fairly unambiguous...

predicate lambdaCall(DataFlowCall call, LambdaCallKind kind, Node receiver) { none() }
predicate lambdaCall(DataFlowCall call, LambdaCallKind kind, Node receiver) {
receiver = call.(SummaryCall).getReceiver() and
kind = kind
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I understand correctly, this is to allow kind to be any LambdaCallKind? If so, maybe exists(kind) is a more idiomatic way of indicating an intentional cartesian product?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct (and LambdaCallKind is currently Unit).

}

class SourceParameterNode extends ParameterNode, CfgNode {
//, LocalSourceNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all of the recent changes, is your comment still in effect?

The precedence for the use of "source" to denote elements of source code
is found in `EssaVariable::getSourceVariable` as well as in the Ruby
code base. But it clashes with the many uses of source to mean
"source of flow" found in the data flow library.
@yoff yoff requested a review from tausbn September 20, 2022 11:29
Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@yoff yoff merged commit ea74317 into github:main Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation no-change-note-required This PR does not need a change note Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants