Skip to content

JS: Call graph changes#2106

Merged
max-schaefer merged 25 commits intogithub:masterfrom
asger-semmle:call-graph-3
Oct 28, 2019
Merged

JS: Call graph changes#2106
max-schaefer merged 25 commits intogithub:masterfrom
asger-semmle:call-graph-3

Conversation

@asger-semmle
Copy link
Contributor

@asger-semmle asger-semmle commented Oct 9, 2019

Reorganizes the call graph computation motivated by the need for library models to reason about callbacks that escape into the library. For example, in the following we could now detect that event refers to an untrusted cross-window event:

function addListener(cb) {
  window.addEventListener("message", cb)
}
function foo(x, event) { /* ... */ }
addListener(foo.bind(this, x))

This consists of a number of changes seemed independent but I had to roll them up into a single change to avoid intermediate regressions or redundant work:

  • Call graph computation is now centralized in CallGraphs.qll
  • InvokeNode.getACallee() and Node.getAFunctionValue() now include the call edges discovered through type tracking -- previously only available internally through FlowSteps::calls.
  • AdditionalPartialInvokeNode has been renamed to PartialInvokeNode with a ::Range companion.
  • Added type tracking for the output of a PartialInvokeNode. (To keep the scope of this PR in check, we don't currently type-track function objects in general, just the result of partial invokes.)

Previously, the call graph exposed through getACallee(0) had the nice property that if any callees were found, we almost certainly had all of them. This is doesn't quite hold anymore. Since it now includes results from type tracking, there will be cases where a proper subset of the callees could be found (where previously nothing was found). This has a few ramifications:

  • Static types now contribute to the call graph even if some edges were also found through data flow. We would lose useful call edges otherwise.
  • The illegal-invocation and inconsistent-new queries had to be dialed back to be more conservative. In principle we could restore their original behaviour by using analyze().getAValue() directly, but the results we lose didn't seem valuable.

Also some changes that could in principle be landed separately, but helped me verify that things actually work:

  • Support for the goog.bind and angular.bind partial invokes.
  • PostMessageEventHandler uses the new API (as per the above example)

Evaluations:

  • Evaluation and re-run of the outlier
  • Call graph metrics show some ups and downs. I've spent a lot of time looking at call graph diffs, and concluded that the remaining changes are benign, although I obvs haven't check all of it. The lost call edges I've seen were either spurious or a call to an extern that is also defined by a polyfill (no longer found due to GlobalAcessPaths::isAssignedInUniqueFile).

@asger-semmle asger-semmle marked this pull request as ready for review October 10, 2019 13:40
@asger-semmle asger-semmle requested a review from a team as a code owner October 10, 2019 13:40
@asger-semmle asger-semmle changed the title Call graph changes JS: Call graph changes Oct 10, 2019
Copy link

@xiemaisi xiemaisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few initial thoughts, I'll take another look tomorrow.

FunctionNode getCallback(int i) { result.flowsTo(getArgument(i)) }

/**
* Gets a function passed as the `i`th argument of this invocation,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no parameter i, and the meaning of param isn't documented. I also think this predicate is complex enough to merit a brief example in the doc comment.

}

/**
* A data flow node that performs a partial function application.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A data flow node that performs a partial function application.
* A data-flow node that performs a partial function application.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? None of the other classes use that convention, but I can update all of them if you like.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we're not being particularly consistent about this. I think the version with the hyphen is correct, but feel free to ignore.


module PartialInvokeNode {
/**
* A data flow node that performs a partial function application.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A data flow node that performs a partial function application.
* A data-flow node that performs a partial function application.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto in a few more places below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 238 occurrences of data flow node. I'll do it in a separate PR to avoid conflicts.

result = getABoundFunctionReferenceAux(function, boundArgs, t)
}

pragma[noopt]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and it's in a recursive predicate, too! Are you absolutely positively sure this is needed?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a concrete suggestion, a quick experiment suggests that you can weaken the pragma[noopt] to a pragma[inline] like this:

  (...)
  or
  exists(DataFlow::StepSummary summary, DataFlow::TypeTracker t2 |
    result = getABoundFunctionReferenceAux(function, boundArgs, t2, summary) and
    t = t2.append(summary)
  )
}

pragma[noinline]
private
DataFlow::SourceNode getABoundFunctionReferenceAux(DataFlow::FunctionNode function, int boundArgs, DataFlow::TypeTracker t, DataFlow::StepSummary summary) {
  exists(DataFlow::SourceNode prev |
    prev = getABoundFunctionReference(function, boundArgs, t) and
    DataFlow::StepSummary::step(prev, result, summary)
  )
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I remember it solving a performance problem, but I can run another evaluation with noinline if you like.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great. I'm not opposed to pragmas on principle, but pragma[noopt] is perhaps an overly blunt instrument. In particular, I believe it prevents us from having different orders for different delta sizes, which is generally a good thing. (And yes, it absolutely does solve a performance problem; removing the pragma entirely makes PostMessageStar grind to a halt on brackets, for instance.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results on nightly look fine. I'm running another eval on default to be sure, but for now I'll merge it in the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker ran out of disk space during the rerun :-\

Copy link

@xiemaisi xiemaisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more thoughts. On the whole I'm fine with this PR, though I am a little concerned about the pragma[noopt]. Making InconsistentNew and IllegalInvocation more conservative is fine by me; the former never gave interesting results, and the latter is slowly outliving its usefulness now that classes are natively supported everywhere.

/**
* A call to `goog.bind`, as a partial function invocation.
*/
private class BindCall extends DataFlow::PartialInvokeNode::Range, DataFlow::CallNode {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a test for this and angular.bind below?

addEventListener = DataFlow::globalVarRef("addEventListener").getACall() and
addEventListener.getArgument(0).mayHaveStringValue("message") and
addEventListener.getArgument(1).analyze().getAValue().(AbstractFunction).getFunction() = this
addEventListener.getArgument(1).getABoundFunctionValue(paramIndex).getFunction() = this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test demonstrating the new results this allows us to find?

/** Gets a function value that may reach this node. */
final FunctionNode getAFunctionValue() {
result = getAFunctionValue(_)
CallGraph::getAFunctionReference(result, 0).flowsTo(this)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we restrict imprecision to zero now? Wasn't it unrestricted before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff is relative to an earlier commit. On master, the imprecision bit wasn't easily available and I don't think it was a deliberate choice to include imprecise values here in the first place.

It seems to me that dealing with the imprecise callees is mainly for queries that rely on seeing all the potential callees, which has become a bit of a niche thing given that data flow queries usually don't do that.

But we should make sure it behaves the same way as InvokeNode.getACallee(), which defaults to any imprecision level. I'd prefer to make the default imprecision level 0 for both 0-argument predicates, so you have to "opt in" to see the imprecise callees. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure. If we restrict ourselves to zero imprecision, what do we lose? Is it just calls to externs (modelling built-ins) and global functions defined elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a commit that makes imprecision zero the default for getAFunctionValue and for getACallee

xiemaisi
xiemaisi previously approved these changes Oct 25, 2019
max-schaefer
max-schaefer previously approved these changes Oct 25, 2019
@asger-semmle asger-semmle dismissed stale reviews from max-schaefer and xiemaisi via 04ee483 October 25, 2019 13:10
@max-schaefer max-schaefer merged commit b333c6a into github:master Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants