Skip to content

Conversation

jbj
Copy link
Contributor

@jbj jbj commented Jan 4, 2019

This PR is the first major step towards CPP-257: constructing the CFG in QL. A library is added to compute a CFG from QL, and tests are added to validate that it computes the same CFG as the extractor does. The extractor-based CFG is not replaced by the QL-based CFG in the ControlFlowNode class, so all analyses are unaffected for now.

When this is merged, I'll ask @ian-semmle to make the extractor produce the necessary db data to get rid of SyntheticDestructorCalls.qll, which means we'll no longer be relying on the successors relation for placing compiler-generated destructor calls. The data should be structured similarly to getDestructorCallAfterNode.

I've validated the implementation as follows.

  1. I've copied most *.c and *.cpp files from our tests into the library-tests/qlcfg/ directory, where tellDifferent.ql checks that their QL-produced CFG is identical to their extractor-produced CFG. I've kept the files that showed a difference at some point, so they serve as regression tests. As you can see from tellDifferent.expected, there are still some problematic cases left. They are caused by CPP-314 and by shortcomings of SyntheticDestructorCalls.qll.
  2. I've run tellDifferent.ql on comdb2, linux, and Wireshark. Where there have been differences between the extractor CFG and the QL CFG, I've attempted to fix them and boil them down to regression tests in library-tests/qlcfg. There are some issues I haven't been able to work around, and those are filed as extractor bugs linked to CPP-257.
  3. I've run a CPP-Differences job on my cfg-enable branch, which contains these changes plus the changes needed to make ControlFlowNode use the QL-based CFG.
    1. There were two changes, both in Boost: a FP in "Infinite loop with unsatisfiable exit condition" was removed, and a FP in "Missing return statement" was added. The latter FP is caused by ODASA-394.
    2. Performance on Jenkins is hard to measure. The run time for the whole suite is within a factor of two on all seven sample projects. I think any slowdown due to CFG calculation are eclipsed by the differences due to hardware and concurrent jobs. The detailed timings (before -> after) are in this gist. The only significant regression is Linux, and I checked manually that the new CFG construction stage takes 156 seconds. Compare that to 187 seconds for the existing CFG pruning stage.
    3. In my initial run of CPP-Differences, everything timed out. This was fixed by C++: Factor out reachable base case #721.
  4. I ran Language-Test/CPP with my cfg-enable branch and addressed the test differences that showed up. The remaining differences are benign and can be accepted when the time comes.

@jbj jbj added the C++ label Jan 4, 2019
@jbj jbj requested a review from a team as a code owner January 4, 2019 12:29
This implements calculation of the control-flow graph in QL. The new
code is not enabled yet as we'll need more extractor changes first.

The `SyntheticDestructorCalls.qll` file is a temporary solution that can
be removed when the extractor produces this information directly.
@pavgust pavgust changed the base branch from next to master January 7, 2019 09:55
Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was something specific you wanted me and/or @rdmarsh2 to look at in this PR, wasn't there?

* ("before" and "after") and collapse the edges around them, we are left with
* the correct CFG for `e`:
*
* e1 -> e2 -> e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent explanation.

* `getExpr()` is an access to `c` (with possible casts), and `getVariable` is
* the variable `c`, which has an initializer `x < y`.
* `getVariableAccess()` is an access to `c` (with possible casts),
* `getVariable` is the variable `c`, which has an initializer `x < y`, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as we're in a comma-separated list, I think it would be clearer to put which has an initializer x < y in parentheses.

* the variable `c`, which has an initializer `x < y`.
* `getVariableAccess()` is an access to `c` (with possible casts),
* `getVariable` is the variable `c`, which has an initializer `x < y`, and
* `getInitializingExpr` is `x < y`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have () after some predicates but not others.

// have multiple predecessors.
// - But after ReturnStmt, that may happen.
/**
* Describes a straight line of `SyntheticDestructorCall`s. Node that such
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Node/Note/

or
this instanceof MicrosoftTryExceptStmt
or
// Detecting exception edges out of a MicrosoftTryExceptStmt is not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean MicrosoftTryFinallyStmt?

* specifies the shape of the CFG for all known language constructs. The case
* analysis is large but does _not_ contain recursion. Recursion is needed in
* the second stage in order to collapse virtual nodes, but that recursion is
* simply a transitive closure and can be fast.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"so is fast"?

* To produce all edges around each control-flow node without recursion, we
* need to pre-compute the targets of exception sources (throw, propagating
* handlers, ...) and short-circuiting operators (||, ? :, ...). This
* pre-computation involves recursion, but it's quick to compute because in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it

* For many AST nodes, their control flow can be described in simpler terms
* than the full generality of describing each of their individual sub-edges.
* To add control flow for a new AST construct, one of the following predicates
* can be used, listed roughly in order of increasing generality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence lost me at first. Perhaps something like: Many kinds of AST nodes share the same pattern of control flow. To add control flow for a new AST construct, it can often be easier to use one of the following predicates (listed roughly in order of increasing generality) than to define it directly.

private import semmle.code.cpp.controlflow.internal.SyntheticDestructorCalls

/** A control-flow node. */
private class Node extends ControlFlowNodeBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the qldoc should explain how this fits in with ControlFlowNode

or
result = this.(Stmt).getParent()
or
// An Initializer under a ConditionDeclExpr is not part of the CFG.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's not as bad as it sounds. I'll clarify this comment.

* A constant that indicates the type of sub-node in a pair of `(Node, Pos)`.
* See the comment block at the top of this file.
*/
private class Pos extends int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be better as an ADT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I liked having it as int during development because I'd get a compile error in any disjunct where I forgot to constrain it. I can check whether that can be done with bindingset for an IPA type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit to turn Pos and Spec into IPA types.

jbj added 2 commits January 8, 2019 13:29
This is cleaner than extending `int` and working with magic numbers.
Performance appears to be unaffected.
@ian-semmle ian-semmle merged commit b3bcabf into github:master Jan 9, 2019
cklin pushed a commit that referenced this pull request Apr 26, 2022
Post-release preparation for codeql-cli-2.9.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants