C++: Simplify `buffer.qll` repair by MathiasVP · Pull Request #11042 · github/codeql

MathiasVP · 2022-10-30T12:14:28Z

#10833 prepared the buffer.qll library for use-use flow. In the end, the PR ended up being a lot more complicated than I had anticipated.

After thinking a bit about this, I reached the conclusion that this is all because of the added use of asIndirectExpr to convert an expression to a DataFlow::Node in the buffer library. The complexity comes from the fact that, if n is the dataflow node tracking *p, then there's flow from that node to the indirect node tracking *(++p). So by using node.asIndirectExpr in the buffer library we would have dataflow through operations that incremented a pointer.

This required a bunch of complexity: It required the need for localFlowIncrStep (to track how much the pointer was incremented), and because this predicate had an integer column to count how much the pointer was incremented, the library could suddenly diverge on loops like:

while(true) { ++p; }

because we'd have dataflow from the indirect node of p to the indirect node of ++p, and localFlowIncrStep would loop forever incrementing the integer column. To fix this, I disabled flow through back-edges in localFlowNotIncrStep.

All in all, this was kinda messy. @jketema also reported that he was seeing multiple possible buffer sizes returned by the getBufferSize predicate, which is certainly not intended.

I think this PR is a much safer and simpler way to approach this. By not using asIndirectExpr we avoid this extra flow, and we don't need to block flow through back-edges since we don't have the localFlowIncrStep predicate anymore. The end result is a much simpler implementation 🎉. It might also solve multiple sizes being reported. @jketema let's discuss this on Monday.

The PR is split into two commits. The first commit contains all of the fixes explained above, and the second commit is purely a performance optimization (see the commit message for tuple counts. Note that the number of tuples in the commit message changes because the predicate is now part of the recursion. The number of tuples in the main predicate at the end is unchanged, though).

… the flow from 'x' to 'x++', which makes the whole library a lot simpler.

large TC in 'localFlowToExpr': ``` Evaluated relational algebra for predicate Buffer#61e3d199::localFlowStepToExpr#2#ff@0a49913i with tuple counts: 4713946 ~0% {2} r1 = SCAN DataFlowUtil#47741e1f::simpleLocalFlowStep#2#ff OUTPUT In.1, In.0 40897385 ~46% {2} r2 = JOIN boundedFastTC:Buffer#61e3d199::localFlowToExprStep#2#ff_10#higher_order_body:DataFlowUtil#47741e1f::simpleLocalFlowStep#2#ff_0#higher_order_body WITH DataFlowUtil#47741e1f::simpleLocalFlowStep#2#ff ON FIRST 1 OUTPUT Rhs.1, Lhs.1 45611331 ~43% {2} r3 = r1 UNION r2 3376553 ~14% {2} r4 = JOIN r3 WITH DataFlowUtil#47741e1f::ExprNode::getExpr#0#dispred#ff ON FIRST 1 OUTPUT Lhs.1, Rhs.1 return r4 ``` After this commit the tuple counts looks like: ``` Evaluated relational algebra for predicate Buffer#61e3d199::localFlowStepToExpr#2#ff@8cc38x5k on iteration 2 running pipeline standard with tuple counts: 51367 ~3% {2} r1 = JOIN Buffer#61e3d199::getBufferSize0#1#f#prev_delta WITH DataFlowUtil#47741e1f::ExprNode::getExpr#0#dispred#ff_10#join_rhs ON FIRST 1 OUTPUT Rhs.1, Lhs.0 124933 ~18% {2} r2 = JOIN r1 WITH #Buffer#61e3d199::localFlowToExprStep#2Plus#ff ON FIRST 1 OUTPUT Rhs.1, Lhs.1 176300 ~17% {2} r3 = r1 UNION r2 184685 ~22% {2} r4 = JOIN r3 WITH DataFlowUtil#47741e1f::simpleLocalFlowStep#2#ff ON FIRST 1 OUTPUT Rhs.1, Lhs.1 56646 ~47% {2} r5 = JOIN r4 WITH DataFlowUtil#47741e1f::ExprNode::getExpr#0#dispred#ff ON FIRST 1 OUTPUT Lhs.1, Rhs.1 44635 ~16% {2} r6 = r5 AND NOT Buffer#61e3d199::localFlowStepToExpr#2#ff#prev(Lhs.0, Lhs.1) return r6 ```

jketema · 2022-11-01T12:32:44Z

Because only the delta is looked at in the unique, there are unfortunately still cases where we're getting results where in the old situation we were not getting any:

void test15(bool b)
{
	char in_buffer1[8000];
	char in_buffer2[9000];
	char out_buffer[9000];
	
	char *in_buffer = b ? in_buffer1 : in_buffer2;

	bcopy(in_buffer, out_buffer, sizeof(out_buffer)); // GOOD: whether we're reading too much depends on `b`, whose value we do not know.
}

I think unique should be about complete buffer sizes and not just a delta (as was done in the original version).

MathiasVP · 2022-11-01T17:40:02Z

Thanks for the testcase, @jketema! Indeed, we don't want to report anything in that one. After I noticed this testcase:

union myUnion {
  char small[10];
  char large[100];
};

void myUnionTest()
{
  myUnion mu;
  // ...
  memset(&(mu.small), 0, 200); // BAD

I realized that what we're after isn't actually unique since both mu.small and mu.large flows to the memset (because mu is a union). Instead, what we want is to calculate to calculate the largest possible buffer size. This means that we do get a result in the above case (since 200 is larger than the largest possible buffer that flows to memset), but we don't get a result in your case above since the largest possible buffer that flows to in_buffer has size 900 (which matches the size of the write).

30f1547 removes the unique aggregate in the delta case, and instead uses max around the complete buffer size.

jketema · 2022-11-01T19:00:38Z

I would have expected that the union case was already handled by:

  exists(Union bufferType |
    // buffer is the address of a union member; in this case, we
    // take the size of the union itself rather the union member, since
    // it's usually OK to access that amount (e.g. clearing with memset).
    why = bufferExpr.(AddressOfExpr).getAddressable() and
    bufferType.getAMemberVariable() = why and
    result = bufferType.getSize()
  )

Why isn't this sufficient?

C++: Simplify 'Buffer.qll' by avoiding 'asIndirectExpr'. This removes…

aa8214a

… the flow from 'x' to 'x++', which makes the whole library a lot simpler.

MathiasVP requested a review from a team as a code owner October 30, 2022 12:14

github-actions Bot added the C++ label Oct 30, 2022

MathiasVP added the no-change-note-required This PR does not need a change note label Oct 30, 2022

MathiasVP force-pushed the simplify-buffer.qll branch from 2a05855 to 1b50168 Compare October 30, 2022 13:20

C++: Use 'max' instead of 'unique.'

30f1547

jketema approved these changes Nov 3, 2022

View reviewed changes

MathiasVP merged commit 18802a2 into github:mathiasvp/replace-ast-with-ir-use-usedataflow Nov 3, 2022

MathiasVP mentioned this pull request Nov 3, 2022

C++: Accept test change #11097

Merged

MathiasVP mentioned this pull request Dec 16, 2022

C++: Use unique in getBufferSize #11722

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++: Simplify `buffer.qll` repair#11042

C++: Simplify `buffer.qll` repair#11042
MathiasVP merged 3 commits intogithub:mathiasvp/replace-ast-with-ir-use-usedataflowfrom
MathiasVP:simplify-buffer.qll

MathiasVP commented Oct 30, 2022 •

edited

Loading

Uh oh!

jketema commented Nov 1, 2022 •

edited

Loading

Uh oh!

MathiasVP commented Nov 1, 2022 •

edited

Loading

Uh oh!

jketema commented Nov 1, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MathiasVP commented Oct 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jketema commented Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MathiasVP commented Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jketema commented Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MathiasVP commented Oct 30, 2022 •

edited

Loading

jketema commented Nov 1, 2022 •

edited

Loading

MathiasVP commented Nov 1, 2022 •

edited

Loading

jketema commented Nov 1, 2022 •

edited

Loading