Skip to content

C++: Simplify buffer.qll repair#11042

Merged
MathiasVP merged 3 commits intogithub:mathiasvp/replace-ast-with-ir-use-usedataflowfrom
MathiasVP:simplify-buffer.qll
Nov 3, 2022
Merged

C++: Simplify buffer.qll repair#11042
MathiasVP merged 3 commits intogithub:mathiasvp/replace-ast-with-ir-use-usedataflowfrom
MathiasVP:simplify-buffer.qll

Conversation

@MathiasVP
Copy link
Copy Markdown
Contributor

@MathiasVP MathiasVP commented Oct 30, 2022

#10833 prepared the buffer.qll library for use-use flow. In the end, the PR ended up being a lot more complicated than I had anticipated.

After thinking a bit about this, I reached the conclusion that this is all because of the added use of asIndirectExpr to convert an expression to a DataFlow::Node in the buffer library. The complexity comes from the fact that, if n is the dataflow node tracking *p, then there's flow from that node to the indirect node tracking *(++p). So by using node.asIndirectExpr in the buffer library we would have dataflow through operations that incremented a pointer.

This required a bunch of complexity: It required the need for localFlowIncrStep (to track how much the pointer was incremented), and because this predicate had an integer column to count how much the pointer was incremented, the library could suddenly diverge on loops like:

while(true) { ++p; }

because we'd have dataflow from the indirect node of p to the indirect node of ++p, and localFlowIncrStep would loop forever incrementing the integer column. To fix this, I disabled flow through back-edges in localFlowNotIncrStep.

All in all, this was kinda messy. @jketema also reported that he was seeing multiple possible buffer sizes returned by the getBufferSize predicate, which is certainly not intended.

I think this PR is a much safer and simpler way to approach this. By not using asIndirectExpr we avoid this extra flow, and we don't need to block flow through back-edges since we don't have the localFlowIncrStep predicate anymore. The end result is a much simpler implementation 🎉. It might also solve multiple sizes being reported. @jketema let's discuss this on Monday.

The PR is split into two commits. The first commit contains all of the fixes explained above, and the second commit is purely a performance optimization (see the commit message for tuple counts. Note that the number of tuples in the commit message changes because the predicate is now part of the recursion. The number of tuples in the main predicate at the end is unchanged, though).

… the flow from 'x' to 'x++', which makes the whole library a lot simpler.
@MathiasVP MathiasVP requested a review from a team as a code owner October 30, 2022 12:14
@github-actions github-actions Bot added the C++ label Oct 30, 2022
@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label Oct 30, 2022
large TC in 'localFlowToExpr':
```
Evaluated relational algebra for predicate Buffer#61e3d199::localFlowStepToExpr#2#ff@0a49913i with tuple counts:
    4713946   ~0%    {2} r1 = SCAN DataFlowUtil#47741e1f::simpleLocalFlowStep#2#ff OUTPUT In.1, In.0

  40897385  ~46%    {2} r2 = JOIN boundedFastTC:Buffer#61e3d199::localFlowToExprStep#2#ff_10#higher_order_body:DataFlowUtil#47741e1f::simpleLocalFlowStep#2#ff_0#higher_order_body WITH DataFlowUtil#47741e1f::simpleLocalFlowStep#2#ff ON FIRST 1 OUTPUT Rhs.1, Lhs.1

  45611331  ~43%    {2} r3 = r1 UNION r2
    3376553  ~14%    {2} r4 = JOIN r3 WITH DataFlowUtil#47741e1f::ExprNode::getExpr#0#dispred#ff ON FIRST 1 OUTPUT Lhs.1, Rhs.1
                    return r4
```

After this commit the tuple counts looks like:
```
Evaluated relational algebra for predicate Buffer#61e3d199::localFlowStepToExpr#2#ff@8cc38x5k on iteration 2 running pipeline standard with tuple counts:
         51367   ~3%    {2} r1 = JOIN Buffer#61e3d199::getBufferSize0#1#f#prev_delta WITH DataFlowUtil#47741e1f::ExprNode::getExpr#0#dispred#ff_10#join_rhs ON FIRST 1 OUTPUT Rhs.1, Lhs.0

        124933  ~18%    {2} r2 = JOIN r1 WITH #Buffer#61e3d199::localFlowToExprStep#2Plus#ff ON FIRST 1 OUTPUT Rhs.1, Lhs.1

        176300  ~17%    {2} r3 = r1 UNION r2
        184685  ~22%    {2} r4 = JOIN r3 WITH DataFlowUtil#47741e1f::simpleLocalFlowStep#2#ff ON FIRST 1 OUTPUT Rhs.1, Lhs.1
         56646  ~47%    {2} r5 = JOIN r4 WITH DataFlowUtil#47741e1f::ExprNode::getExpr#0#dispred#ff ON FIRST 1 OUTPUT Lhs.1, Rhs.1
         44635  ~16%    {2} r6 = r5 AND NOT Buffer#61e3d199::localFlowStepToExpr#2#ff#prev(Lhs.0, Lhs.1)
                        return r6
```
@MathiasVP MathiasVP force-pushed the simplify-buffer.qll branch from 2a05855 to 1b50168 Compare October 30, 2022 13:20
@jketema
Copy link
Copy Markdown
Contributor

jketema commented Nov 1, 2022

Because only the delta is looked at in the unique, there are unfortunately still cases where we're getting results where in the old situation we were not getting any:

void test15(bool b)
{
	char in_buffer1[8000];
	char in_buffer2[9000];
	char out_buffer[9000];
	
	char *in_buffer = b ? in_buffer1 : in_buffer2;

	bcopy(in_buffer, out_buffer, sizeof(out_buffer)); // GOOD: whether we're reading too much depends on `b`, whose value we do not know.
}

I think unique should be about complete buffer sizes and not just a delta (as was done in the original version).

@MathiasVP
Copy link
Copy Markdown
Contributor Author

MathiasVP commented Nov 1, 2022

Thanks for the testcase, @jketema! Indeed, we don't want to report anything in that one. After I noticed this testcase:

union myUnion {
  char small[10];
  char large[100];
};

void myUnionTest()
{
  myUnion mu;
  // ...
  memset(&(mu.small), 0, 200); // BAD

I realized that what we're after isn't actually unique since both mu.small and mu.large flows to the memset (because mu is a union). Instead, what we want is to calculate to calculate the largest possible buffer size. This means that we do get a result in the above case (since 200 is larger than the largest possible buffer that flows to memset), but we don't get a result in your case above since the largest possible buffer that flows to in_buffer has size 900 (which matches the size of the write).

30f1547 removes the unique aggregate in the delta case, and instead uses max around the complete buffer size.

@jketema
Copy link
Copy Markdown
Contributor

jketema commented Nov 1, 2022

I would have expected that the union case was already handled by:

  exists(Union bufferType |
    // buffer is the address of a union member; in this case, we
    // take the size of the union itself rather the union member, since
    // it's usually OK to access that amount (e.g. clearing with memset).
    why = bufferExpr.(AddressOfExpr).getAddressable() and
    bufferType.getAMemberVariable() = why and
    result = bufferType.getSize()
  )

Why isn't this sufficient?

@MathiasVP MathiasVP merged commit 18802a2 into github:mathiasvp/replace-ast-with-ir-use-usedataflow Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C++ no-change-note-required This PR does not need a change note

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants