C++: Fix more FPs on `cpp/invalid-pointer-deref` #12971

MathiasVP · 2023-04-28T14:27:50Z

It turns out that the back-edge detection predicate we have on PhiNodes wasn't correct for the purpose it was introduced for. It was meant to block back-edge flow (see #10593), but it turns out that it's blocking too much flow, which we're seeing when add the barrier to fix the testcases introduced in #12960.

Commit-by-commit review recommended: The first commit fixes the FPs by adding said barrier, and the second commit fixes the FNs by fixing what we consider to be a back-edge.

…dges.

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

jketema · 2023-05-03T09:09:10Z

I don't understand why this is correct. What cases are ruled out by switching to a strict dominator, and are we not accidentally excluding cases we do want to report?

MathiasVP · 2023-05-03T09:15:09Z

I don't understand why this is correct. What cases are ruled out by switching to a strict dominator, and are we not accidentally excluding cases we do want to report?

For some reason some phi nodes have phi edges that goto themselves. I checked whether this is a C/C++ specific issue by adding a consistency check here (which I didn't finish yet because I couldn't compile some of the other extractors 😭), and it looks like all languages have phi nodes that depend directly on themselves. Such phi instructions would be ruled out by using non-strict domination.

I haven't seen any real-world example that's negatively impacted by the change.

jketema · 2023-05-03T09:29:34Z

For some reason some phi nodes have phi edges that goto themselves

So this seems like a more fundamental problem (bug?).

MathiasVP · 2023-05-03T10:03:57Z

So this seems like a more fundamental problem (bug?).

Mayybbbeeee. Or it shows that I'm not fully understanding phi edges 😂. There's a phi node phi for i that satisfies localFlowStep(phi, phi) in this example:

void test(int i)
{
  while (i)
  {
    --i;
  }
}

and that seems weird to me since I'd imagine the input of the phi to be from:

The parameter i
The result of --i,

but it looks like the input is from:

The phi itself
The result of --i.

So it's not like there's a redundant phi edge.

jketema · 2023-05-03T10:11:40Z

That's weird. It's also not the story the aliased SSA is telling me:

#  434| void test(int)
#  434|   Block 0
#  434|     v434_1(void)       = EnterFunction          : 
#  434|     m434_2(unknown)    = AliasedDefinition      : 
#  434|     m434_3(unknown)    = InitializeNonLocal     : 
#  434|     m434_4(unknown)    = Chi                    : total:m434_2, partial:m434_3
#  434|     r434_5(glval<int>) = VariableAddress[i]     : 
#  434|     m434_6(int)        = InitializeParameter[i] : &:r434_5
#-----|   Goto -> Block 1

#  436|   Block 1
#  436|     m436_1(int)        = Phi                : from 0:m434_6, from 2:m438_5
#  436|     r436_2(glval<int>) = VariableAddress[i] : 
#  436|     r436_3(int)        = Load[i]            : &:r436_2, m436_1
#  436|     r436_4(int)        = Constant[0]        : 
#  436|     r436_5(bool)       = CompareNE          : r436_3, r436_4
#  436|     v436_6(void)       = ConditionalBranch  : r436_5
#-----|   False -> Block 3
#-----|   True -> Block 2

#  438|   Block 2
#  438|     r438_1(glval<int>) = VariableAddress[i] : 
#  438|     r438_2(int)        = Load[i]            : &:r438_1, m436_1
#  438|     r438_3(int)        = Constant[1]        : 
#  438|     r438_4(int)        = Sub                : r438_2, r438_3
#  438|     m438_5(int)        = Store[i]           : &:r438_1, r438_4
#-----|   Goto (back edge) -> Block 1

#  440|   Block 3
#  440|     v440_1(void) = NoOp         : 
#  434|     v434_7(void) = ReturnVoid   : 
#  434|     v434_8(void) = AliasedUse   : m434_3
#  434|     v434_9(void) = ExitFunction :

MathiasVP · 2023-05-03T10:17:36Z

That's the IR-based SSA analysis (which is not what we use for dataflow). Dataflow uses the shared SSA library, which has a few differences from the shared SSA library. Notably, the shared library introduces phi edges for SSA reads (in addition to SSA writes).

And yeah, my expectation of what SSA is matches the output of the IR-based SSA. But not the shared SSA library's output 😄.

jketema · 2023-05-03T10:49:47Z

Because, of course we have multiple ways to do SSA :tableflip:.

MathiasVP · 2023-05-03T11:22:11Z

Yeah, the IR-based SSA is really good for must-analyses like the one we need in range analysis and value numbering, but not at all what we want for dataflow 😭. For a while we wanted to improve the IR-based analysis to the point where it was usable for dataflow ... but that never happened, and I pulled in the shared SSA library to do the heavy lifting instead.

jketema · 2023-05-04T07:16:37Z

Mayybbbeeee. Or it shows that I'm not fully understanding phi edges 😂.

Is there any easy way for me to see what the phi edges are for a particular example?

MathiasVP · 2023-05-04T08:22:41Z

Is there any easy way for me to see what the phi edges are for a particular example?

The easiest way of doing this is probably just to query for SsaPhiNodes and use the getAnInput predicate like:

import semmle.code.cpp.ir.dataflow.internal.DataFlowUtil
from SsaPhiNode n, boolean fromBackEdge
select n, n.getAnInput(fromBackEdge), fromBackEdge

If we want to get a more visual representation, we could resurrect the old PrintIRLocalFlow.qll file which can augment the printed IR with dataflow information.

jketema · 2023-05-09T11:46:02Z

So will this change still be needed after #13059?

MathiasVP · 2023-05-09T11:49:15Z

So will this change still be needed after #13059?

Yeah, it looks like it.

jketema · 2023-05-09T11:50:05Z

😕 Do we understand why?

MathiasVP · 2023-05-09T11:51:52Z

Not yet. I've delayed any investigations until #13059 was merged hoping that it would remove the necessity for this change. I can look into why this change is necessary even after #13059 is merged, but I can also let you do it if you would like to drive the investigation.

jketema · 2023-05-09T11:52:40Z

Let's wait until the other PR is merged

jketema · 2023-05-10T10:42:40Z

This indeed still seems needed. Looking at the back-edges of:

    for (char* p = begin; p <= end; ++p) {
        *p = 0; // BAD
    }

There's a back-edge from end to p in p <= end, and those of course live in the same block. It's a bit "funny" though that that's a back-edge. Is this something that comes from the SSA library?

There's more I don't quite understand though. If I repeat the loop, there's also an edge from the p phi node of the first loop to the p phi node of the second loop, although these are two different variables.

MathiasVP · 2023-05-10T11:01:19Z

There's a back-edge from end to p in p <= end, and those of course live in the same block. It's a bit "funny" though that that's a back-edge. Is this something that comes from the SSA library?

Keep in mind that the location for phi nodes isn't super helpful: they get their location from the enclosing basic block. So just because the location overlaps with a specific variable access doesn't mean it's a phi for that variable. On the snippet above I get 3 back-edges:

From (the address of) p in ++p to (the address of) p in p <= end
From (the value of) ++p to (the value of) p in p <= end
From (the indirection of) ++p to (the indirection of) p in p <= end

jketema · 2023-05-10T11:03:47Z

I'm also seeing edges from end in p <= end to p. Are you seeing those too?

MathiasVP · 2023-05-10T11:13:34Z

Oops. I was running with strict domination instead of domination for my above example. Yeah, I'm seeing some additional back-edges other than the ones I mentioned above. Let's discuss it in a sync later today 🤔

…ead-phi and use it to restrict flow in 'cpp/invalid-pointer-deref'.

MathiasVP · 2023-05-10T13:29:22Z

@jketema I've pushed the changes from our sync discussion. For everyone else:

It turned out that changing the definition of back-edge from using dominators to strict dominators was not the right way to fix these FPs. Instead, we noticed that the nodes being marked as barriers by the isBarrier definition added in this PR were nodes that represented the back-edge of a phi-read node. In a loop like this:

end = ...
for(p = start; p < end; ++p) {
  *p = 0;
}

there's a phi-read node for end in the basic-block for p < end, and there's a back-edge from the phi-read node for end whose source is end (i.e., end -> phi for end is a back edge).

Previously, this PR excluded this end node from the set of barriers by saying that the back-edge node should not be in the same block as the phi (which it was in the case of the p < end block). Instead, we're now only including back-edges from phi nodes that aren't phi-read nodes.

Now you may ask (as @jketema rightly did): Why do we even need to add a barrier in the configuration that just searches for a dereference of a pointer that's found to be out-of-bounds, and the reason is super subtle. Consider this snippet:

void test19(unsigned len)
{
  int *xs = new int[len];
  int *end = xs + len;
  for (int *x = xs; x <= end; x++)
  {
    int i = *x;
  }
}

The idea is that:

Dataflow will flow from len (in new int[len]) to len in xs + len (which is equal to end)
Range-analysis will then find that x is non-strictly upper bounds end
Dataflow will find flow from x to a dereference to conclude that we're loading an out-of-bounds pointer.

That is all fine, but now consider this example:

void test19(unsigned len)
{
  int *xs = new int[len];
  int *end = xs + len;
  for (int *x = xs; x < end; x++)
  {
    int i = *x;
  }
}

which is totally fine (because we're now using a strict relational operator instead). But (up until this PR) we were generating a FP here because:

Dataflow will flow from len (in new int[len]) to len in xs + len (which is equal to end) - just as before.
Range-analysis will then find that x (after x++) is non-strictly upper bounds end. This happens at the very last increment of the loop.
Dataflow then finds flow from x to the dereference - as before.

The problem is that, while x is equal to end after the very last increment of x++, control-flow will not enter the body of the loop to dereference the pointer. But dataflow doesn't care about that. So we get a FP.

MathiasVP · 2023-05-10T20:09:48Z

I think DCA looks good. Performance looks unchanged, but we need to double check that the lost results match the FPs we intended to remove 🤞.

jketema · 2023-05-10T21:34:35Z

Let me check those results (unless you're already doing it).

MathiasVP · 2023-05-10T21:36:56Z

I haven't done so yet, no. Feel free to do that 🙂.

jketema · 2023-05-10T21:38:12Z

I'll have a look tomorrow morning. Seems more sensible that I do it, as I need to dive into the FPs of this query anyway.

jketema · 2023-05-11T08:57:24Z

As far as I can tell the DCA alerts that have disappeared are all of the form we expect. The only one I have trouble with is wireshark__wireshark: epan/dissectors/file-elf.c:1099:13:1099:35. However, as far as I can tell that is a FP, so I'm not too concerned.

jketema

LGTM

jketema · 2023-05-11T08:58:17Z

cpp/ql/src/experimental/Security/CWE/CWE-193/InvalidPointerDeref.ql

@@ -229,6 +229,10 @@ module InvalidPointerToDerefConfig implements DataFlow::ConfigSig {

  pragma[inline]
  predicate isSink(DataFlow::Node sink) { isInvalidPointerDerefSink(sink, _, _) }
+
+  predicate isBarrier(DataFlow::Node node) {
+    node = any(DataFlow::SsaPhiNode phi | not phi.isPhiRead()).getAnInput(true)


I wonder if we can introduce this restriction in other places?

Yeah, I wouldn't be surprised if we could construct some FN that would be removed if we did the same trick to the isBarrier2 in the product flow configuration.

MathiasVP · 2023-05-11T09:04:27Z

As far as I can tell the DCA alerts that have disappeared are all of the form we expect. The only one I have trouble with is wireshark__wireshark: epan/dissectors/file-elf.c:1099:13:1099:35. However, as far as I can tell that is a FP, so I'm not too concerned.

Yeah, I guess there's some complex invariant about when is_cie is true (in which case augmentation_string will be assigned to some non-empty string)

MathiasVP added 2 commits April 28, 2023 11:36

C++: Disable flow through nodes that are sources of phi edges' back e…

faf846b

…dges.

C++: Fix back edge detection for phi nodes.

4352757

MathiasVP requested a review from a team as a code owner April 28, 2023 14:27

github-actions bot added the C++ label Apr 28, 2023

MathiasVP requested a review from jketema April 28, 2023 14:28

MathiasVP added the no-change-note-required This PR does not need a change note label Apr 28, 2023

MathiasVP assigned jketema Apr 28, 2023

jketema reviewed May 3, 2023

View reviewed changes

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll Outdated Show resolved Hide resolved

Merge branch 'main' into fix-fp-in-invalid-deref-2

c3a7f98

C++: Add a member predicate to phi nodes for checking if a phi is a r…

f05cce8

…ead-phi and use it to restrict flow in 'cpp/invalid-pointer-deref'.

jketema approved these changes May 11, 2023

View reviewed changes

MathiasVP merged commit fd62820 into github:main May 11, 2023

C++: Fix more FPs on cpp/invalid-pointer-deref #12971

C++: Fix more FPs on cpp/invalid-pointer-deref #12971

Uh oh!

Conversation

MathiasVP commented Apr 28, 2023

Uh oh!

Uh oh!

jketema commented May 3, 2023

Uh oh!

MathiasVP commented May 3, 2023

Uh oh!

jketema commented May 3, 2023

Uh oh!

MathiasVP commented May 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jketema commented May 3, 2023

Uh oh!

MathiasVP commented May 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jketema commented May 3, 2023

Uh oh!

MathiasVP commented May 3, 2023

Uh oh!

jketema commented May 4, 2023

Uh oh!

MathiasVP commented May 4, 2023

Uh oh!

jketema commented May 9, 2023

Uh oh!

MathiasVP commented May 9, 2023

Uh oh!

jketema commented May 9, 2023

Uh oh!

MathiasVP commented May 9, 2023

Uh oh!

jketema commented May 9, 2023

Uh oh!

jketema commented May 10, 2023

Uh oh!

MathiasVP commented May 10, 2023

Uh oh!

jketema commented May 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MathiasVP commented May 10, 2023

Uh oh!

MathiasVP commented May 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MathiasVP commented May 10, 2023

Uh oh!

jketema commented May 10, 2023

Uh oh!

MathiasVP commented May 10, 2023

Uh oh!

jketema commented May 10, 2023

Uh oh!

jketema commented May 11, 2023

Uh oh!

jketema left a comment

Choose a reason for hiding this comment

Uh oh!

jketema May 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MathiasVP May 11, 2023

Choose a reason for hiding this comment

Uh oh!

MathiasVP commented May 11, 2023

Uh oh!

Uh oh!

C++: Fix more FPs on `cpp/invalid-pointer-deref` #12971

C++: Fix more FPs on `cpp/invalid-pointer-deref` #12971

MathiasVP commented May 3, 2023 •

edited

Loading

MathiasVP commented May 3, 2023 •

edited

Loading

jketema commented May 10, 2023 •

edited

Loading

MathiasVP commented May 10, 2023 •

edited

Loading

jketema May 11, 2023 •

edited

Loading