Skip to content

C++: Fix more FPs on cpp/invalid-pointer-deref #12971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 11, 2023

Conversation

MathiasVP
Copy link
Contributor

It turns out that the back-edge detection predicate we have on PhiNodes wasn't correct for the purpose it was introduced for. It was meant to block back-edge flow (see #10593), but it turns out that it's blocking too much flow, which we're seeing when add the barrier to fix the testcases introduced in #12960.

Commit-by-commit review recommended: The first commit fixes the FPs by adding said barrier, and the second commit fixes the FNs by fixing what we consider to be a back-edge.

@MathiasVP MathiasVP requested a review from a team as a code owner April 28, 2023 14:27
@github-actions github-actions bot added the C++ label Apr 28, 2023
@MathiasVP MathiasVP requested a review from jketema April 28, 2023 14:28
@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label Apr 28, 2023
@jketema
Copy link
Contributor

jketema commented May 3, 2023

I don't understand why this is correct. What cases are ruled out by switching to a strict dominator, and are we not accidentally excluding cases we do want to report?

@MathiasVP
Copy link
Contributor Author

I don't understand why this is correct. What cases are ruled out by switching to a strict dominator, and are we not accidentally excluding cases we do want to report?

For some reason some phi nodes have phi edges that goto themselves. I checked whether this is a C/C++ specific issue by adding a consistency check here (which I didn't finish yet because I couldn't compile some of the other extractors 😭), and it looks like all languages have phi nodes that depend directly on themselves. Such phi instructions would be ruled out by using non-strict domination.

I haven't seen any real-world example that's negatively impacted by the change.

@jketema
Copy link
Contributor

jketema commented May 3, 2023

For some reason some phi nodes have phi edges that goto themselves

So this seems like a more fundamental problem (bug?).

@MathiasVP
Copy link
Contributor Author

MathiasVP commented May 3, 2023

So this seems like a more fundamental problem (bug?).

Mayybbbeeee. Or it shows that I'm not fully understanding phi edges 😂. There's a phi node phi for i that satisfies localFlowStep(phi, phi) in this example:

void test(int i)
{
  while (i)
  {
    --i;
  }
}

and that seems weird to me since I'd imagine the input of the phi to be from:

  • The parameter i
  • The result of --i,

but it looks like the input is from:

  • The phi itself
  • The result of --i.

So it's not like there's a redundant phi edge.

@jketema
Copy link
Contributor

jketema commented May 3, 2023

That's weird. It's also not the story the aliased SSA is telling me:

#  434| void test(int)
#  434|   Block 0
#  434|     v434_1(void)       = EnterFunction          : 
#  434|     m434_2(unknown)    = AliasedDefinition      : 
#  434|     m434_3(unknown)    = InitializeNonLocal     : 
#  434|     m434_4(unknown)    = Chi                    : total:m434_2, partial:m434_3
#  434|     r434_5(glval<int>) = VariableAddress[i]     : 
#  434|     m434_6(int)        = InitializeParameter[i] : &:r434_5
#-----|   Goto -> Block 1

#  436|   Block 1
#  436|     m436_1(int)        = Phi                : from 0:m434_6, from 2:m438_5
#  436|     r436_2(glval<int>) = VariableAddress[i] : 
#  436|     r436_3(int)        = Load[i]            : &:r436_2, m436_1
#  436|     r436_4(int)        = Constant[0]        : 
#  436|     r436_5(bool)       = CompareNE          : r436_3, r436_4
#  436|     v436_6(void)       = ConditionalBranch  : r436_5
#-----|   False -> Block 3
#-----|   True -> Block 2

#  438|   Block 2
#  438|     r438_1(glval<int>) = VariableAddress[i] : 
#  438|     r438_2(int)        = Load[i]            : &:r438_1, m436_1
#  438|     r438_3(int)        = Constant[1]        : 
#  438|     r438_4(int)        = Sub                : r438_2, r438_3
#  438|     m438_5(int)        = Store[i]           : &:r438_1, r438_4
#-----|   Goto (back edge) -> Block 1

#  440|   Block 3
#  440|     v440_1(void) = NoOp         : 
#  434|     v434_7(void) = ReturnVoid   : 
#  434|     v434_8(void) = AliasedUse   : m434_3
#  434|     v434_9(void) = ExitFunction : 

@MathiasVP
Copy link
Contributor Author

MathiasVP commented May 3, 2023

That's the IR-based SSA analysis (which is not what we use for dataflow). Dataflow uses the shared SSA library, which has a few differences from the shared SSA library. Notably, the shared library introduces phi edges for SSA reads (in addition to SSA writes).

And yeah, my expectation of what SSA is matches the output of the IR-based SSA. But not the shared SSA library's output 😄.

@jketema
Copy link
Contributor

jketema commented May 3, 2023

Because, of course we have multiple ways to do SSA :tableflip:.

@MathiasVP
Copy link
Contributor Author

Yeah, the IR-based SSA is really good for must-analyses like the one we need in range analysis and value numbering, but not at all what we want for dataflow 😭. For a while we wanted to improve the IR-based analysis to the point where it was usable for dataflow ... but that never happened, and I pulled in the shared SSA library to do the heavy lifting instead.

@jketema
Copy link
Contributor

jketema commented May 4, 2023

Mayybbbeeee. Or it shows that I'm not fully understanding phi edges 😂.

Is there any easy way for me to see what the phi edges are for a particular example?

@MathiasVP
Copy link
Contributor Author

Is there any easy way for me to see what the phi edges are for a particular example?

The easiest way of doing this is probably just to query for SsaPhiNodes and use the getAnInput predicate like:

import semmle.code.cpp.ir.dataflow.internal.DataFlowUtil
from SsaPhiNode n, boolean fromBackEdge
select n, n.getAnInput(fromBackEdge), fromBackEdge

If we want to get a more visual representation, we could resurrect the old PrintIRLocalFlow.qll file which can augment the printed IR with dataflow information.

@jketema
Copy link
Contributor

jketema commented May 9, 2023

So will this change still be needed after #13059?

@MathiasVP
Copy link
Contributor Author

So will this change still be needed after #13059?

Yeah, it looks like it.

@jketema
Copy link
Contributor

jketema commented May 9, 2023

😕 Do we understand why?

@MathiasVP
Copy link
Contributor Author

Not yet. I've delayed any investigations until #13059 was merged hoping that it would remove the necessity for this change. I can look into why this change is necessary even after #13059 is merged, but I can also let you do it if you would like to drive the investigation.

@jketema
Copy link
Contributor

jketema commented May 9, 2023

Let's wait until the other PR is merged

@jketema
Copy link
Contributor

jketema commented May 10, 2023

This indeed still seems needed. Looking at the back-edges of:

    for (char* p = begin; p <= end; ++p) {
        *p = 0; // BAD
    }

There's a back-edge from end to p in p <= end, and those of course live in the same block. It's a bit "funny" though that that's a back-edge. Is this something that comes from the SSA library?

There's more I don't quite understand though. If I repeat the loop, there's also an edge from the p phi node of the first loop to the p phi node of the second loop, although these are two different variables.

@MathiasVP
Copy link
Contributor Author

There's a back-edge from end to p in p <= end, and those of course live in the same block. It's a bit "funny" though that that's a back-edge. Is this something that comes from the SSA library?

Keep in mind that the location for phi nodes isn't super helpful: they get their location from the enclosing basic block. So just because the location overlaps with a specific variable access doesn't mean it's a phi for that variable. On the snippet above I get 3 back-edges:

  • From (the address of) p in ++p to (the address of) p in p <= end
  • From (the value of) ++p to (the value of) p in p <= end
  • From (the indirection of) ++p to (the indirection of) p in p <= end

@jketema
Copy link
Contributor

jketema commented May 10, 2023

I'm also seeing edges from end in p <= end to p. Are you seeing those too?

@MathiasVP
Copy link
Contributor Author

Oops. I was running with strict domination instead of domination for my above example. Yeah, I'm seeing some additional back-edges other than the ones I mentioned above. Let's discuss it in a sync later today 🤔

…ead-phi and use it to restrict flow in 'cpp/invalid-pointer-deref'.
@MathiasVP
Copy link
Contributor Author

MathiasVP commented May 10, 2023

@jketema I've pushed the changes from our sync discussion. For everyone else:

It turned out that changing the definition of back-edge from using dominators to strict dominators was not the right way to fix these FPs. Instead, we noticed that the nodes being marked as barriers by the isBarrier definition added in this PR were nodes that represented the back-edge of a phi-read node. In a loop like this:

end = ...
for(p = start; p < end; ++p) {
  *p = 0;
}

there's a phi-read node for end in the basic-block for p < end, and there's a back-edge from the phi-read node for end whose source is end (i.e., end -> phi for end is a back edge).

Previously, this PR excluded this end node from the set of barriers by saying that the back-edge node should not be in the same block as the phi (which it was in the case of the p < end block). Instead, we're now only including back-edges from phi nodes that aren't phi-read nodes.

Now you may ask (as @jketema rightly did): Why do we even need to add a barrier in the configuration that just searches for a dereference of a pointer that's found to be out-of-bounds, and the reason is super subtle. Consider this snippet:

void test19(unsigned len)
{
  int *xs = new int[len];
  int *end = xs + len;
  for (int *x = xs; x <= end; x++)
  {
    int i = *x;
  }
}

The idea is that:

  1. Dataflow will flow from len (in new int[len]) to len in xs + len (which is equal to end)
  2. Range-analysis will then find that x is non-strictly upper bounds end
  3. Dataflow will find flow from x to a dereference to conclude that we're loading an out-of-bounds pointer.

That is all fine, but now consider this example:

void test19(unsigned len)
{
  int *xs = new int[len];
  int *end = xs + len;
  for (int *x = xs; x < end; x++)
  {
    int i = *x;
  }
}

which is totally fine (because we're now using a strict relational operator instead). But (up until this PR) we were generating a FP here because:

  1. Dataflow will flow from len (in new int[len]) to len in xs + len (which is equal to end) - just as before.
  2. Range-analysis will then find that x (after x++) is non-strictly upper bounds end. This happens at the very last increment of the loop.
  3. Dataflow then finds flow from x to the dereference - as before.

The problem is that, while x is equal to end after the very last increment of x++, control-flow will not enter the body of the loop to dereference the pointer. But dataflow doesn't care about that. So we get a FP.

@MathiasVP
Copy link
Contributor Author

I think DCA looks good. Performance looks unchanged, but we need to double check that the lost results match the FPs we intended to remove 🤞.

@jketema
Copy link
Contributor

jketema commented May 10, 2023

Let me check those results (unless you're already doing it).

@MathiasVP
Copy link
Contributor Author

I haven't done so yet, no. Feel free to do that 🙂.

@jketema
Copy link
Contributor

jketema commented May 10, 2023

I'll have a look tomorrow morning. Seems more sensible that I do it, as I need to dive into the FPs of this query anyway.

@jketema
Copy link
Contributor

jketema commented May 11, 2023

As far as I can tell the DCA alerts that have disappeared are all of the form we expect. The only one I have trouble with is wireshark__wireshark: epan/dissectors/file-elf.c:1099:13:1099:35. However, as far as I can tell that is a FP, so I'm not too concerned.

Copy link
Contributor

@jketema jketema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -229,6 +229,10 @@ module InvalidPointerToDerefConfig implements DataFlow::ConfigSig {

pragma[inline]
predicate isSink(DataFlow::Node sink) { isInvalidPointerDerefSink(sink, _, _) }

predicate isBarrier(DataFlow::Node node) {
node = any(DataFlow::SsaPhiNode phi | not phi.isPhiRead()).getAnInput(true)
Copy link
Contributor

@jketema jketema May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can introduce this restriction in other places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wouldn't be surprised if we could construct some FN that would be removed if we did the same trick to the isBarrier2 in the product flow configuration.

@MathiasVP
Copy link
Contributor Author

As far as I can tell the DCA alerts that have disappeared are all of the form we expect. The only one I have trouble with is wireshark__wireshark: epan/dissectors/file-elf.c:1099:13:1099:35. However, as far as I can tell that is a FP, so I'm not too concerned.

Yeah, I guess there's some complex invariant about when is_cie is true (in which case augmentation_string will be assigned to some non-empty string)

@MathiasVP MathiasVP merged commit fd62820 into github:main May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ no-change-note-required This PR does not need a change note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants