C++: Reduce duplication from crement operations #14867

MathiasVP · 2023-11-21T18:07:28Z

This PR fixes a dataflow duplication issue caused by having multiple dataflow nodes return the same expression when the expression has tricky semantics that's difficult to model using just Exprs.

For example, consider the f(x++) operation. For this expression we generate IR equivalent to:

tmp = x
x = x + 1
f(tmp)

and the question is then: which dataflow node's asExpr() should return the x++ expression. Should it be the one corresponding to x = x + 1 (that is, the incremented value), or should it be tmp in f(tmp) (that is, the original value)?

The C/C++ standard would say that it should be the latter, but when writing queries people are less interested in what the standard demands, and more interesting in whatever fits their query logic, and often that ends up being the former 😂.

So C/C++ currently does both. So both the node corresponding to x = x + 1, and the node corresponding to tmp in f(tmp), return x++. This causes duplication when writing a query such as this (where we track flow from what source() points to, to what the argument of sink points to):

This PR fixes this duplication by adding another dataflow node predicate called Node.asDefinition (and Node.asIndirectDefinition for the indirect case). In the context of the f(x++) example above, the Node.asDefinition predicate will have a result for the dataflow node corresponding to x = x + 1, and Node.asExpr will have a result only for the node corresponding to tmp.

This means that the query from before will only have 1 result:

…means we can reduce duplication from 'asExpr'.

geoffw0 · 2023-11-22T09:57:06Z

Seems like a reasonable direction. My main concern being whether people actually know / remember to use asDefinition() rather than asExpr() when they should. The explanation in the QLDoc on asDefinition() certainly helps.

MathiasVP · 2023-11-22T10:01:25Z

Seems like a reasonable direction. My main concern being whether people actually know / remember to use asDefinition() rather than asExpr() when they should. The explanation in the QLDoc on asDefinition() certainly helps.

Thanks for the feedback! It should be noted that some other languages already have an asDefinition predicate on dataflow nodes, so it's hopefully not too unknown for people 🤞

bdrodes · 2023-11-30T13:27:27Z

The approach makes sense to me, but I also wonder what the practical implications are. I never use the IR directly, I'm always looking for expressions. What underlying IR the expression is mapped to doesn't really matter for any query I write (so far). The way I'm understanding the solution is there is no way to conflate multiple IR hits to one dataflow node.asExpr (i.e., my naive solution would be to say, if you are asking for an asExpr, I give you the expression in code and if there are multiple IR to that expression, I'm still only giving you one hit). If it's the case you can't do that for technical reasons (which I assume is the case) the approach works for me so long as the asExpr still behaves the intuitive way I'm looking for (one expression returned and its pointing to the expression in the code I expect).

Secondly, regardless of the technical details of why two separate predicates may be necessary. Can you describe how I would use the new predicate in practice? Again, I never introspect to the IR, and I'm not sure the distinction would ever buy me anything, but perhaps that's because I'm not using dataflow nodes to their true potential.

bdrodes · 2023-11-30T13:56:54Z

The approach makes sense to me, but I also wonder what the practical implications are. I never use the IR directly, I'm always looking for expressions. What underlying IR the expression is mapped to doesn't really matter for any query I write (so far). The way I'm understanding the solution is there is no way to conflate multiple IR hits to one dataflow node.asExpr (i.e., my naive solution would be to say, if you are asking for an asExpr, I give you the expression in code and if there are multiple IR to that expression, I'm still only giving you one hit). If it's the case you can't do that for technical reasons (which I assume is the case) the approach works for me so long as the asExpr still behaves the intuitive way I'm looking for (one expression returned and its pointing to the expression in the code I expect).

Secondly, regardless of the technical details of why two separate predicates may be necessary. Can you describe how I would use the new predicate in practice? Again, I never introspect to the IR, and I'm not sure the distinction would ever buy me anything, but perhaps that's because I'm not using dataflow nodes to their true potential.

@MathiasVP and I discussed these points offline. He is going to alter a few of the QLDocs to reflect my concerns. Other than that, I'm on board with the approach.

MathiasVP · 2023-11-30T14:37:47Z

The approach makes sense to me, but I also wonder what the practical implications are. I never use the IR directly, I'm always looking for expressions. What underlying IR the expression is mapped to doesn't really matter for any query I write (so far).

It's our goal that you don't have to dig into IR to answer stadard dataflow questions. So you never using the IR directly
is great! It's there in case you want to use it, but you should never be forced to do so.

The way I'm understanding the solution is there is no way to conflate multiple IR hits to one dataflow node.asExpr (i.e., my naive solution would be to say, if you are asking for an asExpr, I give you the expression in code and if there are multiple IR to that expression, I'm still only giving you one hit).

You are correct in that we generate multiple instructions for a given expression. One example is the increment expression
I showed in the PR description. One of those instructions generated for a given expression represents the "result of evaluating the expression". For example, if I have the expression x + y, there will ultimately be an AddInstruction
that represents the result of the addition. And the dataflow node corresponding to that instruction is what's being
selected when you say node.asExpr() intanceof AddExpr. We do our best to ensure that, for a given expression e, there is a unique dataflow node n such that n.asExpr() = e. Failure to do so is what's causing result duplication.

So what you're describing in your "naive solution" really is what we're doing. The challange is how to codify "if there are multiple IR to that expression, I'm still only giving you one hit" :joy . And I believe we have a pretty good handle on that by now.

If it's the case you can't do that for technical reasons (which I assume is the case) the approach works for me so long as the asExpr still behaves the intuitive way I'm looking for (one expression returned and its pointing to the expression in the code I expect).
Secondly, regardless of the technical details of why two separate predicates may be necessary. Can you describe how I would use the new predicate in practice? Again, I never introspect to the IR, and I'm not sure the distinction would ever buy me anything, but perhaps that's because I'm not using dataflow nodes to their true potential.

"Expect" is a difficult thing to formalize in code! I hope that asExpr() will do what you expect. Do you expect flow from
0 to sink in this snippet, for instance?

int x = 0;
sink(x++);

or
what about this one?

int x = 0;
sink(++x);

Giving you the ability to precisely specify what you're looking for is the goal of this PR.

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

jketema · 2023-11-30T15:46:05Z

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

+    (
+      result = n.asExpr()
+      or
+      result = n.asDefinition() and
+      result instanceof CrementOperation
+    )


Why does this need updating, and why only for the crement and not for the assignment store?

Let me split the answer into two paragraphs:

Why does this need updating?

It needed to be updated to match the use in the CWE-119 queries (where the Buffer.qll library expect localExprFlow(e1, e2) to have a result when e1 is a postfix increment expression. This worked before since asExpr conflated the value before the increment with the value after the increment, but now that this conflation is gone we need to do this conflation manually in the asExpr predicate used for computing the localFlowExpr relation.

Why only for crement and not for the assignment store?

That's a very good catch! In ce28c9b I've added a testcase that fails because I forgot this, and in 359b15b I've fixed the FP by adding the missing case.

jketema · 2023-11-30T15:46:45Z

cpp/ql/src/Likely Bugs/Format/NonConstantFormat.ql

-    e = node.asExpr() and isIndirect = false
+    e = [node.asExpr(), node.asDefinition()] and isIndirect = false
    or
-    e = node.asIndirectExpr() and isIndirect = true
+    e = [node.asIndirectExpr(), node.asIndirectDefinition()] and isIndirect = true


This is just to keep the results as they were? Or is there are deeper reason for this?

Yeah, this is simply for keeping results as they were. Consider the case where e instanceof CrementOperation (and assume that e is actually a PostfixIncrExpr). On main this selects both the node representing the value before the increment, and the StoreInstruction that represents the value after the increment.

After this change, simply using node.asExpr() won't give you the node that represents the value after the increment. So we'd lose results on an example like:

const char *hello = "Hello, World\n"; hello++; printf(hello); // BAD

because node.asExpr() would select the value of the hello pointer prior to the increment.

You're example I cannot quite follow. I would have expected that the hello argument of printf uses the value after the increment, and that suffices, because we're looking at the argument of printf when looking for sinks.

Oh, now I think I know what you mean. You're saying that you would have expected the node for which node.asExpr() instanceof CrementOperation holds to be the node that's used for the dataflow path. And that value would flow all the way to the argument of printf.

I can totally see why you'd expect that 🤔. The problem is that there's no such dataflow node since the IR doesn't generate an instruction when the increment occurs in a void context. However, if it does occur in a void context all the further uses of the variable will be the incremented version. So when the crement operation occurs in a void context node.asExpr() instanceof CrementOperation should really mean the same thing as node.asDefinition() instanceof CrementOperation.

Is that kinda what you were thinking about?

The problem is that there's no such dataflow node since the IR doesn't generate an instruction when the increment occurs in a void context.

So there are no addition and store in that case?

I expect the result of the load to be the thing that is tainted here and picked up, but that's apparently not what's going on given that we're somehow talking about the increment operation here.

The source in the non-constant format query is the increment operation. That's why we're talking about that operation here, no? 😂 And the load is loading from the address produced by the increment operation.

Ahhh.

I see why I got confused. We're fixing something that affects duplication at sinks, but somehow - which I had not understood - it also affects sources.

Ok, so doing the special handling in void contexts - as you do now - is probably the least confusing way to deal with this from a user perspective. Avoids having to learn about definitions being special most of the time.

Yeah, I think I'm actually pretty happy with this now. Now you'll only need to deal with asDefinition() when there's actually a difference between what you'd get from calling asExpr() and asDefinition().

Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>

…increments.

…texts.

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

cpp/ql/src/Likely Bugs/Format/NonConstantFormat.ql

MathiasVP · 2023-12-04T17:53:59Z

Bahh, looks like 03b77db re-introduced some result duplication in cpp/leap-year/adding-365-days-per-year. I'll take a look at that tomorrow 👀

Edit: Fixed in a8020f4 🎉

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>

MathiasVP added 3 commits November 21, 2023 17:49

C++: Add a new API for mapping a dataflow node to a definition. This …

ef2caa3

…means we can reduce duplication from 'asExpr'.

C++: Fixup queries to keep the old results.

976adc3

C++: Accept test changes.

a80dbc5

github-actions bot added the C++ label Nov 21, 2023

MathiasVP added 2 commits November 30, 2023 14:20

C++: Add more comments.

43932b6

Merge branch 'main' into reduce-duplication-from-operators

c1561e8

MathiasVP marked this pull request as ready for review November 30, 2023 14:37

MathiasVP requested a review from a team as a code owner November 30, 2023 14:37

C++: Add change note.

3a61dd0

github-actions bot added the documentation label Nov 30, 2023

jketema reviewed Nov 30, 2023

View reviewed changes

MathiasVP and others added 7 commits December 4, 2023 10:34

Update cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

09117d3

Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>

Update cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

5a12a0a

Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>

Update cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

1198d23

Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>

Update cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

6020457

Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>

C++: Add more CWE-119 testcases with compound assignments instead of …

ce28c9b

…increments.

C++: Fix FP by special-casing compound assignments in 'asExprInternal'.

359b15b

C++: Make 'node.asExpr()' behave as 'node.asDefinition()' in void con…

03b77db

…texts.

MathiasVP commented Dec 4, 2023

View reviewed changes

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Dec 4, 2023

View reviewed changes

cpp/ql/src/Likely Bugs/Format/NonConstantFormat.ql Fixed Show resolved Hide resolved

cpp/ql/src/Likely Bugs/Format/NonConstantFormat.ql Fixed Show resolved Hide resolved

C++: Fix Code Scanning errors.

d9d36ff

C++: Add barrier to prevent duplication.

a8020f4

jketema previously approved these changes Dec 5, 2023

View reviewed changes

jketema reviewed Dec 5, 2023

View reviewed changes

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll Outdated Show resolved Hide resolved

Update cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

2908acf

Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>

MathiasVP dismissed jketema’s stale review via 2908acf December 5, 2023 11:00

jketema approved these changes Dec 5, 2023

View reviewed changes

MathiasVP merged commit 8ce4bbe into github:main Dec 5, 2023

C++: Reduce duplication from crement operations #14867

C++: Reduce duplication from crement operations #14867

Uh oh!

Conversation

MathiasVP commented Nov 21, 2023

Uh oh!

geoffw0 commented Nov 22, 2023

Uh oh!

MathiasVP commented Nov 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdrodes commented Nov 30, 2023

Uh oh!

bdrodes commented Nov 30, 2023

Uh oh!

MathiasVP commented Nov 30, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Why does this need updating?

Why only for crement and not for the assignment store?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MathiasVP commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MathiasVP commented Nov 22, 2023 •

edited

Loading

MathiasVP commented Dec 4, 2023 •

edited

Loading