Skip to content

C++: Reduce duplication from crement operations #14867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 5, 2023

Conversation

MathiasVP
Copy link
Contributor

This PR fixes a dataflow duplication issue caused by having multiple dataflow nodes return the same expression when the expression has tricky semantics that's difficult to model using just Exprs.

For example, consider the f(x++) operation. For this expression we generate IR equivalent to:

tmp = x
x = x + 1
f(tmp)

and the question is then: which dataflow node's asExpr() should return the x++ expression. Should it be the one corresponding to x = x + 1 (that is, the incremented value), or should it be tmp in f(tmp) (that is, the original value)?

The C/C++ standard would say that it should be the latter, but when writing queries people are less interested in what the standard demands, and more interesting in whatever fits their query logic, and often that ends up being the former 😂.

So C/C++ currently does both. So both the node corresponding to x = x + 1, and the node corresponding to tmp in f(tmp), return x++. This causes duplication when writing a query such as this (where we track flow from what source() points to, to what the argument of sink points to):

Screenshot 2023-11-20 at 15 34 42

This PR fixes this duplication by adding another dataflow node predicate called Node.asDefinition (and Node.asIndirectDefinition for the indirect case). In the context of the f(x++) example above, the Node.asDefinition predicate will have a result for the dataflow node corresponding to x = x + 1, and Node.asExpr will have a result only for the node corresponding to tmp.

This means that the query from before will only have 1 result:
Screenshot 2023-11-21 at 18 06 15

@github-actions github-actions bot added the C++ label Nov 21, 2023
@geoffw0
Copy link
Contributor

geoffw0 commented Nov 22, 2023

Seems like a reasonable direction. My main concern being whether people actually know / remember to use asDefinition() rather than asExpr() when they should. The explanation in the QLDoc on asDefinition() certainly helps.

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Nov 22, 2023

Seems like a reasonable direction. My main concern being whether people actually know / remember to use asDefinition() rather than asExpr() when they should. The explanation in the QLDoc on asDefinition() certainly helps.

Thanks for the feedback! It should be noted that some other languages already have an asDefinition predicate on dataflow nodes, so it's hopefully not too unknown for people 🤞

@bdrodes
Copy link
Contributor

bdrodes commented Nov 30, 2023

The approach makes sense to me, but I also wonder what the practical implications are. I never use the IR directly, I'm always looking for expressions. What underlying IR the expression is mapped to doesn't really matter for any query I write (so far). The way I'm understanding the solution is there is no way to conflate multiple IR hits to one dataflow node.asExpr (i.e., my naive solution would be to say, if you are asking for an asExpr, I give you the expression in code and if there are multiple IR to that expression, I'm still only giving you one hit). If it's the case you can't do that for technical reasons (which I assume is the case) the approach works for me so long as the asExpr still behaves the intuitive way I'm looking for (one expression returned and its pointing to the expression in the code I expect).

Secondly, regardless of the technical details of why two separate predicates may be necessary. Can you describe how I would use the new predicate in practice? Again, I never introspect to the IR, and I'm not sure the distinction would ever buy me anything, but perhaps that's because I'm not using dataflow nodes to their true potential.

@bdrodes
Copy link
Contributor

bdrodes commented Nov 30, 2023

The approach makes sense to me, but I also wonder what the practical implications are. I never use the IR directly, I'm always looking for expressions. What underlying IR the expression is mapped to doesn't really matter for any query I write (so far). The way I'm understanding the solution is there is no way to conflate multiple IR hits to one dataflow node.asExpr (i.e., my naive solution would be to say, if you are asking for an asExpr, I give you the expression in code and if there are multiple IR to that expression, I'm still only giving you one hit). If it's the case you can't do that for technical reasons (which I assume is the case) the approach works for me so long as the asExpr still behaves the intuitive way I'm looking for (one expression returned and its pointing to the expression in the code I expect).

Secondly, regardless of the technical details of why two separate predicates may be necessary. Can you describe how I would use the new predicate in practice? Again, I never introspect to the IR, and I'm not sure the distinction would ever buy me anything, but perhaps that's because I'm not using dataflow nodes to their true potential.

@MathiasVP and I discussed these points offline. He is going to alter a few of the QLDocs to reflect my concerns. Other than that, I'm on board with the approach.

@MathiasVP
Copy link
Contributor Author

The approach makes sense to me, but I also wonder what the practical implications are. I never use the IR directly, I'm always looking for expressions. What underlying IR the expression is mapped to doesn't really matter for any query I write (so far).

It's our goal that you don't have to dig into IR to answer stadard dataflow questions. So you never using the IR directly
is great! It's there in case you want to use it, but you should never be forced to do so.

The way I'm understanding the solution is there is no way to conflate multiple IR hits to one dataflow node.asExpr (i.e., my naive solution would be to say, if you are asking for an asExpr, I give you the expression in code and if there are multiple IR to that expression, I'm still only giving you one hit).

You are correct in that we generate multiple instructions for a given expression. One example is the increment expression
I showed in the PR description. One of those instructions generated for a given expression represents the "result of evaluating the expression". For example, if I have the expression x + y, there will ultimately be an AddInstruction
that represents the result of the addition. And the dataflow node corresponding to that instruction is what's being
selected when you say node.asExpr() intanceof AddExpr. We do our best to ensure that, for a given expression e, there is a unique dataflow node n such that n.asExpr() = e. Failure to do so is what's causing result duplication.

So what you're describing in your "naive solution" really is what we're doing. The challange is how to codify "if there are multiple IR to that expression, I'm still only giving you one hit" :joy . And I believe we have a pretty good handle on that by now.

If it's the case you can't do that for technical reasons (which I assume is the case) the approach works for me so long as the asExpr still behaves the intuitive way I'm looking for (one expression returned and its pointing to the expression in the code I expect).
Secondly, regardless of the technical details of why two separate predicates may be necessary. Can you describe how I would use the new predicate in practice? Again, I never introspect to the IR, and I'm not sure the distinction would ever buy me anything, but perhaps that's because I'm not using dataflow nodes to their true potential.

"Expect" is a difficult thing to formalize in code! I hope that asExpr() will do what you expect. Do you expect flow from
0 to sink in this snippet, for instance?

int x = 0;
sink(x++);

or
what about this one?

int x = 0;
sink(++x);

Giving you the ability to precisely specify what you're looking for is the goal of this PR.

@MathiasVP MathiasVP marked this pull request as ready for review November 30, 2023 14:37
@MathiasVP MathiasVP requested a review from a team as a code owner November 30, 2023 14:37
Comment on lines 2068 to 2073
(
result = n.asExpr()
or
result = n.asDefinition() and
result instanceof CrementOperation
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need updating, and why only for the crement and not for the assignment store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me split the answer into two paragraphs:

Why does this need updating?

It needed to be updated to match the use in the CWE-119 queries (where the Buffer.qll library expect localExprFlow(e1, e2) to have a result when e1 is a postfix increment expression. This worked before since asExpr conflated the value before the increment with the value after the increment, but now that this conflation is gone we need to do this conflation manually in the asExpr predicate used for computing the localFlowExpr relation.

Why only for crement and not for the assignment store?

That's a very good catch! In ce28c9b I've added a testcase that fails because I forgot this, and in 359b15b I've fixed the FP by adding the missing case.

Comment on lines 71 to 73
e = node.asExpr() and isIndirect = false
e = [node.asExpr(), node.asDefinition()] and isIndirect = false
or
e = node.asIndirectExpr() and isIndirect = true
e = [node.asIndirectExpr(), node.asIndirectDefinition()] and isIndirect = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to keep the results as they were? Or is there are deeper reason for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is simply for keeping results as they were. Consider the case where e instanceof CrementOperation (and assume that e is actually a PostfixIncrExpr). On main this selects both the node representing the value before the increment, and the StoreInstruction that represents the value after the increment.

After this change, simply using node.asExpr() won't give you the node that represents the value after the increment. So we'd lose results on an example like:

const char *hello = "Hello, World\n";
hello++;
printf(hello); // BAD

because node.asExpr() would select the value of the hello pointer prior to the increment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're example I cannot quite follow. I would have expected that the hello argument of printf uses the value after the increment, and that suffices, because we're looking at the argument of printf when looking for sinks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, now I think I know what you mean. You're saying that you would have expected the node for which node.asExpr() instanceof CrementOperation holds to be the node that's used for the dataflow path. And that value would flow all the way to the argument of printf.

I can totally see why you'd expect that 🤔. The problem is that there's no such dataflow node since the IR doesn't generate an instruction when the increment occurs in a void context. However, if it does occur in a void context all the further uses of the variable will be the incremented version. So when the crement operation occurs in a void context node.asExpr() instanceof CrementOperation should really mean the same thing as node.asDefinition() instanceof CrementOperation.

Is that kinda what you were thinking about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that there's no such dataflow node since the IR doesn't generate an instruction when the increment occurs in a void context.

So there are no addition and store in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect the result of the load to be the thing that is tainted here and picked up, but that's apparently not what's going on given that we're somehow talking about the increment operation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source in the non-constant format query is the increment operation. That's why we're talking about that operation here, no? 😂 And the load is loading from the address produced by the increment operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh.

I see why I got confused. We're fixing something that affects duplication at sinks, but somehow - which I had not understood - it also affects sources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so doing the special handling in void contexts - as you do now - is probably the least confusing way to deal with this from a user perspective. Avoids having to learn about definitions being special most of the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I'm actually pretty happy with this now. Now you'll only need to deal with asDefinition() when there's actually a difference between what you'd get from calling asExpr() and asDefinition().

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Dec 4, 2023

Bahh, looks like 03b77db re-introduced some result duplication in cpp/leap-year/adding-365-days-per-year. I'll take a look at that tomorrow 👀

Edit: Fixed in a8020f4 🎉

jketema
jketema previously approved these changes Dec 5, 2023
Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>
@MathiasVP MathiasVP merged commit 8ce4bbe into github:main Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants