-
Notifications
You must be signed in to change notification settings - Fork 1.8k
C++: Reduce duplication from crement operations #14867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++: Reduce duplication from crement operations #14867
Conversation
…means we can reduce duplication from 'asExpr'.
|
Seems like a reasonable direction. My main concern being whether people actually know / remember to use |
Thanks for the feedback! It should be noted that some other languages already have an |
|
The approach makes sense to me, but I also wonder what the practical implications are. I never use the IR directly, I'm always looking for expressions. What underlying IR the expression is mapped to doesn't really matter for any query I write (so far). The way I'm understanding the solution is there is no way to conflate multiple IR hits to one dataflow node.asExpr (i.e., my naive solution would be to say, if you are asking for an asExpr, I give you the expression in code and if there are multiple IR to that expression, I'm still only giving you one hit). If it's the case you can't do that for technical reasons (which I assume is the case) the approach works for me so long as the asExpr still behaves the intuitive way I'm looking for (one expression returned and its pointing to the expression in the code I expect). Secondly, regardless of the technical details of why two separate predicates may be necessary. Can you describe how I would use the new predicate in practice? Again, I never introspect to the IR, and I'm not sure the distinction would ever buy me anything, but perhaps that's because I'm not using dataflow nodes to their true potential. |
@MathiasVP and I discussed these points offline. He is going to alter a few of the QLDocs to reflect my concerns. Other than that, I'm on board with the approach. |
It's our goal that you don't have to dig into IR to answer stadard dataflow questions. So you never using the IR directly
You are correct in that we generate multiple instructions for a given expression. One example is the increment expression So what you're describing in your "naive solution" really is what we're doing. The challange is how to codify "if there are multiple IR to that expression, I'm still only giving you one hit" :joy . And I believe we have a pretty good handle on that by now.
"Expect" is a difficult thing to formalize in code! I hope that int x = 0;
sink(x++);or int x = 0;
sink(++x);Giving you the ability to precisely specify what you're looking for is the goal of this PR. |
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
| ( | ||
| result = n.asExpr() | ||
| or | ||
| result = n.asDefinition() and | ||
| result instanceof CrementOperation | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need updating, and why only for the crement and not for the assignment store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me split the answer into two paragraphs:
Why does this need updating?
It needed to be updated to match the use in the CWE-119 queries (where the Buffer.qll library expect localExprFlow(e1, e2) to have a result when e1 is a postfix increment expression. This worked before since asExpr conflated the value before the increment with the value after the increment, but now that this conflation is gone we need to do this conflation manually in the asExpr predicate used for computing the localFlowExpr relation.
Why only for crement and not for the assignment store?
That's a very good catch! In ce28c9b I've added a testcase that fails because I forgot this, and in 359b15b I've fixed the FP by adding the missing case.
| e = node.asExpr() and isIndirect = false | ||
| e = [node.asExpr(), node.asDefinition()] and isIndirect = false | ||
| or | ||
| e = node.asIndirectExpr() and isIndirect = true | ||
| e = [node.asIndirectExpr(), node.asIndirectDefinition()] and isIndirect = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to keep the results as they were? Or is there are deeper reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is simply for keeping results as they were. Consider the case where e instanceof CrementOperation (and assume that e is actually a PostfixIncrExpr). On main this selects both the node representing the value before the increment, and the StoreInstruction that represents the value after the increment.
After this change, simply using node.asExpr() won't give you the node that represents the value after the increment. So we'd lose results on an example like:
const char *hello = "Hello, World\n";
hello++;
printf(hello); // BADbecause node.asExpr() would select the value of the hello pointer prior to the increment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're example I cannot quite follow. I would have expected that the hello argument of printf uses the value after the increment, and that suffices, because we're looking at the argument of printf when looking for sinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, now I think I know what you mean. You're saying that you would have expected the node for which node.asExpr() instanceof CrementOperation holds to be the node that's used for the dataflow path. And that value would flow all the way to the argument of printf.
I can totally see why you'd expect that 🤔. The problem is that there's no such dataflow node since the IR doesn't generate an instruction when the increment occurs in a void context. However, if it does occur in a void context all the further uses of the variable will be the incremented version. So when the crement operation occurs in a void context node.asExpr() instanceof CrementOperation should really mean the same thing as node.asDefinition() instanceof CrementOperation.
Is that kinda what you were thinking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that there's no such dataflow node since the IR doesn't generate an instruction when the increment occurs in a void context.
So there are no addition and store in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect the result of the load to be the thing that is tainted here and picked up, but that's apparently not what's going on given that we're somehow talking about the increment operation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source in the non-constant format query is the increment operation. That's why we're talking about that operation here, no? 😂 And the load is loading from the address produced by the increment operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh.
I see why I got confused. We're fixing something that affects duplication at sinks, but somehow - which I had not understood - it also affects sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so doing the special handling in void contexts - as you do now - is probably the least confusing way to deal with this from a user perspective. Avoids having to learn about definitions being special most of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I'm actually pretty happy with this now. Now you'll only need to deal with asDefinition() when there's actually a difference between what you'd get from calling asExpr() and asDefinition().
Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>
Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>
Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>
Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>
This PR fixes a dataflow duplication issue caused by having multiple dataflow nodes return the same expression when the expression has tricky semantics that's difficult to model using just
Exprs.For example, consider the
f(x++)operation. For this expression we generate IR equivalent to:and the question is then: which dataflow node's
asExpr()should return thex++expression. Should it be the one corresponding tox = x + 1(that is, the incremented value), or should it betmpinf(tmp)(that is, the original value)?The C/C++ standard would say that it should be the latter, but when writing queries people are less interested in what the standard demands, and more interesting in whatever fits their query logic, and often that ends up being the former 😂.
So C/C++ currently does both. So both the node corresponding to
x = x + 1, and the node corresponding totmpinf(tmp), returnx++. This causes duplication when writing a query such as this (where we track flow from whatsource()points to, to what the argument ofsinkpoints to):This PR fixes this duplication by adding another dataflow node predicate called
Node.asDefinition(andNode.asIndirectDefinitionfor the indirect case). In the context of thef(x++)example above, theNode.asDefinitionpredicate will have a result for the dataflow node corresponding tox = x + 1, andNode.asExprwill have a result only for the node corresponding totmp.This means that the query from before will only have 1 result:
