-
Notifications
You must be signed in to change notification settings - Fork 1.8k
C++: Promote insecure removal of memset query #5256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8604784
to
cbeeed2
Compare
cbeeed2
to
70a953b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good qhelp and fantastic tests! I've yet to review the query itself...
|
||
// x86-64 gcc 9.2: deleted | ||
// x86-64 clang 9.0.0: deleted | ||
// x64 msvc v19.14 (WINE): deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love these annotations!
void nobadFunc2_0_1(){ | ||
unsigned char buff1[PW_SIZE]; | ||
memset(buff1, '\0', sizeof(buff1)); | ||
memset(buff1, 0, PW_SIZE); // GOOD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we trying to test here? Theoretically either (but not both) could be deleted, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this was my attempt at porting a test from the experimental query in a way that didn't overlap with the tests we already had through a combination of the draft query + the tests added by the experimental-query author. Indeed, one of them (but not both) could be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query looks like a good, conservative starting point appropriate for code scanning. It would be nice to revisit some of the simpler bad test cases we do not flag at some point, provided we can maintain the accuracy we have now.
or | ||
this.hasGlobalOrStdName("wmemset") | ||
or | ||
this.hasGlobalName(["bzero", "__builtin_memset"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stuff like this makes me wonder if the private models like MemsetFunction
(models of clearly defined sets of functions that is) should in fact be public, or have public interfaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I think we should have a public interface expressing something like: "A function that writes a value x
to the first n
entries of a given pointer".
This case is a bit special because there are functions like bzero_explicit
which fits exactly this interface, but which we don't want to include in this query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is a bit special because there are functions like
bzero_explicit
which fits exactly this interface, but which we don't want to include in this query.
I think that's a strong reason not to use shared models in this case.
…and add a testcase for this.
Performance looks good. The new results also look like true positives. No sensitive data as far as I can tell, though. It's all just zero'ing buffers after they're declared, but afterward are never actually used for anything. |
cpp-differences results have arrived and LGTM. They're all results we've seen on LGTM already I think. I appreciate the effort you've put into gathering real world results and diffs by the way. They appear to be a mix of potential security issues, and clears of buffers that are simply unused. There's a difference in severity, but I'm happy that both types of results are reported. I didn't see any false positives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass: I've looked at the QL file only.
or | ||
this.hasGlobalOrStdName("wmemset") | ||
or | ||
this.hasGlobalName(["bzero", "__builtin_memset"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is a bit special because there are functions like
bzero_explicit
which fits exactly this interface, but which we don't want to include in this query.
I think that's a strong reason not to use shared models in this case.
// There is no later use of `v`. | ||
not v.getAnAccess() = call.getASuccessor*() and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm nervous about the performance of TC'ing the successor relation in the CFG. Using basic blocks would help here. But if the optimizer pushes enough context into the starting points of this TC, perhaps it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was nervous about that too, but it looks like magic saved us:
m##ControlFlowGraph::ControlFlowNode::getASuccessor_dispred#bfPlus(unique int this) :-
{3} r1 = SCAN Access::Access::getTarget_dispred#ff OUTPUT 0, In.0, In.1
{2} r2 = JOIN r1 WITH Call::Call::getArgument_dispred#fff_120#join_rhs ON FIRST 2 OUTPUT Lhs.2, Rhs.2 'this'
{2} r3 = JOIN r2 WITH Variable::Variable::getUnspecifiedType_dispred#ff ON FIRST 1 OUTPUT Rhs.1, Lhs.1 'this'
{1} r4 = JOIN r3 WITH Type::ArrayType#class#f ON FIRST 1 OUTPUT Lhs.1 'this'
{2} r5 = SCAN Variable::Variable::getUnspecifiedType_dispred#ff OUTPUT In.1, In.0
{1} r6 = JOIN r5 WITH Struct::Struct#class#f ON FIRST 1 OUTPUT Lhs.1
{1} r7 = JOIN r6 WITH project#Access::Access::getTarget_dispred#ff ON FIRST 1 OUTPUT Lhs.0
{2} r8 = JOIN r7 WITH Expr::AddressOfExpr::getAddressable_dispred#ff_10#join_rhs ON FIRST 1 OUTPUT 0, Rhs.1
{1} r9 = JOIN r8 WITH Call::Call::getArgument_dispred#fff_120#join_rhs ON FIRST 2 OUTPUT Rhs.2 'this'
{1} r10 = r4 UNION r9
return r10
EVALUATE RECURSIVE LAYER:
rec #ControlFlowGraph::ControlFlowNode::getASuccessor_dispred#bfPlus(int this, int result) :- [[[ BASE CASE ]]]
{2} r1 = JOIN m##ControlFlowGraph::ControlFlowNode::getASuccessor_dispred#bfPlus WITH ConstantExprs::Cached::successors_adapted#ff@staged_ext ON FIRST 1 OUTPUT Lhs.0 'this', Rhs.1 'result'
return r1
[[[ SEMINAIVE VARIANT - STANDARD ORDER]]]
{2} r2 = SCAN #ControlFlowGraph::ControlFlowNode::getASuccessor_dispred#bfPlus#prev_delta OUTPUT In.1, In.0 'this'
{2} r3 = JOIN r2 WITH ConstantExprs::Cached::successors_adapted#ff@staged_ext ON FIRST 1 OUTPUT Lhs.1 'this', Rhs.1 'result'
{2} r4 = r3 AND NOT #ControlFlowGraph::ControlFlowNode::getASuccessor_dispred#bfPlus#prev(Lhs.0 'this', Lhs.1 'result')
return r4
I can rewrite this to use basic blocks if we don't want to rely on optimizations.
// `v` only escapes as the argument to `memset`. | ||
forall(Expr escape | variableAddressEscapesTree(v.getAnAccess(), escape) | | ||
call.getArgument(0) = escape.getUnconverted() | ||
) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this requirement is too strong, and it's likely the reason why CPP-Differences only shows memset
calls to variables that aren't used at all. Most buffers will be arguments to memcpy
or printf
or something like that, and their address will escape there.
A simple solution could be to assume that every call to functions other than memset
will read from v
if it escapes. This can be improved by using our models library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this requirement is too strong, and it's likely the reason why CPP-Differences only shows memset calls to variables that aren't used at all. Most buffers will be arguments to memcpy or printf or something like that, and their address will escape there.
That's a good point. I realized that this was a very strict requirement, but I hadn't made the connection to this affecting the results on CPP-differences.
A simple solution could be to assume that every call to functions other than memset will read from v if it escapes. This can be improved by using our models library.
Yeah, I actually considered this approach. But I figured that it would be better to just teach variableAddressEscapesTree
about the AliasFunction
model after getting this query merged. But I see that there's not much value in the query without this additional model information.
Here's an LGTM run that compares the query in the PR with a version that uses the solution you proposed (at least I think it is) with support from the models library: https://lgtm.com/query/2490354227721723019/ It produces actual good (i.e., sensitive-data related memsets) results on 3 projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I figured that it would be better to just teach
variableAddressEscapesTree
about theAliasFunction
model after getting this query merged.
That eventual enhancement should probably go in a new variableAddressEscapes
predicate (without the Tree
) since some callers of the original variableAddressEscapesTree
don't use it as an escape analysis but are concerned precisely with the question of whether the variable address ends up at the top of some expression tree. An example of such a caller is the one I proposed in #5256 (comment).
Co-authored-by: Jonas Jensen <jbj@github.com>
…he local variable being used in memset.
Co-authored-by: intrigus-lgtm <60750685+intrigus-lgtm@users.noreply.github.com>
…le for something. Otherwise it looks like a zero-initialization of a buffer, which the query now tries to exclude.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query finds good results now and seems robust. I'd like to merge it after another CPP-Differences run.
Should be ready within a few hours: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1761/ |
Co-authored-by: Jonas Jensen <jbj@github.com>
@github/docs-content-codeql, please review the qhelp, name and description of this new query. |
CPP-differences looks good:
|
|
There's been a lot of discussion above, so for clarity: all of my comments above have been addressed and/or should not be considered blocking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small editorial suggestions, otherwise LGTM from a docs perspective 📝
Co-authored-by: Shati Patel <42641846+shati-patel@users.noreply.github.com>
Co-authored-by: Shati Patel <42641846+shati-patel@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-approved 👍🏽
Fixes https://github.com/github/codeql-c-analysis-team/issues/236.
A couple of LGTM links are in order:
It seems like that query has a high false-positive rate, so I only used it to check whether we flag something that the old query didn't flag (i.e., we obviously missed something).
bitcoin/bitcoin
andredis/redis
.alibaba/AliSQL
, and lose a true positive on the very next line.I think all of these can all be recovered by teaching
variableAddressEscapesTree
aboutAliasFunction
'sparameterNeverEscapes
predicate. But I think the query is good enough for promotion as it is now.While the issue says that we should:
my initial strategy for doing this was to specify that there are no more mentions of the variable in the function body. This turned out to also catch a lot of good results where the
memset
isn't the last thing in the body, but merely the last mention of the variable. I think this strategy:CPP-differences: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1755/