Skip to content

Conversation

MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented Feb 24, 2021

Fixes https://github.com/github/codeql-c-analysis-team/issues/236.

A couple of LGTM links are in order:

  • This run shows the results on the usual LGTM projects we consider. All these results look like true positives to me.
  • This run compares the new query with the results from @dbartol's old version of the query.
    It seems like that query has a high false-positive rate, so I only used it to check whether we flag something that the old query didn't flag (i.e., we obviously missed something).
  • This run compares the new query with the experimental version of this query. The new results are neither a subset nor a superset of the old ones. For instance:
    • We sadly lose some true-positives encryption results on bitcoin/bitcoin and redis/redis.
    • Interestingly, we gain a true positive on alibaba/AliSQL, and lose a true positive on the very next line.

I think all of these can all be recovered by teaching variableAddressEscapesTree about AliasFunction's parameterNeverEscapes predicate. But I think the query is good enough for promotion as it is now.

While the issue says that we should:

[...] only focus on memsets that appear as the very last thing in a function body.

my initial strategy for doing this was to specify that there are no more mentions of the variable in the function body. This turned out to also catch a lot of good results where the memset isn't the last thing in the body, but merely the last mention of the variable. I think this strategy:

  • Catches more results that we care about, and
  • is also the more concise than other ways of specifying "happens as the last thing in the function".

CPP-differences: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1755/

@MathiasVP MathiasVP added the C++ label Feb 24, 2021
@MathiasVP MathiasVP requested a review from jbj February 24, 2021 16:10
@MathiasVP MathiasVP requested a review from a team as a code owner February 24, 2021 16:10
@MathiasVP MathiasVP force-pushed the promote-insecure-memset-query branch from 8604784 to cbeeed2 Compare February 24, 2021 16:43
@MathiasVP MathiasVP force-pushed the promote-insecure-memset-query branch from cbeeed2 to 70a953b Compare February 24, 2021 17:02
Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good qhelp and fantastic tests! I've yet to review the query itself...


// x86-64 gcc 9.2: deleted
// x86-64 clang 9.0.0: deleted
// x64 msvc v19.14 (WINE): deleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love these annotations!

void nobadFunc2_0_1(){
unsigned char buff1[PW_SIZE];
memset(buff1, '\0', sizeof(buff1));
memset(buff1, 0, PW_SIZE); // GOOD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we trying to test here? Theoretically either (but not both) could be deleted, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this was my attempt at porting a test from the experimental query in a way that didn't overlap with the tests we already had through a combination of the draft query + the tests added by the experimental-query author. Indeed, one of them (but not both) could be deleted.

Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query looks like a good, conservative starting point appropriate for code scanning. It would be nice to revisit some of the simpler bad test cases we do not flag at some point, provided we can maintain the accuracy we have now.

or
this.hasGlobalOrStdName("wmemset")
or
this.hasGlobalName(["bzero", "__builtin_memset"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stuff like this makes me wonder if the private models like MemsetFunction (models of clearly defined sets of functions that is) should in fact be public, or have public interfaces?

Copy link
Contributor Author

@MathiasVP MathiasVP Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I think we should have a public interface expressing something like: "A function that writes a value x to the first n entries of a given pointer".

This case is a bit special because there are functions like bzero_explicit which fits exactly this interface, but which we don't want to include in this query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is a bit special because there are functions like bzero_explicit which fits exactly this interface, but which we don't want to include in this query.

I think that's a strong reason not to use shared models in this case.

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Feb 25, 2021

CPP-differences: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1755/

Performance looks good. The new results also look like true positives. No sensitive data as far as I can tell, though. It's all just zero'ing buffers after they're declared, but afterward are never actually used for anything.

@geoffw0
Copy link
Contributor

geoffw0 commented Feb 25, 2021

cpp-differences results have arrived and LGTM. They're all results we've seen on LGTM already I think.

I appreciate the effort you've put into gathering real world results and diffs by the way. They appear to be a mix of potential security issues, and clears of buffers that are simply unused. There's a difference in severity, but I'm happy that both types of results are reported. I didn't see any false positives.

Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass: I've looked at the QL file only.

or
this.hasGlobalOrStdName("wmemset")
or
this.hasGlobalName(["bzero", "__builtin_memset"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is a bit special because there are functions like bzero_explicit which fits exactly this interface, but which we don't want to include in this query.

I think that's a strong reason not to use shared models in this case.

Comment on lines +43 to +44
// There is no later use of `v`.
not v.getAnAccess() = call.getASuccessor*() and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm nervous about the performance of TC'ing the successor relation in the CFG. Using basic blocks would help here. But if the optimizer pushes enough context into the starting points of this TC, perhaps it's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was nervous about that too, but it looks like magic saved us:

m##ControlFlowGraph::ControlFlowNode::getASuccessor_dispred#bfPlus(unique int this) :- 
  {3} r1 = SCAN Access::Access::getTarget_dispred#ff OUTPUT 0, In.0, In.1
  {2} r2 = JOIN r1 WITH Call::Call::getArgument_dispred#fff_120#join_rhs ON FIRST 2 OUTPUT Lhs.2, Rhs.2 'this'
  {2} r3 = JOIN r2 WITH Variable::Variable::getUnspecifiedType_dispred#ff ON FIRST 1 OUTPUT Rhs.1, Lhs.1 'this'
  {1} r4 = JOIN r3 WITH Type::ArrayType#class#f ON FIRST 1 OUTPUT Lhs.1 'this'

  {2} r5 = SCAN Variable::Variable::getUnspecifiedType_dispred#ff OUTPUT In.1, In.0
  {1} r6 = JOIN r5 WITH Struct::Struct#class#f ON FIRST 1 OUTPUT Lhs.1
  {1} r7 = JOIN r6 WITH project#Access::Access::getTarget_dispred#ff ON FIRST 1 OUTPUT Lhs.0
  {2} r8 = JOIN r7 WITH Expr::AddressOfExpr::getAddressable_dispred#ff_10#join_rhs ON FIRST 1 OUTPUT 0, Rhs.1
  {1} r9 = JOIN r8 WITH Call::Call::getArgument_dispred#fff_120#join_rhs ON FIRST 2 OUTPUT Rhs.2 'this'

  {1} r10 = r4 UNION r9
  return r10

EVALUATE RECURSIVE LAYER:
rec #ControlFlowGraph::ControlFlowNode::getASuccessor_dispred#bfPlus(int this, int result) :-     [[[ BASE CASE ]]]
  {2} r1 = JOIN m##ControlFlowGraph::ControlFlowNode::getASuccessor_dispred#bfPlus WITH ConstantExprs::Cached::successors_adapted#ff@staged_ext ON FIRST 1 OUTPUT Lhs.0 'this', Rhs.1 'result'
  return r1

  [[[ SEMINAIVE VARIANT - STANDARD ORDER]]]
  {2} r2 = SCAN #ControlFlowGraph::ControlFlowNode::getASuccessor_dispred#bfPlus#prev_delta OUTPUT In.1, In.0 'this'
  {2} r3 = JOIN r2 WITH ConstantExprs::Cached::successors_adapted#ff@staged_ext ON FIRST 1 OUTPUT Lhs.1 'this', Rhs.1 'result'
  {2} r4 = r3 AND NOT #ControlFlowGraph::ControlFlowNode::getASuccessor_dispred#bfPlus#prev(Lhs.0 'this', Lhs.1 'result')
  return r4

I can rewrite this to use basic blocks if we don't want to rely on optimizations.

Comment on lines 31 to 34
// `v` only escapes as the argument to `memset`.
forall(Expr escape | variableAddressEscapesTree(v.getAnAccess(), escape) |
call.getArgument(0) = escape.getUnconverted()
) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this requirement is too strong, and it's likely the reason why CPP-Differences only shows memset calls to variables that aren't used at all. Most buffers will be arguments to memcpy or printf or something like that, and their address will escape there.

A simple solution could be to assume that every call to functions other than memset will read from v if it escapes. This can be improved by using our models library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this requirement is too strong, and it's likely the reason why CPP-Differences only shows memset calls to variables that aren't used at all. Most buffers will be arguments to memcpy or printf or something like that, and their address will escape there.

That's a good point. I realized that this was a very strict requirement, but I hadn't made the connection to this affecting the results on CPP-differences.

A simple solution could be to assume that every call to functions other than memset will read from v if it escapes. This can be improved by using our models library.

Yeah, I actually considered this approach. But I figured that it would be better to just teach variableAddressEscapesTree about the AliasFunction model after getting this query merged. But I see that there's not much value in the query without this additional model information.

Here's an LGTM run that compares the query in the PR with a version that uses the solution you proposed (at least I think it is) with support from the models library: https://lgtm.com/query/2490354227721723019/ It produces actual good (i.e., sensitive-data related memsets) results on 3 projects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I figured that it would be better to just teach variableAddressEscapesTree about the AliasFunction model after getting this query merged.

That eventual enhancement should probably go in a new variableAddressEscapes predicate (without the Tree) since some callers of the original variableAddressEscapesTree don't use it as an escape analysis but are concerned precisely with the question of whether the variable address ends up at the top of some expression tree. An example of such a caller is the one I proposed in #5256 (comment).

MathiasVP and others added 2 commits February 25, 2021 15:48
@MathiasVP MathiasVP dismissed a stale review via 2777ca4 February 25, 2021 18:49
MathiasVP and others added 3 commits February 25, 2021 19:49
Co-authored-by: intrigus-lgtm <60750685+intrigus-lgtm@users.noreply.github.com>
…le for something. Otherwise it looks like a zero-initialization of a buffer, which the query now tries to exclude.
Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query finds good results now and seems robust. I'd like to merge it after another CPP-Differences run.

@MathiasVP
Copy link
Contributor Author

The query finds good results now and seems robust. I'd like to merge it after another CPP-Differences run.

Should be ready within a few hours: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1761/

@jbj
Copy link
Contributor

jbj commented Feb 26, 2021

@github/docs-content-codeql, please review the qhelp, name and description of this new query.

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Feb 26, 2021

Should be ready within a few hours: https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/1761/

CPP-differences looks good:

  • No performance concerns.
  • The query now only flags two calls to memset (both in Wireshark), and they both look like true positives to me.

@docs-bot
Copy link
Contributor

:octocat:📚 Thanks for the docs ping! 🛎️ This was added to our docs first-responder project board. A team member will be along shortly to respond. To request changes to the docs you can also open a CodeQL docs issue.

@geoffw0
Copy link
Contributor

geoffw0 commented Feb 26, 2021

There's been a lot of discussion above, so for clarity: all of my comments above have been addressed and/or should not be considered blocking.

shati-patel
shati-patel previously approved these changes Feb 26, 2021
Copy link
Contributor

@shati-patel shati-patel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small editorial suggestions, otherwise LGTM from a docs perspective 📝

MathiasVP and others added 2 commits February 26, 2021 19:16
Co-authored-by: Shati Patel <42641846+shati-patel@users.noreply.github.com>
Co-authored-by: Shati Patel <42641846+shati-patel@users.noreply.github.com>
Copy link
Contributor

@shati-patel shati-patel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approved 👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants