Skip to content

C++: Use a TaintTracking::Configuration in three more queries #8382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

MathiasVP
Copy link
Contributor

Initially, I was also going to modernize the sanitizers used in the three queries, but I realized that this code should be changed anyway once we merged the ongoing work on range analysis. So for now I've copied the ad-hoc sanitizers into these three queries to not change their behavior.

@MathiasVP MathiasVP added the C++ label Mar 9, 2022
@MathiasVP MathiasVP requested a review from a team as a code owner March 9, 2022 12:12
Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I've made a few comments and questions. I will also do some kind of run on LGTM of each query before approving this to merge.

@geoffw0
Copy link
Contributor

geoffw0 commented Mar 9, 2022

LGTM runs...

cpp/overflow-destination

cpp/unclear-array-index-validation

cpp/uncontrolled-allocation-size

@MathiasVP
Copy link
Contributor Author

LGTM runs...

cpp/overflow-destination

cpp/unclear-array-index-validation

cpp/uncontrolled-allocation-size

Thanks a lot! I was just in the process of creating diff queries myself.

@geoffw0
Copy link
Contributor

geoffw0 commented Mar 9, 2022

They're not diff queries - I decided that viewing the difference in presentation (path-problem) was more important in this case.

@MathiasVP
Copy link
Contributor Author

They're not diff queries - I decided that viewing the difference in presentation (path-problem) was more important in this case.

Yeah, that's fair. I stripped out the path-problem stuff in cpp/overflow-destination and created a diff like this: https://lgtm.com/query/2343590706647795193/, but I'm totally okay with just using two runs and comparing manually.

@geoffw0
Copy link
Contributor

geoffw0 commented Mar 9, 2022

On cpp/overflow-destination:

  • paths show where taint comes from, and for this query taint is only used to 'filter' results to the most dangerous cases - so it isn't as relevant as for other queries. Still nice to have.
  • 2 new results for gpac/gpac; LGTM.
  • 3 lost results for alibaba/AliSQL (the last 3 in page0zip.cc)
  • 6 new results for arvidn/libtorrent; they're displaying as a table not as alerts, I think they're outside the source location?
  • 1 new result for kamailio/kamalio (the one in parser_carrierroute.c); LGTM.
  • 1 new result for haiku/haiku (the one in pool.c); LGTM.
  • 1 new result for pure-data; LGTM.
  • 2 new results for libretro-uae (in dr_flac.h and stdstring.c); this first is a little weak IMO but not wrong, the second LGTM.
  • 1 lost result for lsds/sgx-lkl (the one in symbol.c)
  • overall the new results look good, I suspect they're due to wider sources; the lost results are sparse and don't concern me.
  • this requires a change note, and I think we can claim wider results as well as paths.

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Mar 9, 2022

Thanks for looking at the result changes for cpp/overflow-destination! I can take a look at the other ones tomorrow.

overall the new results look good, I suspect they're due to wider sources; the lost results are sparse and don't concern me.

It may also be due to the removal of the barriers on "unpredictable" instructions. If we like the results I don't think we should include that barrier to remove these results.

this requires a change note, and I think we can claim wider results as well as paths.

Indeed, this should have a change note. If nothing else, the change note should mention that these queries are now path-problem queries. I'll write the change note first thing tomorrow if we like the result changes on the other two queries.

@geoffw0
Copy link
Contributor

geoffw0 commented Mar 9, 2022

On cpp/uncontrolled-allocation-size:

  • 1 new result in vim/vim (line 741); very long path but looks plausible.
  • 1 new result in tesseract-ocr/tesseract (in bitvec.h); similar to the above.
  • 2 lost results in aws/s2n-tls; perhaps we don't support / model fscanf as a source?
  • 2 new results in gpac/gpac; look OK I think. str2ulong is terrifying, perhaps we didn't track taint through that previously???
  • 10 new results in alibaba/AliSQL (all but the one in my_malloc.c); seem plausible.
  • 1 new result in apache/trafficserver (CacheTool.cc); the path goes through a bunch of largely irrelevant template code; the reinterpret_cast on line 739 is the key; looks correct and impressive!
  • I stopped reading at this point; there are quite a lot of new results but they seem good to me.

@geoffw0
Copy link
Contributor

geoffw0 commented Mar 9, 2022

It may also be due to the removal of the barriers on "unpredictable" instructions. If we like the results I don't think we should include that barrier to remove these results.

You might be right, specifically the case:

// don't use dataflow through calls to pure functions if two or more operands
// are unpredictable

which might mean we now get taint flowing through things like pointer differences (endOfBuffer - pointerIntoBuffer) where we might not have done before. I think I'm OK with the results like that I saw for cpp/overflow-destination.

@geoffw0
Copy link
Contributor

geoffw0 commented Mar 9, 2022

On cpp/unclear-array-index-validation:

  • this is the nosiest of the three queries and has got noisier (results in 83 -> 143 of 184 projects)
    • that's a bit of a warning sign that something may be wrong with the query.
    • though it lacks a @precision tag, and is probably not used all that much.
      • I'd be happy if you want to give it @precision low. That's neater and I don't think will mean it suddenly gets run in one of our standard suites?
  • 2 new results in bitcoin/bitcoin
    • in uint256.cpp, I think taint is flowing through a char type that may not be appropriate
      • were there any barriers based on char / small types previously?
    • in system.cpp, are we tainting i from i < argc? If not, I'm not sure what's going on.
  • I'd like to improve this one a bit.

@MathiasVP
Copy link
Contributor Author

I'd be happy if you want to give it @precision low. That's neater and I don't think will mean it suddenly gets run in one of our standard suites?

Indeed, I think we should give it a @precision low. I can do that in this PR.

  • 2 new results in bitcoin/bitcoin
    • in uint256.cpp, I think taint is flowing through a char type that may not be appropriate
      • were there any barriers based on char / small types previously?
    • in system.cpp, are we tainting i from i < argc? If not, I'm not sure what's going on.
  • I'd like to improve this one a bit.

There are no barriers based on small types in DefaultTaintTracking, no. With that said, that might be a reasonable approach to filtering out some FPs from this query.

I might start by porting over the unpredictable instruction barrier to just this one query. I don't want to do too much work on improving the FP rate of a @precision low query, but I think we can improve it a lot with reasonable little work.

@MathiasVP
Copy link
Contributor Author

  • I'd be happy if you want to give it @precision low. That's neater and I don't think will mean it suddenly gets run in one of our standard suites?

Fixed in 693eca2.

@MathiasVP
Copy link
Contributor Author

@geoffw0 all the new results on cpp/unclear-array-index-validation happened because I incorrectly changed the sink in the query to be the entire array expression (instead of the offset expression) here). I've fixed this in 0d3e47b.

We now get a much more manageable difference. Here's a diff query and here are all the results. I think they look good!

geoffw0
geoffw0 previously approved these changes Mar 10, 2022
Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, looks much better. 👍

There are still false positives for that query to do with characters being checked against (presumably) 256-long arrays, but I don't think we need to fix that as part of this work.

@MathiasVP
Copy link
Contributor Author

DCA showed a giant number of new results on cpp/unclear-array-index-validation (likely due to the bug I introduced in 2328898, but which I then fixed in 0d3e47b). I'll start another DCA run on the HEAD of this PR and merge it if the results look better.

geoffw0
geoffw0 previously approved these changes Mar 11, 2022
Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still 👍

@MathiasVP MathiasVP added the depends on internal PR This PR should only be merged in sync with an internal Semmle PR label Mar 14, 2022
@MathiasVP
Copy link
Contributor Author

I've created the internal PR with the required changes.

…pp/unclear-array-index-validation' to prevent an explosion of new results.
Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM.

nodeIsBarrierEqualityCandidate(node, access, checkedVar)
)
or
// Don't use dataflow into binary instructions if both operands are unpredictable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder what case this is supposed to cover besides subtraction ... but the existence of the case below as well suggests there's more to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, these two barriers block flow in ~100 examples on vim/vim. For this first case, it's most often subtraction/addition. The next case also handles stuff like getting the index of some character in a non-constant string (which they know is bounded due to invariants on the string). I imagine we could do something much more clever here if we want to, but that should probably be once we decide to invest effort in rewriting this query to something much more modern.

@MathiasVP MathiasVP merged commit 9642e59 into github:main Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ depends on internal PR This PR should only be merged in sync with an internal Semmle PR documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants