C++: Add support for `Element` content #16791

MathiasVP · 2024-06-19T16:58:11Z

This PR adds support for Element content for C++ like we have for other CodeQL supported languages. Given code such as:

std::vector<int> v;
int x = source();
v.push_back(x);
// ...
int y = v.at(i);
sink(y);

we now track the flow into v by a store step that pushes Element onto the access path and transfers flow to v. Then, once we reach v.at(i) we pop Element off the access path and transfer flow to the return value. This means we now track container flow as data flow instead of taint flow.

This is all very standard stuff for other languages by now (Tom did this for C# almost four years ago), but there are (of course 😂) a few challenges for C++. Some of these I've resolved in this PR, and I'll create follow-up issues for the rest.

Commit-by-commit review recommended.

Signatures for templates

How do we distinguish between these two constructors in a MaD summary (from here):

explicit vector(size_type count);
explicit vector(size_type count, const T& value = T(), const Allocator& alloc);

the obvious answer in MaD is to distinguish them by providing a signature. However, the signature for the latter constructor mentions the parameter T whose parameter name is given by a template (i.e., the enclosing class' template list). And we can't be sure that this template is named T across all implementations of the STL.

Likewise, consider this constructor:

template< class InputIt > vector( InputIt first, InputIt last, const Allocator& alloc);

whose parameters are given by the template list of the function itself.

To solve the above problem, I've introduced MaD syntax for introducing template parameters when specifying the class name and the function name. Staying true to the spirit of C++, I've chosen this syntax:

["std", "vector<T,Allocator>", _, "vector<InputIterator>", "(InputIterator,InputIterator,const Allocator &)", "", _, _, _, _]

As you can see, the signature mentions both Allocator (from the class template), and InputIterator (from the function template).

This ensures that this MaD row will match the right constructor no matter what the implementation picked as template parameter names.

Internally, the name (InputIterator,InputIterator,const Allocator &) is normalized to a name that doesn't mention the template parameters before being compared with a normalized version of the parameter list in the database.

Handling indirections

The obvious summary for a function like push_back is: Argument[*0] -> Argument[-1].Element[*].

That is, it reads the value pointed to by the 0'th argument (because push_back takes a parameter by reference), and stores it as an Element content on the qualifier.

However, consider this example:

std::vector<int*> v;
int x = source();
int* p = &x;
v.push_back(p);
...
int* q = v.at(i);
sink(*q);

In this case, we track *p (i.e., the indirection of p) when we perform push_back, which means that the value that actually flows into push_back is the address of *p. That is, the node reresenting **p is actually what ends up flowing to the 0'th argument of push_back.

To solve this, we could introduce another summary for push_back: Argument[**0] -> Argument[-1].Element[**].

However, that introduces quite a burden on the developer when adding MaD rows. For example, if we need to add an additional indirection to every model we need to introduce another MaD row to every function.

To solve this problem I introduced a kind of "template" (no pun intended) for specifying an arbitrary number of indirections. So the summary for push_back actually looks like: Argument[*@0] -> Argument[-1].Element[*@] where @ means (any number of indirections).

When consuming MaD rows we then expand the summaries to:

Argument[*0] -> Argument[-1].Element[*]
Argument[**0] -> Argument[-1].Element[**]
Argument[***0] -> Argument[-1].Element[***]
Argument[****0] -> Argument[-1].Element[****]
Argument[*****0] -> Argument[-1].Element[*****]

And I've capped the number to 5. Dataflow performance wise there should be no concerns regarding pushing this further up, but it does mean we produce a lot more strings. We could also consider capping this to
"the maximum number of indirections in the database". I'll let this be future work, though 😂

I'm not super happy with the syntax I've chosen here, so I'm happy to bikeshed on this once we're happy with everything else in this PR.

Supporting associative containers

Initially, I wanted to also replace all of our taint models for std::map with MaD rows. However, I ran into a performance problem. To see what happens, consider the hypothetical MaD summary for std::map::insert:

std::pair<iterator, bool> insert( const value_type& value );

(ignoring @s for this example since it's not relevant for this discussion): Argument[*0] -> ReturnValue.Field[first].Element[*].

That is, the input is the pointer to the first argument (since insert takes a const reference), and returns a std::pair whose first element is an iterator to the inserted element.

Now: Which exact field does Field[first] refer to? Since std::pair<T, U> is a template there will be many many std::pair<T, U> instantiations in the database, and the above MaD row will create a store step that writes any of them to the access path. On a DB I was testing with there were ~1500 instantiations of std::pair<T, U> (i.e., ~1500 combinations of T and U) which made the flow summary library really unhappy.

To solve this problem we need to extend the MaD language so that we can annotate which first field we're talking about (i.e., it's the first field of the std::pair class whose template parameters are T = the template argument from the enclosing std::map and U = bool). I haven't done this yet, though.

To "solve" this performance problem I've simply chosen not to model any associative containers for now. However, anyone who ventures into doing this will hit the same performance problem. So we need to resolve this at some point (but not necessarily in this PR since this is large enough as-is).

cpp/ql/lib/semmle/code/cpp/dataflow/ExternalFlow.qll

cpp/ql/lib/semmle/code/cpp/models/implementations/StdContainer.qll

MathiasVP · 2024-06-19T18:22:16Z

DCA shows nothing spectacular:

Performance is fine (this was my main worry 🎉)
~~- One lost result which has to do with iterators still not being fully converted to use Element content. I expect this result to appear once we model iterators fully with Element content.~~ Fixed by 40fb59d
Two new FPs on cpp/double-free. These appear because Element content allows us to have dataflow through containers now (instead of simply having it be taintflow), so we need to restrict container flow in cpp/double-free similarly to how we restrict flow through ArrayExpr in that query.

…functions.

… rows correctly. This avoids a bad join in a compiler-generated predicate.

geoffw0 · 2024-06-24T15:51:56Z

I don't really understand the motivation for @. If we're not propagating indirect data properly through models, isn't that an issue with data-flow itself (or the indirections feature) - not an issue with the MAD representation?

MathiasVP · 2024-06-24T16:03:34Z

I don't really understand the motivation for @. If we're not propagating indirect data properly through models, isn't that an issue with data-flow itself (or the indirections feature) - not an issue with the MAD representation?

In general we don't infer a *n1 -> *n2 step when a model specifies a n1 -> n2 step. We actually used to do this (see this PR), but pulled it out pretty quickly because of your excellent comment on the PR 😂

The TLDR of your comment is that we cannot assume in general that *n1 -> *n2 holds just because a model gives us n1 -> n2. Your char *stringToUpper(char *source) example is one where we cannot assume this.

So the @ notation is a way to do what Jonas suggested here (as a response to your observation).

geoffw0 · 2024-06-24T16:10:10Z

Thanks, that makes sense. I'll find time to review the code changes at a later point, please ping me if anything is waiting on this.

MathiasVP · 2024-06-24T16:11:02Z

Sounds good. Thanks! I'm not blocked on getting this merged so feel fee to take all the time you need 😄

geoffw0

Initial review. There's a lot going on here, so lots of comments, sorry.

I've reviewed most of the QL changes, skimmed some parts though, and looked at the DCA run as well. I have not yet reviewed the new models or the impact on tests.

I don't agree with the no-change-note-required tag. I think we need to explain the changes to the MAD syntax to users.

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/TaintTrackingUtil.qll

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

geoffw0 · 2024-06-25T14:01:55Z

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowPrivate.qll

+ * This should be equal to the largest number of stars (i.e., `*`s) in any
+ * `Element` content across all of our MaD summaries, sources, and sinks.
+ */
+int getMaxElementContentIndirectionIndex() { result = 5 }


Are we planning to replace this with something better? If so, what could it look like?

It could be the maximum number of indirections in the database. i.e,. something like:

max(any(IndirectOperand n, int indirection | n.hasOperandAndIndirectionIndex(_, indirection) | indirection))

cpp/ql/lib/semmle/code/cpp/dataflow/ExternalFlow.qll

geoffw0 · 2024-06-26T11:41:07Z

cpp/ql/lib/semmle/code/cpp/dataflow/ExternalFlow.qll

+private predicate parseAngles(
+  string s, string beforeAngles, string betweenAngles, string afterAngles
+) {
+  beforeAngles = s.regexpCapture("([^<]+)(?:<([^>]+)>(.*))?", 1) and


I'm concerned about how we handle types with multiple angle brackets. For example, I think vector<foo<T>> will be interpreted as:

beforeAngles = vector betweenAngles = foo<T afterAngles = >

Ah, that's part of the syntax that I didn't spell out in the comments. The following isn't a valid signature:

(const vector<T> &)

for exactly the reason you've written it: It wouldn't be easy to parse. So the signature you specify in the MaD row cannot include templates. Instead, you'll have to write it as:

(const vector &)

I'll add a comment at the appropriate place for this.

For example, you can see a use of such a signature in the copy constructor for vector here: https://github.com/github/codeql/pull/16791/files#diff-4ac8e1a72ed803d7de0e342cfe1bbc7cd0cd05c6618ed33a837c5c0d6a1ea78eR45 (note the missing angle brackets around vector).

I'll add a comment at the appropriate place for this.

Done in d38ce61

geoffw0 · 2024-06-26T11:43:57Z

cpp/ql/lib/semmle/code/cpp/dataflow/ExternalFlow.qll

+    tp = templateFunction.getTemplateArgument(remaining) and
+    result = mid.replaceAll(tp.getName(), "func:" + remaining.toString())
+  )
+}


I think I'd like to see test cases for this and/or related predicates. As it is, it's difficult to be confident they're really working correctly.

Sure. They're being tested through our existing taint flow tests, but I can add unit tests for it if that will make you more comfortable.

e845204 adds test for this. I hope that's the kind of thing you were thinking of? If not, please let me know and I'll be happy to do something else!

cpp/ql/lib/semmle/code/cpp/dataflow/ExternalFlow.qll

MathiasVP · 2024-06-26T12:30:53Z

I don't agree with the no-change-note-required tag. I think we need to explain the changes to the MAD syntax to users.

That's fair. I wanted to delay the change note until we were sure this was the format we wanted. Note that the syntax is backwards compatible with everything we had before. Once we're happy with this syntax I'll amend our existing documentation and add a proper change note (which may happen in a subsequent PR if that's okay with you).

…e signature string.

MathiasVP · 2024-06-26T12:55:12Z

Initial review. There's a lot going on here, so lots of comments, sorry.

No need to apologise for having many comments 😄

MathiasVP · 2024-06-26T13:12:09Z

cpp/ql/lib/ext/Boost.Asio.model.yml

@@ -1,4 +1,3 @@
-extensions:


Note: This was just a drive-by fix. We had two top-level extensions keys.

MathiasVP · 2024-06-26T13:13:12Z

cpp/ql/test/library-tests/dataflow/dataflow-tests/type-bugs.expected

@@ -1,5 +1,37 @@
 astTypeBugs
 irTypeBugs
+| ../../../include/iterator.h:21:3:21:10 | ../../../include/iterator.h:21:3:21:10 | ../../../include/iterator.h:21:3:21:10 | [summary param] *0 in iterator |


I'm not totally sure what's going on with these tests. It may be related to how we model iterators in dataflow (i.e., we have special code to recognize *it = x as being a write to the container that created the iterator it). I don't think we need to fix this in this PR.

geoffw0 · 2024-06-27T16:48:09Z

cpp/ql/lib/semmle/code/cpp/models/implementations/StdContainer.qll

-    output.isQualifierObject()
-  }
-
-  override predicate isPartialWrite(FunctionOutput output) { output.isQualifierObject() }


Is the information from isPartialWrite in the old models encoded in the new models somewhere?

That happens automatically due to the way field flow works (the same is true for all CodeQL languages). Consider an example such as:

1. std::vector<int> v; 2. v.push_back(source()); 3. v.push_back(0); 4. sink(v.at(0));

on main the modeledFlowBarrier predicate excludes cases where isPartialWrite holds. However, now that this is tracked via field flow, the dataflow graph looks like:

graph TD; A["source()"]-->|storeStep| B["v [post update] on line 2"]; B-->|SSA| C["v on line 3"]; C-->|SSA| D["v on line 4"]; E["0"] -->|storeStep| G["v [post update] on line 3"]; G -->|SSA| D; D -->|readStep| F["v.at(0)"];

Loading

and this happens completely automatically via the existing interaction between SSA and field flow.

geoffw0 · 2024-06-27T16:49:53Z

cpp/ql/lib/ext/std.array.model.yml

+      pack: codeql/cpp-all
+      extensible: summaryModel
+    data: # namespace, type, subtypes, name, signature, ext, input, output, kind, provenance
+      - ["std", "array", True, "at", "", "", "Argument[-1].Element[*@]", "ReturnValue[*@]", "value", "manual"]


Why are all the models using Element[*] (or Element[*@]) and not just Element[] (or Element[@])? Surely array.at(0) contains elements, not element references?

Both vector::at and vector::operator[] return references to their elements: https://en.cppreference.com/w/cpp/container/vector/at

This is why you can do stuff like:

std::vector<int> v{10}; v.at(0) = 42; v[1] = 43;

geoffw0 · 2024-06-27T16:53:17Z

cpp/ql/lib/ext/std.array.model.yml

+      - ["std", "array", True, "data", "", "", "Argument[-1].Element[*@]", "ReturnValue[*@]", "value", "manual"]
+      - ["std", "array", True, "operator[]", "", "", "Argument[-1].Element[*@]", "ReturnValue[*@]", "value", "manual"]
+      - ["std", "array", True, "rbegin", "", "", "Argument[-1].Element[*@]", "ReturnValue.Element[*@]", "value", "manual"]
+      - ["std", "array", True, "rcbegin", "", "", "Argument[-1].Element[*@]", "ReturnValue.Element[*@]", "value", "manual"]


This is probably the last thing you should fix (after other changes), but all of the std models included bsl equivalents as well.

Good point. I've added these in 2ad8704. I didn't find any bsl version of iterators, so I've left those out for now. These are just straightforward copies of the std ones with std replaced by bsl.

Did another DCA run after 2ad8704, and results are unchanged

geoffw0 · 2024-06-27T16:59:06Z

I wanted to delay the change note until we were sure this was the format we wanted.

That sounds sensible. There's also the information at the top of ExternalFlow.qll that needs to be updated with the new syntax (Element and @).

MathiasVP added 3 commits June 19, 2024 13:36

C++: Get rid of all the 'StdContainer' taint models.

c158f80

C++: Add support for 'Element' content in dataflow.

013ee9c

C++: Support 'Element' content in flow summaries.

901fac4

MathiasVP requested a review from a team as a code owner June 19, 2024 16:58

github-actions bot added the C++ label Jun 19, 2024

MathiasVP added the no-change-note-required This PR does not need a change note label Jun 19, 2024

github-advanced-security bot found potential problems Jun 19, 2024

View reviewed changes

MathiasVP added 5 commits June 21, 2024 12:47

C++: Support a richer language in MaD summaries for selecting member …

716df0c

…functions.

C++: Add summaries for std dequence containers.

4dc13e8

C++: Add MaD syntax for an arbitrary number of dereferences.

b0bd7a9

C++: Update MaD models.

85da3d2

C++: Fix 'toString' on parameter and argument positions.

47309df

MathiasVP force-pushed the collection-content-2 branch from da2b14b to c61aa93 Compare June 21, 2024 11:48

MathiasVP added 6 commits June 21, 2024 14:04

C++: Accept test changes.

6513c33

C++: No need for 'decodeUnknownContent' if we specify the MaD summary…

3e98623

… rows correctly. This avoids a bad join in a compiler-generated predicate.

C++: Fix Code Scanning errors.

f507b51

C++: Accept test changes. Nothing exciting to see here.

c8f3c1a

C++: Add gnu iterator models to regain TP.

40fb59d

C++: Accept more test changes.

2e74ae4

MathiasVP force-pushed the collection-content-2 branch from c61aa93 to 2e74ae4 Compare June 21, 2024 13:05

C++: Simplify 'parseParens'.

9a2c7d3

geoffw0 reviewed Jun 26, 2024

View reviewed changes

C++: Mention in the QLDoc that you cannot instantiate templates in th…

d38ce61

…e signature string.

C++: Add tests of internal MaD predicates and accept test changes.

e845204

MathiasVP commented Jun 26, 2024

View reviewed changes

geoffw0 reviewed Jun 27, 2024

View reviewed changes

MathiasVP added 2 commits June 27, 2024 19:48

C++: Add bsl models.

2ad8704

C++: Accept test changes.

4535043

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++: Add support for `Element` content #16791

C++: Add support for `Element` content #16791

MathiasVP commented Jun 19, 2024

MathiasVP commented Jun 19, 2024 •

edited

Loading

geoffw0 commented Jun 24, 2024

MathiasVP commented Jun 24, 2024 •

edited

Loading

geoffw0 commented Jun 24, 2024

MathiasVP commented Jun 24, 2024

geoffw0 left a comment

geoffw0 Jun 25, 2024

MathiasVP Jun 26, 2024

geoffw0 Jun 26, 2024

MathiasVP Jun 26, 2024 •

edited

Loading

MathiasVP Jun 26, 2024

MathiasVP Jun 26, 2024

geoffw0 Jun 26, 2024

MathiasVP Jun 26, 2024

MathiasVP Jun 26, 2024

MathiasVP commented Jun 26, 2024

MathiasVP commented Jun 26, 2024

MathiasVP Jun 26, 2024

MathiasVP Jun 26, 2024

geoffw0 Jun 27, 2024

MathiasVP Jun 27, 2024 •

edited

Loading

geoffw0 Jun 27, 2024

MathiasVP Jun 27, 2024 •

edited

Loading

geoffw0 Jun 27, 2024

MathiasVP Jun 27, 2024 •

edited

Loading

MathiasVP Jun 28, 2024

geoffw0 commented Jun 27, 2024

C++: Add support for Element content #16791

Are you sure you want to change the base?

C++: Add support for Element content #16791

Conversation

MathiasVP commented Jun 19, 2024

Signatures for templates

Handling indirections

Supporting associative containers

MathiasVP commented Jun 19, 2024 • edited Loading

geoffw0 commented Jun 24, 2024

MathiasVP commented Jun 24, 2024 • edited Loading

geoffw0 commented Jun 24, 2024

MathiasVP commented Jun 24, 2024

geoffw0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MathiasVP Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MathiasVP commented Jun 26, 2024

MathiasVP commented Jun 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MathiasVP Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MathiasVP Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MathiasVP Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoffw0 commented Jun 27, 2024

C++: Add support for `Element` content #16791

C++: Add support for `Element` content #16791

MathiasVP commented Jun 19, 2024 •

edited

Loading

MathiasVP commented Jun 24, 2024 •

edited

Loading

MathiasVP Jun 26, 2024 •

edited

Loading

MathiasVP Jun 27, 2024 •

edited

Loading

MathiasVP Jun 27, 2024 •

edited

Loading

MathiasVP Jun 27, 2024 •

edited

Loading