Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++: Add support for Element content #16791

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

MathiasVP
Copy link
Contributor

This PR adds support for Element content for C++ like we have for other CodeQL supported languages. Given code such as:

std::vector<int> v;
int x = source();
v.push_back(x);
// ...
int y = v.at(i);
sink(y);

we now track the flow into v by a store step that pushes Element onto the access path and transfers flow to v. Then, once we reach v.at(i) we pop Element off the access path and transfer flow to the return value. This means we now track container flow as data flow instead of taint flow.

This is all very standard stuff for other languages by now (Tom did this for C# almost four years ago), but there are (of course 😂) a few challenges for C++. Some of these I've resolved in this PR, and I'll create follow-up issues for the rest.

Commit-by-commit review recommended.

Signatures for templates

How do we distinguish between these two constructors in a MaD summary (from here):

  • explicit vector(size_type count);
  • explicit vector(size_type count, const T& value = T(), const Allocator& alloc);

the obvious answer in MaD is to distinguish them by providing a signature. However, the signature for the latter constructor mentions the parameter T whose parameter name is given by a template (i.e., the enclosing class' template list). And we can't be sure that this template is named T across all implementations of the STL.

Likewise, consider this constructor:

template< class InputIt > vector( InputIt first, InputIt last, const Allocator& alloc);

whose parameters are given by the template list of the function itself.

To solve the above problem, I've introduced MaD syntax for introducing template parameters when specifying the class name and the function name. Staying true to the spirit of C++, I've chosen this syntax:

["std", "vector<T,Allocator>", _, "vector<InputIterator>", "(InputIterator,InputIterator,const Allocator &)", "", _, _, _, _]

As you can see, the signature mentions both Allocator (from the class template), and InputIterator (from the function template).

This ensures that this MaD row will match the right constructor no matter what the implementation picked as template parameter names.

Internally, the name (InputIterator,InputIterator,const Allocator &) is normalized to a name that doesn't mention the template parameters before being compared with a normalized version of the parameter list in the database.

Handling indirections

The obvious summary for a function like push_back is: Argument[*0] -> Argument[-1].Element[*].

That is, it reads the value pointed to by the 0'th argument (because push_back takes a parameter by reference), and stores it as an Element content on the qualifier.

However, consider this example:

std::vector<int*> v;
int x = source();
int* p = &x;
v.push_back(p);
...
int* q = v.at(i);
sink(*q);

In this case, we track *p (i.e., the indirection of p) when we perform push_back, which means that the value that actually flows into push_back is the address of *p. That is, the node reresenting **p is actually what ends up flowing to the 0'th argument of push_back.

To solve this, we could introduce another summary for push_back: Argument[**0] -> Argument[-1].Element[**].

However, that introduces quite a burden on the developer when adding MaD rows. For example, if we need to add an additional indirection to every model we need to introduce another MaD row to every function.

To solve this problem I introduced a kind of "template" (no pun intended) for specifying an arbitrary number of indirections. So the summary for push_back actually looks like: Argument[*@0] -> Argument[-1].Element[*@] where @ means (any number of indirections).

When consuming MaD rows we then expand the summaries to:

  • Argument[*0] -> Argument[-1].Element[*]
  • Argument[**0] -> Argument[-1].Element[**]
  • Argument[***0] -> Argument[-1].Element[***]
  • Argument[****0] -> Argument[-1].Element[****]
  • Argument[*****0] -> Argument[-1].Element[*****]

And I've capped the number to 5. Dataflow performance wise there should be no concerns regarding pushing this further up, but it does mean we produce a lot more strings. We could also consider capping this to
"the maximum number of indirections in the database". I'll let this be future work, though 😂

I'm not super happy with the syntax I've chosen here, so I'm happy to bikeshed on this once we're happy with everything else in this PR.

Supporting associative containers

Initially, I wanted to also replace all of our taint models for std::map with MaD rows. However, I ran into a performance problem. To see what happens, consider the hypothetical MaD summary for std::map::insert:

std::pair<iterator, bool> insert( const value_type& value );

(ignoring @s for this example since it's not relevant for this discussion): Argument[*0] -> ReturnValue.Field[first].Element[*].

That is, the input is the pointer to the first argument (since insert takes a const reference), and returns a std::pair whose first element is an iterator to the inserted element.

Now: Which exact field does Field[first] refer to? Since std::pair<T, U> is a template there will be many many std::pair<T, U> instantiations in the database, and the above MaD row will create a store step that writes any of them to the access path. On a DB I was testing with there were ~1500 instantiations of std::pair<T, U> (i.e., ~1500 combinations of T and U) which made the flow summary library really unhappy.

To solve this problem we need to extend the MaD language so that we can annotate which first field we're talking about (i.e., it's the first field of the std::pair class whose template parameters are T = the template argument from the enclosing std::map and U = bool). I haven't done this yet, though.

To "solve" this performance problem I've simply chosen not to model any associative containers for now. However, anyone who ventures into doing this will hit the same performance problem. So we need to resolve this at some point (but not necessarily in this PR since this is large enough as-is).

@MathiasVP MathiasVP requested a review from a team as a code owner June 19, 2024 16:58
@github-actions github-actions bot added the C++ label Jun 19, 2024
@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label Jun 19, 2024
@MathiasVP
Copy link
Contributor Author

MathiasVP commented Jun 19, 2024

DCA shows nothing spectacular:

  • Performance is fine (this was my main worry 🎉)
    - One lost result which has to do with iterators still not being fully converted to use Element content. I expect this result to appear once we model iterators fully with Element content. Fixed by 40fb59d
  • Two new FPs on cpp/double-free. These appear because Element content allows us to have dataflow through containers now (instead of simply having it be taintflow), so we need to restrict container flow in cpp/double-free similarly to how we restrict flow through ArrayExpr in that query.

@geoffw0
Copy link
Contributor

geoffw0 commented Jun 24, 2024

I don't really understand the motivation for @. If we're not propagating indirect data properly through models, isn't that an issue with data-flow itself (or the indirections feature) - not an issue with the MAD representation?

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Jun 24, 2024

I don't really understand the motivation for @. If we're not propagating indirect data properly through models, isn't that an issue with data-flow itself (or the indirections feature) - not an issue with the MAD representation?

In general we don't infer a *n1 -> *n2 step when a model specifies a n1 -> n2 step. We actually used to do this (see this PR), but pulled it out pretty quickly because of your excellent comment on the PR 😂

The TLDR of your comment is that we cannot assume in general that *n1 -> *n2 holds just because a model gives us n1 -> n2. Your char *stringToUpper(char *source) example is one where we cannot assume this.

So the @ notation is a way to do what Jonas suggested here (as a response to your observation).

@geoffw0
Copy link
Contributor

geoffw0 commented Jun 24, 2024

Thanks, that makes sense. I'll find time to review the code changes at a later point, please ping me if anything is waiting on this.

@MathiasVP
Copy link
Contributor Author

Sounds good. Thanks! I'm not blocked on getting this merged so feel fee to take all the time you need 😄

Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review. There's a lot going on here, so lots of comments, sorry.

I've reviewed most of the QL changes, skimmed some parts though, and looked at the DCA run as well. I have not yet reviewed the new models or the impact on tests.

I don't agree with the no-change-note-required tag. I think we need to explain the changes to the MAD syntax to users.

* This should be equal to the largest number of stars (i.e., `*`s) in any
* `Element` content across all of our MaD summaries, sources, and sinks.
*/
int getMaxElementContentIndirectionIndex() { result = 5 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to replace this with something better? If so, what could it look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be the maximum number of indirections in the database. i.e,. something like:

max(any(IndirectOperand n, int indirection | n.hasOperandAndIndirectionIndex(_, indirection) | indirection))

private predicate parseAngles(
string s, string beforeAngles, string betweenAngles, string afterAngles
) {
beforeAngles = s.regexpCapture("([^<]+)(?:<([^>]+)>(.*))?", 1) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about how we handle types with multiple angle brackets. For example, I think vector<foo<T>> will be interpreted as:

beforeAngles = vector
betweenAngles = foo<T
afterAngles = >

Copy link
Contributor Author

@MathiasVP MathiasVP Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's part of the syntax that I didn't spell out in the comments. The following isn't a valid signature:

(const vector<T> &)

for exactly the reason you've written it: It wouldn't be easy to parse. So the signature you specify in the MaD row cannot include templates. Instead, you'll have to write it as:

(const vector &)

I'll add a comment at the appropriate place for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, you can see a use of such a signature in the copy constructor for vector here: https://github.com/github/codeql/pull/16791/files#diff-4ac8e1a72ed803d7de0e342cfe1bbc7cd0cd05c6618ed33a837c5c0d6a1ea78eR45 (note the missing angle brackets around vector).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment at the appropriate place for this.

Done in d38ce61

tp = templateFunction.getTemplateArgument(remaining) and
result = mid.replaceAll(tp.getName(), "func:" + remaining.toString())
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to see test cases for this and/or related predicates. As it is, it's difficult to be confident they're really working correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. They're being tested through our existing taint flow tests, but I can add unit tests for it if that will make you more comfortable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e845204 adds test for this. I hope that's the kind of thing you were thinking of? If not, please let me know and I'll be happy to do something else!

@MathiasVP
Copy link
Contributor Author

I don't agree with the no-change-note-required tag. I think we need to explain the changes to the MAD syntax to users.

That's fair. I wanted to delay the change note until we were sure this was the format we wanted. Note that the syntax is backwards compatible with everything we had before. Once we're happy with this syntax I'll amend our existing documentation and add a proper change note (which may happen in a subsequent PR if that's okay with you).

@MathiasVP
Copy link
Contributor Author

Initial review. There's a lot going on here, so lots of comments, sorry.

No need to apologise for having many comments 😄

@@ -1,4 +1,3 @@
extensions:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This was just a drive-by fix. We had two top-level extensions keys.

@@ -1,5 +1,37 @@
astTypeBugs
irTypeBugs
| ../../../include/iterator.h:21:3:21:10 | ../../../include/iterator.h:21:3:21:10 | ../../../include/iterator.h:21:3:21:10 | [summary param] *0 in iterator |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure what's going on with these tests. It may be related to how we model iterators in dataflow (i.e., we have special code to recognize *it = x as being a write to the container that created the iterator it). I don't think we need to fix this in this PR.

output.isQualifierObject()
}

override predicate isPartialWrite(FunctionOutput output) { output.isQualifierObject() }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the information from isPartialWrite in the old models encoded in the new models somewhere?

Copy link
Contributor Author

@MathiasVP MathiasVP Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That happens automatically due to the way field flow works (the same is true for all CodeQL languages). Consider an example such as:

1. std::vector<int> v;
2. v.push_back(source());
3. v.push_back(0);
4. sink(v.at(0));

on main the modeledFlowBarrier predicate excludes cases where isPartialWrite holds. However, now that this is tracked via field flow, the dataflow graph looks like:

graph TD;
    A["source()"]-->|storeStep| B["v [post update] on line 2"];
    B-->|SSA| C["v on line 3"];
    C-->|SSA| D["v on line 4"];
    E["0"] -->|storeStep| G["v [post update] on line 3"];
    G -->|SSA| D;
    D -->|readStep| F["v.at(0)"];
Loading

and this happens completely automatically via the existing interaction between SSA and field flow.

pack: codeql/cpp-all
extensible: summaryModel
data: # namespace, type, subtypes, name, signature, ext, input, output, kind, provenance
- ["std", "array", True, "at", "", "", "Argument[-1].Element[*@]", "ReturnValue[*@]", "value", "manual"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all the models using Element[*] (or Element[*@]) and not just Element[] (or Element[@])? Surely array.at(0) contains elements, not element references?

Copy link
Contributor Author

@MathiasVP MathiasVP Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both vector::at and vector::operator[] return references to their elements: https://en.cppreference.com/w/cpp/container/vector/at

This is why you can do stuff like:

std::vector<int> v{10};
v.at(0) = 42;
v[1] = 43;

- ["std", "array", True, "data", "", "", "Argument[-1].Element[*@]", "ReturnValue[*@]", "value", "manual"]
- ["std", "array", True, "operator[]", "", "", "Argument[-1].Element[*@]", "ReturnValue[*@]", "value", "manual"]
- ["std", "array", True, "rbegin", "", "", "Argument[-1].Element[*@]", "ReturnValue.Element[*@]", "value", "manual"]
- ["std", "array", True, "rcbegin", "", "", "Argument[-1].Element[*@]", "ReturnValue.Element[*@]", "value", "manual"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the last thing you should fix (after other changes), but all of the std models included bsl equivalents as well.

Copy link
Contributor Author

@MathiasVP MathiasVP Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've added these in 2ad8704. I didn't find any bsl version of iterators, so I've left those out for now. These are just straightforward copies of the std ones with std replaced by bsl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another DCA run after 2ad8704, and results are unchanged

@geoffw0
Copy link
Contributor

geoffw0 commented Jun 27, 2024

I wanted to delay the change note until we were sure this was the format we wanted.

That sounds sensible. There's also the information at the top of ExternalFlow.qll that needs to be updated with the new syntax (Element and @).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ no-change-note-required This PR does not need a change note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants