-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++: Add support for Element
content
#16791
base: main
Are you sure you want to change the base?
Conversation
DCA shows nothing spectacular:
|
da2b14b
to
c61aa93
Compare
… rows correctly. This avoids a bad join in a compiler-generated predicate.
c61aa93
to
2e74ae4
Compare
I don't really understand the motivation for |
In general we don't infer a The TLDR of your comment is that we cannot assume in general that So the |
Thanks, that makes sense. I'll find time to review the code changes at a later point, please ping me if anything is waiting on this. |
Sounds good. Thanks! I'm not blocked on getting this merged so feel fee to take all the time you need 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review. There's a lot going on here, so lots of comments, sorry.
I've reviewed most of the QL changes, skimmed some parts though, and looked at the DCA run as well. I have not yet reviewed the new models or the impact on tests.
I don't agree with the no-change-note-required
tag. I think we need to explain the changes to the MAD syntax to users.
* This should be equal to the largest number of stars (i.e., `*`s) in any | ||
* `Element` content across all of our MaD summaries, sources, and sinks. | ||
*/ | ||
int getMaxElementContentIndirectionIndex() { result = 5 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to replace this with something better? If so, what could it look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be the maximum number of indirections in the database. i.e,. something like:
max(any(IndirectOperand n, int indirection | n.hasOperandAndIndirectionIndex(_, indirection) | indirection))
private predicate parseAngles( | ||
string s, string beforeAngles, string betweenAngles, string afterAngles | ||
) { | ||
beforeAngles = s.regexpCapture("([^<]+)(?:<([^>]+)>(.*))?", 1) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about how we handle types with multiple angle brackets. For example, I think vector<foo<T>>
will be interpreted as:
beforeAngles = vector
betweenAngles = foo<T
afterAngles = >
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's part of the syntax that I didn't spell out in the comments. The following isn't a valid signature:
(const vector<T> &)
for exactly the reason you've written it: It wouldn't be easy to parse. So the signature you specify in the MaD row cannot include templates. Instead, you'll have to write it as:
(const vector &)
I'll add a comment at the appropriate place for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, you can see a use of such a signature in the copy constructor for vector here: https://github.com/github/codeql/pull/16791/files#diff-4ac8e1a72ed803d7de0e342cfe1bbc7cd0cd05c6618ed33a837c5c0d6a1ea78eR45 (note the missing angle brackets around vector
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment at the appropriate place for this.
Done in d38ce61
tp = templateFunction.getTemplateArgument(remaining) and | ||
result = mid.replaceAll(tp.getName(), "func:" + remaining.toString()) | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like to see test cases for this and/or related predicates. As it is, it's difficult to be confident they're really working correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. They're being tested through our existing taint flow tests, but I can add unit tests for it if that will make you more comfortable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e845204 adds test for this. I hope that's the kind of thing you were thinking of? If not, please let me know and I'll be happy to do something else!
That's fair. I wanted to delay the change note until we were sure this was the format we wanted. Note that the syntax is backwards compatible with everything we had before. Once we're happy with this syntax I'll amend our existing documentation and add a proper change note (which may happen in a subsequent PR if that's okay with you). |
…e signature string.
No need to apologise for having many comments 😄 |
@@ -1,4 +1,3 @@ | |||
extensions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This was just a drive-by fix. We had two top-level extensions
keys.
@@ -1,5 +1,37 @@ | |||
astTypeBugs | |||
irTypeBugs | |||
| ../../../include/iterator.h:21:3:21:10 | ../../../include/iterator.h:21:3:21:10 | ../../../include/iterator.h:21:3:21:10 | [summary param] *0 in iterator | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally sure what's going on with these tests. It may be related to how we model iterators in dataflow (i.e., we have special code to recognize *it = x
as being a write to the container that created the iterator it
). I don't think we need to fix this in this PR.
output.isQualifierObject() | ||
} | ||
|
||
override predicate isPartialWrite(FunctionOutput output) { output.isQualifierObject() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the information from isPartialWrite
in the old models encoded in the new models somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That happens automatically due to the way field flow works (the same is true for all CodeQL languages). Consider an example such as:
1. std::vector<int> v;
2. v.push_back(source());
3. v.push_back(0);
4. sink(v.at(0));
on main
the modeledFlowBarrier predicate excludes cases where isPartialWrite
holds. However, now that this is tracked via field flow, the dataflow graph looks like:
graph TD;
A["source()"]-->|storeStep| B["v [post update] on line 2"];
B-->|SSA| C["v on line 3"];
C-->|SSA| D["v on line 4"];
E["0"] -->|storeStep| G["v [post update] on line 3"];
G -->|SSA| D;
D -->|readStep| F["v.at(0)"];
and this happens completely automatically via the existing interaction between SSA and field flow.
pack: codeql/cpp-all | ||
extensible: summaryModel | ||
data: # namespace, type, subtypes, name, signature, ext, input, output, kind, provenance | ||
- ["std", "array", True, "at", "", "", "Argument[-1].Element[*@]", "ReturnValue[*@]", "value", "manual"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are all the models using Element[*]
(or Element[*@]
) and not just Element[]
(or Element[@]
)? Surely array.at(0)
contains elements, not element references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both vector::at
and vector::operator[]
return references to their elements: https://en.cppreference.com/w/cpp/container/vector/at
This is why you can do stuff like:
std::vector<int> v{10};
v.at(0) = 42;
v[1] = 43;
- ["std", "array", True, "data", "", "", "Argument[-1].Element[*@]", "ReturnValue[*@]", "value", "manual"] | ||
- ["std", "array", True, "operator[]", "", "", "Argument[-1].Element[*@]", "ReturnValue[*@]", "value", "manual"] | ||
- ["std", "array", True, "rbegin", "", "", "Argument[-1].Element[*@]", "ReturnValue.Element[*@]", "value", "manual"] | ||
- ["std", "array", True, "rcbegin", "", "", "Argument[-1].Element[*@]", "ReturnValue.Element[*@]", "value", "manual"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably the last thing you should fix (after other changes), but all of the std
models included bsl
equivalents as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I've added these in 2ad8704. I didn't find any bsl
version of iterators, so I've left those out for now. These are just straightforward copies of the std
ones with std
replaced by bsl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another DCA run after 2ad8704, and results are unchanged
That sounds sensible. There's also the information at the top of |
This PR adds support for
Element
content for C++ like we have for other CodeQL supported languages. Given code such as:we now track the flow into
v
by a store step that pushesElement
onto the access path and transfers flow tov
. Then, once we reachv.at(i)
we popElement
off the access path and transfer flow to the return value. This means we now track container flow as data flow instead of taint flow.This is all very standard stuff for other languages by now (Tom did this for C# almost four years ago), but there are (of course 😂) a few challenges for C++. Some of these I've resolved in this PR, and I'll create follow-up issues for the rest.
Commit-by-commit review recommended.
Signatures for templates
How do we distinguish between these two constructors in a MaD summary (from here):
explicit vector(size_type count);
explicit vector(size_type count, const T& value = T(), const Allocator& alloc);
the obvious answer in MaD is to distinguish them by providing a signature. However, the signature for the latter constructor mentions the parameter
T
whose parameter name is given by a template (i.e., the enclosing class' template list). And we can't be sure that this template is namedT
across all implementations of the STL.Likewise, consider this constructor:
whose parameters are given by the template list of the function itself.
To solve the above problem, I've introduced MaD syntax for introducing template parameters when specifying the class name and the function name. Staying true to the spirit of C++, I've chosen this syntax:
["std", "vector<T,Allocator>", _, "vector<InputIterator>", "(InputIterator,InputIterator,const Allocator &)", "", _, _, _, _]
As you can see, the signature mentions both
Allocator
(from the class template), andInputIterator
(from the function template).This ensures that this MaD row will match the right constructor no matter what the implementation picked as template parameter names.
Internally, the name
(InputIterator,InputIterator,const Allocator &)
is normalized to a name that doesn't mention the template parameters before being compared with a normalized version of the parameter list in the database.Handling indirections
The obvious summary for a function like push_back is:
Argument[*0] -> Argument[-1].Element[*]
.That is, it reads the value pointed to by the 0'th argument (because
push_back
takes a parameter by reference), and stores it as anElement
content on the qualifier.However, consider this example:
In this case, we track
*p
(i.e., the indirection ofp
) when we performpush_back
, which means that the value that actually flows intopush_back
is the address of*p
. That is, the node reresenting**p
is actually what ends up flowing to the 0'th argument ofpush_back
.To solve this, we could introduce another summary for
push_back
:Argument[**0] -> Argument[-1].Element[**]
.However, that introduces quite a burden on the developer when adding MaD rows. For example, if we need to add an additional indirection to every model we need to introduce another MaD row to every function.
To solve this problem I introduced a kind of "template" (no pun intended) for specifying an arbitrary number of indirections. So the summary for
push_back
actually looks like:Argument[*@0] -> Argument[-1].Element[*@]
where@
means (any number of indirections).When consuming MaD rows we then expand the summaries to:
Argument[*0] -> Argument[-1].Element[*]
Argument[**0] -> Argument[-1].Element[**]
Argument[***0] -> Argument[-1].Element[***]
Argument[****0] -> Argument[-1].Element[****]
Argument[*****0] -> Argument[-1].Element[*****]
And I've capped the number to
5
. Dataflow performance wise there should be no concerns regarding pushing this further up, but it does mean we produce a lot more strings. We could also consider capping this to"the maximum number of indirections in the database". I'll let this be future work, though 😂
I'm not super happy with the syntax I've chosen here, so I'm happy to bikeshed on this once we're happy with everything else in this PR.
Supporting associative containers
Initially, I wanted to also replace all of our taint models for
std::map
with MaD rows. However, I ran into a performance problem. To see what happens, consider the hypothetical MaD summary for std::map::insert:(ignoring
@
s for this example since it's not relevant for this discussion):Argument[*0] -> ReturnValue.Field[first].Element[*]
.That is, the input is the pointer to the first argument (since insert takes a const reference), and returns a std::pair whose first element is an iterator to the inserted element.
Now: Which exact field does
Field[first]
refer to? Sincestd::pair<T, U>
is a template there will be many manystd::pair<T, U>
instantiations in the database, and the above MaD row will create a store step that writes any of them to the access path. On a DB I was testing with there were ~1500 instantiations ofstd::pair<T, U>
(i.e., ~1500 combinations ofT
andU
) which made the flow summary library really unhappy.To solve this problem we need to extend the MaD language so that we can annotate which
first
field we're talking about (i.e., it's thefirst
field of thestd::pair
class whose template parameters areT = the template argument from the enclosing std::map
andU = bool
). I haven't done this yet, though.To "solve" this performance problem I've simply chosen not to model any associative containers for now. However, anyone who ventures into doing this will hit the same performance problem. So we need to resolve this at some point (but not necessarily in this PR since this is large enough as-is).