-
Notifications
You must be signed in to change notification settings - Fork 1.8k
C++: Don't use Fields to define FieldContent
#20901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…plate instantiations.
|
I've looked at the three new results. They're extremely long paths (150+ steps) that all look plausible. There are a couple of plausible ways I could imagine this change having this impact (e.g., maybe this impacts the field-flow branch limit along certain paths), but I don't think it's worth digging too much into why. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors C++ dataflow analysis to use canonical representations of fields and unions instead of instantiated fields, addressing performance issues with template instantiations. The approach is similar to C#'s use of unbound declarations.
Key Changes:
- Introduces
CanonicalFieldandCanonicalUnionabstract classes with two implementations each: aNonLocalvariant (for most cases) and aLocalvariant (for local class definitions) - Updates
FieldContentandUnionContentto use canonical representations internally while still exposing actual fields through their public interfaces - Modifies content type definitions to work with canonical representations, improving performance by reducing fan-out from template instantiations
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll | Core implementation of canonical field/union representations and updated content types to use them |
| cpp/ql/test/library-tests/variables/variables/variable.expected | Updated test expectations reflecting that fields are now classified as NonLocalCanonicalField instead of Field |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Can you explain where we can a "large fan-out"? |
Yes. This is also back-linked in the internal issue now. I had forgotten to add a link - sorry about that! The problem is that, in MaD, when we specify that a function writes to a field named e.g., In the back-linked issue I mention an alternative approach which involves making the MaD summary more precise by specifying that it can be the |
jketema
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions and comments below
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
| bindingset[f] | ||
| pragma[inline_late] | ||
| private int getFieldSize(Field f) { result = f.getType().getSize() } | ||
| private int getFieldSize(CanonicalField f) { result = max(f.getAType().getSize()) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this effectively give us an over-approximation? What is the impact on FNs/FPs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field size is used as an optimization when implementing field-flow through unions (this was described in this PR).
The idea is that a write to one field of a union should obviously flow to another read of the union even though it's reading another field. However, naively doing this leads to performance problems. So instead, we "partition" the union up into chunks indexed by the size of the field so that, if you write to a field and then read from another field of the same size you'll get flow. So I suppose, we could now miss flow in something like:
template<typename T>
union U {
int x;
T y;
};
void test() {
U<int> u_int;
U<VeryLargeStruct> u_very_large; // assume sizeof(VeryLargeStruct) > sizeof(int)
u_int.x = source();
sink(u_int.y);
}on main there will be 1 "partition" that makes up the UnionContent for U<int>: the Union<int> with field size sizeof(int). However, on this branch there will be two partitions of the UnionContent: The Union<T> with sizeof(int), and the Union<T> with sizeof(VeryLargeStruct) (because sizeof(VeryLargeStruct) is the largest byte size of the y field across all instantiations). So now a write to (Union<T>, sizeof(int)) will not flow to a read of (Union<T>, sizeof(VeryLargeStruct)).
I've added a test demonstrating the missing flow in 2024f32. We could have a similar case of new FPs coming from missing a field clearing by using a similar test construction.
| // the indirection index for field content starts at 1 (because `TNonUnionContent` is thought of as | ||
| // the address of the field, `FieldAddress` in the IR). | ||
| indirectionIndex = [1 .. SsaImpl::getMaxIndirectionsForType(f.getUnspecifiedType())] and | ||
| indirectionIndex = [1 .. max(SsaImpl::getMaxIndirectionsForType(f.getAnUnspecifiedType()))] and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this effectively give us an over-approximation? What is the impact on FNs/FPs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not have an impact on results. The only effect this should have is that all the indirections for a given instantiated field is available when we need them in readStep and storeStep.
jketema
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
In C/C++ we use
Fieldto defineDataFlow::FieldContent. These fields can come from template instantiations which can result in large fan-out when we add a MaD summary of a function that writes a field that exists in many template instantiations (I'm looking at youstd::pair!)C# fixes this by using unbound declarations
codeql/csharp/ql/lib/semmle/code/csharp/dataflow/internal/DataFlowPrivate.qll
Line 1188 in 14f9997
However, we don't have the luxury in C/C++ of having an easy way to relate a field from a template to the corresponding field of an instantiation in all the cases we care about. So for the cases where this is possible, we now switch to tracking
FieldContentusing the field from the uninstantiated class (and similarly forUnionContent). For the cases where we cannot map between uninstantiated fields and instantiated fields we continue to use the instantiated field.I've locally tested that this solves the performance problems we have had with adding flow summaries for associative containers.