C++: Don't use `Field`s to define `FieldContent` #20901

MathiasVP · 2025-11-24T12:38:44Z

In C/C++ we use Field to define DataFlow::FieldContent. These fields can come from template instantiations which can result in large fan-out when we add a MaD summary of a function that writes a field that exists in many template instantiations (I'm looking at you std::pair!)

C# fixes this by using unbound declarations

codeql/csharp/ql/lib/semmle/code/csharp/dataflow/internal/DataFlowPrivate.qll

Line 1188 in 14f9997

TFieldContent(Field f) { f.isUnboundDeclaration() } or

. And that's the approach I'm going with here as well.

However, we don't have the luxury in C/C++ of having an easy way to relate a field from a template to the corresponding field of an instantiation in all the cases we care about. So for the cases where this is possible, we now switch to tracking FieldContent using the field from the uninstantiated class (and similarly for UnionContent). For the cases where we cannot map between uninstantiated fields and instantiated fields we continue to use the instantiated field.

I've locally tested that this solves the performance problems we have had with adding flow summaries for associative containers.

…plate instantiations.

MathiasVP · 2025-11-24T14:56:39Z

I've looked at the three new results. They're extremely long paths (150+ steps) that all look plausible. There are a couple of plausible ways I could imagine this change having this impact (e.g., maybe this impacts the field-flow branch limit along certain paths), but I don't think it's worth digging too much into why.

Copilot

Pull request overview

This PR refactors C++ dataflow analysis to use canonical representations of fields and unions instead of instantiated fields, addressing performance issues with template instantiations. The approach is similar to C#'s use of unbound declarations.

Key Changes:

Introduces CanonicalField and CanonicalUnion abstract classes with two implementations each: a NonLocal variant (for most cases) and a Local variant (for local class definitions)
Updates FieldContent and UnionContent to use canonical representations internally while still exposing actual fields through their public interfaces
Modifies content type definitions to work with canonical representations, improving performance by reducing fan-out from template instantiations

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll	Core implementation of canonical field/union representations and updated content types to use them
cpp/ql/test/library-tests/variables/variables/variable.expected	Updated test expectations reflecting that fields are now classified as `NonLocalCanonicalField` instead of `Field`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

jketema · 2025-11-25T10:30:04Z

Can you explain where we can a "large fan-out"?

MathiasVP · 2025-11-25T10:40:53Z

Can you explain where we can a "large fan-out"?

Yes. This is also back-linked in the internal issue now. I had forgotten to add a link - sorry about that!

The problem is that, in MaD, when we specify that a function writes to a field named e.g., first. That will generate a store step that pushes a FieldContent (which is currently defined in terms of Fields) onto the dataflow access path. The question is, however, which FieldContent should be pushed (i.e., which Field should be used in the storeStep) since all MaD has to go on is the name of the field. However, in the case of first, there can be thousands of instantiations of std::pair and so there are thousands of structs with a field named first. And so MaD adds a store step for each of them. This is what's generating the fan-out.

In the back-linked issue I mention an alternative approach which involves making the MaD summary more precise by specifying that it can be the first field of the struct pair when given template arguments <T, U>. However, the shared library makes this really hard since the MaD syntax has been parsed away at the point where we need to know these template instantiation arguments.

jketema

Some questions and comments below

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

jketema · 2025-11-25T10:59:32Z

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

 bindingset[f]
 pragma[inline_late]
-private int getFieldSize(Field f) { result = f.getType().getSize() }
+private int getFieldSize(CanonicalField f) { result = max(f.getAType().getSize()) }


Does this effectively give us an over-approximation? What is the impact on FNs/FPs?

The field size is used as an optimization when implementing field-flow through unions (this was described in this PR).

The idea is that a write to one field of a union should obviously flow to another read of the union even though it's reading another field. However, naively doing this leads to performance problems. So instead, we "partition" the union up into chunks indexed by the size of the field so that, if you write to a field and then read from another field of the same size you'll get flow. So I suppose, we could now miss flow in something like:

template<typename T> union U { int x; T y; }; void test() { U<int> u_int; U<VeryLargeStruct> u_very_large; // assume sizeof(VeryLargeStruct) > sizeof(int) u_int.x = source(); sink(u_int.y); }

on main there will be 1 "partition" that makes up the UnionContent for U<int>: the Union<int> with field size sizeof(int). However, on this branch there will be two partitions of the UnionContent: The Union<T> with sizeof(int), and the Union<T> with sizeof(VeryLargeStruct) (because sizeof(VeryLargeStruct) is the largest byte size of the y field across all instantiations). So now a write to (Union<T>, sizeof(int)) will not flow to a read of (Union<T>, sizeof(VeryLargeStruct)).

I've added a test demonstrating the missing flow in 2024f32. We could have a similar case of new FPs coming from missing a field clearing by using a similar test construction.

jketema · 2025-11-25T11:00:33Z

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

    // the indirection index for field content starts at 1 (because `TNonUnionContent` is thought of as
    // the address of the field, `FieldAddress` in the IR).
-    indirectionIndex = [1 .. SsaImpl::getMaxIndirectionsForType(f.getUnspecifiedType())] and
+    indirectionIndex = [1 .. max(SsaImpl::getMaxIndirectionsForType(f.getAnUnspecifiedType()))] and


Does this effectively give us an over-approximation? What is the impact on FNs/FPs?

This should not have an impact on results. The only effect this should have is that all the indirections for a given instantiated field is available when we need them in readStep and storeStep.

jketema

LGTM

C++: Represent field content using a column that is shared by all tem…

ecb80cb

…plate instantiations.

github-actions bot added the C++ label Nov 24, 2025

C++: Accept test changes from tests that use getAQlClass.

0487e06

MathiasVP marked this pull request as ready for review November 24, 2025 14:56

MathiasVP requested a review from a team as a code owner November 24, 2025 14:56

Copilot AI review requested due to automatic review settings November 24, 2025 14:56

MathiasVP added the no-change-note-required This PR does not need a change note label Nov 24, 2025

Copilot started reviewing on behalf of MathiasVP November 24, 2025 14:57 View session

Copilot finished reviewing on behalf of MathiasVP November 24, 2025 14:59

Copilot AI reviewed Nov 24, 2025

View reviewed changes

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll Outdated Show resolved Hide resolved

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll Outdated Show resolved Hide resolved

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll Show resolved Hide resolved

MathiasVP and others added 2 commits November 24, 2025 15:01

Update cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

2e53370

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

eb6b085

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

jketema reviewed Nov 25, 2025

View reviewed changes

MathiasVP and others added 3 commits November 25, 2025 12:06

C++: Respond to review comments.

47ab307

C++: Add an example with missing flow.

2024f32

Merge branch 'main' into canonical-content

861ca75

jketema approved these changes Nov 25, 2025

View reviewed changes

MathiasVP merged commit 26e5320 into github:main Nov 25, 2025
17 checks passed

github deleted a comment from gthb00gt Nov 25, 2025

C++: Don't use Fields to define FieldContent #20901

C++: Don't use Fields to define FieldContent #20901

Uh oh!

Conversation

MathiasVP commented Nov 24, 2025

Uh oh!

MathiasVP commented Nov 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jketema commented Nov 25, 2025

Uh oh!

MathiasVP commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jketema left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jketema Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

MathiasVP Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

jketema Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

MathiasVP Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

jketema left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

C++: Don't use `Field`s to define `FieldContent` #20901

C++: Don't use `Field`s to define `FieldContent` #20901

MathiasVP commented Nov 25, 2025 •

edited

Loading