Skip to content

Conversation

@MathiasVP
Copy link
Contributor

In C/C++ we use Field to define DataFlow::FieldContent. These fields can come from template instantiations which can result in large fan-out when we add a MaD summary of a function that writes a field that exists in many template instantiations (I'm looking at you std::pair!)

C# fixes this by using unbound declarations

TFieldContent(Field f) { f.isUnboundDeclaration() } or
. And that's the approach I'm going with here as well.

However, we don't have the luxury in C/C++ of having an easy way to relate a field from a template to the corresponding field of an instantiation in all the cases we care about. So for the cases where this is possible, we now switch to tracking FieldContent using the field from the uninstantiated class (and similarly for UnionContent). For the cases where we cannot map between uninstantiated fields and instantiated fields we continue to use the instantiated field.

I've locally tested that this solves the performance problems we have had with adding flow summaries for associative containers.

@github-actions github-actions bot added the C++ label Nov 24, 2025
@MathiasVP
Copy link
Contributor Author

I've looked at the three new results. They're extremely long paths (150+ steps) that all look plausible. There are a couple of plausible ways I could imagine this change having this impact (e.g., maybe this impacts the field-flow branch limit along certain paths), but I don't think it's worth digging too much into why.

@MathiasVP MathiasVP marked this pull request as ready for review November 24, 2025 14:56
@MathiasVP MathiasVP requested a review from a team as a code owner November 24, 2025 14:56
Copilot AI review requested due to automatic review settings November 24, 2025 14:56
@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label Nov 24, 2025
Copilot finished reviewing on behalf of MathiasVP November 24, 2025 14:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors C++ dataflow analysis to use canonical representations of fields and unions instead of instantiated fields, addressing performance issues with template instantiations. The approach is similar to C#'s use of unbound declarations.

Key Changes:

  • Introduces CanonicalField and CanonicalUnion abstract classes with two implementations each: a NonLocal variant (for most cases) and a Local variant (for local class definitions)
  • Updates FieldContent and UnionContent to use canonical representations internally while still exposing actual fields through their public interfaces
  • Modifies content type definitions to work with canonical representations, improving performance by reducing fan-out from template instantiations

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll Core implementation of canonical field/union representations and updated content types to use them
cpp/ql/test/library-tests/variables/variables/variable.expected Updated test expectations reflecting that fields are now classified as NonLocalCanonicalField instead of Field

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

MathiasVP and others added 2 commits November 24, 2025 15:01
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@jketema
Copy link
Contributor

jketema commented Nov 25, 2025

Can you explain where we can a "large fan-out"?

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Nov 25, 2025

Can you explain where we can a "large fan-out"?

Yes. This is also back-linked in the internal issue now. I had forgotten to add a link - sorry about that!

The problem is that, in MaD, when we specify that a function writes to a field named e.g., first. That will generate a store step that pushes a FieldContent (which is currently defined in terms of Fields) onto the dataflow access path. The question is, however, which FieldContent should be pushed (i.e., which Field should be used in the storeStep) since all MaD has to go on is the name of the field. However, in the case of first, there can be thousands of instantiations of std::pair and so there are thousands of structs with a field named first. And so MaD adds a store step for each of them. This is what's generating the fan-out.

In the back-linked issue I mention an alternative approach which involves making the MaD summary more precise by specifying that it can be the first field of the struct pair when given template arguments <T, U>. However, the shared library makes this really hard since the MaD syntax has been parsed away at the point where we need to know these template instantiation arguments.

Copy link
Contributor

@jketema jketema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and comments below

bindingset[f]
pragma[inline_late]
private int getFieldSize(Field f) { result = f.getType().getSize() }
private int getFieldSize(CanonicalField f) { result = max(f.getAType().getSize()) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this effectively give us an over-approximation? What is the impact on FNs/FPs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field size is used as an optimization when implementing field-flow through unions (this was described in this PR).

The idea is that a write to one field of a union should obviously flow to another read of the union even though it's reading another field. However, naively doing this leads to performance problems. So instead, we "partition" the union up into chunks indexed by the size of the field so that, if you write to a field and then read from another field of the same size you'll get flow. So I suppose, we could now miss flow in something like:

template<typename T>
union U {
  int x;
  T y;
};

void test() {
  U<int> u_int;
  U<VeryLargeStruct> u_very_large; // assume sizeof(VeryLargeStruct) > sizeof(int)

  u_int.x = source();
  sink(u_int.y);
}

on main there will be 1 "partition" that makes up the UnionContent for U<int>: the Union<int> with field size sizeof(int). However, on this branch there will be two partitions of the UnionContent: The Union<T> with sizeof(int), and the Union<T> with sizeof(VeryLargeStruct) (because sizeof(VeryLargeStruct) is the largest byte size of the y field across all instantiations). So now a write to (Union<T>, sizeof(int)) will not flow to a read of (Union<T>, sizeof(VeryLargeStruct)).

I've added a test demonstrating the missing flow in 2024f32. We could have a similar case of new FPs coming from missing a field clearing by using a similar test construction.

// the indirection index for field content starts at 1 (because `TNonUnionContent` is thought of as
// the address of the field, `FieldAddress` in the IR).
indirectionIndex = [1 .. SsaImpl::getMaxIndirectionsForType(f.getUnspecifiedType())] and
indirectionIndex = [1 .. max(SsaImpl::getMaxIndirectionsForType(f.getAnUnspecifiedType()))] and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this effectively give us an over-approximation? What is the impact on FNs/FPs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not have an impact on results. The only effect this should have is that all the indirections for a given instantiated field is available when we need them in readStep and storeStep.

Copy link
Contributor

@jketema jketema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MathiasVP MathiasVP merged commit 26e5320 into github:main Nov 25, 2025
17 checks passed
@github github deleted a comment from gthb00gt Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C++ no-change-note-required This PR does not need a change note

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants