-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++: Implement models-as-data #15371
Conversation
…sing models-as-data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments. Looking good so far! I think starting simply with sources and sinks was a great idea as it avoids you having to deal with a large part of the dataflow library right away!
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/dataflow/internal/FlowSummaryImpl.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/dataflow/internal/FlowSummaryImpl.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/dataflow/internal/FlowSummaryImpl.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/dataflow/internal/FlowSummaryImpl.qll
Outdated
Show resolved
Hide resolved
Thanks for the early review @MathiasVP . It's going to take me a bit of time to address everything, along with the known issues in this PR. |
I’ve fixed We do, however, introduce a new bodge in |
cpp/ql/lib/semmle/code/cpp/dataflow/internal/FlowSummaryImpl.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/dataflow/internal/FlowSummaryImpl.qll
Outdated
Show resolved
Hide resolved
…es until we get to the SSA implementations of them.
… not in the database.
Just added four commits. Broadly:
@MathiasVP I'd like your opinion on the last commit in particular, which modifies |
indirectionIndex = | ||
[0 .. max(Ssa::Function f | | ||
| | ||
Ssa::getMaxIndirectionsForType(f.getUnspecifiedType()) - 1 // -1 because a returned value is a prvalue not a glvalue | ||
)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So instead of relying on the presence of a return value instruction we're now relying on the function's return type? I can see how that also works, but I have some concerns:
-
This makes dataflow more dependent on the surface syntax in a way that it wasn't before. I'd prefer to keep dataflow completely unaware of the C/C++ AST (and with a very limited set of cases it is) since it means we can freely change AST classes in the future, and as long as the IR stays correct dataflow will do the right thing.
An example of where this may bite us: In the IR we synthesize function bodies for some things that are notFunction
s. In particular, we do so for the initializer of a global variable and forstatic
local variables. -
Eventually, I'd like to make the construction of indirection nodes smarter to handle things like:
unsigned long get_data_as_long() { char* my_data = source(); // *my_data is tainted unsigned long my_data_as_an_unsigned_long = (unsigned long)my_data; // *my_data_as_an_unsigned_long should be tainted, but unsigned longs have no indirection nodes. So flow is lost! return my_data_as_an_unsigned_long; } unsigned long data = my_data_as_an_unsigned_long(); char* my_data2 = (char*)data; sink(my_data2);
The solution to the above will be to be smarter about how many dataflow nodes we allocate (see this internal link for more information on this).
And since the current logic in the
TNormalReturnKind
branch only talks about the presence of an indirection operand that logic will continue to work once allocate indirection nodes in a smarter way. In contrast, the change here hardwires that we create return kinds based on the return type of a function.
Instead, I think a better solution would be:
- Keep the existing logic
- Replace the
indirectionIndex = 0 // TODO: very much a bodge so that it works on the test that has no return statements
disjunct with a call into the flow summary library (i.e., the library you're filling in in this PR) that finds the set of indirections necessary for the modelled functions that exists as MaD rows. That is, we should have a predicate inFlowSummaryImpl
that reads something like:
/**
* Gets the maximum number of indirections that can be returned by a the function
* modelled using the MaD row `package;type;subtypes;name;signature;ext`.
*/
int maxIndirectionForModelledFunction(string package, string type, boolean subtypes, string name, string signature, string ext) {
exists(interpretElement(package, type, subtypes, name, signature, ext)) and
result = /* Extract the number of stars in the return type specified by signature */
}
and then we could call this predicate instead of hardcoding the indirectionIndex = 0
case. i.e., we'd do something like:
indirectionIndex = maxIndirectionForModelledFunction(_, _, _, _, _, _)
this would ensure that we have ReturnKind
s for all the MaD rows we have, and we keep the nice property that we don't depend on the C++ surface AST for most of dataflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, though notably the original code doesn't even work any more - I'm trying to debug a monotonic recursion error with that, then wire up the new case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's odd. It worked after 2bea0ad, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I've rebased this PR a couple of times so the history is quite confusing.
The issue appears to be that TNormalReturnKind
depends on DataFlow::Node
in order to find all the return kinds / indirection levels; and DataFlow::Node
now includes TFlowSummaryNode
, i.e. depends on the flow summary library, and that of course depends on return kinds. Theoretically we can break the loop by having TNormalReturnKind
only depend on the non-summary DataFlow::Node
s (and having a separate check of the flow summaries themselves as you suggest) ... but simply adding not return instanceof TFlowSummaryNode
in TNormalReturnKind
does not actually break the dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I'm coming back to the idea that return kinds should be lower level than data flow nodes, and thus should not depend on them. This way its easy to reason about them and it doesn't produce any dependency cycles. If we do smarter things with indirection nodes in the future, we'll just have to extend the ReturnKind
type manually as part of that process.
The above means leaving the code more-or-less how it currently is in this PR. It might be worth adding the maxIndirectionForModelledFunction
thing as well. Is there actually any alternative design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but simply adding
not return instanceof TFlowSummaryNode
in TNormalReturnKind does not actually break the dependency.
Indeed, that won't break any dependencies.
I'll pull your PR down locally and see if I can see what's wrong. The changes we did very early in this PR should've prevented exactly this non-monotonic recursion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you've just gotten a bit lost in the commit history 😂 Doing the changes in 2bea0ad solves the non-monotonic recursion problem. When you said:
Makes sense, though notably the original code doesn't even work any more - I'm trying to debug a monotonic recursion error with that, then wire up the new case.
I guess you mean that the code on main
doesn't work any more. That's indeed expected - and that's why we did 2bea0ad. Reintroducing those changes fixed the non-motonic recursion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I'm coming back to the idea that return kinds should be lower level than data flow nodes, and thus should not depend on them. This way its easy to reason about them and it doesn't produce any dependency cycles.
I totally agree. That's what 2bea0ad did. It ensured that ReturnKind
s didn't depend on dataflow nodes, but only on SSA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed two commits now:
- 2159256 reintroduces the changes from 2bea0ad (This is now slightly simpler because we don't need the
parameterIsRedefined
predicate since it was deleted from the code onmain
🎉). The diff is slightly longer than intended because I autoformatted the code by mistake as I was writing it 😅. The only change is the one inTReturnKind
, I promise! - 448a901 adds a predicate to compute the indirections necessary from all the MaD models like I described in C++: Implement models-as-data #15371 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(following 1:1 meeting and a few more commits) I'm not cautiously optimistic that this issue is resolved.
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
I've just added the change notes and fixed the other minor issues. This PR is now ready for review. I will shortly begin a second DCA run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going through all the changes one last time. There are some small nits here and there, so I may add more comments throughout today
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
Co-authored-by: Mathias Vorreiter Pedersen <mathiasvp@github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final comment, but otherwise I think we're good to go!
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
The second DCA run:
So there's probably a small performance loss, not unacceptable I'd say, but if there are any leads for possible improvements I would like to follow them. We may well be better off doing general performance work in future rather than pursuing tiny gains here. I also ran a Swift DCA experiment, which was uneventful, unsurprisingly as the Swift changes in this PR are very minor (they just mirror CPP changes in a few places, mostly comments). |
I've fixed the CI issues (hopefully) and the merge, though alert provenance is not properly computed (I've created another follow-up issue to cover that). |
Ready for final approval + merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's Get This Merged 🎉
Implement models-as-data for C++. That is, support for CSV formatted flow sources, sinks and summaries that look something like this:
The implementation is ported from Swift, and uses the shared MAD library to do the heavy lifting. I've created a range of "synthetic" tests (that is, tests that use models defined in the tests), and also created "real" sources for
getc
and friends (which have corresponding "real" tests).This is currently a draft PR, and is missing most of the results it should find in tests. Things I need to do:
TODO: fix this, there's no good reason for it.
.encodeContent
.union content--- created follow-up issuepost-update nodes?--- created follow-up issueindirect global variables--- created follow-up issue@MathiasVP I would appreciate an early review and constructive feedback, when you're available.