Skip to content

C++: Use-use flow through global variables#11171

Merged
MathiasVP merged 11 commits intogithub:mathiasvp/replace-ast-with-ir-use-usedataflowfrom
MathiasVP:global-flow
Feb 3, 2023
Merged

C++: Use-use flow through global variables#11171
MathiasVP merged 11 commits intogithub:mathiasvp/replace-ast-with-ir-use-usedataflowfrom
MathiasVP:global-flow

Conversation

@MathiasVP
Copy link
Copy Markdown
Contributor

This PR fixes global variable flow on the use-use feature branch. For an example like:

int global;
void setGlobal() {
  global = source();
}

void readGlobal() {
  sink(global);
}

we will model this like:

int global;
void setGlobal() {
  global = source();
  // implicit use of `global` inserted here
}

void readGlobal() {
  // implicit def of `global` inserted here
  sink(global);
}

and we'll have flow like:

source() -> global -> implicit use of global -> VariableNode(global) -> implicit def of global -> global

I also did a drive-by fix for getLocation on VariableNodes since I was seeing new consistency failures because we now allocate more VariableNodes

@github-actions github-actions Bot added the C++ label Nov 8, 2022
Comment thread cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternals.qll Outdated
@MathiasVP MathiasVP marked this pull request as ready for review February 2, 2023 09:22
@MathiasVP MathiasVP requested a review from a team as a code owner February 2, 2023 09:22
@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label Feb 2, 2023
Copy link
Copy Markdown
Contributor

@jketema jketema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand this correctly, instead of having jump steps at arbitrary places in functions - which is how I understood the old code - they now just occur at the beginning and end of functions.

If the above if correct, this is probably a good approximation for now. I guess in general this is probably not completely correct, as we might be analysing a multi-threaded application. Assuming there are no data races (which is a separate problem), I guess we could always handle those in the future by introducing additional jump steps at places where synchronisation takes place.

int getIndirectionIndex() { result = indirectionIndex }

/** Holds if this definition or use has index `index` in block `block`. */
final predicate hasIndexInBlock(IRBlock block, int index) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also have an override? And can the QLDoc be dropped (as it is an override)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually not an override because this class doesn't extend DefImpl. If GlobalDef did extend DefImpl then your proposal is very much correct, but because a DefImpl currently has to have an Operand representing the address of the definition a GlobalDef cannot be a DefImpl (since a GlobalDef is a kind of synthesized definition whose address isn't present in the IR).

The Use side of this has been refactored in a previous PR to handle this situation: We have a UseImpl class, and we have a OperandBasedUse to handle the common case that a Use has an operand that represents the address directly in the IR. But (unlike a DefImpl) a UseImpl doesn't have to be OperandBasedUse, and that's why the GlobalUse can extend UseImpl without requiring an Operand.

We could do a similar refactoring for DefImpl (i.e.,
remove the requirement that DefImpl has to have an Operand, and create a OperandBasedDef class and let most of the SSA definitions extend that class. Then we could have GlobalDef just extend the DefImpl class). It shouldn't be a difficult refactoring to do. I just chose to not do it in this PR to keep the diff as small as possible. But I'd be happy to do that refactoring if it helps you review the PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, I was mistakenly under the impression that it was extending more than TGlobalDef.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do the refactoring here.

variableWriteCand(bb, i, v) and
(
variableWriteCand(bb, i, v) or
sourceVariableIsGlobal(v, _, _, _)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's a good point. I should add a comment to this. The reason is as follows:

variableWriteCand restricts the set of writes in the second SSA phase to those writes that target variables that are determined to be live by the first SSA phase. However, the first SSA phase doesn't know about global variables (as you can see, nothing in this PR touches the ssa0/SsaInternals.qll file). So it may be the case that a write to a global variable is determined to not be live by the first SSA phase (because there's no synthetic "final use of the global variable" at the end of the function body).

Now, there are two ways of handling this situation:

  1. Teach ssa0/SsaInternals.qll about global variables (i.e., by doing the same transformation this PR does, but also do it in the ssa0/SsaInternals.qll file), or
  2. Exclude global variables from the pruning by saying "either the variable isn't pruned away in the first SSA phase, or it's a global variable (like we do in this PR).

The reason I chose the second approach is that I don't expect the first phase to prune much (if anything) away in the case of global variables (since, if you write to a global variable then there's a very very high chance that you also read it somewhere in the database). So it wouldn't really prune anything away, and it would only increase the amount of data the first SSA phase is working with.

Does that make sense?

Copy link
Copy Markdown
Contributor

@jketema jketema Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so that I mostly got from other parts of the code. What confuses me here is that we both have this addition and the following just a bit lower:

    or
    exists(GlobalDef global |
      global.hasIndexInBlock(bb, i, v) and
      certain = true
    )

Why isn't it sufficient to just have this latter bit?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I think you're right 😂. Let me double check that I can remove that sourceVariableIsGlobal disjunct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait. No, I think it's actually needed. Basically, there are two kinds of writes to global variables:

  1. An actual write to the variable in the body of the function. That's this part:
exists(DefImpl def | def.hasIndexInBlock(bb, i, v) |
  if def.isCertain() then certain = true else certain = false
)
  1. A synthesizes initial definition of the global variable. That's this part:
exists(GlobalDef global |
  global.hasIndexInBlock(bb, i, v) and
  certain = true
)

Consider a program like:

void write() {
  global = 42;
}

void read() {
  use(global);
}

the write to global is an instance of DefImpl, and so hits the first case. But we don't want to prune away that write (since there is a read of the variable. Just not in the same callable).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. That clarifies things.

Comment on lines +1303 to +1304
not exists(unique( | | v.getLocation())) and
result instanceof UnknownDefaultLocation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR (and I don't know how relevant this is for the user), but if this is about parameters of functions shouldn't we just pick the one associated with the actual definition of the function (and not of any of its declarations)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's probably a much better approach. I'll create an issue for this.

@MathiasVP
Copy link
Copy Markdown
Contributor Author

So if I understand this correctly, instead of having jump steps at arbitrary places in functions - which is how I understood the old code - they now just occur at the beginning and end of functions.

Exactly. And the new way is really the right way to do it. If we do global flow that jumps to arbitrary places, we get FPs like:

int global;

void foo() {
  sanitizer(&global);
  sink(global); // <-- flow jumps directly from source() to sink(global) without considering the sanitizer.
}

void bar() {
  global = source();
}

IIRC, we had to exclude global variables because of this in some query exactly because of this reason, but I can't seem to find the PR (nor the query) were we had to do this 🤔.

@MathiasVP
Copy link
Copy Markdown
Contributor Author

If the above if correct, this is probably a good approximation for now. I guess in general this is probably not completely correct, as we might be analysing a multi-threaded application. Assuming there are no data races (which is a separate problem), I guess we could always handle those in the future by introducing additional jump steps at places where synchronisation takes place.

That's an excellent point, yes. I imagine this would be a good approach to handling multi-threaded dataflow in the future.

@MathiasVP MathiasVP merged commit 4317381 into github:mathiasvp/replace-ast-with-ir-use-usedataflow Feb 3, 2023
@jketema
Copy link
Copy Markdown
Contributor

jketema commented Feb 3, 2023

I hadn't approve this one yet, although I'm fine with it 😄 . I assume we'll have a small follow-up PR that fixes the "spurious" issue we discussed internally.

@MathiasVP
Copy link
Copy Markdown
Contributor Author

Oh, I'm sorry. I had somehow convinced myself that you'd approved it 😀. Let's do it as a follow up, indeed.

@jketema
Copy link
Copy Markdown
Contributor

jketema commented Feb 3, 2023

No worries, as I had already told you I was fine with it except for the test case you added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C++ no-change-note-required This PR does not need a change note

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants