Skip to content

Python: Port use-use implementation from Java#4235

Merged
RasmusWL merged 9 commits intogithub:mainfrom
yoff:SharedDataflow_UseUseFlow
Sep 10, 2020
Merged

Python: Port use-use implementation from Java#4235
RasmusWL merged 9 commits intogithub:mainfrom
yoff:SharedDataflow_UseUseFlow

Conversation

@yoff
Copy link
Contributor

@yoff yoff commented Sep 9, 2020

I think it is time to let the team see what is going on with the implementation of use-use.

After re-adding def-use steps to global variables and fixing the bug that pre-update nodes lost out-flow, all existing flow is recovered (as viewed by our test files, we should add specific tests for use-use).

After fixing the bug that post-update nodes were never having out-flow, we finally obtain some of the expected improvements by passing the taint-tracking test for list_append (as feared, fixing those bugs on main does not give the same improvement).

The details

The implementation follows java/ql/src/semmle/code/java/dataflow/SSA.qll lines 754-856 and is carried out in SsaCompute.qll. That file already contains several predicates that seem to be copied from the Java implementation and then adapted (not always adapting the comments) to the Python analysis which includes refinements and many implicit uses.

This PR adds a new module in SsaCompute.qll called AdjacentUsesImpl which is exported as AdjacentUses. This module contains the ported computation, but also a redefinition of some of the underlying predicates to exclude refinements and implicit uses. For instance, the Java computation relies on a predicate called defUseRank and SsaCompute.qll already provides one in SsaComputeImpl. But rather than reuse that one, AdjacentUsesImpl defines defSourceUseRank which is based on getASourceUse rather than on getAUse and which excludes refinements (compare variableUse to the new variableSourceUse and variableDef to variableDefine).

Apart from renaming defUseRank to defSourceUseRank and variableUse to variableSourceUse, the java implementation can be used almost verbatim. Only definesAt had to be implemented (and getABBSucessor renamed to getASucessor).

@yoff yoff added the Python label Sep 9, 2020
@yoff yoff marked this pull request as ready for review September 10, 2020 06:36
@yoff yoff requested a review from a team as a code owner September 10, 2020 06:36
@yoff
Copy link
Contributor Author

yoff commented Sep 10, 2020

Decisions about which predicates and modules to make cached and/or private are made to be roughly consistent with the existing Python code. I suspect that may not be the best guide for these matters...

@tausbn
Copy link
Contributor

tausbn commented Sep 10, 2020

I'm a bit worried about leaving out cached annotations, as this can impact performance. Can you point out the places where you've diverged from the Java implementation?

Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall really good stuff 👍 and results looks promising 🎉

Can you explain why we don't need adjacentUseUse? -- I would think we should use that instead of adjacentUseUseSameVar.

Comment on lines 389 to 391
* Holds if `b2` is a transitive successor of `b1` and `v` occurs in `b1` [and
* in `b2` or one of its transitive successors]? but not in any block on the path
* between `b1` and `b2`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add the square brackets to the qldoc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is wrong, it does not hold in the base case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted, now that it is correct.

@RasmusWL
Copy link
Member

Decisions about which predicates and modules to make cached and/or private are made to be roughly consistent with the existing Python code. I suspect that may not be the best guide for these matters...

cached annotations are still black magic to me, so I'm not going to be able to help there 😐

Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, but otherwise this looks really nice!

i = rank[rankix](int j | variableDefine(v, _, b, j) or variableSourceUse(v, _, b, j))
}

/** A `VarAccess` `use` of `v` in `b` at index `i`. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think VarAccess has a special meaning in the Python libraries (whereas I assume it does in the Java libraries), so maybe this should just be spelled out as variable access instead?
(And in writing this, I realise that the variableUse predicate also has this odd reference to VarAccess)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it was copied from the same place.. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed both places.

@yoff
Copy link
Contributor Author

yoff commented Sep 10, 2020

In java, the module AdjacentUsesImpl is private and has no cached annotations. Some of the predicates are not in that module (firstUse and adjacentUseUseSameVar) and they are cached. In Python, all the predicates are cached.

@yoff
Copy link
Contributor Author

yoff commented Sep 10, 2020

adjacentUseUse is the same as adjacentUseUseSameVar but for a source variable rather than for an SSA variable. We have, like C#, SSA variables in our dataflow graph, whereas Java does not.

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
@RasmusWL
Copy link
Member

RasmusWL commented Sep 10, 2020

adjacentUseUse is the same as adjacentUseUseSameVar but for a source variable rather than for an SSA variable. We have, like C#, SSA variables in our dataflow graph, whereas Java does not.

adjacentUseUse in Java is not the same as adjacentUseUseSameVar in Java, since adjacentUseUse also allows passing through phi nodes and uncertain implicit updates.

Are you saying that we don't need to handle this explicitly in a adjacentUseUse predicate, since we already handle it in our dataflow?

@yoff
Copy link
Contributor Author

yoff commented Sep 10, 2020

Yes, the phi-nodes are in our dataflow graph. But perhaps they should not be. Perhaps we should only have EssaNodeDefinitions and then use adjacentUseUse.

@tausbn
Copy link
Contributor

tausbn commented Sep 10, 2020

Perhaps @hvitved can weigh in on the situation in C# (as we're aligning ourselves more with that than with Java)?

Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the open question of how to handle phi nodes, looks good to me.

@RasmusWL
Copy link
Member

gonna merge this now, thinking we can resolve that part in a separate PR.

@RasmusWL RasmusWL merged commit 52d8f7d into github:main Sep 10, 2020
@@ -120,6 +128,14 @@ module EssaFlow {
nodeFrom.(EssaNode).getVar() = p.getAnInput()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these three cases, C# takes any of the last reads of the input variable as nodeFrom, and only if there are no reads do we take the SSA node. I believe this will currently not work

if (..)
  x = taint;
  clean(x);
else
  x = taint;
  clean(x)
sink(x)

because you will jump directly from both definitions of x to the call to sink.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even worse, something like this will (I believe) also not work:

x = ...
if (...)
   x.Foo = taint;
else
   x = ...
sink(x.Foo)

because there is not step from the x in x.Foo = taint to the phi node for x after the if-then-else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yoff have you added the testcases from above somewhere, and checked how we handle them after use-use flow? 😊

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think the latter case will work, as long as the store step in x.Foo = taint targets the refined SSA node for x. But if it instead targets the post-update node for x (as for Java and C#), the change is needed (and it will be needed for the first case anyway).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet (but very soon), the need is tracked here. I think we probably need to remove some essa-flow and let use-use do the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants