Python: Port use-use implementation from Java by yoff · Pull Request #4235 · github/codeql

yoff · 2020-09-09T10:22:18Z

I think it is time to let the team see what is going on with the implementation of use-use.

After re-adding def-use steps to global variables and fixing the bug that pre-update nodes lost out-flow, all existing flow is recovered (as viewed by our test files, we should add specific tests for use-use).

After fixing the bug that post-update nodes were never having out-flow, we finally obtain some of the expected improvements by passing the taint-tracking test for list_append (as feared, fixing those bugs on main does not give the same improvement).

The details

The implementation follows java/ql/src/semmle/code/java/dataflow/SSA.qll lines 754-856 and is carried out in SsaCompute.qll. That file already contains several predicates that seem to be copied from the Java implementation and then adapted (not always adapting the comments) to the Python analysis which includes refinements and many implicit uses.

This PR adds a new module in SsaCompute.qll called AdjacentUsesImpl which is exported as AdjacentUses. This module contains the ported computation, but also a redefinition of some of the underlying predicates to exclude refinements and implicit uses. For instance, the Java computation relies on a predicate called defUseRank and SsaCompute.qll already provides one in SsaComputeImpl. But rather than reuse that one, AdjacentUsesImpl defines defSourceUseRank which is based on getASourceUse rather than on getAUse and which excludes refinements (compare variableUse to the new variableSourceUse and variableDef to variableDefine).

Apart from renaming defUseRank to defSourceUseRank and variableUse to variableSourceUse, the java implementation can be used almost verbatim. Only definesAt had to be implemented (and getABBSucessor renamed to getASucessor).

…seUseFlow

yoff · 2020-09-10T09:02:25Z

Decisions about which predicates and modules to make cached and/or private are made to be roughly consistent with the existing Python code. I suspect that may not be the best guide for these matters...

tausbn · 2020-09-10T10:06:58Z

I'm a bit worried about leaving out cached annotations, as this can impact performance. Can you point out the places where you've diverged from the Java implementation?

RasmusWL

Overall really good stuff 👍 and results looks promising 🎉

Can you explain why we don't need adjacentUseUse? -- I would think we should use that instead of adjacentUseUseSameVar.

python/ql/src/semmle/python/essa/SsaCompute.qll

RasmusWL · 2020-09-10T09:33:58Z

python/ql/src/semmle/python/essa/SsaCompute.qll

+     * Holds if `b2` is a transitive successor of `b1` and `v` occurs in `b1` [and
+     * in `b2` or one of its transitive successors]? but not in any block on the path
+     * between `b1` and `b2`.


Why did you add the square brackets to the qldoc?

Because it is wrong, it does not hold in the base case.

Reverted, now that it is correct.

python/ql/src/semmle/python/essa/SsaCompute.qll

python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll

RasmusWL · 2020-09-10T10:21:42Z

Decisions about which predicates and modules to make cached and/or private are made to be roughly consistent with the existing Python code. I suspect that may not be the best guide for these matters...

cached annotations are still black magic to me, so I'm not going to be able to help there 😐

tausbn

A few comments, but otherwise this looks really nice!

tausbn · 2020-09-10T11:11:56Z

python/ql/src/semmle/python/essa/SsaCompute.qll

+      i = rank[rankix](int j | variableDefine(v, _, b, j) or variableSourceUse(v, _, b, j))
+    }
+
+    /** A `VarAccess` `use` of `v` in `b` at index `i`. */


I don't think VarAccess has a special meaning in the Python libraries (whereas I assume it does in the Java libraries), so maybe this should just be spelled out as variable access instead?
(And in writing this, I realise that the variableUse predicate also has this odd reference to VarAccess)

Probably it was copied from the same place.. :)

I changed both places.

python/ql/src/semmle/python/essa/SsaCompute.qll

yoff · 2020-09-10T12:33:10Z

In java, the module AdjacentUsesImpl is private and has no cached annotations. Some of the predicates are not in that module (firstUse and adjacentUseUseSameVar) and they are cached. In Python, all the predicates are cached.

yoff · 2020-09-10T13:00:29Z

adjacentUseUse is the same as adjacentUseUseSameVar but for a source variable rather than for an SSA variable. We have, like C#, SSA variables in our dataflow graph, whereas Java does not.

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

RasmusWL · 2020-09-10T13:12:22Z

adjacentUseUse is the same as adjacentUseUseSameVar but for a source variable rather than for an SSA variable. We have, like C#, SSA variables in our dataflow graph, whereas Java does not.

adjacentUseUse in Java is not the same as adjacentUseUseSameVar in Java, since adjacentUseUse also allows passing through phi nodes and uncertain implicit updates.

Are you saying that we don't need to handle this explicitly in a adjacentUseUse predicate, since we already handle it in our dataflow?

yoff · 2020-09-10T13:28:20Z

Yes, the phi-nodes are in our dataflow graph. But perhaps they should not be. Perhaps we should only have EssaNodeDefinitions and then use adjacentUseUse.

tausbn · 2020-09-10T13:44:16Z

Perhaps @hvitved can weigh in on the situation in C# (as we're aligning ourselves more with that than with Java)?

RasmusWL

Besides the open question of how to handle phi nodes, looks good to me.

RasmusWL · 2020-09-10T14:12:20Z

gonna merge this now, thinking we can resolve that part in a separate PR.

hvitved · 2020-09-11T10:13:55Z

python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll

@@ -120,6 +128,14 @@ module EssaFlow {
      nodeFrom.(EssaNode).getVar() = p.getAnInput()


In these three cases, C# takes any of the last reads of the input variable as nodeFrom, and only if there are no reads do we take the SSA node. I believe this will currently not work

if (..) x = taint; clean(x); else x = taint; clean(x) sink(x)

because you will jump directly from both definitions of x to the call to sink.

Even worse, something like this will (I believe) also not work:

x = ... if (...) x.Foo = taint; else x = ... sink(x.Foo)

because there is not step from the x in x.Foo = taint to the phi node for x after the if-then-else.

@yoff have you added the testcases from above somewhere, and checked how we handle them after use-use flow? 😊

Actually, I think the latter case will work, as long as the store step in x.Foo = taint targets the refined SSA node for x. But if it instead targets the post-update node for x (as for Java and C#), the change is needed (and it will be needed for the first case anyway).

Not yet (but very soon), the need is tracked here. I think we probably need to remove some essa-flow and let use-use do the work.

Python: Port use-use implementation from Java

c661f43

yoff added the Python label Sep 9, 2020

yoff added 4 commits September 9, 2020 13:27

Python: Add def-use jump-steps

ce7f82d

Python: Repair flow from pre-update nodes

9e59d79

Python: Repair flow out of post-update nodes

b156782

Python: fix comment and source uses

7b10a3a

yoff marked this pull request as ready for review September 10, 2020 06:36

yoff requested a review from a team as a code owner September 10, 2020 06:36

yoff added 2 commits September 10, 2020 10:55

Merge branch 'main' of github.com:github/codeql into SharedDataflow_U…

deb1a4c

…seUseFlow

Python: update test expectations

2eb8ea8

RasmusWL requested changes Sep 10, 2020

View reviewed changes

tausbn requested changes Sep 10, 2020

View reviewed changes

Apply suggestions from code review

3a19b1e

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

Python: Address review comments

92e7a56

tausbn approved these changes Sep 10, 2020

View reviewed changes

RasmusWL approved these changes Sep 10, 2020

View reviewed changes

RasmusWL merged commit 52d8f7d into github:main Sep 10, 2020

hvitved reviewed Sep 11, 2020

View reviewed changes

		@@ -120,6 +128,14 @@ module EssaFlow {
		nodeFrom.(EssaNode).getVar() = p.getAnInput()

Conversation

yoff commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The details

Uh oh!

yoff commented Sep 10, 2020

Uh oh!

tausbn commented Sep 10, 2020

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RasmusWL commented Sep 10, 2020

Uh oh!

tausbn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yoff commented Sep 10, 2020

Uh oh!

yoff commented Sep 10, 2020

Uh oh!

RasmusWL commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yoff commented Sep 10, 2020

Uh oh!

tausbn commented Sep 10, 2020

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

RasmusWL commented Sep 10, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yoff commented Sep 9, 2020 •

edited

Loading

RasmusWL commented Sep 10, 2020 •

edited

Loading