Python: Prevent bad TC and add a bit of caching #5535

tausbn · 2021-03-25T17:16:39Z

Using simpleLocalFlowStep+ with the first argument specialised to
CfgNode was causing the compiler to turn this into a very slowly
converging manual TC computation.

Instead, we use simpleLocalFlowStep* (which is fast) and then join
that with a single step from any CfgNode. This should amount to the
same thing.

I also noticed that the charpred for LocalSourceNode was getting
recomputed a lot, so this is now cached. (The recomputation was
especially bad since it relied on simpleLocalFlowStep+, but anyway
it's a good idea not to recompute this.)

Using `simpleLocalFlowStep+` with the first argument specialised to `CfgNode` was causing the compiler to turn this into a very slowly converging manual TC computation. Instead, we use `simpleLocalFlowStep*` (which is fast) and then join that with a single step from any `CfgNode`. This should amount to the same thing. I also noticed that the charpred for `LocalSourceNode` was getting recomputed a lot, so this is now cached. (The recomputation was especially bad since it relied on `simpleLocalFlowStep+`, but anyway it's a good idea not to recompute this.)

A more principled approach is possible here, but in the short term this will prevent an explosion. For reference, openstack/cinder has roughly 19000 `ForTarget`s and tuples of size up to 5300, and we were calculating the cartesian product of these.

yoff

LGTM

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

It's always a sad thing to see a good plan go wrong: 86860032 ~0% {4} r26 = JOIN r19 WITH DataFlowPublic::TupleElementContent#class#ff CARTESIAN PRODUCT OUTPUT Lhs.0 'nodeFrom', Lhs.1 'nodeTo', Rhs.0, Rhs.1 129256 ~3% {4} r27 = SELECT r26 ON In.3 <= 7 129256 ~0% {3} r28 = SCAN r27 OUTPUT In.0 'nodeFrom', In.2 'c', In.1 'nodeTo' Happily, now it looks like this: 129256 ~0% {3} r20 = JOIN r19 WITH DataFlowPrivate::small_tuple#f CARTESIAN PRODUCT OUTPUT Lhs.0 'nodeFrom', Rhs.0, Lhs.1 'nodeTo'

yoff

LGTM, but the testing will tell..

RasmusWL · 2021-03-26T13:02:57Z

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

  )
 }

+pragma[noinline]
+TupleElementContent small_tuple() { result.getIndex() <= 7 }


Did any special logic go into picking the value 7? or could it just as well have been 10? (either way, I would really appreciate a comment in the code explaining those details, so it's easy to figure out in 1 year time when we look at it again 😉 )

I originally said 10, then yoff said that might be a bit high, then I said 5 but then changed my mind and said 7 which we both agreed was a nice round number.

I don't expect this code to be around in a year's time, and ideally not even a month from now. We already have a plan for a more principled approach for this (mimicking the "any index"-ness of list element content), but in the short term restricting this value is the best way to avoid blowup.

This one is a bit awkward, since the previous version was supposed to improve indexing. Unfortunately this is vastly outweighed by the slow convergence of the TC. Right now we pay the cost of inverting the `hasFlowSource` relation, but this is still cheaper.

yoff

This version of hasLocalSource looks much more natural to me.

tausbn added the no-change-note-required This PR does not need a change note label Mar 25, 2021

tausbn requested a review from yoff March 25, 2021 17:16

tausbn requested a review from a team as a code owner March 25, 2021 17:16

github-actions bot added the Python label Mar 25, 2021

tausbn added 2 commits March 25, 2021 18:28

Python: Slight cleanup

8734df3

yoff previously approved these changes Mar 25, 2021

View reviewed changes

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll Outdated Show resolved Hide resolved

yoff added the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Mar 25, 2021

tausbn dismissed yoff’s stale review via c2f112c March 25, 2021 18:07

yoff previously approved these changes Mar 25, 2021

View reviewed changes

RasmusWL reviewed Mar 26, 2021

View reviewed changes

Python: Fix another bad TC.

f17bbd9

This one is a bit awkward, since the previous version was supposed to improve indexing. Unfortunately this is vastly outweighed by the slow convergence of the TC. Right now we pay the cost of inverting the `hasFlowSource` relation, but this is still cheaper.

tausbn dismissed yoff’s stale review via f17bbd9 March 26, 2021 20:52

yoff approved these changes Mar 27, 2021

View reviewed changes

codeql-ci merged commit 3613ceb into github:main Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python: Prevent bad TC and add a bit of caching #5535

Python: Prevent bad TC and add a bit of caching #5535

Uh oh!

tausbn commented Mar 25, 2021

Uh oh!

yoff left a comment

Uh oh!

Uh oh!

yoff left a comment

Uh oh!

RasmusWL Mar 26, 2021

Uh oh!

tausbn Mar 26, 2021

Uh oh!

yoff left a comment

Uh oh!

Uh oh!

Python: Prevent bad TC and add a bit of caching #5535

Python: Prevent bad TC and add a bit of caching #5535

Uh oh!

Conversation

tausbn commented Mar 25, 2021

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

RasmusWL Mar 26, 2021

Choose a reason for hiding this comment

Uh oh!

tausbn Mar 26, 2021

Choose a reason for hiding this comment

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!