Python: Port URL sanitisation queries to API graphs #5250

tausbn · 2021-02-23T21:13:48Z

Really, this boils down to "Port re library model to use API graphs
instead of points-to", which is what this PR actually does.

Instead of using points-to to track flags, we use a type tracker. To
handle multiple flags at the same time, we add additional flow from

x to x | y and y | x

and, as an added bonus, the above with + instead of |, neatly
fixing #4707

I had to modify the Qualified.ql test slightly, as it now had a
result stemming from the standard library (in warnings.py) that
points-to previously ignored.

It might be possible to implement this as a type tracker on
LocalSourceNodes, but with the added steps for the above operations,
this was not obvious to me, and so I opted for the simpler
"smallstep" variant.

Before merging, I need to

check that performance isn't impacted because of bad joins in the type tracker.
check that no other tests need to be modified because of the change to the re model
figure out if a change note is needed. The queries haven't changed and the behaviour should be the same, but I did deprecate a public predicate in Regex.qll, so that probably merits a change note.

Really, this boils down to "Port `re` library model to use API graphs instead of points-to", which is what this PR actually does. Instead of using points-to to track flags, we use a type tracker. To handle multiple flags at the same time, we add additional flow from `x` to `x | y` and `y | x` and, as an added bonus, the above with `+` instead of `|`, neatly fixing github#4707 I had to modify the `Qualified.ql` test slightly, as it now had a result stemming from the standard library (in `warnings.py`) that points-to previously ignored. It might be possible to implement this as a type tracker on `LocalSourceNode`s, but with the added steps for the above operations, this was not obvious to me, and so I opted for the simpler "`smallstep`" variant.

yoff

Basically looks great, two comments.

python/ql/test/library-tests/regex/Qualified.ql

yoff · 2021-02-24T07:32:23Z

python/ql/src/semmle/python/regex.qll

-    flag = Value::named("sre_constants.SRE_FLAG_" + result).(ObjectInternal).intValue() and
-    obj.(ObjectInternal).intValue().bitAnd(flag) = flag
+    flag = Value::named("sre_constants.SRE_FLAG_" + result).(OI::ObjectInternal).intValue() and
+    obj.(OI::ObjectInternal).intValue().bitAnd(flag) = flag
  )


As far as I can tell, this does not involve points-to, but please reassure me that this is not code we will be moving away from :-)

Your statement confuses me. The change you have highlighted above does indeed involve points-to, and it is code we will be moving away from. However, it's a predicate that could have seen use outside of this codebase, and so I couldn't simply delete it.

I should add a comment saying it's deprecated, though, to make this clear.

I have added the indicated statement. (And dated it, so we can actually see when we can delete it. 🙂)

Thanks, and I guess that the signature makes it rather awkward to rewrite it to not use points-to...

These were increased because of the indirection needed to get to the regex flags, but as we no longer rely on this, we can make do with a smaller import depth.

python/ql/test/library-tests/regex/options

tausbn · 2021-02-24T11:32:37Z

With the join-order fix, the performance is now well within parameters.

RasmusWL

Besides me wanting to understand the rationale behind t.continue() a bit deeper, this PR LGTM 👍

RasmusWL · 2021-02-24T17:00:39Z

python/ql/src/semmle/python/regex.qll

+  exists(API::Node flag | flag_name = canonical_name(flag) and result = flag.getAUse())
+  or
+  exists(BinaryExprNode binop, DataFlow::Node operand |
+    operand.getALocalSource() = re_flag_tracker(flag_name, t.continue()) and


I have a bit of trouble understanding if t.continue() is the only right choice here.

I would have thought that if we were able to track a flag with some type tracker t2, if that flag is used in a binary or operation, the resulting type-tracker would be the continuation of t2. So

exists(BinaryExprNode binop, DataFlow::Node operand, DataFlow::TypeTracker t2 | operand.getALocalSource() = re_flag_tracker(flag_name, t2) and t = t2.continue() and ...

But I'm also wondering if we need this continue stuff at all, or we could just use re_flag_tracker(flag_name) instead -- and if there would be any bad consequences of that.

exists(BinaryExprNode binop, DataFlow::Node operand | operand.getALocalSource() = re_flag_tracker(flag_name) and ...

Your suggested rewrite would have at least two consequences, both of which may affect performance, and one of which will affect behaviour:

By referring to re_flag_tracker/1, the fixpoint computation now has to evaluate both this and re_flag_tracker/2 at the same time. Currently, re_flag_tracker/1 is simply an extra join on top of the result of the other predicate. The impact on performance probably isn't terribly big, but I imagine there is some overhead in doing this.

Probably more impactful on performance is the fact that by not reusing the type tracker, we lose track of whether we have previously propagated the type information across a call. Thus, with your suggestion we might track into a call, then through a binary operation and then out of a different call to the same function. This is potentially a much larger set of nodes.

Finally, rewriting to use t2 actually doesn't change the behaviour. t = t2.continue() is equivalent to t = t2 and t.attr = "" (with a slight abuse of notation). Thus, continue really just checks that we're not tracking an attribute (which makes sense -- we can add re flags, but not objects that happen to have an re flag in an attribute).

👍

Finally, rewriting to use t2 actually doesn't change the behaviour. t = t2.continue() is equivalent to t = t2 and t.attr = "" (with a slight abuse of notation). Thus, continue really just checks that we're not tracking an attribute (which makes sense -- we can add re flags, but not objects that happen to have an re flag in an attribute).

Right. Although the behavior ends up being the same, for me, it just reads wrong semantically. I would like to use the t2 approach, and have made a suggestion for that (couldn't do it in this thread, since that only covered one of the lines 🤦)

Everywhere else (in Python and JavaScript) where this construction appears, and I do mean everywhere is in the form t.continue(). We only see t2.continue() (and currently only in JavaScript) when (back)track(t2,t) is used in an adjacent disjunct.

So I think to suddenly change what is the standard idiom doesn't make much sense.

Rather, it seems to me that a better solution would be to add predicate canContinue to TypeTracker, with the meaning that this.attr is empty. Then we can reuse t in both places, as long as we make sure to check ... and t.canContinue() and we don't have to waste precious space on declaring a variable that is exactly equal to t anyway.

This, however, I think it outside the scope of the present PR.

I agree that changing the way it's used probably isn't in the scope of this PR.

yoff

LGTM

RasmusWL · 2021-02-25T11:43:57Z

python/ql/src/semmle/python/regex.qll

+  exists(BinaryExprNode binop, DataFlow::Node operand |
+    operand.getALocalSource() = re_flag_tracker(flag_name, t.continue()) and


Suggested change

exists(BinaryExprNode binop, DataFlow::Node operand |

operand.getALocalSource() = re_flag_tracker(flag_name, t.continue()) and

exists(BinaryExprNode binop, DataFlow::Node operand, DataFlow::TypeTracker t2 |

operand.getALocalSource() = re_flag_tracker(flag_name, t2) and

t = t2.continue() and

Dismissed for the reasons outlined in #5250 (comment)

tausbn added Python Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish no-change-note-required This PR does not need a change note labels Feb 23, 2021

tausbn requested a review from a team as a code owner February 23, 2021 21:13

Python: Import API graphs privately

2942a11

yoff reviewed Feb 24, 2021

View reviewed changes

tausbn added 3 commits February 24, 2021 10:18

Python: Add deprecation notice to mode_from_mode_object

cac6c4a

Python: Use source nodes and prevent bad join order

e77c105

Python: Decrease import depth in regex tests

af644a0

These were increased because of the indirection needed to get to the regex flags, but as we no longer rely on this, we can make do with a smaller import depth.

RasmusWL reviewed Feb 24, 2021

View reviewed changes

python/ql/test/library-tests/regex/options Show resolved Hide resolved

Python: Get rid of superfluous options file

404649d

tausbn removed the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Feb 24, 2021

tausbn requested review from yoff and RasmusWL February 24, 2021 11:36

RasmusWL reviewed Feb 24, 2021

View reviewed changes

yoff approved these changes Feb 24, 2021

View reviewed changes

RasmusWL reviewed Feb 25, 2021

View reviewed changes

tausbn merged commit 01d581e into github:main Feb 25, 2021

tausbn deleted the python-port-re-security-queries branch February 25, 2021 12:14

		exists(BinaryExprNode binop, DataFlow::Node operand \|
		operand.getALocalSource() = re_flag_tracker(flag_name, t.continue()) and

-  exists(BinaryExprNode binop, DataFlow::Node operand |
-    operand.getALocalSource() = re_flag_tracker(flag_name, t.continue()) and
+  exists(BinaryExprNode binop, DataFlow::Node operand, DataFlow::TypeTracker t2 |
+    operand.getALocalSource() = re_flag_tracker(flag_name, t2) and
+    t = t2.continue() and

Python: Port URL sanitisation queries to API graphs #5250

Python: Port URL sanitisation queries to API graphs #5250

Uh oh!

Conversation

tausbn commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tausbn commented Feb 24, 2021

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tausbn commented Feb 23, 2021 •

edited

Loading