-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Python: Port URL sanitisation queries to API graphs #5250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Really, this boils down to "Port `re` library model to use API graphs instead of points-to", which is what this PR actually does. Instead of using points-to to track flags, we use a type tracker. To handle multiple flags at the same time, we add additional flow from `x` to `x | y` and `y | x` and, as an added bonus, the above with `+` instead of `|`, neatly fixing github#4707 I had to modify the `Qualified.ql` test slightly, as it now had a result stemming from the standard library (in `warnings.py`) that points-to previously ignored. It might be possible to implement this as a type tracker on `LocalSourceNode`s, but with the added steps for the above operations, this was not obvious to me, and so I opted for the simpler "`smallstep`" variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically looks great, two comments.
flag = Value::named("sre_constants.SRE_FLAG_" + result).(ObjectInternal).intValue() and | ||
obj.(ObjectInternal).intValue().bitAnd(flag) = flag | ||
flag = Value::named("sre_constants.SRE_FLAG_" + result).(OI::ObjectInternal).intValue() and | ||
obj.(OI::ObjectInternal).intValue().bitAnd(flag) = flag | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this does not involve points-to, but please reassure me that this is not code we will be moving away from :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your statement confuses me. The change you have highlighted above does indeed involve points-to, and it is code we will be moving away from. However, it's a predicate that could have seen use outside of this codebase, and so I couldn't simply delete it.
I should add a comment saying it's deprecated, though, to make this clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the indicated statement. (And dated it, so we can actually see when we can delete it. 🙂)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, and I guess that the signature makes it rather awkward to rewrite it to not use points-to...
These were increased because of the indirection needed to get to the regex flags, but as we no longer rely on this, we can make do with a smaller import depth.
With the join-order fix, the performance is now well within parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides me wanting to understand the rationale behind t.continue()
a bit deeper, this PR LGTM 👍
exists(API::Node flag | flag_name = canonical_name(flag) and result = flag.getAUse()) | ||
or | ||
exists(BinaryExprNode binop, DataFlow::Node operand | | ||
operand.getALocalSource() = re_flag_tracker(flag_name, t.continue()) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a bit of trouble understanding if t.continue()
is the only right choice here.
I would have thought that if we were able to track a flag with some type tracker t2
, if that flag is used in a binary or operation, the resulting type-tracker would be the continuation of t2
. So
exists(BinaryExprNode binop, DataFlow::Node operand, DataFlow::TypeTracker t2 |
operand.getALocalSource() = re_flag_tracker(flag_name, t2) and
t = t2.continue() and
...
But I'm also wondering if we need this continue stuff at all, or we could just use re_flag_tracker(flag_name)
instead -- and if there would be any bad consequences of that.
exists(BinaryExprNode binop, DataFlow::Node operand |
operand.getALocalSource() = re_flag_tracker(flag_name) and
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggested rewrite would have at least two consequences, both of which may affect performance, and one of which will affect behaviour:
- By referring to
re_flag_tracker/1
, the fixpoint computation now has to evaluate both this andre_flag_tracker/2
at the same time. Currently,re_flag_tracker/1
is simply an extra join on top of the result of the other predicate. The impact on performance probably isn't terribly big, but I imagine there is some overhead in doing this. - Probably more impactful on performance is the fact that by not reusing the type tracker, we lose track of whether we have previously propagated the type information across a call. Thus, with your suggestion we might track into a call, then through a binary operation and then out of a different call to the same function. This is potentially a much larger set of nodes.
Finally, rewriting to use t2
actually doesn't change the behaviour. t = t2.continue()
is equivalent to t = t2 and t.attr = ""
(with a slight abuse of notation). Thus, continue
really just checks that we're not tracking an attribute (which makes sense -- we can add re
flags, but not objects that happen to have an re
flag in an attribute).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Finally, rewriting to use t2 actually doesn't change the behaviour. t = t2.continue() is equivalent to t = t2 and t.attr = "" (with a slight abuse of notation). Thus, continue really just checks that we're not tracking an attribute (which makes sense -- we can add re flags, but not objects that happen to have an re flag in an attribute).
Right. Although the behavior ends up being the same, for me, it just reads wrong semantically. I would like to use the t2
approach, and have made a suggestion for that (couldn't do it in this thread, since that only covered one of the lines 🤦)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everywhere else (in Python and JavaScript) where this construction appears, and I do mean everywhere is in the form t.continue()
. We only see t2.continue()
(and currently only in JavaScript) when (back)track(t2,t)
is used in an adjacent disjunct.
So I think to suddenly change what is the standard idiom doesn't make much sense.
Rather, it seems to me that a better solution would be to add predicate canContinue
to TypeTracker
, with the meaning that this.attr
is empty. Then we can reuse t
in both places, as long as we make sure to check ... and t.canContinue()
and we don't have to waste precious space on declaring a variable that is exactly equal to t
anyway.
This, however, I think it outside the scope of the present PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that changing the way it's used probably isn't in the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
exists(BinaryExprNode binop, DataFlow::Node operand | | ||
operand.getALocalSource() = re_flag_tracker(flag_name, t.continue()) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exists(BinaryExprNode binop, DataFlow::Node operand | | |
operand.getALocalSource() = re_flag_tracker(flag_name, t.continue()) and | |
exists(BinaryExprNode binop, DataFlow::Node operand, DataFlow::TypeTracker t2 | | |
operand.getALocalSource() = re_flag_tracker(flag_name, t2) and | |
t = t2.continue() and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed for the reasons outlined in #5250 (comment)
Really, this boils down to "Port
re
library model to use API graphsinstead of points-to", which is what this PR actually does.
Instead of using points-to to track flags, we use a type tracker. To
handle multiple flags at the same time, we add additional flow from
x
tox | y
andy | x
and, as an added bonus, the above with
+
instead of|
, neatlyfixing #4707
I had to modify the
Qualified.ql
test slightly, as it now had aresult stemming from the standard library (in
warnings.py
) thatpoints-to previously ignored.
It might be possible to implement this as a type tracker on
LocalSourceNode
s, but with the added steps for the above operations,this was not obvious to me, and so I opted for the simpler
"
smallstep
" variant.Before merging, I need to
re
modelRegex.qll
, so that probably merits a change note.