-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Python: Dataflow, unpacking assignment #4752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Dataflow, unpacking assignment #4752
Conversation
Well. We should probably do both. At least for modeling parts of HTTP requests, the second one is required my_tainted_list = [TAINTED_STRING, ...]
external_tainted_list = external_lib.returns_tainted_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're adding support for iterable unpacking, we need a few more test-cases.
In Python 3, they extended interable unpacking: https://www.python.org/dev/peps/pep-3132/
I created some tests for the python 3 specific stuff in https://github.com/github/codeql/blob/081d66eaa38a911e06c4606ae6c18eb0a55c2c82/python/ql/test/3/library-tests/taint/unpacking/test.py
BUT, notice that this is iterable unpacking. Maybe we should have a IterableElementContent
?
Some examples we currently wouldn't cover:
a, *b, c = range(10)
print(a, b, c) # 0 [1, 2, 3, 4, 5, 6, 7, 8] 9
a, *b, c = (-i for i in range(10))
print(a, b, c) # 0 [-1, -2, -3, -4, -5, -6, -7, -8] -9
def foo():
for i in range(10):
yield i
a, *b, c = foo()
print(a, b, c) # 0 [1, 2, 3, 4, 5, 6, 7, 8] 9
iteration, and conversion
Make sure tests are valid Fix wrong test annotations Big refactor to make code readable Big comment to explain code
(and fix annotations again)
A warning about reviewing this commit-by-commit. It does add most tests up front and then introduce progressively more complicate predicates to make the tests pass. But then at the end everything is simplified significantly with a big explanatory comment, so it may not be worth it to try to understand the interim state (although it still captures my thinking somewhat). I also got the test annotations wrong initially (and a few times later), so that might be confusing. Let me know if you want me to produce a polished history.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not reviewed all code in full yet, actually just the tests, but I'm going to request changes now, and go through the rest of the PR.
Iterable unpacking in for
We don't have any tests for using iterable unpacking in for
, for example
@expects(2)
def test_iterable_unpacking_in_for():
tl = [(SOURCE, NONSOURCE), (SOURCE, NONSOURCE)]
for x,y in tl:
SINK(x)
SINK_F(y)
but it would also be nice to have one that uses new extended *
feature from Python 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still chewing over some of the finer details of this, but I figured this was a good point to submit my review.
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
* In order to preserve precise content, we also take a flow step from `TIterableSequence(receiver)` | ||
* directly to `receiver`. | ||
* | ||
* The strategy is then via several read-, store-, and flow steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like having example code for each of these cases would greatly elucidate what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is a good idea and should not be difficult to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted for an example after the step; inlining did not work quite as well.
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
|
||
/** | ||
* A synthetic node representing an iterable element. Used for changing content type | ||
* for instance from a `ListElement` to a `TupleElement`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example is exactly the same as for IterableSequence
, which may make it unclear why both classes are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I could add something like "via a read step to an IterableElement followed by a store step" to the above and "usually via a read step from an IterableSequence followed by a store step" here.
Co-authored-by: Taus <tausbn@github.com>
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think these new complex tests show that it would be really nice to have inline-test-expectations for these tests. I would almost say it's essential to get done soon, since I did not look at the change to the .expected file, and therefore don't actually know what things this PR handles and not handles :(
I've added some comments here requesting to get things cleared up, but would like to chat with you (live) when you've had a chance to look them over, so I can better understand the whole of this PR 😉 just ping me when you're ready to do so
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
* where `a` should not receive content, but `b` and `c` should. `c` will be `["c"]` so | ||
* should have the content converted and transferred, while `b` should read it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should c
receive content?
In my mind we should be able to see that b
will receive TupleElementContent(0)
, so c
should only receive ListElementContent
if exists(TupleElementContent tc | tc.getIndex() >= 1)
-- but possibly I'm misunderstanding something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case b
and c
are both inside a list so if there is any content in the tuple matched by [b, *c]
then [b, *c]
should have content ListElement [rest of acces path]
and b
should have [rest of acces path]
and c
should have ListElement [rest of acces path]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, my understanding of iterable unpacking is that it doesn't matter if you use normal parentheses or square brackets to match the pattern (but that having both can enable writing clean code). So
a, (b, c) = iterable
(a, (b, c)) = iterable
[a, [b, c]] = iterable
a, [b, c] = iterable
are all semantically equivalent -- or have I misunderstood something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that, apart from the first two, these are all different. But the difference is only in the types of intermediate results, so you only see it if that difference matters. It might for instance matter for runtime. It matters for us, because we distinguish list-content and tuple-content. In terms of the values of a
, b
, and c
, the interpreter does not see a difference because the third element of a list is just as precisely known as the third element of a tuple and so it can move back and forth between the two without noticing. Our analysis does see a difference, which is why it does not agree with the interpreter on the following program:
l = [SOURCE, NONSOURCE]
t = (l[0], l[1])
The interpreter sees t
having SOURCE
only in the first component, our analysis says it can be found in both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, just to be perfectly clear, do you agree that these are all semantically equivalent? (in terms of Python semantics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is Python semantics? They may incur different runtimes or memory footprints, and if you manage to change the subscripting of either list or tuple, they could also give different values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is anything, really? :)
My point here is that we could make our analysis a bit simpler, by ignoring whether square brackets or normal parenthesis was used on the LHS of the assignment. I'm guessing that with our current Python libraries, we represent these as TupleNode
or ListNode
, but I don't think we need to make the distinction here.
After spending some time looking for evidence on this, I finally found https://docs.python.org/3/reference/simple_stmts.html#assignment-statements:
Assignment of an object to a target list, optionally enclosed in parentheses or square brackets, is recursively defined as follows.
So if a target is actually a target list (to use the wording from that Python reference), I think we can just consider that target list to be a tuple. This will allow us to simplify the data-flow modeling a bit, and might lead to a bit more precision in some cases (for example (a, [b, *c]) = ("a", ("tainted string", "c"))
).
non-goals
I'm not suggesting we should change the behavior of our data-flow modeling in general. So in the example below, when tracking data-flow from SOURCE
, I'm perfectly fine with modeling content of l
as some element, meaning that we think the content might flow to b
. If we ever want to change that, this is not the PR to do it in at least 😄
l = [SOURCE, NON_SOURCE]
t = (SOURCE, NON_SOURCE)
a, b = l
x, y = t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is anything, really? :)
I meant that one could define many semantics for Python, depending on ones interest. One such might or might not include expected runtime or memory footprint.
After spending some time looking for evidence on this, I finally found https://docs.python.org/3/reference/simple_stmts.html#assignment-statements:
Assignment of an object to a target list, optionally enclosed in parentheses or square brackets, is recursively defined as follows.
So if a target is actually a target list (to use the wording from that Python reference), I think we can just consider that target list to be a tuple. This will allow us to simplify the data-flow modeling a bit, and might lead to a bit more precision in some cases (for example
(a, [b, *c]) = ("a", ("tainted string", "c"))
).
Indeed, it looks like you are right! The different notations are just there for convenience and always denote a "target list", which can indeed model as a tuple.
Reading the reference, it also occurred to me that we may not handle the assignment (and the possible call to a special method) hidden in
a, b[0] = 1, 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a, b[0] = 1, 2
*goes to check in ipython* ... good catch! Would be great with a test-case at least, but not sure if you want to handle that (more general) case in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding a test case is probably a good idea, but I don't think fixing this needs to be done in this PR.
Also, it seems like this is two different issues:
- Correct handling of content when assigning to a subscripted expression, and
- correctly handling the case where
__setitem__
exists onb
.
Of these, I would say 1. is probably the higher priority (though there may exist fancy libraries that use __setitem__
to do clever stuff, so we may end up needing both).
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
I have addressed the issues raised so far except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this wins the prize for "biggest QLDoc comment". 😄
I have added a few comments and suggestions, but I think this looks like it should be possible to merge really soon. Nicely done! 🎉
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll
Outdated
Show resolved
Hide resolved
* A synthetic node representing that there may be an iterable element | ||
* for `consumer` to consume. | ||
*/ | ||
TIterableElement(UnpackingAssignmentTarget consumer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most of these disjuncts, we have names of the form ...Node
(though notably TKwUnpacked
slipped through the cracks somehow). Should we perhaps endeavour to keep this consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that does seem to be a convention :-)
override string toString() { result = "IterableSequence" } | ||
|
||
override DataFlowCallable getEnclosingCallable() { | ||
result = any(CfgNode node | node = TCfgNode(consumer)).getEnclosingCallable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something about the way this is written that bugs me a bit. I'm guessing it is the case that any IterableSequence
is in fact also a CfgNode
(otherwise we would have a consistency problem for this predicate). In that case, it's slightly awkward that we have to jump back through the IPA type constructor to get this link back.
Could we perhaps instead just say that TIterableSequence
contains not just a ControlFlowNode
, but in fact a CfgNode
that itself contains said ControlFlowNode
?
This might make other parts of the code a bit more awkward, though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could do a more local refactor, making the field, consumer
, a CfgNode
and have the charpred link them up?
Differences job for version without unhelpful store steps: Job showing no changes and analysis times with salt seeing an increase just shy of 10% (so somewhat concerning). |
Co-authored-by: Taus <tausbn@github.com>
…unpacking-assignment
New job after merging in main: Job, no changes, Analysis times. Much smaller increase this time :-) |
Simplified the model, knowing that all LHS sequences are the same (now modelled with I also added slightly more precise modelling of indices in the presence of starred variables. |
New job, no changes, no change in analysis time. |
probably a copy-paste error..
Invalid test was rejected (yay). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor bits, but then I'll merge it, I promise! :)
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
Co-authored-by: Taus <tausbn@github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spend a considerable amount of time looking through the comments, and trying to understand it all :) I think I do now, and the docs were very helpful 💪
I've made a couple of suggestions to make them even better (I hope)
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
* The strategy for converting content type is to break the transfer up into a read step | ||
* and a store step, together creating a converting transfer step. | ||
* For this we need a synthetic node in the middle, which we call `TIterableElement(receiver)`. | ||
* It is associated with the receiver of the transfer, because we know the receiver type (tuple) from the syntax. | ||
* Since we sometimes need a converting read step (in the example above, `[b, *c]` reads the content | ||
* `ListElementContent` but should have content `TupleElementContent(0)` and `TupleElementContent(0)`), | ||
* we actually need a second synthetic node. A converting read step is a read step followed by a | ||
* converting transfer. | ||
* | ||
* We can have a uniform treatment by always having two synthetic nodes and so we can view it as | ||
* two stages of the same node. So we read into (or transfer to) `TIterableSequence(receiver)`, | ||
* from which we take a read step to `TIterableElement(receiver)` and then a store step to `receiver`. | ||
* | ||
* In order to preserve precise content, we also take a flow step from `TIterableSequence(receiver)` | ||
* directly to `receiver`. | ||
* | ||
* The strategy is then via several read-, store-, and flow steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part is unclear, and I would like to rewrite it.
* The strategy for converting content type is to break the transfer up into a read step | |
* and a store step, together creating a converting transfer step. | |
* For this we need a synthetic node in the middle, which we call `TIterableElement(receiver)`. | |
* It is associated with the receiver of the transfer, because we know the receiver type (tuple) from the syntax. | |
* Since we sometimes need a converting read step (in the example above, `[b, *c]` reads the content | |
* `ListElementContent` but should have content `TupleElementContent(0)` and `TupleElementContent(0)`), | |
* we actually need a second synthetic node. A converting read step is a read step followed by a | |
* converting transfer. | |
* | |
* We can have a uniform treatment by always having two synthetic nodes and so we can view it as | |
* two stages of the same node. So we read into (or transfer to) `TIterableSequence(receiver)`, | |
* from which we take a read step to `TIterableElement(receiver)` and then a store step to `receiver`. | |
* | |
* In order to preserve precise content, we also take a flow step from `TIterableSequence(receiver)` | |
* directly to `receiver`. | |
* | |
* The strategy is then via several read-, store-, and flow steps: | |
* To transfer content from RHS to the elements of the LHS in the expression `sequence = iterable`, we use two synthetic nodes: | |
* - `TIterableSequence(sequence)` which captures the content-modeling the entire `sequence` will have | |
* (essentially just a copy of the content-modeling the RHS has) | |
* - `TIterableElement(sequence)` which captures the content-modeling that will be assigned to an element. Note that | |
* an empty access path means that the value we are tracking flows directly to the element. | |
* | |
* Since we need to handle recursive structures on the LHS, we can have a uniform treatment by always having these | |
* two synthetic nodes and so we can view it as two stages of the same node. So we read into (or transfer to) | |
* `TIterableSequence(receiver)`, from which we take a read step to `TIterableElement(receiver)` and then a store step | |
* to `receiver`. | |
* | |
* This is accomplished via several read-, store-, and flow steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally happy with the rewrite :-) Two questions, though:
- Did you mean to leave out this bit "In order to preserve precise content, we also take a flow step from
TIterableSequence(receiver)
directly toreceiver
."? (Do you feel that is implied?) - Since we are here, do you feel the two stages help? I originally wanted to make it a single branch
TIterableElement(receiver, direction)
withdirection
being"in"
or"out"
, but the two branches seemed much clearer end cleaner. So perhaps the two stages idea is not relevant anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Did you mean to leave out this bit "In order to preserve precise content, we also take a flow step from
TIterableSequence(receiver)
directly toreceiver
."? (Do you feel that is implied?)
I must admit that "In order to preserve precise content, we also take a flow step from TIterableSequence(receiver)
directly to receiver
." never really made sense to me.
My understanding is that the only time Step 2 (TIterableSequence(sequence)
-> sequence
) is useful, is when the content-modeling is already TupleElementContent
. The unpackingAssignmentFlowStep
predicate doesn't have any constraint showing this, but the only way to transfer things to the elements of the sequence is to use Step 5, which always requires that the content-modeling is TupleElementContent
.
If I'm misunderstanding somewhere here, we definitely need to highlight this in some way. If my understanding is correct, we might want to add a comment to the explanation of Step 2, saying it's only needed to handle things (from RHS) where the content-modeling is already using TupleElementContent
- Since we are here, do you feel the two stages help? I originally wanted to make it a single branch
TIterableElement(receiver, direction)
withdirection
being"in"
or"out"
, but the two branches seemed much clearer end cleaner. So perhaps the two stages idea is not relevant anymore?
No I'm happy about the current setup 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up creating a mashup, turns out I still felt a need for justifying TIterableSequence
😅
Your understanding above is correct and hopefully now fully reflected in the comment.
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifications. I think it will greatly help anyone who dares to look at this in the future 😄 ... probably to the point where it doesn't feel so complicated that it can be described as huge undertaking to look at it (since the docs are more clear now, it's easier to understand) 💪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 🎉
What a ride!
It looked suspicious that we should need a taint tracking step for this, but it turns out that we had no handling in the data flow layer. I added a test and a data flow step, but we cannot yet remove the taint tracking step without getting test failures.
I think there are at least two reasons for this which should be solved (in future PRs, probably):
TAINTED_LIST
as a variable and should probably use an actually tainted list.Update: