Python: Dataflow, unpacking assignment #4752

yoff · 2020-11-30T15:44:13Z

It looked suspicious that we should need a taint tracking step for this, but it turns out that we had no handling in the data flow layer. I added a test and a data flow step, but we cannot yet remove the taint tracking step without getting test failures.

I think there are at least two reasons for this which should be solved (in future PRs, probably):

We have no data flow for slicing
We are modelling TAINTED_LIST as a variable and should probably use an actually tainted list.

Update:

Differences job for latest version: Job, no changes, no change in analysis time
Semantics has been simplified by modelling all target lists as tuples.
Changes from Python: Add new-style tests #4665 has been merged.

RasmusWL · 2020-11-30T17:03:24Z

We are modelling TAINTED_LIST as a variable and should probably use an actually tainted list.

Well. We should probably do both. At least for modeling parts of HTTP requests, the second one is required

my_tainted_list = [TAINTED_STRING, ...]
external_tainted_list = external_lib.returns_tainted_list()

RasmusWL

If we're adding support for iterable unpacking, we need a few more test-cases.

In Python 3, they extended interable unpacking: https://www.python.org/dev/peps/pep-3132/

I created some tests for the python 3 specific stuff in https://github.com/github/codeql/blob/081d66eaa38a911e06c4606ae6c18eb0a55c2c82/python/ql/test/3/library-tests/taint/unpacking/test.py

BUT, notice that this is iterable unpacking. Maybe we should have a IterableElementContent?

Some examples we currently wouldn't cover:

a, *b, c = range(10)
print(a, b, c) # 0 [1, 2, 3, 4, 5, 6, 7, 8] 9

a, *b, c = (-i for i in range(10))
print(a, b, c) # 0 [-1, -2, -3, -4, -5, -6, -7, -8] -9

def foo():
    for i in range(10):
        yield i

a, *b, c = foo()
print(a, b, c) # 0 [1, 2, 3, 4, 5, 6, 7, 8] 9

iteration, and conversion

Make sure tests are valid Fix wrong test annotations Big refactor to make code readable Big comment to explain code

(and fix annotations again)

yoff · 2021-01-14T06:59:21Z

A warning about reviewing this commit-by-commit. It does add most tests up front and then introduce progressively more complicate predicates to make the tests pass. But then at the end everything is simplified significantly with a big explanatory comment, so it may not be worth it to try to understand the interim state (although it still captures my thinking somewhat). I also got the test annotations wrong initially (and a few times later), so that might be confusing.

Let me know if you want me to produce a polished history..

RasmusWL

I've not reviewed all code in full yet, actually just the tests, but I'm going to request changes now, and go through the rest of the PR.

Iterable unpacking in `for`

We don't have any tests for using iterable unpacking in for, for example

@expects(2)
def test_iterable_unpacking_in_for():
    tl = [(SOURCE, NONSOURCE), (SOURCE, NONSOURCE)]
    for x,y in tl:
        SINK(x)
        SINK_F(y)

but it would also be nice to have one that uses new extended * feature from Python 3.

tausbn

Still chewing over some of the finer details of this, but I figured this was a good point to submit my review.

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

tausbn · 2021-01-15T09:57:30Z

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

+ * In order to preserve precise content, we also take a flow step from `TIterableSequence(receiver)`
+ * directly to `receiver`.
+ *
+ * The strategy is then via several read-, store-, and flow steps:


I feel like having example code for each of these cases would greatly elucidate what's going on.

Yes, that is a good idea and should not be difficult to add.

I opted for an example after the step; inlining did not work quite as well.

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

python/ql/test/experimental/dataflow/coverage/test.py

tausbn · 2021-01-15T10:52:41Z

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll

+
+/**
+ * A synthetic node representing an iterable element. Used for changing content type
+ * for instance from a `ListElement` to a `TupleElement`.


This example is exactly the same as for IterableSequence, which may make it unclear why both classes are needed.

Right, I could add something like "via a read step to an IterableElement followed by a store step" to the above and "usually via a read step from an IterableSequence followed by a store step" here.

Co-authored-by: Taus <tausbn@github.com>

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

RasmusWL

Overall I think these new complex tests show that it would be really nice to have inline-test-expectations for these tests. I would almost say it's essential to get done soon, since I did not look at the change to the .expected file, and therefore don't actually know what things this PR handles and not handles :(

I've added some comments here requesting to get things cleared up, but would like to chat with you (live) when you've had a chance to look them over, so I can better understand the whole of this PR 😉 just ping me when you're ready to do so

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

RasmusWL · 2021-01-15T11:12:35Z

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

+ * where `a` should not receive content, but `b` and `c` should. `c` will be `["c"]` so
+ * should have the content converted and transferred, while `b` should read it.


Why should c receive content?

In my mind we should be able to see that b will receive TupleElementContent(0), so c should only receive ListElementContent if exists(TupleElementContent tc | tc.getIndex() >= 1) -- but possibly I'm misunderstanding something.

In this case b and c are both inside a list so if there is any content in the tuple matched by [b, *c] then [b, *c] should have content ListElement [rest of acces path] and b should have [rest of acces path] and c should have ListElement [rest of acces path].

Hmm, my understanding of iterable unpacking is that it doesn't matter if you use normal parentheses or square brackets to match the pattern (but that having both can enable writing clean code). So

a, (b, c) = iterable (a, (b, c)) = iterable [a, [b, c]] = iterable a, [b, c] = iterable

are all semantically equivalent -- or have I misunderstood something?

I would say that, apart from the first two, these are all different. But the difference is only in the types of intermediate results, so you only see it if that difference matters. It might for instance matter for runtime. It matters for us, because we distinguish list-content and tuple-content. In terms of the values of a, b, and c, the interpreter does not see a difference because the third element of a list is just as precisely known as the third element of a tuple and so it can move back and forth between the two without noticing. Our analysis does see a difference, which is why it does not agree with the interpreter on the following program:

l = [SOURCE, NONSOURCE] t = (l[0], l[1])

The interpreter sees t having SOURCE only in the first component, our analysis says it can be found in both.

So, just to be perfectly clear, do you agree that these are all semantically equivalent? (in terms of Python semantics)

What is Python semantics? They may incur different runtimes or memory footprints, and if you manage to change the subscripting of either list or tuple, they could also give different values.

What is anything, really? :)

My point here is that we could make our analysis a bit simpler, by ignoring whether square brackets or normal parenthesis was used on the LHS of the assignment. I'm guessing that with our current Python libraries, we represent these as TupleNode or ListNode, but I don't think we need to make the distinction here.

After spending some time looking for evidence on this, I finally found https://docs.python.org/3/reference/simple_stmts.html#assignment-statements:

Assignment of an object to a target list, optionally enclosed in parentheses or square brackets, is recursively defined as follows.

So if a target is actually a target list (to use the wording from that Python reference), I think we can just consider that target list to be a tuple. This will allow us to simplify the data-flow modeling a bit, and might lead to a bit more precision in some cases (for example (a, [b, *c]) = ("a", ("tainted string", "c"))).

non-goals

I'm not suggesting we should change the behavior of our data-flow modeling in general. So in the example below, when tracking data-flow from SOURCE, I'm perfectly fine with modeling content of l as some element, meaning that we think the content might flow to b. If we ever want to change that, this is not the PR to do it in at least 😄

l = [SOURCE, NON_SOURCE] t = (SOURCE, NON_SOURCE) a, b = l x, y = t

What is anything, really? :)

I meant that one could define many semantics for Python, depending on ones interest. One such might or might not include expected runtime or memory footprint.

After spending some time looking for evidence on this, I finally found https://docs.python.org/3/reference/simple_stmts.html#assignment-statements:

Assignment of an object to a target list, optionally enclosed in parentheses or square brackets, is recursively defined as follows.

So if a target is actually a target list (to use the wording from that Python reference), I think we can just consider that target list to be a tuple. This will allow us to simplify the data-flow modeling a bit, and might lead to a bit more precision in some cases (for example (a, [b, *c]) = ("a", ("tainted string", "c"))).

Indeed, it looks like you are right! The different notations are just there for convenience and always denote a "target list", which can indeed model as a tuple.

Reading the reference, it also occurred to me that we may not handle the assignment (and the possible call to a special method) hidden in

a, b[0] = 1, 2

a, b[0] = 1, 2

*goes to check in ipython* ... good catch! Would be great with a test-case at least, but not sure if you want to handle that (more general) case in this PR?

I think adding a test case is probably a good idea, but I don't think fixing this needs to be done in this PR.
Also, it seems like this is two different issues:

Correct handling of content when assigning to a subscripted expression, and

correctly handling the case where __setitem__ exists on b.

Of these, I would say 1. is probably the higher priority (though there may exist fancy libraries that use __setitem__ to do clever stuff, so we may end up needing both).

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

yoff · 2021-01-15T19:27:18Z

I have addressed the issues raised so far except for. Mentally I have been solving unpacking assignment rather than iterated unpacking. I realise, that we should probably rename the module and have it be used in all the contexts where iterated unpacking happen (for-iteration and comprehensions). And it probably will not be very much code to add for and I need to add a for step anyway, so I will try that shortly. But those extensions could also be postponed to a separate PR...

tausbn

I think this wins the prize for "biggest QLDoc comment". 😄
I have added a few comments and suggestions, but I think this looks like it should be possible to merge really soon. Nicely done! 🎉

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll

tausbn · 2021-01-20T14:18:13Z

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll

+   * A synthetic node representing that there may be an iterable element
+   * for `consumer` to consume.
+   */
+  TIterableElement(UnpackingAssignmentTarget consumer)


For most of these disjuncts, we have names of the form ...Node (though notably TKwUnpacked slipped through the cracks somehow). Should we perhaps endeavour to keep this consistent?

Sure, that does seem to be a convention :-)

tausbn · 2021-01-20T14:21:44Z

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll

+  override string toString() { result = "IterableSequence" }
+
+  override DataFlowCallable getEnclosingCallable() {
+    result = any(CfgNode node | node = TCfgNode(consumer)).getEnclosingCallable()


There's something about the way this is written that bugs me a bit. I'm guessing it is the case that any IterableSequence is in fact also a CfgNode (otherwise we would have a consistency problem for this predicate). In that case, it's slightly awkward that we have to jump back through the IPA type constructor to get this link back.

Could we perhaps instead just say that TIterableSequence contains not just a ControlFlowNode, but in fact a CfgNode that itself contains said ControlFlowNode?

This might make other parts of the code a bit more awkward, though...

Perhaps we could do a more local refactor, making the field, consumer, a CfgNode and have the charpred link them up?

yoff · 2021-01-20T16:01:50Z

Differences job for version without unhelpful store steps: Job showing no changes and analysis times with salt seeing an increase just shy of 10% (so somewhat concerning).

Co-authored-by: Taus <tausbn@github.com>

…unpacking-assignment

yoff · 2021-01-20T18:33:49Z

New job after merging in main: Job, no changes, Analysis times. Much smaller increase this time :-)

…unpacking-assignment

yoff · 2021-01-22T18:43:41Z

Simplified the model, knowing that all LHS sequences are the same (now modelled with TupleElementContent). It did not immediately allow me to remove TIterableSequence as I had hoped, because we may still read some nested list content that we need to converge.

I also added slightly more precise modelling of indices in the presence of starred variables.

yoff · 2021-01-22T18:47:16Z

New job, no changes, no change in analysis time.

probably a copy-paste error..

yoff · 2021-01-25T15:51:15Z

Invalid test was rejected (yay).

tausbn

A few minor bits, but then I'll merge it, I promise! :)

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPublic.qll

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

Co-authored-by: Taus <tausbn@github.com>

RasmusWL

I spend a considerable amount of time looking through the comments, and trying to understand it all :) I think I do now, and the docs were very helpful 💪

I've made a couple of suggestions to make them even better (I hope)

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

RasmusWL · 2021-01-26T17:47:35Z

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

+ * The strategy for converting content type is to break the transfer up into a read step
+ * and a store step, together creating a converting transfer step.
+ * For this we need a synthetic node in the middle, which we call `TIterableElement(receiver)`.
+ * It is associated with the receiver of the transfer, because we know the receiver type (tuple) from the syntax.
+ * Since we sometimes need a converting read step (in the example above, `[b, *c]` reads the content
+ * `ListElementContent` but should have content `TupleElementContent(0)` and `TupleElementContent(0)`),
+ * we actually need a second synthetic node. A converting read step is a read step followed by a
+ * converting transfer.
+ *
+ * We can have a uniform treatment by always having two synthetic nodes and so we can view it as
+ * two stages of the same node. So we read into (or transfer to) `TIterableSequence(receiver)`,
+ * from which we take a read step to `TIterableElement(receiver)` and then a store step to `receiver`.
+ *
+ * In order to preserve precise content, we also take a flow step from `TIterableSequence(receiver)`
+ * directly to `receiver`.
+ *
+ * The strategy is then via several read-, store-, and flow steps:


I think this part is unclear, and I would like to rewrite it.

Suggested change

* The strategy for converting content type is to break the transfer up into a read step

* and a store step, together creating a converting transfer step.

* For this we need a synthetic node in the middle, which we call `TIterableElement(receiver)`.

* It is associated with the receiver of the transfer, because we know the receiver type (tuple) from the syntax.

* Since we sometimes need a converting read step (in the example above, `[b, *c]` reads the content

* `ListElementContent` but should have content `TupleElementContent(0)` and `TupleElementContent(0)`),

* we actually need a second synthetic node. A converting read step is a read step followed by a

* converting transfer.

*

* We can have a uniform treatment by always having two synthetic nodes and so we can view it as

* two stages of the same node. So we read into (or transfer to) `TIterableSequence(receiver)`,

* from which we take a read step to `TIterableElement(receiver)` and then a store step to `receiver`.

*

* In order to preserve precise content, we also take a flow step from `TIterableSequence(receiver)`

* directly to `receiver`.

*

* The strategy is then via several read-, store-, and flow steps:

* To transfer content from RHS to the elements of the LHS in the expression `sequence = iterable`, we use two synthetic nodes:

* - `TIterableSequence(sequence)` which captures the content-modeling the entire `sequence` will have

* (essentially just a copy of the content-modeling the RHS has)

* - `TIterableElement(sequence)` which captures the content-modeling that will be assigned to an element. Note that

* an empty access path means that the value we are tracking flows directly to the element.

*

* Since we need to handle recursive structures on the LHS, we can have a uniform treatment by always having these

* two synthetic nodes and so we can view it as two stages of the same node. So we read into (or transfer to)

* `TIterableSequence(receiver)`, from which we take a read step to `TIterableElement(receiver)` and then a store step

* to `receiver`.

*

* This is accomplished via several read-, store-, and flow steps:

Generally happy with the rewrite :-) Two questions, though:

Did you mean to leave out this bit "In order to preserve precise content, we also take a flow step from TIterableSequence(receiver) directly to receiver."? (Do you feel that is implied?)

Since we are here, do you feel the two stages help? I originally wanted to make it a single branch TIterableElement(receiver, direction) with direction being "in" or "out", but the two branches seemed much clearer end cleaner. So perhaps the two stages idea is not relevant anymore?

Did you mean to leave out this bit "In order to preserve precise content, we also take a flow step from TIterableSequence(receiver) directly to receiver."? (Do you feel that is implied?)

I must admit that "In order to preserve precise content, we also take a flow step from TIterableSequence(receiver) directly to receiver." never really made sense to me.

My understanding is that the only time Step 2 (TIterableSequence(sequence) -> sequence) is useful, is when the content-modeling is already TupleElementContent. The unpackingAssignmentFlowStep predicate doesn't have any constraint showing this, but the only way to transfer things to the elements of the sequence is to use Step 5, which always requires that the content-modeling is TupleElementContent.

If I'm misunderstanding somewhere here, we definitely need to highlight this in some way. If my understanding is correct, we might want to add a comment to the explanation of Step 2, saying it's only needed to handle things (from RHS) where the content-modeling is already using TupleElementContent

Since we are here, do you feel the two stages help? I originally wanted to make it a single branch TIterableElement(receiver, direction) with direction being "in" or "out", but the two branches seemed much clearer end cleaner. So perhaps the two stages idea is not relevant anymore?

No I'm happy about the current setup 👍

I ended up creating a mashup, turns out I still felt a need for justifying TIterableSequence 😅
Your understanding above is correct and hopefully now fully reflected in the comment.

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

…ate.qll

RasmusWL

Thanks for the clarifications. I think it will greatly help anyone who dares to look at this in the future 😄 ... probably to the point where it doesn't feel so complicated that it can be described as huge undertaking to look at it (since the docs are more clear now, it's easier to understand) 💪

tausbn

🚀 🎉

What a ride!

yoff added 3 commits November 30, 2020 14:18

Python: Test for unpacking assignment

673ff90

Python: Adjust test expectations

f345e55

Python: Add read step for unpacking assignment

289b9e6

yoff requested a review from a team as a code owner November 30, 2020 15:44

github-actions bot added the Python label Nov 30, 2020

RasmusWL requested changes Nov 30, 2020

View reviewed changes

yoff added 11 commits January 12, 2021 12:30

Python: Add more unpacking tests

4d9f5be

Python: add tests for conversion during unpacking

9c08467

Python: start support for nested unpacking

a1ab5cc

Python: add test annotations

d8d8b45

Python: model conversion during unpacking

4ee2f49

Python: start handling iterated unpacking

b10cf78

Python: Test interaction between nesting,

b2d95e6

iteration, and conversion

Python: big refactor and fix tests

36a4a50

Make sure tests are valid Fix wrong test annotations Big refactor to make code readable Big comment to explain code

Python: Fix inconsostencies to fix flow

e3199fb

(and fix annotations again)

Python: Final(?!) fix of annotations

6dc0d69

Python: FIx flow

dfdfd3c

yoff requested a review from RasmusWL January 14, 2021 06:55

RasmusWL requested changes Jan 15, 2021

View reviewed changes

tausbn requested changes Jan 15, 2021

View reviewed changes

Apply suggestions from code review

48910d0

Co-authored-by: Taus <tausbn@github.com>

tausbn reviewed Jan 15, 2021

View reviewed changes

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll Outdated Show resolved Hide resolved

RasmusWL reviewed Jan 15, 2021

View reviewed changes

yoff and others added 2 commits January 15, 2021 18:50

Apply suggestions from code review

1edad03

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

Python: Address reviews

5f189a7

yoff requested review from RasmusWL and tausbn January 15, 2021 19:27

yoff requested a review from tausbn January 19, 2021 19:20

tausbn requested changes Jan 20, 2021

View reviewed changes

yoff and others added 2 commits January 20, 2021 17:38

Apply suggestions from code review

e072864

Co-authored-by: Taus <tausbn@github.com>

Merge branch 'main' of github.com:github/codeql into python-dataflow-…

7a5d553

…unpacking-assignment

yoff added 3 commits January 21, 2021 10:43

Python: Have Node-postfix consistently

19918e2

Python: Small refactor

bc1b507

Python: Elaborate comments for steps

88db8f5

yoff requested a review from tausbn January 21, 2021 09:56

yoff added 2 commits January 22, 2021 16:26

Merge branch 'main' of github.com:github/codeql into python-dataflow-…

f948ef8

…unpacking-assignment

Python: Simplify modelling

0d20a4c

Python: fix test expectation

4ff2c6d

probably a copy-paste error..

tausbn requested changes Jan 25, 2021

View reviewed changes

Apply suggestions from code review

09bb300

Co-authored-by: Taus <tausbn@github.com>

yoff requested a review from tausbn January 25, 2021 20:58

RasmusWL requested changes Jan 26, 2021

View reviewed changes

Apply suggestions from code review

500ea12

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

yoff commented Jan 26, 2021

View reviewed changes

python/ql/src/semmle/python/dataflow/new/internal/DataFlowPrivate.qll Outdated Show resolved Hide resolved

yoff and others added 3 commits January 26, 2021 19:16

Update python/ql/src/semmle/python/dataflow/new/internal/DataFlowPriv…

cd85cf1

…ate.qll

Python: autoformat

d18c160

Python: Adjust comment based on review.

0e0b18c

yoff requested a review from RasmusWL January 28, 2021 00:13

RasmusWL approved these changes Jan 28, 2021

View reviewed changes

yoff mentioned this pull request Jan 28, 2021

Python: dataflow, unify iterated unpacking #5047

Merged

tausbn approved these changes Jan 29, 2021

View reviewed changes

tausbn merged commit cb195a0 into github:main Jan 29, 2021

		* where `a` should not receive content, but `b` and `c` should. `c` will be `["c"]` so
		* should have the content converted and transferred, while `b` should read it.

Python: Dataflow, unpacking assignment #4752

Python: Dataflow, unpacking assignment #4752

Uh oh!

Conversation

yoff commented Nov 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update:

Uh oh!

RasmusWL commented Nov 30, 2020

Uh oh!

RasmusWL left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoff commented Jan 14, 2021

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Iterable unpacking in for

Uh oh!

tausbn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

non-goals

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yoff commented Jan 15, 2021

Uh oh!

tausbn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yoff commented Nov 30, 2020 •

edited

Loading

RasmusWL left a comment •

edited

Loading

Iterable unpacking in `for`

yoff commented Jan 20, 2021 •

edited

Loading

yoff commented Jan 22, 2021 •

edited

Loading

yoff Jan 26, 2021 •

edited

Loading