Skip to content

Conversation

yoff
Copy link
Contributor

@yoff yoff commented Nov 30, 2020

It looked suspicious that we should need a taint tracking step for this, but it turns out that we had no handling in the data flow layer. I added a test and a data flow step, but we cannot yet remove the taint tracking step without getting test failures.

I think there are at least two reasons for this which should be solved (in future PRs, probably):

  1. We have no data flow for slicing
  2. We are modelling TAINTED_LIST as a variable and should probably use an actually tainted list.

Update:

@yoff yoff requested a review from a team as a code owner November 30, 2020 15:44
@RasmusWL
Copy link
Member

We are modelling TAINTED_LIST as a variable and should probably use an actually tainted list.

Well. We should probably do both. At least for modeling parts of HTTP requests, the second one is required

my_tainted_list = [TAINTED_STRING, ...]
external_tainted_list = external_lib.returns_tainted_list()

Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're adding support for iterable unpacking, we need a few more test-cases.

In Python 3, they extended interable unpacking: https://www.python.org/dev/peps/pep-3132/

I created some tests for the python 3 specific stuff in https://github.com/github/codeql/blob/081d66eaa38a911e06c4606ae6c18eb0a55c2c82/python/ql/test/3/library-tests/taint/unpacking/test.py

BUT, notice that this is iterable unpacking. Maybe we should have a IterableElementContent?

Some examples we currently wouldn't cover:

a, *b, c = range(10)
print(a, b, c) # 0 [1, 2, 3, 4, 5, 6, 7, 8] 9

a, *b, c = (-i for i in range(10))
print(a, b, c) # 0 [-1, -2, -3, -4, -5, -6, -7, -8] -9

def foo():
    for i in range(10):
        yield i

a, *b, c = foo()
print(a, b, c) # 0 [1, 2, 3, 4, 5, 6, 7, 8] 9

@yoff yoff requested a review from RasmusWL January 14, 2021 06:55
@yoff
Copy link
Contributor Author

yoff commented Jan 14, 2021

A warning about reviewing this commit-by-commit. It does add most tests up front and then introduce progressively more complicate predicates to make the tests pass. But then at the end everything is simplified significantly with a big explanatory comment, so it may not be worth it to try to understand the interim state (although it still captures my thinking somewhat). I also got the test annotations wrong initially (and a few times later), so that might be confusing.

Let me know if you want me to produce a polished history..

Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not reviewed all code in full yet, actually just the tests, but I'm going to request changes now, and go through the rest of the PR.

Iterable unpacking in for

We don't have any tests for using iterable unpacking in for, for example

@expects(2)
def test_iterable_unpacking_in_for():
    tl = [(SOURCE, NONSOURCE), (SOURCE, NONSOURCE)]
    for x,y in tl:
        SINK(x)
        SINK_F(y)

but it would also be nice to have one that uses new extended * feature from Python 3.

Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still chewing over some of the finer details of this, but I figured this was a good point to submit my review.

* In order to preserve precise content, we also take a flow step from `TIterableSequence(receiver)`
* directly to `receiver`.
*
* The strategy is then via several read-, store-, and flow steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like having example code for each of these cases would greatly elucidate what's going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is a good idea and should not be difficult to add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for an example after the step; inlining did not work quite as well.


/**
* A synthetic node representing an iterable element. Used for changing content type
* for instance from a `ListElement` to a `TupleElement`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is exactly the same as for IterableSequence, which may make it unclear why both classes are needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I could add something like "via a read step to an IterableElement followed by a store step" to the above and "usually via a read step from an IterableSequence followed by a store step" here.

Co-authored-by: Taus <tausbn@github.com>
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think these new complex tests show that it would be really nice to have inline-test-expectations for these tests. I would almost say it's essential to get done soon, since I did not look at the change to the .expected file, and therefore don't actually know what things this PR handles and not handles :(

I've added some comments here requesting to get things cleared up, but would like to chat with you (live) when you've had a chance to look them over, so I can better understand the whole of this PR 😉 just ping me when you're ready to do so

Comment on lines 1066 to 1067
* where `a` should not receive content, but `b` and `c` should. `c` will be `["c"]` so
* should have the content converted and transferred, while `b` should read it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should c receive content?

In my mind we should be able to see that b will receive TupleElementContent(0), so c should only receive ListElementContent if exists(TupleElementContent tc | tc.getIndex() >= 1) -- but possibly I'm misunderstanding something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case b and c are both inside a list so if there is any content in the tuple matched by [b, *c] then [b, *c] should have content ListElement [rest of acces path] and b should have [rest of acces path] and c should have ListElement [rest of acces path].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, my understanding of iterable unpacking is that it doesn't matter if you use normal parentheses or square brackets to match the pattern (but that having both can enable writing clean code). So

a, (b, c) = iterable
(a, (b, c)) = iterable
[a, [b, c]] = iterable
a, [b, c] = iterable

are all semantically equivalent -- or have I misunderstood something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that, apart from the first two, these are all different. But the difference is only in the types of intermediate results, so you only see it if that difference matters. It might for instance matter for runtime. It matters for us, because we distinguish list-content and tuple-content. In terms of the values of a, b, and c, the interpreter does not see a difference because the third element of a list is just as precisely known as the third element of a tuple and so it can move back and forth between the two without noticing. Our analysis does see a difference, which is why it does not agree with the interpreter on the following program:

  l = [SOURCE, NONSOURCE]
  t = (l[0], l[1])

The interpreter sees t having SOURCE only in the first component, our analysis says it can be found in both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just to be perfectly clear, do you agree that these are all semantically equivalent? (in terms of Python semantics)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Python semantics? They may incur different runtimes or memory footprints, and if you manage to change the subscripting of either list or tuple, they could also give different values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is anything, really? :)

My point here is that we could make our analysis a bit simpler, by ignoring whether square brackets or normal parenthesis was used on the LHS of the assignment. I'm guessing that with our current Python libraries, we represent these as TupleNode or ListNode, but I don't think we need to make the distinction here.

After spending some time looking for evidence on this, I finally found https://docs.python.org/3/reference/simple_stmts.html#assignment-statements:

Assignment of an object to a target list, optionally enclosed in parentheses or square brackets, is recursively defined as follows.

So if a target is actually a target list (to use the wording from that Python reference), I think we can just consider that target list to be a tuple. This will allow us to simplify the data-flow modeling a bit, and might lead to a bit more precision in some cases (for example (a, [b, *c]) = ("a", ("tainted string", "c"))).

non-goals

I'm not suggesting we should change the behavior of our data-flow modeling in general. So in the example below, when tracking data-flow from SOURCE, I'm perfectly fine with modeling content of l as some element, meaning that we think the content might flow to b. If we ever want to change that, this is not the PR to do it in at least 😄

l = [SOURCE, NON_SOURCE]
t = (SOURCE, NON_SOURCE)

a, b = l
x, y = t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is anything, really? :)

I meant that one could define many semantics for Python, depending on ones interest. One such might or might not include expected runtime or memory footprint.

After spending some time looking for evidence on this, I finally found https://docs.python.org/3/reference/simple_stmts.html#assignment-statements:

Assignment of an object to a target list, optionally enclosed in parentheses or square brackets, is recursively defined as follows.

So if a target is actually a target list (to use the wording from that Python reference), I think we can just consider that target list to be a tuple. This will allow us to simplify the data-flow modeling a bit, and might lead to a bit more precision in some cases (for example (a, [b, *c]) = ("a", ("tainted string", "c"))).

Indeed, it looks like you are right! The different notations are just there for convenience and always denote a "target list", which can indeed model as a tuple.

Reading the reference, it also occurred to me that we may not handle the assignment (and the possible call to a special method) hidden in

  a, b[0] = 1, 2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 a, b[0] = 1, 2

*goes to check in ipython* ... good catch! Would be great with a test-case at least, but not sure if you want to handle that (more general) case in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a test case is probably a good idea, but I don't think fixing this needs to be done in this PR.
Also, it seems like this is two different issues:

  1. Correct handling of content when assigning to a subscripted expression, and
  2. correctly handling the case where __setitem__ exists on b.

Of these, I would say 1. is probably the higher priority (though there may exist fancy libraries that use __setitem__ to do clever stuff, so we may end up needing both).

yoff and others added 2 commits January 15, 2021 18:50
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
@yoff
Copy link
Contributor Author

yoff commented Jan 15, 2021

I have addressed the issues raised so far except for. Mentally I have been solving unpacking assignment rather than iterated unpacking. I realise, that we should probably rename the module and have it be used in all the contexts where iterated unpacking happen (for-iteration and comprehensions). And it probably will not be very much code to add for and I need to add a for step anyway, so I will try that shortly. But those extensions could also be postponed to a separate PR...

@yoff yoff requested review from RasmusWL and tausbn January 15, 2021 19:27
@yoff yoff requested a review from tausbn January 19, 2021 19:20
Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this wins the prize for "biggest QLDoc comment". 😄
I have added a few comments and suggestions, but I think this looks like it should be possible to merge really soon. Nicely done! 🎉

* A synthetic node representing that there may be an iterable element
* for `consumer` to consume.
*/
TIterableElement(UnpackingAssignmentTarget consumer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most of these disjuncts, we have names of the form ...Node (though notably TKwUnpacked slipped through the cracks somehow). Should we perhaps endeavour to keep this consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that does seem to be a convention :-)

override string toString() { result = "IterableSequence" }

override DataFlowCallable getEnclosingCallable() {
result = any(CfgNode node | node = TCfgNode(consumer)).getEnclosingCallable()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something about the way this is written that bugs me a bit. I'm guessing it is the case that any IterableSequence is in fact also a CfgNode (otherwise we would have a consistency problem for this predicate). In that case, it's slightly awkward that we have to jump back through the IPA type constructor to get this link back.

Could we perhaps instead just say that TIterableSequence contains not just a ControlFlowNode, but in fact a CfgNode that itself contains said ControlFlowNode?

This might make other parts of the code a bit more awkward, though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could do a more local refactor, making the field, consumer, a CfgNode and have the charpred link them up?

@yoff
Copy link
Contributor Author

yoff commented Jan 20, 2021

Differences job for version without unhelpful store steps: Job showing no changes and analysis times with salt seeing an increase just shy of 10% (so somewhat concerning).

yoff and others added 2 commits January 20, 2021 17:38
@yoff
Copy link
Contributor Author

yoff commented Jan 20, 2021

New job after merging in main: Job, no changes, Analysis times. Much smaller increase this time :-)

@yoff yoff requested a review from tausbn January 21, 2021 09:56
@yoff
Copy link
Contributor Author

yoff commented Jan 22, 2021

Simplified the model, knowing that all LHS sequences are the same (now modelled with TupleElementContent). It did not immediately allow me to remove TIterableSequence as I had hoped, because we may still read some nested list content that we need to converge.

I also added slightly more precise modelling of indices in the presence of starred variables.

@yoff
Copy link
Contributor Author

yoff commented Jan 22, 2021

New job, no changes, no change in analysis time.

probably a copy-paste error..
@yoff
Copy link
Contributor Author

yoff commented Jan 25, 2021

Invalid test was rejected (yay).

Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor bits, but then I'll merge it, I promise! :)

Co-authored-by: Taus <tausbn@github.com>
@yoff yoff requested a review from tausbn January 25, 2021 20:58
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spend a considerable amount of time looking through the comments, and trying to understand it all :) I think I do now, and the docs were very helpful 💪

I've made a couple of suggestions to make them even better (I hope)

Comment on lines 1064 to 1080
* The strategy for converting content type is to break the transfer up into a read step
* and a store step, together creating a converting transfer step.
* For this we need a synthetic node in the middle, which we call `TIterableElement(receiver)`.
* It is associated with the receiver of the transfer, because we know the receiver type (tuple) from the syntax.
* Since we sometimes need a converting read step (in the example above, `[b, *c]` reads the content
* `ListElementContent` but should have content `TupleElementContent(0)` and `TupleElementContent(0)`),
* we actually need a second synthetic node. A converting read step is a read step followed by a
* converting transfer.
*
* We can have a uniform treatment by always having two synthetic nodes and so we can view it as
* two stages of the same node. So we read into (or transfer to) `TIterableSequence(receiver)`,
* from which we take a read step to `TIterableElement(receiver)` and then a store step to `receiver`.
*
* In order to preserve precise content, we also take a flow step from `TIterableSequence(receiver)`
* directly to `receiver`.
*
* The strategy is then via several read-, store-, and flow steps:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part is unclear, and I would like to rewrite it.

Suggested change
* The strategy for converting content type is to break the transfer up into a read step
* and a store step, together creating a converting transfer step.
* For this we need a synthetic node in the middle, which we call `TIterableElement(receiver)`.
* It is associated with the receiver of the transfer, because we know the receiver type (tuple) from the syntax.
* Since we sometimes need a converting read step (in the example above, `[b, *c]` reads the content
* `ListElementContent` but should have content `TupleElementContent(0)` and `TupleElementContent(0)`),
* we actually need a second synthetic node. A converting read step is a read step followed by a
* converting transfer.
*
* We can have a uniform treatment by always having two synthetic nodes and so we can view it as
* two stages of the same node. So we read into (or transfer to) `TIterableSequence(receiver)`,
* from which we take a read step to `TIterableElement(receiver)` and then a store step to `receiver`.
*
* In order to preserve precise content, we also take a flow step from `TIterableSequence(receiver)`
* directly to `receiver`.
*
* The strategy is then via several read-, store-, and flow steps:
* To transfer content from RHS to the elements of the LHS in the expression `sequence = iterable`, we use two synthetic nodes:
* - `TIterableSequence(sequence)` which captures the content-modeling the entire `sequence` will have
* (essentially just a copy of the content-modeling the RHS has)
* - `TIterableElement(sequence)` which captures the content-modeling that will be assigned to an element. Note that
* an empty access path means that the value we are tracking flows directly to the element.
*
* Since we need to handle recursive structures on the LHS, we can have a uniform treatment by always having these
* two synthetic nodes and so we can view it as two stages of the same node. So we read into (or transfer to)
* `TIterableSequence(receiver)`, from which we take a read step to `TIterableElement(receiver)` and then a store step
* to `receiver`.
*
* This is accomplished via several read-, store-, and flow steps:

Copy link
Contributor Author

@yoff yoff Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally happy with the rewrite :-) Two questions, though:

  1. Did you mean to leave out this bit "In order to preserve precise content, we also take a flow step from TIterableSequence(receiver) directly to receiver."? (Do you feel that is implied?)
  2. Since we are here, do you feel the two stages help? I originally wanted to make it a single branch TIterableElement(receiver, direction) with direction being "in" or "out", but the two branches seemed much clearer end cleaner. So perhaps the two stages idea is not relevant anymore?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Did you mean to leave out this bit "In order to preserve precise content, we also take a flow step from TIterableSequence(receiver) directly to receiver."? (Do you feel that is implied?)

I must admit that "In order to preserve precise content, we also take a flow step from TIterableSequence(receiver) directly to receiver." never really made sense to me.

My understanding is that the only time Step 2 (TIterableSequence(sequence) -> sequence) is useful, is when the content-modeling is already TupleElementContent. The unpackingAssignmentFlowStep predicate doesn't have any constraint showing this, but the only way to transfer things to the elements of the sequence is to use Step 5, which always requires that the content-modeling is TupleElementContent.

If I'm misunderstanding somewhere here, we definitely need to highlight this in some way. If my understanding is correct, we might want to add a comment to the explanation of Step 2, saying it's only needed to handle things (from RHS) where the content-modeling is already using TupleElementContent

  • Since we are here, do you feel the two stages help? I originally wanted to make it a single branch TIterableElement(receiver, direction) with direction being "in" or "out", but the two branches seemed much clearer end cleaner. So perhaps the two stages idea is not relevant anymore?

No I'm happy about the current setup 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up creating a mashup, turns out I still felt a need for justifying TIterableSequence 😅
Your understanding above is correct and hopefully now fully reflected in the comment.

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
@yoff yoff requested a review from RasmusWL January 28, 2021 00:13
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarifications. I think it will greatly help anyone who dares to look at this in the future 😄 ... probably to the point where it doesn't feel so complicated that it can be described as huge undertaking to look at it (since the docs are more clear now, it's easier to understand) 💪

Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🎉

What a ride!

@tausbn tausbn merged commit cb195a0 into github:main Jan 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants