Python: Attribute access API #4423

tausbn · 2020-10-06T14:46:11Z

This PR provides a unified interface for specifying attribute reads and writes.

Currently not handled:

Attribute access via the __dict__ attribute. This should be fairly easy to add (and the tests are there already), but did not seem to be a pressing concern.
Attribute reads for module imports (from module import foo acting as an access of module.foo). These turned out to be very awkward to get to work properly, and I think it would be better to simply handle them directly in the type tracker implementation. This now works as intended. ImportMemberNode turned out to be the magic ingredient I was missing.
Class attributes should be propagated to instance attributes, but are currently not. I'm not sure how to fit this into the greater scheme of dataflow, as this will also need to be modelled correctly for field flow.

Other points:

I'm not sure the way I implemented getattr etc. is the best way of doing so. Also, classes like BuiltinCallNode seem sort of out of place alongside the attributes. Maybe it would be better to move these into a file devoted entirely to modelling builtins?
getAttributeNameExpr is supposed to return the data flow node that defines the name of the attribute. However, for simple attribute accesses, there is no underlying control flow node for attr in object.attr (in fact there isn't even an AST node for it!) and so for these kinds of attribute accesses, the aforementioned method has no return value. I tried synthesizing extra data flow nodes to avoid this, but this quickly got very involved, and so I opted to remove that part from this PR.

RasmusWL

Overall looks very useful (and I would have to fix up a LOT of code) 👍 💪

So far it is only used by the type-tracking library code. Are there any places in the core dataflow library where this should also be used? (or is that planned for a different PR?)

classes like BuiltinCallNode seem sort of out of place alongside the attributes. Maybe it would be better to move these into a file devoted entirely to modelling builtins?

Would be very cool to have proper modeling of builtins, so we can also handle ga = getattr; ga(object, "foo") and import builtins; builtins.getattr(object, "foo").

I was planning to look at the code injection query soon, and will need to model the builtins exec and eval, so maybe we should coordinate this effort together? 😊

python/ql/src/experimental/dataflow/internal/Attributes.qll

RasmusWL · 2020-10-07T13:36:46Z

python/ql/src/experimental/dataflow/internal/Attributes.qll

+}
+
+/** A simple attribute assignment: `object.attr = value`. */
+private class AttributeAssignmentAsAttrWrite extends AttrWrite, CfgNode {


That's a very long class name exposing information the type-system also knows. Would something simpler work?

I initially suggested AttributeAssignment, but reading through more of the code, it starts to make sense for me why you did this. Would a postfix of Node make things clearer? (I guess not, since that still doesn't say whether it's a DataFlow::Node or a ControlFlowNode... damn all those nodes).

Regarding the long name, yeah, I agree it's a bit of a mouthful. The JS libraries has a similar affliction (e.g. StaticClassMemberAsPropWrite). However, the class is private, and not really intended for public consumption (whereas AttrWrite very much is).

python/ql/src/experimental/dataflow/internal/Attributes.qll

yoff

It seems a pity to not use this functionality in attributeStoreStep and attributeReadStep, can we not do that?

python/ql/src/experimental/dataflow/internal/Attributes.qll

Also changes `x = TCfgNode(y)` to `x.asCfgNode() = y` where applicable.

tausbn · 2020-10-08T14:55:33Z

Thank you for the many excellent comments. Most of them have been addressed in the commits I just pushed, and the code has been cleaned up considerably as a consequence!

I decided to take another stab at the named-import-as-attribute-read issue, so the comment for that remains.

Also left to do is to integrate this into the attribute read and store steps in the data flow library. I'm inclined to do this in a separate PR, as it may affect the test output.

Required a small change in `DataFlow::importModule` to get the desired behaviour (cf. the type trackers defined in `moduleattr.ql`, but this should be harmless. The node that is added doesn't have any flow anywhere.

…s.qll Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

yoff

Nice work! One small question, but otherwise looks good. Looking forward to having it used also in data flow :-)

yoff · 2020-10-12T09:28:26Z

python/ql/src/experimental/dataflow/internal/Attributes.qll


  /** Holds if this attribute reference may access an attribute named `attrName`. */
-  predicate mayHaveAttributeName(string attrName) { none() }
+  predicate mayHaveAttributeName(string attrName) {


Is there a reason to not call this hasAttributeName? Are we any more uncertain of the values reported here than in getAttributeName or getAttributeNameExpr?
I am asking because I think this is the surface predicate, and we want to encourage users to use this one for normal use. Perhaps we should also mention this in the comment?

Main reason is: to mimic the JavaScript API. I think the may is useful in this case (at least with how the library functions currently) in that getAttributeName will only yield a result if the name of the attribute is fixed, but mayHaveAttributeName can hold for several attribut names. Thus, I would expect something like

if random.randint(0,1): attr = "foo" else: attr = "bar" x = getattr(object, attr)

to have (at least) two values for mayHaveAttributeName.

Actually, this has me wondering. Perhaps getAttributeName should be rewritten to do the same kind of local flow as mayHaveAttributeName, but only yield a value if the result is unique. 🤔

In CodeQL, I have come to expect that a predicate named getAttributeName may yield more than one value. Is the case where the name of the attribute is fixed important?

Ultimately, I think this will depend on whether we end up seeing false positives because of attribute confusion. I can certainly see an argument for using mayHaveAttributeName in something like type tracking, where we want to propagate types as much as possible (even at the cost of a bit of imprecision), but I can also imagine a situation where conflating two attribute names leads to an erroneous flow of taint. (So in particular, the data flow library itself should probably use getAttributeName for precision.)

So, I was curious, and added a test case to see if we were getting the intended behaviour for mayHaveAttributeName, and it seems we're not.

def setattr_indirect_multiple_write(): if random.randint(0,1): attr = "foo" else: attr = "bar" x = SomeClass() # $tracked=foo $f-:tracked=bar setattr(x, attr, tracked) # $tracked $tracked=foo $f-:tracked=bar

Note the f- annotations above. It seems we only consider local flow from attr = foo and not attr = bar. This is true even if I negate the conditional, so I expect it's simply always picking the first branch. This feels like it might be a bug in the implementation of local flow.

Good catch 👍

Hm, yes, we will have to sort that out...

If you try the same with a conditional expression, do you get the expected behaviour?

The following test code passes (with f- annotations):

def setattr_indirect_multiple_write_ifexpr(): attr = "foo" if random.randint(0,1) else "bar" x = SomeClass() # $tracked=foo $f-:tracked=bar setattr(x, attr, tracked) # $tracked $tracked=foo $f-:tracked=bar

So, no. Same behaviour. ☹️
(Also this was after manually applying 0f077f5 manually, since that commit is not present on this branch.)

I'm wondering if the problem is elsewhere, though. I'll have to debug this a bit.

RasmusWL

I would like to see a more clear distinction between mayHaveAttributeName and getAttributeName before merging 😊

python/ql/src/experimental/dataflow/internal/DataFlowUtil.qll

python/ql/src/experimental/dataflow/internal/Attributes.qll

RasmusWL

Very happy with the updated QLDoc 💪 🥇

Python: Attribute access API

b905a3d

github-actions bot added the Python label Oct 6, 2020

tausbn marked this pull request as ready for review October 6, 2020 17:50

tausbn requested a review from a team as a code owner October 6, 2020 17:50

RasmusWL requested changes Oct 7, 2020

View reviewed changes

yoff requested changes Oct 8, 2020

View reviewed changes

python/ql/src/experimental/dataflow/internal/Attributes.qll Outdated Show resolved Hide resolved

python/ql/src/experimental/dataflow/internal/Attributes.qll Outdated Show resolved Hide resolved

tausbn added 4 commits October 8, 2020 14:53

Python: Implement and use mayHaveAttributeName

e9ecc00

Python: Clean up and extend built-in call node classes

31596ef

Python: Reuse existing node fields

ceb2496

Also changes `x = TCfgNode(y)` to `x.asCfgNode() = y` where applicable.

Python: Remove flow from getAttributeName

df447c0

tausbn and others added 2 commits October 8, 2020 18:08

Python: Support named imports as attribute reads

d46453c

Required a small change in `DataFlow::importModule` to get the desired behaviour (cf. the type trackers defined in `moduleattr.ql`, but this should be harmless. The node that is added doesn't have any flow anywhere.

Python: Update python/ql/src/experimental/dataflow/internal/Attribute…

60eec7b

…s.qll Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

tausbn requested review from RasmusWL and yoff October 9, 2020 10:00

yoff previously approved these changes Oct 12, 2020

View reviewed changes

RasmusWL requested changes Oct 12, 2020

View reviewed changes

python/ql/src/experimental/dataflow/internal/DataFlowUtil.qll Outdated Show resolved Hide resolved

python/ql/src/experimental/dataflow/internal/Attributes.qll Outdated Show resolved Hide resolved

Python: Clear up attribute name access QLDoc

b07c7ab

tausbn dismissed yoff’s stale review via b07c7ab October 12, 2020 11:49

Python: Hopefully final changes to documentation.

3288cf1

RasmusWL approved these changes Oct 12, 2020

View reviewed changes

codeql-ci merged commit d3f8fb5 into github:main Oct 13, 2020

RasmusWL mentioned this pull request Oct 13, 2020

Python: simplify import modeling #4448

Merged

tausbn deleted the python-add-attribute-access-interface branch February 12, 2021 18:03

Python: Attribute access API #4423

Python: Attribute access API #4423

Uh oh!

Conversation

tausbn commented Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tausbn commented Oct 8, 2020

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tausbn commented Oct 6, 2020 •

edited

Loading