Skip to content

Conversation

@tausbn
Copy link
Contributor

@tausbn tausbn commented Oct 6, 2020

This PR provides a unified interface for specifying attribute reads and writes.

Currently not handled:

  • Attribute access via the __dict__ attribute. This should be fairly easy to add (and the tests are there already), but did not seem to be a pressing concern.
  • Attribute reads for module imports (from module import foo acting as an access of module.foo). These turned out to be very awkward to get to work properly, and I think it would be better to simply handle them directly in the type tracker implementation. This now works as intended. ImportMemberNode turned out to be the magic ingredient I was missing.
  • Class attributes should be propagated to instance attributes, but are currently not. I'm not sure how to fit this into the greater scheme of dataflow, as this will also need to be modelled correctly for field flow.

Other points:

  • I'm not sure the way I implemented getattr etc. is the best way of doing so. Also, classes like BuiltinCallNode seem sort of out of place alongside the attributes. Maybe it would be better to move these into a file devoted entirely to modelling builtins?
  • getAttributeNameExpr is supposed to return the data flow node that defines the name of the attribute. However, for simple attribute accesses, there is no underlying control flow node for attr in object.attr (in fact there isn't even an AST node for it!) and so for these kinds of attribute accesses, the aforementioned method has no return value. I tried synthesizing extra data flow nodes to avoid this, but this quickly got very involved, and so I opted to remove that part from this PR.

@github-actions github-actions bot added the Python label Oct 6, 2020
@tausbn tausbn marked this pull request as ready for review October 6, 2020 17:50
@tausbn tausbn requested a review from a team as a code owner October 6, 2020 17:50
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks very useful (and I would have to fix up a LOT of code) 👍 💪

So far it is only used by the type-tracking library code. Are there any places in the core dataflow library where this should also be used? (or is that planned for a different PR?)

classes like BuiltinCallNode seem sort of out of place alongside the attributes. Maybe it would be better to move these into a file devoted entirely to modelling builtins?

Would be very cool to have proper modeling of builtins, so we can also handle ga = getattr; ga(object, "foo") and import builtins; builtins.getattr(object, "foo").

I was planning to look at the code injection query soon, and will need to model the builtins exec and eval, so maybe we should coordinate this effort together? 😊

}

/** A simple attribute assignment: `object.attr = value`. */
private class AttributeAssignmentAsAttrWrite extends AttrWrite, CfgNode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very long class name exposing information the type-system also knows. Would something simpler work?

I initially suggested AttributeAssignment, but reading through more of the code, it starts to make sense for me why you did this. Would a postfix of Node make things clearer? (I guess not, since that still doesn't say whether it's a DataFlow::Node or a ControlFlowNode... damn all those nodes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the long name, yeah, I agree it's a bit of a mouthful. The JS libraries has a similar affliction (e.g. StaticClassMemberAsPropWrite). However, the class is private, and not really intended for public consumption (whereas AttrWrite very much is).

Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a pity to not use this functionality in attributeStoreStep and attributeReadStep, can we not do that?

@tausbn
Copy link
Contributor Author

tausbn commented Oct 8, 2020

Thank you for the many excellent comments. Most of them have been addressed in the commits I just pushed, and the code has been cleaned up considerably as a consequence!

I decided to take another stab at the named-import-as-attribute-read issue, so the comment for that remains.

Also left to do is to integrate this into the attribute read and store steps in the data flow library. I'm inclined to do this in a separate PR, as it may affect the test output.

tausbn and others added 2 commits October 8, 2020 18:08
Required a small change in `DataFlow::importModule` to get the desired
behaviour (cf. the type trackers defined in `moduleattr.ql`, but this
should be harmless. The node that is added doesn't have any flow
anywhere.
…s.qll

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
@tausbn tausbn requested review from RasmusWL and yoff October 9, 2020 10:00
yoff
yoff previously approved these changes Oct 12, 2020
Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! One small question, but otherwise looks good. Looking forward to having it used also in data flow :-)


/** Holds if this attribute reference may access an attribute named `attrName`. */
predicate mayHaveAttributeName(string attrName) { none() }
predicate mayHaveAttributeName(string attrName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to not call this hasAttributeName? Are we any more uncertain of the values reported here than in getAttributeName or getAttributeNameExpr?
I am asking because I think this is the surface predicate, and we want to encourage users to use this one for normal use. Perhaps we should also mention this in the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main reason is: to mimic the JavaScript API. I think the may is useful in this case (at least with how the library functions currently) in that getAttributeName will only yield a result if the name of the attribute is fixed, but mayHaveAttributeName can hold for several attribut names. Thus, I would expect something like

if random.randint(0,1):
    attr = "foo"
else:
    attr = "bar"
x = getattr(object, attr)

to have (at least) two values for mayHaveAttributeName.

Actually, this has me wondering. Perhaps getAttributeName should be rewritten to do the same kind of local flow as mayHaveAttributeName, but only yield a value if the result is unique. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CodeQL, I have come to expect that a predicate named getAttributeName may yield more than one value. Is the case where the name of the attribute is fixed important?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, I think this will depend on whether we end up seeing false positives because of attribute confusion. I can certainly see an argument for using mayHaveAttributeName in something like type tracking, where we want to propagate types as much as possible (even at the cost of a bit of imprecision), but I can also imagine a situation where conflating two attribute names leads to an erroneous flow of taint. (So in particular, the data flow library itself should probably use getAttributeName for precision.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I was curious, and added a test case to see if we were getting the intended behaviour for mayHaveAttributeName, and it seems we're not.

def setattr_indirect_multiple_write():
    if random.randint(0,1):
        attr = "foo"
    else:
        attr = "bar"
    x = SomeClass() # $tracked=foo $f-:tracked=bar
    setattr(x, attr, tracked) # $tracked $tracked=foo $f-:tracked=bar

Note the f- annotations above. It seems we only consider local flow from attr = foo and not attr = bar. This is true even if I negate the conditional, so I expect it's simply always picking the first branch. This feels like it might be a bug in the implementation of local flow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yes, we will have to sort that out...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you try the same with a conditional expression, do you get the expected behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following test code passes (with f- annotations):

def setattr_indirect_multiple_write_ifexpr():
    attr = "foo" if random.randint(0,1) else "bar"
    x = SomeClass() # $tracked=foo $f-:tracked=bar
    setattr(x, attr, tracked) # $tracked $tracked=foo $f-:tracked=bar

So, no. Same behaviour. ☹️
(Also this was after manually applying 0f077f5 manually, since that commit is not present on this branch.)

I'm wondering if the problem is elsewhere, though. I'll have to debug this a bit.

Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see a more clear distinction between mayHaveAttributeName and getAttributeName before merging 😊

Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very happy with the updated QLDoc 💪 🥇

@codeql-ci codeql-ci merged commit d3f8fb5 into github:main Oct 13, 2020
@tausbn tausbn deleted the python-add-attribute-access-interface branch February 12, 2021 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants