-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Python: Attribute access API #4423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Attribute access API #4423
Conversation
RasmusWL
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks very useful (and I would have to fix up a LOT of code) 👍 💪
So far it is only used by the type-tracking library code. Are there any places in the core dataflow library where this should also be used? (or is that planned for a different PR?)
classes like BuiltinCallNode seem sort of out of place alongside the attributes. Maybe it would be better to move these into a file devoted entirely to modelling builtins?
Would be very cool to have proper modeling of builtins, so we can also handle ga = getattr; ga(object, "foo") and import builtins; builtins.getattr(object, "foo").
I was planning to look at the code injection query soon, and will need to model the builtins exec and eval, so maybe we should coordinate this effort together? 😊
| } | ||
|
|
||
| /** A simple attribute assignment: `object.attr = value`. */ | ||
| private class AttributeAssignmentAsAttrWrite extends AttrWrite, CfgNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very long class name exposing information the type-system also knows. Would something simpler work?
I initially suggested AttributeAssignment, but reading through more of the code, it starts to make sense for me why you did this. Would a postfix of Node make things clearer? (I guess not, since that still doesn't say whether it's a DataFlow::Node or a ControlFlowNode... damn all those nodes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the long name, yeah, I agree it's a bit of a mouthful. The JS libraries has a similar affliction (e.g. StaticClassMemberAsPropWrite). However, the class is private, and not really intended for public consumption (whereas AttrWrite very much is).
yoff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a pity to not use this functionality in attributeStoreStep and attributeReadStep, can we not do that?
Also changes `x = TCfgNode(y)` to `x.asCfgNode() = y` where applicable.
|
Thank you for the many excellent comments. Most of them have been addressed in the commits I just pushed, and the code has been cleaned up considerably as a consequence! I decided to take another stab at the named-import-as-attribute-read issue, so the comment for that remains. Also left to do is to integrate this into the attribute read and store steps in the data flow library. I'm inclined to do this in a separate PR, as it may affect the test output. |
Required a small change in `DataFlow::importModule` to get the desired behaviour (cf. the type trackers defined in `moduleattr.ql`, but this should be harmless. The node that is added doesn't have any flow anywhere.
…s.qll Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
yoff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! One small question, but otherwise looks good. Looking forward to having it used also in data flow :-)
|
|
||
| /** Holds if this attribute reference may access an attribute named `attrName`. */ | ||
| predicate mayHaveAttributeName(string attrName) { none() } | ||
| predicate mayHaveAttributeName(string attrName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to not call this hasAttributeName? Are we any more uncertain of the values reported here than in getAttributeName or getAttributeNameExpr?
I am asking because I think this is the surface predicate, and we want to encourage users to use this one for normal use. Perhaps we should also mention this in the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main reason is: to mimic the JavaScript API. I think the may is useful in this case (at least with how the library functions currently) in that getAttributeName will only yield a result if the name of the attribute is fixed, but mayHaveAttributeName can hold for several attribut names. Thus, I would expect something like
if random.randint(0,1):
attr = "foo"
else:
attr = "bar"
x = getattr(object, attr)to have (at least) two values for mayHaveAttributeName.
Actually, this has me wondering. Perhaps getAttributeName should be rewritten to do the same kind of local flow as mayHaveAttributeName, but only yield a value if the result is unique. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In CodeQL, I have come to expect that a predicate named getAttributeName may yield more than one value. Is the case where the name of the attribute is fixed important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately, I think this will depend on whether we end up seeing false positives because of attribute confusion. I can certainly see an argument for using mayHaveAttributeName in something like type tracking, where we want to propagate types as much as possible (even at the cost of a bit of imprecision), but I can also imagine a situation where conflating two attribute names leads to an erroneous flow of taint. (So in particular, the data flow library itself should probably use getAttributeName for precision.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I was curious, and added a test case to see if we were getting the intended behaviour for mayHaveAttributeName, and it seems we're not.
def setattr_indirect_multiple_write():
if random.randint(0,1):
attr = "foo"
else:
attr = "bar"
x = SomeClass() # $tracked=foo $f-:tracked=bar
setattr(x, attr, tracked) # $tracked $tracked=foo $f-:tracked=barNote the f- annotations above. It seems we only consider local flow from attr = foo and not attr = bar. This is true even if I negate the conditional, so I expect it's simply always picking the first branch. This feels like it might be a bug in the implementation of local flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yes, we will have to sort that out...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you try the same with a conditional expression, do you get the expected behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following test code passes (with f- annotations):
def setattr_indirect_multiple_write_ifexpr():
attr = "foo" if random.randint(0,1) else "bar"
x = SomeClass() # $tracked=foo $f-:tracked=bar
setattr(x, attr, tracked) # $tracked $tracked=foo $f-:tracked=barSo, no. Same behaviour.
(Also this was after manually applying 0f077f5 manually, since that commit is not present on this branch.)
I'm wondering if the problem is elsewhere, though. I'll have to debug this a bit.
RasmusWL
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see a more clear distinction between mayHaveAttributeName and getAttributeName before merging 😊
RasmusWL
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very happy with the updated QLDoc 💪 🥇
This PR provides a unified interface for specifying attribute reads and writes.
Currently not handled:
__dict__attribute. This should be fairly easy to add (and the tests are there already), but did not seem to be a pressing concern.Attribute reads for module imports (This now works as intended.from module import fooacting as an access ofmodule.foo). These turned out to be very awkward to get to work properly, and I think it would be better to simply handle them directly in the type tracker implementation.ImportMemberNodeturned out to be the magic ingredient I was missing.Other points:
getattretc. is the best way of doing so. Also, classes likeBuiltinCallNodeseem sort of out of place alongside the attributes. Maybe it would be better to move these into a file devoted entirely to modelling builtins?getAttributeNameExpris supposed to return the data flow node that defines the name of the attribute. However, for simple attribute accesses, there is no underlying control flow node forattrinobject.attr(in fact there isn't even an AST node for it!) and so for these kinds of attribute accesses, the aforementioned method has no return value. I tried synthesizing extra data flow nodes to avoid this, but this quickly got very involved, and so I opted to remove that part from this PR.