-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Python: Small Cleanups #5926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Small Cleanups #5926
Conversation
Now that I got started adding small things that are nice, I've been missing this one (that is available on an `AttrNode`).
To use all the good new stuff 🎉
Highlight why we need to import `DataFlowPrivate`
Some of this modeling could probably go to the standard lib modeling file, but this chain of commits is already pretty feature creep :|
But now we suddenly don't handle the call to `unicode` :O -- at least not when I run the test locally (using Python 3).
I don't want to loose results on this, so until type-tracking/API graphs can handle this, I want to keep our syntactic handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like a good clean-up. 🧹
I have a few comments, but apart from that this looks good!
* Gets the data flow node corresponding to the object whose attribute named | ||
* `attrName` is being read or written. | ||
*/ | ||
Node getObject(string attrName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JavaScript libraries define an accesses
method for the case where you want to link up the nodes in question. Would this work as an alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this out now, and while it would work, it doesn't end up being as nice. I added it as
/**
* Holds if this data flow node accesses attribute named `attrName` on object `object`.
*/
predicate accesses(Node object, string attrName) {
this.getObject() = object and getAttributeName() = attrName
}
I found two places where this needed to be replaced:
@@ -172,13 +172,15 @@ predicate containerStep(DataFlow::CfgNode nodeFrom, DataFlow::Node nodeTo) {
// dict
"values", "items", "get", "popitem"
] and
- call.getFunction().(DataFlow::AttrRead).getObject(name) = nodeFrom
+ call.getFunction().(DataFlow::AttrRead).accesses(nodeFrom, name)
)
or
// list.append, set.add
exists(DataFlow::CallCfgNode call, string name |
name in ["append", "add"] and
- call.getFunction().(DataFlow::AttrRead).getObject(name).getPostUpdateNode() = nodeTo and
+ call.getFunction()
+ .(DataFlow::AttrRead)
+ .accesses(any(DataFlow::Node obj | obj.getPostUpdateNode() = nodeTo), name) and
call.getArg(0) = nodeFrom
)
}
I don't really mind the first one. I think either looks fine, and I guess my slight bias towards .getObject
is just based on being used to seeing that one.
I really don't like the second change though, it becomes much more obscure to read (and write) in my opinion. The fact that member-predicates can be chained with getObject(name).getPostUpdateNode()
is really nice.
So based on this experiment, I would like to go ahead with the getObject(name)
added in this pr, even though it is not able to align with JS. Does that sound ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the second one looks less than ideal, but I think in this case it's because of another shortcoming in our API. The awkward part here is the fact that we have to invert getPostUpdateNode
with an any
. If instead we could write it as .accesses(nodeTo.getPreUpdateNode(), name)
, it would look much nicer. And honestly, I find this much more intuitive than getObject/1
. (My main objection is that the argument to getObject/1
seems totally disconnected from its result. After all, getObject/0
returns the exact same node.)
I am tempted to suggest a third alternative, in the form of a method on AttrRef
like this:
DataFlow::Node withName(string name) {
result = this and
this.getName() = name
}
That would at least make it clear what the role of the otherwise magic argument to getObject/1
is (at least, to me it reads better with attr_ref.withName(name).getObject()
). However, I think this is also a mistake, and it's better overall to align on a common API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could do
.accesses(nodeTo.(DataFlow::PostUpdateNote).getPreUpdateNode(), name)
which is slightly shorter than
.accesses(any(DataFlow::Node obj | obj.getPostUpdateNode() = nodeTo), name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's something that would make your first snippet above a bit cleaner: #6079
I think with this, .accesses
actually looks pretty intuitive and clean.
name in [ | ||
// general | ||
"copy", "pop", | ||
// dict | ||
"values", "items", "get", "popitem" | ||
] and | ||
call.getFunction().(AttrNode).getObject(name) = nodeFrom.asCfgNode() | ||
call.getFunction().(DataFlow::AttrRead).getObject(name) = nodeFrom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like something we would want to capture directly on CallCfgNode
. That is, a way of saying "this is a method call with such and such object".
JavaScript does this using the calls
method on MethodCallNode
. (We may also want to align on these subclasses.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general doing it in this very syntactical way, like this code does, is fundamentally wrong, since it does not capture meth = my_set.pop; meth()
.
So while I like the idea of doing this in an easy way, I think we should make it easy to do it the right way instead 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So while I like the idea of doing this in an easy way, I think we should make it easy to do it the right way instead
I may be misreading this, but it seems to be implying that I'm suggesting something that's "an easy way" and not "the right way", but that's really not the case. What I'm suggesting is something that captures "this is a call, and the local source of the function part is an AttrRead
with such-and-such object and attribute name". This would capture your two-step method call just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very likely that I have misunderstood then. Happy to see a PR for improving this in the future 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what I had in mind is here: #6064
... but I see now that this is perhaps not the most appropriate for this particular use case, since it operates at the level of local source nodes. It's likely to still be of some use, though (as we have many instances of "manual method matching" in our various models).
These seem like good, conservative changes. We can align later, if desired? (This PR does not really introduce new misalignments, it just wraps an existing function...) I do wonder if the last commit should be omitted. It seems that |
Historically, we have not chosen this approach, even if we were just wrapping existing functions because it introduces a maintenance burden that can only be slowly
Good point. It seems to me that our handling of built-ins in the API graph should also include |
I see that these changes happen approximately at the same time as we changed how builtins are handled with #5880, so going to give it a try 👍 |
But it did not work out. I tried my best to illustrate so with the commit history 🤷 |
Ah, I partially understand what's going on now. Because of # Workaround for Python3 not having unicode
import sys
if sys.version_info[0] == 3:
unicode = str we of course don't find |
See this thread for discussion: github#5926 (comment)
As discussed in #6064 (review), in favor of not doing too much bike-shedding, I'm accepting to try out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made a small suggestion (that does not change the semantics).
Regarding noinline
, I think it's probably a good idea. There should be no performance penalty in evaluating accesses
in isolation (as I don't believe we can actual have multiple results for getObject
or getAttributeName
for a given AttrRef
), and conversely inlining could make things worse (by resulting in a join order where we join separately with getObject
and getAttributeName
-- which may be a big join -- and only joining on this
afterwards).
Regarding getPostUpdateNode
, please see my latest comment on #6079
Oh, and you can now rewrite MethodCallNode::calls
in terms of accesses
. 🙂
python/ql/src/semmle/python/dataflow/new/internal/Attributes.qll
Outdated
Show resolved
Hide resolved
See this thread for discussion: github#5926 (comment)
This reverts commit 9137f04.
Co-authored-by: Taus <tausbn@github.com>
I force pushed to fix the commit message of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your patience! 🙂
|
No description provided.