Skip to content

Python: Small Cleanups #5926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jul 2, 2021
Merged

Python: Small Cleanups #5926

merged 16 commits into from
Jul 2, 2021

Conversation

RasmusWL
Copy link
Member

No description provided.

RasmusWL added 7 commits May 19, 2021 11:57
Now that I got started adding small things that are nice, I've been
missing this one (that is available on an `AttrNode`).
To use all the good new stuff 🎉
Highlight why we need to import `DataFlowPrivate`
Some of this modeling could probably go to the standard lib modeling
file, but this chain of commits is already pretty feature creep :|
But now we suddenly don't handle the call to `unicode` :O -- at least
not when I run the test locally (using Python 3).
I don't want to loose results on this, so until type-tracking/API graphs
can handle this, I want to keep our syntactic handling.
@RasmusWL RasmusWL requested a review from a team as a code owner May 19, 2021 11:02
@RasmusWL RasmusWL added the no-change-note-required This PR does not need a change note label May 19, 2021
Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like a good clean-up. 🧹
I have a few comments, but apart from that this looks good!

* Gets the data flow node corresponding to the object whose attribute named
* `attrName` is being read or written.
*/
Node getObject(string attrName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JavaScript libraries define an accesses method for the case where you want to link up the nodes in question. Would this work as an alternative?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this out now, and while it would work, it doesn't end up being as nice. I added it as

  /**
   * Holds if this data flow node accesses attribute named `attrName` on object `object`.
   */
  predicate accesses(Node object, string attrName) {
    this.getObject() = object and getAttributeName() = attrName
  }

I found two places where this needed to be replaced:

@@ -172,13 +172,15 @@ predicate containerStep(DataFlow::CfgNode nodeFrom, DataFlow::Node nodeTo) {
         // dict
         "values", "items", "get", "popitem"
       ] and
-    call.getFunction().(DataFlow::AttrRead).getObject(name) = nodeFrom
+    call.getFunction().(DataFlow::AttrRead).accesses(nodeFrom, name)
   )
   or
   // list.append, set.add
   exists(DataFlow::CallCfgNode call, string name |
     name in ["append", "add"] and
-    call.getFunction().(DataFlow::AttrRead).getObject(name).getPostUpdateNode() = nodeTo and
+    call.getFunction()
+        .(DataFlow::AttrRead)
+        .accesses(any(DataFlow::Node obj | obj.getPostUpdateNode() = nodeTo), name) and
     call.getArg(0) = nodeFrom
   )
 }

I don't really mind the first one. I think either looks fine, and I guess my slight bias towards .getObject is just based on being used to seeing that one.

I really don't like the second change though, it becomes much more obscure to read (and write) in my opinion. The fact that member-predicates can be chained with getObject(name).getPostUpdateNode() is really nice.

So based on this experiment, I would like to go ahead with the getObject(name) added in this pr, even though it is not able to align with JS. Does that sound ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the second one looks less than ideal, but I think in this case it's because of another shortcoming in our API. The awkward part here is the fact that we have to invert getPostUpdateNode with an any. If instead we could write it as .accesses(nodeTo.getPreUpdateNode(), name), it would look much nicer. And honestly, I find this much more intuitive than getObject/1. (My main objection is that the argument to getObject/1 seems totally disconnected from its result. After all, getObject/0 returns the exact same node.)

I am tempted to suggest a third alternative, in the form of a method on AttrRef like this:

DataFlow::Node withName(string name) {
  result = this and
  this.getName() = name
}

That would at least make it clear what the role of the otherwise magic argument to getObject/1 is (at least, to me it reads better with attr_ref.withName(name).getObject()). However, I think this is also a mistake, and it's better overall to align on a common API.

Copy link
Member Author

@RasmusWL RasmusWL Jun 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could do

.accesses(nodeTo.(DataFlow::PostUpdateNote).getPreUpdateNode(), name)

which is slightly shorter than

.accesses(any(DataFlow::Node obj | obj.getPostUpdateNode() = nodeTo), name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's something that would make your first snippet above a bit cleaner: #6079
I think with this, .accesses actually looks pretty intuitive and clean.

name in [
// general
"copy", "pop",
// dict
"values", "items", "get", "popitem"
] and
call.getFunction().(AttrNode).getObject(name) = nodeFrom.asCfgNode()
call.getFunction().(DataFlow::AttrRead).getObject(name) = nodeFrom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like something we would want to capture directly on CallCfgNode. That is, a way of saying "this is a method call with such and such object".

JavaScript does this using the calls method on MethodCallNode. (We may also want to align on these subclasses.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general doing it in this very syntactical way, like this code does, is fundamentally wrong, since it does not capture meth = my_set.pop; meth().

So while I like the idea of doing this in an easy way, I think we should make it easy to do it the right way instead 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So while I like the idea of doing this in an easy way, I think we should make it easy to do it the right way instead

I may be misreading this, but it seems to be implying that I'm suggesting something that's "an easy way" and not "the right way", but that's really not the case. What I'm suggesting is something that captures "this is a call, and the local source of the function part is an AttrRead with such-and-such object and attribute name". This would capture your two-step method call just fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very likely that I have misunderstood then. Happy to see a PR for improving this in the future 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I had in mind is here: #6064
... but I see now that this is perhaps not the most appropriate for this particular use case, since it operates at the level of local source nodes. It's likely to still be of some use, though (as we have many instances of "manual method matching" in our various models).

@RasmusWL RasmusWL requested a review from tausbn May 26, 2021 12:12
@yoff
Copy link
Contributor

yoff commented Jun 11, 2021

These seem like good, conservative changes. We can align later, if desired? (This PR does not really introduce new misalignments, it just wraps an existing function...)

I do wonder if the last commit should be omitted. It seems that unicode should be caught as referring to str and that we do not realise this is what the tests now document?

@tausbn
Copy link
Contributor

tausbn commented Jun 11, 2021

These seem like good, conservative changes. We can align later, if desired? (This PR does not really introduce new misalignments, it just wraps an existing function...)

Historically, we have not chosen this approach, even if we were just wrapping existing functions because it introduces a maintenance burden that can only be slowly deprecated away (see e.g. #5253).

I do wonder if the last commit should be omitted. It seems that unicode should be caught as referring to str and that we do not realise this is what the tests now document?

Good point. It seems to me that our handling of built-ins in the API graph should also include unicode, so maybe that commit is superfluous now?

@RasmusWL
Copy link
Member Author

Good point. It seems to me that our handling of built-ins in the API graph should also include unicode, so maybe that commit is superfluous now?

I see that these changes happen approximately at the same time as we changed how builtins are handled with #5880, so going to give it a try 👍

@RasmusWL
Copy link
Member Author

But it did not work out. I tried my best to illustrate so with the commit history 🤷

@tausbn
Copy link
Contributor

tausbn commented Jun 14, 2021

Ah, I partially understand what's going on now. Because of

# Workaround for Python3 not having unicode
import sys
if sys.version_info[0] == 3:
    unicode = str

we of course don't find unicode(ts) to be a call to the built-in unicode (because it's being rewritten in the global scope, and so we throw our hands in the air and say it could be anything). What's more curious is that we don't flag it as a use of str, then. We may be missing some global flow. (I'm still trying to figure out exactly where this flow is going missing.)

RasmusWL added a commit to RasmusWL/codeql that referenced this pull request Jun 22, 2021
@RasmusWL
Copy link
Member Author

As discussed in #6064 (review), in favor of not doing too much bike-shedding, I'm accepting to try out accesses (now also added in this PR). The JS version has pragma[noinline] -- I'm not sure whether we should just (blindly) copy that as well, or only deal with it if we see performance problems?

Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made a small suggestion (that does not change the semantics).

Regarding noinline, I think it's probably a good idea. There should be no performance penalty in evaluating accesses in isolation (as I don't believe we can actual have multiple results for getObject or getAttributeName for a given AttrRef), and conversely inlining could make things worse (by resulting in a join order where we join separately with getObject and getAttributeName -- which may be a big join -- and only joining on this afterwards).

Regarding getPostUpdateNode, please see my latest comment on #6079

Oh, and you can now rewrite MethodCallNode::calls in terms of accesses. 🙂

Co-authored-by: Taus <tausbn@github.com>
@RasmusWL RasmusWL requested a review from tausbn July 2, 2021 11:29
@RasmusWL
Copy link
Member Author

RasmusWL commented Jul 2, 2021

I force pushed to fix the commit message of Python: AttrRef getOjbect/1 -> accesses/2 , since it said Python: AttrRef getOjbect/1 -> accesses/0 previously.

Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patience! 🙂

@RasmusWL
Copy link
Member Author

RasmusWL commented Jul 2, 2021

:shipit:

@codeql-ci codeql-ci merged commit a25933a into github:main Jul 2, 2021
@RasmusWL RasmusWL deleted the small-cleanups branch July 2, 2021 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-change-note-required This PR does not need a change note Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants