Skip to content

Python: Add getAMethodCall to LocalSourceNode #6064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 22, 2021

Conversation

tausbn
Copy link
Contributor

@tausbn tausbn commented Jun 11, 2021

This seems like something we have been missing for a while now, so I
figured it might be useful to add. It is roughly based on the JavaScript
equivalent, with one major difference: in the JavaScript libraries,
getAMethodCall is reserved for syntactic method calls (obj.m(...))
whereas getAMemberInvocation is used for both this and the case where
the bound method obj.m is stored in a temporary variable and then
subsequently invoked in the same local scope.

It seems to me that the more general predicate is more useful, and hence
should have the simpler name. (And also we don't really work with a
notion of "invocation" in the Python libraries, so we would need a
better name for it anyway.)

I think as long as the documentation makes the behaviour clear, it
should be okay.


Needs a change note, so I'm keeping it as draft for now. Change note added.

This seems like something we have been missing for a while now, so I
figured it might be useful to add. It is roughly based on the JavaScript
equivalent, with one major difference: in the JavaScript libraries,
`getAMethodCall` is reserved for syntactic method calls (`obj.m(...)`)
whereas `getAMemberInvocation` is used for both this and the case where
the bound method `obj.m` is stored in a temporary variable and then
subsequently invoked in the same local scope.

It seems to me that the more general predicate is more useful, and hence
should have the simpler name. (And also we don't really work with a
notion of "invocation" in the Python libraries, so we would need a
better name for it anyway.)

I think as long as the documentation makes the behaviour clear, it
should be okay.
@tausbn tausbn mentioned this pull request Jun 11, 2021
tausbn added 6 commits June 15, 2021 15:04
Roughly patterned after the JS equivalent.
Not as many opportunities to clean stuff up here.
Now that we have a `MethodCallNode` class, it would be silly not to use
that as the return type.
@tausbn tausbn marked this pull request as ready for review June 15, 2021 15:54
@tausbn tausbn requested a review from a team as a code owner June 15, 2021 15:54
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really love it 😍

I think it's important to highlight that this only tracks method references in local scope.

def foo(meth):
    res = meth() # not found by MethodCallNode
    print(res)

def bar():
    r = RocketStation()
    meth = r.fire_the_rockets
    meth() # found by MethodCallNode
    foo(meth)

Receiver

We haven't used the term receiver a lot in the Python libraries until now, and personally I've only seen it used here in CodeQL. So I just want to check if there are arguments for keeping it this way, or whether we should maybe call it getOjbect()? (I think we talked about this the other day, but I simply can't remember)

calls

Personally I don't immediately love the way calls read, as in the example below. Something rubs me wrong about the way the object/receiver is specified... just like for accesses. I think it's very natural that we should either go with both of these, or none of them.

result.(DataFlow::MethodCallNode).calls(cipherInstance(algorithmName), "encryptor")

Based on the fact that I can't really find a different approach replacing calls for MethodCallNode, that means the only alternative is to not provide these shortcuts, and always require a query writer to specify both getMethodName() and getReceiver() (likewise for AttrRead/AttrWrite).

I feel that if we provide these shortcuts, it should be because we want people (mostly us) to use them. I personally don't love them, but if you're feeling good about them, we can try them out, and maybe I'll change my mind after some time 🤷‍♂️ (and I guess you are, since you've made these suggestion for using these already 😛)

Comment on lines 186 to 187
* Also covers the case where the method lookup is done separately from the call itself, as in
* `temp = foo.bar; temp(...)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Also covers the case where the method lookup is done separately from the call itself, as in
* `temp = foo.bar; temp(...)`.
* Also covers the case where the method lookup is done separately from the call itself, as in
* `temp = foo.bar; temp(...)`, although this is only tracked through local scope.

@RasmusWL
Copy link
Member

I got curious how it would handle multiple methods flowing to the same call-site.

So I made a little test:

if cond:
    meth = obj1.foo
else:
    meth = obj2.bar

meth()

with this query:

import python
import semmle.python.dataflow.new.DataFlow
import experimental.dataflow.TestUtil.PrintNode

from DataFlow::MethodCallNode methCall
select prettyNode(methCall), methCall.getMethodName(), prettyNode(methCall.getReceiver())

results in

| meth() | bar | obj1 |
| meth() | bar | obj2 |
| meth() | foo | obj1 |
| meth() | foo | obj2 |

So there is a risk of someone wanting all objects where the foo method is called on, using any(DataFlow::MethodCallNode methCall).calls(result, "foo"), and getting FPs (like obj2) 😞 (and technically also FNs due to the local scope only property).

@tausbn do you see an (easy) way to fix this? I personally don't, but it would be nice if we avoid this 😊 (I'm thinking we would need a new constructor in TNode, which then couldn't just subclass CallCfgNode, and it would be sort of a mess).

Otherwise I guess it's a tradeoff between how big of a problem we estimate this to be vs. how much of an improvement we expect MethodCallNode to provide.

@tausbn
Copy link
Contributor Author

tausbn commented Jun 21, 2021

Amusingly, your second comment (which is an excellent find, by the way ❤️) explains why we need calls in the first place: taken separately, getReceiver/getObject and getMethodName may have multiple results (because, indeed, there may be multiple methods flowing to the call in question). If we require people to specify these separately as a conjuction, we will get the kind of overlap you describe above, leading to false positives.

The only way to avoid this is to specify/access the object and the method name at the same time, for which the following implementation of calls will do:

  predicate calls(Node object, string methodName) {
    object = method_lookup.getObject() and
    methodName = method_lookup.getAttributeName()
  }

By ensuring that it's the same method_lookup, we avoid the unfortunate crosstalk. (I have created a test illustrating this point, which I'll push shortly.)

With this in mind, I also think it's reasonable to add a big disclaimer on getObject and getMethodName saying that you should use calls if you want to specify both at the same time (to avoid people falling into this trap by mistake).


Regarding some of your other points, we agreed that getObject was more in line with the rest of the Python library, so I will make this change as well. Your suggestion for clarifying that this only works in local scope will also be folded in. 🙂

- changes `getReceiver` to `getObject`
- fixes `calls` to avoid unwanted cross-talk
- adds some more documentation to highlight the above issue
@tausbn tausbn requested a review from RasmusWL June 21, 2021 16:12
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amusingly, your second comment (which is an excellent find, by the way ) explains why we need calls in the first place: taken separately, getReceiver/getObject and getMethodName may have multiple results (because, indeed, there may be multiple methods flowing to the call in question). If we require people to specify these separately as a conjuction, we will get the kind of overlap you describe above, leading to false positives.

I got to the same conclusion while I was eating my breakfast thinking this over, but nice to see it has already been implemented 👍

@RasmusWL RasmusWL merged commit e05d6e7 into github:main Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants