Python: port modification of default value #6557

yoff · 2021-08-26T12:32:59Z

Only missing bit is sanitizers. @tausbn mentioned comparing against the empty list. I could add that, but I am not really sure that is a recommended practice. On the other hand, we do not want a slew of false positives from missing that. Let me know.

The old query uses `pointsTo` to limit the sinks to methods on lists and dictionaries. That constraint is omitted here which could hurt performance.

This should avoid potentially terrible performance. Also noted the missing syntactic constructs, as I went through the documnetation.

Consider removing the original test

The one found in the original test case

RasmusWL

Overall looks like some good improvements 💪

I feel that the tests the makes a modification that end up not changing the value (like dict_nonempty_nochange and dict_update_op_nochange), feels a bit like a hypothetical code-pattern, and that we shouldn't spend more time trying to handle those better for now. If we end up getting lots of complaints on this, or seeing this being a major problem from real code, we can consider looking more into it 😊 The same goes for the sanitizer test-case. (unless of course you have seen this in the wild a lot, but I don't see that mentioned anywhere)

Return of default value

I just wanted to note that a related problem can occur if you return the default value, illustrated in the example below.

def foo(x = [])
    return x

foo().append(42)
assert foo() == [42]

BarrierGuard

I wanted to go a bit more into details with the effort that was put into not giving an alert on this code

def sanitizer(l = []):
    if l:
        l.append(1)  # actually ok, since it will not trigger for default value
    ...

I don't see any changes to results from the barrier guard introduced in a762373 -- the changes to the .expected files seems to be related to the changes made in 49ae549

I also don't think it quite works as intended, since for if not x: the expression for x does not seem to be a part of the IdentityGuarded class, so the logic for handling the not will not be applicable. To get it to be part of the IdentityGuarded class we would need something along the lines of any(If i).getTest().contains(this), although that code might not be the best way to handle it in practice (I haven't really thought that through).

As I mentioned above, I would think that code that modifies the argument value in this way would be a rare occurrence, and not something we should worry too much about. So although it might sound a bit disappointing, I will suggest that we remove this bit of the PR, and then just be happy about the other improvements you have made for this query 💪 (if you already have the data indicating I'm wrong, very happy to take a closer look at this though)

In case you agree that we can drop support for this, I think there's a few of my PR comments that can be ignored 😉

python/ql/src/semmle/python/functions/ModificationOfParameterWithDefaultCustomizations.qll

python/ql/test/query-tests/Functions/ModificationOfParameterWithDefault/test.py

python/ql/src/semmle/python/functions/ModificationOfParameterWithDefaultCustomizations.qll

RasmusWL · 2021-08-31T08:19:50Z

python/ql/src/semmle/python/functions/ModificationOfParameterWithDefaultCustomizations.qll

+   * A sanitizer guard for detecting modifications of a parameters default value.
+   */
+  abstract class BarrierGuard extends DataFlow::BarrierGuard {
+    abstract boolean blocksNonEmpty();


needs QLDoc

python/ql/src/semmle/python/functions/ModificationOfParameterWithDefaultCustomizations.qll

RasmusWL · 2021-08-31T09:15:23Z

I don't have any good suggestions for how to fix the tests 😞

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

…fication-of-default-value Files have moved around, specifically PrintNode.qll.

The test case has different behaviour between py2/3. When merging this, we should create an issue to resolve it.

RasmusWL

I think the logic involving invertedness, emptyness, and truthyness is quite complicated. I'm reminded of this quote:

There are two methods in software design. One is to make the program so simple, there are obviously no errors. The other is to make it so complicated, there are no obvious errors.

I didn't invest too much in addressing this part initially, since I thought it might be going away altogether.

As we discussed offline, we probably don't need to handle not not x, so we should be able to get away with just handling a single level of not, thereby removing a lot of the complexity in this queue. Hopefully we'll be able to invest more effort into making guards just work nicely out of the box, instead of implementors having to go through these hoops themselves.

RasmusWL · 2021-09-07T08:54:44Z

python/ql/test/query-tests/Functions/general/functions_test.py

+# Python 2 and 3 versions of the analysis.
+# We comment it out until we have resoved the issue.
+#
+# def issue1143(expr, param=[]):


AHA, so #1143 is the reason we want to have that guard 🤔

python/ql/src/semmle/python/functions/ModificationOfParameterWithDefault.qll

RasmusWL · 2021-09-07T09:48:35Z

python/ql/test/query-tests/Functions/ModificationOfParameterWithDefault/test.py

+# OK
+def sanitizer(l = []):
+    if l:
+        l.append(1)
+    return l
+
+# OK
+def sanitizer_negated(l = [1]):
+    if not l:
+        l.append(1)
+    return l
+
+# Not OK
+def sanitizer(l = []):
+    if not l:
+        l.append(1)  #$ modification=l
+    return l
+
+# Not OK
+def sanitizer_negated(l = [1]):
+    if l:
+        l.append(1)  #$ modification=l
+    return l


To make the tests for guards more complete, can we get else branches on all of these, such as:

def sanitizer(l = []): if l: # OK l.append(1) else: # NOT OK l.append(1) # $ modification=l return l

Good idea 👍

Thanks! These highlighted the inherent problem with using BarrierGuards.

RasmusWL · 2021-09-07T10:06:32Z

python/ql/src/semmle/python/functions/ModificationOfParameterWithDefaultCustomizations.qll

+  /**
+   * A sanitizer guard that does not let a truthy value flow to the true branch.
+   *
+   * Since guards with different behaviour cannot exist on the same node,
+   * we let all guards have the same behaviour, in the sense that they all check
+   * the true branch. Instead, we partition guards into those that block
+   * truthy values and those that block falsy values.
+   */
+  abstract class BlocksTruthy extends DataFlow::BarrierGuard { }
+
+  /**
+   * A sanitizer guard that does not let a falsy value flow to the true branch.
+   *
+   * Since guards with different behaviour cannot exist on the same node,
+   * we let all guards have the same behaviour, in the sense that they all check
+   * the true branch. Instead, we partition guards into those that block
+   * truthy values and those that block falsy values.
+   */
+  abstract class BlocksFalsey extends DataFlow::BarrierGuard { }


Removed since new barrier are introduced below in other suggestion.

Suggested change

/**

* A sanitizer guard that does not let a truthy value flow to the true branch.

*

* Since guards with different behaviour cannot exist on the same node,

* we let all guards have the same behaviour, in the sense that they all check

* the true branch. Instead, we partition guards into those that block

* truthy values and those that block falsy values.

*/

abstract class BlocksTruthy extends DataFlow::BarrierGuard { }

/**

* A sanitizer guard that does not let a falsy value flow to the true branch.

*

* Since guards with different behaviour cannot exist on the same node,

* we let all guards have the same behaviour, in the sense that they all check

* the true branch. Instead, we partition guards into those that block

* truthy values and those that block falsy values.

*/

abstract class BlocksFalsey extends DataFlow::BarrierGuard { }

I thought we wanted all these abstract classes as extension points for future, unknown customisations?

RasmusWL · 2021-09-07T10:06:38Z

python/ql/src/semmle/python/functions/ModificationOfParameterWithDefaultCustomizations.qll

+  /**
+   * An expression that is checked directly in an `if`, possibly with `not`, such as `if x:` or `if not x:`.
+   */
+  private class IdentityGuarded extends Expr {
+    boolean inverted;
+
+    IdentityGuarded() {
+      this = any(If i).getTest() and
+      inverted = false
+      or
+      exists(IdentityGuarded ig, UnaryExpr notExp |
+        notExp.getOp() instanceof Not and
+        ig = notExp and
+        notExp.getOperand() = this
+      |
+        inverted = ig.isInverted().booleanNot()
+      )
+    }
+
+    /**
+     * Whether this guard has been inverted. For `if x:` the result is `false`, and for `if not x:` the result is `true`.
+     */
+    boolean isInverted() { result = inverted }
+  }
+
+  /**
+   * Holds iff `guard` is checking the `Name` represented by `guarded` for truthyness.
+   * `result` is true if the check is inverted and false if it is not.
+   */
+  boolean isIdentityGuard(DataFlow::GuardNode guard, ControlFlowNode guarded) {
+    exists(IdentityGuarded ig |
+      ig instanceof Name and
+      // In `not l`, the `ControlFlowNode` for `l` is not an instance of `GuardNode`.
+      // TODO: This is slightly naive, not handling e.g. `l or cond` correctly.
+      // We should change it when we have a proper guards library.
+      guard.getNode().getAChildNode*() = ig and
+      result = ig.isInverted() and
+      guarded.getNode() = ig
+    )
+  }
+
+  /**
+   * A sanitizer guard that does not let a truthy value flow to the true branch.
+   * Based on `isIdentityGuard`, so comes with the same caveats.
+   */
+  class BlocksTruthyGuard extends BlocksTruthy {
+    ControlFlowNode guarded;
+
+    BlocksTruthyGuard() {
+      // The raw guard is true if the value is non-empty.
+      // We wish to send truthy falues to the false branch,
+      // se we are looking for inverted guards.
+      isIdentityGuard(this, guarded) = true
+    }
+
+    override predicate checks(ControlFlowNode node, boolean branch) {
+      node = guarded and
+      branch = true
+    }
+  }
+
+  /**
+   * A sanitizer guard that does not let a falsy value flow to the true branch.
+   * Based on `isIdentityGuard`, so comes with the same caveats.
+   */
+  class BlocksFalseyGuard extends BlocksFalsey {
+    ControlFlowNode guarded;
+
+    BlocksFalseyGuard() {
+      // The raw guard is true if the value is non-empty.
+      // We wish to send falsy falues to the false branch,
+      // se we are looking for guards that are not inverted.
+      isIdentityGuard(this, guarded) = false
+    }
+
+    override predicate checks(ControlFlowNode node, boolean branch) {
+      node = guarded and
+      branch = true
+    }
+  }


As discussed offline

Suggested change

/**

* An expression that is checked directly in an `if`, possibly with `not`, such as `if x:` or `if not x:`.

*/

private class IdentityGuarded extends Expr {

boolean inverted;

IdentityGuarded() {

this = any(If i).getTest() and

inverted = false

or

exists(IdentityGuarded ig, UnaryExpr notExp |

notExp.getOp() instanceof Not and

ig = notExp and

notExp.getOperand() = this

|

inverted = ig.isInverted().booleanNot()

)

}

/**

* Whether this guard has been inverted. For `if x:` the result is `false`, and for `if not x:` the result is `true`.

*/

boolean isInverted() { result = inverted }

}

/**

* Holds iff `guard` is checking the `Name` represented by `guarded` for truthyness.

* `result` is true if the check is inverted and false if it is not.

*/

boolean isIdentityGuard(DataFlow::GuardNode guard, ControlFlowNode guarded) {

exists(IdentityGuarded ig |

ig instanceof Name and

// In `not l`, the `ControlFlowNode` for `l` is not an instance of `GuardNode`.

// TODO: This is slightly naive, not handling e.g. `l or cond` correctly.

// We should change it when we have a proper guards library.

guard.getNode().getAChildNode*() = ig and

result = ig.isInverted() and

guarded.getNode() = ig

)

}

/**

* A sanitizer guard that does not let a truthy value flow to the true branch.

* Based on `isIdentityGuard`, so comes with the same caveats.

*/

class BlocksTruthyGuard extends BlocksTruthy {

ControlFlowNode guarded;

BlocksTruthyGuard() {

// The raw guard is true if the value is non-empty.

// We wish to send truthy falues to the false branch,

// se we are looking for inverted guards.

isIdentityGuard(this, guarded) = true

}

override predicate checks(ControlFlowNode node, boolean branch) {

node = guarded and

branch = true

}

}

/**

* A sanitizer guard that does not let a falsy value flow to the true branch.

* Based on `isIdentityGuard`, so comes with the same caveats.

*/

class BlocksFalseyGuard extends BlocksFalsey {

ControlFlowNode guarded;

BlocksFalseyGuard() {

// The raw guard is true if the value is non-empty.

// We wish to send falsy falues to the false branch,

// se we are looking for guards that are not inverted.

isIdentityGuard(this, guarded) = false

}

override predicate checks(ControlFlowNode node, boolean branch) {

node = guarded and

branch = true

}

}

/**

* A sanitizer guard that does not let a truthy value flow to the true branch.

*

* Blocks flow of `x` in the true branch in the example below.

* ```py

* if x:

* x.append(42)

* ```

*

* Implementation note: Since guards with different behaviour cannot exist on the same

* node, we let all guards have the same behaviour, in the sense that they all check

* the true branch. Instead, we partition guards into those that block truthy values

* and those that block falsy values.

*/

class BlocksTruthy extends DataFlow::BarrierGuard {

BlocksTruthy() {

this instanceof NameNode

}

override predicate checks(ControlFlowNode node, boolean branch) {

node = this and

branch = true

}

}

/**

* A sanitizer guard that does not let a falsey value flow to the true branch.

*

* Blocks flow of `x` in the true branch in the example below.

* ```py

* if not x:

* x.append(42)

* ```

*

* Implementation note: Since guards with different behaviour cannot exist on the same

* node, we let all guards have the same behaviour, in the sense that they all check

* the true branch. Instead, we partition guards into those that block truthy values

* and those that block falsy values.

*/

class BlocksFalsey extends DataFlow::BarrierGuard {

NameNode guarded;

BlocksFalsey() {

guarded = this.(UnaryExprNode).getOperand() and

this.(UnaryExprNode).getNode().getOp() instanceof Not

}

override predicate checks(ControlFlowNode node, boolean branch) {

node = guarded and

branch = true

}

}

RasmusWL · 2021-09-07T10:07:39Z

python/ql/src/semmle/python/functions/ModificationOfParameterWithDefaultCustomizations.qll

+  /**
+   * A sanitizer for detecting modifications of a parameters default value.
+   */
+  abstract class Barrier extends DataFlow::Node { }


this one is not used in the configuration anymore, so let's just remove it

Suggested change

/**

* A sanitizer for detecting modifications of a parameters default value.

*/

abstract class Barrier extends DataFlow::Node { }

I thought we wanted all these abstract classes as extension points for future, unknown customisations?

…ithDefault.qll Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

They are simply not right for this problem. We should not even make them available as an extension point.

yoff · 2021-09-10T10:51:08Z

I have moved away from BarrierGuards for this query and also removed them as an extension point. The current code contains a small amount of duplication of the BarrierGuard functionality. Let me know if that needs to be factored out somehow.

…fication-of-default-value To get the subpaths.

RasmusWL

Overall looks ok. It's not perfect that our data-flow library is not powerful enough to handle the guards nicely out of the box, but I also don't think it's worth spending too much more time on this.

A few minor comments, otherwise looks ready to be merged from my point of view.

python/ql/src/semmle/python/functions/ModificationOfParameterWithDefaultCustomizations.qll

RasmusWL · 2021-09-14T15:03:44Z

python/ql/src/semmle/python/functions/ModificationOfParameterWithDefaultCustomizations.qll

+    DataFlow::GuardNode guard;
+    NameNode guarded;
+    boolean branch;
+    boolean truthy;


Since it's only the field truthy that is used by subclasses, can you please move the other field to an exists? (to get more idiomatic CodeQL code) 😊

I don't see how this should change the behavior of the code, but do let me know if I'm wrong

yoff · 2021-09-15T10:24:50Z

Analysis of results reproduced here for convenience:

For the 15 unclassified:
All except the one in YouTube_dl are "intended positives" in that they are modifications of default values, but the consequences are intended; they are various forms of caches or state. These are false positives, since they should not be reported, but they signify the query being unrefined rather than buggy. The one in YouTube_dl, I do not understand, the code just looks wrong.

For the 6 classified ones:
These are all counters or states used to test iterations. Again, intended, so false positives.

So how were these results filtered out before?
The new results come from new sources of modification: l[index] = x, l[index] += x, and del l[index].
The first is often used to store a set of handled objects, a prototypic pattern is seen[node] = 1.
The second is often used to store a counter in the first index, as in count[0] += 1.
The third is often used in combination with the first to remove the handled object from the set.

The previous query would only include the forms l += elements and l.m().

…ithDefaultCustomizations.qll Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

yoff · 2021-09-15T10:33:29Z

I guess the question is if we should elide these new sources of modifications or not...

RasmusWL · 2021-09-15T13:27:11Z

I guess the question is if we should elide these new sources of modifications or not...

You mean sinks of modification right? (bad dad joke laugh)

Anyway, sounds good if they are "intended positives". I think the new additional places we detect modifications are good.

The one in youtube-dl looks good to me, and personally I would call that a FP.

yoff · 2021-09-15T13:47:50Z

I guess there is a pattern where you populate the default value before you use it. If all modifications are idempotent and come before all reads, is that simply safe?

RasmusWL · 2021-09-21T07:55:11Z

I guess there is a pattern where you populate the default value before you use it. If all modifications are idempotent and come before all reads, is that simply safe?

Sounds reasonable to me.

RasmusWL

Performance looks fine 👍

yoff added 7 commits August 24, 2021 16:35

Python: Add tests for modification of defaults

e3765ce

Python: straight port of query

e865a29

The old query uses `pointsTo` to limit the sinks to methods on lists and dictionaries. That constraint is omitted here which could hurt performance.

Python: switch from negative to positive list

5bff518

This should avoid potentially terrible performance. Also noted the missing syntactic constructs, as I went through the documnetation.

Python: More tests of syntactic constructs

8614563

Python: test simple sanitizer

d834cec

Python: add inline expectations test

097c23e

Consider removing the original test

Python: Implement modifying syntax

49ae549

yoff requested a review from a team as a code owner August 26, 2021 12:33

github-actions bot added the Python label Aug 26, 2021

yoff added 2 commits August 30, 2021 11:04

Python: Implement simple barrier guard

a762373

The one found in the original test case

Python: Add change note

1903cb8

github-actions bot added the documentation label Aug 30, 2021

yoff added 2 commits August 30, 2021 15:03

Python: Add qldoc

0de621e

Python: Try to remove py2/3 differences

a855074

RasmusWL requested changes Aug 31, 2021

View reviewed changes

yoff and others added 3 commits September 3, 2021 14:23

Apply suggestions from code review

c6eb795

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

Python: Add suggested comments and test case

913990b

Python: Fix simple guards

4998a48

yoff requested a review from RasmusWL September 6, 2021 20:41

yoff added 5 commits September 7, 2021 10:09

Python: Add missing qldoc

ae8408b

Python: Remember to update test expectations

29cb067

Merge branch 'main' of github.com:github/codeql into python/port-modi…

8729701

…fication-of-default-value Files have moved around, specifically PrintNode.qll.

Python: fix reference to PrintNode.qll

b48caaf

Python: coment out discriminating test

e8644f6

The test case has different behaviour between py2/3. When merging this, we should create an issue to resolve it.

RasmusWL requested changes Sep 7, 2021

View reviewed changes

yoff and others added 3 commits September 7, 2021 15:08

Update python/ql/src/semmle/python/functions/ModificationOfParameterW…

43effd2

…ithDefault.qll Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

Python: Simplify guards as suggested

b20232d

Python: Do not use BarrierGuards

7cfa08a

They are simply not right for this problem. We should not even make them available as an extension point.

yoff requested a review from RasmusWL September 10, 2021 10:51

yoff added 2 commits September 10, 2021 14:03

Merge branch 'main' of github.com:github/codeql into python/port-modi…

02fd63c

…fication-of-default-value To get the subpaths.

Python: Subpaths in test output

2eb1173

RasmusWL requested changes Sep 14, 2021

View reviewed changes

yoff and others added 2 commits September 15, 2021 12:25

Update python/ql/src/semmle/python/functions/ModificationOfParameterW…

758b6bd

…ithDefaultCustomizations.qll Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

Python: Unexpose fields as suggested.

8ea7a28

yoff requested a review from RasmusWL September 15, 2021 13:18

RasmusWL approved these changes Sep 21, 2021

View reviewed changes

RasmusWL merged commit 4a16be2 into github:main Sep 21, 2021

Python: port modification of default value #6557

Python: port modification of default value #6557

Uh oh!

Conversation

yoff commented Aug 26, 2021

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Return of default value

BarrierGuard

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RasmusWL commented Aug 31, 2021

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoff commented Sep 10, 2021

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoff commented Sep 15, 2021

Uh oh!

yoff commented Sep 15, 2021

Uh oh!

RasmusWL commented Sep 15, 2021

Uh oh!

yoff commented Sep 15, 2021

Uh oh!

RasmusWL commented Sep 21, 2021

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!