Skip to content

Python: port modification of default value #6557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Sep 21, 2021

Conversation

yoff
Copy link
Contributor

@yoff yoff commented Aug 26, 2021

Only missing bit is sanitizers. @tausbn mentioned comparing against the empty list. I could add that, but I am not really sure that is a recommended practice. On the other hand, we do not want a slew of false positives from missing that. Let me know.

yoff added 7 commits August 24, 2021 16:35
The old query uses `pointsTo` to limit the sinks
to methods on lists and dictionaries.
That constraint is omitted here which could hurt performance.
This should avoid potentially terrible performance.
Also noted the missing syntactic constructs,
as I went through the documnetation.
Consider removing the original test
@yoff yoff requested a review from a team as a code owner August 26, 2021 12:33
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks like some good improvements 💪

I feel that the tests the makes a modification that end up not changing the value (like dict_nonempty_nochange and dict_update_op_nochange), feels a bit like a hypothetical code-pattern, and that we shouldn't spend more time trying to handle those better for now. If we end up getting lots of complaints on this, or seeing this being a major problem from real code, we can consider looking more into it 😊 The same goes for the sanitizer test-case. (unless of course you have seen this in the wild a lot, but I don't see that mentioned anywhere)

Return of default value

I just wanted to note that a related problem can occur if you return the default value, illustrated in the example below.

def foo(x = [])
    return x

foo().append(42)
assert foo() == [42]

BarrierGuard

I wanted to go a bit more into details with the effort that was put into not giving an alert on this code

def sanitizer(l = []):
    if l:
        l.append(1)  # actually ok, since it will not trigger for default value
    ...

I don't see any changes to results from the barrier guard introduced in a762373 -- the changes to the .expected files seems to be related to the changes made in 49ae549

I also don't think it quite works as intended, since for if not x: the expression for x does not seem to be a part of the IdentityGuarded class, so the logic for handling the not will not be applicable. To get it to be part of the IdentityGuarded class we would need something along the lines of any(If i).getTest().contains(this), although that code might not be the best way to handle it in practice (I haven't really thought that through).

As I mentioned above, I would think that code that modifies the argument value in this way would be a rare occurrence, and not something we should worry too much about. So although it might sound a bit disappointing, I will suggest that we remove this bit of the PR, and then just be happy about the other improvements you have made for this query 💪 (if you already have the data indicating I'm wrong, very happy to take a closer look at this though)

In case you agree that we can drop support for this, I think there's a few of my PR comments that can be ignored 😉

* A sanitizer guard for detecting modifications of a parameters default value.
*/
abstract class BarrierGuard extends DataFlow::BarrierGuard {
abstract boolean blocksNonEmpty();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs QLDoc

@RasmusWL
Copy link
Member

I don't have any good suggestions for how to fix the tests 😞

yoff and others added 3 commits September 3, 2021 14:23
@yoff yoff requested a review from RasmusWL September 6, 2021 20:41
yoff added 5 commits September 7, 2021 10:09
…fication-of-default-value

Files have moved around, specifically PrintNode.qll.
The test case has different behaviour between py2/3.
When merging this, we should create an issue to resolve it.
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic involving invertedness, emptyness, and truthyness is quite complicated. I'm reminded of this quote:

There are two methods in software design. One is to make the program so simple, there are obviously no errors. The other is to make it so complicated, there are no obvious errors.

I didn't invest too much in addressing this part initially, since I thought it might be going away altogether.

As we discussed offline, we probably don't need to handle not not x, so we should be able to get away with just handling a single level of not, thereby removing a lot of the complexity in this queue. Hopefully we'll be able to invest more effort into making guards just work nicely out of the box, instead of implementors having to go through these hoops themselves.

# Python 2 and 3 versions of the analysis.
# We comment it out until we have resoved the issue.
#
# def issue1143(expr, param=[]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AHA, so #1143 is the reason we want to have that guard 🤔

Comment on lines 124 to 146
# OK
def sanitizer(l = []):
if l:
l.append(1)
return l

# OK
def sanitizer_negated(l = [1]):
if not l:
l.append(1)
return l

# Not OK
def sanitizer(l = []):
if not l:
l.append(1) #$ modification=l
return l

# Not OK
def sanitizer_negated(l = [1]):
if l:
l.append(1) #$ modification=l
return l
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the tests for guards more complete, can we get else branches on all of these, such as:

def sanitizer(l = []):
    if l:
        # OK
        l.append(1)
    else:
        # NOT OK
        l.append(1) # $ modification=l
    return l

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! These highlighted the inherent problem with using BarrierGuards.

Comment on lines 34 to 52
/**
* A sanitizer guard that does not let a truthy value flow to the true branch.
*
* Since guards with different behaviour cannot exist on the same node,
* we let all guards have the same behaviour, in the sense that they all check
* the true branch. Instead, we partition guards into those that block
* truthy values and those that block falsy values.
*/
abstract class BlocksTruthy extends DataFlow::BarrierGuard { }

/**
* A sanitizer guard that does not let a falsy value flow to the true branch.
*
* Since guards with different behaviour cannot exist on the same node,
* we let all guards have the same behaviour, in the sense that they all check
* the true branch. Instead, we partition guards into those that block
* truthy values and those that block falsy values.
*/
abstract class BlocksFalsey extends DataFlow::BarrierGuard { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed since new barrier are introduced below in other suggestion.

Suggested change
/**
* A sanitizer guard that does not let a truthy value flow to the true branch.
*
* Since guards with different behaviour cannot exist on the same node,
* we let all guards have the same behaviour, in the sense that they all check
* the true branch. Instead, we partition guards into those that block
* truthy values and those that block falsy values.
*/
abstract class BlocksTruthy extends DataFlow::BarrierGuard { }
/**
* A sanitizer guard that does not let a falsy value flow to the true branch.
*
* Since guards with different behaviour cannot exist on the same node,
* we let all guards have the same behaviour, in the sense that they all check
* the true branch. Instead, we partition guards into those that block
* truthy values and those that block falsy values.
*/
abstract class BlocksFalsey extends DataFlow::BarrierGuard { }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we wanted all these abstract classes as extension points for future, unknown customisations?

Comment on lines 137 to 216
/**
* An expression that is checked directly in an `if`, possibly with `not`, such as `if x:` or `if not x:`.
*/
private class IdentityGuarded extends Expr {
boolean inverted;

IdentityGuarded() {
this = any(If i).getTest() and
inverted = false
or
exists(IdentityGuarded ig, UnaryExpr notExp |
notExp.getOp() instanceof Not and
ig = notExp and
notExp.getOperand() = this
|
inverted = ig.isInverted().booleanNot()
)
}

/**
* Whether this guard has been inverted. For `if x:` the result is `false`, and for `if not x:` the result is `true`.
*/
boolean isInverted() { result = inverted }
}

/**
* Holds iff `guard` is checking the `Name` represented by `guarded` for truthyness.
* `result` is true if the check is inverted and false if it is not.
*/
boolean isIdentityGuard(DataFlow::GuardNode guard, ControlFlowNode guarded) {
exists(IdentityGuarded ig |
ig instanceof Name and
// In `not l`, the `ControlFlowNode` for `l` is not an instance of `GuardNode`.
// TODO: This is slightly naive, not handling e.g. `l or cond` correctly.
// We should change it when we have a proper guards library.
guard.getNode().getAChildNode*() = ig and
result = ig.isInverted() and
guarded.getNode() = ig
)
}

/**
* A sanitizer guard that does not let a truthy value flow to the true branch.
* Based on `isIdentityGuard`, so comes with the same caveats.
*/
class BlocksTruthyGuard extends BlocksTruthy {
ControlFlowNode guarded;

BlocksTruthyGuard() {
// The raw guard is true if the value is non-empty.
// We wish to send truthy falues to the false branch,
// se we are looking for inverted guards.
isIdentityGuard(this, guarded) = true
}

override predicate checks(ControlFlowNode node, boolean branch) {
node = guarded and
branch = true
}
}

/**
* A sanitizer guard that does not let a falsy value flow to the true branch.
* Based on `isIdentityGuard`, so comes with the same caveats.
*/
class BlocksFalseyGuard extends BlocksFalsey {
ControlFlowNode guarded;

BlocksFalseyGuard() {
// The raw guard is true if the value is non-empty.
// We wish to send falsy falues to the false branch,
// se we are looking for guards that are not inverted.
isIdentityGuard(this, guarded) = false
}

override predicate checks(ControlFlowNode node, boolean branch) {
node = guarded and
branch = true
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline

Suggested change
/**
* An expression that is checked directly in an `if`, possibly with `not`, such as `if x:` or `if not x:`.
*/
private class IdentityGuarded extends Expr {
boolean inverted;
IdentityGuarded() {
this = any(If i).getTest() and
inverted = false
or
exists(IdentityGuarded ig, UnaryExpr notExp |
notExp.getOp() instanceof Not and
ig = notExp and
notExp.getOperand() = this
|
inverted = ig.isInverted().booleanNot()
)
}
/**
* Whether this guard has been inverted. For `if x:` the result is `false`, and for `if not x:` the result is `true`.
*/
boolean isInverted() { result = inverted }
}
/**
* Holds iff `guard` is checking the `Name` represented by `guarded` for truthyness.
* `result` is true if the check is inverted and false if it is not.
*/
boolean isIdentityGuard(DataFlow::GuardNode guard, ControlFlowNode guarded) {
exists(IdentityGuarded ig |
ig instanceof Name and
// In `not l`, the `ControlFlowNode` for `l` is not an instance of `GuardNode`.
// TODO: This is slightly naive, not handling e.g. `l or cond` correctly.
// We should change it when we have a proper guards library.
guard.getNode().getAChildNode*() = ig and
result = ig.isInverted() and
guarded.getNode() = ig
)
}
/**
* A sanitizer guard that does not let a truthy value flow to the true branch.
* Based on `isIdentityGuard`, so comes with the same caveats.
*/
class BlocksTruthyGuard extends BlocksTruthy {
ControlFlowNode guarded;
BlocksTruthyGuard() {
// The raw guard is true if the value is non-empty.
// We wish to send truthy falues to the false branch,
// se we are looking for inverted guards.
isIdentityGuard(this, guarded) = true
}
override predicate checks(ControlFlowNode node, boolean branch) {
node = guarded and
branch = true
}
}
/**
* A sanitizer guard that does not let a falsy value flow to the true branch.
* Based on `isIdentityGuard`, so comes with the same caveats.
*/
class BlocksFalseyGuard extends BlocksFalsey {
ControlFlowNode guarded;
BlocksFalseyGuard() {
// The raw guard is true if the value is non-empty.
// We wish to send falsy falues to the false branch,
// se we are looking for guards that are not inverted.
isIdentityGuard(this, guarded) = false
}
override predicate checks(ControlFlowNode node, boolean branch) {
node = guarded and
branch = true
}
}
/**
* A sanitizer guard that does not let a truthy value flow to the true branch.
*
* Blocks flow of `x` in the true branch in the example below.
* ```py
* if x:
* x.append(42)
* ```
*
* Implementation note: Since guards with different behaviour cannot exist on the same
* node, we let all guards have the same behaviour, in the sense that they all check
* the true branch. Instead, we partition guards into those that block truthy values
* and those that block falsy values.
*/
class BlocksTruthy extends DataFlow::BarrierGuard {
BlocksTruthy() {
this instanceof NameNode
}
override predicate checks(ControlFlowNode node, boolean branch) {
node = this and
branch = true
}
}
/**
* A sanitizer guard that does not let a falsey value flow to the true branch.
*
* Blocks flow of `x` in the true branch in the example below.
* ```py
* if not x:
* x.append(42)
* ```
*
* Implementation note: Since guards with different behaviour cannot exist on the same
* node, we let all guards have the same behaviour, in the sense that they all check
* the true branch. Instead, we partition guards into those that block truthy values
* and those that block falsy values.
*/
class BlocksFalsey extends DataFlow::BarrierGuard {
NameNode guarded;
BlocksFalsey() {
guarded = this.(UnaryExprNode).getOperand() and
this.(UnaryExprNode).getNode().getOp() instanceof Not
}
override predicate checks(ControlFlowNode node, boolean branch) {
node = guarded and
branch = true
}
}

Comment on lines 29 to 32
/**
* A sanitizer for detecting modifications of a parameters default value.
*/
abstract class Barrier extends DataFlow::Node { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is not used in the configuration anymore, so let's just remove it

Suggested change
/**
* A sanitizer for detecting modifications of a parameters default value.
*/
abstract class Barrier extends DataFlow::Node { }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we wanted all these abstract classes as extension points for future, unknown customisations?

yoff and others added 3 commits September 7, 2021 15:08
…ithDefault.qll

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
They are simply not right for this problem.
We should not even make them available as an extension point.
@yoff
Copy link
Contributor Author

yoff commented Sep 10, 2021

I have moved away from BarrierGuards for this query and also removed them as an extension point. The current code contains a small amount of duplication of the BarrierGuard functionality. Let me know if that needs to be factored out somehow.

@yoff yoff requested a review from RasmusWL September 10, 2021 10:51
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks ok. It's not perfect that our data-flow library is not powerful enough to handle the guards nicely out of the box, but I also don't think it's worth spending too much more time on this.

A few minor comments, otherwise looks ready to be merged from my point of view.

Comment on lines 149 to 152
DataFlow::GuardNode guard;
NameNode guarded;
boolean branch;
boolean truthy;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's only the field truthy that is used by subclasses, can you please move the other field to an exists? (to get more idiomatic CodeQL code) 😊

I don't see how this should change the behavior of the code, but do let me know if I'm wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@yoff
Copy link
Contributor Author

yoff commented Sep 15, 2021

Analysis of results reproduced here for convenience:

For the 15 unclassified:
All except the one in YouTube_dl are "intended positives" in that they are modifications of default values, but the consequences are intended; they are various forms of caches or state. These are false positives, since they should not be reported, but they signify the query being unrefined rather than buggy. The one in YouTube_dl, I do not understand, the code just looks wrong.

For the 6 classified ones:
These are all counters or states used to test iterations. Again, intended, so false positives.

So how were these results filtered out before?
The new results come from new sources of modification: l[index] = x, l[index] += x, and del l[index].
The first is often used to store a set of handled objects, a prototypic pattern is seen[node] = 1.
The second is often used to store a counter in the first index, as in count[0] += 1.
The third is often used in combination with the first to remove the handled object from the set.

The previous query would only include the forms l += elements and l.m().

yoff and others added 2 commits September 15, 2021 12:25
…ithDefaultCustomizations.qll

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
@yoff
Copy link
Contributor Author

yoff commented Sep 15, 2021

I guess the question is if we should elide these new sources of modifications or not...

@yoff yoff requested a review from RasmusWL September 15, 2021 13:18
@RasmusWL
Copy link
Member

I guess the question is if we should elide these new sources of modifications or not...

You mean sinks of modification right? (bad dad joke laugh)

Anyway, sounds good if they are "intended positives". I think the new additional places we detect modifications are good.

The one in youtube-dl looks good to me, and personally I would call that a FP.

@yoff
Copy link
Contributor Author

yoff commented Sep 15, 2021

I guess there is a pattern where you populate the default value before you use it. If all modifications are idempotent and come before all reads, is that simply safe?

@RasmusWL
Copy link
Member

I guess there is a pattern where you populate the default value before you use it. If all modifications are idempotent and come before all reads, is that simply safe?

Sounds reasonable to me.

Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance looks fine 👍

@RasmusWL RasmusWL merged commit 4a16be2 into github:main Sep 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants