Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 10 additions & 7 deletions python/ql/src/experimental/dataflow/TypeTracker.qll
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ private import internal.DataFlowPrivate

/** Any string that may appear as the name of an attribute or access path. */
class AttributeName extends string {
AttributeName() { this = any(Attribute a).getName() }
AttributeName() { this = any(AttrRef a).getAttributeName() }
}

/** Either an attribute name, or the empty string (representing no attribute). */
Expand Down Expand Up @@ -115,19 +115,22 @@ predicate returnStep(ReturnNode nodeFrom, Node nodeTo) {
* assignment to `z` inside `bar`, even though this attribute write happens _after_ `bar` is called.
*/
predicate basicStoreStep(Node nodeFrom, Node nodeTo, string attr) {
exists(AttributeAssignment a, Node var |
a.getName() = attr and
simpleLocalFlowStep*(nodeTo, var) and
var.asVar() = a.getInput() and
nodeFrom.asCfgNode() = a.getValue()
exists(AttrWrite a |
a.mayHaveAttributeName(attr) and
nodeFrom = a.getValue() and
simpleLocalFlowStep*(nodeTo, a.getObject())
)
}

/**
* Holds if `nodeTo` is the result of accessing the `attr` attribute of `nodeFrom`.
*/
predicate basicLoadStep(Node nodeFrom, Node nodeTo, string attr) {
exists(AttrNode s | nodeTo.asCfgNode() = s and s.getObject(attr) = nodeFrom.asCfgNode())
exists(AttrRead a |
a.mayHaveAttributeName(attr) and
nodeFrom = a.getObject() and
nodeTo = a
)
}

/**
Expand Down
256 changes: 256 additions & 0 deletions python/ql/src/experimental/dataflow/internal/Attributes.qll
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
/** This module provides an API for attribute reads and writes. */

import DataFlowUtil
import DataFlowPublic
private import DataFlowPrivate

/**
* A data flow node that reads or writes an attribute of an object.
*
* This abstract base class only knows about the base object on which the attribute is being
* accessed, and the attribute itself, if it is statically inferrable.
*/
abstract class AttrRef extends Node {
/**
* Gets the data flow node corresponding to the object whose attribute is being read or written.
*/
abstract Node getObject();

/**
* Gets the expression node that defines the attribute being accessed, if any. This is
* usually an identifier or literal.
*/
abstract ExprNode getAttributeNameExpr();

/**
* Holds if this attribute reference may access an attribute named `attrName`.
* Uses local data flow to track potential attribute names, which may lead to imprecision. If more
* precision is needed, consider using `getAttributeName` instead.
*/
predicate mayHaveAttributeName(string attrName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to not call this hasAttributeName? Are we any more uncertain of the values reported here than in getAttributeName or getAttributeNameExpr?
I am asking because I think this is the surface predicate, and we want to encourage users to use this one for normal use. Perhaps we should also mention this in the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main reason is: to mimic the JavaScript API. I think the may is useful in this case (at least with how the library functions currently) in that getAttributeName will only yield a result if the name of the attribute is fixed, but mayHaveAttributeName can hold for several attribut names. Thus, I would expect something like

if random.randint(0,1):
    attr = "foo"
else:
    attr = "bar"
x = getattr(object, attr)

to have (at least) two values for mayHaveAttributeName.

Actually, this has me wondering. Perhaps getAttributeName should be rewritten to do the same kind of local flow as mayHaveAttributeName, but only yield a value if the result is unique. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CodeQL, I have come to expect that a predicate named getAttributeName may yield more than one value. Is the case where the name of the attribute is fixed important?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, I think this will depend on whether we end up seeing false positives because of attribute confusion. I can certainly see an argument for using mayHaveAttributeName in something like type tracking, where we want to propagate types as much as possible (even at the cost of a bit of imprecision), but I can also imagine a situation where conflating two attribute names leads to an erroneous flow of taint. (So in particular, the data flow library itself should probably use getAttributeName for precision.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I was curious, and added a test case to see if we were getting the intended behaviour for mayHaveAttributeName, and it seems we're not.

def setattr_indirect_multiple_write():
    if random.randint(0,1):
        attr = "foo"
    else:
        attr = "bar"
    x = SomeClass() # $tracked=foo $f-:tracked=bar
    setattr(x, attr, tracked) # $tracked $tracked=foo $f-:tracked=bar

Note the f- annotations above. It seems we only consider local flow from attr = foo and not attr = bar. This is true even if I negate the conditional, so I expect it's simply always picking the first branch. This feels like it might be a bug in the implementation of local flow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yes, we will have to sort that out...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you try the same with a conditional expression, do you get the expected behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following test code passes (with f- annotations):

def setattr_indirect_multiple_write_ifexpr():
    attr = "foo" if random.randint(0,1) else "bar"
    x = SomeClass() # $tracked=foo $f-:tracked=bar
    setattr(x, attr, tracked) # $tracked $tracked=foo $f-:tracked=bar

So, no. Same behaviour. ☹️
(Also this was after manually applying 0f077f5 manually, since that commit is not present on this branch.)

I'm wondering if the problem is elsewhere, though. I'll have to debug this a bit.

attrName = this.getAttributeName()
or
exists(Node nodeFrom |
localFlow(nodeFrom, this.getAttributeNameExpr()) and
attrName = nodeFrom.asExpr().(StrConst).getText()
)
}

/**
* Gets the name of the attribute being read or written. For dynamic attribute accesses, this
* method is not guaranteed to return a result. For such cases, using `mayHaveAttributeName` may yield
* better results.
*/
abstract string getAttributeName();
}

/**
* A data flow node that writes an attribute of an object. This includes
* - Simple attribute writes: `object.attr = value`
* - Dynamic attribute writes: `setattr(object, attr, value)`
* - Fields written during class initialization: `class MyClass: attr = value`
*/
abstract class AttrWrite extends AttrRef {
/** Gets the data flow node corresponding to the value that is written to the attribute. */
abstract Node getValue();
}

/**
* Represents a control flow node for a simple attribute assignment. That is,
* ```python
* object.attr = value
* ```
* Also gives access to the `value` being written, by extending `DefinitionNode`.
*/
private class AttributeAssignmentNode extends DefinitionNode, AttrNode, DataFlowCfgNode {
override ControlFlowNode getValue() { result = DefinitionNode.super.getValue() }
}

/** A simple attribute assignment: `object.attr = value`. */
private class AttributeAssignmentAsAttrWrite extends AttrWrite, CfgNode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very long class name exposing information the type-system also knows. Would something simpler work?

I initially suggested AttributeAssignment, but reading through more of the code, it starts to make sense for me why you did this. Would a postfix of Node make things clearer? (I guess not, since that still doesn't say whether it's a DataFlow::Node or a ControlFlowNode... damn all those nodes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the long name, yeah, I agree it's a bit of a mouthful. The JS libraries has a similar affliction (e.g. StaticClassMemberAsPropWrite). However, the class is private, and not really intended for public consumption (whereas AttrWrite very much is).

override AttributeAssignmentNode node;

override Node getValue() { result.asCfgNode() = node.getValue() }

override Node getObject() { result.asCfgNode() = node.getObject() }

override ExprNode getAttributeNameExpr() {
// Attribute names don't exist as `Node`s in the control flow graph, as they can only ever be
// identifiers, and are therefore represented directly as strings.
// Use `getAttributeName` to access the name of the attribute.
none()
}

override string getAttributeName() { result = node.getName() }
}

import semmle.python.types.Builtins

/** Represents `CallNode`s that may refer to calls to built-in functions or classes. */
private class BuiltInCallNode extends CallNode, DataFlowCfgNode {
string name;

BuiltInCallNode() {
// TODO disallow instances where the name of the built-in may refer to an in-scope variable of that name.
exists(NameNode id | this.getFunction() = id and id.getId() = name and id.isGlobal()) and
name = any(Builtin b).getName()
}

/** Gets the name of the built-in function that is called at this `CallNode` */
string getBuiltinName() { result = name }
}

/**
* Represents a call to the built-ins that handle dynamic inspection and modification of
* attributes: `getattr`, `setattr`, `hasattr`, and `delattr`.
*/
private class BuiltinAttrCallNode extends BuiltInCallNode {
BuiltinAttrCallNode() { name in ["setattr", "getattr", "hasattr", "delattr"] }

/** Gets the control flow node for object on which the attribute is accessed. */
ControlFlowNode getObject() { result in [this.getArg(0), this.getArgByName("object")] }

/**
* Gets the control flow node for the value that is being written to the attribute.
* Only relevant for `setattr` calls.
*/
ControlFlowNode getValue() {
// only valid for `setattr`
name = "setattr" and
result in [this.getArg(2), this.getArgByName("value")]
}

/** Gets the control flow node that defines the name of the attribute being accessed. */
ControlFlowNode getName() { result in [this.getArg(1), this.getArgByName("name")] }
}

/** Represents calls to the built-in `setattr`. */
private class SetAttrCallNode extends BuiltinAttrCallNode {
SetAttrCallNode() { name = "setattr" }
}

/** Represents calls to the built-in `getattr`. */
private class GetAttrCallNode extends BuiltinAttrCallNode {
GetAttrCallNode() { name = "getattr" }
}

/** An attribute assignment using `setattr`, e.g. `setattr(object, attr, value)` */
private class SetAttrCallAsAttrWrite extends AttrWrite, CfgNode {
override SetAttrCallNode node;

override Node getValue() { result.asCfgNode() = node.getValue() }

override Node getObject() { result.asCfgNode() = node.getObject() }

override ExprNode getAttributeNameExpr() { result.asCfgNode() = node.getName() }

override string getAttributeName() {
result = this.getAttributeNameExpr().asExpr().(StrConst).getText()
}
}

/**
* Represents an attribute of a class that is assigned statically during class definition. For instance
* ```python
* class MyClass:
* attr = value
* ...
* ```
* Instances of this class correspond to the `NameNode` for `attr`, and also gives access to `value` by
* virtue of being a `DefinitionNode`.
*/
private class ClassAttributeAssignmentNode extends DefinitionNode, NameNode, DataFlowCfgNode { }

/**
* An attribute assignment via a class field, e.g.
* ```python
* class MyClass:
* attr = value
* ```
* is treated as equivalent to `MyClass.attr = value`.
*/
private class ClassDefinitionAsAttrWrite extends AttrWrite, CfgNode {
ClassExpr cls;
override ClassAttributeAssignmentNode node;

ClassDefinitionAsAttrWrite() { node.getScope() = cls.getInnerScope() }

override Node getValue() { result.asCfgNode() = node.getValue() }

override Node getObject() { result.asCfgNode() = cls.getAFlowNode() }

override ExprNode getAttributeNameExpr() { none() }

override string getAttributeName() { result = node.getId() }
}

/**
* A read of an attribute on an object. This includes
* - Simple attribute reads: `object.attr`
* - Dynamic attribute reads using `getattr`: `getattr(object, attr)`
* - Qualified imports: `from module import attr as name`
*/
abstract class AttrRead extends AttrRef, Node { }

/**
* A convenience class for embedding `AttrNode` into `DataFlowCfgNode`, as the former is not
* obviously a subtype of the latter.
*/
private class DataFlowAttrNode extends AttrNode, DataFlowCfgNode { }

/** A simple attribute read, e.g. `object.attr` */
private class AttributeReadAsAttrRead extends AttrRead, CfgNode {
override DataFlowAttrNode node;

override Node getObject() { result.asCfgNode() = node.getObject() }

override ExprNode getAttributeNameExpr() {
// Attribute names don't exist as `Node`s in the control flow graph, as they can only ever be
// identifiers, and are therefore represented directly as strings.
// Use `getAttributeName` to access the name of the attribute.
none()
}

override string getAttributeName() { result = node.getName() }
}

/** An attribute read using `getattr`: `getattr(object, attr)` */
private class GetAttrCallAsAttrRead extends AttrRead, CfgNode {
override GetAttrCallNode node;

override Node getObject() { result.asCfgNode() = node.getObject() }

override ExprNode getAttributeNameExpr() { result.asCfgNode() = node.getName() }

override string getAttributeName() {
result = this.getAttributeNameExpr().asExpr().(StrConst).getText()
}
}

/**
* A convenience class for embedding `ImportMemberNode` into `DataFlowCfgNode`, as the former is not
* obviously a subtype of the latter.
*/
private class DataFlowImportMemberNode extends ImportMemberNode, DataFlowCfgNode { }

/**
* Represents a named import as an attribute read. That is,
* ```python
* from module import attr as attr_ref
* ```
* is treated as if it is a read of the attribute `module.attr`, even if `module` is not imported directly.
*/
private class ModuleAttributeImportAsAttrRead extends AttrRead, CfgNode {
override DataFlowImportMemberNode node;

override Node getObject() { result.asCfgNode() = node.getModule(_) }

override ExprNode getAttributeNameExpr() {
// The name of an imported attribute doesn't exist as a `Node` in the control flow graph, as it
// can only ever be an identifier, and is therefore represented directly as a string.
// Use `getAttributeName` to access the name of the attribute.
none()
}

override string getAttributeName() { exists(node.getModule(result)) }
}
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
private import python
private import DataFlowPrivate
import experimental.dataflow.TypeTracker
import Attributes
private import semmle.python.essa.SsaCompute

/**
Expand Down
30 changes: 27 additions & 3 deletions python/ql/src/experimental/dataflow/internal/DataFlowUtil.qll
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ predicate localFlowStep(Node nodeFrom, Node nodeTo) { simpleLocalFlowStep(nodeFr
predicate localFlow(Node source, Node sink) { localFlowStep*(source, sink) }

/**
* Gets an EssaNode that holds the module imported by `name`.
* Gets a `Node` that refers to the module referenced by `name`.
* Note that for the statement `import pkg.mod`, the new variable introduced is `pkg` that is a
* reference to the module `pkg`.
*
Expand All @@ -27,6 +27,9 @@ predicate localFlow(Node source, Node sink) { localFlowStep*(source, sink) }
* 2. `from <package> import <module>` when `<name> = <package> + "." + <module>`
* 3. `from <module> import <member>` when `<name> = <module> + "." + <member>`
*
* Finally, in `from <module> import <member>` we consider the `ImportExpr` corresponding to
* `<module>` to be a reference to that module.
*
* Note:
* While it is technically possible that `import mypkg.foo` and `from mypkg import foo` can give different values,
* it's highly unlikely that this will be a problem in production level code.
Expand All @@ -36,7 +39,7 @@ predicate localFlow(Node source, Node sink) { localFlowStep*(source, sink) }
*
* Also see `DataFlow::importMember`
*/
EssaNode importModule(string name) {
Node importModule(string name) {
exists(Variable var, Import imp, Alias alias |
alias = imp.getAName() and
alias.getAsname() = var.getAStore() and
Expand All @@ -45,8 +48,29 @@ EssaNode importModule(string name) {
or
name = alias.getValue().(ImportExpr).getImportedModuleName()
) and
result.getVar().(AssignmentDefinition).getSourceVariable() = var
result.(EssaNode).getVar().(AssignmentDefinition).getSourceVariable() = var
)
or
// Although it may seem superfluous to consider the `foo` part of `from foo import bar as baz` to
// be a reference to a module (since that reference only makes sense locally within the `import`
// statement), it's important for our use of type trackers to consider this local reference to
// also refer to the `foo` module. That way, if one wants to track references to the `bar`
// attribute using a type tracker, one can simply write
//
// ```ql
// DataFlow::Node bar_attr_tracker(TypeTracker t) {
// t.startInAttr("bar") and
// result = foo_module_tracker()
// or
// exists(TypeTracker t2 | result = bar_attr_tracker(t2).track(t2, t))
// }
// ```
//
// Where `foo_module_tracker` is a type tracker that tracks references to the `foo` module.
// Because named imports are modelled as `AttrRead`s, the statement `from foo import bar as baz`
// is interpreted as if it was an assignment `baz = foo.bar`, which means `baz` gets tracked as a
// reference to `foo.bar`, as desired.
result.asCfgNode().getNode() = any(ImportExpr i | i.getAnImportedModuleName() = name)
}

/**
Expand Down
Original file line number Diff line number Diff line change
@@ -1,16 +1,36 @@
importModule
| test1.py:1:8:1:12 | ControlFlowNode for ImportExpr | mypkg |
| test1.py:1:8:1:12 | GSSA Variable mypkg | mypkg |
| test2.py:1:6:1:10 | ControlFlowNode for ImportExpr | mypkg |
| test2.py:1:6:1:10 | ControlFlowNode for ImportExpr | mypkg |
| test2.py:1:19:1:21 | GSSA Variable foo | mypkg.foo |
| test2.py:1:24:1:26 | GSSA Variable bar | mypkg.bar |
| test3.py:1:8:1:16 | ControlFlowNode for ImportExpr | mypkg |
| test3.py:1:8:1:16 | ControlFlowNode for ImportExpr | mypkg.foo |
| test3.py:2:8:2:16 | ControlFlowNode for ImportExpr | mypkg |
| test3.py:2:8:2:16 | ControlFlowNode for ImportExpr | mypkg.bar |
| test3.py:2:8:2:16 | GSSA Variable mypkg | mypkg |
| test4.py:1:8:1:16 | ControlFlowNode for ImportExpr | mypkg |
| test4.py:1:8:1:16 | ControlFlowNode for ImportExpr | mypkg.foo |
| test4.py:1:21:1:24 | GSSA Variable _foo | mypkg.foo |
| test4.py:2:8:2:16 | ControlFlowNode for ImportExpr | mypkg |
| test4.py:2:8:2:16 | ControlFlowNode for ImportExpr | mypkg.bar |
| test4.py:2:21:2:24 | GSSA Variable _bar | mypkg.bar |
| test5.py:1:8:1:12 | ControlFlowNode for ImportExpr | mypkg |
| test5.py:1:8:1:12 | GSSA Variable mypkg | mypkg |
| test5.py:9:6:9:10 | ControlFlowNode for ImportExpr | mypkg |
| test5.py:9:26:9:29 | GSSA Variable _bar | mypkg.bar |
| test6.py:1:8:1:12 | ControlFlowNode for ImportExpr | mypkg |
| test6.py:1:8:1:12 | GSSA Variable mypkg | mypkg |
| test6.py:5:8:5:16 | ControlFlowNode for ImportExpr | mypkg |
| test6.py:5:8:5:16 | ControlFlowNode for ImportExpr | mypkg.foo |
| test6.py:5:8:5:16 | GSSA Variable mypkg | mypkg |
| test7.py:1:6:1:10 | ControlFlowNode for ImportExpr | mypkg |
| test7.py:1:19:1:21 | GSSA Variable foo | mypkg.foo |
| test7.py:5:8:5:16 | ControlFlowNode for ImportExpr | mypkg |
| test7.py:5:8:5:16 | ControlFlowNode for ImportExpr | mypkg.foo |
| test7.py:5:8:5:16 | GSSA Variable mypkg | mypkg |
| test7.py:9:6:9:10 | ControlFlowNode for ImportExpr | mypkg |
| test7.py:9:19:9:21 | GSSA Variable foo | mypkg.foo |
importMember
| test2.py:1:19:1:21 | GSSA Variable foo | mypkg | foo |
Expand Down
Loading