Skip to content

Python: add models for stdlib #15306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 43 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
0072941
Python: add models for `stdlib`
yoff Jan 12, 2024
e8ba502
Python: small refactor
yoff Jan 23, 2024
a5ab628
Python: add flow to `re.compile`
yoff Jan 23, 2024
74f330f
Python: convenience for running on dbs with stdlib
yoff Jan 23, 2024
a5dcd79
Python: script to find stdlib uses
yoff Jan 23, 2024
c31c0cb
python: use generated query
yoff Jan 31, 2024
d8952e1
python: Start modelling using MaD
yoff Feb 2, 2024
6afe275
python: Improve query and add new modelling
yoff Feb 2, 2024
4e68967
Python: tweak script and use new models
yoff Feb 5, 2024
9f25e39
python: more robust query, more summaries
yoff Feb 6, 2024
3d1e7c7
python: refactor script
yoff Feb 8, 2024
4b91823
python: more robust query
yoff Feb 9, 2024
aaa2aa6
python: more tests
yoff Feb 21, 2024
14ffa6c
python: adapt some models by hand
yoff Feb 21, 2024
1e286c1
python: attempt to improve query
yoff Feb 22, 2024
b12bc20
python: improve `FindUses.ql`
yoff Feb 26, 2024
7378528
python: split up file
yoff Mar 6, 2024
b6c2de4
python: bit more models
yoff Mar 6, 2024
a6e87c0
python: add handy query predicate
yoff Mar 19, 2024
f104192
python: improve query
yoff Mar 19, 2024
237f19c
python: comment out query predicate
yoff Mar 19, 2024
8a1fd7a
python: use updated models
yoff Mar 19, 2024
2c42b78
python: properly detect writing to `self`
yoff Mar 19, 2024
50d28c7
python: update model
yoff Mar 19, 2024
bbf8e77
python: fix spelling
yoff Mar 19, 2024
cf06a05
python: better handling of methods
yoff Mar 20, 2024
a3ed75a
python: example of including only one query
yoff Mar 22, 2024
8a55580
python: lost model
yoff Mar 22, 2024
ad4359e
python: more robust programming
yoff Apr 9, 2024
0340529
python: improved script and new models
yoff Apr 10, 2024
b7ac6fc
python: do not extract stdlib by default
yoff Apr 11, 2024
335a02a
python: improve generator and update models
yoff Apr 12, 2024
b03fa93
Python: Model that `asyncio.log.logger` is a `logging.Logger`
yoff Apr 24, 2024
15c124f
python: logger detection improvements
yoff Apr 26, 2024
b789948
python: logger detection improvements
yoff Apr 26, 2024
dc2a7a6
Merge branch 'main' of https://github.com/github/codeql into python/a…
yoff May 27, 2024
5c3e8f5
Python: models for command injection and cleartext logging
yoff May 29, 2024
c80de96
Python: models for path injection
yoff May 30, 2024
ca51eed
python: more path-injection modeling
yoff Jun 3, 2024
cc77fb1
Merge branch 'main' of https://github.com/github/codeql into python/a…
yoff Jun 4, 2024
b0f7235
PythonÆ add comments
yoff Jun 14, 2024
64ae950
python: do `copy.copy/deepcopy` as dataflow
yoff Jun 14, 2024
2a77a49
Merge branch 'main' of https://github.com/github/codeql into python/a…
yoff Jun 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ rm -rf dbs

mkdir dbs

CODEQL_EXTRACTOR_PYTHON_DONT_EXTRACT_STDLIB=True $CODEQL database create dbs/without-stdlib --language python --source-root repo_dir/
$CODEQL database create dbs/without-stdlib --language python --source-root repo_dir/
$CODEQL query run --database dbs/without-stdlib query.ql > query.without-stdlib.actual
diff query.without-stdlib.expected query.without-stdlib.actual

LGTM_INDEX_EXCLUDE="/usr/lib/**" $CODEQL database create dbs/with-stdlib --language python --source-root repo_dir/
LGTM_INDEX_EXCLUDE="/usr/lib/**" CODEQL_EXTRACTOR_PYTHON_EXTRACT_STDLIB=True $CODEQL database create dbs/with-stdlib --language python --source-root repo_dir/
$CODEQL query run --database dbs/with-stdlib query.ql > query.with-stdlib.actual
diff query.with-stdlib.expected query.with-stdlib.actual
8 changes: 4 additions & 4 deletions python/extractor/semmle/cmdline.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,8 +102,8 @@ def make_parser():
config_options.add_option("--colorize", dest="colorize", default=False, action="store_true",
help = """Colorize the logging output.""")

config_options.add_option("--dont-extract-stdlib", dest="extract_stdlib", default=True, action="store_false",
help="Do not extract the standard library.")
config_options.add_option("--extract-stdlib", dest="extract_stdlib", default=False, action="store_true",
help="Extract the standard library.")

parser.add_option_group(config_options)

Expand Down Expand Up @@ -224,8 +224,8 @@ def parse(command_line):
max_import_depth = float('inf')
options.max_import_depth = max_import_depth

if 'CODEQL_EXTRACTOR_PYTHON_DONT_EXTRACT_STDLIB' in os.environ:
options.extract_stdlib = False
if 'CODEQL_EXTRACTOR_PYTHON_EXTRACT_STDLIB' in os.environ:
options.extract_stdlib = True

options.prune = True
return options, args
Expand Down
102 changes: 102 additions & 0 deletions python/ql/lib/ext/StdLib.model.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
extensions:
- addsTo:
pack: codeql/python-all
extensible: sourceModel
data: []

- addsTo:
pack: codeql/python-all
extensible: sinkModel
data:
- ["subprocess.Popen!","Subclass.Call.Argument[0,args:]", "log-injection"]
- ["zipfile.ZipFile","Member[extractall].Argument[0,path:]", "path-injection"]

- addsTo:
pack: codeql/python-all
extensible: summaryModel
data:
- ["_collections_abc", "Member[Mapping].Subclass.Instance.Member[get]", "Argument[1,default:]", "ReturnValue", "taint"]
- ["argparse", "Member[ArgumentParser].Subclass.Instance.Member[_parse_known_args]", "Argument[0,arg_strings:]", "ReturnValue", "taint"]
- ["argparse", "Member[ArgumentParser].Subclass.Instance.Member[_read_args_from_files]", "Argument[0,arg_strings:]", "ReturnValue", "taint"]
- ["argparse", "Member[ArgumentParser].Subclass.Instance.Member[parse_args]", "Argument[0,args:]", "ReturnValue", "taint"]
- ["argparse", "Member[ArgumentParser].Subclass.Instance.Member[parse_known_args]", "Argument[0,args:]", "ReturnValue", "taint"]
- ["cgi", "Member[FieldStorage].Subclass.Instance.Member[getvalue]", "Argument[self]", "ReturnValue", "taint"]
- ["contextlib", "Member[_BaseExitStack].Subclass.Instance.Member[enter_context]", "Argument[0,cm:]", "ReturnValue", "taint"]
- ["copy", "Member[copy,deepcopy]", "Argument[0,x:]", "ReturnValue", "value"]
- ["ctypes", "Member[create_unicode_buffer]", "Argument[0,init:]", "ReturnValue", "taint"]
- ["distutils", "Member[util].Member[change_root]", "Argument[0,new_root:]", "ReturnValue", "taint"]
- ["email", "Member[header].Member[Header].Subclass.Call", "Argument[0,s:]", "ReturnValue", "taint"]
- ["email", "Member[utils].Member[parseaddr]", "Argument[0,addr:]", "ReturnValue", "taint"]
- ["fnmatch", "Member[filter]", "Argument[0,names:]", "ReturnValue", "taint"]
- ["functools", "Member[reduce]", "Argument[1,sequence:]", "ReturnValue", "taint"]
- ["getopt", "Member[getopt]", "Argument[0,args:]", "ReturnValue", "taint"]
- ["getopt", "Member[getopt]", "Argument[2,longopts:]", "ReturnValue", "taint"]
- ["gettext", "Member[gettext]", "Argument[0,message:]", "ReturnValue", "taint"]
- ["gzip", "Member[GzipFile].Subclass.Call", "Argument[0,filename:]", "ReturnValue", "taint"]
- ["html", "Member[escape]", "Argument[0,s:]", "ReturnValue", "taint"]
- ["html", "Member[parser].Member[HTMLParser].Subclass.Instance.Member[feed]", "Argument[0,data:]", "Argument[self]", "taint"]
- ["imp", "Member[find_module]", "Argument[0,name:]", "ReturnValue", "taint"]
- ["imp", "Member[find_module]", "Argument[1,path:]", "ReturnValue", "taint"]
- ["logging", "Member[getLevelName]", "Argument[0,level:]", "ReturnValue", "taint"]
- ["logging", "Member[LogRecord].Subclass.Instance.Member[getMessage]", "Argument[self]", "ReturnValue", "taint"]
- ["mimetypes", "Member[guess_type]", "Argument[0,url:]", "ReturnValue", "taint"]
- ["multiprocessing", "Member[connection].Member[Listener].Subclass.Call", "Argument[3,authkey:]", "ReturnValue", "taint"]
- ["nturl2path", "Member[pathname2url]", "Argument[0,p:]", "ReturnValue", "taint"]
- ["nturl2path", "Member[url2pathname]", "Argument[0,url:]", "ReturnValue", "taint"]
- ["optparse", "Member[OptionParser].Subclass.Instance.Member[parse_args]", "Argument[0,args:]", "ReturnValue", "taint"]
- ["pathlib", "Member[Path].Subclass.Instance.Member[__enter__]", "Argument[self]", "ReturnValue", "taint"]
- ["pathlib", "Member[PurePath].Subclass.Instance.Member[__fspath__]", "Argument[self]", "ReturnValue", "taint"]
- ["queue", "Member[Queue].Subclass.Instance.Member[put]", "Argument[0,item:]", "Argument[self]", "taint"]
- ["random", "Member[choice]", "Argument[0,seq:]", "ReturnValue", "taint"]
- ["random", "Member[Random].Subclass.Instance.Member[choice]", "Argument[0,seq:]", "ReturnValue", "taint"]
- ["re", "Member[split]", "Argument[0,pattern:]", "ReturnValue", "taint"]
- ["shlex", "Member[quote]", "Argument[0,s:]", "ReturnValue", "taint"]
- ["shutil", "Member[which]", "Argument[0,cmd:]", "ReturnValue", "taint"]
- ["shutil", "Member[which]", "Argument[2,path:]", "ReturnValue", "taint"]
- ["subprocess", "Member[Popen].Subclass.Call", "Argument[0,args:]", "ReturnValue", "taint"]
- ["tarfile", "Member[TarFile].Subclass.Instance.Member[open]", "Argument[0,name:]", "ReturnValue", "taint"]
- ["tarfile", "Member[TarFile].Subclass.Instance.Member[open]", "Argument[2,fileobj:]", "ReturnValue", "taint"]
- ["tempfile", "Member[mkdtemp]", "Argument[0,suffix:]", "ReturnValue", "taint"]
- ["tempfile", "Member[mkdtemp]", "Argument[1,prefix:]", "ReturnValue", "taint"]
- ["tempfile", "Member[mkdtemp]", "Argument[2,dir:]", "ReturnValue", "taint"]
- ["tempfile", "Member[mkstemp]", "Argument[0,suffix:]", "ReturnValue", "taint"]
- ["tempfile", "Member[mkstemp]", "Argument[2,dir:]", "ReturnValue", "taint"]
- ["textwrap", "Member[dedent]", "Argument[0,text:]", "ReturnValue", "taint"]
- ["traceback", "Member[StackSummary].Subclass.Instance.Member[from_list]", "Argument[0,a_list:]", "ReturnValue", "taint"]
- ["typing", "Member[cast]", "Argument[1,val:]", "ReturnValue", "taint"]
- ["urllib", "Member[parse].Member[quote_plus]", "Argument[0,string:]", "ReturnValue", "taint"]
- ["urllib", "Member[parse].Member[quote]", "Argument[0,string:]", "ReturnValue", "taint"]
- ["urllib", "Member[parse].Member[splitquery]", "Argument[0,url:]", "ReturnValue", "taint"]
- ["urllib", "Member[parse].Member[unquote_plus]", "Argument[0,string:]", "ReturnValue", "taint"]
- ["urllib", "Member[parse].Member[unquote]", "Argument[0,string:]", "ReturnValue", "taint"]
- ["urllib", "Member[parse].Member[urlencode]", "Argument[0,query:]", "ReturnValue", "taint"]
- ["urllib", "Member[parse].Member[urljoin]", "Argument[1,url:]", "ReturnValue", "taint"]
- ["urllib", "Member[request].Member[pathname2url]", "Argument[0,pathname:]", "ReturnValue", "taint"]
- ["urllib", "Member[request].Member[Request].Subclass.Call", "Argument[0,url:]", "ReturnValue", "taint"]
- ["urllib", "Member[request].Member[Request].Subclass.Instance.Member[get_full_url]", "Argument[self]", "ReturnValue", "taint"]
- ["urllib", "Member[request].Member[url2pathname]", "Argument[0,pathname:]", "ReturnValue", "taint"]
- ["urllib", "Member[request].Member[urlretrieve]", "Argument[0,url:]", "ReturnValue", "taint"]
- ["urllib", "Member[request].Member[unquote]", "Argument[0,string:]", "ReturnValue", "taint"]
- ["urllib2", "Member[unquote]", "Argument[0,string:]", "ReturnValue", "taint"]
- ["zipfile", "Member[CompleteDirs].Subclass.Instance.Member[namelist]", "Argument[self]", "ReturnValue", "taint"]
- ["zipfile", "Member[ZipFile].Subclass.Call", "Argument[0,file:]", "ReturnValue", "taint"]
- ["zipfile", "Member[ZipFile].Call", "Argument[0,file:]", "ReturnValue.Attribute[filelist].ListElement.Attribute[filename]", "value"]
- ["zipfile", "Member[ZipFile].Subclass.Instance.Member[_extract_member]", "Argument[1,targetpath:]", "ReturnValue", "taint"]
- ["zipfile", "Member[ZipFile].Subclass.Instance.Member[infolist]", "Argument[self]", "ReturnValue", "taint"]
- ["zipfile", "Member[ZipFile].Subclass.Instance.Member[infolist]", "Argument[self].Attribute[filelist]", "ReturnValue", "value"]
- ["zipfile", "Member[ZipFile].Subclass.Instance.Member[namelist]", "Argument[self]", "ReturnValue", "taint"]
- ["shutil", "Member[rmtree]", "Argument[0,path:]", "Argument[2,onerror:].Argument[1]", "taint"]
- addsTo:
pack: codeql/python-all
extensible: neutralModel
data: []

- addsTo:
pack: codeql/python-all
extensible: typeModel
data: []

- addsTo:
pack: codeql/python-all
extensible: typeVariableModel
data: []
2 changes: 2 additions & 0 deletions python/ql/lib/qlpack.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,6 @@ dependencies:
codeql/yaml: ${workspace}
dataExtensions:
- semmle/python/frameworks/**/*.model.yml
- ext/*.model.yml
- ext/generated/*.model.yml
warnOnImplicitThis: true
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,12 @@ newtype TNode =
isExpressionNode(node)
or
node.getNode() instanceof Pattern
// not node.getLocation().getFile().inStdlib() and
// (
// isExpressionNode(node)
// or
// node.getNode() instanceof Pattern
// )
} or
/**
* A node corresponding to a scope entry definition. That is, the value of a variable
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,6 @@ private module Cached {
or
containerStep(nodeFrom, nodeTo)
or
copyStep(nodeFrom, nodeTo)
or
DataFlowPrivate::forReadStep(nodeFrom, _, nodeTo)
or
DataFlowPrivate::iterableUnpackingReadStep(nodeFrom, _, nodeTo)
Expand All @@ -67,6 +65,12 @@ private module Cached {
}
}

predicate summaryLocalStep(DataFlow::Node nodeFrom, DataFlow::Node nodeTo, string model) {
FlowSummaryImpl::Private::Steps::summaryLocalStep(nodeFrom
.(DataFlowPrivate::FlowSummaryNode)
.getSummaryNode(), nodeTo.(DataFlowPrivate::FlowSummaryNode).getSummaryNode(), false, model)
}

import Cached

/**
Expand Down Expand Up @@ -191,18 +195,6 @@ predicate containerStep(DataFlow::Node nodeFrom, DataFlow::Node nodeTo) {
DataFlowPrivate::comprehensionStoreStep(nodeFrom, _, nodeTo)
}

/**
* Holds if taint can flow from `nodeFrom` to `nodeTo` with a step related to copying.
*/
predicate copyStep(DataFlow::CfgNode nodeFrom, DataFlow::CfgNode nodeTo) {
exists(DataFlow::CallCfgNode call | call = nodeTo |
call = API::moduleImport("copy").getMember(["copy", "deepcopy"]).getACall() and
call.getArg(0) = nodeFrom
)
or
nodeTo.(DataFlow::MethodCallNode).calls(nodeFrom, "copy")
}

/**
* Holds if taint can flow from `nodeFrom` to `nodeTo` with an `await`-step,
* such that the whole expression `await x` is tainted if `x` is tainted.
Expand Down
131 changes: 129 additions & 2 deletions python/ql/lib/semmle/python/frameworks/Stdlib.qll
Original file line number Diff line number Diff line change
Expand Up @@ -254,10 +254,14 @@ module Stdlib {
* See https://docs.python.org/3.9/library/logging.html#logging.Logger.
*/
module Logger {
private import semmle.python.dataflow.new.internal.DataFlowDispatch as DD

/** Gets a reference to the `logging.Logger` class or any subclass. */
API::Node subclassRef() {
result = API::moduleImport("logging").getMember("Logger").getASubclass*()
or
result = API::moduleImport("logging").getMember("getLoggerClass").getReturn().getASubclass*()
or
result = ModelOutput::getATypeNode("logging.Logger~Subclass").getASubclass*()
}

Expand All @@ -277,6 +281,13 @@ module Stdlib {
ClassInstantiation() {
this = subclassRef().getACall()
or
this =
DD::selfTracker(subclassRef()
.getAValueReachableFromSource()
.asExpr()
.(ClassExpr)
.getInnerScope())
or
this = API::moduleImport("logging").getMember("root").asSource()
or
this = API::moduleImport("logging").getMember("getLogger").getACall()
Expand Down Expand Up @@ -1492,6 +1503,8 @@ module StdlibPrivate {
or
// io.open is a special case, since it is an alias for the builtin `open`
result = API::moduleImport("io").getMember("open")
or
result = API::moduleImport("codecs").getMember("open")
}

/**
Expand Down Expand Up @@ -2655,6 +2668,16 @@ module StdlibPrivate {
}
}

// // Codecs
// /** A file system access from a `pathlib.Path` method call. */
// private class CodecsFileAccess extends FileSystemAccess::Range, API::CallNode {
// DataFlow::Node pathArgument;
// CodecsFileAccess() {
// this = API::moduleImport("codecs").getMember("open").getACall() and
// pathArgument = this.getParameter(0, "filename").asSink()
// }
// override DataFlow::Node getAPathArgument() { result = pathArgument }
// }
// ---------------------------------------------------------------------------
// pathlib
// ---------------------------------------------------------------------------
Expand Down Expand Up @@ -3251,8 +3274,13 @@ module StdlibPrivate {

override predicate propagatesFlow(string input, string output, boolean preservesValue) {
input in ["Argument[0]", "Argument[pattern:]"] and
output = "ReturnValue.Attribute[pattern]" and
preservesValue = true
(
output = "ReturnValue.Attribute[pattern]" and
preservesValue = true
or
output = "ReturnValue" and
preservesValue = false
)
}
}

Expand Down Expand Up @@ -3491,6 +3519,90 @@ module StdlibPrivate {
}
}

/**
* A flow summary for `urllib.parse.urljoin`
*
* See https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin
*/
class UrljoinSummary extends SummarizedCallable {
UrljoinSummary() { this = "urllib.parse.urljoin" }

override DataFlow::CallCfgNode getACall() {
result = API::moduleImport("urllib").getMember("parse").getMember("urljoin").getACall()
}

override DataFlow::ArgumentNode getACallback() {
result =
API::moduleImport("urllib")
.getMember("parse")
.getMember("urljoin")
.getAValueReachableFromSource()
}

override predicate propagatesFlow(string input, string output, boolean preservesValue) {
input in ["Argument[0]", "Argument[base:]"] and
output = "ReturnValue" and
preservesValue = false
}
}

// ---------------------------------------------------------------------------
// fnmatch
// ---------------------------------------------------------------------------
/**
* A flow summary for `fnmatch.filter`
*
* See https://docs.python.org/3/library/fnmatch.html#fnmatch.filter
*/
class FnmatchFilterSummary extends SummarizedCallable {
FnmatchFilterSummary() { this = "fnmatch.filter" }

override DataFlow::CallCfgNode getACall() {
result = API::moduleImport("fnmatch").getMember("filter").getACall()
}

override DataFlow::ArgumentNode getACallback() {
result = API::moduleImport("fnmatch").getMember("filter").getAValueReachableFromSource()
}

override predicate propagatesFlow(string input, string output, boolean preservesValue) {
input in ["Argument[0].ListElement", "Argument[names:].ListElement"] and
output = "ReturnValue.ListElement" and
preservesValue = true
}
}

// ---------------------------------------------------------------------------
// optparse
// ---------------------------------------------------------------------------
/**
* A flow summary for `optparse.parse_args`
*
* See https://docs.python.org/3/library/fnmatch.html#fnmatch.filter
*/
class OptparseParseArgsSummary extends SummarizedCallable {
OptparseParseArgsSummary() { this = "optparse.parse_args" }

override DataFlow::CallCfgNode getACall() {
result =
API::moduleImport("optparse").getMember("OptionParser").getMember("parse_args").getACall()
}

override DataFlow::ArgumentNode getACallback() {
result =
API::moduleImport("optparse")
.getMember("OptionParser")
.getMember("parse_args")
.getAValueReachableFromSource()
}

override predicate propagatesFlow(string input, string output, boolean preservesValue) {
input in ["Argument[1]", "Argument[args:]"] and
output = "ReturnValue.TupleElement[1]" and
preservesValue = false
}
}

// ---------------------------------------------------------------------------
// tempfile
// ---------------------------------------------------------------------------
Expand Down Expand Up @@ -4959,6 +5071,21 @@ module StdlibPrivate {

override predicate isShellInterpreted(DataFlow::Node arg) { arg = this.getCommand() }
}

/**
* An instance og `logging.Logger` from the `asyncio` module.
* See https://docs.python.org/3/library/asyncio-dev.html#logging
* and https://github.com/python/cpython/blob/3.12/Lib/asyncio/log.py#L7
*/
private class AsyncIOLogger extends Stdlib::Logger::InstanceSource {
AsyncIOLogger() {
this =
API::moduleImport("asyncio")
.getMember("log")
.getMember("logger")
.getAValueReachableFromSource()
}
}
}

// ---------------------------------------------------------------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -96,4 +96,10 @@ module CleartextLogging {
)
}
}

private import semmle.python.frameworks.data.ModelsAsData

private class SinkFromModel extends Sink {
SinkFromModel() { this = ModelOutput::getASinkNode("log-injection").asSink() }
}
}
Loading