diff --git a/docs/language/learn-ql/python/control-flow-graph.rst b/docs/language/learn-ql/python/control-flow-graph.rst deleted file mode 100644 index 099c252784b4..000000000000 --- a/docs/language/learn-ql/python/control-flow-graph.rst +++ /dev/null @@ -1,9 +0,0 @@ -Python control flow graph -========================= - -:doc:`Back to tutorial: control flow analysis ` - -|Python control flow graph| - -.. |Python control flow graph| image:: ../../images/python-flow-graph.png - diff --git a/docs/language/learn-ql/python/control-flow.rst b/docs/language/learn-ql/python/control-flow.rst index fc41f59c9332..d16453157845 100644 --- a/docs/language/learn-ql/python/control-flow.rst +++ b/docs/language/learn-ql/python/control-flow.rst @@ -1,7 +1,12 @@ -Tutorial: Control flow analysis -=============================== +Analyzing control flow in Python +================================ -To analyze the `Control-flow graph `__ of a ``Scope`` we can use the two CodeQL classes ``ControlFlowNode`` and ``BasicBlock``. These classes allow you to ask such questions as "can you reach point A from point B?" or "Is it possible to reach point B *without* going through point A?". To report results we use the class ``AstNode``, which represents a syntactic element and corresponds to the source code - allowing the results of the query to be more easily understood. +You can write CodeQL queries to explore the control-flow graph of a Python program, for example, to discover unreachable code or mutually exclusive blocks of code. + +About analyzing control flow +-------------------------------------- + +To analyze the control-flow graph of a ``Scope`` we can use the two CodeQL classes ``ControlFlowNode`` and ``BasicBlock``. These classes allow you to ask such questions as "can you reach point A from point B?" or "Is it possible to reach point B *without* going through point A?". To report results we use the class ``AstNode``, which represents a syntactic element and corresponds to the source code - allowing the results of the query to be more easily understood. For more information, see `Control-flow graph `__ on Wikipedia. The ``ControlFlowNode`` class ----------------------------- @@ -19,11 +24,18 @@ To show why this complex relation is required consider the following Python code finally: close_resource() -There are many paths through the above code. There are three different paths through the call to ``close_resource();`` one normal path, one path that breaks out of the loop, and one path where an exception is raised by ``might_raise()``. (An annotated flow graph can be seen :doc:`here `.) +There are many paths through the above code. There are three different paths through the call to ``close_resource();`` one normal path, one path that breaks out of the loop, and one path where an exception is raised by ``might_raise()``. + +An annotated flow graph: + +|Python control flow graph| + +.. |Python control flow graph| image:: ../../images/python-flow-graph.png The simplest use of the ``ControlFlowNode`` and ``AstNode`` classes is to find unreachable code. There is one ``ControlFlowNode`` per path through any ``AstNode`` and any ``AstNode`` that is unreachable has no paths flowing through it. Therefore, any ``AstNode`` without a corresponding ``ControlFlowNode`` is unreachable. -**Unreachable AST nodes** +Example finding unreachable AST nodes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: ql @@ -33,9 +45,10 @@ The simplest use of the ``ControlFlowNode`` and ``AstNode`` classes is to find u where not exists(node.getAFlowNode()) select node -➤ `See this in the query console `__. The demo projects on LGTM.com all have some code that has no control flow node, and is therefore unreachable. However, since the ``Module`` class is also a subclass of the ``AstNode`` class, the query also finds any modules implemented in C or with no source code. Therefore, it is better to find all unreachable statements: +➤ `See this in the query console `__. The demo projects on LGTM.com all have some code that has no control flow node, and is therefore unreachable. However, since the ``Module`` class is also a subclass of the ``AstNode`` class, the query also finds any modules implemented in C or with no source code. Therefore, it is better to find all unreachable statements. -**Unreachable statements** +Example finding unreachable statements +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: ql @@ -45,15 +58,15 @@ The simplest use of the ``ControlFlowNode`` and ``AstNode`` classes is to find u where not exists(s.getAFlowNode()) select s -➤ `See this in the query console `__. This query gives fewer results, but most of the projects have some unreachable nodes. These are also highlighted by the standard query: `Unreachable code `__. +➤ `See this in the query console `__. This query gives fewer results, but most of the projects have some unreachable nodes. These are also highlighted by the standard "Unreachable code" query. For more information, see `Unreachable code `__ on LGTM.com. The ``BasicBlock`` class ------------------------ -The ``BasicBlock`` class represents a `basic block `__ of control flow nodes. The ``BasicBlock`` class is not that useful for writing queries directly, but is very useful for building complex analyses, such as data flow. The reason it is useful is that it shares many of the interesting properties of control flow nodes, such as what can reach what and what `dominates `__ what, but there are fewer basic blocks than control flow nodes - resulting in queries that are faster and use less memory. +The ``BasicBlock`` class represents a basic block of control flow nodes. The ``BasicBlock`` class is not that useful for writing queries directly, but is very useful for building complex analyses, such as data flow. The reason it is useful is that it shares many of the interesting properties of control flow nodes, such as, what can reach what, and what dominates what, but there are fewer basic blocks than control flow nodes - resulting in queries that are faster and use less memory. For more information, see `Basic block `__ and `Dominator `__ on Wikipedia. -Example: Finding mutually exclusive basic blocks -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Example finding mutually exclusive basic blocks +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Suppose we have the following Python code: @@ -84,7 +97,8 @@ However, by that definition, two basic blocks are mutually exclusive if they are Combining these conditions we get: -**Mutually exclusive blocks within the same function** +Example finding mutually exclusive blocks within the same function +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: ql @@ -98,10 +112,11 @@ Combining these conditions we get: ) select b1, b2 -➤ `See this in the query console `__. This typically gives a very large number of results, because it is a common occurrence in normal control flow. It is, however, an example of the sort of control-flow analysis that is possible. Control-flow analyses such as this are an important aid to data flow analysis which is covered in the next tutorial. +➤ `See this in the query console `__. This typically gives a very large number of results, because it is a common occurrence in normal control flow. It is, however, an example of the sort of control-flow analysis that is possible. Control-flow analyses such as this are an important aid to data flow analysis. For more information, see :doc:`Analyzing data flow and tracking tainted data in Python `. + +Further reading +--------------- -What next? ----------- +- ":doc:`Analyzing data flow and tracking tainted data in Python `" -- Experiment with the worked examples in the tutorial topic :doc:`Taint tracking and data flow analysis in Python `. -- Find out more about QL in the `QL language handbook `__ and `QL language specification `__. +.. include:: ../../reusables/python-other-resources.rst diff --git a/docs/language/learn-ql/python/functions.rst b/docs/language/learn-ql/python/functions.rst index c3c8a5e6eacf..8fa89f5e188a 100644 --- a/docs/language/learn-ql/python/functions.rst +++ b/docs/language/learn-ql/python/functions.rst @@ -1,7 +1,9 @@ -Tutorial: Functions +Functions in Python =================== -This example uses the standard CodeQL class ``Function`` (see :doc:`Introducing the Python libraries `). +You can use syntactic classes from the standard CodeQL library to find Python functions and identify calls to them. + +These examples use the standard CodeQL class `Function `__. For more information, see ":doc:`Introducing the Python libraries `." Finding all functions called "get..." ------------------------------------- @@ -55,7 +57,7 @@ We can modify the query further to include only methods whose body consists of a and count(f.getAStmt()) = 1 select f, "This function is (probably) a getter." -➤ `See this in the query console `__. This query returns fewer results, but if you examine the results you can see that there are still refinements to be made. This is refined further in :doc:`Tutorial: Statements and expressions `. +➤ `See this in the query console `__. This query returns fewer results, but if you examine the results you can see that there are still refinements to be made. This is refined further in ":doc:`Expressions and statements in Python `." Finding a call to a specific function ------------------------------------- @@ -76,8 +78,12 @@ The ``Call`` class represents calls in Python. The ``Call.getFunc()`` predicate Due to the dynamic nature of Python, this query will select any call of the form ``eval(...)`` regardless of whether it is a call to the built-in function ``eval`` or not. In a later tutorial we will see how to use the type-inference library to find calls to the built-in function ``eval`` regardless of name of the variable called. -What next? ----------- +Further reading +--------------- + +- ":doc:`Expressions and statements in Python `" +- ":doc:`Pointer analysis and type inference in Python `" +- ":doc:`Analyzing control flow in Python `" +- ":doc:`Analyzing data flow and tracking tainted data in Python `" -- Experiment with the worked examples in the following tutorial topics: :doc:`Statements and expressions `, :doc:`Control flow `, and :doc:`Points-to analysis and type inference `. -- Find out more about QL in the `QL language handbook `__ and `QL language specification `__. +.. include:: ../../reusables/python-other-resources.rst diff --git a/docs/language/learn-ql/python/introduce-libraries-python.rst b/docs/language/learn-ql/python/introduce-libraries-python.rst index 54276aedd8e8..c7809eb710b3 100644 --- a/docs/language/learn-ql/python/introduce-libraries-python.rst +++ b/docs/language/learn-ql/python/introduce-libraries-python.rst @@ -1,16 +1,18 @@ -Introducing the CodeQL libraries for Python -=========================================== +CodeQL library for Python +========================= -There is an extensive library for analyzing CodeQL databases extracted from Python projects. The classes in this library present the data from a database in an object-oriented form and provide abstractions and predicates to help you with common analysis tasks. The library is implemented as a set of QL modules, that is, files with the extension ``.qll``. The module ``python.qll`` imports all the core Python library modules, so you can include the complete library by beginning your query with: +When you need to analyze a Python program, you can make use of the large collection of classes in the CodeQL library for Python. -.. code-block:: ql +About the CodeQL library for Python +----------------------------------- - import python +The CodeQL library for each programming language uses classes with abstractions and predicates to present data in an object-oriented form. -The rest of this tutorial summarizes the contents of the standard libraries for Python. We recommend that you read this and then work through the practical examples in the tutorials shown at the end of the page. +Each CodeQL library is implemented as a set of QL modules, that is, files with the extension ``.qll``. The module ``python.qll`` imports all the core Python library modules, so you can include the complete library by beginning your query with: -Overview of the library ------------------------ +.. code-block:: ql + + import python The CodeQL library for Python incorporates a large number of classes. Each class corresponds either to one kind of entity in Python source code or to an entity that can be derived from the source code using static analysis. These classes can be divided into four categories: @@ -20,16 +22,14 @@ The CodeQL library for Python incorporates a large number of classes. Each class - **Taint tracking** - classes that represent the source, sinks and kinds of taint used to implement taint-tracking queries. Syntactic classes -~~~~~~~~~~~~~~~~~ - -This part of the library represents the Python source code. The ``Module``, ``Class``, and ``Function`` classes correspond to Python modules, classes, and functions respectively, collectively these are known as ``Scope`` classes. Each ``Scope`` contains a list of statements each of which is represented by a subclass of the class ``Stmt``. Statements themselves can contain other statements or expressions which are represented by subclasses of ``Expr``. Finally, there are a few additional classes for the parts of more complex expressions such as list comprehensions. Collectively these classes are subclasses of ``AstNode`` and form an `Abstract syntax tree `__ (AST). The root of each AST is a ``Module``. +----------------- -`Symbolic information `__ is attached to the AST in the form of variables (represented by the class ``Variable``). +This part of the library represents the Python source code. The ``Module``, ``Class``, and ``Function`` classes correspond to Python modules, classes, and functions respectively, collectively these are known as ``Scope`` classes. Each ``Scope`` contains a list of statements each of which is represented by a subclass of the class ``Stmt``. Statements themselves can contain other statements or expressions which are represented by subclasses of ``Expr``. Finally, there are a few additional classes for the parts of more complex expressions such as list comprehensions. Collectively these classes are subclasses of ``AstNode`` and form an Abstract syntax tree (AST). The root of each AST is a ``Module``. Symbolic information is attached to the AST in the form of variables (represented by the class ``Variable``). For more information, see `Abstract syntax tree `__ and `Symbolic information `__ on Wikipedia. Scope ^^^^^ -A Python program is a group of modules. Technically a module is just a list of statements, but we often think of it as composed of classes and functions. These top-level entities, the module, class, and function are represented by the three CodeQL classes (`Module `__, `Class `__ and `Function `__ which are all subclasses of ``Scope``. +A Python program is a group of modules. Technically a module is just a list of statements, but we often think of it as composed of classes and functions. These top-level entities, the module, class, and function are represented by the three CodeQL classes `Module `__, `Class `__ and `Function `__ which are all subclasses of ``Scope``. - ``Scope`` @@ -237,11 +237,14 @@ Other - ``Comment`` – A comment Control flow classes -~~~~~~~~~~~~~~~~~~~~ +-------------------- + +This part of the library represents the control flow graph of each ``Scope`` (classes, functions, and modules). Each ``Scope`` contains a graph of ``ControlFlowNode`` elements. Each scope has a single entry point and at least one (potentially many) exit points. To speed up control and data flow analysis, control flow nodes are grouped into basic blocks. For more information, see `Basic block `__ on Wikipedia. -This part of the library represents the control flow graph of each ``Scope`` (classes, functions, and modules). Each ``Scope`` contains a graph of ``ControlFlowNode`` elements. Each scope has a single entry point and at least one (potentially many) exit points. To speed up control and data flow analysis, control flow nodes are grouped into `basic blocks `__. +Example +^^^^^^^ -As an example, we might want to find the longest sequence of code without any branches. A ``BasicBlock`` is, by definition, a sequence of code without any branches, so we just need to find the longest ``BasicBlock``. +If we want to find the longest sequence of code without any branches, we need to consider control flow. A ``BasicBlock`` is, by definition, a sequence of code without any branches, so we just need to find the longest ``BasicBlock``. First of all we introduce a simple predicate ``bb_length()`` which relates ``BasicBlock``\ s to their length. @@ -289,7 +292,12 @@ The classes in the control-flow part of the library are: Type-inference classes ---------------------- -The CodeQL library for Python also supplies some classes for accessing the inferred types of values. The classes ``Value`` and ``ClassValue`` allow you to query the possible classes that an expression may have at runtime. For example, which ``ClassValue``\ s are iterable can be determined using the query: +The CodeQL library for Python also supplies some classes for accessing the inferred types of values. The classes ``Value`` and ``ClassValue`` allow you to query the possible classes that an expression may have at runtime. + +Example +^^^^^^^ + +For example, which ``ClassValue``\ s are iterable can be determined using the query: **Find iterable "ClassValue"s** @@ -301,10 +309,10 @@ The CodeQL library for Python also supplies some classes for accessing the infer where cls.hasAttribute("__iter__") select cls -➤ `See this in the query console `__ This query returns a list of classes for the projects analyzed. If you want to include the results for `builtin classes `__, which do not have any Python source code, show the non-source results. +➤ `See this in the query console `__ This query returns a list of classes for the projects analyzed. If you want to include the results for ``builtin`` classes, which do not have any Python source code, show the non-source results. For more information, see `builtin classes `__ in the Python documentation. Summary -~~~~~~~ +^^^^^^^ - `Value `__ @@ -312,7 +320,7 @@ Summary - ``CallableValue`` - ``ModuleValue`` -These classes are explained in more detail in :doc:`Tutorial: Points-to analysis and type inference `. +For more information about these classes, see ":doc:`Pointer analysis and type inference in Python `." Taint-tracking classes ---------------------- @@ -321,16 +329,21 @@ The CodeQL library for Python also supplies classes to specify taint-tracking an Summary -~~~~~~~ +^^^^^^^ - `TaintKind `__ - `Configuration `__ -These classes are explained in more detail in :doc:`Tutorial: Taint tracking and data flow analysis in Python `. +For more information about these classes, see ":doc:`Analyzing data flow and tracking tainted data in Python `." + +Further reading +--------------- -What next? ----------- +- ":doc:`Functions in Python `" +- ":doc:`Expressions and statements in Python `" +- ":doc:`Pointer analysis and type inference in Python `" +- ":doc:`Analyzing control flow in Python `" +- ":doc:`Analyzing data flow and tracking tainted data in Python `" -- Experiment with the worked examples in the following tutorial topics: :doc:`Functions `, :doc:`Statements and expressions `, :doc:`Control flow `, :doc:`Points-to analysis and type inference `, and :doc:`Taint tracking and data flow analysis in Python `. -- Find out more about QL in the `QL language handbook `__ and `QL language specification `__. +.. include:: ../../reusables/python-other-resources.rst diff --git a/docs/language/learn-ql/python/pointsto-type-infer.rst b/docs/language/learn-ql/python/pointsto-type-infer.rst index 7ae9368d02cb..40f2ecb81fff 100644 --- a/docs/language/learn-ql/python/pointsto-type-infer.rst +++ b/docs/language/learn-ql/python/pointsto-type-infer.rst @@ -1,7 +1,7 @@ -Tutorial: Points-to analysis and type inference -=============================================== +Pointer analysis and type inference in Python +============================================= -This topic contains worked examples of how to write queries using the standard CodeQL library classes for Python type inference. +At runtime, each Python expression has a value with an associated type. You can learn how an expression behaves at runtime by using type-inference classes from the standard CodeQL library. The ``Value`` class -------------------- @@ -9,7 +9,7 @@ The ``Value`` class The ``Value`` class and its subclasses ``FunctionValue``, ``ClassValue``, and ``ModuleValue`` represent the values an expression may hold at runtime. Summary -~~~~~~~ +^^^^^^^ Class hierarchy for ``Value``: @@ -22,9 +22,7 @@ Class hierarchy for ``Value``: Points-to analysis and type inference ------------------------------------- -Points-to analysis, sometimes known as `pointer analysis `__, allows us to determine which objects an expression may "point to" at runtime. - -`Type inference `__ allows us to infer what the types (classes) of an expression may be at runtime. +Points-to analysis, sometimes known as pointer analysis, allows us to determine which objects an expression may "point to" at runtime. Type inference allows us to infer what the types (classes) of an expression may be at runtime. For more information, see `Pointer analysis `__ and `Type inference `__ on Wikipedia. The predicate ``ControlFlowNode.pointsTo(...)`` shows which object a control flow node may "point to" at runtime. @@ -123,7 +121,7 @@ Combining the parts of the query we get this: ) select t, ex1, ex2 -➤ `See this in the query console `__. This query finds only one result in the demo projects on LGTM.com (`youtube-dl `__). The result is also highlighted by the standard query: `Unreachable 'except' block `__. +➤ `See this in the query console `__. This query finds only one result in the demo projects on LGTM.com (`youtube-dl `__). The result is also highlighted by the standard "Unreachable 'except' block" query. For more information, see `Unreachable 'except' block `__ on LGTM.com. .. pull-quote:: @@ -183,7 +181,7 @@ The ``Value`` class has a method ``getACall()`` which allows us to find calls to If we wish to restrict the callables to actual functions we can use the ``FunctionValue`` class, which is a subclass of ``Value`` and corresponds to function objects in Python, in much the same way as the ``ClassValue`` class corresponds to class objects in Python. -Returning to an example from :doc:`Tutorial: Functions `, we wish to find calls to the ``eval`` function. +Returning to an example from ":doc:`Functions in Python `," we wish to find calls to the ``eval`` function. The original query looked this: @@ -225,8 +223,10 @@ Then we can use ``Value.getACall()`` to identify calls to the ``eval`` function, ➤ `See this in the query console `__. This accurately identifies calls to the builtin ``eval`` function even when they are referred to using an alternative name. Any false positive results with calls to other ``eval`` functions, reported by the original query, have been eliminated. -What next? ----------- +Further reading +--------------- + +- ":doc:`Analyzing control flow in Python `" +- ":doc:`Analyzing data flow and tracking tainted data in Python `" -- Find out more about QL in the `QL language handbook `__ and `QL language specification `__. -- Read a description of the CodeQL database in :doc:`What's in a CodeQL database? <../database>` +.. include:: ../../reusables/python-other-resources.rst diff --git a/docs/language/learn-ql/python/ql-for-python.rst b/docs/language/learn-ql/python/ql-for-python.rst index 680c0c374b5c..b4f47e8a70cf 100644 --- a/docs/language/learn-ql/python/ql-for-python.rst +++ b/docs/language/learn-ql/python/ql-for-python.rst @@ -1,37 +1,16 @@ CodeQL for Python ================= +Experiment and learn how to write effective and efficient queries for CodeQL databases generated from Python code bases. + .. toctree:: :glob: - :hidden: + :maxdepth: 2 introduce-libraries-python functions statements-expressions + pointsto-type-infer control-flow - control-flow-graph taint-tracking - pointsto-type-infer - -The following tutorials and worked examples are designed to help you learn how to write effective and efficient queries for Python projects. You should work through these topics in the order displayed. - -- `Basic Python query `__ describes how to write and run queries using LGTM. - -- :doc:`Introducing the CodeQL libraries for Python ` introduces the standard libraries used to write queries for Python code. - -- :doc:`Tutorial: Functions ` demonstrates how to write queries using the standard CodeQL library classes for Python functions. - -- :doc:`Tutorial: Statements and expressions ` demonstrates how to write queries using the standard CodeQL library classes for Python statements and expressions. - -- :doc:`Tutorial: Control flow ` demonstrates how to write queries using the standard CodeQL library classes for Python control flow. - -- :doc:`Tutorial: Points-to analysis and type inference ` demonstrates how to write queries using the standard CodeQL library classes for Python type inference. - -- :doc:`Taint tracking and data flow analysis in Python ` demonstrates how to write queries using the standard taint tracking and data flow libraries for Python. - -Other resources ---------------- -- For examples of how to query common Python elements, see the `Python cookbook `__. -- For the queries used in LGTM, display a `Python query `__ and click **Open in query console** to see the code used to find alerts. -- For more information about the library for Python see the `CodeQL library for Python `__. diff --git a/docs/language/learn-ql/python/statements-expressions.rst b/docs/language/learn-ql/python/statements-expressions.rst index d3b4e68af6c9..eda2d1e45781 100644 --- a/docs/language/learn-ql/python/statements-expressions.rst +++ b/docs/language/learn-ql/python/statements-expressions.rst @@ -1,6 +1,8 @@ -Tutorial: Statements and expressions +Expressions and statements in Python ==================================== +You can use syntactic classes from the CodeQL library to explore how Python expressions and statements are used in a code base. + Statements ---------- @@ -37,13 +39,11 @@ Here is the full class hierarchy: - ``While`` – A ``while`` statement - ``With`` – A ``with`` statement -Example: Finding redundant 'global' statements -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Example finding redundant 'global' statements +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``global`` statement in Python declares a variable with a global (module-level) scope, when it would otherwise be local. Using the ``global`` statement outside a class or function is redundant as the variable is already global. -**Finding redundant global statements** - .. code-block:: ql import python @@ -56,13 +56,11 @@ The ``global`` statement in Python declares a variable with a global (module-lev The line: ``g.getScope() instanceof Module`` ensures that the ``Scope`` of ``Global g`` is a ``Module``, rather than a class or function. -Example: Finding 'if' statements with redundant branches --------------------------------------------------------- +Example finding 'if' statements with redundant branches +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ An ``if`` statement where one branch is composed of just ``pass`` statements could be simplified by negating the condition and dropping the ``else`` clause. -**An 'if' statement that could be simplified** - .. code-block:: python if cond(): @@ -70,9 +68,7 @@ An ``if`` statement where one branch is composed of just ``pass`` statements cou else: do_something -To find statements like this we can run the following query: - -**Find 'if' statements with empty branches** +To find statements like this that could be simplified we can write a query. .. code-block:: ql @@ -131,8 +127,8 @@ Each kind of Python expression has its own class. Here is the full class hierarc - ``Yield`` – A ``yield`` expression - ``YieldFrom`` – A ``yield from`` expression (Python 3.3+) -Example: Finding comparisons to integer or string literals using 'is' -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Example finding comparisons to integer or string literals using 'is' +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Python implementations commonly cache small integers and single character strings, which means that comparisons such as the following often work correctly, but this is not guaranteed and we might want to check for them. @@ -141,9 +137,7 @@ Python implementations commonly cache small integers and single character string x is 10 x is "A" -We can check for these as follows: - -**Find comparisons to integer or string literals using** ``is`` +We can check for these using a query. .. code-block:: ql @@ -164,15 +158,11 @@ The clause ``cmp.getOp(0) instanceof Is and cmp.getComparator(0) = literal`` che We have to use ``cmp.getOp(0)`` and ``cmp.getComparator(0)``\ as there is no ``cmp.getOp()`` or ``cmp.getComparator()``. The reason for this is that a ``Compare`` expression can have multiple operators. For example, the expression ``3 < x < 7`` has two operators and two comparators. You use ``cmp.getComparator(0)`` to get the first comparator (in this example the ``3``) and ``cmp.getComparator(1)`` to get the second comparator (in this example the ``7``). -Example: Duplicates in dictionary literals -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Example finding duplicates in dictionary literals +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If there are duplicate keys in a Python dictionary, then the second key will overwrite the first, which is almost certainly a mistake. We can find these duplicates with CodeQL, but the query is more complex than previous examples and will require us to write a ``predicate`` as a helper. -Here is the query: - -**Find duplicate dictionary keys** - .. code-block:: ql import python @@ -188,7 +178,7 @@ Here is the query: and k1 != k2 and same_key(k1, k2) select k1, "Duplicate key in dict literal" -➤ `See this in the query console `__. When we ran this query on LGTM.com, the source code of the *saltstack/salt* project contained an example of duplicate dictionary keys. The results were also highlighted as alerts by the standard `Duplicate key in dict literal `__ query. Two of the other demo projects on LGTM.com refer to duplicate dictionary keys in library files. +➤ `See this in the query console `__. When we ran this query on LGTM.com, the source code of the *saltstack/salt* project contained an example of duplicate dictionary keys. The results were also highlighted as alerts by the standard "Duplicate key in dict literal" query. Two of the other demo projects on LGTM.com refer to duplicate dictionary keys in library files. For more information, see `Duplicate key in dict literal `__ on LGTM.com. The supporting predicate ``same_key`` checks that the keys have the same identifier. Separating this part of the logic into a supporting predicate, instead of directly including it in the query, makes it easier to understand the query as a whole. The casts defined in the predicate restrict the expression to the type specified and allow predicates to be called on the type that is cast-to. For example: @@ -204,12 +194,10 @@ is equivalent to The short version is usually used as this is easier to read. -Example: Finding Java-style getters -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Example finding Java-style getters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Returning to the example from :doc:`Tutorial: Functions `, the query identified all methods with a single line of code and a name starting with ``get``: - -**Basic: Find Java-style getters** +Returning to the example from ":doc:`Functions in Python `," the query identified all methods with a single line of code and a name starting with ``get``. .. code-block:: ql @@ -220,9 +208,7 @@ Returning to the example from :doc:`Tutorial: Functions `, the query and count(f.getAStmt()) = 1 select f, "This function is (probably) a getter." -This basic query can be improved by checking that the one line of code is of the form ``return self.attr`` - -**Improved: Find Java-style getters** +This basic query can be improved by checking that the one line of code is a Java-style getter of the form ``return self.attr``. .. code-block:: ql @@ -236,21 +222,17 @@ This basic query can be improved by checking that the one line of code is of the ➤ `See this in the query console `__. Of the demo projects on LGTM.com, only the *openstack/nova* project has examples of functions that appear to be Java-style getters. -In this query, the condition: - .. code-block:: ql ret = f.getStmt(0) and ret.getValue() = attr -checks that the first line in the method is a return statement and that the expression returned (``ret.getValue()``) is an ``Attribute`` expression. Note that the equality ``ret.getValue() = attr`` means that ``ret.getValue()`` is restricted to ``Attribute``\ s, since ``attr`` is an ``Attribute``. - -The condition: +This condition checks that the first line in the method is a return statement and that the expression returned (``ret.getValue()``) is an ``Attribute`` expression. Note that the equality ``ret.getValue() = attr`` means that ``ret.getValue()`` is restricted to ``Attribute``\ s, since ``attr`` is an ``Attribute``. .. code-block:: ql attr.getObject() = self and self.getId() = "self" -checks that the value of the attribute (the expression to the left of the dot in ``value.attr``) is an access to a variable called ``"self"``. +This condition checks that the value of the attribute (the expression to the left of the dot in ``value.attr``) is an access to a variable called ``"self"``. Class and function definitions ------------------------------ @@ -271,8 +253,12 @@ Here is the relevant part of the class hierarchy: - ``Class`` - ``Function`` -What next? ----------- +Further reading +--------------- + +- ":doc:`Functions in Python `" +- ":doc:`Pointer analysis and type inference in Python `" +- ":doc:`Analyzing control flow in Python `" +- ":doc:`Analyzing data flow and tracking tainted data in Python `" -- Experiment with the worked examples in the following tutorial topics: :doc:`Control flow ` and :doc:`Points-to analysis and type inference `. -- Find out more about QL in the `QL language handbook `__ and `QL language specification `__. +.. include:: ../../reusables/python-other-resources.rst diff --git a/docs/language/learn-ql/python/taint-tracking.rst b/docs/language/learn-ql/python/taint-tracking.rst index 2ea24369bf40..bfdae7aa4eb4 100644 --- a/docs/language/learn-ql/python/taint-tracking.rst +++ b/docs/language/learn-ql/python/taint-tracking.rst @@ -1,8 +1,10 @@ -Taint tracking and data flow analysis in Python -=============================================== +Analyzing data flow and tracking tainted data in Python +======================================================= -Overview --------- +You can use CodeQL to track the flow of data through a Python program. Tracking user-controlled, or tainted, data is a key technique for security researchers. + +About data flow and taint tracking +---------------------------------- Taint tracking is used to analyze how potentially insecure, or 'tainted' data flows throughout a program at runtime. You can use taint tracking to find out whether user-controlled input can be used in a malicious way, @@ -14,12 +16,12 @@ For example, in the assignment ``dir = path + "/"``, if ``path`` is tainted then even though there is no data flow from ``path`` to ``path + "/"``. Separate CodeQL libraries have been written to handle 'normal' data flow and taint tracking in :doc:`C/C++ <../cpp/dataflow>`, :doc:`C# <../csharp/dataflow>`, :doc:`Java <../java/dataflow>`, and :doc:`JavaScript <../javascript/dataflow>`. You can access the appropriate classes and predicates that reason about these different modes of data flow by importing the appropriate library in your query. -In Python analysis, we can use the same taint tracking library to model both 'normal' data flow and taint flow, but we are still able make the distinction between steps that preserve value and those that don't by defining additional data flow properties. +In Python analysis, we can use the same taint tracking library to model both 'normal' data flow and taint flow, but we are still able make the distinction between steps that preserve values and those that don't by defining additional data flow properties. -For further information on data flow and taint tracking with CodeQL, see :doc:`Introduction to data flow <../intro-to-data-flow>`. +For further information on data flow and taint tracking with CodeQL, see ":doc:`Introduction to data flow <../intro-to-data-flow>`." -Fundamentals of taint tracking and data flow analysis -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Fundamentals of taint tracking using data flow analysis +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The taint tracking library is in the `TaintTracking `__ module. Any taint tracking or data flow analysis query has three explicit components, one of which is optional, and an implicit component. @@ -39,7 +41,7 @@ The kind of taint determines which non-value-preserving steps are possible, in a In the above example ``dir = path + "/"``, taint flows from ``path`` to ``dir`` if the taint represents a string, but not if the taint is ``None``. Limitations -~~~~~~~~~~~ +^^^^^^^^^^^ Although taint tracking is a powerful technique, it is worth noting that it depends on the underlying data flow graphs. Creating a data flow graph that is both accurate and covers a large enough part of a program is a challenge, @@ -79,6 +81,9 @@ A simple taint tracking query has the basic form: where config.hasFlow(src, sink) select sink, "Alert message, including reference to $@.", src, "string describing the source" +Example +^^^^^^^ + As a contrived example, here is a query that looks for flow from a HTTP request to a function called ``"unsafe"``. The sources are predefined and accessed by importing library ``semmle.python.web.HttpRequest``. The sink is defined by using a custom ``TaintTracking::Sink`` class. @@ -126,8 +131,8 @@ The sink is defined by using a custom ``TaintTracking::Sink`` class. -Implementing path queries -~~~~~~~~~~~~~~~~~~~~~~~~~ +Converting a taint-tracking query to a path query +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Although the taint tracking query above tells which sources flow to which sinks, it doesn't tell us how. For that we need a path query. @@ -202,8 +207,8 @@ Thus, our example query becomes: -Custom taint kinds and flows ----------------------------- +Tracking custom taint kinds and flows +------------------------------------- In the above examples, we have assumed the existence of a suitable ``TaintKind``, but sometimes it is necessary to model the flow of other objects, such as database connections, or ``None``. @@ -226,8 +231,8 @@ The ``TaintKind`` itself is just a string (a QL string, not a CodeQL entity repr which provides methods to extend flow and allow the kind of taint to change along the path. The ``TaintKind`` class has many predicates allowing flow to be modified. This simplest ``TaintKind`` does not override any predicates, meaning that it only flows as opaque data. -An example of this is the `Hard-coded credentials query `_, -which defines the simplest possible taint kind class, ``HardcodedValue``, and custom source and sink classes. +An example of this is the "Hard-coded credentials" query, +which defines the simplest possible taint kind class, ``HardcodedValue``, and custom source and sink classes. For more information, see `Hard-coded credentials `_ on LGTM.com. .. code-block:: ql @@ -251,8 +256,11 @@ which defines the simplest possible taint kind class, ``HardcodedValue``, and cu } } -What next? ----------- +Further reading +--------------- + +- ":doc:`Pointer analysis and type inference in Python `" +- ":doc:`Analyzing control flow in Python `" +- ":doc:`Analyzing data flow and tracking tainted data in Python `" -- Experiment with the worked examples in the following tutorial topics: :doc:`Control flow ` and :doc:`Points-to analysis and type inference `. -- Find out more about QL in the `QL language handbook `__ and `QL language specification `__. +.. include:: ../../reusables/python-other-resources.rst diff --git a/docs/language/reusables/python-other-resources.rst b/docs/language/reusables/python-other-resources.rst new file mode 100644 index 000000000000..9668db06d6d2 --- /dev/null +++ b/docs/language/reusables/python-other-resources.rst @@ -0,0 +1,3 @@ +- "`QL language handbook `__" +- `Python cookbook queries `__ in the Semmle wiki +- `Python queries in action `__ on LGTM.com