enhancements treevisitors

Generic tree visitors

Overview

In the future, we might have both an Abstract Syntax Tree (AST) and a bit later in the compilation phase, a tree that contains the (remains of the) current Cython compiler which I'll call the Pyrex tree (since it is mostly a legacy from the Pyrex implementation).

Common for both such trees (and perhaps other components in the program) is that it is often considered a good idea to seperate the data/structure representation from the algorithms operating on them. The leading paragraphs of http://en.wikipedia.org/wiki/Visitor_pattern contains some good philosophy on this.

This is then about what conveniences we should supply to make it easy to operate on trees (and possibly graphs) with algorithms that are external to data structure.

One might filter it (input and output have same basic format but some changes are made), make in-place modifications (same but doesn't make copies), do a conversion (input and output has different formats), build auxiliary data and so on, that's not important.

One has some tree or graph in one form or another that one wants to operate on.

In a lot of cases, it is natural to talk about input and output because they have different formats (like, Pyrex tree -> C source code, or AST -> control flow graph). In other cases, one might, but often it is cleaner and more easily testable to work with input and output.

Efficient tree transform processes

This is a bit complicated (impossible?) to follow, I need to braindump it for now.

In many places it is likely to have many small, decoupled algorithms/changes to the tree. For instance, consider the post-parsing phase: Lots of small changes are likely to happen all over the tree.

One is left with three possibilities:

Do a full tree traversal for every single syntax change
Have all changes for a phase within the same module
Simply assign priorites to changes, and have all changes combined into one transform automatically.

The rest is about doing the last one.

Some of the changes might trigger simply on node type, which is what Fabrizio's examples for the "with" transform etc. does. However other changes, such as "for in => for from" has a lot more complex criteria. Therefore the second option above might mean that one should look into basically have a node match state machine traversal of the tree, namely:

For each phase in the compiler, the aggregate of the match selections of the transforms/tree changes that should happen make up a state machine coupling a tree traversal state to a Python function that makes a change in the tree. Also the order of the transformations are kept.

A tree traversal happens, updating our state in the tree as we move along. Some states (ie set of current future matches within this subtree) might contain predicates, in that case we fork a sub-traversal to resolve the status of the predicate without really changing our position.

* For each position in the tree, start a transform process:

Set CT to point to the first transformation.

Look in the state machine to see if a transformation with order >= CT applies for this node.

If so, run the transform on the node and take the result and put it in the tree. Also backtrack the state machine one node and "revisit" this node again, but with an incremented CT.

Keeping track of CT rules out infinite cycles (a with transform changes to a for changes to a with changes to a ...).

The result should be idistinguishable from running the transforms in order on the tree, but one only needs to traverse it once. (This also means than rather than actually implementing the mess above, one can stub it by applying the transforms one after another, but make sure the API is up to the job.)

If our position indicates a match, it will contain a list of transforms to apply. The

I believe that heavy hacking of webpath mentioned below can provide such a state machine. If not, one might simply drop this.

Visitor pattern

This is a standard way of doing this in OOP.

#!python

class NameNode(Node):
   ...
   def visit(self, visitor, *args, **kwargs): return visitor.accept_NameNode(self, *args, **kwargs)

Accepting on superclasses

One can have an AbstractVisitor like this:

class B(A): ...

class AbstractVisitor:
    def accept_B(self, *args, **kwargs):
        # if not overriden, call accept_B
        self.accept_A(*args, **kwargs)

It is also possible to do stuff like this automatically by type introspection and getattr overloading but that is perhaps too convoluted...

Queries

Some sort of query language decorating member functions (we can type the decorators out for Python 2.3 support, but using Python 2.4 syntax below for brevity).

For instance, find a decent Python XPath engine, then throw a standard W3C DOM on top of our structures, and we're ready for action. Proposed DOM:

Let every node occupy two levels in the three, one for their "type" and one by listing their children attribute names (level 1, 3, 5 would be node types, level 2, 4, 6 would be attribute names)
Have one namespace for each type of tree (for instance, "cython:pyrextree" and "cython:parsetree") which the nodes live in
DOM attributes is used for looking up values in nodes (strings, ints) which are not children in the tree
Child nodes are used for attributes which contain children in the tree
One might want the children in a seperate namespace for brevity and clarity, for instance always the default namespace
Attributes containing lists of children are expanded directly (ie many sibling nodes in the tree)

if a: pass

would be represented by the DOM API as (the "XML" will never hit string representation at any time of course):

#!xml
<p:if xmlns:p="cython:parsetree">
  <condition>
    <p:name name="a"/>
  </condition>
  <body>
    <p:statlist>
        <stats>
            <p:pass>
        </stats>
    </p:statlist>
  </body>
</p:if>

Then one can do:

#!python
import cython.dom as d

class MyTransform(TemplateMatchTransform):
  @template("p:module/body/p:if")
  def if_tests_in_module_scope(self, node, name):
      ...


  @template("p:funcdef//p:funcdef")
  def inner_functions(self, node, name):
      ...

Real example: The for-in to for-from transformation example has this "selection":

#!python
        if isinstance(node, Nodes.ForInStatNode) and \
                isinstance(node.iterator, ExprNodes.IteratorNode) and \
                isinstance(node.iterator.sequence, ExprNodes.SimpleCallNode) and \
                isinstance(node.iterator.sequence.function, ExprNodes.NameNode) and \
                node.iterator.sequence.function.name == "range":

which would be this in XPath (playing with the idea of letting functions acts as transforms as well):

#!python
@template('p:forin[iterator/p:iterator/sequence/p:simplecall/function/p:name[@name = "range"]]', nses={"p": "cython:pyrextree"})
def forin_to_forfrom(node):
  ...

while a more decent implementation would be something like (note that the = operator is generalized to set intersection in XPath!):

#!python
@template(
  'p:forin[iterator/p:iterator/sequence/p:simplecall[@fully_resolved_name = $rangefuncs]]',
  nses={"p": "cython:pyrextree"},
  vars={"rangefuncs" : ["__builtin__.range", "__builtin__.xrange"]}
)
def forin_to_forfrom(node):
  ...

(BTW, to use the XPaths above, // is always prepended. Really, they are XSLT matches, which are not the same thing as XPath but very similar.)

Pros:

Convenient for more complex matches
The W3C DOM can also be used to for instance use : xmldiff in order to compare two parse trees, making writing tests easy (for testing purposes, so speed doesn't matter)
XML serialization comes automatically (useful for debugging, won't be used for production purposes!)
One can optimize compiler performance by using a good XPath engine, which is a well studied problem, so we don't have to reinvent the wheel for some of the most complex queries (where we'd probably implement a suboptimal algorithm)

This is a naive XPath 2 implementation written entirely in Python: http://sourceforge.net/projects/webpath

Some quick testing makes me very confident that making this work with any tree we come up with is very little work, the "only" problem would be performance. In order to hack something like XSLT "match", one would select a nodeset and check on node id on traversal (possibly setting manual priorities, or extend the XPath lib to compute expression complexity according to XSLT spec).

In order for this to work more efficiently though it would be necesarry to implement "parallell XPaths" (working through many XPaths in parallell in a search) and an XPath combiner; giving something like a state machine that's traversed during tree traversal. Attractive, but perhaps overkill for the simple problem one is trying to solve here.

Finding an XSLT implementation written in Python (or at least able to operate on any foreign DOM) could be better, but finding that is not likely as efficient XPath implementations probably must cooperate closely with the DOM it operates on.

Discussion

DagSverreSeljebotn: I think one should look into XPaths for when the power is needed, but make it very optional and have visitor patterns be the most common choice.

DagSverreSeljebotn: Update: Consensus on mailing list seems to be that XPath is too complicated. I guess one should at least push visitor patterns to their limit first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly