DagSverreSeljebotn PipelineThoughts

DagSverreSeljebotn edited this page May 13, 2008 · 1 revision
Clone this wiki locally


This is in response to this statement by Robert:


I think this is good to talk about now. Basically, you want to change
Cython to use the "visitor pattern" rather than the recursive pattern
that it currently uses. This has pros and cons, but I think this
could be a good thing, and you are certainly convinced that its
needed to get started on the GSoC stuff.


So, leaving the patch and implementation strategies aside for now I'm
going to attempt discussing this from the mountain top. (If a lot of
this seems trivial, all the better -- it can then start to serve as
documentation for a direction, rather than my "private" efforts :-). If
not, it will be good to discuss it.)

I honestly believe that any time spent discussing and implementing this
now can repay itself many times later on in saved developer hours and a
lower learning curve for new developers. If people agree with me in how
important this is, I think it would be good to perhaps spend some time
at the beginning of dev1 talking and discussing this (with some slides

Here we go:

Assertion 1: Development of Cython has a potential for becoming stagnant
(if it hasn't already) quickly.

My (subjective) felling is that the codebase has a somewhat high
learning curve. If one wants to add a fundamentally new features (as
opposed to quick fixes, which do currently happen very quickly -- but
think type inference!) it looks like one has to be very careful to keep
many different effects of one's changes in the head at the same time.

Assertion 2: The reason for this state can at least in some part be
attributed to some confusion as to what the current code tree *is*.

I believe that the reason that the codebase can be practical in the
current state is tightly connected with how similar C and CPython is
with Python as a language. The design, so to speak, arose in an
historical happy accident, because the input and output are almost 1:1
related. A Python function and a C function loosely correspond, and so
FuncDefNode can serve the role as both the parsed element and the
element due for C serialization. However there is no reason to assume
that the 1:1 correspondance holds everywhere; and I think there is a
tendency that "ugly hacks" already tend to arise in situations where the
correspondance in program structure diverges more. (To consider how this
might become worse, think about a "method template node".)

If this was only a theoretically founded critisism one might simply
dismiss it, but I believe the problems in assertion 1 comes from this
very fact, and that they can be solved by dealing with this conceptual

The proposed solution: Work towards "The ideal solution" below in tiny,
incremental steps. All *new* major features are implemented with the
ideal solution in mind, and old code refactored as far as is needed for
that to happen, but one doesn't start rewriting working and bugfixed
code just for the sake of abstract purity.

The ideal solution: A strict pipeline-based approach where the "data"
(one or more trees, graphs or similar) is in different stages, with
different phases transforming one into the other. I.e, always try to fit
what one does into the following example scheme:

data stage:string of code -> phase:parsing ->
data stage:syntax tree -> phase:expand with statements ->
data stage:tree without with statements -> phase:create scopes ->
data stage:tree with scopes -> phase:analyse types ->
data stage:tree with ".type" attributes -> ...
data stage:tree closely related to *structure* of C source -> phase:output

Whenever something doesn't quit fit in the scheme of naive input/output,
split it up into more phases until it does :-)

(Digression: Closely related to C structure above means, for instance,
no inner functions. No with statements. No untyped variables. That kind
of thing. Classes may be included though, since one can create a 1:1
correspondance between C code using the CPython API and a class; so it
is not about language "features" as such but some deeper structure.)

Note that all phases does not need to be visitors! For instance, the
output-to-C phase will stay a recursive process for the foreseeable
future. This is a way of thinking, not a recipe for implementation. (And
indeed, this way of thinking is already followed to a degree -- but not
everywhere, and we get problems at the spots where it is not.)

A very important side-effect of this is that it lends itself to
"implementing complex Cython statements as more simple Cython
statements". (Though ideally "more simple Cython statements" can also be
"intermediate simpler instructions" that bridges the gap between Cython
and C in a more fine-grained way in order to give more code reuse than
there is today. But I digress.)

(Implementation notes: This does not necesarrily mean that the classes
of the nodes changes between phases; one does whatever is most
convenient. For instance, if one has T2 = expand_with_nodes(T1), then T1
is allowed to keep WithStatementNode, T2 is however not, but otherwise
the trees are the same. In the same way, if T1 = type_analysis(T0), then
ExprNodes in T0 is not allowed to have the type attribute, while
ExprNodes in T1 are required to have them. This can be a documentation
matter or enforced (in a number of ways), that issue is better left for

Final words: This might seem trivial. It is not. The reason is that
currently, almost wherever you try to implement two simple phases f and
g one ends up having the result of f depend on what is done in g for
some nodes, and the result of g depend on f for other nodes. And so
neither f(g(T)) nor g(f(T)) can be made to work.

The point is, this way of changing how one thinks mostly affects the
compiler design as such, more than it is a specific style of coding.
Visitors happen to be the most natural way to implement this, but are
not the main point. I'd be happy (or, happier than I am today) with a
cleanly defined phase-based recursive process.

Dag Sverre

Comments in retrospect

The point about stagnation was a bit harsh and probably wrong.