DagSverreSeljebotn soc

DagSverreSeljebotn edited this page Mar 31, 2008 · 44 revisions
Clone this wiki locally

Developing Cython for Easy NumPy Integration


This project is about improving several aspects of Cython in order to integrate NumPy in a better way. In short, the plan is to add a selected set of new features to Cython (like parametrized types, compiled operator overloads, and simple template support). This will in turn make it possible to improve NumPy integration with Cython.

NumPy support is already present in Cython, however one currently has to make a choice between an interface that is easy to use but lacks runtime performance, and an interface that performs as well as C but is much harder to learn and use. This project aims to improve this situation. I believe the project should be attractive because most of the tools is already well developed and in production use; this project will add a small finishing touch that have a big impact on usability. This can, with maturity, give Python with NumPy and Cython an edge as a tool for numerical computations that is hard to find anywhere else; most tools are either appropriate for high-level array operations or low-level for-loops, but not both.

Detailed description

Cython (http://cython.org) is a fork of the well-known Pyrex initiated by the SAGE project (http://sagemath.org). Cython's original purpose is to be a good tool for wrapping C libraries into compiled Python modules, however support is added all the time to make it a good all-round Python compiler as well. Cython does not convert Python code to a pure C program, instead it generates C code that makes full use of the CPython libraries and runtime environment. This makes it easy to mix compiled and non-compiled code.

NumPy (http://numpy.scipy.org/) is a Python module that is widely used for numerical calculations, however one is usually restricted to higher-level operations that operate on thousands of numbers at the time (because for-loops within the Python interpreter is too slow). Sometimes custom low-level loops are needed in addition, currently one often implement such loops in C or FORTRAN instead. Cython can here provide an alternative, allowing one to write code operating in a full Python environment with all its conveniences and in a Python-like syntax, while achieving the speed of the equivalent C code.

The goal of the project is that it should be easy for Python and NumPy users to write fast for-loop calculations by writing Cython code that, with the exception of added type information, closely resembles the Python code they are used to. Example:

import numpy

def negative_grayscale_image(numpy.ndarray(numpy.uint8, 2) img):
  cdef int i, j
  for i in range(arr.shape[0]):
    for j in range(arr.shape[1]):
      img[i, j] = 255 - img[i, j]

A draft of the wanted functionality for NumPy integration can be found here: http://wiki.cython.org/enhancements/numpy

The above code will, with a few modifications, compile in the current Cython. However it will access the array data through NumPy's generic Python interface, making it too slow. NumPy can also be accessed by it's inner C interface from Cython, which gives the necesarry speed but at the cost of a steeper learning curve and code that is much less convenient and harder to maintain (further details on the page linked to above). By improving Cython in a few areas it will be possible to offer the convenience of the former approach with the performance of the latter.

Cython access to C libraries (including Python extensions written in C) is normally added by writing "pxd" files. These files are analogous to C header files and declares the types, functions and structures that should be made available from the C library. The plan is to add features to Cython so that efficient and convenient NumPy support can be provided entirely through writing a NumPy pxd file. This approach will make sure that any work that goes into NumPy support can equally benefit other, similar libraries.

Detailed implementation plan:


In short, it involves first implementing a few features similar to C++ (but with a Cython twist): Parametrized types, inlineable code in pxd "header" files, and simple template support, and then write such a NumPy pxd file. While the features might seem many, they are all restricted in scope: Rather than e.g. trying to craft the perfect template system, only the kind of templates that are needed and wanted now will be supported, making it easier to complete the project and allowing the actual needs to drive Cython development, rather than spending time polishing and adding extras that might never be used. Thus for instance class templates, which might be nice to have, are left out since they are not needed now.

The implementation plan sketches a few different options that can be taken depending on how quickly development goes. In the main plan, a working implementation with suboptimal performance will be achieved first, and then a step afterwards is to introduce generic Cython features that will optimize the performance.

The new features can be tested by making a small code snippet testing the feature in isolation, and I plan to write such isolated tests for all new features. None of the added features should impact Cython backwards-compatibility at all. Regression testing will be done by checking that the Cython compiler produces the exact same result as before when compiling existing Cython code (compiling the Cython code within the SAGE project will provide a very good test-case).

Motivation and involvement with project:

In my own university, Python is used as the introductory programming language, however the current culture is to put Python away and switch to C or FORTRAN for serious work later. As I'm going to do a master's degree involving numerical computations, it would be nice to have a more convenient tool available. Also I think from my experiences as an undergraduate that this could help with Python adoption as a generic numerical algorithms teaching tool.

Also, it looks like it will be a fun and motivating project. Quickly diving into the (previously) unfamiliar source code of Cython, getting to know it and its strengths and deficiencies, and trying to think of ways to integrate new features both cleanly and without breaking existing code has so far been a very stimulating experience for me. I became interested in Cython in the beginnning of March 2008, and have been active on the mailing list since then. I have also contributed the following addition to Cython that has been well recieved in the Cython community: http://wiki.cython.org/enhancements/parsetreetransforms

About me:

I'm graduating with a bachelor's degree in "Mathematics, informatics and technology" from the University of Oslo this spring (consisting of about one third calculus and mathematics, one third computer science, and one third probability theory and statistics).

Autumn 2007 I was a Teaching Assistant for a master-level general algorithms course (INF4130) at my university. I have much practical experience with both Python, C and C++ (as well as Java, SAX and XML transforms, XSLT...) and believe I have the necessary knowledge of the required design patterns, algorithms etc. for this project.

Before starting my studies I worked full-time for six months for a company developing Norwegian language tools (http://nynodata.no), doing software development involving porting a code-base in a clean and dependable way from C to Java. I got experience there with test-driven development and thinking very consciously about code architecture. I've had two summer internships in the same company since then.

In high school I was one of five active developers in an open source computer game project that failed miserably (the language was C++ with heavy use of templates etc.). In retrospect I believe this has been important because I learned a lot about what not to do there. Especially, I learned first-hand that releasing small improvements early and often is much more important than having an 100% polished dream-solution.

In high school I was also twice on the Norwegian team in the International Olympiad in Informatics (algorithm competition). I am currently the project administrator for a (successful) effort for running Linux on the Compaq EVO T20 thin client (http://open-evot20.sourceforge.net).