enhancements buffersyntax

DagSverreSeljebotn edited this page Jun 17, 2009 · 45 revisions
Clone this wiki locally

Proposal for a new buffer syntax

  • Author: Dag Seljebotn

This propsal is obsolete, please see the Cython array type CEP.

What is the problem?

The current syntax is e.g.

cdef object[int, mode="fortran"] x
cdef np.ndarray[double, ndim=2] y

This comes from viewing the buffer syntax as an optimization of the Python [] operator in certain special cases. Any non-optimizable operations are passed to the underlying object. In addition, the typename controls the default access mode ("strided" vs. "indirect").

  • It allows Python/NumPy syntax (except for in variable declaration), so one can simply add types to existing code
**Disadvantages: **
  • It gives no clear way of bringing about efficient slicing, efficient arithmetic operations, etc. E.g., with
cdef object[int] a = np.arange(10)
cdef object[int] b
b = a[5:] # made efficient
print b[0] # prints 5

Now, which Python object does b refer to? The one of a?

print b[<object>(0)] # huh, prints 0?

Or, perhaps None?

   print b.foo() # crashes program

* Often one forget to type the indices, or indices in the wrong way (arr[i][j]) or similar without a warning at all that the code will be slowed down by a factor of hundreds
* The mechanism itself is rather crude (only optimizes one specific case), yet the syntax doesn't show this, and so one gets a "magic" feel to it.

Proposed solution

The proposed solution would be introducing buffers as a first class native type with a new syntax.

cdef int[:] buf = obj
print buf[2] # fast access
print obj.some_method()
print buf.some_method() # NOT ALLOWED!

The syntax would embed everything needed to know for optimizing PEP 3118 buffer access without knowing anything about the underlying object type (like NumPy arrays) at all, or allowing operations on the object owning the buffer directly.

PEP 3118 allows for a very wide class of buffer layouts; restricting this is possible in a lot of ways and almost any restriction can give a lot of speedup.

It could work like this. Assume from cython import strided, contig, full, ptr:

  • int[[,:,]] is 3D strided mode.
  • int[::strided, ::strided, ::strided] is the same written out in full
  • int[:,:,::1] is 3D C-contiguous mode
  • int[::contig, ::contig ::strided(1)] is again the same in full
  • int[[:1,::,:]] is 3D Fortran-contiguous mode
  • int[::full, ::full] is generic 2D buffer supporting any layout (adds lots of branches...)
  • int[::ptr, ::1] is a strided array of pointers to contiguous arrays.
  • int[::ptr(1), ::1] is a contiguous array of pointers to contiguous arrays.

Of course, all this mustn't be supported at once.

The existing usecase

An alternative to the existing syntax would be code like this:

from cython import shape # or cython.buffer
def mysum(int[:] arr):
    cdef int s
    for i in range(shape(arr, 0)):
        s += arr[i]
    return s

Here int[:] is an alternative to object[int, ndim=1].

int[[,:,]] is three-dimensional and so on. This makes a clear distinction from the C array syntax and it looks more Pythonic. Also it is within the Python grammar.

int[:] accesses only the buffer, not the corresponding Python object. Coercion from objects acquire a buffer view, while coercion to objects is disallowed in earlier Python versions and gives a standard Python memoryview in newer versions (backports could also be done, though likely e.g. a __frombuffer__ operator in numpy.pxd for efficient numpy.ndarray(buf) construction works better with less efforts).

  • Read-only vs. read-write is automatic like today
  • Mode is passed as a string, e.g. int[:,:,"fortran"] for a 2D array with Fortran contiguous ordering. Default mode is "strided", one can pass "indirect" to get indirect indexing.
  • Negative indices or not can be done by int[[,0]], this is slightly more featureful than today (disallowing negative wraparound on second dimension only)

Main differences from today, in the context of NumPy:

def f():
    cdef int[:] a = ..., b = ..., c
    c = a + b    # would not work before new features are implemented
    c = a[2:110] # would not work before new features are implemented
    print a.flags # nope

So, int[:] represents only the buffer and not the NumPy array object. Slicing and arithmetic on these