enhancements cep1000

robertwb edited this page Apr 15, 2012 · 24 revisions
Clone this wiki locally

CEP 1000: Convention for native dispatches through Python callables

Many callable objects are simply wrappers around native code. This holds for any Cython function, f2py functions, manually written CPython extensions, Numba, etc.

Obviously, when native code calls other native code, it would be nice to skip the significant cost of boxing and unboxing all the arguments. Early binding at compile-time is only possible between different Cython modules, not between all the tools listed above.

CEP 523 deals with Cython-specific aspects (and is out-of-date w.r.t. this CEP); this CEP is intended to be about a cross-project convention only. If a success, this CEP may be proposesd as a PEP in a modified form.

Motivating example (looking a year or two into the future):

def f(x): return 2 * x

def g(x : cython.double): return 3 * x

from fortranmod import h

print f(3)
print g(3)
print h(3)
print scipy.integrate.quad(f, 0.2, 3) # fast callback!
print scipy.integrate.quad(g, 0.2, 3) # fast callback!
print scipy.integrate.quad(h, 0.2, 3) # fast callback!

The native-call slot

We need fast access to probing whether a callable object supports this CEP. Other mechanisms, such as an attribute in a dict, is too slow for many purposes (quoting robertwb: "We're trying to get a 300ns dispatch down to 10ns; you do not want a 50ns dict lookup"). (Obviously, if you call a callable in a loop you can fetch the pointer outside of the loop. But in particular if this becomes a language feature in Cython it will be used in all sorts of places.)

So we hack another type slot into existing and future CPython implementations in the following way: This CEP provides a C header that for all Python versions define a macro Py_TPFLAGS_UNOFFICIAL_EXTRAS for a free bit in tp_flags in the PyTypeObject.

If present, then we extend PyTypeObject as follows:

typedef struct {
   PyTypeObject tp_main;
   size_t tp_unofficial_flags;
   size_t tp_nativecall_offset;
} PyUnofficialTypeObject;

tp_unofficial_flags is unused and should be all 0 for the time being, but can be used later to indicate features beyond this CEP.

If tp_nativecall_offset != 0, this CEP is supported, and the information for doing a native dispatch on a callable obj is located at

(char*)obj + ((PyUnofficialTypeObject*)obj->ob_type)->tp_nativecall_offset;

GIL-less accesss

It is OK to access the native-call table without holding the GIL. This should of course only be used to call functions that state in their signature that they don't need the GIL.

This is important for JITted callables who would like to rewrite their table as more specializations gets added; if one needs to reallocate the table, the old table must linger along long enough that all threads that are currently accessing it are done with it.

Native dispatch descriptor

The final format for the descriptor is not agreed upon yet; this sums up the major alternatives.

The descriptor should be a list of specializations/overload, each described by a function pointer and a signature specification string, such as "id)i" for int f(int, double).

The way it is stored must cater for two cases; first, when the caller expects one or more hard-coded signatures:

if (obj has signature "id)i") {
} else if (obj has signature "if)i") {
   call with promoted second argument;
} else {
   box all arguments;

The second is when a call stack is built dynamically while parsing the string. Since this has higher overhead anyway, optimizing for the first case makes sense.

Approach 1: Interning/run-time allocated IDs

1A: Let each overload have a struct

struct {
   size_t signature_id;
   char *signature;
   void *func_ptr;

Within each process run, there is a 1:1 between signature and signature_id. signature_id is allocated by some central registry.

1B: Intern the string instead:

struct {
   char *signature; /* pointer must come from the central registry */
   void *func_ptr;

However this is not trivial, since signature strings can be allocated on the heap (e.g., a JIT would do this), so interned strings must be memory managed and reference counted. This could be done by each object passing in the signature both when incref-ing and decref-ing the signature string in the interning machinery. Using Python bytes objects is another option. Another option would to re-use and never reclaim them, as they are likely to be small and limited in (distinct) number over even a long-running session.


The cost of comparing a signature: Comparing a global variable (needle) to a value that is guaranteed to already be in cache (candidate match)


  • Conceptually simple struct format.


  • Requires a registry for interning strings. This must be
"handshaked" between the implementors of this CEP (probably by "first to get at sys.modules["_nativecall" sticks it there), as we can't ship a common dependency library for this CEP.

Approach 2: Efficient strcmp of verbatim signatures

The idea is to store the full signatures and the function pointers together in the same memory area, but still have some structure to allow for quick scanning through the list.

Each entry has the structure [signature_string, funcptr] where:

  • The signature string has variable length, but the length is
divisible by 8 bytes on all platforms. The funcptr is always 8 bytes (it is padded on 32-bit systems).
  • The total size of the entry should be divisible by 16 bytes (= the
signature data should be 8 bytes, or 24 bytes, or...)
  • All but the first chunk of signature data should start with a
continuation character "-", i.e. a really long signature string could be "iiiidddd-iiidddd-iiidddd-)d". That is, a "-" is inserted on all positions in the string divisible by 8, except the first.

The point is that if you know a signature, you can quickly scan through the binary blob for the signature in 128 bit increments, without worrying about the variable size nature of each entry. The rules above protects against spurious matches.

Optional: Encoding

The strcmp approach can be made efficient for larger signatures by using a more efficient encoding than ASCII. E.g., an encoding could use 4 bits for the 12 most common symbols and 8 bits for 64 symbols (for a total of 78 symbols), of which some could be letter combinations ("Zd", "T{"). This should be reasonably simple to encode and decode.

The CEP should provide C routines in a header file to work with the signatures. Callers that wish to parse the format string and build a call stack on the fly should probably work with the encoded representation.


The cost of comparing a signature: For the vast majority of functions, the cost is comparing a 64-bit number stored in the CPU instruction stream (needle) to a value that is guaranteed to already be in cache (candidate match).


  • Readability-wise, one can use the C switch statement to dispatch
  • "Stateless data", for compiled code it does not require any
run-time initialization like interning does
  • One less pointer-dereference in the common case of a short


  • Long signatures will require more than 8 bytes to store and could
thus be more expensive than interned strings
  • Format looks uglier in the form of literals in C source code

Signature strings

Example: The function

int f(double x, float y);

would have the signature string "df)i" (or, to save space, "idf").

Fields would follow the PEP3118 extensions of the struct-module format string, but with some modifications:

  • The format should be canonical and fit for strcmp-like
comparison: No whitespace, no field names (TBD: what else?)
  • TBD: Information about GIL requirements (nogil, with gil?), how
exceptions are reported
  • TBD: Support for Cython-specific constructs like memoryview slices
(so that arrays with strides and shape can be passed faster than passing an "O").