Skip to content
Browse files
Python -> C/C++ blog post
  • Loading branch information
PepperJo committed Jan 25, 2019
1 parent 4358827 commit b0c26e9e6a2eb62eaca7b7cbbc6fc8ba8c9eb719
Showing 4 changed files with 4,999 additions and 0 deletions.
@@ -0,0 +1,168 @@
layout: post
title: "Crail Python API: Python -> C/C++ call overhead"
author: Jonas Pfefferle
category: blog
comments: true
<div style="text-align: justify">
With python being used in many machine learning applications, serverless frameworks, etc.
as the go-to language, we believe a Crail client Python API would be a useful tool to
broaden the use-case for Crail.
Since the Crail core is written in Java, performance has always been a concern due to
just-in-time compilation, garbage collection, etc.
However with careful engineering (Off heap buffers, stateful verbs calls, ...)
we were able to show that Crail can devliever similar or better performance compared
to other statically compiled storage systems. So how can we engineer the Python
library to deliver the best possible performance?
Python's reference implementation, also the most widely-used, CPython has historically
always been an interpreter and not a JIT compiler like PyPy. We will focus on
CPython since its alternatives are in general not plug-and-play replacements.
Crail is client-driven so most of its logic is implemented in the client library.
For this reason we do not want to reimplement the client logic for every new
language we want to support as it would result in a maintance nightmare.
However interfacing with Java is not feasible since it encurs in to much overhead
so we decided to implement a C++ client (more on this in a later blog post).
The C++ client allows us to use a foreign function interface in Python to call
C++ functions directly from Python.

### Options, Options, Options

<div style="text-align: justify">
There are two high-level concepts of how to integrate (C)Python and C: extension
modules and embedding.
Embedding Python uses Python as a component in an application. Our aim is to
develop a Python API to be used by other Python applications so embeddings are
not what we look for.
Extension modules are shared libraries that extend the Python interpreter.
For this use-case CPython offers a C API to interact with the Python interpreter
and allows to define modules, objects and functions in C which can be called
from Python. Note that there is also the option to extend the Python interpreter
through a Python library like ctypes or cffi. They are generally easier to
use and should preserve portability (extension modules are CPython specific).
However they do not give as much flexibility as extension modules and incur
in potentially more overhead (see below). There are multiple wrapper frameworks
available for CPython's C API to ease development of extension modules.
Here is an overview of frameworks and libraries we tested:
* [**Cython:**]( optimising static compiler for Python and the Cython programming
language (based on Pyrex). C/C++ function, objects, etc. can be directly
accessed from Cython. The compiler generates C code from Cython which interfaces
with the CPython C-API.
* [**SWIG:**]( (Simplified Wrapper and Interface Generator) is a tool to connect
C/C++ with various high-level languages. C/C++ interfaces that should be available
in Python have to be defined in a SWIG interface file. The interface files
are compiled to C/C++ wrapper files which interface with the CPython C-API.
* [**Boost.Python:**]( is a C++ library that wraps CPython's C-API. It uses
advanced metaprogramming techniques to simplify the usage and allows wrapping
C++ interfaces non-intrusively.
* [**ctypes:**](
is a foreign function library. It allows calling C functions in shared libraries
with predefined compatible data types. It does not require writing any glue code
and does not interface with the CPython C-API directly.

### Benchmarks

In this blog post we focus on the overhead of calling a C/C++ function from Python.
We vary the number of arguments, argument types and the return types. We also
test passing strings to C/C++ since it is part of the Crail API e.g. when
opening or creating a file. Some frameworks expect `bytes` when passing a string
to a underlying `const char *`, some allow to pass a `str` and others allow both.
If C++ is supported by the framework we also test passing a `std::string` to a
C++ function. Note that we perform all benchmarks with CPython version 3.5.2.
We measure the time it takes to call the Python function until it returns.
The C/C++ functions are empty, except a `return` statement where necessary.

<div style="text-align:center"><img src ="{{ site.base }}/img/blog/python_c/python_c_foo.svg" width="725"/></div>

The plot shows that adding more arguments to a function increases runtime.
Introducing the first argument increases the runtime the most. Adding a the integer
return type only increased runtime slightly.

As expected, cytpes as the only test which is not based on extension modules
performed the worst. Function call overhead for a function without return value
and any arguments is almost 300ns and goes up to 1/2 a microsecond with 4
arguments. Considering that RDMA writes can be performed below 1us this would
introduce a major overhead (more on this below in the discussion section).

SWIG and Boost.Python show similar performance where Boost is slightly slower and
out of the implementations based on extension modules is the slowest.
Cython is also based on extension modules so it was a surprise to us that it showed
the best performance of all methods tested. Investigating the performance difference
between Cython and our extension module implementation we found that Cython makes
better use of the C-API.

Our extension module implementation follows the official tutorial and uses
`PyArg_ParseTuple` to parse the arguments. However as shown below we found that
manually unpacking the arguments with `PyArg_UnpackTuple` already significantly
increased the performance. Although these numbers still do not match Cython's
performance we did not further investigate possible optimizations
to our code.

<div style="text-align:center"><img src ="{{ site.base }}/img/blog/python_c/python_c_foo_opt.svg" width="725"/></div>

Let's take a look at the string performance. `bytes` and `str` is used whereever
applicable. To pass strings as bytes the 'b' prefix is used.

<div style="text-align:center"><img src ="{{ site.base }}/img/blog/python_c/python_c_foo_str.svg" width="725"/></div>

Again Cython and the extension module implementation with manual unpacking seem to
deliver the best performance. Passing a 64bit value in form of a `const char *`
pointer seems to be slightly faster than passing an integer argument (up to 20%).
Passing the string to a C++ function which takes a `std::string`
is ~50% slower than passing a `const char *`, probably because of the
instantiation of the underlying data buffer and copying however we have not
confirmed this.

### Discussion

One might think a difference of 100ns should not really matter and you should
anyway not call to often into C/C++. However we believe that this is not true
when it comes to latency sensitive or high IOPS applications. For example
using RDMA one can perform IO operations below a 1us RTT so 100ns is already
a 10% performance hit. Also batching operations (to reduce amount of calls to C)
is not feasible for low latency operations since it typically incurs in wait
time until the batch size is large enough to be posted. Furthermore, even in high
IOPS applications batching is not always feasible and might lead to undesired
latency increase.

Efficient IO is typically performed through an asynchronous
interface to allow not having to wait for IO to complete to perform the next
operation. Even with an asynchronous interface, not only the latency of the operation
is affected but the call overhead also limits the maximum IOPS. For example,
in the best case scenario, our async call only takes one pointer as an argument so
100ns call overhead. And say our C library is capable of posting 5 million requests
per seconds (and is limited by the speed of posting not the device) that calculates
to 200ns per operation. If we introduce a 100ns overhead we limit the IOPS to 3.3
million operations per second which is a 1/3 decrease in performance. This is
already significant consider using ctypes for such an operation now we are
talking about limiting the throughput by a factor of 3.

Besides performance another aspect is the usability of the different approaches.
Considering only ease of use *ctypes* is a clear winner for us. However it only
supports to interface with C and is slow. *Cython*, *SWIG* and *Boost.Python*
require a similar amount of effort to declare the interfaces, however here
*Cython* clearly wins the performance crown. Writing your own *extension module*
is feasible however as shown above to get the best performance one needs
a good understanding of the CPython C-API/internals. From the tested approaches
this one requires the most glue code.

0 comments on commit b0c26e9

Please sign in to comment.