Permalink
Show file tree
Hide file tree
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
4 changed files
with
4,999 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@@ -0,0 +1,168 @@ | ||
--- | ||
layout: post | ||
title: "Crail Python API: Python -> C/C++ call overhead" | ||
author: Jonas Pfefferle | ||
category: blog | ||
comments: true | ||
--- | ||
<div style="text-align: justify"> | ||
<p> | ||
With python being used in many machine learning applications, serverless frameworks, etc. | ||
as the go-to language, we believe a Crail client Python API would be a useful tool to | ||
broaden the use-case for Crail. | ||
Since the Crail core is written in Java, performance has always been a concern due to | ||
just-in-time compilation, garbage collection, etc. | ||
However with careful engineering (Off heap buffers, stateful verbs calls, ...) | ||
we were able to show that Crail can devliever similar or better performance compared | ||
to other statically compiled storage systems. So how can we engineer the Python | ||
library to deliver the best possible performance? | ||
</p> | ||
<p> | ||
Python's reference implementation, also the most widely-used, CPython has historically | ||
always been an interpreter and not a JIT compiler like PyPy. We will focus on | ||
CPython since its alternatives are in general not plug-and-play replacements. | ||
</p> | ||
<p> | ||
Crail is client-driven so most of its logic is implemented in the client library. | ||
For this reason we do not want to reimplement the client logic for every new | ||
language we want to support as it would result in a maintance nightmare. | ||
However interfacing with Java is not feasible since it encurs in to much overhead | ||
so we decided to implement a C++ client (more on this in a later blog post). | ||
The C++ client allows us to use a foreign function interface in Python to call | ||
C++ functions directly from Python. | ||
</p> | ||
</div> | ||
|
||
### Options, Options, Options | ||
|
||
<div style="text-align: justify"> | ||
<p> | ||
There are two high-level concepts of how to integrate (C)Python and C: extension | ||
modules and embedding. | ||
</p> | ||
<p> | ||
Embedding Python uses Python as a component in an application. Our aim is to | ||
develop a Python API to be used by other Python applications so embeddings are | ||
not what we look for. | ||
</p> | ||
<p> | ||
Extension modules are shared libraries that extend the Python interpreter. | ||
For this use-case CPython offers a C API to interact with the Python interpreter | ||
and allows to define modules, objects and functions in C which can be called | ||
from Python. Note that there is also the option to extend the Python interpreter | ||
through a Python library like ctypes or cffi. They are generally easier to | ||
use and should preserve portability (extension modules are CPython specific). | ||
However they do not give as much flexibility as extension modules and incur | ||
in potentially more overhead (see below). There are multiple wrapper frameworks | ||
available for CPython's C API to ease development of extension modules. | ||
Here is an overview of frameworks and libraries we tested: | ||
</p> | ||
</div> | ||
* [**Cython:**](https://cython.org/) optimising static compiler for Python and the Cython programming | ||
language (based on Pyrex). C/C++ function, objects, etc. can be directly | ||
accessed from Cython. The compiler generates C code from Cython which interfaces | ||
with the CPython C-API. | ||
* [**SWIG:**](http://www.swig.org/) (Simplified Wrapper and Interface Generator) is a tool to connect | ||
C/C++ with various high-level languages. C/C++ interfaces that should be available | ||
in Python have to be defined in a SWIG interface file. The interface files | ||
are compiled to C/C++ wrapper files which interface with the CPython C-API. | ||
* [**Boost.Python:**](https://www.boost.org/) is a C++ library that wraps CPython's C-API. It uses | ||
advanced metaprogramming techniques to simplify the usage and allows wrapping | ||
C++ interfaces non-intrusively. | ||
* [**ctypes:**](https://docs.python.org/3.7/library/ctypes.html#module-ctypes) | ||
is a foreign function library. It allows calling C functions in shared libraries | ||
with predefined compatible data types. It does not require writing any glue code | ||
and does not interface with the CPython C-API directly. | ||
|
||
### Benchmarks | ||
|
||
In this blog post we focus on the overhead of calling a C/C++ function from Python. | ||
We vary the number of arguments, argument types and the return types. We also | ||
test passing strings to C/C++ since it is part of the Crail API e.g. when | ||
opening or creating a file. Some frameworks expect `bytes` when passing a string | ||
to a underlying `const char *`, some allow to pass a `str` and others allow both. | ||
If C++ is supported by the framework we also test passing a `std::string` to a | ||
C++ function. Note that we perform all benchmarks with CPython version 3.5.2. | ||
We measure the time it takes to call the Python function until it returns. | ||
The C/C++ functions are empty, except a `return` statement where necessary. | ||
|
||
<div style="text-align:center"><img src ="{{ site.base }}/img/blog/python_c/python_c_foo.svg" width="725"/></div> | ||
<p></p> | ||
|
||
The plot shows that adding more arguments to a function increases runtime. | ||
Introducing the first argument increases the runtime the most. Adding a the integer | ||
return type only increased runtime slightly. | ||
|
||
As expected, cytpes as the only test which is not based on extension modules | ||
performed the worst. Function call overhead for a function without return value | ||
and any arguments is almost 300ns and goes up to 1/2 a microsecond with 4 | ||
arguments. Considering that RDMA writes can be performed below 1us this would | ||
introduce a major overhead (more on this below in the discussion section). | ||
|
||
SWIG and Boost.Python show similar performance where Boost is slightly slower and | ||
out of the implementations based on extension modules is the slowest. | ||
Cython is also based on extension modules so it was a surprise to us that it showed | ||
the best performance of all methods tested. Investigating the performance difference | ||
between Cython and our extension module implementation we found that Cython makes | ||
better use of the C-API. | ||
|
||
|
||
Our extension module implementation follows the official tutorial and uses | ||
`PyArg_ParseTuple` to parse the arguments. However as shown below we found that | ||
manually unpacking the arguments with `PyArg_UnpackTuple` already significantly | ||
increased the performance. Although these numbers still do not match Cython's | ||
performance we did not further investigate possible optimizations | ||
to our code. | ||
|
||
<div style="text-align:center"><img src ="{{ site.base }}/img/blog/python_c/python_c_foo_opt.svg" width="725"/></div> | ||
<p></p> | ||
|
||
Let's take a look at the string performance. `bytes` and `str` is used whereever | ||
applicable. To pass strings as bytes the 'b' prefix is used. | ||
|
||
<div style="text-align:center"><img src ="{{ site.base }}/img/blog/python_c/python_c_foo_str.svg" width="725"/></div> | ||
<p></p> | ||
|
||
Again Cython and the extension module implementation with manual unpacking seem to | ||
deliver the best performance. Passing a 64bit value in form of a `const char *` | ||
pointer seems to be slightly faster than passing an integer argument (up to 20%). | ||
Passing the string to a C++ function which takes a `std::string` | ||
is ~50% slower than passing a `const char *`, probably because of the | ||
instantiation of the underlying data buffer and copying however we have not | ||
confirmed this. | ||
|
||
### Discussion | ||
|
||
One might think a difference of 100ns should not really matter and you should | ||
anyway not call to often into C/C++. However we believe that this is not true | ||
when it comes to latency sensitive or high IOPS applications. For example | ||
using RDMA one can perform IO operations below a 1us RTT so 100ns is already | ||
a 10% performance hit. Also batching operations (to reduce amount of calls to C) | ||
is not feasible for low latency operations since it typically incurs in wait | ||
time until the batch size is large enough to be posted. Furthermore, even in high | ||
IOPS applications batching is not always feasible and might lead to undesired | ||
latency increase. | ||
|
||
|
||
Efficient IO is typically performed through an asynchronous | ||
interface to allow not having to wait for IO to complete to perform the next | ||
operation. Even with an asynchronous interface, not only the latency of the operation | ||
is affected but the call overhead also limits the maximum IOPS. For example, | ||
in the best case scenario, our async call only takes one pointer as an argument so | ||
100ns call overhead. And say our C library is capable of posting 5 million requests | ||
per seconds (and is limited by the speed of posting not the device) that calculates | ||
to 200ns per operation. If we introduce a 100ns overhead we limit the IOPS to 3.3 | ||
million operations per second which is a 1/3 decrease in performance. This is | ||
already significant consider using ctypes for such an operation now we are | ||
talking about limiting the throughput by a factor of 3. | ||
|
||
Besides performance another aspect is the usability of the different approaches. | ||
Considering only ease of use *ctypes* is a clear winner for us. However it only | ||
supports to interface with C and is slow. *Cython*, *SWIG* and *Boost.Python* | ||
require a similar amount of effort to declare the interfaces, however here | ||
*Cython* clearly wins the performance crown. Writing your own *extension module* | ||
is feasible however as shown above to get the best performance one needs | ||
a good understanding of the CPython C-API/internals. From the tested approaches | ||
this one requires the most glue code. | ||
|
||
|
Oops, something went wrong.