Skip to content

Commit

Permalink
Adds a C extension to speed up simpleion module. (#181)
Browse files Browse the repository at this point in the history
  • Loading branch information
cheqianh committed Dec 10, 2021
1 parent 223088d commit 1c0860c
Show file tree
Hide file tree
Showing 19 changed files with 2,651 additions and 173 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,5 @@ jobs:

- run: pip install --upgrade setuptools
- run: pip install -r requirements.txt
- run: pip install .
- run: pip install -e .
- run: py.test
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
*.pyc
*.pyo
*.so
*~
*#
*.swp
Expand All @@ -13,4 +14,5 @@
/dist
/amazon.ion.egg-info
/docs/_build/
/amazon/ion/ion-c-build/

4 changes: 4 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,7 @@
path = vectors
url = https://github.com/amzn/ion-tests
branch = master
[submodule "ion-c"]
path = ion-c
url = https://github.com/amzn/ion-c
branch = master
135 changes: 135 additions & 0 deletions C_EXTENSION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Ion Python C Extension

1. [Overall](#overall)
2. [Motivation](#motivation)
3. [Performance Improvement](#performance-improvement)
4. [Setup](#setup)
5. [Development](#development)
6. [Technical Details](#technical-details)<br>
6.1 &nbsp;[Common Binary Encoding Differences between C Extension and Original Ion Python](#1-common-binary-encoding-differences-between-c-extension-and-original-ion-python)<br>
6.2 &nbsp;[Known Issues](#2-known-issues)<br>
7. [TODO](#todo)
8. [Deploy](#deploy)<br>
8.1 &nbsp;[Distribution](#1-distribution)<br>

## Overall

Ion Python C extension utilizes Ion C to access files that close the performance gap between the Ion Python simpleion module and other Ion implementations.

The simpleion module C extension supports limited options for now and will add more incrementally. Refer to [TODO](#todo) for details.

## Motivation

Python is not fast which causes Ion Python to be slower than other Ion implementations. Ion Python is also slower than other similar python data serialization libraries such as simplejson which is a JSON encoder and decoder. The main reason for the difference in performance between Simplejson and Ion Python simpleion module is because Simplejson binds to a C extension while Ion Python is implemented purely in python.

There are couple technologies we can choose for binding C extension and C binaries (Ion C): CFFI, Cython and CPython APIs.

CFFI and Ctypes are slower than CPython and Cython for most of our use case, Cython is a little bit faster than CPython but it's a compiler for a new programming language that requires more development time. One of the most challenging issues no matter which tool we use is that how we distribute Ion C binaries as it's `.dylib` on Mac, `.so` on Linux and `.lib` on Windows. Also, CPython C extension code for simpleion was almost completed 2 years ago so we decided to choose this option.

If the performance becomes our biggest concern in the future, we should reevaluate the performance implications of the C extension to make sure we're keeping up with the innovations in the Python C extension ecosystem.




## Performance Improvement

The performance improvement depends on a multitude of variables (e.g., how the files are structured, what APIs are called the most). Experiment results show **around** 6000% improvement for text writer/reader and 1400% improvement for binary writer/reader.

We use `timeit` module to measure the execution time.
```.py
setup = "from amazon.ion import simpleion"
code = '''
with open("file_name", "br") as fp:
simpleion.dumps(simpleion.load(fp, single_value=False))
'''
print(timeit.timeit(setup=setup, stmt=code, number=1))
```

#### Experiment Result
`test-driver-report.ion(10n)` are reports generated by [ion-test-driver](https://github.com/amzn/ion-test-driver) which consists of Ion structs and strings.<br/>
`log.ion(10n)` are logs that contain a variety of scalar types, annotations, and nested containers.<br/>

|Files|C extension|Ion Python|Improvement|
|---|---|---|---|
|test-driver-report.ion (42MB)|3.8s|217s|5611%|
|test-driver-report.10n (13.7MB)|3.6s|55s|1428%|
|log.ion (84MB)|14.8s|987s|6569%|
|log.10n (14MB)|15s|221s|1373%|


## Setup

Ensure that cmake is installed. The setup for Ion Python C extension is the same as the original [Ion Python Setup](https://github.com/amzn/ion-python#development). If it runs into any issue during initialization, it will fall back to regular Ion Python. **No extra action needed.**

C extension is built under `ion-python/amazon/ion` and named according to the following format (may be slightly different depending on your platform) `ionc.cpython-$py_version-$platform.$suffix` (e.g., ionc.cpython-39-darwin.so)

#### Getting Started with C Extension:
```
>>> import amazon.ion.simpleion as ion
>>> obj = ion.loads('{abc: 123}')
>>> obj['abc']
123
>>> ion.dumps(obj, binary=True)
b'\xe0\x01\x00\xea\xe9\x81\x83\xd6\x87\xb4\x83abc\xd3\x8a!{'
```


## Development

Architecture of Ion Python C extension:
```
ioncmodule.c
|
|
Ion C -------> Ion C binaries -----> setup.py ------> C extension -------------------> Ion Python simpleion module
compile setup import ionc module
```
After setup, C extension will be built and imported to simpleion module. If there are changes in `ioncmodule.c`, build the latest C extension by running `python setup.py build_ext --inplace`.


## Technical Details

### 1. Common Binary Encoding Differences between C Extension and Original Ion Python
Note that both binary encodings are **equivalent**; one encoding is not more "correct" than the other.<br/>

#### 1.1 Different ways to represent a struct's length. Refer to [Amazon Ion Binary Encoding](https://amzn.github.io/ion-docs/docs/binary.html#13-struct) for details.<br/>
For Ion struct `{a:2}`:
```text
Text IVM ion_symbol_table::{ symbols:[”a”]} { “a”: 2 }
Ion C \xe0\x01\x00\xea \xe7\x81\x83 \xd4 \x87\xb2\x81a \xd3 \x8a 21\x02
Ion Python \xe0\x01\x00\xea \xe8\x81\x83 \xde\x84 \x87\xb2\x81a \xde\x83 \x8a 21\x02
```

#### 1.2 Different order of symbols within a symbol table.<br/>
For symbol `abc` with two annotations `annot1` and `annot2`, `annot1::annot2::abc`:
```text
Ion C text ion_symbol_table::{ symbols:[ "abc", "annot1", "annot2"]} annot1($11)::annot2($12)::abc($10)
Ion C binary \xee\x99\x81\x83 \xde\x95 \x87\xbe\x92 \x83abc\x86annot1\x86annot2 \xe5\x82 \x8b \x8c \x71\x0a
Ion Python binary ion_symbol_table::{ symbols:[ "annot1", "annot2", "abc",]} annot1($10)::annot2($11)::abc($12)
ion Python \xee\x99\x81\x83 \xde\x95 \x87\xbe\x92 \x86annot1\x86annot2\x83abc \xe5\x82 \x8a \x8b \x71\x0c
```

### 2. Known Issues

1. We barely see memory leak issues recently, but it is possible that the issue still exists. Refer to [amzn/ion-python#155](https://github.com/amzn/ion-python/issues/155) for details.
2. C extension only supports at most 9 for timestamp precision. Refer to [amzn/ion-python#160](https://github.com/amzn/ion-python/issues/160) for details.
3. C extension only supports at most 34 decimal digits. Refer to [amzn/ion-python#159](https://github.com/amzn/ion-python/issues/159) for details.


## TODO

1. More bug fixing.
2. More performance improvement.
3. Support more simpleion options such as `imports`, `catalog`, `omit_version_marker`. (Ion Python uses pure python implementation to handle unsupported options currently)
4. Support pretty print.

## Deploy

### 1. Distribution
PYPI supports two ways of distribution: [Source Code Distribution](https://packaging.python.org/guides/distributing-packages-using-setuptools/#source-distributions) and [Wheel Distribution](https://packaging.python.org/guides/distributing-packages-using-setuptools/#wheels). This version uses source code distribution to build Ion C locally automatically after installation of the package. <br/>

We will add wheel distribution in the future release because of the following benefits:
1. Pre-compiling Ion C library avoids potential build/compile issues and does not require a C compiler to be present on the user's machine.
2. Installation of wheels is faster and more efficient.

2 changes: 2 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ recursive-include tests *.py
graft vectors
global-exclude *.pyc
global-exclude .git*
include install.py
include amazon/ion/_ioncmodule.h
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ This package is designed to work with **Python 3.6+**

Start with the [simpleion](https://ion-python.readthedocs.io/en/latest/amazon.ion.html#module-amazon.ion.simpleion)
module, which provides four APIs (`dump`, `dumps`, `load`, `loads`) that will be familiar to users of Python's
built-in JSON parsing module.
built-in JSON parsing module. Simpleion module's performance is improved by an optional [C extension](https://github.com/amzn/ion-python/blob/master/C_EXTENSION.md).

For example:

Expand All @@ -27,8 +27,8 @@ For example:
For additional examples, consult the [cookbook](http://amzn.github.io/ion-docs/guides/cookbook.html).

## Git Setup
This repository contains a [git submodule](https://git-scm.com/docs/git-submodule)
called `ion-tests`, which holds test data used by `ion-python`'s unit tests.
This repository contains two [git submodules](https://git-scm.com/docs/git-submodule).
`ion-tests` holds test data used by `ion-python`'s unit tests and `ion-c` speeds up `ion-python`'s simpleion module.

The easiest way to clone the `ion-python` repository and initialize its `ion-tests`
submodule is to run the following command.
Expand Down
19 changes: 19 additions & 0 deletions amazon/ion/_ioncmodule.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#ifndef _IONCMODULE_H_
#define _IONCMODULE_H_

#include "structmember.h"
#include "decimal128.h"
#include "ion.h"

PyObject* ionc_init_module(void);
iERR ionc_write_value(hWRITER writer, PyObject* obj, PyObject* tuple_as_sexp);
PyObject* ionc_read(PyObject* self, PyObject *args, PyObject *kwds);
iERR ionc_read_all(hREADER hreader, PyObject* container, BOOL in_struct, BOOL emit_bare_values);
iERR ionc_read_value(hREADER hreader, ION_TYPE t, PyObject* container, BOOL in_struct, BOOL emit_bare_values);

iERR _ion_writer_write_symbol_id_helper(ION_WRITER *pwriter, SID value);
iERR _ion_writer_add_annotation_sid_helper(ION_WRITER *pwriter, SID sid);
iERR _ion_writer_write_field_sid_helper(ION_WRITER *pwriter, SID sid);
ION_API_EXPORT void ion_helper_breakpoint(void);

#endif
2 changes: 1 addition & 1 deletion amazon/ion/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -409,7 +409,7 @@ class Timestamp(datetime):
* The ``precision`` field is passed as a keyword argument of the same name.
* The ``fractional_precision`` field is passed as a keyword argument of the same name.
This field only relates to to the ``microseconds`` field and can be thought of
This field only relates to the ``microseconds`` field and can be thought of
as the number of decimal digits that are significant. This is an integer that
that is in the closed interval ``[0, 6]``. If ``0``, ``microseconds`` must be
``0`` indicating no precision below seconds. This argument is optional and only valid
Expand Down

0 comments on commit 1c0860c

Please sign in to comment.