# Faster JSON parsing

For most purposes, the standard library's `json` module is all you need.  However, sometimes, you need to process a *lot* of JSON data, and you need to do it fast.  There are three libraries you should look at.  These are all third-party libraries, and need to be installed with `pip` or `conda`.

First, and easiest: `ujson`.  `ujson` is a slimmed-down version of `json`, which is a drop-in replacement for probably 95% of all use cases.  It's got `ujson.loads()` and `ujson.dumps()`, which work exactly like their `json` counterparts, but are just faster.  Usually between 2x and 10x faster, depending on how big the JSONs are and how complex (i.e., how deeply nested--JSONs inside JSONs, arrays inside arrays, JSONs in arrays of JSONs containing arrays, etc) they are.

I'll demonstrate this using the `timeit` module in the standard library, which will run a little snippet of code a bunch of times and tell us how long it takes.

In [50]:
from timeit import timeit

import ujson
# If you're converting code that already uses the built-in `json` module,
# you can use this next line, and remove the `import json` from your code.
# import ujson as json

print(
    "JSON library, loading json_data 1,000,000 times: ",
    timeit("json.loads(json_data)", globals=globals(), number=1_000_000)
)

print(
    "UJSON library, loading json_data 1,000,000 times: ",
    timeit("ujson.loads(json_data)", globals=globals(), number=1_000_000)
)

# Load that same JSON as a Python dictionary, then test how long it takes
# to serialize it back to a JSON string.
json_dict = json.loads(json_data)

print(
    "JSON library, serializing json_data 1,000,000 times: ",
    timeit("json.dumps(json_dict)", globals=globals(), number=1_000_000)
)

print(
    "UJSON library, serializing json_data 1,000,000 times: ",
    timeit("ujson.dumps(json_dict)", globals=globals(), number=1_000_000)
)

JSON library, loading json_data 1,000,000 times:  3.5366121000006387
UJSON library, loading json_data 1,000,000 times:  1.635926400000244
JSON library, serializing json_data 1,000,000 times:  4.752410699999018
UJSON library, serializing json_data 1,000,000 times:  1.9669594999995752


Sometimes, you dont need the loading to be all that fast, but you need the serialization to be as fast as possible.  This is where `orjson` comes in.  Like `ujson`, it can work as a faster, drop-in replacement for `json`.  It's `loads()` speed is about on par with `ujson`.  But, it is absurdly fast for serializing a Python dictionary into a JSON string--this is because it serializes directly to a binary byte string, which is way faster than serializing to a Python text string.  (but, it does require to you be aware that it's a byte/binary string--you'll need to open files in "wb" mode to save it, for example).

In [51]:
import orjson

print(
    "ORJSON library, loading json_data 1,000,000 times: ",
    timeit("orjson.loads(json_data)", globals=globals(), number=1_000_000)
)

print(
    "ORJSON library, serializing json_data 1,000,000 times: ",
    timeit("orjson.dumps(json_dict)", globals=globals(), number=1_000_000)
)

ORJSON library, loading json_data 1,000,000 times:  1.584552000000258
ORJSON library, serializing json_data 1,000,000 times:  0.586960700000418


Lastly, there's `simdjson`, for when you need the parsing/loading step to be as fast as possible.  `simdjson` can be used as another drop-in replacement for `json` that will generally be faster, but if you want to get the most speed possible, you need to use its native/non-drop-in-replacement API.

Note: `simdjson` gets installed as `pysimdjson` when you're using `pip` or `conda`, but it gets imported as `simdjson`.

*Digression: why `smdjson` is so fast.*

The vast, vast majority of the time spent parsing JSON data--when you're not using `json`--is not spent parsing the data itself.  That step is actually pretty fast.  The slow part is *constructing the Python dictionary.*  Every time you call a JSON library's `loads()` function, it converts the entire JSON into a Python dictionary.  This can be a problem when the JSON is very large, and you only need one or two things out of it (*especially* if those things are nested  few layers deep).  This might be the case if, for example, you have a huge amount of JSON data and you need to filter it down; maybe you're filtering tweet data (which comes in JSON format from the Twitter API) and only want to keep tweets that have geolocation tags.

`simdjson` solves this problem by not contructing the Python dictionary.  It's actually not a Python library at all--it's a library for the C++ programming language, and it uses a *lot* of extremely cool (and sophisticated) tricks to parse JSON data at absurd speeds.  The `simdjson` library in Python is basically just a "translation" layer between Python and the C++ code.

The Python library manages to get a lot of its speed by letting the much, much faster C++ code handle all the parsing and loading.  By default, *nothing* is handed over to Python until you specifically ask for it.  This is because parsing and loading the data in C++ is *way* faster than in Python, so `simdjson` wants to avoid converting things into Python objects unless it *absolutely* has to.

The end result is that when you ask `simdjson` to parse some JSON data, it does--more or less--the following:

1. Parse the JSON in C++, using every trick in the book to make it fast.
2. Store the result as a C++ data structure.  Provide you a Python interface that behaves--on the surface--like a dictionary.
3. Don't convert anything from C++ to Python until you ask for it (i.e., until you index the dictionary-like thing that you get back).

Or, in other words: Python, as a language, is quite slow.  C++, as a language, is quite fast.  `simdjson` makes sure C++ does as much of the work as possible, and that Python does as little as possible.

In [52]:
# simdjson's drop-in API is not actually all that fast.
import simdjson

print(
    "SIMDJSON library, loading json_data 1,000,000 times: ",
    timeit("simdjson.loads(json_data)", globals=globals(), number=1_000_000)
)

print(
    "SIMDJSON library, serializing json_data 1,000,000 times: ",
    timeit("simdjson.dumps(json_dict)", globals=globals(), number=1_000_000)
)

SIMDJSON library, loading json_data 1,000,000 times:  2.2729243000012502
SIMDJSON library, serializing json_data 1,000,000 times:  4.769098700000541


To show off the native parsing API, we have to do a little trick to make `timeit` work properly.  Basically, `timeit.timeit()` only lets you execute one statement at a time; we need to explicitly loop over our JSON object and keep re-parsing it with `simdjson`, so we'll just do that in a function, and call the function with `timeit.timeit()`.

Note: we have to completely delete all the parsed data *before we can parse another JSON.*  This has to do with some of the C++ optimization; it's related to re-using the same chunk of memory, rather than constantly getting a new chunk to store each parsed JSON in.  The details aren't super important, but you'll get weird error messages if you don't do this.

In [53]:
# simdjson's native parsing API
def simdjson_demo():
    parser = simdjson.Parser()
    for i in range(1_000_000):
        parsed = parser.parse(json_data)
        del parsed
    return

print(
    "SIMDJSON library, loading json_data 1,000,000 times with the native parsing API: ",
    timeit("simdjson_demo()", globals=globals(), number=1)
)

SIMDJSON library, loading json_data 1,000,000 times with the native parsing API:  0.5380136999992828


The speed difference is even bigger for larger and more complex JSONs--I've seen the native simdjson API speed up some of my code by nearly 35x when I was working with really, really big JSONs that I only needed a few fields out of.

So: what JSON libraries should you use, and when?  Here's the good rule of thum:
1. Start with the standard library's `json`.  It's very robust and very flexible.
2. Use `ujson` if you don't need any of the extra bells and whistles of `json`, or need faster parsing/serialization with minimal code changes.
3. Use `orjson` if fast serialization is an absolute must-have.
4. Use `pysimdjson`'s native API if ultra-fast parsing is an absolute must-have; you'l need to do some code re-writing, though, to use the native API.

I will often stick with `json` for simplicity's sake in my projects, but I have definitely had projects where I used `pysimdjson` for parsing, `json` for debugging when `pysimdjson` ran into problems, and `orjson` for serialization.  So don't be afraid to mix and match.