# "Blazingly fast and simple JSON Parser -- O( n ) Time and O( 1 ) Space"

> "In this section we will delve into a very clever and efficient parsing algorithm for JSON data.The main idea of the parser is to scan through JSON string linearly and maintain a set of states which would indicate the beginning,end,type of various JSON objects without allocating any extra space"

JSON has been a time tested format for exchanges between independent nodes(may be a client and server node).A server may publish a protocol as to what keys and values it expects from the client to perform certain tasks. The client can send in request in JSON format and receive the result as well in a JSON format. This decouples two systems and they have just a understanding of the format they will be communicating in.




> Note: The usage and the implementation of the parser is available [here](https://github.com/zserge/jsmn) . This blog intends to give a quick sense of how this parser works and see why it works so blazingly fast. We will also look at ways in which we can extend this parser for the case where we receive the JSON data bytes in streaming fashion and not all at once.It will be interesting to see how much of an impact that will have on the performance of the algorithm.

> Tip: One of the main overheads for higly performant JSON request processing servers is parsing the JSON.Usually,parsers allocates a DOM object and then after parsing the JSON returns the entire tree for the server to query on the keys and get relevant values.But for servers where speed of processing the JSON requests is critical, even the allocating space for the DOM object every time a JSON request comes in to be serviced can bring down the performance of the server.The time spent in parsing the JSON requests is going to be always an overhead and hence it is good to have linear time and constant space parsers which in *normal* circumstances would not require any amount of memory allocation.

## Example JSON string for running through the algorithm

We will be running an example through our algorithm and we choose a JSON string for the same --

In [2]:
JSON_STRN="{'key1' : "str_val1",'key2':{'key21': num_val,'key22': primitive_val} ,'key3' : ["str_val31","str_val32"]}"

> Important: JSON has only a limited number of types it supports. JSON supports object type,string type, array type and primitive types. Primitive types include numbers, boolean(true/false) and NULL values. To see how we can build a JSON object using the above types refer to this [resource](https://www.json.org/json-en.html)

## Data structure for parsing

The goal of the parsing algorithm is to fill in an array of tokens as it scrolls through the JSON request. Each JSON component be it objects,arrays,strings,or primitive types is considered to be a token, and in this array we hold a few information about each of these tokens and *sub-tokens*. A picture might help to clearly see what this algorithm intends to do.

![](my_icons/jsmn_parser.jpg)

The highlighted segments in the JSON string is a token.A token in itself can contain sub-tokens(arrays and objects).

Each token in the arrays of token holds the following information

In [9]:
from enum import Enum

class JSON_type(Enum):
    JSON_UNDEFINED = 0
    JSON_OBJECT = 1
    JSON_ARRAY = 2
    JSON_STRING = 3
    JSON_PRIMITIVE = 4
class JSON_token:
    def __init__(self,start: int = -1, stop: int = -1, size: int = -1, parent: int = -1,tok_type: JSON_type = 0):
        self.start = start #the position in the JSON string where this token starts
        self.stop = stop   #the position in the JSON string where this token ends
        self.size = size   #the number of sub-tokens within this token
        self.parent = parent #if this is a sub-token, what's the index of the its parent token
        self.tok_type = tok_type #what the type of the token
        
#allocating an array of tokens
max_possible_tokens = 128
token_list = [JSON_token() for i in range(max_possible_tokens)]

## Parsing Algorithm

Here we have go to traverse down the JSON string and then based on each character update the list of token objects. The parser is robustly written to take care of all the corner cases issues etc. Here I just identify the cases and add comments as to what each case should handle.

In [11]:
from typing import List
def parser(tokens_list: List[JSON_token], json:str):
    for idx,element in enumerate(json):
        if element == '{' or ']':
            None
            #get the next availble token slot from tokens_list, if not available allocate more
            #mark the token slot with appropriate type 
            #mark start as idx
            #mark a parent variable to indicate this is going to be the parent token for the upcoming tokens
        elif element == '}' or ']':
            None
            #go back in the tokens array and search for the parent token for this closure
            #the above can be idenfied by start != -1 and end == -1
            #fill the end the end value for this token marking end as idx
            #reset the parent varibale appropriately
        elif element == ':':
            None
            #here it is an idication a value is coming up next, so mark the previous token slot 
            #as the parent of thie upcoming onr
        elif element == ',':
            None
            #here it is an indication of end of a key value pair and we will move to the next key:value pair
            #update parent token field in precending string
        elif element == '\"':
            None
            #here it's an indication of string so just traverse the string and fill in the values
        else:
            None
            #here it's an indication of a primitve type so just traverse till the end of the type and fill in the values
            
            

> Warning: When we see one obvious limitation comes to mind - we have to know the number of tokens upfront for the JSON requests. This may not always be possible to guess. But again, there are a lot of services which limit the size of JSON request size and that size can be used to heuristically decide upon the size of the token arrays.

In any case to deal with the above possible limitation and also for request that come as a streaming request, i.e. not all the bytes are available to process the JSON request, we can modify the above algorithm to cover those cases as well, but we will have to pay with an increased time complexity for the same.

## Stream Parsing

Image a scenario where the client and the servers talk to each other via tcp protocol and the server has no control or knowledge of the buffer size at the client's end. This means that depending upon the difference in the sizes of the buffers, the full JSON request may not land up at the server's end. So we need a mechanism to be able to hold on to the relevant tokens that still may get its value filled at the arrival of a later batch of bytes. This will also deal with the issue of having to know the maximum number of tokens upfront. Again we can first imagine it visually and then have a look at how we can do it in code.