Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor codebase to use independent module to parse incoming HTTP requests #618

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Jan 16, 2020

This PR tries to address issues with large memory footprint in DBS server, see full discussion in #599

The code is refactored in the following way:

  • I replaced a common pattern used in DBSWriterModel.py, see
-            body = request.body.read()
-            indata = cjson.decode(body)
+            indata = parseFileObject(request.body, method='cjson')
  • I provided new parsers.py module which implement different (de)serialization methods, e.g. cjson, json, json_stream, yaml
  • the new module allows to write custom serialization of input data streams, e.g. we can write custom C-module to optimize serialization of incoming data
  • I provided test example to tests different formats, e.g.
# use cjson format, and run tests 3 times
Server/Python/src/dbs/utils/parsers.py --fin=blocks.json --format=cjson --times=3
# use cjson format, and run tests 3 times
Server/Python/src/dbs/utils/parsers.py --fin=blocks.json --format=json --times=3
# use json_stream format, and run tests 3 times
Server/Python/src/dbs/utils/parsers.py --fin=blocks.json_stream --format=json_stream --times=3
# use yaml format, and run tests 3 times
Server/Python/src/dbs/utils/parsers.py --fin=blocks.yaml --format=yaml --times=3

Due to dynamic nature of python memory allocation it is hard to evaluate an impact of particular format on long running DBSServer, but this PR will allow to easily switch and tests usage of different formats. But to do that the clients which will interact with DBS server will need to send data in proper format, e.g. in json_stream, such that we can measure memory footprint of DBS server in that case.

The provided convert2json_stream function allows to convert either given json (dict) object or file object which contains json data stream, e.g.

# example how to convert json to json_stream
from dbs.utils.parsers import convert2json_stream
import json
data={"data":1, "foo":[1,2,3]}
convert2json_stream(data)
# this will produce the following output
{
"foo"
:
[1
, 2
, 3
]
,
"data"
:
1
}
# if you want to write this output to output file you will do
obj= open('YOUR_FILE_NAME', 'w')
convert2json_stream(data, obj)

# similar if you do have file object which contains json stream you may use it
fobj = open('YOUR_FILE.json')
convert2json_stream(fobj)

# similarly I provide convert2yaml function which can convert given JSON to YAML
data={"data":1, "foo":[1,2,3]}
print(convert2json_stream(data))
data: 1
foo:
- 1
- 2
- 3

With this module we can perform various test on DBS server using different input data formats.

@yuyiguo
Copy link
Member

yuyiguo commented Jan 16, 2020

Valentin,

I will look into this after I have DBS partition is done. It may take a few months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants