Skip to content
This repository has been archived by the owner on May 28, 2021. It is now read-only.

Optimise downloads #423

Closed
DMWMBot opened this issue Sep 28, 2010 · 25 comments
Closed

Optimise downloads #423

DMWMBot opened this issue Sep 28, 2010 · 25 comments
Assignees

Comments

@DMWMBot
Copy link

DMWMBot commented Sep 28, 2010

Investigate ways of optimising the transfers of large chunks of JSON, eg from a query "dataset", whether by socket configuration or streaming the decoding.

@vkuznet
Copy link
Contributor

vkuznet commented Sep 28, 2010

valya: I'm attaching a test code. It takes almost 300 sec to get all datasets from DBS2. Seems quite a lot to me. I also found the following discussion about slowness of http requests on Linux:

http://code.google.com/p/httplib2/issues/detail?id=28
http://code.google.com/p/httplib2/issues/detail?id=91

@drsm79
Copy link

drsm79 commented Sep 28, 2010

metson: See also #350

@vkuznet
Copy link
Contributor

vkuznet commented Jan 6, 2011

valya: I wrote first implementation of jsonstreamer as a decorator. The code shown below demonstrates its usage within CherryPy application. I think code should belong to WMCore.WebTools and can be adopted by others in their application. I see DBS, Phedex as primary consumers of jsonstreamer.
It would be nice to generate a large JSON doc(s) to test the proposed solution.

{{{
#!/usr/bin/env python

import json
import cherrypy

from types import GeneratorType
from json import JSONEncoder
from cherrypy import expose

def test_records(num):
for idx in range(0, num):
rec = dict(id=idx, test=idx, foo=idx)
yield rec

def jsonstreamer(func):
"""JSON streamer decorator"""
def wrapper (self, _args, *_kwds):
"""Decorator wrapper"""
data = func (self, _args, *_kwds)
cherrypy.response.headers['Content-Type'] = "application/json"
if isinstance(data, dict):
for chunk in JSONEncoder().iterencode(data):
yield chunk
elif isinstance(data, list) or isinstance(data, GeneratorType):
for rec in data:
for chunk in JSONEncoder().iterencode(rec):
yield chunk
return wrapper

class Main(object):
"""Class which demonstrates usage of jsonstreamer"""
def init(self):
self.data = {'foo':1}

@expose
def index(self):
    """Home page"""
    return "Home page"

@expose
@jsonstreamer
def records(self):
    """Method which generates data and return them via jsonstreamer"""
    data = test_records(10)
    return data

if name == 'main':
cherrypy.config.update({'server.socket_port': 8888,
'server.thread_pool': 20,
'environment': 'production',
'log.screen':True,
})
cherrypy.quickstart(Main(), '/')
}}}

@drsm79
Copy link

drsm79 commented Jan 6, 2011

metson: Replying to [comment:4 valya]:

I wrote first implementation of jsonstreamer as a decorator. The code shown below demonstrates its usage within CherryPy application. I think code should belong to WMCore.WebTools and can be adopted by others in their application.

Yup. Please provide a patch for that when you're ready. We also need a client that can do sane things with streamed json...

I see DBS, Phedex as primary consumers of jsonstreamer.

Indeed, that's why #350 was assigned to DBS.

It would be nice to generate a large JSON doc(s) to test the proposed solution.

Just have something that sends junk data for some long period of time. Bonus points if the duration is set by HTTP query string.

@DMWMBot
Copy link
Author

DMWMBot commented Jan 6, 2011

gfball: Try PhEDEx for a big sample JSON document, eg https://cmsweb.cern.ch/phedex/datasvc/json/prod/blockreplicas?node=T1_US_FNAL_MSS , which is 50+MB when I just fetched it.

I have had a look at the possibility of streaming the decoding, but it doesn't look that promising at the moment. It is not supported by the default python JSON library.

It would in theory be possible (read chunks from the incoming stream, have a JSON decoder that decodes until the current end of stream, appending to strings, lists and dictionaries as their sub-elements become available), but there are significant obstacles.

  1. the result is always a single object, so we still get bottlenecked waiting for our decoder to return that object before it can be transformed or injected into mongo.
  2. we don't know validity until we reach EOF, so chunking the decoding risks starting processing input that turns out to be invalid.
  3. we could in theory iteratively decode straight to BSON, because it doesn't require eg, foreknowledge of how long an array will be, but that doesn't help much since we want the data to be transformed by AbstractService before injection.
  4. there is little in the existing json codebase we can re-use. json.scanner.Scanner and json.decoder.JSON{Number,String,Constant,Object,Array} would have to be extensively re-written or implemented again.
  5. we lose the speedup of cjson and iterative code (although allowing us to use otherwise dead time in IO wait for a socket) is probably slower than the equivalent pure-python existing json code
  6. we have to deal with lots of messy edge cases like having received half of a unicode char

On the other hand, it would probably be something worth contributing back to the python standard library if we were to write it, but I think it's fairly clearly not worth pursuing this side for the moment.

jsonstreamer looks good, but one query - is it necessary to loop over a list or generator yourself? Won't iterencode do that for you, to avoid the overhead of an extra loop.

@DMWMBot
Copy link
Author

DMWMBot commented Jan 6, 2011

gfball: Alternatively, to generate random JSON the random_X functions in WMCore/HTTPFrontEnd/PlotFairy/Examples.py could be adapted with infinite instead of finite loops (random_baobab produces nested dictionaries).

Ignore the comment about why you add the extra loop, I see JSON rejects iterating over a generator.

The cherrypy docs (cherrypy.org/wiki/ReturnVsYield) indicate you have to set "response.stream" -> True or it buffers the yield statements for you.

@vkuznet
Copy link
Contributor

vkuznet commented Jan 6, 2011

valya: Here is the updated version of the jsonstreamer. The trick to make client works is to yield extra newline character for each decoded record. This allows a client to read a record chunks line by line. It is based on how file object methods are designed, see http://docs.python.org/release/2.5.2/lib/bltin-file-objects.html

If you'll put a print statement for the chunk you'll see that the JSON record
{{{
{"test":1}
}}}
will be chunked as

{{{
chunk {
chunk "test"
chunk :
chunk 1
chunk }
}}}

Of course this approach has one flow, we need to ensure that encoded JSON record doesn't contain newline character (I'll test it later). Meanwhile, here is a server code (I've added response.stream to be True inside of decorator function, as well as extra yield for newline character):
{{{
#!/usr/bin/env python

import json
import cherrypy

from types import GeneratorType
from json import JSONEncoder
from cherrypy import expose

def test_records(num):
for idx in range(0, num):
rec = dict(id=idx, test=idx, foo=idx)
yield rec

def jsonstreamer(func):
"""JSON streamer decorator"""
def wrapper (self, _args, *_kwds):
"""Decorator wrapper"""
cherrypy.response.headers['Content-Type'] = "application/json"
func._cp_config = {'response.stream': True}
data = func (self, _args, *_kwds)
if isinstance(data, dict):
for chunk in JSONEncoder().iterencode(data):
yield chunk
yield '\n'
elif isinstance(data, list) or isinstance(data, GeneratorType):
for rec in data:
for chunk in JSONEncoder().iterencode(rec):
yield chunk
yield '\n'
return wrapper

class Main(object):
"""Class which demonstrates usage of jsonstreamer"""
def init(self):
self.data = {'foo':1}

@expose
def index(self):
    """Home page"""
    return "Home page"

@expose
@jsonstreamer
def records(self):
    """Method which generates data and return them via jsonstreamer"""
    data = test_records(10)
    return data

if name == 'main':
cherrypy.config.update({'server.socket_port': 8888,
'server.thread_pool': 20,
'environment': 'production',
'log.screen':True,
})
cherrypy.quickstart(Main(), '/')

}}}

Here is a simple client which makes HTTP call to the server and read file-like object line by line.

{{{
#!/usr/bin/env python
import urllib
import urllib2
import json
import types

def getdata(url, params, headers=None):
"""
Invoke URL call and retrieve data from data-service based
on provided URL and set of parameters. All data will be parsed
by data-service parsers to provide uniform JSON representation
for further processing.
"""
host = url.replace('http://', '').split('/')0

input_params = params
encoded_data = urllib.urlencode(params, doseq=True)
if  encoded_data:
    url = url + '?' + encoded_data
if  not headers:
    headers = {}
req = urllib2.Request(url)
for key, val in headers.items():
    req.add_header(key, val)
try:
    data = urllib2.urlopen(req)
except urllib2.HTTPError, httperror:
    msg  = 'HTTPError, url=%s, args=%s, headers=%s' \
                % (url, params, headers)
    data = {'error': msg}
    try:
        err  = httperror.read()
        data.update({'httperror':extract_http_error(err)})
    except:
        data.update({'httperror': None})
        pass
except:
    msg  = 'HTTPError, url=%s, args=%s, headers=%s' \
                % (url, params, headers)
    data = {'error': msg, 
            'reason': 'Unable to invoke HTTP call to data-service'}
return data

def jsonstream_decoder(fp_object):
"""Decode data coming out from fp-object, e.g. streamer"""
for line in fp_object:
yield json.loads(line)

def client():
url = 'http://localhost:8888/records/'
params = {}
headers = {'Accept': 'application/json'}
data = getdata(url, params, headers)
print "fp_object data:", data, type(data)
gen = jsonstream_decoder(data)
for rec in gen:
print rec

if name == 'main':
client()
}}}

@vkuznet
Copy link
Contributor

vkuznet commented Jan 6, 2011

valya: This approach will certainly beneficial for services which stream their data, like DBS3, since they create a list of JSON records and send it back to the client. The Phedex is another beast, it constructs one monolithic JSON record (one deeply nested dict) and send it over. I think we need to revisit Phedex server to avoid such CPU/memory/network consumed approach and makes it stream its records.

@vkuznet
Copy link
Contributor

vkuznet commented Jan 7, 2011

valya: Hi,
I performed first series of tests. I used blockreplicas phedex API and get 54MB json file. In order to load it using json module the code allocates 800MB of RAM. This is a real problem with the way Phedex put its data. Everything is stored into one single dict and json parser needs to load every data structure into memory to make a final json object. The RAM allocation is known issue which I reported back into python mailing list, see http://bugs.python.org/issue6594

So I used DBS3 output for datasets and made 420K records which yield to 60MB of total size in a file. But apart from phedex all records are stored in a lists, e.g.

{{{
[
{"record":1},
{"record":2},
]
}}}

so it allows parser to walk record by record and yield each record to the stream. I used my code and found that this time the client code only consume 11MB of RAM steady across all records and server who uses jsonstreamer used about 150MB of RAM to stream all records. The both processes are CPU bound. I measured total time client spend in order to get and parse the data. So the HTTP calls to get data on my localhost took 26 sec, while parsing consume 100sec for 420K records which gives 4000 record/sec throughput. Not great, but not bad either. But (!), when I used cjson to decode records on client I saw huge improvements and I spent only 9sec for parsing. So it's 42K records/second and the same RAM utilization (10MB). The time to get data remains the same. The change I made is I replaced in a client code

{{{
yield json.loads(line)
}}}

into

{{{
yield cjson.decode(line)
}}}

so we can use our jsonwrapper in a client and get a huge speed up.
So to conclude, I recommend to use jsonstreamer on a server side to stream JSON to the client and use jsonstream_decoder (with json wrapper) on a client side. The client needs to use standard urllib/urllib2/httplib libraries to get file-like object for requested data and pass it into jsonstream_decoder. The catch here is to close the file-descriptor, but it can be easily done with '''with''' statement, e.g.

{{{
with urllib2. as fp_object:
for line in fp_object:
yield json.loads(line)
}}}

@drsm79
Copy link

drsm79 commented Jan 7, 2011

metson: That's great. Can you tuen this into a concrete patch?

@vkuznet
Copy link
Contributor

vkuznet commented Jan 7, 2011

valya: I know where to put decorator for the server, it will go together with others in WebTools. What about the client? Its better to put getdata, jsontreamer_decode into some place in WMCore, but I'm not sure where it should go. There are different ways to code the client, I choose urllib2.Request, but it can be done via httplib, urllib too.

@vkuznet
Copy link
Contributor

vkuznet commented Jan 7, 2011

valya: I've added jsonstreamer to the Page.py. Since it intended to be used on a server I hope it's proper place for it. We can put it into WMCore.Wrapper.JsonWrapper, but it will bring cherrypy dependency into it. At the same time I really want to move all decorators from a Page.py itself into a separate file. But it is a separate issue.

As I wrote before I'm not sure about client parts.

@vkuznet
Copy link
Contributor

vkuznet commented Jan 7, 2011

valya: Ohh, patch is attached to the ticket, I thought TRAC will inform about it :)

@drsm79
Copy link

drsm79 commented Jan 7, 2011

metson: The client stuff could go with the Service/Request classes. Some kind of streaming json Request class would make sense, I reckon.

@DMWMBot
Copy link
Author

DMWMBot commented Jan 7, 2011

gfball: I'm not sure I understand how the decoding is working here; if I understand right you read the input from the socket line by line, which for the well-formed DBS example gives you:

'['
'{"record": 1},'
'{"record": 2},'
']'

etc, none of which are successfully decoded by either json or cjson (unterminated array, extra trailing data, cannot parse respectively). So I'm not sure I understand how this is working? What is actually returned by each yield cjson.decode(line) statement?

@vkuznet
Copy link
Contributor

vkuznet commented Jan 7, 2011

valya: Hi Gordon,
the input which comes from the socket is

{{{
'{'
'"record"'
':'
'1'
'}'
'\n'
{'
'"record"'
':'
'2'
'}'
'\n'
}}}

It is not a list, it series of records which are separated by newline. The server yields only records when walking through the list. That's why I was need to put newline yield in a server. This allows the client to read the stream similar to reading a file where it can read line by line and separator is newline. You can use my example code and feed it with your records and printout what chunks are.

@DMWMBot
Copy link
Author

DMWMBot commented Jan 10, 2011

gfball: So, following my comment above that writing a general JSON decoder that could work on streamed data, I had a go at doing so*.

I have written from scratch a stack-based JSON decoder that reads arbitary-length blocks of text from a generator (eg, a socket or file object), updates the internal state and yields top level objects as they become available.

Eg, if the text received is '["hello", "wo' and then 'rld"]', the decoder will yield "hello" after receiving the first chunk and "world" after receiving the rest.

The advantage of this approach is it does not break JSON compatibility (whereas jsonstream_decoder is a new JSON dialect).

The downside (at the moment) is performance - it's only about half as fast as the standard library JSON decoder (although that uses a C module for string scanning, whereas my code is pure python), and presumably significantly slower than cjson. There are probably optimisations possible though, it was written quite quickly.

However, I did also find an existing library (although I haven't tried it out) which might be useful to us: YAJL ( http://lloyd.github.com/yajl/ ) and python bindings ( https://github.com/rtyler/py-yajl/ ), which appears to be a JSON library aimed at this use-case. I'm not clear how mature it is or how much code we need to write on top of it, though.

(*when the power went out in our machine room and there wasn't much else I could do...)

@vkuznet
Copy link
Contributor

vkuznet commented Jan 11, 2011

valya: Gordon, thanks for reference to YAJL, it seems pretty cool. I have install it on my mac and run simple
benchmark test to load/dump phedex.json (56MB). Here is the numbers:

json: 77.7787978649
cjson: 10.656167984
yajl: 9.62515997887

as you see, it is even faster then cjson!!! It supports file-like objects and uses the same method as json, so I can do yajl.dump, yajl.dumps, etc. It means we can easily integrate it with JsonWrapper. It wins cjson, because it handles file-link objects, while for cjson we need to encode/decode strings and then read/write then to the file separately. I'll go ahead and prepare RPM for YAJL, then I can modify JsonWrapper and add support for it.

Please note, that all of them have significant RAM usage, due to the fact that Phedex JSON is one big dict. The serialization of JSON doesn't apply here, but in combination of my/Gordon code we can get best of both worlds.

@evansde77
Copy link

evansde: How many JSON libraries do we have now?

@vkuznet
Copy link
Contributor

vkuznet commented Jan 11, 2011

valya: We only have standard json and cjson. The cjson is fast, but lack of file-like object support. So adding new YAJL library is beneficial.

@evansde77
Copy link

evansde: Does YAJL then replace cjson? I think we should try and keep a lid on the number of externals, especially if (functionally) they do the same thing.

@vkuznet
Copy link
Contributor

vkuznet commented Jan 11, 2011

valya: I think so. It performs better and cover everything json and us need. I will still do few more tests, will try to build, etc. Once this is done I can report back if I succeed or not.

@vkuznet
Copy link
Contributor

vkuznet commented Jan 14, 2011

valya: Found one difference between standard json and yajl. We used json.JSONEncoder(sort_keys=True) to ensure that stored dict is encoded regardless on order of passed keys in a dict. Such option doesn't exists in yajl and therefore fall back to standard JSONEncoder must be provided.

@vkuznet
Copy link
Contributor

vkuznet commented Jan 17, 2011

valya: Another pitfall,

{{{
test="123"

print json.loads(test)
123

print yajl.loads(test)
ValueError: eof was met before the parse could complete

print cjson.decode(test)
123
}}}

but yajl successfully parse test="123\n", so it needs eof in a stream.

@vkuznet
Copy link
Contributor

vkuznet commented Mar 13, 2011

valya: This is fully implemented. I added yajl into DAS and made code refactoring, see #1231.

@ghost ghost assigned vkuznet Jul 24, 2012
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants