-
Notifications
You must be signed in to change notification settings - Fork 7
Optimise downloads #423
Comments
valya: I'm attaching a test code. It takes almost 300 sec to get all datasets from DBS2. Seems quite a lot to me. I also found the following discussion about slowness of http requests on Linux: http://code.google.com/p/httplib2/issues/detail?id=28 |
metson: See also #350 |
valya: I wrote first implementation of jsonstreamer as a decorator. The code shown below demonstrates its usage within CherryPy application. I think code should belong to WMCore.WebTools and can be adopted by others in their application. I see DBS, Phedex as primary consumers of jsonstreamer. {{{ import json from types import GeneratorType def test_records(num): def jsonstreamer(func): class Main(object):
if name == 'main': |
metson: Replying to [comment:4 valya]:
Yup. Please provide a patch for that when you're ready. We also need a client that can do sane things with streamed json...
Indeed, that's why #350 was assigned to DBS.
Just have something that sends junk data for some long period of time. Bonus points if the duration is set by HTTP query string. |
gfball: Try PhEDEx for a big sample JSON document, eg https://cmsweb.cern.ch/phedex/datasvc/json/prod/blockreplicas?node=T1_US_FNAL_MSS , which is 50+MB when I just fetched it. I have had a look at the possibility of streaming the decoding, but it doesn't look that promising at the moment. It is not supported by the default python JSON library. It would in theory be possible (read chunks from the incoming stream, have a JSON decoder that decodes until the current end of stream, appending to strings, lists and dictionaries as their sub-elements become available), but there are significant obstacles.
On the other hand, it would probably be something worth contributing back to the python standard library if we were to write it, but I think it's fairly clearly not worth pursuing this side for the moment. jsonstreamer looks good, but one query - is it necessary to loop over a list or generator yourself? Won't iterencode do that for you, to avoid the overhead of an extra loop. |
gfball: Alternatively, to generate random JSON the random_X functions in WMCore/HTTPFrontEnd/PlotFairy/Examples.py could be adapted with infinite instead of finite loops (random_baobab produces nested dictionaries). Ignore the comment about why you add the extra loop, I see JSON rejects iterating over a generator. The cherrypy docs (cherrypy.org/wiki/ReturnVsYield) indicate you have to set "response.stream" -> True or it buffers the yield statements for you. |
valya: Here is the updated version of the jsonstreamer. The trick to make client works is to yield extra newline character for each decoded record. This allows a client to read a record chunks line by line. It is based on how file object methods are designed, see http://docs.python.org/release/2.5.2/lib/bltin-file-objects.html If you'll put a print statement for the chunk you'll see that the JSON record {{{ Of course this approach has one flow, we need to ensure that encoded JSON record doesn't contain newline character (I'll test it later). Meanwhile, here is a server code (I've added response.stream to be True inside of decorator function, as well as extra yield for newline character): import json from types import GeneratorType def test_records(num): def jsonstreamer(func): class Main(object):
if name == 'main': }}} Here is a simple client which makes HTTP call to the server and read file-like object line by line. {{{ def getdata(url, params, headers=None):
def jsonstream_decoder(fp_object): def client(): if name == 'main': |
valya: This approach will certainly beneficial for services which stream their data, like DBS3, since they create a list of JSON records and send it back to the client. The Phedex is another beast, it constructs one monolithic JSON record (one deeply nested dict) and send it over. I think we need to revisit Phedex server to avoid such CPU/memory/network consumed approach and makes it stream its records. |
valya: Hi, So I used DBS3 output for datasets and made 420K records which yield to 60MB of total size in a file. But apart from phedex all records are stored in a lists, e.g. {{{ so it allows parser to walk record by record and yield each record to the stream. I used my code and found that this time the client code only consume 11MB of RAM steady across all records and server who uses jsonstreamer used about 150MB of RAM to stream all records. The both processes are CPU bound. I measured total time client spend in order to get and parse the data. So the HTTP calls to get data on my localhost took 26 sec, while parsing consume 100sec for 420K records which gives 4000 record/sec throughput. Not great, but not bad either. But (!), when I used cjson to decode records on client I saw huge improvements and I spent only 9sec for parsing. So it's 42K records/second and the same RAM utilization (10MB). The time to get data remains the same. The change I made is I replaced in a client code {{{ into {{{ so we can use our jsonwrapper in a client and get a huge speed up. {{{ |
metson: That's great. Can you tuen this into a concrete patch? |
valya: I know where to put decorator for the server, it will go together with others in WebTools. What about the client? Its better to put getdata, jsontreamer_decode into some place in WMCore, but I'm not sure where it should go. There are different ways to code the client, I choose urllib2.Request, but it can be done via httplib, urllib too. |
valya: I've added jsonstreamer to the Page.py. Since it intended to be used on a server I hope it's proper place for it. We can put it into WMCore.Wrapper.JsonWrapper, but it will bring cherrypy dependency into it. At the same time I really want to move all decorators from a Page.py itself into a separate file. But it is a separate issue. As I wrote before I'm not sure about client parts. |
valya: Ohh, patch is attached to the ticket, I thought TRAC will inform about it :) |
metson: The client stuff could go with the Service/Request classes. Some kind of streaming json Request class would make sense, I reckon. |
gfball: I'm not sure I understand how the decoding is working here; if I understand right you read the input from the socket line by line, which for the well-formed DBS example gives you: '[' etc, none of which are successfully decoded by either json or cjson (unterminated array, extra trailing data, cannot parse respectively). So I'm not sure I understand how this is working? What is actually returned by each |
valya: Hi Gordon, {{{ It is not a list, it series of records which are separated by newline. The server yields only records when walking through the list. That's why I was need to put newline yield in a server. This allows the client to read the stream similar to reading a file where it can read line by line and separator is newline. You can use my example code and feed it with your records and printout what chunks are. |
gfball: So, following my comment above that writing a general JSON decoder that could work on streamed data, I had a go at doing so*. I have written from scratch a stack-based JSON decoder that reads arbitary-length blocks of text from a generator (eg, a socket or file object), updates the internal state and yields top level objects as they become available. Eg, if the text received is '["hello", "wo' and then 'rld"]', the decoder will yield "hello" after receiving the first chunk and "world" after receiving the rest. The advantage of this approach is it does not break JSON compatibility (whereas jsonstream_decoder is a new JSON dialect). The downside (at the moment) is performance - it's only about half as fast as the standard library JSON decoder (although that uses a C module for string scanning, whereas my code is pure python), and presumably significantly slower than cjson. There are probably optimisations possible though, it was written quite quickly. However, I did also find an existing library (although I haven't tried it out) which might be useful to us: YAJL ( http://lloyd.github.com/yajl/ ) and python bindings ( https://github.com/rtyler/py-yajl/ ), which appears to be a JSON library aimed at this use-case. I'm not clear how mature it is or how much code we need to write on top of it, though. (*when the power went out in our machine room and there wasn't much else I could do...) |
valya: Gordon, thanks for reference to YAJL, it seems pretty cool. I have install it on my mac and run simple json: 77.7787978649 as you see, it is even faster then cjson!!! It supports file-like objects and uses the same method as json, so I can do yajl.dump, yajl.dumps, etc. It means we can easily integrate it with JsonWrapper. It wins cjson, because it handles file-link objects, while for cjson we need to encode/decode strings and then read/write then to the file separately. I'll go ahead and prepare RPM for YAJL, then I can modify JsonWrapper and add support for it. Please note, that all of them have significant RAM usage, due to the fact that Phedex JSON is one big dict. The serialization of JSON doesn't apply here, but in combination of my/Gordon code we can get best of both worlds. |
evansde: How many JSON libraries do we have now? |
valya: We only have standard json and cjson. The cjson is fast, but lack of file-like object support. So adding new YAJL library is beneficial. |
evansde: Does YAJL then replace cjson? I think we should try and keep a lid on the number of externals, especially if (functionally) they do the same thing. |
valya: I think so. It performs better and cover everything json and us need. I will still do few more tests, will try to build, etc. Once this is done I can report back if I succeed or not. |
valya: Found one difference between standard json and yajl. We used json.JSONEncoder(sort_keys=True) to ensure that stored dict is encoded regardless on order of passed keys in a dict. Such option doesn't exists in yajl and therefore fall back to standard JSONEncoder must be provided. |
valya: Another pitfall, {{{ print json.loads(test) print yajl.loads(test) print cjson.decode(test) but yajl successfully parse test="123\n", so it needs eof in a stream. |
valya: This is fully implemented. I added yajl into DAS and made code refactoring, see #1231. |
Investigate ways of optimising the transfers of large chunks of JSON, eg from a query "dataset", whether by socket configuration or streaming the decoding.
The text was updated successfully, but these errors were encountered: