Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something is too slow... #11

Closed
UnitedMarsupials-zz opened this issue Dec 5, 2013 · 10 comments
Closed

Something is too slow... #11

UnitedMarsupials-zz opened this issue Dec 5, 2013 · 10 comments

Comments

@UnitedMarsupials-zz
Copy link

Hello! I needed to parse a collection of large JSON files and performance of the pure-Tcl json::json2dict was unsatisfactory.

Unaware of yajl-tcl I wrote my own -- which uses json-c for the actual heavy-lifting the way you are using yajl.

Only after I was done did it occur to me to search for existing C-implementations of JSON-parsing -- and I found yours.

I then compared the performance and now have the following numbers. All tests used the following script on 12 JSON-files (total of over 60Mb):

foreach f $argv {
    set fd  [open $f]
    set d   [yajl::json2dict [read $fd]]
    close $fd
    if {![dict exists $d users]} continue
    foreach user [dict get $d users] {
        array set a $user
        if {[info exists users($a(name))]} continue
        set users($a(name)) $a(fullname)
    }
}

This is, actually, what I needed to do -- extract the "users" part of all JSON-files, collect all such users into an array and print the array at the end.

The performance, as reported by tcsh's time-command (note the times and the memory-use):

Implementationutime (seconds)stime (seconds)elapsed timememory use
Pure TCL41.4870.5180:42.095+303566k
yajl19.2610.3140:19.615+162699k
json-c2.5100.4230:02.935+192485k

As you can see, the json-c based implementation is dramatically faster than both the Pure TCL and the yajl based ones, even if it uses some more memory than the latter. I doubt, there is anything magic about my code -- the performance differences are, likely, attributable to the differences in the underlying JSON-parsers (json-c vs. yajl).

Maybe, json-c is using a hash-table, where yajl (or you?) are using a regular array? This would explain the higher memory use...

In any case, this is something you may wish to investigate closer.

Edit: tests where done with tclsh8.6 as provided by FreeBSD lang/tcl86 port on a FreeBSD-9.2/i386.

@UnitedMarsupials-zz
Copy link
Author

Oh, and, just for kicks, the same code implemented in PHP:

foreach ($files as $f) {
    $d = json_decode(file_get_contents($f), true);
    if (!is_array($d['users']))
        continue;
    foreach ($d['users'] as $user) {
        if (defined($users[$user['name']]))
            continue;
        $users[$user['name']] = $user['fullname'];
    }
}

is even faster still: 1.703u 0.290s 0:01.99 100.0% 3963+129477k 0+0io 0pf+0w

@lehenbauer
Copy link
Collaborator

The issue is that yajl-tcl takes the result of yajl's parse and produces a straight sort of left-to-right output that looks like

map_open map_key glossary map_open map_key title string {example glossary} map_key GlossDiv map_open map_key title string S map_key GlossList map_open map_key GlossEntry map_open map_key ID string SGML map_key SortAs string SGML map_key GlossTerm string {Standard Generalized Markup Language} map_key Acronym string SGML map_key Abbrev string {ISO 8879:1986} map_key GlossDef map_open map_key para string {A meta-markup language, used to create markup languages such as DocBook.} map_key GlossSeeAlso array_open string GML string XML array_close map_close map_key GlossSee string markup map_close map_close map_close map_close map_close

yajl::json2dict is written in Tcl and does a considerable amount of manipulation of that parse to produce the dict. That's the source of the slowness. For sure.

To speed it up to be reasonably competitive (I don't know if it would be faster or slower than json-c), yajl-tcl would need a new "parse2dict" C method added to yajltcl_yajlObjectObjCmd in generic/yajltcl.c that would direcctly build the dict using the Tcl C calls for manipulating dicts such as Tcl_DictObjPut or Tcl_DictObjPutKeyList.

@UnitedMarsupials-zz
Copy link
Author

Oh, I see. Well, the Pure TCL implementation currently in tcllib has the "excuse" of being, well, pure TCL.

But, if compiling is already required for yajl-tcl, then, perhaps, it should be doing everything in C?

@UnitedMarsupials-zz
Copy link
Author

In the interests of benchmarking in the mean time, could you, perhaps, rewrite the Tcl code-snippet I posted to use only the C-methods of yajl-tcl to extract the users-subtree of the parsed JSON? That would make it easier to separate the parsing from dictionary-creating performance... Thanks!

@lehenbauer
Copy link
Collaborator

We originally wrote yajl-tcl to generate JSON quickly. We added parsing later, and the output of the parse is a direct analogue to what's fed to the generator. So the first use of the yajl-tcl parser was to take some JSON that we desired to generate the equivalent of and produce the parse stream that could then be modified to create the matching JSON output with values substituted as desired. So the parser was effectively a tool to help with generation.

I am personally not a huge fan of dicts. To me they are often either too much or too little or, sometimes, both. As to whether everything "should" be done in C, it's a function of need, and desire. I'm pretty sure I see how to do it. I'm somewhat interested in doing it to satisfy my curiosity as to how it will turn out. It could, for that matter, be coded to produce a hierarchy of namespaces with arrays as Tcl arrays, which might be kind of cool.

If you want to just see how fast the raw parse is, try something like

package require yajltcl

yajl create yajlparser

foreach f $argv {
    set fd  [open $f]
    set d   [yajlparser parse [read $fd]]
    close $fd
}

It won't produce a dict but it will produce the left-to-right parse I referred to earlier.

@bovine
Copy link
Member

bovine commented Dec 5, 2013

The yajl::json2dict method was added just to make yajl-tcl be a drop in replacement for applications already using json::json2dict, so its primary goal was just to be interface compatible and faster than that.

Making yajl::json2dict even faster by rewriting it in pure C would be an excellent and welcome improvement however.

@lehenbauer
Copy link
Collaborator

All right, well, I added a pure C "parse2dict" method to the yajtcl object. It's on the master branch. I've only tested it a little bit but it produces a character-for-character identical parse of the contents of playpen/foo.json as ::yajl::json2dict.

Timing the two routines parsing a variable containing the contents of that file, parse2dict is 38X faster than ::yajl::json2dict.

After we gain confidence in the code we can update ::yajl::json2dict to use it.

@bovine
Copy link
Member

bovine commented Dec 5, 2013

The test cases run by tests/dict.tcl show a difference on null values:

Expected: moo cow pig oink rabbit null
Actual: moo cow pig oink rabbit null
Actual2: moo cow pig oink rabbit {{}}
Input: {"moo": "cow", "pig": "oink", "rabbit" : null}
FAILED

@bovine
Copy link
Member

bovine commented Dec 5, 2013

For speed comparison purposes, here is the relative timing difference of one of my tests:

tcllib took 11272571 clicks
yajl took 2923963 clicks
yajl2 took 102771 clicks

@UnitedMarsupials-zz
Copy link
Author

The figures certainly look impressive -- as does the overnight turn-around... Will test here soon. Thank you!

@bovine bovine closed this as completed in 268c704 Dec 5, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants