improve memory usage when parsing large tokens by hayes · Pull Request #31 · creationix/jsonparse

hayes · 2017-01-10T03:57:24Z

I have been working on a project that handles uploads of large json files, and I have found that there is something very inefficient happening when parsing large string tokens.

The root of the problem seems to be in how v8 handles string concatenation of small strings, I am not yet sure if this is an issue across all node versions, and have only been testing on node 6 so far.

The main symptom of this problem is that parsing a single string token of 50mb (I think the actual cut off is somewhere between 30 and 50) will run a node process out of memory (with the default 1.7gb heap). I have narrowed the issue down to a single line: https://github.com/creationix/jsonparse/blob/master/jsonparse.js#L137. I did some heap profiling to try to figure out exactly what was going wrong, but so far don't have a definitive answer to what exactly is using all the memory, but I did find that there were a large number of 64k chunks of memory being allocated. My suspicion is that a 64k chunk is being allocated for each character in the token, and the string representing the token (this.string) is simply a container with pointers to the memory locations of its constituent characters. (I know v8 does this to improve performance and memory use of applications that do a lot of string concatenation).

I have a solution here that seems to solve the memory use issue while still maintaining good performance (I had several other attempts that were far slower 😿 ), but am not 100% sure its the right solution since I am not too familiar with the internals of how strings work in v8.

Let me know what you think, and if there is anything else I can do to help get this addressed.

PS:
I am aware 50mb is a ridiculously large size for a token, but I think you start to see symptoms of this issue with much smaller values, they are just a bit harder to notice, and values of a few mb are not that uncommon

hayes · 2017-01-10T04:07:03Z

+      }else if(n === 0x66){ this.appendStringChar("\f".charCodeAt(0)); this.tState = STRING1;
+      }else if(n === 0x6e){ this.appendStringChar("\n".charCodeAt(0)); this.tState = STRING1;
+      }else if(n === 0x72){ this.appendStringChar("\r".charCodeAt(0)); this.tState = STRING1;
+      }else if(n === 0x74){ this.appendStringChar("\t".charCodeAt(0)); this.tState = STRING1;


these character codes should probably be hardcoded

Sometimes I miss C style character literals.

would have been useful here

hayes · 2017-01-10T05:11:33Z

I have tested on 0.8 through 7 and the issue exists in all version, and the fix now works across all versions as well

hayes · 2017-01-11T07:10:30Z

        this.unicode += String.fromCharCode(n);
        if (this.tState++ === STRING6) {
-          this.string += String.fromCharCode(parseInt(this.unicode, 16));
+          this.appendStringBuf(Buffer(String.fromCharCode(parseInt(this.unicode, 16))));


this wont work on multibyte characters, should add a test for this case

nevermind, I did handle that case... and there is probably already a test

confirmed. test is here: https://github.com/creationix/jsonparse/blob/master/test/primitives.js#L49 It doesn't specifically cover 3-4 byte unicode characters, but those are encoded as 2 separate characters in the json output, so I don't think anything additional is needed here

hayes · 2017-01-11T07:28:16Z

I don't think I have anything else I want to add to this PR, @creationix anything you want me to change or add?

creationix · 2017-01-16T20:15:18Z

Looks great.

creationix · 2017-01-16T20:19:18Z

Published as 1.3.0.

hayes force-pushed the big-strings branch 2 times, most recently from 98b5f61 to b24b04a Compare January 10, 2017 03:59

don't oom on long strings

b7a1bad

hayes commented Jan 10, 2017

View reviewed changes

hayes mentioned this pull request Jan 10, 2017

memory leak dominictarr/JSONStream#32

Open

make tests work on old versions of node

ac45580

hayes force-pushed the big-strings branch from b24b04a to ac45580 Compare January 10, 2017 05:08

hayes force-pushed the big-strings branch 2 times, most recently from b917d82 to 153c707 Compare January 10, 2017 23:19

add char code constants

f9d54bc

hayes force-pushed the big-strings branch from 153c707 to f9d54bc Compare January 10, 2017 23:21

cleanup buffer copy size logic

5643e52

hayes force-pushed the big-strings branch from 64e2c1c to 5643e52 Compare January 10, 2017 23:56

hayes commented Jan 11, 2017

View reviewed changes

creationix merged commit 8722aa9 into creationix:master Jan 16, 2017

hayes deleted the big-strings branch January 17, 2017 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve memory usage when parsing large tokens#31

improve memory usage when parsing large tokens#31
creationix merged 4 commits intocreationix:masterfrom
hayes:big-strings

hayes commented Jan 10, 2017 •

edited

Loading

Uh oh!

hayes Jan 10, 2017

Uh oh!

creationix Jan 10, 2017

Uh oh!

hayes Jan 10, 2017

Uh oh!

hayes commented Jan 10, 2017

Uh oh!

hayes Jan 11, 2017

Uh oh!

hayes Jan 11, 2017

Uh oh!

hayes Jan 11, 2017

Uh oh!

hayes commented Jan 11, 2017

Uh oh!

creationix commented Jan 16, 2017

Uh oh!

creationix commented Jan 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hayes commented Jan 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hayes Jan 10, 2017

Choose a reason for hiding this comment

Uh oh!

creationix Jan 10, 2017

Choose a reason for hiding this comment

Uh oh!

hayes Jan 10, 2017

Choose a reason for hiding this comment

Uh oh!

hayes commented Jan 10, 2017

Uh oh!

hayes Jan 11, 2017

Choose a reason for hiding this comment

Uh oh!

hayes Jan 11, 2017

Choose a reason for hiding this comment

Uh oh!

hayes Jan 11, 2017

Choose a reason for hiding this comment

Uh oh!

hayes commented Jan 11, 2017

Uh oh!

creationix commented Jan 16, 2017

Uh oh!

creationix commented Jan 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hayes commented Jan 10, 2017 •

edited

Loading