Skip to content

improve memory usage when parsing large tokens#31

Merged
creationix merged 4 commits intocreationix:masterfrom
hayes:big-strings
Jan 16, 2017
Merged

improve memory usage when parsing large tokens#31
creationix merged 4 commits intocreationix:masterfrom
hayes:big-strings

Conversation

@hayes
Copy link
Copy Markdown
Contributor

@hayes hayes commented Jan 10, 2017

I have been working on a project that handles uploads of large json files, and I have found that there is something very inefficient happening when parsing large string tokens.

The root of the problem seems to be in how v8 handles string concatenation of small strings, I am not yet sure if this is an issue across all node versions, and have only been testing on node 6 so far.

The main symptom of this problem is that parsing a single string token of 50mb (I think the actual cut off is somewhere between 30 and 50) will run a node process out of memory (with the default 1.7gb heap). I have narrowed the issue down to a single line: https://github.com/creationix/jsonparse/blob/master/jsonparse.js#L137. I did some heap profiling to try to figure out exactly what was going wrong, but so far don't have a definitive answer to what exactly is using all the memory, but I did find that there were a large number of 64k chunks of memory being allocated. My suspicion is that a 64k chunk is being allocated for each character in the token, and the string representing the token (this.string) is simply a container with pointers to the memory locations of its constituent characters. (I know v8 does this to improve performance and memory use of applications that do a lot of string concatenation).

I have a solution here that seems to solve the memory use issue while still maintaining good performance (I had several other attempts that were far slower 😿 ), but am not 100% sure its the right solution since I am not too familiar with the internals of how strings work in v8.

Let me know what you think, and if there is anything else I can do to help get this addressed.

PS:
I am aware 50mb is a ridiculously large size for a token, but I think you start to see symptoms of this issue with much smaller values, they are just a bit harder to notice, and values of a few mb are not that uncommon

@hayes hayes force-pushed the big-strings branch 2 times, most recently from 98b5f61 to b24b04a Compare January 10, 2017 03:59
Comment thread jsonparse.js Outdated
}else if(n === 0x66){ this.appendStringChar("\f".charCodeAt(0)); this.tState = STRING1;
}else if(n === 0x6e){ this.appendStringChar("\n".charCodeAt(0)); this.tState = STRING1;
}else if(n === 0x72){ this.appendStringChar("\r".charCodeAt(0)); this.tState = STRING1;
}else if(n === 0x74){ this.appendStringChar("\t".charCodeAt(0)); this.tState = STRING1;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these character codes should probably be hardcoded

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes I miss C style character literals.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would have been useful here

@hayes
Copy link
Copy Markdown
Contributor Author

hayes commented Jan 10, 2017

I have tested on 0.8 through 7 and the issue exists in all version, and the fix now works across all versions as well

@hayes hayes force-pushed the big-strings branch 2 times, most recently from b917d82 to 153c707 Compare January 10, 2017 23:19
Comment thread jsonparse.js
this.unicode += String.fromCharCode(n);
if (this.tState++ === STRING6) {
this.string += String.fromCharCode(parseInt(this.unicode, 16));
this.appendStringBuf(Buffer(String.fromCharCode(parseInt(this.unicode, 16))));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wont work on multibyte characters, should add a test for this case

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind, I did handle that case... and there is probably already a test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed. test is here: https://github.com/creationix/jsonparse/blob/master/test/primitives.js#L49 It doesn't specifically cover 3-4 byte unicode characters, but those are encoded as 2 separate characters in the json output, so I don't think anything additional is needed here

@hayes
Copy link
Copy Markdown
Contributor Author

hayes commented Jan 11, 2017

I don't think I have anything else I want to add to this PR, @creationix anything you want me to change or add?

@creationix
Copy link
Copy Markdown
Owner

Looks great.

@creationix creationix merged commit 8722aa9 into creationix:master Jan 16, 2017
@creationix
Copy link
Copy Markdown
Owner

Published as 1.3.0.

@hayes hayes deleted the big-strings branch January 17, 2017 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants