improve memory usage when parsing large tokens#31
improve memory usage when parsing large tokens#31creationix merged 4 commits intocreationix:masterfrom
Conversation
98b5f61 to
b24b04a
Compare
| }else if(n === 0x66){ this.appendStringChar("\f".charCodeAt(0)); this.tState = STRING1; | ||
| }else if(n === 0x6e){ this.appendStringChar("\n".charCodeAt(0)); this.tState = STRING1; | ||
| }else if(n === 0x72){ this.appendStringChar("\r".charCodeAt(0)); this.tState = STRING1; | ||
| }else if(n === 0x74){ this.appendStringChar("\t".charCodeAt(0)); this.tState = STRING1; |
There was a problem hiding this comment.
these character codes should probably be hardcoded
There was a problem hiding this comment.
Sometimes I miss C style character literals.
There was a problem hiding this comment.
would have been useful here
|
I have tested on 0.8 through 7 and the issue exists in all version, and the fix now works across all versions as well |
b917d82 to
153c707
Compare
| this.unicode += String.fromCharCode(n); | ||
| if (this.tState++ === STRING6) { | ||
| this.string += String.fromCharCode(parseInt(this.unicode, 16)); | ||
| this.appendStringBuf(Buffer(String.fromCharCode(parseInt(this.unicode, 16)))); |
There was a problem hiding this comment.
this wont work on multibyte characters, should add a test for this case
There was a problem hiding this comment.
nevermind, I did handle that case... and there is probably already a test
There was a problem hiding this comment.
confirmed. test is here: https://github.com/creationix/jsonparse/blob/master/test/primitives.js#L49 It doesn't specifically cover 3-4 byte unicode characters, but those are encoded as 2 separate characters in the json output, so I don't think anything additional is needed here
|
I don't think I have anything else I want to add to this PR, @creationix anything you want me to change or add? |
|
Looks great. |
|
Published as 1.3.0. |
I have been working on a project that handles uploads of large json files, and I have found that there is something very inefficient happening when parsing large string tokens.
The root of the problem seems to be in how v8 handles string concatenation of small strings, I am not yet sure if this is an issue across all node versions, and have only been testing on node 6 so far.
The main symptom of this problem is that parsing a single string token of 50mb (I think the actual cut off is somewhere between 30 and 50) will run a node process out of memory (with the default 1.7gb heap). I have narrowed the issue down to a single line: https://github.com/creationix/jsonparse/blob/master/jsonparse.js#L137. I did some heap profiling to try to figure out exactly what was going wrong, but so far don't have a definitive answer to what exactly is using all the memory, but I did find that there were a large number of 64k chunks of memory being allocated. My suspicion is that a 64k chunk is being allocated for each character in the token, and the string representing the token (this.string) is simply a container with pointers to the memory locations of its constituent characters. (I know v8 does this to improve performance and memory use of applications that do a lot of string concatenation).
I have a solution here that seems to solve the memory use issue while still maintaining good performance (I had several other attempts that were far slower 😿 ), but am not 100% sure its the right solution since I am not too familiar with the internals of how strings work in v8.
Let me know what you think, and if there is anything else I can do to help get this addressed.
PS:
I am aware 50mb is a ridiculously large size for a token, but I think you start to see symptoms of this issue with much smaller values, they are just a bit harder to notice, and values of a few mb are not that uncommon