feat(gatsby): use json-stream-stringify to serialize redux state #9370

rametta · 2018-10-24T21:02:10Z

This PR changes the way the internal gatsby store is saved to a disk to account for sites that have very large payloads.

Closes issue #9362

DSchau

Well that was quick!

Would you be able to test this with your local site and validate it improves the scalability of gatsby? Additionally, testing on a smaller site and validating similar performance would be a nice to have, but I don't anticipate this causing a lot of issues.

One other note is that json-stringify-safe handled circular references, whereas I don't think this new approach does. I'm not sure whether that's a huge concern, but something worth considering!

KyleAMathews · 2018-10-24T21:13:36Z

We got a fair number of bug reports about circular references in the past so definitely need to handle that.

rametta · 2018-10-24T21:17:42Z

@DSchau Yes definitely, I tested with my local build and with the old way was getting the error mentioned in the issue i opened (#9362) And now with this change, it works with any object size. (The file my build was generating was around 500mb)

DSchau · 2018-10-24T21:23:32Z

@rametta looks like we can keep the circular-reference safe json-stringify-safe and pass it as an option to streaming-json-stringify to replace the JSON.stringify call, and in fact, it uses that by default. So I'd recommend trying out streaming-json-stringify (instead of stream-json-stringify) because it's:

More downloaded (but also older... ?)
circular reference safe, by default

You able to check that out and validate it still works with your site?

rametta · 2018-10-24T21:26:36Z

@DSchau That's a good idea. I will try that approach and let you guys know.

pieh · 2018-10-24T21:30:17Z

Here's another json streaming utility https://www.npmjs.com/package/bfj that has some handling for circular references

DSchau · 2018-10-24T21:32:10Z

@pieh even better 👌 Thanks for that link!

KyleAMathews · 2018-10-25T00:40:16Z

Hmmm https://www.npmjs.com/package/bfj#is-it-fast

rametta · 2018-10-25T02:50:03Z

@KyleAMathews I agree that this is not ideal to slow down the build, but I think it's worth it. Here is why:

Will allow bigger sites to be built
A slightly slower build is better than no build at all
Using streams is async so it does not block the event loop like the current implementation
Also safeguards against circular references
Small to medium sites will barely be affected

Here are my findings after experimenting with different stringify-ing methods

50,000 Records (~134 MB file)

Method	Time to complete	Circular Refs
native JSON.stringify	1.10s sync	No
json-stringify-safe	1.19s sync	Yes
bfj	6.9s Async	Some
json-stream-stringify	7.4s Async	Yes

100,000 Records (~251 MB file)

Method	Time to complete	Circular Refs
native JSON.stringify	2.02s sync	No
json-stringify-safe	2.22s sync	Yes
bfj	13.36s Async	Some
json-stream-stringify	19.3s Async	Yes

500,000 Records (~1.24 GB file)

Method	Time to complete	Circular Refs
native JSON.stringify	CRASH	No
json-stringify-safe	CRASH	Yes
bfj	62s Async	Some
json-stream-stringify	194s Async	Yes

…-stream

pieh · 2018-10-25T11:26:34Z

But does it actually slows down the builds or just slows down persisting redux state to file? It mentions:

BFJ yields frequently to avoid monopolising the event loop, interrupting its own execution to let other event handlers run

So this might be not bad idea and might actually speed up builds. Plus there are options we can play with to not make it incredibly slow

rametta · 2018-10-25T12:26:14Z

Unfortunately in my tests, BFJ was not handling all types of circular references, so I switched my PR to use json-stream-stringify instead, which seems to be handling everything flawlessly, it's a bit slower, but since it's async it should not be a problem

pieh · 2018-10-25T13:24:46Z

json-stream-stringify is slower than bfj or than json-stringify-safe ?

rametta · 2018-10-25T14:01:08Z

json-stream-stringify is slower than bfj or than json-stringify-safe ?

Both, i've updated my little benchmark table above with the new lib

DSchau · 2018-10-26T23:00:09Z

@rametta would you be able to run performance benchmarks with a sample site (e.g. something in benchmarks) to ensure this doesn't introduce a performance regression!

rametta · 2018-10-27T13:36:54Z

@DSchau Here are my results with the create pages benchmark.
The results show that not much has changed in terms of build time, which is good.

(After every test, I deleted .cache and public folders manually)

yarn build Create Pages Benchmark

Branch	Pages	Build 1	Build 2	Build 3	AVG
v2.0.31	5,000	8.52	7.44	7.04	7.66
v2.0.31	10,000	12.15	9.94	10.82	10.97
v2.0.31	50,000	47.30	39.86	37.85	41.67
v2.0.31	100,000	86.01	81.78	78.20	81.99
This PR	5,000	7.42	7.84	7.69	7.65
This PR	10,000	11.67	9.56	11.20	10.81
This PR	50,000	45.55	35.13	36.01	38.89
This PR	100,000	80.09	80.77	81.82	80.89

KyleAMathews · 2018-10-27T16:33:00Z

Sorry for not being more explicit. Can you use the markdown benchmark instead? The create pages benchmark doesn't add anything basically to the cache so it's not a good test for this.

KyleAMathews · 2018-10-27T16:46:56Z

E.g. I just tried a 50k markdown site and it OOMed when it tried to write out the cache file.

rametta · 2018-10-27T19:55:19Z

@KyleAMathews No problem, so I ran the markdown benchmark a few times, and realized I would have to make each node created have content that was VERY large, so i replaced the content property of every node made to have a very large snippet of lorum ipsum that would have to be written to .cache/redux-state.json. So even trying to create 5k pages with the old way i would run into the error of:

info bootstrap finished - 82.756 s

error Invalid string length


  RangeError: Invalid string length

  - JSON.stringify

  - stringify.js:5 stringify
    [test_stream]/[json-stringify-safe]/stringify.js:5:15

  - index.js:90 saveState
    [test_stream]/[gatsby]/dist/redux/index.js:90:23

  - index.js:118 emitter.on
    [test_stream]/[gatsby]/dist/redux/index.js:118:5

  - mitt.js:1
    [test_stream]/[mitt]/dist/mitt.js:1:268

  - Array.map

Then when I switched to use my PR, it would build no problem. (Granted I had to provide 12GB of memory to my node command, but at least it mimics the project I am working on now that lead me to find this issue) When my build passed I had a redux-state.json file of about 990MB, whereas the current gatsby would crash anywhere around 500MB file size because that seems to be the limit of JSON.stringify or safe-stringify implementation.

KyleAMathews · 2018-10-27T20:45:35Z

Awesome! And also no perf regressions?

rametta · 2018-10-27T22:01:10Z

I don't think so, if there is, I don't know how to find them

KyleAMathews · 2018-10-27T22:23:25Z

I meant could you do whay you did in #9370 (comment) with the markdown benchmark

rametta · 2018-10-28T12:26:56Z

@KyleAMathews here are the Markdown benchmarks with large amount of data passed into each node

Branch	Pages	Build 1	Build 2	Build 3	AVG	Delta
v2.0.31	500	22.29	21.81	21.55	21.88	-
v2.0.31	1,000	45.92	40.11	39.77	41.93	-
v2.0.31	1,500	58.46	57.46	57.62	57.85	-
This PR	500	20.37	20.44	20.55	20.45	-1.43
This PR	1,000	37.88	37.44	37.01	37.44	-4.49
This PR	1,500	55.23	54.24	54.13	54.53	-3.32

DSchau · 2018-10-29T15:59:51Z

@rametta one neat trick I built into that markdown pages benchmark is that you can conditionally (with environment variables) change the amount of content for each page :)

So here's how that works:

I expose a NUM_ROWS variable, which will basically create a huge markdown table, which tends to lead to a huge markdown AST
So running, e.g. NUM_PAGES=1500 NUM_ROWS=100 yarn build will build out 1500 pages with a fair amount of content

~~So this will test the cache but not redux state. Maybe it's worth building in something additional to test the node creation process that follows a similar approach?~~

This tests both, so consider checking it out! (Thanks @KyleAMathews)

KyleAMathews · 2018-10-29T16:04:55Z

So this will test the cache but not redux state. Maybe it's worth building in something additional to test the node creation process that follows a similar approach?

Not sure what you mean by this?

KyleAMathews · 2018-10-29T16:05:22Z

The raw node content is in the redux state (though not the AST as you mention)

DSchau · 2018-10-29T16:06:11Z

So this will test the cache but not redux state. Maybe it's worth building in something additional to test the node creation process that follows a similar approach?

@KyleAMathews yeah, whoops! I was thinking only the cache was touched here, but updating that env var will increase the size of the generated node too, so it'll actually test both. The other info is accurate :)

rametta · 2018-10-29T17:20:48Z

So running, e.g. NUM_PAGES=1500 NUM_ROWS=100 yarn build will build out 1500 pages with a fair amount of content

Ah okay, that's good to know, thanks!

KyleAMathews · 2018-10-29T17:48:50Z

Even without extra rows, the markdown benchmark OOMed somewhere before 50k pages as I mentioned above #9370 (comment).

Just tested this PR w/ the same benchmark and 100k pages and it finished just fine! Super exciting!

Kyles-MacBook-Pro:markdown (stringify-stream $)$ NUM_PAGES=100000 gatsby build
success open and validate gatsby-configs — 0.006 s
success load plugins — 0.109 s
success onPreInit — 0.715 s
success delete html and css files from previous builds — 0.004 s
success initialize cache — 22.712 s
success copy gatsby files — 0.058 s
success onPreBootstrap — 0.012 s
success source and transform nodes — 82.279 s
success building schema — 5.159 s
success createPages — 29.534 s
success createPagesStatefully — 0.017 s
success onPreExtractQueries — 0.000 s
success update schema — 12.211 s
success extract queries from components — 0.195 s
success run graphql queries — 507.226 s — 100001/100001 197.16 queries/second
success write out page data — 1.296 s
success write out redirect data — 0.001 s
success onPostBootstrap — 0.353 s

info bootstrap finished - 664.473 s

success Building production JavaScript and CSS bundles — 25.580 s
success Building static HTML for pages — 35.440 s — 100001/100001 3060.76 pages/second
info Done building in 726.477 sec

pieh

This is awesome @rametta!

Will just update yarn.lock again (seems like yarn removed bunch of integrity SHAs when you changed dependency).

gatsbot · 2018-10-29T21:04:22Z

Holy buckets, @rametta — we just merged your PR to Gatsby! 💪💜

Gatsby is built by awesome people like you. Let us say “thanks” in two ways:

We’d like to send you some Gatsby swag. As a token of our appreciation, you can go to the Gatsby Swag Store and log in with your GitHub account to get a coupon code good for one free piece of swag. (Currently we’ve got a couple t-shirts available, plus some socks that are really razzing our berries right now.)
We just invited you to join the Gatsby organization on GitHub. This will add you to our team of maintainers. Accept the invite by visiting https://github.com/orgs/gatsbyjs/invitation. By joining the team, you’ll be able to label issues, review pull requests, and merge approved pull requests.

If there’s anything we can do to help, please don’t hesitate to reach out to us: tweet at @gatsbyjs and we’ll come a-runnin’.

Thanks again!

…sbyjs#9370) This PR changes the way the internal gatsby store is saved to a disk to account for sites that have very large payloads. Closes issue gatsbyjs#9362

pieh · 2018-11-05T16:57:29Z

Hey, I've noticed few problems with this implementation:

in some scenarios when build part is really fast (i.e. rendering just 1 static html) gatsby build will finish before save completes leaving with redux-state.json in unfinished state:
there can also be scenarios where we try to do multiple writes in parallel (we don't return promise and even then lodash.debounce doesn't wait for promise I think?)

What I think we would need to do is:
a. write to tmp file and rename it to redux-state on completion (to not leave with redux-state in incomplete state - for example if exit gatsby develop mid-writing)
b. in commands/build actually wait for state saving to finish
c. investigate better debouncing that would queue next write instead of running in parallel

pieh · 2018-11-10T21:46:38Z

I did some work on above and noticed another problem - the way this handles cyclic refs (or rather repeated refs for same object) is cool (see gatsbyDependencies or miscDependencies which are subsets of allDependencies array): https://gist.github.com/pieh/5c2a52d4744bb20678db07934172f4d6
but problem is we don't recreate those references when building state on next run resulting in different weird data and in consequence different schema and failing queries - this seems to mentioned in json-stream-stringify docs:

To restore cyclical structures; use Crockfords Retrocycle method on the parsed object (not included in this module).

So I think we should (at least temporarily) revert this and fix those issues before applying this change again, because right now cache is basically broken

pieh · 2018-11-10T22:21:43Z

I've run same markdown benchmark with 100000 pages as Kyle did where I actually waited for redux-state.json finish saving and here's some results:

info bootstrap finished - 712.475 s

gatsby:redux/index saving state to /var/folders/7b/7_tqb8f50bx7jhyj0bn9r8v80000gn/T/5460e83930b39f96c859e29c3403d54c33bf3214 +0ms
success Building production JavaScript and CSS bundles — 26.208 s
success Building static HTML for pages — 31.109 s — 100001/100001 3551.78 pages/second
info Done building in 770.045 sec
gatsby:redux/index saved state to /var/folders/7b/7_tqb8f50bx7jhyj0bn9r8v80000gn/T/5460e83930b39f96c859e29c3403d54c33bf3214, moving to /Users/misiek/dev/gatsby/benchmarks/markdown/.cache/redux-state.json +8m
gatsby:redux/index saved moved to /Users/misiek/dev/gatsby/benchmarks/markdown/.cache/redux-state.json OK +1ms
info Done building + saving state in 1166.995 sec

So it took additional ~6 minutes to wait for redux-state.json saving (granted it didn't run out of memory). At this point, I really think we should revert this (or put behind some feature flag, as this is in the end degrading performance for 99% of users)

…ate (gatsbyjs#9370)" This reverts commit c334075.

pieh · 2018-11-12T23:11:23Z

PR to revert this: #9896

current implementation of using `json-stream-stringify` is causing a lot of problems and I think this should be reverted - see #9370 (comment) shortly put - this does help with not getting OOM, but it essentially make `.cache/redux-state.json` to not work at all

current implementation of using `json-stream-stringify` is causing a lot of problems and I think this should be reverted - see gatsbyjs#9370 (comment) shortly put - this does help with not getting OOM, but it essentially make `.cache/redux-state.json` to not work at all

ivanoats · 2019-01-10T16:40:07Z

What’s the best way to incorporate these changes without forking Gatsby? A P.R. To make it an option for large sites? Fix the two issues mentioned by @pieh above? (cyclic refs, a,b,c)

…sbyjs#9370) This PR changes the way the internal gatsby store is saved to a disk to account for sites that have very large payloads. Closes issue gatsbyjs#9362

current implementation of using `json-stream-stringify` is causing a lot of problems and I think this should be reverted - see gatsbyjs#9370 (comment) shortly put - this does help with not getting OOM, but it essentially make `.cache/redux-state.json` to not work at all

use stream-json-stringify

49c3ce0

rametta requested a review from a team as a code owner October 24, 2018 21:02

rametta mentioned this pull request Oct 24, 2018

Gatsby V2 can't handle the generation of BIG sites. #9362

Closed

DSchau reviewed Oct 24, 2018

View reviewed changes

Merge branch 'master' into stringify-stream

ca61077

Jason added 2 commits October 24, 2018 22:54

use bfj

2c4470a

Merge remote-tracking branch 'origin/stringify-stream' into stringify…

2bf2204

…-stream

use json-stream-stringify

b7c86e0

Merge branch 'master' into stringify-stream

4e6b07d

pieh approved these changes Oct 29, 2018

View reviewed changes

pieh added 2 commits October 29, 2018 21:36

update yarn.lock

7bb10d9

Merge remote-tracking branch 'origin/master' into stringify-stream

3ae86a6

pieh changed the title ~~Use json stringify stream for large gatsby sites~~ feat(gatsby): use json-stream-stringify to serialize redux state Oct 29, 2018

pieh merged commit c334075 into gatsbyjs:master Oct 29, 2018

rametta deleted the stringify-stream branch October 29, 2018 21:04

kakadiadarpan mentioned this pull request Oct 30, 2018

Gatsby Limit? #9542

Closed

pieh added a commit to pieh/gatsby that referenced this pull request Nov 12, 2018

Revert "feat(gatsby): use json-stream-stringify to serialize redux st…

7ba7a13

…ate (gatsbyjs#9370)" This reverts commit c334075.

pieh mentioned this pull request Nov 12, 2018

Revert json-stream-stringify to serialize redux state #9896

Merged

feat(gatsby): use json-stream-stringify to serialize redux state #9370

feat(gatsby): use json-stream-stringify to serialize redux state #9370

Conversation

rametta commented Oct 24, 2018 • edited

DSchau left a comment

Choose a reason for hiding this comment

KyleAMathews commented Oct 24, 2018

rametta commented Oct 24, 2018 • edited

DSchau commented Oct 24, 2018

rametta commented Oct 24, 2018

pieh commented Oct 24, 2018

DSchau commented Oct 24, 2018

KyleAMathews commented Oct 25, 2018

rametta commented Oct 25, 2018 • edited

pieh commented Oct 25, 2018

rametta commented Oct 25, 2018 • edited

pieh commented Oct 25, 2018

rametta commented Oct 25, 2018 • edited

DSchau commented Oct 26, 2018 • edited by KyleAMathews

rametta commented Oct 27, 2018

KyleAMathews commented Oct 27, 2018

KyleAMathews commented Oct 27, 2018

rametta commented Oct 27, 2018

KyleAMathews commented Oct 27, 2018

rametta commented Oct 27, 2018

KyleAMathews commented Oct 27, 2018

rametta commented Oct 28, 2018

DSchau commented Oct 29, 2018 • edited

KyleAMathews commented Oct 29, 2018

KyleAMathews commented Oct 29, 2018

DSchau commented Oct 29, 2018

rametta commented Oct 29, 2018

KyleAMathews commented Oct 29, 2018

pieh left a comment

Choose a reason for hiding this comment

gatsbot bot commented Oct 29, 2018

pieh commented Nov 5, 2018

pieh commented Nov 10, 2018

pieh commented Nov 10, 2018

pieh commented Nov 12, 2018

ivanoats commented Jan 10, 2019 • edited

rametta commented Oct 24, 2018 •

edited

rametta commented Oct 24, 2018 •

edited

rametta commented Oct 25, 2018 •

edited

rametta commented Oct 25, 2018 •

edited

rametta commented Oct 25, 2018 •

edited

DSchau commented Oct 26, 2018 •

edited by KyleAMathews

DSchau commented Oct 29, 2018 •

edited

ivanoats commented Jan 10, 2019 •

edited