Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gatsby): use json-stream-stringify to serialize redux state #9370

Merged
merged 8 commits into from
Oct 29, 2018

Conversation

rametta
Copy link
Contributor

@rametta rametta commented Oct 24, 2018

This PR changes the way the internal gatsby store is saved to a disk to account for sites that have very large payloads.

Closes issue #9362

Copy link
Contributor

@DSchau DSchau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that was quick!

Would you be able to test this with your local site and validate it improves the scalability of gatsby? Additionally, testing on a smaller site and validating similar performance would be a nice to have, but I don't anticipate this causing a lot of issues.

One other note is that json-stringify-safe handled circular references, whereas I don't think this new approach does. I'm not sure whether that's a huge concern, but something worth considering!

@KyleAMathews
Copy link
Contributor

We got a fair number of bug reports about circular references in the past so definitely need to handle that.

@rametta
Copy link
Contributor Author

rametta commented Oct 24, 2018

@DSchau Yes definitely, I tested with my local build and with the old way was getting the error mentioned in the issue i opened (#9362) And now with this change, it works with any object size. (The file my build was generating was around 500mb)

@DSchau
Copy link
Contributor

DSchau commented Oct 24, 2018

@rametta looks like we can keep the circular-reference safe json-stringify-safe and pass it as an option to streaming-json-stringify to replace the JSON.stringify call, and in fact, it uses that by default. So I'd recommend trying out streaming-json-stringify (instead of stream-json-stringify) because it's:

  • More downloaded (but also older... ?)
  • circular reference safe, by default

You able to check that out and validate it still works with your site?

@rametta
Copy link
Contributor Author

rametta commented Oct 24, 2018

@DSchau That's a good idea. I will try that approach and let you guys know.

@pieh
Copy link
Contributor

pieh commented Oct 24, 2018

Here's another json streaming utility https://www.npmjs.com/package/bfj that has some handling for circular references

@DSchau
Copy link
Contributor

DSchau commented Oct 24, 2018

@pieh even better 👌 Thanks for that link!

@KyleAMathews
Copy link
Contributor

@rametta
Copy link
Contributor Author

rametta commented Oct 25, 2018

@KyleAMathews I agree that this is not ideal to slow down the build, but I think it's worth it. Here is why:

  • Will allow bigger sites to be built
  • A slightly slower build is better than no build at all
  • Using streams is async so it does not block the event loop like the current implementation
  • Also safeguards against circular references
  • Small to medium sites will barely be affected

Here are my findings after experimenting with different stringify-ing methods

50,000 Records (~134 MB file)

Method Time to complete Circular Refs
native JSON.stringify 1.10s sync No
json-stringify-safe 1.19s sync Yes
bfj 6.9s Async Some
json-stream-stringify 7.4s Async Yes

100,000 Records (~251 MB file)

Method Time to complete Circular Refs
native JSON.stringify 2.02s sync No
json-stringify-safe 2.22s sync Yes
bfj 13.36s Async Some
json-stream-stringify 19.3s Async Yes

500,000 Records (~1.24 GB file)

Method Time to complete Circular Refs
native JSON.stringify CRASH No
json-stringify-safe CRASH Yes
bfj 62s Async Some
json-stream-stringify 194s Async Yes

@pieh
Copy link
Contributor

pieh commented Oct 25, 2018

But does it actually slows down the builds or just slows down persisting redux state to file? It mentions:

BFJ yields frequently to avoid monopolising the event loop, interrupting its own execution to let other event handlers run

So this might be not bad idea and might actually speed up builds. Plus there are options we can play with to not make it incredibly slow

@rametta
Copy link
Contributor Author

rametta commented Oct 25, 2018

Unfortunately in my tests, BFJ was not handling all types of circular references, so I switched my PR to use json-stream-stringify instead, which seems to be handling everything flawlessly, it's a bit slower, but since it's async it should not be a problem

@pieh
Copy link
Contributor

pieh commented Oct 25, 2018

json-stream-stringify is slower than bfj or than json-stringify-safe ?

@rametta
Copy link
Contributor Author

rametta commented Oct 25, 2018

json-stream-stringify is slower than bfj or than json-stringify-safe ?

Both, i've updated my little benchmark table above with the new lib

@DSchau
Copy link
Contributor

DSchau commented Oct 26, 2018

@rametta would you be able to run performance benchmarks with a sample site (e.g. something in benchmarks) to ensure this doesn't introduce a performance regression!

@rametta
Copy link
Contributor Author

rametta commented Oct 27, 2018

@DSchau Here are my results with the create pages benchmark.
The results show that not much has changed in terms of build time, which is good.

(After every test, I deleted .cache and public folders manually)

yarn build Create Pages Benchmark

Branch Pages Build 1 Build 2 Build 3 AVG
v2.0.31 5,000 8.52 7.44 7.04 7.66
v2.0.31 10,000 12.15 9.94 10.82 10.97
v2.0.31 50,000 47.30 39.86 37.85 41.67
v2.0.31 100,000 86.01 81.78 78.20 81.99
This PR 5,000 7.42 7.84 7.69 7.65
This PR 10,000 11.67 9.56 11.20 10.81
This PR 50,000 45.55 35.13 36.01 38.89
This PR 100,000 80.09 80.77 81.82 80.89

@KyleAMathews
Copy link
Contributor

Sorry for not being more explicit. Can you use the markdown benchmark instead? The create pages benchmark doesn't add anything basically to the cache so it's not a good test for this.

@KyleAMathews
Copy link
Contributor

E.g. I just tried a 50k markdown site and it OOMed when it tried to write out the cache file.

@rametta
Copy link
Contributor Author

rametta commented Oct 27, 2018

@KyleAMathews No problem, so I ran the markdown benchmark a few times, and realized I would have to make each node created have content that was VERY large, so i replaced the content property of every node made to have a very large snippet of lorum ipsum that would have to be written to .cache/redux-state.json. So even trying to create 5k pages with the old way i would run into the error of:

info bootstrap finished - 82.756 s

error Invalid string length


  RangeError: Invalid string length

  - JSON.stringify

  - stringify.js:5 stringify
    [test_stream]/[json-stringify-safe]/stringify.js:5:15

  - index.js:90 saveState
    [test_stream]/[gatsby]/dist/redux/index.js:90:23

  - index.js:118 emitter.on
    [test_stream]/[gatsby]/dist/redux/index.js:118:5

  - mitt.js:1
    [test_stream]/[mitt]/dist/mitt.js:1:268

  - Array.map

Then when I switched to use my PR, it would build no problem. (Granted I had to provide 12GB of memory to my node command, but at least it mimics the project I am working on now that lead me to find this issue) When my build passed I had a redux-state.json file of about 990MB, whereas the current gatsby would crash anywhere around 500MB file size because that seems to be the limit of JSON.stringify or safe-stringify implementation.

@KyleAMathews
Copy link
Contributor

Awesome! And also no perf regressions?

@rametta
Copy link
Contributor Author

rametta commented Oct 27, 2018

I don't think so, if there is, I don't know how to find them

@KyleAMathews
Copy link
Contributor

I meant could you do whay you did in #9370 (comment) with the markdown benchmark

@rametta
Copy link
Contributor Author

rametta commented Oct 28, 2018

@KyleAMathews here are the Markdown benchmarks with large amount of data passed into each node

Branch Pages Build 1 Build 2 Build 3 AVG Delta
v2.0.31 500 22.29 21.81 21.55 21.88 -
v2.0.31 1,000 45.92 40.11 39.77 41.93 -
v2.0.31 1,500 58.46 57.46 57.62 57.85 -
This PR 500 20.37 20.44 20.55 20.45 -1.43
This PR 1,000 37.88 37.44 37.01 37.44 -4.49
This PR 1,500 55.23 54.24 54.13 54.53 -3.32

@DSchau
Copy link
Contributor

DSchau commented Oct 29, 2018

@rametta one neat trick I built into that markdown pages benchmark is that you can conditionally (with environment variables) change the amount of content for each page :)

So here's how that works:

  • I expose a NUM_ROWS variable, which will basically create a huge markdown table, which tends to lead to a huge markdown AST
  • So running, e.g. NUM_PAGES=1500 NUM_ROWS=100 yarn build will build out 1500 pages with a fair amount of content

So this will test the cache but not redux state. Maybe it's worth building in something additional to test the node creation process that follows a similar approach?

This tests both, so consider checking it out! (Thanks @KyleAMathews)

@KyleAMathews
Copy link
Contributor

So this will test the cache but not redux state. Maybe it's worth building in something additional to test the node creation process that follows a similar approach?

Not sure what you mean by this?

@KyleAMathews
Copy link
Contributor

The raw node content is in the redux state (though not the AST as you mention)

@DSchau
Copy link
Contributor

DSchau commented Oct 29, 2018

So this will test the cache but not redux state. Maybe it's worth building in something additional to test the node creation process that follows a similar approach?

@KyleAMathews yeah, whoops! I was thinking only the cache was touched here, but updating that env var will increase the size of the generated node too, so it'll actually test both. The other info is accurate :)

@rametta
Copy link
Contributor Author

rametta commented Oct 29, 2018

  • So running, e.g. NUM_PAGES=1500 NUM_ROWS=100 yarn build will build out 1500 pages with a fair amount of content

Ah okay, that's good to know, thanks!

@KyleAMathews
Copy link
Contributor

Even without extra rows, the markdown benchmark OOMed somewhere before 50k pages as I mentioned above #9370 (comment).

Just tested this PR w/ the same benchmark and 100k pages and it finished just fine! Super exciting!

Kyles-MacBook-Pro:markdown (stringify-stream $)$ NUM_PAGES=100000 gatsby build
success open and validate gatsby-configs — 0.006 s
success load plugins — 0.109 s
success onPreInit — 0.715 s
success delete html and css files from previous builds — 0.004 s
success initialize cache — 22.712 s
success copy gatsby files — 0.058 s
success onPreBootstrap — 0.012 s
success source and transform nodes — 82.279 s
success building schema — 5.159 s
success createPages — 29.534 s
success createPagesStatefully — 0.017 s
success onPreExtractQueries — 0.000 s
success update schema — 12.211 s
success extract queries from components — 0.195 s
success run graphql queries — 507.226 s — 100001/100001 197.16 queries/second
success write out page data — 1.296 s
success write out redirect data — 0.001 s
success onPostBootstrap — 0.353 s

info bootstrap finished - 664.473 s

success Building production JavaScript and CSS bundles — 25.580 s
success Building static HTML for pages — 35.440 s — 100001/100001 3060.76 pages/second
info Done building in 726.477 sec

Copy link
Contributor

@pieh pieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @rametta!

Will just update yarn.lock again (seems like yarn removed bunch of integrity SHAs when you changed dependency).

@pieh pieh changed the title Use json stringify stream for large gatsby sites feat(gatsby): use json-stream-stringify to serialize redux state Oct 29, 2018
@pieh pieh merged commit c334075 into gatsbyjs:master Oct 29, 2018
@gatsbot
Copy link

gatsbot bot commented Oct 29, 2018

Holy buckets, @rametta — we just merged your PR to Gatsby! 💪💜

Gatsby is built by awesome people like you. Let us say “thanks” in two ways:

  1. We’d like to send you some Gatsby swag. As a token of our appreciation, you can go to the Gatsby Swag Store and log in with your GitHub account to get a coupon code good for one free piece of swag. (Currently we’ve got a couple t-shirts available, plus some socks that are really razzing our berries right now.)
  2. We just invited you to join the Gatsby organization on GitHub. This will add you to our team of maintainers. Accept the invite by visiting https://github.com/orgs/gatsbyjs/invitation. By joining the team, you’ll be able to label issues, review pull requests, and merge approved pull requests.

If there’s anything we can do to help, please don’t hesitate to reach out to us: tweet at @gatsbyjs and we’ll come a-runnin’.

Thanks again!

@rametta rametta deleted the stringify-stream branch October 29, 2018 21:04
@kakadiadarpan kakadiadarpan mentioned this pull request Oct 30, 2018
jedrichards pushed a commit to jedrichards/gatsby that referenced this pull request Nov 1, 2018
…sbyjs#9370)

This PR changes the way the internal gatsby store is saved to a disk to account for sites that have very large payloads.

Closes issue gatsbyjs#9362
@pieh
Copy link
Contributor

pieh commented Nov 5, 2018

Hey, I've noticed few problems with this implementation:

  • in some scenarios when build part is really fast (i.e. rendering just 1 static html) gatsby build will finish before save completes leaving with redux-state.json in unfinished state:
    screen shot 2018-11-04 at 20 13 49
  • there can also be scenarios where we try to do multiple writes in parallel (we don't return promise and even then lodash.debounce doesn't wait for promise I think?)

What I think we would need to do is:
a. write to tmp file and rename it to redux-state on completion (to not leave with redux-state in incomplete state - for example if exit gatsby develop mid-writing)
b. in commands/build actually wait for state saving to finish
c. investigate better debouncing that would queue next write instead of running in parallel

@pieh
Copy link
Contributor

pieh commented Nov 10, 2018

I did some work on above and noticed another problem - the way this handles cyclic refs (or rather repeated refs for same object) is cool (see gatsbyDependencies or miscDependencies which are subsets of allDependencies array): https://gist.github.com/pieh/5c2a52d4744bb20678db07934172f4d6
but problem is we don't recreate those references when building state on next run resulting in different weird data and in consequence different schema and failing queries - this seems to mentioned in json-stream-stringify docs:

To restore cyclical structures; use Crockfords Retrocycle method on the parsed object (not included in this module).

So I think we should (at least temporarily) revert this and fix those issues before applying this change again, because right now cache is basically broken

@pieh
Copy link
Contributor

pieh commented Nov 10, 2018

I've run same markdown benchmark with 100000 pages as Kyle did where I actually waited for redux-state.json finish saving and here's some results:

info bootstrap finished - 712.475 s

gatsby:redux/index saving state to /var/folders/7b/7_tqb8f50bx7jhyj0bn9r8v80000gn/T/5460e83930b39f96c859e29c3403d54c33bf3214 +0ms
success Building production JavaScript and CSS bundles — 26.208 s
success Building static HTML for pages — 31.109 s — 100001/100001 3551.78 pages/second
info Done building in 770.045 sec
gatsby:redux/index saved state to /var/folders/7b/7_tqb8f50bx7jhyj0bn9r8v80000gn/T/5460e83930b39f96c859e29c3403d54c33bf3214, moving to /Users/misiek/dev/gatsby/benchmarks/markdown/.cache/redux-state.json +8m
gatsby:redux/index saved moved to /Users/misiek/dev/gatsby/benchmarks/markdown/.cache/redux-state.json OK +1ms
info Done building + saving state in 1166.995 sec

So it took additional ~6 minutes to wait for redux-state.json saving (granted it didn't run out of memory). At this point, I really think we should revert this (or put behind some feature flag, as this is in the end degrading performance for 99% of users)

pieh added a commit to pieh/gatsby that referenced this pull request Nov 12, 2018
@pieh
Copy link
Contributor

pieh commented Nov 12, 2018

PR to revert this: #9896

pieh added a commit that referenced this pull request Nov 13, 2018
current implementation of using `json-stream-stringify` is causing a lot of problems and I think this should be reverted - see #9370 (comment) 

shortly put - this does help with not getting OOM, but it essentially make `.cache/redux-state.json` to not work at all
jlengstorf pushed a commit to jlengstorf/gatsby that referenced this pull request Nov 14, 2018
current implementation of using `json-stream-stringify` is causing a lot of problems and I think this should be reverted - see gatsbyjs#9370 (comment) 

shortly put - this does help with not getting OOM, but it essentially make `.cache/redux-state.json` to not work at all
@ivanoats
Copy link
Contributor

ivanoats commented Jan 10, 2019

What’s the best way to incorporate these changes without forking Gatsby? A P.R. To make it an option for large sites? Fix the two issues mentioned by @pieh above? (cyclic refs, a,b,c)

gpetrioli pushed a commit to gpetrioli/gatsby that referenced this pull request Jan 22, 2019
…sbyjs#9370)

This PR changes the way the internal gatsby store is saved to a disk to account for sites that have very large payloads.

Closes issue gatsbyjs#9362
gpetrioli pushed a commit to gpetrioli/gatsby that referenced this pull request Jan 22, 2019
current implementation of using `json-stream-stringify` is causing a lot of problems and I think this should be reverted - see gatsbyjs#9370 (comment) 

shortly put - this does help with not getting OOM, but it essentially make `.cache/redux-state.json` to not work at all
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants