Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(gatsby): Chunk nodes when serializing redux to prevent OOM #21555

Merged
merged 5 commits into from
Feb 25, 2020

Conversation

pvdz
Copy link
Contributor

@pvdz pvdz commented Feb 18, 2020

We are using v8.serialize to write and read the redux state. This is faster than JSON.parse. Unfortunately, as reported in #17233, this can lead to a fatal when the contents of the redux state is too big to be serialized to a Buffer (hard max of 2GB). Alternatively, we also hit this problem on large site like a million small md pages.

The solution is to shard the nodes property, which holds all the page data. In this change I've added a simple heuristic to determine the max chunk size (mind you, currently that's basically Infinity). It will serialize about 11 individual nodes, measure their size, and based on the biggest node determine how many nodes would fit in 1.5GB.

The serialization process is updated to no longer put the nodes in the main redux file, but rather sharded over a few specific files. When reading the state from cache, these files are all read and their contents are put together in a single Map again. If there were no nodes files this part does nothing so it's even backwards compatible.

Also fixes the TS type for a page node in redux and exports it so we can reference it explicitly.

blainekasten
blainekasten previously approved these changes Feb 18, 2020
Copy link
Contributor

@blainekasten blainekasten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach a lot. Great job Peter 🚢

@pvdz
Copy link
Contributor Author

pvdz commented Feb 19, 2020

Added the mechanism for transactional state. This required a bit of refactoring, please have a look.

In general it works as explained before: a tmp folder is created, the new cache is written to that folder, the old cache folder is renamed to a backup location, the tmp folder is moved into the cache folder position, and then an attempt is made to drop the old folder.

Legacy files are properly read and deleted when a new cache is created. Upgrading should work transparently. Downgrading will not work, because an older version will not know to find the new redux cache folder, so it will just miss the cache. I think that's fine, both for us and our users.

Updated some tests to take new things into account. Feels like I've been mocking half the fs-extra lib now :p

@pvdz
Copy link
Contributor Author

pvdz commented Feb 19, 2020

(Shuffled the functions around to make TS lint happy. It does not seem to like function calls in sibling functions on the same level unless the called function is declared before it ... weird in the JS world)

@pvdz
Copy link
Contributor Author

pvdz commented Feb 19, 2020

Ignore the next few pushes. I'm adding some logging to debug the windows run through circleci 🤡

@pvdz pvdz force-pushed the fix-v8-serialize branch 2 times, most recently from a32a3bb to 938eca5 Compare February 20, 2020 10:49
@pvdz
Copy link
Contributor Author

pvdz commented Feb 20, 2020

Ok. All tests are passing. Ready to go

pieh
pieh previously approved these changes Feb 21, 2020
Copy link
Contributor

@pieh pieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good.

It would be nice to have some tests that would assert behaviour when some fs operation fails, but leaving decision about feasibility and long term maintainance of them to you. Additionally I left one additional inline comment about some potential extra low effort sanity check that could be made

packages/gatsby/src/redux/persist.ts Show resolved Hide resolved
@pvdz
Copy link
Contributor Author

pvdz commented Feb 21, 2020

Ok I've added the else case. I've not added more tests because it feels like that's more about the mocks than actual fs operations at that point.

@pvdz
Copy link
Contributor Author

pvdz commented Feb 21, 2020

The return type is incorrect but after quickly trying to fix that and propagate the type just led into a deep rabbit hole so I'm doing the same as the parent calling code, for now.

pieh
pieh previously approved these changes Feb 21, 2020
Copy link
Contributor

@pieh pieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@pvdz
Copy link
Contributor Author

pvdz commented Feb 21, 2020

Ah, that else now breaks the tests because they omit the nodes property due to loki ... ok ok.

@pieh
Copy link
Contributor

pieh commented Feb 21, 2020

Ah, that else now breaks the tests because they omit the nodes property due to loki ... ok ok.

Ehh, didn't think about loki ... but also great that there were tests that caught it (uff).

I don't know if it's worth to try to keep that code path then. We could make it not to show warning / return empty state for loki, but it's get convoluted

@pieh
Copy link
Contributor

pieh commented Feb 21, 2020

Different and maybe more robust (still not super robust) test could be to just store number of nodes in rest file and compare it to number of nodes that were read from chunked nodes files

@pvdz
Copy link
Contributor Author

pvdz commented Feb 21, 2020

The breakage is artificial; it's because the test removes the nodes property before testing it.

I can just add a loki if-branch for the test and have the redux branch not remove those props. I think that'd be fine?

Copy link
Contributor

@wardpeet wardpeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have nitpicks and questions, no need to block this PR on merging.

packages/gatsby/src/redux/persist.ts Outdated Show resolved Hide resolved
packages/gatsby/src/redux/persist.ts Outdated Show resolved Hide resolved
packages/gatsby/src/redux/persist.ts Show resolved Hide resolved

function safelyRenameToBak(reduxCacheFolder: string): string {
// Basically try to work around the potential of previous renamed caches
// not being removed for whatever reason. _That_ should not be a blocker.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not try to remove the bak file? Why would we want to keep other bak files around if they didn't get deleted? (permissions issues would be weird because we're creating them)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do try to remove it later on (check // Now try to yolorimraf the old cache folder section). If something wrong happens during writing out new cache, we could potentially try to restore previous one (instead of blowing entire cache away). Tho I don't think we try to restore bak cache now in this case (and also don't know if it's safe to do - it should be).

It's not only permission issues here - fs can be finicky (I found multiple fs problems, but mostly on windows and they were intermittent - just restarting something was "solving" the problem)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that why we use fs-extra to fix windows retries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, the only reason is to try and do the least amount of dangerous work until the new cache is persisted.

I'm not entirely convinced this makes much of a difference because if a fatal happens the cache is in a questionable state, but I think it helps?

The drawback is temporarily requiring double the space. That much is true.

I guess nothing was depending on this yet but the `nodes` property is definitely a `Map`, not an array.

Also exporting the Node type because we'll need this a few times later.
We are using `v8.serialize` to write and read the redux state. This is faster than `JSON.parse`. Unfortunately, as reported in #17233, this can lead to a fatal when the contents of the redux state is too big to be serialized to a Buffer (hard max of 2GB). Alternatively, we also hit this problem on large site like a million small md pages.

The solution is to shard the `nodes` property, which holds all the page data. In this change I've added a simple heuristic to determine the max chunk size (mind you, currently that's basically `Infinity`). It will serialize about 11 individual nodes, measure their size, and based on the biggest node determine how many nodes would fit in 1.5GB.

The serialization process is updated to no longer put the `nodes` in the main redux file, but rather sharded over a few specific files. When reading the state from cache, these files are all read and their contents are put together in a single Map again. If there were no nodes files this part does nothing so it's even backwards compatible.

Because the write is no longer atomized, the process will now write the redux cache to its own `redux` folder. When writing a new cache it will prepare the new cache in a tmp folder first, then move the existing `redux` folder to a temp location, move the new folder to `redux`, and then try to drop the old folder. This is about as transactional as you can get and should leave the cache in either a stale, empty, or updated state. But never in a partial state.
Copy link
Contributor

@pieh pieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified with testing .org

Making 3 builds ( https://github.com/gatsbyjs/gatsby/compare/test-fix-v8-serialize-on-dot-org ):
First one - cold cache
Second one - make a change to react component (so reuse existing cache)
Third one - make a change to gatsby-node.js (invalidating cache)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants