feat(performance): change data.json to trie #12711

bennetthardwick · 2019-03-21T08:02:33Z

Description

Gatsby sites that have a lot of pages (50k - 60k) spend a lot of time during run time in the findPage method when prefetching links. This is due to the method looping through all members of the page-manifest.json file to search for matching paths.

To fix this, I've changed the page-manifests.json file to use a trie-based data structure. This allows for near-constant time lookups and path matching, rather than the O(n) complexity of an array search.

This could also be used in the future for lazy loading chunked page-manifest.json files for particular path segments, and also optimising the dev 404 page to a tree-view as it struggles to load when trying to render large quantities of pages.

One concern that I have regarding this change is that it removes the @reach/router match function. While the trie version does support splats (*) for matchPath, it doesn't support more complex routes such as route parameters (which are not currently used by Gatsby).

You can see the benefits of the change by using the gatsby-dev-cli with the gatsby-intense-benchmark.

Our company has been working on a Gatsby site which has 30k pages at the moment, with the intention to scale to 100k+. We believe improvements to runtime performance are crucial to meet this goal.

Related Issues

KyleAMathews · 2019-03-21T08:17:09Z

/cc @Moocar

KyleAMathews · 2019-03-21T08:39:44Z

This is super interesting! @Moocar is working on a related change to lazy load page metadata which will make this PR as it stands not mergeable but moving to a trie is definitely needed and we'd love your help moving forward to land it!

We don't use reach's matching algorithm at all basically so that's not a problem. The main thing we need to support is detecting if a route should be handled client side. Otherwise it's just exact matches.

bennetthardwick · 2019-03-21T08:50:14Z

That'll be awesome. Do you have any word on the state of @Moocar's change? Would it be worth turning that into an a kind of "page manifest improvements" epic and have me rebase my branch onto there?

muescha · 2019-03-21T09:51:56Z

Maybe better names (or use an object/struct):

c - child
v - value

c - path
v - value

bennetthardwick · 2019-03-21T11:25:10Z

Maybe better names (or use an object/struct):
c - child
v - value

c - path
v - value

I do agree that the key names aren't great. But I was thinking that it was just a serialised version of the sites pages it didn't necessarily need to be human readable and should be as small as possible. I'll wait to hear what's going to happen with this feature before making any more changes.

Moocar · 2019-03-21T21:43:47Z

Hey @bennetthardwick cool stuff! The work I'm doing is best summarized by #11982. In a nutshell, there won't be a global data.json anymore. Instead, we'll generate a single page-data.json for each page, and the front end will load them as required. As a result, we won't have a massive in memory array of pages to search through.

I've also separated match-paths into a separate collection (matchPaths.json). E.g:

{ 
  "foo/*": "/foo/page/",
  ...
}

So the logic to find a page is now as so:

Lookup matchPaths[path]. This returns an actual page path, or if not found, just use the original path.
Use that path to make a request to the server for the page-data.json. E.g /page-data/foo/page/page-data.json. This contains componentChunkName, query results etc.

I'm still working on it, but I think the above strategy will mean we won't need a more complex datastructure to lookup paths. The number of pages in memory will depend on how many links a user has clicked, and how many have been prefetched as a result. I'm also assuming that most sites have very few matchPaths. But if that assumption is wrong, then a trie based lookup could potentially be used for that part.

me4502 · 2019-03-22T03:09:51Z

won't need a more complex datastructure to lookup paths

It's worth noting that the trie data structure is literally built for this task. Whilst yes, it's not as necessary, it's very possible that someone will have a hundred or so links on a single page. The added complexity is negligible compared to the potential gains, even in a case where only linked pages are listed.

Moocar · 2019-03-22T05:36:08Z

@me4502 If a site defines hundreds of matchPath (regex) paths, then for sure, the trie will start to help. But for most sites, I'm assuming there will only ever be a few of those. In which case, in the new page-data.json code I'm introducing, a search for a page will result in iterating through match-paths.json entries, followed by a request to the server using the found path. I.e, there's no longer a server-side array of pages to have to search through. Only an array of { regex paths -> path } objects.

Let me know if I'm missing something. I'm definitely up for introducing a trie for the lookup of match-paths.json.

bennetthardwick · 2019-03-25T00:03:09Z

Hey @Moocar, I'm really excited about these changes because it addresses another one of our pain points (the size of the page-manifest.json file). Do you have a branch / fork up with even a rudimentary version of the changes that we can test to see if it makes the performance fix in this PR redundant?

Moocar · 2019-03-25T01:34:33Z

@bennetthardwick Great to hear! Hard to say when the first cut will be ready, but hopefully in the next week. I'll try and remember to ping this PR once it's up, as I'd really appreciate some testing on large sites.

bennetthardwick · 2019-04-01T00:53:02Z

Hey @Moocar, any updates on this? We set a site live last week with about 25k pages using this change and it fixed all our performance issues. You can test out the site here if you like. We still feel like using this datastructure is necessary even with smaller page-manifest.json files and are totally willing to work to make it compatible with your changes.

Moocar · 2019-04-01T02:13:14Z

@bennetthardwick It's working on my machine (TM). Just documenting, testing etc and then I'll be releasing a series of PRs to make the reviewer's job easier (it touches 58 files so far). Since it's such a large change to Gatsby, I don't imagine it will be merged soon. But I'll shoot through the branch name when it's ready for testing. Hopefully this week (for real this time).

Great to hear that the trie works for the large sites. My PR will remove the need for a pages-manifest.json entirely. Rather than go into depth here, I'll explain the reasoning properly in the PR.

Moocar · 2019-04-01T21:26:18Z

@bennetthardwick pinged you on #13004

bennetthardwick · 2019-04-02T00:14:02Z

This issue is fixed by #13004

bennetthardwick requested a review from a team as a code owner March 21, 2019 08:02

bennetthardwick changed the title ~~Feature/trie for page manifest~~ feat(performance): change data.json to trie Mar 21, 2019

bennetthardwick force-pushed the feature/trie-for-page-manifest branch from cd6de07 to b96a96b Compare March 25, 2019 00:10

Add trie and path matching functions

cd4c4ac

bennetthardwick force-pushed the feature/trie-for-page-manifest branch from b96a96b to cd4c4ac Compare March 27, 2019 01:03

Moocar mentioned this pull request Apr 1, 2019

[WIP] Replace global data.json with page-data.json per page #13004

Closed

30 tasks

bennetthardwick closed this Apr 2, 2019

me4502 mentioned this pull request May 13, 2019

O(pages * visible links) performance in prefetch code. #13979

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(performance): change data.json to trie #12711

feat(performance): change data.json to trie #12711

bennetthardwick commented Mar 21, 2019

KyleAMathews commented Mar 21, 2019

KyleAMathews commented Mar 21, 2019

bennetthardwick commented Mar 21, 2019

muescha commented Mar 21, 2019

bennetthardwick commented Mar 21, 2019 •

edited

Moocar commented Mar 21, 2019 •

edited

me4502 commented Mar 22, 2019

Moocar commented Mar 22, 2019

bennetthardwick commented Mar 25, 2019

Moocar commented Mar 25, 2019

bennetthardwick commented Apr 1, 2019

Moocar commented Apr 1, 2019

Moocar commented Apr 1, 2019

bennetthardwick commented Apr 2, 2019

feat(performance): change data.json to trie #12711

feat(performance): change data.json to trie #12711

Conversation

bennetthardwick commented Mar 21, 2019

Description

Related Issues

KyleAMathews commented Mar 21, 2019

KyleAMathews commented Mar 21, 2019

bennetthardwick commented Mar 21, 2019

muescha commented Mar 21, 2019

bennetthardwick commented Mar 21, 2019 • edited

Moocar commented Mar 21, 2019 • edited

me4502 commented Mar 22, 2019

Moocar commented Mar 22, 2019

bennetthardwick commented Mar 25, 2019

Moocar commented Mar 25, 2019

bennetthardwick commented Apr 1, 2019

Moocar commented Apr 1, 2019

Moocar commented Apr 1, 2019

bennetthardwick commented Apr 2, 2019

bennetthardwick commented Mar 21, 2019 •

edited

Moocar commented Mar 21, 2019 •

edited