Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(performance): change data.json to trie #12711

Conversation

bennetthardwick
Copy link
Contributor

Description

Gatsby sites that have a lot of pages (50k - 60k) spend a lot of time during run time in the findPage method when prefetching links. This is due to the method looping through all members of the page-manifest.json file to search for matching paths.

To fix this, I've changed the page-manifests.json file to use a trie-based data structure. This allows for near-constant time lookups and path matching, rather than the O(n) complexity of an array search.

This could also be used in the future for lazy loading chunked page-manifest.json files for particular path segments, and also optimising the dev 404 page to a tree-view as it struggles to load when trying to render large quantities of pages.

One concern that I have regarding this change is that it removes the @reach/router match function. While the trie version does support splats (*) for matchPath, it doesn't support more complex routes such as route parameters (which are not currently used by Gatsby).

You can see the benefits of the change by using the gatsby-dev-cli with the gatsby-intense-benchmark.

Our company has been working on a Gatsby site which has 30k pages at the moment, with the intention to scale to 100k+. We believe improvements to runtime performance are crucial to meet this goal.

Related Issues

@bennetthardwick bennetthardwick requested a review from a team as a code owner March 21, 2019 08:02
@bennetthardwick bennetthardwick changed the title Feature/trie for page manifest feat(performance): change data.json to trie Mar 21, 2019
@KyleAMathews
Copy link
Contributor

/cc @Moocar

@KyleAMathews
Copy link
Contributor

This is super interesting! @Moocar is working on a related change to lazy load page metadata which will make this PR as it stands not mergeable but moving to a trie is definitely needed and we'd love your help moving forward to land it!

We don't use reach's matching algorithm at all basically so that's not a problem. The main thing we need to support is detecting if a route should be handled client side. Otherwise it's just exact matches.

@bennetthardwick
Copy link
Contributor Author

That'll be awesome. Do you have any word on the state of @Moocar's change? Would it be worth turning that into an a kind of "page manifest improvements" epic and have me rebase my branch onto there?

@muescha
Copy link
Contributor

muescha commented Mar 21, 2019

Maybe better names (or use an object/struct):

c - child
v - value

c - path
v - value

@bennetthardwick
Copy link
Contributor Author

bennetthardwick commented Mar 21, 2019

Maybe better names (or use an object/struct):

c - child
v - value

c - path
v - value

I do agree that the key names aren't great. But I was thinking that it was just a serialised version of the sites pages it didn't necessarily need to be human readable and should be as small as possible. I'll wait to hear what's going to happen with this feature before making any more changes.

@Moocar
Copy link
Contributor

Moocar commented Mar 21, 2019

Hey @bennetthardwick cool stuff! The work I'm doing is best summarized by #11982. In a nutshell, there won't be a global data.json anymore. Instead, we'll generate a single page-data.json for each page, and the front end will load them as required. As a result, we won't have a massive in memory array of pages to search through.

I've also separated match-paths into a separate collection (matchPaths.json). E.g:

{ 
  "foo/*": "/foo/page/",
  ...
}

So the logic to find a page is now as so:

  1. Lookup matchPaths[path]. This returns an actual page path, or if not found, just use the original path.
  2. Use that path to make a request to the server for the page-data.json. E.g /page-data/foo/page/page-data.json. This contains componentChunkName, query results etc.

I'm still working on it, but I think the above strategy will mean we won't need a more complex datastructure to lookup paths. The number of pages in memory will depend on how many links a user has clicked, and how many have been prefetched as a result. I'm also assuming that most sites have very few matchPaths. But if that assumption is wrong, then a trie based lookup could potentially be used for that part.

@me4502
Copy link
Contributor

me4502 commented Mar 22, 2019

won't need a more complex datastructure to lookup paths

It's worth noting that the trie data structure is literally built for this task. Whilst yes, it's not as necessary, it's very possible that someone will have a hundred or so links on a single page. The added complexity is negligible compared to the potential gains, even in a case where only linked pages are listed.

@Moocar
Copy link
Contributor

Moocar commented Mar 22, 2019

@me4502 If a site defines hundreds of matchPath (regex) paths, then for sure, the trie will start to help. But for most sites, I'm assuming there will only ever be a few of those. In which case, in the new page-data.json code I'm introducing, a search for a page will result in iterating through match-paths.json entries, followed by a request to the server using the found path. I.e, there's no longer a server-side array of pages to have to search through. Only an array of { regex paths -> path } objects.

Let me know if I'm missing something. I'm definitely up for introducing a trie for the lookup of match-paths.json.

@bennetthardwick
Copy link
Contributor Author

Hey @Moocar, I'm really excited about these changes because it addresses another one of our pain points (the size of the page-manifest.json file). Do you have a branch / fork up with even a rudimentary version of the changes that we can test to see if it makes the performance fix in this PR redundant?

@Moocar
Copy link
Contributor

Moocar commented Mar 25, 2019

@bennetthardwick Great to hear! Hard to say when the first cut will be ready, but hopefully in the next week. I'll try and remember to ping this PR once it's up, as I'd really appreciate some testing on large sites.

@bennetthardwick
Copy link
Contributor Author

Hey @Moocar, any updates on this? We set a site live last week with about 25k pages using this change and it fixed all our performance issues. You can test out the site here if you like. We still feel like using this datastructure is necessary even with smaller page-manifest.json files and are totally willing to work to make it compatible with your changes.

@Moocar
Copy link
Contributor

Moocar commented Apr 1, 2019

@bennetthardwick It's working on my machine (TM). Just documenting, testing etc and then I'll be releasing a series of PRs to make the reviewer's job easier (it touches 58 files so far). Since it's such a large change to Gatsby, I don't imagine it will be merged soon. But I'll shoot through the branch name when it's ready for testing. Hopefully this week (for real this time).

Great to hear that the trie works for the large sites. My PR will remove the need for a pages-manifest.json entirely. Rather than go into depth here, I'll explain the reasoning properly in the PR.

@Moocar
Copy link
Contributor

Moocar commented Apr 1, 2019

@bennetthardwick pinged you on #13004

@bennetthardwick
Copy link
Contributor Author

This issue is fixed by #13004

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants