Proposal: Deterministic loading of data from path #4626

chmac · 2018-03-20T17:19:27Z

tl;dr Could we remove the map of path to data file in app-*.js and instead try to fetch data by converting the link path to a data filename, handling 404s if it doesn't exist, etc?

History

I'm experimenting with a Gatsby site that has ~3.5k pages. The bundle sizes are like so:

2.7M app-*.js
240K chunk-manifest.json
300K commons-*.js

I haven't fully understood Gatsby's data structure, but checking the network tools shows that app-*.js is loaded as soon as the page finishes loading.

It seems like the current architecture uses webpack to build a Map of all paths to their path to the relevant file on disk. This means that as the number of pages grows, the site's bundle size grows. I presume this approach will not scale very well for sites with 10k or 100k pages.

Idea

Would it be possible to deterministically map path to data file? Further, if we could do that (which I guess we could), would it be possible to skip the "list of pages" and just fetch data by transforming the path variable into its data file?

Extra thoughts

I'm super to new Gatsby core development, zero experience with webpack, so my idea might be total nonsense in that context. If so, apologies, feel free to close.
This would remove the content based hash from the filename, which is a very useful technique for cache busting, etc.

The text was updated successfully, but these errors were encountered:

pieh · 2018-03-20T18:17:38Z

AFAIK only reason hashes need to be there is exactly for cache busting. I am actually currently working on speeding up build process and am touching this part of the code (but no real change there - map to files is still dumped to single file). I'd be interested to hear ideas how we could handle that so it would scale nicer (but not by removing hashes from files :) ).

vinniejames · 2018-03-20T20:50:54Z

This would be a nice feature as well. I the docs, it looks like maybe it could override the $path variable in the template, instead using the filesystem path

chmac · 2018-03-20T21:18:58Z

@pieh Is your work in a branch somewhere that I could use as a starting point to try and dig into this?

I fear that this comes from deep in Gatsby's architecture and so could be difficult to change / refactor. I'm definitely willing to dig into it and see what I can figure out.

pieh · 2018-03-20T21:44:45Z

@chmac My branch is here https://github.com/pieh/gatsby/tree/json-loader

More context about it - together with @m-allanson we are working on speeding up build and develop process by removing bundling and loading json data (results of queries) out of webpack and doing that directly by gatsby. So this actually doesn't focus on reducing app bundle size, but I do small change to async-requires.js which is really just one big map to resources (to page/layout componets and to json files with results of queries). Here's WIP PR #4555 with code from @m-allanson (related to develop part) and my part (build speed up) is mentioned in first comment there along with link to branch/changes.

If you wish to dive in the code here are some entry points you might want to check (links to current master branch as I don't change too much in this department and my branch is still WIP):

pages-writer.js which writes out async-requires.js
loader.js which handles loading data and components needed for current page from async-requires.js during runtime

Before doing changes in code we should probably figure how we could design it so it doesn't increase build time too much and will produce more manageable bundles.

chmac · 2018-03-22T18:59:26Z

@pieh Awesome, thanks for the tips, that's a huge help. I've spent a few days deep diving into this stuff.

Here's what I've understood (please correct me if I've misunderstood any of this stuff, it's a real possibility).

The json-loader branch moves data out of webpack
A new jsonName property is added to pages
A map of jsonName to .json file name is written to static-data-paths.json
This data is built into app-*.js and embedded in every page HTML body

Goals

I'd suggest the following goals:

Remove the list of pages from the HTML and the core javascript. These files should not grow with every new page.
Retain content hashes in all non-HTML for cache busting.

Idea

Here's one idea about how we could move towards those goals.

Move static-data-paths.json out of the built HTML
Move it out of app-*.js as well
Create an async function to take a jsonName and return a JSON filename
Refactor to use this new function

The async function would allow us to fetch static-data-paths.json on demand. It would also allow us to come up with more complicated schemes in the future. For example, we could shard the file, splitting it into chunks, and only fetch each chunk as it's required. However, switching from this being sync to async might make refactoring difficult.

Very open to any feedback.

pieh · 2018-03-22T19:15:26Z

@chmac I was thinking about this a little and for initial load and mounting react components we don't need that map in app bundle or in html - we can delay loading that after initial component is mounted.

Not sure how we could approach chunking that map in the next step - how would we know what chunk we need to load to get path to data for given page?

chmac · 2018-03-22T19:40:43Z

Yes, loading it later makes sense. That will make it async anyway, which paves the way for fancier stuff.

Sharding, I'm thinking back to my WordPress days and database sharding on MySQL. We used to use a remarkably simple scheme that looked something like this:

const calculateShardNameForId(shardLength: int, id: string) => md5(id).substr(0, shardLength)

Any hashing algo would work, and the only thing we need to know is the shardLength. In an ideal world, we could decide that at build time. We could also create a map of shardName -> shardJsonFile so we could content hash the JSON files. Then inside the app-*.js file we only need to keep our map of shards.

A shardLength of 2 would give us 256 shards, so 100k pages would 390 per shard. Even 500k pages would only be 2k per shard.

There are probably lots of potential optimisations, but that was the general approach I was thinking about.

KyleAMathews · 2018-03-22T20:05:38Z

Lazy loading of the paths to page json files is the obvious next step. Sharding would be nice for really large sites. Ideally you'd shard by something like path names so a shard for /blog/*, as those are likely to be needed together. I don't think we need to worry about that right away though as with v2, the amount of data needed per page is something like 10x smaller so sharding would only be helpful for sites with 25k+ pages.

chmac · 2018-03-28T15:05:30Z

OK, sounds like we're reaching consensus around the plan:

Lazy load the map of path to JSON file names
Think about sharding (chunking) it at some point in the future

How do we move forward? There's currently work being done on switching from webpack to our own JSON pipeline in #4555 (described somewhat in #3575). Do we fold the lazy loading into one of those tickets? Create a new ticket for the lazy loading idea?

The original idea I proposed in this ticket doesn't make sense, we'd break the cache busting / content hashing.

pieh · 2018-03-28T15:08:21Z

Lazy loading paths to jsons file names and map that specify what components (pages/layouts) are used for paths is pretty much done - #4715 (I should probably ping here when I posted it)

pieh · 2018-03-28T15:08:55Z

It's for v2 and it's based on #4555

chmac · 2018-03-28T16:10:19Z

@pieh Awesome! v2 is looking better and better!

In that case, I'll close this issue, and I'll create a new one about sharding data.json (linking to here for history). I assume sharding is a low priority upgrade to consider at some point in the future.

pieh · 2018-03-28T16:17:25Z

Just to give more info - when I run my tests against https://github.com/freeCodeCamp/guides (~2800 pages) - gzipped "webpackified" data.json is 141KB - this surely won't scale very nice for 100 000 pages sites, but up to 10 000 I think this shouldn't be that much of a issue

vinniejames · 2018-04-03T21:34:52Z

@pieh thanks for cracking this! I'm curious if there is a planned/estimated release date for v2?

m-allanson · 2018-04-04T08:15:26Z

@vinniejames There's no date but you can track progress over at https://github.com/gatsbyjs/gatsby/projects/2

pieh self-assigned this Mar 20, 2018

pieh mentioned this issue Mar 20, 2018

Any way to reduce the size of __webpack_require__() in the bundle? #4625

Closed

chmac mentioned this issue Mar 22, 2018

Proposal: Replace _.kebabCase() with a different function #4637

Closed

chmac closed this as completed Mar 28, 2018

chmac mentioned this issue Mar 28, 2018

Split page metadata so can lazy load it and reduce the initial JS #4746

Closed

Moocar mentioned this issue Feb 22, 2019

Don't build with webpack on content changes (no more data.json) #11982

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Deterministic loading of data from path #4626

Proposal: Deterministic loading of data from path #4626

chmac commented Mar 20, 2018

pieh commented Mar 20, 2018 •

edited

vinniejames commented Mar 20, 2018

chmac commented Mar 20, 2018

pieh commented Mar 20, 2018

chmac commented Mar 22, 2018

pieh commented Mar 22, 2018

chmac commented Mar 22, 2018 •

edited

KyleAMathews commented Mar 22, 2018

chmac commented Mar 28, 2018

pieh commented Mar 28, 2018

pieh commented Mar 28, 2018

chmac commented Mar 28, 2018

pieh commented Mar 28, 2018 •

edited

vinniejames commented Apr 3, 2018

m-allanson commented Apr 4, 2018

Proposal: Deterministic loading of data from path #4626

Proposal: Deterministic loading of data from path #4626

Comments

chmac commented Mar 20, 2018

History

Idea

Extra thoughts

pieh commented Mar 20, 2018 • edited

vinniejames commented Mar 20, 2018

chmac commented Mar 20, 2018

pieh commented Mar 20, 2018

chmac commented Mar 22, 2018

Goals

Idea

pieh commented Mar 22, 2018

chmac commented Mar 22, 2018 • edited

KyleAMathews commented Mar 22, 2018

chmac commented Mar 28, 2018

pieh commented Mar 28, 2018

pieh commented Mar 28, 2018

chmac commented Mar 28, 2018

pieh commented Mar 28, 2018 • edited

vinniejames commented Apr 3, 2018

m-allanson commented Apr 4, 2018

pieh commented Mar 20, 2018 •

edited

chmac commented Mar 22, 2018 •

edited

pieh commented Mar 28, 2018 •

edited