Don't build with webpack on content changes (no more data.json) #11982

Moocar · 2019-02-22T02:10:21Z

Summary

I've been looking into ways to get rid of data.json. It's a complicated problem so any feedback on the below would be appreciated.

Background

Gatsby writes a data.json (also called pages-manifest) file on every build that maps pages to their componentChunkName and dataPath. This is imported by async-requires.js, which is in turn imported by production-app.js.

The upsides of this approach are:

Once loaded, the running Gatsby app knows the component and dataPath for every page. It can then look at all Link elements, retrieve their "to" path, and start prefetching the data for each page the user might click on.
Since data.json is part of webpack's build, it is hashed according to content. So all data is 100% cacheable.

A global data.json has downsides though:

For gatsby.org, with almost 2000 pages, this file is already over 500kb (116kb zipped). This entire file must be loaded before prefetching can occur.
Any time any graphql query is rerun, the data.json file changes, which results in a new webpack build, which is slooooooow. It also makes future work on incremental builds impossible.

Solution: compilation-specific non-cached page-manifest

no more data.json, therefore async-requires.js only contains import statements for components. Most sites with lots of pages use templated components so this should be a small file.
Before building page html, we produce a page-manifest file. It contains the componentChunk name and dataPath for the page. It is named with a webpack compilation hash and json name. [webpack-compilation-hash]/[jsonName]-manifest.json
```
// public/11593e3b3ac85436984a/path---index--723-manifest.json
{
  "componentChunkName": "component---src-pages-index-js-",
  "dataPath": "621/path---index-723..."
}
```
cache-dir/static-entry.js no longer references data.json. Instead, it reads the the page's manifest file. It also adds the webpack compilation hash to the window CDATA.
When navigating to a page, the gatsby app behaves the same. Except that when resolving a page's component/dataPath, it makes a request for the page manifest file. It knows the compilation hash and json name, so can get this info. Once it has the manifest, it can use the dataPath to download the page's data.
When a query result changes, we only need to update that page's manifest. Since webpack has not been rerun, the data in the query result will be compatible with the running browser's component implementation (right? might need to double check this).
When a webpack rebuild occurs, we must generate new page manifest files for every page.

pros

webpack is no longer rerun on every data change
no global data.json
- Less data usage for very large sites
- Better browser CPU performance for very large sites (iterating through that data.json could become CPU intensive)
- Initial load of Gatsby doesn't require loading entire data.json. So prefetching can occur earlier.

cons

For a page with many links, prefetching will result in double the requests. One for the metadata, and another for the actual data. But http/2 multiplexing helps with this. It's also occuring in the background so the user shouldn't notice any difference unless clicking around very fast.
Navigating to a non-linked URL will mean the browser must make 2 requests for that page. One for the manifest, and another for the dataPath
We still need to rebuild all html pages any time there's a code change. But it's an improvement over doing it any time there's any content change.

NOTE: for sites with 100'000 pages, we'll end up with a compilation directory with 100'000 files. So we might need to use a similar approach to static/d to bucket those files under sub directories.

Shout out to @pieh and @KyleAMathews for the ideas/brainstorming

Alternatives

drop compilation-specific component

In this approach, we'd just save the page-manifest without the compilation hash in the filename. The problem is that the page-manifest lists the componentChunkName, not a link to the actual component. So if a build occurs in the background and the component changes, the frontend won't know to refresh. It might then try and load the new query result into an old component resulting in all kinds of errors (e.g field is undefined on result.data.node.field).

A middleground is to include the compilation hash as a field in each manifest. That way, at the very least, the frontend can compare the manifest compilation hash to its own to see if a rebuild has occured. The benefit of this would be less disk usage since we wouldn't have multiple manifests per page.

Calculate linked pages server side

Using react-tree-walker, while server-side rendering components, we can walk the page component and find all the Link elements. We therefore know which pages a page links to at build time. In theory therefore, we could construct a linked-pages.json for each page. For each linked page, it would include the componentChunkName, dataPath, and linkedPagePath. The entire file would be content hashed and immutable so that it was specific to this build. And then, when we build each page html, we could reference its linkedPagePath in the window data.

Now, when we navigate to a new page, the app simply looks up the appropriate entry in linkedPages, and knows exactly which component and dataPath to load. Even better, it also knows the navigated to page's linkedPagesPath too, so can repeat the process.

Unfortuantely, I can't figure out a way to reliably build the linkedPage.json files for each path. The problem is that when you draw out a graph of the links on a site, there are many back references. E.g index -> blogs -> blog -> index (via header div for example). So it's a cyclic graph, and we therefore can't build a dependency graph due to circular dependencies.

path-dependant buckets of page metadata

@kyle wrote an awesome PR (#6651) that produces buckets of page manifests depending on path segments. It would work great if we were going to run webpack on every data change, but that's what we're trying to avoid.

Related Issues

The text was updated successfully, but these errors were encountered:

wardpeet · 2019-02-22T07:08:21Z

I really like this approach and we can probably even make it better in the long run.

Before building page HTML, we produce a page-manifest file. It contains the componentChunk name and dataPath for the page. It is named with a webpack compilation hash and json name. [webpack-compilation-hash]/[jsonName]-manifest.json

This one is really hard to solve if your build system does not have incremental builds itself. Perhaps Webpack will support this soon or a bazel rule that can help us with this. With webpack, we might be able to save the dep tree ourselves and manually traverse it to see if work needs to be done (unsure if this will work).

Navigating to a non-linked URL will mean the browser must make 2 requests for that page. One for the manifest, and another for the dataPath

On Gatsby cloud, we could even use h2 push to reduce the network time so they get both requests at once.

Calculate linked pages server side

We can still do something similar on the client to actually prefetch manifest files depending on all a tags that are not external so we can fetch the component files when people hover.

Moocar · 2019-02-24T04:47:34Z

@wardpeet Just trying to understand your first comment. Maybe I'm missing something, but webpack provides a compilation hash for the entire build in the stats.json output. So every time there is a new build, there should be a new hash. This is what I'm proposing we use. To be clear, I'm not relying on any incremental functionality in webpack itself. Any time there is any source file change anywhere, we'll rebuild everything. We just won't have to rebuild on data changes.

We can still do something similar on the client to actually prefetch manifest files depending on all a tags that are not external so we can fetch the component files when people hover.

Yep, exactly. I propose we keep Gatsby's existing functionality here. I.e it triggers a prefetch whenever a Link mounts. But where it would look up data.json to find the component and dataPath, it will instead request the page-manifest.

KyleAMathews · 2019-03-02T19:03:58Z

We could also use something like https://github.com/gaearon/react-side-effect for the linked-pages.json — how it works is when a page is rendered each time a <Link> component renders, it passes upwards its pathname which can be collected once the page finishes.

Moocar · 2019-03-03T21:10:40Z

@KyleAMathews nice, we'd still have the back references problem, but react-side-effect would definitely be a more reliable way of tracking the page linking in the first place.

Moocar · 2019-03-05T01:48:48Z

@KyleAMathews raised the idea of including query results in the new page-manifest file. So instead of first requesting the page-manifest to get the dataPath, and then requesting the actual data. The browser would only need to download the page-manifest, which would look something like:

{
  "componentChunkName": "...",
  "data": {
    "allMarkdownRemark": {}
  },
  "pageContext": {
    "path": "/blah",
  }
}

The downside obviously is that query results would no longer be infinitely cacheable. But, since page-manifests force the browser to download something each time, it's not so bad.

On the plus side, the server will be able to monitor whether the underlying file has changed and send back 304 responses to avoid unnecessary downloads.

wardpeet · 2019-03-22T10:52:53Z

@Moocar sorry for taking so long to get back to you but compilation hash seems great, I was looking at file hashes, so don't mind that comment 😄 Looking forward to some code!

m-allanson · 2019-06-13T14:20:42Z

Fixed by #13004 and released in gatsby@2.9.x. 💪🎉

Moocar changed the title ~~Don't run webpack on content changes (no more data.json)~~ Don't build with webpack on content changes (no more data.json) Feb 22, 2019

wardpeet added type: maintenance An issue or pull request describing a change that isn't a bug, feature or documentation change no triage not stale labels Feb 22, 2019

djfarly mentioned this issue Mar 4, 2019

Find a way to not duplicate pages djfarly/gatsby-plugin-graphql-preview#3

Closed

pieh mentioned this issue Mar 15, 2019

pages-manifest[chunk].js should be cacheable / renamed, otherwise every single page' html get's changed on every build. Causing huge deploy upload times on netlify for example. #12591

Closed

Moocar mentioned this issue Mar 21, 2019

feat(performance): change data.json to trie #12711

Closed

Moocar mentioned this issue Mar 27, 2019

refactor(gatsby): enable running of static/page queries separately #12891

Closed

pieh mentioned this issue Mar 29, 2019

Authentication support #1100

Closed

Moocar mentioned this issue Apr 1, 2019

[WIP] Replace global data.json with page-data.json per page #13004

Closed

30 tasks

jackhair mentioned this issue Apr 30, 2019

Missing resources for / #11524

Closed

Moocar mentioned this issue May 23, 2019

Add prefetch for linked component modules to page HTML #14262

Closed

m-allanson closed this as completed Jun 13, 2019

KyleAMathews mentioned this issue Jul 18, 2019

Can you opt-out of webpackCompilationHash (to improve Netlify build times) #15872

Closed

LekoArts removed the not stale label May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't build with webpack on content changes (no more data.json) #11982

Don't build with webpack on content changes (no more data.json) #11982

Moocar commented Feb 22, 2019 •

edited

wardpeet commented Feb 22, 2019

Calculate linked pages server side

Moocar commented Feb 24, 2019

KyleAMathews commented Mar 2, 2019

Moocar commented Mar 3, 2019

Moocar commented Mar 5, 2019

wardpeet commented Mar 22, 2019

m-allanson commented Jun 13, 2019

Don't build with webpack on content changes (no more data.json) #11982

Don't build with webpack on content changes (no more data.json) #11982

Comments

Moocar commented Feb 22, 2019 • edited

Summary

Background

Solution: compilation-specific non-cached page-manifest

Alternatives

drop compilation-specific component

Calculate linked pages server side

path-dependant buckets of page metadata

Related Issues

wardpeet commented Feb 22, 2019

Calculate linked pages server side

Moocar commented Feb 24, 2019

KyleAMathews commented Mar 2, 2019

Moocar commented Mar 3, 2019

Moocar commented Mar 5, 2019

wardpeet commented Mar 22, 2019

m-allanson commented Jun 13, 2019

Moocar commented Feb 22, 2019 •

edited