Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't build with webpack on content changes (no more data.json) #11982

Closed
Moocar opened this issue Feb 22, 2019 · 7 comments
Closed

Don't build with webpack on content changes (no more data.json) #11982

Moocar opened this issue Feb 22, 2019 · 7 comments
Labels
type: maintenance An issue or pull request describing a change that isn't a bug, feature or documentation change

Comments

@Moocar
Copy link
Contributor

Moocar commented Feb 22, 2019

Summary

I've been looking into ways to get rid of data.json. It's a complicated problem so any feedback on the below would be appreciated.

Background

Gatsby writes a data.json (also called pages-manifest) file on every build that maps pages to their componentChunkName and dataPath. This is imported by async-requires.js, which is in turn imported by production-app.js.

The upsides of this approach are:

  • Once loaded, the running Gatsby app knows the component and dataPath for every page. It can then look at all Link elements, retrieve their "to" path, and start prefetching the data for each page the user might click on.
  • Since data.json is part of webpack's build, it is hashed according to content. So all data is 100% cacheable.

A global data.json has downsides though:

  • For gatsby.org, with almost 2000 pages, this file is already over 500kb (116kb zipped). This entire file must be loaded before prefetching can occur.
  • Any time any graphql query is rerun, the data.json file changes, which results in a new webpack build, which is slooooooow. It also makes future work on incremental builds impossible.

Solution: compilation-specific non-cached page-manifest

  • no more data.json, therefore async-requires.js only contains import statements for components. Most sites with lots of pages use templated components so this should be a small file.

  • Before building page html, we produce a page-manifest file. It contains the componentChunk name and dataPath for the page. It is named with a webpack compilation hash and json name. [webpack-compilation-hash]/[jsonName]-manifest.json

    // public/11593e3b3ac85436984a/path---index--723-manifest.json
    {
      "componentChunkName": "component---src-pages-index-js-",
      "dataPath": "621/path---index-723..."
    }
  • cache-dir/static-entry.js no longer references data.json. Instead, it reads the the page's manifest file. It also adds the webpack compilation hash to the window CDATA.

  • When navigating to a page, the gatsby app behaves the same. Except that when resolving a page's component/dataPath, it makes a request for the page manifest file. It knows the compilation hash and json name, so can get this info. Once it has the manifest, it can use the dataPath to download the page's data.

  • When a query result changes, we only need to update that page's manifest. Since webpack has not been rerun, the data in the query result will be compatible with the running browser's component implementation (right? might need to double check this).

  • When a webpack rebuild occurs, we must generate new page manifest files for every page.

pros

  • webpack is no longer rerun on every data change
  • no global data.json
    • Less data usage for very large sites
    • Better browser CPU performance for very large sites (iterating through that data.json could become CPU intensive)
    • Initial load of Gatsby doesn't require loading entire data.json. So prefetching can occur earlier.

cons

  • For a page with many links, prefetching will result in double the requests. One for the metadata, and another for the actual data. But http/2 multiplexing helps with this. It's also occuring in the background so the user shouldn't notice any difference unless clicking around very fast.
  • Navigating to a non-linked URL will mean the browser must make 2 requests for that page. One for the manifest, and another for the dataPath
  • We still need to rebuild all html pages any time there's a code change. But it's an improvement over doing it any time there's any content change.

NOTE: for sites with 100'000 pages, we'll end up with a compilation directory with 100'000 files. So we might need to use a similar approach to static/d to bucket those files under sub directories.

Shout out to @pieh and @KyleAMathews for the ideas/brainstorming

Alternatives

drop compilation-specific component

In this approach, we'd just save the page-manifest without the compilation hash in the filename. The problem is that the page-manifest lists the componentChunkName, not a link to the actual component. So if a build occurs in the background and the component changes, the frontend won't know to refresh. It might then try and load the new query result into an old component resulting in all kinds of errors (e.g field is undefined on result.data.node.field).

A middleground is to include the compilation hash as a field in each manifest. That way, at the very least, the frontend can compare the manifest compilation hash to its own to see if a rebuild has occured. The benefit of this would be less disk usage since we wouldn't have multiple manifests per page.

Calculate linked pages server side

Using react-tree-walker, while server-side rendering components, we can walk the page component and find all the Link elements. We therefore know which pages a page links to at build time. In theory therefore, we could construct a linked-pages.json for each page. For each linked page, it would include the componentChunkName, dataPath, and linkedPagePath. The entire file would be content hashed and immutable so that it was specific to this build. And then, when we build each page html, we could reference its linkedPagePath in the window data.

Now, when we navigate to a new page, the app simply looks up the appropriate entry in linkedPages, and knows exactly which component and dataPath to load. Even better, it also knows the navigated to page's linkedPagesPath too, so can repeat the process.

Unfortuantely, I can't figure out a way to reliably build the linkedPage.json files for each path. The problem is that when you draw out a graph of the links on a site, there are many back references. E.g index -> blogs -> blog -> index (via header div for example). So it's a cyclic graph, and we therefore can't build a dependency graph due to circular dependencies.

path-dependant buckets of page metadata

@kyle wrote an awesome PR (#6651) that produces buckets of page manifests depending on path segments. It would work great if we were going to run webpack on every data change, but that's what we're trying to avoid.

Related Issues

@Moocar Moocar changed the title Don't run webpack on content changes (no more data.json) Don't build with webpack on content changes (no more data.json) Feb 22, 2019
@wardpeet wardpeet added type: maintenance An issue or pull request describing a change that isn't a bug, feature or documentation change no triage not stale labels Feb 22, 2019
@wardpeet
Copy link
Contributor

I really like this approach and we can probably even make it better in the long run.

  • Before building page HTML, we produce a page-manifest file. It contains the componentChunk name and dataPath for the page. It is named with a webpack compilation hash and json name. [webpack-compilation-hash]/[jsonName]-manifest.json

This one is really hard to solve if your build system does not have incremental builds itself. Perhaps Webpack will support this soon or a bazel rule that can help us with this. With webpack, we might be able to save the dep tree ourselves and manually traverse it to see if work needs to be done (unsure if this will work).

  • Navigating to a non-linked URL will mean the browser must make 2 requests for that page. One for the manifest, and another for the dataPath

On Gatsby cloud, we could even use h2 push to reduce the network time so they get both requests at once.

Calculate linked pages server side

We can still do something similar on the client to actually prefetch manifest files depending on all a tags that are not external so we can fetch the component files when people hover.

@Moocar
Copy link
Contributor Author

Moocar commented Feb 24, 2019

@wardpeet Just trying to understand your first comment. Maybe I'm missing something, but webpack provides a compilation hash for the entire build in the stats.json output. So every time there is a new build, there should be a new hash. This is what I'm proposing we use. To be clear, I'm not relying on any incremental functionality in webpack itself. Any time there is any source file change anywhere, we'll rebuild everything. We just won't have to rebuild on data changes.

We can still do something similar on the client to actually prefetch manifest files depending on all a tags that are not external so we can fetch the component files when people hover.

Yep, exactly. I propose we keep Gatsby's existing functionality here. I.e it triggers a prefetch whenever a Link mounts. But where it would look up data.json to find the component and dataPath, it will instead request the page-manifest.

@KyleAMathews
Copy link
Contributor

We could also use something like https://github.com/gaearon/react-side-effect for the linked-pages.json — how it works is when a page is rendered each time a <Link> component renders, it passes upwards its pathname which can be collected once the page finishes.

@Moocar
Copy link
Contributor Author

Moocar commented Mar 3, 2019

@KyleAMathews nice, we'd still have the back references problem, but react-side-effect would definitely be a more reliable way of tracking the page linking in the first place.

@Moocar
Copy link
Contributor Author

Moocar commented Mar 5, 2019

@KyleAMathews raised the idea of including query results in the new page-manifest file. So instead of first requesting the page-manifest to get the dataPath, and then requesting the actual data. The browser would only need to download the page-manifest, which would look something like:

{
  "componentChunkName": "...",
  "data": {
    "allMarkdownRemark": {}
  },
  "pageContext": {
    "path": "/blah",
  }
}

The downside obviously is that query results would no longer be infinitely cacheable. But, since page-manifests force the browser to download something each time, it's not so bad.

On the plus side, the server will be able to monitor whether the underlying file has changed and send back 304 responses to avoid unnecessary downloads.

@wardpeet
Copy link
Contributor

@Moocar sorry for taking so long to get back to you but compilation hash seems great, I was looking at file hashes, so don't mind that comment 😄 Looking forward to some code!

@m-allanson
Copy link
Contributor

Fixed by #13004 and released in gatsby@2.9.x. 💪🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: maintenance An issue or pull request describing a change that isn't a bug, feature or documentation change
Projects
None yet
Development

No branches or pull requests

5 participants