Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gatsby): bundle and page-data stats for telemetry #20146

Merged
merged 23 commits into from
Dec 20, 2019

Conversation

rase-
Copy link
Contributor

@rase- rase- commented Dec 16, 2019

This PR uses the stats object returned by webpack to aggregate file size statistics for all JS bundles. It also computes similar statistics for all page-data.json files.

Because page-data.json files are only written if something changed, we cache page-data sizes by file path in .cache with a new cache for gatsby-telemetry. Because we can only reliably flush page-data aggregates in an exit handler, we can't rely on async code. For this reason, we flush any page-data size measurements from the cache and buffer any changes over the course of the build to overwrite the cache, finally flushing the results from memory.

Addresses #18819 and could be useful for the last point in #15272

@rase- rase- requested a review from a team as a code owner December 16, 2019 11:51
@rase- rase- requested a review from jamo December 16, 2019 11:53
jamo
jamo previously approved these changes Dec 16, 2019
Copy link
Contributor

@jamo jamo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@KyleAMathews
Copy link
Contributor

Looks great! Can you paste some example data e.g. from one of the official starters & gatsbyjs.org?

Copy link
Contributor

@pieh pieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we never purge stale data with this (i.e. page was removed from build, but we keep stats for page-data file around as long as we don't delete .cache)

This is great - it also sets up foundation for #19780

packages/gatsby/src/utils/page-data.js Outdated Show resolved Hide resolved
@rase-
Copy link
Contributor Author

rase- commented Dec 17, 2019

I think we never purge stale data with this (i.e. page was removed from build, but we keep stats for page-data file around as long as we don't delete .cache)

This is great - it also sets up foundation for #19780

Yeah we don't purge any deleted page-data files from the cache here. I wasn't able to find a good way to do this without having to scan the whole directory structure and start reading the files on every gatsby build. This is something we definitely want to avoid. I might've missed it, but to me it seemed like we also keep old page-data files around even if a page was deleted, is that correct? Having assumed that based on my investigation, I wasn't able to find anything to hook into to purge entries from the cache. It seemed like a reasonable trade-off to having to scan and stat the page-data.json files.

Any suggestions or tips to do this better are definitely welcome!

@pieh
Copy link
Contributor

pieh commented Dec 17, 2019

I might've missed it, but to me it seemed like we also keep old page-data files around even if a page was deleted, is that correct?

That's unfortunately correct (and known) issue. There is some work slowly being done in addressing that ( #19004 ), but it uses not very efficient way of scanning page-data directories (which as you mentioned, is not great).

@jamo Question here - do we have to address this stale data issue in this PR? Or would it be acceptable to get this in with known flaw (until this is worked out in performant way)?

@jamo
Copy link
Contributor

jamo commented Dec 17, 2019

@jamo Question here - do we have to address this stale data issue in this PR? Or would it be acceptable to get this in with known flaw (until this is worked out in performant way)?

This stale data is ok for now, so this is good to go with this :shipit:

Copy link
Contributor

@pieh pieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes for using redux store for pageDataStats:

We don't persist pages metadata between builds right now, so we probably would need to create new namespace + reducer in

module.exports = {
program: require(`./program`),
nodes: getNodesReducer(),
nodesByType: getNodesByTypeReducer(),
resolvedNodesCache: require(`./resolved-nodes`),
nodesTouched: require(`./nodes-touched`),
lastAction: require(`./last-action`),
flattenedPlugins: require(`./flattened-plugins`),
config: require(`./config`),
pages: require(`./pages`),
schema: require(`./schema`),
status: require(`./status`),
componentDataDependencies: require(`./component-data-dependencies`),
components: require(`./components`),
staticQueryComponents: require(`./static-query-components`),
jobs: require(`./jobs`),
webpack: require(`./webpack`),
webpackCompilationHash: require(`./webpack-compilation-hash`),
redirects: require(`./redirects`),
babelrc: require(`./babelrc`),
schemaCustomization: require(`./schema-customization`),
themes: require(`./themes`),
logs: require(`gatsby-cli/lib/reporter/redux/reducer`),
inferenceMetadata: require(`./inference-metadata`),
}

I think using a Map/Object wile filenames as keys and size as values is fine.

We don't have to keep our redux state immutable as we don't get any benefits from it, so mutating the Map in response to actions is fine (I would even say it's actually advised to mutate to avoid excessive allocations and GC work).

We can get redux store instance by using store named export from https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/redux/index.js

And in page-data utils file, we can dispatch something like:

store.dispatch({
  type: `SET_PAGE_DATA_STATS`,
  payload: {
    filePath,
    size: pageDataSize
  }
})

I think this will also allow us to not interact with telemetry module there, as we can grab everything from redux state with store.getState().<whateverIsTheNamespaceSpecified in reducers) just before sending BUILD_END` event (?). Maybe it would also make it simpler in telemetry itself (we might not need the bufferring feature - but maybe I'm missing why this is needed)

One last part is that we persist just subset of our redux state (we don't bother with persisting things that don't have proper cache handling - like pages, don't benefit from it - like list of plugins or are derived/denormalized as we can rebuild it - like nodesByType). So to actually persist new redux namespace we do need to add it to

// Persist state.
const saveState = () => {
const state = store.getState()
const pickedState = _.pick(state, [
`nodes`,
`status`,
`componentDataDependencies`,
`components`,
`staticQueryComponents`,
`webpackCompilationHash`,
])
return writeToCache(pickedState)
}

Gatsby will automatically persist state after any action was dispatched (with some debounce to not do that after every single action), so there's nothing more needed to be done.

packages/gatsby/src/utils/page-data.js Outdated Show resolved Hide resolved
packages/gatsby-telemetry/src/telemetry.js Show resolved Hide resolved
@rase-
Copy link
Contributor Author

rase- commented Dec 19, 2019

Thanks for the notes regarding using the redux store. I'll look into it!

jamo
jamo previously approved these changes Dec 19, 2019
Copy link
Contributor

@sidharthachatterjee sidharthachatterjee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Thanks @rase-

Left a comment

packages/gatsby/src/commands/build.js Outdated Show resolved Hide resolved
@sidharthachatterjee sidharthachatterjee added the bot: merge on green Gatsbot will merge these PRs automatically when all tests passes label Dec 20, 2019
Copy link
Contributor

@sidharthachatterjee sidharthachatterjee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌮

@pieh
Copy link
Contributor

pieh commented Dec 20, 2019

Looks great! Can you paste some example data e.g. from one of the official starters & gatsbyjs.org?

This is from gatsby-starter-blog:

{ bundleStats:
      { count: 8,
        min: 0.164,
        max: 197.11,
        sum: 304.549,
        mean: 38.068625,
        median: 1.254,
        stdDev: 73.04034346253447,
        skewness: 1.271274575799052 },
     pageDataStats:
      { count: 7,
        min: 0.195,
        max: 6.603,
        sum: 13.357000000000001,
        mean: 1.9081428571428574,
        median: 1.104,
        stdDev: 2.432424951125346,
        skewness: 0.9382828763341604 } },

(didn't try .org)

One thing we discussed with @sidharthachatterjee on the call is wether we need some special distinction for pages that don't use queries (i.e gatsby-starter-default doesn't use page queries at all). But decided against it, as page-data is still used in runtime even if there are no page queries result, so the stats will be representative.

Also nice thing about this checking page-data size (and not just query result size) is that it will handle cases where there are massive objects passed via context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot: merge on green Gatsbot will merge these PRs automatically when all tests passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants