-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(gatsby): bundle and page-data stats for telemetry #20146
Conversation
…stats-for-telemetry
…stats-for-telemetry
…stats-for-telemetry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
…stats-for-telemetry
Looks great! Can you paste some example data e.g. from one of the official starters & gatsbyjs.org? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we never purge stale data with this (i.e. page was removed from build, but we keep stats for page-data
file around as long as we don't delete .cache
)
This is great - it also sets up foundation for #19780
Yeah we don't purge any deleted page-data files from the cache here. I wasn't able to find a good way to do this without having to scan the whole directory structure and start reading the files on every gatsby build. This is something we definitely want to avoid. I might've missed it, but to me it seemed like we also keep old page-data files around even if a page was deleted, is that correct? Having assumed that based on my investigation, I wasn't able to find anything to hook into to purge entries from the cache. It seemed like a reasonable trade-off to having to scan and stat the page-data.json files. Any suggestions or tips to do this better are definitely welcome! |
That's unfortunately correct (and known) issue. There is some work slowly being done in addressing that ( #19004 ), but it uses not very efficient way of scanning @jamo Question here - do we have to address this stale data issue in this PR? Or would it be acceptable to get this in with known flaw (until this is worked out in performant way)? |
This stale data is ok for now, so this is good to go with this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes for using redux store for pageDataStats
:
We don't persist pages metadata between builds right now, so we probably would need to create new namespace + reducer in
gatsby/packages/gatsby/src/redux/reducers/index.js
Lines 43 to 67 in 2ea7e31
module.exports = { | |
program: require(`./program`), | |
nodes: getNodesReducer(), | |
nodesByType: getNodesByTypeReducer(), | |
resolvedNodesCache: require(`./resolved-nodes`), | |
nodesTouched: require(`./nodes-touched`), | |
lastAction: require(`./last-action`), | |
flattenedPlugins: require(`./flattened-plugins`), | |
config: require(`./config`), | |
pages: require(`./pages`), | |
schema: require(`./schema`), | |
status: require(`./status`), | |
componentDataDependencies: require(`./component-data-dependencies`), | |
components: require(`./components`), | |
staticQueryComponents: require(`./static-query-components`), | |
jobs: require(`./jobs`), | |
webpack: require(`./webpack`), | |
webpackCompilationHash: require(`./webpack-compilation-hash`), | |
redirects: require(`./redirects`), | |
babelrc: require(`./babelrc`), | |
schemaCustomization: require(`./schema-customization`), | |
themes: require(`./themes`), | |
logs: require(`gatsby-cli/lib/reporter/redux/reducer`), | |
inferenceMetadata: require(`./inference-metadata`), | |
} |
I think using a Map/Object wile filenames as keys and size as values is fine.
We don't have to keep our redux state immutable as we don't get any benefits from it, so mutating the Map in response to actions is fine (I would even say it's actually advised to mutate to avoid excessive allocations and GC work).
We can get redux store instance by using store
named export from https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/redux/index.js
And in page-data
utils file, we can dispatch something like:
store.dispatch({
type: `SET_PAGE_DATA_STATS`,
payload: {
filePath,
size: pageDataSize
}
})
I think this will also allow us to not interact with telemetry module there, as we can grab everything from redux state with store.getState().<whateverIsTheNamespaceSpecified in reducers) just before sending
BUILD_END` event (?). Maybe it would also make it simpler in telemetry itself (we might not need the bufferring feature - but maybe I'm missing why this is needed)
One last part is that we persist just subset of our redux state (we don't bother with persisting things that don't have proper cache handling - like pages
, don't benefit from it - like list of plugins or are derived/denormalized as we can rebuild it - like nodesByType
). So to actually persist new redux namespace we do need to add it to
gatsby/packages/gatsby/src/redux/index.js
Lines 53 to 66 in 2ea7e31
// Persist state. | |
const saveState = () => { | |
const state = store.getState() | |
const pickedState = _.pick(state, [ | |
`nodes`, | |
`status`, | |
`componentDataDependencies`, | |
`components`, | |
`staticQueryComponents`, | |
`webpackCompilationHash`, | |
]) | |
return writeToCache(pickedState) | |
} |
Gatsby will automatically persist state after any action was dispatched (with some debounce to not do that after every single action), so there's nothing more needed to be done.
Thanks for the notes regarding using the redux store. I'll look into it! |
…stats-for-telemetry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Thanks @rase-
Left a comment
… after we run page queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🌮
This is from { bundleStats:
{ count: 8,
min: 0.164,
max: 197.11,
sum: 304.549,
mean: 38.068625,
median: 1.254,
stdDev: 73.04034346253447,
skewness: 1.271274575799052 },
pageDataStats:
{ count: 7,
min: 0.195,
max: 6.603,
sum: 13.357000000000001,
mean: 1.9081428571428574,
median: 1.104,
stdDev: 2.432424951125346,
skewness: 0.9382828763341604 } }, (didn't try .org) One thing we discussed with @sidharthachatterjee on the call is wether we need some special distinction for pages that don't use queries (i.e Also nice thing about this checking page-data size (and not just query result size) is that it will handle cases where there are massive objects passed via context |
This PR uses the stats object returned by webpack to aggregate file size statistics for all JS bundles. It also computes similar statistics for all
page-data.json
files.Because
page-data.json
files are only written if something changed, we cache page-data sizes by file path in.cache
with a new cache forgatsby-telemetry
. Because we can only reliably flush page-data aggregates in an exit handler, we can't rely on async code. For this reason, we flush any page-data size measurements from the cache and buffer any changes over the course of the build to overwrite the cache, finally flushing the results from memory.Addresses #18819 and could be useful for the last point in #15272