-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(build): remove stale page-data files #26937
Conversation
c5c258e
to
f6bab95
Compare
Gatsby Cloud Build Reportgatsby 🎉 Your build was successful! See the Deploy preview here. Build Details🕐 Build time: 18m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider to explicitly listen to the error even and reject
for it. And to add a comment explaining it. Beyond that lgtm and you can merge at your leisure, with or without changes.
it could cause very weird edge cases if user actually have pages with `/sq/d/` prefix and we already check for `page-data.json` file (static query will have [hash].json names)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integration test seems to fail but I know you're gonna look into that and I couldn't stop you even if I wanted to so gtg
Description
Initial version - just adds tests and let CI run just to showcase (currently) failing scenario. Following that will push commit with actual fix and adjust description.
---edit
Added tests alone fail ( https://app.circleci.com/pipelines/github/gatsbyjs/gatsby/49374/workflows/a5bff456-778f-49ae-bae4-5469e0ce072a/jobs/506165 ) asserting problematic behaviour
Next commit added actual fix (for build command). Few notes on this implementation:
It uses fs walking to get list of page-data files in
public
dir. This is not most performant way to do this, but is the safest way. We could persist pages slice of redux state and use that instead of traversingpublic/page-data
to get list of previous ones, but this assumes that.cache
andpublic
directory are consistent - it would break if user deleted either.cache
orpublic
alone so it would need additional safeguards ... and have the fs traversal as fallback if.cache
was deleted butpublic
wasn't. Because it would be needed anyway I went with it as initial implementation.I benchmarked couple of fs traversing/globbing methods/packages on sites with varying size (10k, 50k, 100k, 250k) and page path structures ( flat paths which are
/[some-slug]/
and randomly nested paths which can vary from/[some-slug]/
to/[some-slug-1]/[some-slug-2]/[some-slug-3]/[some-slug-4]/
- note that it is really randomized so the results for nested shouldn't be compared against other size because structure won't be the same, instead treat it as random sample). Tested fs traversal methods are in https://github.com/pieh/benchmark-lot-of-pages/tree/master/page-data-finders (cli-find
one is omitted because this was meant to be quick experiment, but initial results were different than those from other methods).Results:
Flat structure:
Nested structure
In all tests
@nodelib/fs-walk
won, so I settled on using it. Worst (perf wise) recorded case was 250k flat with 0.2 ops/sec (so fs traversal taking on average 5s). While this increases build time, 5s in 250k pages feels acceptable if it ensures correctness of deployed site (no stale page-data files there). On top of that - as we improve performance in other parts, we will be able to improve perf on this as well (implement special case for using stored state instead of fs traversal if we can be certain that.cache
/public
dirs where not tampered with since last build)