Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(build): remove stale page-data files #26937

Merged
merged 4 commits into from
Sep 24, 2020
Merged

fix(build): remove stale page-data files #26937

merged 4 commits into from
Sep 24, 2020

Conversation

pieh
Copy link
Contributor

@pieh pieh commented Sep 17, 2020

Description

Initial version - just adds tests and let CI run just to showcase (currently) failing scenario. Following that will push commit with actual fix and adjust description.

---edit

Added tests alone fail ( https://app.circleci.com/pipelines/github/gatsbyjs/gatsby/49374/workflows/a5bff456-778f-49ae-bae4-5469e0ce072a/jobs/506165 ) asserting problematic behaviour

Next commit added actual fix (for build command). Few notes on this implementation:

It uses fs walking to get list of page-data files in public dir. This is not most performant way to do this, but is the safest way. We could persist pages slice of redux state and use that instead of traversing public/page-data to get list of previous ones, but this assumes that .cache and public directory are consistent - it would break if user deleted either .cache or public alone so it would need additional safeguards ... and have the fs traversal as fallback if .cache was deleted but public wasn't. Because it would be needed anyway I went with it as initial implementation.

I benchmarked couple of fs traversing/globbing methods/packages on sites with varying size (10k, 50k, 100k, 250k) and page path structures ( flat paths which are /[some-slug]/ and randomly nested paths which can vary from /[some-slug]/ to /[some-slug-1]/[some-slug-2]/[some-slug-3]/[some-slug-4]/ - note that it is really randomized so the results for nested shouldn't be compared against other size because structure won't be the same, instead treat it as random sample). Tested fs traversal methods are in https://github.com/pieh/benchmark-lot-of-pages/tree/master/page-data-finders (cli-find one is omitted because this was meant to be quick experiment, but initial results were different than those from other methods).

Results:

Flat structure:

  • 10k:
fs-extra x 4.76 ops/sec ±5.12% (72 runs sampled)
globby x 4.53 ops/sec ±3.17% (71 runs sampled)
@nodelib/fs-walk x 6.84 ops/sec ±4.18% (81 runs sampled)
readdirp x 3.61 ops/sec ±3.35% (68 runs sampled)
  • 50k:
fs-extra x 0.87 ops/sec ±3.26% (54 runs sampled)
globby x 0.82 ops/sec ±1.70% (53 runs sampled)
@nodelib/fs-walk x 1.13 ops/sec ±2.42% (55 runs sampled)
readdirp x 0.67 ops/sec ±2.05% (53 runs sampled)
  • 100k:
fs-extra x 0.41 ops/sec ±2.02% (51 runs sampled)
globby x 0.40 ops/sec ±2.54% (51 runs sampled)
@nodelib/fs-walk x 0.54 ops/sec ±1.82% (52 runs sampled)
readdirp x 0.31 ops/sec ±1.81% (51 runs sampled)
  • 250k
fs-extra x 0.16 ops/sec ±1.90% (50 runs sampled)
globby x 0.15 ops/sec ±1.44% (50 runs sampled)
@nodelib/fs-walk x 0.20 ops/sec ±1.79% (51 runs sampled)
readdirp x 0.12 ops/sec ±1.55% (50 runs sampled)

Nested structure

  • 10k
12:20:10 PM: fs-extra x 6.59 ops/sec ±3.48% (80 runs sampled)
12:20:27 PM: globby x 4.39 ops/sec ±3.67% (72 runs sampled)
12:20:40 PM: @nodelib/fs-walk x 7.20 ops/sec ±4.06% (84 runs sampled)
12:20:59 PM: readdirp x 3.62 ops/sec ±3.09% (67 runs sampled)
  • 50k
fs-extra x 0.91 ops/sec ±7.01% (55 runs sampled)
globby x 0.79 ops/sec ±1.88% (53 runs sampled)
@nodelib/fs-walk x 1.27 ops/sec ±2.40% (56 runs sampled)
readdirp x 0.68 ops/sec ±3.24% (53 runs sampled)
  • 100k
fs-extra x 0.46 ops/sec ±2.18% (52 runs sampled)
globby x 0.40 ops/sec ±2.83% (52 runs sampled)
@nodelib/fs-walk x 0.57 ops/sec ±2.17% (52 runs sampled)
readdirp x 0.32 ops/sec ±2.03% (51 runs sampled)
  • 250k
fs-extra x 0.17 ops/sec ±2.46% (50 runs sampled)
globby x 0.14 ops/sec ±0.85% (50 runs sampled)
@nodelib/fs-walk x 0.21 ops/sec ±2.69% (50 runs sampled)
readdirp x 0.13 ops/sec ±0.96% (50 runs sampled)

In all tests @nodelib/fs-walk won, so I settled on using it. Worst (perf wise) recorded case was 250k flat with 0.2 ops/sec (so fs traversal taking on average 5s). While this increases build time, 5s in 250k pages feels acceptable if it ensures correctness of deployed site (no stale page-data files there). On top of that - as we improve performance in other parts, we will be able to improve perf on this as well (implement special case for using stored state instead of fs traversal if we can be certain that .cache/public dirs where not tampered with since last build)

@gatsbot gatsbot bot added the status: triage needed Issue or pull request that need to be triaged and assigned to a reviewer label Sep 17, 2020
@gatsby-cloud
Copy link

gatsby-cloud bot commented Sep 17, 2020

Gatsby Cloud Build Report

gatsby

🎉 Your build was successful! See the Deploy preview here.

Build Details

View the build logs here.

🕐 Build time: 18m

@pieh pieh marked this pull request as ready for review September 18, 2020 11:35
@pieh pieh added topic: stale-artifacts* and removed status: triage needed Issue or pull request that need to be triaged and assigned to a reviewer labels Sep 18, 2020
pvdz
pvdz previously approved these changes Sep 18, 2020
Copy link
Contributor

@pvdz pvdz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider to explicitly listen to the error even and reject for it. And to add a comment explaining it. Beyond that lgtm and you can merge at your leisure, with or without changes.

packages/gatsby/src/utils/page-data.ts Outdated Show resolved Hide resolved
packages/gatsby/src/utils/page-data.ts Show resolved Hide resolved
it could cause very weird edge cases if user actually have pages with
`/sq/d/` prefix and we already check for `page-data.json` file
(static query will have [hash].json names)
Copy link
Contributor

@sidharthachatterjee sidharthachatterjee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@pieh pieh dismissed stale reviews from sidharthachatterjee and pvdz via 424d840 September 18, 2020 14:27
Copy link
Contributor

@pvdz pvdz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration test seems to fail but I know you're gonna look into that and I couldn't stop you even if I wanted to so gtg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants