fix(build): remove stale page-data files #26937

pieh · 2020-09-17T14:30:37Z

Description

Initial version - just adds tests and let CI run just to showcase (currently) failing scenario. Following that will push commit with actual fix and adjust description.

---edit

Added tests alone fail ( https://app.circleci.com/pipelines/github/gatsbyjs/gatsby/49374/workflows/a5bff456-778f-49ae-bae4-5469e0ce072a/jobs/506165 ) asserting problematic behaviour

Next commit added actual fix (for build command). Few notes on this implementation:

It uses fs walking to get list of page-data files in public dir. This is not most performant way to do this, but is the safest way. We could persist pages slice of redux state and use that instead of traversing public/page-data to get list of previous ones, but this assumes that .cache and public directory are consistent - it would break if user deleted either .cache or public alone so it would need additional safeguards ... and have the fs traversal as fallback if .cache was deleted but public wasn't. Because it would be needed anyway I went with it as initial implementation.

I benchmarked couple of fs traversing/globbing methods/packages on sites with varying size (10k, 50k, 100k, 250k) and page path structures ( flat paths which are /[some-slug]/ and randomly nested paths which can vary from /[some-slug]/ to /[some-slug-1]/[some-slug-2]/[some-slug-3]/[some-slug-4]/ - note that it is really randomized so the results for nested shouldn't be compared against other size because structure won't be the same, instead treat it as random sample). Tested fs traversal methods are in https://github.com/pieh/benchmark-lot-of-pages/tree/master/page-data-finders (cli-find one is omitted because this was meant to be quick experiment, but initial results were different than those from other methods).

Results:

Flat structure:

10k:

fs-extra x 4.76 ops/sec ±5.12% (72 runs sampled)
globby x 4.53 ops/sec ±3.17% (71 runs sampled)
@nodelib/fs-walk x 6.84 ops/sec ±4.18% (81 runs sampled)
readdirp x 3.61 ops/sec ±3.35% (68 runs sampled)

50k:

fs-extra x 0.87 ops/sec ±3.26% (54 runs sampled)
globby x 0.82 ops/sec ±1.70% (53 runs sampled)
@nodelib/fs-walk x 1.13 ops/sec ±2.42% (55 runs sampled)
readdirp x 0.67 ops/sec ±2.05% (53 runs sampled)

100k:

fs-extra x 0.41 ops/sec ±2.02% (51 runs sampled)
globby x 0.40 ops/sec ±2.54% (51 runs sampled)
@nodelib/fs-walk x 0.54 ops/sec ±1.82% (52 runs sampled)
readdirp x 0.31 ops/sec ±1.81% (51 runs sampled)

250k

fs-extra x 0.16 ops/sec ±1.90% (50 runs sampled)
globby x 0.15 ops/sec ±1.44% (50 runs sampled)
@nodelib/fs-walk x 0.20 ops/sec ±1.79% (51 runs sampled)
readdirp x 0.12 ops/sec ±1.55% (50 runs sampled)

Nested structure

10k

12:20:10 PM: fs-extra x 6.59 ops/sec ±3.48% (80 runs sampled)
12:20:27 PM: globby x 4.39 ops/sec ±3.67% (72 runs sampled)
12:20:40 PM: @nodelib/fs-walk x 7.20 ops/sec ±4.06% (84 runs sampled)
12:20:59 PM: readdirp x 3.62 ops/sec ±3.09% (67 runs sampled)

50k

fs-extra x 0.91 ops/sec ±7.01% (55 runs sampled)
globby x 0.79 ops/sec ±1.88% (53 runs sampled)
@nodelib/fs-walk x 1.27 ops/sec ±2.40% (56 runs sampled)
readdirp x 0.68 ops/sec ±3.24% (53 runs sampled)

100k

fs-extra x 0.46 ops/sec ±2.18% (52 runs sampled)
globby x 0.40 ops/sec ±2.83% (52 runs sampled)
@nodelib/fs-walk x 0.57 ops/sec ±2.17% (52 runs sampled)
readdirp x 0.32 ops/sec ±2.03% (51 runs sampled)

250k

fs-extra x 0.17 ops/sec ±2.46% (50 runs sampled)
globby x 0.14 ops/sec ±0.85% (50 runs sampled)
@nodelib/fs-walk x 0.21 ops/sec ±2.69% (50 runs sampled)
readdirp x 0.13 ops/sec ±0.96% (50 runs sampled)

In all tests @nodelib/fs-walk won, so I settled on using it. Worst (perf wise) recorded case was 250k flat with 0.2 ops/sec (so fs traversal taking on average 5s). While this increases build time, 5s in 250k pages feels acceptable if it ensures correctness of deployed site (no stale page-data files there). On top of that - as we improve performance in other parts, we will be able to improve perf on this as well (implement special case for using stored state instead of fs traversal if we can be certain that .cache/public dirs where not tampered with since last build)

gatsby-cloud · 2020-09-17T14:54:18Z

Gatsby Cloud Build Report

gatsby

🎉 Your build was successful! See the Deploy preview here.

Build Details

View the build logs here.

🕐 Build time: 18m

pvdz

Consider to explicitly listen to the error even and reject for it. And to add a comment explaining it. Beyond that lgtm and you can merge at your leisure, with or without changes.

packages/gatsby/src/utils/page-data.ts

it could cause very weird edge cases if user actually have pages with `/sq/d/` prefix and we already check for `page-data.json` file (static query will have [hash].json names)

sidharthachatterjee

Looks good to me!

pvdz

Integration test seems to fail but I know you're gonna look into that and I couldn't stop you even if I wanted to so gtg

gatsbot bot added the status: triage needed Issue or pull request that need to be triaged and assigned to a reviewer label Sep 17, 2020

pieh added 2 commits September 17, 2020 16:48

test(artifacts): assert (un)expected page-data/html files

f6bab95

fix(gatsby): delete stale page-data on builds

ca59565

pieh force-pushed the cleanup-stale-page-data branch from c5c258e to f6bab95 Compare September 17, 2020 14:49

pieh marked this pull request as ready for review September 18, 2020 11:35

pieh added topic: stale-artifacts* and removed status: triage needed Issue or pull request that need to be triaged and assigned to a reviewer labels Sep 18, 2020

pvdz previously approved these changes Sep 18, 2020

View reviewed changes

packages/gatsby/src/utils/page-data.ts Outdated Show resolved Hide resolved

packages/gatsby/src/utils/page-data.ts Show resolved Hide resolved

remove deepFilter option

00fb3ed

it could cause very weird edge cases if user actually have pages with `/sq/d/` prefix and we already check for `page-data.json` file (static query will have [hash].json names)

sidharthachatterjee self-requested a review September 18, 2020 13:25

sidharthachatterjee previously approved these changes Sep 18, 2020

View reviewed changes

handle fs traversal errors (kind of)

424d840

pieh dismissed stale reviews from sidharthachatterjee and pvdz via 424d840 September 18, 2020 14:27

pvdz approved these changes Sep 18, 2020

View reviewed changes

pieh merged commit dfe9fb0 into master Sep 24, 2020

delete-merged-branch bot deleted the cleanup-stale-page-data branch September 24, 2020 10:03

pieh mentioned this pull request Oct 1, 2020

page-data.json files not deleted with pages #17402

Closed

pieh mentioned this pull request Nov 6, 2020

feat(develop): add explicit express handler for page-data requests #27882

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(build): remove stale page-data files #26937

fix(build): remove stale page-data files #26937

pieh commented Sep 17, 2020 •

edited

Loading

gatsby-cloud bot commented Sep 17, 2020 •

edited

Loading

pvdz left a comment

sidharthachatterjee left a comment

pvdz left a comment

fix(build): remove stale page-data files #26937

fix(build): remove stale page-data files #26937

Conversation

pieh commented Sep 17, 2020 • edited Loading

Description

Flat structure:

Nested structure

gatsby-cloud bot commented Sep 17, 2020 • edited Loading

Gatsby Cloud Build Report

Build Details

pvdz left a comment

Choose a reason for hiding this comment

sidharthachatterjee left a comment

Choose a reason for hiding this comment

pvdz left a comment

Choose a reason for hiding this comment

pieh commented Sep 17, 2020 •

edited

Loading

gatsby-cloud bot commented Sep 17, 2020 •

edited

Loading