-
Notifications
You must be signed in to change notification settings - Fork 35
Description
Issue surrounding detecting too much (ie, admin view CSS files, not used on site) and detecting too little (ie, WPML paginated URLs).
Ironically, normal full spidering was how this plugin used to function many years ago, more like SimplerStatic/SimplyStatic.
Not wanting to totally rearchitect this plugin (that's what WP2Static is for), a relatively simple adjustment should be made to continually crawl until all new discovered internal URLs have been crawled.
Previous shortcomings in the HTML/CSS parsing failed to detect all assets, which is where the greedy detection was a useful workaround, but should no longer be needed.
This proposed change is not the ultimate in elegance/extensibility, but should provide a good improvement without requiring too much effort.
flow
- plugin clears URLs list before starting export
- plugin initially detects as many WP content paths as possible (posts, pages, pagination, author, etc).
- current 2-phase crawling replaced with 1 recursive crawler, consuming URLs seeded by above detection
- only HTML and CSS files yield new URLs for further crawling
Note: crawl progress will be harder to measure in % terms, should shift to Detected vs crawled
100 URLs Detected, 0 Processed
...
100 URLs Detected, 75 Processed
120 URLs Detected, 80 Processed
...
120 URLs Detected, 120 Processed
Crawler knows to continue only by the crawl_queue not being empty.
Tasks
-
don't detect plugin/theme assets to build initial crawl list
-
new DB tables: crawl_queue, crawl_log (path, where detected, response status),
-
detected URLs are written into both crawl_queue and crawl_log
-
remove
crawl_againtask -
crawltask to use crawl_queue -
use fixed archive and zip name
-
remove flaky detection of pagination URLs
-
move URL lists away from flat TXT files into Database (pending deploy lists):
- Bunny
- S3
- GitHub
- GitLab (including file list from gitlab - just grab to memory)
- Bitbucket
- Netlify (shouldn't need, sends zip)
-
adjust progress indicators in UI and CLI (pending deploy / invalidation progress)
-
prevent any query string URLs being detected (
Pending /?wp_block=untitled-reusable-block Note: initial_crawl_list) -
check options to ignore/delete DeployCache