This repo contains some temporary scripts used by the search team to analyse the results of an SEO crawl of GOV.UK.
The scripts call Elasticsearch, so you should run them from your dev VM with a copy of the mainstream, government and detailed search indexes.
To see the full usasge of all the commands, run bin/analyse --help.
-
bin/analyse missing_from_sitemap CRAWLER_SITEMAP: Find pages which were found by the crawler but are missing from the GOV.UK sitemap.The
CRAWLER_SITEMAPparameter is the sitemap index XML file generated by the crawler. It expects any sitemap chunk files to be saved to the same directory as the sitemap index.Output is saved to a file.
-
bin/analyse load_sitemap SITEMAP_FILE INDEX_NAME: Load the data from the crawler sitemap into a temporary Elasticsearch index.This makes it possible to run the
missing_from_crawlcommand without having to load the entire sitemap into memory every time that command is run. -
bin/analyse missing_from_crawl CRAWLER_INDEX: Find pages which are in the current GOV.UK search index and sitemap but were not found by the crawler.The
CRAWLER_INDEXparameter is an Elasticsearch index generated by theload_sitemapcommand.Output is saved to a CSV file containing the base path and some other properties of the missing documents such as withdrawn status and document type.
-
bin/analyse clean_indices INDEX_NAME: Clean up all the temporary indexes generated by theload_sitemapcommand. Warning: this deletes search indexes. Run with care.