Skip to content
This repository has been archived by the owner on Jan 29, 2019. It is now read-only.

Scripts to analyse the results of a crawl of GOV.UK

License

Notifications You must be signed in to change notification settings

alphagov/govuk-crawler-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Site crawler analysis

This repo contains some temporary scripts used by the search team to analyse the results of an SEO crawl of GOV.UK.

Running the scripts

The scripts call Elasticsearch, so you should run them from your dev VM with a copy of the mainstream, government and detailed search indexes.

To see the full usasge of all the commands, run bin/analyse --help.

Commands

  1. bin/analyse missing_from_sitemap CRAWLER_SITEMAP: Find pages which were found by the crawler but are missing from the GOV.UK sitemap.

    The CRAWLER_SITEMAP parameter is the sitemap index XML file generated by the crawler. It expects any sitemap chunk files to be saved to the same directory as the sitemap index.

    Output is saved to a file.

  2. bin/analyse load_sitemap SITEMAP_FILE INDEX_NAME: Load the data from the crawler sitemap into a temporary Elasticsearch index.

    This makes it possible to run the missing_from_crawl command without having to load the entire sitemap into memory every time that command is run.

  3. bin/analyse missing_from_crawl CRAWLER_INDEX: Find pages which are in the current GOV.UK search index and sitemap but were not found by the crawler.

    The CRAWLER_INDEX parameter is an Elasticsearch index generated by the load_sitemap command.

    Output is saved to a CSV file containing the base path and some other properties of the missing documents such as withdrawn status and document type.

  4. bin/analyse clean_indices INDEX_NAME: Clean up all the temporary indexes generated by the load_sitemap command. Warning: this deletes search indexes. Run with care.

About

Scripts to analyse the results of a crawl of GOV.UK

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages