Script to load aggregate statistics about rates of problem reports and searches
into the Performance Platform's govuk-info/info-statistics dataset. The data
which is sent to that dataset is also written to a local CSV file named for the
date range it covers.
The data could be used to power a dashboard of the pages with the most searches, most problem reports etc, but for now the CSV report can be used to compare the performance of pages on GOV.UK with widely differing pageview rates. There is no frontend yet to display the dataset from the Performance Platform.
The script fetches its source data from GOV.UK's search API and from the Performance Platform (PP):
- the URLs for all pages of all smart answers and simple smart answers (from the search API)
- the numbers of anonymous feedback (problem) reports per page on GOV.UK in the
last 6 weeks
(from PP's
govuk-info/page-contactsdataset, fetched by starting letter/a-/z) - the numbers of searches from pages on GOV.UK in the last 6 weeks
(from PP's
govuk-info/search-termsdataset, fetched by starting letter/a-/z) - the numbers of pageviews in the last 6 weeks for each page which appears in
the
page-contactsandsearch-termsdata (from PP'sgovuk-info/page-statisticsdataset, fetched individually per URL; this makes around 7000 GET requests)
It then combines the datapoints for all pages of each smart answer and simple smart answer so that the whole smart answer is represented by a single datapoint.
Each datapoint contains the problem report and search rates per 100,000 pageviews.
The aggregated datapoints are then written to a CSV file and POSTed to the
govuk-info/info-statistics PP dataset.
Datapoints for URLs which have previously appeared but which don't have any
associated data from the last 6 weeks are not deleted from the info-statistics
dataset, but the dates in the datapoints will indicate that they are no longer
current.
Environment variables for configuration:
PP_DATASET_TOKEN: (required) a token for thegovuk-info/info-statisticsdatasetDATA_DOMAIN: the base URL for the Performance Platform; defaults tohttps://www.performance.service.gov.uk/dataLOG_LEVEL: valid values:DEBUG,INFO(default),WARNING,ERROR,CRITICAL
To update data in the Performance Platform, use ./run.sh (this script will
create its own virtualenv).
Install the dependencies (you may want to do this inside a virtualenv):
pip install -r requirements.txt
Set PP_DATASET_TOKEN as above (perhaps to something invalid) and call the
Python script directly (you may want to comment-out the POST to the Performance
Platform first):
python -m stats.main
Install the test dependencies (you may want to do this inside a virtualenv):
pip install -r requirements_for_tests.txt
Run tests:
nosetests
The script sometimes fails because requests to the Performance Platform time out or otherwise fail (even though the PP client retries up to 5 times for 500, 502 and 503 reponses). It's safe to run the script again if this happens, because it only makes a single POST request at the end so the dataset cannot have been partially updated by the failure.