Skip to content
A worker that will consume GOV.UK URLs from a message queue and crawl them, saving the output to disk
Branch: master
Clone or download
dwhenry Merge pull request #132 from alphagov/ignore-query-params
Process URL that have a page param
Latest commit a5a0ded Mar 14, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Godeps Remove Airbrake dependencies Sep 14, 2017
healthcheck Tidy up string wrapping Sep 22, 2016
http_crawler Merge pull request #107 from alphagov/add-rate-limit-token-option Jul 26, 2016
queue Rename queue.QueueConnection to queue.Connection Feb 11, 2015
ttl_hash_set MD5 hash the URL before inserting into Redis Feb 22, 2018
util Make reconnection tests pass under Go 1.7 Aug 30, 2016
vendor Remove Airbrake dependencies Sep 14, 2017
.gitignore Switch from `third_party.go` to `gom` for dependencies, fixes #80 Nov 14, 2014
.go-version Upgrade to Go 1.7.1 Sep 8, 2016
.travis.yml Update go version and remove gom Sep 21, 2016
CHANGELOG.md Bump version to 0.2.0 Feb 27, 2017
Jenkinsfile Update Jenkinsfile to use external library Mar 2, 2018
LICENSE Correct copyright notice Feb 23, 2015
Makefile Clean up the Makefile Aug 31, 2016
README.md Fix README Sep 21, 2016
crawler_message_item.go Process URL that have a page param Mar 13, 2018
crawler_message_item_test.go Stop decoding URI's before requesting them. Feb 21, 2017
govuk_crawler_worker_suite_test.go
healthcheck.go Make health_check -> healthcheck consistent Sep 22, 2016
healthcheck_test.go Make health_check -> healthcheck consistent Sep 22, 2016
helpers_test.go Switch from `third_party.go` to `gom` for dependencies, fixes #80 Nov 14, 2014
main.go Ignore URL’s with query params Mar 8, 2018
workflow.go Process URL that have a page param Mar 13, 2018
workflow_test.go Process URL that have a page param Mar 13, 2018

README.md

GOV.UK Crawler Worker

continuous integration status

This is a worker that will consume GOV.UK URLs from a message queue and crawl them, saving the output to disk.

Requirements

To run this worker you will need:

Development

You can run the tests locally by running make.

This project uses Godep to manage it's dependencies. If you have a working Go development setup, you should be able to install Godep by running:

go get github.com/tools/godep

Running

To run the worker you'll first need to build it using go build to generate a binary. You can then run the built binary directly using ./govuk_crawler_worker. All configuration is injected using environment varibles. For details on this look at the main.go file.

How it works

This is a message queue worker that will consume URLs from a queue and crawl them, saving the output to disk. Whilst this is the main reason for this worker to exist it has a few activities that it covers before the page gets written to disk.

Workflow

The workflow for the worker can be defined as the following set of steps:

  1. Read a URL from the queue, e.g. https://www.gov.uk/bank-holidays
  2. Crawl the recieved URL
  3. Write the body of the crawled URL to disk
  4. Extract any matching URLs from the HTML body of the crawled URL
  5. Publish the extracted URLs to the worker's own exchange
  6. Acknowledge that the URL has been crawled

The Interface

The public interface for the worker is the exchange labelled govuk_crawler_exchange. When the worker starts it creates this exchange and binds it to it's own queue for consumption.

If you provide user credentials for RabbitMQ that aren't on the root vhost /, you may wish to bind a global exchange yourself for easier publishing by other applications.

You can’t perform that action at this time.