GOV.UK Crawler Worker
This is a worker that will consume GOV.UK URLs from a message queue and crawl them, saving the output to disk.
To run this worker you will need:
You can run the tests locally by running
go get github.com/tools/godep
To run the worker you'll first need to build it using
go build to
generate a binary. You can then run the built binary directly using
./govuk_crawler_worker. All configuration is injected using
environment varibles. For details on this look at the
How it works
This is a message queue worker that will consume URLs from a queue and crawl them, saving the output to disk. Whilst this is the main reason for this worker to exist it has a few activities that it covers before the page gets written to disk.
The workflow for the worker can be defined as the following set of steps:
- Read a URL from the queue, e.g. https://www.gov.uk/bank-holidays
- Crawl the recieved URL
- Write the body of the crawled URL to disk
- Extract any matching URLs from the HTML body of the crawled URL
- Publish the extracted URLs to the worker's own exchange
- Acknowledge that the URL has been crawled
The public interface for the worker is the exchange labelled govuk_crawler_exchange. When the worker starts it creates this exchange and binds it to it's own queue for consumption.
If you provide user credentials for RabbitMQ that aren't on the root
/, you may wish to bind a global exchange yourself for easier
publishing by other applications.