Skip to content
get Hansard into a saner format
Latest commit 41b1729 Apr 23, 2009 Daniel Haran keep links inside interventions


Hansard scraper

Parliament's minutes (“Hansard”) are stuck in the age of dead trees. This project is a first step to making them really digital.

The objective of this scraper is to extract structured content from a Hansard page. Anyone can use it, although it was originally created for use by citizen-factory, “Hansard 2.0”:

Try it

ruby output.rb > semantic.html

It's effing slow: it takes around 2 minutes on a macbook pro

The code's also not the prettiest.


The project does not try to resolve / disambiguate members or bills. URLs are preserved so the importing application can do so.

About the data

The last session of parliament as of writing:

A couple sample pages are saved in hansards/

Something went wrong with that request. Please try again.