Skip to content
get parl.gc.ca Hansard into a saner format
Ruby
Latest commit 41b1729 Apr 23, 2009 Daniel Haran keep links inside interventions

README.rdoc

Hansard scraper

Parliament's minutes (“Hansard”) are stuck in the age of dead trees. This project is a first step to making them really digital.

The objective of this scraper is to extract structured content from a Hansard page. Anyone can use it, although it was originally created for use by citizen-factory, “Hansard 2.0”: github.com/danielharan/citizen-factory/tree/master

Try it

ruby output.rb > semantic.html

It's effing slow: it takes around 2 minutes on a macbook pro

The code's also not the prettiest.

Limitation

The project does not try to resolve / disambiguate members or bills. URLs are preserved so the importing application can do so.

About the data

The last session of parliament as of writing: www2.parl.gc.ca/housechamberbusiness/chambersittings.aspx?Parl=40&Ses=2&Language=E&Mode=2

A couple sample pages are saved in hansards/

Something went wrong with that request. Please try again.