No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
LICENSE
README.md
__init__.py
clean.py
collect.py
config.py
scrape.py

README.md

Ontario Sunshine List Open Scraper

This is a set of scripts and useful functions for anyone wanting to work with the Public Salary Disclosure data as published by the Ontario Ministry of Finance, also known as the Sunshine List.

How to Use

Download the Data

The basic output from this toolchain is available for download in CSV format here: https://s3-us-west-1.amazonaws.com/ontariosunshinelist/data_2015_06_17.csv

Process Your Own

import ontario_sunshine_list as osl

Collect the raw HTML from http://www.fin.gov.on.ca/en/publications/salarydisclosure/
As of June 17, 2015 this is approx 500 mb of data.

col = osl.Collector()
col.run('/home/aleksey/data/sunshine/')

Scrape the data

scr = osl.Scraper()
df = scr.run('/home/aleksey/data/sunshine/')

Clean the data

cle = osl.Cleaner()
df = cle.run(df)

Save

df.to_csv('/home/aleksey/data.csv', encoding='utf-8')

Outstanding Issues

  • Only the initial disclosure is scraped. Addenda are not scraped or processed.
  • A couple of garbled datapoints in the HTML are not captured.