node.js scraping script for NYC OMB Capital Budget PDFs
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
csv
pdf
.eslintrc
.gitignore
LICENSE.md
README.md
package.json
scrape.js

README.md

nyc-capital-commitment-scrape

node.js scraping script for NYC OMB Capital Commitment Plans. The NYC Capital Commitment Plan is a detailed budget document that complements the Capital Budget, showing detailed sub-project committed costs and expected commitment dates. The Capital Commitment plan is published as a 4-part PDF, but machine-readable data at the commitment level is not published.

Disclaimer

I have not thoroughly QC'd the output csv in this repo, and cannot vouch for its accuracy. I recommend that you spot-check individual commitments with the source PDFs if you plan to use this dataset. Please open issues in this repo if you find discrepencies, or submit a pull request if you can help with the scraping code.

Get Data

October 2016 Capital Commitment Plan - Individual Commitments (csv) - 26,432 commitments, $84.3B

October 2016 Capital Commitment Plan - Grouped by Project ID (csv) - 9,207 Capital Projects

January 2017 Capital Commitment Plan - Individual Commitments (csv) - 29,616 commitments, $99.6B

January 2017 Capital Commitment Plan - Grouped by Project ID (csv) - 9,543 Capital Projects

April 2017 Capital Commitment Plan - Individual Commitments (csv) - 33,259 commitments, $105.8B

April 2017 Capital Commitment Plan - Grouped by Project ID (csv) - 9,983 Capital Projects

Agency Code Lookup (csv)

How to Use

Install dependencies npm install

Run scrape.js with a directory of capital commitment plan pdfs as an argument

For example, if you have capital commitment plan pdfs in /pdf/2017-Jan, run node scrape /pdf/2017-Jan.

The script will create a directory of the same name in /csv, with a new file called commitments.csv containing the data.