We are trying to predict health-based drinking water violations in the United States.
Branch: master
Clone or download
wesen Merge pull request #18 from wesen/feature/r-data-cleanup
Add data cleanup and data exploration notebooks
Latest commit e67cc75 Feb 12, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R 🚑 Fixup contaminant code merging Feb 13, 2019
data Fixup Feb 6, 2019
jupyter_notebooks Correct misspelled name in .gitignore Jan 23, 2019
schemas
.gitignore Restore webscraper ignore Feb 6, 2019
LICENSE Initial commit Sep 12, 2018
Pipfile Add dependencies for running scraper and README instructions Jan 23, 2019
Pipfile.lock Add dependencies for running scraper and README instructions Jan 23, 2019
README.md updated webscraper Feb 6, 2019
getSDWISSchema.js Update getSDWISSchema.js Sep 25, 2018
load.sql
table-list.csv intial csv to move away from depending on hardcoding values Feb 6, 2019
web-scraper.py updated webscraper Feb 6, 2019

README.md

Safe Water

This is a Code for Boston project that is trying to predict health-based drinking water violations using the Environmental Protection Agency's Safe Drinking Water Information System.

We will explore other datasets from the EPA, including the Toxic Release Inventory database, the Superfund Enterprise Management System, the Environmental Radiation Monitoring database, and the Enforcement and Compliance History Outline.

We are using Python for the analysis and are in the early stages.

Find us on our Slack channel #water

Getting started

The easiest way to install the Python dependencies is using Pipenv. Ensure that you have Pipenv installed, then, with the repo as your working directory, run:

pipenv install

To add a new Python dependency, run:

pipenv install antigravity  # Replace `antigravity` with desired package name

Be sure to commit Pipfile and Pipfile.lock to the repo.

Running the notebooks

To run the notebooks, run:

pipenv run jupyter notebook jupyter_notebooks

Running the scraper

To run the scraper, run:

pipenv run python -i swid-db-scraper.py

This will load the file and put you into a command prompt. From that prompt, run the following:

pull_envirofacts_data()

⚠️ Note: The scraper can take hours to run.

data aggregation:

The web-scraper functions as a command-line program with a flag based interface. The --help and -h flag is supported to get futher information however flags covered are listed here

  • -p takes a number of process that the script can use to process all of the different data sets
  • -l takes a pathname to a logfile which the script will write too
  • -rs takes a number which maps to the number of records to be included in any one request
  • -mq takes a number which maps to the number of times a url should be attemped before giving up

Webscraping TODO

  • test parallel implementation of web scraper
  • adjust webscraper to reference file name
  • update list of databases we want beyond SWDIS databases
  • test that data read in
  • how can we join and aggreate data so that we dont always have to navigate
  • additional print out info to better understand were we are in the scripts
  • additional error handlings

running scripts on your machine

(instructions only tested on solus linux but should hold on other os') in the command line cd to the safe water directory run the command 'python3.6 -i swid-db-scraper.py' this will load the file and put you into a command prompt now run the following command: 'pull_envirofacts_data()'