# Data Preparation

The aim of this notebook is to extract the data, clean it, process it, merge the different data sets and output the data that will be finally used for the modelling.   

We first extract data from EPA's Safe Drinking Water Information System ([SDWIS](https://www.epa.gov/enviro/sdwis-model)) for the [water systems](https://enviro.epa.gov/enviro/ef_metadata_html.ef_metadata_table?p_table_name=WATER_SYSTEM&p_topic=SDWIS), for their characteristics (notably where the ZIP code where they are situated) and for the Maximum Contaminant Levels (MCLs) [violations](https://enviro.epa.gov/enviro/ef_metadata_html.ef_metadata_table?p_table_name=VIOLATION&p_topic=SDWIS) (notably which contaminants and when).  

Then, 


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import time

import requests # to read the data from the REST API of EPA Envirofacts
import csv # needed as data accessed through the REST API are .csv (other possibilities are .xml or .xls)


## Extracting Water Systems Data from SDWIS

In [26]:
# more on the API: https://www.epa.gov/enviro/envirofacts-data-service-api

# notes on the API:
#     - WATER_SYSTEM = table name
#     - PWS_ACTIVITY_CODE/A ==> select only active water systems
#     - EPA_REGION/01 ==> New England

CSV_URL = 'https://enviro.epa.gov/enviro/efservice/WATER_SYSTEM/EPA_REGION/01/PWS_ACTIVITY_CODE/A/CSV'

with requests.Session() as s:
    download = s.get(CSV_URL)
    decoded_content = download.content.decode('utf-8')
    cr = csv.reader(decoded_content.splitlines(), delimiter=',')
    initial_WS = list(cr)        
    WATER_SYSTEM_raw = pd.DataFrame(initial_WS)
print(WATER_SYSTEM_raw.shape)
WATER_SYSTEM_raw.head()

(10483, 48)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,38,39,40,41,42,43,44,45,46,47
0,WATER_SYSTEM.PWSID,WATER_SYSTEM.PWS_NAME,WATER_SYSTEM.NPM_CANDIDATE,WATER_SYSTEM.PRIMACY_AGENCY_CODE,WATER_SYSTEM.EPA_REGION,WATER_SYSTEM.SEASON_BEGIN_DATE,WATER_SYSTEM.SEASON_END_DATE,WATER_SYSTEM.PWS_ACTIVITY_CODE,WATER_SYSTEM.PWS_DEACTIVATION_DATE,WATER_SYSTEM.PWS_TYPE_CODE,...,WATER_SYSTEM.ZIP_CODE,WATER_SYSTEM.COUNTRY_CODE,WATER_SYSTEM.STATE_CODE,WATER_SYSTEM.SOURCE_WATER_PROTECTION_CODE,WATER_SYSTEM.SOURCE_PROTECTION_BEGIN_DATE,WATER_SYSTEM.OUTSTANDING_PERFORMER,WATER_SYSTEM.OUTSTANDING_PERFORM_BEGIN_DATE,WATER_SYSTEM.CITIES_SERVED,WATER_SYSTEM.COUNTIES_SERVED,
1,ME0004628,MACHIAS TRAILER PARK,Y,ME,01,,,A,,CWS,...,04654,US,ME,N,,,,MACHIAS,Washington,
2,ME0092288,MARSH BROOK ESTATES,Y,ME,01,,,A,,CWS,...,01746,US,MA,N,,,,SANFORD,York,
3,ME0009198,TRAILS END STEAK HOUSE & TAVERN,Y,ME,01,01-01,12-31,A,,TNCWS,...,04936,US,ME,N,,,,EUSTIS,Franklin,
4,ME0094505,R & R VACATION HOME PARK,Y,ME,01,01-01,12-31,A,,TNCWS,...,04055,US,ME,N,,,,NAPLES,Cumberland,


In [27]:
# Some Data Cleaning:

water_system = WATER_SYSTEM_raw.copy()

# set the first row as header:
new_header = water_system.iloc[0] # grab the first row for the header
new_header = new_header.str.split('.').str[1] # we remove the redundant table name (WATER_SYSTEM) in column names
new_header = new_header.str.lower() # set to lower case, as less annoying
water_system = water_system[1:] # take the data less the header row
water_system.columns = new_header # set the header row as the df header

# we remove the last column of null, that is an artifact of the extraction:
water_system = water_system.dropna(axis = 1, how='all') # axis = 1 = columns

water_system.tail() # looks good for now.

Unnamed: 0,pwsid,pws_name,npm_candidate,primacy_agency_code,epa_region,season_begin_date,season_end_date,pws_activity_code,pws_deactivation_date,pws_type_code,...,city_name,zip_code,country_code,state_code,source_water_protection_code,source_protection_begin_date,outstanding_performer,outstanding_perform_begin_date,cities_served,counties_served
10478,NH1108030,WINDY RIDGE ORCHARD,Y,NH,1,06-01,10-31,A,,TNCWS,...,HAVERHILL,3774,US,NH,,,,,HAVERHILL,Grafton
10479,NH1109020,MOUNTAIN VALLEY TREATMENT CTR,Y,NH,1,01-01,12-31,A,,TNCWS,...,ORFORD,3777,US,NH,,,,,HAVERHILL,Grafton
10480,NH1112010,STONEGATE ACRES,Y,NH,1,,,A,,CWS,...,CONCORD,3302,US,NH,,,,,HEBRON,Grafton
10481,NH1113010,HILLSIDE INN CONDOS,Y,NH,1,,,A,,CWS,...,HEBRON,3241,US,NH,,,,,HEBRON,Grafton
10482,NH1117010,CAMP BEREA/DINING HALL,Y,NH,1,01-01,12-31,A,,TNCWS,...,HEBRON,3241,US,NH,,,,,HEBRON,Grafton


In [31]:
# We save the raw "clean" data as csv, in case the API should stop to work in the future:
active_water_systems_NewEngland = water_system.copy() # better name
active_water_systems_NewEngland.to_csv('../data/')

FileNotFoundError: [Errno 2] No such file or directory: '../data/'

## Extracting Violations Data from SDWIS

In [29]:
# more on the API: https://www.epa.gov/enviro/envirofacts-data-service-api

# notes on the API:
#     - VIOLATION = table name
#     - EPA_REGION/01 ==> New England

CSV_URL = 'https://enviro.epa.gov/enviro/efservice/VIOLATION/EPA_REGION/01/CSV'

with requests.Session() as s:
    download = s.get(CSV_URL)
    decoded_content = download.content.decode('utf-8')
    cr = csv.reader(decoded_content.splitlines(), delimiter=',')
    initial_V = list(cr)        
    VIOLATIONS_raw = pd.DataFrame(initial_V)
print(VIOLATIONS_raw.shape)
VIOLATIONS_raw.head()

(100002, 35)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,25,26,27,28,29,30,31,32,33,34
0,VIOLATION.PWSID,VIOLATION.VIOLATION_ID,VIOLATION.FACILITY_ID,VIOLATION.POPULATION_SERVED_COUNT,VIOLATION.NPM_CANDIDATE,VIOLATION.PWS_ACTIVITY_CODE,VIOLATION.PWS_DEACTIVATION_DATE,VIOLATION.PRIMARY_SOURCE_CODE,VIOLATION.POP_CAT_5_CODE,VIOLATION.PRIMACY_AGENCY_CODE,...,VIOLATION.RTC_ENFORCEMENT_ID,VIOLATION.RTC_DATE,VIOLATION.PUBLIC_NOTIFICATION_TIER,VIOLATION.ORIGINATOR_CODE,VIOLATION.SAMPLE_RESULT_ID,VIOLATION.CORRECTIVE_ACTION_ID,VIOLATION.RULE_CODE,VIOLATION.RULE_GROUP_CODE,VIOLATION.RULE_FAMILY_CODE,
1,ME0094672,157508,,388,N,A,,GW,1,ME,...,,,2,S,,,110,100,110,
2,ME0009683,60007,,454,N,A,,GW,1,ME,...,1510,08-SEP-14,2,S,,,110,100,110,
3,ME0000625,6318,,100,N,A,,GW,1,ME,...,638928,04-NOV-13,3,S,,,410,400,410,
4,ME0000625,6316,,100,N,A,,GW,1,ME,...,638930,13-AUG-14,3,S,,,210,200,210,
