# Data Processing and Feature Engineering

In this notebook, we merge the different datasets together, select features and engineer some new features. This first step in feature selection and engineering is based on my knowledge of the data. 

We proceed in this order:  

1. Filter water systems of interest (active water systems in New England)
2. Select features of interest for water systems, and data cleaning
3. Filter violations of interest (pesticides)
4. Select features of interest for violations, and data cleaning
5. Add estimated pesticide use for the water systems
6. Merge water systems and violations to obtain violations by water systems (!)
7. 
8. Engineer new features of interest for violations by water systems 


In the end, we obtain a dateset on which we can do model training and selection. In a later step, when trying different models, a new loop of feature selection and engineering might be needed, and will be performed in a separate notebook.     


In [2]:
import pandas as pd
import numpy as np
import datetime
import time

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import geopandas as gpd # to identify neighboring water systems via ZIP code
from shapely.geometry import Point, Polygon

## Water Systems

We did already select the water systems of interest: EPA region 01, New England, and only active water systems.

In [6]:
# load water systems:
ws_raw = pd.read_csv('../data/active_water_systems_NewEngland.csv')
ws_raw.head().T

Unnamed: 0,0,1,2,3,4
Unnamed: 0,1,2,3,4,5
pwsid,ME0004628,ME0092288,ME0009198,ME0094505,ME0007311
pws_name,MACHIAS TRAILER PARK,MARSH BROOK ESTATES,TRAILS END STEAK HOUSE & TAVERN,R & R VACATION HOME PARK,NOKOMIS CAMPING AREA LLC
npm_candidate,Y,Y,Y,Y,Y
primacy_agency_code,ME,ME,ME,ME,ME
epa_region,1,1,1,1,1
season_begin_date,,,01-01,01-01,05-01
season_end_date,,,12-31,12-31,09-30
pws_activity_code,A,A,A,A,A
pws_deactivation_date,,,,,


In [93]:
# DATA CLEANING FOR WATER SYSTEMS
ws = ws_raw.copy()

# IMPORTANT NOTE: wer cannot use the ZIP code to locate the water systems, 
# as it is sometimes the ZIP code of the legal entity, which is not necessarily in the same place.

# ws.state_code.value_counts() # cannot be used, as state of legal entity.

# Thus we will locate the water systems, with the counties_served + primacy_agency_code
# ws.primacy_agency_code.value_counts()

# The 5 water systems administred by "tribes" have no city/county served. 
# ws[ws.primacy_agency_code == '01'].T

# We manually impute those, based on the city_name (legal entity):
# ws[ws.counties_served == 'Not Reported']

ws.loc[ws.pwsid == 'VT0021432', 'counties_served'] = 'windham'

ws.loc[ws['pwsid'] == '010307001', 'counties_served'] = 'dukes'
ws.loc[ws['pwsid'] == '010307001', 'cities_served'] = 'chilmark'

ws.loc[ws['pwsid'] == '010106001', 'counties_served'] = 'new london'
ws.loc[ws['pwsid'] == '010106001', 'cities_served'] = 'ledyard'

ws.loc[ws['pwsid'] == '010109005', 'counties_served'] = 'new london'
ws.loc[ws['pwsid'] == '010109005', 'cities_served'] = 'montville'

ws.loc[ws['pwsid'] == '010502003', 'counties_served'] = 'washington'
ws.loc[ws['pwsid'] == '010502003', 'cities_served'] = 'charlestown'
ws.loc[ws['pwsid'] == '010502002', 'counties_served'] = 'washington'
ws.loc[ws['pwsid'] == '010502002', 'cities_served'] = 'charlestown'

# we set the 'cities_served' and 'counties_served' to lower case:
ws.cities_served = ws.cities_served.str.lower() # set to lower case, as less annoying
ws.counties_served = ws.counties_served.str.lower() # set to lower case, as less annoying


In [94]:
# First Filter to select needed columns

select_features_ws = ['pwsid', 'pws_name', 'primacy_agency_code', 'pws_type_code', 
                      'gw_sw_code', # if the water system is considered having ground water (“gw”) or surface water (‘sw”) source under SDWA.
                     'owner_type_code', 'population_served_count', 'primary_source_code',
                      'is_wholesaler_ind', # whether the system is a wholesaler of water.
                     'is_school_or_daycare_ind', # if the water system’s primary service area is a school or daycare
                     'service_connections_count', 
                      'source_water_protection_code', # N: WS has not implemented source water protection according to state policy. Y: WS has substantially implemented.
                     # ! source_water_protection_code: most Y only after 01-JAN-2012
                      'cities_served', 'counties_served']

ws = ws.loc[:, select_features_ws] # keep only selected columns
ws.head()


Unnamed: 0,pwsid,pws_name,primacy_agency_code,pws_type_code,gw_sw_code,owner_type_code,population_served_count,primary_source_code,is_wholesaler_ind,is_school_or_daycare_ind,service_connections_count,source_water_protection_code,cities_served,counties_served
0,ME0004628,MACHIAS TRAILER PARK,ME,CWS,GW,P,65,GW,N,N,26,N,machias,washington
1,ME0092288,MARSH BROOK ESTATES,ME,CWS,GW,P,70,GW,N,N,28,N,sanford,york
2,ME0009198,TRAILS END STEAK HOUSE & TAVERN,ME,TNCWS,GW,P,390,GW,N,N,1,N,eustis,franklin
3,ME0094505,R & R VACATION HOME PARK,ME,TNCWS,GW,P,40,GW,N,N,1,N,naples,cumberland
4,ME0007311,NOKOMIS CAMPING AREA LLC,ME,TNCWS,GW,P,118,GW,N,N,1,N,harrison,cumberland


### Finding the Location of the Water Systems

TO BE DONE.

We already filter the data by year, because we are only interested in recent violations. The number of observed violations greatly increased to reach a new plateau in 2009 (known from [previous work](https://github.com/de-la-viz/US-Public-Water-Systems/blob/master/US%20Drinking%20Water%20Quality%20Violations.ipynb)) because of the introduction of new guidelines and rules. We will thus focus on violations from 2009 onwards. 

## Merging all SDWIS Together

We first add the contaminants codes information to the violations.


In [None]:
# merging contaminants codes with violations

violations = violations.merge(contaminant_codes, how='left', on='contaminant_code') # we want to keep all violations


We then merge the water systems and violations by _PWSID_ (it is not a 1 to 1 relation).

In [None]:
# merging water systems with violations:

# 1 water system might see several violations, 
# and 1 violation might affect several water sytems (albeit it is rare)
NE_viol = water_system.merge(violations, how='outer')


In [1]:
# pesticides

In [None]:
# the years 2013 to 2017 are estimates.
# the years 2015 to 2017 are not direclty accessible yet. 
# we first have to download them, then I reload them here:

pesticide_use_2015 = pd.read_csv('../data/pesticide_use/2015PreliminaryEstimates/EPest.county.estimates.2015.txt', sep='\t')
pesticide_use_2016 = pd.read_csv('../data/pesticide_use/2106PreliminaryEstimates/EPest.county.estimates.2016.txt', sep='\t')
pesticide_use_2017 = pd.read_csv('../data/pesticide_use/2017PreliminaryEstimatesNoCA/EPest.county.estimates_noCA.2017.txt', sep='\t')

# append to previous years:
pesticide_use_2009_17 = pesticide_use_2009_14.append(pesticide_use_2015, ignore_index=True)

print(pesticide_use_2009_17.shape)
pesticide_use_2009_17.head()