# Data Processing and Feature Engineering

In this notebook, we merge the different datasets together, select features and engineer some new features. This first step in feature selection and engineering is based on my knowledge of the data. 

We proceed in this order:  

1. Filter water systems of interest (active water systems in New England)
2. Select features of interest for water systems, and data cleaning
3. Filter violations of interest (pesticides)
4. Select features of interest for violations, and data cleaning
5. Add estimated pesticide use for the water systems
6. Merge water systems and violations to obtain violations by water systems (!)
7. 
8. Engineer new features of interest for violations by water systems 


In the end, we obtain a dateset on which we can do model training and selection. In a later step, when trying different models, a new loop of feature selection and engineering might be needed, and will be performed in a separate notebook.     


In [194]:
import pandas as pd
import numpy as np
import datetime
import time

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import geopandas as gpd # to identify neighboring water systems via ZIP code
from shapely.geometry import Point, Polygon

## Water Systems

We did already select the water systems of interest: EPA region 01, New England, and only active water systems.

In [195]:
# load water systems:
ws_raw = pd.read_csv('../data/active_water_systems_NewEngland.csv')
ws_raw.head().T

Unnamed: 0,0,1,2,3,4
Unnamed: 0,0,1,2,3,4
pwsid,ME0004628,ME0092288,ME0009198,ME0094505,ME0007311
pws_name,MACHIAS TRAILER PARK,MARSH BROOK ESTATES,TRAILS END STEAK HOUSE & TAVERN,R & R VACATION HOME PARK,NOKOMIS CAMPING AREA LLC
npm_candidate,Y,Y,Y,Y,Y
primacy_agency_code,ME,ME,ME,ME,ME
epa_region,1,1,1,1,1
season_begin_date,,,01-01,01-01,05-01
season_end_date,,,12-31,12-31,09-30
pws_activity_code,A,A,A,A,A
pws_deactivation_date,,,,,


In [215]:
# DATA CLEANING FOR WATER SYSTEMS
ws = ws_raw.copy()

# IMPORTANT NOTE: wer cannot use the ZIP code to locate the water systems, 
# as it is sometimes the ZIP code of the legal entity, which is not necessarily in the same place.
# ==> we can locate it with the ansi_entity_code.

# We remove the 6 water systems without localization (no ansi_entity_code)
# (It turns out they are tribal lands) ==> ! introduction of a potential bias
# ws_raw.ansi_entity_code.isnull().sum() 
ws = ws[~ws['ansi_entity_code'].isnull()]

# verifying that the counties_served from the water system table is the same as the county_served from geography:
# (ws.county_served != ws.counties_served).sum()
# so deleting one:
ws.drop('county_served', axis=1, inplace=True)

ws.shape

(10476, 49)

In [216]:
# First Filter to select needed columns

select_features_ws = ['pwsid', 'pws_name', 'primacy_agency_code', 'pws_type_code', 
                      'gw_sw_code', # if the water system is considered having ground water (“gw”) or surface water (‘sw”) source under SDWA.
                     'owner_type_code', 'population_served_count', 'primary_source_code',
                      'is_wholesaler_ind', # whether the system is a wholesaler of water.
                     'is_school_or_daycare_ind', # if the water system’s primary service area is a school or daycare
                     'service_connections_count', 
                      'source_water_protection_code', # N: WS has not implemented source water protection according to state policy. Y: WS has substantially implemented.
                     # ! source_water_protection_code: most Y only after 01-JAN-2012
                      'cities_served', 'counties_served', 'ansi_entity_code']

ws = ws.loc[:, select_features_ws] # keep only selected columns
ws.head()


Unnamed: 0,pwsid,pws_name,primacy_agency_code,pws_type_code,gw_sw_code,owner_type_code,population_served_count,primary_source_code,is_wholesaler_ind,is_school_or_daycare_ind,service_connections_count,source_water_protection_code,cities_served,counties_served,ansi_entity_code
0,ME0004628,MACHIAS TRAILER PARK,ME,CWS,GW,P,65,GW,N,N,26,N,MACHIAS,Washington,29.0
1,ME0092288,MARSH BROOK ESTATES,ME,CWS,GW,P,70,GW,N,N,28,N,SANFORD,York,31.0
2,ME0009198,TRAILS END STEAK HOUSE & TAVERN,ME,TNCWS,GW,P,390,GW,N,N,1,N,EUSTIS,Franklin,7.0
3,ME0094505,R & R VACATION HOME PARK,ME,TNCWS,GW,P,40,GW,N,N,1,N,NAPLES,Cumberland,5.0
4,ME0007311,NOKOMIS CAMPING AREA LLC,ME,TNCWS,GW,P,118,GW,N,N,1,N,HARRISON,Cumberland,5.0


### Adding Geographic Information to the Water Systems

TO BE DONE?
* = add the shape from shapefile
* = add Lat/Lon from county (centroid)



## Violations

We already filter the data by year, because we are only interested in recent violations. The number of observed violations greatly increased to reach a new plateau in 2009 (known from [previous work](https://github.com/de-la-viz/US-Public-Water-Systems/blob/master/US%20Drinking%20Water%20Quality%20Violations.ipynb)) because of the introduction of new guidelines and rules. We will thus focus on violations from 2009 onwards. 

In [131]:
# load the violations:
violations_raw = pd.read_csv('../data/violations_NewEngland.csv')
# violations_raw.head().T

In [132]:
# SOME FILTERING AND DATA CLEANING:

violations = violations_raw.copy()

# transform the dates to datetime:
violations.rtc_date = pd.to_datetime(violations.rtc_date)
violations.compl_per_begin_date = pd.to_datetime(violations.compl_per_begin_date)
violations.compl_per_end_date = pd.to_datetime(violations.compl_per_end_date)

violations.loc[:,'year'] = violations['compl_per_begin_date'].dt.year # year when the violation was discovered
violations.loc[:,'month'] = violations['compl_per_begin_date'].dt.month # month when the violation was discovered

# create new column with quarters:
def by_quarter(row):
    if row['month'] < 4:
        return 1
    elif row['month'] >= 4 and row['month'] < 7:
        return 2
    elif row['month'] >= 7 and row['month'] < 10:
        return 3
    else:
        return 4
violations.loc[:,'quarter'] = violations.apply(by_quarter, axis=1) 


select_features_viol = ['pwsid', 'violation_id', 'violation_code', 'violation_category_code', 
                       'is_health_based_ind', 'contaminant_code', 'is_major_viol_ind',
                       'rule_group_code',
                       'year', 'month', 'quarter']

violations = violations.loc[:, select_features_viol] # keep only selected columns

print(violations.shape)
violations.head().T




(100001, 11)


Unnamed: 0,0,1,2,3,4
pwsid,ME0094672,ME0009683,ME0000625,ME0000625,ME0000625
violation_id,157508,60007,6318,6316,6315
violation_code,22,22,75,27,27
violation_category_code,MCL,MCL,Other,MR,MR
is_health_based_ind,Y,Y,N,N,N
contaminant_code,3100,3100,7500,2456,2950
is_major_viol_ind,,,,Y,Y
rule_group_code,100,100,400,200,200
year,2008,2014,2014,2011,2011
month,6,8,10,1,1


### Adding the Contaminant Names

In [136]:
# loading the contaminants codes:
contaminants = pd.read_csv('../data/contaminant_codes.csv')
contaminants = contaminants.drop('Unnamed: 0', axis=1)
# contaminants.head()

In [138]:
# merging the name to the violations
violations = violations.merge(contaminants, how='left', on='contaminant_code')

### Selection of the Contaminants of Interest: Pesticides

We start by having a quick look at the number of violations for the different pesticides:

In [174]:
# most occuring contaminants in New Hampshire:
violations.contaminant_name.value_counts().head(50)

COLIFORM (TCR)                   9459
PUBLIC NOTICE                    2904
LEAD & COPPER RULE               2217
NITRATE                          1709
CONSUMER CONFIDENCE RULE         1385
ARSENIC                           620
NITRITE                           560
E. COLI                           499
TOTAL HALOACETIC ACIDS (HAA5)     465
TETRACHLOROETHYLENE               457
1,2,4-TRICHLOROBENZENE            450
TOLUENE                           449
TRICHLOROETHYLENE                 447
CIS-1,2-DICHLOROETHYLENE          447
DICHLOROMETHANE                   446
STYRENE                           445
P-DICHLOROBENZENE                 444
BENZENE                           443
1,1,1-TRICHLOROETHANE             443
1,1,2-TRICHLOROETHANE             442
1,2-DICHLOROETHANE                442
1,1-DICHLOROETHYLENE              441
CHLOROBENZENE                     441
VINYL CHLORIDE                    441
TRANS-1,2-DICHLOROETHYLENE        441
O-DICHLOROBENZENE                 441
CARBON TETRA

In [175]:
# some contaminants name are empty:
violations['contaminant_name'].isnull().sum()

52392

In [184]:
# let's check those and find why:
empty_cont_name = violations[violations['contaminant_name'].isnull() == True]
empty_cont_name.contaminant_code.isnull().sum() # only 125 miss a contaminant code
empty_cont_name.contaminant_code.value_counts()
# 8000: not found


3100.0    11228
7500.0     3754
5000.0     2706
8000.0     2213
7000.0     2023
1040.0     1746
8000       1471
3014.0      713
1005.0      689
2950.0      505
2987.0      451
1041.0      451
2979.0      446
2983.0      446
2982.0      445
2980.0      443
2976.0      443
2977.0      442
2968.0      442
2964.0      442
2985.0      442
2990.0      442
2969.0      441
2984.0      441
2989.0      440
2996.0      440
2456.0      439
2992.0      438
2991.0      438
2981.0      437
          ...  
1022.0       18
4102.0       17
4174.0       17
100.0        16
4012.0       15
2920.0       13
4007.0       11
4008.0       11
400.0        11
4270.0        3
4264.0        3
1925.0        3
4172.0        3
1011.0        3
4101.0        3
1920.0        2
600.0         2
1002.0        2
1009.0        2
1008.0        2
1013.0        2
1905.0        2
4044.0        2
1006.0        1
3015.0        1
800.0         1
1095.0        1
2257.0        1
2265.0        1
3002.0        1
Name: contaminant_code, 

We add a new column to the violations to identify pesticides

In [149]:
# list of pesticides:
pesticide_use_2009_14 = pd.read_csv('../data/pesticide_use/pesticide_use_2009_14.csv')
pesticide_use_2015 = pd.read_csv('../data/pesticide_use/2015PreliminaryEstimates/EPest.county.estimates.2015.txt', sep='\t')
pesticide_use_2016 = pd.read_csv('../data/pesticide_use/2106PreliminaryEstimates/EPest.county.estimates.2016.txt', sep='\t')
pesticide_use_2017 = pd.read_csv('../data/pesticide_use/2017PreliminaryEstimatesNoCA/EPest.county.estimates_noCA.2017.txt', sep='\t')

# append to previous years:
pesticide_use_2009_17 = pesticide_use_2009_14.append(pesticide_use_2015, ignore_index=True)
pesticide_use_2009_17 = pesticide_use_2009_17.append(pesticide_use_2016, ignore_index=True)
pesticide_use_2009_17 = pesticide_use_2009_17.append(pesticide_use_2017, ignore_index=True)


In [155]:
pesticides = pesticide_use_2009_17.COMPOUND.unique() # a list of all pesticides

In [171]:
# violations.loc[:,'is_pesticide'] = np.where(violations['contaminant_name'] in pesticides, 1, 0)

violations.apply(lambda x: np.where(violations['contaminant_name'][x] in pesticides, 1, 0), axis=1)

ValueError: ('cannot index with vector containing NA / NaN values', 'occurred at index 0')

In [166]:
violations['contaminant_name'][0] in pesticides

False

In [172]:
violations['contaminant_name'].isnull().sum()

52392

TO DO: 

1. ADD THE CONTAMINANTS NAME
2. Filter for pesticide



array(['2,4-D', '2,4-DB', '6-BENZYLADENINE', 'ABAMECTIN', 'ACEPHATE',
       'ACEQUINOCYL', 'ACETAMIPRID', 'ACETOCHLOR', 'ACIBENZOLAR',
       'ACIFLUORFEN', 'ALACHLOR', 'ALDICARB', 'ALUMINUM PHOSPHIDE',
       'AMETRYN', 'AMINOPYRALID', 'AMITRAZ', 'AMPELOMYCES QUISQUALIS',
       'ASULAM', 'ATRAZINE', 'AVIGLYCINE', 'AZADIRACHTIN',
       'AZINPHOS-METHYL', 'AZOXYSTROBIN', 'BACILLUS CEREUS',
       'BACILLUS PUMILIS', 'BACILLUS SUBTILIS', 'BACILLUS THURINGIENSIS',
       'BENFLURALIN', 'BENOMYL', 'BENSULFURON', 'BENSULIDE', 'BENTAZONE',
       'BIFENAZATE', 'BIFENTHRIN', 'BISPYRIBAC', 'BOSCALID', 'BROMACIL',
       'BROMOXYNIL', 'BUPROFEZIN', 'BUTRALIN', 'BUTYLATE',
       'CALCIUM POLYSULFIDE', 'CAPTAN', 'CARBARYL', 'CARBOFURAN',
       'CARBOPHENOTHION', 'CARBOXIN', 'CARFENTRAZONE-ETHYL',
       'CHINOMETHIONAT', 'CHLORANTRANILIPROLE', 'CHLORETHOXYFOS',
       'CHLORFENAPYR', 'CHLORIDAZON', 'CHLORIMURON', 'CHLORMEQUAT',
       'CHLORONEB', 'CHLOROPICRIN', 'CHLOROTHALONIL', 'CHLORPROP

## Merging all SDWIS Together

We first add the contaminants codes information to the violations.


In [None]:
# merging contaminants codes with violations

violations = violations.merge(contaminant_codes, how='left', on='contaminant_code') # we want to keep all violations


We then merge the water systems and violations by _PWSID_ (it is not a 1 to 1 relation).

In [None]:
# merging water systems with violations:

# 1 water system might see several violations, 
# and 1 violation might affect several water sytems (albeit it is rare)
NE_viol = water_system.merge(violations, how='outer')


In [1]:
# pesticides

In [None]:
# the years 2013 to 2017 are estimates.
# the years 2015 to 2017 are not direclty accessible yet. 
# we first have to download them, then I reload them here:

pesticide_use_2015 = pd.read_csv('../data/pesticide_use/2015PreliminaryEstimates/EPest.county.estimates.2015.txt', sep='\t')
pesticide_use_2016 = pd.read_csv('../data/pesticide_use/2106PreliminaryEstimates/EPest.county.estimates.2016.txt', sep='\t')
pesticide_use_2017 = pd.read_csv('../data/pesticide_use/2017PreliminaryEstimatesNoCA/EPest.county.estimates_noCA.2017.txt', sep='\t')

# append to previous years:
pesticide_use_2009_17 = pesticide_use_2009_14.append(pesticide_use_2015, ignore_index=True)

print(pesticide_use_2009_17.shape)
pesticide_use_2009_17.head()