# Data Processing and Feature Engineering

In this notebook, I merge the different datasets together, select features and engineer some new features. This first step in feature selection and engineering is based on my knowledge of the data. 

I proceed in this order:  

1. Filter water systems of interest (active water systems in New England)
2. Select features of interest for water systems, and data cleaning
3. Filter violations of interest (pesticides)
4. Select features of interest for violations, and data cleaning
5. Add estimated pesticide use for the water systems
6. Merge water systems and violations to obtain violations by water systems (!)
7. 
8. Engineer new features of interest for violations by water systems 


In the end, we obtain a dateset on which we can do model training and selection. In a later step, when trying different models, a new loop of feature selection and engineering might be needed, and will be performed in a separate notebook.     


In [2]:
import pandas as pd
import numpy as np
import datetime
import pandas_profiling # fast way to perform exploratory data analysis of a Pandas Dataframe
# import time

# import matplotlib.pyplot as plt
# %matplotlib inline
# import seaborn as sns

# import geopandas as gpd # to identify neighboring water systems (could be done)
# from shapely.geometry import Point, Polygon

## Water Systems

I did already select the water systems of interest: EPA region 01, New England, and only active water systems.

In [3]:
# load water systems:
ws_raw = pd.read_csv('../data/active_water_systems_NewEngland.csv')
ws_raw.head().T

Unnamed: 0,0,1,2,3,4
Unnamed: 0,0,1,2,3,4
pwsid,ME0004628,ME0092288,ME0009198,ME0094505,ME0007311
pws_name,MACHIAS TRAILER PARK,MARSH BROOK ESTATES,TRAILS END STEAK HOUSE & TAVERN,R & R VACATION HOME PARK,NOKOMIS CAMPING AREA LLC
npm_candidate,Y,Y,Y,Y,Y
primacy_agency_code,ME,ME,ME,ME,ME
epa_region,1,1,1,1,1
season_begin_date,,,01-01,01-01,05-01
season_end_date,,,12-31,12-31,09-30
pws_activity_code,A,A,A,A,A
pws_deactivation_date,,,,,


In [4]:
# DATA CLEANING FOR WATER SYSTEMS
ws = ws_raw.copy()

# IMPORTANT NOTE: we cannot use the ZIP code to locate the water systems, 
# as it is sometimes the ZIP code of the legal entity, which is not necessarily in the same place.
# ==> we can locate it with the ansi_entity_code.

# I remove the 6 water systems without localization (no ansi_entity_code)
# (It turns out they are tribal lands) ==> ! introduction of a potential bias
# ws_raw.ansi_entity_code.isnull().sum() 
ws = ws[~ws['ansi_entity_code'].isnull()]

# verifying that the counties_served from the water system table is the same as the county_served from geography:
# (ws.county_served != ws.counties_served).sum()
# so deleting one:
ws.drop('county_served', axis=1, inplace=True)

ws.shape

(10476, 49)

In [5]:
# First Filter to select needed columns

select_features_ws = ['pwsid', 'pws_name', 'primacy_agency_code', 'pws_type_code', 
                      'gw_sw_code', # if the water system is considered having ground water (“gw”) or surface water (‘sw”) source under SDWA.
                     'owner_type_code', 'population_served_count', 'primary_source_code',
                      'is_wholesaler_ind', # whether the system is a wholesaler of water.
                     'is_school_or_daycare_ind', # if the water system’s primary service area is a school or daycare
                     'service_connections_count', 
                      'source_water_protection_code', # N: WS has not implemented source water protection according to state policy. Y: WS has substantially implemented.
                     # ! source_water_protection_code: most Y only after 01-JAN-2012
                      'cities_served', 'counties_served', 'ansi_entity_code']

ws = ws.loc[:, select_features_ws] # keep only selected columns
ws.head()


Unnamed: 0,pwsid,pws_name,primacy_agency_code,pws_type_code,gw_sw_code,owner_type_code,population_served_count,primary_source_code,is_wholesaler_ind,is_school_or_daycare_ind,service_connections_count,source_water_protection_code,cities_served,counties_served,ansi_entity_code
0,ME0004628,MACHIAS TRAILER PARK,ME,CWS,GW,P,65,GW,N,N,26,N,MACHIAS,Washington,29.0
1,ME0092288,MARSH BROOK ESTATES,ME,CWS,GW,P,70,GW,N,N,28,N,SANFORD,York,31.0
2,ME0009198,TRAILS END STEAK HOUSE & TAVERN,ME,TNCWS,GW,P,390,GW,N,N,1,N,EUSTIS,Franklin,7.0
3,ME0094505,R & R VACATION HOME PARK,ME,TNCWS,GW,P,40,GW,N,N,1,N,NAPLES,Cumberland,5.0
4,ME0007311,NOKOMIS CAMPING AREA LLC,ME,TNCWS,GW,P,118,GW,N,N,1,N,HARRISON,Cumberland,5.0


In [6]:
ws['ansi_entity_code'] = ws['ansi_entity_code'].astype(object)
ws.dtypes

pwsid                           object
pws_name                        object
primacy_agency_code             object
pws_type_code                   object
gw_sw_code                      object
owner_type_code                 object
population_served_count          int64
primary_source_code             object
is_wholesaler_ind               object
is_school_or_daycare_ind        object
service_connections_count        int64
source_water_protection_code    object
cities_served                   object
counties_served                 object
ansi_entity_code                object
dtype: object

### Adding Geographic Information to the Water Systems

TO BE DONE?
* = add the shape from shapefile
* = add Lat/Lon from county (centroid)



## Violations

I already filter the data by year, because I am only interested in recent violations. The number of observed violations greatly increased to reach a new plateau in 2009 (known from [previous work](https://github.com/de-la-viz/US-Public-Water-Systems/blob/master/US%20Drinking%20Water%20Quality%20Violations.ipynb)) because of the introduction of new guidelines and rules. We will thus focus on violations from 2009 onwards. 

In [7]:
# load the violations:
violations_raw = pd.read_csv('../data/violations_NewEngland.csv', dtype='object')
# violations_raw.head().T

In [8]:
# SOME FILTERING AND DATA CLEANING:

violations = violations_raw.copy()

# transform the dates to datetime:
violations.rtc_date = pd.to_datetime(violations.rtc_date)
violations.compl_per_begin_date = pd.to_datetime(violations.compl_per_begin_date)
violations.compl_per_end_date = pd.to_datetime(violations.compl_per_end_date)

violations.loc[:,'year'] = violations['compl_per_begin_date'].dt.year # year when the violation was discovered
violations.loc[:,'month'] = violations['compl_per_begin_date'].dt.month # month when the violation was discovered

# create new column with quarters:
def by_quarter(row):
    if row['month'] < 4:
        return 1
    elif row['month'] >= 4 and row['month'] < 7:
        return 2
    elif row['month'] >= 7 and row['month'] < 10:
        return 3
    else:
        return 4
violations.loc[:,'quarter'] = violations.apply(by_quarter, axis=1) 


select_features_viol = ['pwsid', 'violation_id', 'violation_code', 'violation_category_code', 
                       'is_health_based_ind', 'contaminant_code', 'is_major_viol_ind',
                       'rule_group_code',
                       'year', 'month', 'quarter']

violations = violations.loc[:, select_features_viol] # keep only selected columns

# Note: 91 contaminant_code are empty.
# violations.contaminant_code.isnull().sum()

print(violations.shape)
violations.head().T


(65534, 11)


Unnamed: 0,0,1,2,3,4
pwsid,ME0094672,ME0009683,ME0000625,ME0000625,ME0000625
violation_id,157508,60007,6318,6316,6315
violation_code,22,22,75,27,27
violation_category_code,MCL,MCL,Other,MR,MR
is_health_based_ind,Y,Y,N,N,N
contaminant_code,3100,3100,7500,2456,2950
is_major_viol_ind,,,,Y,Y
rule_group_code,100,100,400,200,200
year,2008,2014,2014,2011,2011
month,6,8,10,1,1


In [9]:
violations.contaminant_code.isnull().sum() # 91 violations without specified contaminant. those will be lost...

91

In [10]:
np.sort(violations.year.unique())


array([1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989,
       1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
       2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015, 2016, 2017, 2018])

### Adding the Contaminant Names

In [11]:
# loading the contaminants codes:
contaminants = pd.read_csv('../data/contaminant_codes.csv')
# contaminants.dtypes
# contaminants.head()

In [12]:
# merging the name to the violations
violations_cont = violations.merge(contaminants, how='left', on='contaminant_code')

The most common violations:

In [13]:
# most occuring violations in New England:
# (not all are contaminants, nor pesticides)
violations_cont.contaminant_name.value_counts().head(50)

COLIFORM (TCR)                   13108
PUBLIC NOTICE                     4533
LEAD & COPPER RULE                3440
CONSUMER CONFIDENCE RULE          2409
NITRATE                           2162
E. COLI                            901
ARSENIC                            838
TTHM                               665
NITRITE                            662
TOTAL HALOACETIC ACIDS (HAA5)      619
TETRACHLOROETHYLENE                572
1,2-DICHLOROETHANE                 564
VINYL CHLORIDE                     564
TRANS-1,2-DICHLOROETHYLENE         564
TOLUENE                            561
P-DICHLOROBENZENE                  561
DICHLOROMETHANE                    561
1,2-DICHLOROPROPANE                560
CIS-1,2-DICHLOROETHYLENE           560
1,1-DICHLOROETHYLENE               560
STYRENE                            559
BENZENE                            559
TRICHLOROETHYLENE                  559
CARBON TETRACHLORIDE               559
O-DICHLOROBENZENE                  558
1,1,1-TRICHLOROETHANE    

In [14]:
# some contaminants name are empty:
violations_cont['contaminant_name'].isnull().sum()

2721

In [15]:
# let's check those and find why:
empty_cont_name = violations_cont[violations_cont['contaminant_name'].isnull() == True]
# empty_cont_name.contaminant_code.isnull().sum() # only 91 miss a contaminant code
print(empty_cont_name.contaminant_code.value_counts())
# 8000: not found ==> it is the "Revised Total Coliform Rule", 
# c.f: https://www.epa.gov/sites/production/files/2018-06/documents/2017_annual_dc_drinking_water_compliance_report_508_0.pdf

# we replace the missing names:
violations_cont.loc[violations_cont['contaminant_code'] == '8000', 
                    'contaminant_name'] = "Revised Total Coliform Rule"



8000    2630
Name: contaminant_code, dtype: int64


### Selection of the Contaminants of Interest: Pesticides


I add a new column to the violations to identify pesticides

In [16]:
# list of pesticides:
pesticide_use_2009_14 = pd.read_csv('../data/pesticide_use/pesticide_use_2009_14.csv')
pesticide_use_2015 = pd.read_csv('../data/pesticide_use/2015PreliminaryEstimates/EPest.county.estimates.2015.txt', sep='\t')
pesticide_use_2016 = pd.read_csv('../data/pesticide_use/2106PreliminaryEstimates/EPest.county.estimates.2016.txt', sep='\t')
pesticide_use_2017 = pd.read_csv('../data/pesticide_use/2017PreliminaryEstimatesNoCA/EPest.county.estimates_noCA.2017.txt', sep='\t')

# append to previous years:
pesticide_use_2009_17 = pesticide_use_2009_14.append(pesticide_use_2015, ignore_index=True)
pesticide_use_2009_17 = pesticide_use_2009_17.append(pesticide_use_2016, ignore_index=True)
pesticide_use_2009_17 = pesticide_use_2009_17.append(pesticide_use_2017, ignore_index=True)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


In [17]:
pesticide_use_2009_17.YEAR.unique()

array([2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017])

In [50]:
pesticides = pesticide_use_2009_17.COMPOUND.unique() # a list of all pesticides

In [27]:
def is_pesticide(row):
    if row['contaminant_name'] is None:
        return 0
    elif row['contaminant_name'] in pesticides:
        return 1
    else:
        return 0
violations_cont.loc[:,'is_pesticide'] = violations_cont.apply(is_pesticide, axis=1)


In [28]:
violations_cont.is_pesticide.value_counts()

0    61807
1     3727
Name: is_pesticide, dtype: int64

In [52]:
print("! only few pesticides listed by the NAWQA are found in the SDWIS subset...")
violations_cont[violations_cont.is_pesticide == 1].contaminant_name.unique()

! only few pesticides listed by the NAWQA are found in the SDWIS subset...


array(['METHOXYCHLOR', 'SIMAZINE', 'PICLORAM', 'OXAMYL', 'GLYPHOSATE',
       'DIQUAT', 'DINOSEB', 'DALAPON', 'CARBOFURAN', 'ATRAZINE', '2,4-D',
       'METRIBUZIN', 'DICAMBA', 'ALDICARB', 'METOLACHLOR', 'METHOMYL',
       'CARBARYL', 'ZINC'], dtype=object)

## Merging all SDWIS Together

I want to create a dataframe with all the water systems, and, for all the years of interest, a binary variable that say if they had seen a violation or not.  

I start by creating "repeating" the water systems data for all years:


In [29]:
years_of_interest = [2013, 2014, 2015, 2016, 2017]
ws_years = ws.copy() # water systems data
ws_years.loc[:, 'year'] = 2012 # add column with initial year
for year_ in years_of_interest:
    ws_thisyear = ws.copy() # water systems data
    ws_thisyear.loc[:, 'year'] = year_
    ws_years = ws_years.append(ws_thisyear, ignore_index=True)
print(ws_years.shape)
# ws_years.head()
# ws_years.year.value_counts()

(62856, 16)


I have now a dataset with all the information about the water systems repeated for all the years of interest. I need to add new columns to this data frame that indicate if there was a violation in a given year. In a second step we can engineer new features, for instance indicating past violations.


In [30]:
# in this chunk, I count (aggregation) the violations and pesticide violations by water system and year

# group by year and water system to count the number of violations:
yearly_viol_count = violations_cont.groupby(['pwsid', 
                         'year'], as_index=False).count()[['pwsid', 
                                                           'year', 
                                                           'violation_id']]
yearly_viol_count = yearly_viol_count.rename(index=str, columns={"violation_id": "n_viol"})

# group by year and water system to count the number of pesticide violations:
# we first have to keep only pesticides violations:
pesticide_violations = violations_cont[violations_cont.is_pesticide == 1]
yearly_viol_pesticide_count = pesticide_violations.groupby(['pwsid', 
                         'year'], as_index=False).count()[['pwsid', 
                                                           'year', 
                                                           'violation_id']]
yearly_viol_pesticide_count = yearly_viol_pesticide_count.rename(index=str, 
                                                                 columns={"violation_id": "n_pesticide_viol"})


In [31]:
# Then I merge the count of violations to the water systems, to add this information in a new column
ws_years_viol = ws_years.merge(yearly_viol_count, how='left', on=(['pwsid', 'year']))
ws_years_viol = ws_years_viol.merge(yearly_viol_pesticide_count, how='left', on=(['pwsid', 'year']))
ws_years_viol.head().T

Unnamed: 0,0,1,2,3,4
pwsid,ME0004628,ME0092288,ME0009198,ME0094505,ME0007311
pws_name,MACHIAS TRAILER PARK,MARSH BROOK ESTATES,TRAILS END STEAK HOUSE & TAVERN,R & R VACATION HOME PARK,NOKOMIS CAMPING AREA LLC
primacy_agency_code,ME,ME,ME,ME,ME
pws_type_code,CWS,CWS,TNCWS,TNCWS,TNCWS
gw_sw_code,GW,GW,GW,GW,GW
owner_type_code,P,P,P,P,P
population_served_count,65,70,390,40,118
primary_source_code,GW,GW,GW,GW,GW
is_wholesaler_ind,N,N,N,N,N
is_school_or_daycare_ind,N,N,N,N,N


## Creating Outcome Variables

I can use the count of number of violations per year and water system - *n_viol* - or the same but only for violations due to the presence of pesticides in the water - *n_pesticide_viol* - as continuous outcome variables.  

I will create two more outcomes variables that I will use for classification: a binarization of the two previous ones. *had_violation* if a water system saw a (one or more) drinking water violation in the given year, and *had_pesticide_violation* if there was a (one or more) violation due to the presence of pesticides.  

The number of water systems that saw a violation due to the presence of pesticide in the drinking water above the MCL is probably to low for training a good model...

In [32]:
ws_years_viol.loc[:,'had_violation'] = np.where(ws_years_viol.n_viol >= 0, 1, 0)
ws_years_viol.loc[:,'had_pesticide_violation'] = np.where(ws_years_viol.n_pesticide_viol >= 0, 1, 0)

In [33]:
# Filling the NaN values in n_viol and n_pesticide_viol with 0:
ws_years_viol.loc[:,'n_viol'].fillna(0, inplace=True)
ws_years_viol.loc[:,'n_pesticide_viol'].fillna(0, inplace=True)
ws_years_viol.head().T

Unnamed: 0,0,1,2,3,4
pwsid,ME0004628,ME0092288,ME0009198,ME0094505,ME0007311
pws_name,MACHIAS TRAILER PARK,MARSH BROOK ESTATES,TRAILS END STEAK HOUSE & TAVERN,R & R VACATION HOME PARK,NOKOMIS CAMPING AREA LLC
primacy_agency_code,ME,ME,ME,ME,ME
pws_type_code,CWS,CWS,TNCWS,TNCWS,TNCWS
gw_sw_code,GW,GW,GW,GW,GW
owner_type_code,P,P,P,P,P
population_served_count,65,70,390,40,118
primary_source_code,GW,GW,GW,GW,GW
is_wholesaler_ind,N,N,N,N,N
is_school_or_daycare_ind,N,N,N,N,N


## Feature Engineering

I engineer some new features, based on our knowledge of what might increase the chances of occurrence of drinking water violations.   

**Done:**  

* if WS saw a violation previous year

**Could be done** (might need linking to external data via *ansi_entity_code*):  

* if WS saw a violation previous yearS  
* rural VS urban binary feature
* average income (or anything indicating the wealth of a county)
* estimated pesticide use
* rainfall (average precipitation per year? number of large rainfall events?)
* distance to industries? 


**Violations in previous year:**

In [39]:
# ADDING IF VIOLATION IN PREVIOUS YEAR:

# need to sort by water system, so that the shift works:
ws_years_viol.sort_values(by=['pwsid', 'year'], inplace=True)

# shift the values of had_violation one row. 2012 will be NaN:
ws_years_viol['had_violation_lastyear'] = ws_years_viol.groupby(['pwsid'])['had_violation'].shift(1)
ws_years_viol['had_pesticide_violation_lastyear'] = ws_years_viol.groupby(['pwsid'])['had_pesticide_violation'].shift(1)


**Pesticide use in county:**

In [49]:
ws_years_viol.groupby(['primacy_agency_code', 'ansi_entity_code']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,pwsid,pws_name,pws_type_code,gw_sw_code,owner_type_code,population_served_count,primary_source_code,is_wholesaler_ind,is_school_or_daycare_ind,service_connections_count,source_water_protection_code,cities_served,counties_served,year,n_viol,n_pesticide_viol,had_violation,had_pesticide_violation,had_violation_lastyear,had_pesticide_violation_lastyear
primacy_agency_code,ansi_entity_code,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CT,1.0,2280,2280,2280,2280,2280,2280,2280,2280,2280,2280,2124,2280,2280,2280,2280,2280,2280,2280,1900,1900
CT,3.0,1368,1368,1368,1368,1368,1368,1368,1368,1368,1368,1242,1368,1368,1368,1368,1368,1368,1368,1140,1140
CT,5.0,2196,2196,2196,2196,2196,2196,2196,2196,2196,2196,2064,2196,2196,2196,2196,2196,2196,2196,1830,1830
CT,7.0,1782,1782,1782,1782,1782,1782,1782,1782,1782,1782,1578,1782,1782,1782,1782,1782,1782,1782,1485,1485
CT,9.0,1350,1350,1350,1350,1350,1350,1350,1350,1350,1350,1272,1350,1350,1350,1350,1350,1350,1350,1125,1125
CT,11.0,2388,2388,2388,2388,2388,2388,2388,2388,2388,2388,2040,2388,2388,2388,2388,2388,2388,2388,1990,1990
CT,13.0,1794,1794,1794,1794,1794,1794,1794,1794,1794,1794,1620,1794,1794,1794,1794,1794,1794,1794,1495,1495
CT,15.0,1602,1602,1602,1602,1602,1602,1602,1602,1602,1602,1452,1602,1602,1602,1602,1602,1602,1602,1335,1335
MA,1.0,990,990,990,990,990,990,990,990,990,990,0,990,990,990,990,990,990,990,825,825
MA,3.0,1158,1158,1158,1152,1158,1158,1152,1158,1158,1158,0,1158,1158,1158,1158,1158,1158,1158,965,965


In [78]:
pesticide_use_2012_2017 = pesticide_use_2009_17.loc[pesticide_use_2009_17.YEAR > 2011] 
# pesticide_use_2012_2017.YEAR.unique()
pesticide_use_2012_2017.STATE_FIPS_CODE.unique()

# replace the state FIPS code by abreviation of the State:
# CT=9, MA=25, ME=23, NH=33, RI=44, VT=50.
def replace_state_FIPS(row):
    if row['STATE_FIPS_CODE'] == 9:
        return 'CT'
    elif row['STATE_FIPS_CODE'] == 25:
        return 'MA'
    elif row['STATE_FIPS_CODE'] == 23:
        return 'ME'
    elif row['STATE_FIPS_CODE'] == 33:
        return 'NH'
    elif row['STATE_FIPS_CODE'] == 44:
        return 'RI'
    elif row['STATE_FIPS_CODE'] == 50:
        return 'VT'
    else:
        return 'Not in New England'
    
pesticide_use_2012_2017.loc['primacy_agency_code'] = pesticide_use_2012_2017.apply(replace_state_FIPS, axis=1)

pesticide_use_2012_2017.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0.1,COMPOUND,COUNTY_FIPS_CODE,EPEST_HIGH_KG,EPEST_LOW_KG,STATE_FIPS_CODE,Unnamed: 0,YEAR,primacy_agency_code
1113593,"2,4-D",1,2874.9,2863.3,1,1113593.0,2012,Not in New England
1113594,"2,4-D",3,4116.9,2854.3,1,1113594.0,2012,Not in New England
1113595,"2,4-D",5,749.4,556.9,1,1113595.0,2012,Not in New England
1113596,"2,4-D",7,12.3,5.4,1,1113596.0,2012,Not in New England
1113597,"2,4-D",9,9455.7,9201.2,1,1113597.0,2012,Not in New England


In [100]:
# keep only new england:
pesticide_use_2012_2017_NE =  pesticide_use_2012_2017[pesticide_use_2012_2017.primacy_agency_code != 'Not in New England']
# just to get the right column name before merging:
pesticide_use_2012_2017_NE.loc[:,'ansi_entity_code'] = pesticide_use_2012_2017_NE['COUNTY_FIPS_CODE']
pesticide_use_2012_2017_NE.loc[:,'year'] = pesticide_use_2012_2017_NE['YEAR'] 
# keep only necessary columns:
pesticide_use_2012_2017_NE.drop(['Unnamed: 0', 'COUNTY_FIPS_CODE', 'EPEST_HIGH_KG', 'STATE_FIPS_CODE', 'YEAR'], 
                                axis=1, inplace=True)
# NOTE: I KEEP ONLY THE LOWER ESTIMATE OF PESTICIDE USE (in kg)
# NOTE: I SUM ACCROSS ALL COMPOUNDS TO GET A SUM OF PESTICIDES USE IN KG, 
# NO MATTER THE "CONCENTRATION" of the pesticides are.
# sum pesticide use by year and county:
pesticide_by_county = pesticide_use_2012_2017_NE.groupby(['year', 'primacy_agency_code', 'ansi_entity_code'], 
                                   as_index=False)['EPEST_LOW_KG'].sum()

pesticide_by_county.head(20)


Unnamed: 0,year,primacy_agency_code,ansi_entity_code,EPEST_LOW_KG
0,2012,CT,1,3675.1
1,2012,CT,3,24402.9
2,2012,CT,5,14824.0
3,2012,CT,7,5459.5
4,2012,CT,9,13749.0
5,2012,CT,11,11980.9
6,2012,CT,13,12871.8
7,2012,CT,15,13687.8
8,2012,MA,1,1015.8
9,2012,MA,3,6916.6


In [102]:
# Join the pesticides use to the data:
ws_years_viol = ws_years_viol.merge(pesticide_by_county, how='left', 
                                    on=['year', 'primacy_agency_code', 'ansi_entity_code'])


## Exploratory Data Analysis

I have a look at the features I pre-selected for the modelling with the _pandas-profiling_ package. I then decide on how to handle the issues.

In [103]:
# data.profile_report() # use this line to generate profile in this notebook.
profile = ws_years_viol.profile_report(title='Initial Exploration of Dataset')
profile.to_file(output_file="../documents/Initial Exploration of Dataset.html") # profile generated as html
# look at this document to see data profile...

In [104]:
ws_years_viol.isnull().sum() # another way to identify missing values

pwsid                                   0
pws_name                                0
primacy_agency_code                     0
pws_type_code                           0
gw_sw_code                             30
owner_type_code                         0
population_served_count                 0
primary_source_code                    30
is_wholesaler_ind                       0
is_school_or_daycare_ind                0
service_connections_count               0
source_water_protection_code        28842
cities_served                           0
counties_served                         0
ansi_entity_code                        0
year                                    0
n_viol                                  0
n_pesticide_viol                        0
had_violation                           0
had_pesticide_violation                 0
had_violation_lastyear              10476
had_pesticide_violation_lastyear    10476
EPEST_LOW_KG                            0
dtype: int64

In [105]:
# I will drop the column "source_water_protection_code", as there are lots of NAs
ws_years_viol.drop("source_water_protection_code", axis=1, inplace=True)

In [106]:
# Then I have a closer look at the missing gw_sw_code and primary_source_code:
ws_years_viol.loc[ws_years_viol['gw_sw_code'].isnull()]
# they are all privately owned from MA, but different type of water systems.
# (data are missing for all years)

# I have no way of inputing these values (source of the water), so I exclude those cases.
ws_years_viol.dropna(axis=0, how='any', subset=['gw_sw_code', 'primary_source_code'], inplace=True)

## Saving Dataset for Modelling

In [107]:
ws_years_viol.to_csv('../data/data_input_for_model.csv', index=False)