# Data Processing and Feature Engineering

In this notebook, we merge the different datasets together, select features and engineer some new features. This first step in feature selection and engineering is based on my knowledge of the data. 

We proceed in this order:  

1. Filter water systems of interest (active water systems in New England)
2. Select features of interest for water systems
3. Filter violations of interest (pesticides)
4. Select features of interest for violations
5. Add estimated pesticide use for the water systems
6. Merge water systems and violations to obtain violations by water systems (!)
7. 
8. Engineer new features of interest for violations by water systems 


In the end, we obtain a dateset on which we can do model training and selection. In a later step, when trying different models, a new loop of feature selection and engineering might be needed, and will be performed in a separate notebook.     


## Water Systems

We did already select the water systems of interest: EPA region 01, New England, and only active water systems.

In [None]:
# load water systems

We already filter the data by year, because we are only interested in recent violations. The number of observed violations greatly increased to reach a new plateau in 2009 (known from [previous work](https://github.com/de-la-viz/US-Public-Water-Systems/blob/master/US%20Drinking%20Water%20Quality%20Violations.ipynb)) because of the introduction of new guidelines and rules. We will thus focus on violations from 2009 onwards. 

## Merging all SDWIS Together

We first add the contaminants codes information to the violations.


In [None]:
# merging contaminants codes with violations

violations = violations.merge(contaminant_codes, how='left', on='contaminant_code') # we want to keep all violations


We then merge the water systems and violations by _PWSID_ (it is not a 1 to 1 relation).

In [None]:
# merging water systems with violations:

# 1 water system might see several violations, 
# and 1 violation might affect several water sytems (albeit it is rare)
NE_viol = water_system.merge(violations, how='outer')


In [1]:
# pesticides

In [None]:
# the years 2013 to 2017 are estimates.
# the years 2015 to 2017 are not direclty accessible yet. 
# we first have to download them, then I reload them here:

pesticide_use_2015 = pd.read_csv('../data/pesticide_use/2015PreliminaryEstimates/EPest.county.estimates.2015.txt', sep='\t')
pesticide_use_2016 = pd.read_csv('../data/pesticide_use/2106PreliminaryEstimates/EPest.county.estimates.2016.txt', sep='\t')
pesticide_use_2017 = pd.read_csv('../data/pesticide_use/2017PreliminaryEstimatesNoCA/EPest.county.estimates_noCA.2017.txt', sep='\t')

# append to previous years:
pesticide_use_2009_17 = pesticide_use_2009_14.append(pesticide_use_2015, ignore_index=True)

print(pesticide_use_2009_17.shape)
pesticide_use_2009_17.head()