## CUNY MSDA Fall 2017 Semester  
### DATA 620  
  
**Homework 4: Hudson River Enterococcus Levels Analysis**

By Dmitriy Vecheruk

This analysis is based on the initial [example provided in the course](https://github.com/charleyferrari/CUNY_DATA608/blob/master/lecture4/Hudson_River.ipynb).
  
Data source: Riverkeeper (www.riverkeeper.org) data on Hudson River Enterococcus levels

### 1. Load libraries and preprocess the data

In [48]:
import pandas as pd
import numpy as np
import plotly.offline as py
from plotly.graph_objs import *
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)
%matplotlib inline

In [8]:
dat = pd.read_csv("https://raw.githubusercontent.com/datafeelings/CUNY_DATA608/master/lecture4/Data/riverkeeper_data_2013.csv")

Let's look at the data:

In [13]:
dat[0:8]

Unnamed: 0,Site,Date,EnteroCount,FourDayRainTotal,SampleCount
0,Hudson above Mohawk River,10/16/2011,1733,1.5,35
1,Hudson above Mohawk River,10/21/2013,4,0.2,35
2,Hudson above Mohawk River,9/21/2013,20,0.0,35
3,Hudson above Mohawk River,8/19/2013,6,0.0,35
4,Hudson above Mohawk River,7/21/2013,31,0.0,35
5,Hudson above Mohawk River,6/4/2013,238,1.2,35
6,Hudson above Mohawk River,10/15/2012,23,1.4,35
7,Hudson above Mohawk River,9/15/2012,11,0.1,35


In [18]:
dat.dtypes

Site                        object
Date                datetime64[ns]
EnteroCount                 object
FourDayRainTotal           float64
SampleCount                  int64
dtype: object

Python is not recognizing `Date` and `EnteroCount` as numeric fields! Let's fix that

In [15]:
dat["Date"] = pd.to_datetime(dat["Date"],format="%m/%d/%Y")

In [16]:
# check if any dates could not be parsed 
dat[dat.Date.isnull()]

Unnamed: 0,Site,Date,EnteroCount,FourDayRainTotal,SampleCount


In [17]:
print min(dat["Date"]), max(dat["Date"])

2006-09-19 00:00:00 2013-10-21 00:00:00


The dates seem to have been parsed correctly.
  
As for the `EnteroCount`, the probem was the "<" and ">" signs present in the field to highlight extreme values. We'll get rid of them. And to be more conservative, keep the border values instead.

In [19]:
dat[dat["EnteroCount"].str.contains("<|>",regex=True)]["EnteroCount"].unique()

array(['>2420', '<1', '<10', '>24196'], dtype=object)

In [20]:
dat["EnteroCount"] = dat["EnteroCount"].str.replace("<|>","") 
dat["EnteroCount"] = dat["EnteroCount"].astype("int")
dat.describe()

Unnamed: 0,EnteroCount,FourDayRainTotal,SampleCount
count,3397.0,3397.0,3397.0
mean,387.747719,0.568001,56.88637
std,2046.114024,1.000387,41.588476
min,0.0,0.0,27.0
25%,10.0,0.0,37.0
50%,18.0,0.2,42.0
75%,85.0,0.7,50.0
max,24196.0,8.5,187.0


No NA values in the `EnteroCount` field, we can proceed to the analysis

### 2. Analysis from the homework assignment

1) Create lists & graphs of the best and worst places to swim in the dataset.  
2) The testing of water quality can be sporadic. Which sites have been tested most regularly? Which ones have long gaps between tests? Pick out 5-10 sites and visually compare how regularly their water quality is tested.  
3) Is there a relationship between the amount of rain and water quality? Show this relationship graphically. If you can, estimate the effect of rain on quality at different sites and create a visualization to compare them.  

#### 2.1. The best and worst places to swim in the dataset

In [21]:
len(dat.Site.unique())

75

Overall, there are 75 unique measurement sites.  
First, inspect how the EnteroCount values are distributed over time and per site

In [39]:
plotly_data = [
    Scatter(
        x = dat['Date'],
        y = dat['EnteroCount'],
        mode = 'markers'
    )
]

layout = Layout(title="Observed Entero levels over time",
                xaxis=dict(title='Measurement time'),
                yaxis=dict(title='Enterococcus count per 100mL'))

fig = Figure(data=plotly_data, layout=layout)
py.iplot(fig)

We can see that while the values remain largely under the critical levels (EnteroCount of 2420 per 100ml), there are some spikes in the data. Also, there is a cap at 2420 due to our replacement of the ">2420" value.
So a median value per site should be a good indicators of the overall water quality.  
We should however, also consider the changes in the quality over time, as the median does not reflect it.

In [52]:
dat_site_median = dat["EnteroCount"].groupby(dat["Site"]).agg(np.median).reset_index().sort_values("EnteroCount")

In [64]:
best_10 = dat_site_median.head(10)
worst_10 = dat_site_median.tail(10)

In [65]:
best_10.head(10)

Unnamed: 0,Site,EnteroCount
50,Norrie Point mid-channel,2.5
68,Tivoli Landing,4.0
58,Port Ewen Drinking Water Intake,4.0
59,Poughkeepsie Drinking Water Intake,4.5
72,West Point STP Outfall,7.0
40,Kingston Point Beach,8.0
69,Ulster Landing Beach,8.5
71,Wappingers Creek,9.0
44,Marlboro Landing,9.0
37,Irvington Beach,10.0


In [69]:
trace_best = Bar(x=best_10.EnteroCount,
                  y=best_10.Site,
                  name='Best 10',
                  marker=dict(color='green'),orientation = 'h')
    
trace_worst = Bar(x=worst_10.EnteroCount,
                  y=worst_10.Site,
                  name='Worst 10',
                  marker=dict(color='red'),orientation = 'h')

layout = Layout(title="10 Best and 10 Worst Sites by Water Quality",
                xaxis=dict(title='Enterococcus count per 100mL',type='log',autorange=True),
                yaxis=dict(title='Site name')
                )

fig = Figure(data=[trace_best,trace_worst], layout=layout)
py.iplot(fig)