## CUNY MSDA Fall 2017 Semester  
### DATA 620  
  
**Homework 4: Hudson River Enterococcus Levels Analysis**

By Dmitriy Vecheruk

This analysis is based on the initial [example provided in the course](https://github.com/charleyferrari/CUNY_DATA608/blob/master/lecture4/Hudson_River.ipynb).
  
Data source: Riverkeeper (www.riverkeeper.org) data on Hudson River Enterococcus levels

### 1. Load libraries and preprocess the data

In [1]:
import pandas as pd
import plotly.plotly as py
from plotly.graph_objs import *
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)
%matplotlib inline

In [2]:
dat = pd.read_csv("https://raw.githubusercontent.com/datafeelings/CUNY_DATA608/master/lecture4/Data/riverkeeper_data_2013.csv")

Let's look at the data:

In [3]:
dat[10:16]

Unnamed: 0,Site,Date,EnteroCount,FourDayRainTotal,SampleCount
10,Hudson above Mohawk River,6/16/2012,10,0.2,35
11,Hudson above Mohawk River,5/20/2012,11,0.0,35
12,Hudson above Mohawk River,6/24/2013,30,1.4,35
13,Hudson above Mohawk River,9/19/2011,11,0.1,35
14,Hudson above Mohawk River,8/21/2011,231,0.4,35
15,Hudson above Mohawk River,7/14/2011,11,0.3,35


In [4]:
dat.describe()

Unnamed: 0,FourDayRainTotal,SampleCount
count,3397.0,3397.0
mean,0.568001,56.88637
std,1.000387,41.588476
min,0.0,27.0
25%,0.0,37.0
50%,0.2,42.0
75%,0.7,50.0
max,8.5,187.0


uh oh: python is not recognizing Date and EnteroCount as numeric fields! Let's fix that

In [5]:
dat["Date"] = pd.to_datetime(dat["Date"],format="%m/%d/%Y")

In [6]:
# check if any dates could not be parsed 
dat[dat.Date.isnull()]

Unnamed: 0,Site,Date,EnteroCount,FourDayRainTotal,SampleCount


In [7]:
print min(dat["Date"]), max(dat["Date"])

2006-09-19 00:00:00 2013-10-21 00:00:00


The dates seem to have been parsed correctly.
  
As for the `EnteroCount`, the probem was the "<" and ">" signs present in the field to highlight extreme values. We'll get rid of them. And to be more conservative, keep the border values instead.

In [8]:
dat[dat["EnteroCount"].str.contains("<|>",regex=True)]["EnteroCount"].unique()

array(['>2420', '<1', '<10', '>24196'], dtype=object)

In [9]:
dat["EnteroCount"] = dat["EnteroCount"].str.replace("<|>","") 
dat["EnteroCount"] = dat["EnteroCount"].astype("int")
dat.describe()

Unnamed: 0,EnteroCount,FourDayRainTotal,SampleCount
count,3397.0,3397.0,3397.0
mean,387.747719,0.568001,56.88637
std,2046.114024,1.000387,41.588476
min,0.0,0.0,27.0
25%,10.0,0.0,37.0
50%,18.0,0.2,42.0
75%,85.0,0.7,50.0
max,24196.0,8.5,187.0


No NA values in the `EnteroCount` field, we can proceed to the analysis

### 2. Analysis from the homework assignment

1) Create lists & graphs of the best and worst places to swim in the dataset.  
2) The testing of water quality can be sporadic. Which sites have been tested most regularly? Which ones have long gaps between tests? Pick out 5-10 sites and visually compare how regularly their water quality is tested.  
3) Is there a relationship between the amount of rain and water quality? Show this relationship graphically. If you can, estimate the effect of rain on quality at different sites and create a visualization to compare them.  

#### 2.1. The best and worst places to swim in the dataset

In [10]:
dat.head()

Unnamed: 0,Site,Date,EnteroCount,FourDayRainTotal,SampleCount
0,Hudson above Mohawk River,2011-10-16,1733,1.5,35
1,Hudson above Mohawk River,2013-10-21,4,0.2,35
2,Hudson above Mohawk River,2013-09-21,20,0.0,35
3,Hudson above Mohawk River,2013-08-19,6,0.0,35
4,Hudson above Mohawk River,2013-07-21,31,0.0,35


In [13]:
plotlyData = [
    Scatter(
        x = dat['Date'],
        y = dat['EnteroCount'],
        mode = 'markers'
    )
]

py.iplot(plotlyData)


Aw, snap! We don't have an account for ''. Want to try again? You can authenticate with your email address or username. Sign in is not case sensitive.

Don't have an account? plot.ly

Questions? support@plot.ly


PlotlyError: Because you didn't supply a 'file_id' in the call, we're assuming you're trying to snag a figure from a url. You supplied the url, '', we expected it to start with 'https://plot.ly'.
Run help on this function for more information.