# Exploratory Data Analysis

Exploratory data analysis (EDA) is often talked about in terms like being more of an art than a sicence. We disagree with this. It's important to have a clear *aim*, some problem that we want to solve or hypthesis we want to test with the data. With an aim at hand it's often more straight forward than people think. We want to perform some EDA on the well known [bay area bike share trip data](http://www.bayareabikeshare.com/open-data) using the [pandas package](http://pandas.pydata.org/).

In [None]:
import pandas
import numpy as np
%matplotlib inline

## Series

The most basic objects in pandas are series, which use numpy arrays under the hood (we'll learn more about numpy in future sessions).

In [None]:
series = pandas.Series(np.random.standard_normal(10))

In [None]:
series

In [None]:
series.values # values is the raw data

In [None]:
type(series.values)

In [None]:
series[4] # list-style access

In [None]:
series[1:4]

### The index

The index allows us to conveniently access the data.

In [None]:
series.index
# range index allow us to use the series like a list

In [None]:
#let's change the index to something more interesting
series.index = list('abcdefghij')

In [None]:
list('abcdefghij')

In [None]:
series['e']

In [None]:
series['a' : 'f'] # slicing works as well

In [None]:
# indices can be passed to the constructor directly
# using the index keyword argument
pandas.Series([1,2,3], index=['foo', 'bar', 'baz'])

In [None]:
# Series can also be created from a dictionary
pandas.Series({'a': 12, 'b': 42})

### Arithmetic operations

Arithmetic operations on series work element-by element...

In [None]:
series

In [None]:
series + 5

In [None]:
series**2

In [None]:
series + series

In [None]:
sum(series**2)

In [None]:
# Be careful with arithmetic operations on series
# whose indices don't match up!
series + pandas.Series({'a': 2})

In [None]:
series + pandas.Series({'a': 2, 'c': 2})

### NaN

`NaN` is short for 'not a number'. You can get rid of them using the `dropna` method.

In [None]:
(series + pandas.Series({'a': 2, 'c': 2})).dropna()

## Data Frames

Not unlike `R`'s data frames, pandas `DataFrame` objects are collections of named series.

In [None]:
fish = pandas.DataFrame({'size': [100, 120, 70],
                         'weight': [20, 30, 25]},
                        index = ['Brown Trout', 'Atlantic Salmon', 'Chinook Salmon'])

In [None]:
fish

In [None]:
fish.index

In [None]:
# columns can be accessed the same way
# members of objects can be accessed
fish.weight
# this is not always recommended though
# it won't work if your column name contains
# e.g. spaces or its name is the name of 
# a special method or member function
# fish.size e.g. won't work

In [None]:
# access by string: return the series of that name
fish['size']

In [None]:
fish['size'] > 100 # makes a series of boolean

In [None]:
# access with series of bool:
# return a new data frame where condition is true
fish[fish['size'] > 100] 

In [None]:
fish[fish['size'] > 100]['weight']

## Reading Data

Pandas has many methods for reading data in different formats.

In [None]:
for i in dir(pandas):
    if i.startswith("read"):
        print i

In [None]:
df = pandas.read_csv('data/201508_trip_data.csv.gz')

In [None]:
# get basic information about data
df.info()

In [None]:
# get summary statistics
df.describe()

## Our EDA

The aim of this EDA will be to see how much the average trip duration from a number of stations varies with the day of the week and month-by-month. This could e.g. be used to p see if we can predict the duration of a trip in order to estimate when a given bike will be available again.

In [None]:
# data frameas and series have a number of plot functions
# here, we use a histogram
df['Duration'].plot.hist()

In [None]:
df['Duration'].plot.hist(xlim=(0,100)) # doesn't do what we want

In [None]:
# much better!
# limit duration to 45 minutes
df[df['Duration'] < 60*45]['Duration'].plot.hist(bins=30)

In [None]:
df['Start Date'].head()

In [None]:
import datetime
def mdy_hm(datetimestring):
    return datetime.datetime.strptime(datetimestring,
                            '%m/%d/%Y %H:%M')
df['Start Date'] = df['Start Date'].apply(mdy_hm) # element-wise

## Pivot / Stack

In [None]:
fish

In [None]:
fish.stack()

In [None]:
type(fish.stack())

In [None]:
stacked = fish.stack().reset_index()

In [None]:
stacked

In [None]:
stacked.columns = ['name', 'info', 'value']

In [None]:
stacked

In [None]:
stacked.pivot(index='name', columns='info', values='value')

In [None]:
df.head()

In [None]:
stations = ['Embarcadero at Sansome',
 'Temporary Transbay Terminal (Howard at Beale)',
 'Harry Bridges Plaza (Ferry Building)',
 'San Francisco Caltrain 2 (330 Townsend)',
 'San Francisco Caltrain (Townsend at 4th)']
df = df[df['Start Station'].apply(lambda x: x in stations)]

In [None]:
df.head()

In [None]:
departures = df[['Start Station', 'Start Date', 'Duration']]

In [None]:
departures.head()

In [None]:
pivoted = departures.pivot_table(index='Start Date', columns='Start Station', values='Duration')

In [None]:
pivoted.head()

## Time Series

In [None]:
daily_averages = pivoted.resample('1d').mean()

In [None]:
daily_averages.head()

In [None]:
daily_averages['2014'].head()

In [None]:
daily_averages['2014-10'].head()

In [None]:
daily_averages['2014-10-2':'2014-10-7'].head()

## Groupby

In [None]:
groupby_example = pandas.DataFrame({'key': ['a', 'b', 'a', 'b'],
                                    'value': [1,2,1,2]})

In [None]:
groupby_example

In [None]:
groupby_example.groupby('key').sum()

In [None]:
daily_averages['Weekday'] = daily_averages.index.weekday

In [None]:
mean_weekday = daily_averages.groupby('Weekday').mean()

In [None]:
mean_weekday.plot(kind='bar', ylim=(0, 5000))

In [None]:
import calendar

In [None]:
daily_averages['Weekday'] = daily_averages['Weekday'].apply(lambda x: calendar.day_abbr[x])

In [None]:
mean_weekday = daily_averages.groupby('Weekday').mean()
mean_weekday.plot(kind='bar', ylim=(0, 5000))

In [None]:
mean_weekday['Embarcadero at Sansome']

In [None]:
mean_weekday.iloc[[1,3]]

In [None]:
daily_averages['Month'] = daily_averages.index.month

In [None]:
daily_averages.groupby('Month').mean().plot.bar(ylim=(0, 4000))