In [1]:
# Telling IPython to render plots inside cells
%matplotlib inline

In [3]:
# Importing required Libraries
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import ggplot as gg
from IPython.display import display
from IPython.display import Image
from IPython.display import HTML

#Problem Statement 

A traffic analyst in the city of Zreeha wants to find if there is any difference in the crash frequencies (no. of crashes per year) between rear-end and side-swipe crashes. The transport depeartment collects crash frequencies for a year at 10 sites of 4-legged intersections. The data is described below in the data frame **df**. Statistically speaking, the analyst wants to answer the question:  

> **Are the crash frequencies between rear-end and side-swipe crashes at 4-legged intersection statistically different?**

In [14]:
# Rear-end Crash
HTML('<img src="http://upload.wikimedia.org/wikipedia/commons/1/1f/Head_On_Collision.jpg" width=600 height=400/>')

In [13]:
# Side-swipe Crash
HTML('<img src="http://upload.wikimedia.org/wikipedia/commons/5/50/Japanese_car_accident_blur.jpg" width=600 height=400/>')

## Data Description

### Reading Data
We will first read the data which is saved in a csv file:

In [21]:
df = pd.read_csv('C:\\Users\\durraniu\\Documents\\HT2.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Crash Frequency (Crashes per year),Unnamed: 2
0,Site #,Rear-end,Side-swipe
1,1,10,12
2,2,7,9
3,3,6,4
4,4,5,7


We can see that the first row is un-necessary here so we can skip that. 

In [22]:
df = pd.read_csv('C:\\Users\\durraniu\\Documents\\HT2.csv', skiprows = 2)
df.head()

Unnamed: 0,Site #,Rear-end,Side-swipe
0,1,10,12
1,2,7,9
2,3,6,4
3,4,5,7
4,5,9,8


### Summary Statistics

In [23]:
df.describe()

Unnamed: 0,Site #,Rear-end,Side-swipe
count,10.0,10.0,10.0
mean,5.5,8.2,8.3
std,3.02765,1.932184,2.311805
min,1.0,5.0,4.0
25%,3.25,7.0,7.0
50%,5.5,8.5,8.0
75%,7.75,9.75,9.75
max,10.0,11.0,12.0


But we are not really interested in individual averages of rear-end and side-swipe crashes but the difference between them. Our main goal is to verify whether the mean of the differences is statistically significant.

### Hypothesis Testing
For estimating the significance in mean difference in crash frequencies we'll first find the difference:

In [24]:
df['d'] = df['Rear-end'] - df['Side-swipe']
df.head()

Unnamed: 0,Site #,Rear-end,Side-swipe,d
0,1,10,12,-2
1,2,7,9,-2
2,3,6,4,2
3,4,5,7,-2
4,5,9,8,1


The mean of the differences of two samples is:

In [27]:
dbar = df['d'].mean()
print(dbar)

-0.1


And the standard deviation is:

In [28]:
s = df['d'].std()
print(s)

1.66332999332


#### Hypothesis
Our null hypothesis is that there is no difference between the crash frequencies of rear-end and side-swipe crashes or, in other words, the mean of the population of all these differences is zero:  
$H_o$: $\mu_D$ = 0 
and the alternative hypothesis would be:  
$H_A$: $\mu_D$ $\neq$ 0  

**Level of significance = 0.5**

In [64]:
HTML('<img src="HT2.png" width=750 height=500/>')

#### Critical Value
Because we have a sample size of 10 only we will use *t-test* instead of Z distribution. According to CLT, the mean of the sampling distribution of mean differences in crash frequencies of Rear-end and Side-swipe crashes is equal to the population mean difference which is assumed as zero in this case. 

We can find the critical t for 0.05 significance level and degree of freedom 9 using following command:

In [73]:
from scipy.stats import distributions as dists

tcritical = dists.t.ppf(1-0.05/2, 9)
print(tcritical)

2.26215716274


#### t-statistic
From our data we can compute t score using following formula:

$$t = {(\bar{d} - \mu_D)}/{(s/\sqrt(n))}$$ 

We can use the following command in stats module to find the t-statistic and p-value for two-tailed test:

In [74]:
paired_sample = stats.ttest_rel(df['Rear-end'], df['Side-swipe'])
print "The t-statistic is %.3f and the p-value is %.3f." % paired_sample

The t-statistic is -0.190 and the p-value is 0.853.


## Conclusion

Because the t-value falls in the acceptance region i.e. between 2.262 and -2.262 critical t-values we fail to reject the null hypothesis.  
Another way to interpret the result is that the p-value is higher than the critical t-value, the probability of getting the observed or extreme mean difference given the null hypothesis is true is higher than the probability of rejecting the null hypothesis when it is in fact true. Therefore, we fail to reject the null hypothesis. In the context of this example, we say that mean difference between rear-end and side-swipe crashes is not statistically significant.

## Resources
* [Learning Python for Data Analysis and Visualization](https://www.udemy.com/learning-python-for-data-analysis-and-visualization/)
* [Data Analysis and Statistical Inference course](https://www.coursera.org/course/statistics)
* Caldwell, Sally. Statistics unplugged. Cengage Learning, 2012.
* [paired t test in python](http://iaingallagher.tumblr.com/post/50980987285/t-tests-in-python)

In [67]:
%reload_ext version_information

%version_information numpy, scipy, matplotlib, sympy, pandas, ggplot

Software,Version
Python,2.7.9 64bit [MSC v.1500 64 bit (AMD64)]
IPython,3.0.0
OS,Windows 8 6.2.9200
numpy,1.9.2
scipy,0.15.1
matplotlib,1.4.3
sympy,0.7.6
pandas,0.16.0
ggplot,0.6.5
Sun Apr 26 17:40:56 2015 Eastern Daylight Time,Sun Apr 26 17:40:56 2015 Eastern Daylight Time
