## Paired samples t-test using SciPy stats

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as ss


The paired sample t-test is used to compare measures from a single data source at two points in time, often to test whether an operation on the data source has had a significant effect on the outputs. For example, testing a weight-loss intervention on patients, recording performance on a physical task both before and after the intervention. The test checks if there is a statistically significant difference between the means of the two sets of results.

The paired t-test has four assumptions:
* that the data is numerical or continuous
* that the subjects from which measurements are taken are independent and do not affect each other's measurements
* that the differences between the pairs are normally distributed
* that there are no extreme values in the distribution of pair differences

For this exercise, I will be using data from the US Traffic Fatalities dataset which can be found [here](https://vincentarelbundock.github.io/Rdatasets/csv/AER/Fatalities.csv), comparing fatality data from 48 US states for years 1987 and 1988.

I will first assumption #3 by looking at the distribution of differences between the two years.

In [4]:
df = pd.read_csv('us-road-fatalities.csv')

df.head()

Unnamed: 0,State,1987,1988,Difference
0,Alabama,1110,1023,87
1,Arizona,937,944,-7
2,Arkansas,639,610,29
3,California,5504,5390,114
4,Colorado,591,497,94


In [9]:
year1 = df['1987']
year2 = df['1988']

ss.ttest_rel(year1, year2)

Ttest_relResult(statistic=-1.4657751972691198, pvalue=0.14936820290124436)