<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

## Load and Clean Data

In [1]:
### YOUR CODE STARTS HERE
import pandas as pd
import scipy.stats
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data"
names=['party', 'handicapped-infants', 'water-project', 
'budget', 'physician-fee-freeze', 'el-salvador-aid', 'religious-groups', 
'anti-satellite-ban', 'aid-to-contras', 'mx-missile', 'immigration', 
'synfuels', 'education', 'right-to-sue', 'crime', 'duty-free', 
'south-africa']
df = pd.read_csv(url, names=names)

print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [2]:
#replace "?", and reformat y/n to 1/0
import numpy as np
df = df.replace({'?': np.NaN, 'y': 1, 'n': 0})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [3]:
#count null values
df.isnull().sum()

party                     0
handicapped-infants      12
water-project            48
budget                   11
physician-fee-freeze     11
el-salvador-aid          15
religious-groups         11
anti-satellite-ban       14
aid-to-contras           15
mx-missile               22
immigration               7
synfuels                 21
education                31
right-to-sue             25
crime                    17
duty-free                28
south-africa            104
dtype: int64

In [4]:
#overall stats
df.describe()

Unnamed: 0,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,423.0,387.0,424.0,424.0,420.0,424.0,421.0,420.0,413.0,428.0,414.0,404.0,410.0,418.0,407.0,331.0
mean,0.44208,0.503876,0.596698,0.417453,0.504762,0.641509,0.567696,0.57619,0.501211,0.504673,0.362319,0.423267,0.509756,0.593301,0.427518,0.812689
std,0.497222,0.500632,0.49114,0.493721,0.500574,0.480124,0.495985,0.49475,0.500605,0.500563,0.481252,0.49469,0.500516,0.491806,0.495327,0.390752
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0
75%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [5]:
rep = df[df['party'] == 'republican']
dem = df[df['party'] == 'democrat']
print('Rep Shape: {}'.format(rep.shape))
print('Dem Shape: {}'.format(dem.shape))

Rep Shape: (168, 17)
Dem Shape: (267, 17)


In [6]:
#Democratic party ovehead statistics
dem.describe()

Unnamed: 0,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,258.0,239.0,260.0,259.0,255.0,258.0,259.0,263.0,248.0,263.0,255.0,249.0,252.0,257.0,251.0,185.0
mean,0.604651,0.502092,0.888462,0.054054,0.215686,0.476744,0.772201,0.828897,0.758065,0.471483,0.505882,0.144578,0.289683,0.350195,0.63745,0.935135
std,0.489876,0.501045,0.315405,0.226562,0.412106,0.50043,0.420224,0.377317,0.429121,0.500138,0.500949,0.352383,0.454518,0.477962,0.481697,0.246956
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
75%,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [7]:
#Republican party overhead statistics
rep.describe()

Unnamed: 0,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,165.0,148.0,164.0,165.0,165.0,166.0,162.0,157.0,165.0,165.0,159.0,155.0,158.0,161.0,156.0,146.0
mean,0.187879,0.506757,0.134146,0.987879,0.951515,0.89759,0.240741,0.152866,0.115152,0.557576,0.132075,0.870968,0.860759,0.981366,0.089744,0.657534
std,0.391804,0.501652,0.341853,0.10976,0.215442,0.304104,0.428859,0.36101,0.320176,0.498186,0.339643,0.336322,0.347298,0.135649,0.286735,0.476168
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
50%,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
75%,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Democratic support $\gt$ Republican

In [0]:
from scipy.stats import ttest_ind

In [9]:
#budget
dem_mean = dem['budget'].mean()
rep_mean = rep['budget'].mean()
print('Budget')
print('Democratic Mean: {0:.4f}'.format(dem_mean))
print('Republican Mean: {0:.4f}'.format(rep_mean))

Budget
Democratic Mean: 0.8885
Republican Mean: 0.1341


In [10]:
#null hypothesis: democratic support = republican
t_test = ttest_ind(rep['budget'], dem['budget'], nan_policy='omit')
print('Statistic = {}'.format(t_test.statistic))
print('P-value = {}'.format(t_test.pvalue))

Statistic = -23.21277691701378
P-value = 2.0703402795404463e-77


P-value is small (p $\lt$ 0.1), therefore the null hypothesis is rejected. Looking at the means of the parties, we can say that Democratic support for budget is greater than Republican support with great statistical confidence.

## Republican support $\gt$ Democratic

In [11]:
#physician-fee-freeze
dem_mean = dem['physician-fee-freeze'].mean()
rep_mean = rep['physician-fee-freeze'].mean()
print('Physician Fee Freeze')
print('Democratic Mean: {0:.4f}'.format(dem_mean))
print('Republican Mean: {0:.4f}'.format(rep_mean))

Physician Fee Freeze
Democratic Mean: 0.0541
Republican Mean: 0.9879


In [12]:
#null hypothesis: republican support = democratic
t_test = ttest_ind(rep['physician-fee-freeze'], dem['physician-fee-freeze'], nan_policy='omit')
print('Statistic = {}'.format(t_test.statistic))
print('P-value = {}'.format(t_test.pvalue))

Statistic = 49.36708157301406
P-value = 1.994262314074344e-177


Statistics value is high for this Physician Fee Freeze stands from Republicans and Democrats indicates that it is unlikely that there is support across parties is equal. P-value is extremely small, therefore we can reject the null hypothesis. Looking at the two means, Republican support is far higher than Democratic support. 

## Democratic Support $\approx$ Republican Support (p $\gt$ 0.1)

In [13]:
#water budget
dem_mean = dem['water-project'].mean()
rep_mean = rep['water-project'].mean()
print('Physician Fee Freeze')
print('Democratic Mean: {0:.4f}'.format(dem_mean))
print('Republican Mean: {0:.4f}'.format(rep_mean))

Physician Fee Freeze
Democratic Mean: 0.5021
Republican Mean: 0.5068


In [14]:
#null hypothesis: republican support = democratic
t_test = ttest_ind(rep['water-project'], dem['water-project'], nan_policy='omit')
print('Statistic = {}'.format(t_test.statistic))
print('P-value = {}'.format(t_test.pvalue))

Statistic = 0.08896538137868286
P-value = 0.9291556823993485


P-value $\gt$ 0.1, therefore we cannot reject the null hypothesis. Looking at the means of Democrats and Republicans on the water project, it is probable that both parties suppor the water project equally.

# Stretch

In [32]:
for name in names[1:]:
  dem_mean = dem[name].mean()
  rep_mean = rep[name].mean()
  print(name.replace('-',' ').title() + '\n')
  print('Democratic Mean: {0:.4f}'.format(dem_mean))
  print('Republican Mean: {0:.4f}\n'.format(rep_mean))  
  #null hypothesis: republican support = democratic
  t_test = ttest_ind(rep[name], dem[name], nan_policy='omit')
  print('Statistic = {}'.format(t_test.statistic))
  print('P-value = {}'.format(t_test.pvalue)) 
  print("---" * 12)


Handicapped Infants

Democratic Mean: 0.6047
Republican Mean: 0.1879

Statistic = -9.205264294809222
P-value = 1.613440327937243e-18
------------------------------------
Water Project

Democratic Mean: 0.5021
Republican Mean: 0.5068

Statistic = 0.08896538137868286
P-value = 0.9291556823993485
------------------------------------
Budget

Democratic Mean: 0.8885
Republican Mean: 0.1341

Statistic = -23.21277691701378
P-value = 2.0703402795404463e-77
------------------------------------
Physician Fee Freeze

Democratic Mean: 0.0541
Republican Mean: 0.9879

Statistic = 49.36708157301406
P-value = 1.994262314074344e-177
------------------------------------
El Salvador Aid

Democratic Mean: 0.2157
Republican Mean: 0.9515

Statistic = 21.13669261173219
P-value = 5.600520111729011e-68
------------------------------------
Religious Groups

Democratic Mean: 0.4767
Republican Mean: 0.8976

Statistic = 9.737575825219457
P-value = 2.3936722520597287e-20
------------------------------------
Anti Sa

In [0]:
def ttest(df1, df2, name):
  df1_mean = df1[name].mean()
  df2_mean = df2[name].mean()
  print(name.replace('-',' ').title() + '\n')
  print('Sample 1 Mean: {0:.4f}'.format(df1_mean))
  print('Sample 2 Mean: {0:.4f}\n'.format(df2_mean))  
  #null hypothesis: df1 mean = df2 mean
  t_test = ttest_ind(df1[name], df2[name], nan_policy='omit')
  print('Statistic = {}'.format(t_test.statistic))
  print('P-value = {}'.format(t_test.pvalue)) 

In [35]:
#Aid to contras example
ttest(dem, rep, 'aid-to-contras')
#df1 = dem sample
#df2 = rep sample

Aid To Contras

Sample 1 Mean: 0.8289
Sample 2 Mean: 0.1529

Statistic = 18.052093200819733
P-value = 2.82471841372357e-54
