<a href="https://colab.research.google.com/github/dunkelweizen/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/Cai_Nowicki_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

In [0]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'

df = pd.read_csv(url, header=None)

In [6]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [0]:
columns = ['Class Name', 'handicapped-infants', 
           'water-project-cost-sharing',
           'adoption-of-the-budget-resolution',
           'physician-fee-freeze',
           'el-salvador-aid',
           'religious-groups-in-schools',
           'anti-satellite-test-ban', 
           'aid-to-nicaraguan-contras',
           'mx-missile',
           'immigration',
           'synfuels-corporation-cutback',
           'education-spending',
           'superfund-right-to-sue',
           'crime',
           'duty-free-exports',
           'export-administration-act-south-africa']
df.columns = columns

In [8]:
df.head()

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [0]:
df = df.replace(to_replace='?', value=np.NaN)

In [47]:
df['Class Name'].value_counts()

democrat      267
republican    168
Name: Class Name, dtype: int64

In [48]:
df.isnull().sum()
#so just dropping all my null values will really mess up my data

Class Name                                  0
handicapped-infants                        12
water-project-cost-sharing                 48
adoption-of-the-budget-resolution          11
physician-fee-freeze                       11
el-salvador-aid                            15
religious-groups-in-schools                11
anti-satellite-test-ban                    14
aid-to-nicaraguan-contras                  15
mx-missile                                 22
immigration                                 7
synfuels-corporation-cutback               21
education-spending                         31
superfund-right-to-sue                     25
crime                                      17
duty-free-exports                          28
export-administration-act-south-africa    104
dtype: int64

Null Hypothesis: The 'yes' votes on 'immigration' will be the same between Democrats and Republicans.

Alternate Hypothesis: The 'yes' votes on 'immigration' will be higher for Democrats than for Republicans.

In [49]:
df['immigration'].value_counts()

y    216
n    212
Name: immigration, dtype: int64

In [0]:
df = df.replace(('y','n'), (1, 0))

In [0]:
dem_mask = df['Class Name'] == 'democrat'
df_dems = df[dem_mask]
rep_mask = df['Class Name'] == 'republican'
df_reps = df[rep_mask]

In [55]:
df_dems.head()

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [0]:
average_dems = df_dems['immigration'].mean()
average_reps = df_reps['immigration'].mean()

In [61]:
print(average_dems, average_reps)

#A higher average means more 1s (yes votes) and therefor more support

0.4714828897338403 0.5575757575757576


On average, there was more Republican support for this vote than Democrat. 

In [58]:
ttest_ind(df_dems['immigration'], df_reps['immigration'], nan_policy='omit')

Ttest_indResult(statistic=-1.7359117329695164, pvalue=0.08330248490425066)

The p-value of the 'immigration' vote average is 0.0833, which means I fail to reject my null hypothesis, and I have not proven a statistical difference between Republicans and Democrats on this issue. 

###Function to test columns of dataframe

In [0]:
def stat_difference(column):
  s, p = ttest_ind(df_dems[column], df_reps[column], nan_policy='omit')
  print('Republican Average for', column, '=', df_reps[column].mean())
  print('Democrat Average for', column, '=', df_dems[column].mean())
  print('p-value =', p)
  if p > 0.01:
    return print('No statistical difference')
  else:
    return print('Statistically different!')

In [64]:
df.head()

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [72]:
#generic null hypothesis: votes will show no difference between Republicans and Democrats
#generic alternate hypothesis: votes will show a difference
stat_difference('handicapped-infants')
stat_difference('water-project-cost-sharing')
stat_difference('adoption-of-the-budget-resolution')
stat_difference('physician-fee-freeze')
stat_difference('el-salvador-aid')
stat_difference('religious-groups-in-schools')
stat_difference('anti-satellite-test-ban')
stat_difference('aid-to-nicaraguan-contras')
stat_difference('mx-missile')
stat_difference('synfuels-corporation-cutback')
stat_difference('education-spending')
stat_difference('superfund-right-to-sue')
stat_difference('crime')
stat_difference('duty-free-exports')
stat_difference('export-administration-act-south-africa')

Republican Average for handicapped-infants = 0.18787878787878787
Democrat Average for handicapped-infants = 0.6046511627906976
p-value = 1.613440327937243e-18
Statistically different!
Republican Average for water-project-cost-sharing = 0.5067567567567568
Democrat Average for water-project-cost-sharing = 0.502092050209205
p-value = 0.9291556823993485
No statistical difference
Republican Average for adoption-of-the-budget-resolution = 0.13414634146341464
Democrat Average for adoption-of-the-budget-resolution = 0.8884615384615384
p-value = 2.0703402795404463e-77
Statistically different!
Republican Average for physician-fee-freeze = 0.9878787878787879
Democrat Average for physician-fee-freeze = 0.05405405405405406
p-value = 1.994262314074344e-177
Statistically different!
Republican Average for el-salvador-aid = 0.9515151515151515
Democrat Average for el-salvador-aid = 0.21568627450980393
p-value = 5.600520111729011e-68
Statistically different!
Republican Average for religious-groups-in-sch

###Results of t-tests

Votes with more Democrat support were 'handicapped-infants', 'adoption-of-the-budget-resolution', 'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', ' mx-missile', 'synfuels-corporation-cutback', 'duty-free-exports', and 'export-administration-act-south-africa'

Votes with more Republican support were 'physician-fee-freeze', 'el-salvador-aid', 'religious-groups-in-schools', 'education-spending', 'superfund-right-to-sue', and 'crime'. 

Votes with no significant difference between the two groups were 'immigration' and 'water-project-cost-sharing'.