<a href="https://colab.research.google.com/github/RReyes-DS/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/Rafael_Reyes_LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
# Run my imports
import pandas as pd
import seaborn as sns
import numpy as np
from scipy.stats import ttest_ind


In [8]:
# Get my data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2020-01-30 02:49:52--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data.1’


2020-01-30 02:49:52 (284 KB/s) - ‘house-votes-84.data.1’ saved [18171/18171]



In [0]:
# Make it into a data frame
column_headers = ['party', 'handicapped-infants', 'water-project-cost-sharing', 
                  'adoption-of-the-budget-resolution', 'physician-fee-freeze', 'el-salvador-aid', 'rel-groups-in-schools', 'anti-satellite-test-ban',  
'aid-to-nicaraguan-contras', 'mx-missile', 'immigration', 'synfuels-corporation-cutback', 'education-spending', 'superfund-right-to-sue', 'crime', 'duty-free-exports', 'export-administration-act-south-africa']

In [24]:

df = pd.read_csv('house-votes-84.data', header=None, names=column_headers, na_values='?')
print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,rel-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [25]:
# recode data as numeric so that I can complete the analysis
df = df.replace({'y':1, 'n':0})
df.head()

Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,rel-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [26]:
df['party'].value_counts()

democrat      267
republican    168
Name: party, dtype: int64

In [27]:
rep = df[df['party']=='republican']
rep.head()

Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,rel-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [29]:
dem = df[df['party']=='democrat']
dem.head()

Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,rel-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [40]:
# Look at Adoption
print('The percentage Reps who voted yes is', rep['adoption-of-the-budget-resolution'].mean())
print('The percentage Dems who voted yes is', dem['adoption-of-the-budget-resolution'].mean())


The percentage Reps who voted yes is 0.13414634146341464
The percentage Dems who voted yes is 0.8884615384615384


In [69]:
ttest_ind(rep['adoption-of-the-budget-resolution'], dem['adoption-of-the-budget-resolution'], nan_policy='omit').pvalue<.01

True

In [0]:
# I know that Dems suppored the adoption-of-the-budget-resolution	more than Reps and that the difference was statistically significant with a pvalue <.01


In [67]:
# Look at Religious groups in schools
print('The percentage Reps who voted yes is', rep['rel-groups-in-schools'].mean())
print('The percentage Dems who voted yes is', dem['rel-groups-in-schools'].mean())
ttest_ind(rep['rel-groups-in-schools'], dem['rel-groups-in-schools'], nan_policy='omit').pvalue<.01


The percentage Reps who voted yes is 0.8975903614457831
The percentage Dems who voted yes is 0.47674418604651164


True

In [0]:
# I know that Reps suppored the rel-groups-in-schools	more than Deps and that the difference was statistically significant with a pvalue <.01

In [44]:
# Now lets find a vote where there wasn't a statistical difference between the parties. 
ttest_ind(rep['immigration'], dem['immigration'], nan_policy='omit').pvalue>.1

False

In [45]:
ttest_ind(rep['synfuels-corporation-cutback'], dem['synfuels-corporation-cutback'], nan_policy='omit').pvalue>.1

False

In [46]:
ttest_ind(rep['education-spending'], dem['education-spending'], nan_policy='omit').pvalue>.1

False

In [47]:
ttest_ind(rep['superfund-right-to-sue'], dem['superfund-right-to-sue'], nan_policy='omit').pvalue>.1

False

In [48]:
ttest_ind(rep['crime'], dem['crime'], nan_policy='omit').pvalue>.1

False

In [49]:
ttest_ind(rep['duty-free-exports'], dem['duty-free-exports'], nan_policy='omit').pvalue>.1

False

In [50]:
ttest_ind(rep['export-administration-act-south-africa'], dem['export-administration-act-south-africa'], nan_policy='omit').pvalue>.1

False

In [51]:
ttest_ind(rep['handicapped-infants'], dem['handicapped-infants'], nan_policy='omit').pvalue>.1

False

In [52]:
ttest_ind(rep['water-project-cost-sharing'], dem['water-project-cost-sharing'], nan_policy='omit').pvalue>.1

True

In [0]:
# I cannot reject the null hypothesis that the water project cost sharing bill was equallyt supported (or not supported) by Dems and Reps. 

In [0]:
# Refactor my code
bills = ['handicapped-infants', 'water-project-cost-sharing', 
                  'adoption-of-the-budget-resolution', 'physician-fee-freeze', 'el-salvador-aid', 'rel-groups-in-schools', 'anti-satellite-test-ban',  
'aid-to-nicaraguan-contras', 'mx-missile', 'immigration', 'synfuels-corporation-cutback', 'education-spending', 'superfund-right-to-sue', 'crime', 'duty-free-exports', 'export-administration-act-south-africa']


In [65]:
for x in bills:
  print('The hypothesis that a similar proportion of Dems and Reps supported the', x, 'bill is', ttest_ind(rep[x], dem[x], nan_policy='omit').pvalue<.01)

The hypothesis that a similar proportion of Dems and Reps supported the handicapped-infants bill is True
The hypothesis that a similar proportion of Dems and Reps supported the water-project-cost-sharing bill is False
The hypothesis that a similar proportion of Dems and Reps supported the adoption-of-the-budget-resolution bill is True
The hypothesis that a similar proportion of Dems and Reps supported the physician-fee-freeze bill is True
The hypothesis that a similar proportion of Dems and Reps supported the el-salvador-aid bill is True
The hypothesis that a similar proportion of Dems and Reps supported the rel-groups-in-schools bill is True
The hypothesis that a similar proportion of Dems and Reps supported the anti-satellite-test-ban bill is True
The hypothesis that a similar proportion of Dems and Reps supported the aid-to-nicaraguan-contras bill is True
The hypothesis that a similar proportion of Dems and Reps supported the mx-missile bill is True
The hypothesis that a similar pro