<a href="https://colab.research.google.com/github/alxanderpierre/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/Pierre_Nelson_LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
### YOUR CODE STARTS HERE

# **Load and clean the data**

In [0]:
import pandas as pd
import numpy as np 
from scipy.stats import ttest_1samp, ttest_ind, ttest_rel
import scipy.stats

In [0]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2019-08-19 19:30:26--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data.1’


2019-08-19 19:30:26 (286 KB/s) - ‘house-votes-84.data.1’ saved [18171/18171]



In [0]:
df = pd.read_csv('house-votes-84.data', header=None, names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa'])

In [0]:
df.shape

(435, 17)

In [0]:
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [0]:
df = df.replace({'?':np.NaN, 'n':0, 'y':1}) #interesting way we are replacing the Nan values. going to have to write this down 

In [0]:
df.isnull().sum()

party                     0
handicapped-infants      12
water-project            48
budget                   11
physician-fee-freeze     11
el-salvador-aid          15
religious-groups         11
anti-satellite-ban       14
aid-to-contras           15
mx-missile               22
immigration               7
synfuels                 21
education                31
right-to-sue             25
crime                    17
duty-free                28
south-africa            104
dtype: int64

In [0]:
rep = df[df.party == 'republican'] #notice how we did dot notation inside of te brackets when we are trying to pull out a specific part of a column
print(rep.shape)
rep.head()

(168, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [0]:
dem = df[df.party == "democrat"]
print(dem.shape)
dem.head()

(267, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [0]:
df.party.value_counts()

democrat      267
republican    168
Name: party, dtype: int64

# **Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01**

In [0]:
from scipy.stats import ttest_1samp

In [0]:
## ***1 Sample Test ***

In [0]:
rep['handicapped-infants'].mean # isolating the column from the data set that is repersenative of all the rep that voted on the handicapped bill. and getting the mean from it. 

In [0]:
ttest_1samp(rep['handicapped-infants'], 0, nan_policy='omit')
#ok so here we are doing a one sample t test with the null hypothesis being that no rep voted for this bill. This is why we put the 0. 
# so this is what we can concluded after calculating the pvalue and the t statistic,
# the pvalue is less than .05 so this tells us that we have to reject the null hypothesis and that there is some support for this bill 

Ttest_1sampResult(statistic=6.159569669016066, pvalue=5.434587970316366e-09)

In [0]:
ttest_1samp(rep['handicapped-infants'], .5, nan_policy='omit')
# here we are saying that the mean is .5 and that its split down the middle as far as the rep voting for this bill 
# again we can reject the null hyopthesis here that the rep split their votes on this issue 
# due to the fact that the pvalue is lower than .05 
# meaning that probably voted more one one side of this issue 

Ttest_1sampResult(statistic=-10.232833482397659, pvalue=2.572179359890009e-19)

In [0]:
ttest_1samp(rep['handicapped-infants'], 1, nan_policy='omit')

Ttest_1sampResult(statistic=-26.625236633811387, pvalue=1.978873197183477e-61)

In [0]:
rep['handicapped-infants'].value_counts()

0.0    134
1.0     31
Name: handicapped-infants, dtype: int64

In [0]:
dem['handicapped-infants'].value_counts() #dems voted more for this issue than rep

1.0    156
0.0    102
Name: handicapped-infants, dtype: int64

In [0]:
# ***2 Sample Test ***

In [0]:
ttest_ind(rep['handicapped-infants'], dem['handicapped-infants'], nan_policy='omit')
# so the purpose here is to see if this these two "RANDOM" sample differences are statistically significant 
#we find that the pvalue is less than .05 actually be a great margin. 
# the null hypothesis here is that there is no significant difference 
# and the alterative hypiothesis here is that there is a significant difference 
# so with the pvalue being less than .05 we can reject the null hypothesis that there is no significant difference between the dem can rep voting patteres 


Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18)

# **Using hypothesis testing, find an issue that republicans support more than democrats with**

In [0]:
rep["el-salvador-aid"].mean()

0.9515151515151515

In [0]:
ttest_1samp(rep['el-salvador-aid'], 1, nan_policy='omit')
# hmmmmmm so this is interesting the pvalue is less than .05 but it is really close unlike the last few that we have been doing 
# hmm we have to reject that the null hypothesis that all the rep voted for the bill but what else is here hmmmmm



Ttest_1sampResult(statistic=-2.890793645020198, pvalue=0.004363402589282088)

In [0]:
ttest_1samp(rep['el-salvador-aid'], 0, nan_policy='omit')
#null hypothesis here is also being rejected. null hypothesis being tho that non of the rep voted for this bill 
# But the major differece here is that the pvalue is basically 0 or so close to 0 that it may be scream something to us here
# it is alot closer to zero than when our null hypothesis was 1. what is this telling us.


Ttest_1sampResult(statistic=56.73182528352142, pvalue=1.0573299896233442e-109)

In [0]:
rep['el-salvador-aid'].value_counts()

1.0    157
0.0      8
Name: el-salvador-aid, dtype: int64

In [0]:
# yep i was right here when i was suggesting that the rep favored this bill more than they opposed it.
# lets dive a little deeper and see how the dems voted,   Muuuuahahahahahahahahahahahahahaaaaaaaaaaaa!

In [0]:
dem['el-salvador-aid'].mean()

0.21568627450980393

In [0]:
ttest_1samp(dem['el-salvador-aid'], 0, nan_policy='omit')
# since i already know that most of the rep voted for the bill i want to see now if the dems voted against just to block so of the rep moves 
#(even tho i know that these are suppose to be 'INDEPENDENT' sample)
# ok so we again by surpise REJECT the null hypothesis again here. due to the pvalue being less than .05
# however did the dems vote more or less in favor for this issue ? 

Ttest_1sampResult(statistic=8.357631243360764, pvalue=4.2308289907515245e-15)

In [0]:
ttest_1samp(dem['el-salvador-aid'], .5, nan_policy='omit')

Ttest_1sampResult(statistic=-11.016877548066462, pvalue=2.5007537432253433e-23)

In [0]:
ttest_1samp(dem['el-salvador-aid'], 1, nan_policy='omit')
# lol so we know that they didnt vote in favor of this issue lol 

Ttest_1sampResult(statistic=-30.39138633949369, pvalue=1.4023105444477201e-86)

In [0]:
ttest_ind(rep['el-salvador-aid'], dem['el-salvador-aid'], nan_policy='omit')
# so here the pvalue here is less than .05 now here the null hypothesis that we are reject is one of that the 
# differences between the means arent statistically significant. or i guess a better way to say it is that we reject it being due to random chance. 

Ttest_indResult(statistic=21.13669261173219, pvalue=5.600520111729011e-68)

In [0]:
dem['el-salvador-aid'].value_counts()# lots of nos from the dems

0.0    200
1.0     55
Name: el-salvador-aid, dtype: int64

In [0]:
rep['el-salvador-aid'].value_counts() # lots of yeses for this bill from the reps

1.0    157
0.0      8
Name: el-salvador-aid, dtype: int64

# **Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)**

In [0]:
df.head() # doing this again so i can see out line of the columns to see what bill im going to chose to do my t tests on 

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
dem['budget'].value_counts() # trying to find one that theres not much of a difference 

1.0    231
0.0     29
Name: budget, dtype: int64

In [0]:
rep['budget'].value_counts()

0.0    142
1.0     22
Name: budget, dtype: int64

In [0]:
dem['water-project'].value_counts()

0.0    179
1.0     73
Name: right-to-sue, dtype: int64

In [0]:
rep['water-project'].value_counts()

1.0    75
0.0    73
Name: water-project, dtype: int64

In [0]:
# hmmmmmmmm the water bill looks interesting! lets strap on our diabolical super villian hats and see why! MUUUUUUAHhahahahahahahahahahahahah! 

In [0]:
ttest_1samp(rep['water-project'],0, nan_policy='omit')

Ttest_1sampResult(statistic=12.28932045559371, pvalue=2.525482675130834e-24)

In [0]:
ttest_1samp(rep['water-project'],.5, nan_policy='omit')
# OOOOOOHHHHHHHHHHHHHHH SNAP!!!!!! have i found a pvalue that is acutally greater than .05????
# my super villian powers i lead me to acutlly a fail to reject the null hypothesis here
# why tho?? I am glad you ask well we can see that we set the null hypothesis here to .5 which is basically half! 
# mean that we are saying that half voted for and against this bill 


Ttest_1sampResult(statistic=0.16385760607458383, pvalue=0.8700683158522193)

In [0]:
# even thought we know now how the rest of these are going to turn out lets turn our attention to the dems !

In [0]:
ttest_1samp(dem['water-project'], .5, nan_policy='omit')
# wow again we have to fail to reject the null hypothesis here as well 

Ttest_1sampResult(statistic=0.06454972243678961, pvalue=0.9485867005339235)

In [0]:
ttest_1samp(dem['water-project'], 1, nan_policy='omit')
# here we are rejecting that the null hypothesis is that all the dems voted in favor of this bill 


Ttest_1sampResult(statistic=-15.36283393995609, pvalue=1.8031537722768159e-37)

In [0]:
ttest_ind(rep['water-project'], dem['water-project'], nan_policy='omit')
# now here is where things get a little tricky.
# we have to fail to reject the null hypothesis her due to the pvalue being greater (much greater by the way) than .05
# but again what are we really failing reject here???
#well in a two sample ttest we are comparing the means of both sample to see if there is any statistically significance.
# we'll reject the the hypothesis of there being no statistically significance between the two if the pvalue is <.05
# but here it is greater so we have to conclude that there actually may be some statistical significant between the means
# we can now as the question why were there such a close number of dems and rep voting for this bill
# why did so many dem vote against it yet so many rep chose not to vote at alllllll !! hmmmmmmmm. 

Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)