<a href="https://colab.research.google.com/github/RAV10K1/DS-Unit-1-Sprint-2-Statistics/blob/master/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
### YOUR CODE STARTS HERE
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'

In [0]:
# Importing libraries
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
import seaborn as sns

In [4]:
# Downloading file from UCI website
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2020-01-30 21:52:17--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2020-01-30 21:52:18 (137 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [5]:
# Reading data into dataframe
column_headers = ['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']

df = pd.read_csv('house-votes-84.data', header=None, names=column_headers, na_values="?")

df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [6]:
# Replacing Y and N as 0 and 1
df = df.replace(to_replace={'y':1, 'n':0})
print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [7]:
df.sort_values(by='party')

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
370,democrat,1.0,1.0,1.0,0.0,,1.0,1.0,1.0,0.0,1.0,,,0.0,0.0,1.0,1.0
200,democrat,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
199,democrat,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,,0.0,1.0,0.0,0.0,0.0,1.0,
198,democrat,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,,0.0,1.0,
196,democrat,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
230,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
231,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
233,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
150,republican,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0


In [8]:
# Creating data groups for both parties
df_rep = df[df['party']=='republican']
print(df_rep.head(3))
df_dem = df[df['party']=='democrat']
print(df_dem.head(3))

        party  handicapped-infants  ...  duty-free  south-africa
0  republican                  0.0  ...        0.0           1.0
1  republican                  0.0  ...        0.0           NaN
7  republican                  0.0  ...        NaN           1.0

[3 rows x 17 columns]
      party  handicapped-infants  water-project  ...  crime  duty-free  south-africa
2  democrat                  NaN            1.0  ...    1.0        0.0           0.0
3  democrat                  0.0            1.0  ...    0.0        0.0           1.0
4  democrat                  1.0            1.0  ...    1.0        1.0           1.0

[3 rows x 17 columns]


In [10]:
# Creating list of issues voted on
issues = ['handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']
len(issues)

16

In [0]:
# Assigning variables to each column of both democrat and republican dataframes
hi_rep = df_rep['handicapped-infants'].astype(float)
hi_dem = df_dem['handicapped-infants'].astype(float)
wp_rep = df_rep['water-project'].astype(float)
wp_dem = df_dem['water-project'].astype(float)
bt_rep = df_rep['budget'].astype(float)
bt_dem = df_dem['budget'].astype(float)
pf_rep = df_rep['physician-fee-freeze'].astype(float)
pf_dem = df_dem['physician-fee-freeze'].astype(float)
es_rep = df_rep['el-salvador-aid'].astype(float)
es_dem = df_dem['el-salvador-aid'].astype(float)
rg_rep = df_rep['religious-groups'].astype(float)
rg_dem = df_dem['religious-groups'].astype(float)
ab_rep = df_rep['anti-satellite-ban'].astype(float)
ab_dem = df_dem['anti-satellite-ban'].astype(float)
ac_rep = df_rep['aid-to-contras'].astype(float)
ac_dem = df_dem['aid-to-contras'].astype(float)
mx_rep = df_rep['mx-missile'].astype(float)
mx_dem = df_dem['mx-missile'].astype(float)
im_rep = df_rep['immigration'].astype(float)
im_dem = df_dem['immigration'].astype(float)
sf_rep = df_rep['synfuels'].astype(float)
sf_dem = df_dem['synfuels'].astype(float)
ed_rep = df_rep['education'].astype(float)
ed_dem = df_dem['education'].astype(float)
rs_rep = df_rep['right-to-sue'].astype(float)
rs_dem = df_dem['right-to-sue'].astype(float)
cm_rep = df_rep['crime'].astype(float)
cm_dem = df_dem['crime'].astype(float)
dy_rep = df_rep['duty-free'].astype(float)
dy_dem = df_dem['duty-free'].astype(float)
sa_rep = df_rep['south-africa'].astype(float)
sa_dem = df_dem['south-africa'].astype(float)

In [0]:
# Collating data into lists categorized by party
issues_rep = [hi_rep, wp_rep, bt_rep, pf_rep, es_rep, rg_rep, ab_rep, ac_rep, 
              mx_rep, im_rep, sf_rep, ed_rep, rs_rep, cm_rep, dy_rep, sa_rep]

issues_dem = [hi_dem, wp_dem, bt_dem, pf_dem, es_dem, rg_dem, ab_dem, ac_dem, 
              mx_dem, im_dem, sf_dem, ed_dem, rs_dem, cm_dem, dy_dem, sa_dem]

In [13]:
# Performing ttest on all issues for both parties (one-tailed)
hi_pvalue = ttest_ind(hi_rep, hi_dem, nan_policy='omit').pvalue 
print('{:.40f}'.format(hi_pvalue))
wp_pvalue = ttest_ind(wp_rep, wp_dem, nan_policy='omit').pvalue
print('{:.40f}'.format(wp_pvalue))
bt_pvalue = ttest_ind(bt_rep, bt_dem, nan_policy='omit').pvalue
print('{:.40f}'.format(bt_pvalue))
pf_pvalue = ttest_ind(pf_rep, pf_dem, nan_policy='omit').pvalue
print('{:.40f}'.format(pf_pvalue))
es_pvalue = ttest_ind(es_rep, es_dem, nan_policy='omit').pvalue 
print('{:.40f}'.format(es_pvalue))
rg_pvalue = ttest_ind(rg_rep, rg_dem, nan_policy='omit').pvalue 
print('{:.40f}'.format(rg_pvalue))
ab_pvalue = ttest_ind(ab_rep, ab_dem, nan_policy='omit').pvalue 
print('{:.40f}'.format(ab_pvalue))
ac_pvalue = ttest_ind(ac_rep, ac_dem, nan_policy='omit').pvalue 
print('{:.40f}'.format(ac_pvalue))
mx_pvalue = ttest_ind(mx_rep, mx_dem, nan_policy='omit').pvalue 
print('{:.40f}'.format(mx_pvalue))
im_pvalue = ttest_ind(im_rep, im_dem, nan_policy='omit').pvalue 
print('{:.40f}'.format(im_pvalue))
sf_pvalue = ttest_ind(sf_rep, sf_dem, nan_policy='omit').pvalue 
print('{:.40f}'.format(sf_pvalue))
ed_pvalue = ttest_ind(ed_rep, ed_dem, nan_policy='omit').pvalue 
print('{:.40f}'.format(ed_pvalue))
rs_pvalue = ttest_ind(rs_rep, rs_dem, nan_policy='omit').pvalue 
print('{:.40f}'.format(rs_pvalue))
cm_pvalue = ttest_ind(cm_rep, cm_dem, nan_policy='omit').pvalue 
print('{:.40f}'.format(cm_pvalue))
dy_pvalue = ttest_ind(dy_rep, dy_dem, nan_policy='omit').pvalue 
print('{:.40f}'.format(dy_pvalue))
sa_pvalue = ttest_ind(sa_rep, sa_dem, nan_policy='omit').pvalue 
print('{:.40f}'.format(sa_pvalue))

0.0000000000000000016134403279372430164766
0.9291556823993485370039024928701110184193
0.0000000000000000000000000000000000000000
0.0000000000000000000000000000000000000000
0.0000000000000000000000000000000000000000
0.0000000000000000000239367225205972874114
0.0000000000000000000000000000008521033017
0.0000000000000000000000000000000000000000
0.0000000000000000000000000000000000000000
0.0833024849042506565499621729031787253916
0.0000000000000015759322301054063559183251
0.0000000000000000000000000000000000000000
0.0000000000000000000000000000000001227858
0.0000000000000000000000000000000000000000
0.0000000000000000000000000000000599769717
0.0000000000365267436167222602172333246487


In [0]:
# Creating Dataframe for all Pvalues
df_pvalues = pd.DataFrame(data={'PValue-One-tailed' : [hi_pvalue, wp_pvalue, bt_pvalue, pf_pvalue, es_pvalue, rg_pvalue, ab_pvalue, ac_pvalue,
                                           mx_pvalue, im_pvalue, sf_pvalue, ed_pvalue, rs_pvalue, cm_pvalue, dy_pvalue, sa_pvalue]}, index=issues)

In [0]:
# Inserting Two-tailed Pvalue into dataframe
df_pvalues['PValue-Two-Tailed'] = df_pvalues['PValue-One-tailed']/2

In [0]:
# Setting display options in pandas for number format
pd.options.display.float_format = '{:.20f}'.format

In [17]:
df_pvalues.reset_index()
df_pvalues.head()

Unnamed: 0,PValue-One-tailed,PValue-Two-Tailed
handicapped-infants,0.0,0.0
water-project,0.9291556823993484,0.4645778411996742
budget,0.0,0.0
physician-fee-freeze,0.0,0.0
el-salvador-aid,0.0,0.0


In [0]:
# Resetting index for Pvalues dataframe
df_pvalues = df_pvalues.reset_index()

In [19]:
df_pvalues

Unnamed: 0,index,PValue-One-tailed,PValue-Two-Tailed
0,handicapped-infants,0.0,0.0
1,water-project,0.9291556823993484,0.4645778411996742
2,budget,0.0,0.0
3,physician-fee-freeze,0.0,0.0
4,el-salvador-aid,0.0,0.0
5,religious-groups,0.0,0.0
6,anti-satellite-ban,0.0,0.0
7,aid-to-contras,0.0,0.0
8,mx-missile,0.0,0.0
9,immigration,0.0833024849042506,0.0416512424521253


In [0]:
# For P<0.01
df_pvalues['P<0.01 for 2 Tails'] = df_pvalues['PValue-Two-Tailed']<0.01
df_pvalues['P<0.01 for 1 Tail'] = df_pvalues['PValue-One-tailed']<0.01
# For P>0.1
df_pvalues['P>0.1 for 2 Tails'] = df_pvalues['PValue-Two-Tailed']>0.1
df_pvalues['P>0.1 for 1 Tail'] = df_pvalues['PValue-One-tailed']>0.1

In [21]:
df_pvalues

Unnamed: 0,index,PValue-One-tailed,PValue-Two-Tailed,P<0.01 for 2 Tails,P<0.01 for 1 Tail,P>0.1 for 2 Tails,P>0.1 for 1 Tail
0,handicapped-infants,0.0,0.0,True,True,False,False
1,water-project,0.9291556823993484,0.4645778411996742,False,False,True,True
2,budget,0.0,0.0,True,True,False,False
3,physician-fee-freeze,0.0,0.0,True,True,False,False
4,el-salvador-aid,0.0,0.0,True,True,False,False
5,religious-groups,0.0,0.0,True,True,False,False
6,anti-satellite-ban,0.0,0.0,True,True,False,False
7,aid-to-contras,0.0,0.0,True,True,False,False
8,mx-missile,0.0,0.0,True,True,False,False
9,immigration,0.0833024849042506,0.0416512424521253,False,False,False,False


# Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01

In [0]:
# Null Hypothesis (H0): democrat support for an issue == republican support for an issue (no difference)
# Alternative Hypothesis (H1): democrat support for an issue > republican support for an issue (different)
# Confidence Level = 99.9% (P<0.01)
# This entails a two-tailed test as it is checking whether one party has more support for an issue than the other

In [48]:
# The P<0.01 for 14 out of 16 issues with Water Project and Immigration seeing some bi-partisan support.
# Hence the null hypothesis (H0) must be rejected for these two issues. 
# Therefore there was a difference in support from both parties.

df_pvalues[['index', 'P<0.01 for 2 Tails']].sort_values(by='P<0.01 for 2 Tails')

Unnamed: 0,index,P<0.01 for 2 Tails
1,water-project,False
9,immigration,False
0,handicapped-infants,True
2,budget,True
3,physician-fee-freeze,True
4,el-salvador-aid,True
5,religious-groups,True
6,anti-satellite-ban,True
7,aid-to-contras,True
8,mx-missile,True


# Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01

In [51]:
df_pvalues[['index', 'P<0.01 for 1 Tail']].sort_values(by='P<0.01 for 1 Tail', ascending=False)

Unnamed: 0,index,P<0.01 for 1 Tail
0,handicapped-infants,True
2,budget,True
3,physician-fee-freeze,True
4,el-salvador-aid,True
5,religious-groups,True
6,anti-satellite-ban,True
7,aid-to-contras,True
8,mx-missile,True
10,synfuels,True
11,education,True


In [0]:
# It can be seen from the above table that 

# Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

In [50]:
df_pvalues[['index', 'P>0.1 for 2 Tails']].sort_values(by='P>0.1 for 2 Tails', ascending=False)

Unnamed: 0,index,P>0.1 for 2 Tails
1,water-project,True
0,handicapped-infants,False
2,budget,False
3,physician-fee-freeze,False
4,el-salvador-aid,False
5,religious-groups,False
6,anti-satellite-ban,False
7,aid-to-contras,False
8,mx-missile,False
9,immigration,False
