<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
### YOUR CODE STARTS HERE

#1. Load and clean the data (or determine the best method to drop observations when running tests)


In [0]:
from scipy.stats import ttest_ind
import pandas as pd
import numpy as np

In [8]:
#Get the Data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2020-01-30 01:55:33--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2020-01-30 01:55:33 (605 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [0]:
#Create column headers and make data into a dataframe
column_headers = ['party', 'handicapped-infants','water-project','budget',
                  'physician-fee-freeze','el-salvador-aid','religious-groups',
                  'anti-satellite-ban','aid-to-contras','mx-missile',
                  'immigration','synfuels','education','right-to-sue','crime',
                  'duty-free','south-africa']

In [13]:
df = pd.read_csv('house-votes-84.data', header=None,
                 names=column_headers, na_values='?')

print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [18]:
#record votes as numeric

df = df.replace({'y':1, 'n':0})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [30]:
#drop the nan values for immigration column

#how did republicans and democrats vote?
rep = df[df['party']=='republican']
dem = df[df['party']=='democrat']

#No nans for republicans in the immigration column
rep_col = rep['immigration']
np.isnan(rep_col)
rep_immig_no_nans = rep_col[~np.isnan(rep_col)]
print(rep_immig_no_nans)

#No nans for democrats in the immigration column
dem_col = dem['immigration']
np.isnan(dem_col)
dem_immig_no_nans = dem_col[~np.isnan(dem_col)]
print(dem_immig_no_nans)

0      1.0
1      0.0
7      0.0
8      0.0
10     0.0
      ... 
420    1.0
427    1.0
430    1.0
432    0.0
434    1.0
Name: immigration, Length: 165, dtype: float64
2      0.0
3      0.0
4      0.0
5      0.0
6      0.0
      ... 
425    1.0
426    1.0
428    1.0
429    1.0
431    1.0
Name: immigration, Length: 263, dtype: float64


#2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01

In [43]:
#take the mean of the house of representative votes on a specific bill
#find one that is obviously very different

print("Republican support = ", rep['handicapped-infants'].mean())
print("Democratic support = ", dem['handicapped-infants'].mean())


Republican support =  0.18787878787878787
Democratic support =  0.6046511627906976


In [37]:
#omit the nan's with the nan_policy while performing the ttest
ttest_ind(rep['handicapped-infants'], dem['handicapped-infants'], nan_policy='omit')

Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18)

In [41]:
p = ttest_ind(rep['handicapped-infants'], dem['handicapped-infants'], nan_policy='omit').pvalue 
print("p = "'{:.20f}'.format(p))

p = 0.00000000000000000161


Null Hypothesis: There is no difference between average voting rates (levels of support) for the handicapped infants bill between democrats and republicans in the house of representatives. (support is equal)

𝑥¯1==𝑥¯2 Where 𝑥¯1 is the mean of republican votes and 𝑥¯2 is the mean of democrat votes.

2) Alternative Hypothesis:

𝑥¯1 < 𝑥¯2 Levels of support for the bill will be greater amongst democrats.

3) 99% Confidence Level

#Conclusion: 
We reject the null hypothesis and conclude that support for this bill is greater amongst democrats than it is amongst republicans.

#3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01


In [47]:
#take the mean of the house of representative votes on a specific bill
#find one that is obviously very different

print("Republican support = ", rep['physician-fee-freeze'].mean())
print("Democratic support = ", dem['physician-fee-freeze'].mean())


Republican support =  0.9878787878787879
Democratic support =  0.05405405405405406


In [50]:
#omit the nan's with the nan_policy while performing the ttest

p = ttest_ind(rep['physician-fee-freeze'], dem['physician-fee-freeze'], nan_policy='omit').pvalue
print("p = "'{:.100f}'.format(p))

p = 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000


Null Hypothesis: There is no difference between average voting rates (levels of support) for the physician fee freeze bill between democrats and republicans in the house of representatives. (support is equal)

𝑥¯1==𝑥¯2 Where 𝑥¯1 is the mean of republican votes and 𝑥¯2 is the mean of democrat votes.

2) Alternative Hypothesis:

𝑥¯1 > 𝑥¯2 Levels of support for the bill will be greater amongst republicans.

3) 99% Confidence Level

#Conclusion: 
We reject the null hypothesis and conclude that support for this bill is greater amongst republicans than it is amongst democrats.

#4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)


In [65]:
#take the mean of the house of representative votes on a specific bill
#find one that is obviously very different

print("Republican support = ", rep['water-project'].mean())
print("Democratic support = ", dem['water-project'].mean())

Republican support =  0.5067567567567568
Democratic support =  0.502092050209205


In [66]:
#omit the nan's with the nan_policy while performing the ttest

p = ttest_ind(rep['water-project'], dem['water-project'], nan_policy='omit').pvalue
print("p = ", '{:.10f}'.format(p))

p =  0.9291556824


Null Hypothesis: There is no difference between average voting rates (levels of support) for the water project bill between democrats and republicans.

𝑥¯1==𝑥¯2 Where 𝑥¯1 is the mean of republican votes and 𝑥¯2 is the mean of democrat votes.

2) Alternative Hypothesis:

𝑥¯1 ≠ 𝑥¯2 Levels of support for the bill will be greater amongst republicans.


#Conclusion: 
we fail to reject the null hypothesis and conclude that there is no difference between average voting rates for the water project bill between democrats and republicans. 

#Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables

2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
#refactored code
#this function is specific to this assignment, but should work for any bill
def my_ttest_func(bill_name):
  print("Republican support = ", rep[bill_name].mean())
  print("Democratic support = ", dem[bill_name].mean())
  p = ttest_ind(rep[bill_name], dem[bill_name], nan_policy='omit').pvalue
  print("p = ", '{:.10f}'.format(p))

In [68]:
my_ttest_func('immigration')

Republican support =  0.5575757575757576
Democratic support =  0.4714828897338403
p =  0.0833024849
