<a href="https://colab.research.google.com/github/shainaboover/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

In [0]:
### imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind

In [3]:
! wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2020-05-21 02:38:11--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2020-05-21 02:38:11 (287 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [6]:
# read in data and clean up headers
# change nan values from ? to NAN
column_headers = ['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']
house = pd.read_csv('house-votes-84.data', header=None, names=column_headers, na_values='?')
house.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [7]:
# change 'y' and 'n' to binary
house = house.replace({'y':1, 'n':0})
house.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
# separate republicans from democrats
rep = house[house['party']=='republican']
dem = house[house['party']=='democrat']

In [17]:
# check work
print(house['party'].value_counts())
print(rep.shape)
dem.shape

democrat      267
republican    168
Name: party, dtype: int64
(168, 17)


(267, 17)

In [20]:
#  look at issue to compare
# percentage of each who voted yes
print(dem['physician-fee-freeze'].sum()/len(dem))
rep['physician-fee-freeze'].sum()/len(rep)

0.052434456928838954


0.9702380952380952

In [22]:
# avg rate of voting yes
print(dem['physician-fee-freeze'].mean())
rep['physician-fee-freeze'].mean()

0.05405405405405406


0.9878787878787879

In [30]:
# big difference in sample, conduct the ttest
results = ttest_ind(rep['physician-fee-freeze'], dem['physician-fee-freeze'], nan_policy='omit')
results

Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)

In [33]:
round(results.pvalue, 176)

0.0

In [0]:
# ^Democrats support 'physician-fee-freeze' over republicans with p value < 0.01^

In [37]:
# look for an issue that reps support over dems
# check percentage of each who voted yes
print(rep['crime'].sum()/len(rep))
dem['crime'].sum()/len(dem)

0.9404761904761905


0.33707865168539325

In [38]:
# avg rate of voting yes
print(rep['crime'].mean())
dem['crime'].mean()

0.9813664596273292


0.35019455252918286

In [40]:
# looks like a big difference, conduct ttest
results2 = ttest_ind(rep['crime'], dem['crime'], nan_policy='omit')
results2

Ttest_indResult(statistic=16.342085656197696, pvalue=9.952342705606092e-47)

In [42]:
round(results2.pvalue, 45)

0.0

In [0]:
# ^Republications support 'crime' over democrats with p value < 0.01^

In [45]:
# look for an issue where there is no sig diff between parties
print(rep['immigration'].sum()/len(rep))
dem['immigration'].sum()/len(dem)

0.5476190476190477


0.46441947565543074

In [46]:
# avg rate of yes votes
print(rep['immigration'].mean())
dem['immigration'].mean()

0.5575757575757576


0.4714828897338403

In [47]:
# look fairly close, conduct ttest
ttest_ind(rep['immigration'], dem['immigration'], nan_policy='omit')

Ttest_indResult(statistic=1.7359117329695164, pvalue=0.08330248490425066)

In [0]:
# p value is not > 0.1 but getting close
# try another

In [50]:
print(rep['water-project'].sum()/len(rep))
dem['water-project'].sum()/len(dem)

0.44642857142857145


0.449438202247191

In [52]:
# wow that's close! conduct ttest
ttest_ind(rep['water-project'], dem['water-project'], nan_policy='omit')

Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)

In [0]:
# there it is, p value > 0.1! 

## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using Scipy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!