<a href="https://colab.research.google.com/github/awidener21/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/Copy_of_LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

In [3]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2020-05-21 02:45:45--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2020-05-21 02:45:45 (279 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [0]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import ttest_1samp, ttest_ind, ttest_ind_from_stats, ttest_rel


In [17]:
df = pd.read_csv('house-votes-84.data',
                 header=None,
                 names=['party','handicapped-infants','water-project',
                        'adoption-budget-resolution','physician-fee-freeze',
                        'el-salvador-aid','religion','anti-satellite-test-ban',
                        'aid-contras','mx-missile','immigration',
                        'synfuels-cutback','education-spending',
                        'right-to-sue','crime','duty-free-exports:',
                        'south-africa'])
print(df.shape)
df.head()
                       


(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,adoption-budget-resolution,physician-fee-freeze,el-salvador-aid,religion,anti-satellite-test-ban,aid-contras,mx-missile,immigration,synfuels-cutback,education-spending,right-to-sue,crime,duty-free-exports:,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [47]:
df = df.replace({'?':np.NaN,'n':0, 'y':1})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,adoption-budget-resolution,physician-fee-freeze,el-salvador-aid,religion,anti-satellite-test-ban,aid-contras,mx-missile,immigration,synfuels-cutback,education-spending,right-to-sue,crime,duty-free-exports:,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [48]:
df.isnull().sum()

party                           0
handicapped-infants            12
water-project                  48
adoption-budget-resolution     11
physician-fee-freeze           11
el-salvador-aid                15
religion                       11
anti-satellite-test-ban        14
aid-contras                    15
mx-missile                     22
immigration                     7
synfuels-cutback               21
education-spending             31
right-to-sue                   25
crime                          17
duty-free-exports:             28
south-africa                  104
dtype: int64

In [49]:
df['party'].value_counts()

democrat      267
republican    168
Name: party, dtype: int64

In [50]:
rep = df[df['party']=='republican']
rep.head()

Unnamed: 0,party,handicapped-infants,water-project,adoption-budget-resolution,physician-fee-freeze,el-salvador-aid,religion,anti-satellite-test-ban,aid-contras,mx-missile,immigration,synfuels-cutback,education-spending,right-to-sue,crime,duty-free-exports:,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [51]:
dem = df[df['party']=='democrat']
dem.head()

Unnamed: 0,party,handicapped-infants,water-project,adoption-budget-resolution,physician-fee-freeze,el-salvador-aid,religion,anti-satellite-test-ban,aid-contras,mx-missile,immigration,synfuels-cutback,education-spending,right-to-sue,crime,duty-free-exports:,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [78]:
print(ttest_1samp(dem['aid-contras'], .9, nan_policy='omit'))
print(ttest_1samp(rep['aid-contras'], .9, nan_policy='omit'))
#dem support aid-contras more than rep do

Ttest_1sampResult(statistic=-3.0560309518048747, pvalue=0.002474603228803694)
Ttest_1sampResult(statistic=-25.931573491776362, pvalue=1.9428524679262596e-58)


In [61]:
print(ttest_1samp(dem['el-salvador-aid'], .9, nan_policy='omit'))
print(ttest_1samp(rep['el-salvador-aid'], .9, nan_policy='omit'))
#rep support el-salvador-aid more than dem do

Ttest_1sampResult(statistic=-26.516484581208243, pvalue=3.947986026637245e-75)
Ttest_1sampResult(statistic=3.0714682478339617, pvalue=0.0024943083818825657)


In [64]:
print(ttest_1samp(dem['education-spending'], .9, nan_policy='omit'))
print(ttest_1samp(rep['education-spending'], .9, nan_policy='omit'))

Ttest_1sampResult(statistic=-33.82778672821468, pvalue=6.837447861906306e-95)
Ttest_1sampResult(statistic=-1.0747092630102346, pvalue=0.28418586900991843)


In [67]:
print(ttest_1samp(dem['water-project'], .9, nan_policy='omit'))
print(ttest_1samp(rep['water-project'], .9, nan_policy='omit'))

Ttest_1sampResult(statistic=-12.277357207477515, pvalue=3.632269189898332e-27)
Ttest_1sampResult(statistic=-9.536512673540718, pvalue=4.452308815850444e-17)


## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using Scipy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!