<a href="https://colab.research.google.com/github/CurtCalledBurt/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/module1-statistics-probability-and-inference/LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

In [13]:
politics = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',
                      na_values = '?')
politics.head()

Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
1,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
2,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
3,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y
4,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y


In [0]:
bills = {
    'republican': 'party',
    'n': 'handicapped-infants',
    'y': 'water-project-cost-sharing',
    'n.1': 'adoption-of-the-budget-resolution',
    'y.1': 'physician-fee-freeze',
    'y.2': 'el-salvador-aid',
    'y.3': 'religious-groups-in-schools',
    'n.2': 'anti-satellite-test-ban',
    'n.3': 'aid-to-nicaraguan-contras', 
    'n.4': 'mx-missile', 
    'y.4': 'immigration', 
    '?': 'synfuels-corporation-cutback', 
    'y.5': 'education-spending', 
    'y.6': 'superfund-right-to-sue', 
    'y.7': 'crime', 
    'n.5': 'duty-free-exports', 
    'y.8': 'export-administration-act-south-africa'
}

In [25]:
politics.shape

(434, 17)

In [39]:
politics = politics.rename(columns = bills)
politics = politics.replace({'y': 1.0, 'n': 0.0})
politics.head()

Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
1,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
2,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
4,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0


In [40]:
politics.isna().sum()

party                                       0
handicapped-infants                        12
water-project-cost-sharing                 48
adoption-of-the-budget-resolution          11
physician-fee-freeze                       11
el-salvador-aid                            15
religious-groups-in-schools                11
anti-satellite-test-ban                    14
aid-to-nicaraguan-contras                  15
mx-missile                                 22
immigration                                 7
synfuels-corporation-cutback               20
education-spending                         31
superfund-right-to-sue                     25
crime                                      17
duty-free-exports                          28
export-administration-act-south-africa    104
dtype: int64

So the NaNs are spread throughout the data, but other than in the export-administration column they probably aren't plentiful enough to be of significance. We can more or less safely drop them.

In [0]:
republicans = politics[politics['party'] == 'republican']
democrats = politics[politics['party'] == 'democrat']

In [48]:
print(politics['handicapped-infants'].value_counts(),'\n',
      democrats['handicapped-infants'].value_counts(),'\n',
      republicans['handicapped-infants'].value_counts() )

0.0    235
1.0    187
Name: handicapped-infants, dtype: int64 
 1.0    156
0.0    102
Name: handicapped-infants, dtype: int64 
 0.0    133
1.0     31
Name: handicapped-infants, dtype: int64


In [50]:
democrats['handicapped-infants'].mean()
#.mean() by default doesn't use NaN values, so no need to worry about doing anything special
#to account for them

0.6046511627906976

In [51]:
republicans['handicapped-infants'].mean()

0.18902439024390244

In [0]:
#these means look really different, so the support for p value for this bill support will likely be 
#below the required threshold 

In [54]:
#We are looking to see if one party favors a type of bill over eachother, so we are looking to see
#if we can regect x1 = x2, so we'll want to use 2 sample testing for this

ttest_ind(democrats['handicapped-infants'], republicans['handicapped-infants'], nan_policy='omit')

Ttest_indResult(statistic=9.15392122841775, pvalue=2.4195550274149624e-18)

Our p-value is $\approx 2.42e^{-18}$. This is well below $.01$ so we reject the null hypothesis that democrats and republicans support this bill the same amount. As democrats were listed first and our t-statistic is positive, we infer that democrats support the bill more than republicans (if we'd listed republicans first in the code we would have gotten the same statistic, but negative, telling us rebublican support for the bill was less than the deomcrats').

In [56]:
democrats['crime'].mean()

0.35019455252918286

In [57]:
republican['crime'].mean()

0.98125

In [0]:
#means look really different, this is a good place to look for low p-values

In [60]:
#same as before, we are assuming and trying to reject that x1 = x2, so we'll use two sample testing

ttest_ind(republicans['crime'], democrats['crime'], nan_policy='omit')

Ttest_indResult(statistic=16.288201256755894, pvalue=1.796481827173887e-46)

Our p-value is $\approx 1.8e^{-46}$. Ridiculously below $.01$ so we reject the null hypothesis that democrats and republicans support this bill the same amount. As republicans  were listed first and our t-statistic is positive, we infer that republicans support the bill more than democrats (if we'd listed democrats first in the code we would have gotten the same statistic, but negative, telling us democrat support for the bill was less than the republicans').

In [0]:
#finding an issue that they agreed on felt more difficult than picking an issue they disagreed on,
#so I decided to cheat a little, use .describe() to look at ALL the means, and find one that might 
#be similiar between the two. Didn't have to go far, water-project-cost-sharing looks like
#it will fit the bill

In [72]:
democrats.describe()

Unnamed: 0,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
count,258.0,239.0,260.0,259.0,255.0,258.0,259.0,263.0,248.0,263.0,255.0,249.0,252.0,257.0,251.0,185.0
mean,0.604651,0.502092,0.888462,0.054054,0.215686,0.476744,0.772201,0.828897,0.758065,0.471483,0.505882,0.144578,0.289683,0.350195,0.63745,0.935135
std,0.489876,0.501045,0.315405,0.226562,0.412106,0.50043,0.420224,0.377317,0.429121,0.500138,0.500949,0.352383,0.454518,0.477962,0.481697,0.246956
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
75%,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [73]:
republicans.describe()

Unnamed: 0,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
count,164.0,147.0,163.0,164.0,164.0,165.0,161.0,156.0,164.0,164.0,159.0,154.0,157.0,160.0,155.0,145.0
mean,0.189024,0.503401,0.134969,0.987805,0.95122,0.89697,0.242236,0.153846,0.115854,0.554878,0.132075,0.87013,0.859873,0.98125,0.090323,0.655172
std,0.392727,0.501698,0.342744,0.110092,0.216069,0.304924,0.429773,0.361963,0.32103,0.498501,0.339643,0.337257,0.34823,0.136067,0.287573,0.47696
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
50%,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
75%,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [74]:
ttest_ind(democrats['water-project-cost-sharing'], republican['water-project-cost-sharing'], nan_policy = 'omit')

Ttest_indResult(statistic=-0.02491808700047811, pvalue=0.9801332440121653)

Our p-value is $\approx .98$. Far above $.01$ so we fail to reject the null hypothesis that democrats and republicans support this bill the same amount. So, we conclude it likely that, when looking strictly at party affiliation, there is no preference one way or the other on this bill.