<a href="https://colab.research.google.com/github/Lilchoto3/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [31]:
import pandas as pd
import numpy as np
from scipy import stats

# Column headers for data
columns=['party','handicapped-infants','water-project',
         'budget','physician-fee-freeze', 'el-salvador-aid',
         'religious-groups','anti-satellite-ban',
         'aid-to-contras','mx-missile','immigration',
         'synfuels', 'education', 'right-to-sue','crime','duty-free',
         'south-africa']

# Import data and set column headers
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', names=columns)

df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [32]:
# Double-check for anything out of the ordinary
for col in df.columns:
  print(df[col].value_counts())

democrat      267
republican    168
Name: party, dtype: int64
n    236
y    187
?     12
Name: handicapped-infants, dtype: int64
y    195
n    192
?     48
Name: water-project, dtype: int64
y    253
n    171
?     11
Name: budget, dtype: int64
n    247
y    177
?     11
Name: physician-fee-freeze, dtype: int64
y    212
n    208
?     15
Name: el-salvador-aid, dtype: int64
y    272
n    152
?     11
Name: religious-groups, dtype: int64
y    239
n    182
?     14
Name: anti-satellite-ban, dtype: int64
y    242
n    178
?     15
Name: aid-to-contras, dtype: int64
y    207
n    206
?     22
Name: mx-missile, dtype: int64
y    216
n    212
?      7
Name: immigration, dtype: int64
n    264
y    150
?     21
Name: synfuels, dtype: int64
n    233
y    171
?     31
Name: education, dtype: int64
y    209
n    201
?     25
Name: right-to-sue, dtype: int64
y    248
n    170
?     17
Name: crime, dtype: int64
n    233
y    174
?     28
Name: duty-free, dtype: int64
y    269
?    104
n     62
Name: 

In [33]:
# Okay, so it seems like besides the 'party' column, everything's in a y/n/? format
# I'm assuming that these represent a yes/no/abstain vote for their respective bill, which is the column

# Let's turn the data into a more statistically-readable format
df= df.replace({'y':1,'n':0,'?':np.NaN})

df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [34]:
# Seperate the data set by party for comparison purposes later
dem = df[df['party']=='democrat']
rep = df[df['party']=='republican']

dem.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [35]:
rep.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [36]:
# Looks good, now we can get to finding certain bills that one party favors over the other and vice versa

# First off, though, I wanna know which party supported which bills in general

print("Republican Voting Means\n", rep.mean(), '\n\nDemocrat Voting Means')
print(dem.mean())

Republican Voting Means
 handicapped-infants     0.187879
water-project           0.506757
budget                  0.134146
physician-fee-freeze    0.987879
el-salvador-aid         0.951515
religious-groups        0.897590
anti-satellite-ban      0.240741
aid-to-contras          0.152866
mx-missile              0.115152
immigration             0.557576
synfuels                0.132075
education               0.870968
right-to-sue            0.860759
crime                   0.981366
duty-free               0.089744
south-africa            0.657534
dtype: float64 

Democrat Voting Means
handicapped-infants     0.604651
water-project           0.502092
budget                  0.888462
physician-fee-freeze    0.054054
el-salvador-aid         0.215686
religious-groups        0.476744
anti-satellite-ban      0.772201
aid-to-contras          0.828897
mx-missile              0.758065
immigration             0.471483
synfuels                0.505882
education               0.144578
right-to-sue

In [37]:
# Alright, now lets compare everything and set up our null hypotheses
for col in df.columns[df.columns != 'party']:
  print(col,":",stats.ttest_ind(rep[col], dem[col], nan_policy='omit'))

handicapped-infants : Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18)
water-project : Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)
budget : Ttest_indResult(statistic=-23.21277691701378, pvalue=2.0703402795404463e-77)
physician-fee-freeze : Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)
el-salvador-aid : Ttest_indResult(statistic=21.13669261173219, pvalue=5.600520111729011e-68)
religious-groups : Ttest_indResult(statistic=9.737575825219457, pvalue=2.3936722520597287e-20)
anti-satellite-ban : Ttest_indResult(statistic=-12.526187929077842, pvalue=8.521033017443867e-31)
aid-to-contras : Ttest_indResult(statistic=-18.052093200819733, pvalue=2.82471841372357e-54)
mx-missile : Ttest_indResult(statistic=-16.437503268542994, pvalue=5.03079265310811e-47)
immigration : Ttest_indResult(statistic=1.7359117329695164, pvalue=0.08330248490425066)
synfuels : Ttest_indResult(statistic=-8.293603989407588, pvalue=1.57593

**Hypotheses:**

**1:**

Null Hypothesis: The Republicans support the 'budget' bill as much as the Democrats do.

Alternate: The Republicans do not support the 'budget' bill as much as the democrats do.

Confidence: 99% (p > 0.01)

Results: According to the t-stat of -23.212 and a p-value of $2.07 * 10^{-77}$, the hypothesis that the Republicans support the 'budget' bill as much as the Democrats do is rejected.

**2:**

Null Hypothesis: The Democrats support the 'education' bill as much as the Republicans do.

Alternate: The Democrats do not support the 'education' bill as much as the Republicans do.

Confidence: 99% (p > 0.01)

Results: According to the t-stat of 20.5006 and a p-value of $1.88 * 10^{-64}$, the hypothesis that the Democrats support the 'education' bill as much as the Republicans do is rejected.

**3:**

Null Hypothesis: The Democrats support the 'water-project' bill as much as the Republicans do.

Alternate: The Democrats do not support the 'water-project' bill as much as the Republicans do.

Confidence: 90% (p > 0.1)

Results: According to the t-stat of 0.089 and the p-value of $0.929$, the hypothesis that the Democrats support the 'water-project' bill as much as the Republicans do failed to be rejected.

In [41]:
# Stretch goal: refactor your code into functions so it's easier to get the results using arbi
def do_ttest(dataframe, column, condition1, condition2):
  df1 = dataframe[condition1]
  df2 = dataframe[condition2]
  print(stats.ttest_ind(df1[column], df2[column], nan_policy='omit'))

# Testing the program
# Test 1: Republicans voted the same way as Democrats on the 'water-project' bill
do_ttest(df,'water-project',(df['party']=='republican'),(df['party']=='democrat'))

# Test 2: Congressmen who voted in favor of the 'water-project' bill voted the same way as congressmen
#         who voted in favor of the 'education' bill on the 'aid-to-contras' bill
do_ttest(df,'aid-to-contras',(df['water-project']==1),(df['education']==1))

# Test 3: Congressmen who voted against the 'budget' bill voted the same way as those who voted
#         in favor of the 'mx-missile' bill on the 'education' bill
do_ttest(df,'education',(df['budget']==0),(df['mx-missile']==1))

Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)
Ttest_indResult(statistic=6.4190912289471935, pvalue=4.462714491251616e-10)
Ttest_indResult(statistic=18.688408177622005, pvalue=1.0287746496723199e-54)
