<a href="https://colab.research.google.com/github/AVData/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/Agustin_Vargas_LS_DS10_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [1]:
# importing the data

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2019-11-04 22:49:20--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2019-11-04 22:49:20 (463 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [0]:
# importig libraries

import pandas as pd
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel
import numpy as np

In [4]:
# giving the dataframes column headers


column_headers = ['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']

# Aliasing the data

df = pd.read_csv('house-votes-84.data', 
                 header=None, 
                 names=column_headers, 
                 na_values='?')

df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [9]:
# replacing y's and n's

df = df.replace({'y': 1, 'n': 0})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [53]:
# in order to replace NaN values lets take a look at the means of the
# df's we're going working with

# Note: the larger the mean, the more 'Y's, the smaller the mean the more 'N's

# Breaking up the df into dems and reps

dems = df[df['party'] == 'democrat']
print(dems.head())

reps = df[df['party'] == 'republican']
reps.head()

      party  handicapped-infants  water-project  ...  crime  duty-free  south-africa
2  democrat                  NaN            1.0  ...    1.0        0.0           0.0
3  democrat                  0.0            1.0  ...    0.0        0.0           1.0
4  democrat                  1.0            1.0  ...    1.0        1.0           1.0
5  democrat                  0.0            1.0  ...    1.0        1.0           1.0
6  democrat                  0.0            1.0  ...    1.0        1.0           1.0

[5 rows x 17 columns]


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


#  1) Issue that Democrats support more than Republicans with a p-value < 0.01


We can say that we reject to accept the null hypothesis, therefore the alternative is true, there is a significant differnce between the Democrat's and Republican's support of the budget

In [24]:
# Now we trake a look at the dataframes we want to work with to compare

# republicans

reps['budget'].sum()/len(reps)

0.13095238095238096

In [25]:
# democrats

dems['budget'].sum()/len(dems)

0.8651685393258427

In [26]:
# Working on 'budget' to answer the following question

# Null Hypothesis:
# There will be no difference in support of the budget issue between
# Democrats and Republicans

# Alternative Hypothesis:
# There is a differnce in support for he budget issue between Republicans
# and Democrats

# 'Issue that democrats support more than republicans by p < 0.01'

# working with ttest_ind() function

ttest_ind(dems['budget'], reps['budget'], nan_policy='omit')

Ttest_indResult(statistic=23.21277691701378, pvalue=2.0703402795404463e-77)

In [30]:
# Aliasing Dataframes and removing NaN values

col = dems['budget']
dems_budget_no_nans = col[~np.isnan(col)]

col = reps['budget']
reps_budget_no_nans = col[~np.isnan(col)]

print(len(reps_budget_no_nans))
print(len(dems_budget_no_nans))

164
260


In [33]:
ttest_ind(dems_budget_no_nans, reps_budget_no_nans)

Ttest_indResult(statistic=23.21277691701378, pvalue=2.0703402795405602e-77)

# 2) Issue that Republicans support more than Democrats with a p-value < 0.01


We can say that we reject to accept the null hypothesis, therefore the alternative is true, there is a significant differnce between the Democrat's and Republican's support in the El Salvador Aid issue


In [45]:
# Choosing a data frame to work with
# Looking at 'aid-to-contras' issue

dems['el-salvador-aid'].sum()/len(dems)

0.20599250936329588

In [46]:
reps['el-salvador-aid'].sum()/len(reps)

0.9345238095238095

In [0]:
# Aliasing the dataframes and removing NaN values

col = dems['el-salvador-aid']
dems_el_salvador_aid = col[~np.isnan(col)]

col = reps['el-salvador-aid']
reps_el_salvador_aid = col[~np.isnan(col)]

In [52]:
# Finding the t-test using 'ttest_ind()'

ttest_ind(dems_el_salvador_aid, reps_el_salvador_aid, nan_policy='omit')

Ttest_indResult(statistic=-21.136692611732194, pvalue=5.600520111728605e-68)

# Finding an issue where the difference in support between Republicans and Democrats has a p-value > 0.1


The results show that fail to reject the Null hypothesis and show there is no difference in the support for anti-satellite-ban between the Republicans and Democrats of the 1980s

In [65]:
# Choosing an issue
# looking at means

dems['anti-satellite-ban'].sum()/len(dems)
reps['anti-satellite-ban'].sum()/len(reps)

0.7490636704119851

In [0]:
# working with 'anti-satellite-ban' considering it's during the Cold War

col = dems['anti-satellite-ban']
dems_anti_satellite_ban = col[~np.isnan(col)]

col = reps['anti-satellite-ban']
reps_anti_satellite_ban = col[~np.isnan(col)]

In [70]:
ttest_ind(dems_aid_to_contras, reps_anti_satellite_ban, nan_policy='omit')

Ttest_indResult(statistic=-0.5956033771469363, pvalue=0.5517648858930779)