<a href="https://colab.research.google.com/github/VS-Coder/CheatSheets/blob/master/Michael_Davis_Copy_of_LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

In [0]:
# Import statements
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel


In [0]:
# Load the voting data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data


--2020-05-21 02:02:53--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2020-05-21 02:02:54 (137 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [0]:
# Defining the column headers for the data file.

column_headers = ['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']

In [0]:

# Create the DataFrame
votes_df = pd.read_csv('house-votes-84.data',
                       header=None,
                       names=column_headers,
                       na_values="?")
votes_df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [0]:
# Changing the vote values to 1s and 0s. Recoding the values.
votes_df = votes_df.replace({'y':1, 'n':0})

In [0]:
votes_df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
votes_df.describe()

Unnamed: 0,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,423.0,387.0,424.0,424.0,420.0,424.0,421.0,420.0,413.0,428.0,414.0,404.0,410.0,418.0,407.0,331.0
mean,0.44208,0.503876,0.596698,0.417453,0.504762,0.641509,0.567696,0.57619,0.501211,0.504673,0.362319,0.423267,0.509756,0.593301,0.427518,0.812689
std,0.497222,0.500632,0.49114,0.493721,0.500574,0.480124,0.495985,0.49475,0.500605,0.500563,0.481252,0.49469,0.500516,0.491806,0.495327,0.390752
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0
75%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [0]:
# Hypothosis Test for dems < reps 
# I chose to use physician-fee-freeze for this excersize.
# first get rid of the NaN values.
reps = votes_df[votes_df['party'] == 'republican']
dems = votes_df[votes_df['party'] == 'democrat']

col = reps['physician-fee-freeze']
np.isnan(col)
clean_physician_fee_freeze = col[~np.isnan(col)]


print(reps['physician-fee-freeze'].mean())
dems['physician-fee-freeze'].mean()


0.9878787878787879


0.05405405405405406

In [0]:
# ttest for this vote
from scipy.stats import ttest_ind
ttest_ind(reps['physician-fee-freeze'], 
          dems['physician-fee-freeze'],
          nan_policy='omit')



Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)

In [0]:
# This is a ttest resulting in a republican majority vote
# as dems voted against this bill.
dems['physician-fee-freeze'].value_counts()

0.0    245
1.0     14
Name: physician-fee-freeze, dtype: int64

In [0]:
# Hypothosis Test for dems > reps 
# I chose to use south-africa for this excersize.
# first get rid of the NaN values
reps = votes_df[votes_df['party'] == 'republican']
dems = votes_df[votes_df['party'] == 'democrat']

col = reps['south-africa']
np.isnan(col)
clean_budget = col[~np.isnan(col)]


print(reps['south-africa'].mean())
dems['south-africa'].mean()

0.6575342465753424


0.9351351351351351

In [0]:
# ttest for this vote
from scipy.stats import ttest_ind
ttest_ind(reps['south-africa'], 
          dems['south-africa'],
          nan_policy='omit')



Ttest_indResult(statistic=-6.849454815841208, pvalue=3.652674361672226e-11)

In [0]:
# Here we see dems voting for this bill more than reps.

print(reps['south-africa'].value_counts())
print(dems['south-africa'].value_counts())

1.0    96
0.0    50
Name: south-africa, dtype: int64
1.0    173
0.0     12
Name: south-africa, dtype: int64


In [0]:
# mx-missile
# Hypothosis Test for dems is relatively similar to reps 
# I chose to use mx-missile for this excersize.
# first get rid of the NaN values
reps = votes_df[votes_df['party'] == 'republican']
dems = votes_df[votes_df['party'] == 'democrat']

col = reps['mx-missile']
np.isnan(col)
clean_budget = col[~np.isnan(col)]


print(reps['mx-missile'].mean())
dems['mx-missile'].mean()

0.11515151515151516


0.7580645161290323

In [0]:
# Here we see dems voting for this bill more than reps.

print(reps['mx-missile'].value_counts())
print(dems['mx-missile'].value_counts())

0.0    146
1.0     19
Name: mx-missile, dtype: int64
1.0    188
0.0     60
Name: mx-missile, dtype: int64


In [0]:
# ttest for this vote
from scipy.stats import ttest_ind
ttest_ind(reps['mx-missile'], 
          dems['mx-missile'],
          nan_policy='omit')

Ttest_indResult(statistic=-16.437503268542994, pvalue=5.03079265310811e-47)

## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using Scipy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!