<a href="https://colab.research.google.com/github/ryanleeallred/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

In [0]:
# Imports
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

In [2]:
# Get the data 
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2019-11-14 02:20:45--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2019-11-14 02:20:46 (124 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [5]:
# Build the dataframe
column_headers = ['party','handicapped-infants','water-project', 'budget','physician-fee-freeze','el-salvador-aid','religious-groups',
                  'anti-satellite-ban','aid-to-contras','mx-missile','immigration','synfuels', 'education', 'right-to-sue','crime',
                  'duty-free','south-africa']

df = pd.read_csv('house-votes-84.data', header=None, names=column_headers, na_values="?")
print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [6]:
# Recode votes as numeric
df = df.replace({'y':1, 'n':0})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
# Split the dataset into republican and democrat datasets
rep = df[df['party']=='republican']
dem = df[df['party']=='democrat']

### An issue that democrats support more than republicans with p < 0.01

In [14]:
# Let's test is there is any difference between average voting rates for the budget between democrats and republicans
# Check for missing values
print('Republican:', rep['budget'].isnull().sum())
print('Democrat:', dem['budget'].isnull().sum())

Republican: 4
Democrat: 7


In [16]:
# Remove NaN values from `budget` columns
# Republican dataset
col_r = rep['budget']
np.isnan(col_r)
budget_rep_no_nans = col_r[~np.isnan(col_r)]

# Democrat dataframe
col_d = dem['budget']
np.isnan(col_d)
budget_dem_no_nans = col_d[~np.isnan(col_d)]

# The percentage of "yes" votes on the `budget`
print("Republicans:", budget_rep_no_nans.sum()/len(budget_rep_no_nans))
print("Democrats:", budget_dem_no_nans.sum()/len(budget_dem_no_nans))

Republicans: 0.13414634146341464
Democrats: 0.8884615384615384


1) **Null Hypothesis:** There is no difference between average voting rates for the budget between democrats and republicans in the house of representatives

**Alternative Hypothesis:** The average voting rates for the budget between democrats and republicans in the house of representatives differ

2) Confidence interval: 99%

In [20]:
# T test:
ttest_ind(budget_rep_no_nans, budget_dem_no_nans)

Ttest_indResult(statistic=-23.21277691701378, pvalue=2.0703402795405602e-77)

3) **T-statistic:** -23.213                                                         
**P-value:** 0

4) **Conclusion:** Due to a p-value of 0, I reject the null hypothesis that republicans and democrats support for the budget is similar. With a confidence of 99% we can say that the average voting rates for the budget between democrats and republicans in the house of representatives differ

### An issue that republicans support more than democrats with p < 0.01

In [24]:
# Let's see what is the mean support of parties for physician fee freeze
print('Republicans:', rep['physician-fee-freeze'].mean())
print('Democrats:', dem['physician-fee-freeze'].mean())

Republicans: 0.9878787878787879
Democrats: 0.05405405405405406


In [25]:
# Check for missing values
print('Republicans:', rep['physician-fee-freeze'].isnull().sum())
print('Democrats:', dem['physician-fee-freeze'].isnull().sum())

Republicans: 3
Democrats: 8


1) **Null Hypothesis:** The levels of support for the physician fee freeze between democrats and republicans are equal

**Alternative Hypothesis:** The levels of support for the physician fee freeze between democrats and republicans differ

2) Confidence interval: 99%

In [26]:
# Test the hypothesis
ttest_ind(rep['physician-fee-freeze'], dem['physician-fee-freeze'], nan_policy='omit')

Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)

3) **T-statistic:** 49.367                                                       
**P-value:** 0

4) **Conclusion:** Due to a p-value < 0.01, I reject the null hypothesis that republicans and democrats support for the physician fee freeze is equal. With a confidence of 99% we can say that the levels of support for the physician fee freeze between democrats and republicans differ

### An issue where the difference between republicans and democrats has p > 0.1

In [71]:
# The mean support of parties for water project
print('Republicans:', rep['water-project'].mean())
print('Democrats:', dem['water-project'].mean())

Republicans: 0.5067567567567568
Democrats: 0.502092050209205


In [73]:
# Check for missing values
print('Republicans:', rep['water-project'].isnull().sum())
print('Democrats:', dem['water-project'].isnull().sum())

Republicans: 20
Democrats: 28


1) **Null Hypothesis:** The levels of support for the water project between democrats and republicans are equal

**Alternative Hypothesis:** The levels of support for the water project between democrats and republicans differ

2) Confidence interval: 90%

In [74]:
# Test the hypothesis
ttest_ind(rep['water-project'], dem['water-project'], nan_policy='omit')

Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)

3) **T-statistic:** 0.089                                                      
**P-value:** 0.929

4) **Conclusion:** Due to a p-value > 0.1, I fail to reject the null hypothesis that republicans and democrats support for the water project is equal. With a confidence of 90% we can say that the levels of support for the water project between democrats and republicans are equal

### Stretch goals

In [0]:
# Define a function to print the mean of the column
def mean(col):
  return col.mean()

# Define a function to count the missing values
def missing(col):
  return col.isnull().sum()

# Define a function to test the null hypotesis
def test(col1, col2):
  return ttest_ind(col1, col2, nan_policy='omit')

In [86]:
# Test the functions for the column used on second goal (budget)
c1 = rep['budget']
c2 = dem['budget']

# Print the means
print("The average of \"yes\" votes:")
print("Republicans:", mean(c1))
print("Democrats:", mean(c2))

# Print the missing values
print("\nNumber of missing values:")
print("Republicans:", missing(c1))
print("Democrats:", missing(c2))

# Test the null hypothesis
print("\nT-test:")
print(test(c1, c2))

The average of "yes" votes:
Republicans: 0.13414634146341464
Democrats: 0.8884615384615384

Number of missing values:
Republicans: 4
Democrats: 7

T-test:
Ttest_indResult(statistic=-23.21277691701378, pvalue=2.0703402795404463e-77)


## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using Scipy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!