<a href="https://colab.research.google.com/github/ryanleeallred/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

In [0]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel, ttest_1samp


In [254]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2019-10-08 02:20:30--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data.1’


2019-10-08 02:20:31 (288 KB/s) - ‘house-votes-84.data.1’ saved [18171/18171]



In [0]:
# Load Data
df = pd.read_csv('house-votes-84.data', 
                 header=None,
                 names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa'])

In [0]:
df = df.replace({'?': np.NaN, 'y':1, 'n':0})



In [0]:
# Filtering

republicans = df[df.party == "republican"]
democrats = df[df.party == "democrat"]

In [0]:

issues_list = df.columns.to_list()
issues_list.remove('party')

In [0]:
# Define a function to compare the means between parties

def compare_means(issues):
  BOLD = '\033[1m'
  END = '\033[0m'
  UNDERLINE = '\033[4m'

  d = BOLD + "Dem mean is: " + END
  r = BOLD + "Rep mean is: " + END
 
  for issue in issues: 
    dem_mean = str(np.mean(democrats[issue]))
    rep_mean = str(np.mean(republicans[issue]))

    issue_str = UNDERLINE + "\tIssue: " + str(issue).title().replace("-", " ") + END

    print(issue_str + 
          "\n" + d + dem_mean +
          "\n" + r + rep_mean)
    print("\n" + "-~"*20)
    

In [280]:
compare_means(issues_list)

[4m	Issue: Handicapped Infants[0m
[1mDem mean is: [0m0.6046511627906976
[1mRep mean is: [0m0.18787878787878787

-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
[4m	Issue: Water Project[0m
[1mDem mean is: [0m0.502092050209205
[1mRep mean is: [0m0.5067567567567568

-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
[4m	Issue: Budget[0m
[1mDem mean is: [0m0.8884615384615384
[1mRep mean is: [0m0.13414634146341464

-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
[4m	Issue: Physician Fee Freeze[0m
[1mDem mean is: [0m0.05405405405405406
[1mRep mean is: [0m0.9878787878787879

-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
[4m	Issue: El Salvador Aid[0m
[1mDem mean is: [0m0.21568627450980393
[1mRep mean is: [0m0.9515151515151515

-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
[4m	Issue: Religious Groups[0m
[1mDem mean is: [0m0.47674418604651164
[1mRep mean is: [0m0.8975903614457831

-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~
[4m	Issue: Anti Satellite Ban[0m
[1mDem mean is: [0m0.772200772200772

In [261]:
# Find an issue that republicans support more than democrats

ttest_ind(republicans['physician-fee-freeze'], 
          democrats['physician-fee-freeze'], 
          nan_policy='omit')

Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)

In [262]:
# Find an issue that democrats support more than republicans

ttest_ind(republicans['aid-to-contras'], 
          democrats['aid-to-contras'],
          nan_policy='omit')

Ttest_indResult(statistic=-18.052093200819733, pvalue=2.82471841372357e-54)

In [263]:
## Using hypothesis testing, find an issue where the difference between republicans
## and democrats has p > 0.1 (i.e. there may not be much of a difference)

ttest_ind(republicans['water-project'],
          democrats['water-project'], 
          nan_policy='omit')

Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)

In [0]:
# Define a function to clean the output of the ttest_ind()
def ttest_str_clean(str_in):
  str_out = str_in.replace("Ttest_indResult", "").replace("(", "").replace(")", "").replace(",", "").title()

  return str_out

# Define a function apply ttest_ind() to issues_list
def two_sample_ttest(issues):
  BOLD = '\033[1m'
  END = '\033[0m'
  UNDERLINE = '\033[4m'
 
  for issue in issues:
    ttest_result = str(ttest_ind(republicans[issue],
                                 democrats[issue],
                                 nan_policy='omit'))

    ttest_str = ttest_str_clean(ttest_result)
    
    issue_str = "\t\t" + UNDERLINE + "Issue:" + BOLD + str(issue).title().replace("-", " ") + END
    
    print(issue_str +
          "\n" + ttest_str +
          "\n" + "-~~-"*20)
    

In [299]:
two_sample_ttest(issues_list)

		[4mIssue:[1mHandicapped Infants[0m
Statistic=-9.205264294809222 Pvalue=1.613440327937243E-18
-~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~-
		[4mIssue:[1mWater Project[0m
Statistic=0.08896538137868286 Pvalue=0.9291556823993485
-~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~-
		[4mIssue:[1mBudget[0m
Statistic=-23.21277691701378 Pvalue=2.0703402795404463E-77
-~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~-
		[4mIssue:[1mPhysician Fee Freeze[0m
Statistic=49.36708157301406 Pvalue=1.994262314074344E-177
-~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~-
		[4mIssue:[1mEl Salvador Aid[0m
Statistic=21.13669261173219 Pvalue=5.600520111729011E-68
-~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~-
		[4mIssue:[1mReligious Groups[0m
Statistic=9.737575825219457 Pvalue=2.3936722520597287E-20
-~~--~~--~~--~~--~~--~~--~~--~~--~

## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using Scipy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!