<a href="https://colab.research.google.com/github/nrvanwyck/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/module1-statistics-probability-and-inference/LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data -o voting_data.csv

import pandas as pd

# data set has no header; this will fix that
column_headers = ['political-party', 'handicapped-infants', 
                  'water-project-cost-sharing', 
                  'adoption-of-the-budget-resolution', 'physician-fee-freeze', 
                  'el-salvador-aid', 'religious-groups-in-schools',
                  'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', 
                  'mx-missile', 'immigration', 'synfuels-corporation-cutback', 
                  'education-spending', 'superfund-right-to-sue', 'crime', 
                  'duty-free-exports', 
                  'export-administration-act-south-africa']

# NaNs stored as '?'s; we will make them NaNs
voting_data = pd.read_csv('voting_data.csv', names=column_headers, 
                          na_values='?')

# let's turn all the yeses and nos to 1s and 0s:
voting_data.replace({'y': 1, 'n': 0}, inplace=True)

voting_data.head()
# looks good

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18171  100 18171    0     0      0      0 --:--:-- --:--:-- --:--:--     00    0     0      0      0 --:--:-- --:--:-- --:--:--     031823      0 --:--:-- --:--:-- --:--:-- 31767


Unnamed: 0,political-party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
from scipy.stats import ttest_ind

In [0]:
# creating separate republican and democratic dataframes
rep = voting_data[voting_data['political-party'] == 'republican']
dem = voting_data[voting_data['political-party'] == 'democrat']

# creating empty lists for a future t-test results dataframe...
t_stats = []
p_values = []

# now we can fill those lists, dropping NaNs when we calculate values...
for column_header in column_headers[1:]:
    t_stats.append(ttest_ind(dem[column_header], rep[column_header], nan_policy='omit')[0])
    p_values.append(ttest_ind(dem[column_header], rep[column_header], nan_policy='omit')[1])

# we can create a dictionary from the lists with which to create the dataframe
dict = {'t_stats': t_stats, 'p_values': p_values}

# and we can create the dataframe, using column_headers for the indices
t_test_results = pd.DataFrame(dict, index=column_headers[1:])
    
t_test_results

Unnamed: 0,t_stats,p_values
handicapped-infants,9.205264,1.61344e-18
water-project-cost-sharing,-0.088965,0.9291557
adoption-of-the-budget-resolution,23.212777,2.07034e-77
physician-fee-freeze,-49.367082,1.994262e-177
el-salvador-aid,-21.136693,5.60052e-68
religious-groups-in-schools,-9.737576,2.3936719999999997e-20
anti-satellite-test-ban,12.526188,8.521033000000001e-31
aid-to-nicaraguan-contras,18.052093,2.824718e-54
mx-missile,16.437503,5.030793e-47
immigration,-1.735912,0.08330248


In [0]:
print("Democrats were more likely to support the bills on:")
condition = ((t_test_results['p_values'] < .01) & (t_test_results['t_stats'] > 0))
print("\t","\n\t".join(t_test_results[condition].index.tolist()), sep='')
print("with a p-value of less than .01\n")

print("Republicans were more likely to support the bills on:")
condition = ((t_test_results['p_values'] < .01) & (t_test_results['t_stats'] < 0))
print("\t","\n\t".join(t_test_results[condition].index.tolist()), sep='')
print("with a p-value of less than .01\n")

print("There is not that much of a difference between Democrats and Republicans on:")
condition = (t_test_results['p_values'] > .1)
print("\t","\n\t".join(t_test_results[condition].index.tolist()), sep='')
print("When we ran a t-test on the above, the p-value was greater than .1.\n")

print("The one issue not responsive to the three questions is:")
condition = ((t_test_results['p_values'] > .01) & (t_test_results['p_values'] < .1))
print("\t","\n\t".join(t_test_results[condition].index.tolist()), sep='')
print("When we ran a t-test on the above, the p-value was greater than .01 but less than .1.")

Democrats were more likely to support the bills on:
	handicapped-infants
	adoption-of-the-budget-resolution
	anti-satellite-test-ban
	aid-to-nicaraguan-contras
	mx-missile
	synfuels-corporation-cutback
	duty-free-exports
	export-administration-act-south-africa
with a p-value of less than .01

Republicans were more likely to support the bills on:
	physician-fee-freeze
	el-salvador-aid
	religious-groups-in-schools
	education-spending
	superfund-right-to-sue
	crime
with a p-value of less than .01

There is not that much of a difference between Democrats and Republicans on:
	water-project-cost-sharing
When we ran a t-test on the above, the p-value was greater than .1.

The one issue not responsive to the three questions is:
	immigration
When we ran a t-test on the above, the p-value was greater than .01 but less than .1.
