<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
#imports
import pandas as pd
from scipy.stats import ttest_ind
import numpy as np

In [0]:
#going to pull the headers from the .names file
columns = ['party', 'handicapped-infants', 'water-project-cost-sharing', 'adoption-of-the-budget-resolution', 'physician-fee-freeze',
           'el-salvador-aid', 'religious-groups-in-schools', 'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', 'mx-missile',
           'immigration', 'synfuels-corporation-cutback', 'education-spending', 'superfund-right-to-sue', 'crime', 'duty-free-exports',
           'export-administration-act-south-africa']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',
                 header = None, names = columns, na_values = '?')
print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [0]:
#okay looks like we are all still strings with nan values.
df.dtypes

party                                     object
handicapped-infants                       object
water-project-cost-sharing                object
adoption-of-the-budget-resolution         object
physician-fee-freeze                      object
el-salvador-aid                           object
religious-groups-in-schools               object
anti-satellite-test-ban                   object
aid-to-nicaraguan-contras                 object
mx-missile                                object
immigration                               object
synfuels-corporation-cutback              object
education-spending                        object
superfund-right-to-sue                    object
crime                                     object
duty-free-exports                         object
export-administration-act-south-africa    object
dtype: object

In [0]:
#lets convert y & n to a boolean type for easier analysis.
df.replace({'y': 1, 'n': 0}, inplace = True)

In [0]:
#wonder why I keep getting floats instead of ints.  Doesn't really matter I guess, but would be curious to know why it defaults
#float.  Maybe for precision?
df.dtypes

party                                      object
handicapped-infants                       float64
water-project-cost-sharing                float64
adoption-of-the-budget-resolution         float64
physician-fee-freeze                      float64
el-salvador-aid                           float64
religious-groups-in-schools               float64
anti-satellite-test-ban                   float64
aid-to-nicaraguan-contras                 float64
mx-missile                                float64
immigration                               float64
synfuels-corporation-cutback              float64
education-spending                        float64
superfund-right-to-sue                    float64
crime                                     float64
duty-free-exports                         float64
export-administration-act-south-africa    float64
dtype: object

In [0]:
#okay lastly lets split these into dataframes for each party, since we are comparing party voting paterns.
#I think I need to do deep copies for how my code is going to work
rep = df[df['party'] == 'republican'].copy(deep = True)
dem = df[df['party'] == 'democrat'].copy(deep = True)

In [0]:
print(rep.shape)
rep.head()

(168, 17)


Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [0]:
print(dem.shape)
dem.head()

(267, 17)


Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [0]:
#okay, so we don't need to have the party column any more, since we split them up by parties.
rep.drop('party', axis =1, inplace = True)
dem.drop('party', axis = 1, inplace = True)

In [0]:
#lets see if we dropped that party column
rep.head()

Unnamed: 0,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [0]:
dem.head()

Unnamed: 0,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
2,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


Okay so, after reading the names file provided by UCI the nan values are not representative of "missing data" per se, but could represent politicians abstaining from voting.  So I'll leave the nans in and just omit them during testing.  Otherwise I feel I'd be making a less complete dataset.  Maybe that doesn't matter with my analysis method.

Okay so 2, 3, and 4 have a common element.  That first, we need to identify if there is a statiscally significant difference in the vote between parties.

Then after identifying the columns that do have statistically significant difference in support from the two parties, I will look for ones favored by party.

In [0]:
#So first step is to prove-out the method.
ttest_ind(rep['handicapped-infants'], dem['handicapped-infants'], nan_policy = 'omit').pvalue <.01

True

First step, for each of the bills:

Null Hypothesis: There is no difference between levels of support between democrats and republicans for each of the bills in the 1984 dataset.

Alternative Hypothesis: Levels of support between the two parties will differ, with a 99% confidence level.

In [0]:
#okay so lets just loop through them.  We know that they have the same columns so lets just grab it from one
reject_null = [] #is this pythonic?
fail_reject_null = []

for column in rep.columns:
  if ttest_ind(rep[column], dem[column], nan_policy = 'omit').pvalue < .01:
    reject_null.append(column)
  else:
    fail_reject_null.append(column)

print(reject_null)
print(fail_reject_null)


['handicapped-infants', 'adoption-of-the-budget-resolution', 'physician-fee-freeze', 'el-salvador-aid', 'religious-groups-in-schools', 'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', 'mx-missile', 'synfuels-corporation-cutback', 'education-spending', 'superfund-right-to-sue', 'crime', 'duty-free-exports', 'export-administration-act-south-africa']
['water-project-cost-sharing', 'immigration']


Question 4:
I failed to disprove the Null Hypothesis for 'water-project-cost-sharing', and 'immigration'.

For the rest, I reject the null hypothesis and offer an alternative, that there is a significant difference in party support. 

In [0]:
#Is step 2 as simple as comparing to see who came out on top? Seems deceptive, but I can't really see
#how we could us a ttest to determine which party was different. Let me just test this out here:
dem_support = []
rep_support = []
fail_reject_null = []

for column in rep.columns:
  if ttest_ind(rep[column], dem[column], nan_policy = 'omit').pvalue < .01:

    #I can do this because the initial conditional will weed out our mostly
    #equal voting records
    if rep[column].mean() > dem[column].mean():
      rep_support.append(column)
    else:
      dem_support.append(column)
    
  else:
    fail_reject_null.append(column)

print("A majority democrats support bills: " + ', '.join(dem_support))
print("A majority republicans support bills: " + ', '.join(rep_support))
print("Both parties voted equally for bills: " + ', '.join(fail_reject_null))

A majority democrats support bills: handicapped-infants, adoption-of-the-budget-resolution, anti-satellite-test-ban, aid-to-nicaraguan-contras, mx-missile, synfuels-corporation-cutback, duty-free-exports, export-administration-act-south-africa
A majority republicans support bills: physician-fee-freeze, el-salvador-aid, religious-groups-in-schools, education-spending, superfund-right-to-sue, crime
Both parties voted equally for bills: water-project-cost-sharing, immigration


In [0]:
#Okay lets take a stab at turning this into a function.  Seems like if I were to do this as a function, I'd need to break this up
#a little bit at least.