<a href="https://colab.research.google.com/github/unburied/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
import numpy as np
import seaborn as sns

In [0]:
columns = ['party', 'handicapped_infants', 'water_project_cost_sharing', 
           'adoption_of_the_budget_resolution', 'physician_fee_freeze',
           'el_salvador_aid', 'religious_groups_in_schools', 
           'anti_satellite_test_ban', 'aid_to_nicaraguan_contras',
           'mx_missile', 'immigration', 'synfuels_corporation_cutback',
           'education_spending', 'superfund_right_to_sue' , 'crime' ,
           'duty_free_exports', 'export_administration_act_south_africa']

house_votes = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',
                         header = None, names =  columns)

In [16]:
house_votes.head()

Unnamed: 0,party,handicapped_infants,water_project_cost_sharing,adoption_of_the_budget_resolution,physician_fee_freeze,el_salvador_aid,religious_groups_in_schools,anti_satellite_test_ban,aid_to_nicaraguan_contras,mx_missile,immigration,synfuels_corporation_cutback,education_spending,superfund_right_to_sue,crime,duty_free_exports,export_administration_act_south_africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [17]:
house_votes[columns[1]].value_counts()

n    236
y    187
?     12
Name: handicapped_infants, dtype: int64

In [0]:
#Assign NA values based on the ratio in party votes
def clean(features):  
  
  #get vote counts to replace NAN values
  r_yays, r_nays = vote_counts('republican', features)
  d_yays, d_nays = vote_counts('democrat' , features)
  
  #filter down to '?' for current feature and replace the top number of rows
  #based on vote counts. Check to ensure vote counts are greater than zero
  if r_yays > 0:
    (house_votes.loc[(house_votes.party == 'republican') &
                  (house_votes[features] == '?'), features])[:r_yays] = 'y'
    (house_votes.loc[(house_votes.party == 'republican') &
                  (house_votes[features] == '?'), features]) = 'n'
  else:
    (house_votes.loc[(house_votes.party == 'republican') &
                  (house_votes[features] == '?'), features])[:r_nays] = 'n'
    (house_votes.loc[(house_votes.party == 'republican') &
                  (house_votes[features] == '?'), features]) = 'y'
    
  if d_yays > 0:
    (house_votes.loc[(house_votes.party == 'democrat') &
                  (house_votes[features] == '?'), features])[:d_yays] = 'y'
    (house_votes.loc[(house_votes.party == 'democrat') &
                  (house_votes[features] == '?'), features]) = 'n'
  else:
    (house_votes.loc[(house_votes.party == 'democrat') &
                  (house_votes[features] == '?'), features])[:d_nays] = 'n'
    (house_votes.loc[(house_votes.party == 'democrat') &
                  (house_votes[features] == '?'), features]) = 'y'
    
    
  

In [0]:
#return number of vote to fillNA based on party and feature
def vote_counts(party, feature):

    #get series of based off party and feature
    subset = house_votes[house_votes.party == party]
    
    #divide values from the series
    values = subset[feature].value_counts().to_dict()
    yays = values['y']
    nays = values['n']
    abstained = values['?']

    #convert above values to create NA replacement values
    #based on ratio of current votes
    vote_yay = int((yays / (yays + nays)) * abstained) 
    vote_nay = int((nays / (yays + nays)) * abstained)
  
    #ensure new values equal current NA sum
    if (vote_yay - vote_nay) > 0:
      while  (vote_yay + vote_nay) < abstained:
            vote_yay += 1
    else: 
      while (vote_yay + vote_nay) < abstained:
             vote_nay += 1

    return vote_yay, vote_nay

In [0]:
#CLean all '?' values in dataframe based on ratio of party votes
columns.pop(0)

for col in columns:
  clean(col)

In [21]:
house_votes[columns[0]].value_counts()

n    245
y    190
Name: handicapped_infants, dtype: int64

In [0]:
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

In [0]:
for col in columns:
  house_votes[col] = np.where(house_votes[col] == 'y', 1,0) 

In [24]:
house_votes.head()

Unnamed: 0,party,handicapped_infants,water_project_cost_sharing,adoption_of_the_budget_resolution,physician_fee_freeze,el_salvador_aid,religious_groups_in_schools,anti_satellite_test_ban,aid_to_nicaraguan_contras,mx_missile,immigration,synfuels_corporation_cutback,education_spending,superfund_right_to_sue,crime,duty_free_exports,export_administration_act_south_africa
0,republican,0,1,0,1,1,1,0,0,0,1,0,1,1,1,0,1
1,republican,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,0
2,democrat,0,1,1,1,1,1,0,0,0,0,1,0,1,1,0,0
3,democrat,0,1,1,0,0,1,0,0,0,0,1,0,1,0,0,1
4,democrat,1,1,1,0,1,1,0,0,0,0,1,0,1,1,1,1


In [0]:
def ptest(feature):
  repubs = house_votes.loc[(house_votes.party == 'republican') , feature]
  dems = house_votes.loc[(house_votes.party == 'democrat') , feature]

  sts, pval =  ttest_ind(repubs, dems)

  return sts, pval

In [0]:
def party_issues_test(feature):
  sts, pval = ptest(feature)
  
  if int(sts) < 0 and pval < .01:
    return 'Republican'
  elif int(sts) > 0 and pval <.01:
    return 'Democrat'
  else:
    return 'Center'

In [108]:
for col in columns:
  print("The", col, ' issue is a ', party_issues_test(col), ' supported issue')

The handicapped_infants  issue is a  Republican  supported issue
The water_project_cost_sharing  issue is a  Center  supported issue
The adoption_of_the_budget_resolution  issue is a  Republican  supported issue
The physician_fee_freeze  issue is a  Democrat  supported issue
The el_salvador_aid  issue is a  Democrat  supported issue
The religious_groups_in_schools  issue is a  Democrat  supported issue
The anti_satellite_test_ban  issue is a  Republican  supported issue
The aid_to_nicaraguan_contras  issue is a  Republican  supported issue
The mx_missile  issue is a  Republican  supported issue
The immigration  issue is a  Center  supported issue
The synfuels_corporation_cutback  issue is a  Republican  supported issue
The education_spending  issue is a  Democrat  supported issue
The superfund_right_to_sue  issue is a  Democrat  supported issue
The crime  issue is a  Democrat  supported issue
The duty_free_exports  issue is a  Republican  supported issue
The export_administration_act_s