<a href="https://colab.research.google.com/github/llpk79/DS-Unit-1-Sprint-4-Statistical-Tests-and-Experiments/blob/master/Paul_K_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind

In [17]:
names = col_names = ['party','Disabled Infants', 'Water Project', 'Budget Resolution', 'Physician Fee', 'El Salvador Aid', 'Religion-Schools', 'Anti-Satellite Test', 'Nicaragua Contras Aid', 'MX Missile', 'Immigration', 'Synfuels Cutback', 'Education Spending', 'Superfund Right To Sue', 'Crime', 'Duty Free Exports', 'Export Admin S. Africa']
gov = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',
                  na_values='?', header=None, names=names)
gov.head()

Unnamed: 0,party,Disabled Infants,Water Project,Budget Resolution,Physician Fee,El Salvador Aid,Religion-Schools,Anti-Satellite Test,Nicaragua Contras Aid,MX Missile,Immigration,Synfuels Cutback,Education Spending,Superfund Right To Sue,Crime,Duty Free Exports,Export Admin S. Africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [0]:
reps = gov[gov['party'] == 'republican']
dems = gov[gov['party'] == 'democrat']

In [0]:
reps.head()

In [0]:
dems.head()

In [0]:
def convert_to_bool(x):
  if str(x) == 'y':
    return 1
  if str(x) == 'n':
    return 0
  else:
    return x

In [0]:
dems = dems.applymap(convert_to_bool)
reps = reps.applymap(convert_to_bool)

In [0]:
reps.head()

In [0]:
dems.head()

In [0]:
def fill_mode(column_names, frame):
  for name in column_names[1:]:
    frame[name].fillna(frame[name].mode()[0], inplace=True)


In [0]:
def fill_zero(column_names, frame):
  for name in column_names[1:]:
    frame[name].fillna(0, inplace=True)

In [0]:
def fill_mean(column_names, frame):
  for name in column_names[1:]:
    frame[name].fillna(frame[name].mean(), inplace=True)

In [0]:
def do_ttests(column_names, dems, reps):
  # Make some empty lists.
  dems_support, reps_support, bipartisan = [], [], []
  
  # Iterate though issue columns.
  for name in column_names[1:]:
    
    # Get statistic and p value for each column.
    stat, pval = ttest_ind(dems[name], reps[name], nan_policy='omit')
    
    # Set a boolean indicating if p value is below threshold.
    sig = pval <= 0.01
    if sig:
      
      # If dems support and reps do not, add to dems list.
      if stat > 0 and (sum(dems[name] == 1) > sum(dems[name] == 0)):
        dems_support.append(name)
        
      # If republicans support and dems do not, add to reps list.
      elif stat < 0 and (sum(reps[name] == 1) > sum(reps[name] == 0)):
        reps_support.append(name)
        
    # If no significant difference found, add to bipartisan list.
    else:
      bipartisan.append(name)
  print('Dem supported issues: {}'.format(', '.join(dems_support)))
  print('Rep supported issues: {}'.format(', '.join(reps_support)))
  print('Bipartisan issues: {}\n'.format(', '.join(bipartisan)))


In [0]:
def report_vote(columns, dems, reps):
  for column in columns[1:]:
    print(f'Issue: {column}')
    print(f'Dems: Yea {sum(dems[column] == 1)}, Nay {sum(dems[column] == 0)}')
    print(f'Reps: Yea {sum(reps[column] == 1)}, Nay {sum(reps[column] == 0)}\n')
    

In [0]:
def do_it_all(data):
  columns = data.columns
  
  dems = data[data['party'] == 'democrat']
  reps = data[data['party'] == 'republican']
  
  dems = dems.applymap(convert_to_bool)
  reps = reps.applymap(convert_to_bool)
  
  # Make copies and use various fill methods.
  dems_zero = dems.copy()
  fill_zero(columns, dems_zero)
  
  reps_zero = reps.copy()
  fill_zero(columns, reps_zero)
  
  dems_mean = dems.copy()
  fill_mean(columns, dems_mean)
  
  reps_mean = reps.copy()
  fill_mean(columns, reps_mean)
  
  dems_mode = dems.copy()
  fill_mode(columns, dems_mode)
  
  reps_mode = reps.copy()
  fill_mode(columns, reps_mode)
  
  for name, dem, rep in zip(['exclude nan', 'zero fill', 'mean fill', 'mode fill'],
                            [dems, dems_zero, dems_mean, dems_mode],
                            [reps, reps_zero, reps_mean, reps_mode]):
    print(f'Fill NA method: {name}')
#     report_vote(columns, dem, rep)
    do_ttests(columns, dem, rep)

In [60]:
do_it_all(gov)

Fill NA method: exclude nan
Dem supported issues: Disabled Infants, Budget Resolution, Anti-Satellite Test, Nicaragua Contras Aid, MX Missile, Synfuels Cutback, Duty Free Exports, Export Admin S. Africa
Rep supported issues: Physician Fee, El Salvador Aid, Religion-Schools, Education Spending, Superfund Right To Sue, Crime
Bipartisan issues: Water Project, Immigration

Fill NA method: zero fill
Dem supported issues: Disabled Infants, Budget Resolution, Anti-Satellite Test, Nicaragua Contras Aid, MX Missile, Duty Free Exports
Rep supported issues: Physician Fee, El Salvador Aid, Religion-Schools, Education Spending, Superfund Right To Sue, Crime
Bipartisan issues: Water Project, Immigration, Export Admin S. Africa

Fill NA method: mean fill
Dem supported issues: Disabled Infants, Budget Resolution, Anti-Satellite Test, Nicaragua Contras Aid, MX Missile, Synfuels Cutback, Duty Free Exports, Export Admin S. Africa
Rep supported issues: Physician Fee, El Salvador Aid, Religion-Schools, Edu