<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
from scipy.stats import ttest_1samp, ttest_ind
import seaborn as sns

In [0]:
# Import csv and add headers
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',
                 names=['party', 'handicapped_infants', 'water_project', 'budget', 'physician_fee_freeze',
                         'el_salvador_aid', 'religious_groups', 'anti_satellite_ban', 'aid_to_contras',
                         'mx_missle', 'immigration', 'synfuels', 'education', 'right_to_sue', 'crime', 'duty_free_exports',
                         'south_africa_act'],
                 header=None)

In [12]:
df.head()

Unnamed: 0,party,handicapped_infants,water_project,budget,physician_fee_freeze,el_salvador_aid,religious_groups,anti_satellite_ban,aid_to_contras,mx_missle,immigration,synfuels,education,right_to_sue,crime,duty_free_exports,south_africa_act
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [0]:
# Replace yes with 1 and no with 0 and '?' with None
df = df.replace({'n': 0, 'y': 1, '?': None})

In [113]:
df.head()

Unnamed: 0,party,handicapped_infants,water_project,budget,physician_fee_freeze,el_salvador_aid,religious_groups,anti_satellite_ban,aid_to_contras,mx_missle,immigration,synfuels,education,right_to_sue,crime,duty_free_exports,south_africa_act
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
# Split df into two groups, democrats and republicans
dem = df[df['party'] == 'democrat']
rep = df[df['party'] == 'republican']

In [0]:
# Create a bills list
bills = ['handicapped_infants','water_project', 'budget',
         'physician_fee_freeze', 'el_salvador_aid', 'religious_groups',
         'anti_satellite_ban', 'aid_to_contras', 'mx_missle', 'immigration',
         'synfuels', 'education', 'right_to_sue', 'crime', 'duty_free_exports', 
         'south_africa_act']

## #2 Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01

In [0]:
# Create a func for a 2 sample t test to check that takes a bill as a parameter
def ttest_ind_bill(bill):
  return ttest_ind(dem[bill], rep[bill], nan_policy='omit')


# Create two funcs that test a party against a null hypothesis
def ttest_1samp_dem(bill, null_hypothesis):
  return ttest_ind(dem[bill], null_hypothesis, nan_policy='omit')

def ttest_1samp_rep(bill, null_hypothesis):
  return ttest_ind(rep[bill], null_hypothesis, nan_policy='omit')

In [201]:
# Create a function to check what bills the democrats support
def dem_support():
  for bill in bills:
    p = ttest_ind_bill(bill)[1]
    t = ttest_ind_bill(bill)[0]
    if t > 0 and p < 0.01:
      print(bill)
      
dem_support()

handicapped_infants
budget
anti_satellite_ban
aid_to_contras
mx_missle
synfuels
duty_free_exports
south_africa_act


## #3 Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01

In [200]:
# Create a function to check what bills the republicans support
def rep_support():
  for bill in bills:
    p = ttest_ind_bill(bill)[1]
    t = ttest_ind_bill(bill)[0]
    if t < 0 and p < 0.01:
      print(bill)
      
rep_support()

physician_fee_freeze
el_salvador_aid
religious_groups
education
right_to_sue
crime


## #4 Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1

In [194]:
# Find a bill where both parties have the same rate
def party_agree():
  for bill in bills:
    p = ttest_ind_bill(bill)[1]
    if p > 0.1:
      print(bill)
      
party_agree()

water_project
