<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
# Load and clean the data (or determine the best method to drop observations when running tests)
# import pandas and numpy
import pandas as pd
import numpy as np

# data set url
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'

# update the column headers to be correct
column_headers = ['party', 'class_name', 'handicapped_infants', 'water_project_cost_sharing', 
                  'adoption_of_the_budget_resolution', 'physician_fee_freeze', 'el_salvador_aid', 
                  'religious_groups_in_schools', 'anti_satellite_test_ban',
                  'aid_to_nicaraguan_contras', 'mx_missile', 'immigration', 
                  'synfuels_corporation_cutback', 'education_spending', 'superfund_right_to_sue', 
                  'crime, duty_free_exports', 'export_administration_act_south_africa']

# read in the data
df = pd.read_csv(url, header=None, names=column_headers)

In [2]:
# Checking headers are properly in place
df.head(10)

Unnamed: 0,party,class_name,handicapped_infants,water_project_cost_sharing,adoption_of_the_budget_resolution,physician_fee_freeze,el_salvador_aid,religious_groups_in_schools,anti_satellite_test_ban,aid_to_nicaraguan_contras,mx_missile,immigration,synfuels_corporation_cutback,education_spending,superfund_right_to_sue,"crime, duty_free_exports",export_administration_act_south_africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
5,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
6,democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
7,republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
8,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
9,democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?


In [3]:
# Changing votes to numeric
df = df.replace({'y':1, 'n':0})
df.head(5)

Unnamed: 0,party,class_name,handicapped_infants,water_project_cost_sharing,adoption_of_the_budget_resolution,physician_fee_freeze,el_salvador_aid,religious_groups_in_schools,anti_satellite_test_ban,aid_to_nicaraguan_contras,mx_missile,immigration,synfuels_corporation_cutback,education_spending,superfund_right_to_sue,"crime, duty_free_exports",export_administration_act_south_africa
0,republican,0,1,0,1,1,1,0,0,0,1,?,1,1,1,0,1
1,republican,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,?
2,democrat,?,1,1,?,1,1,0,0,0,0,1,0,1,1,0,0
3,democrat,0,1,1,0,?,1,0,0,0,0,1,0,1,0,0,1
4,democrat,1,1,1,0,1,1,0,0,0,0,1,?,1,1,1,1


In [4]:
# how many from each party
df['party'].value_counts()

democrat      267
republican    168
Name: party, dtype: int64

In [7]:
# checking a random columns nulls
rep['handicapped_infants'].isnull().value_counts()

False    168
Name: handicapped_infants, dtype: int64

In [0]:
# replace the ? as a nan and then drop all na values in the data frame
df = df.replace({'?':np.NaN}).dropna()


In [27]:
# checking a column to make sure it dropped the nans and converted correctly
df['handicapped_infants']

5      1.0
8      1.0
19     1.0
23     1.0
25     0.0
      ... 
423    1.0
426    0.0
427    0.0
430    0.0
431    0.0
Name: handicapped_infants, Length: 232, dtype: float64

In [5]:
# how did republicans vote
rep = df[df['party']=='republican']
len(rep)

168

In [6]:
# how did democrats vote
dem = df[df['party']=='democrat']
len(dem)

267

In [31]:
rep['handicapped_infants'].mean

<bound method Series.mean of 0      1
1      1
7      1
8      1
10     1
      ..
427    0
430    0
432    ?
433    0
434    1
Name: handicapped_infants, Length: 168, dtype: object>