<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
### YOUR CODE STARTS HERE

###1. Load and Clean the Data

In [0]:
#imports
from google.colab import files
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

In [11]:
upload=files.upload()

Saving house-votes-84.data to house-votes-84 (1).data


In [12]:
# Name the columns and read in the data
col_headers = ['Party', 'handicapped-infants', 
               'water-project-cost-sharing', 'adoption-of-the-budget-resolution', 
               'physician-fee-freeze', 'el-salvador-aid', 
               'religious-groups-in-schools', 'anti-satellite-test-ban', 
               'aid-to-nicaraguan-contras', 'mx-missile', 'immigration', 
               'synfuels-corporation-cutback', 'education-spending', 
               'superfund-right-to-sue', 'crime', 'duty-free-exports', 
               'export-administration-act-south-africa']

hv = pd.read_csv('house-votes-84.data', header=None, names= col_headers, na_values='?')
hv.head()

Unnamed: 0,Party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [13]:
# check for null
hv.isnull().sum()

Party                                       0
handicapped-infants                        12
water-project-cost-sharing                 48
adoption-of-the-budget-resolution          11
physician-fee-freeze                       11
el-salvador-aid                            15
religious-groups-in-schools                11
anti-satellite-test-ban                    14
aid-to-nicaraguan-contras                  15
mx-missile                                 22
immigration                                 7
synfuels-corporation-cutback               21
education-spending                         31
superfund-right-to-sue                     25
crime                                      17
duty-free-exports                          28
export-administration-act-south-africa    104
dtype: int64

In [14]:
# make each value a bool of 1 or 0 to make it work better
hv = hv.replace({'y': 1, 'n': 0})
hv.head()

Unnamed: 0,Party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [16]:
# look at the distribution of the votes
hv['Party'].value_counts()

democrat      267
republican    168
Name: Party, dtype: int64

In [17]:
# make new data frames for each party
dem = hv[hv['Party']=='democrat']
len(dem)

267

In [18]:
rep = hv[hv['Party']=='republican']
len(rep)

168

In [32]:
# find the percentage of yes votes for each column
for i in range(1,17):
  print("Democrates were", (dem[col_headers[i]].mean() * 100), "% in favor of the", col_headers[i], "bill.")
  print("Republicans were", (rep[col_headers[i]].mean() * 100), "% in favor of the", col_headers[i], "bill.")
  print()


Democrates were 60.46511627906976 % in favor of the handicapped-infants bill.
Republicans were 18.787878787878785 % in favor of the handicapped-infants bill.

Democrates were 50.2092050209205 % in favor of the water-project-cost-sharing bill.
Republicans were 50.67567567567568 % in favor of the water-project-cost-sharing bill.

Democrates were 88.84615384615384 % in favor of the adoption-of-the-budget-resolution bill.
Republicans were 13.414634146341465 % in favor of the adoption-of-the-budget-resolution bill.

Democrates were 5.405405405405405 % in favor of the physician-fee-freeze bill.
Republicans were 98.7878787878788 % in favor of the physician-fee-freeze bill.

Democrates were 21.568627450980394 % in favor of the el-salvador-aid bill.
Republicans were 95.15151515151516 % in favor of the el-salvador-aid bill.

Democrates were 47.674418604651166 % in favor of the religious-groups-in-schools bill.
Republicans were 89.7590361445783 % in favor of the religious-groups-in-schools bill.


###2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01

In [0]:
# adding a function that can be reused for finding the pvalue
def ttest_a_bill(bill):
  return ttest_ind(dem[bill], rep[bill], nan_policy='omit')

In [63]:
# I'm going to test the hypothosis that the democrates were more in favor of the
#   aid-to-nicaraguan-contras bill than republicans
bill = 'aid-to-nicaraguan-contras'

dem[bill].mean()

0.8288973384030418

In [64]:
rep[bill].mean()

0.15286624203821655

In [0]:
# looks like it is but lets look at the ttesting
results = ttest_a_bill(bill)

In [68]:
print('{:.60f}'.format(results.pvalue))
print(results.pvalue <= .001)

0.000000000000000000000000000000000000000000000000000002824718
True


We can comirm with more than .001 confidence that the this can be used as a statisitc that democrates as a whole are more in favor for the  aid-to-nicaraguan-contras, than republicans.

###3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01

In [69]:
# I'm going to test the hypothosis that the republicans were more in favor of the
#   religious-groups-in-schools bill than democrats
bill = 'religious-groups-in-schools'

rep[bill].mean()

0.8975903614457831

In [70]:
dem[bill].mean()

0.47674418604651164

In [0]:
# it appears to be true but let's ttest it to make sure.
results = ttest_a_bill(bill)

In [72]:
print('{:.25f}'.format(results.pvalue))
print(results.pvalue <= .001)

0.0000000000000000000239367
True


We can comirm with more than .001 confidence that the this can be used as a statisitc that republicans as a whole are more in favor for the religious-groups-in-schools, than democrats.

###4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

In [73]:
# I'm going to test the hypotosis that there is not much difference between the 
#   voting of republicans and democrates on the immigration bill.
bill = 'immigration'

rep[bill].mean()

0.5575757575757576

In [74]:
dem[bill].mean()

0.4714828897338403

In [0]:
#ttest it
results = ttest_a_bill(bill)

In [76]:
print('{:.5f}'.format(results.pvalue))
print(results.pvalue>=.1)

0.08330
False


We were not able to prove that there wasn't much difference between the parties.

In [77]:
# I will try the water-project-cost-sharing bill now
bill= 'water-project-cost-sharing'

rep[bill].mean()

0.5067567567567568

In [78]:
dem[bill].mean()

0.502092050209205

In [0]:
results = ttest_a_bill(bill)

In [80]:
print('{:.5f}'.format(results.pvalue))
print(results.pvalue>=.1)

0.92916
True


This bill however there is a high pvalue for this set and so we can not disprove the null hypothosis.

### Extra

In [0]:
# a function that works to test the pvalue and return if it is less than
#   certain levels.
def test_bill(bill):
  results = ttest_ind(dem[bill], rep[bill], nan_policy='omit')
  print("The pvalue for", bill, "is", results.pvalue)
  print("Is this pvalue <= .05 confidence:", results.pvalue <= .05)
  print("Is this pvalue <= .01 confidence:", results.pvalue <= .01)
  print("Is this pvalue <= .001 confidence:", results.pvalue <= .001)

In [91]:
# loop through all bills to test each one.
for i in range(1,17):
  test_bill(col_headers[i])
  print()

The pvalue for handicapped-infants is 1.613440327937243e-18
Is this pvalue <= .05 confidence: True
Is this pvalue <= .01 confidence: True
Is this pvalue <= .001 confidence: True

The pvalue for water-project-cost-sharing is 0.9291556823993485
Is this pvalue <= .05 confidence: False
Is this pvalue <= .01 confidence: False
Is this pvalue <= .001 confidence: False

The pvalue for adoption-of-the-budget-resolution is 2.0703402795404463e-77
Is this pvalue <= .05 confidence: True
Is this pvalue <= .01 confidence: True
Is this pvalue <= .001 confidence: True

The pvalue for physician-fee-freeze is 1.994262314074344e-177
Is this pvalue <= .05 confidence: True
Is this pvalue <= .01 confidence: True
Is this pvalue <= .001 confidence: True

The pvalue for el-salvador-aid is 5.600520111729011e-68
Is this pvalue <= .05 confidence: True
Is this pvalue <= .01 confidence: True
Is this pvalue <= .001 confidence: True

The pvalue for religious-groups-in-schools is 2.3936722520597287e-20
Is this pvalue <