<a href="https://colab.research.google.com/github/Daniel-Benson-Poe/DS-Unit-1-Sprint-2-Statistics/blob/master/daniel_benson_LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
from scipy.stats import ttest_ind

In [2]:
from google.colab import files
uploaded = files.upload()

Saving house-votes-84.data to house-votes-84.data


In [0]:
# Read in the data and check its shape
house_votes = pd.read_csv('house-votes-84.data')
house_votes.shape

(434, 17)

In [0]:
# Get a look at the data
house_votes.head()

Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
1,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
2,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
3,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
4,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y


In [0]:
# We have some column renaming to do
house_columns = ['Class Name', 'handicapped-infants', 'water-project-cost-sharing', 
           'adoption-of-the-budget-resolution', 'physician-fee-freeze', 
           'el-salvador-aid', 'religious-groups-in-schools', 
           'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', 'mx-missile',
           'immigration', 'synfuels-corporation-cutback', 'education-spending',
           'superfund-right-to-sue', 'crime', 'duty-free-exports', 
           'export-administration-act-south-africa']


In [0]:
# Reread in the data with their new column names
house_votes = pd.read_csv('house-votes-84.data', header=None, names=house_columns, na_values='?')

In [5]:
# Check to ensure our renaming worked
house_votes.columns

Index(['Class Name', 'handicapped-infants', 'water-project-cost-sharing',
       'adoption-of-the-budget-resolution', 'physician-fee-freeze',
       'el-salvador-aid', 'religious-groups-in-schools',
       'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', 'mx-missile',
       'immigration', 'synfuels-corporation-cutback', 'education-spending',
       'superfund-right-to-sue', 'crime', 'duty-free-exports',
       'export-administration-act-south-africa'],
      dtype='object')

In [0]:
# Get a closer look at the data
print(house_votes.describe)
print(house_votes.head())
print(house_votes.tail())
print(house_votes.sample(5))

<bound method NDFrame.describe of      Class Name  ... export-administration-act-south-africa
0    republican  ...                                      y
1    republican  ...                                    NaN
2      democrat  ...                                      n
3      democrat  ...                                      y
4      democrat  ...                                      y
..          ...  ...                                    ...
430  republican  ...                                      y
431    democrat  ...                                      y
432  republican  ...                                      y
433  republican  ...                                      y
434  republican  ...                                      n

[435 rows x 17 columns]>
   Class Name  ... export-administration-act-south-africa
0  republican  ...                                      y
1  republican  ...                                    NaN
2    democrat  ...                            

In [0]:
# Looking into the data we see some empty values
house_votes.isnull().sum()

Class Name                                  0
handicapped-infants                        12
water-project-cost-sharing                 48
adoption-of-the-budget-resolution          11
physician-fee-freeze                       11
el-salvador-aid                            15
religious-groups-in-schools                11
anti-satellite-test-ban                    14
aid-to-nicaraguan-contras                  15
mx-missile                                 22
immigration                                 7
synfuels-corporation-cutback               21
education-spending                         31
superfund-right-to-sue                     25
crime                                      17
duty-free-exports                          28
export-administration-act-south-africa    104
dtype: int64

In [6]:
# Rename the votes so they are numeric
house_votes = house_votes.replace({'y':1, 'n':0})
house_votes.head()

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
# Look at how many of each class we have:
house_votes['Class Name'].value_counts()


democrat      267
republican    168
Name: Class Name, dtype: int64

In [11]:
# Look at how each class is voting
republican_votes = house_votes[house_votes['Class Name']=='republican']
democratic_votes = house_votes[house_votes['Class Name']=='democrat']
print(len(republican_votes))
print(len(democratic_votes))

168
267


# handicapped-infants data

In [0]:
# null hypothesis: the support for handicapped-infants will be equal for both democrats and republicans

# alternative hypothesis: support for handicapped-infants will be different

# Confidency of 99% (pvalue < 0.01)

In [0]:
# Percentage of republicans voting 'yes' for handicapped-infants
republican_votes['handicapped-infants'].sum()/len(republican_votes)

0.18452380952380953

In [0]:
# We have to remove the NaN values first
rep_col = republican_votes['handicapped-infants']
np.isnan(rep_col)

rep_handicapped_infants_no_nans = rep_col[~np.isnan(rep_col)]

In [0]:
# Try again with NaN values removed
republican_votes['handicapped-infants'].sum()/len(rep_handicapped_infants_no_nans)

0.18787878787878787

In [0]:
# What is the mean support of republicans? 
republican_votes['handicapped-infants'].mean()


0.18787878787878787

In [0]:
# What about the democrats? Remove NaN values first
dem_col = democratic_votes['handicapped-infants']
np.isnan(dem_col)

dem_handicapped_infants_no_nans = dem_col[~np.isnan(dem_col)]

In [0]:
democratic_votes['handicapped-infants'].sum()/len(dem_handicapped_infants_no_nans)

0.6046511627906976

In [0]:
# What is the mean support of democrats?
democratic_votes['handicapped-infants'].mean()

0.6046511627906976

In [0]:
# T-test time!
ttest_ind(republican_votes['handicapped-infants'], democratic_votes['handicapped-infants'])

Ttest_indResult(statistic=nan, pvalue=nan)

In [0]:
# Let's account for the null values and try again
ttest_ind(republican_votes['handicapped-infants'], democratic_votes['handicapped-infants'], nan_policy='omit')


Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18)

In [0]:
# In this case our null value was much larger than our sample mean value,
# creating a statistic of -9, while our pvalue was significant enough 
# to reject the null hypothesis. 
ttest_ind(republican_votes['handicapped-infants'], democratic_votes['handicapped-infants'], nan_policy='omit').pvalue

1.613440327937243e-18

In [0]:
# is our p < 0.01?
ttest_ind(republican_votes['handicapped-infants'], democratic_votes['handicapped-infants'], nan_policy='omit').pvalue < 0.01

True

In [0]:
# This p value allows me to reject the null hypothesis that both groups support handicapped-infants equally

In [0]:
# Comparing the groups' mean values, it appears that the democrats show a much greater support
# for the handicapped-infants bill than do the republicans
republican_votes['handicapped-infants'].mean() < democratic_votes['handicapped-infants'].mean()

True

## Let's create some functions to speed up our coding

In [0]:
# A function to discount nans in the republican data
def republic_column(column):
  rep_col = republican_votes[column]
  rep_no_nans = rep_col[~np.isnan(rep_col)]
  return rep_no_nans

In [0]:
# A similar function for the democratic data
def democratic_column(column):
  dem_col = democratic_votes[column]
  dem_no_nans = dem_col[~np.isnan(dem_col)]
  return dem_no_nans

In [0]:
# Let's create a function to calculate vote percentage for democrats
def democrat_vote_percentage(column, dem_no_nans):
  dem_percent = democratic_votes[column].sum()/len(dem_no_nans)
  return dem_percent


In [12]:
# Make sure our democrat functions worked
dem_handicapped_infants_no_nans = democratic_column('handicapped-infants')
democrat_vote_percentage('handicapped-infants', dem_handicapped_infants_no_nans)


0.6046511627906976

In [0]:
# Now a function to calculate vote percentage for republicans
def republican_vote_percentage(column, rep_no_nans):
  rep_percent = republican_votes[column].sum()/len(rep_no_nans)
  return rep_percent

In [14]:
# And make sure our republican functions work correctly
rep_handicapped_infants_no_nans = republic_column('handicapped-infants')
republican_vote_percentage('handicapped-infants', rep_handicapped_infants_no_nans)

0.18787878787878787

In [0]:
# Create a function for our ttest
def ttest(column):
  ttest = ttest_ind(republican_votes[column], democratic_votes[column])
  return ttest

In [16]:
# Check that it works
ttest('handicapped-infants')

Ttest_indResult(statistic=nan, pvalue=nan)

In [0]:
# Finally a function for our ttest accounting for nan values
def ttest_no_nans(column):
  ttest = ttest_ind(republican_votes[column], democratic_votes[column], nan_policy='omit')
  return ttest

In [18]:
# Check that it works
ttest_no_nans('handicapped-infants')

Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18)

# anti-satellite-test-ban data

In [0]:
# null hypothesis: both republians and democrats will equally support the anti-satellite-test-ban

# alternative hypothesis: support for the anti-satellite-test-ban will be different between republicans and democrats

# Confidency of 99% (pvalue < 0.01)

In [0]:
# Account for null values first
rep_anti_satellite_no_nans = republic_column('anti-satellite-test-ban')

In [0]:
# Percent of republicans voting yes for anti-satellite-test-ban
republican_vote_percentage('anti-satellite-test-ban', rep_anti_satellite_no_nans)

0.24074074074074073

In [0]:
# Mean support of republicans voting yes
republican_votes['anti-satellite-test-ban'].mean()

0.24074074074074073

In [0]:
dem_anti_satellite_no_nans = democratic_column('anti-satellite-test-ban')

In [0]:
# Percent of democrats voting yes
democrat_vote_percentage('anti-satellite-test-ban', dem_anti_satellite_no_nans)

0.7722007722007722

In [0]:
# Mean support of democrats voting yes
democratic_votes['anti-satellite-test-ban'].mean()

0.7722007722007722

In [0]:
# t-test!
ttest('anti-satellite-test-ban')

Ttest_indResult(statistic=nan, pvalue=nan)

In [0]:
# Account for nan values
ttest_no_nans('anti-satellite-test-ban')

Ttest_indResult(statistic=-12.526187929077842, pvalue=8.521033017443867e-31)

In [0]:
# Statistic value shows a large difference between signal and noise

In [0]:
# Is the pvalue < 0.01?
ttest_no_nans('anti-satellite-test-ban').pvalue < 0.01

True

In [0]:
# This allows me to reject the null hypothesis that each group supports the bill equally

In [0]:
# Comparing the two means shows that democrats are much more in favor of the bill than republicans
democratic_votes['anti-satellite-test-ban'].mean() > republican_votes['anti-satellite-test-ban'].mean()

True

# immigration

In [0]:
# null hypothesis: both republians and democrats will equally support the immigration bill

# alternative hypothesis: support for the immigration bill will be different between republicans and democrats

# Confidency of 99% (pvalue < 0.01)

In [0]:
# Account for republican null values
rep_immigration_no_nans = republic_column('immigration')

In [0]:
# Percent of republicans voting yes
republican_vote_percentage('immigration', rep_immigration_no_nans)

0.5575757575757576

In [0]:
# Mean of republicans voting yes
republican_votes['immigration'].mean()

0.5575757575757576

In [0]:
# Account for democratic null values
dem_immigration_no_nans = democratic_column('immigration')

In [0]:
# Percent of dems voting yes
democrat_vote_percentage('immigration', dem_immigration_no_nans)

0.4714828897338403

In [0]:
# Mean of dems voting yes
democratic_votes['immigration'].mean()

0.4714828897338403

In [0]:
# ttest time
ttest('immigration')

Ttest_indResult(statistic=nan, pvalue=nan)

In [0]:
# Account for nulls
ttest_no_nans('immigration')

Ttest_indResult(statistic=1.7359117329695164, pvalue=0.08330248490425066)

In [0]:
# The statistic value is fairly small, meaning the signal is not much 
# different from the noise

In [0]:
# Is our pvalue significant?
ttest_no_nans('immigration').pvalue < 0.01

False

In [0]:
# No. In the case of the immigration bill, I cannot reject my null hypothesis

In [0]:
# Let's compare means just for the fun of it
print(republican_votes['immigration'].mean())
print(democratic_votes['immigration'].mean())

0.5575757575757576
0.4714828897338403


# Let's look at the religious-groups-in-schools column

In [0]:
# null hypothesis: the support for religious-groups-in-schools will be equal for both democrats and republicans

# alternative hypothesis: support for religious-groups-in-schools will be different

# Confidency of 99% (pvalue < 0.01)

In [0]:
# Account for republican nulls
rep_religious_groups_no_nans = republic_column('religious-groups-in-schools')

In [0]:
# Check percent of republican support
republican_vote_percentage('religious-groups-in-schools', rep_religious_groups_no_nans)

0.8975903614457831

In [0]:
# Check mean for republican support
republican_votes['religious-groups-in-schools'].mean()

0.8975903614457831

In [0]:
# Account for democratic nulls
dem_religious_groups_no_nans = democratic_column('religious-groups-in-schools')

In [0]:
# Check percent of democratic support
democrat_vote_percentage('religious-groups-in-schools', dem_religious_groups_no_nans)

0.47674418604651164

In [0]:
# Check mean for dem support
democratic_votes['religious-groups-in-schools'].mean()

0.47674418604651164

In [0]:
# ttest!
ttest('religious-groups-in-schools')

Ttest_indResult(statistic=nan, pvalue=nan)

In [0]:
# ttest accounting for nans
ttest_no_nans('religious-groups-in-schools')

Ttest_indResult(statistic=9.737575825219457, pvalue=2.3936722520597287e-20)

In [0]:
# We have a very high statistic indicating the signal standing out 
# from the noise

In [0]:
# Is our pvalue less than 0.01?
ttest_no_nans('religious-groups-in-schools').pvalue < 0.01

True

In [0]:
# I can reject the null hpyothesis with a 99% certainty

In [0]:
# Comparing the means we see that the republicans support this 
# bill much more than the democrats
print(republican_votes['religious-groups-in-schools'].mean())
print(democratic_votes['religious-groups-in-schools'].mean())

0.8975903614457831
0.47674418604651164


# Let's look at some more columns just for practice! 

Let's start with mx-missile

In [0]:
# null hypothesis: the support for mx-missile bill will be equal for both democrats and republicans

# alternative hypothesis: support for mx-missile bill will be different

# Confidency of 99% (pvalue < 0.01)

In [0]:
# Account for republican nulls
rep_mx_missile_no_nans = republic_column('mx-missile')

In [0]:
# Get percent of republican support
republican_vote_percentage('mx-missile', rep_mx_missile_no_nans)

0.11515151515151516

In [0]:
# Get republican support mean
republican_votes['mx-missile'].mean()

0.11515151515151516

In [0]:
# Democratic nulls
dem_mx_missile_no_nans = democratic_column('mx-missile')

In [0]:
# Percent of democratic support
democrat_vote_percentage('mx-missile', dem_mx_missile_no_nans)

0.7580645161290323

In [0]:
# Dem support mean
democratic_votes['mx-missile'].mean()

0.7580645161290323

In [0]:
# ttest
ttest('mx-missile')

Ttest_indResult(statistic=nan, pvalue=nan)

In [0]:
# ttest no nans
ttest_no_nans('mx-missile')

Ttest_indResult(statistic=-16.437503268542994, pvalue=5.03079265310811e-47)

In [0]:
# Statistic shows that the signal definitely stands out from the noise

In [0]:
# Is the pvalue < 0.01?
ttest_no_nans('mx-missile').pvalue < 0.01

True

In [0]:
# I can safely reject my null hypothesis with a 99% confidence

In [0]:
# Comparing the republican and dem support means we see that democrats are much
# more likely to favor this bill than republicans
print(republican_votes['mx-missile'].mean())
print(democratic_votes['mx-missile'].mean())

0.11515151515151516
0.7580645161290323


Let's look at crime now

In [0]:
# null hypothesis: the support for the crime bill will be equal for both democrats and republicans

# alternative hypothesis: support for the crime bill will be different

# Confidency of 99% (pvalue < 0.01)

In [0]:
# Account for republican nulls
rep_crime_no_nans = republic_column('crime')

In [0]:
# Find republican support percent
republican_vote_percentage('crime', rep_crime_no_nans)

0.9813664596273292

In [0]:
# Find republican support mean
republican_votes['crime'].mean()

0.9813664596273292

In [0]:
# Account for democrat nulls
dem_crime_no_nans = democratic_column('crime')

In [0]:
# Find democratic support percent
democrat_vote_percentage('crime', dem_crime_no_nans)

0.35019455252918286

In [0]:
# Find democrat support mean
democratic_votes['crime'].mean()

0.35019455252918286

In [0]:
# ttest
ttest('crime')

Ttest_indResult(statistic=nan, pvalue=nan)

In [0]:
# ttest no nans
ttest_no_nans('crime')

Ttest_indResult(statistic=16.342085656197696, pvalue=9.952342705606092e-47)

In [0]:
# Exceptionalluy large statistic shows that our signal definitely 
# stands out from the noise
ttest_no_nans('crime').statistic

16.342085656197696

In [0]:
# Is our pvalue < 0.01?
ttest_no_nans('crime').pvalue < 0.01

True

In [0]:
# I can safely reject the null hypothesis with a 99% confidence

In [0]:
# Comparing the mean support values we can see that the republican group 
# overwhelmingly supports this bill more than the democratic group
print(republican_votes['crime'].mean())
print(democratic_votes['crime'].mean())

0.9813664596273292
0.35019455252918286


synfuels-corporation-cutback

In [0]:
# null hypothesis: the support for synfuels-corporation-cutback will be equal for both democrats and republicans

# alternative hypothesis: support for synfuels-corporation-cutback will be different

# Confidency of 99% (pvalue < 0.01)

In [0]:
# Account for republican nulls
rep_synfuels_cutback_no_nans = republic_column('synfuels-corporation-cutback')

In [22]:
# Republican support percent
republican_vote_percentage('synfuels-corporation-cutback', rep_synfuels_cutback_no_nans)

0.1320754716981132

In [23]:
# Republican support mean
republican_votes['synfuels-corporation-cutback'].mean()

0.1320754716981132

In [0]:
# Account for democratic nulls
dem_synfuels_cutback_no_nans = democratic_column('synfuels-corporation-cutback')

In [25]:
# Democratic support percent
democrat_vote_percentage('synfuels-corporation-cutback', dem_synfuels_cutback_no_nans)

0.5058823529411764

In [26]:
# Democratic mean
democratic_votes['synfuels-corporation-cutback'].mean()

0.5058823529411764

In [27]:
# ttest!
ttest('synfuels-corporation-cutback')

Ttest_indResult(statistic=nan, pvalue=nan)

In [28]:
# ttest no nans
ttest_no_nans('synfuels-corporation-cutback')

Ttest_indResult(statistic=-8.293603989407588, pvalue=1.5759322301054064e-15)

In [29]:
# The statistic value shows that the signal sufficiently stands out 
# from the noise
ttest_no_nans('synfuels-corporation-cutback').statistic

-8.293603989407588

In [30]:
# Is the pvalue < 0.01?
ttest_no_nans('synfuels-corporation-cutback').pvalue < 0.01

True

In [0]:
# I can reject the null hypothesis considering the above data with a 99% confidence

In [33]:
# Comparing the support means for both groups shows that the democratic group 
# supports the synfuels-corporation-cutback bill much more than the republicans
print(republican_votes['synfuels-corporation-cutback'].mean())
print(democratic_votes['synfuels-corporation-cutback'].mean())

0.1320754716981132
0.5058823529411764



Let's look at duty-free-exports



In [0]:
# null hypothesis: the support for duty-free-exports will be equal for both democrats and republicans

# alternative hypothesis: support for duty-free-exports will be different

# Confidency of 99% (pvalue < 0.01)

In [0]:
# Account for republican nulls
rep_duty_free_exports_no_nans = republic_column('duty-free-exports')

In [37]:
# Check republic support percent
republican_vote_percentage('duty-free-exports', rep_duty_free_exports_no_nans)

0.08974358974358974

In [38]:
# Check Republic support mean
republican_votes['duty-free-exports'].mean()

0.08974358974358974

In [0]:
# Account for democratic nulls
dem_duty_free_exports_no_nans = democratic_column('duty-free-exports')

In [41]:
# Check democrat support percent
democrat_vote_percentage('duty-free-exports', dem_duty_free_exports_no_nans)

0.6374501992031872

In [42]:
# Check democrat support mean
democratic_votes['duty-free-exports'].mean()

0.6374501992031872

In [43]:
# ttest
ttest('duty-free-exports')

Ttest_indResult(statistic=nan, pvalue=nan)

In [44]:
# ttest no nans
ttest_no_nans('duty-free-exports')

Ttest_indResult(statistic=-12.853146132542978, pvalue=5.997697174347365e-32)

In [45]:
# The statistic shows that the signal stands out from the noise
ttest_no_nans('duty-free-exports').statistic

-12.853146132542978

In [46]:
# Is the pvalue < 0.01?
ttest_no_nans('duty-free-exports').pvalue < 0.01

True

In [0]:
# I can reject the null hypothesis with a 99% certainty

In [49]:
# Looking at the support means for both groups, it is clear that the democrats
# are much more likely to support the duty-free-exports bill than the republicans
print(republican_votes['duty-free-exports'].mean())
print(democratic_votes['duty-free-exports'].mean())

0.08974358974358974
0.6374501992031872
