<a href="https://colab.research.google.com/github/Daniel-Benson-Poe/DS-Unit-1-Sprint-2-Statistics/blob/master/daniel_benson_LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
from scipy.stats import ttest_ind

In [0]:
from google.colab import files
uploaded = files.upload()

Saving house-votes-84.data to house-votes-84.data


In [0]:
# Read in the data and check its shape
house_votes = pd.read_csv('house-votes-84.data')
house_votes.shape

(434, 17)

In [0]:
# Get a look at the columns
house_votes.columns

Index(['republican', 'n', 'y', 'n.1', 'y.1', 'y.2', 'y.3', 'n.2', 'n.3', 'n.4',
       'y.4', '?', 'y.5', 'y.6', 'y.7', 'n.5', 'y.8'],
      dtype='object')

In [0]:
# We have some column renaming to do
house_columns = ['Class Name', 'handicapped-infants', 'water-project-cost-sharing', 
           'adoption-of-the-budget-resolution', 'physician-fee-freeze', 
           'el-salvador-aid', 'religious-groups-in-schools', 
           'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', 'mx-missile',
           'immigration', 'synfuels-corporation-cutback', 'education-spending',
           'superfund-right-to-sue', 'crime', 'duty-free-exports', 
           'export-administration-act-south-africa']


In [0]:
# Reread in the data with their new column names
house_votes = pd.read_csv('house-votes-84.data', names=house_columns)

In [0]:
# Check to ensure our renaming worked
house_votes.columns

Index(['Class Name', 'handicapped-infants', 'water-project-cost-sharing',
       'adoption-of-the-budget-resolution', 'physician-fee-freeze',
       'el-salvador-aid', 'religious-groups-in-schools',
       'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', 'mx-missile',
       'immigration', 'synfuels-corporation-cutback', 'education-spending',
       'superfund-right-to-sue', 'crime', 'duty-free-exports',
       'export-administration-act-south-africa'],
      dtype='object')

In [0]:
# Get a closer look at the data
print(house_votes.describe)
print(house_votes.head())
print(house_votes.tail())
print(house_votes.sample(5))

<bound method NDFrame.describe of      Class Name  ... export-administration-act-south-africa
0    republican  ...                                      y
1    republican  ...                                      ?
2      democrat  ...                                      n
3      democrat  ...                                      y
4      democrat  ...                                      y
..          ...  ...                                    ...
430  republican  ...                                      y
431    democrat  ...                                      y
432  republican  ...                                      y
433  republican  ...                                      y
434  republican  ...                                      n

[435 rows x 17 columns]>
   Class Name  ... export-administration-act-south-africa
0  republican  ...                                      y
1  republican  ...                                      ?
2    democrat  ...                            

In [0]:
# Looking into the data we see some values marked as a question mark
# but when we look for any null values there are none.
# According to the source site these marks indicate
# undecided or chose not to vote
house_votes.isnull().sum()

Class Name                                0
handicapped-infants                       0
water-project-cost-sharing                0
adoption-of-the-budget-resolution         0
physician-fee-freeze                      0
el-salvador-aid                           0
religious-groups-in-schools               0
anti-satellite-test-ban                   0
aid-to-nicaraguan-contras                 0
mx-missile                                0
immigration                               0
synfuels-corporation-cutback              0
education-spending                        0
superfund-right-to-sue                    0
crime                                     0
duty-free-exports                         0
export-administration-act-south-africa    0
dtype: int64

In [0]:
# Let's see how many question marks there are
len(house_votes=='?')

435

In [0]:
# Rename the votes so they are numeric
house_votes = house_votes.replace({'y':1, 'n':0, '?':0})
house_votes.head()

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0,1,0,1,1,1,0,0,0,1,0,1,1,1,0,1
1,republican,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,0
2,democrat,0,1,1,0,1,1,0,0,0,0,1,0,1,1,0,0
3,democrat,0,1,1,0,0,1,0,0,0,0,1,0,1,0,0,1
4,democrat,1,1,1,0,1,1,0,0,0,0,1,0,1,1,1,1


In [0]:
# Look at how many of each class we have:
house_votes['Class Name'].value_counts()


democrat      267
republican    168
Name: Class Name, dtype: int64

In [0]:
# Look at how each class is voting
republican_votes = house_votes[house_votes['Class Name']=='republican']
democratic_votes = house_votes[house_votes['Class Name']=='democrat']
print(len(republican_votes))
print(len(democratic_votes))

168
267


# handicapped-infants data

null hypothesis: the support for handicapped-infants will be equal for both democrats and republicans

alternative hypothesis: support for handicapped-infants will be different

In [0]:
# Percentage of republicans voting 'yes' for handicapped-infants
republican_votes['handicapped-infants'].sum()/len(democratic_votes)

0.11610486891385768

In [0]:
# What is the mean support of republicans? 
republican_votes['handicapped-infants'].mean()


0.18452380952380953

In [0]:
# What is the mean support of democrats?
democratic_votes['handicapped-infants'].mean()

0.5842696629213483

In [0]:
# T-test time!
ttest_ind(republican_votes['handicapped-infants'], democratic_votes['handicapped-infants'])

Ttest_indResult(statistic=-8.897130738692912, pvalue=1.5743382054891396e-17)

In [0]:
# is our p < 0.01?
ttest_ind(republican_votes['handicapped-infants'], democratic_votes['handicapped-infants']).pvalue < 0.01

True

In [0]:
# This p value allows me to reject the null hypothesis that both groups support handicapped-infants equally

In [0]:
# Comparing the groups' mean values, it appears that the democrats show a much greater support
# for handicapped-infants than do the republicans
republican_votes['handicapped-infants'].mean() < democratic_votes['handicapped-infants'].mean()

True

# anti-satellite-test-ban data

null hypothesis: both republians and democrats will equally support the anti-satellite-test-ban

alternative hypothesis: support for the anti-satellite-test-ban will be different between republicans and democrats

In [0]:
# Percent of republicans voting yes for anti-satellite-test-ban
republican_votes['anti-satellite-test-ban'].sum()/len(republican_votes)

0.23214285714285715

In [0]:
# Mean support of republicans voting yes
republican_votes['anti-satellite-test-ban'].mean()

0.23214285714285715

In [0]:
# Percent of democrats voting yes
democratic_votes['anti-satellite-test-ban'].sum()/len(democratic_votes)

0.7490636704119851

In [0]:
# Mean support of democrats voting yes
democratic_votes['anti-satellite-test-ban'].mean()

0.7490636704119851

In [0]:
# t-test!
ttest_ind(republican_votes['anti-satellite-test-ban'], democratic_votes['anti-satellite-test-ban'])

Ttest_indResult(statistic=-12.201859703940068, pvalue=1.2271674100823548e-29)

In [0]:
# Is the pvalue < 0.01?
ttest_ind(republican_votes['anti-satellite-test-ban'], democratic_votes['anti-satellite-test-ban']).pvalue < 0.01

True

In [0]:
# This allows me to reject the null hypothesis that each group supports the bill equally

In [0]:
# Comparing the two means shows that democrats are much more in favor of the bill than republicans
democratic_votes['anti-satellite-test-ban'].mean() > republican_votes['anti-satellite-test-ban'].mean()

True