<a href="https://colab.research.google.com/github/oxfordfictionary/DS-Unit-1-Sprint-2-Statistics/blob/master/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
# imports
from scipy.stats import ttest_ind
import pandas as pd
#get raw data using bash
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2020-01-30 20:12:42--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2020-01-30 20:12:43 (129 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [0]:
# make into a dataframe - step one, set column headers
column_headers = ['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']

In [0]:
df = pd.read_csv('house-votes-84.data',
                 header=None,
                 names=column_headers,
                 na_values='?')

# na_values="?" is replacing question marks with NAs


In [0]:
# recode votes as numeric
# This will mean higher mean = more in favor
df = df.replace({'y': 1, 'n': 0})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
df.shape

(435, 17)

In [0]:
#checking for density of NaNs to give me a sense of which will be less 
df[df.isnull().any(axis=1)].sum()

party                   republicanrepublicandemocratdemocratdemocratde...
handicapped-infants                                                    91
water-project                                                          88
budget                                                                130
physician-fee-freeze                                                   64
el-salvador-aid                                                        84
religious-groups                                                      123
anti-satellite-ban                                                    115
aid-to-contras                                                        123
mx-missile                                                             94
immigration                                                            88
synfuels                                                               70
education                                                              63
right-to-sue                          

In [0]:
#setting shorthand variables
dem = df[df['party']=='democrat']
rep = df[df['party']=='republican']


# **2. Find an issue Democrats support more than republicans**


In [0]:
''' the x-missile data appears to reflect the decommissioning of ICBM 
peacekeeper missiles. https://fas.org/nuke/guide/usa/icbm/lgm-118.htm. It makes
sense that the democracts would support this more than republicans (to my mind)'''
dem['mx-missile'].mean()

0.7580645161290323

In [0]:
rep['mx-missile'].mean()

0.11515151515151516

In [0]:
ttest_ind(rep['mx-missile'], dem['mx-missile'], nan_policy='omit')

Ttest_indResult(statistic=-16.437503268542994, pvalue=5.03079265310811e-47)

In [0]:
#with p-value of 5.03/2 (since it is one-tailed) to the power of -47, this would be significant. 
#Can reject null hypothesis; democrats appear to support this bill more than 
# republicans

# **3. Find and issue republicans support more than democrats**

In [0]:
dem['education'].mean()

0.14457831325301204

In [0]:
rep['education'].mean()

0.8709677419354839

In [0]:
''' After looking at means, I became curious about this bill. 
I foudn it here: https://www.congress.gov/bill/101st-congress/house-bill/1675
it appears to be a bill that rewards high-performing schools which 
might explain why it was so popular among republicans. As such, my null 
hypothesis is that Republicans' support of this bill is less than or equal to 
# that of democrats.'''
# I used coding with the t-test to remocve NaN values nan_policy='omit'

ttest_ind(rep['education'], dem['education'], nan_policy='omit')

Ttest_indResult(statistic=20.500685724563073, pvalue=1.8834203990450192e-64)

In [0]:
#Based on the above p-value of 1.88/2 (since it is one-tailed) to the -64th power, it does appear there is
#a significant difference. 

# **Find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)**

In [0]:
dem['water-project'].mean()

0.502092050209205

In [0]:
rep['water-project'].mean()

0.5067567567567568

In [0]:
ttest_ind(rep['water-project'], dem['water-project'], nan_policy='omit')

Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)

In [0]:
# Cannot reject null hypothesis there is no difference between means

# **Stretch Goals**

In [0]:
#A function to quickly visualize some summary information as we did above
def des(df):
  print(df.shape)
  print(df.head())
  print(df[df.isnull().any(axis=1)].sum())

des(df)





(435, 17)
        party  handicapped-infants  ...  duty-free  south-africa
0  republican                  0.0  ...        0.0           1.0
1  republican                  0.0  ...        0.0           NaN
2    democrat                  NaN  ...        0.0           0.0
3    democrat                  0.0  ...        0.0           1.0
4    democrat                  1.0  ...        1.0           1.0

[5 rows x 17 columns]
party                   republicanrepublicandemocratdemocratdemocratde...
handicapped-infants                                                    91
water-project                                                          88
budget                                                                130
physician-fee-freeze                                                   64
el-salvador-aid                                                        84
religious-groups                                                      123
anti-satellite-ban                                         