<a href="https://colab.research.google.com/github/rselent/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/2_1_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [2]:
### YOUR CODE STARTS HERE

from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel
import pandas as pd
import numpy as np

columnHeaders =[ 'party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']

df = pd.read_csv( 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', 
                 header = None,
                 names = columnHeaders)

# Short test; if we're able to identify strings like this, we should be able to 
# exclude them just as easily, without altering the data so much that it 
# (potentially) loses its original meaning

df[ df.budget == '?'].describe()


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11
unique,2,3,3,1,3,3,3,3,3,3,3,3,3,3,3,3,3
top,democrat,?,?,?,?,n,y,n,y,y,n,?,n,y,y,n,?
freq,7,5,7,11,6,4,7,5,4,6,5,4,4,7,5,5,5


In [0]:
# For instance, in the above slice, we can tell that 11 congressmen abstained
# from or were absent during the budgetary vote. And of those 11, 7 of them
# were Democrats. Also of those same 11, 5 of them abstained from voting on
# handicapped infants, and 5 of them voted for crime #misleadingColumnNames

In [3]:
df[ 'party'].value_counts()

democrat      267
republican    168
Name: party, dtype: int64

In [4]:
rep = df[ df.party == 'republican']
dem = df[ df.party == 'democrat']

rep.head()

# As you can see, it's much more readable than 0s and 1s, and (for now) can be 
# manipulated and sorted just as well as numerical values

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
7,republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
8,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
10,republican,n,y,n,y,y,n,n,n,n,n,?,?,y,y,n,n


In [5]:
## Out of the total number of republicans these are the percentages who voted yes:

# First, a temp var representing a straight 'yes' divided by all
repTemp1 = rep[ 'handicapped-infants'].str.count('y').sum()/len(rep)

# This is the total number of republicans (len(rep)), minus those that have a '?'
# as their recorded vote
repsPresent = len(rep) - rep[ 'handicapped-infants'].str.contains('\?').sum()

# And then finally, another temp var representing a 'yes' vote, divided by all
# without a '?' as their recorded vote.
## I imagine this might be possible without needing to create another var
## ('repsPresent'), but it wouldn't be nearly as readable, I don't think
repTemp2 = rep[ 'handicapped-infants'].str.contains('y').sum()/repsPresent

# And here's some print statements to make it a little fancy:
print( 'Out of all {} republicans, {:.2f}% voted "yes"'.format( len(rep), 
                                                              repTemp1*100))
print( 'But out of all {} republicans, only {} actually voted'.format( 
    len(rep), repsPresent))
print( 'And so out of those {} that actually voted, it was {:.2f}% that',
      'voted "yes"'.format( repsPresent, repTemp2*100) )



Out of all 168 republicans, 18.45% voted "yes"
But out of all 168 republicans, only 165 actually voted
And so out of those {} that actually voted, it was {:.2f}% that voted "yes"


In [0]:
# The above, while not necessarily completing the assignment, was a personal
# stretch goal, to show that (some?) statistical work can still be done without
# cleaning the data to the point of being almost unreadable (to a human)

In [6]:
# But to go any further, I _will_ need to convert those strings to numerical
# values. I can't really seem to find a way around it (for now)

# So... BEST METHOD = using the split rep/dem datasets:
rep1 = rep.replace({ 'y': 1, 'n': 0, '?': np.nan})
dem1 = dem.replace({ 'y': 1, 'n': 0, '?': np.nan})

# With NaN values:
print( ttest_ind( rep1[ 'water-project'], dem1[ 'water-project'], 
                 nan_policy = 'omit') )

# Remmove NaN values and verify 
repNanRip_temp = rep1[ 'water-project'].dropna()
demNanRip_temp = dem1[ 'water-project'].dropna()
print( len( repNanRip_temp) )
print( len( demNanRip_temp) )

# And without NaN values: 
### Wait... it's the same as just using nan_policy=omit? What's the point then??
print( ttest_ind( repNanRip_temp, demNanRip_temp) )

Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)
148
239
Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823994811)


In [7]:
# This function accomplishes the general gist with the aid of outside input
def teaTest( bill):
    return( print( ttest_ind( rep1[ bill], dem1[ bill], 
                 nan_policy = 'omit') ) )

bill = 'education'

print( teaTest( bill))


Ttest_indResult(statistic=20.500685724563073, pvalue=1.8834203990450192e-64)
None


In [29]:
# While this one __asks for user input__
def teaQuest():
    while 1:
        bill = input( "Which bill would you like to test? " )
        print( '\n', ttest_ind( rep1[ bill], dem1[ bill], 
                         nan_policy = 'omit') ) 
        print( " Percentages of 'yes' votes: {:.2f}% democrats, {:.2f}% republicans".format( 
                  dem1[ bill].mean()*100, 
                  rep1[ bill].mean()*100 ), '\n ----------\n' )
        if bill == "exit":
            return(0);

teaQuest()

Which bill would you like to test? handicapped-infants

 Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18)
 Percentages of 'yes' votes: 60.47% democrats, 18.79% republicans 
 ----------

Which bill would you like to test? budget

 Ttest_indResult(statistic=-23.21277691701378, pvalue=2.0703402795404463e-77)
 Percentages of 'yes' votes: 88.85% democrats, 13.41% republicans 
 ----------

Which bill would you like to test? water-project

 Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)
 Percentages of 'yes' votes: 50.21% democrats, 50.68% republicans 
 ----------

Which bill would you like to test? physician-fee-freeze

 Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)
 Percentages of 'yes' votes: 5.41% democrats, 98.79% republicans 
 ----------

Which bill would you like to test? el-salvador-aid

 Ttest_indResult(statistic=21.13669261173219, pvalue=5.600520111729011e-68)
 Percentages of 'yes' votes: 21.57% demo

KeyError: ignored

1.   \<Done>

2.   Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01

    *   The budget. According to the data, almost 89% of democrats supported the bill, while only just over 13% of republicans did. This disparity creates a P-value of 2.07x10^-77

3.   Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01

    *   Likewise, an issue that republicans supported waaaay more than democrats was a Freeze on Physician Fees. According to the data we have, almost 99% of republicans voted in favor of the bill, while only less than 6% of democrats supported it. This massive delta creates a P-value of 1.99x10^-177

4.   Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

    *   Finally, there exists only one issue/bill that members of both parties supported almost equally. That is the Water Project bill, coming in with 50.21% and 50.68% of democrat and republican votes, respectively. This sort of confluence of opinions creates a P-value of 0.929. 
    
        There was only one other bill that even came close to meeting the required P-value metric of >.01, and that was the Immigration bill. That bill garnered 47.15% of votes from democrats, and 55.76% from republicans, creating a P-value of 0.0833. It almost clears the hurdle set, but 'almost' only counts in horseshoes or with hand grenades ¯\\\_(ツ)_/¯


#~Stretch goal 1~ \<complete>
*   Functions are fun

#Stretch goal 2 \<hmm..>