<a href="https://colab.research.google.com/github/cedro-gasque/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, ttest_1samp, ttest_rel, ttest_ind_from_stats

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/'
                 'voting-records/house-votes-84.data',
                 header=None,
                 names=['party', 'handicapped-infants', 'water-project',
                        'budget', 'physician-fee-freeze', 'el-salvador-aid',
                        'religious-groups', 'anti-satellite-ban',
                        'aid-to-contras', 'mx-missle', 'immigration',
                        'synfuels', 'education', 'right-to-sue', 'crime',
                        'duty-free', 'south-africa'])
df

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missle,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
430,republican,n,n,y,y,y,y,n,n,y,y,n,y,y,y,n,y
431,democrat,n,n,y,n,n,n,y,y,y,y,n,n,n,n,n,y
432,republican,n,?,n,y,y,y,n,n,n,n,y,y,y,y,n,y
433,republican,n,n,n,y,y,y,?,?,?,?,n,y,y,y,n,y


In [2]:
# replace ? with actual NaN value
df = df.replace({'?': np.NaN, 'n': 0, 'y': 1})
df

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missle,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
430,republican,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
431,democrat,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
432,republican,0.0,,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0
433,republican,0.0,0.0,0.0,1.0,1.0,1.0,,,,,0.0,1.0,1.0,1.0,0.0,1.0


In [0]:
# separate the parties to test the null hypothesis
rep = df[df.party == 'republican']
dem = df[df.party == 'democrat']

In [11]:
# calibrating the results
print('rep:', rep['handicapped-infants'].mean(), '\ndem:', dem['handicapped-infants'].mean())

rep: 0.18787878787878787 
dem: 0.6046511627906976


In [0]:
class Result:
    def __init__ (self, issue, statistic, pvalue):
        # standard constructor stuff
        self.issue = issue
        self.statistic = statistic
        self.pvalue = pvalue
    def __repr__ (self):
        # contingency for when you input the same sample twice
        if self.statistic == 0:
            return 'These are the exact same dataset...\n\
        ...What are you playing at?'
        
        # switches democrat and republican went it's relevant
        dr = ('Democrats', 'Republicans')
        n = 0 if self.statistic > 0 else 1

        return f'{dr[n]} support {self.issue} more than {dr[1 - n]}. statistic: {self.statistic}, pvalue: {self.pvalue}'

In [0]:
def hypothesis_test(issue, d1=dem, d2=rep):
    result = tuple(ttest_ind(d1[issue], d2[issue], nan_policy='omit'))
    return Result(issue, result[0], result[1])

In [38]:
issues = df.drop(labels='party', axis=1).columns
tests = []
for issue in issues:
    tests += [hypothesis_test(issue)]
    print(tests[len(tests) - 1])

Democrats support handicapped-infants more than Republicans. statistic: 9.205264294809222, pvalue: 1.613440327937243e-18
Republicans support water-project more than Democrats. statistic: -0.08896538137868286, pvalue: 0.9291556823993485
Democrats support budget more than Republicans. statistic: 23.21277691701378, pvalue: 2.0703402795404463e-77
Republicans support physician-fee-freeze more than Democrats. statistic: -49.36708157301406, pvalue: 1.994262314074344e-177
Republicans support el-salvador-aid more than Democrats. statistic: -21.13669261173219, pvalue: 5.600520111729011e-68
Republicans support religious-groups more than Democrats. statistic: -9.737575825219457, pvalue: 2.3936722520597287e-20
Democrats support anti-satellite-ban more than Republicans. statistic: 12.526187929077842, pvalue: 8.521033017443867e-31
Democrats support aid-to-contras more than Republicans. statistic: 18.052093200819733, pvalue: 2.82471841372357e-54
Democrats support mx-missle more than Republicans. stati

In [0]:
def check_for_pvalue(list_of_tests, pvalue, direction='lt'):
    list_of_results = []
    if direction == 'lt':
        list_of_results = [test for test in list_of_tests if test.pvalue < pvalue]
    else:
        list_of_results = [test for test in list_of_tests if test.pvalue > pvalue]
    return list_of_results
                

In [54]:
less_than_0point01 = check_for_pvalue(tests, 0.01)
less_than_0point01

[Democrats support handicapped-infants more than Republicans. statistic: 9.205264294809222, pvalue: 1.613440327937243e-18,
 Democrats support budget more than Republicans. statistic: 23.21277691701378, pvalue: 2.0703402795404463e-77,
 Republicans support physician-fee-freeze more than Democrats. statistic: -49.36708157301406, pvalue: 1.994262314074344e-177,
 Republicans support el-salvador-aid more than Democrats. statistic: -21.13669261173219, pvalue: 5.600520111729011e-68,
 Republicans support religious-groups more than Democrats. statistic: -9.737575825219457, pvalue: 2.3936722520597287e-20,
 Democrats support anti-satellite-ban more than Republicans. statistic: 12.526187929077842, pvalue: 8.521033017443867e-31,
 Democrats support aid-to-contras more than Republicans. statistic: 18.052093200819733, pvalue: 2.82471841372357e-54,
 Democrats support mx-missle more than Republicans. statistic: 16.437503268542994, pvalue: 5.03079265310811e-47,
 Democrats support synfuels more than Republ

In [57]:
# Print Democrats support more

for r in less_than_0point01:
    if r.statistic > 0:
        print(r)

Democrats support handicapped-infants more than Republicans. statistic: 9.205264294809222, pvalue: 1.613440327937243e-18
Democrats support budget more than Republicans. statistic: 23.21277691701378, pvalue: 2.0703402795404463e-77
Democrats support anti-satellite-ban more than Republicans. statistic: 12.526187929077842, pvalue: 8.521033017443867e-31
Democrats support aid-to-contras more than Republicans. statistic: 18.052093200819733, pvalue: 2.82471841372357e-54
Democrats support mx-missle more than Republicans. statistic: 16.437503268542994, pvalue: 5.03079265310811e-47
Democrats support synfuels more than Republicans. statistic: 8.293603989407588, pvalue: 1.5759322301054064e-15
Democrats support duty-free more than Republicans. statistic: 12.853146132542978, pvalue: 5.997697174347365e-32
Democrats support south-africa more than Republicans. statistic: 6.849454815841208, pvalue: 3.652674361672226e-11


In [51]:
len(tests)

16

In [55]:
len(less_than_0point01)

14

In [59]:
# There is only one.
greater_than_0point1 = check_for_pvalue(tests, 0.1, direction='gt')
greater_than_0point1

[Republicans support water-project more than Democrats. statistic: -0.08896538137868286, pvalue: 0.9291556823993485]