<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy
from scipy import stats

# Needed for grabbing the dataset from https
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# To improve readibility, I did all informational calls elsewhere, 
# so that the output from things like head(), tail(), and others
# did not make it less readable. If demonstration of these skills
# in notebooks are still required, please let me know. 

# 1. Load and clean the data (or determine the best method to drop observations when running tests)

columns = ['party',
           'handicapped-infants',
           'water-project',
           'budget',
           'physician-fee-freeze',
           'el-salvador-aid',
           'religious-groups',
           'anti-satellite-ban',
           'aid-to-contras',
           'mx-missile',
           'immigration',
           'synfuels',
           'education',
           'right-to-sue',
           'crime',
           'duty-free',
           'south-africa']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',
                 header = None,
                 names = columns,
                 na_values = '?'
                 )


# Replace string values with numbers so they are easier to process later.
#####################

df = df.replace({'n':0,
                 'y':1})


# Seperate the two by subsetting.
#####################

# - Republicans
df_r = df[df['party'] == 'republican']

# Do not need party, it will just be in the way
df_r = df_r.drop(['party'], axis = 1).reset_index()

# - Democrats
df_d = df[df['party'] == 'democrat']

# Do not need party, it will just be in the way
df_d = df_d.drop(['party'], axis = 1).reset_index()


# Sliced list of issues for iteration later
issues = columns[1:]

# Utility function to do ttest on 2 columns
def do_test(df_a, df_b, col):

    # Process df_a
    a_na = df_a[-np.isnan(df_a)]
    a = a_na[col]
    a_mean = np.mean(a_na[col])

    # Process b
    b_na = df_b[-np.isnan(df_b)]
    b = b_na[col]
    b_mean = np.mean(b_na[col])

    # ttest
    t, p = stats.ttest_ind(a,
                           b,
                           nan_policy='omit')

    # 1 - t value
    # 2 - p value
    # 3 - sample_a mean
    # 4 - sample_b mean

    return t, p, a_mean, b_mean


# 2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01

# 3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01

# 4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

for i in issues:

    t, p, r_mean, d_mean = do_test(df_r, df_d, i)

    print('Issue: ' + i)
    print('******************')
    if r_mean > d_mean:
        print('- More Republican support:')
        print('---'+str(round(r_mean * 100))+'% of Republicans')
        print('---'+str(round(d_mean * 100))+'% of Democrats')
        print('------ t = '+ str(t))
        print('------ p = '+ str(p))
        print('---------- Hypothesis:')

        if p < 0.01:
            print(
                'Given the information above, we reject the null hypothesis. The difference is not "random"\nthere is to much of a difference')
            print('\n\n')
        elif p > 0.1:
            print(
                'In this case, we fail to reject the null hypothesis, there is not enough of a difference \nbetween how each party voted')
            print('\n\n')
        else:
            print('unknown\n\n')

    if d_mean > r_mean:
        print('- More Democratic support:')
        print('---'+str(round(d_mean * 100, 2))+'% of Democrats')
        print('---'+str(round(r_mean * 100, 2))+'% Republicans')
        print('------ t = ' + str(t))
        print('------ p = ' + str(p))
        print('---------- Hypothesis:')


        if p < 0.01:
            print('Given the information above, we reject the null hypothesis. The difference is not "random"\nthere is to much of a difference')
            print('\n\n')
        elif p > 0.1:
            print('In this case, we fail to reject the null hypothesis, there is not enough of a difference \nbetween how each party voted')
            print('\n\n')
        else:
            print('unknown\n\n')



Issue: handicapped-infants
******************
- More Democratic support:
---60.47% of Democrats
---18.79% Republicans
------ t = -9.205264294809222
------ p = 1.613440327937243e-18
---------- Hypothesis:
Given the information above, we reject the null hypothesis. The difference is not "random"
there is to much of a difference



Issue: water-project
******************
- More Republican support:
---51.0% of Republicans
---50.0% of Democrats
------ t = 0.08896538137868286
------ p = 0.9291556823993485
---------- Hypothesis:
In this case, we fail to reject the null hypothesis, there is not enough of a difference 
between how each party voted



Issue: budget
******************
- More Democratic support:
---88.85% of Democrats
---13.41% Republicans
------ t = -23.21277691701378
------ p = 2.0703402795404463e-77
---------- Hypothesis:
Given the information above, we reject the null hypothesis. The difference is not "random"
there is to much of a difference



Issue: physician-fee-freeze
***