<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

In [2]:
# Specify the columns for the UCI dataset
cols = [
    'party',
    'handicapped-infants',
    'water-project',
    'budget',
    'physician-fee-freeze',
    'el-salvador-aid',
    'religious-groups',
    'anti-satellite-ban',
    'aid-to-contras',
    'mx-missile',
    'immigration',
    'synfuels',
    'education',
    'right-to-sue',
    'crime',
    'duty-free',
    'south-africa',
    ]


# Assign the url string to a variable
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'

# Fetch the csv data assuming no header row in the data"
#   1. assign the column names specified above
#   2. replace '?' with NaN values
df_house = pd.read_csv(url, header=None, names=cols, na_values='?')

# Translate 'y' and 'n' votes to numeric '1' and'0'
df_house = df_house.replace({'y': 1, 'n': 0})
df_house.sample(10)

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
260,democrat,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
148,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0
49,republican,0.0,,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
391,democrat,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
60,democrat,1.0,1.0,1.0,0.0,0.0,,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
336,democrat,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,,1.0
304,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,
113,republican,0.0,,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
29,democrat,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0


In [3]:
# Construct two dataframes: republican votes & democrat votes
df_house_gop = df_house[df_house['party'] == 'republican']
df_house_dem = df_house[df_house['party'] == 'democrat']

df_house_gop.sample(10)
df_house_dem.sample(10)

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
183,democrat,,,,,,,,,1.0,,,,,,,
291,democrat,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,,0.0,1.0,0.0,1.0
415,democrat,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,,1.0,0.0,0.0,0.0,0.0,0.0,1.0
139,democrat,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
29,democrat,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
34,democrat,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
17,democrat,1.0,,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
320,democrat,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0
361,democrat,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,
12,democrat,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,,


In [0]:
# Define a function to apply a test
#   * issu:  the issue column name being tested
#   * pdict: the party (string key: 'republican' or 'democrat') voting record (value: dataframe) being tested
def poly_diff(issu, pdict):
  # Assign the the party's issue voting record to temp dataframes
  df_tmp_r = pdict['republican'][issu]
  df_tmp_d = pdict['democrat'][issu]

  # Remove NaNs from each temporary dataframe
  df_tmp_r = df_tmp_r.dropna()
  df_tmp_d = df_tmp_d.dropna()

  return ttest_ind(df_tmp_r, df_tmp_d)


In [0]:
# Create a results list
list_rslts = []

# Iterate through the issues
for i in range(1, len(cols)-1):
  # Set up some working variables
  wrk_lt_pt01 = ""
  wrk_gt_pt1  = ""

  # Create a list to house this iteration's output
  lst_iter = []
  
  # Call the poly_diff function processing this iteration's issue
  tst_itr = poly_diff(cols[i], {"republican": df_house_gop, "democrat": df_house_dem})

  # Set user friendly data values characterizing this iteration's p-value
  if tst_itr.pvalue < .01 : wrk_lt_pt01 = "Yes" 
  if tst_itr.pvalue > .1  : wrk_gt_pt1  = "Yes"

  # Assign poly_diff output to the iteration list
  lst_iter.append(cols[i])
  lst_iter.append(tst_itr.statistic)
  lst_iter.append(tst_itr.pvalue)
  lst_iter.append(wrk_lt_pt01)
  lst_iter.append(wrk_gt_pt1)

  # Append this iteration's results list to the overall results list
  list_rslts.append(lst_iter)


In [10]:
# Done iterating, create a results dataframe
df_results = pd.DataFrame(list_rslts, columns=['Issue', 't-value', 'p-value', 'p-value < .01', 'p-value > 0.1'])
df_results

Unnamed: 0,Issue,t-value,p-value,p-value < .01,p-value > 0.1
0,handicapped-infants,-9.205264,1.61344e-18,Yes,
1,water-project,0.088965,0.9291557,,Yes
2,budget,-23.212777,2.07034e-77,Yes,
3,physician-fee-freeze,49.367082,1.994262e-177,Yes,
4,el-salvador-aid,21.136693,5.60052e-68,Yes,
5,religious-groups,9.737576,2.3936719999999997e-20,Yes,
6,anti-satellite-ban,-12.526188,8.521033000000001e-31,Yes,
7,aid-to-contras,-18.052093,2.824718e-54,Yes,
8,mx-missile,-16.437503,5.030793e-47,Yes,
9,immigration,1.735912,0.08330248,,


## Discussion

**NOTE**: our function above calls `ttest_ind` with the republican dataset as the first parameter and the democrat dataset as the second parameter

***Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01***

It appears that Democrats support *budget* issue more than the Republicans with a p-value less than .01.  Conventional wisdom indicates that Democrats generally support government spending and programs.  So this result appears to align with that political tenet.

***Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01***

It appears that Republicans support *religious-groups* issue more than the Democrats with a p-value less than .01.  Conventional wisdom indicates that Republicans generally support more involvement of religious groups and beliefs in the public sector.  So this result also appears to align with a common political tenet.


***Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)***

The water project has a p-value of greater than .1.  In this case we can't reject the null hypothesis that Republicans and Democrats have the same voting outcomes for this particular issue

