<a href="https://colab.research.google.com/github/Vertex138/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/submission/Colin_Brinkley_LS_DS_131_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, ttest_1samp

In [0]:
#Importing the dataset, freshing it up for it's debut, and splitting it in two by 'class'
columnNames = [
    'class',
    'infant',
    'water',
    'budget',
    'physician',
    'elsalvador',
    'religion',
    'satellite',
    'contras',
    'missile',
    'immigration',
    'synfuels',
    'education',
    'superfund',
    'crime',
    'exports',
    'africa']
congressDF = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',names=columnNames)
congressDF = congressDF.replace({'?':np.NaN, 'n':0, 'y':1})
demDF = congressDF[congressDF['class'] == 'democrat']
repDF = congressDF[congressDF['class'] == 'republican']

# 1 Value T-Testing

In [0]:
def dem0TTest(name, conf = .05):
  ttest_1samp(repDF[name], 0, nan_policy='omit')

# 2 Value T-Testing

In [0]:
def twoPartyTTest(name, conf = .05):
  
  #Calculate P Value and means
  pvalue = ttest_ind(repDF[name], demDF[name], nan_policy='omit').pvalue
  stat = ttest_ind(repDF[name], demDF[name], nan_policy='omit').statistic
  repMean = repDF[name].mean()
  demMean = demDF[name].mean()
  
  #Calculate whether to reject or not reject the null hypothesis, and give a reason why
  if (pvalue < conf):
    print("Reject the null hypothesis.")
    if (repMean > demMean):
      print("Republicans are more in favor for the \'"+ name+ "\' vote than Democrats.")
    else:
      print("Democrats are more in favor for the \'"+ name+ "\' vote than Republicans.")
  else:
    print("Fail to reject the null hypothesis.\n",
          "Democrats and Republicans are in similar amounts of favor for the \'"+ name+ "\' vote.")
    
  #Output the variables calculated during the function
  print("RAW VALUES:\n",
        "P Value:\t\t"+ str(pvalue)+"\n",
        "Confidence:\t\t"+str((1-conf)*100)+"%\n",
        "Rebuplican Mean:\t"+str(repMean)+"\n",
        "Democrat Mean:\t\t" +str(demMean)+"\n",
        "T-Statistic:\t\t"+str(stat))

## Democrat Support Hypothesis Testing

**Null Hypothesis:** Democrats and Republicans have similar favor for 'budget'.

**Alternative Hypothesis:** Democrats prefer 'budget' more than Republicans

**Confidence Interval:** 95%

**T-Statistic:** -23.21277691701378

**P-Value:** 2.0703402795404463e-77

In [7]:
# Democrats prefer 'budget' more than Republicans with a 95% confidence interval:
twoPartyTTest('budget')

Reject the null hypothesis.
Democrats are more in favor for the 'budget' vote than Republicans.
RAW VALUES:
 P Value:		2.0703402795404463e-77
 Confidence:		95.0%
 Rebuplican Mean:	0.13414634146341464
 Democrat Mean:		0.8884615384615384
 T-Statistic:		-23.21277691701378


## Republican Support Hypothesis Testing

**Null Hypothesis:** Democrats and Republicans have similar favor for 'physician'.

**Alternative Hypothesis:** Republicans prefer 'physician' more than Democrats

**Confidence Interval:** 95%

**T-Statistic:** 49.36708157301406

**P-Value:** 1.994262314074344e-177

In [5]:
# Rebuplicans prefer 'physician' more than Democrats with a 95% confidence interval
twoPartyTTest('physician')

Reject the null hypothesis.
Republicans are more in favor for the 'physician' vote than Democrats.
RAW VALUES:
 P Value:		1.994262314074344e-177
 Confidence:		95.0%
 Rebuplican Mean:	0.9878787878787879
 Democrat Mean:		0.05405405405405406
 T-Statistic:		49.36708157301406


## Similar Support Hypothesis Testing

**Null Hypothesis:** Democrats and Republicans have similar favor for 'immigration'.

**Alternative Hypothesis:** Democrats and Republicans have different favors for 'immigration'.

**Confidence Interval:** 95%

**T-Statistic:** 1.7359117329695164

**P-Value:** 0.08330248490425066

In [6]:
# Democrats and Republicans both prefer 'immigration' similar amounts with a 95% confidence interval
twoPartyTTest('immigration')

Fail to reject the null hypothesis.
 Democrats and Republicans are in similar amounts of favor for the 'immigration' vote.
RAW VALUES:
 P Value:		0.08330248490425066
 Confidence:		95.0%
 Rebuplican Mean:	0.5575757575757576
 Democrat Mean:		0.4714828897338403
 T-Statistic:		1.7359117329695164
