<a href="https://colab.research.google.com/github/hurshd0/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/module1-statistics-probability-and-inference/LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
#################### BOILER PLATE ###################
%matplotlib inline

import numpy as np # Linear algebra lib
import pandas as pd # Data analysis lib
import matplotlib.pyplot as plt # plotting lib
import seaborn as sns # matplotlib wrapper plotting lib
import random # python random lib

# Matplotlib and Seaborn params
from matplotlib import rcParams
rcParams['figure.figsize'] = 10, 6
plt.style.use('fivethirtyeight')

# Removes rows and columns truncation of '...'
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Ignore the warnings
import warnings
warnings.filterwarnings("ignore")

# Load stats modules
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel, ttest_1samp

### 1. Load and clean the data (or determine the best method to drop observations when running tests)


In [2]:
column_names = ['party','handicapped-infants','water-project','budget','physician-fee-freeze','el-salvador-aid','religious-groups','anti-satellite-ban','aid-to-contras','mx-missile','immigration','synfuels','education','right-to-sue','crime','duty-free','south-africa']
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', names=column_names, na_values='?', ).replace({"y":1, "n":0})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [3]:
dems = df[df['party'] == 'democrat']
dems.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [4]:
reps = df[df['party'] == 'republican']
reps.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


# 1-sample t-test example

In a 1-sample T-test we are testing the mean of one sample against a **null hypothesis of our choosing**.

The null hypothesis that we designate depends on how we have encoded our data and the kind of questions that we want to test. 

If I have encoded votes as 0 for no and 1 for yes, I want to test Democratic support for an issue, and I use a null hypothesis of 0, then I am comparing Democrat voting support against a null hypothesis of no Democrat support at all for a given issue.

If I use a null hypothesis of .5 then I am comparing the democrat voting support against a null hypothesis of democrats being neither in favor or against a particular issue. 


If I use a null hypothesis of 1 then I am comparing the democrat voting support against a null hypothesis of all democrats being favor of a particular issue.

Lets use the 0 and .5 null-hypotheses to test the significance of those particular claims. They're all valid questions to be asking, they're just posing a slightly different question --testing something different.

![](https://i.imgur.com/sRhrGQ6.png)

In [5]:
dems['handicapped-infants'].value_counts(dropna=False)

1.0    156
0.0    102
NaN      9
Name: handicapped-infants, dtype: int64

In [6]:
dems['handicapped-infants'].mean()

0.6046511627906976

In [7]:
ttest_1samp(dems['handicapped-infants'], .5, nan_policy='omit')

Ttest_1sampResult(statistic=3.431373087696574, pvalue=0.000699612317167372)

In [8]:
ttest_1samp(dems['handicapped-infants'], 0, nan_policy='omit')

Ttest_1sampResult(statistic=19.825711173357988, pvalue=1.0391992873567661e-53)

In [9]:
ttest_1samp(dems['handicapped-infants'], 1, nan_policy='omit')

Ttest_1sampResult(statistic=-12.96296499796484, pvalue=6.590394568934029e-30)

In [10]:
ttest_1samp(dems['handicapped-infants'], .6, nan_policy='omit')

Ttest_1sampResult(statistic=0.15250547056429176, pvalue=0.8789079366662332)

In [11]:
ttest_1samp(dems['handicapped-infants'], .6046511627906976, nan_policy='omit')

Ttest_1sampResult(statistic=0.0, pvalue=1.0)

# 2-sample t-test (for means) example

![](https://i.imgur.com/x1WQr9h.png)

![](https://i.imgur.com/rrfsigE.png)

Null Hypothesis: There's no difference, no parties voted the same

In [12]:
print("Democrat Support: ", dems['south-africa'].mean())
print("Republican Support: ", reps['south-africa'].mean())

Democrat Support:  0.9351351351351351
Republican Support:  0.6575342465753424


- Null Hypothesis: The mean of democrats support == The mean of republicans support. (Theses two parties support this bill at the same rate) $H_0: \mu_1 = \mu_2$

- Alternative Hypothesis: The means are different (not the same level of support) $H_1: \mu_1 \neq \mu_2$

  Given the results of the above test, I would REJECT the null hypotheis that there is the same level of support for this bill among democrats and republicans.

### 2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01


### Before:

- Null Hypothesis: 
   > There is no difference between their support, $H_0: \mu_1 = \mu_2$

- Alternate Hypothesis: 
   > Democrats support more than republicans,  $H_1: \mu_1 > \mu_2$

In [13]:
dems['anti-satellite-ban'].value_counts(dropna=False)

1.0    200
0.0     59
NaN      8
Name: anti-satellite-ban, dtype: int64

In [14]:
dems['anti-satellite-ban'].mean()

0.7722007722007722

In [15]:
reps['anti-satellite-ban'].value_counts(dropna=False)

0.0    123
1.0     39
NaN      6
Name: anti-satellite-ban, dtype: int64

In [16]:
reps['anti-satellite-ban'].mean()

0.24074074074074073

### After:

In [17]:
ttest_ind(dems['anti-satellite-ban'], reps['anti-satellite-ban'], nan_policy='omit')

Ttest_indResult(statistic=12.526187929077842, pvalue=8.521033017443867e-31)

In [0]:
def t_test_results(df_x, df_y, alternative='two-sided', usevar='pooled'):
    from statsmodels.stats.weightstats import ttest_ind
    test_statistic, p_value, degrees_of_freedom = ttest_ind(
    df_x.dropna(), df_y.dropna(), alternative=alternative, usevar=usevar)
    print('''
    =============  T-test ============
    degrees of freedom: {:.4f}
    p-value           : {:.4f}
    test statistic    : {:.4f}
    ==================================
    '''.format(degrees_of_freedom, p_value, test_statistic))

In [19]:
t_test_results(dems['anti-satellite-ban'], reps['anti-satellite-ban'])


    degrees of freedom: 419.0000
    p-value           : 0.0000
    test statistic    : 12.5262
    


### Interpretation:

Given the results of the above test (p < 0.01, and t = 12.5262), I would REJECT the null hypotheis that there is the same level of support for this bill among democrats and republicans, and suggest the alternate hypothesis that democrats support more than republican.

Succint way to say, "The evidence against $H_0$ is significant at the given significance level of $\alpha = 0.05$".

### 3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01


### Before:

- Null Hypothesis: 
   > There is no difference between their support, $H_0: \mu_1 = \mu_2$

- Alternate Hypothesis: 
   > Republicans support more than democrats,  or democrats support less than republicans, $H_1: \mu_1 < \mu_2$

In [20]:
dems['crime'].value_counts(dropna=False)

0.0    167
1.0     90
NaN     10
Name: crime, dtype: int64

In [21]:
dems['crime'].mean()

0.35019455252918286

In [22]:
reps['crime'].value_counts(dropna=False)

1.0    158
NaN      7
0.0      3
Name: crime, dtype: int64

In [23]:
reps['crime'].mean()

0.9813664596273292

### After:

In [24]:
ttest_ind(dems['crime'], reps['crime'], nan_policy='omit')

Ttest_indResult(statistic=-16.342085656197696, pvalue=9.952342705606092e-47)

In [25]:
t_test_results(dems['crime'], reps['crime'])


    degrees of freedom: 416.0000
    p-value           : 0.0000
    test statistic    : -16.3421
    


### Interpretation:

Given the results of the above test (p < 0.01, and t = -16.3421), I would REJECT the null hypotheis that there is the same level of support for this bill among democrats and republicans, and suggest the alternate hypothesis that democrats support less  than republican.

### 4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

### Before:

- Null Hypothesis: 
   > There is no difference between their support, $H_0: \mu_1 = \mu_2$

- Alternate Hypothesis: 
   > The means are different (not the same level of support) $H_1: \mu_1 \neq \mu_2$

In [35]:
dems['water-project'].value_counts(dropna=False)

1.0    120
0.0    119
NaN     28
Name: water-project, dtype: int64

In [36]:
reps['water-project'].value_counts(dropna=False)

1.0    75
0.0    73
NaN    20
Name: water-project, dtype: int64

In [37]:
dems['water-project'].mean()

0.502092050209205

In [38]:
reps['water-project'].mean()

0.5067567567567568

### After:

In [34]:
ttest_ind(dems['water-project'], reps['water-project'], nan_policy='omit')

Ttest_indResult(statistic=-0.08896538137868286, pvalue=0.9291556823993485)

In [39]:
t_test_results(dems['water-project'], dems['water-project'])


    degrees of freedom: 476.0000
    p-value           : 1.0000
    test statistic    : 0.0000
    


### Interpretation:

Given the results of the above test (p > 0.1, and t =), I FAIL TO REJECT the null hypotheis, that there is indeed the same level of support for this bill among democrats and republicans.