<a href="https://colab.research.google.com/github/peytonrunyan/DS-Unit-1-Sprint-4-Statistical-Tests-and-Experiments/blob/master/LS_DS_141_Statistics_Probability_and_Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science Module 141
## Statistics, Probability, and Inference

## Prepare - examine what's available in SciPy

As we delve into statistics, we'll be using more libraries - in particular the [stats package from SciPy](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html).

In [0]:
from scipy import stats

In [0]:
# As usual, lots of stuff here! There's our friend, the normal distribution
norm = stats.norm()
print(norm.mean())
print(norm.std())
print(norm.var())

0.0
1.0
1.0


In [0]:
# And a new friend - t
t1 = stats.t(5)  # 5 is df "shape" parameter
print(t1.mean())
print(t1.std())
print(t1.var())

0.0
1.2909944487358056
1.6666666666666667


![T distribution PDF with different shape parameters](https://upload.wikimedia.org/wikipedia/commons/4/41/Student_t_pdf.svg)

*(Picture from [Wikipedia](https://en.wikipedia.org/wiki/Student's_t-distribution#/media/File:Student_t_pdf.svg))*

The t-distribution is "normal-ish" - the larger the parameter (which reflects its degrees of freedom - more input data/features will increase it), the closer to true normal.

In [0]:
t2 = stats.t(30)  # Will be closer to normal
print(t2.mean())
print(t2.std())
print(t2.var())

0.0
1.0350983390135313
1.0714285714285714


Why is it different from normal? To better reflect the tendencies of small data and situations with unknown population standard deviation. In other words, the normal distribution is still the nice pure ideal in the limit (thanks to the central limit theorem), but the t-distribution is much more useful in many real-world situations.



In [0]:
# Now with confidence!

import scipy

## Assignment - apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
import numpy as np

col_names=['party','handicapped_infants','water_project','budget_resolution','physician_fee_freeze','el_salvador_aid','religious-groups-in-schools','anti-satellite-test-ban','aid-to-nicaraguan-contras','mx-missile','immigration','synfuels-corporation-cutback','education-spending','superfund-right-to-sue','crime','duty-free-exports','export-administration-act-south-africa'] 

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', names=col_names, header=None)

In [80]:
df.head()

Unnamed: 0,party,handicapped_infants,water_project,budget_resolution,physician_fee_freeze,el_salvador_aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [81]:
test = df
test.replace('y',1)

Unnamed: 0,party,handicapped_infants,water_project,budget_resolution,physician_fee_freeze,el_salvador_aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,1,n,1,1,1,n,n,n,1,?,1,1,1,n,1
1,republican,n,1,n,1,1,1,n,n,n,n,n,1,1,1,n,?
2,democrat,?,1,1,?,1,1,n,n,n,n,1,n,1,1,n,n
3,democrat,n,1,1,n,?,1,n,n,n,n,1,n,1,n,n,1
4,democrat,1,1,1,n,1,1,n,n,n,n,1,?,1,1,1,1
5,democrat,n,1,1,n,1,1,n,n,n,n,n,n,1,1,1,1
6,democrat,n,1,n,1,1,1,n,n,n,n,n,n,?,1,1,1
7,republican,n,1,n,1,1,1,n,n,n,n,n,n,1,1,?,1
8,republican,n,1,n,1,1,1,n,n,n,n,n,1,1,1,n,1
9,democrat,1,1,1,n,n,n,1,1,1,n,n,n,n,n,?,?


In [0]:
x = test.replace('n',0.0)

In [0]:
x = x.replace('y',1.0)

In [111]:
x.head()

Unnamed: 0,party,handicapped_infants,water_project,budget_resolution,physician_fee_freeze,el_salvador_aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0,1,0,1,1,1,0,0,0,1,?,1,1,1,0,1
1,republican,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,?
2,democrat,?,1,1,?,1,1,0,0,0,0,1,0,1,1,0,0
3,democrat,0,1,1,0,?,1,0,0,0,0,1,0,1,0,0,1
4,democrat,1,1,1,0,1,1,0,0,0,0,1,?,1,1,1,1


In [112]:
x = x.replace('?', np.nan)
x.head()

Unnamed: 0,party,handicapped_infants,water_project,budget_resolution,physician_fee_freeze,el_salvador_aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [113]:
repub = x[x['party'] == 'republican']
dem = x[x['party'] == 'democrat']
scipy.stats.ttest_ind(repub['handicapped_infants'], dem['handicapped_infants'], axis=0, equal_var=True, nan_policy='omit')

Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18)

In [168]:
new_list = list(x)
#new_list = new_list.pop(0)
new_list.remove('party')

for item in new_list:
  p_value = scipy.stats.ttest_ind(repub[item], dem[item], axis=0, equal_var=True, nan_policy='omit')
  print(item+':' + '  ' + str(p_value[1])+ '\n')

handicapped_infants:  1.613440327937243e-18

water_project:  0.9291556823993485

budget_resolution:  2.0703402795404463e-77

physician_fee_freeze:  1.994262314074344e-177

el_salvador_aid:  5.600520111729011e-68

religious-groups-in-schools:  2.3936722520597287e-20

anti-satellite-test-ban:  8.521033017443867e-31

aid-to-nicaraguan-contras:  2.82471841372357e-54

mx-missile:  5.03079265310811e-47

immigration:  0.08330248490425066

synfuels-corporation-cutback:  1.5759322301054064e-15

education-spending:  1.8834203990450192e-64

superfund-right-to-sue:  1.2278581709672758e-34

crime:  9.952342705606092e-47

duty-free-exports:  5.997697174347365e-32

export-administration-act-south-africa:  3.652674361672226e-11



In [115]:
np.var(repub['handicapped_infants'])

0.15258034894398528

In [0]:
#scipy.stats.ttest_ind(repub, dem, axis=0, equal_var=True, nan_policy='omit')


#This next bit is me seeing if the variance b/t dems and republicans is different (Welch's vs. Student's t-test)

In [116]:
np.var(dem['handicapped_infants'])

0.2390481341265549

In [169]:
b = list(x)
b.pop(0)
variance_dem = [np.var(dem[item]) for item in b]
print(b)

['handicapped_infants', 'water_project', 'budget_resolution', 'physician_fee_freeze', 'el_salvador_aid', 'religious-groups-in-schools', 'anti-satellite-test-ban', 'aid-to-nicaraguan-contras', 'mx-missile', 'immigration', 'synfuels-corporation-cutback', 'education-spending', 'superfund-right-to-sue', 'crime', 'duty-free-exports', 'export-administration-act-south-africa']


In [0]:
variance_rep = [np.var(repub[item]) for item in b]

In [0]:
d_var_array = np.asarray(variance_dem)

In [137]:
np.var(x['handicapped_infants'])

0.24664531741640533

In [0]:
r_var_array= np.asarray(variance_rep)

In [136]:
(d_var_array - r_var_array)/

array([ 8.64677852e-02,  4.12770878e-05, -1.70534674e-02,  3.91579250e-02,
        1.23031638e-01,  1.57537263e-01, -6.87789688e-03,  1.23283867e-02,
        8.15110618e-02,  2.50174228e-03,  1.35333856e-01,  1.12924901e-02,
        8.59139782e-02,  2.09271996e-01,  1.49417765e-01, -1.64525547e-01])

In [0]:
total_var = variance_rep = [np.var(x[item]) for item in b]

In [0]:
tot_var_array = np.asarray(total_var)
print

In [0]:
np.set_printoptions(suppress=True)
percent_var_dif = np.round(((np.absolute((d_var_array - r_var_array))/tot_var_array)),2)

In [147]:
scipy.stats.ttest_1samp(percent_var_dif, 0, axis=0, nan_policy='omit')

Ttest_1sampResult(statistic=4.2498584200780565, pvalue=0.0006989288093475429)

In [148]:
percent_var_dif


array([0.35, 0.  , 0.07, 0.16, 0.49, 0.69, 0.03, 0.05, 0.33, 0.01, 0.59,
       0.05, 0.34, 0.87, 0.61, 1.08])