<a href="https://colab.research.google.com/github/StevenMElliott/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/Copy_of_LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

#Get and Clean the Data

In [0]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import random
from statistics import mean, stdev
import numpy as np
from scipy.stats import ttest_1samp, ttest_ind

In [2]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2019-06-14 11:34:21--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2019-06-14 11:34:21 (616 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [3]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.names

--2019-06-14 11:34:22--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.names
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6868 (6.7K) [application/x-httpd-php]
Saving to: ‘house-votes-84.names’


2019-06-14 11:34:22 (134 MB/s) - ‘house-votes-84.names’ saved [6868/6868]



In [4]:
file = open('house-votes-84.data', "r")
file.readline()

'republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y\n'

In [5]:
df = pd.read_csv('house-votes-84.data', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [0]:
df.columns = ['party'
             , 'handicapped_infants'
             , 'water_project'
             , 'budget_resolution'
             , 'pysician_fee_freeze'
             , 'el_salvador_aid'
             , 'religious_groups_in_school'
             , 'anit_satellite_test_ban'
             , 'contras'
             , 'mx_missle'
             , 'immegration'
             , 'synfuels_corperation_cutback'
             , 'education_spending'
             , 'superfund_right_to_sue'
             , 'crime'
             , 'duty_free_exports'
             , 'south_africa']

In [7]:
df.head()

Unnamed: 0,party,handicapped_infants,water_project,budget_resolution,pysician_fee_freeze,el_salvador_aid,religious_groups_in_school,anit_satellite_test_ban,contras,mx_missle,immegration,synfuels_corperation_cutback,education_spending,superfund_right_to_sue,crime,duty_free_exports,south_africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [8]:
df['contras'].value_counts()

y    242
n    178
?     15
Name: contras, dtype: int64

In [0]:
df.to_csv('clean_congress.csv')

In [0]:
#filling the NaN
df = df.replace(to_replace='?', value=np.NaN)

In [0]:
df = df.replace('n', 0.0)

In [0]:
df = df.replace('y', 1.0)

In [0]:
#make the subsets
rep = df[df['party'] == 'republican']
dem = df[df['party'] == 'democrat']

In [14]:
dem.shape + rep.shape

(267, 17, 168, 17)

In [15]:
df.party.value_counts()

democrat      267
republican    168
Name: party, dtype: int64

#Hypothesis Testing

## Contras - Dems show greater support

In [17]:
ttest_1samp(dem['contras'], .8, nan_policy='omit')

Ttest_1sampResult(statistic=1.24202327453032, pvalue=0.21533860496566257)

Null Hypothesis: .8 - 80% of Democrats support Contras

Alt Hypothesis: There is no 80% Democratic support for Contras

Given the results of the about test I would FAIL TO REJECT the null hypothesis at the 95% significance level.

In [18]:
ttest_ind(dem['contras'], rep['contras'], nan_policy='omit')

Ttest_indResult(statistic=18.052093200819733, pvalue=2.82471841372357e-54)

Null Hypothesis: There is equal support for contras between republican's and democrats.

Alt Hypothesis: The support is from both parties is not equal.

Given the results of the test about I would REJECT the null hypothesis that the level of support is the same between Republicans and Democrats.

In [48]:
print('Democrat:', dem.contras.mean()) 
print('Republican:', rep.contras.mean())

Democrat: 0.8288973384030418
Republican: 0.15286624203821655


## Religious Groups in Schools - Republicans show greater support

In [58]:
ttest_ind(dem['religious_groups_in_school'], rep['religious_groups_in_school'], nan_policy='omit')

Ttest_indResult(statistic=-9.737575825219457, pvalue=2.3936722520597287e-20)

Null Hypothesis: Support for Religious Groups in Schools is  equal among both parties.

Alt Hypothesis: Support for Religous Groups in Schools is not equal.

Based on the test above, I REJECT the null hypothesis that the level of support is the same between Republicans and Democrats.

In [65]:
print('Scenario One:', ttest_1samp(dem['religious_groups_in_school'], .5, nan_policy='omit')
      ,"\n Scenario Two:" ,ttest_1samp(dem['religious_groups_in_school'], .6, nan_policy='omit'))

Scenario One: Ttest_1sampResult(statistic=-0.7464459604122172, pvalue=0.45608033540995874) 
 Scenario Two: Ttest_1sampResult(statistic=-3.9561635901847527, pvalue=9.857653395038522e-05)


- Scenario 1

Null Hypothesis: There is 50-50 support among Democrats

Alt Hypothesis: There is not 50-50 support among Democrats

Given the test above, I FAIL TO REJECT the null hypothesis of 50-50 support among Democrats.

- Scenario 2

Null Hpothesis: There is 60% support among Democrats

Alt Hypothesis: There isn't 60% support among Democrats

Given the test above, I REJECT the hull hypothesis of  60% support from Democrats.

- Conclusion

By comparing the results of both scenarios, I conclude that 50% support is closer to the mean than 60%

In [66]:
print('Scenario One:', ttest_1samp(rep['religious_groups_in_school'], .5, nan_policy='omit')
      ,"\n Scenario Two:" ,ttest_1samp(rep['religious_groups_in_school'], .6, nan_policy='omit'))

Scenario One: Ttest_1sampResult(statistic=16.844895175868118, pvalue=1.103043623086e-37) 
 Scenario Two: Ttest_1sampResult(statistic=12.608148813452804, pvalue=5.898826128224674e-26)


- Scenario 1

Null Hypothesis: There is 50-50 support among Republicans.

Alt Hypothesis: There is not 50-50 support amongRepublicans.

Given the test above, I REJECT the null hypothesis of 50-50 support among Republicans.

- Scenario 2

Null Hpothesis: There is 60% support among Republicans.

Alt Hypothesis: There isn't 60% support among Republicans.

Given the test above, I REJECT the hull hypothesis of  60% support from Republicans.

- Conclusion

By comparing the results of both scenarios, I conclude that 60% support is closer to the mean than 50%

In [68]:
print('Democrat:', dem['religious_groups_in_school'].mean()) 
print('Republican:',rep ['religious_groups_in_school'].mean())

Democrat: 0.47674418604651164
Republican: 0.8975903614457831


## Bi-partisan Support

In [49]:
ttest_ind(dem['water_project'], rep['water_project'], nan_policy='omit')

Ttest_indResult(statistic=-0.08896538137868286, pvalue=0.9291556823993485)

Null Hypothesis: Both parties support the Water Project equally.

Alt Hypothesis: Support for the Water Project is not equal.

Based on the test about, I FAIL TO REJECT the null hypothesis that both parties support the Water Project equally

In [50]:
print('Democrat:', dem['water_project'].mean()) 
print('Republican:',rep ['water_project'].mean())

Democrat: 0.502092050209205
Republican: 0.5067567567567568
