<a href="https://colab.research.google.com/github/ameralhomdy/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/module1-statistics-probability-and-inference/LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
### YOUR CODE STARTS HERE
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel, ttest_1samp

In [61]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2019-09-17 00:01:47--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data.2’


2019-09-17 00:01:48 (286 KB/s) - ‘house-votes-84.data.2’ saved [18171/18171]



In [62]:
# Load Data
df = pd.read_csv('house-votes-84.data', 
                 header=None,
                 names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa'])
print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [63]:
df = df.replace({'?':np.NaN, 'n':0, 'y':1})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
dem = df[df['party'] == 'democrat']
rep = df[df['party'] == 'republican']

In [65]:
dem.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [66]:
rep.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [67]:
df.isnull().sum()

party                     0
handicapped-infants      12
water-project            48
budget                   11
physician-fee-freeze     11
el-salvador-aid          15
religious-groups         11
anti-satellite-ban       14
aid-to-contras           15
mx-missile               22
immigration               7
synfuels                 21
education                31
right-to-sue             25
crime                    17
duty-free                28
south-africa            104
dtype: int64

In [68]:
# 1 sample test

rep['immigration'].mean()

0.5575757575757576

In [69]:
# find the sample size while excluding the NaN values
len(rep['immigration']) - rep['immigration'].isnull().sum()

165

1) Null Hypothesis:

$H_0$: there's a ZERO support for this bill

2) Alternative Hypothesis

$H_a:\bar x \neq0$ 

3) Confidence Level: 95%

In [70]:
ttest_1samp(rep['immigration'], 0, nan_policy='omit')

Ttest_1sampResult(statistic=14.376541013291384, pvalue=7.541248569126767e-31)

4) t-statistics: 14.377

5) p-value: 7.541e-31

---
Conclusion: Due to a p-value of almost 0 I reject the null hypothesis that replican support is zero and conclude that republican support is non-zero

In [71]:
dem.mean()

handicapped-infants     0.604651
water-project           0.502092
budget                  0.888462
physician-fee-freeze    0.054054
el-salvador-aid         0.215686
religious-groups        0.476744
anti-satellite-ban      0.772201
aid-to-contras          0.828897
mx-missile              0.758065
immigration             0.471483
synfuels                0.505882
education               0.144578
right-to-sue            0.289683
crime                   0.350195
duty-free               0.637450
south-africa            0.935135
dtype: float64

In [72]:
rep.mean()

handicapped-infants     0.187879
water-project           0.506757
budget                  0.134146
physician-fee-freeze    0.987879
el-salvador-aid         0.951515
religious-groups        0.897590
anti-satellite-ban      0.240741
aid-to-contras          0.152866
mx-missile              0.115152
immigration             0.557576
synfuels                0.132075
education               0.870968
right-to-sue            0.860759
crime                   0.981366
duty-free               0.089744
south-africa            0.657534
dtype: float64

## 2 Sample T-test

1) Null Hypothesis:

$H_0: \bar x_1 = \bar x_2$:

2) Alternative Hypothesis

$H_a:\bar x_1 \neq \bar x_2$ 

3) Confidence Level: 95%

In [73]:
ttest_ind(dem['water-project'], rep['water-project'], nan_policy='omit')

Ttest_indResult(statistic=-0.08896538137868286, pvalue=0.9291556823993485)

4) T-statistics: -0.0889

5) P-value: 0.929

---
Conclusion
Due to a p-value of 0.929 I fail to regect the Null Hypothesis


This is the issue were there isn't much difference between the support of both parties

#### Second 2 sample T-test

1) Null Hypothesis:

$H_0: \bar x_1 = \bar x_2$:

2) Alternative Hypothesis

$H_a:\bar x_1 \neq \bar x_2$ 

3) Confidence Level: 99%

In [74]:
ttest_ind(rep['handicapped-infants'], dem['handicapped-infants'], nan_policy='omit')

Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18)

In [75]:
rep['handicapped-infants'].mean()

0.18787878787878787

In [76]:
dem['handicapped-infants'].mean()

0.6046511627906976

4) T-statistics: -9.205

5) P-value: 1.61e-18

---
Conclusion
Due to a p-value of 1.61e-18 I regect the Null Hypothesis

This is an issue where the democrats support it more than the republicans

#### Third 2 sample T-test

1) Null Hypothesis:

$H_0: \bar x_1 = \bar x_2$:

2) Alternative Hypothesis

$H_a:\bar x_1 \neq \bar x_2$ 

3) Confidence Level: 99%

In [77]:
ttest_ind(rep['education'], dem['education'], nan_policy='omit')

Ttest_indResult(statistic=20.500685724563073, pvalue=1.8834203990450192e-64)

In [78]:
rep['education'].mean()

0.8709677419354839

In [79]:
dem['education'].mean()

0.14457831325301204

4) T-statistics: 20.5

5) P-value: 1.88e-64

---
Conclusion
Due to a p-value of 1.88e-64 I regect the Null Hypothesis

This is an issue were the Republicans support it more than the Democrats

#### Fourth 2 sample T-test

1) Null Hypothesis:

$H_0: \bar x_1 = \bar x_2$:

2) Alternative Hypothesis

$H_a:\bar x_1 \neq \bar x_2$ 

3) Confidence Level: 99%

In [81]:
ttest_ind(rep['mx-missile'], dem['mx-missile'], nan_policy='omit')

Ttest_indResult(statistic=-16.437503268542994, pvalue=5.03079265310811e-47)

In [83]:
dem['mx-missile'].mean()

0.7580645161290323

In [84]:
rep['mx-missile'].mean()

0.11515151515151516

4) T-statistics: -16.438

5) P-value: 5.03e-47

---
Conclusion
Due to a p-value of 5.03e-47 I regect the Null Hypothesis

This is an issue were the democrats support it more than the Republicans