<a href="https://colab.research.google.com/github/Phatdeluxe/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/module1-statistics-probability-and-inference/LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Try some 1-sample t-tests as well

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. See if you can make a visualization that communicates your results
3. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [1]:
!wget 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'

--2019-09-16 17:51:08--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2019-09-16 17:51:08 (275 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [0]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_1samp, ttest_ind

In [7]:
df = pd.read_csv('house-votes-84.data', 
                 header=None,
                 names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa'])
print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [11]:
df = df.replace({'?': np.NaN, 'n': 0, 'y': 1})

df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
dem = df[df['party'] == 'democrat']
rep = df[df['party'] == 'republican']

In [13]:
dem.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [14]:
rep.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


## 2. Something Dems support more than Reps

1. $H_0: \bar{x}_1 = \bar{x}_2$

2. $H_a: \bar{x}_1 \neq \bar{x}_2$

3. Confidence level: 95%

In [22]:
ttest_ind(dem['handicapped-infants'], rep['handicapped-infants'], nan_policy='omit')

Ttest_indResult(statistic=9.205264294809222, pvalue=1.613440327937243e-18)

4. T-statistic: 9.21

5. P-value: 1.61e-18

---

Conclusion: Due to a T-statistic of 9.21 and a P-value of 1.61e-18, we reject the null hypothesis that the two parties voted similarly, and suggest that the Democrats voted more in favor than the Republicans on this issue

## 3. Something Reps support more than Dems

1. $H_0: \bar{x}_1 = \bar{x}_2$

2. $H_a: \bar{x}_1 \neq \bar{x}_2$

3. Confidence level: 95%

In [20]:
ttest_ind(dem['education'], rep['education'], nan_policy='omit')

Ttest_indResult(statistic=-20.500685724563073, pvalue=1.8834203990450192e-64)

4. T-statistic: -20.5

5. P-value: 1.88e-64

---

Conclusion:
Due to a T-statistic of -20.5 and a P-value of 1.88e-64, we reject the null hypothesis that the two parties voted similarly, and suggest that the Republicans voted more in favor of this bill.

## 4. Somthing Dems and Reps agree on:

1. $H_0$: $\bar{x}_1 = \bar{x}_2$

2. $H_a$: $\bar{x}_1 \neq \bar{x}_2$

3. Confidence level: 95%

In [16]:
ttest_ind(dem['water-project'], rep['water-project'], nan_policy='omit')

Ttest_indResult(statistic=-0.08896538137868286, pvalue=0.9291556823993485)

4. T-statistic: -0.089
5. P-value: 0.93

---

Conclusion:

Due to this test resulting in a T-statistic of -0.089 and having a P-value of 0.93, we fail to reject the null hypothesis that Democrats and Republicans vote differently on this subject

## 1 Sample testing

1. $H_0$: $\bar{x}_1 = 0.0$ - There is no Republican support for the handicapped infants bill

2. $H_a$: $\bar{x}_1 \neq 0.0$ - There is non-zero Republican Support for the handicapped infants bill

3. Confidence level: 95%

In [28]:
ttest_1samp(rep['handicapped-infants'], 0.0, nan_policy='omit')

Ttest_1sampResult(statistic=6.159569669016066, pvalue=5.434587970316366e-09)

4. T-Statistic: 6.16

5. P-value: 5.43e-9

---

Conclusion:
Due to a P-value of 5.43e-9, I reject the null hypothesis and conclude that there is non-zero Republican support for the handicapped infants bill

1. $H_0$: $\bar{x}_1 = 0.5$ - There is 50% Democrat support for the religious groups bill

2. $H_a$: $\bar{x}_1 \neq 0.5$ - There is not 50% Democrat support for the religious groups bill

3. Confidence level: 95%

In [29]:
ttest_1samp(dem['religious-groups'], 0.5, nan_policy='omit')

Ttest_1sampResult(statistic=-0.7464459604122172, pvalue=0.45608033540995874)

In [30]:
dem['religious-groups'].mean()

0.47674418604651164

4. T-statistic: -0.75

5. P-value: 0.46

---

Conclusion:
Due to a P-value of 0.46, we fail to reject the null hypothesis that there is 50% support for the religious groups bill