<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

### Assignment Goals

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

- Also practice running 1 sample t-tests

---

#### Stretch Goals

0. Try to make some kind of visualization that communicates the results
1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

---

In [1]:
# The Uje Imports
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

In [2]:
# SciPy Imports
from scipy.stats import ttest_1samp, ttest_ind, ttest_ind_from_stats, ttest_rel

#### 1. Load and Clean the Data (or determine the best method to drop observations when running tests)

In [1]:
# Get the dataset
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2019-09-16 14:29:24--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data.1’


2019-09-16 14:29:25 (323 KB/s) - ‘house-votes-84.data.1’ saved [18171/18171]



In [3]:
# Load data in dataframe and add in column headers
df_0 = pd.read_csv("house-votes-84.data",
                   header=None,
                   names=["party", "handicapped-infants", "water-project",
                         "budget", "physician-fee-freeze", "el-salvador-aid",
                         "religious-groups", "anti-satellite-ban",
                         "aid-to-contras", "mx-missile", "immigration",
                         "synfuels", "education", "right-to-sue", "crime",
                         "duty-free", "south-africa"])

print(df_0.shape)
df_0.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [6]:
# Replace "?" with NaN and "y" / "n" with 1 / 0 
df = df_0.replace({
    "?": np.NaN,
    "n": 0,
    "y": 1,
})

print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [13]:
# Check out a summary of null values
df.isnull().sum()

party                     0
handicapped-infants      12
water-project            48
budget                   11
physician-fee-freeze     11
el-salvador-aid          15
religious-groups         11
anti-satellite-ban       14
aid-to-contras           15
mx-missile               22
immigration               7
synfuels                 21
education                31
right-to-sue             25
crime                    17
duty-free                28
south-africa            104
dtype: int64

In [7]:
# Testing out how many rows would be dropped if I dropped them at this point
df_dropna = df.dropna(axis=0, how="any")
print(df_dropna.shape)  # Looks like around half of the rows would be dropped - not ideal

(232, 17)


In [9]:
# Going ahead with the filtering without dropping
# I will omit the null values while running the tests

# Filter Dems and Reps into separate dataframes - our 2 "samples"
dem = df[df["party"] == "democrat"]
rep = df[df["party"] == "republican"]

print("Democrats:", dem.shape)
print("Republicans:", rep.shape)

Democrats: (267, 17)
Republicans: (168, 17)


In [14]:
dem.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [15]:
rep.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


#### 2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01

##### 2-Sample t-Test

Null hypothesis: no difference in mean support between Republicans and Democrats on the "synfuel" bill.

$H_0$: $\bar{x}_1 = \bar{x}_2$

Alternative hypothesis: the mean support for the "synfuel" bill is different between Republicans and Democrats.

$H_a$: $\bar{x}_1 ≠ \bar{x}_2$

Confidence Level: 99% or 0.99

In [17]:
# First I'll look at "synfuels"
# Just by looking at the head I can see dems who voted for it but no reps
ttest_ind(rep["synfuels"], dem["synfuels"], nan_policy="omit")

Ttest_indResult(statistic=-8.293603989407588, pvalue=1.5759322301054064e-15)

##### Results

t-stat: -8.29

p-value: 0.0000000000000016 (1.58e-15)

##### Conclusion

The resulting p-value of running a t-test on the "synfuels" bill between the Democrat and Republican samples is 0. Therefore, I can reject the null hypothesis that there is no difference in support between the two samples. Furthermore, the negative t-statistic indicates that Democrats were more supportive of the bill than their Republican counterparts.

In [20]:
# Just to look at the difference, I'll calculate the mean for each sample
print(rep["synfuels"].mean())
print(dem["synfuels"].mean())

0.1320754716981132
0.5058823529411764


---

#### 3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01

##### 2-Sample t-Test

Null hypothesis: no difference in mean support between Republicans and Democrats on the "physician-fee-freeze" bill.

$H_0$: $\bar{x}_1 = \bar{x}_2$

Alternative hypothesis: the mean support for the "physician-fee-freeze" bill is different between Republicans and Democrats.

$H_a$: $\bar{x}_1 ≠ \bar{x}_2$

Confidence Level: 99% or 0.99

In [19]:
# Using the same intuition as the previous bill, 
# I'm going to analyze the support of "physician-fee-freeze"
ttest_ind(rep["physician-fee-freeze"], dem["physician-fee-freeze"], nan_policy="omit")

Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)

##### Results

t-stat: 49.37

p-value: 0.000000...00000...0000...000...00...0...0199 (1.99e-177)

##### Conclusion

The resulting p-value of running a t-test on the "physician-fee-freeze" bill between the Democrat and Republican samples is 0. Therefore, I can reject the null hypothesis that there is no difference in support between the two samples. Furthermore, the positive t-statistic indicates that Republicans were more supportive of the bill than their Democrat colleagues.

In [22]:
# View the difference in the means
print(rep["physician-fee-freeze"].mean())
print(dem["physician-fee-freeze"].mean())

0.9878787878787879
0.05405405405405406


---

#### 4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

In [25]:
# Look at the head to determine which bill might be supported equally between the parties
rep.head(16)

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0
11,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,,
14,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,,,0.0,
15,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,,0.0,
18,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,0.0,0.0
28,republican,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0


In [26]:
dem.head(16)

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0
9,democrat,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,
12,democrat,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,,
13,democrat,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,,1.0,1.0,,0.0,0.0,1.0,
16,democrat,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,,1.0,1.0,1.0,,0.0,0.0,1.0
17,democrat,1.0,,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0


From a brief visual inspection, it seems that the "water-project" bill has roughly the same support from each side of the aisle.

##### 2-Sample t-Test

Null hypothesis: no difference in mean support between Republicans and Democrats on the "water-project" bill.

$H_0$: $\bar{x}_1 = \bar{x}_2$

Alternative hypothesis: the mean support for the "water-project" bill is different between Republicans and Democrats.

$H_a$: $\bar{x}_1 ≠ \bar{x}_2$

Confidence Level: 90% or 0.90

In [27]:
# Conduct the t-test
ttest_ind(rep["water-project"], dem["water-project"], nan_policy="omit")

Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)

##### Results

t-statistic: 0.09

p-value: 0.93
0.93 > 0.1

##### Conclusion

The results of the t-test between Republicans and Democrats on the "water-project" bill provides a t-statistic of 0.09 and p-value of 0.93. Therefore, with a confidence level of 90% (0.1),  I fail to reject the null hypothesis that there is no difference in the mean of the votes of the two samples.

---

### 1-Sample t-Testing

5. Practice running 1-sample t-tests

---
---

#### Stretch Goals

0. Try to make some kind of visualization that communicates the results
1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

---

#### 0. Try to make some kind of visualization that communicates the results

#### 1. Refactor your code into functions so it's easy to rerun with arbitrary variables

#### 2. Apply hypothesis testing to your personal project data

 (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)