

```
# This is formatted as code
```

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
### YOUR CODE STARTS HERE
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel
import pandas as pd
import numpy as np

In [0]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2020-01-20 23:39:13--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2020-01-20 23:39:14 (619 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [0]:
#Created a list of column headers for the UCI data
column_headers = ['party', 'handicapped-infants', 'water-project', 
'budget', 'physician-fee-freeze', 'el-salvador-aid', 'religious-groups', 
'anti-satellite-ban', 'aid-to-contras', 'mx-missile', 'immigration', 
'synfuels', 'education', 'right-to-sue', 'crime', 'duty-free', 
'south-africa']

In [0]:
#implement UCI data by creating a dataframe of the data. Removed headers, added column headers and replaced all "?" with NaN's
df = pd.read_csv('house-votes-84.data',
                 header=None,
                 names=column_headers,
                 na_values="?")
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [0]:
#replace each "y" with a 1 and each "n" with a 0 to make comparisons possible
df = df.replace({'y': 1, 'n': 0})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
#Create a new dataframe that specifically only has republican data
rep = df[df['party'] == 'republican']
rep.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [0]:
#Create a new dataframe that specifically only has democrat data
dem = df[df['party'] == 'democrat']
dem.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


# **democrats support more than republicans with p < 0.01**

In [0]:
#democrats support the budget bill more than the republicans
print(dem['budget'].mean())
print(rep['budget'].mean())

0.8884615384615384
0.13414634146341464


In [0]:
#further proof that more democrats support the bill than republicans
col = dem['budget']
budget_nonans_dem = col[~np.isnan(col)]

budget_nonans_dem.value_counts()

1.0    231
0.0     29
Name: budget, dtype: int64

In [0]:
#231 democrats voted yes vs 22 republicans that said yes
col = rep['budget']
budget_nonans_rep = col[~np.isnan(col)]

budget_nonans_rep.value_counts()

0.0    142
1.0     22
Name: budget, dtype: int64

In [0]:
#ttest between democrats and republicans for the budget bill
ttest_ind(dem['budget'], rep['budget'], nan_policy='omit')

Ttest_indResult(statistic=23.21277691701378, pvalue=2.0703402795404463e-77)

In [0]:
2.0703402795404463e-77 < 0.01

True

Due to the pvalue being 2.0703402795404463e-77, I **reject** my null hypothesis that the support for the democrats and republicans for the budget bill are of equal support with a confidence level of 99%

# **republicans support more than democrats with p < 0.01**

In [0]:
#republicans support the physician fee freeze bill more than democrats
print(dem['physician-fee-freeze'].mean())
print(rep['physician-fee-freeze'].mean())

0.05405405405405406
0.9878787878787879


In [0]:
#further proof that more republicans support the bill than democrats
col = dem['physician-fee-freeze']
physician_fee_freeze_nonans_dem = col[~np.isnan(col)]

physician_fee_freeze_nonans_dem.value_counts()

0.0    245
1.0     14
Name: physician-fee-freeze, dtype: int64

In [0]:
#163 republicans voted yes vs 14 democrats that said yes
col = rep['physician-fee-freeze']
physician_fee_freeze_nonans_rep = col[~np.isnan(col)]

physician_fee_freeze_nonans_rep.value_counts()

1.0    163
0.0      2
Name: physician-fee-freeze, dtype: int64

In [0]:
#ttest between democrats and republicans for the physician fee freeze bill
ttest_ind(dem['physician-fee-freeze'], rep['physician-fee-freeze'], nan_policy='omit')

Ttest_indResult(statistic=-49.36708157301406, pvalue=1.994262314074344e-177)

In [0]:
1.994262314074344e-177 < 0.01

True

Due to the pvalue being 1.994262314074344e-177, I **reject** my null hypothesis that the support for the democrats and republicans for the physician fee freeze bill are of equal support with a confidence level of 99%

# **the difference between republicans and democrats has p > 0.1**

In [0]:
#The water project bill is the only bill where the democrats and republicans where split upon the bill
print(dem['water-project'].mean())
print(rep['water-project'].mean())

0.502092050209205
0.5067567567567568


In [0]:
#ttest between democrats and republicans for the water project bill
ttest_ind(dem['water-project'], rep['water-project'], nan_policy='omit')

Ttest_indResult(statistic=-0.08896538137868286, pvalue=0.9291556823993485)

In [0]:
0.9291556823993485 > 0.1

True

Due to the pvalue being 0.9291556823993485, I **fail to reject** my null hypothesis that the support for the democrats and republicans for the water project bill are of equal support with a confidence level of 90%

# **Stretch Goal**

In [0]:
#A function where you only have to plug in the bill name and it will output the ttest for the specified bill
def ttest_func(column):
  return ttest_ind(dem[column], rep[column], nan_policy='omit')

null Hypothesis: There is no difference of support for the crime bill

$\bar{x}_{1} == \bar{x}_{2}$

Alternative Hypothesis: There will be a difference of support for the crime bill

$\bar{x}_{1} \neq \bar{x}_{2}$

Confidence level: 95%

In [0]:
#utilizing the ttest_func function
ttest_func('crime')

Ttest_indResult(statistic=-16.342085656197696, pvalue=9.952342705606092e-47)

In [0]:
9.952342705606092e-47 < 0.05

True

Due to the pvalue being 9.952342705606092e-47, I **reject** my null hypothesis that the support for the crime bill are of equal support.