<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
#loading scipy library + tests
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel
import pandas as pd
import numpy as np
%matplotlib inline
import seaborn as sns

In [2]:
#Getting data using bash command !

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2019-12-10 19:34:43--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2019-12-10 19:34:44 (124 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [3]:
df = pd.read_csv('house-votes-84.data', 
                 header=None,
                 names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa'])
print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [4]:
#Calling the Sum of all null/NaN values by column and sorting from least to greatest 

df.isnull().sum().sort_values()

#Checking again because I *know* there are missing values
df.isnull().values.any()

False

In [5]:
#using melt to tidy dataset, changing column headers to fall under column 'Issue' + to create the Vote column
#Sorting new formatted df

formatted_df = pd.melt(df,
                       ["party"],
                       var_name="Issue",
                       value_name="Vote")
formatted_df.head()

Unnamed: 0,party,Issue,Vote
0,republican,handicapped-infants,n
1,republican,handicapped-infants,n
2,democrat,handicapped-infants,?
3,democrat,handicapped-infants,n
4,democrat,handicapped-infants,y


In [0]:
# #Replacing all '?' values with 'Other' to more accurately convey the Values
# formatted_df.replace(to_replace ="?", 
#                  value ="Other")

In [7]:
df = df.replace({'?':np.NaN, 'n':0, 'y':1})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
#filtering formatted_df into two separate dataframes
dem = df[df['party'] == 'democrat']
rep = df[df['party'] == 'republican']

In [9]:
#checking out my new dataframes
dem.head()
dem.describe()

Unnamed: 0,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,258.0,239.0,260.0,259.0,255.0,258.0,259.0,263.0,248.0,263.0,255.0,249.0,252.0,257.0,251.0,185.0
mean,0.604651,0.502092,0.888462,0.054054,0.215686,0.476744,0.772201,0.828897,0.758065,0.471483,0.505882,0.144578,0.289683,0.350195,0.63745,0.935135
std,0.489876,0.501045,0.315405,0.226562,0.412106,0.50043,0.420224,0.377317,0.429121,0.500138,0.500949,0.352383,0.454518,0.477962,0.481697,0.246956
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
75%,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [10]:
#checking out my new dataframes
rep.head()
dem.describe()

Unnamed: 0,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,258.0,239.0,260.0,259.0,255.0,258.0,259.0,263.0,248.0,263.0,255.0,249.0,252.0,257.0,251.0,185.0
mean,0.604651,0.502092,0.888462,0.054054,0.215686,0.476744,0.772201,0.828897,0.758065,0.471483,0.505882,0.144578,0.289683,0.350195,0.63745,0.935135
std,0.489876,0.501045,0.315405,0.226562,0.412106,0.50043,0.420224,0.377317,0.429121,0.500138,0.500949,0.352383,0.454518,0.477962,0.481697,0.246956
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
75%,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [0]:
#Getting a count of each y or n vote on each issue based on party to help make assumptions for T-tests
count = formatted_df.groupby(['Issue', 'Vote', 'party']).size() 
# print(count)

In [44]:
count.tail(5)

Issue          Vote  party     
water-project  ?     republican     20
               n     democrat      119
                     republican     73
               y     democrat      120
                     republican     75
dtype: int64

In [0]:
#Sample Testing


#importing scipy tests
from scipy.stats import ttest_1samp


In [14]:
#Getting mean of republican votes on budget
rep['el-salvador-aid'].mean()


0.9515151515151515

In [15]:
#Checking for null values in budget
rep['el-salvador-aid'].isnull().sum()

3

In [16]:
# sample size (length of the column, excluding NaN values)
len(rep['el-salvador-aid']) - rep['el-salvador-aid'].isnull().sum()

165

In [17]:
# Run 1-sample t-test providing sample and null hypothesis
# pass nan_policy='omit' any time you nave NaN values in a column


ttest_1samp(rep['physician-fee-freeze'], 1, nan_policy='omit')


Ttest_1sampResult(statistic=-1.4185450076223511, pvalue=0.1579292482594923)

In [18]:


ttest_1samp(dem['physician-fee-freeze'], 1, nan_policy='omit')

Ttest_1sampResult(statistic=-67.19374970932937, pvalue=1.7479453896049469e-165)

In [0]:
# Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
# Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
# Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)


In [19]:
#closer to null, fail to reject

#2Sample Test Dems support

ttest_ind(rep['physician-fee-freeze'], dem['physician-fee-freeze'], nan_policy='omit')

Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)

In [0]:
# 'handicapped-infants','water-project',
#                           'budget','physician-fee-freeze', 'el-salvador-aid',
#                           'religious-groups','anti-satellite-ban',
#                           'aid-to-contras','mx-missile','immigration',
#                           'synfuels', 'education', 'right-to-sue','crime','duty-free',
#                           'south-africa'

In [35]:
ttest_1samp(rep['mx-missile'], 1, nan_policy='omit')

Ttest_1sampResult(statistic=-35.49944402826317, pvalue=6.985438779792963e-79)

In [36]:
ttest_1samp(dem['right-to-sue'], 1, nan_policy='omit')

Ttest_1sampResult(statistic=-24.808582253419022, pvalue=1.7639983932923968e-69)

In [32]:
 #2Sample Test Reps support
 ttest_ind(rep['right-to-sue'], dem['right-to-sue'], nan_policy='omit')

Ttest_indResult(statistic=13.51064251060933, pvalue=1.2278581709672758e-34)

In [0]:
#I'm fairly certain this is a 1. I understand what the code does but it's taking me a bit more time to understand the 
#statistical concepts. Still taking in documentation (statquest, statistics in plain english, TK videos etc) 
#on this material until it sticks. 
#Had a few issues with overall order of process structure (the way the code was written and executed 
#in the lecture vs the way I wrote it + finding ways to get my code to work with the code for the new t-test material ) 
#Feeling unclear, but definitely making progress. 
#100% *need* and will be taking advantage of all TL hours available to me this week.  

In [37]:
 #2Sample Test where Dems and Reps votes don't differ much
 ttest_ind(rep['water-project'], dem['water-project'], nan_policy='omit')

Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)