<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
### YOUR CODE STARTS HERE
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

In [5]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.names

1. Title: 1984 United States Congressional Voting Records Database

2. Source Information:
    (a) Source:  Congressional Quarterly Almanac, 98th Congress, 
                 2nd session 1984, Volume XL: Congressional Quarterly Inc. 
                 Washington, D.C., 1985.
    (b) Donor: Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
    (c) Date: 27 April 1987 

3. Past Usage
   - Publications
     1. Schlimmer, J. C. (1987).  Concept acquisition through 
        representational adjustment.  Doctoral dissertation, Department of 
        Information and Computer Science, University of California, Irvine, CA.
        -- Results: about 90%-95% accuracy appears to be STAGGER's asymptote
     - Predicted attribute: party affiliation (2 classes)

4. Relevant Information:
      This data set includes votes for each of the U.S. House of
      Representatives Congressmen on the 16 key votes identified by the
      CQA.  The CQA lists nine different types of votes: voted for, paired
      

In [0]:
cols = ['class_name', 'handicapped_infants', 'water_project_cost_sharing', 'adoption_of_budget_resolution', 'physician_fee_freeze', 'el_salvador_aid', 
        'religious_groups_in_schools', 'anti_satellite_test_ban', 'aid_to_contras', 'mx_missile', 'immigration', 'synfuels_cutback', 'education_spending', 
        'superfund_right_to_sue', 'crime', 'duty_free_exports', 'export_act_SA']
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', header = None, names = cols, na_values = '?')

In [7]:
print(df.shape)
df.head()

(435, 17)


Unnamed: 0,class_name,handicapped_infants,water_project_cost_sharing,adoption_of_budget_resolution,physician_fee_freeze,el_salvador_aid,religious_groups_in_schools,anti_satellite_test_ban,aid_to_contras,mx_missile,immigration,synfuels_cutback,education_spending,superfund_right_to_sue,crime,duty_free_exports,export_act_SA
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [8]:
df.isna().sum()

class_name                         0
handicapped_infants               12
water_project_cost_sharing        48
adoption_of_budget_resolution     11
physician_fee_freeze              11
el_salvador_aid                   15
religious_groups_in_schools       11
anti_satellite_test_ban           14
aid_to_contras                    15
mx_missile                        22
immigration                        7
synfuels_cutback                  21
education_spending                31
superfund_right_to_sue            25
crime                             17
duty_free_exports                 28
export_act_SA                    104
dtype: int64

In [9]:
# The data is binary, and we don't want to kick the distribution one way or another. Let's take a look at the mode for each feature.

for feat in df:
  print(feat)
  print(df[feat].mode())
  print(df[feat].value_counts())
  print('-----')

class_name
0    democrat
dtype: object
democrat      267
republican    168
Name: class_name, dtype: int64
-----
handicapped_infants
0    n
dtype: object
n    236
y    187
Name: handicapped_infants, dtype: int64
-----
water_project_cost_sharing
0    y
dtype: object
y    195
n    192
Name: water_project_cost_sharing, dtype: int64
-----
adoption_of_budget_resolution
0    y
dtype: object
y    253
n    171
Name: adoption_of_budget_resolution, dtype: int64
-----
physician_fee_freeze
0    n
dtype: object
n    247
y    177
Name: physician_fee_freeze, dtype: int64
-----
el_salvador_aid
0    y
dtype: object
y    212
n    208
Name: el_salvador_aid, dtype: int64
-----
religious_groups_in_schools
0    y
dtype: object
y    272
n    152
Name: religious_groups_in_schools, dtype: int64
-----
anti_satellite_test_ban
0    y
dtype: object
y    239
n    182
Name: anti_satellite_test_ban, dtype: int64
-----
aid_to_contras
0    y
dtype: object
y    242
n    178
Name: aid_to_contras, dtype: int64
-----
mx_mis

In [10]:
# So in a lot of cases, the yes and no votes are close to evenly split, or have a distribution awfully similar to the party distribution.
# Given that at least some of these votes are likely split along party lines, choosing to fill with the mode will kick the distribution one way or another.
# We can, however, choose to fill with a new value that represents the null values well. Abstaining a vote is sort of like voting half yes half no.
# Therefore, we can replace Yes with 1, No with 0, and NaN with 0.5.

df2 = df.copy()
df2.replace(to_replace = ['y','n'], value = [1,0], inplace=True)
df2.fillna(0.5, inplace=True)
df2.isna().sum()

class_name                       0
handicapped_infants              0
water_project_cost_sharing       0
adoption_of_budget_resolution    0
physician_fee_freeze             0
el_salvador_aid                  0
religious_groups_in_schools      0
anti_satellite_test_ban          0
aid_to_contras                   0
mx_missile                       0
immigration                      0
synfuels_cutback                 0
education_spending               0
superfund_right_to_sue           0
crime                            0
duty_free_exports                0
export_act_SA                    0
dtype: int64

In [0]:
# Let's split our data into two separate samples - Democrats and Republicans.

burros = df2[df2.class_name == 'democrat']
elefantes = df2[df2.class_name =='republican']

# and check to make sure we're not missing anything

assert len(df2) == len(burros)+len(elefantes)

In [0]:
# Make a list containing the numeric features
features = cols
features.remove('class_name')

In [13]:
for feat in features:
  print(feat)
  print(burros[feat].mean())
  print(elefantes[feat].mean())
  print('-------')

handicapped_infants
0.601123595505618
0.19345238095238096
-------
water_project_cost_sharing
0.50187265917603
0.5059523809523809
-------
adoption_of_budget_resolution
0.8782771535580525
0.14285714285714285
-------
physician_fee_freeze
0.06741573033707865
0.9791666666666666
-------
el_salvador_aid
0.22846441947565543
0.9434523809523809
-------
religious_groups_in_schools
0.47752808988764045
0.8928571428571429
-------
anti_satellite_test_ban
0.7640449438202247
0.25
-------
aid_to_contras
0.8239700374531835
0.17559523809523808
-------
mx_missile
0.7397003745318352
0.12202380952380952
-------
immigration
0.47191011235955055
0.5565476190476191
-------
synfuels_cutback
0.5056179775280899
0.15178571428571427
-------
education_spending
0.16853932584269662
0.8422619047619048
-------
superfund_right_to_sue
0.301498127340824
0.8392857142857143
-------
crime
0.35580524344569286
0.9613095238095238
-------
duty_free_exports
0.6292134831460674
0.11904761904761904
-------
export_act_SA
0.8014981273408

In [14]:
# run an independent t-test for each feature.

for feat in features:
  tstat, pvalue = ttest_ind(burros[feat], elefantes[feat])
  print(feat)
  print(burros[feat].mean() - elefantes[feat].mean())
  print(tstat)
  print(pvalue)
  print('---------')

handicapped_infants
0.40767121455323707
9.22317772154614
1.2761169357253626e-18
---------
water_project_cost_sharing
-0.004079721776350964
-0.08764559884421878
0.9301988772663682
---------
adoption_of_budget_resolution
0.7354200107009097
22.821693043884803
2.872115314395808e-76
---------
physician_fee_freeze
-0.9117509363295879
-46.10191006844654
3.967141133302638e-169
---------
el_salvador_aid
-0.7149879614767255
-20.895617123040896
1.4659659186479053e-67
---------
religious_groups_in_schools
-0.41532905296950245
-9.815876256106362
1.142999405504256e-20
---------
anti_satellite_test_ban
0.5140449438202247
12.448556296273836
1.2736295885307941e-30
---------
aid_to_contras
0.6483747993579454
17.791848422270405
1.4948014750035628e-53
---------
mx_missile
0.6176765650080257
16.326540222505365
4.863267267891218e-47
---------
immigration
-0.08463750668806852
-1.7350166356866614
0.08344939720307315
---------
synfuels_cutback
0.3538322632423756
8.20071170109401
2.7434037173701792e-15
--------

In [15]:
#Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01

tstat, pvalue = ttest_ind(burros['handicapped_infants'], elefantes['handicapped_infants'])
print('handicapped_infants')
print(burros['handicapped_infants'].mean() - elefantes['handicapped_infants'].mean())
print(tstat)
print(pvalue)

handicapped_infants
0.40767121455323707
9.22317772154614
1.2761169357253626e-18


In [16]:
# Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01

tstat, pvalue = ttest_ind(burros['physician_fee_freeze'], elefantes['physician_fee_freeze'])
print('physician_fee_freeze')
print(burros['physician_fee_freeze'].mean() - elefantes['physician_fee_freeze'].mean())
print(tstat)
print(pvalue)

physician_fee_freeze
-0.9117509363295879
-46.10191006844654
3.967141133302638e-169


In [17]:
# Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

tstat, pvalue = ttest_ind(burros['water_project_cost_sharing'], elefantes['water_project_cost_sharing'])
print('water_project_cost_sharing')
print(burros['water_project_cost_sharing'].mean() - elefantes['water_project_cost_sharing'].mean())
print(tstat)
print(pvalue)

water_project_cost_sharing
-0.004079721776350964
-0.08764559884421878
0.9301988772663682
