<a href="https://colab.research.google.com/github/scrunts23/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
# import
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

In [0]:
# load data via website 
congressional_voting_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'

In [4]:
# read in data
voting_data = pd.read_csv(congressional_voting_data_url)
voting_data

Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
1,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
2,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
3,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
4,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
429,republican,n,n,y,y,y,y,n,n,y,y,n,y,y,y,n,y
430,democrat,n,n,y,n,n,n,y,y,y,y,n,n,n,n,n,y
431,republican,n,?,n,y,y,y,n,n,n,n,y,y,y,y,n,y
432,republican,n,n,n,y,y,y,?,?,?,?,n,y,y,y,n,y


In [0]:
# set headers and give ? NaN values 
column_header = ['political party', 'handicapped infants', 'water project cost sharing', 'adoption of the budget resolution', 'physician fee freeze', 'el salvador aid','religious groups in schools', 'anti satellite test ban', 'aid to nicaraguan contras', 'mx missile', 'immigration', 'synfuels corporation cutback', 'education spending', 'superfund right to sue', 'crime', 'duty free exports', 'export administration act south africa']
voting_data = pd.read_csv(congressional_voting_data_url, header=None, names=column_header, na_values="?")

In [6]:
# inspect header
voting_data.head()

Unnamed: 0,political party,handicapped infants,water project cost sharing,adoption of the budget resolution,physician fee freeze,el salvador aid,religious groups in schools,anti satellite test ban,aid to nicaraguan contras,mx missile,immigration,synfuels corporation cutback,education spending,superfund right to sue,crime,duty free exports,export administration act south africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [7]:
# recode votes as numeric values 
voting_data = voting_data.replace({'y': 1, 'n': 0})
voting_data.head()

Unnamed: 0,political party,handicapped infants,water project cost sharing,adoption of the budget resolution,physician fee freeze,el salvador aid,religious groups in schools,anti satellite test ban,aid to nicaraguan contras,mx missile,immigration,synfuels corporation cutback,education spending,superfund right to sue,crime,duty free exports,export administration act south africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [8]:
# how many from each party?
voting_data['political party'].value_counts()

democrat      267
republican    168
Name: political party, dtype: int64

In [9]:
# look at republicans voting 
rep = voting_data[voting_data['political party']=='republican']
rep.head()

Unnamed: 0,political party,handicapped infants,water project cost sharing,adoption of the budget resolution,physician fee freeze,el salvador aid,religious groups in schools,anti satellite test ban,aid to nicaraguan contras,mx missile,immigration,synfuels corporation cutback,education spending,superfund right to sue,crime,duty free exports,export administration act south africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [10]:
rep.describe()

Unnamed: 0,handicapped infants,water project cost sharing,adoption of the budget resolution,physician fee freeze,el salvador aid,religious groups in schools,anti satellite test ban,aid to nicaraguan contras,mx missile,immigration,synfuels corporation cutback,education spending,superfund right to sue,crime,duty free exports,export administration act south africa
count,165.0,148.0,164.0,165.0,165.0,166.0,162.0,157.0,165.0,165.0,159.0,155.0,158.0,161.0,156.0,146.0
mean,0.187879,0.506757,0.134146,0.987879,0.951515,0.89759,0.240741,0.152866,0.115152,0.557576,0.132075,0.870968,0.860759,0.981366,0.089744,0.657534
std,0.391804,0.501652,0.341853,0.10976,0.215442,0.304104,0.428859,0.36101,0.320176,0.498186,0.339643,0.336322,0.347298,0.135649,0.286735,0.476168
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
50%,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
75%,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [12]:
# look at democrat voting 
dem = voting_data[voting_data['political party']=='democrat']
dem.head()

Unnamed: 0,political party,handicapped infants,water project cost sharing,adoption of the budget resolution,physician fee freeze,el salvador aid,religious groups in schools,anti satellite test ban,aid to nicaraguan contras,mx missile,immigration,synfuels corporation cutback,education spending,superfund right to sue,crime,duty free exports,export administration act south africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [13]:
dem.describe()

Unnamed: 0,handicapped infants,water project cost sharing,adoption of the budget resolution,physician fee freeze,el salvador aid,religious groups in schools,anti satellite test ban,aid to nicaraguan contras,mx missile,immigration,synfuels corporation cutback,education spending,superfund right to sue,crime,duty free exports,export administration act south africa
count,258.0,239.0,260.0,259.0,255.0,258.0,259.0,263.0,248.0,263.0,255.0,249.0,252.0,257.0,251.0,185.0
mean,0.604651,0.502092,0.888462,0.054054,0.215686,0.476744,0.772201,0.828897,0.758065,0.471483,0.505882,0.144578,0.289683,0.350195,0.63745,0.935135
std,0.489876,0.501045,0.315405,0.226562,0.412106,0.50043,0.420224,0.377317,0.429121,0.500138,0.500949,0.352383,0.454518,0.477962,0.481697,0.246956
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
75%,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [50]:
# mean support of Republicans
rep['adoption of the budget resolution'].mean()

0.13414634146341464

In [51]:
# mean support of Democrats 
dem['adoption of the budget resolution'].mean()

0.8884615384615384

In [52]:
#  t-test of adoption of the budget resolution dem
ttest_ind(dem['adoption of the budget resolution'], rep['adoption of the budget resolution'], nan_policy='omit')

Ttest_indResult(statistic=23.21277691701378, pvalue=2.0703402795404463e-77)

As we can see this is a issue where the Dem support more that the Rep since the pvalue<.01.

In [53]:
# mean support of republicans physician fee freeze
rep['physician fee freeze'].mean()

0.9878787878787879

In [54]:
# mean support of Democrats physician fee freeze
dem['physician fee freeze'].mean()

0.05405405405405406

In [55]:
#  t-test of adoption of the physician fee freeze
ttest_ind(rep['physician fee freeze'], dem['physician fee freeze'], nan_policy='omit')

Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)

As we can see from the pvalue, this is a issue where the Rep support more that the Dem since the pvalue <0.01.

In [56]:
# mean support of rep water project cost sharing
rep['water project cost sharing'].mean()

0.5067567567567568

In [57]:
# mean support of dem water project cost sharing
dem['water project cost sharing'].mean()

0.502092050209205

In [58]:
#  t-test of adoption of the #  t-test of adoption of the physician fee freeze
ttest_ind(rep['water project cost sharing'], dem['water project cost sharing'], nan_policy='omit')

Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)

From pvalue p>0.1 we can see this is a issue where both the dem and rep are split on this vote.

In [0]:
Strech Goal

In [0]:
# make def
def two_sample_t_test(column):
  tstat, pvalue = ttest_ind(rep[column], dem[column], nan_policy='omit')
  return(tstat, pvalue)

In [20]:
# test def
two_sample_t_test('water project cost sharing')

(0.08896538137868286, 0.9291556823993485)

In [23]:
# slice voting issues from columns
voting_slice = column_header[1:]
voting_slice

['handicapped infants',
 'water project cost sharing',
 'adoption of the budget resolution',
 'physician fee freeze',
 'el salvador aid',
 'religious groups in schools',
 'anti satellite test ban',
 'aid to nicaraguan contras',
 'mx missile',
 'immigration',
 'synfuels corporation cutback',
 'education spending',
 'superfund right to sue',
 'crime',
 'duty free exports',
 'export administration act south africa']

In [26]:
# making code where this is easier to complete with other data 
t_vals = {}
p_vals = {}

for column in voting_slice:
  tstat, pvalue = ttest_ind(rep[column], dem[column], nan_policy='omit')
  print("The two sample T-Test result in "+ column + " is:")
  print(two_sample_t_test(column))
  t_vals[column] = tstat
  p_vals[column] = pvalue

The two sample T-Test result in handicapped infants is:
(-9.205264294809222, 1.613440327937243e-18)
The two sample T-Test result in water project cost sharing is:
(0.08896538137868286, 0.9291556823993485)
The two sample T-Test result in adoption of the budget resolution is:
(-23.21277691701378, 2.0703402795404463e-77)
The two sample T-Test result in physician fee freeze is:
(49.36708157301406, 1.994262314074344e-177)
The two sample T-Test result in el salvador aid is:
(21.13669261173219, 5.600520111729011e-68)
The two sample T-Test result in religious groups in schools is:
(9.737575825219457, 2.3936722520597287e-20)
The two sample T-Test result in anti satellite test ban is:
(-12.526187929077842, 8.521033017443867e-31)
The two sample T-Test result in aid to nicaraguan contras is:
(-18.052093200819733, 2.82471841372357e-54)
The two sample T-Test result in mx missile is:
(-16.437503268542994, 5.03079265310811e-47)
The two sample T-Test result in immigration is:
(1.7359117329695164, 0.083