<a href="https://colab.research.google.com/github/EvidenceN/DS-Unit-1-Sprint-2-Statistics/blob/master/Evidence%20N.%20Answers_LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import ttest_1samp, ttest_ind, ttest_ind_from_stats, ttest_rel


In [0]:
### Loading the assignment data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2019-10-07 20:13:22--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data.1’


2019-10-07 20:13:22 (284 KB/s) - ‘house-votes-84.data.1’ saved [18171/18171]



In [0]:
# Load Data and give the names of each column
df = pd.read_csv('house-votes-84.data', 
                 header=None,
                 names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa'])

print(df.shape)

df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [0]:
# replacing the unknown values with NaN, and replacing yes and no with 0's and 1's
# because we can't graph strings. we can only graph numbers. 
df = df.replace({'?':np.NaN, 'n':0, 'y':1})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
# getting a general idea of empty data sets. 
df.isnull().sum()

party                     0
handicapped-infants      12
water-project            48
budget                   11
physician-fee-freeze     11
el-salvador-aid          15
religious-groups         11
anti-satellite-ban       14
aid-to-contras           15
mx-missile               22
immigration               7
synfuels                 21
education                31
right-to-sue             25
crime                    17
duty-free                28
south-africa            104
dtype: int64

In [0]:
# splitting the data in two into democrats and republican data sets. 
rep = df[df['party']=='republican']
dem = df[df['party']=='democrat']
print('rep:', rep.shape)
print('dem:', dem.shape)

# Getting the mean of rep data set and dem data set. 

print('rep mean:', rep['handicapped-infants'].mean())
print('dem mean:', dem['handicapped-infants'].mean())

rep: (168, 17)
dem: (267, 17)
rep mean: 0.18787878787878787
dem mean: 0.6046511627906976


In [0]:
# creating a function to replicate class notes and code. If this
# replica function works and give me the same result the instructor
# produced in class, then i will know that the function is fully 
# funtional and i ccan use it for my assignment.

# a = sample one, b = null hypothesis, omit nan values
def one_sample_ttest(a,b,nan_policy='omit'):
  return ttest_1samp(a,b,nan_policy='omit')

# a = sample one, b = sample 2, omit nan values
def two_sample_ttest(a,b,nan_policy='omit'):
  return ttest_ind(a,b,nan_policy='omit')

one_sample_ttest(rep['physician-fee-freeze'], 1)
two_sample_ttest(rep['handicapped-infants'], dem['handicapped-infants'])
# The result of the function is exactly what the instructor got in class
# so, i can use this function to do my assignment. 

In [0]:
# create a list that has all the features for rep and dem data sets so that 
# i can easily itirate through it to find the solution i am looking for

rep_data = [rep['handicapped-infants'], rep['physician-fee-freeze'], 
           rep['water-project'], rep['budget'],
           rep['el-salvador-aid'], rep['religious-groups'],
           rep['anti-satellite-ban'], rep['aid-to-contras'],
           rep['mx-missile'], rep['immigration'],
           rep['synfuels'], rep['education'],
           rep['right-to-sue'], rep['crime'],
           rep['duty-free'], rep['south-africa']]
           

dem_data = [dem['handicapped-infants'], dem['physician-fee-freeze'], 
           dem['water-project'], dem['budget'],
           dem['el-salvador-aid'], dem['religious-groups'],
           dem['anti-satellite-ban'], dem['aid-to-contras'],
           dem['mx-missile'], dem['immigration'],
           dem['synfuels'], dem['education'],
           dem['right-to-sue'], dem['crime'],
           dem['duty-free'], dem['south-africa']]

# testing to see if i can use list comprehension too iterate through
# different features without typing them out individually

# republican data with a null hypothesis of 0.5
[one_sample_ttest(a,0.5) for a in rep_data]

# my list comprehension works

In [0]:
# republican data with a null hypothesis of 1
[one_sample_ttest(a,1) for a in rep_data]
'''Republicans support the issues of "Physician freeze, el salvador aid, and crime" 
more than democrats because the p-value of these data frames are less than 0.01 but the p value
is not a big negative number which represents zero(0)'''

[Ttest_1sampResult(statistic=-26.625236633811387, pvalue=1.978873197183477e-61),
 Ttest_1sampResult(statistic=-1.4185450076223511, pvalue=0.1579292482594923),
 Ttest_1sampResult(statistic=-11.961605243444543, pvalue=1.8656648229239887e-23),
 Ttest_1sampResult(statistic=-32.43595087385152, pvalue=5.293293090366981e-73),
 Ttest_1sampResult(statistic=-2.890793645020198, pvalue=0.004363402589282088),
 Ttest_1sampResult(statistic=-4.338836636208457, pvalue=2.488389920449274e-05),
 Ttest_1sampResult(statistic=-22.533735393166197, pvalue=1.1791229999687983e-51),
 Ttest_1sampResult(statistic=-29.402380855978315, pvalue=1.646323339608132e-65),
 Ttest_1sampResult(statistic=-35.49944402826317, pvalue=6.985438779792963e-79),
 Ttest_1sampResult(statistic=-11.407472760546423, pvalue=1.4608897556395605e-22),
 Ttest_1sampResult(statistic=-32.22244115962839, pvalue=2.3835416257429457e-71),
 Ttest_1sampResult(statistic=-4.776485613378817, pvalue=4.122594240061259e-06),
 Ttest_1sampResult(statistic=-5.03

In [0]:
# democrat data with a null hypothesis of 1
[[one_sample_ttest(a,1) for a in dem_data]]
'''democrats support the issue of "South Africa" more than republicans because
that is the only feature where the p value is less than 0.01 but the p value
is not a big negative number which represents zero(0)'''

[[Ttest_1sampResult(statistic=-12.96296499796484, pvalue=6.590394568934029e-30),
  Ttest_1sampResult(statistic=-67.19374970932937, pvalue=1.7479453896049469e-165),
  Ttest_1sampResult(statistic=-15.36283393995609, pvalue=1.8031537722768159e-37),
  Ttest_1sampResult(statistic=-5.702205846437985, pvalue=3.217258173105712e-08),
  Ttest_1sampResult(statistic=-30.39138633949369, pvalue=1.4023105444477201e-86),
  Ttest_1sampResult(statistic=-16.7950341092749, pvalue=3.12862805821349e-43),
  Ttest_1sampResult(statistic=-8.724104538575864, pvalue=3.377871549159516e-16),
  Ttest_1sampResult(statistic=-7.354085178140069, pvalue=2.464384519854037e-12),
  Ttest_1sampResult(statistic=-8.87861403790268, pvalue=1.418109388321789e-16),
  Ttest_1sampResult(statistic=-17.137489559066015, pvalue=1.1309092327824875e-44),
  Ttest_1sampResult(statistic=-15.750968962442089, pvalue=1.839716417353916e-39),
  Ttest_1sampResult(statistic=-38.305787204198445, pvalue=3.89658572232871e-106),
  Ttest_1sampResult(sta

In [0]:
# testing how i can use list comprehension with two sample t tests.

#[two_sample_ttest(a,b) for a in rep_data for b in dem_data]

# I don't know how to use list comprehension or for loop with multiple
# dat sets to get all my results at once and google is not helping either

#for data in rep_data:
 # for b_data in dem_data:
  #  a = data
   # b = b_data
    #result = (two_sample_ttest(a,b))
#print(result)

In [0]:
# Handicapped infants
print('Handicapped infants:', two_sample_ttest(rep_data[0],dem_data[0]))

# physician-fee-freeze
print('physician-fee-freeze:', two_sample_ttest(rep_data[1],dem_data[1]))

# water project vote comparison between democrats and republicans
print('Water Project:', two_sample_ttest(rep_data[2],dem_data[2]))

# budget vote comparison between democrats and republicans
print('Budget:', two_sample_ttest(rep_data[3],dem_data[3]))

# el salvador aid vote comparison between democrats and republicans
print('el-salvador-aid:', two_sample_ttest(rep_data[4],dem_data[4]))

# religious groups vote comparison between democrats and republicans
print('Religious Groups:', two_sample_ttest(rep_data[5],dem_data[5]))

# anti satelite ban vote comparison between democrats and republicans
print('Anti Satellite ban:', two_sample_ttest(rep_data[6],dem_data[6]))

# aid to contras vote comparison between democrats and republicans
print('aid to contras:', two_sample_ttest(rep_data[7],dem_data[7]))

# mx missile vote comparison between democrats and republicans
print('mx -  missile:', two_sample_ttest(rep_data[8],dem_data[8]))

# immigration vote comparison between democrats and republicans
print('immigration:', two_sample_ttest(rep_data[9],dem_data[9]))

# synfuels vote comparison between democrats and republicans
print('synfuels:', two_sample_ttest(rep_data[10],dem_data[10]))

# education vote comparison between democrats and republicans
print('education:', two_sample_ttest(rep_data[11],dem_data[11]))

# right to sue vote comparison between democrats and republicans
print('right to sue:', two_sample_ttest(rep_data[12],dem_data[12]))

# crime vote comparison between democrats and republicans
print('crime:', two_sample_ttest(rep_data[13],dem_data[13]))

# duty free vote comparison between democrats and republicans
print('duty free:', two_sample_ttest(rep_data[14],dem_data[14]))

# South Africa vote comparison between democrats and republicans
print('South Africa:', two_sample_ttest(rep_data[15],dem_data[15]))

Handicapped infants: Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18)
physician-fee-freeze: Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)
Water Project: Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)
Budget: Ttest_indResult(statistic=-23.21277691701378, pvalue=2.0703402795404463e-77)
el-salvador-aid: Ttest_indResult(statistic=21.13669261173219, pvalue=5.600520111729011e-68)
Religious Groups: Ttest_indResult(statistic=9.737575825219457, pvalue=2.3936722520597287e-20)
Anti Satellite ban: Ttest_indResult(statistic=-12.526187929077842, pvalue=8.521033017443867e-31)
aid to contras: Ttest_indResult(statistic=-18.052093200819733, pvalue=2.82471841372357e-54)
mx -  missile: Ttest_indResult(statistic=-16.437503268542994, pvalue=5.03079265310811e-47)
immigration: Ttest_indResult(statistic=1.7359117329695164, pvalue=0.08330248490425066)
synfuels: Ttest_indResult(statistic=-8.293603989407588, pvalue=1.5759322301054

In [0]:
'''Issues in which democrates and 
republicans has a p value > 0.1 is "Water Project" which has 
a p value of 0.9and "immigration is pretty close with
a p value of 0.08"'''