<a href="https://colab.research.google.com/github/SaraWestWA/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/SW_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

In [0]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
from scipy.stats import ttest_ind

In [0]:
#load data, add column headers, define ? as Nan; all at once!

column_headers = ['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']

house = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', 
                 header=None, 
                 names=column_headers,
                 na_values="?")
house.sample(10)

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
236,democrat,n,n,y,n,n,n,y,y,y,n,n,n,n,n,y,y
304,republican,n,y,n,y,y,y,n,n,n,y,n,y,y,y,n,
399,republican,n,y,n,y,,y,n,n,n,y,n,y,y,y,n,n
244,democrat,y,n,y,n,n,n,n,y,y,y,n,n,n,n,y,y
109,democrat,y,,y,n,n,n,y,y,y,n,n,n,n,n,y,
45,democrat,y,y,y,n,n,n,y,y,,n,y,n,n,n,y,
105,democrat,y,y,y,n,n,n,n,y,y,n,y,n,n,n,y,y
235,republican,n,n,n,y,y,y,n,n,n,y,n,y,n,y,n,y
369,republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,n,y
199,democrat,y,y,n,n,n,n,y,y,,n,y,n,n,n,y,


In [0]:
# verify shape as expected, index=column 17
house.shape

(435, 17)

In [0]:
# use replace to change yes and no votes to boolean values
# map is used at the column level

house = house.replace({'y':1,'n':0})
house.sample(5)

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
60,democrat,1.0,1.0,1.0,0.0,0.0,,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0
111,republican,0.0,,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
47,democrat,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,


In [0]:
# find out how many voters from each party
house['party'].value_counts()

democrat      267
republican    168
Name: party, dtype: int64

In [0]:
# make two smaller data frames, one for republicans one for democrats
# republicans
gop=house[house['party']=='republican']
print(gop.shape)
gop.sample(15)


(168, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
283,republican,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
378,republican,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
274,republican,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0
327,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
67,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
399,republican,0.0,1.0,0.0,1.0,,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
123,republican,1.0,,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
392,republican,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0
191,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,,0.0,1.0,0.0,1.0,1.0,1.0,0.0,
117,republican,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0


In [0]:
#now democrats
dnc=house[house['party']=='democrat']
print(dnc.shape)
dnc.sample(15)

(267, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
362,democrat,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0
47,democrat,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,
341,democrat,0.0,,1.0,,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,,,1.0,1.0
297,democrat,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,,0.0,1.0,
323,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,,1.0,0.0,0.0,1.0,1.0,0.0,
249,democrat,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,,0.0,1.0,0.0,0.0,0.0,1.0,1.0
387,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,
159,democrat,0.0,1.0,1.0,0.0,,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,,0.0,
42,democrat,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
272,democrat,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,


In [0]:
# pick a column to examine, republicans, 157 of 168
gop['aid-to-contras'].value_counts()


0.0    133
1.0     24
Name: aid-to-contras, dtype: int64

In [0]:
# verify that length of column is as expected
# note length includes all values including NaN, which doesn't work for
# calculating vote percentages
len(gop['aid-to-contras'])

168

In [0]:
# make new column, or is there another way to drop Nan
gop['aid-to-contras'].count()


157

In [0]:
# what percent of repulicans voted yes on aid to contras, yes/total
# should be 24/157=0.15287
gop['aid-to-contras'].sum()/gop['aid-to-contras'].count()

0.15286624203821655

In [0]:
# mean function is a tidy way to do 4 steps above
gop['aid-to-contras'].mean()

0.15286624203821655

In [0]:
# what is the democratic voting percentage?
dnc['aid-to-contras'].mean()

0.8288973384030418

# Assignment Recap


2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

# To check my t-test understanding:
The pvalue for democrats vs republicans on support for aid to contras will be much higher that 0.01, democrats supporting much more

1.   Null Hypothesis: democrats and republicans support aid to contras at "the same" level
2.   Alternate hypothisis : support for aid to contras is not "the same"




In [0]:
# when using the t-test a nan policy must be stated
# no nan policy
ttest_ind(gop['aid-to-contras'],dnc['aid-to-contras'])


Ttest_indResult(statistic=nan, pvalue=nan)

In [0]:
# nan policy in place
ttest_ind(gop['aid-to-contras'],dnc['aid-to-contras'], nan_policy='omit')

Ttest_indResult(statistic=-18.052093200819733, pvalue=2.82471841372357e-54)

null hypothesis rejected, as expected
objective 2 completed

Uscientific eyeball test suggests that freezing physician fees may be rahter more supported by republicans than democrats. Will this test satisfy objective 3?
1. Null Hypothesis: democrats and republicans support aid to contras at "the same" level
2. Alternate hypothisis : support for aid to contras is not "the same"

In [0]:
# ttest on phsician-fee-freeze
ttest_ind(gop['physician-fee-freeze'],dnc['physician-fee-freeze'], nan_policy = 'omit')

Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)

In [0]:
# check means to verify that republicans indeed support the fee freeze more
print(gop['physician-fee-freeze'].mean())
print(dnc['physician-fee-freeze'].mean())

0.9878787878787879
0.05405405405405406


null hypothesis rejected, objective 3 completed

ISO an issue where democrats and republicans show very similar level of support, to complete objective 4 above

1. Null Hypotheis: democrats and republicans support chosen issue at "the same" level
2. Alternate Hypothesis: support for chosen issue is not "the same"
3. ???????? pvalue > 0.1 corresponds to a 99% confidence level


In [0]:
# run ttests on other columns to find one with close support, avoiding
# columns used in lecture samples
ttest_ind(gop['south-africa'],dnc['south-africa'], nan_policy = 'omit')


Ttest_indResult(statistic=-6.849454815841208, pvalue=3.652674361672226e-11)

In [0]:
ttest_ind(gop['religious-groups'],dnc['religious-groups'], nan_policy = 'omit')

Ttest_indResult(statistic=9.737575825219457, pvalue=2.3936722520597287e-20)

In [0]:
ttest_ind(gop['duty-free'], dnc['duty-free'], nan_policy= 'omit')

Ttest_indResult(statistic=-12.853146132542978, pvalue=5.997697174347365e-32)

In [0]:
ttest_ind(gop['anti-satellite-ban'], dnc['anti-satellite-ban'], nan_policy= 'omit')

Ttest_indResult(statistic=-12.526187929077842, pvalue=8.521033017443867e-31)

In [0]:
ttest_ind(gop['budget'], dnc['budget'], nan_policy= 'omit')

Ttest_indResult(statistic=-23.21277691701378, pvalue=2.0703402795404463e-77)

In [0]:
# p value is the closest yet, but
ttest_ind(gop['immigration'], dnc['immigration'], nan_policy= 'omit')

Ttest_indResult(statistic=1.7359117329695164, pvalue=0.08330248490425066)

In [0]:
print(gop['immigration'].mean())
print(dnc['immigration'].mean())

0.5575757575757576
0.4714828897338403


I reject the null hypothesis, that support for immigration is the same between republicans and democrates.

Now to rewatch the lecture to understand why having such disparate percentages still gives a close pvalue. Perhpas it is due to relative sample sizes?




In [60]:
#  Lecture - df.describe() will give means to look at
# gives an idea of values to compare
gop.describe()

Unnamed: 0,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,165.0,148.0,164.0,165.0,165.0,166.0,162.0,157.0,165.0,165.0,159.0,155.0,158.0,161.0,156.0,146.0
mean,0.187879,0.506757,0.134146,0.987879,0.951515,0.89759,0.240741,0.152866,0.115152,0.557576,0.132075,0.870968,0.860759,0.981366,0.089744,0.657534
std,0.391804,0.501652,0.341853,0.10976,0.215442,0.304104,0.428859,0.36101,0.320176,0.498186,0.339643,0.336322,0.347298,0.135649,0.286735,0.476168
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
50%,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
75%,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [61]:
dnc.describe()

Unnamed: 0,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,258.0,239.0,260.0,259.0,255.0,258.0,259.0,263.0,248.0,263.0,255.0,249.0,252.0,257.0,251.0,185.0
mean,0.604651,0.502092,0.888462,0.054054,0.215686,0.476744,0.772201,0.828897,0.758065,0.471483,0.505882,0.144578,0.289683,0.350195,0.63745,0.935135
std,0.489876,0.501045,0.315405,0.226562,0.412106,0.50043,0.420224,0.377317,0.429121,0.500138,0.500949,0.352383,0.454518,0.477962,0.481697,0.246956
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
75%,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


The means on the two describe tables above indicate that immigration and water-projects are the two projects where it is most likely for the republicans and democrats to vote similarily. However, the water project was part of the lecture and immigration does not work for a 99% confidence level. However, a less desirable confidence level can be calculated.

In [68]:
x= 1-0.08330248490425066
y=x*100
print('Confidence level that democrats and republicans agreed about immigration')
print('in the 1984 is',y,'percent.')

Confidence level that democrats and republicans agreed about immigration
in the 1984 is 91.66975150957494 percent.


## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using Scipy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!

In [73]:
a=gop
b=dnc
col_list=['handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']

for cols in col_list:
  print col
#   tester = ttest_ind(a['col_list'], dnc['col'], nan_policy= 'omit')
#   # print(col,' ', tester)

NameError: ignored