<a href="https://colab.research.google.com/github/mauney/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/module1-statistics-probability-and-inference/LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
### YOUR CODE STARTS HERE
import numpy as np
import pandas as pd
from scipy.stats import ttest_1samp, ttest_ind

In [0]:
categories = ['class-name', 'handicapped-infants', 'water-project-cost-sharing',
              'adoption-of-the-budget-resolution', 'physician-fee-freeze',
              'el-salvador-aid', 'religious-groups-in-schools',
              'anti-satellite-test-ban', 'aid-to-nicaraguan-contras',
              'mx-missile', 'immigration', 'synfuels-corporation-cutback',
              'education-spending', 'superfund-right-to-sue', 'crime',
              'duty-free-exports', 'export-administration-act-south-africa']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', names=categories)

In [13]:
df.replace(to_replace={'y': 1, 'n': 0, '?': np.nan}, inplace=True)
df.head()

Unnamed: 0,class-name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [15]:
dem = df[df['class-name'] == 'democrat']
rep = df[df['class-name'] == 'republican']

print('dem shape: ', dem.shape)
print('rep shape: ', rep.shape)

dem shape:  (267, 17)
rep shape:  (168, 17)


In [41]:
dem_support = []
rep_support = []
little_difference = []
other = []

for category in categories[1:]:
  t_statistic, p_value = ttest_ind(dem[category], rep[category], nan_policy='omit')
  
  if p_value < 0.01:
    if t_statistic > 0:
      dem_support.append(category)
    elif t_statistic < 0:
      rep_support.append(category)
  elif p_value > 0.1:
    little_difference.append(category)
  else:
    other.append(category)
  
  print(f'{category}:\nt-statistic: {t_statistic}\np-value:     {p_value}\n\n')

handicapped-infants:
t-statistic: 9.205264294809222
p-value:     1.613440327937243e-18


water-project-cost-sharing:
t-statistic: -0.08896538137868286
p-value:     0.9291556823993485


adoption-of-the-budget-resolution:
t-statistic: 23.21277691701378
p-value:     2.0703402795404463e-77


physician-fee-freeze:
t-statistic: -49.36708157301406
p-value:     1.994262314074344e-177


el-salvador-aid:
t-statistic: -21.13669261173219
p-value:     5.600520111729011e-68


religious-groups-in-schools:
t-statistic: -9.737575825219457
p-value:     2.3936722520597287e-20


anti-satellite-test-ban:
t-statistic: 12.526187929077842
p-value:     8.521033017443867e-31


aid-to-nicaraguan-contras:
t-statistic: 18.052093200819733
p-value:     2.82471841372357e-54


mx-missile:
t-statistic: 16.437503268542994
p-value:     5.03079265310811e-47


immigration:
t-statistic: -1.7359117329695164
p-value:     0.08330248490425066


synfuels-corporation-cutback:
t-statistic: 8.293603989407588
p-value:     1.57593223

In [50]:
print('Null Hypothesis: the two parties supported the bill at the same rate.')
print('Alternative Hyptothesis: the two parties supported the bill at different rates.')
print('Confidence Level: 99%\n')

print('''
For the following votes where the Democrats favored the bill more than the
Republicans, we reject the null hypothesis that the two parties supported
bill at the same rate.
''')

for bill in dem_support:
  print(bill)
print('\n********')

print('''
For the following votes where the Republicans favored the bill more than
the Democrats, we reject the null hypothesis that the two parties supported
bill at the same rate.
''')

for bill in rep_support:
  print(bill)
print('\n********')

print('''
For the following vote we fail to reject the null hypothesis that the
two parties supported bill at the same rate, with a p-value
of more than 10%.
''')

for bill in little_difference:
  print(bill)
print('\n********')

print('''
The following bill did not meet any of the above distinctions. The p-value was
less than 10% but more than 1%.
''')

for bill in other:
  print(bill)



Null Hypothesis: the two parties supported the bill at the same rate.
Alternative Hyptothesis: the two parties supported the bill at different rates.
Confidence Level: 99%


For the following votes where the Democrats favored the bill more than the
Republicans, we reject the null hypothesis that the two parties supported
bill at the same rate.

handicapped-infants
adoption-of-the-budget-resolution
anti-satellite-test-ban
aid-to-nicaraguan-contras
mx-missile
synfuels-corporation-cutback
duty-free-exports
export-administration-act-south-africa

********

For the following votes where the Republicans favored the bill more than
the Democrats, we reject the null hypothesis that the two parties supported
bill at the same rate.

physician-fee-freeze
el-salvador-aid
religious-groups-in-schools
education-spending
superfund-right-to-sue
crime

********

For the following vote we fail to reject the null hypothesis that the
two parties supported bill at the same rate, with a p-value
of more than 1