
<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import numpy as np
import pandas as pd
from scipy import stats

# column headers
names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']

# init the datra frame from csv
df=pd.read_csv('/content/house-votes-84.data',names=names)


In [5]:
# make sure that the changes persisted
df=df.replace({'n':0,'y':1,'?':np.NaN})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [8]:
rep=df[df['party']=='republican']
dem=df[df['party']=='democrat']

# using ugly output to print heads to see if it looks right
print(rep.head())
print(dem.head())


         party  handicapped-infants  ...  duty-free  south-africa
0   republican                  0.0  ...        0.0           1.0
1   republican                  0.0  ...        0.0           NaN
7   republican                  0.0  ...        NaN           1.0
8   republican                  0.0  ...        0.0           1.0
10  republican                  0.0  ...        0.0           0.0

[5 rows x 17 columns]
      party  handicapped-infants  water-project  ...  crime  duty-free  south-africa
2  democrat                  NaN            1.0  ...    1.0        0.0           0.0
3  democrat                  0.0            1.0  ...    0.0        0.0           1.0
4  democrat                  1.0            1.0  ...    1.0        1.0           1.0
5  democrat                  0.0            1.0  ...    1.0        1.0           1.0
6  democrat                  0.0            1.0  ...    1.0        1.0           1.0

[5 rows x 17 columns]


In [19]:
# find the mean values for each column in both data sets
# this is going to tell me what issues each party voted for
# excluding party column because it dosent make sense to take the mean of a string
print('\nREP:')
for col in names[1:]:
  print(f"issue: {col} Mean votes: {rep[col].describe()['mean']}")
print("\nDEM:")
for col in names[1:]:
  print(f"issue: {col} Mean votes: {dem[col].describe()['mean']}")



REP:
issue: handicapped-infants Mean votes: 0.18787878787878787
issue: water-project Mean votes: 0.5067567567567568
issue: budget Mean votes: 0.13414634146341464
issue: physician-fee-freeze Mean votes: 0.9878787878787879
issue: el-salvador-aid Mean votes: 0.9515151515151515
issue: religious-groups Mean votes: 0.8975903614457831
issue: anti-satellite-ban Mean votes: 0.24074074074074073
issue: aid-to-contras Mean votes: 0.15286624203821655
issue: mx-missile Mean votes: 0.11515151515151516
issue: immigration Mean votes: 0.5575757575757576
issue: synfuels Mean votes: 0.1320754716981132
issue: education Mean votes: 0.8709677419354839
issue: right-to-sue Mean votes: 0.8607594936708861
issue: crime Mean votes: 0.9813664596273292
issue: duty-free Mean votes: 0.08974358974358974
issue: south-africa Mean votes: 0.6575342465753424

DEM:
issue: handicapped-infants Mean votes: 0.6046511627906976
issue: water-project Mean votes: 0.502092050209205
issue: budget Mean votes: 0.8884615384615384
issue: 

# Finding a Republican issue

$ H_0 =$ Republicans support

In [20]:
# so looking at the means i noticed that 98% of the republicans voted to support the crime bill
# where %35 of the democrates voted to support the crime bill, this seems like a good issue to choose
stats.ttest_ind(rep['crime'],dem['crime'],nan_policy='omit')

Ttest_indResult(statistic=16.342085656197696, pvalue=9.952342705606092e-47)