<a href="https://colab.research.google.com/github/tallywiesenberg/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/Wiesenberg_LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
### YOUR CODE STARTS HERE
import pandas as pd
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

###Load Data

In [123]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',
                 na_values='?', header=None, names=['party', 
                                                                                                                                     'handicapped-infants',
                                                                                                                                     ' water-project-cost-sharing',
                                                                                                                                     'adoption-of-the-budget-resolution',
                                                                                                                                     'physician-fee-freeze',
                                                                                                                                     'el-salvador-aid',
                                                                                                                                     'religious-groups-in-schools',
                                                                                                                                     'anti-satellite-test-ban',
                                                                                                                                     'aid-to-nicaraguan-contras',
                                                                                                                                     'mx-missile',
                                                                                                                                     'immigration',
                                                                                                                                     'synfuels-corporation-cutback',
                                                                                                                                     'education-spending',
                                                                                                                                     'superfund-right-to-sue',
                                                                                                                                     'crime',
                                                                                                                                     'duty-free-exports',
                                                                                                                                     'export-administration-act-south-africa'])
print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [124]:
df.isna().sum()

party                                       0
handicapped-infants                        12
 water-project-cost-sharing                48
adoption-of-the-budget-resolution          11
physician-fee-freeze                       11
el-salvador-aid                            15
religious-groups-in-schools                11
anti-satellite-test-ban                    14
aid-to-nicaraguan-contras                  15
mx-missile                                 22
immigration                                 7
synfuels-corporation-cutback               21
education-spending                         31
superfund-right-to-sue                     25
crime                                      17
duty-free-exports                          28
export-administration-act-south-africa    104
dtype: int64

In [125]:
#impute with mode
for col in df.columns[df.columns != 'party']:
  df[col] = df[col].fillna(value=df[col].mode()[0]) #have to include [0] because mode returns a series
  #map y/n to 0/1
  d = {'n': 0, 'y': 1}
  df[col] = df[col].map(vote)
df.isna().sum()
df.head()

Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0,1,0,1,1,1,0,0,0,1,0,1,1,1,0,1
1,republican,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1
2,democrat,0,1,1,0,1,1,0,0,0,0,1,0,1,1,0,0
3,democrat,0,1,1,0,1,1,0,0,0,0,1,0,1,0,0,1
4,democrat,1,1,1,0,1,1,0,0,0,0,1,0,1,1,1,1


###Hypothesis Testing: More Support From Democrats

In [128]:
###superfund right to sue

#splitting republicans and democrats
republicans = df['superfund-right-to-sue'][df['party'] == 'republican']
democrats = df['superfund-right-to-sue'][df['party'] == 'democrat']
democrats

2      1
3      1
4      1
5      1
6      1
9      0
12     1
13     0
16     1
17     1
19     0
20     0
21     1
22     0
23     0
24     0
25     0
26     0
27     0
29     0
31     0
32     1
34     0
39     1
40     0
41     0
42     0
43     0
44     0
45     0
      ..
385    1
386    0
387    0
388    1
389    0
390    0
391    0
394    1
395    0
396    1
397    1
398    0
406    1
407    1
408    1
411    0
414    0
415    0
417    1
418    0
419    0
421    1
422    0
423    0
424    1
425    0
426    0
428    1
429    0
431    0
Name: superfund-right-to-sue, Length: 267, dtype: int64

In [134]:
#finding mean and st dev
for p in [republicans, democrats]:
  print(f'Mean: {p.mean()}')
  print(f'St Dev: {p.std()}')
  print('----------')
#run ttest, find tstat and pvalue
tstat, pvalue = ttest_ind(republicans, democrats)
print(f'T Score: {tstat}')
print(f'P Value: {pvalue}')
  

Mean: 0.8690476190476191
St Dev: 0.3383567866677383
----------
Mean: 0.3295880149812734
St Dev: 0.47094631449244206
----------
T Score: 12.897349288686685
P Value: 1.960355885693202e-32


###Writing a Function that calculates tstat and p-value for a given column

In [0]:
def t_test(col):
  #print bill
  print(f'Bill: {col}')
  print('--------')
  #define and split by party
  republicans = df[col][df['party'] == 'republican']
  democrats = df[col][df['party'] == 'democrat']
  #calculate mean and st dev
  for p in [republicans, democrats]:
    print(f'Mean: {p.mean()}')
    print(f'St Dev: {p.std()}')
    print('--------')
  #run ttest, find tstat and pvalue
  tstat, pvalue = ttest_ind(republicans, democrats)
  print(f'T Score: {tstat}')
  print(f'P Value: {pvalue}')
  #print which party favors bill
  if pvalue < 0.01:
    if republicans.mean() >= democrats.mean():
      print('Republicans Favor Bill')
    else:
      print('Democrats Favor Bill')
  if pvalue >= 0.01:
    print('No Significant Difference in Voting Across Party Lines')
  print('--------\n\n\n--------')

In [148]:
t_test('el-salvador-aid')

Bill: el-salvador-aid
--------
Mean: 0.9523809523809523
St Dev: 0.21359550471631136
--------
Mean: 0.250936329588015
St Dev: 0.4343661266962051
--------
T Score: 19.49464560374767
P Value: 3.2189704657195496e-61
Republicans Favor Bill
--------


--------


###Run Function on Every Column

In [151]:
for col in df.columns[df.columns != 'party']:
  t_test(col)

Bill: handicapped-infants
--------
Mean: 0.18452380952380953
St Dev: 0.3890704560731812
--------
Mean: 0.5842696629213483
St Dev: 0.4937730011175113
--------
T Score: -8.897130738692912
P Value: 1.5743382054891396e-17
Democrats Favor Bill
--------


--------
Bill:  water-project-cost-sharing
--------
Mean: 0.5654761904761905
St Dev: 0.49717622934179645
--------
Mean: 0.5543071161048689
St Dev: 0.49797540030615184
--------
T Score: 0.22789967012174497
P Value: 0.8198318156454878
No Significant Difference in Voting Across Party Lines
--------


--------
Bill: adoption-of-the-budget-resolution
--------
Mean: 0.15476190476190477
St Dev: 0.3627588109849967
--------
Mean: 0.8913857677902621
St Dev: 0.31173892143110515
--------
T Score: -22.5072082955668
P Value: 7.529526421191217e-75
Democrats Favor Bill
--------


--------
Bill: physician-fee-freeze
--------
Mean: 0.9702380952380952
St Dev: 0.17043780322336816
--------
Mean: 0.052434456928838954
St Dev: 0.22332010811379255
--------
T Score: