<a href="https://colab.research.google.com/github/noreallyimfine/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/Copy_of_LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
### YOUR CODE STARTS HERE

# Import pandas as t-tests
import pandas as pd
from scipy.stats import ttest_1samp, ttest_ind, ttest_ind_from_stats, ttest_rel

In [3]:
# List of column names
col_names = ['party', 'handicapped-infants', 'water-project', 'budget',
             'physician-fee-freeze', 'el-salvador-aid', 'religious-groups',
             'anti-satellite-ban', 'aid-to-contras', 'mx-missile', 'immigration',
             'synfuels', 'education', 'right-to-sue', 'crime', 'duty-free', 'south-africa']

# Load data into df and change '?' to NaN
votes = pd.read_csv('/content/house-votes-84.data', names=col_names, na_values='?')

# Change y/n to 1/0
votes.replace({'y': 1, 'n': 0}, inplace=True)

votes.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [5]:
# Split df into dem and rep
rep = votes[votes['party'] == 'republican']
dem = votes[votes['party'] == 'democrat']

rep.head(), dem.head()

(         party  handicapped-infants  ...  duty-free  south-africa
 0   republican                  0.0  ...        0.0           1.0
 1   republican                  0.0  ...        0.0           NaN
 7   republican                  0.0  ...        NaN           1.0
 8   republican                  0.0  ...        0.0           1.0
 10  republican                  0.0  ...        0.0           0.0
 
 [5 rows x 17 columns],
       party  handicapped-infants  water-project  ...  crime  duty-free  south-africa
 2  democrat                  NaN            1.0  ...    1.0        0.0           0.0
 3  democrat                  0.0            1.0  ...    0.0        0.0           1.0
 4  democrat                  1.0            1.0  ...    1.0        1.0           1.0
 5  democrat                  0.0            1.0  ...    1.0        1.0           1.0
 6  democrat                  0.0            1.0  ...    1.0        1.0           1.0
 
 [5 rows x 17 columns])

Null Hypothesis: Democrats are no more likely or less than chance to support this bill

Alternative Hypothesis: Democrats are more likely than chance to lean one way on this bill

Confidence Interval: 95% (We need 95% confidence to reject the null hypothesis)


In [6]:
# Rerun t-test from lecture
ttest_1samp(dem['handicapped-infants'], .5, nan_policy='omit')

Ttest_1sampResult(statistic=3.431373087696574, pvalue=0.000699612317167372)

t-statistic: 3.43

p-value = .0007

From the results of the test above, our p-value < .05 therefore we reject the null hypothesis

### Another Test

Null Hypothesis: Democrats and Republicans are equally likely to support

Alternative Hypothesis: Dems and Reps are not equally likely to support

Confidence Interval: 99%

In [7]:
# 2 sample ttest on crime bill
ttest_ind(dem['crime'], rep['crime'], nan_policy='omit')

Ttest_indResult(statistic=-16.342085656197696, pvalue=9.952342705606092e-47)

t-statistic: -16.34

p-value: .0000000000000000000000000000000000000000000000000000000000001 (not enough zeros)

Due to the results of the test run above, we can reject the null hypothesis as our p-value < .05


In [25]:
# See how they actually voted
print(dem['crime'].value_counts())
print(rep['crime'].value_counts())

0.0    167
1.0     90
Name: crime, dtype: int64
1.0    158
0.0      3
Name: crime, dtype: int64


Republicans are more likely to support the crime bill than democrats with p < .01

### Another Test

Null Hypothesis: Dem mean == Rep mean for budget bill

Alternative Hypothesis: The mean of Dem and Rep support for budget bill differ

Confidence Interval: 99%

In [26]:
# 2 sample ttest on budget bill
ttest_ind(dem['budget'], rep['budget'], nan_policy='omit')

Ttest_indResult(statistic=23.21277691701378, pvalue=2.0703402795404463e-77)

t-statistic: 23.212

p-value: .000000000000000000000000000000000000000000021 (est. # of 0s)

From the results of test above, we see a p-value < .01 and therefore reject the null hypothesis

In [27]:
# See how they voted
print(dem['budget'].value_counts())
print(rep['budget'].value_counts())

1.0    231
0.0     29
Name: budget, dtype: int64
0.0    142
1.0     22
Name: budget, dtype: int64


Democrats are more likely than Republicans to support the budget bill, with p < .01

### Another Test

Null Hypothesis: Dems and Reps are equally likely to support immigration bill

Alternative Hypothesis: They are not equally likely to support

Confidence Interval: 99%

In [28]:
# 2 Sample ttest on immigration bill
ttest_ind(dem['immigration'], rep['immigration'], nan_policy='omit')

Ttest_indResult(statistic=-1.7359117329695164, pvalue=0.08330248490425066)

t-statistic: -1.74

p-value: .08

From the results of the test above, with a p-value > .01, we fail to reject the null hypothesis

We cannot rule out that Dems and Reps are equally likely to support this immigration bill

In [17]:
# Drop party column for ttest function
test_rep = rep.drop('party', axis=1)
test_dem = dem.drop('party', axis=1)

test_dem.head(), test_rep.head()

(   handicapped-infants  water-project  budget  ...  crime  duty-free  south-africa
 2                  NaN            1.0     1.0  ...    1.0        0.0           0.0
 3                  0.0            1.0     1.0  ...    0.0        0.0           1.0
 4                  1.0            1.0     1.0  ...    1.0        1.0           1.0
 5                  0.0            1.0     1.0  ...    1.0        1.0           1.0
 6                  0.0            1.0     0.0  ...    1.0        1.0           1.0
 
 [5 rows x 16 columns],
     handicapped-infants  water-project  budget  ...  crime  duty-free  south-africa
 0                   0.0            1.0     0.0  ...    1.0        0.0           1.0
 1                   0.0            1.0     0.0  ...    1.0        0.0           NaN
 7                   0.0            1.0     0.0  ...    1.0        NaN           1.0
 8                   0.0            1.0     0.0  ...    1.0        0.0           1.0
 10                  0.0            1.0     0

In [0]:
# Function to run ttests on all the colummns in one df
# Null Hypothesis: party is random chance to support bill

def ttest_1(df):
  for column in df.columns:
    results = ttest_1samp(df[column], .5, nan_policy='omit')
    print(column)
    print('t-statistic: ', results[0])
    print('p-value: ', results[1])
    print('---'*10)

In [21]:
# Run ttest on dem df
ttest_1(test_dem)

handicapped-infants
t-statistic:  3.431373087696574
p-value:  0.000699612317167372
------------------------------
water-project
t-statistic:  0.06454972243678961
p-value:  0.9485867005339235
------------------------------
budget
t-statistic:  19.859406568628835
p-value:  5.75931504660857e-54
------------------------------
physician-fee-freeze
t-statistic:  -31.67705343439813
p-value:  6.796885728494356e-91
------------------------------
el-salvador-aid
t-statistic:  -11.016877548066462
p-value:  2.5007537432253433e-23
------------------------------
religious-groups
t-statistic:  -0.7464459604122172
p-value:  0.45608033540995874
------------------------------
anti-satellite-ban
t-statistic:  10.424565592705058
p-value:  1.8326900884510166e-21
------------------------------
aid-to-contras
t-statistic:  14.13618595353591
p-value:  4.190313037098042e-34
------------------------------
mx-missile
t-statistic:  9.470521640429526
p-value:  2.3590277159598606e-18
------------------------------


In [0]:
# Function to run 2 sample ttest on dfs
# Null Hypothesis is that they have the same mean

def ttest_2(df_a, df_b):
  for column in df_a.columns:
    results = ttest_ind(df_a[column], df_b[column], nan_policy='omit')
    print(column)
    print('t-statistic: ', results[0])
    print('p-value: ', results[1])
    print('---'*10)

In [33]:
# Run ttests on all columns in dfs
ttest_2(test_dem, test_rep)

handicapped-infants
t-statistic:  9.205264294809222
p-value:  1.613440327937243e-18
------------------------------
water-project
t-statistic:  -0.08896538137868286
p-value:  0.9291556823993485
------------------------------
budget
t-statistic:  23.21277691701378
p-value:  2.0703402795404463e-77
------------------------------
physician-fee-freeze
t-statistic:  -49.36708157301406
p-value:  1.994262314074344e-177
------------------------------
el-salvador-aid
t-statistic:  -21.13669261173219
p-value:  5.600520111729011e-68
------------------------------
religious-groups
t-statistic:  -9.737575825219457
p-value:  2.3936722520597287e-20
------------------------------
anti-satellite-ban
t-statistic:  12.526187929077842
p-value:  8.521033017443867e-31
------------------------------
aid-to-contras
t-statistic:  18.052093200819733
p-value:  2.82471841372357e-54
------------------------------
mx-missile
t-statistic:  16.437503268542994
p-value:  5.03079265310811e-47
-----------------------------