<a href="https://colab.research.google.com/github/SarmenSinanian/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/Sarmen_Sinanian_LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)


###WRITE OUT
- 1) NULL HYPOTHESIS
- 2) ALTERNATIVE HYPOTHESIS
- 3) CONFIDENCE LEVEL
- 4) T-STATISTIC
- 5) P-VALUE
- 6) INTERPRETATION

## The Four Parts of a T-test

### Before:

1) Null Hypothesis: "The Boring Hypothesis", "The default state of the world"

 - The cooking times between the two burners is the same: $\mu_{1} = \mu_{2}$

2) Alternative Hypothesis: "The opposite of the null hypothesis"

- The cooking times are different $\mu_{1} \neq \mu_{2}$

3) Confidence Level: 95%

**Confidence level of 95%** (reach an incorrect conclusion only 5% of the time)

p < 1 - Confidence Level

1 - Confidence Level = $\alpha$

- I reject the null hypothesis when P-value < alpha (1 - Confidence Level)

### After:

4) T-statistic: 6.362

5) P-value: .000000000251

What is the t-statistic?: ~Roughly the number of standard deviations away from the mean, that corresponds to the differences in means that we have observed given their sample sizes. 

What is the P-value: The probability that the difference that we have observed (The t-statist that we get) could have happened by random chance. 

### Interpretation: 

Due to calculating a t-statistic of 6.362 which corresponds to a p-value of .000000000251, we reject the null hypothesis that the mean of cooking times between the two burners is equal, and suggest the alternative hypothesis that they are different. 



In [0]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel, ttest_1samp

In [51]:
house_votes_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?
republican,n,y,n,y,y,n,n,n,n,n,?,?,y,y,n,n
republican,n,y,n,y,y,y,n,n,n,n,y,?,y,y,?,?
democrat,n,y,y,n,n,n,y,y,y,n,n,n,y,n,?,?
democrat,y,y,y,n,n,y,y,y,?,y,y,?,n,n,y,?
republican,n,y,n,y,y,y,n,n,n,n,n,y,?,?,n,?
republican,n,y,n,y,y,y,n,n,n,y,n,y,y,?,n,?
democrat,y,n,y,n,n,y,n,y,?,y,y,y,?,n,n,y
democrat,y,?,y,n,n,n,y,y,y,n,n,n,y,n,y,y
republican,n,y,n,y,y,y,n,n,n,n,n,?,y,y,n,n
democrat,y,y,y,n,n,n,y,y,y,n,y,n,n,n,y,y
democrat,y,y,y,n,n,?,y,y,n,n,y,n,n,n,y,y
democrat,y,y,y,n,n,n,y,y,y,n,n,n,?,?,y,y
democrat,y,?,y,n,n,n,y,y,y,n,n,?,n,n,y,y
democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,y,

In [0]:
df = pd.read_csv(house_votes_url, header=None, na_values = '?')

In [53]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [0]:
# 1. Class Name: 2 (democrat, republican) 
# 2. handicapped-infants: 2 (y,n) 
# 3. water-project-cost-sharing: 2 (y,n) 
# 4. adoption-of-the-budget-resolution: 2 (y,n) 
# 5. physician-fee-freeze: 2 (y,n) 
# 6. el-salvador-aid: 2 (y,n) 
# 7. religious-groups-in-schools: 2 (y,n) 
# 8. anti-satellite-test-ban: 2 (y,n) 
# 9. aid-to-nicaraguan-contras: 2 (y,n) 
# 10. mx-missile: 2 (y,n) 
# 11. immigration: 2 (y,n) 
# 12. synfuels-corporation-cutback: 2 (y,n) 
# 13. education-spending: 2 (y,n) 
# 14. superfund-right-to-sue: 2 (y,n) 
# 15. crime: 2 (y,n) 
# 16. duty-free-exports: 2 (y,n) 
# 17. export-administration-act-south-africa: 2 (y,n)

In [0]:
columns = ['party','handicapped-infants','water-project-cost-sharing',
           'adoption-of-the-budget-resolution','physician-fee-freeze',
           'el-salvador-aid','religious-groups-in-schools',
           'anti-satellite-test-ban','aid-to-nicaraguan-contras'
           ,'mx-missile','immigration','synfuels-corporation-cutback'
           ,'education-spending','superfund-right-to-sue','crime',
           'duty-free-exports','export-administration-act-south-africa']


In [0]:
df.columns = columns

In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435 entries, 0 to 434
Data columns (total 17 columns):
party                                     435 non-null object
handicapped-infants                       423 non-null object
water-project-cost-sharing                387 non-null object
adoption-of-the-budget-resolution         424 non-null object
physician-fee-freeze                      424 non-null object
el-salvador-aid                           420 non-null object
religious-groups-in-schools               424 non-null object
anti-satellite-test-ban                   421 non-null object
aid-to-nicaraguan-contras                 420 non-null object
mx-missile                                413 non-null object
immigration                               428 non-null object
synfuels-corporation-cutback              414 non-null object
education-spending                        404 non-null object
superfund-right-to-sue                    410 non-null object
crime                      

In [0]:
df.replace('y',1, inplace=True)

In [0]:
df.replace('n',0,inplace=True)

In [0]:
rep =df[df['party'] =='republican']

In [0]:
dem = df[df['party'] == 'democrat']

In [62]:
df.head()

Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [63]:
df.shape

(435, 17)

In [64]:
df.isnull().sum()

party                                       0
handicapped-infants                        12
water-project-cost-sharing                 48
adoption-of-the-budget-resolution          11
physician-fee-freeze                       11
el-salvador-aid                            15
religious-groups-in-schools                11
anti-satellite-test-ban                    14
aid-to-nicaraguan-contras                  15
mx-missile                                 22
immigration                                 7
synfuels-corporation-cutback               21
education-spending                         31
superfund-right-to-sue                     25
crime                                      17
duty-free-exports                          28
export-administration-act-south-africa    104
dtype: int64

## The Four Parts of a T-test

### Before:

1) Null Hypothesis: "The Boring Hypothesis", "The default state of the world"

 - The cooking times between the two burners is the same: $\mu_{1} = \mu_{2}$

2) Alternative Hypothesis: "The opposite of the null hypothesis"

- The cooking times are different $\mu_{1} \neq \mu_{2}$

3) Confidence Level: 95%

**Confidence level of 95%** (reach an incorrect conclusion only 5% of the time)

p < 1 - Confidence Level

1 - Confidence Level = $\alpha$

- I reject the null hypothesis when P-value < alpha (1 - Confidence Level)

### After:

4) T-statistic: 6.362

5) P-value: .000000000251

What is the t-statistic?: ~Roughly the number of standard deviations away from the mean, that corresponds to the differences in means that we have observed given their sample sizes. 

What is the P-value: The probability that the difference that we have observed (The t-statist that we get) could have happened by random chance. 

### Interpretation: 

Due to calculating a t-statistic of 6.362 which corresponds to a p-value of .000000000251, we reject the null hypothesis that the mean of cooking times between the two burners is equal, and suggest the alternative hypothesis that they are different. 



###WRITE OUT
- 1) NULL HYPOTHESIS
- 2) ALTERNATIVE HYPOTHESIS
- 3) CONFIDENCE LEVEL
- 4) T-STATISTIC
- 5) P-VALUE
- 6) INTERPRETATION

In [65]:
print('Republican Support: ', rep['handicapped-infants'].mean())
print('Democrat Support: ', dem['handicapped-infants'].mean())

Republican Support:  0.18787878787878787
Democrat Support:  0.6046511627906976


In [66]:
ttest_ind(rep['handicapped-infants'], dem['handicapped-infants'], nan_policy='omit')
#No null hypothesis passed like in the 1 sample t-test (i.e. the test sample,
# then the null hypothesis (a value which is being compared against)
#ttest_ind looks at the lengths each () then grab the sample size on its own
# and compare to see which one has the smaller sample size, then uses
# that smaller sample size as the "n" then use n-1 as the degrees of freedom.

Ttest_indResult(statistic=-9.205264294809222, pvalue=1.613440327937243e-18)

- 1) NULL HYPOTHESIS: The support gauged by the mean number of votes from republicans and democrats on the subject of *handicapped-infants* **is the same**.
- 2) ALTERNATIVE HYPOTHESIS: The support gauged by the mean number of votes from republicans and democrats on the subject of *handicapped-infants* **is different**.
- 3) CONFIDENCE LEVEL: 99%
- 4) T-STATISTIC: -9.205
- 5) P-VALUE: .00000000000000000161
- 6) INTERPRETATION: We **REJECT** the null hypothesis that the support gauged by the mean number of votes from republicans and democrats on the subject of handicapped-infants is the same.

In [67]:
print('Republican Support: ', rep['physician-fee-freeze'].mean())
print('Democrat Support: ', dem['physician-fee-freeze'].mean()) 

Republican Support:  0.9878787878787879
Democrat Support:  0.05405405405405406


In [68]:
ttest_ind(rep['physician-fee-freeze'], dem['physician-fee-freeze'], nan_policy='omit')


Ttest_indResult(statistic=49.36708157301406, pvalue=1.994262314074344e-177)

- 1) NULL HYPOTHESIS: The support gauged by the mean number of votes from republicans and democrats on the subject of *freezing physician fees* **is the same**.
- 2) ALTERNATIVE HYPOTHESIS: The support gauged by the mean number of votes from republicans and democrats on the subject of *freezing physician fees* **is  different**.
- 3) CONFIDENCE LEVEL: 95%
- 4) T-STATISTIC: -49.367
- 5) P-VALUE: ~0
- 6) INTERPRETATION: We **REJECT** the null hypothesis that the support gauged by the mean number of votes from republicans and democrats on the subject of freezing physician fees is the same.

In [69]:
print('Republican Support: ', rep['water-project-cost-sharing'].mean())
print('Democrat Support: ', dem['water-project-cost-sharing'].mean()) 

Republican Support:  0.5067567567567568
Democrat Support:  0.502092050209205


In [70]:
ttest_ind(rep['water-project-cost-sharing'], dem['water-project-cost-sharing'], nan_policy='omit')

Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)

- 1) NULL HYPOTHESIS: The support gauged by the mean number of votes from republicans and democrats on the subject of *water cost sharing* **is the same**.
- 2) ALTERNATIVE HYPOTHESIS: The support gauged by the mean number of votes from republicans and democrats on the subject of *water cost sharing*  **is  different**.
- 3) CONFIDENCE LEVEL: 95%
- 4) T-STATISTIC: .089
- 5) P-VALUE: ~0.929
- 6) INTERPRETATION: We **FAIL TO REJECT** the null hypothesis that the support gauged by the mean number of votes from republicans and democrats on the subject of freezing physician fees is the same.