<a href="https://colab.research.google.com/github/SamAlrahmani/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

In [53]:
### MY CODE STARTS HERE
# Imports
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import ttest_1samp, ttest_ind, ttest_ind_from_stats, ttest_rel

v= pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',
                    header=None,names=['party', 'handicapped',
                                       'water-project',
                                       'budget',
                                       'physician-fee-freez','el-salvador-aid',
                                       'religious-groups',
                                       'anti-satellite-test-ban',
                                       'aid-to-contras','mx-missile',
                                       'mmigration','synfuels',
                                       'education','right-to-sue',
                                       'crime', 'duty-free','south-africa']) 
v


Unnamed: 0,party,handicapped,water-project,budget,physician-fee-freez,el-salvador-aid,religious-groups,anti-satellite-test-ban,aid-to-contras,mx-missile,mmigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
430,republican,n,n,y,y,y,y,n,n,y,y,n,y,y,y,n,y
431,democrat,n,n,y,n,n,n,y,y,y,y,n,n,n,n,n,y
432,republican,n,?,n,y,y,y,n,n,n,n,y,y,y,y,n,y
433,republican,n,n,n,y,y,y,?,?,?,?,n,y,y,y,n,y


In [54]:
# replacing the unknown values with NaN, and replacing yes and no with 0 & 1
# because we can't graph strings. we can only graph numbers. 
v = v.replace({'?':np.NaN, 'n':0, 'y':1})
v.head()

Unnamed: 0,party,handicapped,water-project,budget,physician-fee-freez,el-salvador-aid,religious-groups,anti-satellite-test-ban,aid-to-contras,mx-missile,mmigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [55]:
# checking the empty data sets. 
v.isnull().sum()


party                        0
handicapped                 12
water-project               48
budget                      11
physician-fee-freez         11
el-salvador-aid             15
religious-groups            11
anti-satellite-test-ban     14
aid-to-contras              15
mx-missile                  22
mmigration                   7
synfuels                    21
education                   31
right-to-sue                25
crime                       17
duty-free                   28
south-africa               104
dtype: int64

In [56]:
# splitting the data to democrats and republican data sets. 
rep = v[v['party']=='republican']
dem = v[v['party']=='democrat']
print('rep:', rep.shape)
print('dem:', dem.shape)

rep: (168, 17)
dem: (267, 17)


In [57]:
#getting the mean of each data
print('rep mean: ', rep['crime'].mean())
print('dem mean: ', dem['crime'].mean())

rep mean:  0.9813664596273292
dem mean:  0.35019455252918286


## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using Scipy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!