<a href="https://colab.research.google.com/github/SamAlrahmani/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/Sam_Alrahmani_LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

In [60]:
### MY CODE STARTS HERE
# Imports
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import ttest_1samp, ttest_ind, ttest_ind_from_stats, ttest_rel

v= pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',
                    header=None,names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']) 
v


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
430,republican,n,n,y,y,y,y,n,n,y,y,n,y,y,y,n,y
431,democrat,n,n,y,n,n,n,y,y,y,y,n,n,n,n,n,y
432,republican,n,?,n,y,y,y,n,n,n,n,y,y,y,y,n,y
433,republican,n,n,n,y,y,y,?,?,?,?,n,y,y,y,n,y


In [61]:
# replacing the unknown values with NaN, and replacing yes and no with 0 & 1
# because we can't graph strings. we can only graph numbers. 
v = v.replace({'?':np.NaN, 'n':0, 'y':1})
v.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [62]:
# checking the empty data sets. 
v.isnull().sum()


party                     0
handicapped-infants      12
water-project            48
budget                   11
physician-fee-freeze     11
el-salvador-aid          15
religious-groups         11
anti-satellite-ban       14
aid-to-contras           15
mx-missile               22
immigration               7
synfuels                 21
education                31
right-to-sue             25
crime                    17
duty-free                28
south-africa            104
dtype: int64

In [63]:
v['party'].value_counts()

democrat      267
republican    168
Name: party, dtype: int64

In [0]:
# Generate the data to democrats and republican data sets. 
rep = v[v['party']=='republican']
dem = v[v['party']=='democrate']


In [65]:
rep.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [0]:
import scipy
from scipy import stats
from scipy.stats import ttest_1samp


In [67]:
stats.ttest_1samp(rep['education'], .95, nan_policy='omit')

Ttest_1sampResult(statistic=-2.925597438194524, pvalue=0.003958290033777179)

In [0]:
# I reject the Null HypothHypothesis due to the p-value i I had for the rep

In [69]:
stats.ttest_ind(dem['handicapped-infants'],rep['handicapped-infants'], nan_policy='omit')

Ttest_indResult(statistic=nan, pvalue=nan)

In [79]:
#getting the mean of each data
print('rep mean: ', rep['handicapped-infants'].mean())
print('dem mean: ', dem['handicapped-infants'].mean())

rep mean:  0.18787878787878787
dem mean:  nan


In [71]:
#create artificial data !

a=0 # mean 
b=1 #std
sample = np.random.normal(a, b, 1000)
sample[:10]

array([ 1.23059759,  0.34340793,  0.58422031,  1.1059311 , -0.45167716,
       -0.53010136,  0.86184036, -1.15788016,  0.9686254 ,  1.17985721])

In [72]:
v

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
430,republican,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
431,democrat,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
432,republican,0.0,,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0
433,republican,0.0,0.0,0.0,1.0,1.0,1.0,,,,,0.0,1.0,1.0,1.0,0.0,1.0


In [73]:
#combine all the features that the rep and dem have it to get back through it and find the answer we looking for 
rep_data = [rep['handicapped-infants'], rep['physician-fee-freeze'], 
           rep['water-project'], rep['budget'],
           rep['el-salvador-aid'], rep['religious-groups'],
           rep['anti-satellite-ban'], rep['aid-to-contras'],
           rep['mx-missile'], rep['immigration'],
           rep['synfuels'], rep['education'],
           rep['right-to-sue'], rep['crime'],
           rep['duty-free'], rep['south-africa']]

dem_data = [dem['handicapped-infants'], dem['physician-fee-freeze'], 
           dem['water-project'], dem['budget'],
           dem['el-salvador-aid'], dem['religious-groups'],
           dem['anti-satellite-ban'], dem['aid-to-contras'],
           dem['mx-missile'], dem['immigration'],
           dem['synfuels'], dem['education'],
           dem['right-to-sue'], dem['crime'],
           dem['duty-free'], dem['south-africa']]
# I test the lists and it could work togather combine with no need to 
#individually write them

#Null hypothesis of 0.5 for the republican's data
[one_sample_ttest(a, 0.5) for a in rep_data]
#AND THE LIST LOOKS WORKING GOOD !

[Ttest_1sampResult(statistic=-10.232833482397659, pvalue=2.572179359890009e-19),
 Ttest_1sampResult(statistic=57.09643655679979, pvalue=3.8884659739684535e-110),
 Ttest_1sampResult(statistic=0.16385760607458383, pvalue=0.8700683158522193),
 Ttest_1sampResult(statistic=-13.705331355148527, pvalue=6.249973784238298e-29),
 Ttest_1sampResult(statistic=26.920515819250607, pvalue=4.531756691646779e-62),
 Ttest_1sampResult(statistic=16.844895175868118, pvalue=1.103043623086e-37),
 Ttest_1sampResult(statistic=-7.694446231812848, pvalue=1.3430544790393879e-12),
 Ttest_1sampResult(statistic=-12.048344034968558, pvalue=4.758689364006887e-24),
 Ttest_1sampResult(statistic=-15.439826683525418, pvalue=8.750571156437711e-34),
 Ttest_1sampResult(statistic=1.4845341263724807, pvalue=0.1395867786115413),
 Ttest_1sampResult(statistic=-13.659513100277254, pvalue=1.5157657361710158e-28),
 Ttest_1sampResult(statistic=13.7323961384641, pvalue=1.5754742985301713e-28),
 Ttest_1sampResult(statistic=13.057014526

In [74]:
rep.describe()

Unnamed: 0,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,165.0,148.0,164.0,165.0,165.0,166.0,162.0,157.0,165.0,165.0,159.0,155.0,158.0,161.0,156.0,146.0
mean,0.187879,0.506757,0.134146,0.987879,0.951515,0.89759,0.240741,0.152866,0.115152,0.557576,0.132075,0.870968,0.860759,0.981366,0.089744,0.657534
std,0.391804,0.501652,0.341853,0.10976,0.215442,0.304104,0.428859,0.36101,0.320176,0.498186,0.339643,0.336322,0.347298,0.135649,0.286735,0.476168
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
50%,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
75%,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [75]:
dem.describe()

Unnamed: 0,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,,,,,,,,,,,,,,,,
std,,,,,,,,,,,,,,,,
min,,,,,,,,,,,,,,,,
25%,,,,,,,,,,,,,,,,
50%,,,,,,,,,,,,,,,,
75%,,,,,,,,,,,,,,,,
max,,,,,,,,,,,,,,,,


In [76]:
#Null hypothesis of 1 for the rep's data
[one_sample_ttest(a, 1) for a in rep_data]

[Ttest_1sampResult(statistic=-26.625236633811387, pvalue=1.978873197183477e-61),
 Ttest_1sampResult(statistic=-1.4185450076223511, pvalue=0.1579292482594923),
 Ttest_1sampResult(statistic=-11.961605243444543, pvalue=1.8656648229239887e-23),
 Ttest_1sampResult(statistic=-32.43595087385152, pvalue=5.293293090366981e-73),
 Ttest_1sampResult(statistic=-2.890793645020198, pvalue=0.004363402589282088),
 Ttest_1sampResult(statistic=-4.338836636208457, pvalue=2.488389920449274e-05),
 Ttest_1sampResult(statistic=-22.533735393166197, pvalue=1.1791229999687983e-51),
 Ttest_1sampResult(statistic=-29.402380855978315, pvalue=1.646323339608132e-65),
 Ttest_1sampResult(statistic=-35.49944402826317, pvalue=6.985438779792963e-79),
 Ttest_1sampResult(statistic=-11.407472760546423, pvalue=1.4608897556395605e-22),
 Ttest_1sampResult(statistic=-32.22244115962839, pvalue=2.3835416257429457e-71),
 Ttest_1sampResult(statistic=-4.776485613378817, pvalue=4.122594240061259e-06),
 Ttest_1sampResult(statistic=-5.03

In [77]:
#Null hypothesis of 1 for the dem's data
[[one_sample_ttest(a, 1) for a in dem_data]]

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)
  **kwargs)
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
  ret = ret.dtype.type(ret / rcount)


[[Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan),
  Ttest_1sampResult(statistic=nan, pvalue=nan)]]

## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using Scipy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!