<a href="https://colab.research.google.com/github/MHaley206265/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/Matt_Haley_DSPT6_LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

In [0]:
# Importing data set and libraries
import pandas as pd
from scipy.stats import ttest_ind

In [3]:
!wget --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2020-04-09 01:08:06--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
  Issued certificate has expired.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2020-04-09 01:08:07 (286 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [6]:
column_headers = ['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']

hvd = pd.read_csv('house-votes-84.data', 
                 header=None, 
                 names=column_headers,
                 na_values="?")

hvd.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [7]:
# replace n and y with 0 and 1

hvd = hvd.replace({'y' : 1, 'n': 0})
hvd.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
# Define a function that splits data frame and compares vote info

# Define a function that calls both functions
def subject_conclusions(df, feature):
  values = df[feature].unique().tolist()
  subjects = split_df(df, feature)[values[0]].columns.to_list()
  for subject in subjects:
    vote_info(split_df(df, feature), subject)

# define a function that splits the dataframe
def split_df(df, feature):
  values = df[feature].unique().tolist()
  dfs = {}
  for i in range(len(values)):
    dfs[values[i]] = df[df[feature]==values[i]].drop(feature, axis='columns')
  return dfs

# Define a function that compares vote numbers of both df's for all subjects
def vote_info(df_dict, subject):
  rep_vote = df_dict['republican'][subject].dropna()
  rep_no_votes = rep_vote.value_counts().sort_values('index')[0]
  rep_yes_votes = rep_vote.value_counts().sort_values('index')[1]
  dem_vote = df_dict['democrat'][subject].dropna()
  dem_no_votes = dem_vote.value_counts().sort_values('index')[0]
  dem_yes_votes = dem_vote.value_counts().sort_values('index')[1]
  rep_mean = rep_vote.mean()
  dem_mean = dem_vote.mean()
  vote_result = ''
  if rep_mean > dem_mean:
    vote_result = 'Republicans supported this bill more than Democrats'
  elif dem_mean > rep_mean:
    vote_result = 'Democrats supported this bill more than Republicans'
  pval = ttest_ind(rep_vote, dem_vote).pvalue
  result = 'There is a degree of certainty less than 95% that the difference in results is not random'
  conclusion = 'The null hypothesis can not be rejected in this case'
  if pval < 0.001:
    result = 'There is a degree of certainty greater than 99.9% that the difference in results is not random'
    conclusion = 'The null hypothesis can be rejected in this case'
  elif pval < 0.01:
    result = 'There is a degree of certainty greater than 99% that the difference in results is not random'
    conclusion = 'The null hypothesis can be rejected in this case'
  elif pval < 0.05:
    result = 'There is a degree of certainty greater than 95% that the difference in results is not random'
    conclusion = 'The null hypothesis can be rejected in this case'
  return  print(dedent(f"""
  ------------------------------------------------------------------------------
  {subject} vote information:
  Republican Votes: Yes {rep_yes_votes}, No {rep_no_votes}
  Democrat Votes: Yes {dem_yes_votes}, No {dem_no_votes}
  Republican: {rep_mean}
  Democrat: {dem_mean}
  {vote_result}
  P-Value: {pval}
  {result}
  {conclusion}
  ------------------------------------------------------------------------------"""))

In [81]:
# call function on original dataframe
subject_conclusions(hvd, 'party')


------------------------------------------------------------------------------
handicapped-infants vote information:
Republican Votes: Yes 31, No 134
Democrat Votes: Yes 156, No 102
Republican: 0.18787878787878787
Democrat: 0.6046511627906976
Democrats supported this bill more than Republicans
P-Value: 1.613440327936998e-18
There is a degree of certainty greater than 99.9% that the difference in results is not random
The null hypothesis can be rejected in this case
------------------------------------------------------------------------------

------------------------------------------------------------------------------
water-project vote information:
Republican Votes: Yes 75, No 73
Democrat Votes: Yes 120, No 119
Republican: 0.5067567567567568
Democrat: 0.502092050209205
Republicans supported this bill more than Democrats
P-Value: 0.9291556823994811
There is a degree of certainty less than 95% that the difference in results is not random
The null hypothesis can not be rejected in th

## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using Scipy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!