<a href="https://colab.research.google.com/github/lowea/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

In [8]:
import pandas as pd
import numpy as np
from scipy import stats

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
df = pd.read_csv(url, names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa'])
df = df.replace({'?': np.NaN, 'y': 1, 'n': 0})

print(df.head())

        party  handicapped-infants  ...  duty-free  south-africa
0  republican                  0.0  ...        0.0           1.0
1  republican                  0.0  ...        0.0           NaN
2    democrat                  NaN  ...        0.0           0.0
3    democrat                  0.0  ...        0.0           1.0
4    democrat                  1.0  ...        1.0           1.0

[5 rows x 17 columns]


In [13]:
rep = df[df['party'] == 'republican']
dem = df[df['party'] == 'democrat']

print(rep.head())

         party  handicapped-infants  ...  duty-free  south-africa
0   republican                  0.0  ...        0.0           1.0
1   republican                  0.0  ...        0.0           NaN
7   republican                  0.0  ...        NaN           1.0
8   republican                  0.0  ...        0.0           1.0
10  republican                  0.0  ...        0.0           0.0

[5 rows x 17 columns]


In [14]:
print(dem.head())

      party  handicapped-infants  water-project  ...  crime  duty-free  south-africa
2  democrat                  NaN            1.0  ...    1.0        0.0           0.0
3  democrat                  0.0            1.0  ...    0.0        0.0           1.0
4  democrat                  1.0            1.0  ...    1.0        1.0           1.0
5  democrat                  0.0            1.0  ...    1.0        1.0           1.0
6  democrat                  0.0            1.0  ...    1.0        1.0           1.0

[5 rows x 17 columns]


In [22]:
# Find an issue democrats support more than republicans
issue = ''
for col in df.columns[1:]:
  t_stat, p_value = stats.ttest_ind(rep[col], dem[col], nan_policy='omit')
  p_value = (1 - p_value) / 2
  if p_value > .1 and rep[col].sum() > dem[col].sum():
    issue = col

print(issue)

crime


In [21]:
# Find an issue democrats support more than republicans
issue = ''
for col in df.columns[1:]:
  t_stat, p_value = stats.ttest_ind(dem[col], rep[col], nan_policy='omit')
  p_value = (1 - p_value) / 2
  if p_value > .1 and dem[col].sum() > rep[col].sum():
    issue = col

print(issue)

south-africa


In [17]:
# Find an issue for which p > .1
issue = ''
for col in df.columns[1:]:
  t_stat, p_value = stats.ttest_ind(rep[col], dem[col], nan_policy='omit')
  if p_value > .1:
    issue = col

print(issue)

water-project


## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using Scipy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!