<a href="https://colab.research.google.com/github/ryankoul/DS-Unit-1-Sprint-2-Statistics/blob/master/module1-Statistics-Probability-Assignment/LS_DS15_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01 (99% confidence level)
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (90% confidence interval) (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

In [1]:
import pandas as pd
import numpy as np

# Grab file from UCI
!wget 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'

--2020-04-13 22:54:44--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data.1’


2020-04-13 22:54:44 (285 KB/s) - ‘house-votes-84.data.1’ saved [18171/18171]



In [0]:
# Read in CSV file into DataFrame, manually assign column headers
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',
                 names=[
                        'party','handicapped-infants','water-project',
                        'budget','physician-fee-freeze', 'el-salvador-aid',
                        'religious-groups','anti-satellite-ban',
                        'aid-to-contras','mx-missile','immigration',
                        'synfuels', 'education', 'right-to-sue','crime','duty-free',
                        'south-africa'
                        ]
                 )

In [3]:
# Convert yes and no to binary, and ? (placeholder value) to np.NaN
df = df.replace({'y': 1, 'n': 0, '?': np.NaN})
df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [4]:
# Check counts of Republicans, Democrats, NaNs
print(df['party'].value_counts())
print()
print(df.isnull().sum())

democrat      267
republican    168
Name: party, dtype: int64

party                     0
handicapped-infants      12
water-project            48
budget                   11
physician-fee-freeze     11
el-salvador-aid          15
religious-groups         11
anti-satellite-ban       14
aid-to-contras           15
mx-missile               22
immigration               7
synfuels                 21
education                31
right-to-sue             25
crime                    17
duty-free                28
south-africa            104
dtype: int64


In [5]:
# Dataframe filtering to make two samples
dems = df[df['party'] == 'democrat']
reps = df[df['party'] == 'republican']

reps.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


### Issue that Democrats support more than Republicans

**Variables**

$ \bar{x_1} = $ Sample 1 mean – in this case, Republicans in the US House of Representatives in 1980 who supported the `aid-to-contras` bill.

$ \bar{x_2} = $ Sample 2 mean – in this case, Democrats in the US House of Representatives in 1980 who supported the `aid-to-contras` bill.

**T-test procedure**
1. Null hypothesis: The percent of sampled Republicans who support the `aid-to-contras` bill == the percent of sampled Democrats who support it.
   - $H_0: \bar{x_1} == \bar{x_2}$ 
2. Alternative hypothesis: The percent of sampled Republicans who support the `aid-to-contras` ≠ the percent of sampled Democrats who support it.
   - $H_1: \bar{x_1} \neq \bar{x_2}$ 
3. Confidence Level: 99% (p-value < 0.01)

In [9]:
# Test
from scipy import stats
stats.ttest_ind(reps['aid-to-contras'], dems['aid-to-contras'], nan_policy='omit')


Ttest_indResult(statistic=-18.052093200819733, pvalue=2.82471841372357e-54)

### Conclusion:
The sign of the t-statistic tells us which sample mean is larger. 
   - If $ \bar{x_1} $ > $ \bar{x_2} $, the t-statistic will be positive, and vice-versa.
   - If $ \bar{x_1} $ > $ \bar{x_1} $, the t-statistic will be negative, and vice-versa.
   
![](https://www.statsdirect.co.uk/help/generatedimages/equations/equation167.svg)

If Democrats and Republicans supported `aid-to-contras` at the same rate (as we hypothesized), the probability that we'd observe data at least this extreme is < 1%. In fact, the negative sign on the t statistic indicates that Democrats support it more than Republicans.

Thus, we reject the null that Democrats support `aid-to-contras` at the same rate as Republicans. 

## Issue that Republicans support more than Democrats
**Variables**

$ \bar{x_1} = $ Republicans who supported the 1980 `crime` bill.

$ \bar{x_2} = $ Democrats who supported the 1980 `crime` bill..

**T-test procedure**
1. Null hypothesis: The percent of sampled Republicans who support the `crime` bill == the percent of sampled Democrats who support it.
   - $H_0: \bar{x_1} == \bar{x_2}$ 
2. Alternative hypothesis: The percent of sampled Republicans who support the `crime` ≠ the percent of sampled Democrats who support it.
   - $H_1: \bar{x_1} \neq \bar{x_2}$ 
3. Confidence Level: 99% (p-value < 0.01)

In [27]:
# Test
stats.ttest_ind(reps['crime'], dems['crime'], nan_policy='omit')

Ttest_indResult(statistic=16.342085656197696, pvalue=9.952342705606092e-47)

### Conclusion

If Democrats and Republicans supported the 1980 `crime` bill at the same rate (as we hypothesized), the probability that we'd observe data at least this extreme is < 1%. In fact, the positive sign on the t statistic indicates that Republicans support it more than Democrats.

Thus, we reject the null that Democrats support `crime` at the same rate as Republicans. 

## Issue that Republicans and Democrats Support Roughly the Same

**Variables**

$ \bar{x_1} = $ Republicans who supported the 1980 `water-project` bill.

$ \bar{x_2} = $ Democrats who supported the 1980 `water-project` bill..

**T-test procedure**
1. Null hypothesis: The percent of Republicans who support the `water-project` bill == the percent of Democrats who support it.
   - $H_0: \bar{x_1} == \bar{x_2}$ 
2. Alternative hypothesis: The percent of Republicans who support the `water-project` ≠ the percent of Democrats who support it.
   - $H_1: \bar{x_1} \neq \bar{x_2}$ 
3. Confidence Level: 90% (p-value > 0.1)

In [10]:
# Test
stats.ttest_ind(reps['water-project'], dems['water-project'], nan_policy='omit')


Ttest_indResult(statistic=0.08896538137868286, pvalue=0.9291556823993485)

### Conclusion
If Democrats and Republicans supported the 1980 `water-project` bill at the same rate (as we hypothesized), the probability that we'd observe data at least this extreme is 93%. In other words, we can say with only 7% confidence that the true population parameter of Republicans who support `water-project` is different than that of Democrats (and vice-versa).

Thus, we fail to reject the null that Democrats support `crime` at the same rate as Republicans. 


## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using SciPy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)
    - Degree of freedom = sample size - 1

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!

In [20]:
sample1 = dems['physician-fee-freeze']
sample2 = reps['physician-fee-freeze']
difference = sample1 - sample2
np.sqrt(len(difference))

# test Aliya's code
df3 = df[:]
df4 = df[:]

df3 = df3.dropna()
print(df3.shape)
df4.shape

(232, 17)


(435, 17)

# Refactor code into function
def one_sample_ttest(df, col, p, x_bar):
  """Accepts 4 parameters: a dataframe, a column, a population mean, and a sample mean.
  Returns t statistic and p value of as a scipy object."""
  from scipy import stats
  return stats.ttest_1samp(df[col], p, x_bar)

assert one_sample_ttest(df=reps, col='physician-fee-freeze', p=0.5)