<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import numpy as np
import pandas as pd
from scipy import stats

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data", names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa'])

In [0]:
df.replace(to_replace='?', value=np.NaN, inplace=True)
df.replace(to_replace='n', value=0, inplace=True)
df.replace(to_replace='y', value=1, inplace=True)

# I will be using nan_policy='omit' to handle NaN values

In [14]:
print("#                        df.shape")
print(df.shape)
print("#                        df.head(10)")
print(df.head(10))
print("#                        df.tail(10)")
print(df.tail(10))
print("#                        df.isnull().sum().sort_values()")
print(df.isnull().sum().sort_values())
print("#                        df.describe()")
print(df.describe())

#                        df.shape
(435, 17)
#                        df.head(10)
        party  handicapped-infants  ...  duty-free  south-africa
0  republican                  0.0  ...        0.0           1.0
1  republican                  0.0  ...        0.0           NaN
2    democrat                  NaN  ...        0.0           0.0
3    democrat                  0.0  ...        0.0           1.0
4    democrat                  1.0  ...        1.0           1.0
5    democrat                  0.0  ...        1.0           1.0
6    democrat                  0.0  ...        1.0           1.0
7  republican                  0.0  ...        NaN           1.0
8  republican                  0.0  ...        0.0           1.0
9    democrat                  1.0  ...        NaN           NaN

[10 rows x 17 columns]
#                        df.tail(10)
          party  handicapped-infants  ...  duty-free  south-africa
425    democrat                  0.0  ...        1.0           NaN
426    de

In [15]:
dem = df[df['party']=='democrat']
rep = df[df['party']=='republican']

print("#                        dem.shape")
print(dem.shape)
print("#                        dem.head(10)")
print(dem.head(10))
print("#                        dem.tail(10)")
print(dem.tail(10))
print("#                        dem.isnull().sum().sort_values()")
print(dem.isnull().sum().sort_values())
print("#                        dem.describe()")
print(dem.describe())
print("=================================================================")
print("#                        rep.shape")
print(rep.shape)
print("#                        rep.head(10)")
print(rep.head(10))
print("#                        rep.tail(10)")
print(rep.tail(10))
print("#                        rep.isnull().sum().sort_values()")
print(rep.isnull().sum().sort_values())
print("#                        rep.describe()")
print(rep.describe())

#                        dem.shape
(267, 17)
#                        dem.head(10)
       party  handicapped-infants  ...  duty-free  south-africa
2   democrat                  NaN  ...        0.0           0.0
3   democrat                  0.0  ...        0.0           1.0
4   democrat                  1.0  ...        1.0           1.0
5   democrat                  0.0  ...        1.0           1.0
6   democrat                  0.0  ...        1.0           1.0
9   democrat                  1.0  ...        NaN           NaN
12  democrat                  0.0  ...        NaN           NaN
13  democrat                  1.0  ...        1.0           NaN
16  democrat                  1.0  ...        0.0           1.0
17  democrat                  1.0  ...        1.0           1.0

[10 rows x 17 columns]
#                        dem.tail(10)
        party  handicapped-infants  ...  duty-free  south-africa
419  democrat                  1.0  ...        0.0           1.0
421  democrat        

In [22]:
# Testing to see how they return values
# t_stat = stats.ttest_1samp([0,1,2], 0)[0]
# print(t_stat)
# pvalue = stats.ttest_1samp([0,1,2], 0)[1]
# print(p_value)
# t_test = stats.ttest_1samp([0,1,2], 0)
# print(t_test.statistic)
# print(t_test.pvalue)

1.7320508075688774
0.22540333075851657


In [0]:
def two_sample_result(rep_set, dem_set):
  t_test = stats.ttest_ind(rep_set, dem_set, nan_policy='omit')
  print("Results:")
  print("\nT Statistic: " + str(t_test.statistic))
  print("\nP Value: " + str(t_test.pvalue))
  if t_test.pvalue <= 0.05:
    print("\nReject H_0.")
    if t_test.statistic > 0:
      print("\nSuggest republican support is greater. R.mean=" + str(rep_set.mean()) + ", D.mean=" + str(dem_set.mean()))
    elif t_test.statistic < 0:
      print("\nSuggest democrat support is greater. R.mean=" + str(rep_set.mean()) + ", D.mean=" + str(dem_set.mean()))
    else:
      print("\nT Stat = 0, how did you get here??")
  else:
    print("\nDo not reject H_0. Suggest levels of support are approximately the same between R/D, or not different enough to be statistically significant.")

In [37]:
two_sample_result(rep['education'], dem['education']) # Test the example from class to double check it works as expected

Results:

T Statistic: 20.500685724563073

P Value: 1.8834203990450192e-64

Reject H_0.

Suggest republican support is greater. R.mean=0.8709677419354839, D.mean=0.14457831325301204


In [38]:
two_sample_result(rep['immigration'], dem['immigration'])

Results:

T Statistic: 1.7359117329695164

P Value: 0.08330248490425066

Do not reject H_0. Suggest levels of support are approximately the same between R/D, or not different enough to be statistically significant.


In [41]:
two_sample_result(rep['aid-to-contras'], dem['aid-to-contras'])

Results:

T Statistic: -18.052093200819733

P Value: 2.82471841372357e-54

Reject H_0.

Suggest democrat support is greater. R.mean=0.15286624203821655, D.mean=0.8288973384030418


In [42]:
two_sample_result(rep['crime'], dem['crime'])

Results:

T Statistic: 16.342085656197696

P Value: 9.952342705606092e-47

Reject H_0.

Suggest republican support is greater. R.mean=0.9813664596273292, D.mean=0.35019455252918286
