<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

###Loading in Data Set

In [0]:
import pandas as pd
import numpy as np


#Let's pass some column headers
column_headers = ['party','handicapped-infants','water-project',
                  'budget','physician-fee-freeze', 'el-salvador-aid',
                  'religious-groups','anti-satellite-ban',
                  'aid-to-contras','mx-missile','immigration',
                  'synfuels', 'education', 'right-to-sue','crime','duty-free',
                  'south-africa']

voting_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',names=column_headers)


###Clean Data Set

In [25]:
## Check to see the shape.
# It shows 435 rows and 17 columns. Makes sense because there are 435 representatives. 1 column is for class and the other 16 have binary attributes.
print(voting_data.shape)

(435, 17)


In [34]:
# Let's take a better look
voting_data.head(15)

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
9,democrat,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,


In [27]:
voting_data.tail()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
430,republican,n,n,y,y,y,y,n,n,y,y,n,y,y,y,n,y
431,democrat,n,n,y,n,n,n,y,y,y,y,n,n,n,n,n,y
432,republican,n,?,n,y,y,y,n,n,n,n,y,y,y,y,n,y
433,republican,n,n,n,y,y,y,?,?,?,?,n,y,y,y,n,y
434,republican,n,y,n,y,y,y,n,n,n,y,n,y,y,y,?,n


In [28]:
# Let's find out some basic summary metrics

voting_data.describe(include='all')

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,435,435,435,435,435,435,435,435,435,435,435,435,435,435,435,435,435
unique,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
top,democrat,n,y,y,n,y,y,y,y,y,y,n,n,y,y,n,y
freq,267,236,195,253,247,212,272,239,242,207,216,264,233,209,248,233,269


### Convert data to numeric data & Find null values

In [29]:
# Using categorical encoding, I will re-work my data to go from binary values to numeric values. This will make it easier for me to manipulate the data

voting_data = voting_data.replace({'y':1 , 'n':0 , '?': np.NaN})

voting_data.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [30]:
# Find null values
voting_data.isnull().sum()

party                     0
handicapped-infants      12
water-project            48
budget                   11
physician-fee-freeze     11
el-salvador-aid          15
religious-groups         11
anti-satellite-ban       14
aid-to-contras           15
mx-missile               22
immigration               7
synfuels                 21
education                31
right-to-sue             25
crime                    17
duty-free                28
south-africa            104
dtype: int64

### Subset Data

In [31]:
voting_data['party'].value_counts()

democrat      267
republican    168
Name: party, dtype: int64

In [32]:
# Democrats

dem = voting_data[voting_data['party']=='democrat']
dem.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


In [33]:
# Republicans

rep = voting_data[voting_data['party']=='republican']
rep.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


### Finding a good bill to use. 
I want to know how many dems and reps did not vote for each bill, to get a better idea of which data to use for my T-Test

In [35]:
dem['physician-fee-freeze'].isnull().sum()

8

In [36]:
rep['physician-fee-freeze'].isnull().sum()

3

In [37]:
dem['el-salvador-aid'].isnull().sum()

12

In [38]:
rep['el-salvador-aid'].isnull().sum()

3

In [39]:
dem['religious-groups'].isnull().sum()

9

In [40]:
rep['religious-groups'].isnull().sum()

2

In [41]:
dem['immigration'].isnull().sum()

4

In [42]:
rep['immigration'].isnull().sum()

3

In [43]:
dem['education'].isnull().sum()

18

In [44]:
rep['education'].isnull().sum()

13

In [45]:
dem['right-to-sue'].isnull().sum()

15

In [46]:
rep['right-to-sue'].isnull().sum()

10

In [47]:
dem['aid-to-contras'].isnull().sum()

4

In [48]:
rep['aid-to-contras'].isnull().sum()

11

# Two Sample T-Test Time !

##Republicans > Democrats on El Salvador Aid Bill

1) Null Hypothesis: Republicans and Democrats support the El Salvador Aid Bill Equally

2) Alternative Hypothesis: The Republicans and Democrats do NOT support the bills equally.

3) Confidence Level: 99%

In [53]:
# Two Sample T-Test 
# El Salvador Aid Bill

from scipy import stats

stats.ttest_ind(rep['el-salvador-aid'],dem['el-salvador-aid'], nan_policy='omit')

Ttest_indResult(statistic=21.13669261173219, pvalue=5.600520111729011e-68)

Conclusion: Based on a t-statistic of 21.13 and a p-value of 5.6 * 10^ (-68), I **reject** the null hypothesis that both parties support the bill equally and suggest alternative hypothesis, that the parties support the bill unequally.

##Since the t-statistic = 21.13 AND is a positive value, that means that more Republicans were in support of this bill.

In [58]:
dem['el-salvador-aid'].dropna().mean()

#20% of dems voted

0.21568627450980393

In [59]:
rep['el-salvador-aid'].dropna().mean()
#95% of reps voted 

0.9515151515151515

#Two Sample T-Test 
Democrats > Republicans
#Aid to Contras Bill

1) Null Hypothesis: Republicans and Democrats support the Aid to Contras Bill Equally

2) Alternative Hypothesis: The Republicans and Democrats do NOT support the bill equally.

3) Confidence Level: 99%


In [0]:
# Two Sample T-Test 
# Aid to Contras Bill

stats.ttest_ind(dem['aid-to-contras'],rep['aid-to-contras'],nan_policy='omit')

Ttest_indResult(statistic=18.052093200819733, pvalue=2.82471841372357e-54)

Conclusion: Based on a t-statistic of 18 and a p-value of 2.8 * 10 ^(-54), I **reject** the null hypothesis that both parties equally supported the Aid to Contras bill. Instead, I suggest the alternative hypothesis that Democrats and Republicans do not support the bill equally.

##Since the t-statistic = 18.05 AND is a positive value, that means that more Democrats were in support of this bill.

# Immigration Bill
### 2 sample T Test

1) Null Hypothesis: Democrats and Republicans equally vote for/supported the Immigration Bill.

2) Alternative Hypothesis: Democrats and Republicans did not equally vote for/support the Immigration Bill.

3) Confidence Level: 90%

In [61]:
#Two Sample T Test
#Immigration Bill

stats.ttest_ind(dem['immigration'],rep['immigration'],nan_policy='omit')

Ttest_indResult(statistic=-1.7359117329695164, pvalue=0.08330248490425066)

###Conclusion: Based on a t-statistic of -1.7 and a p-value of 0.08, I **reject** the null hypothesis that the parties voted equally in support of the Immigration Bill, and suggest the alternative that they voted differently. 

###Because the t-stat was negative, I know that more republicans voted in support of the Immigration Bill than democrats.

#Education Bill

1) Null Hypothesis: Democrats and Republicans equally vote for/supported the Education Bill.

2) Alternative Hypothesis: Democrats and Republicans did not equally vote for/support the Education Bill.

3) Confidence Level: 90%

In [62]:
stats.ttest_ind(dem['education'],rep['education'],nan_policy='omit')

Ttest_indResult(statistic=-20.500685724563073, pvalue=1.8834203990450192e-64)

###Conclusion: Based on a t-statistic of -20.5 and a p-value of 1.88*10^(-64), I **reject** the null hypothesis that the parties voted equally in support of the Immigration Bill, and suggest the alternative that they voted differently.

###Because the t-stat was negative, I know that more republicans were in favor of this bill, than democrats. 

#Physician-fee-freeze

1) Null Hypothesis: Democrats and Republicans equally vote for/supported the Physician fee freeze bill.

2) Alternative Hypothesis: Democrats and Republicans did not equally vote for/support the Physician fee freeze bill.

3) Confidence Level: 90%

In [63]:
stats.ttest_ind(dem['physician-fee-freeze'],rep['physician-fee-freeze'],nan_policy='omit')

Ttest_indResult(statistic=-49.36708157301406, pvalue=1.994262314074344e-177)

###Conclusion: Based on a t-statistic of -49.36 and a p-value of 1.99*10^(-177), I **reject** the null hypothesis that the parties voted equally in support of the physician fee freeze bill, and suggest the alternative that they voted differently.

###Because the t-stat was negative, I know that more republicans were in favor of this bill, than democrats.

#Religious Groups Bill

1) Null Hypothesis: Democrats and Republicans equally vote for/supported the Religious Groups Bill.

2) Alternative Hypothesis: Democrats and Republicans did not equally vote for/support the Religious Groups Bill.

3) Confidence Level: 90%

In [64]:
stats.ttest_ind(dem['religious-groups'],rep['religious-groups'],nan_policy='omit')

Ttest_indResult(statistic=-9.737575825219457, pvalue=2.3936722520597287e-20)

###Conclusion: Based on a t-statistic of -9.73 and a p-value of 2.39*10^(-20), I **reject** the null hypothesis that the parties voted equally in support of the religious groups bill, and suggest the alternative that they voted differently.

###Because the t-stat was negative, I know that more republicans were in favor of this bill, than democrats.

#Right to Sue

1) Null Hypothesis: Democrats and Republicans equally vote for/supported the right to sue bill.

2) Alternative Hypothesis: Democrats and Republicans did not equally vote for/support the right to sue bill.

3) Confidence Level: 90%

In [65]:
stats.ttest_ind(dem['right-to-sue'],rep['right-to-sue'],nan_policy='omit')

Ttest_indResult(statistic=-13.51064251060933, pvalue=1.2278581709672758e-34)

###Conclusion: Based on a t-statistic of -13.51 and a p-value of 1.22*10^(-34), I **reject** the null hypothesis that the parties voted equally in support of the right to sue bill, and suggest the alternative that they voted differently.

###Because the t-stat was negative, I know that more republicans were in favor of this bill, than democrats.

#Water Project

1) Null Hypothesis: Democrats and Republicans equally voted for/supported the water project Bill.

2) Alternative Hypothesis: Democrats and Republicans did not equally vote for/support the Water project Bill.

3) Confidence Level: 90%

In [66]:
stats.ttest_ind(dem['water-project'],rep['water-project'],nan_policy='omit')

Ttest_indResult(statistic=-0.08896538137868286, pvalue=0.9291556823993485)

###Conclusion: Based on a t-statistic of -0.08 and a p-value of 0.929, I **reject** the null hypothesis that the parties voted equally in support of the water power bill, and suggest the alternative that they voted differently.

###Because the t-stat was negative, I know that more republicans were in favor of this bill, than democrats.

#Budget

1) Null Hypothesis: Democrats and Republicans equally vote for/supported the budget bill.

2) Alternative Hypothesis: Democrats and Republicans did not equally vote for/support the budget bill.

3) Confidence Level: 90%

In [67]:
stats.ttest_ind(dem['budget'],rep['budget'],nan_policy='omit')

Ttest_indResult(statistic=23.21277691701378, pvalue=2.0703402795404463e-77)

###Conclusion: Based on a t-statistic of 23 and a p-value of 2.07*10^(-77), I **reject** the null hypothesis that the parties voted equally in support of the water power bill, and suggest the alternative that they voted differently.

###Because the t-stat was positive, I know that more democrats were in favor of this bill, than republicans.

Anti-Satellite Ban

1) Null Hypothesis: Democrats and Republicans equally vote for/supported the anti-satellite-ban bill.

2) Alternative Hypothesis: Democrats and Republicans did not equally vote for/support the anti-satellite-ban bill.

3) Confidence Level: 90%

In [68]:
stats.ttest_ind(dem['anti-satellite-ban'],rep['anti-satellite-ban'],nan_policy='omit')

Ttest_indResult(statistic=12.526187929077842, pvalue=8.521033017443867e-31)

In [69]:
stats.ttest_ind(dem['mx-missile'],rep['mx-missile'],nan_policy='omit')

Ttest_indResult(statistic=16.437503268542994, pvalue=5.03079265310811e-47)

In [70]:
stats.ttest_ind(dem['synfuels'],rep['synfuels'],nan_policy='omit')

Ttest_indResult(statistic=8.293603989407588, pvalue=1.5759322301054064e-15)

In [71]:
stats.ttest_ind(dem['crime'],rep['crime'],nan_policy='omit')

Ttest_indResult(statistic=-16.342085656197696, pvalue=9.952342705606092e-47)

In [72]:
stats.ttest_ind(dem['duty-free'],rep['duty-free'],nan_policy='omit')

Ttest_indResult(statistic=12.853146132542978, pvalue=5.997697174347365e-32)

In [73]:
stats.ttest_ind(dem['south-africa'],rep['south-africa'],nan_policy='omit')

Ttest_indResult(statistic=6.849454815841208, pvalue=3.652674361672226e-11)

## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using Scipy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!