<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [4]:
### YOUR CODE STARTS HERE
%pwd

'C:\\Users\\vince\\Desktop\\lambdaschool_temp\\week3\\DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments\\module1-statistics-probability-and-inference'

In [6]:
# Get Data
import wget
from snippets import files

files.DownloadFile("https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data", 'voting_data')

voting_data.data successfully downloaded to voting_data.data


'voting_data.data'

In [16]:
# Load Data and Check
import pandas as pd

columns = ['party','handicapped-infants','water-project','budget','physician-fee-freeze','el-salvador-aid','religious-groups','anti-satellite-ban','aid-to-contras','mx-missile','immigration','synfuels','education','right-to-sue','crime','duty-free','south-africa']
data = pd.read_csv('voting_data.data', header=None)
data.columns=columns
data.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [19]:
# Map data to boolean or short (R, D) and (0,1,NaN)
clean_d = data.copy()
for col in columns:
    clean_d[col] = data[col].map({
        'republican':'R',
        'democrat': 'D',
        'n': 0,
        'y': 1,
    })
    
clean_d.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,R,0.0,1,0,1.0,1.0,1,0,0,0,1,,1.0,1,1,0,1.0
1,R,0.0,1,0,1.0,1.0,1,0,0,0,0,0.0,1.0,1,1,0,
2,D,,1,1,,1.0,1,0,0,0,0,1.0,0.0,1,1,0,0.0
3,D,0.0,1,1,0.0,,1,0,0,0,0,1.0,0.0,1,0,0,1.0
4,D,1.0,1,1,0.0,1.0,1,0,0,0,0,1.0,,1,1,1,1.0


In [20]:
# Describe data to see what we're looking at
clean_d.describe()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,435,423,387,424,424,420,424,421,420,413,428,414,404,410,418,407,331
unique,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
top,D,0,1,1,0,1,1,1,1,1,1,0,0,1,1,0,1
freq,267,236,195,253,247,212,272,239,242,207,216,264,233,209,248,233,269


In [85]:
## Create splicing and early analysis functions to test issues across parties

from scipy import stats

def splitByParty(data, issue):
    ### Split data into republican and democrat and return a dictionary of lists
    subset = data[['party', issue]]
    r_data = subset[subset.party=='R']
    d_data = subset[subset.party=='D']
    
    def equalize_lists(df1, df2, issue):
        ### Sample down larger list to create equal length lists.  Requires multiple iterations to get decent result.
        #Drop any missing values
        temp_df1 = df1.dropna(how='any')
        temp_df2 = df2.dropna(how='any')
        #Count how many good datapoints are left
        count1 = temp_df1[issue].count()
        count2 = temp_df2[issue].count()
        
        #debug-print
        print('Non missing data: ', count1, count2)
        
        if count1 > count2:
            return [temp_df1.sample(n=count2), temp_df2]
        elif count2 > count1:
            return [temp_df1, temp_df2.sample(n=count1)]
        else:
            return [temp_df1, temp_df2]
        
    
    # Call equalizer
    sample = equalize_lists(r_data, d_data, issue)
    r_data = sample[0]
    d_data = sample[1]
    # Store data as list and return
    r_data = r_data[issue].tolist()
    d_data = d_data[issue].tolist()
    print('Returning series of length:', len(r_data))
    return {'R':r_data, 'D':d_data}

def ttest2(series1, series2):
    ### Run 2 sample ttest on series passed in
    return stats.ttest_ind(series1, series2, nan_policy='omit')
    
    

In [86]:
split_set = splitByParty(clean_d, 'handicapped-infants')
print(len(split_set['R']), len(split_set['D']))

Non missing data:  165 258
Returning series of length: 165
165 165


In [87]:
## Testing with one issue
split_data = splitByParty(clean_d, 'handicapped-infants')
ttest2(split_data['R'], split_data['D'])

Non missing data:  165 258
Returning series of length: 165


Ttest_indResult(statistic=-8.981194774780398, pvalue=2.1400307200407278e-17)

In [95]:
# See if that makes any sense
import matplotlib.pyplot as plt
df = pd.DataFrame.from_dict(split_data)
df.describe()

Unnamed: 0,R,D
count,165.0,165.0
mean,0.187879,0.624242
std,0.391804,0.485792
min,0.0,0.0
25%,0.0,0.0
50%,0.0,1.0
75%,0.0,1.0
max,1.0,1.0


In [102]:
# So the means are certainly not going to be the same.  I was just a little surprised by how different they'd voted.
# Since we can, let's just them all!!!!!!
for issue in columns[1:]:
    print('Regarding Issue:', issue)
    split_data = splitByParty(clean_d, issue)
    print(ttest2(split_data['R'], split_data['D']))
    print('')

Regarding Issue: handicapped-infants
Non missing data:  165 258
Returning series of length: 165
Ttest_indResult(statistic=-8.981194774780398, pvalue=2.1400307200407278e-17)

Regarding Issue: water-project
Non missing data:  148 239
Returning series of length: 148
Ttest_indResult(statistic=0.5797213274999463, pvalue=0.5625465328556588)

Regarding Issue: budget
Non missing data:  164 260
Returning series of length: 164
Ttest_indResult(statistic=-19.382265479268277, pvalue=3.2666341590367923e-56)

Regarding Issue: physician-fee-freeze
Non missing data:  165 259
Returning series of length: 165
Ttest_indResult(statistic=47.41541906719711, pvalue=7.500755796676712e-149)

Regarding Issue: el-salvador-aid
Non missing data:  165 255
Returning series of length: 165
Ttest_indResult(statistic=20.84533991440558, pvalue=4.765899053740693e-62)

Regarding Issue: religious-groups
Non missing data:  166 258
Returning series of length: 166
Ttest_indResult(statistic=9.135388756820058, pvalue=6.74366203409

In [97]:
# On the second issue, there isn't much difference ( I would NOT reject the null hypothesis that means are equivalent): 'water project'
# Important to note here, though, is that running this once isn't good enough because of sample size matching!
# A monte carlo is in order!!!!!!

1