
In the discussion folder, you'll find the `turnout.csv` data, which was drawn from the 2012 National Election Survey. The data records the age, eduction level (of total years in school), income, race (caucasian or not), and past voting record (i.e. whether or not the respondent voted in the 2012 Presidential election). The sample is composed of 2000 individual respondents. 

Please break the data up into a training (1600 entries, 80%) and test dataset (400 entries, 20%). 

Build a Naive Bayesian Classifier from scratch that tries to predict whether a respondent will vote in a presidential election or not, pr(Vote==1). The classifier must be built from scratch. Do not use a third party ML or statistical package. 

Run your algorithm and see how it predicts on the test data by calculating the predictive accuracy. 

Does your model perform better than chance (i.e. coin flip)?

In [26]:
import numpy as np
import pandas as pd
import scipy.stats as st 
import pprint as pp 


In [72]:
#importing and viewing data 
filepath='/Users/ellisobrien/Desktop/Georgetown Semester 1/Data Science/coding_discussions_ppol564_fall2021/05_coding_discussion/turnout.csv'
csv = pd.read_csv(filepath)
csv.head()


Unnamed: 0,id,age,educate,income,vote,white
0,1,60,14.0,3.3458,1,1
1,2,51,10.0,1.8561,0,1
2,3,24,12.0,0.6304,0,1
3,4,38,8.0,3.4183,1,1
4,5,25,12.0,2.7852,1,1


In [73]:
#dropping ID column
csv.drop(["id"], axis=1, inplace=True)


#setting seed
np.random.seed(36)

#converting data to Training and Test Data
TrainData = csv.sample(frac=.8).reset_index(drop=True)
TestData = csv.drop(TrainData.index).reset_index(drop=True)
bi_vars = TrainData[['vote', 'white']]
cont_vars = TrainData[['vote', 'age', 'educate', 'income']]

#printing rows of data
print("Training Data:",TrainData.shape[0],
      "\nTest Data:",TestData.shape[0])



Training Data: 1600 
Test Data: 400


In [74]:

Total = TrainData.shape[0]

# Subset the data by who voted
vote = TrainData.query("vote == 1")
novote = TrainData.query("vote == 0")

# Calculate the probabality for each 
pr_vote = vote.shape[0]/Total
pr_no_vote = novote.shape[0]/Total

# Print the probabilities
print(
f"""
Pr(vote = 1): {pr_vote}
Pr(vote = 0): {pr_no_vote}
""")


Pr(vote = 1): 0.745
Pr(vote = 0): 0.255



In [75]:

def binary_probs(data,outcome_var=""):
    '''
    Function calculates the class and conditional probabilities in 
    the binary data. 
    
    Inputs: Data frame of binary vars
    Outputs: conditional probabilty for the data
    '''
    # Generate empty dictionary containers.
    class_probs = {};cond_probs = {}
    # Locate all variables that are not the outcome.
    vars = [v for v in bi_vars.columns if v != outcome_var]
    # iterate through the class outcomes
    for y, d in bi_vars.groupby(outcome_var): 
        # calculate the class probabilities
        class_probs.update({y: d.shape[0]/data.shape[0]})
        for v in vars:
            # calculate the conditional probabilities for each variable given the class.
            pr = d[v].sum()/d.shape[0]
            cond_probs[(v,1,y)] = pr 
            cond_probs[(v,0,y)] = 1 - pr
    return class_probs, cond_probs


# Run
class_probs, cond_probs = binary_probs(bi_vars,outcome_var="vote")

# Print
print("class probabilities",end="\n\n")
pp.pprint(class_probs)
print("\n")
print("conditional probabilities",end="\n\n")
pp.pprint(cond_probs)

class probabilities

{0: 0.255, 1: 0.745}


conditional probabilities

{('white', 0, 0): 0.20588235294117652,
 ('white', 0, 1): 0.1174496644295302,
 ('white', 1, 0): 0.7941176470588235,
 ('white', 1, 1): 0.8825503355704698}


In [76]:
# Get the mean and standard deviation for each continuous variable
vars = [v for v in cont_vars.columns if v != "vote"]
dist_locs = {}
for v in vars:
    dist_locs.update({(v, 1): {'mean': vote[v].mean(), 'sd': vote[v].std()}})
    dist_locs.update({(v, 0): {'mean': novote[v].mean(), 'sd': novote[v].std()}})
    
dist_locs

{('age', 1): {'mean': 46.23154362416108, 'sd': 16.79350795057688},
 ('age', 0): {'mean': 42.5514705882353, 'sd': 19.23467634884542},
 ('educate', 1): {'mean': 12.600671140939598, 'sd': 3.204467843303058},
 ('educate', 0): {'mean': 10.67156862745098, 'sd': 3.147281575764395},
 ('income', 1): {'mean': 4.2633896812080465, 'sd': 2.9064856249133473},
 ('income', 0): {'mean': 2.7019500000000005, 'sd': 2.095294474234877}}

In [77]:
def predict(data, dist_locs, cond_probs):
    '''
    
    '''
    
    store_preds = []
    for i,row in data.iterrows():
        # Get the predictions using a Gaussan distribution
        pr_0 = 1; pr_1 = 1
        for j in range(0, 3):
            pr_0 *= st.norm(dist_locs[(row.index[j],0)]['mean'],
                            dist_locs[(row.index[j],0)]['sd']).pdf(row.values[j])
            pr_1 *= st.norm(dist_locs[(row.index[j],1)]['mean'], 
                            dist_locs[(row.index[j],1)]['sd']).pdf(row.values[j])
        # Multiple by class proabilities
        pr_0 *= pr_no_vote
        pr_1 *= pr_vote
       # Multiply by conditional probabilities
        pr_0 *= cond_probs['white', row['white'], 0]
        pr_1 *= cond_probs['white', row['white'], 1]
    
        
        # Assign the class designation to the highest probability
        if pr_0 >= pr_1:
            class_pred = 0
        else:
            class_pred = 1
            
        store_preds.append([pr_0,pr_1,class_pred])
        
        # Turn to DataFrame
        
 
    return pd.DataFrame(store_preds,columns=["pr_0","pr_1","pred"])



In [78]:
# Run
preds_train = predict(TrainData, dist_locs, cond_probs)
preds_train.head(10)

Unnamed: 0,pr_0,pr_1,pred
0,1.821826e-05,6e-05,1
1,3.35316e-06,2.5e-05,1
2,1.681127e-06,3.5e-05,1
3,1.593059e-06,8.3e-05,1
4,6.999225e-05,0.000191,1
5,6.844807e-06,3.5e-05,1
6,4.730186e-08,1.1e-05,1
7,1.359446e-10,2e-06,1
8,3.701611e-06,1.6e-05,1
9,3.290427e-07,1e-05,1


In [79]:
#testing the training accuracy 
accuracy_train = sum(TrainData.vote == preds_train.pred)/TrainData.shape[0]
accuracy_train

0.74125

In [80]:
# Run prediction functino on test data
preds_test = predict(TestData,dist_locs, cond_probs)
preds_test.head(10)



Unnamed: 0,pr_0,pr_1,pred
0,4.396933e-06,1.332856e-05,1
1,1.77916e-05,0.0001875516,1
2,3.146377e-05,6.985682e-05,1
3,7.96735e-05,0.0001921454,1
4,1.073979e-06,4.408899e-05,1
5,5.577799e-05,8.492188e-05,1
6,5.267228e-08,9.101569e-09,0
7,4.657539e-05,7.381272e-05,1
8,1.102416e-05,7.264487e-06,0
9,3.574112e-05,0.0001102321,1


In [82]:

# Compare prediction to actual
accuracy_test = sum(TestData.vote == preds_test.pred)/TestData.shape[0]
accuracy_test


0.7125

## Results 

This model can accuratley predict whether or not someone will vote 71.25% of the time. 21.25% better than a coin flip. 
