# Coding Discussion 05

### Can we predict whether someone will vote or not?

• The data is drawn from the 2012 National Election Survey and records the age, eduction level (of total years in school), income, race (caucasian or not), and past voting record (i.e. whether or not the respondent voted in the 2012 Presidential election). The sample is composed of 2000 individual respondents.

### Objectives

• Please break the data up into a training (1600 entries, 80%) and test dataset (400 entries, 20%).

• Build a Naive Bayesian Classifier from scratch that tries to predict whether a respondent will vote in a presidential election or not, pr(Vote==1). The classifier must be built from scratch. Do not use a third party ML or statistical package.

• Run your algorithm and see how it predicts on the test data by calculating the predictive accuracy.

• Does your model perform better than chance (i.e. coin flip)?

• When completing this answer, be sure to: 1.) comment on all of your code 2.) provide a narrative for what you're doing and 3.) summarize your results and findings 

In [314]:
import pandas as pd
import numpy as np
import pprint as pp # for printing
import scipy.stats as st # for Normal PDF
import requests

# Plotting libraries 
import matplotlib.pyplot as plt
import seaborn as sns
from plotnine import *

# Silence warnings 
import warnings
warnings.filterwarnings("ignore")

In [315]:
# Read in csv file from github 
url = 'https://raw.githubusercontent.com/edunford/coding_discussions_ppol564_fall2021/main/05_coding_discussion/turnout.csv'
res = requests.get(url, allow_redirects=True)
with open('turnout.csv','wb') as file:
    file.write(res.content)
turnout = pd.read_csv('turnout.csv')

In [316]:
# Train-Test split (just using Pandas)
train = turnout.sample(frac=.8).reset_index(drop=True)
test = turnout.drop(train.index).reset_index(drop=True)
# Print off the split count 
print("Training Data:",train.shape[0],
      "\nTest Data:",test.shape[0])

Training Data: 1600 
Test Data: 400


In [317]:
# Drop, rearrange, and rename data 
train_binary = train.drop(columns=['id', 'age', 'educate', 'income'])
y, x1 = train_binary.iloc[1,:]
test_binary = test.drop(columns=['id', 'age', 'educate', 'income'])

<h2><center> Naive Bayesian Classifier


$$Pr(class | data) = Pr( x_1| class)\times Pr( x_2| class) \times \dots \times  Pr(class)$$

A Naive Bayes simplifies a traditional Bayesian classifier by assuming each variable is independent of each other and removing the normalizing factor. 

<h3><center> Naive Bayesian Classifier with Binary Predictors


We will need to calculate each component of the equation above. First, let's determine the $Pr(class)$ or the fraction of people who voted and people who did not vote.

In [319]:
N = train.shape[0]

# Subset the data by class
vote1 = train.query("vote == 1")
vote0 = train.query("vote == 0")

# Calculate the probability for each class
pr_vote_1 = vote1.shape[0]/N
pr_vote_0 = vote0.shape[0]/N

# Print the probabilities
print(
f"""
Pr(vote = 1): {pr_vote_1}
Pr(vote = 0): {pr_vote_0}
""")


Pr(vote = 1): 0.74375
Pr(vote = 0): 0.25625



Next, we will need to calculate the conditional probabilities or $Pr(data | class)$. In other words, from the training set, let's find the number of white individuals who voted and did not vote and non-white individuals who voted and did not vote. 

In [320]:
# Given vote == 1
w1_vote1 = vote1.query("white == 1").shape[0]/vote1.shape[0]
w0_vote1 = vote1.query("white == 0").shape[0]/vote1.shape[0]


# Given vote == 0
w1_vote0 = vote0.query("white == 1").shape[0]/vote0.shape[0]
w0_vote0 = vote0.query("white == 0").shape[0]/vote0.shape[0]

print(
f"""
Pr(white = 1 |vote = 1): {w1_vote1}
Pr(white = 0 |vote = 1): {w0_vote1}
Pr(white = 1 |vote = 0): {w1_vote0}
Pr(white = 0 |vote = 0): {w0_vote0}
""")


Pr(white = 1 |vote = 1): 0.8815126050420168
Pr(white = 0 |vote = 1): 0.11848739495798319
Pr(white = 1 |vote = 0): 0.775609756097561
Pr(white = 0 |vote = 0): 0.22439024390243903



In [321]:
prob_vote1 = w0_vote1 * pr_vote_1
prob_vote0 = w0_vote0 * pr_vote_0

print(f"""
Pr(vote = 1) = {prob_vote1}
Pr(vote = 0) = {prob_vote0}
""")


Pr(vote = 1) = 0.088125
Pr(vote = 0) = 0.057499999999999996



In [322]:
prob_vote1 = w1_vote1 * pr_vote_1
prob_vote0 = w1_vote0 * pr_vote_0

print(f"""
Pr(cw = 1) = {prob_vote1}
Pr(cw = 0) = {prob_vote0}
""")


Pr(cw = 1) = 0.655625
Pr(cw = 0) = 0.19874999999999998



Let's calculate the underlying probabilities and then calculate the predictions for each observation in the data.

In [323]:
def calc_probs(data,outcome_var=""):
    '''
    Function calculates the class and conditional probabilities in 
    the binary data. 
    
    Note that I'm using dictionaries with tuple keys to keep
    track of the variable, it's val, and the outcome, which we're conditioning on. 
    '''
    # Generate empty dictionary containers.
    class_probs = {};cond_probs = {}
    # Locate all variables that are not the outcome.
    vars = [v for v in data.columns if v != outcome_var]
    # iterate through the class outcomes
    for y, d in data.groupby(outcome_var): 
        # calculate the class probabilities
        class_probs.update({y: d.shape[0]/data.shape[0]})
        for v in vars:
            # calculate the conditional probabilities for each variable given the class.
            pr = d[v].sum()/d.shape[0]
            cond_probs[(v,1,y)] = pr 
            cond_probs[(v,0,y)] = 1 - pr
    return class_probs, cond_probs


# Run
class_probs, cond_probs = calc_probs(train_binary,outcome_var="vote")

# Print
print("class probabilities",end="\n\n")
pp.pprint(class_probs)
#print("\n")
#print("conditional probabilities",end="\n\n")
#pp.pprint(cond_probs)

class probabilities

{0: 0.25625, 1: 0.74375}


Let's build a prediction function that runs through the observations in the data and calculates the probabilities and makes a class prediction.

In [324]:
def predict(data,class_probs,cond_probs):
    '''
    Function calculates the conditiona probability for membership into each class.
    Then returns both the probabilities and the most likely class. 
    '''
    store_preds = []
    for i,row in data.iterrows():
        pr_1 = 1; pr_0 = 1
        for j in range(1,len(row.index)):
            pr_0 *= cond_probs[(row.index[j],row.values[j],0)]
            pr_1 *= cond_probs[(row.index[j],row.values[j],1)]     
        pr_0 *= class_probs[0]
        pr_1 *= class_probs[1]
        store_preds.append([pr_0,pr_1,max([(pr_0,0),(pr_1,1)])[1]])
    return pd.DataFrame(store_preds,columns=["pr_0","pr_1","pred"])

# Run 
preds = predict(train_binary, class_probs, cond_probs)

Finally, calculate the predictive accuracy of the model.

In [325]:
accuracy = sum(train_binary.vote == preds.pred)/train.shape[0]
accuracy

0.74375

Now let's the accuracy of the model on our test data.

In [326]:
test_preds = predict(test_binary, class_probs, cond_probs)
test_accuracy = sum(test_binary.vote == test_preds.pred)/test.shape[0]
test_accuracy

0.7

Though the Naive Bayes Classifier is quite simplistic when compared to other modeling strategies (such as a neural net or a gradient boosting machine), it proves to be effective on a wide array of prediction tasks.

<h3><center> Naive Bayesian Classifier with Continuous Predictors

We need a way to map a continuous variable into a probability space. Here we'll use the probability density function for Gaussian (normal) distribution to convert continuous values into probabilities.

Note that we can use information regarding the distribution of each continuous predict and find out where any single point is on that continuous variables probability distribution.

We can calculate the conditional mean and standard deviation for each value of the outcome and then calculate the predictions from there for any one of our continuous variables.

In [327]:
# Drop, rearrange, and rename train df 
train_drop = train.drop(columns=['id', 'white', 'age'])
train_drop = train_drop[['vote', 'educate', 'income']]
y,x1,x2 = train_drop.iloc[1,:]
train_drop.columns = ['y', 'x1', 'x2']

# Drop, rearrange, and rename test df 
test_drop = test.drop(columns=['id', 'white', 'age'])
test_drop = test_drop[['vote', 'educate', 'income']]
test_drop.columns = ['y', 'x1', 'x2']

Calculate the class probabilities or $Pr(class)$. 

In [304]:
y1 = train_drop.query("y == 1")
y0 = train_drop.query("y == 0")

# Class probabilities.
pr_y1 = y1.shape[0]/train_drop.shape[0]
pr_y0 = y0.shape[0]/train_drop.shape[0]

Calculate the means and standard devitaions. 

In [305]:
# Collect the mean and standard dev. of each conditional distribution
dist_locs = \
{("x1",1):{'mean':y1.x2.mean(),'sd':y1.x1.std()},
 ("x1",0):{'mean':y0.x2.mean(),'sd':y0.x1.std()},
 ("x2",1):{'mean':y1.x1.mean(),'sd':y1.x2.std()},
 ("x2",0):{'mean':y0.x1.mean(),'sd':y0.x2.std()}
}

# Print
pp.pprint(dist_locs)

{('x1', 0): {'mean': 2.8572895261845397, 'sd': 3.263255626180008},
 ('x1', 1): {'mean': 4.226724103419518, 'sd': 3.297833599903187},
 ('x2', 0): {'mean': 10.688279301745636, 'sd': 2.3290857540116523},
 ('x2', 1): {'mean': 12.51834862385321, 'sd': 2.901260701952806}}


<h4><center> Predicting a Single Observation

In [312]:
# Prediction for the 1 class
a = st.norm(dist_locs[("x1",1)]['mean'], dist_locs[("x1",1)]['sd']).pdf(x1)
b = st.norm(dist_locs[("x2",1)]['mean'], dist_locs[("x2",1)]['sd']).pdf(x2)
c = pr_y1
pr_1 = a * b * c 

# Prediction for the 0 class
a = st.norm(dist_locs[("x1",0)]['mean'], dist_locs[("x1",0)]['sd']).pdf(x1)
b = st.norm(dist_locs[("x2",0)]['mean'], dist_locs[("x2",0)]['sd']).pdf(x2)
c = pr_y0
pr_0 = a * b * c 

print(
f'''
    Pr(y == 1| X): {pr_1}
    Pr(y == 0| X): {pr_0}
''')


    Pr(y == 1| X): 6.563751571296217e-07
    Pr(y == 0| X): 5.55000404139091e-08



<h4><center> Predicting Multiple Observations 

In [307]:
def predict(data,dist_locs):
    ''''''
    store_preds = []
    for i,row in data.iterrows():
        
        # Get the predictions using a Gaussan distribution
        pr_0 = 1; pr_1 = 1
        for j in range(1,len(row)):
            
            pr_0 *= st.norm(dist_locs[(row.index[j],0)]['mean'],
                            dist_locs[(row.index[j],0)]['sd']).pdf(row.values[j])
            pr_1 *= st.norm(dist_locs[(row.index[j],1)]['mean'], 
                            dist_locs[(row.index[j],1)]['sd']).pdf(row.values[j])
        pr_0 *= pr_y0
        pr_1 *= pr_y1
        
        # Assign the class designation to the highest probability
        if pr_0 >= pr_1:
            class_pred = 0
        else:
            class_pred = 1
            
        store_preds.append([pr_0,pr_1,class_pred])
        
    return pd.DataFrame(store_preds,columns=["pr_0","pr_1","pred"])

# Run
preds_train = predict(train_drop,dist_locs)


Determine the predictive accuracy of the training data.

In [310]:
# Predictive accuracy of training data 
accuracy_train = sum(train_drop.y == preds_train.pred)/train_drop.shape[0]
accuracy_train

0.749375

Determine the predictive accuracy of the test data.

In [309]:
# Test test df 
preds_test = predict(test_drop, dist_locs)

# Predictive accuracy of test data 
accuracy_train = sum(test_drop.y == preds_test.pred)/test_drop.shape[0]
accuracy_train

0.7

Based on age and income, the model predicted whether someone would vote or not with 70 percent accuracy. The test data was not as accurate as the training data (as is expected), but it is still more accurate than a coin flip. Likewise, using "white" vs "non-white" as a predictor, produced a model with 70 percent accuracy as well. 