### PPOL 564 - Coding Discussion #5<br/>Ryan Ripper<br/>11/14/21

In [1]:
# Import the necessary Python Modules.
import pandas as pd # For Pandas.
import numpy as np # For Numpy.
import pprint as pp # For printing.
import scipy.stats as st # For Normal PDF.
import matplotlib.pyplot as plt
from plotnine import *

# Silence warnings.
import warnings
warnings.filterwarnings("ignore")

We will build a Naive Bayes Classifier considering both discrete and continous predictors in order to predict whether or not an individual voted in the 2012 Presidential election. In order to handle both discrete and continous predictors, we will build two separate dictionaries to hold the discrete and continous conditional probabilities. Once both dictionaries have been built, we pull in the associated conditional probability when needed. For the discrete predictors, we use the exact conditional probabilities listed in the associated dictionary. For the continous predictors, we estimate the conditional probabiilities using a Gaussian distribution.

In [2]:
# Load in the Turnout data.
turnout_data = pd.read_csv("../turnout.csv")

# Remove "id" and reorder columns in the Turnout data.
turnout_data = turnout_data.reindex(columns = ["vote", "age", "educate", "income", "white"])

# Break up the data into training (80% of total) and testing (20% of total) data.
train = turnout_data.sample(frac = .8).reset_index(drop = True)
test = turnout_data.drop(train.index).reset_index(drop = True)

In [3]:
# Check to make sure the data was loaded properly.

# Print off the split count.
print("Training Data:", train.shape[0],
      "\nTest Data:", test.shape[0])

# Look at the head of the training data.
train.head()

Training Data: 1600 
Test Data: 400


Unnamed: 0,vote,age,educate,income,white
0,1,64,16.0,5.8684,1
1,1,65,12.0,3.7435,1
2,0,30,12.0,1.726,1
3,0,22,7.0,0.2364,1
4,1,46,10.0,1.3566,0


In [4]:
# First, calculate class probabilities.

# Collect observations that did vote (first subset).
vote1 = train.query("vote == 1")

# Collect observations that did not vote (second subset).
vote0 = train.query("vote == 0")

# Calculate class probabilities for observations that did vote and did not vote.
pr_vote1 = vote1.shape[0] / train.shape[0]
pr_vote0 = vote0.shape[0] / train.shape[0]

In [5]:
# Second, calculate the conditional probabilities.

# Consider the discrete predictors first.

# Collect the conditional probabilities in a dictionary with tuple keys ("white", [0, 1], [0, 1]).
# The second item in the tuple key designates whether or not observation is white.
# The third item in the tuple key desginates whether or not observation voted.
dist_locs_discrete = \
{
    ("white", 0, 0) : vote0.query("white == 0").shape[0] / vote0.shape[0],
    ("white", 0, 1) : vote1.query("white == 0").shape[0] / vote1.shape[0],
    ("white", 1, 0) : vote0.query("white == 1").shape[0] / vote0.shape[0],
    ("white", 1, 1) : vote1.query("white == 1").shape[0] / vote1.shape[0]
}

In [6]:
# Consider the continuous predictors.

# Collect the mean and standard deviation of each conditional distribution.
# The collection of mean and standard deviations are held in a dictionary with tuple keys (predictor, [0, 1])
# The first item in the tuple key desginates the associated predictor type.
# The second item in the tuple key designates whether or not the observation is white.
dist_locs_continuous = \
{
    ("age", 0) : {"mean" : vote0.age.mean(), "sd" : vote0.age.std()},
    ("age", 1) : {"mean" : vote1.age.mean(), "sd" : vote1.age.std()},
    ("educate", 0) : {"mean" : vote0.educate.mean(), "sd" : vote0.educate.std()},
    ("educate", 1) : {"mean" : vote1.educate.mean(), "sd" : vote1.educate.std()},
    ("income", 0) : {"mean" : vote0.income.mean(), "sd" : vote0.income.std()},
    ("income", 1) : {"mean" : vote1.income.mean(), "sd" : vote1.income.std()}   
}

In [7]:
"""
Create function that takes data along with the conditional probability dictionaries for both discrete and
continous predictors to predict multiple observations.
"""
def predict(data, dist_locs_discrete, dist_locs_continuous):
    """
    The predict function creates a Pandas DataFrame.
    The DataFrame holds the associated probability of the observation voting or not and the corresponding prediction.
    
    Arguments
    -----
    data: Pandas DataFrame
        A Pandas DataFrame containing observations from the 2012 National Election Survey.
        
    dist_locs_discrete: dictionary
        A dictionary with tuple keys holding the conditional probabilities for the discrete predictors in the data.
        
    dist_locs_continous: dictionary
        A dictionary with tuple keys holding the conditional probabilities for the continous predictors in the data.
    
    return
    -----
    A Pandas DataFrame with corresponding probabilities for an observation voting, not voting, and prediction.
    """
    
    # Create list to store all the predictions in.
    store_preds = []
    
    # Iterate through all the rows in the data to be predicted.
    for i, row in data.iterrows():
        
        # Initialize the probability to 1 so that we can multiply on to the predicted probability.
        pr_0 = 1
        pr_1 = 1
        
        for j in range(1, len(row)):
            # Consider only the continuous predictors.
            if (row.index[j] != "white"):
                # Get the predictions using a Gaussan distribution.
                pr_0 *= st.norm(dist_locs_continuous[(row.index[j], 0)]['mean'],
                                dist_locs_continuous[(row.index[j], 0)]['sd']).pdf(row.values[j])

                pr_1 *= st.norm(dist_locs_continuous[(row.index[j], 1)]['mean'], 
                                dist_locs_continuous[(row.index[j], 1)]['sd']).pdf(row.values[j])
            # Consider the discrete predictors.
            else:
                # Pull in the probability for white when did vote from the dictionary.
                pr_0 *= dist_locs_discrete[(row.index[j], row.values[j], 0)]
        
                # Pull in the probability for white when did not vote fron the dictionary.
                pr_1 *= dist_locs_discrete[(row.index[j], row.values[j], 1)]
        
        # Multiply on the class probability for did not vote.
        pr_0 *= pr_vote0
        
        # Multiply on the class probability for did vote.
        pr_1 *= pr_vote1
        
        # Assign the class designation to the highest probability.
        if pr_0 >= pr_1:
            class_pred = 0
        else:
            class_pred = 1
        
        # Add the prediction to the list of final predictions.
        store_preds.append([pr_0, pr_1, class_pred])
    
    # Return the list of all predictions as a Pandas DataFrame.
    return pd.DataFrame(store_preds, columns = ["pr_0", "pr_1", "pred"])

In [8]:
# Run on the training data.
preds_train = predict(train, dist_locs_discrete, dist_locs_continuous)

In [9]:
# Examine the predicted output.
preds_train.head(10)

Unnamed: 0,pr_0,pr_1,pred
0,4e-06,7.4e-05,1
1,3.9e-05,0.000136,1
2,5.9e-05,0.000107,1
3,1.4e-05,8e-06,0
4,2.1e-05,1.6e-05,0
5,6e-06,0.000115,1
6,1e-06,8e-06,1
7,1.5e-05,2.7e-05,1
8,2.7e-05,0.000128,1
9,1.1e-05,0.000144,1


In [10]:
# Examine the predictive accuracy of the training data.
accuracy_train = sum(train.vote == preds_train.pred) / train.shape[0]
accuracy_train

0.743125

In [11]:
# Run on the test data.
preds_test = predict(test, dist_locs_discrete, dist_locs_continuous)

In [12]:
# Examine the predictive accuracy on the test data.
accuracy_test = sum(test.vote == preds_test.pred) / test.shape[0]
accuracy_test

0.71

We see that our model predicts whether or not someone voted in the 2012 Presidential election with 74.3% accuracy on the training data. However, once we examine the accuracy of our model with test data, we have an out of sample prediction of 71%. Ultimately, our model does perform better than chance since we are predicting the vote result correctly a majority of the time.