In [1]:
# Modules
import pandas as pd
import numpy as np
import scipy.stats as st # used for Normal PDF

# **Can we predict whether someone will vote or not?**

In the discussion folder, you'll find the `turnout.csv` data, which was drawn from the 2012 National Election Survey. The data records the age, eduction level (of total years in school), income, race (caucasian or not), and past voting record (i.e. whether or not the respondent voted in the 2012 Presidential election). The sample is composed of 2000 individual respondents. 

Please break the data up into a training (1600 entries, 80%) and test dataset (400 entries, 20%). 

Build a Naive Bayesian Classifier from scratch that tries to predict whether a respondent will vote in a presidential election or not, pr(Vote==1). The classifier must be built from scratch. Do not use a third party ML or statistical package. 

Run your algorithm and see how it predicts on the test data by calculating the predictive accuracy. 

### Load, Split, and Inspect Data

In [2]:
# Set seed for replicatability
np.random.seed(8191)

# Read in data from csv
dta = pd.read_csv("../turnout.csv")

# Split train-test using Pandas
train = dta.sample(frac=.8).reset_index(drop=True)
test = dta.drop(train.index).reset_index(drop=True)

# Print length of train-test
print("Training Data:",train.shape[0],"\nTest Data:",test.shape[0])

# Print random sample
train.sample(5)

Training Data: 1600 
Test Data: 400


Unnamed: 0,id,age,educate,income,vote,white
752,727,33,14.0,5.2331,1,1
1153,655,23,16.0,4.8954,1,1
615,951,50,10.0,2.917,1,1
1389,1499,48,15.0,3.46,1,1
472,756,55,13.0,8.7545,1,1


### Calculate class and conditional probabilities 

We will need these to calculate the Bayesian probabilities that give this classifier its name.

In [3]:
# Split train data when variable is categorical or consecutive,
# these have slightly different processes for calculating Bayesian probabilities 
categorical = train[['vote', 'white']]
continuous = train[['age', 'educate', 'income', 'vote']]

In [4]:
# Create empty dictionaries to store class and conditional probabilities
class_probs = {}
conditional_probs = {}

# Get list of categorical columns that are not the dependent vraible, 'vote'
cat_cols = [col for col in categorical.columns if col != 'vote']

# Loop through categorical data grouped by 'vote' dummy variable (2 times only)
# y is the value of the dummy(0 or 1), d is the data grouped according to y
for y, d in categorical.groupby('vote'): 
    
    # update class probability 
    class_probs.update({y: d.shape[0]/ categorical.shape[0]})
    
    # loop through categorical columns (except vote) to calculate Pr then assign to conditional probs dictionary
    # this dictionary uses tuples as keys where each key is (col_name,col_value,conditional_prob)
    for col in cat_cols:
        pr = d[col].sum()/d.shape[0]
        conditional_probs[(col,1,y)] = pr 
        conditional_probs[(col,0,y)] = 1 - pr

### Get the mean and standard deviation of each conditional distribution
We need the mean and standard deviation of each continuous variable to put into the normal pdf function when modifying the Bayesian probability. 

In [5]:
#Create empty dictionary to store means and st devs
dist_locs = {}

# loop through all possible conditional continuous probabilities and add mean/st dev to dictionary 
for i in range(2): # use 2 because vote can be 0 or 1
    # create subset of data based on value of vote
    sub = train.query(f'vote == {i}')
    
    # loop through each continuous variable, make sure vote is not included (it shouldn't be anyway)
    con_cols = [col for col in continuous.columns if col != 'vote']
    
    # find and add mean/st dev to dictionary using tuple keys
    for col in con_cols:
        dist_locs[(col,i)] = {'mean':sub[col].mean(),'sd':sub[col].std()}

### Create the classifier and make the prediciton

In [6]:
def predict(data, class_probs, conditional_probs, dist_locs):
    '''
    This method predicts whether someone voted using a Bayesian classifier, based on 
    categorical values (white or not white) and continuous values (age,income,education).
    --------------------------
    Arguments:
        data (df): the training data from turnout database
        class_probs (dict): the class probabilities for categorical variables
        conditional_probs (dict): the conditional probabilities for categorical variables
        dist_locs (dict): the mean and standard deviation of continuous variables
    Returns:
        pred (df): dataframe containing prediction information on whether each person voted
    '''
    # create empty list to store predictions 
    store_preds = []
    
    # loop through each row in the dataframe
    for i,row in data.iterrows():
        # initialize both probabilities at 1
        pr_0 = pr_1 = 1
        
        # loop through all continuous columns
        for j in ['age','income','educate']:
            # adjust Pr0 and Pr1 based on normal pdf of column
            pr_0 *= st.norm(dist_locs[(j,0)]['mean'],dist_locs[(j,0)]['sd']).pdf(row[j])
            pr_1 *= st.norm(dist_locs[(j,1)]['mean'],dist_locs[(j,1)]['sd']).pdf(row[j])
            
        # update probabilities based on categorical column
        pr_0 *= conditional_probs[('white',row.white,0)]*class_probs[0]
        pr_1 *= conditional_probs[('white',row.white,1)]*class_probs[1]
        
        # classify based on most likely probability 
        class_pred = 0 if pr_0 >= pr_1 else 1
        
        # store prediction in list
        store_preds.append([pr_0,pr_1,class_pred])
    
    # create dataframe of prediction info
    pred = pd.DataFrame(store_preds,columns=["pr_0","pr_1","pred"])
    return pred

### Inspect Training Accuracy
Let's see how we did on the training data. A high accuracy score does not necessarily mean we have a good classifier, we might have made some error that overfit the data.

In [7]:
# get predicitions of train data
train_preds = predict(train, class_probs, conditional_probs, dist_locs)

# calculate accuracy of train data
train_accuracy = sum(train.vote == train_preds.pred)/train.shape[0]
train_accuracy

0.74

74% accuracy, not bad! Now we hope to see a similar number in the test data set.

### Inspect Test Accuracy
This is where it counts!

In [8]:
# get predicitions of test data
test_preds = predict(test, class_probs, conditional_probs, dist_locs)

# calculate accuracy of test data
test_accuracy = sum(test.vote == test_preds.pred)/test.shape[0]
test_accuracy

0.7125

### Does your model perform better than chance (i.e. coin flip)?

71%! That's very close to our training data and much better than a coin flip. The model preformed well, but with such limited variables, we were never going to be able to predict voting with a very high degree of accuracy (since there are so many things that affect voting)