# Naive Bayes Classification Models  
## Introduction  
The following Python code demonstrates some basic __Naive Bayes__ classification using Spark. We will create basic Weather data for predicting if someone will 'Play' tennis. The data is hard coded as a list of lists and put into a dataframe. The dataframe is then mapped into an RDD of labeled point vectors. __Notice__ that since `numpy` cannot handle categorical variables, these are recoded as binary indicator variables. For Example, __outlook__ = `sunny`, `overcast` or `rainy`, is replaced by three variables; `sunny` (1 or 0), `overcast` (1 or 0), `rainy` (1 or 0).  

In [1]:
# Initialize the environment
import numpy as np
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

# Create the `rawdata`, loosely based on the UHCI weather data set
rawdata = [
['sunny',85,85,'FALSE',0],
['sunny',80,90,'TRUE',0],
['overcast',83,86,'FALSE',1],
['rainy',70,96,'FALSE',1],
['rainy',68,80,'FALSE',1],
['rainy',65,70,'TRUE',0],
['overcast',64,65,'TRUE',1],
['sunny',72,95,'FALSE',0],
['sunny',69,70,'FALSE',1],
['rainy',75,80,'FALSE',1],
['sunny',75,70,'TRUE',1],
['overcast',72,90,'TRUE',1],
['overcast',81,75,'FALSE',1],
['rainy',71,91,'TRUE',0]
]

# Create a Data Frame from the `rawdata`
from pyspark.sql import SQLContext,Row
sqlContext = SQLContext(sc)

data_df = sqlContext.createDataFrame(rawdata,
   ['outlook','temp','humid','windy','play'])

# Transform categorical variables into indicator variables
out2index = {'sunny':[1,0,0],'overcast':[0,1,0],'rainy':[0,0,1]}

# Make an RDD of labeled vectors
def newrow(dfrow):
    outrow = list(out2index.get((dfrow[0])))  #get dictionary entry for outlook
    outrow.append(dfrow[1])   #temp
    outrow.append(dfrow[2])   #humidity
    if dfrow[3]=='TRUE':      #windy
        outrow.append(1)
    else:
        outrow.append(0)
    return (LabeledPoint(dfrow[4],outrow))

datax_rdd=data_df.map(newrow)

- Verify the __RDD__ data and some basic summary statistics.

In [2]:
datax_rdd.take(5)

[LabeledPoint(0.0, [1.0,0.0,0.0,85.0,85.0,0.0]),
 LabeledPoint(0.0, [1.0,0.0,0.0,80.0,90.0,1.0]),
 LabeledPoint(1.0, [0.0,1.0,0.0,83.0,86.0,0.0]),
 LabeledPoint(1.0, [0.0,0.0,1.0,70.0,96.0,0.0]),
 LabeledPoint(1.0, [0.0,0.0,1.0,68.0,80.0,0.0])]

In [3]:
datax_rdd.count()

14

## Building the Naive Bayes Model  

In [4]:
# Import the the Naive Bayes classifer and train the model
import scipy
from pyspark.mllib.classification import NaiveBayes

my_nbmodel = NaiveBayes.train(datax_rdd)

# Verify the rained Model and execute a Prediction
datax_col = datax_rdd.collect()
trainset_pred =[]
for x in datax_col:
    trainset_pred.append(my_nbmodel.predict(x.features))

print "Model object:"
print my_nbmodel

print "Model Prediction:"
print trainset_pred

Model object:
<pyspark.mllib.classification.NaiveBayesModel object at 0x7f3c593b1250>
Model Prediction:
[1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0]


In [5]:
# Generate a Confusion Matrix. Note that the  row is the true class label 0 or 1, columns are predicted label.
nb_cf_mat=np.zeros([2,2])  #num of classes
for pnt in datax_col:
    predctn = my_nbmodel.predict(np.array(pnt.features))
    nb_cf_mat[pnt.label][predctn]+=1

corrcnt=0
for i in range(2):
    corrcnt+=nb_cf_mat[i][i]
nb_per_corr=corrcnt/nb_cf_mat.sum()

print "Naive Bayes Confusion Matrix:"
print nb_cf_mat
print "Percent Correct:"
print nb_per_corr

Naive Bayes Confusion Matrix:
[[ 3.  2.]
 [ 0.  9.]]
Percent Correct:
0.857142857143




## Test Model Performance by introducing "Dummy" Variables  
In this scenario, we re-run the model against the same data set, but introduce "useless" variables that is constant, after the 'play' column.  

In [6]:
import numpy as np
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

#outlook,temperature,humidity,windy,play, copied from Weka's data example
rawdata=[
['sunny',85,85,'FALSE',0,1],
['sunny',80,90,'TRUE',0,1],
['overcast',83,86,'FALSE',1,1],
['rainy',70,96,'FALSE',1,1],
['rainy',68,80,'FALSE',1,1],
['rainy',65,70,'TRUE',0,1],
['overcast',64,65,'TRUE',1,1],
['sunny',72,95,'FALSE',0,1],
['sunny',69,70,'FALSE',1,1],
['rainy',75,80,'FALSE',1,1],
['sunny',75,70,'TRUE',1,1],
['overcast',72,90,'TRUE',1,1],
['overcast',81,75,'FALSE',1,1],
['rainy',71,91,'TRUE',0,1]
]

from pyspark.sql import SQLContext,Row
sqlContext = SQLContext(sc)

data_df=sqlContext.createDataFrame(rawdata,
                                   ['outlook','temp','humid','windy','play','mydummy']) #<--add field

#transform categoricals into indicator variables
out2index={'sunny':[1,0,0],'overcast':[0,1,0],'rainy':[0,0,1]}

#make RDD of labeled vectors
def newrow(dfrow):    
    outrow = list(out2index.get((dfrow[0])))  #get dictionary entry
    outrow.append(dfrow[1])   #temp    
    outrow.append(dfrow[2])   #humidity    
    if dfrow[3]=='TRUE':      #windy        
        outrow.append(1)    
    else:        
        outrow.append(0)    
    outrow.append(dfrow[5])  # <---- add this     
    return (LabeledPoint(dfrow[4],outrow))

datax_rdd=data_df.map(newrow)

In [7]:
datax_rdd.take(5)

[LabeledPoint(0.0, [1.0,0.0,0.0,85.0,85.0,0.0,1.0]),
 LabeledPoint(0.0, [1.0,0.0,0.0,80.0,90.0,1.0,1.0]),
 LabeledPoint(1.0, [0.0,1.0,0.0,83.0,86.0,0.0,1.0]),
 LabeledPoint(1.0, [0.0,0.0,1.0,70.0,96.0,0.0,1.0]),
 LabeledPoint(1.0, [0.0,0.0,1.0,68.0,80.0,0.0,1.0])]

In [8]:
from pyspark.mllib.classification import NaiveBayes

#execute model, it can go in a single pass
my_nbmodel = NaiveBayes.train(datax_rdd)

#Some info on model 
#print my_nbmodel
#some checks,get some of training data and test it:
datax_col=datax_rdd.collect()   #if datax_rdd was big, use sample or take

trainset_pred =[]
for x in datax_col:
    trainset_pred.append(my_nbmodel.predict(x.features))

#print trainset_pred

#get a confusion matrix
#the row is the true class label 0 or 1, columns are predicted label
nb_cf_mat=np.zeros([2,2])  #num of classes
for pnt in datax_col:
    predctn = my_nbmodel.predict(np.array(pnt.features))
    nb_cf_mat[pnt.label][predctn]+=1

corrcnt=0
for i in range(2):
    corrcnt+=nb_cf_mat[i][i]
nb_per_corr=corrcnt/nb_cf_mat.sum()

print "Naive Bayes Confusion Matrix:"
print nb_cf_mat
print "Percent Correct:"
print nb_per_corr

Naive Bayes Confusion Matrix:
[[ 3.  2.]
 [ 0.  9.]]
Percent Correct:
0.857142857143




## Test Model Performance on New Unseen Data    
In this scenario, we re-run the model against the new data to measure it's performance.
The `newpoint` corresponds to the following:

- Outlook binary indicator variables = sunny,overcast,rainy
- Temperature = 68
- Humidy = 79
- Windy = 0
- Useless-constant (dummy variable) = 1

In [9]:
# Add a new data point
newpoint  = np.array([1,0,0,68,79,0,1])

# Execute the new test point through the Naive Bayes Model
print "Prediction on new data:"
my_nbmodel.predict(newpoint)

Prediction on new data:


1.0