# Classification with Decision Tree and Naive Bayes example

### Importing MLlib libraries

In [1]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
from pyspark.mllib.classification import NaiveBayes
import pyspark.mllib.linalg
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark import SparkConf, SparkContext

### Read data and pre-processing
The **Iris flower data set** or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis". The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.

In [2]:
...
conf = SparkConf()
...
sc = SparkContext(conf=conf)

In [3]:
rawData = sc.textFile("data/iris.csv")

In [4]:
'''
val splitlines = rawData.map(lines => {
    lines.split(',')
  })
splitlines.first()
'''
splitlines = rawData.map(lambda lines: lines.split(','))
splitlines.first()

['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']

In [5]:
'''
val Data = splitlines.map { col =>   
     val species = col(col.size - 1)                       
     val label = if (species == "Iris-versicolor") 0.toInt else if (species == "Iris-setosa") 1.toInt else 2.toInt
     val features = col.slice(0, col.size - 1).map(_.toDouble)
     LabeledPoint(label, Vectors.dense(features))
}
Data.take(5).foreach(println)
'''
def iris_label(argument):
    iris = {
        "Iris-versicolor": 0,
        "Iris-setosa": 1,
        "Iris-virginica": 2,
    }
    return iris.get(argument, 0)

def estractor(col):
    species = col[-1]
    label = iris_label(species)
    features = [float(x) for x in col[:-1]]
    return LabeledPoint(label, Vectors.dense(features))
    
Data = splitlines.map(estractor)

for x in Data.take(5):
    print(x)

(1.0,[5.1,3.5,1.4,0.2])
(1.0,[4.9,3.0,1.4,0.2])
(1.0,[4.7,3.2,1.3,0.2])
(1.0,[4.6,3.1,1.5,0.2])
(1.0,[5.0,3.6,1.4,0.2])


### Split the data into training and test sets (40% held out for testing)

In [7]:
'''
val splits = Data.randomSplit(Array(0.6, 0.4), seed = 11L)
val trainingData = splits(0).cache()
val testData = splits(1)
println("Training Data")
trainingData.take(5).foreach(println)
println("Test Data")
testData.take(5).foreach(println)
'''
splits = Data.randomSplit([0.6, 0.4], seed = 11)
trainingData = splits[0].cache()
testData = splits[1]
print(trainingData.count())
print("Training Data")
for x in trainingData.take(5):
    print(x)
print("Test Data")
for x in testData.take(5):
    print(x)

98
Training Data
(1.0,[5.1,3.5,1.4,0.2])
(1.0,[4.9,3.0,1.4,0.2])
(1.0,[5.0,3.6,1.4,0.2])
(1.0,[5.4,3.9,1.7,0.4])
(1.0,[4.6,3.4,1.4,0.3])
Test Data
(1.0,[4.7,3.2,1.3,0.2])
(1.0,[4.6,3.1,1.5,0.2])
(1.0,[5.0,3.4,1.5,0.2])
(1.0,[4.4,2.9,1.4,0.2])
(1.0,[4.9,3.1,1.5,0.1])


### Train a Decision Tree
Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling and are able to capture nonlinearities and feature interactions. Tree ensemble algorithms such as random forests and boosting are among the top performers for classification and regression tasks. MLlib supports decision trees for binary and multiclass classification and for regression, using both continuous and categorical features. The implementation partitions data by rows, allowing distributed training with millions of instances.

In [8]:
'''
val numClasses = 3
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "entropy"
val maxDepth = 3
val maxBins = 10
val dtModel = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
  impurity, maxDepth, maxBins)
println(dtModel.toDebugString)
'''
numClasses = 3
categoricalFeaturesInfo = {} #Map[Int, Int]()
impurity = "entropy"
maxDepth = 3
maxBins = 10
dtModel = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
print(dtModel.toDebugString())

DecisionTreeModel classifier of depth 3 with 13 nodes
  If (feature 2 <= 3.5)
   If (feature 1 <= 2.6)
    If (feature 0 <= 4.8)
     Predict: 1.0
    Else (feature 0 > 4.8)
     Predict: 0.0
   Else (feature 1 > 2.6)
    Predict: 1.0
  Else (feature 2 > 3.5)
   If (feature 3 <= 1.7)
    If (feature 2 <= 5.3)
     Predict: 0.0
    Else (feature 2 > 5.3)
     Predict: 2.0
   Else (feature 3 > 1.7)
    If (feature 2 <= 5.0)
     Predict: 2.0
    Else (feature 2 > 5.0)
     Predict: 2.0



In [9]:
'''
val dtTotalCorrect = trainingData.map { point =>
  if (dtModel.predict(point.features) == point.label) 1 else 0
  }.sum

println(dtTotalCorrect)
println(trainingData.count)
'''
predictions = dtModel.predict(trainingData.map(lambda point: point.features))
labelsAndPredictions = trainingData.map(lambda lp: lp.label).zip(predictions)
dtTotalCorrect = labelsAndPredictions.map(lambda line: 1 if line[0]==line[1] else 0).sum()

'''
for x in trainingData.collect():
    print(dtModel.predict(x.features))
'''

print(dtTotalCorrect)
print(trainingData.count())

95
98


In [10]:
'''
val dtAccuracy = dtTotalCorrect / trainingData.count
println(dtAccuracy)
'''
dtAccuracy = dtTotalCorrect / trainingData.count()
print(dtAccuracy)

0.9693877551020408


In [11]:
'''
val dtTotalCorrect = testData.map { point =>
    if (nbModel.predict(point.features) == point.label) 1 else 0
}.sum

println(dtTotalCorrect)
println(testData.count)
'''
predictions = dtModel.predict(testData.map(lambda point: point.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
dtTotalCorrect = labelsAndPredictions.map(lambda line: 1 if line[0]==line[1] else 0).sum()

print(dtTotalCorrect)
print(testData.count())

51
52


In [12]:
'''
val nbAccuracy = nbTotalCorrect / testData.count
println(nbAccuracy)
'''
dtAccuracy = dtTotalCorrect / testData.count()
print(dtAccuracy)

0.9807692307692307


### Train Naive Bayes
Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to the training data, it computes the conditional probability distribution of each feature given label, and then it applies Bayes theorem to compute the conditional probability distribution of label given an observation and use it for prediction.

In [13]:
'''
val nbModel = NaiveBayes.train(trainingData)
println(nbModel)
'''
nbModel = NaiveBayes.train(trainingData,1.0)
print(nbModel)

<pyspark.mllib.classification.NaiveBayesModel object at 0x7ffb409b4780>


In [14]:
'''
val nbTotalCorrect = trainingData.map { point =>
    if (nbModel.predict(point.features) == point.label) 1 else 0
}.sum
println(nbTotalCorrect)
println(trainingData.count)
'''
predictions = nbModel.predict(trainingData.map(lambda point: point.features))
labelsAndPredictions = trainingData.map(lambda lp: lp.label).zip(predictions)
nbTotalCorrect = labelsAndPredictions.map(lambda line: 1 if line[0]==line[1] else 0).sum()
print(nbTotalCorrect)
print(trainingData.count())

70
98


In [15]:
'''
val nbAccuracy = nbTotalCorrect / trainingData.count
println(nbAccuracy)
'''
nbAccuracy = nbTotalCorrect / trainingData.count()
print(nbAccuracy)

0.7142857142857143


### Test

In [16]:
'''
val nbTotalCorrect = testData.map { point =>
    if (nbModel.predict(point.features) == point.label) 1 else 0
}.sum

println(nbTotalCorrect)
println(testData.count)
'''
predictions = nbModel.predict(testData.map(lambda point: point.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
nbTotalCorrect = labelsAndPredictions.map(lambda line: 1 if line[0]==line[1] else 0).sum()
print(nbTotalCorrect)
print(testData.count())

30
52


In [17]:
'''
val nbAccuracy = nbTotalCorrect / testData.count
println(nbAccuracy)
'''
nbAccuracy = nbTotalCorrect / testData.count()
print(nbAccuracy)

0.5769230769230769


### Complete evaluation on test set for Decision Tree model

In [18]:
'''
val predictionAndLabels = testData.map { case LabeledPoint(label, features) =>
  val prediction = dtModel.predict(features)
  (prediction, label)
}
'''
predictions = dtModel.predict(testData.map(lambda point: point.features))
predictionAndLabels = predictions.zip(testData.map(lambda lp: lp.label))

'''
// Instantiate metrics object
val metrics = new MulticlassMetrics(predictionAndLabels)
'''
metrics = MulticlassMetrics(predictionAndLabels)

'''
// Confusion matrix
println("Confusion matrix:")
println(metrics.confusionMatrix)
'''
print("Confusion matrix:")
#print(metrics.confusionMatrix())

'''
// Overall Statistics
val precision = metrics.precision
val recall = metrics.recall // same as true positive rate
val f1Score = metrics.fMeasure
println("Summary Statistics")
println(s"Precision = $precision")
println(s"Recall = $recall")
println(s"F1 Score = $f1Score")
'''
precision = metrics.precision()
recall = metrics.recall() # same as true positive rate
f1Score = metrics.fMeasure()
print("Summary Statistics")
print("Precision = ", precision)
print("Recall = ", recall)
print("F1 Score = ", f1Score)

'''
// Precision by label
val labels = metrics.labels
labels.foreach { l =>
    println(s"Precision($l) = " + metrics.precision(l))
}
'''
labels = trainingData.map(lambda l:l.label).distinct().sortBy(lambda l: l).collect()
for l in labels:
    print("Precision(%s) = %s" % (l,metrics.precision(l)))
    
'''
// Recall by label
labels.foreach { l =>
    println(s"Recall($l) = " + metrics.recall(l))
}
'''
for l in labels:
    print("Recall(%s) = %s" % (l,metrics.recall(l)))

'''
// False positive rate by label
labels.foreach { l =>
    println(s"FPR($l) = " + metrics.falsePositiveRate(l))
}
'''
for l in labels:
    print("FPR(%s) = %s" % (l,metrics.falsePositiveRate(l)))

'''
// F-measure by label
labels.foreach { l =>
    println(s"F1-Score($l) = " + metrics.fMeasure(l))
}
'''
for l in labels:
    print("F1-Score(%s) = %s" % (l,metrics.fMeasure(l)))

'''
// Weighted stats
println(s"Weighted precision: ${metrics.weightedPrecision}")
println(s"Weighted recall: ${metrics.weightedRecall}")
println(s"Weighted F1 score: ${metrics.weightedFMeasure}")
println(s"Weighted false positive rate: ${metrics.weightedFalsePositiveRate}")
'''
print("Weighted precision: ", metrics.weightedPrecision)
print("Weighted recall: ", metrics.weightedRecall)
print("Weighted F1 score: ", metrics.weightedFMeasure())
print("Weighted false positive rate: ", metrics.weightedFalsePositiveRate)

Confusion matrix:
Summary Statistics
Precision =  0.9807692307692307
Recall =  0.9807692307692307
F1 Score =  0.9807692307692307
Precision(0.0) = 0.9565217391304348
Precision(1.0) = 1.0
Precision(2.0) = 1.0
Recall(0.0) = 1.0
Recall(1.0) = 1.0
Recall(2.0) = 0.9333333333333333
FPR(0.0) = 0.03333333333333333
FPR(1.0) = 0.0
FPR(2.0) = 0.0
F1-Score(0.0) = 0.9777777777777777
F1-Score(1.0) = 1.0
F1-Score(2.0) = 0.9655172413793104
Weighted precision:  0.9816053511705686
Weighted recall:  0.9807692307692307
Weighted F1 score:  0.9806513409961686
Weighted false positive rate:  0.014102564102564101


http://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#label-based-metrics

### Complete evaluation on test set for Naive Bayes model

In [19]:
'''
val predictionAndLabels = testData.map { case LabeledPoint(label, features) =>
  val prediction = nbModel.predict(features)
  (prediction, label)
}
'''
predictions = nbModel.predict(testData.map(lambda point: point.features)).map(lambda pre: float(pre))
predictionAndLabels = predictions.zip(testData.map(lambda lp: lp.label))

'''
// Instantiate metrics object
val metrics = new MulticlassMetrics(predictionAndLabels)
'''
metrics = MulticlassMetrics(predictionAndLabels)

'''
// Confusion matrix
println("Confusion matrix:")
println(metrics.confusionMatrix)
'''
print("Confusion matrix:")
#print(metrics.confusionMatrix())

'''
// Overall Statistics
val precision = metrics.precision
val recall = metrics.recall // same as true positive rate
val f1Score = metrics.fMeasure
println("Summary Statistics")
println(s"Precision = $precision")
println(s"Recall = $recall")
println(s"F1 Score = $f1Score")
'''
precision = metrics.precision()
recall = metrics.recall() # same as true positive rate
f1Score = metrics.fMeasure()
print("Summary Statistics")
print("Precision = ", precision)
print("Recall = ", recall)
print("F1 Score = ", f1Score)

'''
// Precision by label
val labels = metrics.labels
labels.foreach { l =>
    println(s"Precision($l) = " + metrics.precision(l))
}
'''
labels = trainingData.map(lambda l:l.label).distinct().sortBy(lambda l: l).collect()
for l in labels:
    print("Precision(%s) = %s" % (l,metrics.precision(l)))
    
'''
// Recall by label
labels.foreach { l =>
    println(s"Recall($l) = " + metrics.recall(l))
}
'''
for l in labels:
    print("Recall(%s) = %s" % (l,metrics.recall(l)))

'''
// False positive rate by label
labels.foreach { l =>
    println(s"FPR($l) = " + metrics.falsePositiveRate(l))
}
'''
for l in labels:
    print("FPR(%s) = %s" % (l,metrics.falsePositiveRate(l)))

'''
// F-measure by label
labels.foreach { l =>
    println(s"F1-Score($l) = " + metrics.fMeasure(l))
}
'''
for l in labels:
    print("F1-Score(%s) = %s" % (l,metrics.fMeasure(l)))

'''
// Weighted stats
println(s"Weighted precision: ${metrics.weightedPrecision}")
println(s"Weighted recall: ${metrics.weightedRecall}")
println(s"Weighted F1 score: ${metrics.weightedFMeasure}")
println(s"Weighted false positive rate: ${metrics.weightedFalsePositiveRate}")
'''
print("Weighted precision: ", metrics.weightedPrecision)
print("Weighted recall: ", metrics.weightedRecall)
print("Weighted F1 score: ", metrics.weightedFMeasure())
print("Weighted false positive rate: ", metrics.weightedFalsePositiveRate)

Confusion matrix:
Summary Statistics
Precision =  0.5769230769230769
Recall =  0.5769230769230769
F1 Score =  0.5769230769230769
Precision(0.0) = 0.0
Precision(1.0) = 1.0
Precision(2.0) = 0.40540540540540543
Recall(0.0) = 0.0
Recall(1.0) = 1.0
Recall(2.0) = 1.0
FPR(0.0) = 0.0
FPR(1.0) = 0.0
FPR(2.0) = 0.5945945945945946
F1-Score(0.0) = 0.0
F1-Score(1.0) = 1.0
F1-Score(2.0) = 0.5769230769230769
Weighted precision:  0.4054054054054054
Weighted recall:  0.5769230769230769
Weighted F1 score:  0.4548816568047337
Weighted false positive rate:  0.17151767151767153


checked