# Classification with Decision Tree and Naive Bayes example

### Importing MLlib libraries 

In [1]:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.Entropy
import org.apache.spark.mllib.classification.NaiveBayes
import org.apache.spark.mllib.evaluation.MulticlassMetrics

### Read data and pre-processing

The **Iris flower data set** or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis". The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.

In [2]:
val rawData = sc.textFile("data/iris.csv")

In [3]:
val splitlines = rawData.map(lines => {
    lines.split(',')
  })
splitlines.first()

Array(5.1, 3.5, 1.4, 0.2, Iris-setosa)

In [4]:
val Data = splitlines.map { col =>   
     val species = col(col.size - 1)                       
     val label = if (species == "Iris-versicolor") 0.toInt else if (species == "Iris-setosa") 1.toInt else 2.toInt
     val features = col.slice(0, col.size - 1).map(_.toDouble)
     LabeledPoint(label, Vectors.dense(features))
}
Data.take(5).foreach(println)

(1.0,[5.1,3.5,1.4,0.2])
(1.0,[4.9,3.0,1.4,0.2])
(1.0,[4.7,3.2,1.3,0.2])
(1.0,[4.6,3.1,1.5,0.2])
(1.0,[5.0,3.6,1.4,0.2])


### Split the data into training and test sets (40% held out for testing)

In [5]:
val splits = Data.randomSplit(Array(0.6, 0.4), seed = 11L)
val trainingData = splits(0).cache()
val testData = splits(1)
println("Training Data")
trainingData.take(100).foreach(println)
println("Test Data")
testData.take(100).foreach(println)

Training Data
(1.0,[5.1,3.5,1.4,0.2])
(1.0,[4.9,3.0,1.4,0.2])
(1.0,[5.0,3.6,1.4,0.2])
(1.0,[5.4,3.9,1.7,0.4])
(1.0,[4.6,3.4,1.4,0.3])
(1.0,[5.0,3.4,1.5,0.2])
(1.0,[4.4,2.9,1.4,0.2])
(1.0,[5.8,4.0,1.2,0.2])
(1.0,[5.7,4.4,1.5,0.4])
(1.0,[5.4,3.9,1.3,0.4])
(1.0,[5.1,3.5,1.4,0.3])
(1.0,[5.7,3.8,1.7,0.3])
(1.0,[5.1,3.8,1.5,0.3])
(1.0,[5.4,3.4,1.7,0.2])
(1.0,[4.8,3.4,1.9,0.2])
(1.0,[5.0,3.0,1.6,0.2])
(1.0,[5.0,3.4,1.6,0.4])
(1.0,[5.2,3.5,1.5,0.2])
(1.0,[4.7,3.2,1.6,0.2])
(1.0,[4.8,3.1,1.6,0.2])
(1.0,[5.4,3.4,1.5,0.4])
(1.0,[4.9,3.1,1.5,0.2])
(1.0,[5.0,3.2,1.2,0.2])
(1.0,[5.5,3.5,1.3,0.2])
(1.0,[4.9,3.6,1.4,0.1])
(1.0,[4.4,3.0,1.3,0.2])
(1.0,[5.0,3.5,1.3,0.3])
(1.0,[5.0,3.5,1.6,0.6])
(1.0,[5.1,3.8,1.9,0.4])
(1.0,[4.8,3.0,1.4,0.3])
(1.0,[5.1,3.8,1.6,0.2])
(1.0,[5.3,3.7,1.5,0.2])
(0.0,[7.0,3.2,4.7,1.4])
(0.0,[6.9,3.1,4.9,1.5])
(0.0,[5.5,2.3,4.0,1.3])
(0.0,[6.5,2.8,4.6,1.5])
(0.0,[5.7,2.8,4.5,1.3])
(0.0,[6.3,3.3,4.7,1.6])
(0.0,[4.9,2.4,3.3,1.0])
(0.0,[5.0,2.0,3.5,1.0])
(0.0,[5.9,3.0,4.2,1.5])
(0

### Train a Decision Tree

Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling and are able to capture nonlinearities and feature interactions. Tree ensemble algorithms such as random forests and boosting are among the top performers for classification and regression tasks.
MLlib supports decision trees for binary and multiclass classification and for regression, using both continuous and categorical features. The implementation partitions data by rows, allowing distributed training with millions of instances.

In [6]:
val numClasses = 3
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "entropy"
val maxDepth = 3
val maxBins = 10
val dtModel = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
  impurity, maxDepth, maxBins)
println(dtModel.toDebugString)


DecisionTreeModel classifier of depth 3 with 11 nodes
  If (feature 2 <= 3.5)
   If (feature 1 <= 2.7)
    Predict: 0.0
   Else (feature 1 > 2.7)
    Predict: 1.0
  Else (feature 2 > 3.5)
   If (feature 3 <= 1.6)
    If (feature 2 <= 4.9)
     Predict: 0.0
    Else (feature 2 > 4.9)
     Predict: 2.0
   Else (feature 3 > 1.6)
    If (feature 3 <= 1.8)
     Predict: 2.0
    Else (feature 3 > 1.8)
     Predict: 2.0



In [7]:
val dtTotalCorrect = trainingData.map { point =>
  if (dtModel.predict(point.features) == point.label) 1 else 0
  }.sum

println(dtTotalCorrect)
println(trainingData.count)

88.0
89


In [8]:
val dtAccuracy = dtTotalCorrect / trainingData.count
println(dtAccuracy)

0.9887640449438202


### Test

In [9]:
val dtTotalCorrect = testData.map { point =>
  if (dtModel.predict(point.features) == point.label) 1 else 0
  }.sum
println(dtTotalCorrect)
println(testData.count)

58.0
61


In [10]:
val dtAccuracy = dtTotalCorrect / testData.count
println(dtAccuracy)

0.9508196721311475


### Train Naive Bayes

Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to the training data, it computes the conditional probability distribution of each feature given label, and then it applies Bayes theorem to compute the conditional probability distribution of label given an observation and use it for prediction.

In [11]:
val nbModel = NaiveBayes.train(trainingData)
println(nbModel)

org.apache.spark.mllib.classification.NaiveBayesModel@1c5eb597


In [12]:
val nbTotalCorrect = trainingData.map { point =>
    if (nbModel.predict(point.features) == point.label) 1 else 0
}.sum
println(nbTotalCorrect)
println(trainingData.count)

88.0
89


In [13]:
val nbAccuracy = nbTotalCorrect / trainingData.count
println(nbAccuracy)

0.9887640449438202


### Test

In [14]:
val nbTotalCorrect = testData.map { point =>
    if (nbModel.predict(point.features) == point.label) 1 else 0
}.sum

println(nbTotalCorrect)
println(testData.count)

57.0
61


In [15]:
val nbAccuracy = nbTotalCorrect / testData.count
println(nbAccuracy)

0.9344262295081968


### Complete evaluation on test set for Decision Tree  model

In [16]:
val predictionAndLabels = testData.map { case LabeledPoint(label, features) =>
  val prediction = dtModel.predict(features)
  (prediction, label)
}

// Instantiate metrics object
val metrics = new MulticlassMetrics(predictionAndLabels)

// Confusion matrix
println("Confusion matrix:")
println(metrics.confusionMatrix)

// Overall Statistics
val precision = metrics.precision
val recall = metrics.recall // same as true positive rate
val f1Score = metrics.fMeasure
println("Summary Statistics")
println(s"Precision = $precision")
println(s"Recall = $recall")
println(s"F1 Score = $f1Score")

// Precision by label
val labels = metrics.labels
labels.foreach { l =>
    println(s"Precision($l) = " + metrics.precision(l))
}

// Recall by label
labels.foreach { l =>
    println(s"Recall($l) = " + metrics.recall(l))
}

// False positive rate by label
labels.foreach { l =>
    println(s"FPR($l) = " + metrics.falsePositiveRate(l))
}

// F-measure by label
labels.foreach { l =>
    println(s"F1-Score($l) = " + metrics.fMeasure(l))
}

// Weighted stats
println(s"Weighted precision: ${metrics.weightedPrecision}")
println(s"Weighted recall: ${metrics.weightedRecall}")
println(s"Weighted F1 score: ${metrics.weightedFMeasure}")
println(s"Weighted false positive rate: ${metrics.weightedFalsePositiveRate}")

Confusion matrix:
19.0  0.0   2.0   
1.0   17.0  0.0   
0.0   0.0   22.0  
Summary Statistics
Precision = 0.9508196721311475
Recall = 0.9508196721311475
F1 Score = 0.9508196721311475
Precision(0.0) = 0.95
Precision(1.0) = 1.0
Precision(2.0) = 0.9166666666666666
Recall(0.0) = 0.9047619047619048
Recall(1.0) = 0.9444444444444444
Recall(2.0) = 1.0
FPR(0.0) = 0.025
FPR(1.0) = 0.0
FPR(2.0) = 0.05128205128205128
F1-Score(0.0) = 0.9268292682926829
F1-Score(1.0) = 0.9714285714285714
F1-Score(2.0) = 0.9565217391304348
Weighted precision: 0.9527322404371584
Weighted recall: 0.9508196721311476
Weighted F1 score: 0.950698478372626
Weighted false positive rate: 0.027101723413198824


### Complete evaluation on test set for Naive Bayes  model

In [18]:
val predictionAndLabels = testData.map { case LabeledPoint(label, features) =>
  val prediction = nbModel.predict(features)
  (prediction, label)
}

// Instantiate metrics object
val metrics = new MulticlassMetrics(predictionAndLabels)

// Confusion matrix
println("Confusion matrix:")
println(metrics.confusionMatrix)

// Overall Statistics
val precision = metrics.precision
val recall = metrics.recall // same as true positive rate
val f1Score = metrics.fMeasure
println("Summary Statistics")
println(s"Precision = $precision")
println(s"Recall = $recall")
println(s"F1 Score = $f1Score")

// Precision by label
val labels = metrics.labels
labels.foreach { l =>
    println(s"Precision($l) = " + metrics.precision(l))
}

// Recall by label
labels.foreach { l =>
    println(s"Recall($l) = " + metrics.recall(l))
}

// False positive rate by label
labels.foreach { l =>
    println(s"FPR($l) = " + metrics.falsePositiveRate(l))
}

// F-measure by label
labels.foreach { l =>
    println(s"F1-Score($l) = " + metrics.fMeasure(l))
}

// Weighted stats
println(s"Weighted precision: ${metrics.weightedPrecision}")
println(s"Weighted recall: ${metrics.weightedRecall}")
println(s"Weighted F1 score: ${metrics.weightedFMeasure}")
println(s"Weighted false positive rate: ${metrics.weightedFalsePositiveRate}")

Confusion matrix:
14.0  0.0   1.0   
0.0   20.0  0.0   
1.0   0.0   14.0  
Summary Statistics
Precision = 0.96
Recall = 0.96
F1 Score = 0.96
Precision(0.0) = 0.9333333333333333
Precision(1.0) = 1.0
Precision(2.0) = 0.9333333333333333
Recall(0.0) = 0.9333333333333333
Recall(1.0) = 1.0
Recall(2.0) = 0.9333333333333333
FPR(0.0) = 0.02857142857142857
FPR(1.0) = 0.0
FPR(2.0) = 0.02857142857142857
F1-Score(0.0) = 0.9333333333333333
F1-Score(1.0) = 1.0
F1-Score(2.0) = 0.9333333333333333
Weighted precision: 0.9600000000000001
Weighted recall: 0.9600000000000001
Weighted F1 score: 0.9600000000000001
Weighted false positive rate: 0.01714285714285714


checked