# Lab3 - Second part: Machine learning in Spark
## Task 2 - SVM with L-BFGS

Repeat task 1 running SVM with the L-BFGS optimization algorithm. Setting the maximum number of iterations to 30 is fine for the purpose of this assignment.

Reference: https://spark.apache.org/docs/1.2.0/mllib-optimization.html

In [19]:
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.optimization.{LBFGS, HingeGradient, SquaredL2Updater}

// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "toxicity.txt")
val numFeatures = data.take(1)(0).features.size

// Split data into training and test.
val splits = data.randomSplit(Array(0.8, 0.2), seed = 11L)
val training = splits(0).cache()
val test = splits(1)

// Create the model with LBFGS
val (weightsWithIntercept, _) = LBFGS.runLBFGS(
    training.map(x => (x.label, MLUtils.appendBias(x.features))),
    new HingeGradient(),
    new SquaredL2Updater(),
    numCorrections = 10,
    convergenceTol = 1e-4,
    maxNumIterations = 30,
    regParam = 0.01,
    initialWeights = Vectors.dense(new Array[Double](numFeatures + 1))
)

// Run training algorithm to build the model
val model2 = new SVMModel(
    Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size - 1)),
    weightsWithIntercept(weightsWithIntercept.size - 1)
)
val model1 = SVMWithSGD.train(training, 30)

// Clear the default threshold.
model2.clearThreshold()
model1.clearThreshold()

// Compute raw scores on the test set.
// Compute distance from hyperplane for each test example
val scoreAndLabels2 = test.map { point =>
  val score = model2.predict(point.features)
  (score, point.label)
}

val scoreAndLabels1 = test.map { point =>
  val score = model1.predict(point.features)
  (score, point.label)
}

// Get evaluation metrics.
// Compute the area under the ROC curve using the MLlib primitive
val metrics2 = new BinaryClassificationMetrics(scoreAndLabels2)
val auROC2 = metrics2.areaUnderROC()

val metrics1 = new BinaryClassificationMetrics(scoreAndLabels1)
val auROC1 = metrics1.areaUnderROC()

println("Area under ROC LBFGS = " + auROC2)
println("Area under ROC SGD = " + auROC1)

Area under ROC LBFGS = 0.7272727272727273
Area under ROC SGD = 0.6136363636363636


## Does L-BFGS perform better than SGD for this training set?

For the same training set the L-BFGS `0.727` performs better than SGD `0.614`.