# Lab3 - Second part: Machine learning in Spark
## Challenge - Linear Classification Methods

Repeat this task, trying out at least two additional linear classification methods. For example you could try L1 regularized SVM, or to run SVM without regularization. Another alternative is to run Logistic Regression, with L1 or L2 regularization. Report your findings in terms of model performance.

## Options

(A) L1 regularized SVM
(B) SVM without regularization
(C) Logistic Regression with L1 regularization
(D) Logistic Regression with L2 regularization

I will go for option A and B.

## A) L1 regularized SVM and B) SVM without regularization

In [2]:
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.optimization.{L1Updater, SimpleUpdater}

// Load dataset in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "toxicity.txt")

// Split data into training (80%) and test (20%).
val splits = data.randomSplit(Array(0.8, 0.2), seed = 11L)
val training = splits(0).cache() // caching training data
val test = splits(1)

// Run training algorithm to build the model
// SVM using SGD (Stochastic Gradient Descent) optimization method with L1 (default is L2)
val svmAlgA = new SVMWithSGD()
svmAlgA.optimizer.
  setNumIterations(100).
  setRegParam(0.1).
  setUpdater(new L1Updater)
val modelA = svmAlgA.run(training)

// Model without regularization
val svmAlgB = new SVMWithSGD()
svmAlgB.optimizer.
  setNumIterations(100).
  setRegParam(0.1).
  setUpdater(new SimpleUpdater)
val modelB = svmAlgB.run(training)


// Clear the default threshold.
modelA.clearThreshold()
modelB.clearThreshold()

// Compute raw scores on the test set.
// Compute distance from hyperplane for each test example
val scoreAndLabelsA = test.map { point =>
  val score = modelA.predict(point.features)
  (score, point.label)
}

val scoreAndLabelsB = test.map { point =>
  val score = modelB.predict(point.features)
  (score, point.label)
}


// Get evaluation metrics.
// Compute the area under the ROC curve using the MLlib primitive
val metricsA = new BinaryClassificationMetrics(scoreAndLabelsA)
val auROCA = metricsA.areaUnderROC()

val metricsB = new BinaryClassificationMetrics(scoreAndLabelsB)
val auROCB = metricsB.areaUnderROC()

println("Area under ROC = " + auROCA)
println("Area under ROC = " + auROCB)

Area under ROC = 0.18181818181818177
Area under ROC = 0.5681818181818181


Both models seem to perform not very good. The value for using SVM with L1 is `0.182` and for SVM without regularization the ROC is `0.568`, so very close to random.