# Lab3 - Second part: Machine learning in Spark
## Task 1 - Toxicology prediction with SVM

Write a Spark program that builds a toxicology prediction model form the  toxicity.txt dataset, using Support Vector Machines (SVM). In order to evaluate the model, use 80% of the dataset as training set, and save the remaining 20% for testing. As evaluation metric you can use the area under the ROC curve.

### Data and preparation
Looking at the file in LIBSVM format, each line looks as follows:

    1.0 855:1.0 15924:1.0 51912:1.0 54501:2.0 57936:1.0 74579:2.0 76053:1.0 81127:4.0 ...
    <label> <index>:<value> <index>:<value> ...
 
Here the **label** indicates the molecular toxicity, 1 means toxic and 0 means non-toxic. The **features** are molecular descriptors (CDK molecular signatures).

In [3]:
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.util.MLUtils

// Load dataset in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "toxicity.txt")

// Split data into training (80%) and test (20%).
val splits = data.randomSplit(Array(0.8, 0.2), seed = 11L)
val training = splits(0).cache() // caching training data
val test = splits(1)

// Run training algorithm to build the model
// SVM using SGD (Stochastic Gradient Descent) optimization method
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)

// Clear the default threshold.
model.clearThreshold()

// Compute raw scores on the test set.
// Compute distance from hyperplane for each test example
val scoreAndLabels = test.map { point =>
  val score = model.predict(point.features)
  (score, point.label)
}

// Get evaluation metrics.
// Compute the area under the ROC curve using the MLlib primitive
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()

println("Area under ROC = " + auROC)

Area under ROC = 0.5681818181818181


## What is the limitation of Hadoop that makes Machine Learning infeasible? How does Spark overcome this limitation?

Hadoop comes with HDFS and stores large data on disk in a distributed way, it uses MapReduce to do distributed processing on those data. MapReduce jobs are for batch analytics and run sequentially.

Spark can stores data in-memory using resilient distributed datasets (RDD). 

Most machine leraning algorithms run on the same data set iteratively and use data retrieved from iterations in next iterations. In hadoop MapReduce there is no easy way to communicate a shared state from iteration to iteration. Also using the same data over and over again, creates a need for in memory caching, so the trainig data can be cached using Spark. Reading from disk all the time in hadoop would slow things down.

## How would you rate the model that you obtained in this task? Is that a good model?

The area under the **ROC** curve is `0.568`. In the lecture it was said that "the area under the ROC curve for a good model is close to 1".

The value is close to `0.5` so it is more a random predictor and therefore not a useful model.