Good Scala tutorials: 

* http://blog.codacy.com/2015/07/03/how-to-learn-scala/
* http://www.newthinktank.com/2015/08/learn-scala-one-video/
* https://learnxinyminutes.com/docs/scala/

# Classification problem statement # 

Here we will use the https://www.kaggle.com/c/stumbleupon problem to classify web pages as:

* **Evergreen**: Pages that are persistently popular labeled as 1.
* **Ephemeral**: Pages that are popular for a short amount of time labeled as 0.

We will download train.tsv which is the training set and contains 7,395 urls. 

The feature set is as follows:

<img src = "evergreen1.png">
<img src = "evergreen2.png">



# Step 1: Data preprocessing #

Load the data into `rawData`

`scala> val rawData = sc.textFile("/Users/adarshnair/spark-2.0.1-bin-hadoop2.7/spark_projects/Classification/train_noheader.tsv")`

** Preview the data **

`scala> rawData.first
res4: String = "url"	"urlid"	"boilerplate"	"alchemy_category"	"alchemy_category_score"	"avglinksize"	"commonlinkratio_1"	"commonlinkratio_2"	"commonlinkratio_3"	"commonlinkratio_4"	"compression_ratio"	"embed_ratio"	"framebased"	"frameTagRatio"	"hasDomainLink"	"html_ratio"	"image_ratio"	"is_news"	"lengthyLinkDomain"	"linkwordscore"	"news_front_page"	"non_markup_alphanum_characters"	"numberOfLinks"	"numwords_in_url"	"parametrizedLinkRatio"	"spelling_errors_ratio"	"label"`

The first line of the data are the column headers.

** Remove the column headers **

We can do this using the `sed` command.

`Adarshs-MacBook-Pro:Classification adarshnair$ sed 1d train.tsv > train_noheader.tsv`

Now preview the data after updating the rawData file to point to train_noheader.tsv

`scala> val rawData = sc.textFile("/Users/adarshnair/spark-2.0.1-bin-hadoop2.7/spark_projects/Classification/train_noheader.tsv")
rawData: org.apache.spark.rdd.RDD[String] = /Users/adarshnair/spark-2.0.1-bin-hadoop2.7/spark_projects/Classification/train_noheader.tsv MapPartitionsRDD[5] at textFile at <console>:24`

`scala> rawData.first
res5: String = "http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"	"4042"	"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees in its crystal ...`


** Split the data on the \t **

`scala> val records = rawData.map(line => line.split("\t"))`

Preview data:

`scala> records.first
res6: Array[String] = Array("http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html", "4042", "{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees ...`

** Data cleaning **

* Trim extra quoatation marks
* Replace missing values which have '?' with a '0'
* Set the label to `label` which is the last column
* Set the features to `features` for columns 5 to 25
* Wrap the `label` and `features` using `LabeledPoint` which converts the features into an MLlib vector. (http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point)

`import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors`


`scala> val data = records.map { r =>
     | val trimmed  = r.map(_.replaceAll("\"", ""))
     | val label = trimmed(r.size - 1).toInt
     | val features = trimmed.slice(4, r.size - 1).map(x => if (x == "?") 0.0 else x.toDouble )
     | LabeledPoint(label, Vectors.dense(features))
     | }`

** Cache the data **

`scala> data.cache`

** Count the number of rows or webapages **

`scala> val numData = data.count
numData: Long = 7395  `

** Data cleaning specific to when we use data for Naive Bayes classifier **

The NB classifier cannot use features which have negative values. We will create a version of the dataset with the negative values removed by adding `.map(d => if (d < 0) 0.0 else d)`

`scala> val nbData = records.map { r =>
     | val trimmed = r.map(_.replaceAll("\"", ""))
     | val label = trimmed(r.size - 1).toInt
     | val features = trimmed.slice(4, r.size -1).map(x => if (x == "?") 0.0 else x.toDouble).map(x => if (x < 0) 0.0 else x)
     | LabeledPoint(label, Vectors.dense(features))
     | }`

# Step 2: Training classification models #

We will train the data using 4 classification models.

`scala> val numIterations = 10
numIterations: Int = 10`

`scala> val maxTreeDepth = 5
maxTreeDepth: Int = 5`

## Step 2.1: Logistic Regression ##

`scala> import org.apache.spark.mllib.classification.LogisticRegressionWithSGD`

We set the number of iterations to 10.

`scala> val lrModel = LogisticRegressionWithSGD.train(data, numIterations)`

## Step 2.2: SVM ##

`scala> import org.apache.spark.mllib.classification.SVMWithSGD`

`scala> val svmModel = SVMWithSGD.train(data, numIterations)`


## Step 2.3: Naive Bayes ##

`scala> import org.apache.spark.mllib.classification.NaiveBayes`

We will use the nbData which has the negative values for features replaced with 0.

`scala> val nbModel = NaiveBayes.train(nbData) `

## Step 2.4: Decision Trees ##

`scala> import org.apache.spark.mllib.tree.DecisionTree`

`scala> import org.apache.spark.mllib.tree.configuration.Algo`

`scala> import org.apache.spark.mllib.tree.impurity.Entropy`

We will set the maxTreeDepth to 5, the mode to Algo.Classification and the impurity measure to Entropy.

`scala> val dtModel = DecisionTree.train(data, Algo.Classification, Entropy, maxTreeDepth)`

# Step 3.a: Make predictions and check performance using our trained classification models #

## Step 3.1.a: Logistic regression predictions ##

Prediction on a single datapoint.

`scala> val dataPoint = data.first`

`scala> val prediction = lrModel.predict(dataPoint.features)
prediction: Double = 1.0`

Compare this prediction with the actual label of the first datapoint.

`scala> val trueLabel = dataPoint.label
trueLabel: Double = 0.0`

Thus the model did not get it right for the first datapoint.

Make predictions on the entire training set.

`scala> val predictions = lrModel.predict(data.map(x => x.features))`

Check the values of the first five predictions.

`scala> predictions.take(5)
16/10/19 14:38:23 WARN Executor: 1 block locks were not released by TID = 87:
[rdd_7_0]
res8: Array[Double] = Array(1.0, 1.0, 1.0, 1.0, 1.0)`

## Step 3.1.b: Logistic regression performance ##


** Accuracy **

This is the ratio of the number of correctly clasified instances divided by the total.

`scala> val lrTotalCorrect = data.map { point =>
     | if (lrModel.predict(point.features) == point.label) 1 else 0
     | }.sum
lrTotalCorrect: Double = 3806.0`

`scala> val lrAccuracy = lrTotalCorrect / numData
lrAccuracy: Double = 0.5146720757268425`

The accuracy os 51.46%.

## Step 3.2: SVM predictions and performance ##

** Predictions **

`scala> val svmTotalCorrect = data.map { x =>
     | if (svmModel.predict(x.features) == x.label) 1 else 0
     | }.sum
svmTotalCorrect: Double = 3806.0 `

** Accuracy ** 

`scala> val svmAccuracy = svmTotalCorrect / numData
svmAccuracy: Double = 0.5146720757268425`

The accuracy is 51.46%.

## Step 3.3: Naive Bayes predictions  and performance ##

** Predictions **

`scala> val nbTotalCorrect = nbData.map { point =>
     | if (nbModel.predict(point.features) == point.label) 1 else 0
     | }.sum
nbTotalCorrect: Double = 4292.0 `

** Accuracy **

`scala> val nbAccuracy = nbTotalCorrect / numData
nbAccuracy: Double = 0.5803921568627451`

The accuracy is 58.09%.

## Step 3.4: Decision tree predictions and performance ##

** Predictions **

`scala> val dtTotalCorrect = data.map { point =>
     | val score = dtModel.predict(point.features)
     | val predicted = if (score > 0.5) 1 else 0 
     | if (predicted == point.label) 1 else 0
     | }.sum
dtTotalCorrect: Double = 4794.0`

** Accuracy **

`scala> val dtAccuracy = dtTotalCorrect / numData
dtAccuracy: Double = 0.6482758620689655`

The accuracy is 64.8%.

## Step 3.b: Additional performance metrics ##

** Precision **

`TP / (TP + FP)`

** Recall/Sensitivity/True Positive Rate**

`TP / (TP + FN)`

** Area under Precision-Recall curve **

A value of 1 denotes a perfect classifier.

** False Positive Rate **

`FP / (FP + TN)`

** ROC curve/Area under ROC curve(AUC)**

The graph that plots the classifiers performance tradeoff of True positive rate against False positive rate. An AUC value of 1 denotes a perfect classifier.

** Computing Area under PR curve and AUC curve for Logistic Regression and SVM models **

`scala> import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics`

`scala> val metrics = Seq(lrModel, svmModel).map { model => 
     | val scoreAndLabels = data.map { point =>
     | (model.predict(point.features), point.label)
     | }
     | val metrics = new BinaryClassificationMetrics(scoreAndLabels)
     | (model.getClass.getSimpleName, metrics.areaUnderPR, metrics.areaUnderROC)
     | }
metrics: Seq[(String, Double, Double)] = List((LogisticRegressionModel,0.7567586293858841,0.5014181143280931), (SVMModel,0.7567586293858841,0.5014181143280931))`

We get the following values:
* LogisticRegressionModel - PR Curve: 0.75, AUC: 0.501
* SVMModel - PR Curve: 0.756, AUC: 0.501

** Computing Area under PR curve and AUC curve for Naive Bayes model **

`scala> val nbMetrics = Seq(nbModel).map{ model =>
     | val scoreAndLabels = nbData.map { point =>
     | val score = model.predict(point.features)
     | (if (score > 0.5) 1.0 else 0.0, point.label)
     | }
     | val metrics = new BinaryClassificationMetrics(scoreAndLabels)
     | (model.getClass.getSimpleName, metrics.areaUnderPR, metrics.areaUnderROC)
     | }
nbMetrics: Seq[(String, Double, Double)] = List((NaiveBayesModel,0.6808510815151734,0.5835585110136261))`

We get the following values:
* NaiveBayesModel - PR Curve: 0.68, AUC: 0.58

** Computing Area under PR curve and AUC curve for Decision Tree model **

`scala> val dtMetrics = Seq(dtModel).map{ model =>
     | val scoreAndLabels = data.map { point =>
     | val score = model.predict(point.features)
     | (if (score > 0.5) 1.0 else 0.0, point.label)
     | }
     | val metrics = new BinaryClassificationMetrics(scoreAndLabels)
     | (model.getClass.getSimpleName, metrics.areaUnderPR, metrics.areaUnderROC)
     | }
dtMetrics: Seq[(String, Double, Double)] = List((DecisionTreeModel,0.7430805993331199,0.6488371887050935))`

We get the following values:
* DecisionTreeModel - PR Curve: 0.74, AUC: 0.648

** Final output with all values: **

`scala> val allMetrics = metrics ++ nbMetrics ++ dtMetrics
allMetrics: Seq[(String, Double, Double)] = List((LogisticRegressionModel,0.7567586293858841,0.5014181143280931), (SVMModel,0.7567586293858841,0.5014181143280931), (NaiveBayesModel,0.6808510815151734,0.5835585110136261), (DecisionTreeModel,0.7430805993331199,0.6488371887050935))`

`scala> allMetrics.foreach{ case (m, pr, roc) => 
     | println(f"$m, Area under PR: ${pr * 100.0}%2.4f%%, Area under ROC: ${roc * 100.0}%2.4f%%") 
     | }`
     
`LogisticRegressionModel, Area under PR: 75.6759%, Area under ROC: 50.1418%
SVMModel, Area under PR: 75.6759%, Area under ROC: 50.1418%
NaiveBayesModel, Area under PR: 68.0851%, Area under ROC: 58.3559%
DecisionTreeModel, Area under PR: 74.3081%, Area under ROC: 64.8837%`

# Step 4: Feature Standardization # 

## Step 4.1: Analyse features ##

Most models make the assumption that the features are normally distributed. To investigate this, we represent the feature vectors as a distributed matrix in MLlib using the `RowMatrix` class. `RowMatrix` is an RDD made up of vector where each vector is row of our matrix.

`scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix`

Create `vectors` variable which has our features.

`scala> val vectors = data.map(lp => lp.features)`

Create `matrix` variable which is the matrix of our features.

`scala> val matrix = new RowMatrix(vectors)`

** Compute summary stats of our features using `computeColumnSummaryStatistics()` ** 

`scala> val matrixSummary = matrix.computeColumnSummaryStatistics()`

Using the matrixSummary object we can find the mean, min, max, variance and numNonzeros.

`scala> println(matrixSummary.mean)`

`scala> println(matrixSummary.max)`

`scala> println(matrixSummary.variance)`

`scala> println(matrixSummary.numNonzeros)`

## Step 4.2: Standardize and scale features ##

** Import the StandardScaler:**

`scala> import org.apache.spark.mllib.feature.StandardScaler`

It takes 2 arguments - withMean which when set to True will subtract the mean from the data and withStd which applies the standard deviation scaling.

`scala> val scaler = new StandardScaler(withMean = true, withStd = true).fit(vectors)`

** Get the scaled data:**

`scala> val scaledData = data.map(x => LabeledPoint(x.label, scaler.transform(x.features)))`

** Unscaled features:** 

`scala> println(data.first.features)
16/10/19 16:06:00 WARN Executor: 1 block locks were not released by TID = 180:
[rdd_7_0]
[0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,
0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,
5424.0,170.0,8.0,0.152941176,0.079129575]`

** Scaled features:**

`scala> println(scaledData.first.features)
16/10/19 16:06:10 WARN Executor: 1 block locks were not released by TID = 181:
[rdd_7_0]
[1.137647336497678,-0.08193557169294771,1.0251398128933331,-0.05586356442541689,
-0.4688932531289357,-0.3543053263079386,-0.3175352172363148,0.3384507982396541,
0.0,0.828822173315322,-0.14726894334628504,0.22963982357813484,-0.14162596909880876,
0.7902380499177364,0.7171947294529865,-0.29799681649642257,-0.2034625779299476,
-0.03296720969690391,-0.04878112975579913,0.9400699751165439,-0.10869848852526258,
-0.2788207823137022]`

## Step 4.3: Retrain Logistic Regression model using scaled features ##

*Naive Bayes and Decision Trees are unaffected by scaling feautes.*

** Train model using scaled features ** 

`scala> val lrModelScaled = LogisticRegressionWithSGD.train(scaledData, numIterations)`

** Get total correct predictions using scaled features **

`scala> val lrTotalCorrectScaled = scaledData.map { point =>
     | if (lrModelScaled.predict(point.features) == point.label) 1 else 0
     | }.sum
lrTotalCorrectScaled: Double = 4588.0`

** Get accuracy **

`scala> val lrAccuracyScaled = lrTotalCorrectScaled / numData
lrAccuracyScaled: Double = 0.6204192021636241`


** Make compare predictions with actual labels **

`scala> val lrPredictionsVsTrue = scaledData.map { point => 
     | (lrModelScaled.predict(point.features), point.label) 
     | }`

** Get Metrics object **

`scala> val lrMetricsScaled = new BinaryClassificationMetrics(lrPredictionsVsTrue)
lrMetricsScaled: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics = org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@49a9766b`

** PR Curve Score **

`scala> val lrPr = lrMetricsScaled.areaUnderPR
lrPr: Double = 0.7272540762713375`

** AUC Curve Score **

`scala> val lrRoc = lrMetricsScaled.areaUnderROC
lrRoc: Double = 0.6196629669112512v`

Using scaled features we improved our AUC score from ~50% to 62%

# Step 5: Considering more features #

In step 1 we considered only the numeric features to be a part of our features set(columns 5-25) by doing 

` val features = trimmed.slice(4, r.size - 1).map(x => if (x == "?") 0.0 else x.toDouble )`

Now we shall consider the **alchemy_category** feature as well, which is the 4th feature.

As it is a string feature we will have to encode it, using the ** 1-of-k** encoding.

## Step 5.1: 1-of-k encoding ##

**1-of-k encode the categories using `zipWithIndex.toMap`**

`scala> val categories = records.map(r => r(3)).distinct.collect.zipWithIndex.toMap
categories: scala.collection.immutable.Map[String,Int] = Map("weather" -> 0, "sports" -> 6, "unknown" -> 4, "computer_internet" -> 12, "?" -> 11, "culture_politics" -> 3, "religion" -> 8, "recreation" -> 2, "arts_entertainment" -> 9, "health" -> 5, "law_crime" -> 10, "gaming" -> 13, "business" -> 1, "science_technology" -> 7)`

**Get the number of categories**

`scala> val numCategories = categories.size
numCategories: Int = 14`

Create a vector of length 14 to represent this feature and assign a value of 1 for the index of the relevant relevant category for each data point.

`scala> val dataCategories = records.map { r =>
     | val trimmed = r.map(_.replaceAll("\"", ""))
     | val label = trimmed(r.size - 1).toInt
     | val categoryIdx = categories(r(3))
     | val categoryFeatures = Array.ofDim[Double](numCategories)
     | categoryFeatures(categoryIdx) = 1.0
     | val otherFeatures = trimmed.slice(4, r.size - 1).map(d => if (d == "?") 0.0 else d.toDouble)
     | val features = categoryFeatures ++ otherFeatures
     | LabeledPoint(label, Vectors.dense(features))
     | }`

Check the new feature set for the first webpage:

`scala> println(dataCategories.first.features)
[0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,
0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,
8.0,0.152941176,0.079129575]`

The first feature is the category feature with a value 1 for the category that webpage belongs to and 0 for every other category.

## Step 5.2: Standardize the features ##

** Standardize the features **

`scala> val scalerCats = new StandardScaler(withMean = true, withStd = true).fit(dataCategories.map(lp => lp.features))`

`scala> val scaledDataCats = dataCategories.map(lp => LabeledPoint(lp.label, scalerCats.transform(lp.features)))`



## Step 5.3: Train model on scaled data and expanded feature set##

** Train model **

`scala> val lrModelScaledCats = LogisticRegressionWithSGD.train(scaledDataCats, numIterations)`

## Step 5.4: Performance metrics ##

** Make predictions **

`scala> val lrTotalCorrectScaledCats = scaledDataCats.map { point =>
     | if (lrModelScaledCats.predict(point.features) == point.label) 1 else 0
     | }.sum
lrTotalCorrectScaledCats: Double = 4923.0`

** Get accuracy of predictions **

`scala> val lrAccuracyScaledCats = lrTotalCorrectScaledCats / numData
lrAccuracyScaledCats: Double = 0.6657200811359026`

** Get PR Curve value **

`scala> val lrPredictionsVsTrueCats = scaledDataCats.map { point => 
     | (lrModelScaledCats.predict(point.features), point.label) 
     | }`

`scala> val lrMetricsScaledCats = new BinaryClassificationMetrics(lrPredictionsVsTrueCats)`

`scala> val lrPrCats = lrMetricsScaledCats.areaUnderPR
lrPrCats: Double = 0.7579640787676577`

** Get AUC value **

`scala> val lrRocCats = lrMetricsScaledCats.areaUnderROC
lrRocCats: Double = 0.6654826844243996`

** Collate performance metric values **

`scala> println(f"${lrModelScaledCats.getClass.getSimpleName}\nAccuracy: ${lrAccuracyScaledCats * 100}%2.4f%%\nArea under PR: ${lrPrCats * 100.0}%2.4f%%\nArea under ROC: ${lrRocCats * 100.0}%2.4f%%") 
LogisticRegressionModel
Accuracy: 66.5720%
Area under PR: 75.7964%
Area under ROC: 66.5483%`

# Step 6: Using the correct form of data #

When the data has features that are both categorical(0,1 values) and frequency data(count values which can range from 0 to n) it can affect the performance of our model. To illustrate this we will train the Naive Bayes model using just the categorical data(column 4).

`scala> val dataNB = records.map { r =>
     | val trimmed = r.map(_.replaceAll("\"", ""))
     | val label = trimmed(r.size - 1).toInt
     | val categoryIdx = categories(r(3))
     | val categoryFeatures = Array.ofDim[Double](numCategories)
     | categoryFeatures(categoryIdx) = 1.0
     | LabeledPoint(label, Vectors.dense(categoryFeatures))
     | }`


** Train model **

`scala> val nbModelCats = NaiveBayes.train(dataNB)`

** Get performance metrics(accuracy) for newly trained model **

`scala> val nbTotalCorrectCats = dataNB.map { point =>
     | if (nbModelCats.predict(point.features) == point.label) 1 else 0
     | }.sum
nbTotalCorrectCats: Double = 4508.0`

`scala> val nbAccuracyCats = nbTotalCorrectCats / numData
nbAccuracyCats: Double = 0.6096010818120352`

`scala> val nbPredictionsVsTrueCats = dataNB.map { point => 
     | (nbModelCats.predict(point.features), point.label) 
     | }`

** Get PR Curve, AUC curve values **

`scala> val nbMetricsCats = new BinaryClassificationMetrics(nbPredictionsVsTrueCats)`

`scala> val nbPrCats = nbMetricsCats.areaUnderPR
nbPrCats: Double = 0.7405222106704076                                           

scala> val nbRocCats = nbMetricsCats.areaUnderROC
nbRocCats: Double = 0.6051384941549446`

`scala> println(f"${nbModelCats.getClass.getSimpleName}\nAccuracy: ${nbAccuracyCats * 100}%2.4f%%\nArea under PR: ${nbPrCats * 100.0}%2.4f%%\nArea under ROC: ${nbRocCats * 100.0}%2.4f%%") 
NaiveBayesModel
Accuracy: 60.9601%
Area under PR: 74.0522%
Area under ROC: 60.5138%`

The accuracy metric of the Naive Model rose by 2% points by using just the categorical feature. 

# Step 7: Tuning parameters #

## Step 7.1.a: Logistic Regression(SGD) parameters ##

The arguments for logistic regression are as follows:

* stepSize
* numIterations
* regParam
* miniBatchFraction

** Necessary imports: **

`scala> import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD`

`scala> import org.apache.spark.mllib.optimization.Updater
import org.apache.spark.mllib.optimization.Updater`

`scala> import org.apache.spark.mllib.optimization.SimpleUpdater
import org.apache.spark.mllib.optimization.SimpleUpdater`

`scala> import org.apache.spark.mllib.optimization.L1Updater
import org.apache.spark.mllib.optimization.L1Updater`

`scala> import org.apache.spark.mllib.optimization.SquaredL2Updater
import org.apache.spark.mllib.optimization.SquaredL2Updater`

`scala> import org.apache.spark.mllib.classification.ClassificationModel
import org.apache.spark.mllib.classification.ClassificationModel`

** Helper function to train logistic regression model **

We feed the function the input, along with arguments for stepSize, numIterations, regParam.

`scala> def trainWithParams(input: RDD[LabeledPoint], regParam: Double, numIterations: Int, updater: Updater, stepSize: Double) = {
     | val lr = new LogisticRegressionWithSGD
     | lr.optimizer.setNumIterations(numIterations).setUpdater(updater).setRegParam(regParam).setStepSize(stepSize)
     | lr.run(input)
     | }`

** Helper function to calculate AUC metric **

`scala> def createMetrics(label: String, data: RDD[LabeledPoint], model: ClassificationModel) = {
     | val scoreAndLabels = data.map { point =>
     | (model.predict(point.features), point.label)
     | }
     | val metrics = new BinaryClassificationMetrics(scoreAndLabels)
     | (label, metrics.areaUnderROC)
     | }`

** Cache the data to increase speed against multiple runs against the dataset ** 

`scala> scaledDataCats.cache`

## Step 7.1.b: Tuning Logistic regression parameters ##

We will be tuning 3 parameters:

** Number of iterations **

We will run the model on 1,5,10 and 50 iteratations and use the helper functions to train and find AUC scores for each.

`scala> val iterResults = Seq(1, 5, 10, 50).map { param =>
     | val model = trainWithParams(scaledDataCats, 0.0, param, new SimpleUpdater, 1.0)
     | createMetrics(s"$param iterations", scaledDataCats, model)
     | }`

`scala> iterResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") }
1 iterations, AUC = 64.95%
5 iterations, AUC = 66.62%
10 iterations, AUC = 66.55%
50 iterations, AUC = 66.81%`

** Step size ** 

Step size decides how far in the direction of the steepest gradient the algorithm takes a step when updating the model weight vector after each training sample. We will try step sizes - 0.001, 0.01, 0.1, 1.0, 10.0.

`scala> val stepResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map { param =>
     | val model = trainWithParams(scaledDataCats, 0.0, numIterations, new SimpleUpdater, param)
     | createMetrics(s"$param step size", scaledDataCats, model)
     | }`

`scala> stepResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") }
0.001 step size, AUC = 64.97%
0.01 step size, AUC = 64.96%
0.1 step size, AUC = 65.52%
1.0 step size, AUC = 66.55%
10.0 step size, AUC = 61.92%`

The AUC improves as we go from a step size of 0.001 to 1.0 but falls at 10.0

** Updater **

This argument controls the regularization. Regularization can help prevent over fitting by penalizing model complexity. 

When we have **low regularization, models tend to overfit**, ** when it is too high, models tend to underfit.**

The forms of regularization available are:

* SimpleUpdater - default for **Logistic Regression**, no regularization
* SquaredL2Updater - default for ** SVMs **, squared L2 norm of the weight vector
* L1Updater - L1 norm of the weight vector

** We will use the SquaredL2Updater regularizer with values - 0.001, 0.01, 0.1, 1.0, 10.0 **

`scala> val regResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map { param =>
     | val model = trainWithParams(scaledDataCats, param, numIterations, new SquaredL2Updater, 1.0)
     | createMetrics(s"$param L2 regularization parameter", scaledDataCats, model)
     | }`

`scala> regResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") }
0.001 L2 regularization parameter, AUC = 66.55%
0.01 L2 regularization parameter, AUC = 66.55%
0.1 L2 regularization parameter, AUC = 66.63%
1.0 L2 regularization parameter, AUC = 66.04%
10.0 L2 regularization parameter, AUC = 35.33%`

With a high value for the Updater, we can see that the AUC value goes down drastically due to underfitting.

## Step 7.2: Tuning Decision Tree parameters ##

We will be tuning:

* **maxDepth**: controls the complexity of the model
* **impurity**: choose between `Gini` and `Entropy`

** Load libraries **

`scala> import org.apache.spark.mllib.tree.impurity.Impurity
import org.apache.spark.mllib.tree.impurity.Impurity

scala> import org.apache.spark.mllib.tree.impurity.Entropy
import org.apache.spark.mllib.tree.impurity.Entropy

scala> import org.apache.spark.mllib.tree.impurity.Gini
import org.apache.spark.mllib.tree.impurity.Gini`

** Helper function to iterate through tuned parameters **

`scala> def trainDTWithParams(input: RDD[LabeledPoint], maxDepth: Int, impurity: Impurity) = {
     | DecisionTree.train(input, Algo.Classification, impurity, maxDepth)
     | }`

** Iterating through different values for maxDepth with the Entropy impurity parameter**

We will use 1, 2, 3, 4, 5, 10, 20 values for the maxDepth parameter.

`scala> val dtResultsEntropy = Seq(1, 2, 3, 4, 5, 10, 20).map { param =>
     | val model = trainDTWithParams(data, param, Entropy)
     | val scoreAndLabels = data.map { point =>
     | val score = model.predict(point.features)
     | (if (score > 0.5) 1.0 else 0.0, point.label)
     | }
     | val metrics = new BinaryClassificationMetrics(scoreAndLabels)
     | (s"$param tree depth", metrics.areaUnderROC)
     | }`

`scala> dtResultsEntropy.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") }
1 tree depth, AUC = 59.33%
2 tree depth, AUC = 61.68%
3 tree depth, AUC = 62.61%
4 tree depth, AUC = 63.63%
5 tree depth, AUC = 64.88%
10 tree depth, AUC = 76.26%
20 tree depth, AUC = 98.45%`

We get our best result at higher model complexity with 20 levels, but this is likely because we are over fitting the data.

** Iterating through different values for maxDepth with the Gini impurity parameter**

`scala> val dtResultsGini = Seq(1, 2, 3, 4, 5, 10, 20).map { param =>
     | val model = trainDTWithParams(data, param, Gini)
     | val scoreAndLabels = data.map { point =>
     | val score = model.predict(point.features)
     | (if (score > 0.5) 1.0 else 0.0, point.label)
     | }
     | val metrics = new BinaryClassificationMetrics(scoreAndLabels)
     | (s"$param tree depth", metrics.areaUnderROC)
     | }`

`scala> dtResultsGini.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") }
1 tree depth, AUC = 59.33%
2 tree depth, AUC = 61.68%
3 tree depth, AUC = 62.61%
4 tree depth, AUC = 63.63%
5 tree depth, AUC = 64.89%
10 tree depth, AUC = 78.37%
20 tree depth, AUC = 98.87%`

We get very similar results with the Gini parameter as well.

## Step 7.3: Tuning Naive Bayes model ##

Here we will tune the **`lambda`** parameter which controls the additive smoothing, which handles the case when a class and feature value do not occur together in the dataset.

** Helper function to iterate through lambda values **

`scala> def trainNBWithParams(input: RDD[LabeledPoint], lambda: Double) = {
     | val nb = new NaiveBayes
     | nb.setLambda(lambda)
     | nb.run(input)
     | }`

** Iterate through values for lambda **

`scala> val nbResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map { param =>
     | val model = trainNBWithParams(dataNB, param)
     | val scoreAndLabels = dataNB.map { point =>
     | (model.predict(point.features), point.label)
     | }
     | val metrics = new BinaryClassificationMetrics(scoreAndLabels)
     | (s"$param lambda", metrics.areaUnderROC)
     | }`

`scala> nbResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") }
0.001 lambda, AUC = 60.51%
0.01 lambda, AUC = 60.51%
0.1 lambda, AUC = 60.51%
1.0 lambda, AUC = 60.51%
10.0 lambda, AUC = 60.51%`

# Step 8: Cross Validation #

** Step 1: Split the data into 60% training and 40% testing set **

`scala> val trainTestSplit = scaledDataCats.randomSplit(Array(0.6, 0.4), 123)`

`scala> val train = trainTestSplit(0)`

`scala> val test = trainTestSplit(1)`

** Step 2: Train Logistic Regression model on training data and make predictions on the test set **

`scala> val regResultsTest = Seq(0.0, 0.001, 0.0025, 0.005, 0.01).map { param =>
     | val model = trainWithParams(train, param, numIterations, new SquaredL2Updater, 1.0)
     | createMetrics(s"$param L2 regularization parameter", test, model)
     | }`

`scala> regResultsTest.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.6f%%") }
0.0 L2 regularization parameter, AUC = 66.126842%
0.001 L2 regularization parameter, AUC = 66.126842%
0.0025 L2 regularization parameter, AUC = 66.126842%
0.005 L2 regularization parameter, AUC = 66.126842%
0.01 L2 regularization parameter, AUC = 66.093195%`

** Step 3(for evaluation purposes): Train model on training data and make predictions on the training set **

`scala> val regResultsTrain = Seq(0.0, 0.001, 0.0025, 0.005, 0.01).map { param =>
     | val model = trainWithParams(train, param, numIterations, new SquaredL2Updater, 1.0)
     | createMetrics(s"$param L2 regularization parameter", train, model)
     | }`

`scala> regResultsTrain.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.6f%%") }
0.0 L2 regularization parameter, AUC = 66.233459%
0.001 L2 regularization parameter, AUC = 66.233459%
0.0025 L2 regularization parameter, AUC = 66.233459%
0.005 L2 regularization parameter, AUC = 66.257100%
0.01 L2 regularization parameter, AUC = 66.278745%`

As we can see, our AUC increases when test on the data the model has already seen.

# Step 9: Conclusion #

Thus our logistic regression model can successfuly predict whether a page is `ephemeral` or `evergreen` with good efficiency.