### 11. Spark ML

This notebook will introduce Spark ML and its API.

#### Correlation calculation


In [4]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.appName("SparkML").getOrCreate()
import spark.implicits._

In [5]:
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.ml.linalg.{Vectors, Matrix}
import org.apache.spark.sql.Row

val data = Seq(
  Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),  
  Vectors.dense(4.0, 5.0, 0.0, 3.0),
  Vectors.dense(6.0, 7.0, 0.0, 8.0),
  Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
)

val df = data.map(Tuple1.apply).toDF("features")
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
println("Pearson correlation matrix:\n" + coeff1.toString)
val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
println("\nSpearman correlation matrix:\n" + coeff2.toString)

Pearson correlation matrix:
1.0                   0.055641488407465814  NaN  0.4004714203168137  
0.055641488407465814  1.0                   NaN  0.9135958615342522  
NaN                   NaN                   1.0  NaN                 
0.4004714203168137    0.9135958615342522    NaN  1.0                 

Spearman correlation matrix:
1.0                  0.10540925533894532  NaN  0.40000000000000174  
0.10540925533894532  1.0                  NaN  0.9486832980505141   
NaN                  NaN                  1.0  NaN                  
0.40000000000000174  0.9486832980505141   NaN  1.0                  




Let us understand the above piece of code step by step. There are two ways to create vectors, first one is dense and another is sparse. In a dense vector we specify all n elements and its values. The dense vector is simply created as ``Vectors.dense(4.0, 5.0, 0.0, 3.0)`` where as sparse is created as ``Vectors.sparse(4, Seq((0, 1.0), (3, -2.0)))`` wgere the first number if the number of elements/dimension in the vector, and the ``Seq`` given is a tuple of index, (value pair). Thus ``Vectors.sparse(4, Seq((0, 1.0), (3, -2.0)))`` is same as ``Vectors.dense(1.0, 0.0, 0.0, -2.0)`` as seen below

In [6]:
Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))).toDense

[1.0,0.0,0.0,-2.0]


The correlation matrix is is always symmetric across diagonal with diagonal values being 1 as the corelation of a vector with itself is always 1. The matrix is symmetric across the diagonal that is element (0, 1) is same as (1, 0), (0, 2) same as (2, 0) and so on.

The formula for pearson correlation is 

$p\:=\:\frac{n\sum{xy} - (\sum{x})(\sum{y})}{\sqrt{[n\sum{x^2} - (\sum{x})^2][n\sum{y^2} - (\sum{y})^2]}}$

The correlation between two lists (1.0, 4.0, 6.0, 9.0) and (0.0, 5.0, 7.0, 0.0)
as per [this](http://calculator.vhex.net/calculator/statistics/pearson-correlation) URL is expected to be 0.055641.

The following code snippet calculates this pearson correlation between two list of doubles of equal length.

In [7]:
import scala.math.sqrt

def pearsonCorrelation(x:List[Double], y: List[Double]): Double = {
    val n = x.length
    val sumx = x.reduce(_ + _)
    val sumxsquare = x.map(e => e * e).reduce(_ + _)
    val sumy = y.reduce(_ + _)
    val sumysquare = y.map(e => e * e).reduce(_ + _)
    val sumxy = (x zip y).map{case (l, r) => l * r}.reduce(_ + _)
    val numerator = (n * sumxy) - (sumx * sumy)
    val denominator = (n * sumxsquare - sumx * sumx) * (n * sumysquare - sumy * sumy)
    numerator / math.sqrt(denominator)
}

val v1 = List(1.0, 4.0, 6.0, 9.0)
val v2 = List(0, 5.0, 7.0, 0)
println("Pearson coefficient between v1 and v2 is " + pearsonCorrelation(v1, v2))

Pearson coefficient between v1 and v2 is 0.055641488407465724


In [8]:
// Alternate implementation of above but by calculating the mean value first. 
// The above implementation doesn't need to calculate
val n = v1.length
val meanx = v1.reduce(_ + _) * 1.0 / n
val meany = v2.reduce(_ + _) * 1.0 / n

val numerator = (v1 zip v2).map{
 case (e1, e2) => (e1 - meanx) * (e2 - meany)
}.reduce(_ + _)

val sumxsquare = v1.map( e => (e - meanx) * (e - meanx)).reduce(_ + _)
val sumysquare= v2.map(e => (e - meany) * (e - meany)).reduce(_ + _)
val denominator = math.sqrt(sumxsquare) * math.sqrt(sumysquare)


println("Covariance between x and y is " + numerator / denominator)

Covariance between x and y is 0.055641488407465724



The vectors defined in ``data`` for which we computed the pearson coefficient are

```
[1.0,0.0,0.0,-2.0]
[4.0,5.0,0.0,3.0]
[6.0,7.0,0.0,8.0]
[9.0,0.0,0.0,1.0]
```

The line ``val df = data.map(Tuple1.apply).toDF("features")`` creates a ``DataFrame``. The ``toDF(columnName)`` is a a function that can be applied to ``Seq[Tuple{n}]``. This is an implicit function that lets us convert tuples to data frame which we get by the import ``scala.implicits._``. Each tuple in the sequence becomes a row and each element of the tuple becomes a column. To achieve this, we need to convert ``Seq[Vector]`` to ``Seq[Tuple1]`` using ``map`` so that we can convert the ``Seq[Tuple1]`` to a ``DataFrame`` with one column which we will call features.

The code ``Correlation.corr(df, "features")`` computes the correlation matrix. The return value of this call is a ``DataFrame`` with one row and one column. The name of the column is ``pearson(<nae of the original column in dataset>)`` and the type of the value is a ``Matrix``. The code ``val Row(coeff1: Matrix) = Correlation.corr(df, "features").head`` is a one liner to get the first and only row of this ``DataFrame`` and assign the value of the ``Matrix`` in it to the variable ``coeff1``.

The parameters of the ``corr`` function are the ``DataFrame`` instance, the name of the column in the ``DataFrame`` and an optional third parameter for the correlation method, which defaults to ``pearson``. Only valid value for now is ``spearman``.


---

#### ChiSquared Test

We will now look at Hypothesis testing which tests whether the result we get is stastically significant or not. We will see Pearsons Chi-Squared tests in this section. First a code sample and the result we get




In [23]:
import org.apache.spark.ml.stat.ChiSquareTest
import org.apache.spark.ml.linalg.Vector

val data = Seq(
  (0.0, Vectors.dense(0.5, 10.0)),
  (0.0, Vectors.dense(1.5, 20.0)),
  (1.0, Vectors.dense(1.5, 30.0)),
  (0.0, Vectors.dense(3.5, 30.0)),
  (0.0, Vectors.dense(3.5, 40.0)),
  (1.0, Vectors.dense(3.5, 40.0))
)

val df = data.toDF("label", "features")
val chiDF = ChiSquareTest.test(df, "features", "label")
val chi = chiDF.head
println("Printing the ChiSquaredTest DataFrame")
chiDF.show(truncate = false)
println("pValues = " + chi.getAs[Vector](0))
println("degreesOfFreedom = " + chi.getSeq[Int](1).mkString("[", ",", "]"))
println("statistics = " + chi.getAs[Vector](2))

Printing the ChiSquaredTest DataFrame
+---------------------------------------+----------------+----------+
|pValues                                |degreesOfFreedom|statistics|
+---------------------------------------+----------------+----------+
|[0.6872892787909721,0.6822703303362126]|[2, 3]          |[0.75,1.5]|
+---------------------------------------+----------------+----------+

pValues = [0.6872892787909721,0.6822703303362126]
degreesOfFreedom = [2,3]
statistics = [0.75,1.5]



We will now derive the above statistics using plain RDDs to see how the calculation is performed step by step. Note that this is not necessarily the most efficient way, but it demonstrates the steps nevertheless. 

We start by creating a contingency matrix for the each feature. The number of rows of this matrix are same as the number of unique values of that feature and the number of columns is same as number of unique labels. The values in the matrix are same as the number of occurances for that feature, label combination. In our case we will build two contingency matrix for each feature. For the first feature the unique values are (0.5, 1.5, 3.5) and the second feature has the unique values (10, 20, 30, 40). The number of columns are 2 in both cases for the labels (1, 0).

As an illustration, the matrix for first feature would be as follows. We have also added the row sum and the col sum for the following matrix

|               | **0**          | **1**  | **sum**|
|:-------------: |:-------------:|:-------------: |:-------------: |
| **0.5**      |  1| 0 |1|
| **1.5**      | 1      |   1 |2|
| **3.5**      | 2     |    1 |3|
| **sum**      | 4    |   2  ||


Let us give the input as a List of ``Tuples[(Double, Double)]``. This is for one dimension of the feature vector.

In [75]:
val inputList = List((0.5, 0), (1.5, 0), (1.5, 1), (3.5, 0), (3.5, 0), (3.5, 1))
val inputs = sc.parallelize(inputList)
val freq = inputs.map(i => (i, 1)).reduceByKey(_ + _).collectAsMap
val rowSum = inputs.map(i => (i._1, 1)).reduceByKey(_ + _).collectAsMap
val colSum = inputs.map(i => (i._2, 1)).reduceByKey(_ + _).collectAsMap
val inputSize = inputList.size
val uniqueLabels = inputList.map(_._2).toSet
val uniqueFeatures = inputList.map(_._1).toSet
println("(Label, Feature) pairs counts are " + 
                freq.mkString("[", ", ", "]") + 
                "\nRow Sum are " + rowSum.mkString("[", ", ", "]") + 
                "\nCol Sum are " + colSum.mkString("[", ", ", "]") )

(Label, Feature) pairs counts are [(1.5,0) -> 1, (3.5,0) -> 2, (3.5,1) -> 1, (0.5,0) -> 1, (1.5,1) -> 1]
Row Sum are [3.5 -> 3, 1.5 -> 2, 0.5 -> 1]
Col Sum are [1 -> 2, 0 -> 4]



The above Map, Row Sum and Col Sum values are same as the the Matrix we saw earlier. The next step is to compute the expected value. The way we compute the expected value is compute the probability of each of the possible labels and then multiply the total number of occurances of the feature with these probability.

For example, in the above case, from Col Sum we have the frequency of label 0 and 1 is 0.67 (4 / 6) and 0.33 (2 / 6) respectively. Expected value for feature value 3.5 would be 2( 3 $\times$ 0.67) and 1( 3 $\times$ 0.33) for label 0 and 1 respectively

Following code computes the expected value of each feature value, label value combination.

In [87]:
val expectedValues = (for(f <- uniqueFeatures ; l <-  uniqueLabels) yield(f, l)) map {
    case (f, l) => ((f, l), 1.0 * rowSum(f) * colSum(l) / inputSize)
}
println("Expected values for possible (Label, Feature) pair are " + expectedValues)

Expected values for possible (Label, Feature) pair are Set(((1.5,0),1.3333333333333333), ((3.5,0),2.0), ((0.5,1),0.3333333333333333), ((0.5,0),0.6666666666666666), ((1.5,1),0.6666666666666666), ((3.5,1),1.0))



For calculating $\chi^2$ stat. We will implement the following

$\chi^2\:=\:\sum_{i=1}^{N}\frac{(O_i - E_i)^2}{E_i}$

Where

- $O_i$ = the number of observations of type i.
- N = Total Number of observations
- $E_i$ = Expected theoritical probability of type i


Degrees of freedom is simply computed as $(n_f - 1) \times (n_l - 1)$

Where 

- $n_f$: Number of unique feature values
- $n_l$: Number of unique labels

In [94]:
val stat = expectedValues.foldLeft(0.0){
    case(acc, (k, e)) =>
        val f = freq.getOrElse(k, 0)
        val diff = f - e
        acc + diff * diff / e        
}
val df = (uniqueFeatures.size - 1) * (uniqueLabels.size - 1)
println("Stat is " + stat + ", degrees of freedom are " + df)

Stat is 0.7500000000000001, degrees of freedom are 2




With the Stat and degrees of freedom calculated, we will now calculate the pValue. The above degrees of freedom and stat are in sync with what we calculated using SparkML.

pValue calculation isnt straight forward and SparkML uses Apache Commons Math. We will do the same as follows

In [99]:
import org.apache.commons.math3.distribution.ChiSquaredDistribution
println("The p value for the stat " + stat + " for " + df + " degrees of freedom is " + 
        (1 - new ChiSquaredDistribution(df).cumulativeProbability(stat)))

The p value for the stat 0.7500000000000001 for 2 degrees of freedom is 0.6872892787909721


---

#### ML Pipeline

Spark ML standardizes the API for ML Algorithms to make it easy to use different algorithms in the same pipeline. Following are the important components

- DataFrame: This is the same dataframe we saw in Spark SQL. The dataframe is used to store the dataset and can be used to store various features, text, feature vectors etc.
- Transformer: This converts a ``DataFrame`` to another ``DataFrame``. ML model is a transformer which converts the input data frame or features to predictions. It can also be used for feature engineering where we use an existing DataFrame and add more engineered columns to the DataFrame.
- Estimator: This takes in a dataframe and gives a model. That is, it takes in a DataFrame and gives a Transformer.The learning algorithm used is in fact an Estimator.
- Pipeline: It chains multiple Estimators and Transformers to specify an ML Workflow.
- Parameter: All Transformers and Estimators share a common API for specifying parameters.


A Pipeline is a Transformer. Thus ``transform`` method is invoked on each component of the ``Pipeline`` to produce a ``DataFrame`` fed into the next component of the pipeline. If any component is an ``Estimator`` it calls the ``fit``method to produce a ``Model`` which is a ``Transformer``.

---

Let us now see how to train a model. We will use a ``LogisticRegression`` model with max 10 iterations and use 0.01 as $\lambda$ for regularization.


In [135]:
import org.apache.spark.ml.classification.LogisticRegression
val inputVectors = spark.createDataFrame(Seq((1.0, Vectors.dense(0.0, 1.1, 0.1)),
  (0.0, Vectors.dense(2.0, 1.0, -1.0)),
  (0.0, Vectors.dense(2.0, 1.3, 1.0)),
  (1.0, Vectors.dense(0.0, 1.2, -0.5)))).toDF("Labels", "Features")

println("Input Vectors are\n") 
inputVectors.show(truncate = false)
  
val lr1 = new LogisticRegression()
println("Explanation of parameters for this Regression is \n\n" + lr.explainParams + "\n\n")

Input Vectors are

+------+--------------+
|Labels|Features      |
+------+--------------+
|1.0   |[0.0,1.1,0.1] |
|0.0   |[2.0,1.0,-1.0]|
|0.0   |[2.0,1.3,1.0] |
|1.0   |[0.0,1.2,-0.5]|
+------+--------------+

Explanation of parameters for this Regression is 

aggregationDepth: suggested depth for treeAggregate (>= 2) (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial. (default: auto)
featuresCol: features column name (default: features, current: Features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label, current: Labels)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. (undefined)
lowerBoundsOnInterc


Calling ``explainParam`` on any ``Transformer`` will give us a documentation of all possible parameters the ``Transformer`` supports. Let us train the Model with the given input vectors

In [124]:

lr.setMaxIter(10).setRegParam(0.01).setLabelCol("Labels").setFeaturesCol("Features")
val model1 = lr.fit(inputVectors)
println("Model was fit using parameters: " + model1.extractParamMap)

Model was fit using parameters: {
	logreg_0b047d778a82-aggregationDepth: 2,
	logreg_0b047d778a82-elasticNetParam: 0.0,
	logreg_0b047d778a82-family: auto,
	logreg_0b047d778a82-featuresCol: Features,
	logreg_0b047d778a82-fitIntercept: true,
	logreg_0b047d778a82-labelCol: Labels,
	logreg_0b047d778a82-maxIter: 10,
	logreg_0b047d778a82-predictionCol: prediction,
	logreg_0b047d778a82-probabilityCol: probability,
	logreg_0b047d778a82-rawPredictionCol: rawPrediction,
	logreg_0b047d778a82-regParam: 0.01,
	logreg_0b047d778a82-standardization: true,
	logreg_0b047d778a82-threshold: 0.5,
	logreg_0b047d778a82-tol: 1.0E-6
}



As we see above, we train a Model with the input Model instance. We can configure it by chaining calls to set the required parameters. Following is an alternate way to create a similar model.


In [127]:
import org.apache.spark.ml.param.ParamMap

val lr2 = new LogisticRegression()
val pm = ParamMap(lr2.regParam -> 0.01, lr2.labelCol -> "Labels", lr2.featuresCol -> "Features", lr2.maxIter -> 10)
val model2 = lr2.fit(inputVectors, pm)
println("Model was fit using parameters: " + model2.extractParamMap)

Model was fit using parameters: {
	logreg_1b9826f1f816-aggregationDepth: 2,
	logreg_1b9826f1f816-elasticNetParam: 0.0,
	logreg_1b9826f1f816-family: auto,
	logreg_1b9826f1f816-featuresCol: Features,
	logreg_1b9826f1f816-fitIntercept: true,
	logreg_1b9826f1f816-labelCol: Labels,
	logreg_1b9826f1f816-maxIter: 10,
	logreg_1b9826f1f816-predictionCol: prediction,
	logreg_1b9826f1f816-probabilityCol: probability,
	logreg_1b9826f1f816-rawPredictionCol: rawPrediction,
	logreg_1b9826f1f816-regParam: 0.01,
	logreg_1b9826f1f816-standardization: true,
	logreg_1b9826f1f816-threshold: 0.5,
	logreg_1b9826f1f816-tol: 1.0E-6
}



Following code snippet will now use the Test model and give results

In [134]:
val test = spark.createDataFrame(Seq(
  (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
  (0.0, Vectors.dense(3.0, 2.0, -0.1)),
  (1.0, Vectors.dense(0.0, 2.2, -1.5))
)).toDF("Labels", "Features")

val res = model2.transform(test.select("Features"))
println("Predictions are\n")
res.show(truncate = false)
println("Inputs with the labels are\n")
test.show(truncate = false)


Predictions are

+--------------+--------------------------------------+------------------------------------------+----------+
|Features      |rawPrediction                         |probability                               |prediction|
+--------------+--------------------------------------+------------------------------------------+----------+
|[-1.0,1.5,1.3]|[-6.587201443935503,6.587201443935503]|[0.0013759947069214356,0.9986240052930786]|1.0       |
|[3.0,2.0,-0.1]|[3.980182819425658,-3.980182819425658]|[0.9816604009374171,0.018339599062582944] |0.0       |
|[0.0,2.2,-1.5]|[-6.37651770286046,6.37651770286046]  |[0.0016981475578358373,0.9983018524421641]|1.0       |
+--------------+--------------------------------------+------------------------------------------+----------+

Inputs with the labels are

+------+--------------+
|Labels|Features      |
+------+--------------+
|1.0   |[-1.0,1.5,1.3]|
|0.0   |[3.0,2.0,-0.1]|
|1.0   |[0.0,2.2,-1.5]|
+------+--------------+



---

#### Sample Pipeline



In [153]:
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.{Pipeline, PipelineModel}

val training = spark.createDataFrame(Seq(
  (0L, "a b c d e spark", 1.0),
  (1L, "b d", 0.0),
  (2L, "spark f g h", 1.0),
  (3L, "hadoop mapreduce", 0.0)
)).toDF("id", "Text", "Label")

//1. Create Transformers to amend the inputDataFrame to add new columns, Tokens and tf_features
val token = new Tokenizer().setInputCol("Text").setOutputCol("Tokens")

val hash = new HashingTF().setInputCol("Tokens").setOutputCol("tf_features")

val lr = new LogisticRegression().setFeaturesCol("tf_features").setLabelCol("Label").setMaxIter(10).setRegParam(0.01)

val pipe = new Pipeline().setStages(Array(token, hash, lr))

val pipelineModel = pipe.fit(training)




In the above code snippet we did the following

- Create a training ``DataFrame``
- Create a transformer ``token`` or type ``Tokenizer`` which tokanizes the input sentences. To see what possible values can be configured we can invoke ``token.explainParams``
- Create a transformer ``hash`` which is of type ``HashingTF``. This transformer takes in word vectors and outputs the term frequency usinh Hashing. Again, ``hash.explainParams`` will give details of all the possible parameters.
- Create a ``LogisticRegression`` instance.
- Create a ``Pipeline`` which adds the three stages for tokenize, hashing and LogisticRegression in that order.
- Fit the pipeline with the traning data which gives us a ``PipelineModel``. 

To see what exactly is output by the first two transformers.

In [172]:
val tokenized = token.transform(training)
tokenized.show(truncate = false)

val hashed = hash.transform(tokenized)
hashed.select("id", "Tokens", "tf_features").show(truncate = false)

+---+----------------+-----+----------------------+
|id |Text            |Label|Tokens                |
+---+----------------+-----+----------------------+
|0  |a b c d e spark |1.0  |[a, b, c, d, e, spark]|
|1  |b d             |0.0  |[b, d]                |
|2  |spark f g h     |1.0  |[spark, f, g, h]      |
|3  |hadoop mapreduce|0.0  |[hadoop, mapreduce]   |
+---+----------------+-----+----------------------+

+---+----------------------+--------------------------------------------------------------------------+
|id |Tokens                |tf_features                                                               |
+---+----------------------+--------------------------------------------------------------------------+
|0  |[a, b, c, d, e, spark]|(262144,[17222,27526,28698,30913,227410,234657],[1.0,1.0,1.0,1.0,1.0,1.0])|
|1  |[b, d]                |(262144,[27526,30913],[1.0,1.0])                                          |
|2  |[spark, f, g, h]      |(262144,[15554,24152,51505,234657],



Now that we have the ``Pipeline`` and the ``PipelineModel``. Let us test with our sample input data in the following snippet.

In [176]:
val test = spark.createDataFrame(Seq(
  (4L, "spark i j k"),
  (5L, "l m n"),
  (6L, "spark hadoop spark"),
  (7L, "apache hadoop")
)).toDF("Id", "Text")

pipelineModel.transform(test).select("id", "Text", "probability", "prediction")show(truncate = false)

+---+------------------+-----------------------------------------+----------+
|id |Text              |probability                              |prediction|
+---+------------------+-----------------------------------------+----------+
|4  |spark i j k       |[0.5406433544852275,0.45935664551477245] |0.0       |
|5  |l m n             |[0.933438262738352,0.06656173726164803]  |0.0       |
|6  |spark hadoop spark|[0.22922657813243238,0.7707734218675676] |1.0       |
|7  |apache hadoop     |[0.9768636139518374,0.023136386048162642]|0.0       |
+---+------------------+-----------------------------------------+----------+




The ``DataFrame`` we provided just contains the text. The pipeline internally tokenizes and then computes the feature vector before feeding it into the classifier.

The process of training the model is not cheap and takes hours and at times days. We thus need a way to save the trained ``PipelineModel`` and even the ``Pipeline`` (in this case we need to train the model all over again). Following code snippet shows how we can achieve this.  We will load the pre-trained model and get predictions for our test ``DataFrame``. The predictions should be identical to the ones we just saw.



In [183]:
pipelineModel.write.overwrite.save("pipeline-model")
pipe.write.overwrite.save("pipeline")

val preTrained = PipelineModel.load("pipeline-model")
preTrained.transform(test).select("id", "Text", "probability", "prediction")show(truncate = false)

+---+------------------+-----------------------------------------+----------+
|id |Text              |probability                              |prediction|
+---+------------------+-----------------------------------------+----------+
|4  |spark i j k       |[0.5406433544852275,0.45935664551477245] |0.0       |
|5  |l m n             |[0.933438262738352,0.06656173726164803]  |0.0       |
|6  |spark hadoop spark|[0.22922657813243238,0.7707734218675676] |1.0       |
|7  |apache hadoop     |[0.9768636139518374,0.023136386048162642]|0.0       |
+---+------------------+-----------------------------------------+----------+



---

#### Feature Extraction, Selection and Transformation.

##### TF-IDF.

This is a feature vector used in Natural Language Processing to analyse the importance of a term in the document. The TermFrequency ``TF(t, d)`` is used to denote the number of times the term occured in the document. The Document ``DF(t, D)`` is the number of times the term occured in the document corpus ``D``.

$IDF(t, D)\:= log\frac{|D| + 1}{DF(t, D) + 1}$

if the term is very common across corpus the fraction tends to 1 and the log becomes 0. If the word is rare in the corpus, the value of the fraction is large and thus the log value is large.

Thus $TFIDF(t, d, D)\:=\:TF(t, d)\cdot IDF(t, D)$

We have already seen the HashingTF which calculates the term frequency in the document using Hashing. another alternative is ``CountVectorizer``. Following is the result of using both implementations.


In [202]:
import org.apache.spark.ml.feature.{HashingTF, Tokenizer, CountVectorizer}
val sentenceData = spark.createDataFrame(Seq(
  (0.0, "a b c"),
  (0.0, "a b b c a d")
)).toDF("label", "sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)

val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val countTF = new CountVectorizer().setInputCol("words").setOutputCol("rawFeatures").fit(wordsData)
hashingTF.transform(wordsData).select("words", "rawFeatures")show(truncate = false)
val tfDF = countTF.transform(wordsData)
tfDF.select("words", "rawFeatures").show(truncate = false)


+------------------+-----------------------------------------------------+
|words             |rawFeatures                                          |
+------------------+-----------------------------------------------------+
|[a, b, c]         |(262144,[28698,30913,227410],[1.0,1.0,1.0])          |
|[a, b, b, c, a, d]|(262144,[27526,28698,30913,227410],[1.0,1.0,2.0,2.0])|
+------------------+-----------------------------------------------------+

+------------------+-------------------------------+
|words             |rawFeatures                    |
+------------------+-------------------------------+
|[a, b, c]         |(4,[0,1,2],[1.0,1.0,1.0])      |
|[a, b, b, c, a, d]|(4,[0,1,2,3],[2.0,2.0,1.0,1.0])|
+------------------+-------------------------------+




As we see in the above the output of both ``HashingTF`` and ``CountVectorizer`` both give is a ``Tuple3`` with the fields (``numFields``, ``featureId``, ``wordCount``) 

The size of ``featureId`` and ``wordCount`` is same and has a 1-1 correspondence between the term id and the term count.

Now that we know what the feature count is, we then find the Inverse document frequency. Note how the term ``d`` gets a higher value and others get a value of 0 as they appear in all the documents in the corpus.

In [204]:
import org.apache.spark.ml.feature.IDF

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("tfidf")
val idfModel = idf.fit(tfDF)
idfModel.transform(tfDF).select("words", "rawFeatures", "tfidf").show(truncate = false)

+------------------+-------------------------------+----------------------------------------------+
|words             |rawFeatures                    |tfidf                                         |
+------------------+-------------------------------+----------------------------------------------+
|[a, b, c]         |(4,[0,1,2],[1.0,1.0,1.0])      |(4,[0,1,2],[0.0,0.0,0.0])                     |
|[a, b, b, c, a, d]|(4,[0,1,2,3],[2.0,2.0,1.0,1.0])|(4,[0,1,2,3],[0.0,0.0,0.0,0.4054651081081644])|
+------------------+-------------------------------+----------------------------------------------+




##### Word2Vec

In Word2Vec we represent words as word vectors such that similar words are close to each other in vector space.

In Word2Vec we have two vectors for each word w, $u_w$ and $v_w$ when the word w is the context word and the center word. There is another parameter k which is for the context window size.

Word2Vec has two possible approaches to calculate the word vectors, Skip gram and CBOW(Continuous Bag of Words). Spark use Skip gram approach to calculate the word vectors. The cost function that is maximized for training this model is

$\frac{1}{T}\sum_{t = 1}^T\sum_{j = -k}^klog(\frac{p_{t+j}}{p_t})$

Since log of a fraction is negative, we maximize as we need the number to be close to 0.

Thus, given the center word $w_j$, the probability that the word $w_i$ is in the window is 

$p(w_i / w_j)\:=\:\frac{exp(u_{w_i}^T\cdot v_{w_j})}{\sum_{l = 1}^V exp(u_l^T\cdot v_{w_j})}$

Following code snippet, calculates the word vectors for the vocabulary. The Vector we get for each sentence is a mean of all the vector for the words in the sentence.

In [214]:
import org.apache.spark.ml.feature.Word2Vec

val documentDF = spark.createDataFrame(Seq(
  "Hi I heard about Spark".split(" "),
  "I wish Java could use case classes".split(" "),
  "Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")

val w2v = new Word2Vec().setInputCol("text").setOutputCol("vector").setVectorSize(3).setMinCount(0)
val w2vModel = w2v.fit(documentDF)
val vectors = w2vModel.transform(documentDF)
vectors.show(truncate = false)

+------------------------------------------+---------------------------------------------------------------+
|text                                      |vector                                                         |
+------------------------------------------+---------------------------------------------------------------+
|[Hi, I, heard, about, Spark]              |[0.03173386193811894,0.009443491697311401,0.024377789348363876]|
|[I, wish, Java, could, use, case, classes]|[0.025682436302304268,0.0314303718706859,-0.01815584538105343] |
|[Logistic, regression, models, are, neat] |[0.022586782276630402,-0.01601201295852661,0.05122732147574425]|
+------------------------------------------+---------------------------------------------------------------+




---

#### Feature Transformers

##### Tokenizer

We split a sentence into word in this type of transformation. By default the splitting is done by spaces. Following code snippet demonstrated this


In [228]:
val testDF = spark.createDataFrame(Seq(
            (1, "Hi I heard about Spark"),
            (2, "I wish Java could use case classes"),
            (3, "Logistic regression models are neat")
            )).toDF("Id", "Text")
val tokenizer = new Tokenizer().setInputCol("Text").setOutputCol("tokens")
val tokenized = tokenizer.transform(testDF)
tokenized.show(truncate = false)

+---+-----------------------------------+------------------------------------------+
|Id |Text                               |tokens                                    |
+---+-----------------------------------+------------------------------------------+
|1  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |
|2  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
|3  |Logistic regression models are neat|[logistic, regression, models, are, neat] |
+---+-----------------------------------+------------------------------------------+



##### StopWordRemover 

Takes in tokenized dataset and adds new column which contains the subset of tokenized words except for stop words. A List of stop words for different languages can be found [here](https://github.com/apache/spark/tree/master/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords)

In [235]:
import org.apache.spark.ml.feature.StopWordsRemover

val stopWordTokenizer = new StopWordsRemover().setInputCol("tokens").setOutputCol("s_tokens")
val stopWordDF = stopWordTokenizer.transform(tokenized)
stopWordDF.select("tokens", "s_tokens").show(truncate = false)

+------------------------------------------+------------------------------------+
|tokens                                    |s_tokens                            |
+------------------------------------------+------------------------------------+
|[hi, i, heard, about, spark]              |[hi, heard, spark]                  |
|[i, wish, java, could, use, case, classes]|[wish, java, use, case, classes]    |
|[logistic, regression, models, are, neat] |[logistic, regression, models, neat]|
+------------------------------------------+------------------------------------+




##### NGrams

The following transformer create n grams from tokenized words, by default the value of n is 2.

In [239]:
import org.apache.spark.ml.feature.NGram

val nGramsTransformer = new NGram().setInputCol("s_tokens").setOutputCol("nGrams")
val nGramsDF = nGramsTransformer.transform(stopWordDF)
nGramsDF.select("s_tokens", "nGrams")show(truncate = false)

+------------------------------------+-----------------------------------------------------+
|s_tokens                            |nGrams                                               |
+------------------------------------+-----------------------------------------------------+
|[hi, heard, spark]                  |[hi heard, heard spark]                              |
|[wish, java, use, case, classes]    |[wish java, java use, use case, case classes]        |
|[logistic, regression, models, neat]|[logistic regression, regression models, models neat]|
+------------------------------------+-----------------------------------------------------+




##### Binarizer

This transformer takes in a ``DataFrame`` and emits a scalar/vector of same size with 1 if the value is greater than the threshold else 0.
The value in the output column is either a Dense or sparse vector whichever takes less space.


In [256]:
import org.apache.spark.ml.feature.Binarizer
val numericDF = spark.createDataFrame(Seq((1, Vectors.dense(1, 2, 3, 4, 5)), 
                    (1, Vectors.dense(2, 1, 0, 1, 5)))).toDF("Id", "Vectors")
val binarizer = new Binarizer().setInputCol("Vectors").setOutputCol("Binaries").setThreshold(2)
binarizer.transform(numericDF).select("Vectors", "Binaries").show(truncate = false)


+---------------------+---------------------+
|Vectors              |Binaries             |
+---------------------+---------------------+
|[1.0,2.0,3.0,4.0,5.0]|[0.0,0.0,1.0,1.0,1.0]|
|[2.0,1.0,0.0,1.0,5.0]|(5,[4],[1.0])        |
+---------------------+---------------------+




##### PCA

This Model can be used to perfome [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) for the given input vectors.

In [263]:
import org.apache.spark.ml.feature.PCA
val data = Array(
  Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
  Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
  Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")

val pca = new PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(3).fit(df)

val result = pca.transform(df)
result.show(truncate = false)

+---------------------+-----------------------------------------------------------+
|features             |pcaFeatures                                                |
+---------------------+-----------------------------------------------------------+
|(5,[1,3],[1.0,7.0])  |[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
|[2.0,0.0,3.0,4.0,5.0]|[-4.645104331781534,-1.1167972663619026,-5.524543751369387]|
|[4.0,0.0,0.0,6.0,7.0]|[-6.428880535676489,-5.337951427775355,-5.524543751369389] |
+---------------------+-----------------------------------------------------------+




##### Polynomial Expansion

It is a process of expanding the features into polynomial space. For example, with input vector [2, 1] and degree of 3 we have the following possible values $[2^1, 2^2, 2^3, 1^1, 1^2, 1^3, 1\cdot2, 1^2\cdot 2, 1\cdot2^2]$

In [264]:
import org.apache.spark.ml.feature.PolynomialExpansion

val data = Array(
  Vectors.dense(2.0, 1.0),
  Vectors.dense(0.0, 0.0),
  Vectors.dense(3.0, -1.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")

val polyExpansion = new PolynomialExpansion().setInputCol("features").setOutputCol("polyFeatures").setDegree(3)

val polyDF = polyExpansion.transform(df)
polyDF.show(truncate = false)

+----------+------------------------------------------+
|features  |polyFeatures                              |
+----------+------------------------------------------+
|[2.0,1.0] |[2.0,4.0,8.0,1.0,2.0,4.0,1.0,2.0,1.0]     |
|[0.0,0.0] |[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]     |
|[3.0,-1.0]|[3.0,9.0,27.0,-1.0,-3.0,-9.0,1.0,3.0,-1.0]|
+----------+------------------------------------------+




##### StringIndexer and OneHotEncoder

These transformers transform a string value to an index and the index column to a vector with maximum one bit set corresponding to the index value.

The one hot encoding in thie case doesn't have 3 bits for three distinct values but achieves this using 2 bits only.

In [291]:
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}

val df = spark.createDataFrame(Seq(
  (0, "a"),
  (1, "b"),
  (2, "c"),
  (3, "a"),
  (4, "a"),
  (5, "c")
  )).toDF("id", "category")

val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
val indexed = indexer.transform(df)

val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")

val encoded = encoder.transform(indexed)

import org.apache.spark.sql.functions._

val toDense = udf((x: Vector) => x.toDense)
encoded.withColumn("OneHot", toDense($"categoryVec")).select("id","category", "categoryIndex", "OneHot").show()

+---+--------+-------------+---------+
| id|category|categoryIndex|   OneHot|
+---+--------+-------------+---------+
|  0|       a|          0.0|[1.0,0.0]|
|  1|       b|          2.0|[0.0,0.0]|
|  2|       c|          1.0|[0.0,1.0]|
|  3|       a|          0.0|[1.0,0.0]|
|  4|       a|          0.0|[1.0,0.0]|
|  5|       c|          1.0|[0.0,1.0]|
+---+--------+-------------+---------+



##### VectorIndexer 
Converts the dimensions of the vectors to categorical fields wherever the number of unique values in the dimension of less than or equal to the max categories available.

In [313]:
import org.apache.spark.ml.feature.VectorIndexer

val df = spark.createDataFrame(
                Seq((0, Vectors.dense(10, 1)),
                (2, Vectors.dense(9, 2)),
                (3, Vectors.dense(10, 3)),
                (4, Vectors.dense(9, 4)),
                (5, Vectors.dense(10, 5)))).toDF("id", "vectors")
  
val vi = new VectorIndexer().setInputCol("vectors").setMaxCategories(2).setOutputCol("Categorized")
val viModel = vi.fit(df)
println("The index of the category map is " + viModel.categoryMaps.keys)
viModel.transform(df).show()

The index of the category map is Set(0)
+---+----------+-----------+
| id|   vectors|Categorized|
+---+----------+-----------+
|  0|[10.0,1.0]|  [1.0,1.0]|
|  2| [9.0,2.0]|  [0.0,2.0]|
|  3|[10.0,3.0]|  [1.0,3.0]|
|  4| [9.0,4.0]|  [0.0,4.0]|
|  5|[10.0,5.0]|  [1.0,5.0]|
+---+----------+-----------+




##### Interaction and VectorAssembler

VectorAssembler takes in multiple vectors and scalers to create one vector in the output.  Interaction simply takes in multiple vectors to create another vector of with values which is a cross product of all possible values of these vectors. Following is an example code snippet



In [322]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.Interaction

val df = spark.createDataFrame(Seq(
  (1, 1, 2, 3, 8, 4, 5),
  (2, 4, 3, 8, 7, 9, 8),
  (3, 6, 1, 9, 2, 3, 6),
  (4, 10, 8, 6, 9, 4, 5),
  (5, 9, 2, 7, 10, 7, 3),
  (6, 1, 1, 4, 2, 8, 4)
)).toDF("id1", "id2", "id3", "id4", "id5", "id6", "id7")

val assembler1 = new VectorAssembler().setInputCols(Array("id2", "id3", "id4")).setOutputCol("vec1")
val assembler2 = new VectorAssembler().setInputCols(Array("id5", "id6", "id7")).setOutputCol("vec2")

val df1 = assembler2.transform(assembler1.transform(df))


val interaction = new Interaction().setInputCols(Array("vec1", "vec2")).setOutputCol("interactionCol")
val df2 = interaction.transform(df1)
df2.select("vec1", "vec2", "interactionCol").show(truncate = false)

+--------------+--------------+----------------------------------------------+
|vec1          |vec2          |interactionCol                                |
+--------------+--------------+----------------------------------------------+
|[1.0,2.0,3.0] |[8.0,4.0,5.0] |[8.0,4.0,5.0,16.0,8.0,10.0,24.0,12.0,15.0]    |
|[4.0,3.0,8.0] |[7.0,9.0,8.0] |[28.0,36.0,32.0,21.0,27.0,24.0,56.0,72.0,64.0]|
|[6.0,1.0,9.0] |[2.0,3.0,6.0] |[12.0,18.0,36.0,2.0,3.0,6.0,18.0,27.0,54.0]   |
|[10.0,8.0,6.0]|[9.0,4.0,5.0] |[90.0,40.0,50.0,72.0,32.0,40.0,54.0,24.0,30.0]|
|[9.0,2.0,7.0] |[10.0,7.0,3.0]|[90.0,63.0,27.0,20.0,14.0,6.0,70.0,49.0,21.0] |
|[1.0,1.0,4.0] |[2.0,8.0,4.0] |[2.0,8.0,4.0,2.0,8.0,4.0,8.0,32.0,16.0]       |
+--------------+--------------+----------------------------------------------+




##### Normalizer

Normalizes the vector by dividing each value in the vector with the normaizer. The normalizer is $\sqrt[p]{\sum_{i = 0}^N |v_i|^p}$ where N is the length of the vector.

In [330]:
import org.apache.spark.ml.feature.Normalizer

val normalizer = new Normalizer().setInputCol("vec1").setOutputCol("NormalizedVec1").setP(2)
normalizer.transform(df1).select("vec1", "NormalizedVec1").show(truncate = false)


+--------------+------------------------------------------------------------+
|vec1          |NormalizedVec1                                              |
+--------------+------------------------------------------------------------+
|[1.0,2.0,3.0] |[0.2672612419124244,0.5345224838248488,0.8017837257372732]  |
|[4.0,3.0,8.0] |[0.423999152002544,0.31799936400190804,0.847998304005088]   |
|[6.0,1.0,9.0] |[0.552344770738994,0.09205746178983235,0.8285171561084911]  |
|[10.0,8.0,6.0]|[0.7071067811865475,0.565685424949238,0.4242640687119285]   |
|[9.0,2.0,7.0] |[0.7774815830232241,0.17277368511627203,0.6047078979069521] |
|[1.0,1.0,4.0] |[0.23570226039551587,0.23570226039551587,0.9428090415820635]|
+--------------+------------------------------------------------------------+




##### StandardScaler

Standardizes each vector to have unit standard deviation and zero mean

In [366]:
import org.apache.spark.ml.feature.StandardScaler

val simpleDF = spark.createDataFrame(Seq(
                    Vectors.dense(5, 2, 7), 
                    Vectors.dense(1, 3, 9)).map(Tuple1.apply)).toDF("Features")

val ss1 = new StandardScaler().setInputCol("Features").setOutputCol("StdFeatures").setWithMean(false).setWithStd(true)
val ssModel = ss.fit(simpleDF)
ssModel.transform(simpleDF).show(truncate = false)
val ss2 = new StandardScaler().setInputCol("Features").setOutputCol("StdFeatures").setWithMean(true).setWithStd(true)
val ssModel1 = ss.fit(simpleDF)
ssModel1.transform(simpleDF).show(truncate = false)

+-------------+---------------------------------------------------------+
|Features     |StdFeatures                                              |
+-------------+---------------------------------------------------------+
|[5.0,2.0,7.0]|[1.7677669529663687,2.82842712474619,4.949747468305832]  |
|[1.0,3.0,9.0]|[0.35355339059327373,4.242640687119285,6.363961030678928]|
+-------------+---------------------------------------------------------+

+-------------+---------------------------------------------------------+
|Features     |StdFeatures                                              |
+-------------+---------------------------------------------------------+
|[5.0,2.0,7.0]|[1.7677669529663687,2.82842712474619,4.949747468305832]  |
|[1.0,3.0,9.0]|[0.35355339059327373,4.242640687119285,6.363961030678928]|
+-------------+---------------------------------------------------------+



The above calculation is done by calculating standard deviation across all dimensions of the vectors in different rows and calculating the z-scores. In one case the values are centered across mean and in another it isnt.


##### MinMaxScaler

MinMaxScaler scales each feature between a given range. The rescaling each feature E is done as follows

$Rescaled(e_i)\:=\:\frac{e_i - E}{E_{max} - E_{min}} * (max - min) + min$

Default value of max and min is 0 and 1

In [376]:
import org.apache.spark.ml.feature.MinMaxScaler

val simpleDF = spark.createDataFrame(Seq(
                    Vectors.dense(5, 2, 7), 
                    Vectors.dense(1, 3, 9),
                    Vectors.dense(4, 8, 6)).map(Tuple1.apply)).toDF("Features")

val minMaxScaler = new MinMaxScaler().setInputCol("Features").setOutputCol("StdFeatures").setMin(-1).setMax(1)
val minMaxScalerModel = minMaxScaler.fit(simpleDF)
minMaxScalerModel.transform(simpleDF).show(truncate = false)

+-------------+-------------------------------+
|Features     |StdFeatures                    |
+-------------+-------------------------------+
|[5.0,2.0,7.0]|[1.0,-1.0,-0.33333333333333337]|
|[1.0,3.0,9.0]|[-1.0,-0.6666666666666667,1.0] |
|[4.0,8.0,6.0]|[0.5,1.0,-1.0]                 |
+-------------+-------------------------------+




##### Bucketizer

Splits the feature column values specified by the user into buckets. With n + 1 splits, there are n splits.



In [383]:
import org.apache.spark.ml.feature.Bucketizer

val splits = Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity)

val data = Array(-999.9, -0.5, -0.3, 0.0, 0.2, 999.9)
val dataFrame = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")

val bucketizer = new Bucketizer().setInputCol("features").setOutputCol("bucketedFeatures").setSplits(splits)

val bucketedData = bucketizer.transform(dataFrame)

bucketedData.show()

+--------+----------------+
|features|bucketedFeatures|
+--------+----------------+
|  -999.9|             0.0|
|    -0.5|             1.0|
|    -0.3|             1.0|
|     0.0|             2.0|
|     0.2|             2.0|
|   999.9|             3.0|
+--------+----------------+




##### ElementwiseProduct

This transformer performs element wise multiplication between the given transformer vector and each vector row in the given dataset.

In [384]:
import org.apache.spark.ml.feature.ElementwiseProduct


val dataFrame = spark.createDataFrame(Seq(
  ("a", Vectors.dense(1.0, 2.0, 3.0)),
  ("b", Vectors.dense(4.0, 5.0, 6.0)))).toDF("id", "vector")

val transformingVector = Vectors.dense(0.0, 1.0, 2.0)
val transformer = new ElementwiseProduct().setScalingVec(transformingVector).setInputCol("vector").setOutputCol("transformedVector")

transformer.transform(dataFrame).show()

+---+-------------+-----------------+
| id|       vector|transformedVector|
+---+-------------+-----------------+
|  a|[1.0,2.0,3.0]|    [0.0,2.0,6.0]|
|  b|[4.0,5.0,6.0]|   [0.0,5.0,12.0]|
+---+-------------+-----------------+




##### SQLTransformer

This transformer uses SQL queries to transform data frames to other dataframes. We can use built in functions provided by Spark SQL or UDFs to achieve this.


In [388]:
import org.apache.spark.ml.feature.SQLTransformer

val df = spark.createDataFrame(
  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF("id", "v1", "v2")

val sqlTransformer = new SQLTransformer().setStatement("SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")
sqlTransformer.transform(df).show


+---+---+---+---+----+
| id| v1| v2| v3|  v4|
+---+---+---+---+----+
|  0|1.0|3.0|4.0| 3.0|
|  2|2.0|5.0|7.0|10.0|
+---+---+---+---+----+




##### QuantileDiscretizer

Transforms the given input continuous range of values into discrete quantiles.

TODO: Analyse the code and algorithm.

##### Imputer

The imputer imputes missing values in a column to either mean or median value as seen below

In [393]:
import org.apache.spark.ml.feature.Imputer

val df = spark.createDataFrame(Seq(
  (1.0, Double.NaN),
  (2.0, Double.NaN),
  (Double.NaN, 3.0),
  (4.0, 4.0),
  (5.0, 5.0)
)).toDF("a", "b")

val imputer = new Imputer().setInputCols(Array("a", "b")).setOutputCols(Array("out_a", "out_b")).setStrategy("median")
val imputingModel = imputer.fit(df)
imputingModel.transform(df).show



+---+---+-----+-----+
|  a|  b|out_a|out_b|
+---+---+-----+-----+
|1.0|NaN|  1.0|  4.0|
|2.0|NaN|  2.0|  4.0|
|NaN|3.0|  2.0|  3.0|
|4.0|4.0|  4.0|  4.0|
|5.0|5.0|  5.0|  5.0|
+---+---+-----+-----+




TODO: Explore Local Sensitivity Hashing for Spark DF.


### Classification and Regression

This part of the notebook will focus on how to implement ML models in Spark.

#### Logistic Regression

We have a Binomial Logistic regression for predicting between two possible classes and a Multinomial Logistic regression for predicting between multiple classes.



In [410]:
import org.apache.spark.ml.classification.LogisticRegression

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
val lrModel = lr.fit(data)
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")

Coefficients: (692,[244,263,272,300,301,328,350,351,378,379,405,406,407,428,433,434,455,456,461,462,483,484,489,490,496,511,512,517,539,540,568],[-7.353983524188197E-5,-9.102738505589466E-5,-1.9467430546904298E-4,-2.0300642473486668E-4,-3.1476183314863995E-5,-6.842977602660743E-5,1.5883626898239883E-5,1.4023497091372047E-5,3.5432047524968605E-4,1.1443272898171087E-4,1.0016712383666666E-4,6.014109303795481E-4,2.840248179122762E-4,-1.1541084736508837E-4,3.85996886312906E-4,6.35019557424107E-4,-1.1506412384575676E-4,-1.5271865864986808E-4,2.804933808994214E-4,6.070117471191634E-4,-2.008459663247437E-4,-1.421075579290126E-4,2.739010341160883E-4,2.7730456244968115E-4,-9.838027027269332E-5,-3.808522443517704E-4,-2.5315198008555033E-4,2.7747714770754307E-4,-2.443619763919199E-4,-0.0015394744687597765,-2.3073328411331293E-4]) Intercept: 0.22456315961250325



In the above example we create an instance of ``LogisticRegression`` and set the regularization parameter to 0.3 and the elastic param for setting the weight of L1 regularization to 0.8. The algorithm is ``auto`` and the class figures it out based on the possible label values. We can however use ``multinomial`` regression for a a problem with only two possible label values.

In [412]:
val mlr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setFamily("multinomial")
val mlrModel = lr.fit(data)
println(s"Coefficients: ${mlrModel.coefficientMatrix} Intercept: ${mlrModel.interceptVector}")

Coefficients: 1 x 692 CSCMatrix
(0,244) -7.353983524188197E-5
(0,263) -9.102738505589466E-5
(0,272) -1.9467430546904298E-4
(0,300) -2.0300642473486668E-4
(0,301) -3.1476183314863995E-5
(0,328) -6.842977602660743E-5
(0,350) 1.5883626898239883E-5
(0,351) 1.4023497091372047E-5
(0,378) 3.5432047524968605E-4
(0,379) 1.1443272898171087E-4
(0,405) 1.0016712383666666E-4
(0,406) 6.014109303795481E-4
(0,407) 2.840248179122762E-4
(0,428) -1.1541084736508837E-4
(0,433) 3.85996886312906E-4
(0,434) 6.35019557424107E-4
(0,455) -1.1506412384575676E-4
(0,456) -1.5271865864986808E-4
(0,461) 2.804933808994214E-4
(0,462) 6.070117471191634E-4
(0,483) -2.008459663247437E-4
(0,484) -1.421075579290126E-4
(0,489) 2.739010341160883E-4
(0,490) 2.7730456244968115E-4
(0,496) -9.838027027269332E-5
(0,511) -3.808522443517704E-4
(0,512) -2.5315198008555033E-4
(0,517) 2.7747714770754307E-4
(0,539) -2.443619763919199E-4
(0,540) -0.0015394744687597765
(0,568) -2.3073328411331293E-4 Intercept: [0.22456315961250325]



Following code shows how to get the summary of a training. Currently we only get the summary of ``BinomialLogisticRegression``

In [421]:
import org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary

val modelSummary = lrModel.summary
val biRegModelSumm = modelSummary.asInstanceOf[BinaryLogisticRegressionTrainingSummary]
println("Object History for each iteration is\n\n")
modelSummary.objectiveHistory.foreach(println(_))
println("\nROC (Receiver Operating Characterictics) is ")
val roc = biRegModelSumm.roc
roc.show(5)

Object History for each iteration is


0.6833149135741672
0.6662875751473734
0.6217068546034618
0.6127265245887887
0.6060347986802873
0.6031750687571562
0.5969621534836274
0.5940743031983118
0.5906089243339022
0.5894724576491042
0.5882187775729587

ROC (Receiver Operating Characterictics) is 
+---+--------------------+
|FPR|                 TPR|
+---+--------------------+
|0.0|                 0.0|
|0.0|0.017543859649122806|
|0.0| 0.03508771929824561|
|0.0| 0.05263157894736842|
|0.0| 0.07017543859649122|
+---+--------------------+
only showing top 5 rows




TODO: Get details on ROC curve

The DataFrame ``biRegModelSumm.fMeasureByThreshold`` gives us F1 Scores for various thresholds. Since our goal is to maximize the F1 Scores, we select the maximum value of the F-Measure and then threshold the model to that value. The value of threshold is used to determine either the value has label of 0 or 1.

In [438]:
val fMeasures = biRegModelSumm.fMeasureByThreshold
val maxF1Score = fMeasures.select(max("F-Measure")).head().getDouble(0)
val bestThreshold = fMeasures.where($"F-Measure" === maxF1Score).head().getDouble(0)
lrModel.setThreshold(bestThreshold)
println(s"Found max F1 score of $maxF1Score for threshold $bestThreshold")

Found max F1 score of 1.0 for threshold 0.5585022394278357




##### DecisionTreeClassifier

We will now look at decision tree classifiers.

In [477]:
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

In [484]:
// Identify categorical features for vectors wherever possible and index the labels.

import org.apache.spark.ml.feature.{VectorIndexer, StringIndexer, IndexToString}

val labelIndexer = 
    new StringIndexer().setInputCol("label").setOutputCol("indexedLabels").fit(data)
    
val vIndexer = 
    new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(4).fit(data)
    
val labelLookup = 
    new IndexToString().setInputCol("prediction").setOutputCol("predictionLabel").setLabels(labelIndexer.labels)

In above case we create labelIndexer and vectorIndexer with entire data set to ensure we have index in place for all possible values we will see

Create the DecisionTree classifier

In [485]:
import org.apache.spark.ml.classification.DecisionTreeClassifier

val dt = new DecisionTreeClassifier().setLabelCol("indexedLabels").setFeaturesCol("indexedFeatures")



Split the data in training and test set Create a pipeline of the above transformers and tree

In [486]:

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

val pipeline = new Pipeline()
pipeline.setStages(Array(labelIndexer, vIndexer, dt, labelLookup))

val pipelineModel = pipeline.fit(trainingData)



Make predictions and test accuracy of the model as follows


In [487]:
// Make predictions.
val predictions = pipelineModel.transform(testData)

// Select example rows to display.
predictions.select("predictionLabel", "label").show(5)


+---------------+-----+
|predictionLabel|label|
+---------------+-----+
|            0.0|  0.0|
|            0.0|  0.0|
|            0.0|  0.0|
|            0.0|  0.0|
|            0.0|  0.0|
+---------------+-----+
only showing top 5 rows



In [502]:
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

val evaluator = 
    new MulticlassClassificationEvaluator().
        setLabelCol("indexedLabels").setPredictionCol("prediction").setMetricName("accuracy")
        
val accuracy = evaluator.evaluate(predictions)
printf("Test Accuracy = %.2f%%\n", accuracy * 100)

val f1Score = 
    new MulticlassClassificationEvaluator().
        setLabelCol("indexedLabels").setPredictionCol("prediction").setMetricName("f1")

val score = f1Score.evaluate(predictions)
printf("F1 Score is = %.3f", score)


Test Accuracy = 96.67%
F1 Score is = 0.967


#### RandomForestClassifier

We will now train the above model using RandomForests, using the RandomForestClassifier

In [10]:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, VectorIndexer, IndexToString}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

val stringIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(data)
val vectorIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(4).fit(data)
val indexToString = 
    new IndexToString().setInputCol("prediction").setOutputCol("predictionLabel").setLabels(stringIndexer.labels)

val rfClassifier = new RandomForestClassifier().setFeaturesCol("indexedFeatures").setLabelCol("indexedLabel").setNumTrees(7)
val pipeline = new Pipeline()
pipeline.setStages(Array(stringIndexer, vectorIndexer, rfClassifier, indexToString))
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
val model = pipeline.fit(trainingData)
val predictions = model.transform(testData)

println("Evaluating results")
val evaluator = 
    new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction").setMetricName("accuracy")
 
val accuracy = evaluator.evaluate(predictions)
printf("Test Accuracy = %.2f%%\n", accuracy * 100)
val f1Score = 
    new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction").setMetricName("f1")

val score = f1Score.evaluate(predictions)
printf("F1 Score is = %.3f", score)                   


Evaluating results
Test Accuracy = 100.00%
F1 Score is = 1.000

#### Gradient Boosted Tree classifier (GBTClassifier)



In [25]:
import org.apache.spark.ml.feature.{VectorIndexer, StringIndexer, IndexToString}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.GBTClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

val Array(trainingData, testData) = data.randomSplit(Array(70, 30))

val stringIndexer = new StringIndexer().setOutputCol("indexedLabel").setInputCol("label").fit(data)
val vectorIndexer = new VectorIndexer().setOutputCol("indexedFeatures").setInputCol("features").fit(data)
val indexToString = new IndexToString().setLabels(stringIndexer.labels).setInputCol("prediction").setOutputCol("predictionLabel")

val gbtClassifier = new GBTClassifier().setFeaturesCol("indexedFeatures").setLabelCol("indexedLabel").setMaxIter(12)

val pipeline = new Pipeline();
pipeline.setStages(Array(stringIndexer, vectorIndexer, gbtClassifier, indexToString))
val model = pipeline.fit(trainingData)
val prediction = model.transform(testData)

val accuracyEvaluator =
    new MulticlassClassificationEvaluator().setMetricName("accuracy").setLabelCol("indexedLabel").setPredictionCol("prediction")
    
val accuracy = accuracyEvaluator.evaluate(prediction)
printf("Accuracy is %.2f%%\n", accuracy * 100)


Accuracy is 100.00%



#### Multi Layer Percepton



In [63]:
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier

val data = spark.read.format("libsvm").load("data/mllib/sample_multiclass_classification_data.txt")
val Array(trainingData, testData) = data.randomSplit(Array(70, 30))

val mlpModel = 
new MultilayerPerceptronClassifier().setLayers(Array(4, 5, 4, 3)).setLabelCol("label").setFeaturesCol("features").fit(trainingData)

val predictions = mlpModel.transform(testData)

val evaluator = 
 new MulticlassClassificationEvaluator().setMetricName("accuracy").setPredictionCol("prediction").setLabelCol("label")
 
val accuracy = evaluator.evaluate(predictions)

printf("Accuracy of the model is %.2f%%\n", accuracy * 100)

Accuracy of the model is 97.50%




#### LinearSupportVectorMachine

SVM constructs a hyperplane in multi dimensional space to create a separation with largest distance to the nearest training points of any class. 


In [70]:
import org.apache.spark.ml.classification.LinearSVC


val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

val model = new LinearSVC().setRegParam(0.01).setMaxIter(20).fit(data)

println(s"Intercepts and coefficients are ${model.intercept} and ${model.coefficients}")

Intercepts and coefficients are 0.01392478111699671 and [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-5.069812824819294E-4,-1.1494312477323097E-4,-8.709557950749516E-5,8.356190826529035E-5,-5.61725819329025E-6,-6.85436974883161E-6,-1.9214534421189166E-5,3.656850257885353E-4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.541118001126304E-4,3.205513952947573E-4,3.804331349899807E-4,-1.9723038982671933E-4,-8.191348194466776E-6,0.0,0.0,0.0,-6.983999260319252E-6,-7.143722252448831E-6,7.5672246805187E-7,7.723760988027544E-5,1.8913591510090662E-4,2.007566364504202E-4,-6.88230705255455E-4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.8901832


##### One Vs Rest classifier

OneVsRest classifier uses binary classifier to make multi class predictions. There are k classifters trained to predict k classes. Each classifier predicts class i against all others.



In [83]:
import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

val data = spark.read.format("libsvm").load("data/mllib/sample_multiclass_classification_data.txt")
val Array(training, test) = data.randomSplit(Array(70, 30))

val lr = new LogisticRegression().setFitIntercept(true).setTol(1e-8).setMaxIter(25)

val ovr = new OneVsRest().setClassifier(lr)

val model = ovr.fit(training)

val prediction = model.transform(training)

val evaluator = 
new MulticlassClassificationEvaluator().setMetricName("accuracy").setLabelCol("label").setPredictionCol("prediction")

printf("Accuracy is %.2f%%\n", evaluator.evaluate(prediction) * 100)

Accuracy is 97.12%
