### 11. Spark ML

This notebook will introduce Spark ML and its API.

#### Correlation calculation


In [10]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.appName("SparkML").getOrCreate()
import spark.implicits._

In [34]:
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.ml.linalg.{Vectors, Matrix}
import org.apache.spark.sql.Row

val data = Seq(
  //Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
  Vector.dense()
  Vectors.dense(4.0, 5.0, 0.0, 3.0),
  Vectors.dense(6.0, 7.0, 0.0, 8.0),
  Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
)

val df = data.map(Tuple1.apply).toDF("features")
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
println("Pearson correlation matrix:\n" + coeff1.toString)
val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
println("Spearman correlation matrix:\n" + coeff2.toString)

Pearson correlation matrix:
1.0                   0.055641488407465814  NaN  0.4004714203168137  
0.055641488407465814  1.0                   NaN  0.9135958615342522  
NaN                   NaN                   1.0  NaN                 
0.4004714203168137    0.9135958615342522    NaN  1.0                 
Spearman correlation matrix:
1.0                  0.10540925533894532  NaN  0.40000000000000174  
0.10540925533894532  1.0                  NaN  0.9486832980505141   
NaN                  NaN                  1.0  NaN                  
0.40000000000000174  0.9486832980505141   NaN  1.0                  




Let us understand the above piece of code step by step. There are two ways to create vectors, first one is dense and another is sparse. In a dense vector we specify all n elements and its values. The dense vector is simply created as ``Vectors.dense(4.0, 5.0, 0.0, 3.0)`` where as sparse is created as ``Vectors.sparse(4, Seq((0, 1.0), (3, -2.0)))`` wgere the first number if the number of elements/dimension in the vector, and the ``Seq`` given is a tuple of index, (value pair). Thus ``Vectors.sparse(4, Seq((0, 1.0), (3, -2.0)))`` is same as ``Vectors.dense(1.0, 0.0, 0.0, -2.0)`` as seen below

In [19]:
Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))).toDense

[1.0,0.0,0.0,-2.0]


The correlation matrix is is always symmetric across diagonal with diagonal values being 1 as the corelation of a vector with itself is always 1. The matrix is symmetric across the diagonal that is element (0, 1) is same as (1, 0), (0, 2) same as (2, 0) and so on.

The formula for pearson correlation is 

$p\:=\:\frac{n\sum{xy} - (\sum{x})(\sum{y})}{\sqrt{[n\sum{x^2} - (\sum{x})^2][n\sum{y^2} - (\sum{y})^2]}}$

The correlation between two lists (1.0, 0.0, 0.0, -2.0) and (4.0, 5.0, 0.0, 3.0)
as per [this](http://calculator.vhex.net/calculator/statistics/pearson-correlation) URL is expected to be 0.1227.

The following code snippet calculates this pearson correlation between two list of doubles of equal length.

In [58]:
import scala.math.sqrt

def pearsonCorrelation(x:List[Double], y: List[Double]): Double = {
    val n = x.length
    val sumx = x.reduce(_ + _)
    val sumxsquare = x.map(e => e * e).reduce(_ + _)
    val sumy = y.reduce(_ + _)
    val sumysquare = y.map(e => e * e).reduce(_ + _)
    val sumxy = (x zip y).map{case (l, r) => l * r}.reduce(_ + _)
    val numerator = (n * sumxy) - (sumx * sumy)
    val denominator = (n * sumxsquare - sumx * sumx) * (n * sumysquare - sumy * sumy)
    numerator / math.sqrt(denominator)
}

val x = List(1.0, 0.0, 0.0, -2.0)
val y = List(4.0, 5.0, 0.0, 3.0)
val z = List(6.0, 7.0, 0.0, 8.0)
println("Pearson coefficient between x and y is " + pearsonCorrelation(x, y))
println("Pearson coefficient between x and x is " + pearsonCorrelation(x, x))
println("Pearson coefficient between x and z is " + pearsonCorrelation(x, z))
println("Pearson coefficient between y and z is " + pearsonCorrelation(y, z))

Pearson coefficient between x and y is 0.12262786789699316
Pearson coefficient between x and x is 1.0
Pearson coefficient between x and z is -0.3501151884184551
Pearson coefficient between y and z is 0.8586775814821836


In [59]:
// Alternate implementation of above but by calculating the mean value first. 
// The above implementation doesn't need to calculate the mean and thus runs faster for larger lists

val meanx = x.reduce(_ + _) * 1.0 / n
val meany = y.reduce(_ + _) * 1.0 / n

val numerator = (x zip y).map{
 case (e1, e2) => (e1 - meanx) * (e2 - meany)
}.reduce(_ + _)

val sumxsquare = x.map( e => (e - meanx) * (e - meanx)).reduce(_ + _)
val sumysquare= y.map(e => (e - meany) * (e - meany)).reduce(_ + _)
val denominator = math.sqrt(sumxsquare) * math.sqrt(sumysquare)


println("Pearson correlation between x and y is " + numerator / denominator)

Pearson correlation between x and y is 0.12262786789699315



TODO: Investigate why the correlation in the matrix is not same as the one we calculated for the vector