<a href="https://cocl.us/Data_Science_with_Scalla_top"><img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/adds/Data_Science_with_Scalla_notebook_top.png" width = 750, align = "center"></a>
 <br/>
<a><img src="https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width="200" align="center"></a>"

# Module 2: Preparing Data 

## Identifying Outliers

### Lesson Objectives

After completing this lesson, you should be able to:

- Compute the inverse of covariance matrix given of a dataset
-	Compute Mahalanobis Distance for all elements in a dataset
-	Remove outliers from a dataset


## Mahalanobis Distance 

-	Multi-dimensional generalization of measuring how many standard deviations a point is away from the mean
-	Measured along each Principal Component axis 
-	Unitless and scale-invariant 
-	Takes into account the correlations of the dataset
-	Used to detect outliers

In [None]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._

import org.apache.spark.sql.functions._
import  org.apache.spark.mllib.linalg.{Vector, Vectors}
import  org.apache.spark.ml.feature.StandardScaler
import  org.apache.spark.ml.feature.VectorAssembler 
import org.apache.spark.mllib.util.MLUtils


In [None]:
// Continuing from Previous Example 

val  dfRandom = spark.range(0, 10).select("id").
 withColumn("uniform", rand(10L)).
 withColumn("normal1", randn(10L)).
 withColumn("normal2", randn(11L))

val  assembler = new VectorAssembler().
 setInputCols(Array("uniform","normal1","normal2")).
 setOutputCol("features")

val dfVec = MLUtils.convertVectorColumnsFromML(assembler.transform(dfRandom))

In [None]:
// Continuing from the Previous Example
dfVec.select("id","features").show()

// An Example with Outliers 
val dfOutlier = dfVec.select("id","features").unionAll(spark.createDataFrame(Seq((10,Vectors.dense(3, 3, 3)))))
dfOutlier.sort(dfOutlier("id").desc).show(5)

In [None]:
// An Example with Outliers 

val scaler = new StandardScaler().
 setInputCol("features").setOutputCol("scaledFeat").
 setWithStd(true).setWithMean(true)

val scalerModel = scaler.fit(MLUtils.convertVectorColumnsToML(dfOutlier.select("id","features")))

val dfScaled = scalerModel.transform(MLUtils.convertVectorColumnsToML(dfOutlier)).select("id","scaledFeat")
dfScaled.sort(dfScaled("id").desc).show(3)

In [None]:
import  org.apache.spark.mllib.stat.Statistics

import  breeze.linalg._

val  rddVec = MLUtils.convertVectorColumnsFromML(dfScaled.select("scaledFeat")).rdd.map(_(0).asInstanceOf[org.apache.spark.mllib.linalg.Vector])

val  colCov = Statistics.corr(rddVec)
val  invColCovB = inv(new DenseMatrix(3, 3, colCov.toArray))

In [None]:
// Computing Mahalanobis Distance 

val mahalanobis = udf[Double, org.apache.spark.ml.linalg.Vector]{ v =>
 val k = v.toArray
 val vB = new DenseVector(k);
 vB.t * invColCovB * vB
}

val dfMahalanobis = dfScaled.withColumn("mahalanobis", mahalanobis(dfScaled("scaledFeat")))

In [None]:
// Computing Mahalanobis Distance 
dfMahalanobis.show()

In [None]:
// Removing Outliers 

dfMahalanobis.sort(dfMahalanobis("mahalanobis").desc).show(2)

val ids = dfMahalanobis.select("id","mahalanobis").sort(dfMahalanobis("mahalanobis").desc).drop("mahalanobis").collect() 

val idOutliers = ids.map(_(0).asInstanceOf[Long]).slice(0,2)

## Removing Outliers

In [None]:
dfOutlier.filter("id not in (10, 2)").show()

## Lesson Summary

- Having completed this lesson, you should be able to:
-	Compute the inverse of covariance matrix given of a dataset 
-	Compute Mahalanobis Distance for all elements in a dataset
-	Remove outliers from a dataset

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.