<a href="https://cocl.us/Data_Science_with_Scalla_top"><img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/adds/Data_Science_with_Scalla_notebook_top.png" width = 750, align = "center"></a>
 <br/>
<a><img src="https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width="200" align="center"></a>"

# Module 2: Preparing Data - Data Normalization

## Data Normalization 

### Lesson Objectives

-	After completing this lesson, you should be able to: 
-	Normalize a dataset to have unit p-norm
-	Normalize a dataset to have unit standard deviation and zero mean 
-	Normalize a dataset to have given minimum and maximum values 


## Normalizer

-	A Transformer which transforms a dataset of Vector rows, normalizing each Vector to have unit norm
-	Takes a parameter P, which specifies the p-norm used for normalization (p=2 by default)
- Standardize input data and improve the behavior of learning algorithms

In [None]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._

In [None]:
// Continuing from Previous Example 

import  org.apache.spark.ml.feature.VectorAssembler

import  org.apache.spark.sql.functions._

val dfRandom = spark.range(0, 10).select("id").
 withColumn("uniform", rand(10L)).
 withColumn("normal1", randn(10L)).
 withColumn("normal2", randn(11L))

val assembler = new  VectorAssembler().
 setInputCols(Array("uniform","normal1","normal2")).
 setOutputCol("features")

val dfVec = assembler.transform(dfRandom)


// Continuing from Previous Example 

dfVec.select("id","features").show()
 

In [None]:

// A Simple Normalizer 

import  org.apache.spark.ml.feature.Normalizer

val scaler1 = new Normalizer().setInputCol("features").setOutputCol("scaledFeat").setP(1.0)
scaler1.transform(dfVec.select("id","features")).show(5)

## Standard Scaler

-	A Model which can be fit on a dataset to produce a `StandardScalerModel`
-	A Transformer which transforms a dataset of `Vector` rows, normalizing each feature to have unit standard deviation and/or zero mean
- Takes two parameters:
	-	`withStd`: scales the data to unit standard deviation (default: true)
	-	`withMean`: centers the data with mean before scaling (default: false)
-	It builds a dense output, sparse inputs will raise an exception
-	If the standard deviation of a feature is zero, it returns 0.0 in the Vector for that feature

In [None]:
// A Simple Standard Scaler 

import  org.apache.spark.ml.feature.StandardScaler

val  scaler2 = new StandardScaler().
 setInputCol("features"). setOutputCol("scaledFeat").
 setWithStd(true). setWithMean(true)

val  scaler2Model = scaler2.fit(dfVec.select("id","features"))
scaler2Model.transform(dfVec.select("id","features")).show(5)

## MinMax Scaler 

-	A Model which can be fit on a dataset to produce a `MinMaxScalerModel`
-	A Transformer which transforms a dataset of `Vector` rows, rescaling each feature to a specific range (often `[0,1]`)
-	Takes two parameters: 
	-	min: lower bound after transformation, shared by all features (default:0.0)
	-	max: upper bound after transformation, shared by all features (default: 1.0)
-	Since zero values are likely to be transformed to non-zero values, sparse inputs may result in dense outputs

In [None]:

// A Simple MinMax Scaler 
import  org.apache.spark.ml.feature.MinMaxScaler 

val scaler3 = new MinMaxScaler().
 setInputCol("features").setOutputCol("scaledFeat").
 setMin(-1.0).setMax(1.0)

val scaler3Model = scaler3.fit(dfVec.select("id","features"))
scaler3Model.transform(dfVec.select("id","features")).show(5)

## Lesson Summary 

-	Having completed this lesson, you should be able to: 
- Normalize a dataset to have unit p-norm
-	Normalize a dataset to have unit standard deviation and zero mean
-	Normalize a dataset to have given minimum and maximum values

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.