# Normalization

## Table of Contents

1. Introduction to Normalization
1. Function `normalize`
1. Utility class `Normalizer` 
1. Using `Normalizer` in a pipeline

## Introduction to Normalization

Normalization is the process of scaling individual samples to have unit norm. It is the numeric values in a row that collectively have unit norm. This assumption is the basis of the Vector Space Model often used in text classification and clustering contexts. 

About Vector Space Model:
- https://en.wikipedia.org/wiki/Vector_space_model

Load libraries.

In [6]:
from sklearn import preprocessing
import numpy as np

## Function `normalize`

Transform a toy dataset `X` using the `normalize` function. Then store the normalized data in an object `X_normalized`.

In [9]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -2.]]
X_normalized = preprocessing.normalize(X, norm='l2')
X_normalized                                      

In the above output, the sum of the squares of the elements of each row are equal to `1` as is shown in the following cell.

In [11]:
X_normalized[0,:], X_normalized[0,:]**2, np.sum(X_normalized[0,:]**2), np.sqrt(np.sum(X_normalized[0,:]**2))

In the above example, the normalized vectors in `X_normalized` are calculated using the 'l2' norm 

\\(z = \sqrt{\sum_{i=1}^n x^2_i}\\)  

Divide each non-zero component in `X` by the square root of the sum of the squares of each component in `X`. 

The 'l1' norm is defined as

\\(z = \sum_{i=1}^n \|x_i| \\) 

Divide each non-zero component in `X` by the sum of the absolute values of each components in the vector.

The 'max' norm is defined as

\\(z = \max {x_i} \\) 

Divide each non-zero component in `X` by the maximum value in the vector.

Transform the dataset `X` using the `normalize` function with 'l1' norm. Then store the normalized data in a object `X_normalized_l1`.

In [14]:
X_normalized_l1 = preprocessing.normalize(X, norm='l1')
X_normalized_l1

Transform the dataset `X` using the `normalize` function with 'max' norm. Then store the normalized data in a object `X_normalized_m`. Note that using 'max' norm does not take absolute values first, so the third vector is `[0,1,-2]`.

In [16]:
X_normalized_m = preprocessing.normalize(X, norm='max')
X_normalized_m

###`Normalizer`

The `preprocessing` module in scikit-learn further provides a utility class `Normalizer` that implements the same operation using the Transformer API. The fit method is useless in this case: this operation treats samples independently. The fit method is useful when used in a pipeline.

Create an object `normalizer`.

In [20]:
normalizer = preprocessing.Normalizer()
normalizer

The `normalizer` instance can then be used on sample vectors as an transformer. The transform method scales each non zero row of X to unit norm.

In [22]:
normalizer.transform(X) 

The transform method scales the sample vectors to unit norm.

In [24]:
normalizer.transform([[-1.,  1., 0.]]) 

The above session introduces a quick and easy way to implement normalization on single dataset and using the `Normalizer` class which can scale features to unit norm and be useful in the early steps of a `sklearn.pipeline.Pipeline`.

### Using `Normalizer` in a pipeline

Here is an example shows whether or not normalizing the features of the California Housing Prices dataset has any impact on the performance of a k-Nearest-Neighbors estimator.

Load libraries.

In [29]:
from sklearn.preprocessing import Normalizer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error

Read the dataset using the `fetch_california_housing` function and then split it into train and test using the `train_test_split` function.

In [31]:
dataset = fetch_california_housing()
X_full, y_full = dataset.data, dataset.target
X_train, X_test, y_train, y_test=train_test_split(X_full,y_full,test_size=0.2, random_state=20)

Here we use a k-nearest neighbors regressor as part of a pipeline that includes normalizing, and for the purposes of comparison, a knn regressor trained on the unnormalized data has been provided in the following code cell.

In [33]:
steps=[('normalizer', Normalizer()),
       ('knn',        KNeighborsRegressor())]

pipeline=Pipeline(steps)

Fit the pipeline using `X_train` as training data and `y_train` as target values, and pass the computed parameters to an object `knn_normalized`. Also, fit a knn regressor using unnormalized training data and pass the computed parameters to the object `knn_unnormalized`.

In [35]:
knn_normalized = pipeline.fit(X_train, y_train)
knn_unnormalized = KNeighborsRegressor().fit(X_train, y_train)

Compute and print metrics.

In [37]:
print('Prediction Error with normalization: {}'.format(mean_squared_error(y_test, knn_normalized.predict(X_test))))
print('Prediction Error without normalization: {}'.format(mean_squared_error(y_test, knn_unnormalized.predict(X_test))))

The output above shows that normalization has significantly improved the performance of k-nearest neighbors regressor in predicting using the California Housing Prices dataset.