# Data Preprocessing

This is a mini tutorial of data preprocessing using scikit-learn package, topics here we going to discuss are:
-  Mean removal
-  Min Max Scaling
-  Normalization
-  Binarization
-  One hot encoding

## Standardization or Mean Removal:

Standardization of data is a compulsory for most of the machine learning models implemented using scikit-learn. To avoid bad behaviours of our data we need to standardize them to zero mean and unit variance.

In [2]:
from sklearn import preprocessing
import numpy as np

X_train = np.array([[2.,  1.,  0.],
                    [1., -1.,  2.],
                    [0.,  1., -1.]])

# Lets use the scale function for getting zero mean and unit variance
X_scaled = preprocessing.scale(X_train)

X_scaled

array([[ 1.22474487,  0.70710678, -0.26726124],
       [ 0.        , -1.41421356,  1.33630621],
       [-1.22474487,  0.70710678, -1.06904497]])

__Lets check the mean and variance:__

In [3]:
X_scaled.mean(axis=0)

array([  0.00000000e+00,   7.40148683e-17,   0.00000000e+00])

In [4]:
X_scaled.std(axis=0)

array([ 1.,  1.,  1.])

## Min Max Scaling:

This is an alternative standardization technique which is for scaling features to lie between a given range(often between zero and one)

In [6]:
# Sample data matrix
X_train = np.array([[1., -1.,  2.],
                    [2.,  0.,  0.],
                    [0.,  1., -1.]])

# Scaling a data matrix to the [0, 1] range
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)

X_train_minmax

array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

## Normalization:

This process invloves scaling individual samples to have unit norm. 

In [11]:
# Sample data
X = [[1., -1.,  2.],
     [2.,  0.,  0.],
     [0.,  1., -1.]]

# Lets normalize the data using norms l1 or l2
X_normalized = preprocessing.normalize(X, norm='l1')

X_normalized

array([[ 0.25, -0.25,  0.5 ],
       [ 1.  ,  0.  ,  0.  ],
       [ 0.  ,  0.5 , -0.5 ]])

## Binarization:

Binarization is the process of thresholding numerical features to get boolean values.

In [13]:
X = [[1., -1.,  2.],
     [2.,  0.,  0.],
     [0.,  1., -1.]]

binarizer = preprocessing.Binarizer(threshold=1.1)

binarizer.transform(X)

array([[ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  0.]])

## One Hot Encoding:

This is used to convert categorical features into binary features. This estimator transforms each categorical feature with m possible values into m binary features, with only one active. 

In [15]:
gender = ['male', 'female']  # [0, 1]
continent = ['from Europe', 'from US', 'from Asia']  # [0, 1, 2]
browser = ['uses Firefox',
           'uses Chrome',
           'uses Safari', 
           'uses Internet Explorer']  # [0, 1, 2, 3]

sample_data = ['male', 'from US', 'uses Internet Explorer']  # [0, 1, 3]

enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
enc.transform([[0, 1, 3]]).toarray()

array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])