# Preprocessing Data

Goal: Preprocess raw data and convert it to the right format. It can be achieved through several ways.

1. Binarization
2. Mean removal
3. Scaling
4. Normalization

## Binarization

Convert numerical values into boolean values.

In [19]:
import numpy as np
from sklearn import preprocessing

input_data = np.array([[5.2, -2.9, 3.3],
                       [-1.2, 7.8, -6.1],
                       [3.9, 0.4, 2.1],
                       [7.3, -9.9, -4.5]])

# Binarize data
data_binarized = preprocessing.Binarizer(threshold=2.1).transform(input_data)
print('data_binarized:\n', data_binarized)

data_binarized:
 [[ 1.  0.  1.]
 [ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 1.  0.  0.]]


## Mean Removal

Why is it useful to remove the mean from our feature vector?
- so that each feature is centered on zero
- remove bias from the features in our feature vector

In [11]:
# Print mean and standard deviation
print('before:')
print('mean = {}'.format(input_data.mean(axis=0)))
print('std deviation = {}'.format(input_data.std(axis=0)))

before:
mean = [ 3.8  -1.15 -1.3 ]
std deviation = [ 3.13129366  6.36651396  4.0620192 ]


In [14]:
# Remove mean
data_scaled = preprocessing.scale(input_data)
print('after:')
print('mean = {}'.format(data_scaled.mean(axis=0)))
print('std deviation = {}'.format(data_scaled.mean(axis=0)))

after:
mean = [  5.55111512e-17   0.00000000e+00   2.77555756e-17]
std deviation = [  5.55111512e-17   0.00000000e+00   2.77555756e-17]


## Scaling

Scales the values so that it lies between the range of 0 to 1

In [20]:
data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))
data_scaler_minmax = data_scaler_minmax.fit_transform(input_data)
print('Min max scaled data:\n', data_scaler_minmax)

Min max scaled data:
 [[ 0.75294118  0.39548023  1.        ]
 [ 0.          1.          0.        ]
 [ 0.6         0.5819209   0.87234043]
 [ 1.          0.          0.17021277]]
