# Chapter 7
# Prepare Your Data For Machine Learning

Many machine learning algorithms make assumptions about your data. It is often a very good idea to prepare your data in such way to best expose the structure of the problem to the machine learning algorithms that you intend to use. After completing this lesson you will know how to:
1. Rescale data.
2. Standardize data.
3. Normalize data.
4. Binarize data.

Let's get started.

## 7.1 Need For Data Pre-processing

You almost always need to pre-process your data. It is a required step. A difficulty is that different algorithms make dierent assumptions about your data and may require different transforms. Further, when you follow all of the rules and prepare your data, sometimes algorithms can deliver better results without pre-processing.

Generally, I would <b>recommend creating many different views and transforms of your data, then exercise a handful of algorithms on each view of your dataset</b>. This will help you to flush out which data transforms might be better at exposing the structure of your problem in general.

## 7.2 Data Transforms

In this lesson you will work through 4 different data pre-processing recipes for machine learning. Each recipe follows the same structure:
- Load the dataset from a URL.
- Split the dataset into the input and output variables for machine learning.
- Apply a pre-processing transform to the input variables.
- Summarize the data to show the change.

The scikit-learn library provides two standard idioms for transforming data. Each are useful in different circumstances. The transforms are calculated in such a way that they can be applied to your training data and any samples of data you may have in the future. The scikit-learn documentation has some information on how to use various different pre-processing methods:
- Fit and Multiple Transform.
- Combined Fit-And-Transform.

The <b>Fit and Multiple Transform</b> method is the <a>preferred approach. You call the fit() function to prepare the parameters of the transform once on your data. Then later you can use the transform() function on the same data to prepare it for modeling and again on the test or validation dataset or new data that you may see in the future</a>. The <b>Combined Fit-And-Transform</b> is a convenience that you can use for one off tasks. <a>This might be useful if you are interested in plotting or summarizing the transformed data</a>.

## 7.3 Rescale Data

When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale. <a>Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1 </a>. This is useful for <b>optimization algorithms used in the core of machine learning algorithms like gradient descent</b>. It is also useful for <b>algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like k-Nearest Neighbors<b>.

In [1]:
# Rescale data (between 0 and 1)
from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler

filename = '.\data\pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values

# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

# summarize transformed data
set_printoptions(precision=3)

print(rescaledX[0:5,:])

[[0.353 0.744 0.59  0.354 0.    0.501 0.234 0.483]
 [0.059 0.427 0.541 0.293 0.    0.396 0.117 0.167]
 [0.471 0.92  0.525 0.    0.    0.347 0.254 0.183]
 [0.059 0.447 0.541 0.232 0.111 0.419 0.038 0.   ]
 [0.    0.688 0.328 0.354 0.199 0.642 0.944 0.2  ]]


<b>Note</b>. After rescaling you can see that all of the values are in the range between 0 and 1.

## 7.4 Standardize Data

Standardization is a useful technique to transform <b>attributes with a Gaussian distribution</b> and <a>differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1</a>. It is most suitable for techniques that assume a Gaussian distribution in the input variables and <a>work better with rescaled data</a>, such as <b>linear regression</b>, <b>logistic regression</b> and <b>linear discriminate</b> analysis.

In [2]:
# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
from pandas import read_csv
from numpy import set_printoptions

filename = '.\data\pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values

# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


## 7.5 Normalize Data

Normalizing in scikit-learn refers to <b>rescaling each observation (row) to have a length of 1</b> (<a>called a unit norm or a vector with the length of 1 in linear algebra</a>). This pre-processing method can be useful for sparse datasets (<a>lots of zeros</a>) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as k-Nearest Neighbors.

In [3]:
# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer
from pandas import read_csv
from numpy import set_printoptions

filename = '.\data\pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values

# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(normalizedX[0:5,:])

[[0.034 0.828 0.403 0.196 0.    0.188 0.004 0.28 ]
 [0.008 0.716 0.556 0.244 0.    0.224 0.003 0.261]
 [0.04  0.924 0.323 0.    0.    0.118 0.003 0.162]
 [0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]
 [0.    0.596 0.174 0.152 0.731 0.188 0.01  0.144]]


## 7.6 Binarize Data (<font color='blue'>Make Binary</font>)

You can transform your data using a binary threshold. <a>All values above the threshold are marked 1 and all equal to or below are marked as 0. This is called binarizing your data or thresholding</a> your data. It can be <b>useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful</b>.

In [4]:
# binarization
from sklearn.preprocessing import Binarizer
from pandas import read_csv
from numpy import set_printoptions

filename = '.\data\pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values

# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)

# summarize transformed data
set_printoptions(precision=3)
print(binaryX[0:5,:])

[[1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1. 1. 1.]]


<b>Note</b>. You can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.