## What is data preprocesing?

**Data preprocessing** means using manipulation techniques to make your dataset ready for running a machine learning model. In particular, you'll be transforming the features (columns containing inputs to the model).

## What are centering and scaling?

**Centering** means subtracting the mean of a feature from each element of the feature, so that the mean of the processed feature is zero.

**Scaling** means dividing the each element of the feature by the standard deviation, so that the standard deviation of the processed feature is one. Some people prefer the term **standardizing**. It means the same thing as scaling.

## When should I use centering and scaling?

Centering and scaling is essential for model types that assume each feature comes from a standard normal distribution. That includes

- K-nearest neighbors
- Support vector machines (when using the 'kernel trick' of the radial basis function)
- Regularized regression (lasso and ridge regression)

Centering and scaling are not essential but can help with convergence for the following model types.

- Linear and logistic regression
- Neural networks

Centering and scaling have no effect and are completely unnecessary for the following model types.

- Tree-based models (decision trees, random forests, gradient boosting)
- Naive Bayes

## What Python packages can I use for centering and scaling data?

- **scikit-learn** (used here)
- **PyCaret**
- **pandas**

## Case study: k-nearest neighbors using the diamonds dataset

The diamonds dataset is a classic dataset on diamond prices, originally found in R's **ggplot2** package, and available to Python users in the **plotnine** package.

In [1]:
from plotnine.data import diamonds
diamonds

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


We'll try to predict the **cut** of the diamonds using the numeric features in the dataset. (That is, for simplicity, we'll ignore **color** and **clarity**.)

Before modeling, let's look at some summary statistics in the features.

In [2]:
diamonds.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


Notice that the maximum value of carat is about 5, but the maximum price is almost 20000. Sadly, you can't get a one carat diamond for a dollar, so the scales of each feature are very different.

## Importing the required functions

We'll use 

- [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split the dataset into training and testing sets.
- [KNeighborsClassifier()](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) to fit the k-nearest neighbors model.
- [StandardScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to scale the features used in the model.

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

## Creating training and testing sets

- The **cut** column is our response variable (the thing to predict). We'll assign this to `y`.
- We'll use all the other numeric variables (everything except the response, **color**, and **clarity**) for features. We'll assign these to `X`.

In [4]:
y = diamonds["cut"]
X = diamonds.drop(columns=["cut", "color", "clarity"])

Now we perform the train-test split, using default options.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Creating a K-NN classifier

To run the k-nearest neighbors model, we need to create a `KNeighborsClassifier` object.

In [6]:
knn = KNeighborsClassifier()

## Running the model without standardizing

First we fit the model to the training data.

In [7]:
knn.fit(X_train, y_train)

KNeighborsClassifier()

Now we measure the accuracy of the predictions.

In [8]:
knn.score(X_test, y_test)

0.557211716722284

## Running the model with standardizing

First we create a standard scaler object.

In [9]:
ss = StandardScaler()

Now we fit the scaler (calculate the means and standard deviations) and transform the features (subtract those means and divide by the standard deviations).

It's important that we perform this separately on the training and testing sets. Otherwise we suffer **data leakage**, where information from the testing set has "leaked" into the training set.

In [10]:
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.fit_transform(X_test)

Again we fit the model.

In [11]:
knn.fit(X_train_scaled, y_train)

KNeighborsClassifier()

... and calculate the accuracy.

In [12]:
knn.score(X_test_scaled, y_test)

0.7061920652576937

Notice the substantial improvement in accuracy. Great!

## What other types of scaling are available?

Scikit-learn provides several other functions for scaling in the `sklearn.preprocessing` submodule.

- [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) subtracts the median and divides by the inter-quartile range.
- [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html) divides each value by the maximum absolute value (so all values are between -1 and 1).
- [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) converts the values to a range.
- [Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html) scales each row so the sum of the squares of the values equals one.


## Where can I learn more?

- DataCamp's [Preprocessing for Machine Learning in Python](https://app.datacamp.com/learn/courses/preprocessing-for-machine-learning-in-python) and [Feature Engineering for Machine Learning in Python](https://app.datacamp.com/learn/courses/feature-engineering-for-machine-learning-in-python) courses.
- scikit-learn's [Preprocessing data](https://scikit-learn.org/stable/modules/preprocessing.html) tutorial.
- Quora Q&A on [Which machine algorithms require data scaling/normalization?](https://www.quora.com/Which-machine-algorithms-require-data-scaling-normalization)