# Introduction to Scaling

Before you implement your first Clustering algorithm by yourself, this notebook teaches you the concept of scaling.

## Features Scaling

Often the input features of your model have different units which means that the variables also have different scales. While some model types (e.g. tree-based models like decision tree or random forest) are unaffected by the scale of numerical input variables, many machine learning algorithms including f.e. algorithms using distance measures (e.g. K-Means) perform better when the input features are scaled to a specific range.  

Since K-Means uses distance measures, we are covering this topic here. 

The most popular techniques for scaling are **normalization** and **standardization**. 

Check the [link](https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/) for further info. 

![](images/normalization_vs_standardization.png)

**As an example, we will use the cars dataset out of our Linear Regression repo.**

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

In [None]:
# Read in the dataset
cars = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv")
cars = cars.rename(columns={'Unnamed: 0':'car_model',
                            'wt':"weight"});
cars.head()

In [None]:
# Before we have a look at the different scaling methods, we have to define which columns we want to scale.
# Within the linear regression repo, we used weight and horsepower to predict mpg using a multiple linear regression.
# Thus, we want to scale the independent variables weight and horsepower.
col_scale = ['hp', 'weight']

### Data Standardization 

In order to standardize a dataset it is necessary to rescale the distribution of values so that the mean of observed values is 0 and the standard deviation is 1.  
You can think of it as subtracting the mean value or centering the data. 
Sklearn provides us for this case with the [Standard scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

A value is standardized as follows: 

$ x_{scaled} = \frac{x – \mu}{\sigma}  $, where 

$ \mu = \frac{\sum{x}}{m} $ is the mean, where m is the number of observations

$ \sigma = \sqrt{ \frac{\sum{ (x – \mu)^2 }}{m}} $ is the standard deviation

In [None]:
# Scaling with standard scaler
# First, a StandardScaler instance is defined with default hyperparameters.
# After defining we can call the fit_transform() function and pass it to our data we want to transform.
scaler = StandardScaler()
cars_scaled = scaler.fit_transform(cars[col_scale])

In [None]:
# Result is a transformed array with transformed values
# Convert the array back to a dataframe and check scaled result
cars_scaled_df = pd.DataFrame(cars_scaled)
cars_scaled_df.head()

In [None]:
# Drop original hp and weight columns from original cars dataframe and concatenate it with scaled columns
cars_dropped = cars.drop(col_scale, axis=1)
cars_preprocessed = pd.concat([cars_scaled_df, cars_dropped], axis=1)

In [None]:
cars_preprocessed.head()

### Data normalization 

Normalizing the data means to rescale it from the original range so that all values lie within the new range of 0 and 1.
We can easily do this by using the [Min-Max-Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) from sklearn. This scaler transforms the feature(s) by scaling it(them) to a given range (default range is 0 to 1). 

A value is normalized as follows: 

$ x_{scaled} = \frac{x – x_{min}}{x_{max} – x_{min}} $

(Where the min and max values pertain to the value x being normalized, from your **train** dataset)

#### Hands-On now
Try out doing normalizing and use the MinMaxScaler() for normalizing weight and horsepower as above.

In [None]:
# Scaling with MinMaxScaler

# Try to scale you data with the MinMaxScaler() from sklearn. 
# It follows the same syntax as the StandardScaler.
# Don't forget: you have to import the scaler at the top of your notebook. 