## Feature Scaling
The scales of variables can have a very large impact on the results of the model. For instance, consider this example of employee salaries:


In [None]:
import pandas as pd

salaries = {
    'ID':['01','02','03'],
    'Salary':[70000,60000,52000],
    'Years of Experience':[5,4,1]
}

df = pd.DataFrame(salaries, index=salaries['ID']).drop('ID', axis=1)
df

We want to group employees together. Employees 1 and 3 are definitely in different groups. But how would we group Employee 2? Employee 2 is closer to Employee 1 in salary, but to Employee 3 in experience. 

The scale is throwing us off, so we look at __feature scaling__. There are two methods of feature scaling:
1. Standardization
$$\hat{x} = \frac{x-\bar{x}}{s}$$
2. Normalization
$$\hat{x} = \frac{x-x_{min}}{x_{max}-x_{min}}$$

Standardization will generally give a number in the range [-3,3], while normalization will always give a result between [0,1].

Let's see how each method affects the data.

In [None]:
# Standardized
def standardize_df(x):
    return (x-x.mean())/(x.std(ddof=1))

standardize_df(df)

In [None]:
def normalize_df(x):
    return (x-x.min())/(x.max()-x.min())

normalize_df(df)

What do we see? Looking at the original data, the gap in salaries between Employees 1 and 2 was so large that we'd say that Employee 2 was closer to Employee 3. But as we look at the standardized and normalized data, we see that the salary of Employee 2 is very nearly in the middle (0 for Standardized, 0.5 for Normalized). So, the Salary may not be a good indicator. But looking at the Years of Experience, we see Employee 2 is actually very close to Employee 1. So, it is more likely for Employee 2 to be grouped with Employee 1.