<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Scaling by Normalising or Standardising your Data

Scaling can be an issue for some models. 

By this I mean, if some features have much larger values than others, they'll have a heavier weighting than others. 

Thinking back to Hooke's Law, we gather data for how much a spring extends when we add weights to it.

![Hooke's Law Experiment](../../Images/hookes-law.png)

And we plot the results: 
![Hooke's Law Experiment](../../Images/hookes-law-graph-1.png)

Next time, we repeat the experiment with an industrial grade spring that extends A LOT when each 1 Newtown weight added. 

![Hooke's Law Experiment](../../Images/hookes-law-graph-2.png)

(It's probably obvious that I made up these numbers but please bear with me).

Comparing the two charts: you should notice that in the second chart a small change in X produces a large change in y. Futhermore, this leads to a large coefficient in the $y=ax+b$ formula i.e. the value for a is large. 

Now consider what happens in multiple linear regression. If some of the features contain much larger numbers than others i.e. if some columns are orders of magnitude greater than others, this will lead to large cofficients on some features. 

For example, when linear regression is applied to a dataset, the formula might be: 

 $$ y = 1.6x_1 + 192x_2 + b$$

 It should be obvious that a small change in $x_2$ will have a large impact on the target value, compared to a similar change in $x_1$. 

 The solution is to Normalise the data by: 
 * Centering 
 * Scaling

 In order to center a column of data


 Models including Linear Regression, KNearestNeighbours, State Vector Machines and Neural Networks are all sensitive to this problem. 

# The Automobiles dataset

In [3]:
import pandas as pd
# ensure that we can see all columns when we display a dataframe
pd.set_option('max_columns', None) 

# read the automobiles dataset into a dataframe
auto_df = pd.read_csv('../../Data/automobiles.csv')

# drop the symbolling and normalised losses columns
auto_df = auto_df.drop(['symboling', 'normalised_losses'], axis=1)
# drop all rows with na values
auto_df = auto_df.dropna() 
auto_df

Unnamed: 0,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.40,10.0,102.0,5500.0,24,30,13950.0
4,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.40,8.0,115.0,5500.0,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,volvo,gas,std,four,sedan,rwd,front,109.1,188.8,68.9,55.5,2952,ohc,four,141,mpfi,3.78,3.15,9.5,114.0,5400.0,23,28,16845.0
201,volvo,gas,turbo,four,sedan,rwd,front,109.1,188.8,68.8,55.5,3049,ohc,four,141,mpfi,3.78,3.15,8.7,160.0,5300.0,19,25,19045.0
202,volvo,gas,std,four,sedan,rwd,front,109.1,188.8,68.9,55.5,3012,ohcv,six,173,mpfi,3.58,2.87,8.8,134.0,5500.0,18,23,21485.0
203,volvo,diesel,turbo,four,sedan,rwd,front,109.1,188.8,68.9,55.5,3217,ohc,six,145,idi,3.01,3.40,23.0,106.0,4800.0,26,27,22470.0


In [4]:
auto_df[['stroke', 'engine_size']]

Unnamed: 0,stroke,engine_size
0,2.68,130
1,2.68,130
2,3.47,152
3,3.40,109
4,3.40,136
...,...,...
200,3.15,141
201,3.15,141
202,2.87,173
203,3.40,145


Looking at the automobiles dataset (above) we can see that the features do order by several orders of magnitude, and therefore, we should be applying strategies to Normalise or Standardise the data. 

# Standardising vs Normalising

In this section we'll Standardise the 'stroke' and 'engine_size' columns so that you can compare and contrast the differences between them both. 

## Standardising

Once a feature is standardised, most of its values fall in the range -1 to +1.

This process invoves 2 steps. 
1. Centering
2. Scaling

Centering is achieved by calculating the mean for the column (or feature), and subtracting the mean from each value in the column. 

Scaling is achieved by dividing each number by the standard deviation. 

Standardising is usually the best choice when the data follows a gaussian distribution (bell curve). 


In [7]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(auto_df[['stroke', 'engine_size']])
# Print out the Standardised Data as a Dataframe
pd.DataFrame(X)



Unnamed: 0,0,1
0,-1.808186,0.045215
1,-1.808186,0.045215
2,0.702918,0.575559
3,0.480415,-0.461021
4,0.480415,0.189854
...,...,...
188,-0.314238,0.310387
189,-0.314238,0.310387
190,-1.204249,1.081795
191,0.480415,0.406813


## Normalising 

Normalising is similar, however once a column (or feature) has been normalised, its values will fall in the region 0 to 1.

In [8]:
from sklearn.preprocessing import Normalizer
scaler = Normalizer()
X = scaler.fit_transform(auto_df[['stroke', 'engine_size']])
# Print out the Standardised Data as a Dataframe
pd.DataFrame(X)

Unnamed: 0,0,1
0,0.020611,0.999788
1,0.020611,0.999788
2,0.022823,0.999740
3,0.031177,0.999514
4,0.024992,0.999688
...,...,...
188,0.022335,0.999751
189,0.022335,0.999751
190,0.016587,0.999862
191,0.023442,0.999725


# Recap of Regularised Regression

At this point you may wish to recall Ridge Regression and Lasso Regression both of which can be used to shrink large coefficients.  In the case of Lasso Regression some coefficients may even shrink to zero (which is feature selection). 

For this reason we don't employee Normalisation or Standardisation with Ridge or Lasso Regression - scaling features is already baked in.

One of the main advantages of being able to use Standardisation or Normalisation yourself is that you can apply it to any model.


