Machine Learning - Scale

Scale Features
When your data has different values, and even different measurement units, it can be difficult to compare them. What is kilograms compared to meters? Or altitude compared to time?

The answer to this problem is scaling. We can scale data into new values that are easier to compare.

Take a look at the table below, it is the same data set that we used in the multiple regression chapter, but this time the volume column contains values in liters instead of cm3 (1.0 instead of 1000).

In [1]:
import pandas
df = pandas.read_csv("co2_emission_dataset.csv")
df.head()

Unnamed: 0,Car,Model,Volume,Weight,CO2
0,Toyota,Aygo,1.0,790,99
1,Mitsubishi,Space Star,1.2,1160,95
2,Skoda,Citigo,1.0,929,95
3,Fiat,500,0.9,865,90
4,Mini,Cooper,1.5,1140,105


It can be difficult to compare the volume 1.0 with the weight 790, but if we scale them both into comparable values, we can easily see how much one value is compared to the other.

There are different methods for scaling data, in this tutorial we will use a method called standardization.

The standardization method uses this formula:

z = (x - u) / s

Where z is the new value, x is the original value, u is the mean and s is the standard deviation.

In [3]:
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X = df[['Weight', 'Volume']]

scaledX = scale.fit_transform(X)

print(scaledX)

[[-1.61929208 -1.02454353]
 [ 0.64245979 -0.23643312]
 [-0.76960692 -1.02454353]
 [-1.16082886 -1.41859873]
 [ 0.52020293  0.94573249]
 [-0.76960692 -1.02454353]
 [ 0.3307048   0.55167728]
 [ 1.89559257  0.94573249]
 [ 0.34904333  0.94573249]
 [ 0.58133136  1.33978769]]
