# Data Transformation: Feature scaling

## Introduction

Data transformation is one of the fundamental steps in the part of data processing. In the first place you may think that  the terms scale, standardise, and normalise may be used interchangeably. However, it was pretty hard to find information about which of them we should use and also when to use. Therefore, I’m going to explain the following key aspects:

- The difference between Standardisation and Normalisation
- When to use Standardisation and when to use Normalisation
- How to apply feature scaling in Python

## Definitions

### Feature Scaling

In practice, we often encounter different types of variables in the same dataset. A significant issue is that the range of the variables may differ a lot. Using the original scale may put more weights on the variables with a large range. In order to deal with this problem, we need to apply the technique of features rescaling to independent variables or features of data in the step of data pre-processing. The terms normalisation and standardisation are sometimes used interchangeably, but they usually refer to different things.

- The goal of applying Feature Scaling is to make sure features are on almost the same scale so that each feature is equally important and make it easier to process by most ML algorithms.

#### Rescaling
a vector means to add or subtract a constant and then multiply or divide by a constant, as you would do to change the units of measurement of the data, for example, to convert a temperature from Celsius to Fahrenheit.

#### Normalizing
a vector most often means dividing by a norm of the vector. It also often refers to rescaling by the minimum and range of the vector, to make all the elements lie between 0 and 1 thus bringing all the values of numeric columns in the dataset to a common scale.

#### Standardizing 
a vector most often means subtracting a measure of location and dividing by a measure of scale. For example, if the vector contains random values with a Gaussian distribution, you might subtract the mean and divide by the standard deviation, thereby obtaining a “standard normal” random variable with mean 0 and standard deviation 1.

#### Note 1 : 
The other issue is that standardization has two elements: Centering and scaling. Division by standard deviation is a type of scaling. Subtraction of mean is a type of centering. You don't have to use those. You could use median and Gini mean difference, for example. You can center without scaling and scale without centering.

#### Note 2 : 
If an algorithm is not distance-based, feature scaling is unimportant, including Naive Bayes, Linear Discriminant Analysis, and Tree-Based models (gradient boosting, random forest, etc.).

### Example
 
Let's considere a dataset that contains an independent variable (Purchased) and 3 dependent variables (Country, Age, and Salary). We can easily notice that the variables are not on the same scale because the range of Age is from 27 to 50, while the range of Salary going from 48 K to 83 K. The range of Salary is much wider than the range of Age. This will cause some issues in our models since a lot of machine learning models such as k-means clustering and nearest neighbour classification are based on the Euclidean Distance.

## Z-score standardization or Min-Max scaling

There is no obvious answer to this question: it really depends on the application.

Some machine learning models are fundamentally based on distance matrix, also known as the distance-based classifier, for example, K-Nearest-Neighbours, SVM, and Neural Network. Feature scaling is extremely essential to those models, especially when the range of the features is very different. Otherwise, features with a large range will have a large influence in computing the distance.

Max-Min Normalisation typically allows us to transform the data with varying scales so that no specific dimension will dominate the statistics, and it does not require making a very strong assumption about the distribution of the data, such as k-nearest neighbours and artificial neural networks. However, Normalisation does not treat outliners very well. On the contrary, standardisation allows users to better handle the outliers and facilitate convergence for some computational algorithms like gradient descent. Therefore, we usually prefer standardisation over Min-Max Normalisation.

For example, in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance (depending on the question and if the PCA computes the components via the correlation matrix instead of the covariance matrix.

However, this doesn’t mean that Min-Max scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, typical neural network algorithm require data that on a 0-1 scale.


## Conclusions
- When to normalize the data and when to standardize the data depends purely on the context of the problem we are working and the scale of features that are required for that particular problem.
- If we want all the features to have values in the range [0,1], we go for normalization and if we want all the features with mean-centered variance scaling, we go for standardization.
- It is recommended to go for Standardization because most of the models give better results when they are trained on standardized data over normalized data.

# Code

In [44]:
import pandas as pd
import numpy as np


class FeatureScaling:
    def __init__(self, df):
        self.df = df
    
    
    def standardize(self):
        stdValues = self.df.std(axis=0)
        mean = self.df.mean(axis=0)
        self.df = (self.df-self.df.mean())/self.df.std()
        return {"df":self.df,'mean':mean,'std':stdValues}
    
    def normalize(self):
        minValues = self.df.min(axis=0)
        maxValues = self.df.max(axis=0)
        self.df = (self.df-minValues)/(maxValues-minValues)
        return {'df':self.df,'min':minValues,'max':maxValues}
    
    #Scaling using median and quantiles consists of subtracting the median to all the observations 
    #and then dividing by the interquartile difference. It Scales features using statistics that are robust to outliers.
    def ScaleToMedianAndQuartiles(self):
        medianValues = self.df.median()
        scaleValues = self.df.quantile(0.75)-self.df.quantile(0.25)
        self.df = (self.df-medianValues)/scaleValues
        return {'df':self.df,'median':medianValues,'scale':scaleValues}

    # In case, we want to use our trained model using scaled input values, which have to be within the same scaler, of course after storing the metrics
    def standardizeInputs(self,inputs,mean,std):
        scaledInputs = []
        for i in range(len(inputs)):
            scaledInputs.append((inputs[i] - mean[i]) / std[i])
        return scaledInputs
    
    def normalizeInputs(self,inputs,metrics):
        scaledInputs = []
        for i in range(len(inputs)):
            scaledInputs.append((inputs[i] - metrics['min'][i]) /( metrics['max'][i] - metrics['min'][i]))
        return scaledInputs

    def ScaleToMedianAndQuartilesInputs(self,inputs,metrics):
        scaledInputs = []
        for i in range(len(inputs)):
            scaledInputs.append((inputs[i] - metrics['median'][i]) / metrics['scale'][i])
        return scaledInputs
           
    def unstandardize(self,scaler):
        self.df = (self.df*scaler["std"])+scaler["mean"]
        return self.df
    
    def unnormalize(self,scaler):
        self.df = (self.df*(scaler["max"]-scaler["min"])+scaler["min"])
        return self.df
    
    def unscaleMedianAndQuartiles(self,scaler):
        self.df = (self.df*scaler["median"])/scaler["scale"]
        return self.df
    
    def read(self):
        return self.df

In [45]:
dff = pd.DataFrame({"A":[12, 7, 11, 8, 4], 
                   "B":[70, 2, 54, 3, 2], 
                   "C":[20, 16, 1000, 3, 8], 
                   "D":[14, 3, 888, 188, 6],
                   "E":[900, 600, 10, 222, 8]}) 

print(dff)

sc = FeatureScaling(dff)
inp = [12, 70, 20, 14, 900]
scaler = sc.standardize()
print(scaler['df'])
print(sc.standardizeInputs(inp,scaler["mean"],scaler['std']))
print('******************************')
sc.unstandardize(scaler)
print(sc.read())

    A   B     C    D    E
0  12  70    20   14  900
1   7   2    16    3  600
2  11  54  1000  888   10
3   8   3     3  188  222
4   4   2     8    6    8
          A         B         C         D         E
0  1.121719  1.320500 -0.428498 -0.539264  1.408998
1 -0.436224 -0.729591 -0.437548 -0.568088  0.643238
2  0.810130  0.838125  1.788652  1.750906 -0.862756
3 -0.124635 -0.699443 -0.466959 -0.083327 -0.321619
4 -1.370989 -0.729591 -0.455647 -0.560227 -0.867861
[1.1217185151295603, 1.3204996492840566, -0.42849823341848875, -0.5392644052612866, 1.408997922734886]
******************************
      A     B       C      D      E
0  12.0  70.0    20.0   14.0  900.0
1   7.0   2.0    16.0    3.0  600.0
2  11.0  54.0  1000.0  888.0   10.0
3   8.0   3.0     3.0  188.0  222.0
4   4.0   2.0     8.0    6.0    8.0


In [46]:
dff = pd.DataFrame({"A":[12, 7, 11, 8, 4], 
                   "B":[70, 2, 54, 3, 2], 
                   "C":[20, 16, 1000, 3, 8], 
                   "D":[14, 3, 888, 188, 6],
                   "E":[900, 600, 10, 222, 8]}) 
print(dff)
sc = FeatureScaling(dff)
scaler = sc.ScaleToMedianAndQuartiles()
print(scaler['df'])
inputs =[12, 70, 20, 14, 900]
print('**************************************************************')
print(sc.ScaleToMedianAndQuartilesInputs(inputs,scaler))
print(sc.unscaleMedianAndQuartiles(scaler))


    A   B     C    D    E
0  12  70    20   14  900
1   7   2    16    3  600
2  11  54  1000  888   10
3   8   3     3  188  222
4   4   2     8    6    8
      A         B          C         D         E
0  1.00  1.288462   0.333333  0.000000  1.149153
1 -0.25 -0.019231   0.000000 -0.060440  0.640678
2  0.75  0.980769  82.000000  4.802198 -0.359322
3  0.00  0.000000  -1.083333  0.956044  0.000000
4 -1.00 -0.019231  -0.666667 -0.043956 -0.362712
**************************************************************
[1.0, 1.2884615384615385, 0.3333333333333333, 0.0, 1.1491525423728814]
     A         B           C         D         E
0  2.0  0.074334    0.444444  0.000000  0.432393
1 -0.5 -0.001109    0.000000 -0.004649  0.241069
2  1.5  0.056583  109.333333  0.369400 -0.135203
3  0.0  0.000000   -1.444444  0.073542  0.000000
4 -2.0 -0.001109   -0.888889 -0.003381 -0.136478


In [25]:
dff = pd.DataFrame({"A":[12, 7, 11, 8, 4], 
                   "B":[70, 2, 54, 3, 2], 
                   "C":[20, 16, 1000, 3, 8], 
                   "D":[14, 3, 888, 188, 6],
                   "E":[900, 600, 10, 222, 8]}) 
print(dff)
sc = FeatureScaling(dff)
scaler = sc.normalize()
print(scaler['df'])
inputs = [12, 70, 20, 14, 900]
print(sc.normalizeInputs(inputs,scaler))
sc.unnormalize(scaler)
print(sc.read())

    A   B     C    D    E
0  12  70    20   14  900
1   7   2    16    3  600
2  11  54  1000  888   10
3   8   3     3  188  222
4   4   2     8    6    8
       A         B         C         D         E
0  1.000  1.000000  0.017051  0.012429  1.000000
1  0.375  0.000000  0.013039  0.000000  0.663677
2  0.875  0.764706  1.000000  1.000000  0.002242
3  0.500  0.014706  0.000000  0.209040  0.239910
4  0.000  0.000000  0.005015  0.003390  0.000000
[1.0, 1.0, 0.017051153460381142, 0.012429378531073447, 1.0]
      A     B       C      D      E
0  12.0  70.0    20.0   14.0  900.0
1   7.0   2.0    16.0    3.0  600.0
2  11.0  54.0  1000.0  888.0   10.0
3   8.0   3.0     3.0  188.0  222.0
4   4.0   2.0     8.0    6.0    8.0
