## Feature Scaling - Part1

Youtube Explanation of Feature Scaling, Standardization and Normalization : https://youtu.be/opbSccCw10E


Suppose we have two features of weight(gm) and price(Rs), as in the below dataset. The “Weight” cannot have a meaningful comparison with the “Price.” So the assumption algorithm makes that since “Weight” > “Price,” thus “Weight,” is more important than “Price.”

        Fruit      = "Orange","Apple","Banana","Mango"
        weight(gm) = 100,150,170,200
        Price(Rs)  = 1,2,4,5
        
So these more significant number starts playing a more decisive role while training the model. Thus feature scaling is needed to bring every feature in the same footing without any upfront importance. Interestingly, if we convert the weight to “Kg,” then “Price” becomes dominant.

- Feature Scaling is one of the important pre-processing that is required for standardizing/normalization of the input data. When the range of values are very distinct in each column, we need to scale them to the common level. The values are brought to common level and then we can apply further machine learning algorithm to the input data.

## Different Feature Scaling Techniques

We can use different Scaling Techniques in order to scale the input dataset. We can apply either of the following:

- Standardization
- Normalization

### What is Normalization ?

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging **between 0 and 1**. It is also known as Min-Max scaling.

Formula for normalization:
    

$$
X(norm) = \frac{X-Xmin}{Xmax-Xmin}
$$

In the above equation:

     - Xmax and Xmin is Maximum and Minimum Value of the feature column
     - When the value of X is the minimum value in the column, the numerator will be 0, and hence X(norm) is 0
     - On the other hand, when the value of X is the maximum value in the column, the numerator is equal to the 
       denominator and thus the value of X(norm) is 1
     - If the value of X is between the minimum and the maximum value, then the value of X(norm) is between 0 and 1

In [1]:
# Min-Max scaling
x = [1,3,7,6,5,2,3]
minmax = [(x_i - min(x)) / (max(x) - min(x)) for x_i in x]
print("min_max_using_python :",minmax)


# Min-Max scaling using numpy
import numpy as np
x_np = np.array(x)
np_minmax = (x_np - x_np.min()) / (x_np.max() - x_np.min())
print("min_max_using_numpy :",np_minmax)

min_max_using_python : [0.0, 0.3333333333333333, 1.0, 0.8333333333333334, 0.6666666666666666, 0.16666666666666666, 0.3333333333333333]
min_max_using_numpy : [0.         0.33333333 1.         0.83333333 0.66666667 0.16666667
 0.33333333]


### What is Standardization ?

It is a very effective technique which re-scales a feature value so that it has distribution with 0 mean value and variance equals to 1.

Formula for Standardization:
    

$$
X(stand) = \frac{X-mu}{sigma}
$$

            - mu is the mean of the feature values and sigma is the standard deviation of the feature values. 
            - Note that in this case, the values are not restricted to a particular range.

In [2]:
# Standardization using Python

x = [1,3,7,6,5,2,3]
mean = sum(x)/len(x)
std_dev = (1/len(x) * sum([ (x_i - mean)**2 for x_i in x]))**0.5

z_scores = [(x_i - mean)/std_dev for x_i in x]
print("z_scores uisng python :",z_scores)


# Standardization using numpy
import numpy as np
x_np = np.array(x)
z_scores_np = (x_np - x_np.mean()) / x_np.std()
print("z_scores using numpy :",z_scores_np)

z_scores uisng python : [-1.4071950894605836, -0.4221585268381751, 1.5479145984066418, 1.0553963170954377, 0.5628780357842333, -0.9146768081493794, -0.4221585268381751]
z_scores using numpy : [-1.40719509 -0.42215853  1.5479146   1.05539632  0.56287804 -0.91467681
 -0.42215853]


In [3]:
round(z_scores_np.std(),2)

1.0