## Why we need Feature Scaling?

#### Recap

In [21]:

#Supress warnings
import warnings
warnings.filterwarnings('ignore')

# Importing the libraries
import numpy as np
import pandas as pd

np.set_printoptions(precision=4)
np.set_printoptions(suppress=True) #Otherwise prints in scientific format

# Importing the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

# Encoding categorical data
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

#splitting dataset
from sklearn.cross_validation import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X,y,test_size=0.2,random_state=0 ) 

dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


### Notice:
Age and salary are not on same scale as age is going from 27 - 50 and salary goes from 40K - 70K

Most ML models are based on eucleadean distance and hence variables needs to be of same scale. Otherwise Eucledean distance will be donminated by salary

#### Two types of feature scaling
    
    Standardization:
        X_std = X - mean(X)/(std(X))
        
    Normalization:
        X_norm = X - min(X)/(max(X) - min(X))

In [13]:
from sklearn.preprocessing import StandardScaler
help(StandardScaler)

Help on class StandardScaler in module sklearn.preprocessing.data:

class StandardScaler(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin)
 |  Standardize features by removing the mean and scaling to unit variance
 |  
 |  Centering and scaling happen independently on each feature by computing
 |  the relevant statistics on the samples in the training set. Mean and
 |  standard deviation are then stored to be used on later data using the
 |  `transform` method.
 |  
 |  Standardization of a dataset is a common requirement for many
 |  machine learning estimators: they might behave badly if the
 |  individual feature do not more or less look like standard normally
 |  distributed data (e.g. Gaussian with 0 mean and unit variance).
 |  
 |  For instance many elements used in the objective function of
 |  a learning algorithm (such as the RBF kernel of Support Vector
 |  Machines or the L1 and L2 regularizers of linear models) assume that
 |  all features are centered around 0 an

In [17]:
sc_X = StandardScaler()
X_tr = sc_X.fit_transform(X_tr)
X_tr

array([[-1.    ,  2.6458, -0.7746,  0.2631,  0.1238],
       [ 1.    , -0.378 , -0.7746, -0.2535,  0.4618],
       [-1.    , -0.378 ,  1.291 , -1.9754, -1.5309],
       [-1.    , -0.378 ,  1.291 ,  0.0526, -1.1114],
       [ 1.    , -0.378 , -0.7746,  1.6406,  1.7203],
       [-1.    , -0.378 ,  1.291 , -0.0813, -0.1675],
       [ 1.    , -0.378 , -0.7746,  0.9518,  0.9861],
       [ 1.    , -0.378 , -0.7746, -0.5979, -0.4821]])

#### Do not fit to Test data! Otherwise normalization will happen with limited data (in this case 2)

    We want train and test data to have similar scaling 

In [19]:
X_te = sc_X.transform(X_te)
X_te

array([[ 0.    , -1.    ,  0.    , -4.1   , -4.7242],
       [ 0.    , -1.    ,  0.    , -3.9   , -4.7241]])

#### Do we have to scale dummy variables??

    Not really needed. But it does not hurt