## Feature Scaling

<img src='images/scaling.jpg'/>

### Normalization

Scaling each observation from original range into the range 0 and 1

<img src='images/minmax.png'/>

### Standardization

rescaling data so it has a zero mean and unit variance

<img src='images/standard.jpg'/>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('dataset/Data.csv')

In [3]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [4]:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN',strategy='mean',axis=0)

In [5]:
X = df.iloc[:,:-1].values
y = df.iloc[:,3].values

#df.fillna()
#df.replace()
#df.dropna()

In [6]:
df.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In [7]:
X[:,1:3] = imp.fit_transform(X[:,1:3])

In [None]:
#np.around(X[:,1:3],decimals=2)


In [8]:
X1 = pd.DataFrame(X,columns=df.columns[:-1])

In [9]:
X1

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.8
5,France,35.0,58000.0
6,Spain,38.7778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


In [None]:
#X1.iloc[:,1:3]=X1.loc[:,.apply(np.around,decimals=2)
#X1.dtypes
#X1.Salary.astype(np.float64)

- look age going from 27 to 50

- salary going from 48000 to 83000

so both features dont have on a same scale ,this will cause some issue for machine learning 
model..because some machine learning models based on **Euclidean distance**

<img src="images/ed.PNG"/>

In [12]:
X[:,0]

array([0L, 2L, 1L, 2L, 1L, 0L, 2L, 0L, 1L, 0L], dtype=object)

In [10]:
#Encoding categorical data
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
label = LabelEncoder()
X[:,0] = label.fit_transform(X[:,0])
X

array([[0L, 44.0, 72000.0],
       [2L, 27.0, 48000.0],
       [1L, 30.0, 54000.0],
       [2L, 38.0, 61000.0],
       [1L, 40.0, 63777.77777777778],
       [0L, 35.0, 58000.0],
       [2L, 38.77777777777778, 52000.0],
       [0L, 48.0, 79000.0],
       [1L, 50.0, 83000.0],
       [0L, 37.0, 67000.0]], dtype=object)

In [13]:
one_hot_encoder = OneHotEncoder(categorical_features=[0])

In [14]:
X = one_hot_encoder.fit_transform(X).toarray()

In [15]:
X_new=X.astype(int)
X_new

array([[    1,     0,     0,    44, 72000],
       [    0,     0,     1,    27, 48000],
       [    0,     1,     0,    30, 54000],
       [    0,     0,     1,    38, 61000],
       [    0,     1,     0,    40, 63777],
       [    1,     0,     0,    35, 58000],
       [    0,     0,     1,    38, 52000],
       [    1,     0,     0,    48, 79000],
       [    0,     1,     0,    50, 83000],
       [    1,     0,     0,    37, 67000]])

In [16]:
label_y = LabelEncoder()
y = label.fit_transform(y)

In [17]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int64)

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.2,random_state=0)

In [20]:
#feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()

In [21]:
X_train = sc_X.fit_transform(X_train)
X_test  = sc_X.transform(X_test)

In [None]:
X_train


In [None]:
X_test

In [None]:
#should we apply feature scaling on labels/targets..?

In [None]:
#No we don't need to apply feature scaling on this one because this is a classification 
#problem..

#but in regression problem if dependent variable has huge range of values...

In [None]:
#d=pd.DataFrame(X_train)
#d.describe()
#X_train.mean(axis=0)