# Feature Scaling
- Do not apply features scaling on dummy variable(used to add categorical data into regression model)

- Standardisation converts values between + 3 and -3 <br>
    <b>x(stand) = x - mean(x) / standard deviation (x)</b>

- normalisation converts values between 0 and 1<br>
    <b>x(norm) = x - min(x) / max(x) - min(x)</b>
- apply features scaling on numerical values with high difference between them

- sc.fit_transform() compute the mean of feature values and then apply scaling formula ( use this for train_x)

- sc.transform() applies the standardisation formula with previous computed mean by fit_transform (use this for 
test_x)

<br>
<b>Important!</b>
Use normalization when there is normal distribution between features while standerdization is applicable in all cases

In [34]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer

## Loading Dataset

In [3]:
dataset = pd.read_csv("../../../Datasets/salary_information.csv", sep=",")
dataset.columns = ['Country', 'Age', 'Salary', 'Purchased']
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


## Handling Missing Values

In [6]:
dataset = dataset.dropna()
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


## Separating target column from dataset

In [15]:
target = dataset['Purchased']
dataset = dataset.drop('Purchased', axis=1)

## Encoding Categorical Data

### Dataset
- [0] specify indices of columns for which we want to apply transformation

In [32]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
dataset_encoded = np.array(ct.fit_transform(dataset))
dataset_encoded

array([[1.0e+00, 0.0e+00, 0.0e+00, 4.4e+01, 7.2e+04],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.7e+01, 4.8e+04],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.0e+01, 5.4e+04],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.8e+01, 6.1e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.5e+01, 5.8e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.8e+01, 7.9e+04],
       [0.0e+00, 1.0e+00, 0.0e+00, 5.0e+01, 8.3e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.7e+01, 6.7e+04]])

### Target Column

In [38]:
le = LabelEncoder()
target_encoded = le.fit_transform(target)
target_encoded

array([0, 1, 0, 0, 1, 1, 0, 1])

## Train Test Split

In [39]:
from sklearn.model_selection import train_test_split

In [40]:
x = dataset_encoded
y = target_encoded

In [41]:
x_train, x_test, y_train, y_test = train_test_split(dataset_encoded, target_encoded, test_size = 0.2, random_state = 1)

## Apply Features Scaling
- fit_transform() computes mean and standard deviation and scale data
- transform() uses previously computed mean and standard deviation by fit_transform() and scale data
- To sclae training data use fit_tranform()
- To scale test data use transform() so that test data will be scaled using same mean and standard deviation of training data

In [44]:
from sklearn.preprocessing import StandardScaler

In [46]:
sc = StandardScaler()
x_train[:, 3:] = sc.fit_transform(x_train[:, 3:])
x_train

array([[ 0.        ,  0.        ,  1.        , -1.68132541, -1.53532042],
       [ 0.        ,  1.        ,  0.        ,  1.21896092,  1.31792992],
       [ 1.        ,  0.        ,  0.        ,  0.46236449,  0.4211941 ],
       [ 1.        ,  0.        ,  0.        , -0.67253016, -0.72010604],
       [ 0.        ,  0.        ,  1.        , -0.29423195, -0.47554172],
       [ 1.        ,  0.        ,  0.        ,  0.96676211,  0.99184416]])

In [47]:
x_test[:, 3:] = sc.transform(x_test[:, 3:])
x_test

array([[ 1.        ,  0.        ,  0.        , -0.42033135,  0.01358691],
       [ 0.        ,  1.        ,  0.        , -1.30302719, -1.04619179]])