*Day 9*
# Scaling & Encoding
--- 

# **Scaling**

Scaling is a technique used to standardize the range of independent variables or features of data. In data processing, it is also known as normalization and is generally performed during the data preprocessing step.

Merubah skala tapi tidak distribusinya.

Effective model to apply scaling: KNN, LogReg, Clustering, SVM, Neural Network, PCA, LDA, ...

## Types of Scaling

1. MinMax Scaler, scale converted to range 0-1
2. Standard Scaler, mean=0 std dev=1
3. Robust Scaler: Using quartiles Q1-Q3, **robust to outliers**

Example: apply to tips and total_bill, as they have different scale (tips is smaller).

## **Encoding** 

Types of Encoding:
1. One Hot Encoding, same as dummy variable, same number of columns as number of categories (best for <5 category)
2. Ordinal Encoding, converting categorical to ordinal (e.g. low, medium, high)
3. Binary Encoding, converting categorical to binary

| Scale | OHE | Ordinal | Binary |
| --- | --- | --- | --- |
| Nominal | Yes | No | yes |
| Ordinal | Yes | Yes | No |

## Binary Encoding

Converting category to binary. Less interpretable.

## **Fit & Transform**

- **.fit** only applied to training set to avoid data leakage (over optimistic score in test set), only applied to numerical data
- **.transform** applied to both training and test set

| Scale         |Desc  | Traing | Test | 
| ---           | --- |--- | --- |
| .fit          | Construct the parameter |Yes | No |
| .transform    | Apply the parameter |Yes | Yes |

In [22]:
#Fit are used to construct the scaling parameters
# scaler = MinMaxScaler()
# scaler.fit(X_train)

# #Transform the training and test data
# x_train_scaled = scaler.transform(X_train)
# x_test_scaled = scaler.transform(X_test)

In [23]:
## EDA Standard Libary

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.stats as ss

In [24]:
#ML Library

#ML Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
#ML TrainTest Split
from sklearn.model_selection import train_test_split
#ML Report
from sklearn.metrics import  accuracy_score

In [25]:
from warnings import filterwarnings
filterwarnings('ignore')

In [26]:
tips = sns.load_dataset('tips')

In [27]:
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


# Scaling

In [28]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [29]:
#Transform using minmax scaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler


In [30]:
minmax = MinMaxScaler()
standard = StandardScaler()
robust = RobustScaler()

# Applying PreProcess scheme:
1. Splitting
2. One Hot Encoding: sex, smoker, time
3. Binary Encoding: day
4. Scaling: total_bill
5. No treatment: size

In [31]:
#Train test split
x = tips.drop('smoker', axis=1)  # All cols except  ind var (x)
y = tips['smoker']               # default col as target (y) 

xtrain, xtest, ytrain, ytest = train_test_split(
    x,
    y,
    test_size= 0.2,    # Test Data Size as 20% 
    random_state=20,   # Random seed
    stratify=y         # Proportion between train & test, same propotion as y 
)

In [32]:
#Combine transformer 
from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

import category_encoders as ce

#Split the data
tranformer = ColumnTransformer([ 
                                ('OHE', OneHotEncoder(drop='first'), ['sex', 'smoker', 'time']),
                                ('Binary', ce.BinaryEncoder(), ['day']),
                                ('Robust Scaler', RobustScaler(), ['total_bill'])], remainder='passthrough')

In [33]:
tranformer

# Encoding


## 1. One Hot Encoding


In [34]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer   

In [35]:
transformer = ColumnTransformer([
                                 ('encoder',OneHotEncoder(),['sex','smoker','day','time'])
                                ])

In [36]:
tips_encoded = transformer.fit_transform(tips)

In [37]:
transformer.get_feature_names_out()

array(['encoder__sex_Female', 'encoder__sex_Male', 'encoder__smoker_No',
       'encoder__smoker_Yes', 'encoder__day_Fri', 'encoder__day_Sat',
       'encoder__day_Sun', 'encoder__day_Thur', 'encoder__time_Dinner',
       'encoder__time_Lunch'], dtype=object)

## Ordinal encoding


In [38]:
#Ordinal encoding
tips_ordinal_encoded = tips.copy()

In [39]:
!pip3 install category_encoders


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [40]:
import category_encoders as ce

In [41]:
ordinal_mapping = [{'col' : 'day', 
                    'mapping' : {None:0, 'Thur':1, 'Fri':2, 'Sat':3, 'Sun':4}}]

ordinal_encoder = ce.OrdinalEncoder(cols=['day'], mapping=ordinal_mapping)  

In [42]:
ordinal_encoder

In [43]:
day_ord_encoded = ordinal_encoder.fit_transform(tips_ordinal_encoded)  

In [44]:
day_ord_encoded

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,4,Dinner,2
1,10.34,1.66,Male,No,4,Dinner,3
2,21.01,3.50,Male,No,4,Dinner,3
3,23.68,3.31,Male,No,4,Dinner,2
4,24.59,3.61,Female,No,4,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,3,Dinner,3
240,27.18,2.00,Female,Yes,3,Dinner,2
241,22.67,2.00,Male,Yes,3,Dinner,2
242,17.82,1.75,Male,No,3,Dinner,2
