<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">

<h1> Used Car Listing Price Prediction</h1>
    
</div>

<center><img src="https://raw.githubusercontent.com/anthonynamnam/anthonynamnam/main/icons/image/car-banner.png" alt="memes" width="600" /></center>

---

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">

<h2> Project Overview</h2>
    
Please kindly refer to the github repo of this project: <a href="https://github.com/anthonynamnam/brainstation_capstone#project-overview">Link</a>

    
</div>


---

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">

   <h2> Notebook Overview</h2>
    
Now that we have a well-prepared dataset in hand, our next steps involve handling categorical variables, scaling features, addressing class imbalances, and ultimately building and evaluating predictive models. Each of these steps plays a pivotal role in the success of our machine learning and deep learning endeavors. In this notebook, we will guide you through the following steps of working with data:
    
<ol>
    <font size=3><li><b>Categorical Data Encoding 🎲</b></li></font>
    <p>Many machine learning algorithms require numerical inputs, necessitating the transformation of categorical variables into a format suitable for analysis. In this notebook, we'll explore various encoding techniques to convert categorical data into a numerical representation that our models can comprehend</p>
    <font size=3><li><b>Data Scaling 📐</b></li></font>
    <p>Ensuring that all features are on a consistent scale is crucial for the performance of many machine learning algorithms. We'll delve into the importance of data scaling and demonstrate methods to standardize or normalize our features.</p>
    <font size=3><li><b>Class Imbalance ⚖️</b></li></font>
    <p>Real-world datasets often exhibit imbalances in class distribution, where certain outcomes are more prevalent than others. We'll explore techniques to address class imbalances, ensuring that our models are trained to recognize patterns effectively.</p>
    <font size=3><li><b>ML/DL Modeling 🧠</b></li></font>
    <p>The heart of our predictive analytics journey lies in building machine learning and deep learning models. We'll guide you through the process of selecting, training, and fine-tuning models that best suit the nature of our data and the goals of our project.</p>
    <font size=3><li><b>Model Evaluation 🧮</b></li></font>
    <p>As we generate predictions, it becomes imperative to assess the performance of our models. We'll introduce metrics and techniques for evaluating model accuracy, precision, recall, and other key indicators to ensure that our models meet the desired standards.</p>

</ol>
Through this notebook, we aim to equip you with the knowledge and tools needed to navigate the intricacies of turning prepared data into actionable insights. Let's harness the power of machine learning and deep learning to uncover patterns, make predictions, and elevate the impact of our project.
    
</div>

---

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">

<a class="anchor" id="4-toc"> 
    <h2> Table of Contents </h2>
</a>
    
<ul>    
    <li> <a href="#4-setup">Notebook Set Up</a></li>
    <li> <a href="#4-func">Functions</a></li>
    <li> <a href="#4-load">Data Loading</a></li>
    <li> <a href="#4-cat-encode">Categorical Enconding</a></li>
    <li> <a href="#4-scale">Data Scaling</a></li>
    <li> <a href="#4-imbalance">Class Imbalance</a></li>
    <li> <a href="#4-models">Proposed Models</a></li>
    <li> <a href="#4-pipelines">Model Pipelines</a></li>
    <li> <a href="#4-evaluate">Model Evaluation</a></li>
    <li> <a href="#4-learn">Learning/Takeaway</a></li>
</ul>
    
</div>

---

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-setup">
    <h2> Set Up </h2>
</a>
<b>Table of Content:</b>
<ul>    
    <li> <a href="#4-import">Import Library</a></li>
    <li> <a href="#4-const">Global Const</a></li>
</ul>
</div>

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-import">
<h3> Import Library </h3>
</a>
</div>

In [93]:
import time
import random
import logging
import warnings
import datetime

# Data Science Package
import numpy as np
import pandas as pd

import sys
import logging

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-const">
<h3> Global Constant </h3>
</a>
</div>

In [94]:

warnings.filterwarnings('ignore')

pd.options.display.max_columns = None
ran = random.Random()
ran.seed(42)

In [95]:

logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s >>> %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    handlers=[
                        logging.FileHandler(filename='log/modelling.log'),
                        logging.StreamHandler(sys.stdout)
                    ])
logger = logging.getLogger('LOGGER_NAME')

[Back-to-top](#4-toc)

---

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-func">
    <h2> Functions </h2>
</a>
<b>Table of Content:</b>
<ul>    
    <li> <a href="#4-func-print">Helper Funcntions (Print Info)</a></li>
    <li> <a href="#4-func-edit">Helper Funcntions (Edit Dataframe)</a></li>
</ul>
</div>

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-func-print">
<h3> Helper Funcntions (Print Info) </h3>
</a>
</div>

In [96]:
# Helper Functions to print df info and statement
import pandas as pd

def print_num_row(df: pd.DataFrame) -> None:
    """
    Description
    -----
    Retrieve the number of rows of dataframe and print it as a statement.
    
    Args
    -----
    df (pd.DataFrame): the target dataframe
    
    Returns
    -----
    None
    
    Example
    -----
    df = pd.DataFrame(data = {"height":[147,190],"weight":[47,72],"age":[12,28]},index = [0,1])
    print_num_row(df)  =>
        |
        | "The dataframe has 2 rows of record now."
        |
    
    
    """
    print(f"The dataframe has {df.shape[0]} rows of record now.")
    return
    

def print_num_col(df: pd.DataFrame) -> None:
    """
    Description
    -----
    Retrieve the number of columns of dataframe and print it as a statement.
    
    Args
    -----
    df (pd.DataFrame): the target dataframe
    
    Returns
    -----
    None
    
    Example
    -----
    df = pd.DataFrame(data = {"height":[147,190],"weight":[47,72],"age":[12,28]},index = [0,1])
    print_num_col(df) => 
        |
        | "The dataframe has 3 columns now."
        |
    
    
    """
    print(f"The dataframe has {df.shape[1]} columns now.")
    return
        
def print_dim(df: pd.DataFrame) -> None:
    """
    Description
    -----
    Retrieve the shape of dataframe and print it as a statement.
    
    Args
    -----
    df (pd.DataFrame): the target dataframe
    
    Returns
    -----
    None
    
    Example
    -----
    abc_df = pd.DataFrame(data = {"height":[147,190],"weight":[47,72],"age":[12,28]},index = [0,1])
    print_dim(abc_df) =>
        |
        | "There are 2 rows and 3 columns in this dataframe now."
        |
    
    
    """
    print(f"There are {df.shape[0]} rows and {df.shape[1]} columns in this dataframe now.")
    return


def print_null_count(df: pd.DataFrame,cols:list = []) -> None:
    """
    Description
    -----
    Count the null value in each columns.
    If `cols` is provided, only show the null value count for the columns in `cols`. 
    Otherwise, show null value count for all columns.
    
    Args
    -----
    df (pd.DataFrame): target dataframe
    cols (list): the column names to show the null value count. Default: []
    
    Returns
    -----
    None
    
    Example
    -----
    abc_df = pd.DataFrame(data = {"height":[147,190],"weight":[47,np.nan],"age":[np.nan,28]},index = [0,1])
    print_null_count(abc_df) => 
        |
        | === Null Count ===
        | height    0
        | weight    1
        | age       2
        | dtype: int64
        |
        
    print_null_count(abc_df,cols=["age"]) => 
        |
        | === Null Count ===
        | Column `age`: 2
        |
        
    print_null_count(abc_df,cols=["age","weight"]) => 
        |
        | === Null Count ===
        | Column `age`: 2
        | Column `weight`: 1
        |
    
    """
    if len(cols) == 0:
        null_count = df.isnull().sum()
        
        print("=== Null Count ===")
        print(null_count)
    else:
        assert set(cols).issubset(df.columns)
        null_count = df[cols].isnull().sum()
        
        print("=== Null Count ===")
        for col in cols:
            print(f"Column `{col}`: {null_count[col]}")
    return 


def print_null_pct(df: pd.DataFrame,cols:list = []) -> None:
    """
    Description
    -----
    Count the null percentage in each columns.
    If `cols` is provided, only show the null percentage for the columns in `cols`. 
    Otherwise, show null percentage for all columns.
    
    Args
    -----
    df (pd.DataFrame): target dataframe
    cols (list): the column names to show the null percentage. Default: []
    
    Returns
    -----
    None
    
    Example
    -----
    abc_df = pd.DataFrame(data = {"height":[147,190],"weight":[47,np.nan],"age":[np.nan,np.nan]},index = [0,1])
    print_null_pct(abc_df) => 
        |
        | === Null Count Precentage ===
        | height      0.0%
        | weight     50.0%
        | age       100.0%
        | dtype: object
        |
        
    print_null_pct(abc_df,cols=["weight"]) => 
        |
        | === Null Count Precentage ===
        | Column weight: 50.0%
        |
    
    """
    total_row = df.shape[0]
    if len(cols) == 0:
        null_count = df.isnull().sum()
        null_pct = null_count / total_row * 100
        
        print("=== Null Count Precentage ===")
        print(null_pct.round(2).astype(str)+"%")
    else:
        assert set(cols).issubset(df.columns)
        null_count = df[cols].isnull().sum()
        null_pct = null_count / total_row * 100

        print("=== Null Count Precentage ===")
        for col in cols:
            print(f"Column {col}: {round(null_pct[col],4)}%")
    return

def print_duplicated_count(df: pd.DataFrame) -> None:
    """
    Description
    -----
    Count the number of duplicated rows.
    
    Args
    -----
    df (pd.DataFrame): target dataframe
    
    Returns
    -----
    None
    
    Example
    -----
    abc_df = pd.DataFrame(data = {"height":[147,190,147],"weight":[47,np.nan,47],"age":[13,27,13]},index = [0,1,2])
    print_duplicated_count(abc_df) => 
        |
        | There are 1 duplicated rows
        | 
        
    """
    print(f"There are {df.duplicated().sum()} duplicated rows")
    return



<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-func-edit">
<h3> Helper Funcntions (Edit Dataframe) </h3>
</a>
</div>

In [97]:
def drop_cols_if_exist(df: pd.DataFrame,cols_to_drop:list) -> pd.DataFrame:
    """
    Description
    -----
    Drop a column from a dataframe with inplace = True. Only execute the dropping if the cols exist.
    
    Args
    -----
    df (pd.DataFrame): the target dataframe
    cols_to_drop (list): the list of column to be dropped
    
    Returns
    -----
    df (pd.DataFrame): the dataframe with columns dropped
    
    Example
    -----
    # Create a DataFrame
    abc_df = pd.DataFrame(data = {"height":[147,190,147],"weight":[47,np.nan,47],"age":[13,27,13]},index = [0,1,2])
    print(abc_df)  =>
        |
        |    height  weight  age
        | 0     147    47.0   13
        | 1     190     NaN   27
        | 2     147    47.0   13
        |
        
    # Drop columns if exist
    dropped_abc_df = drop_cols_if_exist(abc_df,cols_to_drop=["weight"])
    print(dropped_abc_df)   =>
        | Successfully dropped columns: {'weight'}
        |    height  age
        | 0     147   13
        | 1     190   27
        | 2     147   13
    
    
    """
    intersect_cols = set(cols_to_drop).intersection(df.columns)
    df.drop(columns=intersect_cols,inplace=True)
    print(f"Successfully dropped columns: {intersect_cols}")
    return df    

[Back-to-top](#4-toc)

---

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-load">
    <h2> Data Loading </h2>
</a>
<b>Table of Content:</b>
<ul>    
    <li> <a href="#4-load-process">Load Processed Data</a></li>
    <li> <a href="#4-san-check-train">Sanity Check - Train Data</a></li>
    <li> <a href="#4-san-check-test">Sanity Check - Train Data</a></li>
    <li> <a href="#4-split-xy">Split Data</a></li>
</ul>
</div>

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-load-process">
<h3> Load the Split data </h3>
</a>
</div>

In [98]:
# Define dtype for split dataset
SPLIT_COL_DTYPE = {
    
    "log_miles":float,
    "year":int,
    "make":str,
    "model":str,
    "trim":str,
    "body_type":str,
    "vehicle_type":str,
    "drivetrain":str,
    "transmission":str,
    "engine_size":float,
    "engine_block":str,
    "price_range":int,
    
    # For encoded fuel type
    "fuel_M85":int,
    "fuel_Lpg":int,
    "fuel_Diesel":int,
    "fuel_Unleaded":int,
    "fuel_Hydrogen":int,
    "fuel_PremiumUnleaded":int,
    "fuel_Biodiesel":int,
    "fuel_E85":int,
    "fuel_Electric":int,
    "fuel_CompressedNaturalGas":int,
}

In [99]:
# First, we read the train dataset
train_data = pd.read_parquet(path = "data/train-data.parquet",
                     columns = SPLIT_COL_DTYPE)
train_data.reset_index(drop = True, inplace=True)

In [100]:
# Then, we read the testt dataset
test_data = pd.read_parquet(path = "data/test-data.parquet",
                     columns = SPLIT_COL_DTYPE)
test_data.reset_index(drop = True, inplace=True)

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-san-check-train">
<h3> Sanity Check - Train </h3>
</a>
</div>

In [101]:
train_data.head()

Unnamed: 0,log_miles,year,make,model,trim,body_type,vehicle_type,drivetrain,transmission,engine_size,engine_block,price_range,fuel_M85,fuel_Lpg,fuel_Diesel,fuel_Unleaded,fuel_Hydrogen,fuel_PremiumUnleaded,fuel_Biodiesel,fuel_E85,fuel_Electric,fuel_CompressedNaturalGas
0,8.353497,2020.0,Toyota,Tacoma,SR5,Pickup,Truck,4WD,Automatic,3.5,V,3,0,0,0,1,0,0,0,0,0,0
1,9.780189,2018.0,RAM,Ram 1500 Pickup,Big Horn,Pickup,Truck,RWD,Automatic,3.0,V,3,0,0,1,0,0,0,0,0,0,0
2,9.922555,2018.0,Lexus,ES,350,Sedan,Car,FWD,Automatic,3.5,V,3,0,0,0,1,0,0,0,0,0,0
3,10.928507,2017.0,BMW,X5,sDrive35i,SUV,Truck,RWD,Automatic,3.0,I,3,0,0,0,0,0,1,0,0,0,0
4,11.268992,2015.0,Scion,tC,Release Series 9.0,Coupe,Car,FWD,Automatic,2.5,I,1,0,0,0,1,0,0,0,0,0,0


In [102]:
print_dim(train_data)

There are 1764450 rows and 22 columns in this dataframe now.


In [103]:
print_null_count(train_data)

=== Null Count ===
log_miles                    0
year                         0
make                         0
model                        0
trim                         0
body_type                    0
vehicle_type                 0
drivetrain                   0
transmission                 0
engine_size                  0
engine_block                 0
price_range                  0
fuel_M85                     0
fuel_Lpg                     0
fuel_Diesel                  0
fuel_Unleaded                0
fuel_Hydrogen                0
fuel_PremiumUnleaded         0
fuel_Biodiesel               0
fuel_E85                     0
fuel_Electric                0
fuel_CompressedNaturalGas    0
dtype: int64


In [104]:
print_duplicated_count(train_data)

There are 23968 duplicated rows


<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-san-check-test">
<h3> Sanity Check - Test </h3>
</a>
</div>

In [105]:
test_data.head()

Unnamed: 0,log_miles,year,make,model,trim,body_type,vehicle_type,drivetrain,transmission,engine_size,engine_block,price_range,fuel_M85,fuel_Lpg,fuel_Diesel,fuel_Unleaded,fuel_Hydrogen,fuel_PremiumUnleaded,fuel_Biodiesel,fuel_E85,fuel_Electric,fuel_CompressedNaturalGas
0,11.532738,2014.0,Jeep,Wrangler Unlimited,Sahara,SUV,Truck,4WD,Manual,3.6,V,2,0,0,0,1,0,0,0,0,0,0
1,12.028894,2012.0,Honda,Accord,EX-L V6,Coupe,Car,FWD,Automatic,3.5,V,1,0,0,0,1,0,0,0,0,0,0
2,10.063947,2020.0,Kia,FORTE,LXS,Sedan,Car,FWD,Automatic,2.0,I,2,0,0,0,1,0,0,0,0,0,0
3,10.526024,2019.0,Jeep,Grand Cherokee,High Altitude,SUV,Truck,4WD,Automatic,3.6,V,4,0,0,0,1,0,0,0,0,0,0
4,10.633834,2018.0,Volkswagen,Tiguan,SE,SUV,Truck,4WD,Automatic,2.0,I,2,0,0,0,1,0,0,0,0,0,0


In [106]:
print_dim(test_data)

There are 588151 rows and 22 columns in this dataframe now.


In [107]:
print_null_count(test_data)

=== Null Count ===
log_miles                    0
year                         0
make                         0
model                        0
trim                         0
body_type                    0
vehicle_type                 0
drivetrain                   0
transmission                 0
engine_size                  0
engine_block                 0
price_range                  0
fuel_M85                     0
fuel_Lpg                     0
fuel_Diesel                  0
fuel_Unleaded                0
fuel_Hydrogen                0
fuel_PremiumUnleaded         0
fuel_Biodiesel               0
fuel_E85                     0
fuel_Electric                0
fuel_CompressedNaturalGas    0
dtype: int64


In [108]:
print_duplicated_count(test_data)

There are 4218 duplicated rows


<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-split-xy">
<h3> Split Data into X and y </h3>
</a>
</div>

In [109]:
X_train = train_data.drop(columns=["price_range"])
y_train = train_data["price_range"]

In [110]:
X_test= test_data.drop(columns=["price_range"])
y_test = test_data["price_range"]

In [111]:
# Sanity check
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)

(1764450, 21) (588151, 21) (1764450,) (588151,)


[Back-to-top](#4-toc)

---

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-cat-encode">
    <h2> Categorical Encoding </h2>
</a>
</div>

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>🧠 Idea:</b></font>
<br><br>
After some researches, we concluded that these are the available encoding options for our dataset.  

- `Dummy Encoding` (keep all categories)
- `Dummy Encoding` (with fixed number of category, i.e. top 10 most frequent categories)
- `Dummy Encoding` (with value counts percentage threshold, i.e. only keep categories with more than X% of total records)
- `Ordinal Encoding` (for category with ordinal meaning)
- `Count Encoding` (Data Leakage if no split data) `->` Use `sklearn`.`Pipeline`
- `Target Encoding`  (Data Leakage if no split data) `->` Use `sklearn`.`Pipeline`

**However**, some of them are not suitable for our columns.

`Count Enconding`: It is not useful as we have > 7M records and some values may have over millions count.  
`Ordinal Encoding`: Our categorical columns generally do not consist of any order or level, so ordinal enconding may not be useful at all in this scenario.  

<div>

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>💬 Comment:</b></font>

- From the above table, as we do not want to expand our feature spaces too much. We will apply target encoding to columns `make`, `model`, `trim` and `body_type`.

- Advantages of `target encoding`:
    1. `Target encoding` will not expand our feature spaces.
    1. `Target encoding` is encoded by averaging target variable (`price_range`) within each feature group (`model`), which mean the encoded value is the average `price_range` of that `model`. 
    
- Disadvantage of `target encoding`:
    1. As the averaging process will gather the information across different rows, this may lead to **data leakage**. So, we will not perform encoding transformation here. Instead, we will embed the transformation in modelling pipeline.

- For other columns, we will apply `dummy variable encoding`, which may not consist of data leakage problem.
    
- In order to integrate encoder with cross validation, we will embed the encoder into pipeline in modelling section.

<div>

[Back-to-top](#4-toc)

---

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-scale">
    <h2> Data Scaling </h2>
</a>
</div>

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>🧠 Idea:</b></font>
<br><br>
There are three available encoding options for our dataset.  
    
---
- `Standard Scaling`
    - Output space: `[-inf,inf]` with `mean` of 0 and `variance` of 1
    - Characteristics:
        - Useful when features have different units
        - Sensitive to outliers
---
- `Min-Max Scaling`
    - Output space: `[0,1]`
    - Characteristics:
        - Maintains the shape of the original distribution
        - Sensitive to outliers
---
- `Robust Scaling`
    - Output space: `[-inf,inf]`
    - Characteristics:
        - Effective in the presence of outliers, as it uses the median and IQR.
        - Maintains the central tendency (Data tends to stay at the center)
    

<div>

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>💬 Comment:</b></font>

In our dataset, there are several feature types:
- `Categorical Features` (e.g. `make`, `drivetrain`)
- `Numerical Features` (e.g. `year`, `log_miles`)

---
    
- For `Categorical Features`, as we will apply `one-hot encoding` and `target encoding`. The output space of both method are in `[0,1]` too. If we apply Min-Max Scaling, it does not change anything. If we apply `standard scaling`, it would change the scale from `[0,1]` to `[-inf, inf]`. Therefore, we should not apply any scaling on it.
    
- For `Numerical Features`, as they have different magnitude scales, we should apply `standard scaling` on it.

<div>

[Back-to-top](#4-toc)

---

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-imbalance">
    <h2> Class Imbalance</h2>
</a>
</div>

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>❓ What is Class Imbalance?</b></font>
<br><br>
    <p><b>Class imbalance</b> refers to a situation in a classification problem where the distribution of the classes is not uniform, meaning that one or more classes have significantly fewer instances than the others. In other words, there is an unequal distribution of target labels in the dataset, and one or more classes are underrepresented compared to the others.</p>
<br>
    <p>For our classification task, which is <b>multi-class classification</b>, class imbalance can refer to <b>unequal distribution</b> across multiple classes. </p>

<div>

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>🧠 How to tackle Class Imbalance?</b></font>
<br><br>

There are several techniques for **Class Imbalance**.  
1. Resampling
    - **Over Sampling**
    - **Under Sampling**
    - **Hybrid Sampling** (Over Sampling on Minority Class + Under Sampling on Majority Class) (To be tested)
2. Synthetic Data Generation
    - <b>S</b>ynthetic <b>M</b>inority <b>O</b>ver sampling <b>TE</b>chnique <b>(SMOTE)</b>
    - <b>S</b>ynthetic <b>M</b>inority <b>O</b>ver sampling <b>TE</b>chnique for <b>N</b>ominal & <b>C</b>ontinuous <b>(SMOTENC)</b>

    

<div>

In [None]:
# TODO

In [None]:
y_train.value_counts()

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>💬 Comment:</b></font>
TODO: Edit base on the chart  
    
- According to above chart, we can see that the minority class only consist of **5%** of the total data, if we apply **undersampling**, we will result in **25%** of data and lose **75%** of the information.
    
- According to above chart, we can see that the ratio of majority class and minority class is **20:1**. If we apply **oversampling**, we will duplicate **20** times of the minority class.
    
**[After first-round of experiment]**  
- We find that oversampling on minority class does not improve our most of our model.
- Perhaps we can also consider hybrid sampling in next sprint, which combines both the oversampling and undersampling
    
<div>

[Back-to-top](#4-toc)

---

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-models">
    <h2> Proposed Models </h2>
</a>
</div>

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>🧠 Ideas:</b></font>


- [x] Logistic Regression
- [x] Decision Tree
- [x] Stochastic Gradient Boosting
- [x] AdaBoost
- [x] XGBoost
- [ ] Random Forest
- [ ] CatBoost
- [ ] Naive Bayes
- [ ] K-Nearest Neighbor
- [ ] Neural Network (Less Deep)
- [ ] Neural Network (Medium Deep)
- [ ] Neural Network (More Deep)


<div>

[Back-to-top](#4-toc)

---

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-pipelines">
    <h2> Model Pipelines </h2>
</a>

</div>

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>⚠️ Note:</b></font>

- For faster training process, we will use 20% of our training data (Around 340K Rows) here by using `train_test_split` and setting `stratify` = `y`.
- We will apply Oversampling technique to deal with class imbalance.
- We will apply Grid Search CV with 5-fold for getting the optimized parameters.
    
<div>

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-model">
    <h3> Classes for Model Pipelines </h3>
</a>

</div>

In [156]:
import os
import pprint
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold,train_test_split
from sklearn.preprocessing import OneHotEncoder, TargetEncoder, StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA

import xgboost as xgb


model_performance_dict = {
    "model_name": None,
    "model": None,
    
    "sub_sampling":False,
    "sub_sampling_time":0,
    "sub_sampling_pct_X_train":None,
    "sub_sampling_pct_X_test":None,
    
    "over_sampling":False,
    "over_sampling_time":0,
    
    "X_train_shape":None,
    "X_test_shape":None,
    
    "target_encoding": False,
    "target_encoding_col": [],
    "one_hot_encoding": False,
    "one_hot_encoding_col": [],
    
    "standard_scaling": False,
    "standard_scaling_col": [],
    "min_max_scaling": False,
    "min_max_scaling_col": [],
    
    "pca": False,
    "pca_n_components": None,
    
    "scoring":None,
    
    "grid_search": False,
    "best_params": None,
    "best_train_score": None,
    
    "params": None,
    "train_score": None,
    
    
    "test_score": None,    
    "confusion_matrix": None,
}

class ModelPerf:
    
    _data = None
    _path = None

    def __init__(self,path,read = True):
        self._path = path
        # try to load the previous file
        if read:
            self.read_csv()
        
    def _is_path_valid(self):
        os.path.isfile(self._path)
        
    def read_csv(self):
        try:
            self._data = pd.read_csv(self._path)
        except Exception as e:
            logger.info(f"Creating new model performance file...")
            self._data = pd.DataFrame()
            
    def export_csv(self):
        if self._data.shape[0] > 0:
            self._data.to_csv(self._path,index = False)
            logger.info(f"Model Performance CSV is exported")
        else:
            logger.warning(f"Model Performance is empty")
            
    def add_data(self,new_data:dict,export = True):
        new_df = pd.DataFrame([new_data])
        self._data = pd.concat([self._data,new_df])
        self._data.reset_index(drop = True, inplace=True)
        if export:
            self.export_csv()
        
    def get_data(self):
        return self._data
    
    def print_data(self):
        print(self.get_data())
        
    
    

class MyModel:
    
    def __init__(self,
                 X_train,X_test,
                 y_train,y_test,
                 train_subsample_size = None,
                 test_subsample_size = None):
        # Random State
        self._random_state = 42
        self._train_subsample_size = train_subsample_size
        self._test_subsample_size = test_subsample_size
        
        # Store the data
        self.X_train = X_train
        le = LabelEncoder()
        self.y_train = pd.Series(le.fit_transform(y_train),index= y_train)
            
        self.X_test = X_test
        self.y_test = pd.Series(le.transform(y_test),index= y_test)
        
        self.print_data_size()
                
        self._classes = sorted(list(set(y_train)))
        
        # Default is 5-fold
        self.cv = None
        
        # Subsampling
        self._subsampled = False
        
        # OverSampling
        self._oversampled = False
        
        # Custom Encoder
        self._one_hot_transformer = None
        self._one_hot_col = []
        self._tar_end_transformer = None
        self._tar_end_col = []
        self.encoder = None
        self._encoder_steps = []
        self._encoder_name = ""
        
        # Custom Scaler
        self._standard_scaler = None
        self._ss_col = []
        self._min_max_scaler = None
        self._mm_col = []
        self.scaler = None
        self._scaler_steps = []
        self._scaler_name = ""
        
        # Custom PCA
        self.pca = None
        self._pca_name = None
        self._pca_n_components = None
        
        # Grid Search
        self._use_grid_search= False
        self.grid_search = None
        self.gs_params = {}
        
        # Custom Model
        self.model = None
        self._model_name = ""
        
        # Custom Score
        self._scoring = "f1_weighted"
        self._scoring_func = lambda y_true,y_pred: f1_score(y_true,y_pred,average = "weighted")
        
        # Custom Pipeline
        self.pipeline = None
        self._pipeline_steps = []
        
        # Prediction
        self.train_y_pred = None
        self.y_pred = None
        self.y_prob = None
        
        # Timer
        self._subsample_time = None
        self._oversampling_time = None
        self._fit_time = None
        self._predict_time = None
        
        
        # Model Performance
        self.model_perf = None
        self._raw_cm = None
        self._cm = None
        self._cv_results = None
        
    def init_subsampling(self):
        if self._train_subsample_size is not None or self._test_subsample_size is not None:
            start_time = time.time()
            
            if self._train_subsample_size is not None:
                # Sample the train dataset with train_test_split function
                self.X_train, _, self.y_train, _ = train_test_split(self.X_train, self.y_train,
                                                                    stratify = self.y_train, 
                                                                    train_size = self._train_subsample_size,
                                                                    random_state = self._random_state)
            if self._test_subsample_size is not None:
                # Sample the test dataset with train_test_split function
                _, self.X_test, _ ,self.y_test= train_test_split(self.X_test, self.y_test,
                                                                 stratify = self.y_test, 
                                                                 test_size = self._test_subsample_size, 
                                                                 random_state = self._random_state)

            end_time = time.time()
            
            self._subsample_time = end_time - start_time
            logger.info(f"Subsampling Completed | Time elapsed: {self.time_to_str(self._subsample_time)}")
            self.print_data_size(title = "After Subsampling")
            self._subsampled = True
            
        else: 
            logger.error("'train_subsample_size' & 'test_subsample_size' cannot be None.")
        
    def init_over_sampling(self):
        start_time = time.time()
        
        logger.info("Oversampling on Train Data...")
        self.X_train, self.y_train = self.random_over_sampling(self.X_train,self.y_train)
        end_time = time.time()
        
        self._oversampling_time = end_time - start_time
        logger.info(f"Oversampling Completed | Time elapsed: {self.time_to_str(self._oversampling_time)}")
        self.print_data_size(title = "After Oversampling")
        self._oversampled = True
        
        
    def init_one_hot_encoding(self,encoder = None, columns=[]):
        if len(columns) > 0:
            if encoder is None:
                self._one_hot_transformer = Pipeline(
                    steps=[("one_hot",OneHotEncoder(sparse_output = False,drop="first"))]
                )
            else:
                self._one_hot_transformer = Pipeline(
                    steps=[("one_hot",encoder)]
                )
                
            self._set_one_hot_col(columns)
            
            # Add to encoder steps
            self._encoder_steps.append(("one_hot",self._one_hot_transformer,self._one_hot_col))
            
            logger.info("One-hot Encoder Initialization Completed")
            self.print_empty()
        
    def init_target_encoding(self,encoder = None, columns=[]):
        if len(columns) > 0:
            if encoder is None:
                self._tar_end_transformer = Pipeline(
                    steps=[("tar_end",TargetEncoder(target_type="continuous",
                                                    random_state=self._random_state))]
                )
            else:
                self._tar_end_transformer = Pipeline(
                    steps=[("tar_end",encoder)]
                )
                
            self._set_target_encode_col(columns)
            
            # Add to encoder steps
            self._encoder_steps.append(("tar_end",self._tar_end_transformer,self._tar_end_col))
            
            logger.info("Target Encoder Initialization Completed")
            self.print_empty()

    def init_encoder(self,name = "encoders"):        
        if len(self._encoder_steps) == 0:
            self.encoder = None
        else:
            self.encoder = ColumnTransformer(
                transformers=self._encoder_steps,
                remainder = "passthrough",
                verbose_feature_names_out = False
            )
            self._encoder_name = name
            self._pipeline_steps.append((self._encoder_name,self.encoder))
        
    def init_standard_scaler(self,scaler = None, columns=[]):
        if len(columns) > 0:
            if scaler is None:
                self._standard_scaler = Pipeline(
                    steps=[("standard_scaler",StandardScaler()),]
                )
            else:
                self._standard_scaler = Pipeline(
                    steps=[("standard_scaler",scaler),]
                )
                
            self._set_standard_scaling_col(columns)
                
            # Add to encoder steps
            self._scaler_steps.append(("standard_scaler",self._standard_scaler,self._ss_col))
            
            logger.info("Standard Scaler Initialization Completed")
            self.print_empty()
        
    def init_min_max_scaler(self,scaler = None,columns=[]):
        if len(columns) > 0:
            if scaler is None:
                self._min_max_scaler = Pipeline(
                    steps=[("min_max_scaler",MinMaxScaler()),]
                )
            else:
                self._min_max_scaler = Pipeline(
                    steps=[("min_max_scaler",scaler),]
                )
            
            self._set_min_max_scaling_col(columns)
                
            # Add to encoder steps
            self._scaler_steps.append(("min_max_scaler",self._min_max_scaler,self._mm_col))
            
            logger.info("Min-Max Scaler Initialization Completed")
            self.print_empty()
            
    def init_scaler(self,name = "scalers"):
        if len(self._scaler_steps) == 0:
            self.scaler = None
        else:        
            self.scaler = ColumnTransformer(
                transformers=self._scaler_steps,
                remainder = "passthrough",
                verbose_feature_names_out = False
            )
            self._scaler_name = name
            self._pipeline_steps.append((self._scaler_name,self.scaler))
            
    def init_pca(self,name="pca",n_components = None):
        
        if n_components is None:
            self.pca = PCA()
        else:
            self._pca_n_components = n_components
            self.pca = PCA(n_components = n_components)
            
        self._pca_name = name
        self._pipeline_steps.append((self._pca_name,self.pca))
        logger.info("PCA Initialization Completed")
        self.print_empty()
        
    def init_model(self,name,model):
        self.model = model
        self._model_name = name
        self._pipeline_steps.append((self._model_name,self.model))
        logger.info(f"{self.model.__class__.__name__} (name: {self._model_name}) Initialization Completed")
        self.print_empty()
                
        
    def init_grid_search(self,params,n_job = 1,no_cv = False):
        if self.pipeline is None:
            logger.error("Please initialize the model pipeline first")
            return
        if len(params) == 0:
            logger.error(f"Grid Search params cannot be empty.")
        else:
            self._set_grid_search_params(params)
            self.grid_search = GridSearchCV(self.pipeline,
                                            self.gs_params,
                                            n_jobs = n_job,
                                            cv = self.cv if not no_cv else None,
                                            scoring = self._scoring,
                                            verbose= 0)
            self._use_grid_search = True
            if self.cv is not None:
                logger.info(f"Grid Search with {self.cv if isinstance(self.cv, int) else self.cv.get_n_splits()}-folds cross-validation Initialized Completed")
            else:
                logger.info(f"Grid Search without cross-validation Initialized Completed")
            
            self.print_empty()
     
    def init_pipeline(self):
        if self.model is None:
            logger.error(f"No Model is initiated")
            return
        if len(self._pipeline_steps) == 0:
            logger.error(f"No steps in pipeline")
            return
        else:           
            self.pipeline = Pipeline(self._pipeline_steps,verbose=False)
            self.pipeline.set_output(transform="pandas")
            
            
    # ===================== Set Function ============================== 
        
    def set_kfold_cv(self,cv = 5):
        if cv is None:
            self.cv = None
        assert isinstance(cv,int), "'cv' must be integer"
        assert cv >= 2, "'cv' must be greater than 1"
        self.cv = StratifiedKFold(n_splits = cv)
        
        
    

    def _set_one_hot_col(self,columns=[]):
        if len(columns) == 0:
            return
        else:
            for col in columns:
                assert col in self.X_train.columns, f"{col} not found in X_train"
                assert col in self.X_test.columns, f"{col} not found in X_test"
            self._one_hot_col = columns
            
            
    def _set_target_encode_col(self,columns=[]):
        if len(columns) == 0:
            return
        else:
            for col in columns:
                assert col in self.X_train.columns, f"{col} not found in X_train"
                assert col in self.X_test.columns, f"{col} not found in X_test"
            self._tar_end_col = columns
            
            
    def _set_standard_scaling_col(self,columns=[]):
        if len(columns) == 0:
            return
        else:
            for col in columns:
                assert col in self.X_train.columns, f"{col} not found in X_train"
                assert col in self.X_test.columns, f"{col} not found in X_test"
            self._ss_col = columns
        
    def _set_min_max_scaling_col(self,columns=[]):
        if len(columns) == 0:
            return
        else:
            for col in columns:
                assert col in self.X_train.columns, f"{col} not found in X_train"
                assert col in self.X_test.columns, f"{col} not found in X_test"
            self._mm_col = columns
            
    def _set_grid_search_params(self,params):
        available_prefix = [step[0] for step in self.pipeline.steps]
        for param in params:
            assert param.split("__")[0] in available_prefix, f"Grid Search Params {param} not found. Only found: {available_prefix}"
        self.gs_params = params    
    
    # ================== Model Fitting =================
    def fit(self):
        if self._use_grid_search:
            start_time = time.time()
            logger.info(f"Model Fitting with Grid Search...")
            self.grid_search.fit(self.X_train,self.y_train)
            self.cv_results = pd.DataFrame(self.grid_search.cv_results_)
            self.cv_results.to_csv(f"log/gs-{self._model_name}-{datetime.datetime.now().strftime('%Y-%m-%d--%H-%M-%S')}.csv")
            logger.info("Grid Search Cross Validationn Results Exported.")
        else:
            start_time = time.time()
            logger.info(f"Model Fitting...")
            self.pipeline.fit(self.X_train,self.y_train)
            self.train_y_pred = self.pipeline.predict(self.X_train)
            
        end_time = time.time()
        self._fit_time = end_time - start_time
        logger.info(f"Total Fitting Time:{self.time_to_str(self._fit_time)}")
        self.print_empty()
        
    
    # ================== Model Prediction =================
    def predict(self):
        logger.info("Predicting on Test Data...")
        start_time = time.time()
        if self._use_grid_search:
            try:
                self.y_prob = self.grid_search.predict_proba(self.X_test)
            except AttributeError as e:
                logger.error(f"{e}")
            self.y_pred = self.grid_search.predict(self.X_test)
        else:
            try:
                self.y_prob = self.pipeline.predict_proba(self.X_test)
            except AttributeError as e:
                logger.error(f"{e}")
            self.y_pred = self.pipeline.predict(self.X_test)

        end_time = time.time()
        self._predict_time = end_time - start_time
        logger.info(f"Total Predicting Time:{self.time_to_str(self._predict_time)}")
        self.print_empty()
        
        
    # ================== Model Evaluation =================
    def get_best_train_score(self):
        if self._use_grid_search:
            return self.grid_search.best_score_
        else:
            return self._scoring_func(self.y_train,self.train_y_pred)
        
    def print_best_train_score(self):
        logger.info(f"Best Train Score: {self.get_best_train_score()}")
        
    def get_test_score(self):
        if self.y_pred is not None:
            return self._scoring_func(self.y_test,self.y_pred)
        else:
            logger.error(f"No y_pred found for testing score.")
    
    def print_test_score(self):
        if self.y_pred is not None:
            logger.info(f"Test Score: {self.get_test_score()}")
        else:
            logger.error(f"No y_pred found for testing score.")
        
    def get_best_params(self):
        if self._use_grid_search:
            return self.grid_search.best_params_
        else:
            return self.pipeline.steps[-1][1].get_params()
        
    def print_best_params(self):
        self.print_title(title = "Parameters")
        pprint.pprint(self.get_best_params())
        
    def compute_raw_confusion_matrix(self):
        if self.y_pred is not None:
            logger.info("Computing Confusion Matrix...")
            self._raw_cm = confusion_matrix(self.y_test,self.y_pred)
        else:
            logger.error("Please make prediction first.")
    
    def compute_confusion_matrix(self):
        if self._raw_cm is None:
            self.compute_raw_confusion_matrix()
        self._cm = pd.DataFrame(self._raw_cm,
                                index = [f"actual_{i}" for i in self._classes],
                                columns = [f"predict_{i}" for i in self._classes])
        
    def print_confusion_matrix(self):
        if self._cm is None:
            self.compute_confusion_matrix()
                    
        self.print_title(title = "Confusion Matrix on Test Data")
        print(self._cm)
    
    # ================== Helper Function =================
    @staticmethod
    def time_to_str(t):
        time_list = str(datetime.timedelta(seconds=t)).split(".")[0].split(":")
        return f"\t{time_list[0]} hours {time_list[1]} minutes {time_list[2]} seconds"
    
    def print_data_size(self, title = "Data Shape"):
        self.print_title(title)
        logger.info(f"X_train: {self.X_train.shape} | y_train: {self.y_train.shape}")
        logger.info(f"X_test : {self.X_test.shape} | y_test : {self.y_test.shape}")
        self.print_empty()
        
    @staticmethod
    def print_title(title):
        num_of_equal = 15
        logger.info(f"{'='*num_of_equal} {title} {'='*num_of_equal}")
        
    @staticmethod
    def print_empty():
        logger.info("")
        
    @staticmethod
    def random_over_sampling(X,y):
        assert X.shape[0] == y.shape[0]

        X.reset_index(drop = True,inplace = True)
        y.reset_index(drop = True,inplace = True)

        # Join both table
        target_col = "price_range"
        X[target_col] = y

        # Get the class in target variable
        _classes = sorted(list(set(y)))

        # Value Count for each class
        val_count = y.value_counts()
        # print(val_count)

        # Get the maximum count
        max_count = max(val_count)

        # list to store sampled data
        new_data_list = []

        for i in _classes:
            diff = max_count - val_count[i]

            # set seed to for reproducibilty
            rand_seed = 0

            # cache the filter dataframe
            this_X = X[X[target_col]==i]

            sample_batch_size = min(diff,min(val_count[i],5000)) # Round off to nearest thousand, to avoid duplicate the whole sample set => more variance

            while diff > 0:
                # For eac
                sampled_data = this_X.sample(min(diff,sample_batch_size), # for last iteration, use diff to make sure all classes have same same sample size
                                  replace = False, # Ensure no duplicates in this batch
                                  random_state = rand_seed)
                new_data_list.append(sampled_data)
                diff -= sample_batch_size
                rand_seed += 1

        new_data_list.append(X)
        new_data = pd.concat(new_data_list)
        new_data.reset_index(drop = True, inplace = True)
        new_X = new_data.drop(columns = [target_col])
        new_y = new_data[target_col]

        return new_X, new_y
    

    def add_perf(self,model_res: ModelPerf):
        this_dict = model_performance_dict.copy()
        
        this_dict["model_name"] = self._model_name
        this_dict["model"] = self.model.__class__.__name__
        
        this_dict["sub_sampling"] = self._subsampled
        this_dict["sub_sampling_time"] = self._subsample_time
        this_dict["sub_sampling_pct_X_train"] = self._train_subsample_size
        this_dict["sub_sampling_pct_X_test"] = self._train_subsample_size
        
        this_dict["over_sampling"] = self._oversampled
        this_dict["over_sampling_time"] = self._oversampling_time
        
        this_dict["X_train_shape"] = self.X_train.shape
        this_dict["X_test_shape"] = self.X_test.shape
        
        if self._tar_end_transformer is not None:
            this_dict["target_encoding"] = True
            this_dict["target_encoding_col"] = self._tar_end_col
            
        
        if self._one_hot_transformer is not None:
            this_dict["one_hot_encoding"] = True
            this_dict["one_hot_encoding_col"] = self._one_hot_col
            
        if self._standard_scaler is not None:
            this_dict["standard_scaling"] = True
            this_dict["standard_scaling_col"] = self._ss_col
            
        if self._min_max_scaler is not None:
            this_dict["min_max_scaling"] = True
            this_dict["min_max_scaling_col"] = self._mm_col
            
        if self.pca is not None:
            this_dict["pca"] = True
            this_dict["pca_n_components"] = self._pca_n_components
            
        this_dict["scoring"] = self._scoring
        if self._use_grid_search:
            this_dict["grid_search"] = self._use_grid_search 
            this_dict["best_params"] = self.get_best_params()
            this_dict["best_train_score"] = self.get_best_train_score()
        else:
            this_dict["params"] = self.get_best_params()
            this_dict["train_score"] = self.get_best_train_score()
            
        
        this_dict["test_score"] = self.get_test_score()
        if self._cm is not None:
            this_dict["confusion_matrix"] = self._raw_cm
            
            
        model_res.add_data(this_dict)
        return model_res

In [157]:
# Log the model performance
perf = ModelPerf("log/model_performance.csv")

In [158]:
# Use 20% of the data to train the model and make prediction on the whole test data 
global_subsample_pct = 0.2

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-model">
    <h3> Model - Logistic Regression (Baseline Model) </h3>
</a>
<br>
<b>Why Logistic Regression Model is the baseline model?</b>
<ol>
    <li><b>Simplicity</b></li>
    <p>For classification task, <b>Logistic Regression</b> is a simple and interpretable linear model. It models the relationship between the input features and the binary outcome by applying the logistic function to a linear combination of the input features. </p>       
    <li><b>Interpretability</b></li>
    <p>The coefficients in Logistic Regression provide a direct interpretation of the impact of each feature on the log-odds of the outcome. This interpretability is valuable for gaining insights into the relationships between predictors and the target variable. </p>         
    <li><b>Robustness to Irrelevant Features</b></li>
    <p>Logistic Regression can be robust to irrelevant features, as its regularization techniques (e.g., L1 or L2 regularization) help prevent overfitting and suppress the impact of less informative features. </p>     

</ol>

Therefore, any model tested later on should be compared with Logistic Regression baseline model.    
    
</div>

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>⚠️ Note:</b></font>

For baseline model, we just use Logistic Regression to get our first and simple modelling result.
    
<div>

#### Logistic Regression

In [160]:
from sklearn.linear_model import LogisticRegression

def model(test = False):
    model = MyModel(X_train,X_test,
                    y_train,y_test,
                    train_subsample_size=global_subsample_pct,
                    test_subsample_size=None)

    # Apply Subsampling for faster training in experimental stage
    model.init_subsampling() # Subsample data for faster deployment

    # Instantiate encoder steps
    model.init_one_hot_encoding(columns=["vehicle_type","drivetrain","transmission","engine_block"])
    model.init_target_encoding(columns=["make","model","trim","body_type"])
    model.init_encoder() # Activate all encoders and add to pipeline

    model.init_standard_scaler(columns=["make","model","trim","body_type"])
    model.init_scaler() # Activate all scalers and add to pipeline

    # Instantiate model and add to pipeline
    model_name = "logit_v1"
    model.init_model(name = model_name,
                     model = LogisticRegression(penalty = "l2", # Avoid Overfitting
                                                max_iter=10000,
                                                C = 1))
    
    # Instantiate the entire pipeline
    model.init_pipeline()

    # Fit the Pipeline
    model.fit()
    
    # Make Prediction on X_test
    model.predict()
    
    # Print the train score (without GridSearch) / best train score (with GridSearch)
    model.print_best_train_score()
    
    # Print the test score
    model.print_test_score()
    
    # Print the best params
    model.print_best_params()
    
    # Print the confusion matrix
    model.print_confusion_matrix()

    # Add model performance
    if not test:
        global perf
        perf = model.add_perf(perf)
    
    return model

# Run this model
logit_v1 = model()

2023-11-09 20:09:07.f - INFO >>> X_train: (1764450, 21) | y_train: (1764450,)
2023-11-09 20:09:07.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:09:07.f - INFO >>> 
2023-11-09 20:09:09.f - INFO >>> Subsampling Completed | Time elapsed: 	0 hours 00 minutes 01 seconds
2023-11-09 20:09:09.f - INFO >>> X_train: (352890, 21) | y_train: (352890,)
2023-11-09 20:09:09.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:09:09.f - INFO >>> 
2023-11-09 20:09:09.f - INFO >>> One-hot Encoder Initialization Completed
2023-11-09 20:09:09.f - INFO >>> 
2023-11-09 20:09:09.f - INFO >>> Target Encoder Initialization Completed
2023-11-09 20:09:09.f - INFO >>> 
2023-11-09 20:09:09.f - INFO >>> Standard Scaler Initialization Completed
2023-11-09 20:09:09.f - INFO >>> 
2023-11-09 20:09:09.f - INFO >>> LogisticRegression (name: logit_v1) Initialization Completed
2023-11-09 20:09:09.f - INFO >>> 
2023-11-09 20:09:09.f - INFO >>> Model Fitting...
2023-11-09 20:10:25.f -

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>💬 Comment:</b></font>
    
- From the confusion matrix, we might see a bit of class imbalance issue. Let's try to apply oversampling.
<div>

#### Logistic Regression + Oversampling

In [161]:
from sklearn.linear_model import LogisticRegression

def model(test = False):
    model = MyModel(X_train,X_test,
                    y_train,y_test,
                    train_subsample_size=global_subsample_pct,
                    test_subsample_size=None)

    # Apply Subsampling for faster training in experimental stage
    model.init_subsampling() # Subsample data for faster deployment
    model.init_over_sampling() # Oversampling for class imabalance

    # Instantiate encoder steps
    model.init_one_hot_encoding(columns=["vehicle_type","drivetrain","transmission","engine_block"])
    model.init_target_encoding(columns=["make","model","trim","body_type"])
    model.init_encoder() # Activate all encoders and add to pipeline
    
    # Instantiate scaler steps
    model.init_standard_scaler(columns=["make","model","trim","body_type"])
    model.init_scaler() # Activate all scalers and add to pipeline

    # Instantiate model and add to pipeline
    model_name = "logit_v2"
    model.init_model(name = model_name,
                     model = LogisticRegression(penalty = "l2", # Avoid Overfitting
                                                max_iter=10000,
                                                C = 1))
    
    # Instantiate the entire pipeline
    model.init_pipeline()

    # Fit the Pipeline
    model.fit()
    
    # Make Prediction on X_test
    model.predict()
    
    # Print the train score (without GridSearch) / best train score (with GridSearch)
    model.print_best_train_score()
    
    # Print the test score
    model.print_test_score()
    
    # Print the best params
    model.print_best_params()
    
    # Print the confusion matrix
    model.print_confusion_matrix()

    # Add model performance
    if not test:
        global perf
        perf = model.add_perf(perf)
    
    return model

# Run this model
logit_v2 = model()

2023-11-09 20:10:31.f - INFO >>> X_train: (1764450, 21) | y_train: (1764450,)
2023-11-09 20:10:31.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:10:31.f - INFO >>> 
2023-11-09 20:10:33.f - INFO >>> Subsampling Completed | Time elapsed: 	0 hours 00 minutes 01 seconds
2023-11-09 20:10:33.f - INFO >>> X_train: (352890, 21) | y_train: (352890,)
2023-11-09 20:10:33.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:10:33.f - INFO >>> 
2023-11-09 20:10:33.f - INFO >>> Oversampling on Train Data...
2023-11-09 20:10:34.f - INFO >>> Oversampling Completed | Time elapsed: 	0 hours 00 minutes 00 seconds
2023-11-09 20:10:34.f - INFO >>> X_train: (779665, 21) | y_train: (779665,)
2023-11-09 20:10:34.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:10:34.f - INFO >>> 
2023-11-09 20:10:34.f - INFO >>> One-hot Encoder Initialization Completed
2023-11-09 20:10:34.f - INFO >>> 
2023-11-09 20:10:34.f - INFO >>> Target Encoder Initialization C

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>💬 Comment:</b></font>

- From the above confusion matrix result, it seems that `Oversampling` does not help much with Logistic Regression.
- Let's apply `grid search` for the best parameters.
<div>

#### Logistic Regression + Grid Search 5-fold CV

In [162]:
from sklearn.linear_model import LogisticRegression

def model(test = False):
    model = MyModel(X_train,X_test,
                    y_train,y_test,
                    train_subsample_size=global_subsample_pct,
                    test_subsample_size=None)

    # Apply Subsampling for faster training in experimental stage
    model.init_subsampling() # Subsample data for faster deployment

    # Instantiate encoder steps
    model.init_one_hot_encoding(columns=["vehicle_type","drivetrain","transmission","engine_block"])
    model.init_target_encoding(columns=["make","model","trim","body_type"])
    model.init_encoder() # Activate all encoders and add to pipeline

    model.init_standard_scaler(columns=["make","model","trim","body_type"])
    model.init_scaler() # Activate all scalers and add to pipeline

    # Instantiate model and add to pipeline
    model_name = "logit_v3"
    model.init_model(name = model_name,
                     model = LogisticRegression(penalty = "l2", # Avoid Overfitting
                                                max_iter=10000))
    
    # Instantiate the entire pipeline
    model.init_pipeline()

    # Set K-Fold CV
    model.set_kfold_cv(5)
    
    # Instantiate the grid search
    model.init_grid_search(params={f"{model_name}__C":[10 ** i for i in [-3,-1,0,1,3]],
                                  },
                           n_job = 1
                          )

    # Fit the Pipeline
    model.fit()
    
    # Make Prediction on X_test
    model.predict()
    
    # Print the train score (without GridSearch) / best train score (with GridSearch)
    model.print_best_train_score()
    
    # Print the test score
    model.print_test_score()
    
    # Print the best params
    model.print_best_params()
    
    # Print the confusion matrix
    model.print_confusion_matrix()

    # Add model performance
    if not test:
        global perf
        perf = model.add_perf(perf)
    
    return model

# Run this model
logit_v3 = model()

2023-11-09 20:14:56.f - INFO >>> X_train: (1764450, 21) | y_train: (1764450,)
2023-11-09 20:14:56.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:14:56.f - INFO >>> 
2023-11-09 20:14:58.f - INFO >>> Subsampling Completed | Time elapsed: 	0 hours 00 minutes 01 seconds
2023-11-09 20:14:58.f - INFO >>> X_train: (352890, 21) | y_train: (352890,)
2023-11-09 20:14:58.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:14:58.f - INFO >>> 
2023-11-09 20:14:58.f - INFO >>> One-hot Encoder Initialization Completed
2023-11-09 20:14:58.f - INFO >>> 
2023-11-09 20:14:58.f - INFO >>> Target Encoder Initialization Completed
2023-11-09 20:14:58.f - INFO >>> 
2023-11-09 20:14:58.f - INFO >>> Standard Scaler Initialization Completed
2023-11-09 20:14:58.f - INFO >>> 
2023-11-09 20:14:58.f - INFO >>> LogisticRegression (name: logit_v3) Initialization Completed
2023-11-09 20:14:58.f - INFO >>> 
2023-11-09 20:14:58.f - INFO >>> Grid Search with 5-folds cross-validatio

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>💬 Comment - Logistic Regression</b></font>

    
- From the above results for logistic regression, we obtained:
|Model|Best Params / Params|training F1|testing F1|
|---|---|:---:|:---:|
|`Logistic Regression`|`{'C': 1}`|`71.93%`|`71.98%`|
|`Logistic Regression` (with Oversampling)|`{'C': 1}`|`68.91%`|`69.41%`|
|`Logistic Regression` (with 5-fold Grid Search CV)|`{'C': 1000}`|`71.78%`|`71.76%`|


- Seems that OverSampling method does not work well in our case.    
- While the first model give us a general insight on how well a logistic model can perform on our task. We should alway use cross validation to obtain a more accurate score for model evaluation.
- For the following model, we will always apply 5-fold cross validation.
- Therefore, we will use the first model (`Logistic Regression with 5-fold Grid Search CV`) as our baseline model.


<div>

In [None]:
# TODO: Interpret Model coeff

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-model">
    <h3> Model - Decision Tree </h3>
</a>
    
Other than Logistics Regression, we can also try to use Decision Tree.

</div>

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>⚠️ Note:</b></font>

- For Decision Tree, we can skip the scaling the data as decision tree optimization is not based on distance of value.
    
<div>

#### Decision Tree + Grid Search 5-fold CV

In [163]:
from sklearn.tree import DecisionTreeClassifier

def model(test = False):
    model = MyModel(X_train,X_test,
                    y_train,y_test,
                    train_subsample_size=global_subsample_pct,
                    test_subsample_size=None)

    # Apply Subsampling for faster training in experimental stage
    model.init_subsampling() # Subsample data for faster deployment

    # Instantiate encoder steps
    model.init_one_hot_encoding(columns=["vehicle_type","drivetrain","transmission","engine_block"])
    model.init_target_encoding(columns=["make","model","trim","body_type"])
    model.init_encoder() # Activate all encoders and add to pipeline

    # Instantiate model and add to pipeline
    this_model_name = "dtc_v1"
    model.init_model(name = this_model_name,
                     model = DecisionTreeClassifier())
    
    # Instantiate the entire pipeline
    model.init_pipeline()
    
    # Set K-Fold CV
    model.set_kfold_cv(5)
    
    # Instantiate the grid search
    model.init_grid_search(params={f"{this_model_name}__max_depth":[5,10,25,50],
                                   f"{this_model_name}__min_samples_split":[10,25,50,100]
                                  },
                           n_job = 1
                          )

    # Fit the Pipeline
    model.fit()
    
    # Make Prediction on X_test
    model.predict()
    
    # Print the train score (without GridSearch) / best train score (with GridSearch)
    model.print_best_train_score()
    
    # Print the test score
    model.print_test_score()
    
    # Print the best params
    model.print_best_params()
    
    # Print the confusion matrix
    model.print_confusion_matrix()

    # Add model performance
    if not test:
        global perf
        perf = model.add_perf(perf)
    
    return model

# Run this model
dtc_v1 = model()

2023-11-09 20:40:14.f - INFO >>> X_train: (1764450, 21) | y_train: (1764450,)
2023-11-09 20:40:14.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:40:14.f - INFO >>> 
2023-11-09 20:40:16.f - INFO >>> Subsampling Completed | Time elapsed: 	0 hours 00 minutes 01 seconds
2023-11-09 20:40:16.f - INFO >>> X_train: (352890, 21) | y_train: (352890,)
2023-11-09 20:40:16.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:40:16.f - INFO >>> 
2023-11-09 20:40:16.f - INFO >>> One-hot Encoder Initialization Completed
2023-11-09 20:40:16.f - INFO >>> 
2023-11-09 20:40:16.f - INFO >>> Target Encoder Initialization Completed
2023-11-09 20:40:16.f - INFO >>> 
2023-11-09 20:40:16.f - INFO >>> DecisionTreeClassifier (name: dtc_v1) Initialization Completed
2023-11-09 20:40:16.f - INFO >>> 
2023-11-09 20:40:16.f - INFO >>> Grid Search with 5-folds cross-validation Initialized Completed
2023-11-09 20:40:16.f - INFO >>> 
2023-11-09 20:40:16.f - INFO >>> Model Fitting w

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>💬 Comment:</b></font>

- Let's try to apply oversampling to see if it works with Decision Tree.

<div>

#### Decision Tree + Oversampling + Grid Search 5-fold CV

In [164]:
from sklearn.tree import DecisionTreeClassifier

def model(test = False):
    model = MyModel(X_train,X_test,
                    y_train,y_test,
                    train_subsample_size=global_subsample_pct,
                    test_subsample_size=None)

    # Apply Subsampling for faster training in experimental stage
    model.init_subsampling() # Subsample data for faster deployment
    model.init_over_sampling() # Oversampling for class imabalance

    # Instantiate encoder steps
    model.init_one_hot_encoding(columns=["vehicle_type","drivetrain","transmission","engine_block"])
    model.init_target_encoding(columns=["make","model","trim","body_type"])
    model.init_encoder() # Activate all encoders and add to pipeline

    # Instantiate model and add to pipeline
    this_model_name = "dtc_v2"
    model.init_model(name = this_model_name,
                     model = DecisionTreeClassifier())
    
    # Instantiate the entire pipeline
    model.init_pipeline()
    
    # Set K-Fold CV
    model.set_kfold_cv(5)
    
    # Instantiate the grid search
    model.init_grid_search(params={f"{this_model_name}__max_depth":[5,10,25,50],
                                   f"{this_model_name}__min_samples_split":[10,25,50,100]
                                  },
                           n_job = 1
                          )

    # Fit the Pipeline
    model.fit()
    
    # Make Prediction on X_test
    model.predict()
    
    # Print the train score (without GridSearch) / best train score (with GridSearch)
    model.print_best_train_score()
    
    # Print the test score
    model.print_test_score()
    
    # Print the best params
    model.print_best_params()
    
    # Print the confusion matrix
    model.print_confusion_matrix()

    # Add model performance
    if not test:
        global perf
        perf = model.add_perf(perf)
    
    return model

# Run this model
dtc_v2 = model()

2023-11-09 20:43:14.f - INFO >>> X_train: (1764450, 21) | y_train: (1764450,)
2023-11-09 20:43:14.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:43:14.f - INFO >>> 
2023-11-09 20:43:16.f - INFO >>> Subsampling Completed | Time elapsed: 	0 hours 00 minutes 01 seconds
2023-11-09 20:43:16.f - INFO >>> X_train: (352890, 21) | y_train: (352890,)
2023-11-09 20:43:16.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:43:16.f - INFO >>> 
2023-11-09 20:43:16.f - INFO >>> Oversampling on Train Data...
2023-11-09 20:43:16.f - INFO >>> Oversampling Completed | Time elapsed: 	0 hours 00 minutes 00 seconds
2023-11-09 20:43:16.f - INFO >>> X_train: (779665, 21) | y_train: (779665,)
2023-11-09 20:43:16.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:43:16.f - INFO >>> 
2023-11-09 20:43:16.f - INFO >>> One-hot Encoder Initialization Completed
2023-11-09 20:43:16.f - INFO >>> 
2023-11-09 20:43:16.f - INFO >>> Target Encoder Initialization C

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>💬 Comment:</b></font>

- It seems oversampling does not improve the performance when working with Decision Tree.
<div>

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>🧠 Idea:</b></font>

What about if we try to use PCA to selct the perform feature selection?

<div>

#### Decision Tree + PCA + Grid Search 5-fold CV

In [165]:
from sklearn.tree import DecisionTreeClassifier

def model(test = False):
    model = MyModel(X_train,X_test,
                    y_train,y_test,
                    train_subsample_size=global_subsample_pct,
                    test_subsample_size=None)

    # Apply Subsampling for faster training in experimental stage
    model.init_subsampling() # Subsample data for faster deployment

    # Instantiate encoder steps
    model.init_one_hot_encoding(columns=["vehicle_type","drivetrain","transmission","engine_block"])
    model.init_target_encoding(columns=["make","model","trim","body_type"])
    model.init_encoder() # Activate all encoders and add to pipeline

    # # Instantiate PCA step
    pca_name = "pca"
    model.init_pca(name=pca_name)

    # Instantiate model
    model_name = "dtc_v3"
    model.init_model(name = model_name,
                     model = DecisionTreeClassifier())
    
    # Instantiate the entire pipeline
    model.init_pipeline()

    # Set K-Fold CV
    model.set_kfold_cv(5)
    
    # Instantiate the grid search
    model.init_grid_search(params={f"{model_name}__max_depth":[5,10,25,50],
                                   f"{model_name}__min_samples_split":[5,10,25,50],
                                   f"{pca_name}__n_components":[0.9,0.95],
                                  },
                           n_job = 1
                          )
    
    # Fit the Pipeline
    model.fit()
    
    # Make Prediction on X_test
    model.predict()
    
    # Print the train score (without GridSearch) / best train score (with GridSearch)
    model.print_best_train_score()
    
    # Print the test score
    model.print_test_score()
    
    # Print the best params
    model.print_best_params()
    
    # Print the confusion matrix
    model.print_confusion_matrix()

    # Add model performance
    if not test:
        global perf
        perf = model.add_perf(perf)
    
    return model

# Run this model
dtc_v3 = model()



2023-11-09 20:49:41.f - INFO >>> X_train: (1764450, 21) | y_train: (1764450,)
2023-11-09 20:49:41.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:49:41.f - INFO >>> 
2023-11-09 20:49:43.f - INFO >>> Subsampling Completed | Time elapsed: 	0 hours 00 minutes 01 seconds
2023-11-09 20:49:43.f - INFO >>> X_train: (352890, 21) | y_train: (352890,)
2023-11-09 20:49:43.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 20:49:43.f - INFO >>> 
2023-11-09 20:49:43.f - INFO >>> One-hot Encoder Initialization Completed
2023-11-09 20:49:43.f - INFO >>> 
2023-11-09 20:49:43.f - INFO >>> Target Encoder Initialization Completed
2023-11-09 20:49:43.f - INFO >>> 
2023-11-09 20:49:43.f - INFO >>> PCA Initialization Completed
2023-11-09 20:49:43.f - INFO >>> 
2023-11-09 20:49:43.f - INFO >>> DecisionTreeClassifier (name: dtc_v3) Initialization Completed
2023-11-09 20:49:43.f - INFO >>> 
2023-11-09 20:49:43.f - INFO >>> Grid Search with 5-folds cross-validation Initiali

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>💬 Comment - Decision Tree</b></font>
    
- From the above results for Decision Tree, we obtained:
|Model|Best Params / Params|training F1|testing F1|
|---|---|:---:|:---:|
|`Decision Tree` (with 5-fold Grid Search CV)|`{'max_depth': 25, 'min_samples_split': 50}`|`83.57%`|`84.04%`|
|`Decision Tree` (with Oversampling & 5-fold Grid Search CV)|`{'max_depth': 50, 'min_samples_split': 10}`|`91.70%`|`81.97%`|
|`Decision Tree` (with PCA & 5-fold Grid Search CV)|`{'max_depth': 25, 'min_samples_split': 50, 'n_components': 0.95}`|`81.55%`|`82.05%`|


- Recall that our baseline model (Logistic Regression) return weighted f1 score of around 70%, decision tree seems works better with our data.
- Although oversampling improve a lot in terms of training score, it does not improve the testing score.
- For PCA, it does not helping us to improve our model. Maybe our data is already cleaned that any PCA will only remove useful informaiton from our data.
<div>

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-model">
    <h3> Model - Adaptive Boosting</h3>
</a>

</div>

#### Adaboost + Grid Search 5-fold CV (Based on the Best Decision Tree)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

def model(test = False):
    model = MyModel(X_train,X_test,
                    y_train,y_test,
                    # train_subsample_size=global_subsample_pct,
                    train_subsample_size=global_subsample_pct,
                    test_subsample_size=None)

    # Apply Subsampling for faster training in experimental stage
    model.init_subsampling() # Subsample data for faster deployment
    # model.init_over_sampling() # Oversampling for class imabalance

    # Instantiate encoder steps
    model.init_one_hot_encoding(columns=["vehicle_type","drivetrain","transmission","engine_block"])
    model.init_target_encoding(columns=["make","model","trim","body_type"])
    model.init_encoder() # Activate all encoders and add to pipeline

    # Instantiate model
    model_name = "abc_v1"
    model.init_model(name = model_name,
                     model = AdaBoostClassifier(estimator = DecisionTreeClassifier(max_depth = 25,
                                                                                   min_samples_split = 50)
                                                )
                    )
    
    # Instantiate the entire pipeline
    model.init_pipeline()

    # Set K-Fold CV
    model.set_kfold_cv(5)
    
    # Instantiate the grid search
    model.init_grid_search(params={f"{model_name}__n_estimators":[100,150], 
                                   f"{model_name}__learning_rate":[0.95,1,1.05],
                                  },
                           n_job = 1)
    
    # Fit the Pipeline
    model.fit()
    
    # Make Prediction on X_test
    model.predict()
    
    # Print the train score (without GridSearch) / best train score (with GridSearch)
    model.print_best_train_score()
    
    # Print the test score
    model.print_test_score()
    
    # Print the best params
    model.print_best_params()
    
    # Print the confusion matrix
    model.print_confusion_matrix()

    # Add model performance
    if not test:
        global perf
        perf = model.add_perf(perf)
    
    return model

# Run this model
abc_v1 = model()


2023-11-09 23:34:20.f - INFO >>> X_train: (1764450, 21) | y_train: (1764450,)
2023-11-09 23:34:20.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 23:34:20.f - INFO >>> 
2023-11-09 23:34:22.f - INFO >>> Subsampling Completed | Time elapsed: 	0 hours 00 minutes 01 seconds
2023-11-09 23:34:22.f - INFO >>> X_train: (352890, 21) | y_train: (352890,)
2023-11-09 23:34:22.f - INFO >>> X_test : (588151, 22) | y_test : (588151,)
2023-11-09 23:34:22.f - INFO >>> 
2023-11-09 23:34:22.f - INFO >>> One-hot Encoder Initialization Completed
2023-11-09 23:34:22.f - INFO >>> 
2023-11-09 23:34:22.f - INFO >>> Target Encoder Initialization Completed
2023-11-09 23:34:22.f - INFO >>> 
2023-11-09 23:34:22.f - INFO >>> AdaBoostClassifier (name: abc_v1) Initialization Completed
2023-11-09 23:34:22.f - INFO >>> 
2023-11-09 23:34:22.f - INFO >>> Grid Search with 5-folds cross-validation Initialized Completed
2023-11-09 23:34:22.f - INFO >>> 
2023-11-09 23:34:22.f - INFO >>> Model Fitting with 

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>💬 Comment - AdabBoost</b></font>

TODO: Edit Score  
TODO:Comment
    
- From the above results for Adaptive Boosting, we obtained:
|Model|Best Params / Params|training F1|testing F1|
|---|---|:---:|:---:|
|`Adaboost` (with 5-fold Grid Search CV)|`__`|`__%`|`__%`|

    

<div>

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-model">
    <h3> Model - Stochastic Gradient Boosting</h3>
</a>

</div>

#### Gradient Boosting + Grid Search 5-fold CV

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

def model(test = False):
    model = MyModel(X_train,X_test,
                    y_train,y_test,
                    train_subsample_size=global_subsample_pct,
                    test_subsample_size=None)

    # Apply Subsampling for faster training in experimental stage
    model.init_subsampling() # Subsample data for faster deployment

    # Instantiate encoder steps
    model.init_one_hot_encoding(columns=["vehicle_type","drivetrain","transmission","engine_block"])
    model.init_target_encoding(columns=["make","model","trim","body_type"])
    model.init_encoder() # Activate all encoders and add to pipeline

    # Instantiate model
    model_name = "gbc_v1"
    model.init_model(name = model_name,
                     model = GradientBoostingClassifier(max_depth = 25,
                                                        min_samples_split = 50,
                                                        n_iter_no_change = 3,
                                                        subsample = 0.5, # Apply Stochastic Gradient Boosting
                                                        verbose= 0))
    
    # Instantiate the entire pipeline
    model.init_pipeline()

    # Set K-Fold CV
    model.set_kfold_cv(5)
    
    # Instantiate the grid search
    model.init_grid_search(params={f"{model_name}__learning_rate":[0.05,0.1],
                                   f"{model_name}__subsample":[0.5,0.75,0.9],
                                   f"{model_name}__n_estimators":[50,100,200],
                                  },
                           n_job = 1)
    
    # Fit the Pipeline
    model.fit()
    
    # Make Prediction on X_test
    model.predict()
    
    # Print the train score (without GridSearch) / best train score (with GridSearch)
    model.print_best_train_score()
    
    # Print the test score
    model.print_test_score()
    
    # Print the best params
    model.print_best_params()
    
    # Print the confusion matrix
    model.print_confusion_matrix()

    # Add model performance
    if not test:
        global perf
        perf = model.add_perf(perf)
    
    return model

# Run this model
gbc_v1 = model()

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>💬 Comment - Stochastic Gradient Boosting</b></font>

TODO: Edit Score  
TODO: Comment
    
- From the above results for Stochastic Gradient Boosting, we obtained:
|Model|Best Params / Params|training F1|testing F1|
|---|---|:---:|:---:|
|`Stochastic Gradient Boosting` (with 5-fold Grid Search CV)|`__`|`__%`|`__%`|

    

<div>

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-model">
    <h3> Model - XGBoost </h3>
</a>

</div>

#### XG Boosting + Grid Search 5-fold CV

In [None]:
import xgboost as xgb


def model(test = False):
    model = MyModel(X_train,X_test,
                    y_train,y_test,
                    train_subsample_size=global_subsample_pct,
                    test_subsample_size=None)

    # Apply Subsampling for faster training in experimental stage
    model.init_subsampling() # Subsample data for faster deployment

    # Instantiate encoder steps
    model.init_one_hot_encoding(columns=["vehicle_type","drivetrain","transmission","engine_block"])
    model.init_target_encoding(columns=["make","model","trim","body_type"])
    model.init_encoder() # Activate all encoders and add to pipeline

    # Instantiate model and add to pipeline
    model_name = "xgb_v1"
    model.init_model(name = model_name,
                     model = xgb.XGBClassifier(n_estimators = 100))
    
    # Instantiate the entire pipeline
    model.init_pipeline()

    # Set K-Fold CV
    # model.set_kfold_cv(5)
    
    # Instantiate the grid search
    model.init_grid_search(params={f"{this_model_name}__max_depth":[5,10,25,50],
                                   f"{this_model_name}__min_samples_split":[10,25,50,100]
                                  },
                           n_job = 1)

    # Fit the Pipeline
    model.fit()
    
    # Make Prediction on X_test
    model.predict()
    
    # Print the train score (without GridSearch) / best train score (with GridSearch)
    model.print_best_train_score()
    
    # Print the test score
    model.print_test_score()
    
    # Print the best params
    model.print_best_params()
    
    # Print the confusion matrix
    model.print_confusion_matrix()

    # Add model performance
    if not test:
        global perf
        perf = model.add_perf(perf)
    
    return model

# Run this model
xgb_v1 = model(True)

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<font size="4px" color="#ffa600"><b>💬 Comment - XG Boost</b></font>

TODO: Edit Score  
TODO: Comment
    
- From the above results for XG Boost, we obtained:
|Model|Best Params / Params|training F1|testing F1|
|---|---|:---:|:---:|
|`XG Boost` (with 5-fold Grid Search CV)|`__`|`__%`|`__%`|


<div>

[Back-to-top](#4-toc)

---

<div style="border-radius:10px; border:#33a0ff solid; padding: 15px; background-color: #f0f1ff; font-size:100%; text-align:left">
<a class="anchor" id="4-evaluate">
    <h2> Model Evaluation </h2>
</a>





</div>

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<a class="anchor" id="4-model">
    <h3> Model Comparison </h3>
</a>
    
TODO: Summarise    

- By comparing the best model from different types of model, we obtained:
|Model|Best Params|training F1|testing F1|
|---|---|:---:|:---:|
|`Logistic Regression` (with 5-fold Grid Search CV)|`{'C': 1000}`|`71.78%`|`71.76%`|
|`Decision Tree` (with 5-fold Grid Search CV)|`{'max_depth': 25, 'min_samples_split': 50}`|`83.57%`|`84.04%`|




</div>

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<a class="anchor" id="4-model">
    <h3> Model Training Result </h3>
</a>
</div>

In [137]:
p = ModelPerf("log/model_performance.csv")
temp = p.get_data()#[["model","test_score","standard_scaling","pca","grid_search","best_params","best_train_score","params","train_score"]]
temp.sort_values(by=["test_score"],ascending = [False])

Unnamed: 0,model_name,model,sub_sampling,sub_sampling_time,sub_sampling_pct_X_train,sub_sampling_pct_X_test,over_sampling,over_sampling_time,X_train_shape,X_test_shape,target_encoding,target_encoding_col,one_hot_encoding,one_hot_encoding_col,standard_scaling,standard_scaling_col,min_max_scaling,min_max_scaling_col,pca,pca_n_components,scoring,grid_search,best_params,best_train_score,params,train_score,test_score,confusion_matrix
15,abc_v1,AdaBoostClassifier,True,2.548039,0.01,0.01,False,,"(17644, 21)","(5882, 21)",True,"['make', 'model', 'trim', 'body_type']",True,"['vehicle_type', 'drivetrain', 'transmission',...",False,[],False,[],False,,f1_weighted,True,"{'abc_v1__learning_rate': 0.9, 'abc_v1__n_esti...",0.804023,,,0.801451,[[1377 212 0 0 0]\n [ 286 2166 147 ...
3,dtc_v1,DecisionTreeClassifier,True,4.236204,0.01,0.01,False,,"(17644, 21)","(5882, 21)",True,"['make', 'model', 'trim', 'body_type']",True,"['vehicle_type', 'drivetrain', 'transmission',...",False,[],False,[],False,,f1_weighted,False,,,"{'ccp_alpha': 0.0, 'class_weight': None, 'crit...",0.826225,0.778531,[[1335 250 4 0 0]\n [ 253 2145 196 ...
11,dtc_v1,DecisionTreeClassifier,True,4.204373,0.01,0.01,False,,"(17644, 21)","(5882, 21)",True,"['make', 'model', 'trim', 'body_type']",True,"['vehicle_type', 'drivetrain', 'transmission',...",False,[],False,[],False,,f1_weighted,True,"{'dtc_v1__max_depth': 10, 'dtc_v1__min_samples...",0.771995,,,0.776918,[[1341 246 2 0 0]\n [ 230 2148 217 ...
9,dtc_v1,DecisionTreeClassifier,True,4.139838,0.01,0.01,False,,"(17644, 21)","(5882, 21)",True,"['make', 'model', 'trim', 'body_type']",True,"['vehicle_type', 'drivetrain', 'transmission',...",False,[],False,[],False,,f1_weighted,True,"{'dtc_v1__max_depth': 10, 'dtc_v1__min_samples...",0.772048,,,0.77686,[[1331 256 2 0 0]\n [ 224 2163 205 ...
7,dtc_v1,DecisionTreeClassifier,True,3.881756,0.01,0.01,False,,"(17644, 21)","(5882, 21)",True,"['make', 'model', 'trim', 'body_type']",True,"['vehicle_type', 'drivetrain', 'transmission',...",False,[],False,[],False,,f1_weighted,False,,,"{'ccp_alpha': 0.0, 'class_weight': None, 'crit...",0.81146,0.773672,[[1337 249 3 0 0]\n [ 239 2148 208 ...
14,dtc_v3,DecisionTreeClassifier,True,4.296172,0.01,0.01,False,,"(17644, 21)","(5882, 21)",True,"['make', 'model', 'trim', 'body_type']",True,"['vehicle_type', 'drivetrain', 'transmission',...",False,[],False,[],True,,f1_weighted,True,"{'dtc_v3__max_depth': 10, 'dtc_v3__min_samples...",0.765591,,,0.770113,[[1290 295 3 0 1]\n [ 212 2180 202 ...
6,dtc_v4,DecisionTreeClassifier,True,4.219699,0.01,0.01,False,,"(17644, 21)","(5882, 21)",True,"['make', 'model', 'trim', 'body_type']",True,"['vehicle_type', 'drivetrain', 'transmission',...",False,[],False,[],True,,f1_weighted,True,"{'dtc_v4__max_depth': 15, 'dtc_v4__min_samples...",0.758576,,,0.765461,[[1312 275 2 0 0]\n [ 267 2128 198 ...
8,dtc_v2,DecisionTreeClassifier,True,4.146628,0.01,0.01,True,0.1691,"(38980, 21)","(12995, 21)",True,"['make', 'model', 'trim', 'body_type']",True,"['vehicle_type', 'drivetrain', 'transmission',...",False,[],False,[],False,,f1_weighted,False,,,"{'ccp_alpha': 0.0, 'class_weight': None, 'crit...",0.821929,0.730786,[[2266 319 2 10 2]\n [ 326 1928 306 ...
4,dtc_v2,DecisionTreeClassifier,True,4.236193,0.01,0.01,True,0.221858,"(38980, 21)","(12995, 21)",True,"['make', 'model', 'trim', 'body_type']",True,"['vehicle_type', 'drivetrain', 'transmission',...",False,[],False,[],False,,f1_weighted,False,,,"{'ccp_alpha': 0.0, 'class_weight': None, 'crit...",0.879233,0.716833,[[2254 326 9 10 0]\n [ 328 1904 319 ...
0,logit_v1,LogisticRegression,True,4.579603,0.01,0.01,False,,"(17644, 21)","(5882, 21)",True,"['make', 'model', 'trim', 'body_type']",True,"['vehicle_type', 'drivetrain', 'transmission',...",True,"['make', 'model', 'trim', 'body_type']",False,[],False,,f1_weighted,False,,,"{'C': 1, 'class_weight': None, 'dual': False, ...",0.723986,0.715027,[[1255 286 34 7 7]\n [ 274 2085 210 ...


In [None]:
# TODO: COmment

[Back-to-top](#4-toc)

---

<div style="border-radius:10px; border: #ffd500 solid; padding: 15px; background-color: #ffffcf; font-size:100%; text-align:left">
<a class="anchor" id="4-learn">
    <h2> Learning / Takeaway </h2>
</a>
</div>

[Back-to-top](#4-toc)

---