# Models

- In this notebook, we will train various models and measure their performance by $MAE$, $MSE$, and $R^2$.
- As the purpose of this project is to predict `popularity`, we decided that regression model is the the most appropriate. 
- Our selected models are the following:
    1. Decision Tree
    2. AdaBoost
    3. Random Forest
    4. Gradient Boosting (scikit-learn)
    5. Histogram-based Gradient Boosting (scikit-learn)
    6. XGBoost
    7. LightGBM 
    8. CatBoost
    9. K-Nearest Neighbors (KNN)
    10. Multilayer Perceptron (MLP)

---

## Preparing Helper Functions

In [2]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import sklearn
import math

In [101]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from utils.lds import LDS

def split_data(df):
    # Define the features and target variable
    # X = df.drop(['popularity', 'weight'], axis=1)
    df = LDS(df)
    X = df.drop(['popularity', 'weight'], axis=1)
    y = df['popularity']
    # Splitting the dataset into training and testing sets
    X_train, X_test, y_train, y_test, weights_train, weights_test = train_test_split(
        X, y, df['weight'], test_size=0.2, random_state=42
    )
    return X_train, X_test, y_train, y_test, weights_train, weights_test

In [96]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def model_performance(model_name, y_test, y_pred, weights_test=None):
    mae = mean_absolute_error(y_test, y_pred, sample_weight=weights_test)
    mse = mean_squared_error(y_test, y_pred, sample_weight=weights_test)
    r2 = r2_score(y_test, y_pred, sample_weight=weights_test)
    print(f"Model Performance ({model_name}):\nMAE = {mae}\nMSE = {mse}\nR^2 = {r2}\n")

---

## Models Implementation

In [97]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
from xgboost import XGBRegressor
import lightgbm as lgb
from catboost import CatBoostRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor

In [102]:
class Models:
    def __init__(self, df, lds=False):
        self.df = df
        self.LDS = LDS
        self.X_train, self.X_test, self.y_train, self.y_test, self.w_train, self.w_test  = split_data(self.df)
        
        scaler = StandardScaler()  
        scaler.fit(self.X_train)
        self.X_train_standardized = scaler.transform(self.X_train)  
        self.X_test_standardized = scaler.transform(self.X_test)  

        normalizer = Normalizer()
        normalizer.fit(self.X_train)
        self.X_train_normalized = normalizer.transform(self.X_train)
        self.X_test_normalized = normalizer.transform(self.X_test)

        self.DecisionTree()
        self.AdaBoost()
        self.RandomForest()
        self.GB()
        self.HistGB()
        self.XGBoost()
        self.LGBM()
        self.CatBoost()
        self.KNN()
        self.MLP()

    def run_model(self, regr, model_name):
        regr.fit(self.X_train, self.y_train)
        y_pred = regr.predict(self.X_test)
        model_performance(model_name=f"{model_name}", y_test=self.y_test, y_pred=y_pred)
        
        regr.fit(self.X_train, self.y_train, sample_weight=self.w_train)
        y_pred = regr.predict(self.X_test)
        model_performance(model_name=f"{model_name} (with LDS)", y_test=self.y_test, y_pred=y_pred, weights_test=self.w_test)
    
    def DecisionTree(self):
        regr = DecisionTreeRegressor()
        self.run_model(regr, "Decision Tree")
    
    def AdaBoost(self):
        regr = AdaBoostRegressor(DecisionTreeRegressor(), n_estimators=50)
        self.run_model(regr, "AdaBoost")

    def RandomForest(self):
        regr = RandomForestRegressor(n_estimators=100)
        self.run_model(regr, "Random Forest")

    def GB(self):
        regr = GradientBoostingRegressor(n_estimators=100)
        self.run_model(regr, "Gradient Boosting")

    def HistGB(self):
        regr = HistGradientBoostingRegressor(max_iter=100)
        self.run_model(regr, "Hist Gradient Boosting")

    def XGBoost(self):
        regr = XGBRegressor(objective='reg:squarederror', n_estimators=100)
        self.run_model(regr, "XGBoost")

    def LGBM(self):
        regr = lgb.LGBMRegressor()
        regr.fit(self.X_train, self.y_train, eval_set=[(self.X_test, self.y_test)], eval_metric='mse')
        y_pred = regr.predict(self.X_test, num_iteration=regr.best_iteration_)
        model_performance(model_name=f"LightGBM", y_test=self.y_test, y_pred=y_pred)

    def CatBoost(self):
        regr = CatBoostRegressor(verbose=0)
        regr.fit(self.X_train, self.y_train, eval_set=(self.X_test, self.y_test), use_best_model=True)
        y_pred = regr.predict(self.X_test)
        model_performance(model_name=f"CatBoost", y_test=self.y_test, y_pred=y_pred)

    def KNN(self):
        regr = KNeighborsRegressor(n_neighbors=5, weights='distance')
        regr.fit(self.X_train_normalized, self.y_train)
        y_pred = regr.predict(self.X_test_normalized)
        model_performance(model_name=f"K-Nearest Neighbors", y_test=self.y_test, y_pred=y_pred)

    def MLP(self):
        params = { 'hidden_layer_sizes' : [10,10],
            'activation' : 'relu', 'solver' : 'adam',
            'alpha' : 0.0, 'batch_size' : 10,
            'random_state' : 0, 'tol' : 0.0001,
            'nesterovs_momentum' : False,
            'learning_rate' : 'constant',
            'learning_rate_init' : 0.01,
            'max_iter' : 1000, 'shuffle' : True,
            'n_iter_no_change' : 50, 'verbose' : False }
        regr = MLPRegressor(**params)
        regr.fit(self.X_train_standardized, self.y_train)
        y_pred = regr.predict(self.X_test_standardized)
        model_performance(model_name=f"Multilayer Perceptron", y_test=self.y_test, y_pred=y_pred)

---

## Models Training and Results

In [107]:
clean_df = pd.read_csv("./data/cleandata.csv")
smogn_02_df = pd.read_csv("data/data_smogn_02.csv")
smogn_03_df = pd.read_csv("data/data_smogn_03.csv")
smogn_04_df = pd.read_csv("data/data_smogn_04.csv")
smogn_05_df = pd.read_csv("data/data_smogn_05.csv")
smogn_06_df = pd.read_csv("data/data_smogn_06.csv")
smogn_07_df = pd.read_csv("data/data_smogn_07.csv")

Every dataset contains `weight` column for LDS, so we will run every model twice (if applicable) for each dataset.

### Original Data

In [103]:
print('Model on Original Data\n')
Models(clean_df)

Model on Original Data

Model Performance (Decision Tree):
MAE = 18.57303879374284
MSE = 626.0107864451937
R^2 = -0.49874256892663915

Model Performance (Decision Tree (with LDS)):
MAE = 20.66388842328596
MSE = 793.1598373315657
R^2 = -0.298831402242413

Model Performance (AdaBoost):
MAE = 13.967791767936056
MSE = 358.3783201872458
R^2 = 0.1420006557793232

Model Performance (AdaBoost (with LDS)):
MAE = 16.903356135701582
MSE = 527.0719225538452
R^2 = 0.13689835007740014

Model Performance (Random Forest):
MAE = 13.570182674839362
MSE = 316.3434660426139
R^2 = 0.24263698130164324

Model Performance (Random Forest (with LDS)):
MAE = 16.976347682630145
MSE = 489.81150643928197
R^2 = 0.1979137926558212

Model Performance (Gradient Boosting):
MAE = 15.78093795728254
MSE = 379.6811001725244
R^2 = 0.09099932498482466

Model Performance (Gradient Boosting (with LDS)):
MAE = 19.19209511925938
MSE = 542.3431466970196
R^2 = 0.11189110117966516

Model Performance (Hist Gradient Boosting):
MAE = 1

<__main__.Models at 0x2b7e421e0>

### SMOGN Data

In [104]:
print('Model on SMOGN (rel_thes = 0.2) Data')
Models(smogn_02_df)

Model on SMOGN (rel_thes = 0.2) Data
Model Performance (Decision Tree):
MAE = 16.301484614755623
MSE = 545.7004294895336
R^2 = -0.2765477600036472

Model Performance (Decision Tree (with LDS)):
MAE = 19.382233282772596
MSE = 731.6062792768417
R^2 = 0.0002390503146730838

Model Performance (AdaBoost):
MAE = 12.068041889513362
MSE = 307.67921831284394
R^2 = 0.2802512226271443

Model Performance (AdaBoost (with LDS)):
MAE = 15.720310727524597
MSE = 487.9394701764362
R^2 = 0.3332167288466429

Model Performance (Random Forest):
MAE = 12.052907398880418
MSE = 272.3400660189963
R^2 = 0.3629195022606023

Model Performance (Random Forest (with LDS)):
MAE = 16.044316611782634
MSE = 451.22416867541546
R^2 = 0.38338924066943136

Model Performance (Gradient Boosting):
MAE = 13.084112321699024
MSE = 299.8431935870048
R^2 = 0.29858190237474924

Model Performance (Gradient Boosting (with LDS)):
MAE = 17.219800362443856
MSE = 446.25719125497096
R^2 = 0.3901767576763079

Model Performance (Hist Gradient

<__main__.Models at 0x2b7e42720>

In [105]:
print('Model on SMOGN (rel_thes = 0.3) Data')
Models(smogn_03_df)

Model on SMOGN (rel_thes = 0.3) Data
Model Performance (Decision Tree):
MAE = 15.625271878453967
MSE = 525.7334359931481
R^2 = -0.06940478321302912

Model Performance (Decision Tree (with LDS)):
MAE = 18.261542034966045
MSE = 670.1177808012253
R^2 = 0.17270353980366648

Model Performance (AdaBoost):
MAE = 11.893243063444467
MSE = 309.29104644490485
R^2 = 0.3708649634423191

Model Performance (AdaBoost (with LDS)):
MAE = 14.928763554929198
MSE = 444.2266553552305
R^2 = 0.4515782896242069

Model Performance (Random Forest):
MAE = 11.816574127935272
MSE = 271.32354551135813
R^2 = 0.4480954082366069

Model Performance (Random Forest (with LDS)):
MAE = 15.081717387550938
MSE = 403.32886560534797
R^2 = 0.5020688118268523

Model Performance (Gradient Boosting):
MAE = 12.745526028822013
MSE = 295.11150177649483
R^2 = 0.3997078557790701

Model Performance (Gradient Boosting (with LDS)):
MAE = 16.01248532086322
MSE = 403.8682982007494
R^2 = 0.5014028532603498

Model Performance (Hist Gradient Bo

<__main__.Models at 0x2b7f38560>

In [108]:
print('Model on SMOGN (rel_thes = 0.4) Data')
Models(smogn_04_df)

Model on SMOGN (rel_thes = 0.4) Data
Model Performance (Decision Tree):
MAE = 9.374191788498614
MSE = 321.7865053062617
R^2 = 0.47192265249052523

Model Performance (Decision Tree (with LDS)):
MAE = 11.285137683541695
MSE = 406.40276134678265
R^2 = 0.5048283330850346

Model Performance (AdaBoost):
MAE = 7.6036393955674235
MSE = 188.90721015418782
R^2 = 0.6899881852761547

Model Performance (AdaBoost (with LDS)):
MAE = 9.325801723981666
MSE = 254.5279884638352
R^2 = 0.6898764961476114

Model Performance (Random Forest):
MAE = 7.812471176387853
MSE = 172.5189944364946
R^2 = 0.7168825557482036

Model Performance (Random Forest (with LDS)):
MAE = 9.660554970766746
MSE = 234.18818829317564
R^2 = 0.7146590363100996

Model Performance (Gradient Boosting):
MAE = 12.875783040384396
MSE = 313.76331965767014
R^2 = 0.4850893407326311

Model Performance (Gradient Boosting (with LDS)):
MAE = 14.743113445233416
MSE = 378.39237959788426
R^2 = 0.5389569088249306

Model Performance (Hist Gradient Boosti

<__main__.Models at 0x2a0db8ad0>

In [109]:
print('Model on SMOGN (rel_thes = 0.5) Data')
Models(smogn_05_df)

Model on SMOGN (rel_thes = 0.5) Data
Model Performance (Decision Tree):
MAE = 11.128184492331886
MSE = 384.36702432711246
R^2 = 0.6222586014442659

Model Performance (Decision Tree (with LDS)):
MAE = 11.788212620035374
MSE = 412.8200054084128
R^2 = 0.675259555731869

Model Performance (AdaBoost):
MAE = 8.661235243969656
MSE = 219.80477067613853
R^2 = 0.783984170781088

Model Performance (AdaBoost (with LDS)):
MAE = 9.585604554094648
MSE = 255.34124576345192
R^2 = 0.7991385385813194

Model Performance (Random Forest):
MAE = 9.431072990485024
MSE = 201.25107895493323
R^2 = 0.8022180384532935

Model Performance (Random Forest (with LDS)):
MAE = 10.165269504474209
MSE = 229.1633075655677
R^2 = 0.8197311338262963

Model Performance (Gradient Boosting):
MAE = 16.689873771002194
MSE = 412.5791771822984
R^2 = 0.5945327628543338

Model Performance (Gradient Boosting (with LDS)):
MAE = 17.721417236738205
MSE = 441.271300847027
R^2 = 0.6528786483153333

Model Performance (Hist Gradient Boosting):

<__main__.Models at 0x2b7e85850>

In [110]:
print('Model on SMOGN (rel_thes = 0.6) Data')
Models(smogn_06_df)

Model on SMOGN (rel_thes = 0.6) Data
Model Performance (Decision Tree):
MAE = 16.73812059954372
MSE = 670.3386482017439
R^2 = 0.06806625163430347

Model Performance (Decision Tree (with LDS)):
MAE = 19.05719752915164
MSE = 805.2937856215826
R^2 = 0.10778981124546472

Model Performance (AdaBoost):
MAE = 11.967897112142943
MSE = 356.80555115261086
R^2 = 0.5039535082523445

Model Performance (AdaBoost (with LDS)):
MAE = 14.28931164045594
MSE = 468.5619491775642
R^2 = 0.48086555169897083

Model Performance (Random Forest):
MAE = 12.79229353754437
MSE = 334.3755939928361
R^2 = 0.5351366037036751

Model Performance (Random Forest (with LDS)):
MAE = 15.21206105689856
MSE = 450.3624080665331
R^2 = 0.5010293928956162

Model Performance (Gradient Boosting):
MAE = 16.857621054595413
MSE = 487.41775520131387
R^2 = 0.32237077954060267

Model Performance (Gradient Boosting (with LDS)):
MAE = 20.39563370142627
MSE = 631.39263726369
R^2 = 0.3004603361785054

Model Performance (Hist Gradient Boosting):

<__main__.Models at 0x286aa0980>

In [111]:
print('Model on SMOGN (rel_thes = 0.7) Data')
Models(smogn_07_df)

Model on SMOGN (rel_thes = 0.7) Data
Model Performance (Decision Tree):
MAE = 15.808059337419852
MSE = 652.5741862358093
R^2 = 0.20625547293268998

Model Performance (Decision Tree (with LDS)):
MAE = 17.678176973397264
MSE = 744.8904588765482
R^2 = 0.2363740774631844

Model Performance (AdaBoost):
MAE = 11.242900399827912
MSE = 336.9000101979445
R^2 = 0.5902189438322203

Model Performance (AdaBoost (with LDS)):
MAE = 13.418966578918473
MSE = 435.4235669121206
R^2 = 0.5536246719027385

Model Performance (Random Forest):
MAE = 12.204938833762483
MSE = 317.61306779145764
R^2 = 0.613678199962648

Model Performance (Random Forest (with LDS)):
MAE = 14.42348563283199
MSE = 421.874916740186
R^2 = 0.5675140973388975

Model Performance (Gradient Boosting):
MAE = 17.28554332334455
MSE = 521.3228609030697
R^2 = 0.3659001897335892

Model Performance (Gradient Boosting (with LDS)):
MAE = 20.394429108619878
MSE = 644.4823053600308
R^2 = 0.3393076940044806

Model Performance (Hist Gradient Boosting):

<__main__.Models at 0x2a0db6420>