<a href="https://colab.research.google.com/github/ekombu/LR_Machine_Learning_Algorithm/blob/main/MPG_Boost_family_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let us start by importing libraries

In [None]:
!pip install catboost

In [1]:
import pandas as pd
import numpy as np

Next, we us the mpg dataset from seaborn

In [2]:
# Load from seaborn
import seaborn as sns
df = sns.load_dataset('mpg')

Explore the data

In [3]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [4]:
# Drop name variable (non-numeric)
df = df.drop(columns=['name'])

In [6]:
#Check the new variables
df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year', 'origin'],
      dtype='object')

In [7]:
#check for missing
df.isnull().sum()


Unnamed: 0,0
mpg,0
cylinders,0
displacement,0
horsepower,6
weight,0
acceleration,0
model_year,0
origin,0


Perform Preprocessing

In [8]:
# One-hot encode 'origin' if it's categorical
df = pd.get_dummies(df, columns=['origin'], drop_first=True)

In [9]:
# Define features and target
X = df.drop('mpg', axis=1)
y = df['mpg']

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
from sklearn.preprocessing import StandardScaler

In [13]:
# Scale numeric features (optional for tree-based models but can improve stability)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [14]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_train_scaled = imputer.fit_transform(X_train_scaled)
X_test_scaled = imputer.transform(X_test_scaled)


Let us now train each Model

In [18]:
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
import lightgbm as lgb
import catboost as cb

In [19]:
# 1. Gradient Boosting Regressor (Sklearn)
gbr = GradientBoostingRegressor(random_state=42)
gbr.fit(X_train_scaled, y_train)
# 2. XGBoost
xgb_model = xgb.XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)  # XGBoost can handle unscaled data well

# 3. LightGBM
lgb_model = lgb.LGBMRegressor(random_state=42)
lgb_model.fit(X_train, y_train)
# 4. CatBoost (suppressing verbose output)
cat_model = cb.CatBoostRegressor(verbose=0, random_state=42)
cat_model.fit(X_train, y_train)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000192 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 286
[LightGBM] [Info] Number of data points in the train set: 318, number of used features: 8
[LightGBM] [Info] Start training from score 23.608176


<catboost.core.CatBoostRegressor at 0x7d0cd87ab910>

Next Le us create a function to evaluate all the models at once

In [20]:
from sklearn.metrics import mean_squared_error, r2_score

In [21]:
def evaluate(model, X, y_true, name):
    preds = model.predict(X)
    rmse = np.sqrt(mean_squared_error(y_true, preds))
    r2 = r2_score(y_true, preds)
    return {'Model': name, 'RMSE': rmse, 'R²': r2}

results = [
    evaluate(gbr, X_test_scaled, y_test, "GradientBoosting (sklearn)"),
    evaluate(xgb_model, X_test, y_test, "XGBoost"),
    evaluate(lgb_model, X_test, y_test, "LightGBM"),
    evaluate(cat_model, X_test, y_test, "CatBoost")
]

df_results = pd.DataFrame(results).sort_values(by='RMSE')
print(df_results)


                        Model      RMSE        R²
3                    CatBoost  2.191652  0.910663
2                    LightGBM  2.225464  0.907885
0  GradientBoosting (sklearn)  2.289252  0.902529
1                     XGBoost  2.637626  0.870606


From our result, Catboost outperforms others