# Catboost Benchmarking

This notebook is to benchmark the LightGBM model with a catboost alternative. We will focus on the standard name set and move onto complex names if this version proves fruitful.

In [1]:
!pip install lightgbm
!pip install sqlalchemy
!pip install catboost

Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl.metadata (17 kB)
Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl (3.6 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.6/3.6 MB[0m [31m117.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m82.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lightgbm
Successfully installed lightgbm-4.6.0
Collecting sqlalchemy
  Downloading sqlalchemy-2.0.41-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting greenlet>=1 (from sqlalchemy)
  Downloading greenlet-3.2.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (4.1 kB)
Downloading sqlalchemy-2.0.41-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m

In [2]:
import json
import re
import random
import matplotlib.pyplot as plt
import lightgbm as lgb
import pandas as pd
import numpy as np
import sqlite3
from catboost import CatBoostRanker, Pool
from datetime import datetime
from collections import defaultdict
from sqlalchemy import select, func
from sklearn.metrics import precision_score, recall_score, f1_score

In [3]:
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


In [4]:
# Paths to files in Drive
db_path = "/content/drive/MyDrive/Colab Notebooks/database.db"
train_ids_path = "/content/train_std_ids.csv"
val_ids_path = "/content/validation_std_ids.csv"

Pull whole standard dataset from SQL db into pandas frame.

In [5]:
conn = sqlite3.connect(db_path)  # or your local path
full_df = pd.read_sql_query("SELECT * FROM feature_matrix", conn)

Saved the training and validation IDs in a CSV so we can split the data set.

In [6]:
train_ids = pd.read_csv(train_ids_path)["train_ids"].dropna().astype(int).tolist()
val_ids = pd.read_csv(val_ids_path)["validation_ids"].dropna().astype(int).tolist()

Validation dataset pulled from full set, remianing is training set. This is to ensure that we don't train on validation set.

In [7]:
val_ids_set = set(val_ids)
val_df = full_df[full_df["clean_row_id"].isin(val_ids_set)]
full_df = full_df[~full_df["clean_row_id"].isin(val_ids_set)]

This was pulled from the initial model development notebook.

In [8]:
def compute_ranking_metrics(df, k=3):
    # Group
    grouped = df.groupby("clean_row_id")

    # Get top 1 and calculate accuracy
    top1 = grouped.apply(lambda g: g.loc[g["score"].idxmax()]).reset_index(drop=True)
    acc1 = (top1["label"] == 1).mean()

    # Same with recall
    topk = grouped.apply(lambda g: g.nlargest(k, "score")).reset_index(drop=True)
    recall_k = topk.groupby("clean_row_id")["label"].max().mean()

    # MMR
    def reciprocal_rank(g):
        sorted_g = g.sort_values("score", ascending=False).reset_index()
        match = sorted_g[sorted_g["label"] == 1]
        return 1.0 / (match.index[0] + 1) if not match.empty else 0.0

    mrr = grouped.apply(reciprocal_rank).mean()
    return acc1, recall_k, mrr

This has been altered to fit a catboost model. It is largely the same but trains a different model.

In [9]:
def train_catboost_model(train_df, val_df, parameters: dict, n_rounds: int = 500):
    # Prepare group sizes (queries)
    train_group = train_df.groupby("clean_row_id").size().values
    val_group = val_df.groupby("clean_row_id").size().values

    # Extract features
    X_train = train_df.drop(
        columns=["label", "clean_row_id", "investor", "firm", "template_id"]
    )
    y_train = train_df["label"]

    X_val = val_df.drop(
        columns=["label", "clean_row_id", "investor", "firm", "template_id"]
    )
    y_val = val_df["label"]

    # Create Pools for CatBoost
    train_pool = Pool(
        data=X_train,
        label=y_train,
        group_id=np.repeat(range(len(train_group)), train_group),
    )
    val_pool = Pool(
        data=X_val, label=y_val, group_id=np.repeat(range(len(val_group)), val_group)
    )

    # Train the model
    model = CatBoostRanker(iterations=n_rounds, **parameters)

    model.fit(train_pool, eval_set=val_pool, early_stopping_rounds=50, verbose=True)

    # Predict and evaluate
    preds = model.predict(val_pool)
    val_df = val_df.copy()
    val_df["score"] = preds

    acc1, recall3, mrr = compute_ranking_metrics(val_df, k=3)

    print("\nEvaluation Metrics (Validation Set):")
    print(f"Accuracy@1 : {acc1:.4f}")
    print(f"Recall@3   : {recall3:.4f}")
    print(f"MRR        : {mrr:.4f}")

    return model

In [10]:
catboost_params = {
    "learning_rate": 0.1,
    "loss_function": "YetiRank",
    "random_seed": 42,
    "task_type": "CPU",
}

In [11]:
# Retrain
model = train_catboost_model(full_df, val_df, catboost_params, n_rounds=2000)

0:	test: 0.9555798	best: 0.9555798 (0)	total: 1.99s	remaining: 1h 6m 9s
1:	test: 0.9625773	best: 0.9625773 (1)	total: 3.85s	remaining: 1h 4m 4s
2:	test: 0.9633213	best: 0.9633213 (2)	total: 5.68s	remaining: 1h 3m 1s
3:	test: 0.9633204	best: 0.9633213 (2)	total: 7.45s	remaining: 1h 2m
4:	test: 0.9633198	best: 0.9633213 (2)	total: 9.25s	remaining: 1h 1m 31s
5:	test: 0.9637132	best: 0.9637132 (5)	total: 11s	remaining: 1h 1m 7s
6:	test: 0.9632936	best: 0.9637132 (5)	total: 12.8s	remaining: 1h 46s
7:	test: 0.9632825	best: 0.9637132 (5)	total: 14.6s	remaining: 1h 34s
8:	test: 0.9632777	best: 0.9637132 (5)	total: 16.4s	remaining: 1h 29s
9:	test: 0.9632878	best: 0.9637132 (5)	total: 18.2s	remaining: 1h 27s
10:	test: 0.9632626	best: 0.9637132 (5)	total: 20.1s	remaining: 1h 27s
11:	test: 0.9635822	best: 0.9637132 (5)	total: 21.9s	remaining: 1h 25s
12:	test: 0.9636715	best: 0.9637132 (5)	total: 23.7s	remaining: 1h 23s
13:	test: 0.9638078	best: 0.9638078 (13)	total: 25.5s	remaining: 1h 19s
14:	tes

  top1 = grouped.apply(lambda g: g.loc[g["score"].idxmax()]).reset_index(drop=True)
  topk = grouped.apply(lambda g: g.nlargest(k, "score")).reset_index(drop=True)



Evaluation Metrics (Validation Set):
Accuracy@1 : 0.9062
Recall@3   : 0.9969
MRR        : 0.9506


  mrr = grouped.apply(reciprocal_rank).mean()


This is significant improvement from the LightGBM model. Both our own metrics and the Catboost properitary metrics indicate a significant perfect with near perfect ranking quality on the evalualtion set.

YetiRank Loss: 0.983728901 - near perfect ranking quality.

Accuracy@1: 0.9062, Recall@3: 0.9969 - indicating the same thing, most of the time our top three ranking predictions contains the correct template.

Let's try on the complex dataset to see how that performs.