# Catboost Benchmarking - Complex Names

This notebook is to benchmark the LightGBM model with a catboost alternative. This iteration of benchmarking will focus on the complex name set.

Complex names are flagged in the `lp_clean` dataset with these flags:

```python
complex_flags = [
    "has_multiple_first_names",
    "has_middle_name",
    "has_multiple_middle_names",
    "has_multiple_last_names",
    "has_nfkd_normalized",
    "has_german_char"
]
```

In [1]:
!pip install lightgbm
!pip install sqlalchemy
!pip install catboost

Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl.metadata (17 kB)
Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl (3.6 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.6/3.6 MB[0m [31m107.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m75.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lightgbm
Successfully installed lightgbm-4.6.0
Collecting sqlalchemy
  Downloading sqlalchemy-2.0.41-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting greenlet>=1 (from sqlalchemy)
  Downloading greenlet-3.2.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (4.1 kB)
Downloading sqlalchemy-2.0.41-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m

In [2]:
import json
import re
import random
import matplotlib.pyplot as plt
import lightgbm as lgb
import pandas as pd
import numpy as np
import sqlite3
from catboost import CatBoostRanker, Pool
from datetime import datetime
from collections import defaultdict
from sqlalchemy import select, func
from sklearn.metrics import precision_score, recall_score, f1_score

In [3]:
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


In [7]:
# Paths to files in Drive
db_path = "/content/drive/MyDrive/Colab Notebooks/database.db"
train_ids_path = "/content/train_com_ids.csv"
val_ids_path = "/content/validation_com_ids.csv"

Pull whole complex dataset from SQL db into pandas frame.

In [5]:
conn = sqlite3.connect(db_path)  # or your local path
full_df = pd.read_sql_query("SELECT * FROM feature_matrix_complex", conn)

Saved the training and validation IDs in a CSV so we can split the data set.

In [8]:
train_ids = pd.read_csv(train_ids_path)["train_ids"].dropna().astype(int).tolist()
val_ids = pd.read_csv(val_ids_path)["validation_ids"].dropna().astype(int).tolist()

Validation dataset pulled from full set, remianing is training set. This is to ensure that we don't train on validation set.

In [9]:
val_ids_set = set(val_ids)
val_df = full_df[full_df["clean_row_id"].isin(val_ids_set)]
full_df = full_df[~full_df["clean_row_id"].isin(val_ids_set)]

This was pulled from the initial model development notebook.

In [10]:
def compute_ranking_metrics(df, k=3):
    # Group
    grouped = df.groupby("clean_row_id")

    # Get top 1 and calculate accuracy
    top1 = grouped.apply(lambda g: g.loc[g["score"].idxmax()]).reset_index(drop=True)
    acc1 = (top1["label"] == 1).mean()

    # Same with recall
    topk = grouped.apply(lambda g: g.nlargest(k, "score")).reset_index(drop=True)
    recall_k = topk.groupby("clean_row_id")["label"].max().mean()

    # MMR
    def reciprocal_rank(g):
        sorted_g = g.sort_values("score", ascending=False).reset_index()
        match = sorted_g[sorted_g["label"] == 1]
        return 1.0 / (match.index[0] + 1) if not match.empty else 0.0

    mrr = grouped.apply(reciprocal_rank).mean()
    return acc1, recall_k, mrr

This has been altered to fit a catboost model. It is largely the same but trains a different model.

In [11]:
def train_catboost_model(train_df, val_df, parameters: dict, n_rounds: int = 500):
    # Prepare group sizes (queries)
    train_group = train_df.groupby("clean_row_id").size().values
    val_group = val_df.groupby("clean_row_id").size().values

    # Extract features
    X_train = train_df.drop(
        columns=["label", "clean_row_id", "investor", "firm", "template_id"]
    )
    y_train = train_df["label"]

    X_val = val_df.drop(
        columns=["label", "clean_row_id", "investor", "firm", "template_id"]
    )
    y_val = val_df["label"]

    # Create Pools for CatBoost
    train_pool = Pool(
        data=X_train,
        label=y_train,
        group_id=np.repeat(range(len(train_group)), train_group),
    )
    val_pool = Pool(
        data=X_val, label=y_val, group_id=np.repeat(range(len(val_group)), val_group)
    )

    # Train the model
    model = CatBoostRanker(iterations=n_rounds, **parameters)

    model.fit(train_pool, eval_set=val_pool, early_stopping_rounds=50, verbose=True)

    # Predict and evaluate
    preds = model.predict(val_pool)
    val_df = val_df.copy()
    val_df["score"] = preds

    acc1, recall3, mrr = compute_ranking_metrics(val_df, k=3)

    print("\nEvaluation Metrics (Validation Set):")
    print(f"Accuracy@1 : {acc1:.4f}")
    print(f"Recall@3   : {recall3:.4f}")
    print(f"MRR        : {mrr:.4f}")

    return model

In [12]:
catboost_params = {
    "learning_rate": 0.1,
    "loss_function": "YetiRank",
    "random_seed": 42,
    "task_type": "CPU",
}

In [13]:
# Retrain
model = train_catboost_model(full_df, val_df, catboost_params, n_rounds=2000)

0:	test: 0.7712870	best: 0.7712870 (0)	total: 7.46s	remaining: 4h 8m 24s
1:	test: 0.8074694	best: 0.8074694 (1)	total: 14.8s	remaining: 4h 6m 41s
2:	test: 0.8082809	best: 0.8082809 (2)	total: 22.3s	remaining: 4h 7m 12s
3:	test: 0.8889232	best: 0.8889232 (3)	total: 29.7s	remaining: 4h 7m 6s
4:	test: 0.8892686	best: 0.8892686 (4)	total: 37.2s	remaining: 4h 7m 28s
5:	test: 0.8890383	best: 0.8892686 (4)	total: 44.7s	remaining: 4h 7m 34s
6:	test: 0.9091286	best: 0.9091286 (6)	total: 52.2s	remaining: 4h 7m 42s
7:	test: 0.9266756	best: 0.9266756 (7)	total: 59.7s	remaining: 4h 7m 44s
8:	test: 0.9282899	best: 0.9282899 (8)	total: 1m 7s	remaining: 4h 8m 4s
9:	test: 0.9293530	best: 0.9293530 (9)	total: 1m 14s	remaining: 4h 8m 5s
10:	test: 0.9279502	best: 0.9293530 (9)	total: 1m 22s	remaining: 4h 8m 15s
11:	test: 0.9308014	best: 0.9308014 (11)	total: 1m 29s	remaining: 4h 8m 22s
12:	test: 0.9310256	best: 0.9310256 (12)	total: 1m 37s	remaining: 4h 8m 38s
13:	test: 0.9307229	best: 0.9310256 (12)	tota

  top1 = grouped.apply(lambda g: g.loc[g["score"].idxmax()]).reset_index(drop=True)
  topk = grouped.apply(lambda g: g.nlargest(k, "score")).reset_index(drop=True)



Evaluation Metrics (Validation Set):
Accuracy@1 : 0.8550
Recall@3   : 0.9850
MRR        : 0.9203


  mrr = grouped.apply(reciprocal_rank).mean()


This is significant improvement from the LightGBM model. Both our own metrics and the Catboost properitary metrics indicate a significant perfect with near perfect ranking quality on the evalualtion set.

YetiRank Loss: 0.9712444948 - near perfect ranking quality.

Accuracy@1: 0.8550, Recall@3: 0.9850 - indicating the same thing, most of the time our top three ranking predictions contains the correct template.

The catboost model seems to provide a much better result than the LightGBM variant. I think the final thing to do be resolving to one of the two is to do hyperparameter training on both models and do a final comparission on both datasets.