# Standard Name Modelling

This note book trains a separate LightGBM model on just typical "First Name Last Name" naming conventions that dominate our dataset.

These names use 79 different syntactic templates for email types and account for 95% of our data. We have padded the data to enrich our dataset while striving to maintain firm level statistics. Complex names are marked in the clean data with any one of these flags.

In [1]:
!pip install lightgbm
!pip install sqlalchemy

Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl.metadata (17 kB)
Downloading lightgbm-4.6.0-py3-none-manylinux_2_28_x86_64.whl (3.6 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m3.4/3.6 MB[0m [31m110.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m78.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lightgbm
Successfully installed lightgbm-4.6.0
Collecting sqlalchemy
  Downloading sqlalchemy-2.0.41-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting greenlet>=1 (from sqlalchemy)
  Downloading greenlet-3.2.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (4.1 kB)
Downloading sqlalchemy-2.0.41-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2

In [2]:
import json
import re
import random
import matplotlib.pyplot as plt
import lightgbm as lgb
import pandas as pd
import numpy as np
import sqlite3
from datetime import datetime
from collections import defaultdict
from sqlalchemy import select, func
from sklearn.metrics import precision_score, recall_score, f1_score

In [3]:
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


In [14]:
# Paths to files in Drive
db_path = "/content/drive/MyDrive/Colab Notebooks/database.db"
train_ids_path = "/content/train_std_ids.csv"
val_ids_path = "/content/validation_std_ids.csv"

Pull whole standard dataset from SQL db into pandas frame.

In [5]:
conn = sqlite3.connect(db_path)  # or your local path
full_df = pd.read_sql_query("SELECT * FROM feature_matrix", conn)

Saved the training and validation IDs in a CSV so we can split the data set.

In [15]:
train_ids = pd.read_csv(train_ids_path)["train_ids"].dropna().astype(int).tolist()
val_ids = pd.read_csv(val_ids_path)["validation_ids"].dropna().astype(int).tolist()

Validation dataset pulled from full set, remianing is training set. This is to ensure that we don't train on validation set.

In [16]:
val_ids_set = set(val_ids)
val_df = full_df[full_df["clean_row_id"].isin(val_ids_set)]
full_df = full_df[~full_df["clean_row_id"].isin(val_ids_set)]

This was pulled from the initial model development notebook.

In [17]:
def compute_ranking_metrics(df, k=3):
    # Group
    grouped = df.groupby("clean_row_id")

    # Get top 1 and calculate accuracy
    top1 = grouped.apply(lambda g: g.loc[g["score"].idxmax()]).reset_index(drop=True)
    acc1 = (top1["label"] == 1).mean()

    # Same with recall
    topk = grouped.apply(lambda g: g.nlargest(k, "score")).reset_index(drop=True)
    recall_k = topk.groupby("clean_row_id")["label"].max().mean()

    # MMR
    def reciprocal_rank(g):
        sorted_g = g.sort_values("score", ascending=False).reset_index()
        match = sorted_g[sorted_g["label"] == 1]
        return 1.0 / (match.index[0] + 1) if not match.empty else 0.0

    mrr = grouped.apply(reciprocal_rank).mean()
    return acc1, recall_k, mrr

This is so we can save logs to file in case collab logs us out and we lose our printouts.

In [18]:
# Paths
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
log_path = f"/content/drive/MyDrive/lgb_log_{timestamp}.txt"
model_path = f"/content/drive/MyDrive/lgb_model_{timestamp}.txt"

# Logger
log_file = open(log_path, "w")


def log_callback(period=10):
    def _callback(env):
        if env.iteration % period == 0 or env.iteration + 1 == env.end_iteration:
            result = f"[{env.iteration}] "
            for data_name, eval_name, result_val, _ in env.evaluation_result_list:
                result += f"{data_name} {eval_name}: {result_val:.5f}  "
            print(result)
            log_file.write(result + "\n")
            log_file.flush()

    return _callback

From the model development notebook, changed slightly to fit our dataframes.

In [19]:
def train_model(
    train_ids,
    val_ids,
    parameters: dict,
    n_rounds: int = 500,
    lr_decay_gamma: float = 0.95,
):
    conn = sqlite3.connect(db_path)
    # Load validation set
    X_val = val_df.drop(
        columns=["label", "clean_row_id", "investor", "firm", "template_id"]
    )
    y_val = val_df["label"]
    val_group_sizes = val_df.groupby("clean_row_id").size().tolist()
    lgb_val = lgb.Dataset(
        X_val, label=y_val, group=val_group_sizes, free_raw_data=False
    )

    # Load full training set
    full_df
    X_train = full_df.drop(
        columns=["label", "clean_row_id", "investor", "firm", "template_id"]
    )
    y_train = full_df["label"]
    train_group_sizes = full_df.groupby("clean_row_id").size().tolist()
    lgb_train = lgb.Dataset(
        X_train, label=y_train, group=train_group_sizes, free_raw_data=False
    )

    # Learning rate schedule
    def lr_decay(current_round):
        return parameters["learning_rate"] * (lr_decay_gamma**current_round)

    # Train model
    model = lgb.train(
        params=parameters,
        train_set=lgb_train,
        num_boost_round=n_rounds,
        valid_sets=[lgb_train, lgb_val],
        valid_names=["train", "val"],
        callbacks=[
            lgb.reset_parameter(learning_rate=lr_decay),
            lgb.early_stopping(stopping_rounds=500),
            lgb.log_evaluation(period=1),
            log_callback(1),
        ],
    )

    # Predict and Evaluate
    preds = model.predict(X_val, num_iteration=model.best_iteration)
    val_df["score"] = preds

    acc1, recall3, mrr = compute_ranking_metrics(val_df, k=3)

    print("\Evaluation Metrics (Validation Set):")
    print(f"Accuracy@1 : {acc1:.4f}")
    print(f"Recall@3   : {recall3:.4f}")
    print(f"MRR        : {mrr:.4f}")

    return model

In [22]:
params = {
    "objective": "lambdarank",
    "metric": ["ndcg"],
    "eval_at": [1, 3],
    "label_gain": [0, 1],
    "learning_rate": 0.1,
    "num_leaves": 128,
    "max_depth": 10,
    "min_split_gain": 1e-3,
    "min_child_weight": 0.01,
    "max_delta_step": 1.0,
    "scale_pos_weight": 78,
    "verbosity": -1,
    "boosting_type": "gbdt",
    "force_row_wise": True,
    "max_bin": 128,
    "feature_fraction": 0.8,
    "bagging_freq": 0.8,
    "bagging_freq": 1,
    "lambda_l1": 1.0,
    "min_data_per_group": 10,
}

In [23]:
# Retrain
model = train_model(train_ids, val_ids, params, n_rounds=2000)
log_file.close()
model.save_model(model_path)

[0] train ndcg@1: 0.87033  train ndcg@3: 0.94188  val ndcg@1: 0.64278  val ndcg@3: 0.81413  
[1]	train's ndcg@1: 0.870334	train's ndcg@3: 0.941877	val's ndcg@1: 0.642781	val's ndcg@3: 0.814131
Training until validation scores don't improve for 500 rounds
[1] train ndcg@1: 0.89356  train ndcg@3: 0.95588  val ndcg@1: 0.74984  val ndcg@3: 0.88450  
[2]	train's ndcg@1: 0.893562	train's ndcg@3: 0.955881	val's ndcg@1: 0.749839	val's ndcg@3: 0.884504
[2] train ndcg@1: 0.89356  train ndcg@3: 0.95588  val ndcg@1: 0.74984  val ndcg@3: 0.88449  
[3]	train's ndcg@1: 0.893562	train's ndcg@3: 0.955884	val's ndcg@1: 0.749839	val's ndcg@3: 0.884494
[3] train ndcg@1: 0.90884  train ndcg@3: 0.96441  val ndcg@1: 0.83037  val ndcg@3: 0.93137  
[4]	train's ndcg@1: 0.908838	train's ndcg@3: 0.964409	val's ndcg@1: 0.830365	val's ndcg@3: 0.931369
[4] train ndcg@1: 0.91871  train ndcg@3: 0.96814  val ndcg@1: 0.84889  val ndcg@3: 0.93838  
[5]	train's ndcg@1: 0.918706	train's ndcg@3: 0.96814	val's ndcg@1: 0.8488

  top1 = grouped.apply(lambda g: g.loc[g["score"].idxmax()]).reset_index(drop=True)
  topk = grouped.apply(lambda g: g.nlargest(k, "score")).reset_index(drop=True)


\Evaluation Metrics (Validation Set):
Accuracy@1 : 0.8616
Recall@3   : 0.9960
MRR        : 0.9273


  mrr = grouped.apply(reciprocal_rank).mean()


<lightgbm.basic.Booster at 0x7b22af54d6d0>

Performance has improved incrementally. About 1 percent in the top 1 and still a almost 100% in recall at top3. This is a good baseline model, next thing to do is some final benchmarking agains RAG and maybe try catboost.

NDCG@1: 0.861618 - precision at top 1 spot is correct almost 86% of the time.

NDCG@3: 0.944518 - at top3, it is correct almost 90% of the time.

Accuracy@1: 0.8616, Recall@3: 0.9960 - indicating the same thing, most of the time our top three ranking predictions contains the correct template.