In [None]:

import importlib

if not importlib.util.find_spec("my_first_ltr"): # type: ignore
    %pip install -qqq git+https://github.com/algolia/my-first-learning-to-rank

# What's Learning To Rank ?

Ranking is the process of organizing or arranging items in order based on their importance, quality, or performance. It is widely used in various applications, such as search engines, where results are ranked to show the most relevant pages first, or in recommender systems, which rank products, movies, or other items to suggest the most suitable options to users.

## Why rank ? And what's ranking here ?
In the context of a search experience ranking helps prioritize the most relevant results, improving the user experience by making it easier to find what they need quickly. We will try to define an optimal order for the results for each query. What we define by query here is the word or phrase a user types to find information.

## What's ranking model ?

It's a function that maps an item to it's relevance score.

# It all starts with data

- a user search history on a streaming platform
- a subset of imdb dataset

At home, you can try with your search history if you'd like!

We did all the nasty pre-processing and cleaning for you - so you can just have fun! 
The main steps we did for preprocessing are:
- One-hot encoding: turn multi-categorical features into a list of binary features
- Create a textual relevance signal (using [OkapiBM25](https://en.wikipedia.org/wiki/Okapi_BM25))
- Compute the relevance score of the documents.

## What are our features ?
What is a feature in our context? A feature in machine learning is a piece of information or characteristic that helps the model make predictions or decisions. It’s an input the model uses to learn patterns in the data.

In [None]:
from my_first_ltr.utils import load_raw_dataset

unprocess_dataset = load_raw_dataset()
unprocess_dataset.head(2)

Go ahead! Explore the dataset to get familiar with it a bit, here are a few examples for you!

In [None]:
# histogram for numerical values:
unprocess_dataset.imdb_score.hist()

In [None]:
# histogram for textual values (note the use of explode when it's a list of string):
unprocess_dataset.explode("genres").genres.value_counts().plot.barh()
unprocess_dataset[unprocess_dataset.Action == "play"].explode("genres").genres.value_counts().plot.barh(color="red")

In [None]:
# correlation between values:
unprocess_dataset[["imdb_score", "tmdb_score"]].corr()

## What's our score here ?

- What's your idea ?

We base our scoring on the past interactions of the users with the movies when they typed a query. 
- We consider that if the user watched the movie, it was highly relevant
- We consider that if the user added it to it's watchlist, it was relevant, but not the right mood at that time
- We consider that if the user clicked on a movie, it showed some interest but it wasn't that relevant

In [None]:
from my_first_ltr.utils import load_dataset

dataset = load_dataset(local=True)
dataset.head(5)

We could also negative examples. It would help to improve the model by providing contrast to positive examples. Using a conversion ratio in the score, instead of a simple sum, helps the model better capture the relative importance of examples, leading to more balanced and accurate rankings.

We split data into training and testing sets to train the model on one portion of the data and evaluate its performance on unseen data, ensuring it generalizes well to new inputs.

In [None]:
from my_first_ltr.train_utils import get_categories
import pandas as pd
from catboost import Pool

def dataset_split(
    dataset: pd.DataFrame,
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Split the dataset into training, testing, and validation sets based on queries.

    Steps to implement:
    1. Group the dataset by the "normalized_query"
    2. Split the grouped data into two sets: 95% for training and testing, 5% for validation.
    3. Further split the 95% dataset into 70% for training and 30% for testing.
    4. Explode the grouped data back into individual rows for each query.
    5. Return the Datasets
    """
    # FIXME: Complete the function here
    pass

def build_pool(dataset: pd.DataFrame, name: str) -> Pool:
    """
    Given a dataset and its name, build a CatBoost Pool object.

    Steps to implement:
    1. Sort the dataset by the "normalized_query" column.
    2. Prepare the input features:
       - Exclude columns that are not input features (e.g., "normalized_query", "id", "score").
    3. Extract:
       - Features
       - Target column ("score") as labels.
    4. To identify the categorical features in the dataset ou can use the function `get_categories(dataset: pd.DataFrame)`.
    5. Construct and return the `Pool` (Group identifiers from the "normalized_query" column with the parameter `group_id`).
    """
    # FIXME: Complete the function here
    pass

In [None]:
from my_first_ltr.train_utils import build_pool, dataset_split


train_df, test_df, val_df = dataset_split(dataset)

train_pool = build_pool(train_df, "train")
test_pool = build_pool(test_df, "test")
val_pool = build_pool(val_df, "validation")

# Then comes a model

## Pointwise: RMSE

A pointwise learning-to-rank (LTR) approach using Root Mean Square Error (RMSE) is a method where the ranking problem is treated as a regression problem. The model is trained to predict a relevance score as close as possible to the ground truth relevance score for each individual item.

**It ignores the relationships between items within a list, focusing only on the accuracy of individual predictions. Thus it's name, pointwise.**

### Let's go to practice

We initializes a CatBoostRanker, a gradient-boosting model  using the following parameters:

- `loss_function="RMSE"`: Optimize based on `RMSE` which measures the average squared difference between predicted and true ranks.
- `learning_rate=0.15`: Determines how much the model's parameters are updated in response to the calculated error after each iteration. A smaller value leads to slower, more stable learning, while a larger value speeds learning but risks overshooting the optimal solution.
- `thread_count=1`: Uses a single CPU thread for training.
- `iterations=500`: Runs 500 iterations of boosting (adding weak learners to improve predictions).
- `random_seed=0`: Ensures reproducible results by fixing randomness.

In [None]:

from catboost import CatBoostRanker

model = CatBoostRanker(loss_function="RMSE", depth=6, learning_rate=0.15, thread_count=1, iterations=500, random_seed=0)

model.fit(train_pool, eval_set=test_pool, plot=True, metric_period=1)

In [None]:
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

val_df["pred_score"] = model.predict(val_pool)

results = val_df[["score", "pred_score"]]

print("Predictions vs Actuals:")
print(results.head())

rmse = np.sqrt(mean_squared_error(results["score"], results["pred_score"]))
mae = mean_absolute_error(results["score"], results["pred_score"])

print("\nEvaluation Metrics:")
print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")

## Pairwise: PairLogit

This approach focuses on learning the relative preference between pairs of items. The model is trained to predict which of the two items in a pair should be ranked higher, based on pairwise comparisons.

In [None]:
# FIXME: setup the ranker for some pairwise ranking
model_pairwise = CatBoostRanker(loss_function="PairLogit", thread_count=1, random_seed=0)
model_pairwise.fit(train_pool, eval_set=test_pool, plot=True, metric_period=1)

## Listwise: YetiRank

YetiRank optimizes a smooth approximation of an IR (Information Retrieval) metric, such as NDCG (Normalized Discounted Cumulative Gain). The use of a listwise approach means that the model learns directly to improve the ranking quality of the entire list rather than individual scores.

### Wait, NDC what ?

Normalized Discounted Cumulative Gain (NDCG) is a metric used to evaluate the quality of a ranked list of items. It measures how well the ranking of retrieved items matches the ideal ranking based on relevance, emphasizing the importance of placing highly relevant items near the top of the list.

In [None]:
# documents ordered according to their relevance scores
from numpy import asarray
from sklearn.metrics import ndcg_score


true_relevance = asarray([list(reversed(range(21)))])
print(true_relevance)
print("Perfect:", ndcg_score(true_relevance, true_relevance))

pred_relevance = asarray([[19, 20] + list(reversed(range(19)))])
print(pred_relevance)
print("Two items swapping places:", ndcg_score(true_relevance, pred_relevance))

pred_relevance = asarray([[15, 19, 18, 17, 16, 20] + list(reversed(range(15)))])
print(pred_relevance)
print("Two items swapping places further down:", ndcg_score(true_relevance, pred_relevance))

pred_relevance = asarray([list(range(21))])
print(pred_relevance)
print("Let's reverse everything", ndcg_score(true_relevance, pred_relevance))

### Back to our model

In [None]:
# FIXME: setup the ranker for some listwise ranking
# As you can see, we have to specify the end metric required to optimize here, (eg: CTR, CVR, NDCG) as we are not basing
# the optimization of the scores difference to prediction.
model_listwise = CatBoostRanker(loss_function="YetiRank", thread_count=1, random_seed=0, custom_metric=["NDCG:top=-1;type=Base;denominator=LogPosition;hints=skip_train~false"])
model_listwise.fit(train_pool, eval_set=test_pool, plot=True, metric_period=1)

## Model's leaderboard

Create a model's leaderboard add add your iteration to compare to our baselines!

Little tip to get you started, you can use catboost `get_eval` method to quickly retrieve a metric for a model.

In [None]:
from my_first_ltr.data_visualisation import RMSE, NDCG_20

# FIXME: try out the eval_metrics
your_metric = ...
model.eval_metrics(train_pool, your_metric, ntree_start=model.tree_count_ - 1)

# FIXME: Compare the different models.
models = {"RMSE": model, "PairLogit": model_pairwise, "YetiRank": model_listwise, "BestModelInTheWorld": ...}

In [None]:
#@title Solution to retrieve metrics for multiple models

import pandas as pd

models = {"RMSE": model, "PairLogit": model_pairwise, "YetiRank": model_listwise}
metrics = []

for k, m in models.items():
    metrics_dict = dict()
    metrics_dict["model_name"] = k
    metrics_dict['train_NDCG@20'] = m.eval_metrics(train_pool,
                                                      'NDCG:top=20;type=Base;denominator=LogPosition',
                                                      ntree_start=m.tree_count_ - 1)['NDCG:top=20;type=Base'][0]

    metrics_dict['test_NDCG@20'] = m.eval_metrics(test_pool,
                                                     'NDCG:top=20;type=Base;denominator=LogPosition',
                                                     ntree_start=m.tree_count_ - 1)['NDCG:top=20;type=Base'][0]

    metrics_dict['val_NDCG@20'] = m.eval_metrics(val_pool,
                                                'NDCG:top=20;type=Base;denominator=LogPosition',
                                                ntree_start=m.tree_count_ - 1)['NDCG:top=20;type=Base'][0]
    metrics.append(metrics_dict)


metrics_df = pd.DataFrame.from_records(metrics)
metrics_df

## Feature importance

In [None]:
import pandas as pd
from catboost import CatBoost, Pool
from my_first_ltr.train_utils import keep_input_features
import matplotlib.pyplot as plt

def cross_dataset_shape_cascade( m: CatBoost, X: pd.DataFrame, pools: dict[str, Pool]) -> None:
    """Shap values average per feature accross all dataset."""
    df_feature_importance = pd.DataFrame(
        data={k: m.get_feature_importance(pool) for k, pool in pools.items()}, index=X.columns
    )

    plt.close("all")
    df_feature_importance.plot.barh(figsize=(10, 12))
    plt.title("cross dataset - importance of each feature")
    plt.legend()
    plt.show()

cross_dataset_shape_cascade(model, keep_input_features(train_df), {"train": train_pool, "test": test_pool, "val": val_pool})