# Training a Ranking Model

The goal of a ranking is to order item by importance. The value of relevance does not matter directly. 

Ranking a set of documents with regard to user query is an example of ranking problem. Its only important to ge tthe right order where the top documents matter more.

The Revevance/label is a floating point numerical value between 0 and 5 (generally between 0 and 4) where 0 means "completely unrelated", 4 means "very relevant" and 5 means "the same as the query"

url: https://en.wikipedia.org/wiki/Learning_to_rank
url 2: https://www.tensorflow.org/decision_forests/tutorials/beginner_colab

In [1]:
import tensorflow_decision_forests as tfdf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math

In [5]:
archive_path = tf.keras.utils.get_file("letor.zip",
  "https://download.microsoft.com/download/E/7/E/E7EABEF1-4C7B-4E31-ACE5-73927950ED5E/Letor.zip",
  extract=True)

# Path to the train and test dataset using libsvm format.
raw_dataset_path = os.path.join(os.path.dirname(archive_path), "../../dataset/Letor/OHSUMED/Data/All/OHSUMED.txt")

The dataset is stored as a .txt file in a specific format. We will need to convert it into a csv file

In [9]:
def convert_libsvm_to_csv(src_path, dst_path):
  """Converts a libsvm ranking dataset into a flat csv file.

  Note: This code is specific to the LETOR3 dataset.
  """
  dst_handle = open(dst_path, "w")
  first_line = True
  for src_line in open(src_path,"r"):
    # Note: The last 3 items are comments.
    items = src_line.split(" ")[:-3]
    relevance = items[0]
    group = items[1].split(":")[1]
    features = [ item.split(":") for item in items[2:]]

    if first_line:
      # Csv header
      dst_handle.write("relevance,group," + ",".join(["f_" + feature[0] for feature in features]) + "\n")
      first_line = False
    dst_handle.write(relevance + ",g_" + group + "," + (",".join([feature[1] for feature in features])) + "\n")
  dst_handle.close()
  
# convert the dataset
csv_dataset_path = "../../dataset/ohsumed.csv"
convert_libsvm_to_csv("../../dataset/Letor/OHSUMED/Data/All/OHSUMED.txt", csv_dataset_path)

# load a dataset into pandas
dataset_df = pd.read_csv(csv_dataset_path)

# display the first 3 examples
dataset_df.head()



Unnamed: 0,relevance,group,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,...,f_16,f_17,f_18,f_19,f_20,f_21,f_22,f_23,f_24,f_25
0,2,g_1,3.0,2.079442,0.272727,0.261034,37.330565,11.431241,37.29975,1.138657,...,9.340024,24.808785,0.393091,57.416517,3.294893,25.0231,3.219799,-3.87098,-3.90273,-3.87512
1,0,g_1,3.0,2.079442,0.428571,0.400594,37.330565,11.431241,37.29975,1.81448,...,9.340024,24.808785,0.349205,43.240626,2.654724,23.4903,3.156588,-3.96838,-4.00865,-3.9867
2,2,g_1,0.0,0.0,0.0,0.0,37.330565,11.431241,37.29975,0.0,...,9.340024,24.808785,0.240319,25.816989,1.551342,15.865,2.764115,-4.28166,-4.33313,-4.44161
3,2,g_1,4.0,2.772589,0.333333,0.320171,37.330565,11.431241,37.29975,1.260808,...,9.340024,24.808785,0.111496,10.092426,0.649758,14.2778,2.658706,-4.77772,-4.73563,-4.86759
4,0,g_1,0.0,0.0,0.0,0.0,37.330565,11.431241,37.29975,0.0,...,9.340024,24.808785,0.182104,23.546296,1.621393,15.2764,2.726309,-4.43073,-4.45985,-4.57053


In [10]:
def split_dataset(dataset, test_ratio=0.30):
    """Split a panda dataframe in two"""
    test_indices = np.random.rand(len(dataset)) < test_ratio
    return dataset[~test_indices], dataset[test_indices]

train_ds_pd, test_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples for testing.".format(
    len(train_ds_pd), len(test_ds_pd)
))

# display the fist 3 examples of the training dataset
train_ds_pd.head(3)

11291 examples in training, 4849 examples for testing.


Unnamed: 0,relevance,group,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,...,f_16,f_17,f_18,f_19,f_20,f_21,f_22,f_23,f_24,f_25
0,2,g_1,3.0,2.079442,0.272727,0.261034,37.330565,11.431241,37.29975,1.138657,...,9.340024,24.808785,0.393091,57.416517,3.294893,25.0231,3.219799,-3.87098,-3.90273,-3.87512
2,2,g_1,0.0,0.0,0.0,0.0,37.330565,11.431241,37.29975,0.0,...,9.340024,24.808785,0.240319,25.816989,1.551342,15.865,2.764115,-4.28166,-4.33313,-4.44161
3,2,g_1,4.0,2.772589,0.333333,0.320171,37.330565,11.431241,37.29975,1.260808,...,9.340024,24.808785,0.111496,10.092426,0.649758,14.2778,2.658706,-4.77772,-4.73563,-4.86759


in the dataset the relevance defines the ground truth rank among row of the same group

In [11]:
# name of relevance and grouping columns
relevance = "relevance"

ranking_train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=relevance, task=tfdf.keras.Task.RANKING)
ranking_test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=relevance, task=tfdf.keras.Task.RANKING)


Metal device set to: Apple M1

systemMemory: 8.00 GB
maxCacheSize: 2.67 GB



  features_dataframe = dataframe.drop(label, 1)
2022-12-24 20:59:15.862716: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-24 20:59:15.864376: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
  features_dataframe = dataframe.drop(label, 1)


In [12]:
model_8 = tfdf.keras.GradientBoostedTreesModel(
    task=tfdf.keras.Task.RANKING,
    ranking_group="group",
    num_trees=50
)

model_8.fit(x=ranking_train_ds)

Use /var/folders/sk/f7k402kx1wvdmcz91gdz6hs00000gn/T/tmptqd686s7 as temporary training directory
Reading training dataset...
Training dataset read in 0:00:01.934200. Found 11291 examples.
Training model...


2022-12-24 21:00:38.131034: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-12-24 21:00:38.132660: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Model trained in 0:00:00.624182
Compiling model...


[INFO kernel.cc:1176] Loading model from path /var/folders/sk/f7k402kx1wvdmcz91gdz6hs00000gn/T/tmptqd686s7/model/ with prefix d4805d9c6a7b4dbf
[INFO abstract_model.cc:1249] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO kernel.cc:1022] Use fast generic engine


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
Model compiled.


2022-12-24 21:00:39.569869: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-12-24 21:00:39.624401: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


<keras.callbacks.History at 0x17ab399d0>

keras does not propose any ranking metrics. Instead the training and validation are shown in the training logs. In this case the loss is lambda_mart_ndcg5 and the fineal (ie at the end of the training) NDCG (normalized discounted cumulative gain) is 0.510136

In [13]:
model_8.summary()

Model: "gradient_boosted_trees_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
Total params: 1
Trainable params: 0
Non-trainable params: 1
_________________________________________________________________
Type: "GRADIENT_BOOSTED_TREES"
Task: RANKING
Label: "__LABEL"
Rank group: "__RANK_GROUP"

Input Features (25):
	f_1
	f_10
	f_11
	f_12
	f_13
	f_14
	f_15
	f_16
	f_17
	f_18
	f_19
	f_2
	f_20
	f_21
	f_22
	f_23
	f_24
	f_25
	f_3
	f_4
	f_5
	f_6
	f_7
	f_8
	f_9

No weights

Variable Importance: MEAN_MIN_DEPTH:
    1. "__RANK_GROUP"  4.739620 ################
    2.      "__LABEL"  4.739620 ################
    3.         "f_15"  4.729743 ###############
    4.          "f_1"  4.729553 ###############
    5.         "f_13"  4.718252 ###############
    6.         "f_11"  4.717208 ###############
    7.         "f_12"  4.710133 ###############
    8.         "f_19"  4.606778 ###############
    9.          