<center><h1>From RankNet to ListNet</h1></center>

### Introduction

In this notebook, we review the paper 'Learning to Rank: From Pairwise Approach to Listwise Approach' and implement the corresponding RankNet and ListNet. Several methods take object pairs as input for learning to rank between the pairs. The paper postulates that learning to rank should adopt the listwise approach instead of pairwise approach. One of the major contributions of the paper is proposed a new probabilistic method for the approach. Specifically, it introdues two probability models:
* Permuation probability
* Top one probability
  
These two probability models have been applied to define a listwise loss function during learning the neural networks. 

The advantages with taking the pairwise approach are:
* existing methodologies on classfication can be directly applied (<span style="color:blue">high rank VS low rank is analogous to dog VS cat, meaning existing classification methods can be directly applied to this pairwise ranking problem</span>)
* the training instances of document pairs can be easily obtained in certain scenarios (<span style="color:blue">can be obtained easily for pairwise data</span>)

The probelms with taking the pairwise approach are:
* the objective of learning is formalized as minimizing errors in classification of document pairs, rather than minimizing errors in ranking of documents
* the training process is computationally costly, as the number of document pairs is very large (<span style="color:blue">this is kind of permutation: to compare as many pairs as possible</span>)
* the assumption of the document pairs are generated i.i.d. is also too strong (<span style="color:blue">this may lead biased of the data</span>)
* the number of generated document pairs varies largely from query to query which will lead in training a model biased toward queries with more document pairs 

### Listwise Approach

This paper proposed a probabilistic method to calculate the listwise loss function. They transform both the scores of the documents assigned by a ranking function and the explicit or implicit judgments of the documents given by humans into probability distributions. Specifically, we consider these two 'score' distribution separately.

<b>Firstly</b>, for the socre of the documents, we can have it by humans:

Suppose in the training dataset, we have a set of queries $Q = \{q^{(1)},q^{(2)},...,q^{(m)}\}$ is given. Each query $q^{i}$ is asscoiated with a list of documents (<span style="color:blue">Similar to hotel booking, a list of hotels is presented for each search</span>) $d^{(i)} = \Big(d_{1}^{(i)},d_{2}^{(i)},...,d_{n^{(i)}}^{(i)}\Big)$, where $d_{j}^{(i)}$ denotes the $j-$th document and $n^{(i)}$ denotes the sizes of $d^{(i)}$. Furthermore, each list of documents $d^{(i)}$ is attached with a list of scores $y^{(i)} = \Big({y_{1}^{(i)},y_{2}^{(i)},...,y_{n^{(i)}}^{(i)}}\Big)$ where $y_{j}^{(i)}$ denotes the score on document $d_{j}^{(i)}$ with respect to query $q^{(i)}$. And the score $y^{(i)}$ can be explicitly or implicitly given by humans.(<span style="color:blue"> which will be considered as the output during the learning process</span>). For example, $y_{j}^{(i)}$ can be considered as the number of clicks on document $d^{(i)}_j$, which means higher click-on rate is obsered for $d_j^{(i)}$ on $q^{i}$ the higher score (the stronger relevance exists between them) (<span style="color:blue">it can also be given by humans for example review scores</span>).

<b>Secondly</b> we consider the score (ranking) assgned by a ranking function (<span style="color:blue">I understand this as the attribute of the document which we need to learn from the neural network</span>), we use a neural network (function) to learn from the input feature vectors. 

A feature vector $x_j^{(i)} = \Psi(q^{(i)},d^{(i)}_j)$ is created from each query-document pair $(q^{(i)},d^{(i)}_j)$ for $i=1,2,...,m; j = 1,2,...,n^{(i)}$. Each list of features $x^{(i)} = (x_1^{(i)},x_2^{(i)},...,x_{n^{(i)}}^{(i)})$ where $x_j^{(i)}$ could be (should be) multiple dimension vectors. Then we kind have the input features $x_{(i)}$ and the corresponding output score $y^{(i)}$.

<b>Finally</b>, we create a ranking function $f$ (neural network), for each feature vector $x^{(i)}_j$, it provides a score $f(x_{j}^{(i)})$. The scores learned by the neural network $z^{i} = \Big(f(x_{1}^{(i)}),f(x_{2}^{(i)}),...,f(x_{n^{(i)}}^{(i)})\Big)$ should align with the human-provided scores, $y^{(i)}= \Big(y_{1}^{(i)},y_{2}^{(i)},...,y_{n^{(i)}}^{(i)}\Big)$ in terms of ordering. That is, the order of the values within each vector should be the same.  Then the objective of learning is to define a total loss function to make sure that the order of the values whithin each vector should be minimized:
$$\sum_{i=1}^{m} L(y^{(i)},z^{(i)})$$



### Probability Models

In this section, we introduce probability models to calculate the listwise loss function. The authors map a list of scores to a probability distribution using one of the models described here and then apply a metric between two probability distributions as the loss function. The two models are referred to as permutation probability and top-one probability.

#### Permutation Probability

Suppose the set of objects to be ranked is identified by the numbers $1,2,...,n$. A permutation $\pi$ of these objects is defined as a bijection from the set $\{1,2,...,n\}$ to itself. The permutation is represented as $\pi = <\pi(1),\pi(2),...,\pi(n)>$, where $\pi(j)$ denotes the object in the $j-$th position of the permutation. The set of all possible permutations of $n$ objects is denoted by $\Omega_n$. Use $s$ to denote the list of scores $s=(s_1,s_2,...,s_n)$, where $s_j$ is the score of the $j-$th object. In the article, the authors propose the following definition, lemma and theorems.

<b>Definition 1 </b> Suppose that $\pi$ is a permutation on the $n$ objects, and $\phi(\cdot)$ is an increasing and strictly positive function. Then, the probability of permutation $\pi$ given the list of scores $s$ is defined as 
$$P_s(\pi) = \prod\limits_{j=1}^{n}\frac{\phi(s_{\pi(j)})}{\sum_{k=j}^{n}\phi(s_{\pi(k)})}$$
where $s_{\pi(j)}$ is the score of object at position $j$ of permutation $\pi$.

<b>Lemma 2 </b> The permutation probabilities $P_s(\pi)$, $\pi\in \Omega_n$ form a probability distribution over the set of permutations i.e., for each $\pi \in \Omega_n$, we have $P_s(\pi)>0$, and $\sum\limits_{\pi \in \Omega_n} P_s(\pi) = 1 $.

<b>Proof</b>  <span style="color:blue">The proof can be found in the paper as following: </span>

According to the definition of $\phi(\cdot)$, we have $P_s(\pi)>0$ for any $\pi \in \Omega_n$. Furthermore,
$$
\begin{align}
\sum_{\pi\in\Omega_n} P_s(\pi) &= \sum_{\pi\in \Omega_n} \prod\limits_{j=1}^{n}\frac{\phi(s_{\pi(j)})}{\sum_{k=j}^{n}\phi(s_{\pi(k)})}\\
& = \sum_{\pi(1)=1}^n    \sum_{\pi(2)=1,\pi(2)\neq\pi(1)}^n ...  \sum_{\pi(q)=1,\pi(q)\neq\pi(l),\forall l< q}^n ...  \sum_{\pi(n)=1,\pi(n)\neq\pi(l),\forall l< n}^n  \prod\limits_{j=1}^{n}\frac{\phi(s_{\pi(j)})}{\sum_{k=j}^{n}\phi(s_{\pi(k)})}\tag{1}\\
& = \sum_{\pi(1)=1}^n \frac{\phi(s_{\pi(1)})}{\sum_{k=1}^{n}\phi(s_{\pi(k)})} \sum_{\pi(2)=1,\pi(2)\neq \pi(1)}^n \frac{\phi(s_{\pi(2)})}{\sum_{k=2}^{n}\phi(s_{\pi(k)})} ... \sum_{\pi(q)=1,\pi(q)\neq \pi(l), \forall l<q}^n \frac{\phi(s_{\pi(q)})}{\sum_{k=s}^{n}\phi(s_{\pi(k)})} ... \sum_{\pi(n)=1,\pi(n)\neq \pi(l), \forall l<n}^n \frac{\phi(s_{\pi(n)})}{\sum_{k=n}^{n}\phi(s_{\pi(k)})}
\end{align}
$$
Since for any $1\leq q\leq n$, 
$$\sum_{\pi(n)=1,\pi(n)\neq \pi(l), \forall l<n}^n \frac{\phi(s_{\pi(n)})}{\sum_{k=n}^{n}\phi(s_{\pi(k)})} = 1.$$
Then,we have 
$P_s(\pi)>0$, and $\sum\limits_{\pi \in \Omega_n} P_s(\pi) = 1 $.

<span style="color:blue"> I will prove this lemma by inducation method. </span>

<span style="color:blue"> <b>Proof</b>  <span style="color:blue"> I will prove this lemma by inducation method. Firstly: suppose the object size $o=2$ as $\{1,2\}$ and the correspong score is $\{s_1,s_2\}$. The full permuation is $\{1,2\}$ and $\{2,1\}$. By Definition 1, we have 
$$
\begin{align}
\sum_{\pi\in\Omega_2} P_s(\pi) &= \frac{\phi(s_1)}{\phi(s_1)+\phi(s_2)}\cdot\frac{\phi(s_2)}{\phi(s_2)} + \frac{\phi(s_2)}{\phi(s_1)+\phi(s_2)}\cdot\frac{\phi(s_1)}{\phi(s_1)}\\
&=\frac{\phi(s_1)+\phi(s_2)}{\phi(s_1)+\phi(s_2)}\\
&= 1
\end{align}
$$
Now assume when $o=k$, we have $\sum\limits_{\pi \in \Omega_k} P_s(\pi) = 1 $. Then we will prove $\sum\limits_{\pi \in \Omega_{k+1}} P_s(\pi) = 1 $
$$
\begin{align}
\sum_{\pi\in\Omega_{k+1}} P_s(\pi) &= \sum\limits_{\pi(1)=1, \pi \in \Omega_k} P_s(\pi)  + \sum\limits_{\pi(1)=2, \pi \in \Omega_k} P_s(\pi)  + ... + \sum\limits_{\pi(1)=k+1, \pi \in \Omega_k} P_s(\pi) \\
&= \frac{\phi(s_1)}{\phi(s_1)+\phi(s_2)+...+\phi(s_{k+1})}\cdot \sum_{\pi\in\Omega_{k}} P_s(\pi) + \frac{\phi(s_2)}{\phi(s_1)+\phi(s_2)+...+\phi(s_{k+1})}\cdot \sum_{\pi\in\Omega_{k}} P_s(\pi)+ ... + \frac{\phi(s_{k+1})}{\phi(s_1)+\phi(s_2)+...+\phi(s_{k+1})}\cdot \sum_{\pi\in\Omega_{k}} P_s(\pi)
\end{align}
$$
With the assumption $\sum\limits_{\pi \in \Omega_k} P_s(\pi) = 1 $, we can have
$$
\begin{align}
\sum_{\pi\in\Omega_{k+1}} P_s(\pi) &= \frac{\phi(s_1)}{\phi(s_1)+\phi(s_2)+...+\phi(s_{k+1})} + \frac{\phi(s_2)}{\phi(s_1)+\phi(s_2)+...+\phi(s_{k+1})}+ ...+\frac{\phi(s_{k+1})}{\phi(s_1)+\phi(s_2)+...+\phi(s_{k+1})}\\
&= 1
\end{align}
$$
</span>

<span style="color:blue"> The contruction of the permutation probability confirms its sum equals to 1. Once the permuation probability is built, we can use it as the probability distribution in the metric function (loss function). The purpose of using this permutation probability can repeatly tell the neural network the order of the score for the document with respect to the query. </span>

<b>Theorem 3 </b> Given any two permutations $\pi$ and $\pi' \in \Omega_n$, if (1) $\pi(p) = \pi'(q), p<q$; (2) $\pi(r) = \pi'(r)$, $r\neq p,q$; (3) $s_{\pi(p)}>s_{\pi(q)}$, then $P_s(\pi)>P_s(\pi')$.

<b>Proof</b> From Definition 1, we have 

$$
\begin{align}
P_s(\pi) &= \prod\limits_{j=1}^{n}\frac{\phi(s_{\pi(j)})}{\sum_{k=j}^{n}\phi(s_{\pi(k)})}
\end{align}
$$
and 
$$
\begin{align}
P_s(\pi') &= \prod\limits_{j=1}^{n}\frac{\phi(s_{\pi'(j)})}{\sum_{k=j}^{n}\phi(s_{\pi'(k)})}
\end{align}
$$
In order to prove $P_s(\pi)>P_s(\pi')$, we need to prove
$$\prod\limits_{j=p}^{q}\frac{\phi(s_{\pi(j)})}{\sum_{k=j}^{n}\phi(s_{\pi(k)})} > \prod\limits_{j=p}^{q}\frac{\phi(s_{\pi'(j)})}{\sum_{k=j}^{n}\phi(s_{\pi'(k)})}$$

<span style="color:blue"> With the $\pi$ permuation   as $(s_{\pi(1)},s_{\pi(2)},...,s_{\pi(p)},...,s_{\pi(q)},...,s_{\pi(n)})$; and the $\pi'$ permuation  as $(s_{\pi'(1)},s_{\pi'(1)},...,s_{\pi'(q)},...,s_{\pi'(p)},...,s_{\pi'(n)})$ and $\pi(r) = \pi'(r)$, $r\neq p,q$ we rewrite $P_s(\pi)$  and $P_s(\pi')$ as 
$$
\begin{align}
P_s(\pi) &=  \frac{\phi(s_{\pi(1)})}{\sum_{k=1}^{n}\phi(s_{\pi(k)})}\cdot \frac{\phi(s_{\pi(2)})}{\sum_{k=2}^{n}\phi(s_{\pi(k)})}\cdot ... \cdot\frac{\phi(s_{\pi(p)})}{\sum_{k=p}^{n}\phi(s_{\pi(k)})}\cdot ... \cdot\frac{\phi(s_{\pi(q)})}{\sum_{k=q}^{n}\phi(s_{\pi(k)})}\cdot ... \cdot \frac{\phi(s_{\pi(n)})}{\phi(s_{\pi(n)})}
\end{align}
$$
and 
$$
\begin{align}
P_s(\pi') &=  \frac{\phi(s_{\pi'(1)})}{\sum_{k=1}^{n}\phi(s_{\pi'(k)})}\cdot \frac{\phi(s_{\pi'(2)})}{\sum_{k=2}^{n}\phi(s_{\pi'(k)})}\cdot ... \cdot\frac{\phi(s_{\pi'(q)})}{\sum_{k=p}^{n}\phi(s_{\pi'(k)})}\cdot ... \cdot\frac{\phi(s_{\pi'(p)})}{\sum_{k=q}^{n}\phi(s_{\pi'(k)})}\cdot ... \cdot \frac{\phi(s_{\pi'(n)})}{\phi(s_{\pi'(n)})}\\
&=  \frac{\phi(s_{\pi(1)})}{\sum_{k=1}^{n}\phi(s_{\pi(k)})}\cdot \frac{\phi(s_{\pi(2)})}{\sum_{k=2}^{n}\phi(s_{\pi(k)})}\cdot ... \cdot\frac{\phi(s_{\pi'(q)})}{\sum_{k=p}^{n}\phi(s_{\pi'(k)})}\cdot ... \cdot\frac{\phi(s_{\pi'(p)})}{\sum_{k=q}^{n}\phi(s_{\pi'(k)})}\cdot ... \cdot \frac{\phi(s_{\pi(n)})}{\phi(s_{\pi(n)})}
\end{align}
$$
Now it is clear why only need to prove 
$$\prod\limits_{j=p}^{q}\frac{\phi(s_{\pi(j)})}{\sum_{k=j}^{n}\phi(s_{\pi(k)})} > \prod\limits_{j=p}^{q}\frac{\phi(s_{\pi'(j)})}{\sum_{k=j}^{n}\phi(s_{\pi'(k)})}$$
</span>. 

Notice that $$\prod\limits_{j=p}^{q} \phi(s_{\pi(j)})  = \prod\limits_{j=p}^{q} \phi(s_{\pi'(j)}) $$. Thus, we need to prove 
$$\prod\limits_{j=p}^{q}\frac{1}{\sum_{k=j}^{n}\phi(s_{\pi(k)})} > \prod\limits_{j=p}^{q}\frac{1}{\sum_{k=j}^{n}\phi(s_{\pi'(k)})}$$
When $j = p$, $\sum_{k=j}^{n}\phi(s_{\pi(k)}) =  \sum_{k=j}^{n}\phi(s_{\pi'(k)})$. 
When $p<j <q$, because $\phi$ is an increasing function and $s_{\pi(p)}>s_{\pi(q)}$, we have 
$$\prod\limits_{j=p+1}^{q}\frac{1}{\sum_{k=j}^{n}\phi(s_{\pi(k)})} > \prod\limits_{j=p+1}^{q}\frac{1}{\sum_{k=j}^{n}\phi(s_{\pi'(k)})}$$ 
Then we have $P_s(\pi)>P_s(\pi')$.

<b>Theorem 4 </b> For the $n$ objects, if $s_1>s_2>...>s_n$, then $P_s(<1,2,...,n>)$ is the highest permutation probability and $P_s(<n,n-1,...,1>)$ is the lowest permutation probability among the permutation probabilities of the $n$ objects.

<span style="color:blue"> This could be proved with Theorem 3. </span>

<span style="color:blue"> All the lemma and theorems are used to confirm that the permuation probability satisfies the probability distribution and better ranking will give higher probability. But this may lead time consuming to compute the permutation of the object.</span>

#### Top One Probability

The top one probability of an object represents the probability of its being ranked on the top, given the scores of all the objects.

<b>Definition 5 </b> The top one probability of object $j$ is defined as  
$$P_s(j) = \sum_{\pi(1) = j, \pi \in \Omega_n} P_s(\pi)$$
where $P_s(\pi)$ is  permutation probability of $\pi$ given $s$.

<span style="color:blue"> Here it shows that we still need to cacluate all the permutation probabilities. In order to solve this problem, the authors provide an alternative way to define the probability.</span>


<b>Theorem 6 </b> For top one probability $P_s(j)$, we have 
$$P_s(j) = \frac{\phi(s_j)}{\sum_{k=1}^{n}\phi(s_k)},$$
where $s_j$ is the score of object $j$, $j=1,2,...,n.$

<b>Proof </b> <span style="color:blue"> We first write $P_s(j)$ by the definition of $P_s(\pi)$. 
$$
\begin{align}
P_s(j) &= \sum_{\pi(1) = j, \pi \in \Omega_n} P_s(\pi)\\
&= \frac{\phi(s_{j})}{\sum_{k=1}^{n}\phi(s_{k})}  \sum_{\pi \in \Omega_{n-1}} P_s(\pi)\\
&= \frac{\phi(s_{j})}{\sum_{k=1}^{n}\phi(s_{k})}
\end{align}
$$
The last can be obtained by Lemma 2.
</span> 

With Theorem 6, we can consequently have the following lemma and theorem.

<b>Lemma 7 </b> Top one probabilities $P_s(j)$, $j=1,2,...,n$ forms a probability distribution over the set of $n$ objects.

<b>Theorem 8 </b> Given any two objects $j$ and $k$, if $s_j>s_k$, $j\neq k$, $j,k=1,2,...,n$ then  $P_s(j)>P_s(k)$.


If we use Cross Entropy as metric, the listwise loss function becomes:
$$
\begin{align}
L(y^{(i)},z^{(i)}) = -\sum_{j=1}^{n} P_{y^{(i)}}(j)\log(P_{z^{(i)}}(j))
\end{align}
$$


### RankNet

#### library

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

#### Model

In [2]:
class RankNet(nn.Module):
    def __init__(self, input_size):
        super(RankNet, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_size, 64),
            nn.ReLU(),
            nn.Linear(64, 1)  # Output a single score
        )

    def forward(self, x):
        return self.fc(x)


In [3]:
def ranknet_loss(score1, score2, true_label):
  
    #The RankNet loss
    s_diff = score1 - score2
    prob = torch.sigmoid(s_diff)
    
    loss = nn.BCELoss()(prob, true_label)
    return loss

In [4]:
input_size = 5  # Assume each item has 5 features
model = RankNet(input_size)

#### Data

In [5]:
item1_features = torch.randn((10, input_size))  # 10 items with 5 features
item2_features = torch.randn((10, input_size))
true_labels = torch.randint(0, 2, (10, 1)).float()
print(true_labels)

tensor([[0.],
        [1.],
        [0.],
        [1.],
        [0.],
        [1.],
        [0.],
        [1.],
        [1.],
        [0.]])


### Training

In [6]:
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(100):
    model.train()
    
    # Get scores for both items
    score1 = model(item1_features)
    score2 = model(item2_features)
    
    # Compute the loss
    loss = ranknet_loss(score1, score2, true_labels)
    
    # Backpropagation and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

Epoch 0, Loss: 0.6771
Epoch 10, Loss: 0.6262
Epoch 20, Loss: 0.5824
Epoch 30, Loss: 0.5428
Epoch 40, Loss: 0.5052
Epoch 50, Loss: 0.4685
Epoch 60, Loss: 0.4326
Epoch 70, Loss: 0.3977
Epoch 80, Loss: 0.3639
Epoch 90, Loss: 0.3309


### ListNet

In [10]:
import torch
import numpy as np
from sklearn.model_selection import train_test_split
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

In [11]:
# Generate synthetic data
n_samples, n_features = 100, 10
np.random.seed(42)
X = np.random.randn(n_samples, n_features)
relevance_scores = np.random.randint(0, 5, size=(n_samples,))

X_train, X_test, y_train, y_test = train_test_split(X, relevance_scores, test_size=0.2, random_state=42)
X_train, X_test = torch.tensor(X_train, dtype=torch.float32), torch.tensor(X_test, dtype=torch.float32)
y_train, y_test = torch.tensor(y_train, dtype=torch.float32), torch.tensor(y_test, dtype=torch.float32)
y_test.shape[0]

20

In [12]:
# ListNet Model
class ListNetModel(nn.Module):
    def __init__(self, input_size):
        super(ListNetModel, self).__init__()
        self.fc1 = nn.Linear(input_size, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

In [13]:
def listnet_loss(y_pred, y_true):
    P_y_pred = F.softmax(y_pred, dim=0)
    P_y_true = F.softmax(y_true, dim=0)
    loss = -torch.sum(P_y_true * torch.log(P_y_pred))
    return loss

In [14]:
def swapped_pairs(ys_pred, ys_target):
    N = ys_target.shape[0]
    swapped = 0
    for i in range(N - 1):
        for j in range(i + 1, N):
            if ys_target[i] < ys_target[j]:
                if ys_pred[i] > ys_pred[j]:
                    swapped += 1
            elif ys_target[i] > ys_target[j]:
                if ys_pred[i] < ys_pred[j]:
                    swapped += 1
    return swapped


def ndcg(ys_true, ys_pred):
    def dcg(ys_true, ys_pred):
        _, argsort = torch.sort(ys_pred, descending=True, dim=0)
        ys_true_sorted = ys_true[argsort]
        ret = 0
        for i, l in enumerate(ys_true_sorted, 1):
            ret += (2 ** l - 1) / np.log2(1 + i)
        return ret
    ideal_dcg = dcg(ys_true, ys_true)
    pred_dcg = dcg(ys_true, ys_pred)
    return pred_dcg / ideal_dcg

In [16]:
input_size = X_train.shape[1]
model = ListNetModel(input_size)

In [19]:
num_epochs=200
for epoch in range(num_epochs):
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    model.train()
    optimizer.zero_grad()
    predicted_scores = model(X_train).squeeze()
    loss = listnet_loss(predicted_scores, y_train)
    loss.backward()
    optimizer.step()
    if (epoch + 1) % 50 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

        with torch.no_grad():
            test_pred = model(X_test)
            valid_swapped_pairs = swapped_pairs(test_pred, y_test)
            ndcg_score = ndcg(y_test, test_pred).item()
            print(f"epoch: {epoch + 1} valid swapped pairs: {valid_swapped_pairs}/{y_test.shape[0] * (y_test.shape[0] - 1) // 2} ndcg: {ndcg_score:.4f}")

Epoch [50/200], Loss: 3.8510
epoch: 50 valid swapped pairs: 61/190 ndcg: 0.7927
Epoch [100/200], Loss: 3.8577
epoch: 100 valid swapped pairs: 61/190 ndcg: 0.7922
Epoch [150/200], Loss: 3.8483
epoch: 150 valid swapped pairs: 63/190 ndcg: 0.7732
Epoch [200/200], Loss: 3.8553
epoch: 200 valid swapped pairs: 64/190 ndcg: 0.7802
