# Capstone: Learning to Rank
## RankNet

In this notebook we will train on random data the learning to rank model RankNet.

The idea behind LTR is always to start with a dataset of some queries, their returned documents and the score of relevance. This relevance may be an *a posteriori* metric like number of clicks.

You can run this lab both locally or in Colab.

- To run in Colab just go to `https://colab.research.google.com`, sign-in and you upload this notebook. Colab has GPU access for free.
- To run locally just run `jupyter notebook` and access the notebook in this lab. You would need to first install the requirements in `requirements.txt`

Follow the instructions. Good luck!


The idea behind RankNet is to model the **joint probability** that `document i` comes before `document j` as the following:

$P_{ij} = 1$ if $s_i > s_j$
$P_{ij} = 0.5$ if $s_i = s_j$
$P_{ij} = 0$ if $s_i < s_j$

So for *every pair of inputs* we will calculate both outputs, substract them, pass a logistic function to model the probability:

<img src="./ranknet.png">


In [None]:
!nvidia-smi

In [None]:
!pip install textblob 'keras-nlp' 'keras-preprocessing' 'gensim==4.2.0' np_utils

In [None]:
import os
import random
import warnings
from itertools import combinations

import gensim
import keras.backend as K
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from gensim.models import Doc2Vec
from sklearn.model_selection import train_test_split
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Activation, Dense, Subtract
from tensorflow.nn import leaky_relu

TRACE = False
embedding_dim = 100
epochs=50
batch_size = 50
sample_queries = 20
sample_results_dataset = 100



In [None]:
def set_seeds_and_trace():
  os.environ['PYTHONHASHSEED'] = '0'
  np.random.seed(42)
  tf.random.set_seed(42)
  random.seed(42)
  if TRACE:
    tf.debugging.set_log_device_placement(True)

def set_session_with_gpus_and_cores():
  cores = multiprocessing.cpu_count()
  gpus = len(tf.config.list_physical_devices('GPU'))
  config = tf.compat.v1.ConfigProto( device_count = {'GPU': gpus  , 'CPU': cores} , intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
  sess = tf.compat.v1.Session(config=config) 
  tf.compat.v1.keras.backend.set_session(sess)

set_seeds_and_trace()
set_session_with_gpus_and_cores()
warnings.filterwarnings('ignore')


In [None]:
%%writefile get_data.sh

if [ ! -f yelp.csv ]; then
  wget -O yelp.csv https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
fi
if [ ! -f doc2vec_yelp_model ]; then
  wget -O doc2vec_yelp_model https://www.dropbox.com/s/bibu9bashb0cd68/doc2vec_yelp_model?dl=0
fi

In [None]:
!bash get_data.sh

In [None]:
model = Doc2Vec.load("./doc2vec_yelp_model")

In [None]:
path = './yelp.csv'
yelp = pd.read_csv(path)
train_set_reviews = yelp.sample(n=sample_results_dataset).reset_index(drop=True)
queries = yelp.text.sample(n=sample_queries).reset_index(drop=True)


In [None]:
# Here results will be a tensor that for each query_id and revew_id it will hold the inferred vector of the review by the doc2vec model.
# We will use it to create the pair of reviews (xi, xj) that will be the input of our model
results = np.zeros((len(queries), len(train_set_reviews), 100))


# The scores tensor will have for each query and review the similarity using the doc2vec model.
# This similarity score we will use it later to get the pij using the formulas at the start, and that pij will be the rue values to predict
scores = np.zeros((len(queries), len(train_set_reviews)))

for q_ix, query in enumerate(queries):
  for r_ix, review in enumerate(train_set_reviews):
      #  FILL
      pass

In [None]:
# put data into pairs
xi = []
xj = []
pij = []
pair_id = []
pair_query_id = []

for q_ix, query in enumerate(queries):
    for pair_idx in combinations(enumerate(results[q_ix]), 2):
        pair_query_id.append(query)
        pair_id.append(pair_idx)
        ix_i, document_i = pair_idx[0]
        ix_j, document_j = pair_idx[1]
        xi.append(document_i)
        xj.append(document_j)

        pij = None  # Find pij for each q_ix, pair_idx with the help of the scores matrix and the formula at the start
        pij.append(_pij)

xi = np.array(xi)
xj = np.array(xj)
pij = np.array(pij)
pair_query_id = np.array(pair_query_id)
del results
del scores


In [None]:
# FILL

# Split xi, xj, pij, and pair_id into train and test sets setting the kwarg stratify to be pair_query_id

In [None]:
xi_train = tf.constant(xi_train)
xi_test = tf.constant(xi_test)
xj_train = tf.constant(xj_train)
xj_test = tf.constant(xj_test)
pij_train = tf.constant(pij_train)
pij_test = tf.constant(pij_test)
pair_id_train = pair_id_train
pair_id_test = pair_id_test

In [None]:

# Try to create a model with 2 dense layers with leaky_relu as activations. Then a linear dense function and a substract layer.

# This time I will leave it blank, but in the parameter oij you should have the output of the substraction.


# model architecture
class RankNet(Model):
    def __init__(self):
        super().__init__()
        # FILL

    def call(self, inputs):
        xi, xj = inputs
        # FILL
        output = layers.Activation('sigmoid')(oij)
        return output

    def build_graph(self):
        x = [Input(shape=(10)), Input(shape=(10))]
        return Model(inputs=x, outputs=self.call(x))

In [None]:
# train model using compile with binary_crossentropy.
ranknet = RankNet()

ranknet.summary()

In [None]:
#Train the model, pass as inputs [xi_train, xj_train] and pij_train.
history = None # Fill

In [None]:
# function for plotting loss
def plot_metrics(train_metric, val_metric=None, metric_name=None, title=None, ylim=5):
    plt.title(title)
    plt.ylim(0,ylim)
    plt.plot(train_metric,color='blue',label=metric_name)
    if val_metric is not None: plt.plot(val_metric,color='green',label='val_' + metric_name)
    plt.legend(loc="upper right")

# plot loss history
plot_metrics(history.history['loss'], history.history['val_loss'], "Loss", "Loss", ylim=1.0)

In [None]:
#Test with a a new sample pair of docs to get their associated probability.

new_doci = None
new_docj = None
inputs = None

In [None]:
ranknet(inputs)