# Capstone: Learning to Rank
## RankNet

In this notebook we will train on random data the learning to rank model RankNet.

The idea behind LTR is always to start with a dataset of some queries, their returned documents and the score of relevance. This relevance may be an *a posteriori* metric like number of clicks.

You can run this lab both locally or in Colab.

- To run in Colab just go to `https://colab.research.google.com`, sign-in and you upload this notebook. Colab has GPU access for free.
- To run locally just run `jupyter notebook` and access the notebook in this lab. You would need to first install the requirements in `requirements.txt`

Follow the instructions. Good luck!


The idea behind RankNet is to model the **joint probability** that `document i` comes before `document j` as the following:

$P_{ij} = 1$ if $s_i > s_j$
$P_{ij} = 0.5$ if $s_i = s_j$
$P_{ij} = 0$ if $s_i < s_j$

So for *every pair of inputs* we will calculate both outputs, substract them, pass a logistic function to model the probability:

<img src="https://github.com/axel-sirota/practical-nlp/blob/main/1-similarity/ranknet.png?raw=1">


In [1]:
!nvidia-smi

Thu Oct 20 18:32:12 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   58C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
!pip install 'gensim==4.2.0'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim==4.2.0
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 1.4 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.2.0


In [4]:
import multiprocessing
import tensorflow as tf
import sys
import keras.backend as K
from keras.models import Sequential
from tensorflow.keras import layers, activations, losses, Model, Input
from keras.layers import Dense, Embedding, Lambda
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from textblob import TextBlob, Word
from keras_preprocessing.sequence import pad_sequences
from keras.initializers import Constant
from tensorflow.nn import leaky_relu
from tensorflow.keras.utils import plot_model, Progbar
from gensim.models import Doc2Vec
import gensim
import numpy as np
import random
import os
import pandas as pd
import gensim
import warnings
import nltk
from sklearn.model_selection import train_test_split
from itertools import combinations
import matplotlib.pyplot as plt
import multiprocessing
import os
import random
import warnings
from itertools import combinations

import gensim
import keras.backend as K
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from gensim.models import Doc2Vec
from sklearn.model_selection import train_test_split
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Activation, Dense, Subtract
from tensorflow.nn import leaky_relu

TRACE = False
embedding_dim = 100
epochs=10
batch_size = 50
sample_queries = 20
sample_results_dataset = 100



In [5]:
def set_seeds_and_trace():
  os.environ['PYTHONHASHSEED'] = '0'
  np.random.seed(42)
  tf.random.set_seed(42)
  random.seed(42)
  if TRACE:
    tf.debugging.set_log_device_placement(True)

def set_session_with_gpus_and_cores():
  cores = multiprocessing.cpu_count()
  gpus = len(tf.config.list_physical_devices('GPU'))
  config = tf.compat.v1.ConfigProto( device_count = {'GPU': gpus  , 'CPU': cores} , intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
  sess = tf.compat.v1.Session(config=config) 
  K.set_session(sess)

set_seeds_and_trace()
set_session_with_gpus_and_cores()
warnings.filterwarnings('ignore')


In [6]:
%%writefile get_data.sh

if [ ! -f yelp.csv ]; then
  wget -O yelp.csv https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
fi
if [ ! -f doc2vec_yelp_model ]; then
  wget -O doc2vec_yelp_model https://www.dropbox.com/s/bibu9bashb0cd68/doc2vec_yelp_model?dl=0
fi

Writing get_data.sh


In [7]:
!bash get_data.sh

--2022-10-20 20:30:24--  https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6018:18::a27d:312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/xds4lua69b7okw8/yelp.csv [following]
--2022-10-20 20:30:24--  https://www.dropbox.com/s/raw/xds4lua69b7okw8/yelp.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc9da2e1b518a7e0873699e20fcb.dl.dropboxusercontent.com/cd/0/inline/BvN-ew-xwQ6-3lnQHUaFL5147xpFBsYHhEyrWKMC3OcfhKxNUSnSkMFuhGEu3mxHTIvuxFDJ5WRoIFgZrSn3Rj59x-iK2uxkSOKDHspYsXA0ndgEudgOjiPBVKCetE3-JIqr-BMNbx3onIYNQqO1FEUizJQnPFz7yg9Gj67ZA4A5Jg/file# [following]
--2022-10-20 20:30:25--  https://uc9da2e1b518a7e0873699e20fcb.dl.dropboxusercontent.com/cd/0/inline/BvN-ew-xwQ6-3lnQHUaFL5147xpFBsYHhEyrWKMC3OcfhKxNUSnSkMFuhGEu3mxHTIvuxFDJ5WRoIFgZrSn3Rj5

In [8]:
model = Doc2Vec.load("./doc2vec_yelp_model")

In [9]:
path = './yelp.csv'
yelp = pd.read_csv(path)
train_set_reviews = yelp.sample(n=sample_results_dataset).reset_index(drop=True)

queries = yelp.text.sample(n=sample_queries).reset_index(drop=True)
print(queries)


0     I really try to like Old Town Scottsdale - and...
1     (aka. SKETCHY TEMPE with BONNIE G, Part One of...
2     Thought Saturday night would be busy at 6:00 P...
3     I was actually really impressed, even though I...
4     Pros:\n1.  Excellent service.  Hell, it's damn...
5     Treated with complete disrespect. Worst servic...
6     I met up with a girlfriend at borders. This Bo...
7     First time here and it was really good. I orde...
8     I'm on a low carb diet right now, so I had to ...
9     I got my nails done there last Thursday for th...
10    I was referred to Jones Family Dentistry for a...
11    Just went to this theater last night, and it w...
12    The Harkins Camelview 5 gives Arizonans the un...
13    Great new addition to the Old Town neighborhoo...
14    What can I say that hasn't already been said a...
15    This place is essentially a copy of the old Fa...
16    I have been coming here since discovering them...
17    Delicious barbecue, we had 4 meats platter

In [14]:
results = np.zeros((len(queries), len(train_set_reviews), 100))
scores = np.zeros((len(queries), len(train_set_reviews)))
# the feature vector of the review using the doc2vec model, and the score as the similarity
model = gensim.models.doc2vec.Doc2Vec(vector_size=embedding_dim, min_count=2, epochs=3, workers=5)
def read_corpus(query, tokens_only=False):
       for  line in query:
            print(line)
            tokens = list(gensim.utils.simple_preprocess(line))  # tokenize and preprocess line. Try to search in gensim
            inferred_vector = model.infer_vector(tokens)
            print(inferred_vector)
                for review in sample_reviews:
                  try:
                     similarity = model.similarity_unseen_docs(doc_words1= list(gensim.utils.simple_preprocess(query)), doc_words2= list(gensim.utils.simple_preprocess(review)))
                   except KeyError:
                        similarity = 0
                        similarities.append(similarity)

            if tokens_only:
                yield tokens
            else:
                # For training data, add tags and yield the result. The end yielded result should be a TaggedDocument
                
                yield gensim.models.doc2vec.TaggedDocument(tokens)

similarities = []
for review in sample_reviews:
    try:
        similarity = model.similarity_unseen_docs(doc_words1= list(gensim.utils.simple_preprocess(query)), doc_words2= list(gensim.utils.simple_preprocess(review)))
    except KeyError:
        similarity = 0
    similarities.append(similarity)


In [15]:
# put data into pairs
xi = []
xj = []
pij = []
pair_id = []
pair_query_id = []

# Fill

for q_ix, query in enumerate(queries):
    for pair_idx in combinations(enumerate(results[q_ix]), 2):
        pair_query_id.append(query)
        pair_id.append(pair_idx)
        ix_i, document_i = pair_idx[0]
        ix_j, document_j = pair_idx[1]
        xi.append(document_i)
        xj.append(document_j)

        pij = None  # Find pij for each q_ix, pair_idx
        pij.append(_pij)

xi = np.array(xi)
xj = np.array(xj)
pij = np.array(pij)
pair_query_id = np.array(pair_query_id)
del results
del scores


AttributeError: ignored

In [None]:
# FILL

# Split xi, xj, pij, and pair_id into train and test sets.
# HINT: stratify by pair_query_id

In [None]:
xi_train = tf.constant(xi_train)
xi_test = tf.constant(xi_test)
xj_train = tf.constant(xj_train)
xj_test = tf.constant(xj_test)
pij_train = tf.constant(pij_train)
pij_test = tf.constant(pij_test)
pair_id_train = pair_id_train
pair_id_test = pair_id_test

In [None]:

# Try to create a model with 2 dense layers with leaky_relu as activations. Then a linear dense function and a substract layer.

# This time I will leave it blank, but in the parameter oij you should have the output of the substraction.


# model architecture
class RankNet(Model):
    def __init__(self):
        super().__init__()
        # FILL

    def call(self, inputs):
        xi, xj = inputs
        # FILL
        output = layers.Activation('sigmoid')(oij)
        return output

    def build_graph(self):
        x = [Input(shape=(10)), Input(shape=(10))]
        return Model(inputs=x, outputs=self.call(x))

In [16]:
# train model using compile with binary_crossentropy.
ranknet = RankNet()
ranknet.compiler(optimizer='adam',  loss='binary_crossentropy')
a = ranknet.build_graph()
a.summary

NameError: ignored

In [None]:
#Train the model, pass as inputs [xi_train, xj_train] and pij_train.
history = None # Fill

In [None]:
# function for plotting loss
def plot_metrics(train_metric, val_metric=None, metric_name=None, title=None, ylim=5):
    plt.title(title)
    plt.ylim(0,ylim)
    plt.plot(train_metric,color='blue',label=metric_name)
    if val_metric is not None: plt.plot(val_metric,color='green',label='val_' + metric_name)
    plt.legend(loc="upper right")

# plot loss history
plot_metrics(history.history['loss'], history.history['val_loss'], "Loss", "Loss", ylim=1.0)

In [None]:
#Test with a a new sample pair of docs to get their associated probability.

new_doci = None
new_docj = None
inputs = None

In [None]:
ranknet(inputs)