# Lab 4

## Reranking on Similarity

In this notebook we will rerank the results based on simiarity with some sample reviews in the Yelp dataset, reusing the Doc2Vec model from the previous lab.

You can run this lab both locally or in Colab.

- To run in Colab just go to `https://colab.research.google.com`, sign-in and you upload this notebook. Colab has GPU access for free.
- To run locally just run `jupyter notebook` and access the notebook in this lab. You would need to first install the requirements in `requirements.txt`

Follow the instructions. Good luck!

In [None]:
!pip install textblob 'keras-nlp' 'keras-preprocessing'


In [None]:
import gensim
import numpy as np
import pandas as pd
from gensim.models import Doc2Vec

np.random.seed(42)

In [None]:
%%writefile get_data.sh

if [ ! -f yelp.csv ]; then
  wget -O yelp.csv https://www.dropbox.com/s/xds4lua69b7okw8/yelp.csv?dl=0
fi
if [ ! -f doc2vec_yelp_model ]; then
  wget -O doc2vec_yelp_model https://www.dropbox.com/s/bibu9bashb0cd68/doc2vec_yelp_model?dl=0
fi

In [None]:
!bash get_data.sh

In [None]:
model = None # Load the Doc2Vec model from the file we downloaded

In [None]:
query = 'Best french restaurant'

In [None]:
# Use the same simple_preprocess from the last lab 4 to tokenize the query
tokenized_query = list()

In [None]:
inferred_vector = model.infer_vector(tokenized_query)
print(inferred_vector)

In [None]:
path = './yelp.csv'
yelp = pd.read_csv(path)
# Create a new DataFrame that only contains the 5-star and 1-star reviews.
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]
sample_reviews = yelp_best_worst.text.sample(n=200)

In [None]:
similarities = []
for review in sample_reviews:
    similarity = 0 # Find the similarity between the query and the review using gensim
    similarities.append(similarity)


Ok, what we have done is having a query term, then finding the similarity between the query and the reviews.

The idea behind the algorithm would be to reorder the results based on the similarity score, not on BM25. Let's see which one is better.

In [None]:
# Create a dataframe that has as columns the sample reviews and the similarity with the query
reviews_with_similarities = None

In [None]:
a = None # Sort the df_results by similarity column in descending order

In [None]:

print(f'Most similar document after reranking within retrieved results has description: \n\n{a["review"].iloc[0]}\n\nWith similarity: {a["similarity"].iloc[0]}\n\n---------\n\n')

In [None]:
print(f'Most similar document before reranking within retrieved results has description: \n\n{reviews_with_similarities["review"].iloc[0]}\n\nWith similarity: {reviews_with_similarities["similarity"].iloc[0]}\n\n---------\n\n')

In [None]:
print(f'Number of documents that surpass 0.5 similarity threshold: {len(a[a["similarity"] >= 0.5])}')

It is remarkable how using DBOW the most similar result understood the need for good prices and good food (which can be said characterizes french food). On the other hand the least similar result is a sports bar, which seems about right as well!.

It is not a perfect method, but a very good indication. A good idea is to have something like this **between** the raw results (thousands), filter them by similarity (hundreds) and then have a learning to rank recommender (dozens).

Tensorflow has opensources TF Recommenders which is great to plug in as an algorithm **after** these results. But this alone would work just fine.
