# Retrieve & Re-Rank Demo over Simple Wikipedia

This examples demonstrates the Retrieve & Re-Rank Setup and allows to search over [Simple Wikipedia](https://simple.wikipedia.org/wiki/Main_Page).

You can input a query or a question. The script then uses semantic search
to find relevant passages in Simple English Wikipedia (as it is smaller and fits better in RAM).

For semantic search, we use `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` and retrieve
32 potentially passages that answer the input query.

Next, we use a more powerful CrossEncoder (`cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')`) that
scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance,
especially when you search over a corpus for which the bi-encoder was not trained for.


In [None]:
#!pip install -U sentence-transformers rank_bm25
#!pip install datasets

In [None]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch

if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")


#We use the Bi-Encoder to encode all passages, so that we can use it with semantic search
bi_encoder = SentenceTransformer('nq-distilbert-base-v1')
bi_encoder.max_seq_length = 256     #Truncate long passages to 256 tokens
top_k = 32                          #Number of passages we want to retrieve with the bi-encoder

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/540 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/554 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
from datasets import load_dataset
ds = load_dataset("Coder-Dragon/wikipedia-movies", split='train[:1000]')

Downloading readme:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.0M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
titles = ds['Title']
plots = ds['Plot']
#ChatGPT help with defining "passages" as I kept getting errors later in the code with how it was formatted
passages = [title + ": " + plot for title, plot in zip(titles, plots)]

In [None]:
print("Passages:", len(passages))

# We encode all passages into our vector space. This takes about 5 minutes (depends on your GPU speed)
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)

Passages: 1000


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

In [None]:
# We also compare the results to lexical search (keyword search). Here, we use
# the BM25 algorithm which is implemented in the rank_bm25 package.

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np


# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc


tokenized_corpus = []
for passage in tqdm(passages):
    tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)


  0%|          | 0/1000 [00:00<?, ?it/s]

In [None]:
# This function will search all wikipedia articles for passages that
# answer the query
def search(query):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -5)[-5:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)

    print("Top-5 lexical search (BM25) hits")
    for hit in bm25_hits[:5]:  # Ensure only the top 5 are printed
        print("\t{:.3f}\t{}".format(hit['score'], titles[hit['corpus_id']].replace("\n", " ")))

    ##### Semantic Search #####
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking #####
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    # Output of top-5 hits from bi-encoder
    print("\n-------------------------\n")
    print("Top-5 Bi-Encoder Retrieval hits")
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    for hit in hits[:5]:  # Ensure only the top 5 are printed
        print("\t{:.3f}\t{}".format(hit['score'], titles[hit['corpus_id']].replace("\n", " ")))

    # Output of top-5 hits from re-ranker
    print("\n-------------------------\n")
    print("Top-5 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[:5]:  # Ensure only the top 5 are printed
        print("\t{:.3f}\t{}".format(hit['cross-score'], titles[hit['corpus_id']].replace("\n", " ")))


In [None]:
search(query = "Documentaries showcasing indigenous peoples' survival and daily life in Arctic regions")

Input question: Documentaries showcasing indigenous peoples' survival and daily life in Arctic regions
Top-5 lexical search (BM25) hits
	9.241	I Do
	8.233	Uncharted Seas
	6.430	Powers That Prey
	4.977	Rough Waters
	4.425	The King of Kings

-------------------------

Top-5 Bi-Encoder Retrieval hits
	0.448	Nanook of the North
	0.318	The Frozen North
	0.274	David Copperfield
	0.271	Straight Shooting
	0.257	The Salvation Hunters

-------------------------

Top-5 Cross-Encoder Re-ranker hits
	-4.554	Nanook of the North
	-9.599	The Frozen North
	-10.188	The Salvation Hunters
	-11.145	David Copperfield
	-11.297	Straight Shooting


In [None]:
search(query = "Western romance")

Input question: Western romance
Top-5 lexical search (BM25) hits
	6.245	The Call of the Wild
	5.915	Romance
	5.534	Four Sons
	5.406	A Man's Fight
	5.308	Gun Smoke

-------------------------

Top-5 Bi-Encoder Retrieval hits
	0.313	Romance
	0.290	Married in Hollywood
	0.287	The Great Gatsby
	0.285	Frankenstein
	0.283	Youth's Endearing Charm

-------------------------

Top-5 Cross-Encoder Re-ranker hits
	-4.111	Romance
	-8.114	Married in Hollywood
	-10.627	The Great Gatsby
	-11.119	Youth's Endearing Charm
	-11.349	Frankenstein


In [None]:
search(query = "Silent film about a Parisian star moving to Egypt, leaving her husband for a baron, and later reconciling after finding her family in poverty in Cairo.")

Input question: Silent film about a Parisian star moving to Egypt, leaving her husband for a baron, and later reconciling after finding her family in poverty in Cairo.
Top-5 lexical search (BM25) hits
	35.189	Sahara
	16.194	He Who Gets Slapped
	14.018	Inspiration
	13.940	A Lad from Old Ireland
	12.716	The Maltese Falcon

-------------------------

Top-5 Bi-Encoder Retrieval hits
	0.465	Married in Hollywood
	0.440	The Golden Louis
	0.417	Morocco
	0.410	Sahara
	0.409	The King on Main Street

-------------------------

Top-5 Cross-Encoder Re-ranker hits
	4.657	Sahara
	-6.515	Morocco
	-9.318	The Golden Louis
	-9.486	Married in Hollywood
	-9.651	The King on Main Street


In [None]:
search(query = "Comedy film, office disguises, boss's daughter, elopement")

Input question: Comedy film, office disguises, boss's daughter, elopement
Top-5 lexical search (BM25) hits
	13.079	Mabel's Blunder
	9.579	Bucking Broadway
	8.525	Ask Father
	8.119	The Front Page
	7.893	His Wedding Night

-------------------------

Top-5 Bi-Encoder Retrieval hits
	0.389	Youth's Endearing Charm
	0.386	The Pasha's Daughter
	0.367	A Little Journey
	0.356	Bumping Into Broadway
	0.334	Caught in a Cabaret

-------------------------

Top-5 Cross-Encoder Re-ranker hits
	-7.151	The Pasha's Daughter
	-7.455	Caught in a Cabaret
	-8.381	Bumping Into Broadway
	-8.831	Youth's Endearing Charm
	-10.773	A Little Journey


In [None]:
search(query = "Lost film, Cleopatra charms Caesar, plots world rule, treasures from mummy, revels with Antony, tragic end with serpent in Alexandria.")

Input question: Lost film, Cleopatra charms Caesar, plots world rule, treasures from mummy, revels with Antony, tragic end with serpent in Alexandria.
Top-5 lexical search (BM25) hits
	71.493	Cleopatra
	10.778	Captain Applejack
	9.089	Fair Lady
	8.353	Reaching for the Moon
	8.273	Tom Sawyer

-------------------------

Top-5 Bi-Encoder Retrieval hits
	0.553	Cleopatra
	0.364	The Man Who Lost Himself
	0.352	The Golden Louis
	0.348	The Lost World
	0.319	Reaching for the Moon

-------------------------

Top-5 Cross-Encoder Re-ranker hits
	6.336	Cleopatra
	-8.899	The Lost World
	-9.969	The Man Who Lost Himself
	-10.243	The Golden Louis
	-10.423	Reaching for the Moon


In [None]:
search(query = "Denis Gage Deane-Tanner")

Input question: Denis Gage Deane-Tanner
Top-5 lexical search (BM25) hits
	29.003	Captain Alvarez
	4.123	Hangman's House
	0.000	Red Courage
	0.000	The Sea Lion
	0.000	A Sailor-Made Man

-------------------------

Top-5 Bi-Encoder Retrieval hits
	0.336	The Man from Blankley's
	0.310	Blind Youth
	0.297	The Blot
	0.291	Old Clothes
	0.289	Caught Plastered

-------------------------

Top-5 Cross-Encoder Re-ranker hits
	-6.756	Caught Plastered
	-9.909	The Blot
	-10.953	Blind Youth
	-10.982	The Man from Blankley's
	-11.219	Old Clothes


# Analysis

BM25 Recall@1: There are 4 that created positive outputs out of 6 so the recall@1 is 4/6 or 0.66666

BM25 Mean Reciprocal Rank (MRR):

The MRR would be 0/5, 0/5, 1/1, 1/3, 1/1 and 1/1. so the total (1+0.333+1+1)/6 = 0.133

Reranker Recall@1:

If we look at the Recall@1 it would be 3 out of 6 positive outputs so Recall@1 = 0/5

Reranker Mean Reciprocal Rank (MRR):

The MRR would be 1/1, 0/5, 1/1, 0/5, 1/1 and 0/5. so the total 3/6 or 0/5
