# Semantic Search - Wikipedia Question-Answer-Retrieval

This examples demonstrates the setup for Question-Answer-Retrieval.

You can input a query or a question. The script then uses semantic search
to find relevant passages in Simple English Wikipedia (as it is smaller and fits better in RAM).

As model, we use: nq-distilbert-base-v1

It was trained on the Natural Questions dataset, a dataset with real questions from Google Search together with annotated data from Wikipedia providing the answer. For the passages, we encode the Wikipedia article tile together with the individual text passages.

In [None]:
!pip install -U sentence-transformers
!pip install -U datasets

In [None]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import time
import gzip
import os
import torch
import pandas as pd
import csv

if not torch.cuda.is_available():
  print("Warning: No GPU found. Please add GPU to your notebook")


# We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
model_name = 'nq-distilbert-base-v1'
bi_encoder = SentenceTransformer(model_name)
top_k = 5  # Number of passages we want to retrieve with the bi-encoder

# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

# Load dataset
input_filepath = 'wiki_movie_plots_deduped.csv'
ds = pd.read_csv(input_filepath)

# Select the first 1000 entries
ds = ds.head(1000)

# Gzip file
wikipedia_filepath = "movie_plots.csv.gz"
ds.to_csv(wikipedia_filepath, index=False, compression='gzip')

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
   # CSV reader
    csv_reader = csv.reader(fIn)
    # Skip the header line
    next(csv_reader)
    for line_num, line in enumerate(csv_reader, start=1):
        # Extract title and paragraph
        title = line[1] if len(line) > 1 else ''
        print(title)
        paragraph = line[-1] if len(line) > 0 else ''
        print(paragraph)
        passages.append([title, paragraph])

# If you like, you can also limit the number of passages you want to use
print("Passages:", len(passages))

# To speed things up, pre-computed embeddings are downloaded.
# The provided file encoded the passages with the model 'nq-distilbert-base-v1'
if model_name == 'nq-distilbert-base-v1':
    embeddings_filepath = 'simplewiki-2020-11-01-nq-distilbert-base-v1.pt'
    if not os.path.exists(embeddings_filepath):
        util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01-nq-distilbert-base-v1.pt', embeddings_filepath)

    corpus_embeddings = torch.load(embeddings_filepath)
    corpus_embeddings = corpus_embeddings.float()  # Convert embedding file to float
    if torch.cuda.is_available():
        corpus_embeddings = corpus_embeddings.to('cuda')
    corpus_embeddings = corpus_embeddings[:1000]
else:  # Here, we compute the corpus_embeddings from scratch (which can take a while depending on the GPU)
    corpus_embeddings = bi_encoder.encode([passage[1] for passage in passages[:1000]], convert_to_tensor=True, show_progress_bar=True)

In [41]:
def search(query):
    # Encode the query using the bi-encoder and find potentially relevant passages
    start_time = time.time()
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query
    end_time = time.time()

    # Output of top-k hits
    print("Input question:", query)
    print("Results (after {:.3f} seconds):".format(end_time - start_time))
    for hit in hits:
      print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']]))

In [42]:
search(query = "Documentaries showcasing indigenous peoples' survival and dailylife in Arctic regions")

Input question: Documentaries showcasing indigenous peoples' survival and dailylife in Arctic regions
Results (after 0.011 seconds):
	0.317	['Terrible Teddy, the Grizzly King', 'Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading "His Photographer" and "His Press Agent" respectively, follow him into the shot; the photographer sets up his camera. "Teddy" aims his rifle upward at the tree and fells what appears to be a common house cat, which he then proceeds to stab. "Teddy" holds his prize aloft, and the press agent takes notes. The second shot is taken in a slightly different part of the wood, on a path. "Teddy" rides the path on his horse towards the camera and out to the left of the shot, followed closely by th

In [43]:
search(query = "Western romance")

Input question: Western romance
Results (after 0.037 seconds):
	0.293	['Safety Last!', 'The film opens in 1922 with Harold Lloyd (the character has the same name as the actor) behind bars. His mother and his girlfriend, Mildred, are consoling him as a somber official and priest show up. The three of them walk toward what looks like a noose. It then becomes obvious they are at a train station and the "noose" is actually a trackside pickup hoop used by train crews to receive orders without stopping, and the bars are merely the ticket barrier. He promises to send for his girlfriend so they can get married once he has "made good" in the big city. Then he is off.\nHe gets a job as a salesclerk at the De Vore Department Store, where he has to pull various stunts to get out of trouble with the picky and arrogantly self-important head floorwalker, Mr. Stubbs. He shares a rented room with his pal "Limpy" Bill, a construction worker.\nWhen Harold finishes his shift, he sees an old friend from hi

In [44]:
search(query = "Silent film about a Parisian star moving to Egypt, leaving her husband for a baron, and later reconciling after finding her family in poverty in Cairo.")

Input question: Silent film about a Parisian star moving to Egypt, leaving her husband for a baron, and later reconciling after finding her family in poverty in Cairo.
Results (after 0.017 seconds):
	0.258	['Terrible Teddy, the Grizzly King', 'Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading "His Photographer" and "His Press Agent" respectively, follow him into the shot; the photographer sets up his camera. "Teddy" aims his rifle upward at the tree and fells what appears to be a common house cat, which he then proceeds to stab. "Teddy" holds his prize aloft, and the press agent takes notes. The second shot is taken in a slightly different part of the wood, on a path. "Teddy" rides the path on his horse towards 

In [45]:
search(query = "Comedy film, office disguises, boss's daughter, elopement.")

Input question: Comedy film, office disguises, boss's daughter, elopement.
Results (after 0.019 seconds):
	0.312	['Terrible Teddy, the Grizzly King', 'Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading "His Photographer" and "His Press Agent" respectively, follow him into the shot; the photographer sets up his camera. "Teddy" aims his rifle upward at the tree and fells what appears to be a common house cat, which he then proceeds to stab. "Teddy" holds his prize aloft, and the press agent takes notes. The second shot is taken in a slightly different part of the wood, on a path. "Teddy" rides the path on his horse towards the camera and out to the left of the shot, followed closely by the press agent and photograp

In [46]:
search(query = "Lost film, Cleopatra charms Caesar, plots world rule, treasures from mummy, revels with Antony, tragic end with serpent in Alexandria.")

Input question: Lost film, Cleopatra charms Caesar, plots world rule, treasures from mummy, revels with Antony, tragic end with serpent in Alexandria.
Results (after 0.014 seconds):
	0.266	['Honky Tonk', "Sophie Tucker plays Sophie Leonard, a singer in a nightclub who at great sacrifice sends her daughter Beth (Lila Lee) to Europe to be educated, keeping her work as an entertainer a secret from her. When the grown-up, expensively educated Beth returns to America, she is shocked to discover her mother's true profession and disowns her, breaking Sophie's heart."]
	0.202	['The Bishop Murder Case', 'Looking out from his private balcony, elderly Prof. Dillard and his servant see the body of family friend Joseph Robin with an arrow in his chest. Dillard calls district attorney Markham, who brings in private detective Philo Vance and lazy police detective Heath. Vance quickly deduces that the arrow scene was staged (Robin was actually bludgeoned inside the house), but there is no obvious susp

In [47]:
search(query = "Denis Gage Deane-Tanner")

Input question: Denis Gage Deane-Tanner
Results (after 0.017 seconds):
	0.343	['Feet First', "Harold Horne, an ambitious shoe salesman in Honolulu, unknowingly meets the boss' secretary Barbara (Barbara Kent) - thinking she is the boss' daughter - and tells her he is a millionaire leather tycoon.\nHorne spends much of his time around Barbara hiding his true circumstances, in both the shoe store and later as an (accidental) stowaway on board a ship. Trying to evade the ship's crew, he becomes trapped in a mailbag, which is taken off the ship and falls off a delivery cart onto a window cleaner's cradle, which is hoisted upwards. Escaping from the bag, he finds himself dangling high above the streets of Los Angeles. After several thwarted attempts to get inside the building, he climbs to the very top, only to slip off - unaware his foot is caught on the end of a rope, which rescues him inches from the ground."]
	0.250	['Close Harmony', 'A musically talented young woman named Marjorie who 

In [53]:
# Ground truth films for each query
ground_truth_films = {
    "Documentaries showcasing indigenous peoples' survival and daily life in Arctic regions": [
        "Nanook of the North",
    ]
}

# Search results
search_results = {
    "Documentaries showcasing indigenous peoples' survival and daily life in Arctic regions": [
        {'score': 0.317, 'passage': ['Terrible Teddy, the Grizzly King', 'Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading "His Photographer" and "His Press Agent" respectively, follow him into the shot; the photographer sets up his camera. "Teddy" aims his rifle upward at the tree and fells what appears to be a common house cat, which he then proceeds to stab. "Teddy" holds his prize aloft, and the press agent takes notes. The second shot is taken in a slightly different part of the wood, on a path. "Teddy" rides the path on his horse towards the camera and out to the left of the shot, followed closely by the press agent and photographer, still dutifully holding their signs.']},
        {'score': 0.210, 'passage': ['Lucky Star', "Timothy Osborn (Farrell) and Martin Wrenn (Williams) work as linemen for an utility in a rural area. Both flirt with Mary Tucker (Gaynor) who is the daughter of a widowed dairy farmer, As the film begins it is 1917, and America becomes involved in World War I. Both men join the U.S. Army. While on the battlefield, Wrenn and Osborn serve in the same unit, and Wrenn is a sergeant. Ordered to deliver food to men at the front, Wrenn instead purloins the truck that was to be used for the delivery for personal use, and Osborn uses a horse-drawn wagon to deliver the food. While going to the front he is injured by shellfire. Both men return home, and Osborn is now confined to a wheelchair. He and Wrenn vie for Mary's affection. She becomes attached to Osborn and visits him every day. Wrenn, who had been kicked out of the Army, uses money and guile to win over Mary's mother, who pressures her to marry Wrenn. She stops seeing Osborn and agrees to marry Wrenn. In the end, Osborn regains some use of his legs, walks through snow to confront Wrenn just before he is about to wed Mary. Townspeople intervene in their fight and put Wrenn on a train out of town. Osborn reunites with Mary"]},
        {'score': 0.198, 'passage': ['Shoulder Arms', 'Charlie is in boot camp in the ""awkward squad."" Once in France he gets no letters from home. He finally gets a package containing limburger cheese which requires a gas mask and which he throws over into the German trench. He goes ""over the top"" and captures thirteen Germans (""I surrounded them""), then volunteers to wander through the German lines disguised as a tree trunk. With the help of a French girl he captures the Kaiser and the Crown Prince and is given a statue and victory parade in New York and then ... fellow soldiers wake him from his dream.']},
        {'score': 0.192, 'passage': ["Hell's Heroes", "Four men, Bob Sangster, ""Barbwire"" Gibbons, ""Wild Bill"" Kearney, and José, rob the bank in the town of New Jerusalem. José and the cashier is killed, while Barbwire is shot in the shoulder. The three outlaws escape the posse, fleeing into the desert. However, their horses die and they have little water. When they reach a water hole, they are dismayed to find that not only is it dry, but there is a pregnant woman stranded there. She gives birth to a boy. Before she dies from her ordeal, she makes the three the child's godfathers and begs them to take him to his father, Frank Edwards ... the cashier they murdered. Bob wants to abandon the boy, but the other two are determined to honor the woman's request. They start walking the 40 miles to New Jerusalem. Weakened by his wound, Barbwire eventually can go no further. He makes the others continue on without him, then shoots himself. That night, they stop to rest. When Bob wakes up the next morning, he finds Bill gone. A note explains he left to conserve the little remaining water. Bob goes on, discarding his belongings along the way, including finally the loot. At one point, he leaves the baby, but then picks him up again. His strength gives out just as he reaches a poisoned water hole. Then, he comes up with a plan. He drinks his fill, knowing that he will have about an hour before it kills him. He stumbles into New Jerusalem's church, where the congregation is celebrating Christmas. Then, his task completed, he dies without uttering a word.'"]},
        {'score': 0.192, 'passage': ['The Little American', "Karl Von Austreim (Jack Holt) lives in America with his German father and American mother. He notices a young lady, Angela More (Mary Pickford). As she is celebrating her birthday on the Fourth of July of 1914, she receives flowers from the French Count Jules De Destin (Raymond Hatton). They are interrupted by Karl, who also gives her a present. They soon battle for Angela's attention. To lose his competition, Count Jules arranges for Karl to be sent to Hamburg, where he will have to join his regiment. Angela is crushed when he announces he has to leave. The next day, Angela reads in the paper the Germans and French are at war and 10,000 Germans have been killed already. Three months pass by without a word from Karl. Karl is wounded in the fighting. Word spreads that Germany will sink any ship which is thought to be carrying munitions to the Allies. Angela is aboard one of those ships when it is hit. Angela saves herself by climbing on a floating table and begging the attackers not to fire on the passengers. Angela is eventually rescued. After weeks of ceaseless hammering from the German guns, the French fall back on Vangy. Angela arrives in Vangy as well to visit her aunt, only to discover she has died. The Old Prussians are bombing the city and Angela is requested to flee. However, she is determined to stay to nurse the wounded soldiers. Meanwhile, the Germans enter the chateau with the intention of getting drunk and enjoying themselves with the young women. A French soldier tries to help Angela escape, but she is unwilling to. He next asks her to let a French soldier spy on the Germans and inform the French via a secret hidden telephone. Angela is afraid, but gives them permission. The Germans are intent on raping Angela, who is the only person in the mansion not to be hidden. She reveals herself to be an American to save herself, but they do not believe her. Angela attempts to run away and hide, but is discovered by a German soldier who turns out to be Karl. Angela orders him to save the other women in the house, but Karl responds he cannot give orders to his fellow Germans. She realizes there is nothing she can do. With permission to leave the mansion, she witnesses the execution of the French soldiers. She is heartbroken and decides to go back in for revenge. Angela secretly calls the French with the hidden telephone and informs them that there are three gun holders near the chateau. The French prepare themselves and attack the Germans. The Germans realize someone is giving the French information and Karl catches Angela. He tries to help her escape, but they are caught. The commander orders that Angela be shot. When Karl tries to save her, he is to sentenced to be executed as well for treason. As the couple face death, the French bomb the mansion, enabling Angela and Karl to escape. They are too weak to run and collapse near a statue of Jesus. The next day, they are found by French soldiers. They initially want to shoot Karl, but Angela begs them to set him free. They eventually allow her to fly back to America with Karl by her side as a German prisoner."]}
    ]
}

# Calculate Recall@1 and MRR
recall_1_sum = 0
mrr_sum = 0
total_queries = len(search_results)

for query, results in search_results.items():
    relevant_films = ground_truth_films.get(query, [])
    top_result = results[0] if results else None
    if top_result and relevant_films and top_result['passage'][0] in relevant_films:
        recall_1_sum += 1
        rank_first_relevant = next((i + 1 for i, doc in enumerate(results) if doc['passage'][0] in relevant_films), 0)
        mrr_sum += 1 / rank_first_relevant if rank_first_relevant else 0

recall_1 = recall_1_sum / total_queries
mrr = mrr_sum / total_queries

print("Recall@1:", recall_1)
print("MRR:", mrr)


Recall@1: 0.0
MRR: 0.0
