Homework 3: Search for Movie Plots  
DSBA 6188  
Eric Phann

# Semantic Search - Search for Wikipedia Movie Plots

This notebook demonstrates a baseline semantic search model.

You can input a query or a question. The script then uses semantic search
to find relevant movies in Wikipedia Movie Plots (first 1000 movies, before 1920).

As model, we use: `nq-distilbert-base-v1`

It was trained on the Natural Questions dataset, a dataset with real questions from Google Search together with annotated data from Wikipedia providing the answer. For the movies, we encode the Wikipedia article title together with the plot.

## Install/import dependencies

In [None]:
%%capture
!pip install -U sentence-transformers

In [None]:
%%capture
!pip install datasets

In [None]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import time
import gzip
import os
import torch

if not torch.cuda.is_available():
  print("Warning: No GPU found. Please add GPU to your notebook")

In [None]:
from datasets import load_dataset

## Import data

As a dataset, we use Wikipedia Movie Plots (first 1000 movies from 1920 or before). We can import this using Hugging Face `datasets`.

In [None]:
# import data
ds = load_dataset("Coder-Dragon/wikipedia-movies", split='train[:1000]')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.0M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
# see an example of the data
ds[0]

{'Release Year': 1901,
 'Title': 'Kansas Saloon Smashers',
 'Origin/Ethnicity': 'American',
 'Director': 'Unknown',
 'Cast': None,
 'Genre': 'unknown',
 'Wiki Page': 'https://en.wikipedia.org/wiki/Kansas_Saloon_Smashers',
 'Plot': "A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]",
 'Image': 'upload.wikimedia.org/wikipedia/commons/2/2d/KansasSaloonSmashers1901.jpg'}

## Create corpus embeddings (Encode)

In [None]:
# We use the Bi-Encoder to encode all movies, so that we can use it with semantic search
model_name = 'nq-distilbert-base-v1'
bi_encoder = SentenceTransformer(model_name)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/540 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/554 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# create movies[] as [title, plot]
movies = []
for title, plot in zip(ds['Title'], ds['Plot']):
  movies.append([title, plot])

# examine movies[]
print(len(movies), "\n")
movies[:3]

1000 



[['Kansas Saloon Smashers',
  "A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]"],
 ['Love by the Light of the Moon',
  "The moon, painted with a smiling face hangs over a park at night. A young couple walking past a fence learn on a railing and look up. The moon smiles. They embrace, and the moon's smile gets bigger. They then sit down on a bench by a tree. The moon's view is blocked, causing him to frown. In the last scene, the man fans the woman with his hat because the moon has left the sky and is perched over her shoulder to see everythi

In [None]:
# create corpus embeddings
corpus_embeddings = bi_encoder.encode(movies, convert_to_tensor=True, show_progress_bar=True)

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

In [None]:
corpus_embeddings

tensor([[ 0.4242,  0.3535, -0.4906,  ..., -0.4527, -0.7219, -0.3250],
        [ 0.4416,  0.4091,  0.6788,  ..., -0.3532, -0.1965, -0.0579],
        [ 0.4367,  0.0597, -0.0720,  ...,  0.0901,  0.1567,  0.0581],
        ...,
        [ 0.1280, -0.4610, -0.6371,  ...,  0.3196, -0.1572,  0.8058],
        [ 0.1239,  0.1953, -0.4919,  ...,  0.2174,  0.0470, -0.1063],
        [-0.2902, -0.0538, -0.3584,  ..., -0.0562, -0.3267, -0.3031]],
       device='cuda:0')

## Create query embeddings (Encode)

In [None]:
# a search function using util.semantic_search, where top_k = num of movies to retrieve
def search(query, top_k = 5):
    # Encode the query using the bi-encoder and find potentially relevant movies
    start_time = time.time()
    query_embedding = bi_encoder.encode(query, convert_to_tensor=True)

    # use util.semantic_search to return cosine similarity and top k
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query
    end_time = time.time()

    # Output of top-k hits
    print("Query: ", query)
    print("Results (after {:.3f} seconds):".format(end_time - start_time))
    for hit in hits:
      print("\t{:.3f}\t{}".format(hit['score'], movies[hit['corpus_id']]))


In [None]:
# a search function using util.cos_sim and torch.topk, where top_k = num of movies to retrieve
def search2(query, top_k = 5):
    query_embedding = bi_encoder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    # Output of top-k hits
    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar movies in corpus:")
    for score, idx in zip(top_results[0], top_results[1]):
        print("(Score: {:.4f})".format(score), movies[idx])

## Search Examples & Metrics

### Evaluation Metrics


*   Recall@1
  * $Recall@k = \frac{true\,positives@k}{(true\,positives@k + false\,negatives@k)}$
  * How many actual relevent results shown from top k over all actual relevent results from query?
  * In our case, is our 1 "ground truth movie" the first movie returned? (yes=1, no=0)
*   Mean Reciprocal Rank (MRR)
  * $MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{rank_i}$
  * Averaging the highest ranks across queries


### Example 1

In [None]:
search("Documentaries showcasing indigenous peoples' survival and daily life in Arctic regions")

Query:  Documentaries showcasing indigenous peoples' survival and daily life in Arctic regions
Results (after 0.171 seconds):
	0.457	['Nanook of the North', 'The documentary follows the lives of an Inuk, Nanook, and his family as they travel, search for food, and trade in the Ungava Peninsula of northern Quebec, Canada. Nanook; his wife, Nyla; and their family are introduced as fearless heroes who endure rigors no other race could survive. The audience sees Nanook, often with his family, hunt a walrus, build an igloo, go about his day, and perform other tasks.']
	0.264	['David Copperfield', '"David Copperfield consists of three reels and as three separate films, released in three consecutive weeks, with three different titles: The Early Life of David Copperfield, Little Em\'ly and David Copperfield and The Loves of David Copperfield.[4]']
	0.251	['Straight Shooting', 'At the end of the 19th century in the Far West, a farmer is fighting for his right to plough the plains. In order to ex

Ground Truth Movie = 'Nanook of the North'  
Recall@1 = 1  
Reciprocal Rank = 1

### Example 2

In [None]:
search("Western romance")

Query:  Western romance
Results (after 0.015 seconds):
	0.305	['The Great Gatsby', "An adaptation of F. Scott Fitzgerald's Long Island-set novel, where Midwesterner Nick Carraway is lured into the lavish world of his neighbor, Jay Gatsby. Soon enough, however, Carraway will see through the cracks of Gatsby's nouveau riche existence, where obsession, madness, and tragedy await."]
	0.295	["Youth's Endearing Charm", 'The film is about a court case and embezzlement.']
	0.283	['Frankenstein', 'Described as "a liberal adaptation of Mrs. Shelley\'s famous story", the plot description in the Edison Kinetogram was:[3]']
	0.280	['High Society Blues', 'A new country family comes to live among established wealthy neighbors.']
	0.277	['Romance', 'As described in a film publication,[2] a youth (Arthur Rankin) in the prologue seeks advice from his grandfather (Sydney), who then recalls a romance of his own youth which is then shown as a flashback. A priest (Sydney) is in love with an Italian opera si

Ground Truth Movie = 'The Lucky Horseshoe'  
Recall@1 = 0  
Reciprocal Rank = 0

### Example 3

In [None]:
search("Silent film about a Parisian star moving to Egypt, leaving her husband \
for a baron, and later reconciling after finding her family in poverty in Cairo.")

Query:  Silent film about a Parisian star moving to Egypt, leaving her husband for a baron, and later reconciling after finding her family in poverty in Cairo.
Results (after 0.016 seconds):
	0.449	['Married in Hollywood', "A showgirl, part of a troupe, tours Europe where she falls in love with a Balkan prince. The prince's parents disapprove and attempt to put a stop to the romance. A revolution occurs and the prince and the showgirl elope to Hollywood."]
	0.394	['The King on Main Street', 'King Serge IV of Molvania (Menjou) comes to a small American town, and falls in love with one of its residents, Mary Young (Love).[3][4]']
	0.392	['Sahara', "Silent film femme fatale, Louise Glaum, portrays the role of Mignon, a Parisian music hall celebrity. Mignon marries a young American civil engineer, John Stanley, portrayed by Matt Moore. Stanley is transferred to Egypt to work on an engineering project in the Sahara. Mignon and her son, portrayed by Pat Moore, join Stanley in the desert.[3][

Ground Truth Movie = 'Sahara'  
Recall@1 = 0  
Reciprocal Rank = 1/3

### Example 4

In [None]:
search("Comedy film, office disguises, boss's daughter, elopement.")

Query:  Comedy film, office disguises, boss's daughter, elopement.
Results (after 0.013 seconds):
	0.403	["Youth's Endearing Charm", 'The film is about a court case and embezzlement.']
	0.400	['Dressed to Kill', 'The gang of a mob boss grow suspicious of his new girlfriend. She\'s a beautiful young girl and they don\'t believe she would actually associate with the mob and wonder if she\'s really a police "plant". The mobsters dress nattily to not appear "out of place" in the ritzy neighborhoods prior to a heist.']
	0.398	['The Boy Friend', 'Comedy about a small-town girl unhappy with her family, and a boy trying to please her by throwing a big party.']
	0.390	['A Little Journey', 'A girl travelling by train to meet her boyfriend meets another young man and falls in love with him.']
	0.373	['This Is Heaven', "Vilma Banky portrays a newly arrived Hungarian immigrant who learns to accustom herself to the new and strange life she finds in New York. The story gave Miss Banky moments of come

Ground Truth Movie = 'Ask Father'  
Recall@1 = 0  
Reciprocal Rank = 0

### Example 5

In [None]:
search("Lost film, Cleopatra charms Caesar, plots world rule, treasures from \
mummy, revels with Antony, tragic end with serpent in Alexandria.")

Query:  Lost film, Cleopatra charms Caesar, plots world rule, treasures from mummy, revels with Antony, tragic end with serpent in Alexandria.
Results (after 0.019 seconds):
	0.537	['Cleopatra', "Because the film has been lost, the following summary is reconstructed from a description in a contemporary film magazine.\r\nCleopatra (Bara), the Siren of Egypt, by a clever ruse reaches Caesar (Leiber) and he falls victim to her charms. They plan to rule the world together, but then Caesar falls. Cleopatra's life is desired by the church, as the wanton woman's rule has become intolerable. Pharon (Roscoe), a high priest, is given a sacred dagger to take her life. He gives her his love instead and, when she is in need of some money, leads her to the tomb of his ancestors, where she tears the treasure from the breast of the mummy. With this wealth she goes to Rome to meet Antony (Hall). He leaves the affairs of state and travels to Alexandria with her, where they revel. Antony is recalled to R

Ground Truth Movie = 'Cleopatra'  
Recall@1 = 1  
Reciprocal Rank = 1

### Example 6

In [None]:
search("Denis Gage Deane-Tanner")

Query:  Denis Gage Deane-Tanner
Results (after 0.015 seconds):
	0.365	['Souls for Sale', 'Remember "Mem" Steddon (Eleanor Boardman) marries Owen Scudder (Lew Cody) after a whirlwind courtship. However, on their wedding night, she has a change of heart. When the train taking them to Los Angeles stops for water, she impulsively and secretly gets off in the middle of the desert. Strangely, when Scudder realizes she is gone, he does not have the train stopped.\r\nMem sets off in search of civilization. Severely dehydrated, she sees an unusual sight: an Arab on a camel. It turns out to be actor Tom Holby (Frank Mayo); she has stumbled upon a film being shot on location. When she recuperates, she is given a role as an extra. Both Holby and director Frank Claymore (Richard Dix) are attracted to her. However, when filming ends, she does not follow the troupe back to Hollywood, but rather gets a job at a desert inn.\r\nMeanwhile, Scudder is recognized and arrested at the train station. He turns

Ground Truth Movie = 'Captain Alvarez'  
Recall@1 = 0  
Reciprocal Rank = 0

## Results

*   2 out of 6 (0.33) queries returned the ground truth as the top result (Average Recall@1)
*   Mean Reciprocal Rank (MRR) = (1 + 1 + 1/3 + 0 + 0 + 0)/6 = 7/18 = 0.38




## Analysis

__What type of queries tend to do well? Which not so well?__  


*   Longer queries which generally give more information related to the title or plot tend to do better (Examples 1, 3, 5)
*   Shorter queries or those not pertaining to the title or plot tend to do worse, or not retrieve the ground truth in the top 5 results at all (Examples 2, 5, 6)



__For queries that the model didn't perform well, what could be two alternative approaches?__


*   The model used was `nq-distilbert-base-v1` which is trained on natural question and answer data. Perhaps rephrasing queries to be more question-like would help performance.
*   When creating the embeddings, we only included movie titles and plots. Perhaps including other attributes, such as cast, would help improve performance on queries like Example 6.

