Children often ask their caregivers to find episodes of their favorite TV shows based only on a very short (and loosely relevant!) description of it ("the one where Arthur has a wiggly tooth") but video services like Netflix and Amazon don't currently provide such content-based search. Given summaries of each episode, can we use sequence embeddings to solve this retrieval problem?

Before beginning this homework, install the following libraries:

```sh
conda install -c huggingface transformers
pip install -U sentence-transformers
conda install -c conda-forge ipywidgets
```

First, let's read in our data for the TV show "Wild Kratts" (from [Wikipedia](https://en.wikipedia.org/wiki/List_of_Wild_Kratts_episodes)), which has the following (tab-separated) form:

|Episode|Title|Summary|
|:-|:-|:-|
|1|Mom of a Croc|At the Nile River, zoologists Chris and Martin Kratt (voiced by their real-world selves) are on a mission to show one of their fellow Wild Kratts team members—brilliant young inventor Aviva Corcovado (Athena Karkanis)—that there's more to crocodiles than just violence and snapping jaws. After shrinking themselves down to a few inches tall by using Aviva's Miniaturizer invention, the Kratt Brothers disguise themselves as crocodile eggs and sneak into a mother crocodile's new nest. In the Wild Kratts team's turtle-shaped aircraft and headquarters—the Tortuga, one of Aviva's greatest inventions—the Wild Kratts tech team, consisting of Aviva, communications expert and mechanic Koki (Heather Bambrick), and skilled pilot Jimmy Z (Jonathan Malen) monitor Chris and Martin and watch as the mother crocodile faithfully guards her nest against predators for months without even eating anything. Eventually, as the crocodile eggs hatch and the crocodile mom uses her mouth to carry several of her newly hatched babies to the river, Aviva changes her mind about crocodiles and decides that these reptiles are in fact caring and dedicated mothers. But when the mother crocodile leaves the river to go get more hatchlings from her nest, predators threaten the first batch of baby crocodiles. The Kratt Brothers must use the incredible Creature Power Suits—two of Aviva's inventions—to gain the abilities of crocodiles and protect the vulnerable crocodile hatchlings.|
|2|Whale of a Squid|The Kratt Brothers use Aviva's amphipod-inspired submersible, the Amphisub, to dive into the deep waters of the Southern Ocean. There, they witness a never-before-seen wildlife moment: a battle between a sperm whale and a giant squid. However, the water pressure at the extreme depths where the battle is taking place badly damages and partially crushes the Amphisub, forcing Aviva to use her new ExtendoArm invention to pull the submersible back to the Tortuga. To allow Chris and Martin to return to the site of the whale-versus-squid battle, Aviva programs two new Creature Power Suits—Sperm Whale Power for Chris, and Squid Power for Martin. The Kratt Brothers use their new Creature Powers to dive back into the deep sea, where the sperm whale and the giant squid are still locked in combat. Suddenly, the sperm whale becomes entangled in a discarded fishing net and begins sinking toward an area full of underwater volcanoes. To make matters worse, a colossal squid attacks the sperm whale's calf. Chris and Martin must put their Creature Powers of both sperm whale and squid to good use to rescue the mother sperm whale and her calf.|


In [1]:
def read_data(filename):
    data=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            episode=cols[0]
            title=cols[1]
            summary=cols[2]
            data.append((episode, title, summary))
    return data

In [2]:
data=read_data("../data/wild_kratts_episodes.txt")

In [3]:
def get_document_reps_for_data(data, sequence_embedding_function, model):
    
    # This function applies the sequence_embedding_function argument (a function itself) to the summary
    # element in the input data list, and returns a copy of that list with an embedding of that summary attached.
    
    data_with_reps=[]
    
    for episode, title, summary in data:
        data_with_reps.append((episode, title, summary, sequence_embedding_function(model, summary)))
    
    return data_with_reps

In [4]:
def cosine_similarity(a, b):
    return np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))

First, we may be tempted to use the [CLS] token for BERT to represent an entire input string (as is often done in *supervised* document classification models).  How well does this work as an out-of-the-box document representation not optimized for our particular task?

In [5]:
from transformers import BertModel, BertTokenizer
import numpy as np
from sentence_transformers import SentenceTransformer

In [6]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


**Q1**: Fill out the `get_cls_token_for_doc` function to return the [CLS] embedding for the input string.  The output should be a single 768-dimensional numpy vector (see `4.embeddings/BERT.ipynb` for converting between a pytorch tensor and a numpy object).

In [7]:
def get_cls_token_for_doc(model, string):
    inputs = tokenizer(string, return_tensors="pt")
    outputs = model(**inputs)
    return outputs.last_hidden_state[0][0].detach().numpy()

In [8]:
bert_cls_data=get_document_reps_for_data(data, get_cls_token_for_doc, model)

**Q2**: Use these representations to find the episode that is most similar to the description "The one where they bounce back in time" by having the highest cosine similarity between representations.  A sample function shell `run_query` is provided below, along with the only arguments you need, but feel free to adapt it as you see fit.

In [9]:
def run_query(query, data_with_reps, sequence_embedding_function, model):
    query_rep=sequence_embedding_function(model, query)
    sims=[]
    for ep, title, summary, document_embedding in data_with_reps:
        cos=cosine_similarity(query_rep, document_embedding)
        sims.append((cos, ep, title, summary))

    import pandas as pd
    pd.set_option('display.max_colwidth', 1000)
    from IPython.display import display
    df = pd.DataFrame(list(reversed(sorted(sims)))[:10])
    df.columns=["Cosine", "Episode", "Title", "Summary"]
    display(df)
    

In [10]:
query="The one where they bounce back in time"
run_query(query, bert_cls_data, get_cls_token_for_doc, model)

Unnamed: 0,Cosine,Episode,Title,Summary
0,0.755259,34,Little Howler,Chris and Martin find a wolf pup in Martin's bag after a chaotic morning involving a wolf pack and Chris's Creature Power Suit set in moose mode.
1,0.753341,137,Deer Buckaroo,The Wild Kratts search for a white-tailed deer named Buckaroo as they learn about deer society and their antlers.
2,0.727147,77,Crocogator Contest,"The Wild Kratts go on a Crocogator Challenge to figure out what the differences between crocodiles and alligators are, and why the creature needs those differences."
3,0.72223,132,Hercules – The Giant Beetle,"While investigating exoskeletons, the Hercules beetle accidentally gets enlarged."
4,0.719663,133,Hero's Journey,"When Aviva's new Time Thruster invention gets attached to a migrating salmon, the Wild Kratts must follow the salmon run through the Alaskan wilderness to recover the Time Thruster."
5,0.71907,142,The Vanishing Stingray,"A bored Martin and Chris decide to play a game of creature hide-and-seek, and search the sand flats for one of the greatest hiding animals in the ocean—the stingray."
6,0.717661,143,In Search of the Easter Bunny,The Kratt Brothers split up; Chris heads north while Martin heads south in search of the species of rabbit or hare who they think is the real Easter bunny.
7,0.711772,141,Mystery of the Mini Monkey Models,The Wild Kratts team splits up to find mini monkeys known as tamarins and marmosets to discover why they have such weird hairstyles and bright colors.
8,0.709808,150,Adapto the Coyote,"When a coyote pup sneaks into the Tortuga, the Wild Kratts want to learn why this creature is so adaptable."
9,0.702731,134,Creepy Creatures!,The villains are stealing animals for a haunted house and the Wild Kratts and Wild Kratts kids must stop them and save their animal friends.


Now let's try a sentence embedding model that was optimized for generating sentence representations: Sentence-BERT ([Reimers and Gurevych 2019](https://arxiv.org/pdf/1908.10084.pdf)).  Example usage (in the context of the Huggingface transformers library) can be found [here](https://www.sbert.net).

In [11]:
sentence_model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

**Q3**: Fill out the `get_sentence_embedding` function below to return the sentence embedding for the input string, and use it again to find the episode that is most similar to the description "The one where they bounce back in time" by having the highest cosine similarity between representations.  Which method for generating sentence embeddings appears better for this task?

In [12]:
def get_sentence_embedding(model, string):
    embedding = model.encode(string)
    return embedding

In [13]:
sentence_transformer_data=get_document_reps_for_data(data, get_sentence_embedding, sentence_model)

In [14]:
query="The one where they bounce back in time"
run_query(query, sentence_transformer_data, get_sentence_embedding, sentence_model)

Unnamed: 0,Cosine,Episode,Title,Summary
0,0.349195,91,Back in Creature Time: Day of the Dodo,"After a quick scan on Philippine eagle eggs and learning the animals are going extinct, which means the Kratt Brothers will never adventure with creatures gone forever until Aviva reveals a Trampoline like a time machine which the Kratt Brothers use to jump back in time up to 500 years in the past. First, they decided to see a dodo and Koki, Chris, and Martin jump back on Mauritius Island in the early 1600s when the dodos became extinct. Gourmand's ancestor, Great Great Granddaddy Gourmand (Zachary Bennett) plans to use the dodos for his lunch. The Wild Kratts have to rescue the dodos before they become extinct and travel back to the present day before the remote runs low on power."
1,0.340593,76,Where the Bison Roam,"The team explores the spot where Lewis and Clark first saw the American prairie. While cleaning the teleporter's activation ring, Jimmy Z accidentally rolls it onto the horn of the lead bull in the herd. The Kratts have to find a way to get the ring back without getting charged and discover that bison and prairies are more difficult to find than they were back then."
2,0.334609,143,In Search of the Easter Bunny,The Kratt Brothers split up; Chris heads north while Martin heads south in search of the species of rabbit or hare who they think is the real Easter bunny.
3,0.334412,82,Fossa-Palooza,"Chris is bummed that the crew has to leave Madagascar for Mother's Day. They have not met the fossa (a very catlike, predatorial mongoose of Madagascar). Suddenly, Martin opens the garage door in midflight and Chris falls in Madagascar to find a fossa mom and baby."
4,0.32979,47,Bugs Or Monkeys?,"A miniaturized Martin wants to check out insects in the rainforest, but Chris wants to investigate the world of the spider monkey. Chris wins when he puts mini Martin in his backpack and heads off through the treetops. When Martin gets kidnapped by a mischievous baby spider monkey, both brothers experience spider monkey anatomy and monkey locomotion. But just as the rescued Martin starts heading back to the Tortuga on Chris in his new spider monkey power suit, a jaguar emerges from the jungle shadows with its eyes on the baby spider monkey. The Kratts must save the baby monkey and reunite it with its family again."
5,0.325065,17,Elephant in the Room,"While driving their jeep (the Createrra) across the African Savanna, Chris and Martin find a baby elephant separated from his mother. Martin names him Thornsley, and brings him back to the Tortuga HQ, only to discover the power of a baby African elephant, and the work necessary to take care of one. While the team is sleeping, Thornsley takes control of the Tortuga, causing Aviva, Koki and Jimmy Z to crash-land in the desert, leaving the Kratt Brothers stranded in the savanna. Jimmy Z gets mad when the elephant eats his sandwich, the gang gets miniaturized by the shrink machine and gets trapped in a shoe. The bros manage to find the Tortuga, finding it a wreck by the young mischievous elephant, and bring the crew back to their original size. Chris and Martin then find his herd and his mom, who's trapped in mud, and must use elephant strength and power to save her. They do so and Thornsley is reunited with his mother."
6,0.320811,83,Mini Madagascar,Chris and Martin go minisized and explore the small world of Madagascar. They think that Madagascar is safe without any big predators around. They soon see that Madagascar is filled with small monsters.
7,0.313297,121,Fire Salamander,"After seeing a salamander crawling out of a fire in the Black Forest, the Kratt Brothers set off to solve the mystery of the fire salamander's life cycle. They also have to stop Donita and Dabio from capturing 1,000 of them for her bodysuit plan, and even save Chris."
8,0.305579,123,Wild Ponies,"When the Kratt Brothers find a herd of wild horses on a beach, Aviva and Koki are eager to see them. However, the Wild Kratts are forced to retreat into the Tortuga when a storm hits. After the storm, the team finds a wild horse foal who was separated from the herd by a wave. Chris and Martin set off to reunite the foal with his mother."
9,0.304159,29,Seasquatch,"When Martin accidentally knocks Jimmy's controller into the ocean, the Kratt Brothers volunteer to go into the deep to find it, encountering an anglerfish and other deep sea creatures. But then the submarine loses power while they are exploring the strange landscape of the ocean depths and they are trapped on the ocean floor of the deep sea. Aviva must figure out how to harness the energy from the deep sea's hydrothermal vents to save the brothers and return them to the surface. The Wild Kratts team learns all about the amazing process of chemosynthesis and how deep sea creatures transform toxic chemicals into energy. With a little help from the yeti crab, Martin and Chris are able to capture this energy and use it to restart the Amphisub and return to the surface."
