In [1]:
import pandas as pd
import numpy as np

from tqdm import tqdm
tqdm.pandas()

# ✨Learning objectives
- Learn how vectorize text using language transformer models
- Learn how to perform information retrieval (aka similarity  search) on text data

# 💼 Business question: 
Can we create a search system that will take a movie's plot description and find other similar movies?
I.e. can we create a simple recommendation system?

### Load the dataset
The source for the dataset: https://www.kaggle.com/jrobischon/wikipedia-movie-plots
```
Plot summary descriptions scraped from Wikipedia

Content
The dataset contains descriptions of 34,886 movies from around the world. Column descriptions are listed below:

Release Year - Year in which the movie was released
Title - Movie title
Origin/Ethnicity - Origin of movie (i.e. American, Bollywood, Tamil, etc.)
Director - Director(s)
Plot - Main actor and actresses
Genre - Movie Genre(s)
Wiki Page - URL of the Wikipedia page from which the plot description was scraped
Plot - Long form description of movie plot (WARNING: May contain spoilers!!!)
```

In [2]:
df = pd.read_csv('data/wiki_movie_plots_deduped.csv')
df.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


In [3]:
df.shape

(34886, 8)

### Transform movie plots into vectors using language transformers 🤖

From Wikipedia:
> A transformer is a deep learning model that adopts the mechanism of attention, differentially weighing the significance of each part of the input data. It is used primarily in the field of natural language processing (NLP) and in computer vision (CV).

At the end of this lesson, you'll find some additional resources on language transformers.
Three things are important to know for our purposes:
- just like `word2vec` transformers are capable of encoding text documents into vectors
- unlike `word2vec`, which uses a bag-of-words assumption, transformers take into account the sequence of words in the entire document
- there are plenty of pretrained models available so that we don't have to train our own language transformer model

Below we'll take advantage of a pretrained [`MINILM`](https://arxiv.org/abs/2002.10957) model available in the [`sentence-transformers`](https://github.com/UKPLab/sentence-transformers) library.

In [4]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [5]:
vec = model.encode('text to be encoded as a vector')
vec.shape

(384,)

In [6]:
vec

array([-6.53098598e-02, -2.37192973e-01, -4.02155757e-01, -2.92886406e-01,
       -2.83317100e-02, -1.57253161e-01, -1.68945396e-03, -1.75767168e-02,
       -1.75248206e-01,  4.23989624e-01,  8.84522870e-02, -1.37995929e-01,
       -4.12889309e-02,  2.90352821e-01,  7.73523748e-02,  1.66886538e-01,
       -4.14034098e-01,  7.47231722e-01, -9.51502994e-02, -3.27102840e-01,
        3.29275340e-01,  4.39708084e-02, -3.53455335e-01, -7.72817433e-02,
        1.17481440e-01,  7.21677959e-01,  1.14074983e-01, -1.22848421e-01,
        4.48687196e-01, -2.33250201e-01,  3.87917578e-01,  2.28660032e-01,
        5.51502526e-01,  3.48250598e-01, -4.16423440e-01,  1.44073069e-01,
        8.19080602e-03,  2.80054603e-02,  8.01526681e-02, -9.43190008e-02,
        4.41039801e-02,  4.80651036e-02, -1.41612351e-01, -2.52890736e-01,
        4.17781949e-01, -1.84089363e-01, -1.65566802e-01, -8.29408988e-02,
       -2.09465206e-01,  2.69925624e-01, -6.35559738e-01,  2.86408097e-01,
       -8.13528657e-01,  

Generate vector representations for all movie plots

In [7]:
df['vec'] = df['Plot'].progress_apply(model.encode)

100%|████████████████████████████████████████████████████| 34886/34886 [07:23<00:00, 78.60it/s]


In [8]:
df[['Plot', 'vec']].head()

Unnamed: 0,Plot,vec
0,"A bartender is working at a saloon, serving dr...","[0.15061118, 0.09013437, -0.1195993, -0.155564..."
1,"The moon, painted with a smiling face hangs ov...","[0.13909757, 0.16326891, 0.24832156, -0.024223..."
2,"The film, just over a minute long, is composed...","[0.13978973, 0.17100815, -0.14208138, 0.037247..."
3,Lasting just 61 seconds and consisting of two ...,"[-0.17586792, 0.01116799, 0.16533226, 0.036813..."
4,The earliest known adaptation of the classic f...,"[-0.21197395, 0.25960857, 0.2994074, -0.253225..."


### Use Similarity Search techniques to find similar movies

From [Wikipedia](https://en.wikipedia.org/wiki/Nearest_neighbor_search):
> Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

We'll use [`NearestNeighbors`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html) search algorithm available in `sklearn`.
Below we perform the following steps:
1. extract vectors and transform them into numpy arrays
2. create an instance of NearestNeighbors
3. train our model (often referred to as a search index)

In [9]:
from sklearn.neighbors import NearestNeighbors

vectors = np.array(df['vec'].values.tolist())
nn = NearestNeighbors()
index = nn.fit(vectors)

Now our search index is trained.
Let's write a convenience function that takes the plot description of a movie and returns movies with similar plots.
For that, we'll need to:
1. turn our input text into a vector
2. pass this vector to the [`.kneighbors()`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors.kneighbors) method of the search index. Optionally, we can change the number of nearest neighbors that gets returned by passing the `n_neighbors` parameter.
3. select a subset of movies based on the indices returned by the `.kneighbors()` method

In [10]:
def get_similar(text, 
                df=df, 
                cols_to_return=['Title', 'Release Year', 'Origin/Ethnicity', 'Genre'],
                n_neighbors=15):
    vector = [model.encode(text)]
    neigh_ind = index.kneighbors(vector, n_neighbors=n_neighbors, return_distance=False)
    df_subset = df.iloc[neigh_ind[0]][cols_to_return]
    return df_subset

Finally, we can give our search system a try. Let's see what we movies have plots similar to [Titanic 🚢](https://en.wikipedia.org/wiki/Titanic_(1997_film)#Plot)

In [11]:
plot_text_titanic = """
In 1996, aboard the research vessel Akademik Mstislav Keldysh, Brock Lovett and his team search the wreck of RMS Titanic. They recover a safe they hope contains a necklace with a large diamond known as the Heart of the Ocean. Instead, they only find a drawing of a young nude woman wearing the necklace. The sketch is dated April 14, 1912, the same day the Titanic struck an iceberg.[Note 1] Rose Dawson Calvert, the woman in the drawing, and her granddaughter, are brought aboard Keldysh. Rose recounts her experiences aboard Titanic.
In 1912 Southampton, 17-year-old Rose DeWitt Bukater, her wealthy fiancé Cal Hockley, and Rose's widowed mother, Ruth, board the Titanic. Ruth emphasizes that Rose's marriage to Cal will resolve the family's financial problems and maintain their upper-class status. Meanwhile, Jack Dawson, a poor young artist, wins a third-class Titanic ticket in a poker game. After setting sail, Rose, distraught over her loveless engagement, climbs over the stern railing, intending to jump overboard. Jack appears and coaxes her back onto the deck. The two develop a tentative friendship, but when Cal and Ruth strongly object, Rose acquiesces and discourages Jack's attention. She soon realizes she prefers Jack over Cal.
Rose brings Jack to her state room and pays him to sketch her nude, wearing only the Heart of the Ocean necklace. They later evade Cal's servant, Lovejoy, and have sex in an automobile inside the cargo hold. On the forward deck, they witness the ship's collision with an iceberg and overhear its officers and builder discussing the serious situation. Cal discovers Jack's sketch and Rose's insulting note left inside his safe, along with the necklace. When Jack and Rose return to warn the others about the collision, Cal has Lovejoy slip the necklace into Jack's pocket to frame him for theft. Jack is then confined in the master-at-arms' office. Cal puts the necklace into his overcoat pocket.
With the ship sinking, Rose flees Cal and her mother, who has boarded a lifeboat. Rose finds and frees Jack, and they barely make it back to the boat deck. Cal and Jack urge Rose to board a lifeboat. Having arranged to save himself, Cal falsely claims he can get Jack safely off the ship. As her lifeboat is lowered, Rose, unable to abandon Jack, jumps back on board. Cal grabs Lovejoy's pistol and chases Rose and Jack into the flooding first-class dining saloon. They get away, and Cal realizes that he gave his coat, and consequently the necklace, to Rose; he later boards a lifeboat posing as a lost child's father.
Jack and Rose return to the boat deck. The lifeboats have departed and the ship's stern is rising as the flooded bow sinks. As passengers fall to their deaths, Jack and Rose desperately cling to the stern rail. The upended ship breaks in half and the bow section dives downward. The remaining stern slams back onto the ocean, then upends again before it, too, sinks. In the freezing water, Jack helps Rose onto a wooden panel buoyant enough for one person. Some time later, Rose, barely alive, is saved by a returning lifeboat, but Jack, still clutching the panel, has died of hypothermia.
The RMS Carpathia rescues the survivors; Rose avoids Cal by hiding among the steerage passengers and gives her name as Rose Dawson. Still wearing Cal's overcoat, she discovers the necklace tucked inside the pocket. In the present, Rose says she later heard that Cal committed suicide after losing his fortune in the Wall Street Crash of 1929. Lovett abandons his search after hearing Rose's story. Alone on the stern of Keldysh, Rose takes out the Heart of the Ocean that has been in her possession all along, and drops it into the sea over the wreck site. While she is seemingly asleep or has died in her bed,[8] her photos on the dresser depict a happy life of freedom and adventure.
A young Rose reunites with Jack at Titanic's Grand Staircase, applauded by those who died on the ship.
"""

get_similar(plot_text_titanic)

Unnamed: 0,Title,Release Year,Origin/Ethnicity,Genre
16392,Titanic 3D,2012,American,drama
13153,Titanic,1997,American,"historical epic, disaster"
9350,The Deep,1977,American,thriller
19401,A Night to Remember,1958,British,drama
5705,Plymouth Adventure,1952,American,drama
7924,The Three Lives of Thomasina,1964,American,family
1945,Murder on a Honeymoon,1935,American,"comedy, mystery"
15317,The Reaping,2007,American,horror
17788,The Home Song Stories,2007,Australian,drama
9158,The Land That Time Forgot,1975,American,adventure


We mostly see movies from the drama and adventure genres: a somewhat expected result given the plot of Titanic

Let's try another movie: [Saving Private Ryan](https://en.wikipedia.org/wiki/Saving_Private_Ryan#Plot)

In [12]:
plot_text_saving_private_ryan = """
An elderly veteran walks through a cemetery, accompanied by his family. Coming across a specific grave, he is overcome with emotion and recalls his time as a soldier. On the morning of June 6, 1944, the U.S. Army lands at Omaha Beach as part of the Normandy invasion. Captain John H. Miller leads a breakout from the beach, overwhelming fierce German resistance. Meanwhile at the United States Department of War in Washington, D.C., it is learned that James Francis Ryan of the 101st Airborne Division is the last of four brothers presumed alive but missing. General George C. Marshall orders Ryan to be found and sent home.
Miller soon receives orders to lead a unit to find Ryan. Arriving in the contested town of Neuville between the German defenders and the 101st Airborne, it is learned that Ryan is defending a key bridge in the fictional town of Ramelle. While assisting the 101st in Neuville, one of Miller's men is shot by a German sniper and is killed in action. En route to Ramelle, Miller decides against the judgment of his unit to neutralize a German machine gun nest, resulting in the loss of the unit's Medic. A surviving German soldier is spared by the intervention of Upham; Miller blindfolds the soldier and orders him to surrender himself to the next Allied patrol. When Reiben threatens to desert, Miller defuses the situation by revealing his civilian background as a teacher.
Soon arriving in Ramelle, the remaining unit make contact with Ryan and inform him of his brothers' deaths. Though upset by the news, Ryan refuses to abandon his current posting, which soon comes under siege by attacking German armor. Miller and his unit fight alongside the 101st though the German armor advantage soon starts to take its toll on the Americans. In the ensuing battle, Jackson, Mellish and Horvath are killed. In an attempt to destroy the bridge with pre-placed explosives, Miller is fatally wounded by "Steamboat Willie", the German soldier he earlier spared. As the Germans approach the bridge, P-51 Mustangs as well as advancing American Shermans with infantry rout the Germans. Steamboat Willie is personally executed by Upham, who spares his comrades.
As a result of his wounds, Miller dies, but first tells Ryan to "earn this," referring to the postwar life that he will hopefully be able to experience. Ryan is revealed to be the elderly veteran from the beginning of the film and the grave belonging to Miller. Ryan expresses his gratitude for the sacrifices made by Miller and his men and states that he hopes he indeed earned it, before saluting Miller's gravestone.
"""

get_similar(plot_text_saving_private_ryan)

Unnamed: 0,Title,Release Year,Origin/Ethnicity,Genre
13347,Saving Private Ryan,1998,American,"drama, war"
6450,7th Cavalry,1956,American,western
22197,Passchendaele,2008,Canadian,"romance, world war i drama"
6507,D-Day the Sixth of June,1956,American,war
5631,Hangman's Knot,1952,American,western
13377,The Thin Red Line,1998,American,"drama, war"
15185,Day Zero,2007,American,drama
3082,I Wanted Wings,1941,American,drama
8208,First to Fight,1967,American,war
4154,A Walk in the Sun,1945,American,war drama


The resulting movies are mainly from the drama, war and action genres.
Again, this would be an expected result for those who've seen the movie

### 🏋️ Your turn: try cosine similarity metric

Video on why cosine similarity might be better than euclidean distance (default distance used in sklearn's `NearestNeighbors`)  when it comes to finding similar text documents: 
https://www.youtube.com/watch?v=3VsaLblELP0

Your task is to:
1. create a new instance of `NearestNeighbors` but this time providing `metric='cosine'` as a parameter
2. create a new search index (let's call it `index_cosine`)
3. re-define our `get_similar` function to accept an additional parameter `index` so that we can pass different search indices if we needed to. I.e. the function should have the following signature:
```python
def get_similar(text, 
                   df=df, 
                   cols_to_return=['Title', 'Release Year', 'Origin/Ethnicity', 'Genre'],
                   n_neighbors=15,
                   index=index_cosine):
                   ...                 
```

In [13]:
# THIS IS A SOLUTION: REMOVE BEFORE THE PRESENTATION
nn = NearestNeighbors(metric='cosine')
index_cosine = nn.fit(vectors)

In [20]:
# THIS IS A SOLUTION: REMOVE BEFORE THE PRESENTATION
def get_similar(text, 
                df=df, 
                cols_to_return=['Title', 'Release Year', 'Origin/Ethnicity', 'Genre'],
                n_neighbors=15,
                index=index_cosine):
    vector = [model.encode(text)]
    neigh_dist, neigh_ind = index.kneighbors(vector, n_neighbors=n_neighbors, return_distance=True)
    df_subset = df.iloc[neigh_ind[0]][cols_to_return]
    df_subset['distance'] = neigh_dist[0]
    return df_subset

Let's again try finding movies similar to Titanic and Saving Private Ryan, but this time using cosine similarity as a metric

In [21]:
get_similar(plot_text_titanic, index=index_cosine)

Unnamed: 0,Title,Release Year,Origin/Ethnicity,Genre,distance
13153,Titanic,1997,American,"historical epic, disaster",0.087355
16392,Titanic 3D,2012,American,drama,0.087355
19401,A Night to Remember,1958,British,drama,0.391392
9350,The Deep,1977,American,thriller,0.414046
12857,Titanic,1996,American,biography,0.424837
9158,The Land That Time Forgot,1975,American,adventure,0.43741
5705,Plymouth Adventure,1952,American,drama,0.438937
8059,Assault on a Queen,1966,American,crime drama,0.450015
17541,"20,000 Leagues Under the Sea",1985,Australian,animation / adventure,0.453581
7924,The Three Lives of Thomasina,1964,American,family,0.469458


In [22]:
get_similar(plot_text_saving_private_ryan, index=index_cosine)

Unnamed: 0,Title,Release Year,Origin/Ethnicity,Genre,distance
13347,Saving Private Ryan,1998,American,"drama, war",0.176766
6450,7th Cavalry,1956,American,western,0.354021
21680,Dunkirk,2017,British,unknown,0.357766
11863,Patriot Games,1992,American,crime drama,0.358656
22197,Passchendaele,2008,Canadian,"romance, world war i drama",0.36368
6839,Men in War,1957,American,war,0.385096
15185,Day Zero,2007,American,drama,0.38799
4154,A Walk in the Sun,1945,American,war drama,0.391066
11187,Glory,1989,American,"drama, war",0.391852
6507,D-Day the Sixth of June,1956,American,war,0.392438


While the results are similar to what we were getting with the default euclidean metric, some returned movies and the order of the movies (i.e. the distances to our query vector) are different.

# ✔️Summary
- It's possible to take advantage of cutting-edge models (like transformers) without having to train one yourself: use someone else's pretrained models
- Using vectors to do information retrieval allows you to find documents that have similar meanings, not necessarily keywords
- Cosine similarity is often a preferred metric when dealing with high-dimensional vector spaces. There's no clear definition of what's considered "high-dimensional", but generally, the cut-off point is somewhere around 50-100 dimensions. In low-dimensional spaces, euclidean distance is used more commonly

# ☝️Where to go from here?

- How do big tech companies quickly search through millions of documents? They use slightly different nearest neighbor search algorithms that do the search approximately, not exactly. You can learn more about Approximate Nearest Neighbors algorithms in this [video](https://www.youtube.com/watch?v=QkCCyLW0ehU)

- You can apply the techniques you learned here on other datasets.
  Some places where you can find text datasets to practice:
   - [Kaggle datasets](https://www.kaggle.com/datasets)
   - [Google dataset search](https://datasetsearch.research.google.com/)
   - [Reddit r/datasets](https://www.reddit.com/r/datasets/)
 

# 📓 Additional resources:
- Language Transformers:
   - [Video: Illustrated Guide to Transformers Neural Network](https://www.youtube.com/watch?v=4Bdc55j80l8)
   - [Introductory course on language transformers by HuggingFace](https://huggingface.co/course)
- Nearest Neighbour search:
   - [Video: StatQuest: K-nearest neighbors](https://www.youtube.com/watch?v=HVXime0nQeI)
   - Nearest neighbor methods and vector models:
      - [Part 1](https://erikbern.com/2015/09/24/nearest-neighbor-methods-vector-models-part-1.html)
      - [Part 2](https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html)
      - [Epilogue](https://erikbern.com/2015/10/20/nearest-neighbors-and-vector-models-epilogue-curse-of-dimensionality.html)
- Learn how to measure similarity between two words not by their meaning but by the characters present in each word:
   - [Measuring Text Similarity Using the Levenshtein Distance](https://blog.paperspace.com/measuring-text-similarity-using-levenshtein-distance/)
   - [Implementing The Levenshtein Distance for Word Autocompletion and Autocorrection](https://blog.paperspace.com/implementing-levenshtein-distance-word-autocomplete-autocorrect/)