# Embedding all the Plots

The goal for this notebook is to set up the pipeline for getting for embedding all of the movie plots.

## Imports

In [9]:
import json
import pandas as pd
from sentence_transformers import SentenceTransformer
from scripts.helpers import get_embedding

## Data

In [25]:
# let's read the preprocessed data
df = pd.read_csv('../data/processed/preprocessed.csv', index_col=0)
print(df.shape)
df.head(5)

(22394, 10)


Unnamed: 0,name,release_year,rating,num_votes,wikipedia_id,plot,freebase_id,languages,countries,genres
0,Miss Jerry,1894.0,5.3,207,10109752,After finding out that her father is suffering...,/m/02q23xk,['Silent film'],['United States of America'],"['Short Film', 'Silent film', 'Indie', 'Black-..."
1,The Corbett-Fitzsimmons Fight,1897.0,5.3,484,28703057,The film no longer exists in its entirety; how...,/m/0czdh_n,[],[],['Sports']
2,The Story of the Kelly Gang,1906.0,6.0,855,142995,The Story of the Kelly Gangs tone is of sorrow...,/m/0120y4,['English Language'],['Australia'],"['Crime Fiction', 'Silent film', 'Biography', ..."
3,Robbery Under Arms,1907.0,4.3,25,32986669,Key scenes of the film included the branding o...,/m/04p7yxx,['Silent film'],['Australia'],"['Silent film', 'Drama']"
4,"Captain Midnight, the Bush King",1911.0,5.4,18,32987200,Edgar Dalimore is the son of wealthy station o...,/m/0h569x9,['Silent film'],[],['Silent film']


In [26]:
# keep only the columns we need
df = df[['wikipedia_id', 'plot']]

## Embedding

In order to start embedding, we need to download the [model](https://huggingface.co/intfloat/e5-large-v2) to the `models` directory:

```
git lfs install
git clone https://huggingface.co/intfloat/e5-large-v2
```

In [10]:
model = SentenceTransformer('../models/e5-large-v2')

In [29]:
with open('../data/embeddings/embeddings.json') as embedding_file:
    embeddings = json.loads(embedding_file.read())

In [None]:
# remove the movies that are already embedded
embedded_movies = list(embeddings.keys())
to_embed_df = df[~df.wikipedia_id.isin(embedded_movies)]

for i , (index, row) in enumerate(to_embed_df.iterrows()):
    embeddings[row['wikipedia_id']] = get_embedding(row['plot'], model).tolist()
    
    if i % 10 == 0:  # save json after every 10 movies
        with open("../data/embeddings/embeddings.json", "w") as file: 
            json.dump(embeddings, file)