# Embeddings

The aim of this notebook is to generate embeddings for the two main approaches that we use for generating graphs:

1. **embedding approach**: simply use [e5-large](https://huggingface.co/intfloat/e5-large) model for generating feature vectors from plots;
2. **genre approach**: first let the [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) predict the probabilities of belonging to a genre and a theme, then merge the vectors into an embedding.

## Imports

In [1]:
import json
import os

import numpy as np
import pandas as pd

from sentence_transformers import SentenceTransformer
from scripts.helpers import get_embedding, get_classification_from_json

In [2]:
root_path = os.path.dirname(os.path.abspath(""))

## Data

Note that data preprocessing has been done in [preprocess.ipynb](preprocess.ipynb). The notebook will clean, merge and save the data to `data/processed`.

In [3]:
df = pd.read_csv('../data/processed/preprocessed.csv', index_col=0)
print(df.shape)
df.head(5)

(22394, 10)


Unnamed: 0,name,release_year,rating,num_votes,wikipedia_id,plot,freebase_id,languages,countries,genres
0,Miss Jerry,1894.0,5.3,207,10109752,After finding out that her father is suffering...,/m/02q23xk,['Silent film'],['United States of America'],"['Short Film', 'Silent film', 'Indie', 'Black-..."
1,The Corbett-Fitzsimmons Fight,1897.0,5.3,484,28703057,The film no longer exists in its entirety; how...,/m/0czdh_n,[],[],['Sports']
2,The Story of the Kelly Gang,1906.0,6.0,855,142995,The Story of the Kelly Gangs tone is of sorrow...,/m/0120y4,['English Language'],['Australia'],"['Crime Fiction', 'Silent film', 'Biography', ..."
3,Robbery Under Arms,1907.0,4.3,25,32986669,Key scenes of the film included the branding o...,/m/04p7yxx,['Silent film'],['Australia'],"['Silent film', 'Drama']"
4,"Captain Midnight, the Bush King",1911.0,5.4,18,32987200,Edgar Dalimore is the son of wealthy station o...,/m/0h569x9,['Silent film'],[],['Silent film']


## Embedding Approach

We're using the model described in this [paper](https://arxiv.org/pdf/2212.03533.pdf), which can be downloaded [here](https://huggingface.co/intfloat/e5-large-v2). To download the model, run in the `models` directory:

```
git lfs install
git clone https://huggingface.co/intfloat/e5-large-v2
```

In [5]:
model = SentenceTransformer('../models/e5-large-v2')
decades = np.arange(1910, 2020, 10)
input_template = 'query: {}'

In [6]:
for decade in decades:
    decade_df = df[(df.release_year >= decade) & (df.release_year < decade + 10)]
    
    # create dictionary with embeddings for all the movies
    embeddings = {}
    
    for index, row in decade_df.iterrows():
        plot = input_template.format(row['plot'])
        movie = row['name']
        embeddings[movie] = get_embedding(plot, model)
        
    with open(os.path.join(root_path, 'data', 'embeddings', 'embedding', f'plots{decade}s.json'), "w") as outfile: 
        json.dump(embeddings, outfile)

## Genre Approach

We first let the model predict the probabilities and then convert them into embeddings.

In [34]:
decades = np.arange(1910, 2020, 10)

for decade in decades:

    probabilities = get_classification_from_json(decade)    
    embeddings = {}
    
    for movie, probability in probabilities.items():
       
        labels = probability['genres']['labels'] + probability['themes']['labels']
        scores = probability['genres']['scores'] + probability['themes']['scores']
    
        # sort the labels and reorder the scores accordingly
        _, scores = zip(*sorted(zip(labels, scores)))
    
        # normalize the embedding
        embedding = np.array(scores)
        norm_embedding = embedding / np.linalg.norm(embedding)
        
        embeddings[str(movie)] = list(norm_embedding)
        
    with open(os.path.join(root_path, 'data', 'embeddings', 'custom_genres', f'plots_{decade}s.json'), "w") as outfile: 
        json.dump(embeddings, outfile)