# Similarities 

As we now have embeddings, it's time to save the similarities to JSONs. The goal is to take the embeddings we generated in the previous steps and by using dot product calculate cosine similarity between all the movies within every decade. We do it by using two different approaches:

1. **embedding approach**: we used e5-large model to generate normalized embeddings of size 1024 (see [embeddings.ipynb](embeddings.ipynb));
2. **genre approach**: we picked genres and themes and let bart-large model to predict the probabilities of belonging to a genre/theme, which we turned into vector representations and normalized into unit vectors (see [embeddings.ipynb](embeddings.ipynb)).

## Imports

In [1]:
import itertools
import json

import numpy as np
from tqdm import tqdm

import os
import sys
sys.path.append(os.path.abspath('..'))

from scripts.helpers import get_embeddings_from_json

In [3]:
root_path = os.path.dirname(os.path.abspath(""))

## Embeddings

Since we're already embedded the movies using Google Colab and downloaded them (see README), we can simply read the json.

In [None]:
decades = np.arange(1910, 2020, 10)

In [10]:
for decade in decades:
    embeddings = get_embeddings_from_json(decade, approach='embedding')
    
    # generate all possible combinations for dot product 
    combinations = list(itertools.combinations(embeddings.keys(), 2))
    
    similarities = {}
    
    for movie_1, movie_2 in tqdm(combinations):
        similarity = embeddings[movie_1] @ embeddings[movie_2]
        similarities['-'.join([str(movie_1), str(movie_2)])] = similarity
    
    filepath = os.path.join(root_path, 'data', 'similarities', 'embedding', f'similarities_{decade}s.json')
    with open(filepath, "w") as outfile: 
        json.dump(similarities, outfile)

## Genres

We didn't merge for loops for the two approaches, because the first one is computationally much more heavy, so we kept it separate.

In [8]:
for decade in decades:

    embeddings = get_embeddings_from_json(decade, approach='custom_genres')
    
    similarities = {}
    
    for movie_1, movie_2 in tqdm(itertools.combinations(embeddings.keys(), 2)):
        similarity = embeddings[movie_1] @ embeddings[movie_2]
        similarities['-'.join([str(movie_1), str(movie_2)])] = similarity
    
    filepath = os.path.join(root_path, 'data', 'similarities', 'custom_genres', f'similarities_{decade}s.json')
    with open(filepath, "w") as outfile: 
        json.dump(similarities, outfile)

6786it [00:00, 279581.81it/s]
85078it [00:00, 299702.35it/s]
862641it [00:02, 323413.80it/s]
1088550it [00:03, 288493.19it/s]
1604736it [00:04, 348812.13it/s]
1313010it [00:03, 361166.52it/s]
1554966it [00:04, 356145.14it/s]
3296028it [00:09, 344640.35it/s]
5700376it [00:23, 241693.26it/s]
19453203it [01:10, 275713.85it/s]
1461195it [00:04, 307364.16it/s]
