# Similarities 

The aim of this notebook is to calculate similarity scores and save them down to JSON.
As it's a heavy job, it will be run separately in Google Colab.

## Imports

In [1]:
# add root path to system path
import sys
sys.path.append('../../')

In [2]:
import itertools
import pandas as pd
import json
import os
from tqdm import tqdm
from scripts.helpers import get_embeddings_from_json, get_similarities_from_json, get_classification_from_json

In [5]:
root_path = os.path.dirname(os.path.abspath(""))

## Data

Note that data preprocessing has been done in [preprocess.ipynb](preprocess.ipynb). The notebook will clean, merge and save the data to `data/processed`.

In [6]:
movie_df = pd.read_csv(os.path.join(root_path, 'data/processed/preprocessed.csv'), index_col=0)
print(movie_df.shape)
movie_df.head(5)

(22394, 10)


Unnamed: 0,name,release_year,rating,num_votes,wikipedia_id,plot,freebase_id,languages,countries,genres
0,Miss Jerry,1894.0,5.3,207,10109752,After finding out that her father is suffering...,/m/02q23xk,['Silent film'],['United States of America'],"['Short Film', 'Silent film', 'Indie', 'Black-..."
1,The Corbett-Fitzsimmons Fight,1897.0,5.3,484,28703057,The film no longer exists in its entirety; how...,/m/0czdh_n,[],[],['Sports']
2,The Story of the Kelly Gang,1906.0,6.0,855,142995,The Story of the Kelly Gangs tone is of sorrow...,/m/0120y4,['English Language'],['Australia'],"['Crime Fiction', 'Silent film', 'Biography', ..."
3,Robbery Under Arms,1907.0,4.3,25,32986669,Key scenes of the film included the branding o...,/m/04p7yxx,['Silent film'],['Australia'],"['Silent film', 'Drama']"
4,"Captain Midnight, the Bush King",1911.0,5.4,18,32987200,Edgar Dalimore is the son of wealthy station o...,/m/0h569x9,['Silent film'],[],['Silent film']


## Embeddings Approach

Since we're already embedded the movies using Google Colab and downloaded them (see Readme), we can simply read the json.

In [9]:
DECADE = 2010  # change the decade whenever necessary
embeddings = get_embeddings_from_json(DECADE)

### Modelling

### Similarity Matrix

Once we have vectorized the plots, we can calculate the similarities between the plots with dot product.

In [10]:
# generate all possible combinations for dot product 
movie_sample = list(embeddings.keys())
combinations = list(itertools.combinations(movie_sample, 2))

In [None]:
similarity_dict = {}

for movie_1, movie_2 in tqdm(combinations):
    similarity = embeddings[movie_1] @ embeddings[movie_2]
    similarity_dict['-'.join([str(movie_1), str(movie_2)])] = similarity

In [15]:
with open(os.path.join(root_path, 'data', 'embeddings', 'similarities', f'similarities_{DECADE}s.json'), "w") as outfile: 
    json.dump(similarity_dict, outfile)

## Classification Approach

In [None]:
probabilities = get_classification_from_json(DECADE)