This notebook demonstrates how to create a vertex index using FAISS (Facebook AI Similarity Search) library on a movies dataset containing movie titles and descriptions, which can then be used to query for closest movies for our custom movie description

In [31]:
import pandas as pd
import os
import multiprocessing
import torch
import numpy as np
import faiss
from faiss.contrib.client_server import run_index_server, ClientIndex
import sqlite3

Import the movies_metadata.csv file from https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset. I have attached the same in the repo

In [2]:
moviesMetadata = pd.read_csv("A:\\Experiment_Programs\\VertexSearch\\movies_metadata.csv")

  moviesMetadata = pd.read_csv("A:\\Experiment_Programs\\VertexSearch\\movies_metadata.csv")


In [33]:
moviesMetadata.shape

(45466, 24)

Picking the descriptions of only the first 1000 movies in the movies list to play with because my system cannot handling the load of encoding the descriptions of 45000 movies.

In [3]:
inputDf = moviesMetadata.iloc[:1000][['original_title','overview']]
inputDf.rename(columns={'original_title':'movie_title','overview':'movie_description'},inplace=True)
inputDf

Unnamed: 0,movie_title,movie_description
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...
...,...,...
995,The Three Caballeros,For Donald's birthday he receives a box with t...
996,The Sword in the Stone,Wart is a young boy who aspires to be a knight...
997,So Dear to My Heart,The tale of Jeremiah Kincaid and his quest to ...
998,Robin Hood: Prince of Thieves,When the dastardly Sheriff of Nottingham murde...


In [4]:
inputDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   movie_title        1000 non-null   object
 1   movie_description  988 non-null    object
dtypes: object(2)
memory usage: 15.8+ KB


Filling null values in the movie_description column

In [5]:
inputDf['movie_description'].notna().value_counts()

movie_description
True     988
False     12
Name: count, dtype: int64

In [6]:
inputDf['movie_description'] = inputDf['movie_description'].fillna('This movie does not matter at all')

In [7]:
inputDf['movie_description'].notna().value_counts()

movie_description
True    1000
Name: count, dtype: int64

Using BERT (Bidirectional Encoder Representations for Transformers) for tokenizing and embedding the descriptions for the movies

In [8]:
from transformers import AutoModel,AutoTokenizer
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)



In [9]:
def tokenize_description(description):
    return tokenizer.tokenize(description)

In [10]:
def embed_description(description):
    inputs = tokenizer(description, return_tensors="pt",truncation=True,padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()
    return embeddings

A sample of how the embeddings for a few movie descriptions would look like

In [11]:
inputDf.iloc[:5]['movie_description'].map(embed_description)

0    [-0.25185489654541016, 0.23600180447101593, 0....
1    [-0.20528849959373474, 0.14001354575157166, 0....
2    [0.05027814581990242, -0.21238787472248077, 0....
3    [0.22480608522891998, 0.11495525389909744, 0.5...
4    [-0.04765690118074417, -0.08733634650707245, 0...
Name: movie_description, dtype: object

The actual embedding of all 1000 movie descriptions - this takes a while

In [12]:
%time inputDf['embeddings'] = inputDf['movie_description'].map(embed_description)

CPU times: total: 18min 54s
Wall time: 3min 51s


Converting the embeddings column into a string to write it to SQLite table

In [13]:
inputDf['stringEmbeddings'] = inputDf['embeddings'].apply(lambda x: f'{x}')
inputDf.drop('embeddings',inplace=True,axis=1)
inputDf.rename(columns={'stringEmbeddings':'embeddings'},inplace=True)

In [14]:
inputDf

Unnamed: 0,movie_title,movie_description,embeddings
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[-0.25185489654541016, 0.23600180447101593, 0...."
1,Jumanji,When siblings Judy and Peter discover an encha...,"[-0.20528849959373474, 0.14001354575157166, 0...."
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[0.05027814581990242, -0.21238787472248077, 0...."
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","[0.22480608522891998, 0.11495525389909744, 0.5..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[-0.04765690118074417, -0.08733634650707245, 0..."
...,...,...,...
995,The Three Caballeros,For Donald's birthday he receives a box with t...,"[-0.3604140877723694, 0.19241639971733093, 0.2..."
996,The Sword in the Stone,Wart is a young boy who aspires to be a knight...,"[-0.21462322771549225, 0.06435999274253845, 0...."
997,So Dear to My Heart,The tale of Jeremiah Kincaid and his quest to ...,"[-0.0900331661105156, 0.1550447940826416, 0.19..."
998,Robin Hood: Prince of Thieves,When the dastardly Sheriff of Nottingham murde...,"[-0.14862754940986633, 0.09581873565912247, 0...."


Creating and writing the FAISS Index that uses inner product (as intended, for BERT embeddings)

In [15]:
embeddingsArr = np.array(inputDf['embeddings'].apply(lambda x : eval(x)).tolist())
len(embeddingsArr)

1000

In [21]:
writePath = "A:\\Experiment_Programs\\VertexSearch"
indexFileName = "moviesIndex.index"

In [22]:
movieIndex = faiss.index_factory(embeddingsArr.shape[1], "Flat",faiss.METRIC_INNER_PRODUCT)
print(movieIndex.is_trained)
%time movieIndex.add(embeddingsArr)
faiss.write_index(movieIndex,os.path.join(writePath,indexFileName))

True
CPU times: total: 0 ns
Wall time: 5.55 ms


Writing the movies dataset along with the embeddings into a SQLite Table

In [20]:
sqliteConnection = sqlite3.connect(os.path.join(writePath,'db.movies_embeddings'))
cursor = sqliteConnection.cursor()
cursor.execute('create table movies('+','.join(list(inputDf.columns))+')')
cursor.executemany("insert into movies values(?,?,?)",inputDf.values)
sqliteConnection.commit()
cursor.execute("select count(*) from movies;").fetchall()

[(1000,)]

Now, here is where the search comes in. The portion of code up until now encompasses the development work. The portion after will be the backend code that is triggered by a user directly, or using a front end app.

In [23]:
movieIndex = faiss.read_index(os.path.join(writePath,indexFileName))

Getting the top k closest movies from the 1000 movies that semantically match a custom description

In [38]:
k = 5
description = "Suggest historical movies"
nearestNeighborDistances,nearestNeighborIndices = movieIndex.search(np.array([embed_description(description)]),k)

In [39]:
for i in range(k) :
    movieRow = cursor.execute(f"select movie_title,movie_description from movies limit 1 offset {nearestNeighborIndices[0][i]};").fetchall()
    print(movieRow[0][0],' - ',movieRow[0][1])

Honigmond  -  German Comedy
Richard III  -  Shakespeare's Play transplanted into a 1930s setting.
It's My Party  -  A gathering of friends. A gift of love. A celebration of life.
Paris, France  -  A writer has torrid fantasy affairs with young men.
Living in Oblivion  -  Film about filmmaking. It takes place during one day on set of non-budget movie. Ultimate tribute to all independent filmmakers.
