# IMDB Recommender System

**TF-IDF is used in [the case document is as dimension](words-as-vectors-document-dimensions)
and [the case term document matrix](term-document-matrix).**

In this section, we will use TF-IDF and cosine similarity to build a recommender system for movies.

In [64]:
from collections import OrderedDict

import numpy as np
import pandas as pd
from datasets import load_dataset
from rich.pretty import pprint
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from datetime import datetime

In [65]:
# Load the IMDb movie reviews dataset
dataset = load_dataset("SandipPalit/Movie_Dataset")

Found cached dataset csv (/Users/gaohn/.cache/huggingface/datasets/SandipPalit___csv/SandipPalit--Movie_Dataset-83bb53eb261b0039/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)


  0%|          | 0/1 [00:00<?, ?it/s]

In [66]:
YEAR = 2000

In [67]:
dataset_cutoff = dataset.filter(lambda example: datetime.strptime(example["Release Date"], "%Y-%m-%d").year > YEAR)
dataset_cutoff

Filter:   0%|          | 0/48392 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Release Date', 'Title', 'Overview', 'Genre', 'Vote Average', 'Vote Count'],
        num_rows: 46064
    })
})

In [68]:
train_dataset = dataset_cutoff["train"]
print(f"Number of training examples: {len(train_dataset)}")

X_train = train_dataset["Overview"]
X_train = np.array(X_train)

Number of training examples: 46064


In [69]:
X_train

array(["Japan's number one extreme reality show is having it's first all-American special! Six lucky contestants, chosen from thousands of applicants, will have the chance to win millions of dollars, and all they have to do is stay alive!",
       'This documentary examines a selection of real life serial killers and compares them to the fictional Hannibal Lecter.',
       'An undercover cop lets his job get personal while on an underground assignment.',
       ...,
       "A young artist, imprisoned within the trammels of opium addiction, relives his phantasmagorical courtship with a mysterious Soprano known only as Ligeia. While exploring these opium induced memories, the lines dividing reality and fantasy, memory and hallucination, begin to blur into an indecipherable labyrinth of obsession and despair. Though he is nursed back to health by a new love, the Lady Rowena, the haunting of Ligeia's memories as well as the spirits of Azrael and Nehushtan, continue to plague his mind and s

The labels are binary, indicating whether the review is positive or negative. But since we are
not doing any classification, we will ignore the labels.

We will be less pedantic and not split a validation set.

In [70]:
df_train = train_dataset.to_pandas()
df_train

Unnamed: 0,Release Date,Title,Overview,Genre,Vote Average,Vote Count
0,2001-01-01,Slashers,Japan's number one extreme reality show is hav...,"['Horror', 'Thriller']",5.5,48
1,2001-01-01,Serial Killers: The Real Life Hannibal Lecters,This documentary examines a selection of real ...,['Documentary'],7.0,13
2,2001-01-08,The Proposal,An undercover cop lets his job get personal wh...,"['Drama', 'Thriller']",6.7,10
3,2001-01-18,Super Troopers,"Five bored, occasionally high and always ineff...","['Comedy', 'Crime', 'Mystery']",6.6,856
4,2001-01-22,Enigma,The story of the WWII project to crack the cod...,"['Mystery', 'Drama', 'Thriller', 'Romance', 'W...",6.4,222
...,...,...,...,...,...,...
46059,2022-09-30,The Good House,"Hildy Good, a wry New England realtor and desc...","['Drama', 'Comedy']",0.0,0
46060,2022-10-13,Czyściec,"From the earliest times, people have wondered ...",['Drama'],0.0,0
46061,2022-10-31,Edgar Allen Poe's Ligeia,"A young artist, imprisoned within the trammels...","['Horror', 'Romance']",0.0,0
46062,2022-11-03,Заступница,In the center of the plot is the Vatican list ...,['Documentary'],0.0,0


In [71]:
df_train[df_train["Title"].str.contains("Batman")]

Unnamed: 0,Release Date,Title,Overview,Genre,Vote Average,Vote Count
981,2008-07-08,Batman: Gotham Knight,A collection of key events mark Bruce Wayne's ...,"['Science Fiction', 'Animation', 'Action', 'Ad...",6.7,516
5417,2012-08-19,The Batman Shootings,The premiere of The Dark Knight Rises was the ...,"['Documentary', 'History']",6.4,4
5427,2012-08-21,"Batman: The Dark Knight Returns, Part 1",Batman has not been seen for ten years. A new ...,"['Science Fiction', 'Action', 'Animation', 'My...",7.8,1255
7336,2013-01-03,"Batman: The Dark Knight Returns, Part 2",Batman has stopped the reign of terror that Th...,"['Science Fiction', 'Action', 'Animation', 'My...",8.0,1156
12880,2014-04-09,Batman: Strange Days,"Celebrating Batman’s 75th anniversary, DC Ente...","['Action', 'Animation', 'TV Movie']",7.1,78
13026,2014-04-19,Batman Beyond,This short celebrating 75 years of Batman from...,"['Action', 'Animation', 'Science Fiction', 'TV...",7.2,45
13310,2014-05-13,Son of Batman,"Batman learns he has a violent, unruly pre-tee...","['Animation', 'Action', 'Adventure']",7.0,847
14239,2014-08-12,Batman: Assault on Arkham,Batman works desperately to find a bomb plante...,"['Thriller', 'Animation', 'Action', 'Crime']",7.4,927
15497,2014-11-10,Batman: The Birth of the Modern Blockbuster,"A documentary exploring the impact of ""Batman""...",['Documentary'],0.0,0
15512,2014-11-11,Hanging with Batman,This elegantly constructed portrait of Batman ...,['Documentary'],0.0,0


In [72]:

# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english")

# Generate the tf-idf vectors for the corpus
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
print(X_train_tfidf.shape)

(46064, 73634)


In [73]:
print(X_train_tfidf.toarray()[0:3, :])

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


This means we have $D = 34,552$ documents and $T = 61,460$ unique words.

In [74]:
len(tfidf_vectorizer.vocabulary_)

73634

```python
Input:
    X : {ndarray, sparse matrix} of shape (n_samples_X, n_features)
        Input data.

    Y : {ndarray, sparse matrix} of shape (n_samples_Y, n_features), \
            default=None
        Input data. If None, the output will be the pairwise
        similarities between all samples in X.
...

Returns:
    kernel matrix : ndarray of shape (n_samples_X, n_samples_Y)
        Returns the cosine similarity between samples in X and Y.
```

In [75]:
%%time
# compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(X_train_tfidf, X_train_tfidf)
print(cosine_sim.shape)

(46064, 46064)
CPU times: user 18.9 s, sys: 21.6 s, total: 40.5 s
Wall time: 57.9 s


In [76]:
%%time
# compute and print the cosine similarity matrix
cosine_sim_linear_kernel = linear_kernel(X_train_tfidf, X_train_tfidf)
print(cosine_sim_linear_kernel.shape)

(46064, 46064)
CPU times: user 20.2 s, sys: 22.2 s, total: 42.4 s
Wall time: 1min 14s


How to interpret the cosine similarity matrix operated on the TF-IDF matrix?

First of all, the cosine similarity assumes your input is in the shape of `n_samples` by `n_features`.
In our case, the `n_samples` is the number of documents, and the `n_features` is the number of unique words.

It returns a matrix of shape `n_samples` by `n_samples`. The value at the $i$-th row and $j$-th column is
the cosine similarity between the $i$-th document and the $j$-th document denoted by:

$$
\text{cosine similarity}_{i, j}
$$

Consequently, the matrix's diagonal is $1$ since the cosine similarity between a document and itself is $1$.

In [79]:
def recommend_movies(title, cosine_sim, top_k=10):
    # Get the index of the movie that matches the title
    idx = df_train[df_train["Title"] == title].index[0]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[0:top_k]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df_train.iloc[movie_indices]

In [80]:
recommend_movies("Batman: The Dark Knight Returns, Part 1", cosine_sim_linear_kernel)

Unnamed: 0,Release Date,Title,Overview,Genre,Vote Average,Vote Count
5427,2012-08-21,"Batman: The Dark Knight Returns, Part 1",Batman has not been seen for ten years. A new ...,"['Science Fiction', 'Action', 'Animation', 'My...",7.8,1255
23394,2016-08-04,Batman: Bad Blood,Bruce Wayne is missing. Alfred covers for him ...,"['Science Fiction', 'Action', 'Animation']",7.2,604
20422,2015-11-27,Red Hood: The Fallen,"Following the Death of Batman, a new vigilante...",['Action'],6.5,2
35928,2019-07-19,Batman: Hush,A mysterious new villain known only as Hush us...,"['Science Fiction', 'Crime', 'Animation', 'Mys...",7.3,657
10068,2013-09-16,The Dark Knight Legacy,A fan film imagining the world after Batman's ...,"['Action', 'Crime', 'Drama']",2.0,1
26940,2017-05-06,Batman & Bill,"Everyone thinks that Bob Kane created Batman, ...",['Documentary'],7.1,61
43614,2021-07-26,"Batman: The Long Halloween, Part Two","As Gotham City's young vigilante, the Batman, ...","['Animation', 'Mystery', 'Action', 'Crime']",7.6,338
35961,2019-07-21,Lego DC Batman: Family Matters,"Suspicion is on high after Batman, Batgirl, Ro...","['Animation', 'Family', 'Action', 'Comedy']",7.1,108
981,2008-07-08,Batman: Gotham Knight,A collection of key events mark Bruce Wayne's ...,"['Science Fiction', 'Animation', 'Action', 'Ad...",6.7,516
7336,2013-01-03,"Batman: The Dark Knight Returns, Part 2",Batman has stopped the reign of terror that Th...,"['Science Fiction', 'Action', 'Animation', 'My...",8.0,1156
