# Movie Recommender System

**TF-IDF is used in [the case document is as dimension](words-as-vectors-document-dimensions)
and [the case term document matrix](term-document-matrix).**

In this section, we will use TF-IDF and cosine similarity to build a recommender system for movies. 

In [4]:
from collections import OrderedDict

import numpy as np
import pandas as pd
from datasets import load_dataset
from rich.pretty import pprint
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from datetime import datetime

Let's load the data and take a look at it.

In [5]:
# Load the IMDb movie reviews dataset
dataset = load_dataset("SandipPalit/Movie_Dataset")

Found cached dataset csv (/Users/gaohn/.cache/huggingface/datasets/SandipPalit___csv/SandipPalit--Movie_Dataset-83bb53eb261b0039/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)


  0%|          | 0/1 [00:00<?, ?it/s]

Let's take all data after the `Year=2000`.

In [6]:
YEAR = 2000

dataset_cutoff = dataset.filter(lambda example: datetime.strptime(example["Release Date"], "%Y-%m-%d").year > YEAR)
dataset_cutoff

Loading cached processed dataset at /Users/gaohn/.cache/huggingface/datasets/SandipPalit___csv/SandipPalit--Movie_Dataset-83bb53eb261b0039/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-941befd870621f22.arrow


DatasetDict({
    train: Dataset({
        features: ['Release Date', 'Title', 'Overview', 'Genre', 'Vote Average', 'Vote Count'],
        num_rows: 46064
    })
})

To get the dataset, we need to call the key `train`, obtaining our `train_dataset`.

We will convert the `train_dataset` to a dataframe and take a look at it.

In [7]:
train_dataset = dataset_cutoff["train"]
print(f"Number of training examples: {len(train_dataset)}")

df_train = train_dataset.to_pandas()
df_train.head()

Number of training examples: 46064


Unnamed: 0,Release Date,Title,Overview,Genre,Vote Average,Vote Count
0,2001-01-01,Slashers,Japan's number one extreme reality show is hav...,"['Horror', 'Thriller']",5.5,48
1,2001-01-01,Serial Killers: The Real Life Hannibal Lecters,This documentary examines a selection of real ...,['Documentary'],7.0,13
2,2001-01-08,The Proposal,An undercover cop lets his job get personal wh...,"['Drama', 'Thriller']",6.7,10
3,2001-01-18,Super Troopers,"Five bored, occasionally high and always ineff...","['Comedy', 'Crime', 'Mystery']",6.6,856
4,2001-01-22,Enigma,The story of the WWII project to crack the cod...,"['Mystery', 'Drama', 'Thriller', 'Romance', 'W...",6.4,222


We are interested in the `Overview` column, which contains the movie description and reviews.

We define `X_train` to be the array containing all the reviews (`Overview` column).

We will be less pedantic and not split a validation set.

In [8]:
X_train = train_dataset["Overview"]
X_train = np.array(X_train)

We will use the `TfidfVectorizer` from `sklearn` to convert the text to a matrix of TF-IDF features.
This process can be treated as a **feature extraction** step.

In [10]:
# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english")

# Generate the tf-idf vectors for the corpus
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
print(X_train_tfidf.shape)

(46064, 73634)


The shape tells us we have $D = 34,552$ documents and $T = 61,460$ unique words.

In [17]:
len(tfidf_vectorizer.vocabulary_)

73634

We will use the `cosine_similarity` function from `sklearn.metrics.pairwise` to compute the cosine similarity between all movies.

This means computing the cosine similarity between each document and all other documents in the corpus.

Note that `cosine_similarity` takes in a matrix of `n_samples` by `n_features` and returns a matrix of `n_samples` by `n_samples`.
So in our example, the documents should correspond to the rows and the features should correspond to the columns.

In [13]:
%%time
# compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(X_train_tfidf, X_train_tfidf)
print(cosine_sim.shape)

(46064, 46064)
CPU times: user 20.7 s, sys: 23 s, total: 43.6 s
Wall time: 1min 3s


In [14]:
%%time
# compute and print the cosine similarity matrix
cosine_sim_linear_kernel = linear_kernel(X_train_tfidf, X_train_tfidf)
print(cosine_sim_linear_kernel.shape)

(46064, 46064)
CPU times: user 22.3 s, sys: 23.2 s, total: 45.5 s
Wall time: 1min 8s


It is also known that `linear_kernel` has faster computation for very sparse and large
TF-IDF matrices. They produce the same results.

Next, how to interpret the cosine similarity matrix operated on the TF-IDF matrix?

As mentioned earlier, the cosine similarity assumes your input is in the shape of `n_samples` by `n_features`.
corresponding to the number of documents the number of unique words respectively.

It returns a matrix of shape `n_samples` by `n_samples`. The value at the $d$-th row and $t$-th column is
the cosine similarity between the $d$-th document and the $t$-th document denoted by:

$$
\text{cosine similarity}_{d, t}
$$

Consequently, the matrix's diagonal is $1$ since the cosine similarity between a document and itself is $1$.

The `recommender` function below is adapted from [here](https://goodboychan.github.io/python/datacamp/natural_language_processing/2020/07/17/04-TF-IDF-and-similarity-scores.html).

In [40]:
def recommender(
    title: str, df: pd.DataFrame, cosine_similarity: np.ndarray, top_k: int = 10
) -> pd.DataFrame:
    """Recommends movies based on the cosine similarity matrix.

    Args:
        title (str): Title of the movie.
        df (pd.DataFrame): DataFrame containing the movie dataset.
        cosine_similarity (np.ndarray): Cosine similarity matrix.
        top_k (int, optional): Number of top recommendations to return.
            Defaults to 10.

    Returns:
        pd.DataFrame: DataFrame containing the top-k recommendations
    """
    # Get the index of the movie that matches the title
    idx = df[df["Title"] == title].index[0]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = cosine_similarity[idx]
    sim_scores = list(enumerate(sim_scores))
    
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the top-k most similar movies
    top_k_sim_scores = sim_scores[1 : top_k + 1]
    print(f"Top-k most similar movies: {top_k_sim_scores}")

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top-k most similar movies
    return df.iloc[movie_indices]


In [41]:
recommender(title="Batman: The Dark Knight Returns, Part 1", df = df_train, cosine_similarity=cosine_sim_linear_kernel)

Top-k most similar movies: [(23394, 0.29454056287554864), (20422, 0.2233061848576633), (35928, 0.22283671049064951), (10068, 0.21846897815198668), (26940, 0.19796639860021925), (43614, 0.19603520822205336), (35961, 0.18803326815435364), (981, 0.1660633919733712), (7336, 0.16418589005672793), (10152, 0.16021138874781266)]


Unnamed: 0,Release Date,Title,Overview,Genre,Vote Average,Vote Count
5427,2012-08-21,"Batman: The Dark Knight Returns, Part 1",Batman has not been seen for ten years. A new ...,"['Science Fiction', 'Action', 'Animation', 'My...",7.8,1255
23394,2016-08-04,Batman: Bad Blood,Bruce Wayne is missing. Alfred covers for him ...,"['Science Fiction', 'Action', 'Animation']",7.2,604
20422,2015-11-27,Red Hood: The Fallen,"Following the Death of Batman, a new vigilante...",['Action'],6.5,2
35928,2019-07-19,Batman: Hush,A mysterious new villain known only as Hush us...,"['Science Fiction', 'Crime', 'Animation', 'Mys...",7.3,657
10068,2013-09-16,The Dark Knight Legacy,A fan film imagining the world after Batman's ...,"['Action', 'Crime', 'Drama']",2.0,1
...,...,...,...,...,...,...
46057,2022-09-16,Land of Dreams,Simin is an Iranian woman on a journey to disc...,['Comedy'],0.0,0
46058,2022-09-28,Poppy Field,Poppy Field follows the struggle of a young Ro...,"['Drama', 'Romance']",0.0,0
46060,2022-10-13,Czyściec,"From the earliest times, people have wondered ...",['Drama'],0.0,0
46062,2022-11-03,Заступница,In the center of the plot is the Vatican list ...,['Documentary'],0.0,0


With just TF-IDF and the cosine similiarity metric, we can already
build a somewhat naive recommender system.

In [None]:
df_train[df_train["Title"].str.contains("Batman")]

## References and Further Readings

- [Goodboychan: TF-IDF and similarity scores](https://goodboychan.github.io/python/datacamp/natural_language_processing/2020/07/17/04-TF-IDF-and-similarity-scores.html)