# 🧠 Content-Based Recommendation System - TF-IDF Version

In this notebook, we build a content-based recommendation system using movie descriptions (overviews). We apply **Term Frequency-Inverse Document Frequency (TF-IDF) vectorization** to transform text into numerical vectors, and use cosine similarity to measure content closeness.

> ⚙️ This notebook demonstrates how to use the `ContentBasedRecommender` class 
from the `src/` module to generate recommendations based on movie overviews 
using TF-IDF and cosine similarity.

The class handles preprocessing, vectorization, and similarity calculation internally.

## ⚙️ 1. Setup

We import the required libraries:

- `pandas`, `numpy` for data handling
- `TfidfVectorizer` for converting text to vector form
- `cosine_similarity` to compute distances between vectors

In [1]:
import sys

sys.path.append("../")

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# This is a personalized package for content-based recommendation systems.
# It includes a recommender class that uses TF-IDF vectorization and cosine similarity
# to recommend items based on their content features.
from src.content_based import ContentBasedRecommender

## 2. Load Data

We load the preprocessed dataset enriched_movies_clean.csv, which includes:

- Cleaned overview text
- Metadata such as release_year, n_genres, overview_length, etc.

In [2]:
# Load the dataset and display the first few rows
df = pd.read_csv("../data/processed/enriched_movies_clean.csv")
df.head()

Unnamed: 0,movieId,genres,overview,popularity,poster_path,release_date,title,tmdb_id,vote_average,release_year,n_genres,overview_length,release_decade
0,6,"['Action', 'Crime', 'Thriller']",The Public Enemy battle The Gangstas in a Stee...,0.2205,/4miPIzrKBaSznoTBv1bS0sZwoMG.jpg,1995-07-15,Heat (1995),706330,0.0,1995.0,3,290,1990.0
1,34,"[""Children's"", 'Comedy', 'Drama']",Babe is a little pig who doesn't quite know hi...,4.5272,/zKuQMtnbVTz9DsOnOJmlW71v4qH.jpg,1995-07-18,Babe (1995),9598,6.244,1995.0,3,383,1990.0
2,50,"['Crime', 'Thriller']","Held in an L.A. interrogation room, Verbal Kin...",7.9936,/rWbsxdwF9qQzpTPCLmDfVnVqTK1.jpg,1995-07-19,"Usual Suspects, The (1995)",629,8.175,1995.0,2,409,1990.0
3,1,"['Animation', ""Children's"", 'Comedy']","Led by Woody, Andy's toys live happily in his ...",21.8546,/uXDfjJbdP4ijW5hWSBrPrlKpxab.jpg,1995-11-22,Toy Story (1995),862,7.968,1995.0,3,303,1990.0
4,2,"['Adventure', ""Children's"", 'Fantasy']",When siblings Judy and Peter discover an encha...,3.1183,/vgpXmVaVyUL7GGiDeiK1mKEKzcX.jpg,1995-12-15,Jumanji (1995),8844,7.237,1995.0,3,395,1990.0


## 3. Preprocess Overview Text

To prepare the text for vectorization:

- Fill missing overviews with empty strings
- Convert all text to lowercase to reduce variance from case sensitivity

This is made by our personalized package that allow us to have a modular code reutilizing functions, we start by doing:

In [3]:
# Use the ContentBasedRecommender class to fit the model
recommender = ContentBasedRecommender(df)
recommender.fit()

and we have done all of the process described below (and just above)

- `overview.fillna("")`
- TF-IDF vectorizer manual
- cosine_similarity manual
- Series for indices

## 4. TF-IDF Vectorization

Using `TfidfVectorizer`:

- We transform each movie overview into a vector
- Stopwords are removed to focus on meaningful content
- The resulting matrix has one row per movie and one column per word in the corpus

#### 🧠 What is TF-IDF?

TF-IDF stands for Term Frequency - Inverse Document Frequency. It is a statistical method used to evaluate how important a word is to a document in a collection (corpus). The main idea is:

- Words that appear frequently in a document but rarely in others are more important.
- Common words (like “the”, “and”, etc.) are ignored.

✅ What does this code do?

    ```python
    vectorizer = TfidfVectorizer(stop_words="english")
    tfidf_matrix = vectorizer.fit_transform(df["overview_clean"])
    ```

- This converts each movie overview into a numerical vector using the TF-IDF weighting scheme.
- It removes common English stopwords automatically.
- It returns a sparse matrix of shape:
    ```
    (n_movies, m_unique_words_in_corpus)
    ```

Each row represents a movie, and each column represents a unique word across all overviews. The value in the matrix is the TF-IDF score of that word for that movie.

> ✅ Even though movies may have different overviews, all vectors have the same dimension because they are embedded in the same vocabulary space.

## 5. Cosine Similarity

We calculate the **cosine similarity** between every pair of movie vectors.
This gives us a square matrix where:

- Entry (i, j) represents how similar movie *i* is to movie *j*
- Values range from 0 (not similar) to 1 (identical)

#### 📏 What is Cosine Similarity?

Cosine similarity is a metric used to measure how similar two vectors are, regardless of their magnitude.

🔸 **Formula**:

$$
\cos{(A, \, B)} = \frac{A \cdot B}{\| A \| \cdot \| B \|}
$$

where

- $A \cdot B$ is the dot product of the two vectors
- $\|A\|$ and $\|B\|$ are their respective norms (magnitudes)
- Values range from 0 (completely dissimilar) to 1 (identical direction in space).
- This is especially useful for text, because two documents can be similar even if one is longer than the other.

✅ What does this code do?

    ```python
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
    ```

- It computes the pairwise cosine similarity between every pair of movies.
- Returns a square matrix of shape:

    ```bash
    (n_movies, n_movies)
    ```

- Entry [i, j] in this matrix indicates how similar movie i is to movie j, based on their TF-IDF vectors.

> ✅ This matrix allows us to find the most similar movies to any given movie using only their descriptions.

## 6. Recommend Function

We define a function `recommender.get_recommendations(title, top_n)` that:

1. Locates the index of the selected movie
2. Retrieves similarity scores against all other movies
3. Sorts and selects the top n most similar
4. Returns the corresponding titles

This is a **content-based approach**, meaning it recommends movies with similar descriptions regardless of user behavior or ratings, the function belongs to the class `ContentBasedRecommender` that was storaged in an instance as the variable `recommender`.

#### 🔍 Example: Toy Story (1995)

Running the function with "Toy Story (1995)" returns:

In [6]:
movie_title = "Toy Story (1995)"
recommender.get_recommendations(movie_title, top_n=5)

['Toy Story 2 (1999)',
 'Rebel Without a Cause (1955)',
 'Condorman (1981)',
 'Malice (1993)',
 'Man on the Moon (1999)']

We can see that “**Toy Story 2**” ranks highly, which is expected due to its direct thematic and narrative connection.

In [7]:
results = recommender.get_recommendations("Toy Story (1995)", top_n=5)
for i, title in enumerate(results, 1):
    print(f"{i}. {title}")

1. Toy Story 2 (1999)
2. Rebel Without a Cause (1955)
3. Condorman (1981)
4. Malice (1993)
5. Man on the Moon (1999)
