# Content-Based Filtering: Comprehensive Analysis

This notebook presents a complete analysis of content-based filtering recommendation systems across multiple datasets. It combines theoretical foundations with practical implementations and comparative analysis.

---

## 1. Introduction to Content-Based Filtering

Content-based filtering is a recommendation technique that analyzes the **features** (attributes) of items to suggest similar items to users. Unlike collaborative filtering, which relies on user behavior patterns, content-based filtering focuses on the **intrinsic characteristics** of the items themselves.

<img src="../images/cb_schema.png" alt="Content-Based Filtering Schema" width="450">


### Key Principle:
*"If a user liked item A, they will likely enjoy item B if B has similar features to A."*

### How It Works:
1. **Feature Extraction**: Extract relevant features from items (e.g., director, cast, genres, country)
2. **Profile Creation**: Create item profiles based on these features
3. **Similarity Calculation**: Compute similarity between items using mathematical metrics
4. **Recommendation Generation**: Suggest items most similar to those the user has enjoyed

### Advantages:
- ✅ **No Cold Start for New Items**: Can recommend new items immediately if their features are known
- ✅ **Transparency**: Easy to explain why an item was recommended
- ✅ **User Independence**: Doesn't require data from other users
- ✅ **Niche Recommendations**: Can recommend unpopular items with similar features

### Disadvantages:
- ❌ **Limited Diversity**: Tends to recommend similar items (filter bubble effect)
- ❌ **Feature Engineering Required**: Needs rich metadata about items
- ❌ **Cold Start for New Users**: Requires user history to make personalized recommendations
- ❌ **Overspecialization**: May not discover items outside user's known preferences

---

## 2. System Architecture

This diagram illustrates the general flow of content-based filtering: from item features to user profile creation, similarity computation, and final recommendations.

<img src="../images/cb_system_architecture.png" alt="System Architecture" width="1200">

Our implementation consists of four main components:

1. **Preprocessing**: Clean and prepare movie metadata
2. **TF-IDF Vectorization**: Convert text features into numerical vectors
3. **Cosine Similarity**: Calculate similarity between all movie pairs
4. **Recommendation Engine**: Generate top-N recommendations for any given movie

---

## 3. Technical Approach: TF-IDF and Cosine Similarity

### TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection of documents.

**Term Frequency (TF)**: How often a term appears in a document

$$TF_{i,j} = \frac{n_{i,j}}{\sum_k n_{k,j}}$$

where:
- $n_{i,j}$ = number of occurrences of term $t_i$ in document $d_j$
- $\sum_k n_{k,j}$ = total number of terms in document $d_j$

**Inverse Document Frequency (IDF)**: How rare or common a term is across all documents

$$IDF_i = \log \frac{|D|}{|\{d : t_i \in d\}|}$$

where:
- $|D|$ = total number of documents
- $|\{d : t_i \in d\}|$ = number of documents containing term $t_i$

**TF-IDF Score**:

$$TF\text{-}IDF_{i,j} = TF_{i,j} \times IDF_i$$

**Interpretation**:
- High TF-IDF = term is frequent in this document but rare in others → **distinctive feature**
- Low TF-IDF = term is either rare in this document or common everywhere → **less informative**

### Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space.

$$\cos(\theta) = \frac{A \cdot B}{\|A\| \times \|B\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}}$$

where:
- $A, B$ = TF-IDF vectors for two movies
- $n$ = number of dimensions (unique terms)

**Interpretation**:
- **1.0** = identical movies (same features)
- **0.0** = completely different movies (no common features)
- **0.5-0.9** = similar movies (typical recommendation range)

### Why Remove Spaces from Features?

In our preprocessing, we remove spaces from categorical features (director names, cast, genres). This is a **crucial technique** for better TF-IDF performance:

**Without removing spaces**:
- "Christopher Nolan" → treated as 2 separate words: "christopher" and "nolan"
- Problem: "christopher" appears in "Christopher Nolan" and "Christopher Lee" → false similarity

**With removing spaces**:
- "Christopher Nolan" → `"christophernolan"` → **one unique token**
- Benefit: TF-IDF treats "Christopher Nolan" as **one atomic feature**, not two separate words

**Impact**:
- Improved precision in director/cast matching (Netflix)
- Better genre clustering
- More accurate similarity scores

This is a standard NLP technique for named entity recognition in recommendation systems.

---

In [None]:
# Import required libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import pycountry_convert as pc
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


---

# 4. Netflix Movies and TV Shows Dataset

## Dataset Description

We use the **Netflix Movies and TV Shows** dataset, which contains metadata about over 8,000 titles available on Netflix.

**Dataset Features**:
- `title`: Movie/show name
- `director`: Director(s)
- `cast`: Main actors
- `country`: Country of production
- `listed_in`: Genre categories
- `type`: Movie or TV Show
- `description`: Plot summary

**Approach**: We focus on **structured metadata** (director, cast, genres, continent) rather than textual descriptions. This emphasizes production characteristics and categorical features.

In [2]:
# Load the Netflix dataset
df_netflix = pd.read_csv('../datasets/Netflix/netflix/netflix_titles.csv')

print(f"Netflix Dataset Shape: {df_netflix.shape}")
print(f"Number of titles: {len(df_netflix)}")
df_netflix.head()

Netflix Dataset Shape: (8807, 12)
Number of titles: 8807


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


## Data Preprocessing

### Clean Special Characters

Remove special characters from descriptions to improve text processing quality.

In [3]:
chars_to_replace = ['-', '.', ':', ',', '"', '"']

def replace_chars(text):
    """Replace special characters with spaces"""
    for char in chars_to_replace:
        text = text.replace(char, ' ')
    return text

df_netflix['description'] = df_netflix['description'].apply(replace_chars)

### Handle Missing Values

Fill missing values with empty strings to avoid errors during processing.  
For cast, we limit to top 3 actors to reduce noise.

In [4]:
# Fill missing values
df_netflix['director'] = df_netflix['director'].fillna('')
df_netflix['cast'] = df_netflix['cast'].fillna('')
df_netflix['cast'] = df_netflix['cast'].apply(lambda x: ' '.join(x.split(', ')[:3]))  # Keep only top 3 actors
df_netflix['genres'] = df_netflix['listed_in']
df_netflix['genres'] = df_netflix['genres'].fillna('')
df_netflix['description'] = df_netflix['description'].fillna('')
df_netflix['country'] = df_netflix['country'].fillna('')

print("Missing values handled successfully!")

Missing values handled successfully!


### Normalize Text Data

Convert all text to lowercase and remove spaces to ensure consistent matching.  
This prevents "Action" and "action" from being treated as different terms.

**Important**: Removing spaces from categorical features (director, cast) creates atomic tokens,  
improving TF-IDF accuracy by treating "Christopher Nolan" as one feature, not two words.

In [5]:
def clean_data(x):
    """Normalize text: lowercase and remove spaces"""
    return str(x).strip().lower().replace(" ", "")

columns_to_clean = ['director', 'cast', 'description']
for column in columns_to_clean:
    df_netflix[column] = df_netflix[column].apply(clean_data)

print("Text normalization complete!")

Text normalization complete!


### Feature Engineering: Country to Continent

Convert country names to continents to reduce feature dimensionality.  
This helps group movies by broader geographic regions.

In [6]:
def country_to_continent(country_name):
    """Convert country name to continent name"""
    try:
        country_code = pc.country_name_to_country_alpha2(country_name, cn_name_format="default")
        continent_code = pc.country_alpha2_to_continent_code(country_code)
        continent_name = pc.convert_continent_code_to_continent_name(continent_code)
        return continent_name
    except:
        return 'Unknown'

def map_countries_to_continents(countries):
    """Map list of countries to their continents"""
    country_list = countries.split(', ')
    continents = set([country_to_continent(country) for country in country_list])
    return ' '.join(continents)

df_netflix['continent'] = df_netflix['country'].apply(map_countries_to_continents)

print("Country to continent mapping complete!")
df_netflix[['country', 'continent']].head(10)

Country to continent mapping complete!


Unnamed: 0,country,continent
0,United States,North America
1,South Africa,Africa
2,,Unknown
3,,Unknown
4,India,Asia
5,,Unknown
6,,Unknown
7,"United States, Ghana, Burkina Faso, United Kin...",Africa North America Europe
8,United Kingdom,Europe
9,United States,North America


### Create Unified Metadata Column

Combine all relevant features into a single "metadata" column.  
This column will be used for TF-IDF vectorization.

**Selected Features**:
- `director`: Film director(s)
- `cast`: Top 3 actors
- `genres`: Genre categories
- `type`: Movie or TV Show
- `continent`: Geographic region

**Note**: We exclude `description` to focus on structured metadata.  
This approach emphasizes director, cast, and genre similarities.

In [7]:
# Create metadata column by concatenating selected features
df_netflix['metadata'] = df_netflix['director'] + ' ' + df_netflix['cast'] + ' ' + df_netflix['genres'] + ' ' + df_netflix['type'] + ' ' + df_netflix['continent']

print("Metadata column created!")
print("\nExample metadata:")
print(df_netflix['metadata'].iloc[0])

Metadata column created!

Example metadata:
kirstenjohnson  Documentaries Movie North America


## TF-IDF Vectorization

Now we convert the metadata text into numerical vectors using TF-IDF.

**Parameters**:
- `stop_words='english'`: Remove common English words (the, is, at, etc.)
- These words don't help distinguish between movies

In [8]:
# Create TF-IDF vectorizer and transform metadata
tfidf_netflix = TfidfVectorizer(stop_words='english')
tfidf_matrix_netflix = tfidf_netflix.fit_transform(df_netflix['metadata'])

print(f"TF-IDF Matrix Shape: {tfidf_matrix_netflix.shape}")
print(f"Number of movies: {tfidf_matrix_netflix.shape[0]}")
print(f"Number of unique terms: {tfidf_matrix_netflix.shape[1]}")

TF-IDF Matrix Shape: (8807, 13778)
Number of movies: 8807
Number of unique terms: 13778


## Compute Cosine Similarity Matrix

Calculate pairwise cosine similarity between all movies.  
This creates a similarity matrix where each cell (i, j) represents the similarity between movie i and movie j.

**Note**: We use `linear_kernel` which is equivalent to cosine similarity for normalized vectors, but computationally more efficient.

In [9]:
# Compute cosine similarity matrix
cosine_sim_netflix = linear_kernel(tfidf_matrix_netflix, tfidf_matrix_netflix)

print(f"Similarity Matrix Shape: {cosine_sim_netflix.shape}")
print(f"This is a {cosine_sim_netflix.shape[0]} x {cosine_sim_netflix.shape[1]} matrix")

Similarity Matrix Shape: (8807, 8807)
This is a 8807 x 8807 matrix


## Build Recommendation Function

Create a function that takes a movie title and returns the most similar movies.

**Algorithm**:
1. Find the index of the input movie
2. Get similarity scores for all movies compared to this movie
3. Sort movies by similarity score (descending)
4. Return top N most similar movies (excluding the input movie itself)

In [10]:
# Create title-to-index mapping
titles_netflix = df_netflix['title']
indices_netflix = pd.Series(df_netflix.index, index=df_netflix['title'])

def content_based_recommendations_netflix(title, num_recommendations=20):
    """
    Generate content-based recommendations for a given movie (Netflix dataset).

    Parameters:
    -----------
    title : str
        The title of the movie to base recommendations on
    num_recommendations : int
        Number of recommendations to return (default: 20)

    Returns:
    --------
    pd.DataFrame
        DataFrame with recommended movie titles and similarity scores
    """
    # Get the index of the movie
    idx = indices_netflix[title]

    # Get pairwise similarity scores for all movies with this movie
    sim_scores = list(enumerate(cosine_sim_netflix[idx]))

    # Sort movies by similarity score (descending)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get top N similar movies (excluding the movie itself at index 0)
    sim_scores = sim_scores[1:num_recommendations + 1]

    # Create recommendations list
    recommendations = []
    for index, score in sim_scores:
        recommendations.append({
            'title': titles_netflix.iloc[index],
            'similarity_score': score
        })

    return pd.DataFrame(recommendations)

print("Recommendation function created!")

Recommendation function created!


## Results and Analysis

Let's test our recommendation system with several movies and analyze the results.

### Example 1: "The Lord of the Rings: The Return of the King"

<img src="../images/poster_lotr.png" alt="Lord of the Rings poster" width="300">

**Expected Results**:
- Other Lord of the Rings movies
- Fantasy films with similar directors (Peter Jackson)
- Epic fantasy adventures

In [11]:
recommendations_lotr = content_based_recommendations_netflix('The Lord of the Rings: The Return of the King', num_recommendations=20)
print("Recommendations for 'The Lord of the Rings: The Return of the King':")
recommendations_lotr

Recommendations for 'The Lord of the Rings: The Return of the King':


Unnamed: 0,title,similarity_score
0,The Lord of the Rings: The Two Towers,1.0
1,The Lovely Bones,0.6125
2,Motown Magic,0.436349
3,Ghost Rider,0.395852
4,The Shannara Chronicles,0.390173
5,Clash of the Titans,0.386037
6,District 9,0.379186
7,Underworld: Rise of the Lycans,0.378178
8,Occupation,0.372201
9,Cursed,0.366939


**Analysis**:
The recommendations focus primarily on:
- **Director**: Other Peter Jackson films
- **Genre**: Fantasy, Adventure, Epic
- **Cast**: Films featuring similar actors (Elijah Wood, Ian McKellen, etc.)

This demonstrates how content-based filtering excels at finding movies with similar production characteristics.


### Example 2: "The Matrix"

<img src="../images/poster_matrix.png" alt="Matrix poster" width="300">

**Expected Results**:
- Sci-fi action films
- Movies by the Wachowski siblings
- Films with similar themes (cyberpunk, dystopian)

In [12]:
recommendations_matrix = content_based_recommendations_netflix('The Matrix', num_recommendations=20)
print("Recommendations for 'The Matrix':")
recommendations_matrix

Recommendations for 'The Matrix':


Unnamed: 0,title,similarity_score
0,The Matrix Reloaded,1.0
1,The Matrix Revolutions,1.0
2,Jupiter Ascending,0.627485
3,Cloud Atlas,0.433838
4,The Umbrella Academy,0.306924
5,Jupiter's Legacy,0.288729
6,Wu Assassins,0.288729
7,Star Trek: The Next Generation,0.288729
8,Star Trek: Voyager,0.288729
9,Star Trek: Deep Space Nine,0.288729


**Analysis**:
The system recommends:
- **Genre-based**: Other sci-fi and action films
- **Director-based**: Other Wachowski films
- **Thematic**: Movies exploring similar philosophical and technological themes

The recommendations capture the unique blend of action and science fiction that defines The Matrix.


### Example 3: "Indiana Jones and the Last Crusade"

<img src="../images/poster_indiana_jones.png" alt="Indiana Jones poster" width="300">


**Expected Results**:
- Other Indiana Jones films
- Adventure films by Steven Spielberg
- Action-adventure classics

In [13]:
recommendations_indiana = content_based_recommendations_netflix('Indiana Jones and the Last Crusade', num_recommendations=20)
print("Recommendations for 'Indiana Jones and the Last Crusade':")
recommendations_indiana

Recommendations for 'Indiana Jones and the Last Crusade':


Unnamed: 0,title,similarity_score
0,Indiana Jones and the Raiders of the Lost Ark,0.628181
1,Indiana Jones and the Temple of Doom,0.628181
2,Jaws,0.529414
3,Indiana Jones and the Kingdom of the Crystal S...,0.474337
4,Schindler's List,0.462626
5,The BFG,0.448903
6,LEGO Ninjago: Masters of Spinjitzu: Day of the...,0.43867
7,The Adventures of Tintin,0.433671
8,Super Monsters: Dia de los Monsters,0.42461
9,LEGO Marvel Super Heroes: Guardians of the Galaxy,0.42461


**Analysis**:
The recommendations emphasize:
- **Franchise**: Other Indiana Jones movies
- **Director**: Steven Spielberg's adventure films
- **Genre**: Action-adventure, treasure hunting themes
- **Era**: Classic adventure films from similar time periods

This shows the system's ability to identify both franchise connections and stylistic similarities.

---

# 5. Full TMDB Movies Dataset 2024

## Dataset Description

The **Full TMDB Movies Dataset 2024** contains information about 1,000,000 films from The Movie Database (TMDB).

**Dataset Features**:
- `title`: Movie name
- `vote_average`: Average user rating
- `vote_count`: Number of votes
- `release_date`: Release year
- `revenue`: Box office revenue
- `runtime`: Movie duration
- `budget`: Production budget
- `original_language`: Language of production
- `popularity`: TMDB popularity score
- `tagline`: Movie tagline/slogan

**Approach**: Unlike Netflix, this dataset lacks director/cast information but has rich **numerical and temporal features**. We use ratings, budget, runtime, language, and taglines to find similar movies.

**Key Difference**: This approach finds movies similar in **commercial success, production scale, and audience reception** rather than creative team and genre.

In [14]:
# Load the TMDB dataset
df_tmdb = pd.read_csv('../datasets/TMDB/TMDB_movie_dataset_v11.csv')

print(f"Original TMDB Dataset Shape: {df_tmdb.shape}")
df_tmdb.head()

Original TMDB Dataset Shape: (1047554, 24)


Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,Inception,"Cobb, a skilled thief who commits corporate es...",83.952,/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, franc..."
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,...,Interstellar,The adventures of a group of explorers who mak...,140.241,/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,Mankind was born on Earth. It was never meant ...,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time,..."
2,155,The Dark Knight,8.512,30619,Released,2008-07-16,1004558444,152,False,/nMKdUUepR0i5zn0y1T4CsSB5chy.jpg,...,The Dark Knight,Batman raises the stakes in his war on crime. ...,130.643,/qJ2tW6WMUDux911r6m7haRef0WH.jpg,Welcome to a world without rules.,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin","joker, sadism, chaos, secret identity, crime f..."
3,19995,Avatar,7.573,29815,Released,2009-12-15,2923706026,162,False,/vL5LR6WdxWPjLPFRLe133jXWsh5.jpg,...,Avatar,"In the 22nd century, a paraplegic Marine is di...",79.932,/kyeqWdyUXW608qlYkRqosgbbJyK.jpg,Enter the world of Pandora.,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom","English, Spanish","future, society, culture clash, space travel, ..."
4,24428,The Avengers,7.71,29166,Released,2012-04-25,1518815515,143,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,...,The Avengers,When an unexpected enemy emerges and threatens...,98.082,/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg,Some assembly required.,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,"English, Hindi, Russian","new york city, superhero, shield, based on com..."


## Data Preprocessing

### Feature Selection and Filtering

Select relevant columns and filter out low-quality entries.

**Filtering Criteria**:
- Must have title
- Must have ratings (vote_average > 0, vote_count > 100)
- Must have financial data (revenue > 0, budget > 0)
- Must have reasonable runtime (> 20 minutes)

These filters ensure we work with well-documented, commercially released films.

In [15]:
# Select relevant columns
df_tmdb = df_tmdb[['id', 'title', 'vote_average', 'vote_count', 'release_date',
                    'revenue', 'runtime', 'adult', 'budget', 'original_language',
                    'popularity', 'tagline']]

# Apply filters
df_tmdb = df_tmdb.dropna(subset=['title'])
df_tmdb = df_tmdb[(df_tmdb['vote_average'] != 0) &
                  (df_tmdb['vote_count'] != 0) &
                  (df_tmdb['revenue'] != 0) &
                  (df_tmdb['budget'] != 0) &
                  (df_tmdb['runtime'] > 20) &
                  (df_tmdb['vote_count'] > 100)]

print(f"Filtered TMDB Dataset Shape: {df_tmdb.shape}")
print(f"Number of movies after filtering: {len(df_tmdb)}")

Filtered TMDB Dataset Shape: (7059, 12)
Number of movies after filtering: 7059


### Handle Missing Values and Extract Year

Fill missing values and extract release year from date.

In [16]:
df_tmdb['release_date'] = df_tmdb['release_date'].fillna('')
df_tmdb['tagline'] = df_tmdb['tagline'].fillna('')

# Extract year from release_date
df_tmdb['release_date'] = df_tmdb['release_date'].apply(lambda x: x[:4] if len(x) >= 4 else '')

print("Missing values handled and year extracted!")

Missing values handled and year extracted!


### Convert Numerical Features to Strings

Convert all numerical features to strings so they can be used in TF-IDF vectorization.

**Why convert to strings?**
- TF-IDF works with text data
- Similar numerical values (e.g., budget ~$100M) will be treated as identical tokens
- This creates clusters of movies with similar production scales

In [17]:
df_tmdb['vote_average'] = df_tmdb['vote_average'].astype(str)
df_tmdb['vote_count'] = df_tmdb['vote_count'].astype(str)
df_tmdb['revenue'] = df_tmdb['revenue'].astype(str)
df_tmdb['runtime'] = df_tmdb['runtime'].astype(str)
df_tmdb['adult'] = df_tmdb['adult'].astype(str)
df_tmdb['budget'] = df_tmdb['budget'].astype(str)
df_tmdb['popularity'] = df_tmdb['popularity'].astype(str)

print("Numerical features converted to strings!")
df_tmdb.info()

Numerical features converted to strings!
<class 'pandas.core.frame.DataFrame'>
Index: 7059 entries, 0 to 18083
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 7059 non-null   int64 
 1   title              7059 non-null   object
 2   vote_average       7059 non-null   object
 3   vote_count         7059 non-null   object
 4   release_date       7059 non-null   object
 5   revenue            7059 non-null   object
 6   runtime            7059 non-null   object
 7   adult              7059 non-null   object
 8   budget             7059 non-null   object
 9   original_language  7059 non-null   object
 10  popularity         7059 non-null   object
 11  tagline            7059 non-null   object
dtypes: int64(1), object(11)
memory usage: 716.9+ KB


### Create Unified Metadata Column

Combine selected features into metadata column.

**Selected Features**:
- `vote_average`: Rating quality
- `vote_count`: Rating quantity (popularity indicator)
- `release_date`: Era/year
- `runtime`: Movie length
- `adult`: Content rating
- `original_language`: Language
- `popularity`: TMDB popularity score
- `tagline`: Marketing message

**Note**: We use commas as separators to create distinct tokens for each feature.

In [18]:
# Create metadata column
df_tmdb['metadata'] = (df_tmdb['vote_average'] + ',' +
                       df_tmdb['vote_count'] + ',' +
                       df_tmdb['release_date'] + ',' +
                       df_tmdb['runtime'] + ',' +
                       df_tmdb['adult'] + ',' +
                       df_tmdb['original_language'] + ',' +
                       df_tmdb['popularity'] + ',' +
                       df_tmdb['tagline'])

print("Metadata column created!")
print("\nExample metadata:")
print(df_tmdb['metadata'].iloc[0])

Metadata column created!

Example metadata:
8.364,34495,2010,148,False,en,83.952,Your mind is the scene of the crime.


## TF-IDF Vectorization

Convert metadata to TF-IDF vectors with advanced parameters.

**Parameters**:
- `analyzer='word'`: Analyze at word level
- `ngram_range=(1, 2)`: Use both single words and word pairs
- `min_df=0.01`: Ignore terms appearing in < 1% of documents
- `max_df=0.9`: Ignore terms appearing in > 90% of documents
- `stop_words='english'`: Remove common English words

These parameters help filter out both very rare and very common terms.

In [19]:
# Create TF-IDF vectorizer with advanced parameters
tfidf_tmdb = TfidfVectorizer(analyzer='word',
                             ngram_range=(1, 2),
                             min_df=0.01,
                             max_df=0.9,
                             stop_words='english')
tfidf_matrix_tmdb = tfidf_tmdb.fit_transform(df_tmdb['metadata'])

print(f"TF-IDF Matrix Shape: {tfidf_matrix_tmdb.shape}")
print(f"Number of movies: {tfidf_matrix_tmdb.shape[0]}")
print(f"Number of unique terms: {tfidf_matrix_tmdb.shape[1]}")

TF-IDF Matrix Shape: (7059, 210)
Number of movies: 7059
Number of unique terms: 210


## Compute Cosine Similarity Matrix

In [20]:
# Compute cosine similarity matrix
cosine_sim_tmdb = linear_kernel(tfidf_matrix_tmdb, tfidf_matrix_tmdb)

print(f"Similarity Matrix Shape: {cosine_sim_tmdb.shape}")
print(f"This is a {cosine_sim_tmdb.shape[0]} x {cosine_sim_tmdb.shape[1]} matrix")

Similarity Matrix Shape: (7059, 7059)
This is a 7059 x 7059 matrix


## Build Recommendation Function

In [21]:
# Create title-to-index mapping
titles_tmdb = df_tmdb['title']
indices_tmdb = pd.Series(df_tmdb.index, index=df_tmdb['title'])

def content_based_recommendations_tmdb(title, num_recommendations=20):
    """
    Generate content-based recommendations for a given movie (TMDB dataset).

    Parameters:
    -----------
    title : str
        The title of the movie to base recommendations on
    num_recommendations : int
        Number of recommendations to return (default: 20)

    Returns:
    --------
    pd.DataFrame
        DataFrame with recommended movie titles and similarity scores
    """
    # Get the index of the movie
    idx = indices_tmdb[title]

    # Get pairwise similarity scores for all movies with this movie
    sim_scores = list(enumerate(cosine_sim_tmdb[idx]))

    # Sort movies by similarity score (descending)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get top N similar movies (excluding the movie itself at index 0)
    sim_scores = sim_scores[1:num_recommendations + 1]

    # Create recommendations list
    recommendations = []
    for index, score in sim_scores:
        recommendations.append({
            'title': titles_tmdb.iloc[index],
            'similarity_score': score
        })

    return pd.DataFrame(recommendations)

print("TMDB recommendation function created!")

TMDB recommendation function created!


## Results and Analysis

Let's test the TMDB recommendation system with several movies.

### Example 1: "Interstellar"

<img src="../images/poster_interstellar.png" alt="Interstellar poster" width="300">

**Expected Results**:
- Sci-fi films with similar ratings and budget
- Movies from similar era (2010s)
- Films with comparable runtime and production scale

In [22]:
recommendations_interstellar = content_based_recommendations_tmdb('Interstellar', num_recommendations=20)
print("Recommendations for 'Interstellar':")
recommendations_interstellar

Recommendations for 'Interstellar':


Unnamed: 0,title,similarity_score
0,Captain America: The Winter Soldier,1.0
1,The Amazing Spider-Man 2,1.0
2,American Sniper,1.0
3,Exodus: Gods and Kings,1.0
4,The Judge,1.0
5,Inherent Vice,1.0
6,Leviathan,0.698998
7,The Hobbit: The Battle of the Five Armies,0.698218
8,Divergent,0.697747
9,Transformers: Age of Extinction,0.697747


**Analysis**:
The TMDB system recommends based on:
- **Rating similarity**: Movies with similar vote_average and vote_count
- **Production scale**: Films with comparable budgets and revenues
- **Temporal proximity**: Movies from similar release years
- **Runtime**: Films of similar length
- **Tagline themes**: Movies with similar marketing messages

Unlike the Netflix system (which would recommend based on director/cast), this finds movies that are **commercially and critically similar** to Interstellar.


### Example 2: "The Empire Strikes Back"

<img src="../images/poster_empire_strikes_back.png" alt="Empire Strikes Back poster" width="300">

**Expected Results**:
- Classic films from similar era (1980s)
- Movies with high ratings and popularity
- Films with similar production characteristics

In [23]:
recommendations_empire = content_based_recommendations_tmdb('The Empire Strikes Back', num_recommendations=20)
print("Recommendations for 'The Empire Strikes Back':")
recommendations_empire

Recommendations for 'The Empire Strikes Back':


Unnamed: 0,title,similarity_score
0,The Mummy,0.880448
1,Jaws,0.824869
2,Supergirl,0.818263
3,Cutthroat Island,0.753386
4,National Treasure: Book of Secrets,0.712251
5,Captain America: The First Avenger,0.711668
6,Clouds of Sils Maria,0.710457
7,Million Dollar Arm,0.710457
8,Jurassic World,0.710282
9,Shooter,0.710105


**Analysis**:
Recommendations emphasize:
- **Era**: Other films from the 1980s
- **Critical acclaim**: Highly-rated classics
- **Popularity**: Well-known, widely-watched films
- **Production values**: Big-budget productions

This approach finds "peers" of Empire Strikes Back in terms of cultural impact and commercial success.


### Example 3: "Kill Bill: Vol. 1"

<img src="../images/poster_kill_bill.png" alt="Kill Bill poster" width="300">

**Expected Results**:
- Action films from early 2000s
- Movies with similar ratings and runtime
- Films with comparable production budgets

In [24]:
recommendations_killbill = content_based_recommendations_tmdb('Kill Bill: Vol. 1', num_recommendations=20)
print("Recommendations for 'Kill Bill: Vol. 1':")
recommendations_killbill

Recommendations for 'Kill Bill: Vol. 1':


Unnamed: 0,title,similarity_score
0,Labor Day,0.650458
1,Almost Christmas,0.641938
2,Ratatouille,0.637608
3,The Nile Hilton Incident,0.636946
4,The Door in the Floor,0.628969
5,Human Capital,0.618011
6,Zatoichi,0.61725
7,The Object of My Affection,0.616926
8,The Conjuring: The Devil Made Me Do It,0.613066
9,My Mom Is a Character 3,0.612381


**Analysis**:
The system identifies:
- **Temporal clustering**: Films from similar time period
- **Rating patterns**: Movies with similar audience reception
- **Production characteristics**: Comparable budgets and runtimes
- **Language/market**: Films from similar markets

This demonstrates how numerical features create different similarity patterns than creative team features.

---

# 6. MovieLens 25M Dataset with Genome Tags

## Dataset Description

The **MovieLens 25M** dataset is one of the most comprehensive movie rating datasets, containing:
- **25 million ratings** from 162,000 users on 62,000 movies
- **Genome tags**: A sophisticated tagging system with 1,128 unique tags
- **Genome scores**: Relevance scores (0.0-1.0) indicating how well each tag describes each movie

**Dataset Features**:
- `movies.csv`: Movie titles and genres
- `genome-tags.csv`: Tag descriptions (e.g., "dystopian", "time travel", "emotional")
- `genome-scores.csv`: Relevance scores for each movie-tag pair

**Approach**: Unlike Netflix (director/cast) or TMDB (commercial metrics), MovieLens uses **semantic tags** that describe movie themes, moods, and characteristics. This creates a rich, multi-dimensional representation of each film.

**Key Innovation**: The genome tagging system was created by MovieLens researchers to capture subjective qualities that traditional metadata misses (e.g., "thought-provoking", "visually appealing", "dark comedy").

In [25]:
# Load MovieLens 25M dataset
movies_ml = pd.read_csv('../datasets/MovieLens/ml-25m/movies.csv', sep=',', encoding='latin-1', usecols=['movieId', 'title', 'genres'])
genome_scores = pd.read_csv('../datasets/MovieLens/ml-25m/genome-scores.csv')
genome_tags = pd.read_csv('../datasets/MovieLens/ml-25m/genome-tags.csv')

print(f"Movies: {len(movies_ml)}")
print(f"Genome tags: {len(genome_tags)}")
print(f"Genome scores: {len(genome_scores)}")

print("\nSample genome tags:")
genome_tags.head(10)

Movies: 62423
Genome tags: 1128
Genome scores: 15584448

Sample genome tags:


Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s
5,6,1950s
6,7,1960s
7,8,1970s
8,9,1980s
9,10,19th century


## Data Preprocessing

Split pipe-separated genres into lists and convert to strings.

In [26]:
# Process genres
movies_ml['genres'] = movies_ml['genres'].str.split('|')
movies_ml['genres'] = movies_ml['genres'].fillna("").astype('str')

print("Genres processed!")
movies_ml.head()

Genres processed!


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),"['Adventure', 'Animation', 'Children', 'Comedy..."
1,2,Jumanji (1995),"['Adventure', 'Children', 'Fantasy']"
2,3,Grumpier Old Men (1995),"['Comedy', 'Romance']"
3,4,Waiting to Exhale (1995),"['Comedy', 'Drama', 'Romance']"
4,5,Father of the Bride Part II (1995),['Comedy']


Remove tags that don't provide meaningful content information.

**Filtered tags**: 'original', 'sequel', 'good sequel', 'sequels'

**Rationale**: These tags describe a movie's position in a franchise, not its content. Including them would cause all sequels to be recommended together regardless of genre or theme.

In [27]:
# Filter out uninformative tags
uninformative_tags = ['original', 'sequel', 'good sequel', 'sequels']
genome_tags = genome_tags[~genome_tags['tag'].isin(uninformative_tags)]

print(f"Tags after filtering: {len(genome_tags)}")
print(f"Removed {len(uninformative_tags)} uninformative tags")


Tags after filtering: 1124
Removed 4 uninformative tags


For each movie, select the 10 tags with highest relevance scores.

**Why top 10?**
- Focuses on most distinctive characteristics
- Reduces noise from low-relevance tags
- Creates more precise similarity matching


In [28]:
# Merge genome scores with tag names
merged_tags = pd.merge(genome_scores, genome_tags, on='tagId')

# For each movie, get top 10 tags by relevance score
top_tags = merged_tags.groupby('movieId').apply(
    lambda x: x.nlargest(10, 'relevance')['tag'].tolist()
)
top_tags_df = top_tags.reset_index(name='top_tags')

print(f"Movies with top tags: {len(top_tags_df)}")
print("\nExample - top tags for first movie:")
print(top_tags_df.iloc[0])


Movies with top tags: 13816

Example - top tags for first movie:
movieId                                                     1
top_tags    [toys, computer animation, pixar animation, ki...
Name: 0, dtype: object


Combine top tags with movie titles and genres.

In [29]:
# Merge top tags with movie data
movies_full = pd.merge(top_tags_df, movies_ml, on='movieId')

print(f"Complete dataset: {len(movies_full)} movies")
movies_full.head()

Complete dataset: 13816 movies


Unnamed: 0,movieId,top_tags,title,genres
0,1,"[toys, computer animation, pixar animation, ki...",Toy Story (1995),"['Adventure', 'Animation', 'Children', 'Comedy..."
1,2,"[adventure, children, fantasy, kids, special e...",Jumanji (1995),"['Adventure', 'Children', 'Fantasy']"
2,3,"[comedy, gunfight, romance, destiny, great, cr...",Grumpier Old Men (1995),"['Comedy', 'Romance']"
3,4,"[women, chick flick, divorce, girlie movie, ro...",Waiting to Exhale (1995),"['Comedy', 'Drama', 'Romance']"
4,5,"[father daughter relationship, pregnancy, midl...",Father of the Bride Part II (1995),['Comedy']


Convert tag lists to space-separated strings for TF-IDF processing.

**Note**: We wrap each tag in quotes to treat multi-word tags (e.g., "time travel") as single tokens, similar to our space-removal technique in Netflix dataset.

In [30]:
# Convert tag lists to strings
movies_full['top_tags'] = movies_full['top_tags'].apply(
    lambda x: x if isinstance(x, list) else ([x] if isinstance(x, str) else [])
)
movies_full['top_tags'] = movies_full['top_tags'].apply(
    lambda x: ' '.join(f"'{tag}'" for tag in x)
)

print("Tags formatted!")
print("\nExample formatted tags:")
print(movies_full['top_tags'].iloc[0])

Tags formatted!

Example formatted tags:
'toys' 'computer animation' 'pixar animation' 'kids and family' 'animation' 'kids' 'pixar' 'children' 'cartoon' 'animated'


Combine genome tags and genres into metadata.

**Feature Composition**:
- `top_tags`: Top 10 semantic tags (e.g., "dystopian", "thought-provoking")
- `genres`: Traditional genre categories (e.g., "Sci-Fi", "Drama")

This combines **semantic understanding** (tags) with **categorical classification** (genres).

In [31]:
# Create metadata column
movies_full['metadata'] = movies_full['top_tags'] + ',' + movies_full['genres']

print("Metadata column created!")
print("\nExample metadata:")
print(movies_full['metadata'].iloc[0])

Metadata column created!

Example metadata:
'toys' 'computer animation' 'pixar animation' 'kids and family' 'animation' 'kids' 'pixar' 'children' 'cartoon' 'animated',['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']


## TF-IDF Vectorization

Use advanced TF-IDF parameters to capture both single tags and tag combinations.

**Parameters** (same as TMDB for consistency):
- `analyzer='word'`: Word-level analysis
- `ngram_range=(1, 2)`: Single words and word pairs
- `min_df=0.01`: Ignore very rare terms
- `max_df=0.9`: Ignore very common terms
- `stop_words='english'`: Remove common English words

In [32]:
# Create TF-IDF vectorizer
tfidf_ml = TfidfVectorizer(analyzer='word',
                           ngram_range=(1, 2),
                           min_df=0.01,
                           max_df=0.9,
                           stop_words='english')
tfidf_matrix_ml = tfidf_ml.fit_transform(movies_full['metadata'])

print(f"TF-IDF Matrix Shape: {tfidf_matrix_ml.shape}")
print(f"Number of movies: {tfidf_matrix_ml.shape[0]}")
print(f"Number of unique terms: {tfidf_matrix_ml.shape[1]}")

TF-IDF Matrix Shape: (13816, 478)
Number of movies: 13816
Number of unique terms: 478


## Compute Cosine Similarity Matrix

In [33]:
# Compute cosine similarity matrix
cosine_sim_ml = linear_kernel(tfidf_matrix_ml, tfidf_matrix_ml)

print(f"Similarity Matrix Shape: {cosine_sim_ml.shape}")
print(f"This is a {cosine_sim_ml.shape[0]} x {cosine_sim_ml.shape[1]} matrix")

Similarity Matrix Shape: (13816, 13816)
This is a 13816 x 13816 matrix


## Build Recommendation Function

In [34]:
# Create title-to-index mapping
titles_ml = movies_full['title']
indices_ml = pd.Series(movies_full.index, index=movies_full['title'])

def content_based_recommendations_ml(title, num_recommendations=20):
    """
    Generate content-based recommendations for a given movie (MovieLens dataset).

    Parameters:
    -----------
    title : str
        The title of the movie to base recommendations on
    num_recommendations : int
        Number of recommendations to return (default: 20)

    Returns:
    --------
    pd.DataFrame
        DataFrame with recommended movie titles and similarity scores
    """
    # Get the index of the movie
    idx = indices_ml[title]

    # Get pairwise similarity scores for all movies with this movie
    sim_scores = list(enumerate(cosine_sim_ml[idx]))

    # Sort movies by similarity score (descending)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get top N similar movies (excluding the movie itself at index 0)
    sim_scores = sim_scores[1:num_recommendations + 1]

    # Create recommendations list
    recommendations = []
    for index, score in sim_scores:
        recommendations.append({
            'title': titles_ml.iloc[index],
            'similarity_score': score
        })

    return pd.DataFrame(recommendations)

print("MovieLens recommendation function created!")

MovieLens recommendation function created!


## Results and Analysis

Let's test the MovieLens recommendation system with several movies.

### Example 1: "Toy Story (1995)"

<img src="../images/poster_toy_story.png" alt="Toy Story poster" width="300">

**Expected Results**:
- Other Pixar animated films
- Family-friendly animations
- Movies with similar themes (friendship, adventure, humor)

In [35]:
recommendations_toystory = content_based_recommendations_ml('Toy Story (1995)', num_recommendations=20)
print("Recommendations for 'Toy Story (1995)':")
recommendations_toystory

Recommendations for 'Toy Story (1995)':


Unnamed: 0,title,similarity_score
0,"Ant Bully, The (2006)",0.923594
1,Toy Story Toons: Small Fry (2011),0.915293
2,"Monsters, Inc. (2001)",0.914108
3,Finding Nemo (2003),0.909835
4,"Bug's Life, A (1998)",0.908161
5,The Good Dinosaur (2015),0.903491
6,Everyone's Hero (2006),0.902183
7,Cloudy with a Chance of Meatballs 2 (2013),0.901282
8,Toy Story 2 (1999),0.887813
9,Antz (1998),0.883577


**Analysis**:
The genome tags capture:
- **Thematic elements**: "friendship", "adventure", "humor"
- **Target audience**: "family-friendly", "kids"
- **Emotional tone**: "heartwarming", "fun"
- **Animation style**: "pixar", "cgi"

This creates recommendations based on **how the movie feels** rather than just who made it or what genre it is.


### Example 2: "Princess Mononoke (Mononoke-hime) (1997)"

<img src="../images/poster_mononoke.png" alt="Princess Mononoke poster" width="300">

**Expected Results**:
- Other Studio Ghibli films
- Fantasy animations with environmental themes
- Epic, visually stunning animations

In [36]:
recommendations_mononoke = content_based_recommendations_ml('Princess Mononoke (Mononoke-hime) (1997)', num_recommendations=20)
print("Recommendations for 'Princess Mononoke (Mononoke-hime) (1997)':")
recommendations_mononoke

Recommendations for 'Princess Mononoke (Mononoke-hime) (1997)':


Unnamed: 0,title,similarity_score
0,Laputa: Castle in the Sky (TenkÃ» no shiro Rap...,0.8052
1,Patema Inverted (2013),0.744454
2,Kubo and the Two Strings (2016),0.721319
3,Ponyo (Gake no ue no Ponyo) (2008),0.716755
4,Final Fantasy VII: Advent Children (2004),0.705494
5,Escaflowne: The Movie (Escaflowne) (2000),0.703741
6,47 Ronin (2013),0.697804
7,Legend of the Guardians: The Owls of Ga'Hoole ...,0.682183
8,Fantastic Beasts and Where to Find Them (2016),0.670305
9,"Fall, The (2006)",0.66951


**Analysis**:
Genome tags likely include:
- **Themes**: "environmental", "nature", "conflict"
- **Style**: "epic", "visually stunning", "japanese animation"
- **Tone**: "serious", "philosophical", "violent"

This demonstrates how semantic tags can capture complex thematic and stylistic elements that simple genre labels miss.


### Example 3: "Planet of the Apes (1968)"

<img src="../images/poster_apes.png" alt="Plane of the Apes poster" width="300">

**Expected Results**:
- Classic sci-fi films
- Dystopian/post-apocalyptic movies
- Thought-provoking science fiction

In [37]:
recommendations_apes = content_based_recommendations_ml('Planet of the Apes (1968)', num_recommendations=20)
print("Recommendations for 'Planet of the Apes (1968)':")
recommendations_apes

Recommendations for 'Planet of the Apes (1968)':


Unnamed: 0,title,similarity_score
0,Escape from the Planet of the Apes (1971),0.868425
1,Conquest of the Planet of the Apes (1972),0.825291
2,Things to Come (1936),0.812009
3,Snowpiercer (2013),0.805515
4,Outland (1981),0.798862
5,Silent Running (1972),0.797983
6,Sleep Dealer (2008),0.796751
7,Red Planet (2000),0.783328
8,Total Recall (1990),0.780007
9,Gattaca (1997),0.775492


**Analysis**:
Genome tags capture:
- **Themes**: "dystopian", "post-apocalyptic", "social commentary"
- **Genre**: "sci-fi", "classic"
- **Tone**: "thought-provoking", "dark", "twist ending"

The genome system excels at finding movies with similar **intellectual and emotional impact**, not just surface-level genre matching.

---

# 7. Comparison of All Three Approaches

### Feature Comparison

| Aspect | Netflix | TMDB | MovieLens 25M |
|--------|---------|------|---------------|
| **Primary Features** | Director, Cast, Genres, Continent | Ratings, Budget, Revenue, Runtime, Year | Genome Tags, Genres |
| **Feature Type** | Categorical (people, genres) | Numerical + Temporal | Semantic (themes, moods) |
| **Similarity Basis** | Creative team & content type | Commercial success & production scale | Thematic & emotional content |
| **Recommendation Style** | "More films like this in style/genre" | "Films with similar market performance" | "Films that feel similar" |
| **Strengths** | Captures artistic similarity | Captures commercial similarity | Captures subjective qualities |
| **Weaknesses** | Limited by metadata availability | Ignores creative aspects | Requires sophisticated tagging |
| **Best For** | Finding films by same creators | Finding quality/era matches | Finding thematically similar films |

### Key Differences in Results

**Netflix Approach**:
- ✅ Finds movies by same director/actors
- ✅ Strong genre consistency
- ✅ Captures franchise relationships
- ❌ Limited by metadata availability
- ❌ Cannot find "sleeper hits" with different teams

**TMDB Approach**:
- ✅ Finds commercially similar films
- ✅ Captures era/temporal patterns
- ✅ Identifies production scale similarities
- ❌ May recommend unrelated genres
- ❌ Ignores creative team entirely

**MovieLens Approach**:
- ✅ Captures subjective qualities (mood, themes)
- ✅ Finds thematically similar films across genres
- ✅ Rich semantic understanding
- ✅ Can discover unexpected connections
- ❌ Requires extensive tagging infrastructure
- ❌ Tag quality depends on human curation

---