# TV Series Recommendation System

## Overview
This notebook develops a robust TV series recommendation system using data from The Movie Database (TMDb). It implements a sophisticated content-based filtering approach enhanced with metadata features to generate personalized series recommendations based on content similarity and user preferences.

## Data Sources
- **Detailed Series Data**: Comprehensive information about TV series including title, air dates, genres, ratings, cast, and creators
- **TV Genre Mappings**: Mappings between genre IDs and their corresponding names
- **Data Storage**: All data is stored in MongoDB for efficient access and management

## Technical Approach

### 1. Data Preparation and Feature Engineering
The recommendation system employs a multi-stage preprocessing pipeline:
- Converting raw genre IDs to human-readable genre names
- Extracting first air date and start year information
- Extracting rich features from detailed series data:
  - Keywords that describe series themes and concepts
  - Creators and showrunners
  - Leading cast members (top 5 actors)
  - Production companies involved
  - Networks that broadcast the series
- Normalizing numerical features using MinMaxScaler:
  - Popularity
  - Vote average
  - Vote count
  - Number of seasons
  - Number of episodes

### 2. Text Feature Processing
Multiple features are extracted into separate text columns for targeted processing:
- `name_text`: Series titles
- `genre_text`: Text representation of genres
- `overview_text`: Plot summaries and descriptions
- `keywords_text`: Thematic keywords from TMDb
- `creators_text`: Series creators and showrunners
- `cast_text`: Leading actors
- `companies_text`: Production studios
- `networks_text`: Broadcasting networks
- `status_text`: Current production status

### 3. TF-IDF Vectorization with Feature Weighting
Instead of a simple bag-of-words approach, this system:
- Creates separate TF-IDF vectors for each feature category
- Applies different weights to each feature type to prioritize certain attributes:
  - Name weight: 2.5
  - Genre weight: 4.0
  - Overview weight: 1.5
  - Keywords weight: 9.0 (highest importance)
  - Creators weight: 2.0
  - Cast weight: 2.5
  - Production companies weight: 0.5
  - Networks weight: 1.5

### 4. Similarity Calculation
- Combines weighted TF-IDF vectors to create a unified feature representation
- Computes cosine similarity between all series pairs
- Creates a mapping of series names to indices for efficient lookup

### 5. Hybrid Recommendation System
The final recommendation algorithm combines multiple factors:
- Content similarity (60%): Based on the cosine similarity of weighted features
- Popularity (20%): Favors more popular series
- User ratings (20%): Prioritizes highly-rated content
- Genre matching bonus: Additional points (10%) for matching genres with the reference series

## Recommendation Functions

### Content-Based Recommendations
- Provides recommendations based purely on content similarity
- Handles edge cases like missing series names
- Excludes duplicate entries
- Returns detailed information including similarity scores

### Hybrid Recommendations
- Combines content similarity with popularity and ratings
- Adds genre overlap bonus to further refine recommendations
- Returns a final hybrid score that balances all factors
- Sorts results by the hybrid score to present the most relevant recommendations first

## Evaluation and Testing
The system is tested with a diverse set of popular series to ensure quality recommendations:
- "Stranger Things"
- "Game of Thrones"
- "Breaking Bad"
- "The Office"
- "Friends"
- "The Blacklist"
- "Narcos"

For each series, the system generates the top recommendations based on content similarity, adjusted for popularity and ratings.

## Visualization
The notebook uses a consistent visualization style:
- FiveThirtyEight plotting style for a professional appearance
- Viridis color palette for better readability
- High-resolution figures (100 DPI)
- Consistent figure size (12x8)

## Future Improvements
Potential enhancements to explore:
- Incorporating collaborative filtering techniques
- Adding personalization based on user viewing history
- Implementing diversity measures to avoid recommendation echo chambers
- Accounting for series length and completion status in recommendations
- Real-time model updates as new series are released

## Conclusion
This notebook demonstrates an advanced, content-based TV series recommendation system that leverages detailed metadata to provide high-quality recommendations. The sophisticated weighting approach ensures that the most relevant features (like keywords and genres) have a stronger influence on recommendations, while the hybrid scoring system balances content similarity with popularity and quality metrics.

In [1]:
import pandas as pd
import numpy as np
from pymongo import MongoClient
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from dotenv import load_dotenv
import os
import joblib
from tqdm import tqdm

# Set up styling
plt.style.use('fivethirtyeight')
sns.set_palette('viridis')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['figure.dpi'] = 100

# Connect to MongoDB
load_dotenv(dotenv_path="..\\data\\.env")

mongo_uri = os.getenv('MONGO_URI')
client = MongoClient(mongo_uri)
db = client['imdb_recommender']

# Load all necessary data
print("Loading data from MongoDB...")
detailed_series = list(db.detailed_series.find())
tv_genres = list(db.tv_genres.find())

# Convert to DataFrames
series_df = pd.DataFrame(detailed_series)
genre_map = {genre['id']: genre['name'] for genre in tv_genres}



print(f"Loaded {len(series_df)} detailed series records")

Loading data from MongoDB...
Loaded 8596 detailed series records


In [2]:
# Data Preparation and Feature Engineering
print("\nPreparing and enriching data...")

# 1. Clean and prepare basic series data
# Convert first_air_date to datetime
if 'first_air_date' in series_df.columns:
    series_df['first_air_date'] = pd.to_datetime(series_df['first_air_date'], errors='coerce')
    series_df['start_year'] = series_df['first_air_date'].dt.year

# Extract genre names
if 'genres' in series_df.columns:
    def extract_genre_names(genres_list):
        if not isinstance(genres_list, list):
            return []
        return [genre['name'] for genre in genres_list if isinstance(genre, dict) and 'name' in genre]
    
    series_df['genre_names'] = series_df['genres'].apply(extract_genre_names)

# 2. Extract features from detailed data
print("Extracting features from detailed series data...")
# Initialize new columns for extracted data
series_df['keywords_list'] = ""
series_df['creators_list'] = ""
series_df['cast_list'] = ""
series_df['production_companies_list'] = ""
series_df['networks_list'] = ""

# Extract keywords
if 'keywords' in series_df.columns:
    def extract_keywords(keywords_dict):
        if not isinstance(keywords_dict, dict) or 'results' not in keywords_dict:
            return []
        return [k['name'] for k in keywords_dict['results'] 
                if isinstance(k, dict) and 'name' in k]
    
    series_df['keywords_list'] = series_df['keywords'].apply(extract_keywords)

# Extract creators
if 'created_by' in series_df.columns:
    def extract_creators(creators_list):
        if not isinstance(creators_list, list):
            return []
        return [c['name'] for c in creators_list 
                if isinstance(c, dict) and 'name' in c]
    
    series_df['creators_list'] = series_df['created_by'].apply(extract_creators)

    # Extract cast
if 'credits' in series_df.columns:
    def extract_cast(credits_dict):
        if not isinstance(credits_dict, dict) or 'cast' not in credits_dict:
            return []
        return [c['name'] for c in credits_dict['cast'][:5] 
                if isinstance(c, dict) and 'name' in c]
    
    series_df['cast_list'] = series_df['credits'].apply(extract_cast)

# Extract production companies
if 'production_companies' in series_df.columns:
    def extract_companies(companies_list):
        if not isinstance(companies_list, list):
            return []
        return [c['name'] for c in companies_list 
                if isinstance(c, dict) and 'name' in c]
    
    series_df['production_companies_list'] = series_df['production_companies'].apply(extract_companies)

# Extract networks
if 'networks' in series_df.columns:
    def extract_networks(networks_list):
        if not isinstance(networks_list, list):
            return []
        return [n['name'] for n in networks_list 
                if isinstance(n, dict) and 'name' in n]
    
    series_df['networks_list'] = series_df['networks'].apply(extract_networks)


Preparing and enriching data...
Extracting features from detailed series data...


In [3]:
# 3. Feature scaling for numerical features
print("\nScaling numerical features...")
numerical_features = ['popularity', 'vote_average', 'vote_count']
numerical_features = [f for f in numerical_features if f in series_df.columns]

if numerical_features:
    scaler = MinMaxScaler()
    numerical_df = series_df[numerical_features].copy()
    
    # Handle missing values
    for col in numerical_features:
        numerical_df[col] = numerical_df[col].fillna(numerical_df[col].median())
    
    # Apply scaling
    scaled_features = scaler.fit_transform(numerical_df)
    scaled_df = pd.DataFrame(scaled_features, 
                           columns=[f'{col}_scaled' for col in numerical_features],
                           index=numerical_df.index)
    
    # Add scaled features back to main dataframe
    for col in scaled_df.columns:
        series_df[col] = scaled_df[col].values

# Scale other TV-specific features
for feature in ['number_of_seasons', 'number_of_episodes']:
    if feature in series_df.columns and series_df[feature].notna().sum() > 0:
        # Fill missing values with median for scaling
        temp_values = series_df[feature].fillna(series_df[feature].median())
        # Scale and add as new column
        series_df[f'{feature}_scaled'] = MinMaxScaler().fit_transform(
            temp_values.values.reshape(-1, 1)
        )


Scaling numerical features...


In [13]:
# 4. Create text features for TF-IDF
print("\nCreating text features for similarity calculation...")
series_df['name_text'] = series_df['name'].fillna('')
series_df['genre_text'] = series_df['genre_names'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')
series_df['overview_text'] = series_df['overview'].fillna('')
series_df['keywords_text'] = series_df['keywords_list'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')
series_df['creators_text'] = series_df['creators_list'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')
series_df['cast_text'] = series_df['cast_list'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')
series_df['companies_text'] = series_df['production_companies_list'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')
series_df['networks_text'] = series_df['networks_list'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')
series_df['status_text'] = series_df['status'].fillna('')

# 5. Create separate TF-IDF vectors for each feature category
print("Creating TF-IDF vectors...")
tfidf_name = TfidfVectorizer(stop_words='english').fit_transform(series_df['name_text'])
tfidf_genre = TfidfVectorizer(stop_words='english').fit_transform(series_df['genre_text'])
tfidf_overview = TfidfVectorizer(stop_words='english', max_features=5000).fit_transform(series_df['overview_text'])
tfidf_keywords = TfidfVectorizer(stop_words='english').fit_transform(series_df['keywords_text'])
tfidf_creators = TfidfVectorizer(stop_words='english').fit_transform(series_df['creators_text'])
tfidf_cast = TfidfVectorizer(stop_words='english').fit_transform(series_df['cast_text'])
tfidf_companies = TfidfVectorizer(stop_words='english').fit_transform(series_df['companies_text'])
tfidf_networks = TfidfVectorizer(stop_words='english').fit_transform(series_df['networks_text'])

# 6. Weight assignment
from scipy.sparse import hstack

# Weight assignment
name_weight = 2.5
genre_weight = 4.0
overview_weight = 1.5
keywords_weight = 9.0
creators_weight = 2.0
cast_weight = 2.5
companies_weight = 0.5
networks_weight = 1.5

# Combine with weights
weighted_features = hstack([
    name_weight * tfidf_name,
    genre_weight * tfidf_genre,
    overview_weight * tfidf_overview,
    keywords_weight * tfidf_keywords,
    creators_weight * tfidf_creators,
    cast_weight * tfidf_cast,
    companies_weight * tfidf_companies,
    networks_weight * tfidf_networks
])

# 7. Calculate cosine similarity
print("Calculating cosine similarity matrix...")
cosine_sim = cosine_similarity(weighted_features, weighted_features)
print(f"Cosine similarity matrix shape: {cosine_sim.shape}")

# 8. Create a mapping of series names to indices
indices = pd.Series(series_df.index, index=series_df['name'])


Creating text features for similarity calculation...
Creating TF-IDF vectors...
Calculating cosine similarity matrix...
Cosine similarity matrix shape: (8596, 8596)


In [15]:
# 10. Define recommendation functions
def get_content_based_recommendations(name, cosine_sim, df, indices, top_n=10):
    """Get series recommendations based purely on content similarity"""
    # Check if series exists
    if name not in indices:
        return pd.DataFrame({'name': [f"Series '{name}' not found in database"],
                           'similarity': [0]})
    
    try:
        # Get the index of the series
        idx = indices[name]
        
        # Get similarity scores for all series
        sim_scores = []
        for i in range(len(df)):
            # Skip the exact same series and any duplicate names
            if i == idx or df.iloc[i]['name'] == name:
                continue
                
            # Handle case where similarity value might be an array
            sim_value = cosine_sim[idx, i]
            if hasattr(sim_value, '__len__') and len(sim_value) > 0:
                sim_value = float(sim_value[0])
            sim_scores.append((i, float(sim_value)))
        
        # Sort based on similarity scores
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        
        # Get top N most similar series
        sim_scores = sim_scores[:top_n]
        
        # Get series indices
        series_indices = [i[0] for i in sim_scores]
        
        # Create result dataframe
        columns_to_include = ['id', 'name', 'genre_names', 'vote_average', 'popularity', 'poster_path']
        result = df.iloc[series_indices][columns_to_include].copy()
        result['similarity'] = [i[1] for i in sim_scores]
        
        return result
    except Exception as e:
        print(f"Error in get_content_based_recommendations: {e}")
        return pd.DataFrame({'name': [f"Error finding recommendations: {str(e)}"],
                           'similarity': [0]})

def get_hybrid_recommendations(name, df=series_df, indices=indices, top_n=10):
    """Get recommendations using a hybrid approach combining content and metadata"""
    # Get content-based recommendations first
    content_recs = get_content_based_recommendations(name, cosine_sim, df, indices, top_n=top_n*3)
    
    if len(content_recs) == 0 or 'not found' in content_recs['name'].iloc[0]:
        return content_recs
    
    # Get the original indices from content_recs
    original_indices = content_recs.index
    
    # Add the scaled features from the original dataframe
    content_recs['popularity_scaled'] = df.loc[original_indices, 'popularity_scaled'].values
    content_recs['vote_average_scaled'] = df.loc[original_indices, 'vote_average_scaled'].values
    
    # Create hybrid score
    content_recs['hybrid_score'] = (
        # Content similarity (60%)
        0.60 * content_recs['similarity'] +
        # Popularity (10%)
        0.20 * content_recs['popularity_scaled'] +
        # Rating (30%)
        0.20 * content_recs['vote_average_scaled']
    )
    
    # Get the recommended series's genres
    series_idx = indices[name]
    
    # Safely extract genre names and convert to a set
    genres = df.iloc[series_idx]['genre_names']
    # Ensure we have a flat list of strings
    if isinstance(genres, list):
        # Flatten any nested lists and make sure all items are strings
        flat_genres = []
        for item in genres:
            if isinstance(item, list):
                flat_genres.extend([str(subitem) for subitem in item])
            else:
                flat_genres.append(str(item))
        target_genres = set(flat_genres)
    else:
        target_genres = set()
    
    # Add genre matching bonus
    def calculate_genre_overlap(genres):
        if not target_genres or not genres:
            return 0
            
        # Ensure genres is a flat list of strings
        if isinstance(genres, list):
            # Flatten any nested lists
            flat_genres = []
            for item in genres:
                if isinstance(item, list):
                    flat_genres.extend([str(subitem) for subitem in item])
                else:
                    flat_genres.append(str(item))
            genre_set = set(flat_genres)
        else:
            genre_set = set([str(genres)])
            
        overlap = genre_set.intersection(target_genres)
        return len(overlap) / max(len(target_genres), 1)
    
    content_recs['genre_score'] = content_recs['genre_names'].apply(calculate_genre_overlap)
    
    # Adjust hybrid score with genre bonus
    content_recs['hybrid_score'] = content_recs['hybrid_score'] + (0.1 * content_recs['genre_score'])
    
    # Sort by hybrid score and return top N
    return content_recs.sort_values('hybrid_score', ascending=False).head(top_n)

In [16]:
# 11. Test with popular series
print("\nTesting recommendation system...")
test_titles = [
    'Stranger Things',
    'Game of Thrones',
    'Breaking Bad',
    'The Office',
    'Friends',
    'The Blacklist',
    'Narcos'
]

for title in test_titles:
    if title in indices:
        print(f"\nRecommendations for '{title}':")
        recommendations = get_hybrid_recommendations(title)
        for i, row in recommendations.iterrows():
            print(f"{row['name']} - Score: {row['hybrid_score']:.3f}, "
                 f"Rating: {row['vote_average']}, "
                 f"Genres: {', '.join(row['genre_names'])}")
    else:
        print(f"\nSeries '{title}' not found in database")


Testing recommendation system...

Recommendations for 'Stranger Things':
FROM - Score: 0.442, Rating: 8.189, Genres: Mystery, Drama, Sci-Fi & Fantasy
Nazar - Score: 0.425, Rating: 9.0, Genres: Mystery, Sci-Fi & Fantasy
American Horror Story - Score: 0.424, Rating: 8.116, Genres: Drama, Mystery, Sci-Fi & Fantasy
Dark - Score: 0.424, Rating: 8.4, Genres: Crime, Drama, Sci-Fi & Fantasy, Mystery
Yonimo Kimyou na Monogatari Tokubetsuhen - Score: 0.423, Rating: 9.5, Genres: Mystery, Sci-Fi & Fantasy
The Leftovers - Score: 0.419, Rating: 7.633, Genres: Sci-Fi & Fantasy, Drama
Don't Come Home - Score: 0.407, Rating: 7.8, Genres: Mystery, Drama, Sci-Fi & Fantasy
Guillermo del Toro's Cabinet of Curiosities - Score: 0.402, Rating: 7.5, Genres: Drama, Mystery, Sci-Fi & Fantasy
Haven - Score: 0.399, Rating: 7.536, Genres: Drama, Sci-Fi & Fantasy, Mystery
The Twilight Zone - Score: 0.396, Rating: 7.7, Genres: Sci-Fi & Fantasy, Drama, Mystery

Recommendations for 'Game of Thrones':
House of the Drag

In [17]:
# 12. Save the model
print("\nSaving recommendation model...")

model_components = {
    # Save TF-IDF vectorizers
    'tfidf_vectorizers': {
        'name': TfidfVectorizer(stop_words='english').fit(series_df['name_text']),
        'genre': TfidfVectorizer(stop_words='english').fit(series_df['genre_text']),
        'overview': TfidfVectorizer(stop_words='english', max_features=5000).fit(series_df['overview_text']),
        'keywords': TfidfVectorizer(stop_words='english').fit(series_df['keywords_text']),
        'creators': TfidfVectorizer(stop_words='english').fit(series_df['creators_text']),
        'cast': TfidfVectorizer(stop_words='english').fit(series_df['cast_text']),
        'companies': TfidfVectorizer(stop_words='english').fit(series_df['companies_text']),
        'networks': TfidfVectorizer(stop_words='english').fit(series_df['networks_text'])
    },
    
    # Save feature weights
    'feature_weights': {
        'name_weight': name_weight,
        'genre_weight': genre_weight,
        'overview_weight': overview_weight,
        'keywords_weight': keywords_weight,
        'creators_weight': creators_weight,
        'cast_weight': cast_weight,
        'companies_weight': companies_weight,
        'networks_weight': networks_weight
    },
    
    # Save cosine similarity matrix
    'cosine_sim': cosine_sim,
    
    # Save indices mapping
    'indices': indices,
    
    # Save the series DataFrame with all necessary columns
    'series_df': series_df[[
        'id', 'name', 'genre_names', 'vote_average', 'popularity', 'poster_path', 
        'overview', 'popularity_scaled', 'vote_average_scaled', 'vote_count_scaled',
        'name_text', 'genre_text', 'overview_text', 'keywords_text', 
        'creators_text', 'cast_text', 'companies_text', 'networks_text'
    ]]
}

joblib.dump(model_components, '../models/series_recommender.joblib')
print("Model saved successfully to 'models/series_recommender.joblib'")


Saving recommendation model...
Model saved successfully to 'models/series_recommender.joblib'


In [5]:
series_df['score'] = series_df['popularity_scaled'] * series_df['vote_average_scaled'] * series_df['vote_count_scaled']
top_6_movies = series_df.nlargest(6, 'score')
print(top_6_movies[['name', 'score']])


                name     score
170  Game of Thrones  0.125796
121   Grey's Anatomy  0.076292
134   Rick and Morty  0.067129
186       Squid Game  0.061196
273     Breaking Bad  0.059713
176     The Simpsons  0.056471
