# Movie Recommendation System

## Overview
This notebook develops a comprehensive movie recommendation system using data from The Movie Database (TMDb). It implements a sophisticated content-based filtering approach with weighted features to generate personalized movie recommendations based on content similarity and user preferences.

## Data Sources
- **Basic Movie Data**: General information about movies including title, release date, genres, and ratings
- **Detailed Movie Data**: Extended information like cast, crew, keywords, budget, and production companies
- **Genre Mappings**: Mappings between genre IDs and names

## Technical Approach

### 1. Data Preparation and Feature Engineering
The recommendation system employs a multi-stage preprocessing pipeline:
- Converting raw genre IDs to human-readable genre names
- Extracting rich features from detailed movie data:
  - Keywords that describe movie themes and concepts
  - Directors and their filmography patterns
  - Leading cast members (top 5 actors)
  - Production companies
- Normalizing numerical features (popularity, vote average, runtime) using MinMaxScaler

### 2. Text Feature Processing
Multiple features are extracted into separate text columns for targeted processing:
- `title_text`: Movie titles
- `genre_text`: Text representation of genres
- `overview_text`: Plot summaries and descriptions
- `keywords_text`: Thematic keywords from TMDb
- `directors_text`: Film directors
- `cast_text`: Leading actors
- `companies_text`: Production studios

### 3. TF-IDF Vectorization with Feature Weighting
Instead of a simple bag-of-words approach, this system:
- Creates separate TF-IDF vectors for each feature category
- Applies different weights to each feature type to prioritize certain attributes:
  - Title weight: 2.5
  - Genre weight: 2.0
  - Overview weight: 1.0
  - Keywords weight: 10.0 (highest importance)
  - Directors weight: 1.5
  - Cast weight: 2.5
  - Production companies weight: 0.5

### 4. Similarity Calculation
- Combines weighted TF-IDF vectors to create a unified feature representation
- Computes cosine similarity between all movie pairs
- Handles duplicate movie titles by keeping only the most popular version

### 5. Hybrid Recommendation System
The final recommendation algorithm combines multiple factors:
- Content similarity (60%): Based on the cosine similarity of weighted features
- Popularity (10%): Favors more popular movies
- User ratings (30%): Prioritizes highly-rated content
- Genre matching bonus: Additional points for matching genres with the reference movie

## Evaluation and Testing
The system is tested with a diverse set of popular movies to ensure quality recommendations:
- "The Godfather"
- "Spider-Man"
- "Interstellar"
- "The Dark Knight Rises"
- "Django Unchained"
- "Se7en"

For each movie, the system generates the top 10 most similar movies based on content, adjusted for popularity and ratings.

## Model Persistence
The trained recommendation model is saved for future use, including:
- TF-IDF vectorizers for each feature type
- Feature weights
- Cosine similarity matrix
- Movie index mappings
- Processed movie dataframe with all features

## Future Improvements
Potential enhancements to explore:
- Incorporating collaborative filtering techniques
- Adding personalization based on user viewing history
- Implementing diversity measures to avoid recommendation echo chambers
- Real-time model updates as new movies are released

## Conclusion
This notebook demonstrates an advanced, content-based movie recommendation system that leverages detailed movie metadata to provide high-quality recommendations. The sophisticated weighting approach ensures that the most relevant features (like keywords and cast) have a stronger influence on recommendations.

In [2]:
import pandas as pd
import numpy as np
from pymongo import MongoClient
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from dotenv import load_dotenv
import os
import joblib
from tqdm import tqdm

# Set up styling
plt.style.use('fivethirtyeight')
sns.set_palette('viridis')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['figure.dpi'] = 100

# Connect to MongoDB
load_dotenv(dotenv_path="..\\data\\.env")

mongo_uri = os.getenv('MONGO_URI')
client = MongoClient(mongo_uri)
db = client['imdb_recommender']

# Load all necessary data
print("Loading data from MongoDB...")
detailed_movies = list(db.detailed_movies.find())
movie_genres = list(db.movie_genres.find())

# Convert to DataFrames
movies_df = pd.DataFrame(detailed_movies)
genre_map = {genre['id']: genre['name'] for genre in movie_genres}



print(f"Loaded {len(movies_df)} movies")


Loading data from MongoDB...
Loaded 9464 movies


In [3]:
movies_df.head()

Unnamed: 0,_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,origin_country,...,status,tagline,title,video,vote_average,vote_count,credits,keywords,similar,videos
0,67d8210f6df8a70f77505b5f,False,/9nhjGaFLKtddDPtPaX5EmKqsWdH.jpg,,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 878,...",https://tv.apple.com/movie/umc.cmc.26o403koqo2...,950396,tt13654226,[US],...,Released,The world's most dangerous secret lies between...,The Gorge,False,7.8,1649,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 8122, 'name': 'canyon'}, ...","{'page': 1, 'results': [{'adult': False, 'back...","{'results': [{'iso_639_1': 'en', 'iso_3166_1':..."
1,67d8210f6df8a70f77505b60,False,/gFFqWsjLjRfipKzlzaYPD097FNC.jpg,,25000000,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",https://justwatch.pro/movie/1126166/flight-risk,1126166,tt10078772,[US],...,Released,Y'all need a pilot?,Flight Risk,False,6.0,375,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 3800, 'name': 'airplane'}...","{'page': 1, 'results': [{'adult': False, 'back...","{'results': [{'iso_639_1': 'en', 'iso_3166_1':..."
2,67d8210f6df8a70f77505b61,False,/kEYWal656zP5Q2Tohm91aw6orlT.jpg,,6000000,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",https://anora.film,1064213,tt28607951,[US],...,Released,Love is a hustle.,Anora,False,7.099,1379,"{'cast': [{'adult': False, 'gender': 1, 'id': ...","{'keywords': [{'id': 613, 'name': 'new year's ...","{'page': 1, 'results': [{'adult': False, 'back...","{'results': [{'iso_639_1': 'en', 'iso_3166_1':..."
3,67d8210f6df8a70f77505b62,False,/zo8CIjJ2nfNOevqNajwMRO6Hwka.jpg,"{'id': 1241984, 'name': 'Moana Collection', 'p...",150000000,"[{'id': 16, 'name': 'Animation'}, {'id': 12, '...",https://movies.disney.com/moana-2,1241982,tt13622970,[US],...,Released,The ocean is calling them back.,Moana 2,False,7.2,1773,"{'cast': [{'adult': False, 'gender': 1, 'id': ...","{'keywords': [{'id': 658, 'name': 'sea'}, {'id...","{'page': 1, 'results': [{'adult': False, 'back...","{'results': [{'iso_639_1': 'en', 'iso_3166_1':..."
4,67d8210f6df8a70f77505b63,False,/1w8kutrRucTd3wlYyu5QlUDMiG1.jpg,"{'id': 762512, 'name': 'The Lion King (Reboot)...",200000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 10751...",https://movies.disney.com/mufasa-the-lion-king,762509,tt13186482,[US],...,Released,The story of an orphan who would be king.,Mufasa: The Lion King,False,7.5,1534,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 6054, 'name': 'friendship...","{'page': 1, 'results': [{'adult': False, 'back...","{'results': [{'iso_639_1': 'en', 'iso_3166_1':..."


In [4]:
# 1. Clean and prepare basic movie data
# Convert release_date to datetime
if 'release_date' in movies_df.columns:
    movies_df['release_date'] = pd.to_datetime(movies_df['release_date'], errors='coerce')
    movies_df['release_year'] = movies_df['release_date'].dt.year

# Extract genre names
if 'genres' in movies_df.columns:
    def extract_genre_names(genres_list):
        if not isinstance(genres_list, list):
            return []
        return [genre['name'] for genre in genres_list if isinstance(genre, dict) and 'name' in genre]
    
    movies_df['genre_names'] = movies_df['genres'].apply(extract_genre_names)

In [5]:
# 2. Extract features from detailed data
print("Extracting features from detailed movie data...")
# Initialize new columns for extracted data
movies_df['keywords_list'] = ""
movies_df['directors'] = ""
movies_df['cast_list'] = ""
movies_df['production_companies_list'] = ""

# Extract keywords
if 'keywords' in movies_df.columns:
    def extract_keywords(keywords_dict):
        if not isinstance(keywords_dict, dict) or 'keywords' not in keywords_dict:
            return []
        return [k['name'] for k in keywords_dict['keywords'] 
                if isinstance(k, dict) and 'name' in k]
    
    movies_df['keywords_list'] = movies_df['keywords'].apply(extract_keywords)

# Extract directors
if 'credits' in movies_df.columns:
    def extract_directors(credits_dict):
        if not isinstance(credits_dict, dict) or 'crew' not in credits_dict:
            return []
        return [c['name'] for c in credits_dict['crew'] 
                if isinstance(c, dict) and c.get('job') == 'Director']
    
    movies_df['directors'] = movies_df['credits'].apply(extract_directors)

# Extract cast
if 'credits' in movies_df.columns:
    def extract_cast(credits_dict):
        if not isinstance(credits_dict, dict) or 'cast' not in credits_dict:
            return []
        return [c['name'] for c in credits_dict['cast'][:5] 
                if isinstance(c, dict) and 'name' in c]
    
    movies_df['cast_list'] = movies_df['credits'].apply(extract_cast)

# Extract production companies
if 'production_companies' in movies_df.columns:
    def extract_companies(companies_list):
        if not isinstance(companies_list, list):
            return []
        return [c['name'] for c in companies_list 
                if isinstance(c, dict) and 'name' in c]
    
    movies_df['production_companies_list'] = movies_df['production_companies'].apply(extract_companies)

# 3. Feature scaling for numerical features
print("\nScaling numerical features...")
numerical_features = ['popularity', 'vote_average', 'vote_count']
numerical_features = [f for f in numerical_features if f in movies_df.columns]

if numerical_features:
    scaler = MinMaxScaler()
    numerical_df = movies_df[numerical_features].copy()
    
    # Handle missing values
    for col in numerical_features:
        numerical_df[col] = numerical_df[col].fillna(numerical_df[col].median())
    
    # Apply scaling
    scaled_features = scaler.fit_transform(numerical_df)
    scaled_df = pd.DataFrame(scaled_features, 
                           columns=[f'{col}_scaled' for col in numerical_features],
                           index=numerical_df.index)
    
    # Add scaled features back to main dataframe
    for col in scaled_df.columns:
        movies_df[col] = scaled_df[col].values

# Scale runtime, budget and revenue if available
for feature in ['runtime', 'budget', 'revenue']:
    if feature in movies_df.columns and movies_df[feature].notna().sum() > 0:
        # Fill missing values with median for scaling
        temp_values = movies_df[feature].fillna(movies_df[feature].median())
        # Scale and add as new column
        movies_df[f'{feature}_scaled'] = MinMaxScaler().fit_transform(
            temp_values.values.reshape(-1, 1)
        )

Extracting features from detailed movie data...

Scaling numerical features...


In [6]:
movies_df.head()

Unnamed: 0,_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,origin_country,...,keywords_list,directors,cast_list,production_companies_list,popularity_scaled,vote_average_scaled,vote_count_scaled,runtime_scaled,budget_scaled,revenue_scaled
0,67d8210f6df8a70f77505b5f,False,/9nhjGaFLKtddDPtPaX5EmKqsWdH.jpg,,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 878,...",https://tv.apple.com/movie/umc.cmc.26o403koqo2...,950396,tt13654226,[US],...,"[canyon, fog, romance, tower, elite sniper, li...",[Scott Derrickson],"[Miles Teller, Anya Taylor-Joy, Sigourney Weav...","[Skydance Media, Crooked Highway, Apple Studios]",1.0,0.78,0.044414,0.300948,0.0,0.0
1,67d8210f6df8a70f77505b60,False,/gFFqWsjLjRfipKzlzaYPD097FNC.jpg,,25000000,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",https://justwatch.pro/movie/1126166/flight-risk,1126166,tt10078772,[US],...,"[airplane, pilot, flight, airplane accident, a...",[Mel Gibson],"[Mark Wahlberg, Michelle Dockery, Topher Grace...","[Davis Entertainment, Icon Productions, Hammer...",0.950821,0.6,0.0101,0.21564,0.05,0.013825
2,67d8210f6df8a70f77505b61,False,/kEYWal656zP5Q2Tohm91aw6orlT.jpg,,6000000,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",https://anora.film,1064213,tt28607951,[US],...,"[new year's eve, new york city, marriage, boar...",[Sean Baker],"[Mikey Madison, Mark Eydelshteyn, Yura Borisov...","[Cre Film, FilmNation Entertainment]",0.744864,0.7099,0.037142,0.329384,0.012,0.01402
3,67d8210f6df8a70f77505b62,False,/zo8CIjJ2nfNOevqNajwMRO6Hwka.jpg,"{'id': 1241984, 'name': 'Moana Collection', 'p...",150000000,"[{'id': 16, 'name': 'Animation'}, {'id': 12, '...",https://movies.disney.com/moana-2,1241982,tt13622970,[US],...,"[sea, ocean, villain, musical, sequel, duringc...","[David G. Derrick Jr., Jason Hand, Dana Ledoux...","[Auliʻi Cravalho, Dwayne Johnson, Hualālai Chu...","[Walt Disney Pictures, Walt Disney Animation S...",0.666697,0.72,0.047754,0.234597,0.3,0.357217
4,67d8210f6df8a70f77505b63,False,/1w8kutrRucTd3wlYyu5QlUDMiG1.jpg,"{'id': 762512, 'name': 'The Lion King (Reboot)...",200000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 10751...",https://movies.disney.com/mufasa-the-lion-king,762509,tt13186482,[US],...,"[friendship, paradise, lion, musical, prequel,...",[Barry Jenkins],"[Aaron Pierre, Kelvin Harrison, Jr., Tiffany B...",[Walt Disney Pictures],0.684713,0.75,0.041317,0.279621,0.4,0.23949


In [39]:
# 4. Create text features for TF-IDF
print("\nCreating text features for similarity calculation...")
movies_df['title_text'] = movies_df['title'].fillna('')
movies_df['genre_text'] = movies_df['genre_names'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')
movies_df['overview_text'] = movies_df['overview'].fillna('')
movies_df['keywords_text'] = movies_df['keywords_list'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')
movies_df['directors_text'] = movies_df['directors'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')
movies_df['cast_text'] = movies_df['cast_list'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')
movies_df['companies_text'] = movies_df['production_companies_list'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')

# 5. Create separate TF-IDF vectors for each feature category
print("Creating TF-IDF vectors...")
tfidf_title = TfidfVectorizer(stop_words='english').fit_transform(movies_df['title_text'])
tfidf_genre = TfidfVectorizer(stop_words='english').fit_transform(movies_df['genre_text'])
tfidf_overview = TfidfVectorizer(stop_words='english', max_features=5000).fit_transform(movies_df['overview_text'])
tfidf_keywords = TfidfVectorizer(stop_words='english').fit_transform(movies_df['keywords_text'])
tfidf_directors = TfidfVectorizer(stop_words='english').fit_transform(movies_df['directors_text'])
tfidf_cast = TfidfVectorizer(stop_words='english').fit_transform(movies_df['cast_text'])
tfidf_companies = TfidfVectorizer(stop_words='english').fit_transform(movies_df['companies_text'])

# 6. Weight assignment
from scipy.sparse import hstack

# Weight assignment (adjust these to change importance of each feature)
title_weight = 2.5
genre_weight = 3.5
overview_weight = 1.0
keywords_weight = 10.0
directors_weight = 1.5
cast_weight = 2.5
companies_weight = 0.5

# Combine with weights
weighted_features = hstack([
    title_weight * tfidf_title,
    genre_weight * tfidf_genre,
    overview_weight * tfidf_overview,
    keywords_weight * tfidf_keywords,
    directors_weight * tfidf_directors,
    cast_weight * tfidf_cast,
    companies_weight * tfidf_companies
])

# 7. Calculate sparse cosine similarity matrix that mimics the original format
print("Calculating sparse cosine similarity matrix (keeping only top 15 per movie)...")
import numpy as np
from scipy.sparse import csr_matrix

# Number of movies
num_movies = weighted_features.shape[0]

# Create empty sparse matrix with same dimensions as the original
row_indices = []
col_indices = []
data_values = []

# For each movie, calculate similarity with all others and keep only top 15
for i in range(num_movies):
    # Calculate similarity for current movie with all others
    row_similarities = cosine_similarity(weighted_features[i:i+1], weighted_features).flatten()
    
    # Find indices of top 15 most similar movies (excluding self)
    # Set self-similarity to -1 to exclude it
    row_similarities[i] = -1
    top_indices = np.argsort(row_similarities)[-15:]
    
    # Add these to our sparse matrix data
    for idx in top_indices:
        # Only include positive similarities
        if row_similarities[idx] > 0:
            row_indices.append(i)
            col_indices.append(idx)
            data_values.append(row_similarities[idx])
    
    # Print progress
    if (i+1) % 100 == 0:
        print(f"Processed {i+1}/{num_movies} movies")

# Create the sparse matrix
sparse_cosine_sim = csr_matrix((data_values, (row_indices, col_indices)), 
                              shape=(num_movies, num_movies))

# This sparse_cosine_sim can be used just like your original cosine_sim
cosine_sim = sparse_cosine_sim

print("Sparse similarity matrix created")

# 8. Create a mapping of movie titles to indices
indices = pd.Series(movies_df.index, index=movies_df['title'])


Creating text features for similarity calculation...
Creating TF-IDF vectors...
Calculating sparse cosine similarity matrix (keeping only top 15 per movie)...
Processed 100/9464 movies
Processed 200/9464 movies
Processed 300/9464 movies
Processed 400/9464 movies
Processed 500/9464 movies
Processed 600/9464 movies
Processed 700/9464 movies
Processed 800/9464 movies
Processed 900/9464 movies
Processed 1000/9464 movies
Processed 1100/9464 movies
Processed 1200/9464 movies
Processed 1300/9464 movies
Processed 1400/9464 movies
Processed 1500/9464 movies
Processed 1600/9464 movies
Processed 1700/9464 movies
Processed 1800/9464 movies
Processed 1900/9464 movies
Processed 2000/9464 movies
Processed 2100/9464 movies
Processed 2200/9464 movies
Processed 2300/9464 movies
Processed 2400/9464 movies
Processed 2500/9464 movies
Processed 2600/9464 movies
Processed 2700/9464 movies
Processed 2800/9464 movies
Processed 2900/9464 movies
Processed 3000/9464 movies
Processed 3100/9464 movies
Processed 32

In [40]:
# 10. Define recommendation functions
def get_content_based_recommendations(title, cosine_sim, df, indices, top_n=10):
    """Get movie recommendations based purely on content similarity"""
    # Check if movie exists
    if title not in indices:
        return pd.DataFrame({'title': [f"Movie '{title}' not found in database"],
                           'similarity': [0]})
    
    try:
        # Get the index of the movie
        idx = indices[title]
        
        # Get similarity scores for all movies
        sim_scores = []
        for i in range(len(df)):
            # Skip the exact same movie and any duplicate titles
            if i == idx or df.iloc[i]['title'] == title:
                continue
                
            # Handle case where similarity value might be an array
            sim_value = cosine_sim[idx, i]
            if hasattr(sim_value, '__len__') and len(sim_value) > 0:
                sim_value = float(sim_value[0])
            sim_scores.append((i, float(sim_value)))
        
        # Sort based on similarity scores
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        
        # Get top N most similar movies
        sim_scores = sim_scores[:top_n]
        
        # Get movie indices
        movie_indices = [i[0] for i in sim_scores]
        
        # Create result dataframe
        columns_to_include = ['id', 'title', 'genre_names', 'vote_average', 'popularity', 'poster_path']
        result = df.iloc[movie_indices][columns_to_include].copy()
        result['similarity'] = [i[1] for i in sim_scores]
        
        return result
    except Exception as e:
        print(f"Error in get_content_based_recommendations: {e}")
        return pd.DataFrame({'title': [f"Error finding recommendations: {str(e)}"],
                           'similarity': [0]})

def get_hybrid_recommendations(title, df=movies_df, indices=indices, top_n=10):
    """Get recommendations using a hybrid approach combining content and metadata"""
    # Get content-based recommendations first
    content_recs = get_content_based_recommendations(title, cosine_sim, df, indices, top_n=top_n*3)
    
    if len(content_recs) == 0 or 'not found' in content_recs['title'].iloc[0]:
        return content_recs
    
    # We need to make sure these columns exist in content_recs
    # Get the original indices from content_recs
    original_indices = content_recs.index
    
    # Add the scaled features from the original dataframe
    content_recs['popularity_scaled'] = df.loc[original_indices, 'popularity_scaled'].values
    content_recs['vote_average_scaled'] = df.loc[original_indices, 'vote_average_scaled'].values
    
    # Create hybrid score
    content_recs['hybrid_score'] = (
        # Content similarity (60%)
        0.60 * content_recs['similarity'] +
        # Popularity (10%)
        0.10 * content_recs['popularity_scaled'] +
        # Rating (30%)
        0.30 * content_recs['vote_average_scaled']
    )
    
    # Get the recommended movie's genres
    movie_idx = indices[title]
    target_genres = set(df.iloc[movie_idx]['genre_names'])
    
    # Add genre matching bonus
    def calculate_genre_overlap(genres):
        if not target_genres or not genres:
            return 0
        overlap = set(genres).intersection(target_genres)
        return len(overlap) / max(len(target_genres), 1)
    
    content_recs['genre_score'] = content_recs['genre_names'].apply(calculate_genre_overlap)
    
    # Adjust hybrid score with genre bonus
    content_recs['hybrid_score'] = content_recs['hybrid_score'] + (0.1 * content_recs['genre_score'])
    
    # Sort by hybrid score and return top N
    return content_recs.sort_values('hybrid_score', ascending=False).head(top_n)

In [41]:
# 11. Test with popular movies
print("\nTesting recommendation system...")
test_titles = [
    'The Godfather',
    'Interstellar',
    'The Dark Knight Rises',
    'Django Unchained',
    'Se7en'
]

for title in test_titles:
    if title in indices:
        print(f"\nRecommendations for '{title}':")
        recommendations = get_hybrid_recommendations(title)
        for i, row in recommendations.iterrows():
            print(f"{row['title']} - Score: {row['hybrid_score']:.3f}, "
                 f"Rating: {row['vote_average']}, "
                 f"Genres: {', '.join(row['genre_names'])}")
    else:
        print(f"\nMovie '{title}' not found in database")



Testing recommendation system...

Recommendations for 'The Godfather':
The Godfather Part II - Score: 0.661, Rating: 8.57, Genres: Drama, Crime
The Godfather Part III - Score: 0.557, Rating: 7.415, Genres: Crime, Drama, Thriller
The Traitor - Score: 0.516, Rating: 7.662, Genres: Drama, Crime, Thriller
Casino - Score: 0.508, Rating: 8.0, Genres: Crime, Drama
GoodFellas - Score: 0.504, Rating: 8.5, Genres: Drama, Crime
The Pig, the Snake and the Pigeon - Score: 0.501, Rating: 7.3, Genres: Crime, Drama, Thriller
The Irishman - Score: 0.494, Rating: 7.6, Genres: Crime, Drama, History
Scarface - Score: 0.490, Rating: 8.162, Genres: Action, Crime, Drama
Miracles: The Canton Godfather - Score: 0.461, Rating: 7.1, Genres: Crime, Action, Comedy, Drama
The Punisher - Score: 0.453, Rating: 5.8, Genres: Action, Crime, Drama, Thriller

Recommendations for 'Interstellar':
Stowaway - Score: 0.550, Rating: 5.953, Genres: Science Fiction, Drama, Thriller, Adventure
Spaceman - Score: 0.513, Rating: 6.7

In [None]:
# 12. Save the model
print("\nSaving recommendation model...")

model_components = {
    # Save TF-IDF vectorizers
    'tfidf_vectorizers': {
        'title': TfidfVectorizer(stop_words='english').fit(movies_df['title_text']),
        'genre': TfidfVectorizer(stop_words='english').fit(movies_df['genre_text']),
        'overview': TfidfVectorizer(stop_words='english', max_features=5000).fit(movies_df['overview_text']),
        'keywords': TfidfVectorizer(stop_words='english').fit(movies_df['keywords_text']),
        'directors': TfidfVectorizer(stop_words='english').fit(movies_df['directors_text']),
        'cast': TfidfVectorizer(stop_words='english').fit(movies_df['cast_text']),
        'companies': TfidfVectorizer(stop_words='english').fit(movies_df['companies_text'])
    },
    
    # Save feature weights
    'feature_weights': {
        'title_weight': title_weight,
        'genre_weight': genre_weight,
        'overview_weight': overview_weight,
        'keywords_weight': keywords_weight,
        'directors_weight': directors_weight,
        'cast_weight': cast_weight,
        'companies_weight': companies_weight
    },
    
    # Save cosine similarity matrix
    'cosine_sim': cosine_sim,
    
    # Save indices mapping
    'indices': indices,
    
    # Save the movies DataFrame with all necessary columns
    'movies_df': movies_df[[
        'id', 'title', 'genre_names', 'vote_average', 'popularity', 'poster_path', 
        'overview', 'popularity_scaled', 'vote_average_scaled', 'vote_count_scaled',
        'title_text', 'genre_text', 'overview_text', 'keywords_text', 
        'directors_text', 'cast_text', 'companies_text'
    ]]
}

joblib.dump(model_components, '../models/movie_recommender.joblib')
print("Model saved successfully to 'models/movie_recommender.joblib'")


Saving recommendation model...


FileNotFoundError: [Errno 2] No such file or directory: 'models/movie_recommender.joblib'

In [8]:
movies_df['score'] = movies_df['popularity_scaled'] * movies_df['vote_average_scaled'] * movies_df['vote_count_scaled']
top_6_movies = movies_df.nlargest(6, 'score')
print(top_6_movies[['title', 'score']])


                                                 title     score
58                                        Interstellar  0.076298
68                              Avengers: Infinity War  0.068234
110                           The Shawshank Redemption  0.046061
109  The Lord of the Rings: The Fellowship of the Ring  0.041572
125           Harry Potter and the Philosopher's Stone  0.040668
140                                          Inception  0.039911
