# Movie Recommendation System: Exploratory Data Analysis

**Author:** atahabilder1  
**Project:** Professional Movie Recommendation System  
**Dataset:** TMDB Movies Dataset 2024 (1M+ Movies)  

---

## Objective
Conduct comprehensive exploratory data analysis on the preprocessed TMDB movie dataset to:
- Understand data distributions and patterns
- Identify key features for recommendation models
- Generate insights for feature engineering
- Create professional visualizations for portfolio presentation

## Dataset Overview
- **Original Records:** 1,293,764 movies
- **Processed Records:** 65,700 high-quality movies
- **Features:** 24+ engineered features including ratings, genres, keywords, financial data
- **Time Range:** 1950-2024 (filtered for quality and relevance)


In [None]:
# Import necessary libraries for professional data analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Configure visualization settings for professional output
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

print("📊 EDA Environment Setup Complete")
print(f"📅 Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")

In [None]:
# Load the preprocessed dataset
print("🔄 Loading preprocessed movie dataset...")
movies_df = pd.read_csv('../data/processed_movies.csv.gz', compression='gzip')

print(f"✅ Dataset loaded successfully!")
print(f"📊 Shape: {movies_df.shape}")
print(f"💾 Memory usage: {movies_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Display basic information
movies_df.info()

## 1. Dataset Overview & Quality Assessment

In [None]:
# Comprehensive data overview
print("=== DATASET QUALITY ASSESSMENT ===")
print(f"Total Records: {len(movies_df):,}")
print(f"Total Features: {len(movies_df.columns)}")
print(f"Date Range: {movies_df['release_year'].min()} - {movies_df['release_year'].max()}")
print(f"Rating Range: {movies_df['vote_average'].min():.1f} - {movies_df['vote_average'].max():.1f}")

# Missing values analysis
print("\n=== MISSING VALUES ANALYSIS ===")
missing_data = movies_df.isnull().sum()
missing_percent = (missing_data / len(movies_df)) * 100
missing_summary = pd.DataFrame({
    'Missing_Count': missing_data,
    'Missing_Percentage': missing_percent
}).round(2)

# Show only columns with missing values
missing_summary = missing_summary[missing_summary['Missing_Count'] > 0]
if len(missing_summary) > 0:
    print(missing_summary)
else:
    print("✅ No missing values found - excellent data quality!")

# Sample of the dataset
print("\n=== SAMPLE DATA ===")
display(movies_df.head())

## 2. Temporal Analysis: Movie Release Patterns

In [None]:
# Create temporal analysis visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Temporal Analysis of Movie Releases', fontsize=16, fontweight='bold')

# 1. Movies by release year
yearly_counts = movies_df['release_year'].value_counts().sort_index()
axes[0, 0].plot(yearly_counts.index, yearly_counts.values, linewidth=2, color='steelblue')
axes[0, 0].set_title('Movies Released by Year')
axes[0, 0].set_xlabel('Release Year')
axes[0, 0].set_ylabel('Number of Movies')
axes[0, 0].grid(True, alpha=0.3)

# 2. Movies by decade
decade_counts = movies_df['release_decade'].value_counts().sort_index()
axes[0, 1].bar(decade_counts.index, decade_counts.values, color='lightcoral', alpha=0.8)
axes[0, 1].set_title('Movies by Decade')
axes[0, 1].set_xlabel('Decade')
axes[0, 1].set_ylabel('Number of Movies')
axes[0, 1].tick_params(axis='x', rotation=45)

# 3. Movies by release month
month_counts = movies_df['release_month'].value_counts().sort_index()
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
axes[1, 0].bar(range(1, 13), [month_counts.get(i, 0) for i in range(1, 13)], 
               color='mediumseagreen', alpha=0.8)
axes[1, 0].set_title('Movie Releases by Month')
axes[1, 0].set_xlabel('Month')
axes[1, 0].set_ylabel('Number of Movies')
axes[1, 0].set_xticks(range(1, 13))
axes[1, 0].set_xticklabels(month_names, rotation=45)

# 4. Movie age distribution
axes[1, 1].hist(movies_df['movie_age'], bins=30, color='plum', alpha=0.7, edgecolor='black')
axes[1, 1].set_title('Distribution of Movie Ages')
axes[1, 1].set_xlabel('Movie Age (Years)')
axes[1, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.savefig('../results/figures/eda_temporal_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# Print insights
print("🔍 TEMPORAL INSIGHTS:")
print(f"📈 Peak production decade: {decade_counts.idxmax()}s ({decade_counts.max():,} movies)")
print(f"📅 Most popular release month: {month_names[month_counts.idxmax()-1]} ({month_counts.max():,} movies)")
print(f"🎬 Average movie age: {movies_df['movie_age'].mean():.1f} years")

## 3. Rating and Popularity Analysis

In [None]:
# Rating and popularity analysis
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Movie Ratings and Popularity Analysis', fontsize=16, fontweight='bold')

# 1. Vote average distribution
axes[0, 0].hist(movies_df['vote_average'], bins=50, color='skyblue', alpha=0.7, edgecolor='black')
axes[0, 0].axvline(movies_df['vote_average'].mean(), color='red', linestyle='--', 
                   label=f'Mean: {movies_df["vote_average"].mean():.2f}')
axes[0, 0].axvline(movies_df['vote_average'].median(), color='orange', linestyle='--',
                   label=f'Median: {movies_df["vote_average"].median():.2f}')
axes[0, 0].set_title('Distribution of Movie Ratings')
axes[0, 0].set_xlabel('Average Rating')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()

# 2. Vote count distribution (log scale)
axes[0, 1].hist(np.log10(movies_df['vote_count']), bins=50, color='lightgreen', alpha=0.7, edgecolor='black')
axes[0, 1].set_title('Distribution of Vote Counts (Log Scale)')
axes[0, 1].set_xlabel('Log10(Vote Count)')
axes[0, 1].set_ylabel('Frequency')

# 3. Rating vs Vote Count scatter
sample_data = movies_df.sample(n=min(5000, len(movies_df)))  # Sample for performance
axes[1, 0].scatter(sample_data['vote_count'], sample_data['vote_average'], 
                   alpha=0.5, color='purple', s=20)
axes[1, 0].set_xscale('log')
axes[1, 0].set_title('Rating vs Vote Count')
axes[1, 0].set_xlabel('Vote Count (Log Scale)')
axes[1, 0].set_ylabel('Average Rating')

# 4. Rating categories
rating_counts = movies_df['rating_category'].value_counts()
axes[1, 1].pie(rating_counts.values, labels=rating_counts.index, autopct='%1.1f%%',
               colors=['lightcoral', 'gold', 'lightgreen', 'mediumpurple'])
axes[1, 1].set_title('Distribution by Rating Category')

plt.tight_layout()
plt.savefig('../results/figures/eda_ratings_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# Statistical summary
print("🔍 RATING INSIGHTS:")
print(f"⭐ Average rating: {movies_df['vote_average'].mean():.2f} ± {movies_df['vote_average'].std():.2f}")
print(f"📊 Median vote count: {movies_df['vote_count'].median():.0f}")
print(f"🏆 Highest rated movie: {movies_df.loc[movies_df['vote_average'].idxmax(), 'title']} ({movies_df['vote_average'].max():.1f})")
print(f"👥 Most voted movie: {movies_df.loc[movies_df['vote_count'].idxmax(), 'title']} ({movies_df['vote_count'].max():,} votes)")

## 4. Genre Analysis

In [None]:
# Genre analysis - extract individual genres from lists
import ast

def safe_literal_eval(val):
    """Safely evaluate string representations of lists"""
    if pd.isna(val) or val == '':
        return []
    try:
        # Handle both string representations and actual lists
        if isinstance(val, str):
            return ast.literal_eval(val)
        elif isinstance(val, list):
            return val
        else:
            return []
    except:
        # Fallback: split by comma
        return [genre.strip().strip("'\"") for genre in str(val).split(',') if genre.strip()]

# Extract all genres
all_genres = []
for genres_list in movies_df['genres']:
    genres = safe_literal_eval(genres_list)
    all_genres.extend(genres)

# Count genre frequencies
from collections import Counter
genre_counts = Counter(all_genres)
top_genres = dict(genre_counts.most_common(15))

# Create genre visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Genre Analysis', fontsize=16, fontweight='bold')

# 1. Top genres bar chart
genre_names = list(top_genres.keys())
genre_values = list(top_genres.values())
bars = axes[0, 0].bar(range(len(genre_names)), genre_values, color='steelblue', alpha=0.8)
axes[0, 0].set_title('Top 15 Movie Genres')
axes[0, 0].set_xlabel('Genres')
axes[0, 0].set_ylabel('Number of Movies')
axes[0, 0].set_xticks(range(len(genre_names)))
axes[0, 0].set_xticklabels(genre_names, rotation=45, ha='right')

# Add value labels on bars
for bar, value in zip(bars, genre_values):
    axes[0, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
                    f'{value:,}', ha='center', va='bottom', fontsize=9)

# 2. Genre count distribution
axes[0, 1].hist(movies_df['genre_count'], bins=range(1, movies_df['genre_count'].max()+2), 
                color='lightcoral', alpha=0.7, edgecolor='black')
axes[0, 1].set_title('Distribution of Genre Count per Movie')
axes[0, 1].set_xlabel('Number of Genres')
axes[0, 1].set_ylabel('Number of Movies')

# 3. Average rating by top genres
genre_ratings = {}
for genre in genre_names[:10]:  # Top 10 genres
    genre_movies = movies_df[movies_df['genres'].apply(
        lambda x: genre in safe_literal_eval(x)
    )]
    if len(genre_movies) > 0:
        genre_ratings[genre] = genre_movies['vote_average'].mean()

sorted_genres = sorted(genre_ratings.items(), key=lambda x: x[1], reverse=True)
genre_names_sorted = [item[0] for item in sorted_genres]
rating_values = [item[1] for item in sorted_genres]

bars = axes[1, 0].bar(range(len(genre_names_sorted)), rating_values, 
                      color='mediumseagreen', alpha=0.8)
axes[1, 0].set_title('Average Rating by Genre (Top 10)')
axes[1, 0].set_xlabel('Genres')
axes[1, 0].set_ylabel('Average Rating')
axes[1, 0].set_xticks(range(len(genre_names_sorted)))
axes[1, 0].set_xticklabels(genre_names_sorted, rotation=45, ha='right')
axes[1, 0].set_ylim(5.5, max(rating_values) + 0.2)

# Add value labels
for bar, value in zip(bars, rating_values):
    axes[1, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
                    f'{value:.2f}', ha='center', va='bottom', fontsize=9)

# 4. Genre popularity over decades
top_5_genres = genre_names[:5]
decades = sorted(movies_df['release_decade'].unique())

for genre in top_5_genres:
    decade_counts = []
    for decade in decades:
        decade_movies = movies_df[movies_df['release_decade'] == decade]
        genre_count = sum(decade_movies['genres'].apply(
            lambda x: genre in safe_literal_eval(x)
        ))
        decade_counts.append(genre_count)
    
    axes[1, 1].plot(decades, decade_counts, marker='o', label=genre, linewidth=2)

axes[1, 1].set_title('Genre Popularity Trends by Decade')
axes[1, 1].set_xlabel('Decade')
axes[1, 1].set_ylabel('Number of Movies')
axes[1, 1].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../results/figures/eda_genre_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# Print insights
print("🔍 GENRE INSIGHTS:")
print(f"🎭 Most common genre: {genre_names[0]} ({genre_values[0]:,} movies)")
print(f"📊 Average genres per movie: {movies_df['genre_count'].mean():.1f}")
print(f"⭐ Highest rated genre: {genre_names_sorted[0]} ({rating_values[0]:.2f} avg rating)")
print(f"📈 Total unique genres found: {len(genre_counts)}")

## 5. Financial Analysis

In [None]:
# Financial analysis (for movies with budget/revenue data)
financial_df = movies_df[(movies_df['budget'] > 0) & (movies_df['revenue'] > 0)].copy()

print(f"📊 Movies with financial data: {len(financial_df):,} out of {len(movies_df):,} ({len(financial_df)/len(movies_df)*100:.1f}%)")

if len(financial_df) > 100:  # Only proceed if we have sufficient data
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Financial Analysis', fontsize=16, fontweight='bold')
    
    # 1. Budget distribution
    axes[0, 0].hist(np.log10(financial_df['budget']), bins=30, color='gold', alpha=0.7, edgecolor='black')
    axes[0, 0].set_title('Budget Distribution (Log Scale)')
    axes[0, 0].set_xlabel('Log10(Budget in USD)')
    axes[0, 0].set_ylabel('Frequency')
    
    # 2. Revenue distribution
    axes[0, 1].hist(np.log10(financial_df['revenue']), bins=30, color='lightgreen', alpha=0.7, edgecolor='black')
    axes[0, 1].set_title('Revenue Distribution (Log Scale)')
    axes[0, 1].set_xlabel('Log10(Revenue in USD)')
    axes[0, 1].set_ylabel('Frequency')
    
    # 3. Budget vs Revenue scatter
    sample_financial = financial_df.sample(n=min(2000, len(financial_df)))
    scatter = axes[1, 0].scatter(sample_financial['budget'], sample_financial['revenue'], 
                                c=sample_financial['vote_average'], cmap='viridis', alpha=0.6, s=30)
    axes[1, 0].set_xscale('log')
    axes[1, 0].set_yscale('log')
    axes[1, 0].set_title('Budget vs Revenue (colored by rating)')
    axes[1, 0].set_xlabel('Budget (USD, Log Scale)')
    axes[1, 0].set_ylabel('Revenue (USD, Log Scale)')
    
    # Add diagonal line for break-even
    min_val = min(sample_financial['budget'].min(), sample_financial['revenue'].min())
    max_val = max(sample_financial['budget'].max(), sample_financial['revenue'].max())
    axes[1, 0].plot([min_val, max_val], [min_val, max_val], 'r--', alpha=0.8, label='Break-even line')
    axes[1, 0].legend()
    
    # Colorbar for scatter plot
    plt.colorbar(scatter, ax=axes[1, 0], label='Average Rating')
    
    # 4. ROI distribution
    roi_data = financial_df[financial_df['roi'].between(-5, 20)]  # Filter extreme outliers
    axes[1, 1].hist(roi_data['roi'], bins=50, color='coral', alpha=0.7, edgecolor='black')
    axes[1, 1].axvline(roi_data['roi'].mean(), color='red', linestyle='--', 
                       label=f'Mean ROI: {roi_data["roi"].mean():.2f}')
    axes[1, 1].set_title('Return on Investment (ROI) Distribution')
    axes[1, 1].set_xlabel('ROI (Profit/Budget)')
    axes[1, 1].set_ylabel('Frequency')
    axes[1, 1].legend()
    
    plt.tight_layout()
    plt.savefig('../results/figures/eda_financial_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    # Print financial insights
    print("\n🔍 FINANCIAL INSIGHTS:")
    print(f"💰 Average budget: ${financial_df['budget'].mean():,.0f}")
    print(f"💸 Average revenue: ${financial_df['revenue'].mean():,.0f}")
    print(f"📈 Average ROI: {financial_df['roi'].mean():.2f}x")
    print(f"🏆 Most profitable movie: {financial_df.loc[financial_df['profit'].idxmax(), 'title']}")
    print(f"   Profit: ${financial_df['profit'].max():,.0f}")
    
else:
    print("⚠️ Insufficient financial data for detailed analysis")

## 6. Content Analysis

In [None]:
# Content-based analysis
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Content Analysis for Recommendation Features', fontsize=16, fontweight='bold')

# 1. Overview length distribution
axes[0, 0].hist(movies_df['overview_length'], bins=50, color='lightblue', alpha=0.7, edgecolor='black')
axes[0, 0].axvline(movies_df['overview_length'].mean(), color='red', linestyle='--',
                   label=f'Mean: {movies_df["overview_length"].mean():.0f} chars')
axes[0, 0].set_title('Movie Overview Length Distribution')
axes[0, 0].set_xlabel('Overview Length (characters)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()

# 2. Word count distribution
axes[0, 1].hist(movies_df['overview_word_count'], bins=50, color='lightcyan', alpha=0.7, edgecolor='black')
axes[0, 1].axvline(movies_df['overview_word_count'].mean(), color='red', linestyle='--',
                   label=f'Mean: {movies_df["overview_word_count"].mean():.0f} words')
axes[0, 1].set_title('Overview Word Count Distribution')
axes[0, 1].set_xlabel('Number of Words')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()

# 3. Runtime distribution
axes[1, 0].hist(movies_df['runtime'], bins=50, color='lightsalmon', alpha=0.7, edgecolor='black')
axes[1, 0].axvline(movies_df['runtime'].mean(), color='red', linestyle='--',
                   label=f'Mean: {movies_df["runtime"].mean():.0f} min')
axes[1, 0].set_title('Movie Runtime Distribution')
axes[1, 0].set_xlabel('Runtime (minutes)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].legend()

# 4. Keyword count distribution
axes[1, 1].hist(movies_df['keyword_count'], bins=30, color='lightsteelblue', alpha=0.7, edgecolor='black')
axes[1, 1].axvline(movies_df['keyword_count'].mean(), color='red', linestyle='--',
                   label=f'Mean: {movies_df["keyword_count"].mean():.1f} keywords')
axes[1, 1].set_title('Keyword Count Distribution')
axes[1, 1].set_xlabel('Number of Keywords')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].legend()

plt.tight_layout()
plt.savefig('../results/figures/eda_content_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# Content insights
print("🔍 CONTENT INSIGHTS:")
print(f"📝 Average overview length: {movies_df['overview_length'].mean():.0f} characters")
print(f"📖 Average word count: {movies_df['overview_word_count'].mean():.0f} words")
print(f"⏱️ Average runtime: {movies_df['runtime'].mean():.0f} minutes")
print(f"🏷️ Average keywords per movie: {movies_df['keyword_count'].mean():.1f}")
print(f"🎬 Longest movie: {movies_df.loc[movies_df['runtime'].idxmax(), 'title']} ({movies_df['runtime'].max():.0f} min)")

## 7. Correlation Analysis for Feature Selection

In [None]:
# Correlation analysis for numerical features
numerical_features = [
    'vote_average', 'vote_count', 'popularity', 'runtime', 
    'overview_length', 'overview_word_count', 'genre_count', 
    'keyword_count', 'movie_age', 'weighted_rating'
]

# Add financial features if available
if 'budget' in movies_df.columns and movies_df['budget'].sum() > 0:
    numerical_features.extend(['budget', 'revenue', 'profit', 'roi'])

# Select available features
available_features = [col for col in numerical_features if col in movies_df.columns]
correlation_df = movies_df[available_features].corr()

# Create correlation heatmap
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(correlation_df, dtype=bool))  # Mask upper triangle
sns.heatmap(correlation_df, mask=mask, annot=True, cmap='RdBu_r', center=0,
            square=True, fmt='.2f', cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Feature Correlation Matrix for Recommendation System', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('../results/figures/eda_correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

# Find strongest correlations with vote_average (our target for quality)
correlations_with_rating = correlation_df['vote_average'].abs().sort_values(ascending=False)
print("🔍 CORRELATION INSIGHTS:")
print("\n📊 Features most correlated with movie rating:")
for feature, corr in correlations_with_rating.head(6).items():
    if feature != 'vote_average':
        direction = "positively" if correlation_df.loc['vote_average', feature] > 0 else "negatively"
        print(f"  • {feature}: {corr:.3f} ({direction} correlated)")

# Feature importance summary
print("\n🎯 FEATURE SELECTION RECOMMENDATIONS:")
print("📈 High importance for content-based filtering:")
print("  • overview_length, overview_word_count (text features)")
print("  • genre_count, keyword_count (categorical features)")
print("  • runtime (content feature)")
print("\n⭐ High importance for collaborative filtering:")
print("  • vote_average, vote_count (user preferences)")
print("  • weighted_rating (quality indicator)")
print("  • popularity (trending indicator)")

## 8. Data Quality Summary for Model Development

In [None]:
# Comprehensive data quality summary
print("=" * 60)
print("📊 COMPREHENSIVE DATA QUALITY REPORT")
print("=" * 60)

print(f"\n🎬 DATASET OVERVIEW:")
print(f"  • Total movies: {len(movies_df):,}")
print(f"  • Features: {len(movies_df.columns)}")
print(f"  • Memory usage: {movies_df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print(f"  • Time span: {movies_df['release_year'].min()}-{movies_df['release_year'].max()}")

print(f"\n⭐ RATING QUALITY:")
print(f"  • Rating range: {movies_df['vote_average'].min():.1f} - {movies_df['vote_average'].max():.1f}")
print(f"  • Average rating: {movies_df['vote_average'].mean():.2f} ± {movies_df['vote_average'].std():.2f}")
print(f"  • Median vote count: {movies_df['vote_count'].median():.0f}")
print(f"  • Reliable ratings (>100 votes): {(movies_df['vote_count'] > 100).sum():,} movies")

print(f"\n🎭 CONTENT RICHNESS:")
print(f"  • Average genres per movie: {movies_df['genre_count'].mean():.1f}")
print(f"  • Average keywords per movie: {movies_df['keyword_count'].mean():.1f}")
print(f"  • Average overview length: {movies_df['overview_length'].mean():.0f} characters")
print(f"  • Movies with substantial content: {(movies_df['overview_length'] > 100).sum():,}")

print(f"\n💰 FINANCIAL DATA:")
financial_available = (movies_df['budget'] > 0).sum() if 'budget' in movies_df.columns else 0
print(f"  • Movies with financial data: {financial_available:,} ({financial_available/len(movies_df)*100:.1f}%)")
if financial_available > 0:
    print(f"  • Budget range: ${movies_df[movies_df['budget'] > 0]['budget'].min():,.0f} - ${movies_df[movies_df['budget'] > 0]['budget'].max():,.0f}")

print(f"\n🔧 MODEL READINESS:")
print(f"  ✅ Content-based features: Complete (genres, keywords, overview)")
print(f"  ✅ Collaborative features: Complete (ratings, vote counts)")
print(f"  ✅ Temporal features: Complete (release dates, movie age)")
print(f"  ✅ Quality filters: Applied (min votes, runtime, content length)")
print(f"  ✅ Data types: Optimized for ML pipelines")

print(f"\n🎯 RECOMMENDATION STRATEGY:")
print(f"  1. Content-based: Use genres + keywords + overview text")
print(f"  2. Collaborative: Use vote_average + vote_count + popularity")
print(f"  3. Hybrid: Combine both approaches with weighted ensemble")
print(f"  4. Cold start: Use content features for new movies")

print("\n" + "=" * 60)
print("✅ DATASET IS READY FOR MODEL DEVELOPMENT")
print("=" * 60)

## 9. Next Steps: Model Development Strategy

Based on this comprehensive EDA, the recommended development approach is:

### Phase 1: Content-Based Models
1. **TF-IDF + Cosine Similarity** (baseline)
   - Features: overview text + genres + keywords
   - Fast, interpretable, good cold-start handling

2. **Word Embeddings** (Word2Vec/FastText)
   - GPU-accelerated training on overview text
   - Better semantic understanding

3. **BERT Embeddings** (transformer-based)
   - State-of-the-art text understanding
   - Leverage NVIDIA A6000 for efficient inference

### Phase 2: Collaborative Filtering
1. **Matrix Factorization** (SVD)
   - Use vote_average as ratings
   - Handle sparsity with regularization

2. **Neural Collaborative Filtering**
   - Deep learning approach
   - GPU-accelerated training

### Phase 3: Hybrid Models
1. **Weighted Ensemble**
   - Combine content + collaborative predictions
   - Optimize weights based on performance

2. **Neural Hybrid Architecture**
   - End-to-end deep learning
   - Leverage all available features

### Success Metrics
- **Precision@K, Recall@K, NDCG@K** for ranking quality
- **Coverage** for recommendation diversity
- **Cold-start performance** for new movies
- **Inference speed** for production readiness

---

**Next Notebook:** `02_content_based_models.ipynb`