#  TMDB Movie Analysis Pipeline

This notebook orchestrates a complete ETL pipeline for movie data analysis using PySpark and the TMDb API.

## Pipeline Steps:
1. **Extract** - Fetch movie data from TMDb API
2. **Transform** - Clean and prepare data using PySpark  
3. **Analyze** - Run KPI analysis and rankings
4. **Visualize** - Generate charts and insights

## Setup & Imports

In [None]:
import os
import sys

# Add project root to path
sys.path.insert(0, os.getcwd())

from pyspark.sql import SparkSession
from src.extract import fetch_all_movies
from src.transform import clean_and_transform, show_data_summary
from src.analyze import run_all_analysis, display_analysis_results
from src.visualize import save_all_visualizations
from utils.config import OUTPUT_DIR
from utils.logger import setup_logger

# Initialize logger
logger = setup_logger()

print(" Imports successful!")

## Initialize Spark Session

In [None]:
# Create Spark session
spark = SparkSession.builder \
    .appName("TMDB Movie Analysis") \
    .config("spark.driver.memory", "4g") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")
print("Spark session created!")

---
## Step 1: Extract Data from TMDb API

Fetch movie details using the TMDb API with retry mechanism and rate limiting.

In [None]:
# Extract movie data from API
raw_movies = fetch_all_movies()

print(f"\n Extracted {len(raw_movies)} movies from TMDb API")

---
## Step 2: Transform & Clean Data

Clean the raw data:
- Extract values from JSON columns
- Convert data types
- Handle missing values
- Filter valid movies

In [None]:
# Clean and transform the data
df = clean_and_transform(spark, raw_movies)

# Display summary
show_data_summary(df)

---
## Step 3: Analyze Data & Generate KPIs

Run comprehensive analysis including:
- Movie rankings (revenue, budget, profit, ROI, ratings)
- Advanced search queries
- Franchise vs Standalone comparison
- Director performance analysis

In [None]:
# Run all analysis
analysis_results = run_all_analysis(df)

# Display results
display_analysis_results(analysis_results)

---
## Step 4: Create Visualizations

Generate and save visualizations:
- Revenue vs Budget trends
- ROI by Genre
- Popularity vs Rating
- Yearly box office trends
- Franchise vs Standalone comparison

In [None]:
# Generate visualizations
viz_files = save_all_visualizations(df, analysis_results)

print("\n Visualizations saved:")
for name, path in viz_files.items():
    print(f"    {name}: {path}")

---
## Display Visualizations

In [None]:
from IPython.display import Image, display

# Display saved visualizations
for name, path in viz_files.items():
    if os.path.exists(path):
        print(f"\n {name.replace('_', ' ').title()}:")
        display(Image(filename=path, width=800))

---
## Pipeline Summary

In [None]:
print("\n" + "="*60)
print("    PIPELINE COMPLETED SUCCESSFULLY!")
print("="*60)
print(f"\n Results saved to: {OUTPUT_DIR}/")
print(f" Visualizations created: {len(viz_files)}")
print(f" Movies analyzed: {df.count()}")

---
## Cleanup

In [None]:
# Stop Spark session when done
spark.stop()
print(" Spark session stopped.")