# TMDB Data Cleaning & Preprocessing

This notebook transforms raw JSON movie data into clean, analysis-ready datasets.

## Objectives
1. Load raw JSON files from `data/raw/`
2. Extract and flatten nested JSON structures
3. Clean data types and handle missing values
4. Apply quality filters
5. Add derived features (ROI, profit, release year)
6. Save cleaned data to `data/processed/`

## Setup

In [None]:
# Import required libraries
import sys
import os
from pathlib import Path
import pandas as pd
import numpy as np

# Add project root to path and set working directory
project_root = Path.cwd().parent
sys.path.append(str(project_root))
os.chdir(str(project_root))

from src.transform.pipeline import DataCleaningPipeline
from src.utils.helpers import load_config, setup_logging

# Setup logger for notebook
logger = setup_logging(module_name='cleaning_notebook')
logger.info("✓ Imports successful")

## 1. Initialize Data Cleaner

Load configuration and initialize the DataCleaner class.

In [None]:
# Initialize cleaner
pipeline = DataCleaningPipeline(config_path="config/config.yaml")

logger.info("DataCleaningPipeline initialized")
logger.info(f"  Raw data path: {pipeline.raw_data_path}")
logger.info(f"  Interim data path: {pipeline.interim_data_path}")
logger.info(f"  Processed data path: {pipeline.processed_data_path}")

## 2. Load and Inspect Raw Data

Load raw JSON files and examine the initial structure.

In [None]:
# Load raw data
df_raw = pipeline.load_raw_data()

logger.info(f"Raw data shape: {df_raw.shape}")
logger.info(f"Columns: {list(df_raw.columns)}")

# Display first few rows
df_raw.head()

In [None]:
# Check data types and missing values before cleaning
logger.info("\n=== RAW DATA SUMMARY ===")
logger.info(f"Total movies: {len(df_raw)}")
logger.info(f"Total columns: {len(df_raw.columns)}")
logger.info(f"\nMissing values:")
missing = df_raw.isnull().sum()
missing_pct = (missing / len(df_raw) * 100).round(2)
missing_df = pd.DataFrame({'Missing': missing, 'Percentage': missing_pct})
missing_df[missing_df['Missing'] > 0].sort_values('Missing', ascending=False)

## 3. Run Complete Cleaning Pipeline

Execute the full data cleaning process:
- Extract nested JSON fields
- Convert data types
- Handle missing/unrealistic values
- Apply quality filters
- Add derived features
- Reorder columns

In [None]:
# Run the complete cleaning pipeline
df_cleaned = pipeline.run(save_interim=True, save_final=True)

## 4. Inspect Cleaned Data

Examine the cleaned dataset structure and quality.

In [None]:
# Display cleaned data
logger.info(f"\n=== CLEANED DATA SUMMARY ===")
logger.info(f"Total movies: {len(df_cleaned)}")
logger.info(f"Total columns: {len(df_cleaned.columns)}")
logger.info(f"Columns: {list(df_cleaned.columns)}")

# Show first few rows
df_cleaned.head()

In [None]:
# Check data types
logger.info("\nData types:")
df_cleaned.dtypes

In [None]:
# Check missing values in cleaned data
missing = df_cleaned.isnull().sum()
missing_pct = (missing / len(df_cleaned) * 100).round(2)
missing_df = pd.DataFrame({'Missing': missing, 'Percentage': missing_pct})
missing_summary = missing_df[missing_df['Missing'] > 0].sort_values('Missing', ascending=False)

logger.info(f"\nMissing values in cleaned data:")
missing_summary

## 5. Data Quality Statistics

Generate summary statistics for numeric columns.

In [None]:
# Statistical summary of numeric columns
numeric_cols = df_cleaned.select_dtypes(include=[np.number]).columns
logger.info(f"\nNumeric columns: {list(numeric_cols)}")

df_cleaned[numeric_cols].describe().round(2)

In [None]:
# Check specific data quality metrics
logger.info("\n=== DATA QUALITY METRICS ===")
logger.info(f"Date range: {df_cleaned['release_date'].min()} to {df_cleaned['release_date'].max()}")
logger.info(f"Year range: {df_cleaned['release_year'].min()} to {df_cleaned['release_year'].max()}")
logger.info(f"\nBudget (in millions USD):")
logger.info(f"  Mean: ${df_cleaned['budget_musd'].mean():.2f}M")
logger.info(f"  Median: ${df_cleaned['budget_musd'].median():.2f}M")
logger.info(f"  Max: ${df_cleaned['budget_musd'].max():.2f}M")
logger.info(f"\nRevenue (in millions USD):")
logger.info(f"  Mean: ${df_cleaned['revenue_musd'].mean():.2f}M")
logger.info(f"  Median: ${df_cleaned['revenue_musd'].median():.2f}M")
logger.info(f"  Max: ${df_cleaned['revenue_musd'].max():.2f}M")
logger.info(f"\nROI:")
logger.info(f"  Mean: {df_cleaned['roi'].mean():.2f}%")
logger.info(f"  Median: {df_cleaned['roi'].median():.2f}%")
logger.info(f"  Max: {df_cleaned['roi'].max():.2f}%")

## 6. Explore Categorical Data

Examine the distribution of key categorical variables.

In [None]:
# Top 10 languages
logger.info("\n=== TOP 10 ORIGINAL LANGUAGES ===")
df_cleaned['original_language'].value_counts().head(10)

In [None]:
# Most common genres (from pipe-separated values)
logger.info("\n=== GENRE DISTRIBUTION ===")
# Split genres and count
all_genres = df_cleaned['genres'].dropna().str.split('|').explode()
all_genres.value_counts().head(15)

In [None]:
# Collection membership
logger.info(f"\nMovies in collections: {df_cleaned['belongs_to_collection'].notna().sum()}")
logger.info(f"Standalone movies: {df_cleaned['belongs_to_collection'].isna().sum()}")

# Top collections
logger.info("\n=== TOP 10 COLLECTIONS ===")
df_cleaned['belongs_to_collection'].value_counts().head(10)

## 7. Sample of Cleaned Data

View a few complete records to verify data quality.

In [None]:
# Display sample movies with key information
sample_cols = ['title', 'release_year', 'genres', 'budget_musd', 'revenue_musd', 
               'roi', 'vote_average', 'director', 'cast']
df_cleaned[sample_cols].head(10)

## Summary

Data cleaning completed successfully! The cleaned dataset is now saved in:
- **CSV**: `data/processed/movies_cleaned.csv`
- **Parquet**: `data/processed/movies_cleaned.parquet`
- **Interim**: `data/interim/movies_interim.csv`

### Cleaning Steps Applied:
1. ✓ Loaded raw JSON files
2. ✓ Extracted nested structures (genres, cast, crew, etc.)
3. ✓ Converted data types (dates, numeric values)
4. ✓ Handled missing and unrealistic values
5. ✓ Applied quality filters (duplicates, sparse data, non-released movies)
6. ✓ Added derived features (ROI, profit, release year)
7. ✓ Reordered columns for consistency

### Next Steps:
Proceed to `03_kpi_analysis.ipynb` for KPI calculations and analysis.