# TMDB Data Cleaning & Preprocessing

This notebook transforms raw JSON movie data into clean, analysis-ready datasets.

## Objectives
1. Load raw JSON files from `data/raw/`
2. Extract and flatten nested JSON structures
3. Clean data types and handle missing values
4. Apply quality filters
5. Add derived features (ROI, profit, release year)
6. Save cleaned data to `data/processed/`

## Setup

In [1]:
# Import required libraries
import sys
import os
from pathlib import Path
import pandas as pd
import numpy as np

# Add project root to path and set working directory
project_root = Path.cwd().parent
sys.path.append(str(project_root))
os.chdir(str(project_root))

from src.transform.pipeline import DataCleaningPipeline
from src.utils.helpers import load_config, setup_logging

# Setup logger for notebook
logger = setup_logging(module_name='cleaning_notebook')
logger.info("✓ Imports successful")

2025-12-09 16:22:10 - cleaning_notebook - INFO - ✓ Imports successful


## 1. Initialize Data Cleaner

Load configuration and initialize the DataCleaner class.

In [2]:
# Initialize cleaner
pipeline = DataCleaningPipeline(config_path="config/config.yaml")

logger.info("DataCleaningPipeline initialized")
logger.info(f"  Raw data path: {pipeline.raw_data_path}")
logger.info(f"  Interim data path: {pipeline.interim_data_path}")
logger.info(f"  Processed data path: {pipeline.processed_data_path}")

2025-12-09 16:22:11 - cleaning_notebook - INFO - DataCleaningPipeline initialized
2025-12-09 16:22:11 - cleaning_notebook - INFO -   Raw data path: data\raw
2025-12-09 16:22:11 - cleaning_notebook - INFO -   Interim data path: data\interim
2025-12-09 16:22:11 - cleaning_notebook - INFO -   Processed data path: data\processed


## 2. Load and Inspect Raw Data

Load raw JSON files and examine the initial structure.

In [3]:
# Load raw data
df_raw = pipeline.load_raw_data()

logger.info(f"Raw data shape: {df_raw.shape}")
logger.info(f"Columns: {list(df_raw.columns)}")

# Display first few rows
df_raw.head()

2025-12-09 16:22:11 - transform - INFO - Loading 18 JSON files from data\raw
2025-12-09 16:22:11 - transform - INFO - Loaded 18 movies with 28 columns
2025-12-09 16:22:11 - cleaning_notebook - INFO - Raw data shape: (18, 28)
2025-12-09 16:22:11 - cleaning_notebook - INFO - Columns: ['adult', 'backdrop_path', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'origin_country', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count', 'credits', 'keywords']


Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,origin_country,original_language,...,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,credits,keywords
0,False,/u2bZhH3nTf0So0UIC1QxAqBvC07.jpg,"{'id': 386382, 'name': 'Frozen Collection', 'p...",150000000,"[{'id': 16, 'name': 'Animation'}, {'id': 10751...",http://movies.disney.com/frozen,109445,tt2294629,[US],en,...,102,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Only the act of true love will thaw a frozen h...,Frozen,False,7.25,17188,"{'cast': [{'adult': False, 'gender': 1, 'id': ...","{'keywords': [{'id': 7376, 'name': 'princess'}..."
1,False,/cbcpDn6XJaIGoOil1bKuskU8ds4.jpg,"{'id': 1241, 'name': 'Harry Potter Collection'...",125000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",https://www.warnerbros.com/movies/harry-potter...,12445,tt1201607,[GB],en,...,130,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,It all ends.,Harry Potter and the Deathly Hallows: Part 2,False,8.084,21464,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 616, 'name': 'witch'}, {'..."
2,False,/dF6FjTZzRTENfB4R17HDN20jLT2.jpg,"{'id': 328, 'name': 'Jurassic Park Collection'...",150000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",https://www.jurassicworld.com/,135397,tt0369610,[US],en,...,124,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The park is open.,Jurassic World,False,6.699,21127,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 2041, 'name': 'island'}, ..."
3,False,/8BTsTfln4jlQrLXUBquXJ0ASQy9.jpg,"{'id': 10, 'name': 'Star Wars Collection', 'po...",245000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",http://www.starwars.com/films/star-wars-episod...,140607,tt2488496,[US],en,...,136,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Every generation has a story.,Star Wars: The Force Awakens,False,7.3,20104,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 803, 'name': 'android'}, ..."
4,False,/ehzI1mVcnHqB58NqPyQwpMqcVoz.jpg,"{'id': 9485, 'name': 'The Fast and the Furious...",190000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",https://www.uphe.com/movies/furious-7,168259,tt2820852,[US],en,...,139,"[{'english_name': 'Arabic', 'iso_639_1': 'ar',...",Released,Vengeance hits home.,Furious 7,False,7.223,11035,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 830, 'name': 'car race'},..."


### Data Exploration

In [6]:
df_raw.drop(columns=['adult', 'imdb_id', 'original_title', 'video', 'homepage'], inplace=True)
df_raw.head()

Unnamed: 0,backdrop_path,belongs_to_collection,budget,genres,id,origin_country,original_language,overview,popularity,poster_path,...,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,credits,keywords
0,/u2bZhH3nTf0So0UIC1QxAqBvC07.jpg,"{'id': 386382, 'name': 'Frozen Collection', 'p...",150000000,"[{'id': 16, 'name': 'Animation'}, {'id': 10751...",109445,[US],en,Young princess Anna of Arendelle dreams about ...,18.277,/itAKcobTYGpYT8Phwjd8c9hleTo.jpg,...,1274219009,102,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Only the act of true love will thaw a frozen h...,Frozen,7.25,17188,"{'cast': [{'adult': False, 'gender': 1, 'id': ...","{'keywords': [{'id': 7376, 'name': 'princess'}..."
1,/cbcpDn6XJaIGoOil1bKuskU8ds4.jpg,"{'id': 1241, 'name': 'Harry Potter Collection'...",125000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",12445,[GB],en,"Harry, Ron and Hermione continue their quest t...",17.3221,/c54HpQmuwXjHq2C9wmoACjxoom3.jpg,...,1341511219,130,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,It all ends.,Harry Potter and the Deathly Hallows: Part 2,8.084,21464,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 616, 'name': 'witch'}, {'..."
2,/dF6FjTZzRTENfB4R17HDN20jLT2.jpg,"{'id': 328, 'name': 'Jurassic Park Collection'...",150000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",135397,[US],en,Twenty-two years after the events of Jurassic ...,9.3758,/rhr4y79GpxQF9IsfJItRXVaoGs4.jpg,...,1671537444,124,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The park is open.,Jurassic World,6.699,21127,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 2041, 'name': 'island'}, ..."
3,/8BTsTfln4jlQrLXUBquXJ0ASQy9.jpg,"{'id': 10, 'name': 'Star Wars Collection', 'po...",245000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",140607,[US],en,Thirty years after defeating the Galactic Empi...,7.5842,/wqnLdwVXoBjKibFRR5U3y0aDUhs.jpg,...,2068223624,136,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Every generation has a story.,Star Wars: The Force Awakens,7.3,20104,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 803, 'name': 'android'}, ..."
4,/ehzI1mVcnHqB58NqPyQwpMqcVoz.jpg,"{'id': 9485, 'name': 'The Fast and the Furious...",190000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",168259,[US],en,Deckard Shaw seeks revenge against Dominic Tor...,17.1859,/ktofZ9Htrjiy0P6LEowsDaxd3Ri.jpg,...,1515400000,139,"[{'english_name': 'Arabic', 'iso_639_1': 'ar',...",Released,Vengeance hits home.,Furious 7,7.223,11035,"{'cast': [{'adult': False, 'gender': 2, 'id': ...","{'keywords': [{'id': 830, 'name': 'car race'},..."


In [7]:

df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   backdrop_path          18 non-null     object 
 1   belongs_to_collection  16 non-null     object 
 2   budget                 18 non-null     int64  
 3   genres                 18 non-null     object 
 4   id                     18 non-null     int64  
 5   origin_country         18 non-null     object 
 6   original_language      18 non-null     object 
 7   overview               18 non-null     object 
 8   popularity             18 non-null     float64
 9   poster_path            18 non-null     object 
 10  production_companies   18 non-null     object 
 11  production_countries   18 non-null     object 
 12  release_date           18 non-null     object 
 13  revenue                18 non-null     int64  
 14  runtime                18 non-null     int64  
 15  spoken_l

In [8]:
# List of all columns with 'object' data type
cols_to_check = [
    'backdrop_path', 
    'belongs_to_collection', 
    'genres', 
    'origin_country', 
    'original_language', 
    'overview', 
    'poster_path', 
    'production_companies', 
    'production_countries', 
    'release_date', 
    'spoken_languages', 
    'status', 
    'tagline', 
    'title', 
    'keywords'
]

# Loop to inspect the first valid value of each column
for col in cols_to_check:
    print(f"--- Column: {col} ---")
    # Get the first non-null value to see what the data actually looks like
    if df_raw[col].notna().any():
        first_val = df_raw[col].dropna().iloc[0]
        print(f"Sample Value: {first_val}")
        print(f"Type: {type(first_val)}")
    else:
        print("Column contains only NaN values.")
    print("\n")

--- Column: backdrop_path ---
Sample Value: /u2bZhH3nTf0So0UIC1QxAqBvC07.jpg
Type: <class 'str'>


--- Column: belongs_to_collection ---
Sample Value: {'id': 386382, 'name': 'Frozen Collection', 'poster_path': '/13Op41T3cALJedrKqYPrlc3cIbO.jpg', 'backdrop_path': '/s3vdRkK7KZFUDC8HEJo2GRKyVhW.jpg'}
Type: <class 'dict'>


--- Column: genres ---
Sample Value: [{'id': 16, 'name': 'Animation'}, {'id': 10751, 'name': 'Family'}, {'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}]
Type: <class 'list'>


--- Column: origin_country ---
Sample Value: ['US']
Type: <class 'list'>


--- Column: original_language ---
Sample Value: en
Type: <class 'str'>


--- Column: overview ---
Sample Value: Young princess Anna of Arendelle dreams about finding true love at her sister Elsa’s coronation. Fate takes her on a dangerous journey in an attempt to end the eternal winter that has fallen over the kingdom. She's accompanied by ice delivery man Kristoff, his reindeer Sven, and snowman Olaf. On an 

In [None]:
# Check data types and missing values before cleaning
logger.info("\n=== RAW DATA SUMMARY ===")
logger.info(f"Total movies: {len(df_raw)}")
logger.info(f"Total columns: {len(df_raw.columns)}")
logger.info(f"\nMissing values:")
missing = df_raw.isnull().sum()
missing_pct = (missing / len(df_raw) * 100).round(2)
missing_df = pd.DataFrame({'Missing': missing, 'Percentage': missing_pct})
missing_df[missing_df['Missing'] > 0].sort_values('Missing', ascending=False)

## 3. Run Complete Cleaning Pipeline

Execute the full data cleaning process:
- Extract nested JSON fields
- Convert data types
- Handle missing/unrealistic values
- Apply quality filters
- Add derived features
- Reorder columns

In [None]:
# Run the complete cleaning pipeline
df_cleaned = pipeline.run(save_interim=True, save_final=True)

## 4. Inspect Cleaned Data

Examine the cleaned dataset structure and quality.

In [None]:
# Display cleaned data
logger.info(f"\n=== CLEANED DATA SUMMARY ===")
logger.info(f"Total movies: {len(df_cleaned)}")
logger.info(f"Total columns: {len(df_cleaned.columns)}")
logger.info(f"Columns: {list(df_cleaned.columns)}")

# Show first few rows
df_cleaned.head()

In [None]:
# Check data types
logger.info("\nData types:")
df_cleaned.dtypes

In [None]:
# Check missing values in cleaned data
missing = df_cleaned.isnull().sum()
missing_pct = (missing / len(df_cleaned) * 100).round(2)
missing_df = pd.DataFrame({'Missing': missing, 'Percentage': missing_pct})
missing_summary = missing_df[missing_df['Missing'] > 0].sort_values('Missing', ascending=False)

logger.info(f"\nMissing values in cleaned data:")
missing_summary

## 5. Data Quality Statistics

Generate summary statistics for numeric columns.

In [None]:
# Statistical summary of numeric columns
numeric_cols = df_cleaned.select_dtypes(include=[np.number]).columns
logger.info(f"\nNumeric columns: {list(numeric_cols)}")

df_cleaned[numeric_cols].describe().round(2)

In [None]:
# Check specific data quality metrics
logger.info("\n=== DATA QUALITY METRICS ===")
logger.info(f"Date range: {df_cleaned['release_date'].min()} to {df_cleaned['release_date'].max()}")
logger.info(f"Year range: {df_cleaned['release_year'].min()} to {df_cleaned['release_year'].max()}")
logger.info(f"\nBudget (in millions USD):")
logger.info(f"  Mean: ${df_cleaned['budget_musd'].mean():.2f}M")
logger.info(f"  Median: ${df_cleaned['budget_musd'].median():.2f}M")
logger.info(f"  Max: ${df_cleaned['budget_musd'].max():.2f}M")
logger.info(f"\nRevenue (in millions USD):")
logger.info(f"  Mean: ${df_cleaned['revenue_musd'].mean():.2f}M")
logger.info(f"  Median: ${df_cleaned['revenue_musd'].median():.2f}M")
logger.info(f"  Max: ${df_cleaned['revenue_musd'].max():.2f}M")
logger.info(f"\nROI:")
logger.info(f"  Mean: {df_cleaned['roi'].mean():.2f}%")
logger.info(f"  Median: {df_cleaned['roi'].median():.2f}%")
logger.info(f"  Max: {df_cleaned['roi'].max():.2f}%")

## 6. Explore Categorical Data

Examine the distribution of key categorical variables.

In [None]:
# Top 10 languages
logger.info("\n=== TOP 10 ORIGINAL LANGUAGES ===")
df_cleaned['original_language'].value_counts().head(10)

In [None]:
# Most common genres (from pipe-separated values)
logger.info("\n=== GENRE DISTRIBUTION ===")
# Split genres and count
all_genres = df_cleaned['genres'].dropna().str.split('|').explode()
all_genres.value_counts().head(15)

In [None]:
# Collection membership
logger.info(f"\nMovies in collections: {df_cleaned['belongs_to_collection'].notna().sum()}")
logger.info(f"Standalone movies: {df_cleaned['belongs_to_collection'].isna().sum()}")

# Top collections
logger.info("\n=== TOP 10 COLLECTIONS ===")
df_cleaned['belongs_to_collection'].value_counts().head(10)

## 7. Sample of Cleaned Data

View a few complete records to verify data quality.

In [None]:
# Display sample movies with key information
sample_cols = ['title', 'release_year', 'genres', 'budget_musd', 'revenue_musd', 
               'roi', 'vote_average', 'director', 'cast']
df_cleaned[sample_cols].head(10)

## Summary

Data cleaning completed successfully! The cleaned dataset is now saved in:
- **CSV**: `data/processed/movies_cleaned.csv`
- **Parquet**: `data/processed/movies_cleaned.parquet`
- **Interim**: `data/interim/movies_interim.csv`

### Cleaning Steps Applied:
1. ✓ Loaded raw JSON files
2. ✓ Extracted nested structures (genres, cast, crew, etc.)
3. ✓ Converted data types (dates, numeric values)
4. ✓ Handled missing and unrealistic values
5. ✓ Applied quality filters (duplicates, sparse data, non-released movies)
6. ✓ Added derived features (ROI, profit, release year)
7. ✓ Reordered columns for consistency

### Next Steps:
Proceed to `03_kpi_analysis.ipynb` for KPI calculations and analysis.