# TMDB API Data Extraction

This notebook demonstrates the process of fetching movie data from The Movie Database (TMDB) API.

## Objectives
1. Configure API connection
2. Fetch movie data with credits and keywords
3. Save raw JSON files
4. Validate fetched data

## Setup

In [1]:
# Import required libraries
import sys
import os
from pathlib import Path

# Add project root to path and set working directory
project_root = Path.cwd().parent
sys.path.append(str(project_root))
os.chdir(str(project_root))  

from src.fetch.fetch_tmdb_api import TMDBFetcher
from src.utils.helpers import load_config, setup_logging

# Setup logger for notebook 
logger = setup_logging(module_name='fetch_notebook')
logger.info("✓ Imports successful")

2025-12-09 13:23:16 - fetch_notebook - INFO - ✓ Imports successful


## 1. Load Configuration

Load project configuration which contains:
- API settings (base URL, timeout, rate limits)
- Data paths

In [2]:
# Load configuration
config = load_config('config/config.yaml')

logger.info("Configuration loaded successfully")
logger.info(f"  API Base URL: {config['api']['base_url']}")
logger.info(f"  Rate Limit: {config['api']['rate_limit_delay']}s")
logger.info(f"  Raw Data Path: {config['paths']['raw_data']}")

2025-12-09 13:23:18 - fetch_notebook - INFO - Configuration loaded successfully
2025-12-09 13:23:18 - fetch_notebook - INFO -   API Base URL: https://api.themoviedb.org/3
2025-12-09 13:23:18 - fetch_notebook - INFO -   Rate Limit: 0.25s
2025-12-09 13:23:18 - fetch_notebook - INFO -   Raw Data Path: data/raw/


## 2. Define Movie IDs to Fetch

Specify the TMDB movie IDs you want to extract data for. These IDs can be found on themoviedb.org.

In [3]:
# Example: Marvel Cinematic Universe movies
movie_ids = [0,299534,19995,140607,299536,597,135397,420818,24428,168259,99861,284054,12445,181808,330457,351286,109445,321612,260513]


## 3. Initialize Fetcher and Extract Data

The TMDBFetcher will:
- Connect to TMDB API using credentials from `.env` file
- Fetch movie details including credits and keywords
- Save each movie as a separate JSON file in `data/raw/`
- Apply rate limiting to respect API limits

In [4]:
# Initialize the fetcher
fetcher = TMDBFetcher(config_path="config/config.yaml")

logger.info("="*60)
logger.info("Starting data extraction...")
logger.info("="*60)

# Fetch movies (skip_existing=True means already downloaded files won't be re-fetched)
fetched_count = fetcher.fetch_movies(movie_ids, skip_existing=True)

logger.info("="*60)
logger.info(f"✓ Successfully fetched {fetched_count} new movies")
logger.info(f"✓ Data saved to: {fetcher.raw_data_path}")
logger.info("="*60)

2025-12-09 10:25:44 - fetch_notebook - INFO - Starting data extraction...


Fetching movies:   0%|          | 0/19 [00:00<?, ?it/s]

2025-12-09 10:25:45 - fetch - ERROR - Error fetching movie 0: 404 Client Error: Not Found for url: https://api.themoviedb.org/3/movie/0?api_key=b52da0c30e5cff65bed4b8dbda2060d5&append_to_response=credits%2Ckeywords


Fetching movies: 100%|██████████| 19/19 [02:31<00:00,  7.96s/it]

2025-12-09 10:28:15 - fetch_notebook - INFO - ✓ Successfully fetched 18 new movies
2025-12-09 10:28:15 - fetch_notebook - INFO - ✓ Data saved to: data\raw





## 4. Verify Extracted Data

Check that the JSON files were created and examine a sample.

In [4]:
# List all JSON files in raw data directory
import json

raw_data_path = Path(config['paths']['raw_data'])
json_files = list(raw_data_path.glob("*.json"))

logger.info(f"Total JSON files in {raw_data_path}: {len(json_files)}")
logger.info("\nFiles:")
for file in sorted(json_files)[:10]:  # Show first 10
    logger.info(f"  - {file.name}")

2025-12-09 13:23:29 - fetch_notebook - INFO - Total JSON files in data\raw: 18
2025-12-09 13:23:29 - fetch_notebook - INFO - 
Files:
2025-12-09 13:23:29 - fetch_notebook - INFO -   - 109445.json
2025-12-09 13:23:29 - fetch_notebook - INFO -   - 12445.json
2025-12-09 13:23:29 - fetch_notebook - INFO -   - 135397.json
2025-12-09 13:23:29 - fetch_notebook - INFO -   - 140607.json
2025-12-09 13:23:29 - fetch_notebook - INFO -   - 168259.json
2025-12-09 13:23:29 - fetch_notebook - INFO -   - 181808.json
2025-12-09 13:23:29 - fetch_notebook - INFO -   - 19995.json
2025-12-09 13:23:29 - fetch_notebook - INFO -   - 24428.json
2025-12-09 13:23:29 - fetch_notebook - INFO -   - 260513.json
2025-12-09 13:23:29 - fetch_notebook - INFO -   - 284054.json


In [5]:
# Examine one sample JSON file
if json_files:
    sample_file = json_files[0]
    with open(sample_file, 'r', encoding='utf-8') as f:
        sample_data = json.load(f)
    
    logger.info(f"\nSample: {sample_file.name}")
    logger.info("="*60)
    logger.info(f"Title: {sample_data.get('title')}")
    logger.info(f"Release Date: {sample_data.get('release_date')}")
    logger.info(f"Budget: ${sample_data.get('budget'):,}")
    logger.info(f"Revenue: ${sample_data.get('revenue'):,}")
    logger.info(f"Runtime: {sample_data.get('runtime')} minutes")
    logger.info(f"Vote Average: {sample_data.get('vote_average')}/10")
    logger.info(f"Vote Count: {sample_data.get('vote_count'):,}")
    logger.info(f"\nGenres: {[g['name'] for g in sample_data.get('genres', [])]}")
    logger.info(f"Cast Members: {len(sample_data.get('credits', {}).get('cast', []))}")
    logger.info(f"Crew Members: {len(sample_data.get('credits', {}).get('crew', []))}")

2025-12-09 13:23:54 - fetch_notebook - INFO - 
Sample: 109445.json
2025-12-09 13:23:54 - fetch_notebook - INFO - Title: Frozen
2025-12-09 13:23:54 - fetch_notebook - INFO - Release Date: 2013-11-20
2025-12-09 13:23:54 - fetch_notebook - INFO - Budget: $150,000,000
2025-12-09 13:23:54 - fetch_notebook - INFO - Revenue: $1,274,219,009
2025-12-09 13:23:54 - fetch_notebook - INFO - Runtime: 102 minutes
2025-12-09 13:23:54 - fetch_notebook - INFO - Vote Average: 7.25/10
2025-12-09 13:23:54 - fetch_notebook - INFO - Vote Count: 17,188
2025-12-09 13:23:54 - fetch_notebook - INFO - 
Genres: ['Animation', 'Family', 'Adventure', 'Fantasy']
2025-12-09 13:23:54 - fetch_notebook - INFO - Cast Members: 60
2025-12-09 13:23:54 - fetch_notebook - INFO - Crew Members: 285
