# 01 - Data Sources

This notebook covers:
1. TMDB metadata scraper explanation
2. Loading source files (ratings, tags, movies_enriched)
3. Merging and enriching data
4. Saving processed dataset

In [17]:
import pandas as pd
import numpy as np
from pathlib import Path

In [18]:
# Define paths
DATA_RAW = Path('../data/raw/MovieLens_100K/ml-latest-small')
DATA_PROCESSED = Path('../data/processed')

# Ensure processed directory exists
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)

## TMDB Metadata Scraper

The `src/tmdb_scraper.py` script fetches additional movie metadata from the TMDB API. Here's how it works:

### Overview

The scraper uses the `tmdbId` from `links.csv` to fetch rich metadata for each movie:

- **Director** - Primary director name
- **Cast** - Top 5 actors (pipe-separated)
- **Country** - Production country
- **Runtime** - Duration in minutes
- **Budget/Revenue** - Financial data
- **Release date** - Precise release date
- **Vote average/count** - TMDB ratings
- **Popularity** - TMDB popularity score
- **Overview/Tagline** - Plot summary and tagline

### Key Features

1. **Polite scraping**: Random delays (0.25-0.5s) between requests
2. **Proper User-Agent**: Identifies as educational project
3. **Checkpoint support**: Can resume if interrupted
4. **Rate limit handling**: Automatically waits if rate limited

### Usage

```bash
# Ensure TMDB_API_KEY is in .env file
uv run python src/tmdb_scraper.py
```

Output: `data/processed/movies_enriched.csv`

In [19]:
# Key parts of the TMDB scraper (src/tmdb_scraper.py)

# The scraper class fetches movie details and credits from TMDB API:

'''
class TMDBScraper:
    def fetch_movie_metadata(self, tmdb_id: int) -> dict:
        """Fetch complete metadata for a movie."""
        metadata = {
            'tmdb_id': tmdb_id,
            'director': None,
            'cast': None,
            'country': None,
            'runtime': None,
            'budget': None,
            'revenue': None,
            'release_date': None,
            'original_language': None,
            'vote_average': None,
            'vote_count': None,
            'popularity': None,
            'tagline': None,
            'overview': None,
        }
        
        # Get movie details
        details = self.get_movie_details(tmdb_id)
        if details:
            metadata['runtime'] = details.get('runtime')
            metadata['budget'] = details.get('budget')
            # ... etc
        
        # Add polite delay
        time.sleep(random.uniform(0.25, 0.5))
        
        # Get credits (director, cast)
        credits = self.get_movie_credits(tmdb_id)
        if credits:
            # Extract director from crew
            directors = [p['name'] for p in credits['crew'] if p['job'] == 'Director']
            metadata['director'] = directors[0] if directors else None
            
            # Get top 5 cast
            cast = [p['name'] for p in credits['cast'][:5]]
            metadata['cast'] = '|'.join(cast)
        
        return metadata
'''

print("See src/tmdb_scraper.py for full implementation")

See src/tmdb_scraper.py for full implementation


## Load Source Files

In [20]:
# Load ratings.csv
ratings = pd.read_csv(DATA_RAW / 'ratings.csv')
print(f"Ratings: {len(ratings):,} rows")
print(ratings.head())
print(f"\nColumns: {list(ratings.columns)}")

Ratings: 100,836 rows
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931

Columns: ['userId', 'movieId', 'rating', 'timestamp']


In [21]:
# Load tags.csv
tags = pd.read_csv(DATA_RAW / 'tags.csv')
print(f"Tags: {len(tags):,} rows")
print(tags.head())
print(f"\nColumns: {list(tags.columns)}")

Tags: 3,683 rows
   userId  movieId              tag   timestamp
0       2    60756            funny  1445714994
1       2    60756  Highly quotable  1445714996
2       2    60756     will ferrell  1445714992
3       2    89774     Boxing story  1445715207
4       2    89774              MMA  1445715200

Columns: ['userId', 'movieId', 'tag', 'timestamp']


In [22]:
# Load movies_enriched.csv (output from TMDB scraper)
movies_enriched = pd.read_csv(DATA_PROCESSED / 'movies_enriched.csv')
print(f"Movies enriched: {len(movies_enriched):,} rows")
print(movies_enriched.head())
print(f"\nColumns: {list(movies_enriched.columns)}")

Movies enriched: 9,734 rows
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  tmdbId         director  \
0  Adventure|Animation|Children|Comedy|Fantasy     862    John Lasseter   
1                   Adventure|Children|Fantasy    8844     Joe Johnston   
2                               Comedy|Romance   15602    Howard Deutch   
3                         Comedy|Drama|Romance   31357  Forest Whitaker   
4                                       Comedy   11862    Charles Shyer   

                                                cast country  runtime  \
0  Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...      US     81.0   
1  Robin Williams|Kirsten Dunst|Bradley Pierce|Bo...      US    104.0   
2

In [23]:
# Check enriched data coverage
print("\nEnriched data coverage:")
for col in ['director', 'cast', 'country', 'runtime', 'budget', 'revenue']:
    if col in movies_enriched.columns:
        non_null = movies_enriched[col].notna().sum()
        pct = non_null / len(movies_enriched) * 100
        print(f"  {col}: {non_null:,} ({pct:.1f}%)")


Enriched data coverage:
  director: 9,617 (98.8%)
  cast: 9,587 (98.5%)
  country: 9,566 (98.3%)
  runtime: 9,621 (98.8%)
  budget: 9,621 (98.8%)
  revenue: 9,621 (98.8%)


## Enrich Ratings Data

Merge ratings with:
1. Tags (aggregated per movie)
2. Enriched movie metadata from TMDB

In [24]:
# Aggregate tags per movie (combine all tags into pipe-separated string)
tags_agg = (
    tags
    .groupby('movieId')['tag']
    .apply(lambda x: '|'.join(x.astype(str)))
    .reset_index()
    .rename(columns={'tag': 'tags'})
)

print(f"Movies with tags: {len(tags_agg):,}")
print(tags_agg.head())

Movies with tags: 1,572
   movieId                                          tags
0        1                               pixar|pixar|fun
1        2  fantasy|magic board game|Robin Williams|game
2        3                                     moldy|old
3        5                              pregnancy|remake
4        7                                        remake


In [25]:
# Merge ratings with movies_enriched
ratings_enriched = ratings.merge(
    movies_enriched,
    on='movieId',
    how='left'
)

print(f"After merging with movies: {len(ratings_enriched):,} rows")

After merging with movies: 100,836 rows


In [26]:
# Merge with aggregated tags
ratings_enriched = ratings_enriched.merge(
    tags_agg,
    on='movieId',
    how='left'
)

print(f"After merging with tags: {len(ratings_enriched):,} rows")
print(f"\nFinal columns: {list(ratings_enriched.columns)}")

After merging with tags: 100,836 rows

Final columns: ['userId', 'movieId', 'rating', 'timestamp', 'title', 'genres', 'tmdbId', 'director', 'cast', 'country', 'runtime', 'budget', 'revenue', 'release_date', 'original_language', 'vote_average', 'vote_count', 'popularity', 'tagline', 'overview', 'tags']


In [27]:
# Check the result
print("\nDataset info:")
ratings_enriched.info()

print("\n\nSample rows:")
ratings_enriched.head()


Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 21 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   userId             100836 non-null  int64  
 1   movieId            100836 non-null  int64  
 2   rating             100836 non-null  float64
 3   timestamp          100836 non-null  int64  
 4   title              100823 non-null  object 
 5   genres             100823 non-null  object 
 6   tmdbId             100823 non-null  float64
 7   director           100511 non-null  object 
 8   cast               100448 non-null  object 
 9   country            100423 non-null  object 
 10  runtime            100515 non-null  float64
 11  budget             100515 non-null  float64
 12  revenue            100515 non-null  float64
 13  release_date       100515 non-null  object 
 14  original_language  100515 non-null  object 
 15  vote_average       100515 non-null  

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,tmdbId,director,cast,country,...,budget,revenue,release_date,original_language,vote_average,vote_count,popularity,tagline,overview,tags
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,862.0,John Lasseter,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,US,...,30000000.0,394436586.0,1995-11-22,en,7.97,19291.0,18.8275,The adventure takes off when toys come to life!,"Led by Woody, Andy's toys live happily in his ...",pixar|pixar|fun
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance,15602.0,Howard Deutch,Walter Matthau|Jack Lemmon|Ann-Margret|Sophia ...,US,...,25000000.0,71500000.0,1995-12-22,en,6.5,409.0,2.2075,Still Yelling. Still Fighting. Still Ready for...,A family wedding reignites the ancient feud be...,moldy|old
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller,949.0,Michael Mann,Al Pacino|Robert De Niro|Val Kilmer|Jon Voight...,US,...,60000000.0,187400000.0,1995-12-15,en,7.9,7917.0,11.0379,A Los Angeles crime saga.,Obsessive master thief Neil McCauley leads a t...,
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,807.0,David Fincher,Morgan Freeman|Brad Pitt|Gwyneth Paltrow|John ...,US,...,33000000.0,327311859.0,1995-09-22,en,8.4,22288.0,13.2001,Gluttony. Greed. Sloth. Envy. Wrath. Pride. Lust.,Two homicide detectives are on a desperate hun...,mystery|twist ending|serial killer
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,629.0,Bryan Singer,Stephen Baldwin|Gabriel Byrne|Benicio del Toro...,US,...,6000000.0,23300000.0,1995-07-19,en,8.2,10965.0,7.8931,Five criminals. One line up. No coincidence.,"Held in an L.A. interrogation room, Verbal Kin...",mindfuck|suspense|thriller|tricky|twist ending...


In [28]:
# Check for missing enriched data
print("\nMissing data counts:")
for col in ['title', 'director', 'cast', 'country', 'runtime']:
    if col in ratings_enriched.columns:
        null_count = ratings_enriched[col].isna().sum()
        pct = null_count / len(ratings_enriched) * 100
        print(f"  {col}: {null_count:,} missing ({pct:.1f}%)")


Missing data counts:
  title: 13 missing (0.0%)
  director: 325 missing (0.3%)
  cast: 388 missing (0.4%)
  country: 413 missing (0.4%)
  runtime: 321 missing (0.3%)


## Save Enriched Dataset

In [29]:
# Save to processed folder
output_path = DATA_PROCESSED / 'ratings_enriched.csv'
ratings_enriched.to_csv(output_path, index=False)

print(f"Saved enriched ratings to: {output_path}")
print(f"Total rows: {len(ratings_enriched):,}")
print(f"Total columns: {len(ratings_enriched.columns)}")

Saved enriched ratings to: ../data/processed/ratings_enriched.csv
Total rows: 100,836
Total columns: 21
