In [9]:
from pathlib import Path
from src import DisneyETL
import pandas as pd

# DISNEY+ LLM DATA ENGINEER PRE-ASSIGNMENT

In [3]:
# Initializing the ETL pipeline execution / orchestration class:
etl = DisneyETL()

## 1. Data Ingestion


### a. The dataset

The dataset used fetches all of Disney's listed movie titles (699 in total) along with metadata and stores each as a JSON object.

See *data_ingestion.py* in the source code.


In [3]:
etl.collect_data()

2024-11-26 02:41:47 | DisneyETL | INFO | Starting data collection process
2024-11-26 02:41:47 | DisneyETL | INFO | Saving movie data to /Users/ben-holden-artifex/Desktop/projects/disney-assessment/data/raw/movie-data
2024-11-26 02:41:47 | DisneyETL | INFO | Fetching all movies from Disney API...
2024-11-26 02:41:48 | DisneyETL | INFO | Found 699 total movies to process


Iterating through Disney's API:   0%|          | 0/18 [00:00<?, ?it/s]

2024-11-26 02:42:03 | DisneyETL | INFO | Successfully collected 699 movies
2024-11-26 02:42:03 | DisneyETL | INFO | Found 699 movies
2024-11-26 02:42:03 | DisneyETL | INFO | Processing individual movies...


Processing movies:   0%|          | 0/699 [00:00<?, ?it/s]

2024-11-26 02:42:04 | DisneyETL | INFO | Saved movie to /Users/ben-holden-artifex/Desktop/projects/disney-assessment/data/raw/movie-data/101-dalmatians-1961.json
2024-11-26 02:42:04 | DisneyETL | INFO | Saved movie to /Users/ben-holden-artifex/Desktop/projects/disney-assessment/data/raw/movie-data/101-dalmatians-1996.json
2024-11-26 02:42:05 | DisneyETL | INFO | Saved movie to /Users/ben-holden-artifex/Desktop/projects/disney-assessment/data/raw/movie-data/101-dalmatians-2-patchs-london-adventure.json
2024-11-26 02:42:06 | DisneyETL | INFO | Saved movie to /Users/ben-holden-artifex/Desktop/projects/disney-assessment/data/raw/movie-data/102-dalmatians.json
2024-11-26 02:42:06 | DisneyETL | INFO | Saved movie to /Users/ben-holden-artifex/Desktop/projects/disney-assessment/data/raw/movie-data/20000-leagues-under-the-sea.json
2024-11-26 02:42:07 | DisneyETL | INFO | Saved movie to /Users/ben-holden-artifex/Desktop/projects/disney-assessment/data/raw/movie-data/a-bugs-life.json
2024-11-26 0

True

## b. Database operations

Here, I load the raw JSON records into a SQLite database (`movie_data.db`).
Using a SQLite database is convenient since there aren't that many records, but it does the job for querying purposes and mimics the code structure one might see in production. 

See `database_utils.py` for source code.

In [5]:
etl.load_to_sqlite()

2024-11-26 02:51:15 | DisneyETL | INFO | Starting data loading process
2024-11-26 02:51:15 | DisneyETL | INFO | Initialized database at /Users/ben-holden-artifex/Desktop/projects/disney-assessment/data/processed/disney_movies.db
2024-11-26 02:51:15 | DisneyETL | INFO | Starting batch insert from /Users/ben-holden-artifex/Desktop/projects/disney-assessment/data/raw/movie-data-preprocessed
2024-11-26 02:51:15 | DisneyETL | INFO | Processed 10 movies
2024-11-26 02:51:15 | DisneyETL | INFO | Processed 20 movies
2024-11-26 02:51:15 | DisneyETL | INFO | Processed 30 movies
2024-11-26 02:51:15 | DisneyETL | INFO | Processed 40 movies
2024-11-26 02:51:15 | DisneyETL | INFO | Processed 50 movies
2024-11-26 02:51:15 | DisneyETL | INFO | Processed 60 movies
2024-11-26 02:51:15 | DisneyETL | INFO | Processed 70 movies
2024-11-26 02:51:15 | DisneyETL | INFO | Processed 80 movies
2024-11-26 02:51:15 | DisneyETL | INFO | Processed 90 movies
2024-11-26 02:51:15 | DisneyETL | INFO | Processed 100 movie

True

## 2. Data Preprocessing

### a. Transformation(s) & b. storage / retrieval optimization
The dataset is already fairly organized due to the way it was collected, but there's an additional data preprocessing step that processes text data prior to loading it into the SQLite database.
See `data_preprocessing` in the source code.

The preprocessing pipeline handles both data cleaning and storage optimization through the `MoviePreprocessor` class; then, the data is stored in a SQLite table.


In [4]:
# Preprocessed dataframe as a Pandas DF:
etl.preprocess_data()

2024-11-26 02:48:38 | DisneyETL | INFO | Starting data preprocessing
2024-11-26 02:48:38 | DisneyETL | INFO | Starting data preprocessing
2024-11-26 02:48:38 | DisneyETL | INFO | Created analysis CSV at /Users/ben-holden-artifex/Desktop/projects/disney-assessment/data/raw/disney_movies_analysis.csv
2024-11-26 02:48:38 | DisneyETL | INFO | 
Preprocessing Summary:
2024-11-26 02:48:38 | DisneyETL | INFO | Total movies processed: 699
2024-11-26 02:48:38 | DisneyETL | INFO | Years covered: 1937 to 2026
2024-11-26 02:48:38 | DisneyETL | INFO | Animation movies: 228
2024-11-26 02:48:38 | DisneyETL | INFO | Average description length: 398 characters
2024-11-26 02:48:38 | DisneyETL | INFO | 
Files saved to: /Users/ben-holden-artifex/Desktop/projects/disney-assessment/data/raw/movie-data-preprocessed


True

In [14]:
data = pd.read_csv("../data/raw/disney_movies.csv")
data.head(100)

Unnamed: 0,title,description,rating,runtime,genres,directors,writers,cast,year,release_date,slug,type,link,animation
0,greyfriars bobby: the true story of a dog,walt disney presents the remarkable true story...,nr,,animals/nature|drama|family|historical,don chaffey,robert westerby|eleanor atkinson,laurence naismith|jennifer nevison|sean keir|j...,1961,1961-07-17,greyfriars-bobby-the-true-story-of-a-dog,movie,https://movies.disney.com/greyfriars-bobby-the...,0
1,angels in the infield,"bob ""bungler"" bugler is the celestial coach ca...",tv-pg,,comedy|family|live action|sports,robert king,holly goldberg sloan|richard conlin|garrett k....,joanne boland|laura catalano|dan duran|colin f...,2000,2000-04-09,angels-in-the-infield,movie,https://movies.disney.com/angels-in-the-infield,0
2,sing along songs: campout at walt disney world,"in campout at walt disney world, mickey, minni...",nr,,animation|family,,,,2006,2006-09-27,sing-along-songs-campout-at-walt-disney-world,movie,https://movies.disney.com/sing-along-songs-cam...,1
3,eloise at the plaza,eloise is a fun-loving little girl with a knac...,tv-g,,comedy|family|live action,kevin lima,janet brownell|kay thompson,victor a. young|jeffrey tambor|corinne conley|...,2003,2003-04-27,eloise-at-the-plaza,movie,https://movies.disney.com/eloise-at-the-plaza,0
4,rolie polie olie: the great defender of fun,"it's little sister zowie's birthday party, and...",g,,adventure|animation|family|preschool|science f...,ron pitts,nadine van der velde|william joyce,joshua tucci|james woods|catherine disher|cole...,2002,2002-08-13,rolie-polie-olie-the-great-defender-of-fun,movie,https://movies.disney.com/rolie-polie-olie-the...,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,iron man & hulk: heroes united,marvel makes cinematic history again with the ...,pg,,action-adventure|animation|science fiction|sup...,,,,2013,2013-12-03,iron-man-and-hulk-heroes-united,movie,https://movies.disney.com/iron-man-and-hulk-he...,1
96,the light in the forest,the year is 1764 when a peace treaty between t...,nr,,classics|drama|family|live action,,,wendell corey|jessica tandy|fess parker|joanne...,1958,1958-07-08,the-light-in-the-forest,movie,https://movies.disney.com/the-light-in-the-forest,0
97,meet the deedles,"laughter is taken to the ""x-treme"" when hawaii...",pg,,comedy|family|live action,,,paul walker|steve van wormer|john ashton|denni...,1998,1998-03-27,meet-the-deedles,movie,https://movies.disney.com/meet-the-deedles,0
98,101 dalmatians,walt disney’s beloved animated masterpiece 101...,g,1h 19min,action-adventure|animation|family,hamilton luske|clyde geronimi|wolfgang reitherman,dodie smith|bill peet,betty lou gerson|lisa davis|queenie leonard|ge...,1961,1961-01-25,101-dalmatians-1961,movie,https://movies.disney.com/101-dalmatians-1961,1


## 3. Vectorization

### a. Generating embeddings with Google AI's T5  & b. Vector storage with FAISS index


The vectorization pipeline leverages Google's T5-base model to generate semantic embeddings for movie data, combining movie titles, descriptions, genres, cast, and other metadata into rich vector representations. 

The implementation uses batch processing and GPU acceleration (via Apple Metal when available) for efficient processing of the ~700 movie dataset. 

Generated embeddings are stored using FAISS (Facebook AI Similarity Search), implementing an IVF (Inverted File) index that enables sub-linear search complexity and efficient similarity queries.


The pipeline processes around 100 movies per minute, with each movie's textual data converted into a 768-dimensional embedding vector. 

These embeddings are organized in a FAISS index optimized for CPU-based similarity search, allowing for quick retrieval (< 100ms query time) while maintaining a reasonable memory footprint (~1MB per 100 movies). 

The system automatically handles device selection (GPU/CPU), includes progress tracking, and implements efficient batch processing to manage memory usage, making it scalable for larger datasets.

In [None]:
etl.vectorize_data()

2024-11-26 03:00:32 | DisneyETL | INFO | Starting embedding creation
2024-11-26 03:00:32 | DisneyETL | INFO | Loading t5-base model and tokenizer...


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

2024-11-26 03:01:04 | DisneyETL | INFO | Using Apple Metal GPU acceleration
2024-11-26 03:01:05 | DisneyETL | INFO | Starting embedding generation and indexing
2024-11-26 03:01:05 | DisneyETL | INFO | Found 699 movies to process


Generating embeddings:   0%|          | 0/22 [00:00<?, ?it/s]

2024-11-26 03:02:11 | DisneyETL | INFO | Combining embeddings...
2024-11-26 03:02:11 | DisneyETL | INFO | Building FAISS index...


## 4. Query and Retrieve
### a. Text query for similarity search

In [None]:
# Character types
results = etl.movie_search("movies with clever and witty sidekick characters")

In [None]:
results = etl.movie_search("stories about unlikely heroes who save the day")

In [None]:
results = etl.movie_search("movies about the relationship between fathers and daughters")

In [None]:
results = etl.movie_search("stories about unlikely friendships between different species")

In [None]:
results = etl.movie_search("Movies starring Tom Hanks")

In [None]:
results = etl.movie_search("Christmas movies with funny villains")

### b. Retrieval-Augmented Generation

## 5. Documentation 

## 6. Bonus

### a. Pipeline logging

For more detail, please refer to *project_utils.py* in the source code as this contains the logging system in this pipeline (stored in the logs directory)


### b. Optimizing for scalability and parallel processing

A rudimentary implementation exists in *project_utils.py* for threading/pooling certain operations. This is a quick representation of the architecture I would use when processing larger amounts of (bulkier) data.