In [1]:
from src import ProjectManager
from src import MovieScraper
from src import DatabaseOps
from src import MoviePreprocessor
from src import MovieEmbedder
from src import SearchAndRetrieval
import pandas as pd

# DISNEY+ LLM DATA ENGINEER PRE-ASSIGNMENT

Note: Typically, in deployment this would probably be wrapped in an execution/orchestration class. 

See `disney_etl.py` in the source code for an example of how this might be done.

For the purposes of this demo, I'm instantiating and using each of the worker classes individually for clarity.

In [2]:
pm = ProjectManager()

## 1. Data Ingestion

In [3]:
scraper = MovieScraper(project_manager=pm)


### a. The dataset

The dataset used fetches all of Disney's listed movie titles (699 in total) along with metadata and stores each as a JSON object.

See *data_ingestion.py* in the source code.


In [4]:
scraper.run()

Iterating through Disney's API by page:   0%|          | 0/18 [00:00<?, ?it/s]

Processing movies:   0%|          | 0/699 [00:00<?, ?it/s]

### b. Database operations

Here, I load the raw JSON records into a SQLite database (`movie_data.db`).
Using a SQLite database is convenient since there aren't that many records, but it does the job for querying purposes and mimics the code structure one might use in production. 

See `database_utils.py` for source code.

In [3]:
db = DatabaseOps(pm)

In [5]:
#directory to get JSON files from
raw_dir = pm.directories.get("raw") / "movie-data"

In [8]:
db.batch_insert_from_json_files(raw_dir)

Loading movies to database: 0it [00:00, ?it/s]

(699, 0)

## 2. Data Preprocessing

In [4]:
preprocessor = MoviePreprocessor(project_manager=pm)

In [5]:
movies = db.get_all_movies()

In [6]:
preprocessed_movies = preprocessor.preprocess_movies(movies)

Preprocessing movies:   0%|          | 0/699 [00:00<?, ?it/s]


### a. Transformation(s) & b. storage / retrieval optimization
The dataset is already fairly organized due to the way it was collected, but there's an additional data preprocessing step that processes text data prior to loading it into the SQLite database.
See `data_preprocessing` in the source code.

The preprocessing pipeline handles both data cleaning and storage optimization through the `MoviePreprocessor` class; then, the data is stored in a SQLite table.


In [7]:
df = pd.DataFrame(preprocessed_movies)

for col in ['genres', 'directors', 'writers', 'cast']:
    if col in df.columns:
        df[col] = df[col].apply(lambda x: '|'.join(x) if x else '')

df.head()

Unnamed: 0,title,description,rating,runtime,genres,directors,writers,cast,year,slug,type
0,101 dalmatians,walt disney’s beloved animated masterpiece 101...,g,1h 19min,family|animation|action-adventure,hamilton luske|wolfgang reitherman|clyde geronimi,bill peet|dodie smith,,1961,101-dalmatians-1961,movie
1,101 dalmatians (1996),cruella de vil dognaps all of the dalmatian pu...,g,,family|comedy|live action|adventure,stephen herek,dodie smith|john hughes,,1996,101-dalmatians-1996,movie
2,101 dalmatians ii: patch's london adventure,"the adventure begins when patch, gets the chan...",g,,family|animation|action-adventure,brian smith|jim kammerud,garrett k. schiff|dodie smith|dan root,,2003,101-dalmatians-2-patchs-london-adventure,movie
3,102 dalmatians,"oddball, the spotless dalmatian puppy on a sea...",g,,family|comedy|live action|adventure,kevin lima,noni white|bob tzudiker|dodie smith|brian rega...,,2000,102-dalmatians,movie
4,"20,000 leagues under the sea",climb aboard the nautilus and into a strange u...,g,,live action|adventure|science fiction|fantasy,richard fleischer,jules verne|earl felton,,1954,20000-leagues-under-the-sea,movie


## 3. Vectorization

In [7]:
embedder = MovieEmbedder(pm)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565



### a. Generating embeddings with Google AI's T5  & b. Vector storage with FAISS index


The vectorization pipeline leverages Google's T5-base model to generate semantic embeddings for movie data, combining movie titles, descriptions, genres, cast, and other metadata into rich vector representations. 

The implementation uses batch processing and GPU acceleration (via Apple Metal when available) for efficient processing of the ~700 movie dataset. 

Generated embeddings are stored using FAISS (Facebook AI Similarity Search), implementing an IVF (Inverted File) index that enables sub-linear search complexity and efficient similarity queries.


The pipeline processes each movie's textual data and converts it into a 768-dimensional embedding vector. 

These embeddings are organized in a FAISS index optimized for CPU-based similarity search, allowing for quick retrieval + maintaining a reasonable memory footprint. 

The system automatically handles device selection (GPU/CPU), includes progress tracking, and implements efficient batch processing to manage memory usage, making it scalable for larger datasets.

In [None]:
embedder.process_embeddings(movies=preprocessed_movies, use_parallel=True)

Starting embedding generation and indexing
Creating text representations for 699 movies...


Generating movie texts:   0%|          | 0/699 [00:00<?, ?it/s]

Setting up slug mappings...
Generating embeddings in batches of 32...


Generating embeddings:   0%|          | 0/22 [00:00<?, ?it/s]

Starting batch 1/22
Batch size: 32
Processing 32 texts for embedding
Running model inference...
Converting to numpy...
Completed batch 1/22
Starting batch 2/22
Batch size: 32
Processing 32 texts for embedding
Running model inference...
Converting to numpy...
Completed batch 2/22
Starting batch 3/22
Batch size: 32
Processing 32 texts for embedding
Running model inference...
Converting to numpy...
Completed batch 3/22
Starting batch 4/22
Batch size: 32
Processing 32 texts for embedding
Running model inference...
Converting to numpy...
Completed batch 4/22
Starting batch 5/22
Batch size: 32
Processing 32 texts for embedding
Running model inference...
Converting to numpy...
Completed batch 5/22
Starting batch 6/22
Batch size: 32
Processing 32 texts for embedding
Running model inference...
Converting to numpy...
Completed batch 6/22
Starting batch 7/22
Batch size: 32
Processing 32 texts for embedding
Running model inference...
Converting to numpy...
Completed batch 7/22
Starting batch 8/22


## 4. Query and Retrieve

In [None]:
qr = SearchAndRetrieval()


### a. Text query for similarity search

In [None]:
# Character types


### b. Retrieval-Augmented Generation

## 5. Documentation 

## 6. Bonus

I'm also integrating the "bonus" functionalities in the worker classes, since logging and parallel processing will be useful.

For more detail, please refer to `project_utils.py` in the source code for these bonus processes.

These, among some other project utility functions, are found in the `ProjectManager` class.

### a. Pipeline logging

In [3]:
#load logging

### b. Parallel processing

A rudimentary implementation in `project_utils.py` for threading/pooling of operations.

This is an example of a useful method for processing a large amount of data (eg. a richer corpus of text data)