# IMDB Data Preparation for Snowflake

This notebook prepares the IMDB dataset for Snowflake ingestion and AI agent use.

In [None]:
import pandas as pd
import sys
sys.path.append('..')
from data_preparation import prepare_data_for_snowflake, save_for_snowflake
from text_preprocessing import *

# Load and preprocess the data
df = pd.read_csv('../data/imdb_top_1000.csv')

# Apply text preprocessing
df['processed_overview'] = df['Overview'].apply(preprocess_text)
df['keywords'] = df['Overview'].apply(lambda x: extract_keywords(x, n_keywords=5))

# Prepare normalized dataframes
prepared_data = prepare_data_for_snowflake(df)

# Create output directory for processed data
!mkdir -p ../data/processed

# Save prepared data
save_for_snowflake(prepared_data, '../data/processed')

## Data Schema Overview

The data has been normalized into the following tables:

1. **movies** - Main table with movie information
   - movie_id (PK)
   - Series_Title
   - release_year
   - Certificate
   - runtime_minutes
   - IMDB_Rating
   - Meta_score
   - No_of_Votes
   - gross_amount
   - processed_overview
   - created_at
   - updated_at

2. **genres** - Movie-genre relationships
   - movie_id (FK)
   - genre
   - created_at
   - updated_at

3. **credits** - Movie credits (directors and stars)
   - movie_id (FK)
   - person_name
   - role
   - created_at
   - updated_at

4. **keywords** - Extracted keywords from movie overviews
   - movie_id (FK)
   - keyword
   - created_at
   - updated_at