<h3>Amendments Log</h3>
<table style="width:100%">
  <thead>
    <tr>
      <th style="text-align:left">Version</th>
      <th style="text-align:left">Amended By</th>
      <th style="text-align:left">Date</th>
      <th style="text-align:left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1.4</td>
      <td>Gary Manley</td>
      <td>2025-11-30</td>
      <td>Removed source_type column as it does not exist in Bronze.</td>
    </tr>
    <tr>
      <td>1.3</td>
      <td>Gary Manley</td>
      <td>2025-11-30</td>
      <td>Fixed Source Table path (pointed to 'bronze' schema) and removed resiliency checks.</td>
    </tr>
    <tr>
      <td>1.2</td>
      <td>Gary Manley</td>
      <td>2025-11-30</td>
      <td>Refactored to use Pandas for deduplication logic instead of SQL.</td>
    </tr>
    <tr>
      <td>1.1</td>
      <td>Gary Manley</td>
      <td>2025-11-30</td>
      <td>Updated deduplication logic to prioritize latest release date</td>
    </tr>
    <tr>
      <td>1.0</td>
      <td>Gary Manley</td>
      <td>2025-11-30</td>
      <td>Initial Version</td>
    </tr>
  </tbody>
</table>

In [10]:
# 1. SETUP & IMPORTS
import duckdb
import pandas as pd
import os
import sys
from dotenv import load_dotenv

# Load Utils
sys.path.append(os.getcwd())
try:
    from utils.db_utils import f_add_surrogate_key
except ImportError:
    print("Error: Could not import utils")

# Load Env
vLocalEnvPath = r"C:/Users/garym/Documents/GitHub/MovieReleases/.env"
if os.path.exists(vLocalEnvPath):
    load_dotenv(dotenv_path=vLocalEnvPath)
else:
    load_dotenv()

vMdToken = os.getenv("MOTHERDUCK_TOKEN")
if not vMdToken: raise RuntimeError("MOTHERDUCK_TOKEN missing")

# Connect
print("Connecting to MotherDuck...")
vCon = duckdb.connect(f"md:?motherduck_token={vMdToken}")

Connecting to MotherDuck...


In [11]:
# PARAMETERS / CONSTANTS
cNotebookName = "process_dim_film.ipynb"
vTargetTable = "MovieReleases.silver.film_release_dim"

## 2. Extract & Deduplicate (Pandas)
We read the active Bronze history into a Pandas DataFrame and deduplicate using Python.
We group by `imdb_id_ref` and keep the row with the most recent `valid_from_uda` (System Entry Date).

In [12]:
# 1. Fetch Active Bronze Data
print("Fetching active records from Bronze...")
try:
    # Corrected Path: Read from 'bronze' schema which has the SCD2 columns
    dfBronze = vCon.table("MovieReleases.bronze.uk_releases").df()
except Exception as e:
    print(f"Error reading source table: {e}")
    dfBronze = pd.DataFrame()

if not dfBronze.empty:
    # 2. Filter Active
    dfActive = dfBronze[dfBronze['is_current_uda'] == True].copy()
    
    # 3. Deduplicate (Pandas Logic)
    # Sort by 'valid_from_uda' descending to put the latest system entry at the top
    dfSorted = dfActive.sort_values(by='valid_from_uda', ascending=False)
    
    # Drop duplicates on Business Key (imdb_id_ref), keeping the first (latest)
    dfDedup = dfSorted.drop_duplicates(subset=['imdb_id_ref'], keep='first').copy()
    
    # 4. Prepare Source Dataframe
    # Removed source_type as it is not in the source table
    dfSource = dfDedup[['imdb_id_ref', 'movie_title']].copy()
    
    # Ensure we don't have blank IDs
    dfSource = dfSource.dropna(subset=['imdb_id_ref'])
    
    print(f"Found {len(dfSource)} unique movies to process.")

    # 5. Generate/Maintain Surrogate Keys
    dfDimFilm = f_add_surrogate_key(
        vCon=vCon,
        dfNewData=dfSource,
        vTargetTableName=vTargetTable,
        vBusinessKeyCol="imdb_id_ref",
        vSkColName="sk_film_release"
    )
    
    # 6. Load to Silver (Replace Table)
    print(f"Loading to {vTargetTable}...")
    vCon.sql("CREATE SCHEMA IF NOT EXISTS MovieReleases.silver")
    vCon.register('v_stage_dim_film', dfDimFilm)
    vCon.sql(f"CREATE OR REPLACE TABLE {vTargetTable} AS SELECT * FROM v_stage_dim_film")
    
    print("Success.")
    # Validation
    vCon.sql(f"SELECT * FROM {vTargetTable} LIMIT 5").show()

else:
    print("No data found in Bronze. Skipping Silver load.")

vCon.close()

Fetching active records from Bronze...
Found 77 unique movies to process.
Generating Surrogate Keys (sk_film_release) for MovieReleases.silver.film_release_dim...
Target MovieReleases.silver.film_release_dim does not exist. Starting fresh SKs from 1.
Loading to MovieReleases.silver.film_release_dim...
Success.
┌─────────────┬──────────────────────────────────────────┬─────────────────┐
│ imdb_id_ref │               movie_title                │ sk_film_release │
│   varchar   │                 varchar                  │      int64      │
├─────────────┼──────────────────────────────────────────┼─────────────────┤
│ tt30274401  │ Five Nights at Freddy's 2                │               1 │
│ tt33978029  │ Ready or Not: Here I Come                │               2 │
│ tt30825738  │ Star Wars: The Mandalorian and Grogu     │               3 │
│ tt17490712  │ Mortal Kombat II                         │               4 │
│ tt32565993  │ Three Bags Full: A Sheep Detective Movie │              