# Orchestrator

**Objective:** Master controller for the ETL pipeline.

**Function:** 
1. Connects to MotherDuck `pipeline_control` table.
2. Reads the active steps and execution order.
3. Uses `papermill` to execute the child notebooks (`ingest`, `process`) in sequence.

In [1]:
# 1. IMPORTS
import duckdb
import pandas as pd
import papermill as pm
import os
import sys
import time
from datetime import datetime
from dotenv import load_dotenv

In [2]:
# 2. SETUP
# Using forward slashes for path safety on Windows/Linux compatibility
vLocalEnvPath = r"C:/Users/garym/Documents/GitHub/MovieReleases/.env"

if os.path.exists(vLocalEnvPath):
    load_dotenv(dotenv_path=vLocalEnvPath)
else:
    load_dotenv()

vMdToken = os.getenv("MOTHERDUCK_TOKEN")
if not vMdToken: raise RuntimeError("MOTHERDUCK_TOKEN missing")

print(f"--- STARTING PIPELINE AT {datetime.now()} ---")

--- STARTING PIPELINE AT 2025-11-30 10:03:07.583447 ---


## 3. Fetch Schedule

In [3]:
try:
    print("Connecting to MotherDuck to fetch schedule...")
    con = duckdb.connect(f"md:?motherduck_token={vMdToken}")
    
    # Read the control table
    # Ensure the table exists first to avoid crashes on fresh runs
    try:
        vSql = """
            SELECT step_id, notebook_path, description 
            FROM MovieReleases.main.pipeline_control 
            WHERE is_active = TRUE 
            ORDER BY step_id ASC
        """
        dfSchedule = con.sql(vSql).df()
    except Exception as e:
        print("Pipeline Control table not found. Please run setup SQL.")
        dfSchedule = pd.DataFrame()

    con.close()
    
    if dfSchedule.empty:
        print("No active steps found. Exiting.")
    else:
        print(f"Found {len(dfSchedule)} steps to execute.")

except Exception as e:
    raise RuntimeError(f"Failed to fetch pipeline schedule: {e}")

Connecting to MotherDuck to fetch schedule...
Found 2 steps to execute.


## 4. Execute Pipeline Loop

In [5]:
vHasErrors = False

if not dfSchedule.empty:
    for index, row in dfSchedule.iterrows():
        vStepId = row['step_id']
        vNotebook = row['notebook_path']
        vDesc = row['description']
        
        print(f"\n>>> EXECUTION STEP {vStepId}: {vNotebook}")
        print(f"    Description: {vDesc}")
        
        # Define output path for logs
        vLogDir = "logs"
        os.makedirs(vLogDir, exist_ok=True)
        vOutputNotebook = os.path.join(vLogDir, f"out_{vNotebook}")
        
        try:
            t_start = time.time()
            
            # PAPERMILL: Runs the notebook
            pm.execute_notebook(
                input_path=vNotebook,
                output_path=vOutputNotebook,
                parameters=dict(vResetTable=False),
                kernel_name='python3',
                progress_bar=False, 
                stdout_file=sys.stdout
            )
            
            t_end = time.time()
            print(f"    [SUCCESS] Step {vStepId} completed in {round(t_end - t_start, 2)}s")
            
        except Exception as e:
            print(f"    [FAILURE] Step {vStepId} failed: {e}")
            print(f"    Check output notebook: {vOutputNotebook}")
            vHasErrors = True
            break

if vHasErrors:
    raise RuntimeError("Pipeline Failed")
else:
    print("\n--- PIPELINE SUCCESS ---")

Passed unknown parameter: vResetTable
Input notebook does not contain a cell with tag 'parameters'



>>> EXECUTION STEP 1: ingest_releases.ipynb
    Description: Scrape IMDb and load to Landing
Connecting to MotherDuck...
Loading 77 rows to Landing: MovieReleases.landing.uk_releases
Landing load complete.


Passed unknown parameter: vResetTable
Input notebook does not contain a cell with tag 'parameters'


    [SUCCESS] Step 1 completed in 10.16s

>>> EXECUTION STEP 2: process_bronze.ipynb
    Description: Merge Landing to Bronze (SCD2)
Connecting to MotherDuck...
Found 1 tables to process.
Processing Rule: uk_releases -> uk_releases (SCD2)
--- Processing SCD2: MovieReleases.landing.uk_releases -> MovieReleases.bronze.uk_releases ---
SCD2 Merge Complete.
Pipeline Complete.
    [SUCCESS] Step 2 completed in 7.88s

--- PIPELINE SUCCESS ---
