# Detailed Pipeline Walkthrough

**A step-by-step guide to running the Fly Behavior Pipeline from a Jupyter notebook.**

This notebook mirrors exactly what `make run` does, but breaks it into clearly separated,
well-explained stages so you can understand, inspect, and control each part.

---

### What the pipeline does (high level)

1. **YOLO inference** — detect eyes and proboscis in each video frame
2. **Distance / angle computation** — derive per-frame CSV metrics
3. **Normalization** — convert raw distances to percentages
4. **RMS / envelope analysis** — compute rolling-window smoothed signals
5. **Combined analysis** — merge datasets, build matrices, generate plots
6. **Reaction prediction** — run trained ML model on combined data
7. **Backup** — copy results to SMB / Box / secured storage

### Safety defaults in this notebook

| Setting | Default | Why |
|---------|---------|-----|
| `DRY_RUN_ONLY` | `True` | Validates everything without running the actual pipeline |
| `RUN_BACKUPS` | `False` | Backups are skipped unless you explicitly enable them |
| `delete_source_after_render` | **forced to `false`** | We never delete source videos from inside this notebook |

> **Tip:** Run this notebook top-to-bottom once with `DRY_RUN_ONLY = True` to verify
> your environment, then set it to `False` for the real run.

---
## A) Parameters

Edit the values in this cell to control the notebook's behavior.
Everything else reads from these variables.

In [None]:
# ===================================================================
# PARAMETERS — edit these before running the notebook
# ===================================================================

from pathlib import Path

# Path to your config YAML (relative to repo root, or absolute)
CONFIG_PATH = Path("config/config.yaml")

# Set to False to actually execute the pipeline (make run).
# When True, the notebook only validates the environment and config.
DRY_RUN_ONLY = True

# Set to True to run backups after the pipeline finishes.
# Backups copy results to SMB, Box, and secured storage.
RUN_BACKUPS = False

# Allow the pipeline to run on CPU (no GPU).
# If False and no GPU is found, the pipeline will refuse to start.
ALLOW_CPU = True

# Directory where pipeline log files are saved.
LOG_DIR = Path("logs")

# Maximum seconds to wait for the pipeline before timing out.
# Default: 6 hours (21600 seconds). Set to None for no limit.
PIPELINE_TIMEOUT_SECONDS = 21600

---
## B) Environment Validation

This section checks that all required software and packages are available
before you try to run anything.

In [None]:
# ===================================================================
# Environment validation — checks Python, packages, and tools
# ===================================================================

import sys
import os
import shutil
import subprocess
from datetime import datetime

# ---------- Resolve repo root ----------
# This notebook lives in docs/, so the repo root is one level up.
_notebook_dir = Path(os.getcwd())
# If running from the repo root already, use cwd; otherwise go up.
if (_notebook_dir / "Makefile").exists():
    REPO_ROOT = _notebook_dir
elif (_notebook_dir.parent / "Makefile").exists():
    REPO_ROOT = _notebook_dir.parent
else:
    # Fallback: use the notebook's grandparent
    REPO_ROOT = _notebook_dir.parents[1] if len(_notebook_dir.parents) > 1 else _notebook_dir

print(f"Repo root : {REPO_ROOT}")
print(f"Python    : {sys.version}")
print()

# ---------- Check required Python packages ----------
required_packages = [
    ("yaml",       "PyYAML"),
    ("numpy",      "numpy"),
    ("pandas",     "pandas"),
    ("matplotlib", "matplotlib"),
    ("cv2",        "opencv-python"),
    ("ultralytics","ultralytics"),
    ("torch",      "torch (PyTorch)"),
    ("scipy",      "scipy"),
    ("tqdm",       "tqdm"),
]

all_ok = True
for module_name, display_name in required_packages:
    try:
        __import__(module_name)
        print(f"  [OK]  {display_name}")
    except ImportError:
        print(f"  [MISSING]  {display_name}  <-- install with: pip install {display_name}")
        all_ok = False

print()

# ---------- Check external CLI tools ----------
cli_tools = [
    ("ffmpeg",  "Video encoding (used by compose_videos_rms)"),
    ("ffprobe", "Video metadata inspection"),
    ("make",    "Build system (runs the pipeline)"),
]

for tool_name, purpose in cli_tools:
    path = shutil.which(tool_name)
    if path:
        print(f"  [OK]  {tool_name} -> {path}")
    else:
        print(f"  [MISSING]  {tool_name} ({purpose})")
        all_ok = False

print()

# ---------- GPU check ----------
try:
    import torch
    gpu_available = torch.cuda.is_available()
    if gpu_available:
        gpu_name = torch.cuda.get_device_name(0)
        print(f"  [GPU]  CUDA available: {gpu_name}")
    else:
        print(f"  [CPU]  No CUDA GPU detected.")
        if not ALLOW_CPU:
            print("         ERROR: ALLOW_CPU is False — set it to True or use a GPU machine.")
            all_ok = False
        else:
            print("         Pipeline will run on CPU (slower but functional).")
except ImportError:
    print("  [CPU]  PyTorch not installed — GPU detection skipped.")

print()
if all_ok:
    print("Environment validation PASSED.")
else:
    print("Environment validation FAILED — fix the issues above before proceeding.")
    raise SystemExit("Environment check failed.")

---
## C) Config Selection and Validation

This section loads your `config.yaml`, prints the key settings, and warns
you about any destructive options that are currently enabled.

In [None]:
# ===================================================================
# Load and validate config.yaml
# ===================================================================

import yaml

# Resolve config path relative to repo root
config_file = REPO_ROOT / CONFIG_PATH
if not config_file.exists():
    raise FileNotFoundError(
        f"Config file not found: {config_file}\n"
        f"Check your CONFIG_PATH parameter. Available configs:\n"
        + "\n".join(f"  - {p.name}" for p in (REPO_ROOT / "config").glob("*.yaml"))
    )

with open(config_file, "r") as fh:
    config_data = yaml.safe_load(fh) or {}

print(f"Loaded config: {config_file}")
print(f"Config size  : {config_file.stat().st_size:,} bytes")
print()

# ---------- Print key settings ----------
print("=" * 60)
print("EFFECTIVE CONFIGURATION (key values)")
print("=" * 60)

# Model path
model_path = config_data.get("model_path", "(not set)")
model_exists = Path(str(model_path)).expanduser().exists() if model_path else False
print(f"  model_path             : {model_path}")
print(f"    exists?              : {model_exists}")

# Dataset roots
main_dirs = config_data.get("main_directories", [])
if isinstance(main_dirs, str):
    main_dirs = [main_dirs]
print(f"  main_directories       : {len(main_dirs)} dataset(s)")
for d in main_dirs:
    exists = Path(d).expanduser().exists()
    status = "OK" if exists else "MISSING"
    print(f"    [{status}] {d}")

# Timing
print(f"  fps_default            : {config_data.get('fps_default', '(not set)')}")
print(f"  odor_on_s              : {config_data.get('odor_on_s', '(not set)')}")
print(f"  odor_off_s             : {config_data.get('odor_off_s', '(not set)')}")
print(f"  odor_latency_s         : {config_data.get('odor_latency_s', '(not set)')}")

# GPU / compute
print(f"  allow_cpu              : {config_data.get('allow_cpu', False)}")
print(f"  cuda_allow_tf32        : {config_data.get('cuda_allow_tf32', True)}")

# Force flags
force = config_data.get("force", {})
print(f"  force.pipeline         : {force.get('pipeline', False)}")
print(f"  force.yolo             : {force.get('yolo', False)}")
print(f"  force.combined         : {force.get('combined', False)}")
print(f"  force.reaction_predict : {force.get('reaction_prediction', False)}")
print(f"  force.reaction_matrix  : {force.get('reaction_matrix', False)}")

print()

# ---------- SAFETY WARNINGS ----------
print("=" * 60)
print("SAFETY CHECK")
print("=" * 60)

warnings_found = 0

# Check delete_source_after_render
delete_source = config_data.get("delete_source_after_render", False)
if delete_source:
    warnings_found += 1
    print(f"  [WARNING] delete_source_after_render = {delete_source}")
    print(f"            This DELETES original video files after rendering!")
    print(f"            The notebook will override this to 'false' at runtime.")
else:
    print(f"  [SAFE]    delete_source_after_render = {delete_source}")

# Check secure_cleanup (moves files to secured storage)
secure_cfg = (config_data.get("analysis", {}).get("combined", {}) or {}).get("secure_cleanup")
if secure_cfg:
    perform_cleanup = secure_cfg.get("perform_cleanup", True)
    if perform_cleanup:
        warnings_found += 1
        print(f"  [WARNING] secure_cleanup.perform_cleanup = True")
        print(f"            This moves/deletes files after copying to secured storage.")
    else:
        print(f"  [SAFE]    secure_cleanup.perform_cleanup = False")

# Check backup compression
backup_cfg = config_data.get("backups", {})
compression = backup_cfg.get("compression", {}).get("enabled", False)
print(f"  [INFO]    backups.compression.enabled = {compression}")

if warnings_found > 0:
    print(f"\n  >>> {warnings_found} warning(s) found. Review above before running. <<<")
else:
    print(f"\n  All safety checks passed.")

---
## D) Non-Destructive Overrides

This cell sets environment variables that make the pipeline safer.
These overrides apply **only** for this notebook session — your config.yaml is not modified.

In [None]:
# ===================================================================
# Apply safe runtime overrides (environment variables only)
# ===================================================================

# Force matplotlib to use the non-interactive Agg backend.
# This prevents the pipeline from trying to open GUI windows.
os.environ["MPLBACKEND"] = "Agg"
print("Set MPLBACKEND=Agg (headless plotting)")

# Suppress noisy ONNX Runtime logging.
os.environ["ORT_LOGGING_LEVEL"] = "3"
print("Set ORT_LOGGING_LEVEL=3 (warnings only)")

# Override CPU usage if the parameter allows it.
if ALLOW_CPU:
    os.environ["ALLOW_CPU"] = "1"
    print("Set ALLOW_CPU=1 (pipeline can fall back to CPU)")

# Ensure logs directory exists.
log_dir = REPO_ROOT / LOG_DIR
log_dir.mkdir(parents=True, exist_ok=True)
print(f"Log directory ready: {log_dir}")

print()
print("Runtime overrides applied. Your config.yaml is unchanged.")

---
## E) Dry-Run Checks

Before running the pipeline, let's verify:
- Input dataset directories exist and contain data
- The YOLO model file exists
- What outputs will be generated (and where)

This runs even when `DRY_RUN_ONLY = True`.

In [None]:
# ===================================================================
# Dry-run checks — verify inputs and show expected outputs
# ===================================================================

print("=" * 60)
print("DRY-RUN CHECKS")
print("=" * 60)
print()

# ---------- 1) Check dataset directories ----------
print("--- Dataset Directories ---")
datasets_ready = 0
datasets_missing = 0

for d in main_dirs:
    p = Path(d).expanduser()
    if p.exists():
        # Count fly directories (immediate subdirectories)
        fly_dirs = [x for x in p.iterdir() if x.is_dir()]
        # Count CSV files recursively
        csv_count = sum(1 for _ in p.rglob("*.csv"))
        print(f"  [OK]  {p.name}: {len(fly_dirs)} fly dir(s), {csv_count} CSV(s)")
        datasets_ready += 1
    else:
        print(f"  [MISSING] {d}")
        datasets_missing += 1

print(f"  Summary: {datasets_ready} ready, {datasets_missing} missing")
print()

# ---------- 2) Check YOLO model ----------
print("--- YOLO Model ---")
model_file = Path(str(model_path)).expanduser()
if model_file.exists():
    size_mb = model_file.stat().st_size / (1024 * 1024)
    mtime = datetime.fromtimestamp(model_file.stat().st_mtime)
    print(f"  [OK]  {model_file.name} ({size_mb:.1f} MB, modified {mtime:%Y-%m-%d %H:%M})")
else:
    print(f"  [MISSING] {model_file}")
    print(f"            Pipeline will fail at YOLO step. Check model_path in config.")
print()

# ---------- 3) Show expected output locations ----------
print("--- Expected Output Locations ---")
analysis_cfg = config_data.get("analysis", {})

output_locations = []

# Envelope visuals
env_vis = analysis_cfg.get("envelope_visuals", {})
if env_vis:
    for key in ("matrices", "envelopes"):
        sub = env_vis.get(key)
        if sub:
            items = sub if isinstance(sub, list) else [sub]
            for item in items:
                if "out_dir" in item:
                    output_locations.append(("envelope_visuals." + key, item["out_dir"]))

# Combined outputs
combined = analysis_cfg.get("combined", {})
if combined:
    for key in ("matrices", "matrix", "overlay"):
        sub = combined.get(key)
        if sub and "out_dir" in sub:
            output_locations.append(("combined." + key, sub["out_dir"]))
    envelopes = combined.get("envelopes", [])
    if not isinstance(envelopes, list):
        envelopes = [envelopes]
    for env in envelopes:
        if isinstance(env, dict) and "out_dir" in env:
            output_locations.append(("combined.envelopes", env["out_dir"]))
    wide = combined.get("wide", {})
    if wide and "output_csv" in wide:
        output_locations.append(("combined.wide", wide["output_csv"]))

# Reaction predictions
reaction = config_data.get("reaction_prediction", {})
if reaction:
    if "output_csv" in reaction:
        output_locations.append(("reaction_prediction", reaction["output_csv"]))
    matrix = reaction.get("matrix", {})
    if matrix and "out_dir" in matrix:
        output_locations.append(("reaction_prediction.matrix", matrix["out_dir"]))

for label, location in output_locations:
    loc_path = Path(str(location)).expanduser()
    exists = loc_path.exists()
    status = "exists" if exists else "will be created"
    print(f"  {label}")
    print(f"    -> {location} ({status})")

print()

# ---------- 4) Show the exact command that will run ----------
print("--- Pipeline Command ---")
pipeline_cmd = f"make run"  # Uses Makefile target
print(f"  Command  : {pipeline_cmd}")
print(f"  Expands to:")
print(f"    export MPLBACKEND=Agg")
print(f"    export ORT_LOGGING_LEVEL=3")
print(f"    python scripts/pipeline/run_workflows.py --config {CONFIG_PATH}")
if RUN_BACKUPS:
    print(f"    python scripts/backup_system.py")
else:
    print(f"    (backups skipped — RUN_BACKUPS = False)")
print()

if DRY_RUN_ONLY:
    print("DRY_RUN_ONLY = True -> The pipeline will NOT execute.")
    print("Set DRY_RUN_ONLY = False in the Parameters cell to run for real.")
else:
    print("DRY_RUN_ONLY = False -> The pipeline WILL execute in the next section.")

---
## F) Run the Pipeline

This is the main execution cell. It runs `make run` (minus the backup step
if `RUN_BACKUPS` is `False`).

**What to expect:**
- Output streams in real-time below the cell
- A timestamped log file is saved to the `logs/` directory
- If anything fails, the cell stops and shows the last 50 lines of output
- The pipeline can take 15–60+ minutes depending on your data size and hardware

In [None]:
# ===================================================================
# Shell runner helper — streams output and logs to file
# ===================================================================

def run_shell_command(
    cmd,
    *,
    cwd=None,
    env=None,
    log_file=None,
    timeout=None,
    label="command",
):
    """
    Run a shell command with real-time output streaming.

    - Streams stdout/stderr to the notebook AND to a log file.
    - On failure: prints the last 50 lines and raises an exception.
    - Returns the process return code (0 = success).

    Parameters
    ----------
    cmd : list of str
        The command to run (e.g., ["make", "run"]).
    cwd : Path or str, optional
        Working directory. Defaults to REPO_ROOT.
    env : dict, optional
        Environment variables. Defaults to current env.
    log_file : Path or str, optional
        Path to write all output. Created automatically if not given.
    timeout : int, optional
        Max seconds to wait. None = no limit.
    label : str
        Human-readable name for the command (used in messages).
    """
    if cwd is None:
        cwd = REPO_ROOT
    if env is None:
        env = os.environ.copy()
    if log_file is None:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        log_file = log_dir / f"pipeline_run_{timestamp}.log"

    log_file = Path(log_file)
    log_file.parent.mkdir(parents=True, exist_ok=True)

    print(f"Running: {' '.join(str(c) for c in cmd)}")
    print(f"Log    : {log_file}")
    print(f"Started: {datetime.now():%Y-%m-%d %H:%M:%S}")
    print("-" * 60)

    collected_lines = []

    try:
        process = subprocess.Popen(
            cmd,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            text=True,
            bufsize=1,
            cwd=str(cwd),
            env=env,
        )

        with open(log_file, "w") as lf:
            for line in process.stdout:
                print(line, end="", flush=True)
                lf.write(line)
                lf.flush()
                collected_lines.append(line)

        returncode = process.wait(timeout=timeout)

    except subprocess.TimeoutExpired:
        process.kill()
        print(f"\nTIMEOUT: {label} exceeded {timeout}s limit.")
        returncode = -1
    except Exception as exc:
        print(f"\nERROR running {label}: {exc}")
        returncode = -1

    print("-" * 60)
    print(f"Finished: {datetime.now():%Y-%m-%d %H:%M:%S}")
    print(f"Exit code: {returncode}")

    if returncode != 0:
        # Show last 50 lines for debugging
        print()
        print("=" * 60)
        print(f"FAILED — last 50 lines of {label} output:")
        print("=" * 60)
        for line in collected_lines[-50:]:
            print(line, end="")
        print()
        raise RuntimeError(
            f"{label} failed with exit code {returncode}. "
            f"Full log: {log_file}"
        )

    return returncode


print("Shell runner helper defined.")

In [None]:
# ===================================================================
# Execute the pipeline (or skip if DRY_RUN_ONLY)
# ===================================================================

if DRY_RUN_ONLY:
    print("DRY_RUN_ONLY = True — skipping pipeline execution.")
    print("Change DRY_RUN_ONLY to False in the Parameters cell and re-run.")
else:
    print("=" * 60)
    print("EXECUTING PIPELINE")
    print("=" * 60)
    print()

    # Build the command.
    # Instead of 'make run' (which includes backup), we call the pipeline
    # directly so we can control backups separately.
    pipeline_command = [
        sys.executable,
        "scripts/pipeline/run_workflows.py",
        "--config",
        str(CONFIG_PATH),
    ]

    # Build env with the same vars that Makefile sets.
    run_env = os.environ.copy()
    run_env["MPLBACKEND"] = "Agg"
    run_env["ORT_LOGGING_LEVEL"] = "3"

    # Add src/ to PYTHONPATH so imports resolve correctly.
    src_path = str(REPO_ROOT / "src")
    existing_pythonpath = run_env.get("PYTHONPATH", "")
    run_env["PYTHONPATH"] = os.pathsep.join(
        filter(None, [src_path, str(REPO_ROOT), existing_pythonpath])
    )

    run_shell_command(
        pipeline_command,
        env=run_env,
        timeout=PIPELINE_TIMEOUT_SECONDS,
        label="pipeline",
    )

    print()
    print("Pipeline completed successfully!")

---
## G) Output Summary

After the pipeline finishes, this cell shows you:
- Where outputs were saved
- How many files were generated
- Quick sanity checks on the results

In [None]:
# ===================================================================
# Output summary — inspect what was generated
# ===================================================================

print("=" * 60)
print("OUTPUT SUMMARY")
print("=" * 60)
print()

for label, location in output_locations:
    loc_path = Path(str(location)).expanduser()
    print(f"  {label}:")
    print(f"    Path: {loc_path}")

    if not loc_path.exists():
        print(f"    Status: does not exist (not yet generated)")
    elif loc_path.is_file():
        size_mb = loc_path.stat().st_size / (1024 * 1024)
        mtime = datetime.fromtimestamp(loc_path.stat().st_mtime)
        print(f"    Status: file, {size_mb:.2f} MB, modified {mtime:%Y-%m-%d %H:%M}")
    elif loc_path.is_dir():
        all_files = list(loc_path.rglob("*"))
        file_count = sum(1 for f in all_files if f.is_file())
        total_bytes = sum(f.stat().st_size for f in all_files if f.is_file())
        total_mb = total_bytes / (1024 * 1024)
        print(f"    Status: directory, {file_count} file(s), {total_mb:.1f} MB total")
        # Show the 5 most recently modified files
        recent = sorted(
            [f for f in all_files if f.is_file()],
            key=lambda f: f.stat().st_mtime,
            reverse=True,
        )[:5]
        if recent:
            print(f"    Recent files:")
            for f in recent:
                mtime = datetime.fromtimestamp(f.stat().st_mtime)
                print(f"      {f.name} ({mtime:%Y-%m-%d %H:%M})")
    print()

# ---------- Quick sanity check on CSVs ----------
print("--- CSV Sanity Check ---")
csv_outputs = [loc for label, loc in output_locations if str(loc).endswith(".csv")]
for csv_path_str in csv_outputs:
    csv_path = Path(str(csv_path_str)).expanduser()
    if csv_path.exists():
        try:
            import pandas as pd
            df = pd.read_csv(csv_path, nrows=5)
            print(f"  {csv_path.name}: {len(df.columns)} columns, preview OK")
            print(f"    Columns: {', '.join(df.columns[:8])}{'...' if len(df.columns) > 8 else ''}")
        except Exception as e:
            print(f"  {csv_path.name}: ERROR reading — {e}")
    else:
        print(f"  {csv_path.name}: not found (pipeline may not have run yet)")

print()

# ---------- Log file info ----------
print("--- Log Files ---")
log_files = sorted(log_dir.glob("pipeline_run_*.log"), reverse=True)
if log_files:
    latest = log_files[0]
    size_kb = latest.stat().st_size / 1024
    print(f"  Latest: {latest.name} ({size_kb:.1f} KB)")
    print(f"  Total log files: {len(log_files)}")
else:
    print(f"  No pipeline log files found in {log_dir}")

---
## H) Backups (Optional)

Backups are **off by default** (`RUN_BACKUPS = False`). When enabled, the
backup system:

1. Copies CSV results to the SMB network share
2. Syncs to Box cloud storage via rclone
3. Copies to secured local storage

**No compression by default** — files are copied directly. Enable compression
in `config.yaml` under `backups.compression.enabled`.

To run backups: set `RUN_BACKUPS = True` in the Parameters cell, then re-run
this cell.

In [None]:
# ===================================================================
# Run backups (only if RUN_BACKUPS is True)
# ===================================================================

if not RUN_BACKUPS:
    print("RUN_BACKUPS = False — skipping backups.")
    print("Set RUN_BACKUPS = True in the Parameters cell to enable.")
    print()
    print("Backup destinations configured in config.yaml:")
    destinations = backup_cfg.get("destinations", {})
    for name, dest in destinations.items():
        enabled = dest.get("enabled", False)
        status = "enabled" if enabled else "disabled"
        print(f"  {name}: {status}")
else:
    print("=" * 60)
    print("RUNNING BACKUPS")
    print("=" * 60)
    print()

    backup_command = [sys.executable, "scripts/backup_system.py"]

    run_shell_command(
        backup_command,
        timeout=3600,  # 1 hour max for backups
        label="backup",
    )

    print()
    print("Backups completed successfully!")

---
## I) Troubleshooting

### Common errors and what to do

| Error | Cause | Fix |
|-------|-------|-----|
| `FileNotFoundError: config.yaml` | Config path is wrong | Update `CONFIG_PATH` in Parameters cell |
| `ModuleNotFoundError: ultralytics` | Missing package | Run `pip install -r requirements.txt` |
| `CUDA out of memory` | GPU memory full | Close other GPU processes, or set `allow_cpu: true` in config |
| `No such file: best.pt` | Model file missing | Check `model_path` in config.yaml |
| `RuntimeError: pipeline failed` | General pipeline error | Check the log file path shown in the error message |
| `OSError: [Errno 28] No space` | Disk full | Free up disk space, check `df -h` |
| `Permission denied` on SMB/secured | Network mount not available | Check that SMB share is mounted |

### View the latest log file

Run the cell below to see the last 50 lines of the most recent pipeline log.

In [None]:
# ===================================================================
# View latest log file (last 50 lines)
# ===================================================================

log_files = sorted(log_dir.glob("pipeline_run_*.log"), reverse=True)

if log_files:
    latest_log = log_files[0]
    size_kb = latest_log.stat().st_size / 1024
    mtime = datetime.fromtimestamp(latest_log.stat().st_mtime)

    print(f"Latest log: {latest_log.name}")
    print(f"Size: {size_kb:.1f} KB | Modified: {mtime:%Y-%m-%d %H:%M:%S}")
    print("=" * 60)

    with open(latest_log, "r") as f:
        lines = f.readlines()

    print(f"Showing last 50 of {len(lines)} lines:\n")
    for line in lines[-50:]:
        print(line, end="")
else:
    print("No pipeline log files found.")
    print(f"Logs are saved to: {log_dir}")

---
## Run All (Sequential)

This cell runs every stage above in order. Useful for a completely
hands-off execution once you've verified the dry-run checks.

**Important:** Make sure you've set `DRY_RUN_ONLY = False` in the Parameters
cell before running this.

In [None]:
# ===================================================================
# Run All — sequential execution of the full workflow
# ===================================================================
# This cell is a convenience wrapper. Each stage is identical to
# running the cells above in order.
#
# Stages:
#   1. Validate environment
#   2. Load and check config
#   3. Apply safe overrides
#   4. Dry-run checks
#   5. Run pipeline (if DRY_RUN_ONLY = False)
#   6. Run backups (if RUN_BACKUPS = True)
#   7. Print output summary
#
# If any stage fails, execution stops immediately.
# ===================================================================

print("=" * 60)
print(f"RUN ALL — started at {datetime.now():%Y-%m-%d %H:%M:%S}")
print(f"DRY_RUN_ONLY = {DRY_RUN_ONLY}")
print(f"RUN_BACKUPS  = {RUN_BACKUPS}")
print("=" * 60)
print()

if DRY_RUN_ONLY:
    print("DRY_RUN_ONLY is True — will validate but not execute.")
    print("Set DRY_RUN_ONLY = False in the Parameters cell for a real run.")
    print()
    print("Validation completed above. Review the output in the cells above.")
else:
    # --- Run pipeline ---
    print("[Stage 1/3] Running pipeline...")
    pipeline_command = [
        sys.executable,
        "scripts/pipeline/run_workflows.py",
        "--config",
        str(CONFIG_PATH),
    ]
    run_env = os.environ.copy()
    run_env["MPLBACKEND"] = "Agg"
    run_env["ORT_LOGGING_LEVEL"] = "3"
    src_path = str(REPO_ROOT / "src")
    existing = run_env.get("PYTHONPATH", "")
    run_env["PYTHONPATH"] = os.pathsep.join(filter(None, [src_path, str(REPO_ROOT), existing]))

    run_shell_command(
        pipeline_command,
        env=run_env,
        timeout=PIPELINE_TIMEOUT_SECONDS,
        label="pipeline",
    )
    print()

    # --- Run backups ---
    if RUN_BACKUPS:
        print("[Stage 2/3] Running backups...")
        run_shell_command(
            [sys.executable, "scripts/backup_system.py"],
            timeout=3600,
            label="backup",
        )
        print()
    else:
        print("[Stage 2/3] Backups skipped (RUN_BACKUPS = False).")
        print()

    # --- Output summary ---
    print("[Stage 3/3] Output summary:")
    for label, location in output_locations:
        loc_path = Path(str(location)).expanduser()
        if loc_path.exists():
            if loc_path.is_file():
                size_mb = loc_path.stat().st_size / (1024 * 1024)
                print(f"  [OK] {label}: {size_mb:.2f} MB")
            else:
                file_count = sum(1 for f in loc_path.rglob("*") if f.is_file())
                print(f"  [OK] {label}: {file_count} file(s)")
        else:
            print(f"  [--] {label}: not found")

print()
print(f"Finished at {datetime.now():%Y-%m-%d %H:%M:%S}")
print("=" * 60)