# 🚀 Databricks Inspire AI 🚀

**Transform your metadata into actionable AI-generated use cases, documentation, and presentations.**

---

## Outputs
- **Notebooks** – Ready-to-deploy SQL code
- **PDF** – Professional documentation  
- **PowerPoint** – Executive slides
- **Excel** – Prioritized use cases catalog

*Supports 20+ languages including English, Arabic, Chinese, French, Spanish, German, Japanese, and more.*

---

## Quick Start
1. **Configure** – Set **Business Name**, **UC Metadata** and **Operation**
2. **Run** – Click **Run All**
3. **Explore** – Find outputs in your **Generation Path**

---

## Configuration

| # | Widget | Description | Default |
|:--|:---|:---|:---|
| 01 | **Business Name** | Organization/project name | *Required* |
| 02 | **UC Metadata** | Catalogs, Schemas, or Tables (e.g., `main.finance`) or JSON path | *Required* |
| 03 | **Operation** | `Discover Usecases`, `Re-generate SQL`, `Generate Sample Result` | `Discover Usecases` |
| 04 | **Business Domains** | Focus domains (e.g., "Risk, Finance") | *Auto-detected* |
| 05 | **Business Priorities** | `Increase Revenue`, `Reduce Cost`, `Optimize Operations`, `Mitigate Risk`, `Empower Talent`, `Enhance Experience`, `Drive Innovation`, `Achieve ESG`, `Protect Revenue`, `Execute Strategy` | `Increase Revenue` |
| 06 | **Strategic Goals** | Custom goals for prioritization | *Auto-generated* |
| 07 | **Generation Options** | `SQL Code`, `PDF Catalog`, `Presentation`, `dashboards`, `Unstructured Data Usecases` | `SQL Code` |
| 08 | **Generation Path** | Output folder | `./inspire_gen/` |
| 09 | **Documents Languages** | Target language(s) | `English` |
| 10 | **AI Model** | Model endpoint for generated SQL ai_query | `databricks-gpt-oss-120b` |

---

## Privacy

**Metadata Only**: Reads schemas, table & column names only. Does **NOT** access or sample your actual data.

In [0]:
DATABRICKS_INSPIRE_BANNER = r"""
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ____        _        _          _      _                             ┃
┃   |  _ \  __ _| |_ __ _| |__  _ __(_) ___| | _____                      ┃
┃   | | | |/ _` | __/ _` | '_ \| '__| |/ __| |/ / __|                     ┃
┃   | |_| | (_| | || (_| | |_) | |  | | (__|   <\__ \                     ┃
┃   |____/ \__,_|\__\__,_|_.__/|_|  |_|\___|_|\_\___/                     ┃
┃       ___                      _                  _    ___              ┃
┃      |_ _| _ __   ___  _ __   (_) _ __  ___      / \  |_ _|             ┃
┃       | | | '_ \ / __|| '_ \  | || '__|/ _ \    / _ \  | |              ┃
┃       | | | | | |\__ \| |_) | | || |  |  __/   / ___ \ | |              ┃
┃      |___||_| |_||___/| .__/  |_||_|   \___|  /_/   \_\___|             ┃
┃                       |_|                                               ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
"""

# ==============================================================================
# TECHNICAL CONTEXT - Prompt to Model Mapping Configuration
# ==============================================================================
# This structure maps all prompts to their assigned LLM models.
# Modify the "model" field for each prompt to route it to a different model.
# Available models are defined in the "models" section below.
# ==============================================================================
TECHNICAL_CONTEXT = {
    "prompts_models": [
        # === PHASE 1: INITIALIZATION & CONTEXT EXTRACTION ===
        # Temperature: 0.3-0.4 for accurate extraction of business context
        {"prompt_name": "BUSINESS_CONTEXT_WORKER_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.3},      # Step 1: Extract business context, goals, priorities
        {"prompt_name": "UNSTRUCTURED_DATA_DOCUMENTS_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.4},  # Step 2: Generate unstructured doc list (if enabled)
        {"prompt_name": "FILTER_BUSINESS_TABLES_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.2},       # Step 3: Filter business vs technical tables (precision needed)
        
        # === PHASE 2: USE CASE GENERATION (PARALLEL) ===
        # Temperature: 0.7-0.8 for creative/innovative use case generation
        {"prompt_name": "BASE_USE_CASE_GEN_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.7},            # Step 4: Main structured data use case generation (creative)
        {"prompt_name": "AI_USE_CASE_GEN_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.8},              # Step 4: AI/ML focused use case generation (highly creative)
        {"prompt_name": "STATS_USE_CASE_GEN_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.7},           # Step 4: Statistical use case generation (creative)
        {"prompt_name": "UNSTRUCTURED_DATA_USE_CASE_GEN_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.7}, # Step 4: Unstructured data use cases (creative)
        
        # === PHASE 3: DOMAIN CLUSTERING ===
        # Temperature: 0.4-0.5 for balanced clustering decisions
        {"prompt_name": "DOMAIN_FINDER_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.5},                # Step 5: Cluster use cases into business domains
        {"prompt_name": "SUBDOMAIN_DETECTOR_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.4},           # Step 6: Detect subdomains within each domain
        {"prompt_name": "DOMAINS_MERGER_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.4},               # Step 7: Merge similar domains (optional)
        
        # === PHASE 4: SCORING & DEDUPLICATION ===
        # Temperature: 0.2-0.3 for consistent, accurate scoring
        {"prompt_name": "SCORE_USE_CASES_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.2},              # Step 8: Score use cases (ROI, Strategic Alignment) - precision
        {"prompt_name": "REVIEW_USE_CASES_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.3},             # Step 9: Intelligent deduplication using scores
        
        # === PHASE 5: SQL GENERATION ===
        # Temperature: 0.1-0.2 for accurate, syntactically correct SQL
        {"prompt_name": "USE_CASE_SQL_GEN_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.1},             # Step 10: Generate SQL (accuracy critical)
        {"prompt_name": "USE_CASE_SQL_FIX_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.1},             # Step 11: Fix SQL errors (precision critical)
        {"prompt_name": "INTERPRET_USER_SQL_REGENERATION_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.2}, # SQL Regeneration mode (special)
        
        # === PHASE 6: SUMMARY & ARTIFACTS ===
        # Temperature: 0.5-0.6 for engaging summaries and dashboards
        {"prompt_name": "SUMMARY_GEN_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.5},                  # Step 12: Generate executive summary
        {"prompt_name": "DASHBOARDS_GEN_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.6},               # Step 13: Generate dashboards (if enabled)
        
        # === PHASE 7: TRANSLATION (MULTI-LANGUAGE) ===
        # Temperature: 0.2-0.3 for accurate translation
        {"prompt_name": "KEYWORDS_TRANSLATE_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.2},           # Step 14: Translate keywords (accuracy)
        {"prompt_name": "USE_CASE_TRANSLATE_PROMPT", "model": "claude-sonnet-4-5", "temperature": 0.3},           # Step 15: Translate use cases
    ],
    "models": [
        {
            "name": "claude-sonnet-4-5",
            "llm_endpoint_name": "databricks-claude-sonnet-4-5",
            "llm_input_context_tokens_count": 200000,
            "llm_output_context_tokens_count": 128000
        },
        {
            "name": "claude-opus-4-5",
            "llm_endpoint_name": "databricks-claude-opus-4-5",
            "llm_input_context_tokens_count": 200000,
            "llm_output_context_tokens_count": 64000
        },
        {
            "name": "gpt-oss-120b",
            "llm_endpoint_name": "databricks-gpt-oss-120b",
            "llm_input_context_tokens_count": 131000,
            "llm_output_context_tokens_count": 131000
        },
        {
            "name": "gpt-oss-20b",
            "llm_endpoint_name": "databricks-gpt-oss-20b",
            "llm_input_context_tokens_count": 131000,
            "llm_output_context_tokens_count": 32000
        }
    ]
}

def get_model_endpoint_for_prompt(prompt_name: str) -> str:
    """
    Get the LLM endpoint name for a given prompt using TECHNICAL_CONTEXT.
    
    Args:
        prompt_name: Name of the prompt (e.g., "BUSINESS_CONTEXT_WORKER_PROMPT")
    
    Returns:
        The LLM endpoint name (e.g., "databricks-gpt-oss-20b")
    """
    models_lookup = {m["name"]: m["llm_endpoint_name"] for m in TECHNICAL_CONTEXT["models"]}
    
    for pm in TECHNICAL_CONTEXT["prompts_models"]:
        if pm["prompt_name"] == prompt_name:
            model_name = pm["model"]
            return models_lookup.get(model_name, "databricks-gpt-oss-20b")
    
    return "databricks-gpt-oss-20b"

def get_model_config_for_prompt(prompt_name: str) -> dict:
    """
    Get the full model configuration for a given prompt using TECHNICAL_CONTEXT.
    
    Args:
        prompt_name: Name of the prompt (e.g., "BUSINESS_CONTEXT_WORKER_PROMPT")
    
    Returns:
        Dictionary with model configuration including endpoint, input/output token limits
    """
    models_lookup = {m["name"]: m for m in TECHNICAL_CONTEXT["models"]}
    
    for pm in TECHNICAL_CONTEXT["prompts_models"]:
        if pm["prompt_name"] == prompt_name:
            model_name = pm["model"]
            return models_lookup.get(model_name, models_lookup.get("claude-sonnet-4-5", {}))
    
    return models_lookup.get("claude-sonnet-4-5", {})

def log_print(message: str, level: str = "INFO", flush: bool = True):
    """Print a message with timestamp in logger format for immediate console output.
    For ERROR/CRITICAL levels, also writes to stderr for visibility.
    """
    import time as _time
    import sys as _sys
    timestamp = _time.strftime('%H:%M:%S')
    formatted_msg = f"{timestamp} - {level} - {message}"
    print(formatted_msg, flush=flush)
    if flush:
        _sys.stdout.flush()
    
    level_upper = level.upper()
    if level_upper in ("ERROR", "CRITICAL"):
        print(formatted_msg, file=_sys.stderr, flush=True)

def get_clean_error_message(exception: Exception, max_lines: int = 1) -> str:
    """
    Extract clean error message from exception without ugly stack traces.
    
    Args:
        exception: The exception object
        max_lines: Maximum number of lines to include (default: 1 for first line only)
    
    Returns:
        Clean error message string
    """
    error_str = str(exception)
    if '\n' in error_str and 'JVM stacktrace' not in error_str:
        # Multi-line error but no JVM trace - take first meaningful line
        lines = [line.strip() for line in error_str.split('\n') if line.strip()]
        return ' '.join(lines[:max_lines])
    elif 'JVM stacktrace' in error_str or len(error_str) > 500:
        # Has JVM stack trace or very long - extract just first line
        first_line = error_str.split('\n')[0].strip()
        # If first line mentions table/view not found, keep that
        if 'TABLE_OR_VIEW_NOT_FOUND' in first_line or 'cannot be found' in first_line:
            return first_line
        # Otherwise extract the main error message
        if ']:' in first_line:
            return first_line.split(']:')[-1].strip()
        return first_line[:500]  # Truncate to 500 chars
    return error_str


# ==============================================================================
# TRULY ADAPTIVE PARALLELISM CALCULATOR
# ==============================================================================
# Parallelism is calculated dynamically based on ACTUAL DATA:
# - Number of items to process (tables, use cases, domains)
# - Estimated prompt/payload size
# - LLM vs non-LLM operations
# - Risk of rate limiting or timeouts
#
# BOUNDS: MIN=2, MAX=10 (enforced globally)
# METADATA QUERIES: FIXED at 5 (no LLM, but too many connections can hang)
# ==============================================================================

# Fixed parallelism for metadata operations (no LLM, but DB connections can saturate)
METADATA_PARALLELISM = 5

def calculate_adaptive_parallelism(
    step_name: str,
    max_parallelism: int,
    num_items: int = 0,
    total_columns: int = 0,
    avg_prompt_chars: int = 0,
    num_domains: int = 0,
    is_llm_operation: bool = True,
    logger=None
) -> tuple:
    """
    Calculate truly adaptive parallelism based on actual data characteristics.
    
    Args:
        step_name: Name of the step (for logging)
        max_parallelism: User-configured maximum parallelism
        num_items: Number of items to process (tables, use cases, domains, etc.)
        total_columns: Total columns involved (affects prompt size)
        avg_prompt_chars: Average prompt size in characters
        num_domains: Number of business domains
        is_llm_operation: Whether this step makes LLM calls
        logger: Optional logger
    
    Returns:
        Tuple of (parallelism: int, reason: str)
    """
    MIN_PARALLELISM = 4
    MAX_PARALLELISM = 10
    
    # Start with user max, capped at global max
    base = min(max_parallelism, MAX_PARALLELISM)
    
    # =========================================================================
    # METADATA OPERATIONS (Fixed at 5 - no LLM but DB connections can saturate)
    # =========================================================================
    if step_name in ["schema_discovery", "table_discovery", "column_fetch"]:
        result = METADATA_PARALLELISM
        reason = f"FIXED={METADATA_PARALLELISM} for metadata queries (DB connection limit)"
        if logger:
            logger.info(f"🔧 [{step_name.upper()}] Parallelism = {result} | {reason}")
        return (result, reason)
    
    # =========================================================================
    # FILE I/O OPERATIONS (Can use higher parallelism, but cap based on items)
    # =========================================================================
    if step_name in ["notebook_generation", "artifact_writing"]:
        # Scale with number of items, but cap reasonably
        if num_items <= 5:
            result = min(base, num_items + 2)
            reason = f"few items ({num_items}), using {result} workers"
        elif num_items <= 15:
            result = min(base, 6)
            reason = f"moderate items ({num_items}), capped at 6"
        else:
            result = min(base, 8)
            reason = f"many items ({num_items}), capped at 8 for I/O stability"
        
        result = max(MIN_PARALLELISM, min(MAX_PARALLELISM, result))
        if logger:
            logger.info(f"🔧 [{step_name.upper()}] Parallelism = {result} | {reason}")
        return (result, reason)
    
    # =========================================================================
    # LLM OPERATIONS - Truly adaptive based on workload
    # =========================================================================
    
    # Base calculation factors
    factors = []
    calculated = base
    
    # FACTOR 1: Number of items (more items = need more caution)
    if num_items > 0:
        if num_items <= 10:
            item_factor = 0.8  # Small batch, can be aggressive
            factors.append(f"{num_items} items (small)")
        elif num_items <= 30:
            item_factor = 0.6  # Medium batch
            factors.append(f"{num_items} items (medium)")
        elif num_items <= 100:
            item_factor = 0.4  # Large batch, be conservative
            factors.append(f"{num_items} items (large)")
        else:
            item_factor = 0.3  # Very large, very conservative
            factors.append(f"{num_items} items (very large)")
        calculated = int(calculated * item_factor)
    
    # FACTOR 2: Prompt size (larger prompts = more tokens = slower responses)
    if avg_prompt_chars > 0:
        if avg_prompt_chars > 100000:
            prompt_factor = 0.4  # Very large prompts
            factors.append(f"~{avg_prompt_chars//1000}K chars/prompt (huge)")
        elif avg_prompt_chars > 50000:
            prompt_factor = 0.5
            factors.append(f"~{avg_prompt_chars//1000}K chars/prompt (large)")
        elif avg_prompt_chars > 20000:
            prompt_factor = 0.7
            factors.append(f"~{avg_prompt_chars//1000}K chars/prompt (medium)")
        else:
            prompt_factor = 0.9
            factors.append(f"~{avg_prompt_chars//1000}K chars/prompt (small)")
        calculated = int(calculated * prompt_factor)
    
    # FACTOR 3: Number of domains (more domains = more parallel LLM calls)
    if num_domains > 0:
        if num_domains > 15:
            domain_factor = 0.4  # Many domains, very conservative
            factors.append(f"{num_domains} domains (many)")
        elif num_domains > 8:
            domain_factor = 0.5
            factors.append(f"{num_domains} domains (moderate)")
        else:
            domain_factor = 0.7
            factors.append(f"{num_domains} domains (few)")
        calculated = int(calculated * domain_factor)
    
    # FACTOR 4: Total columns (more columns = bigger schema context)
    if total_columns > 0:
        if total_columns > 2000:
            col_factor = 0.5
            factors.append(f"{total_columns} cols (massive schema)")
        elif total_columns > 1000:
            col_factor = 0.6
            factors.append(f"{total_columns} cols (large schema)")
        elif total_columns > 500:
            col_factor = 0.7
            factors.append(f"{total_columns} cols (medium schema)")
        else:
            col_factor = 0.9
            factors.append(f"{total_columns} cols (small schema)")
        calculated = int(calculated * col_factor)
    
    # FACTOR 5: Step-specific adjustments
    step_adjustments = {
        "scoring": (0.6, "scoring is rate-limit sensitive"),
        "deduplication": (0.7, "dedup needs LLM per domain"),
        "sql_generation": (0.7, "SQL gen is complex"),
        "use_case_generation": (0.6, "LLM-intensive, 2-pass for transactional tables"),
        "domain_clustering": (0.6, "domain detection is heavy"),
        "subdomain_detection": (0.7, "subdomain per domain"),
        "translation": (0.7, "translation LLM calls"),
        "sql_validation": (0.8, "DB queries, not LLM"),
    }
    
    if step_name in step_adjustments:
        adj_factor, adj_reason = step_adjustments[step_name]
        calculated = int(calculated * adj_factor)
        factors.append(adj_reason)
    
    # Apply bounds
    result = max(MIN_PARALLELISM, min(MAX_PARALLELISM, calculated))
    
    # If no factors were applied and it's an LLM operation, use conservative default
    if not factors and is_llm_operation:
        result = max(MIN_PARALLELISM, min(4, base))
        factors.append("LLM operation, conservative default")
    
    # Build reason string
    reason = " + ".join(factors) if factors else "default calculation"
    reason = f"calculated={result} based on: {reason}"
    
    if logger:
        logger.info(f"🔧 [{step_name.upper()}] Parallelism = {result} (from max={max_parallelism}) | {reason}")
    
    return (result, reason)


def log_adaptive_parallelism_decision(step_name: str, parallelism: int, max_parallelism: int, reason: str):
    """
    Log the adaptive parallelism decision with full context.
    """
    log_print(f"🔧 [{step_name.upper()}] Workers: {parallelism} (max={max_parallelism})")
    log_print(f"   └─ Reason: {reason}")


def create_widgets():
    """
    Creates widgets if they don't exist. Retains existing widget values.
    
    Widget Order:
    0- Business Name
    1- UC Metadata
    2- Operation
    3- Business Domains
    4- Business Priorities
    5- Strategic Goals
    6- Generation Options
    7- Generation Path
    8- Documents Languages
    9- AI Model
    """
    
    log_print("Creating widgets (retaining existing values)...")
    
    widget_errors = []
    
    # --- 0. Business Name (REQUIRED) ---
    try:
        dbutils.widgets.text("00_business_name", "", "01. Business Name")
    except Exception as e:
        widget_errors.append(f"Business Name: {e}")
    
    # --- 1. UC Metadata (catalogs/schemas/tables OR JSON file path) ---
    try:
        dbutils.widgets.text("01_uc_metadata", "", "02. UC Metadata")
    except Exception as e:
        widget_errors.append(f"UC Metadata: {e}")
    
    # --- 2. Operation (controls main operation mode) ---
    try:
        operation_options = [
            "Discover Usecases",
            "Re-generate SQL",
            "Generate Sample Result"
        ]
        dbutils.widgets.dropdown("02_operation", "Discover Usecases", operation_options, "03. Operation")
    except Exception as e:
        widget_errors.append(f"Operation: {e}")
    
    # --- 3. Business Domains (comma-separated list of domains) ---
    try:
        dbutils.widgets.text("03_business_domains", "", "04. Business Domains")
    except Exception as e:
        widget_errors.append(f"Business Domains: {e}")
    
    # --- 4. Business Priorities (multi-select) ---
    try:
        business_priorities_options = [
            "Increase Revenue",
            "Reduce Cost",
            "Optimize Operations",
            "Mitigate Risk",
            "Empower Talent",
            "Enhance Experience",
            "Drive Innovation",
            "Achieve ESG",
            "Protect Revenue",
            "Execute Strategy"
        ]
        dbutils.widgets.multiselect("04_business_priorities", "Increase Revenue", business_priorities_options, "05. Business Priorities")
    except Exception as e:
        widget_errors.append(f"Business Priorities: {e}")
    
    # --- 5. Strategic Goals ---
    try:
        dbutils.widgets.text("05_strategic_goals", "", "06. Strategic Goals")
    except Exception as e:
        widget_errors.append(f"Strategic Goals: {e}")
    
    # --- 6. Generation Options (multiselect with generation choices) ---
    try:
        generation_options = [
            "SQL Code",
            "PDF Catalog",
            "Presentation",
            "dashboards",
            "Unstructured Data Usecases"
        ]
        dbutils.widgets.multiselect(
            "06_generation_options", 
            "SQL Code",
            generation_options, 
            "07. Generation Options"
        )
    except Exception as e:
        widget_errors.append(f"Generation Options: {e}")
    
    # --- 7. Generation Path ---
    try:
        dbutils.widgets.text("07_generation_path", "./inspire_gen/", "08. Generation Path")
    except Exception as e:
        widget_errors.append(f"Generation Path: {e}")
    
    # --- 8. Documents Languages (multiselect) ---
    try:
        lang_choices = [
            "English", "French", "German", "Spanish", "Hindi",
            "Chinese (Mandarin)", "Japanese", "Arabic", "Portuguese", "Russian",
            "Swedish", "Danish", "Norwegian", "Finnish",
            "Italian", "Polish", "Romanian", "Ukrainian", "Dutch", "Korean",
            "Indonesian", "Malay", "Tamil"
        ]
        dbutils.widgets.multiselect("08_documents_languages", "English", lang_choices, "09. Documents Languages")
    except Exception as e:
        widget_errors.append(f"Documents Languages: {e}")
    
    # --- 9. AI Model (model endpoint for ai_query in generated SQL) ---
    try:
        dbutils.widgets.text("09_ai_model", "databricks-gpt-oss-120b", "10. AI Model")
    except Exception as e:
        widget_errors.append(f"AI Model: {e}")
    
    if widget_errors:
        log_print(f"⚠️ Some widgets had errors during creation:", level="WARNING")
        for err in widget_errors:
            log_print(f"   - {err}", level="WARNING")
        log_print("   Try running: dbutils.widgets.removeAll() and then run this cell again")
    else:
        log_print("✅ Widgets created successfully.")
    
    log_print("")
    log_print(">>> Fill in the widget values at the TOP of this notebook, then run main().")

# ---
# Run this cell to create widgets.
# Fill in the widget values at the TOP of the notebook.
# Then, proceed to run the 'main()' cell below.
# ---

create_widgets()

# COMMAND ----------

# DBTITLE 1,Imports & Commons
# ==============================================================================
# 0. IMPORTS & CONFIGURATION
# ==============================================================================
import os
import pandas as pd
import logging
import re
import subprocess
import sys
import json
import csv
import io
import uuid
import base64
import random
import tempfile
import shutil
import datetime
import html
import pkg_resources
import warnings
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.utils import AnalysisException
from collections import defaultdict
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading
import gc

# --- Databricks SDK Imports for Notebook Creation ---
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import workspace

# --- Global Configuration ---
AI_MODEL_NAME = "databricks-gpt-oss-20b"

# Token-to-Character Ratios (for context limit calculations)
# English: 1 token ≈ 4 characters
# Non-English: 1 token ≈ 2 characters
TOKEN_TO_CHAR_RATIO_ENGLISH = 4
TOKEN_TO_CHAR_RATIO_NON_ENGLISH = 2

def get_model_token_limit(prompt_name: str = None) -> int:
    """
    Get the input token limit for a model based on TECHNICAL_CONTEXT.
    
    Args:
        prompt_name: Name of the prompt to lookup model for. If None, returns default model limit.
    
    Returns:
        Input token limit for the model assigned to this prompt
    """
    if prompt_name:
        model_config = get_model_config_for_prompt(prompt_name)
        if model_config:
            return model_config.get("llm_input_context_tokens_count", 200000)
    
    default_model = next((m for m in TECHNICAL_CONTEXT["models"] if m["name"] == "claude-sonnet-4-5"), None)
    return default_model.get("llm_input_context_tokens_count", 200000) if default_model else 200000

def get_model_output_token_limit(prompt_name: str = None) -> int:
    """
    Get the OUTPUT token limit for a model based on TECHNICAL_CONTEXT.
    This is critical to prevent LLM output truncation (Claude defaults to only 1000 tokens without max_tokens).
    
    Args:
        prompt_name: Name of the prompt to lookup model for. If None, returns default model limit.
    
    Returns:
        Output token limit for the model assigned to this prompt
    """
    if prompt_name:
        model_config = get_model_config_for_prompt(prompt_name)
        if model_config:
            return model_config.get("llm_output_context_tokens_count", 32000)
    
    default_model = next((m for m in TECHNICAL_CONTEXT["models"] if m["name"] == "claude-sonnet-4-5"), None)
    return default_model.get("llm_output_context_tokens_count", 32000) if default_model else 32000

def get_max_context_chars(language: str = "English", prompt_name: str = None) -> int:
    """
    Calculate the maximum character limit based on language and model's token limit.
    Uses TECHNICAL_CONTEXT to determine model-specific token limits.
    
    Args:
        language: Target language (default: "English")
        prompt_name: Name of the prompt to lookup model for (optional)
    
    Returns:
        Maximum character limit for the given language and model
    """
    max_tokens = get_model_token_limit(prompt_name)
    
    if language.lower() == "english":
        return max_tokens * TOKEN_TO_CHAR_RATIO_ENGLISH
    else:
        return max_tokens * TOKEN_TO_CHAR_RATIO_NON_ENGLISH

def get_safe_context_limit(language: str = "English", buffer_percent: float = 0.9, prompt_name: str = None) -> int:
    """
    Calculate a safe context limit with buffer to proactively avoid LLM errors.
    Uses TECHNICAL_CONTEXT to determine model-specific token limits.
    
    Formula: model_token_limit * char_ratio * buffer_percent
    
    Args:
        language: Target language (default: "English")
        buffer_percent: Safety buffer (default: 0.9 = 10% buffer)
        prompt_name: Name of the prompt to lookup model for (optional)
    
    Returns:
        Safe character limit with buffer applied
    """
    max_chars = get_max_context_chars(language, prompt_name)
    safe_limit = int(max_chars * buffer_percent)
    return safe_limit

# Legacy constants for backward compatibility (uses default model's English ratio)
MAX_CONTEXT_TOKENS = get_model_token_limit()
MAX_CONTEXT_CHARS = MAX_CONTEXT_TOKENS * TOKEN_TO_CHAR_RATIO_ENGLISH

# ==============================================================================
# IDENTIFIER NORMALIZATION UTILITIES
# ==============================================================================
# These functions handle SQL identifier parsing and quoting consistently.
# All user input should be normalized (backticks stripped) on input,
# and backticks should be added when constructing SQL queries.
# ==============================================================================

def normalize_identifier(identifier: str) -> str:
    """
    Strip backticks from an identifier.
    
    Args:
        identifier: A SQL identifier that may or may not have backticks
        
    Returns:
        The identifier without backticks
        
    Examples:
        normalize_identifier("`my-schema`") -> "my-schema"
        normalize_identifier("my_table") -> "my_table"
    """
    if identifier is None:
        return ""
    return identifier.strip().strip('`')

def quote_identifier(identifier: str) -> str:
    """
    Add backticks to an identifier for safe SQL usage.
    First normalizes (strips existing backticks) then adds fresh backticks.
    
    Args:
        identifier: A SQL identifier (normalized or not)
        
    Returns:
        The identifier wrapped in backticks
        
    Examples:
        quote_identifier("my-schema") -> "`my-schema`"
        quote_identifier("`my-schema`") -> "`my-schema`" (not double-quoted)
    """
    if identifier is None:
        return "``"
    normalized = normalize_identifier(identifier)
    return f"`{normalized}`"

def parse_three_level_name(name: str) -> tuple:
    """
    Parse a three-level name (catalog.schema.table or catalog.schema.column) into parts.
    Handles names with or without backticks at any level.
    
    Args:
        name: A three-level name like "catalog.schema.table" or "`cat`.`sch`.`tbl`"
        
    Returns:
        Tuple of (part1, part2, part3) with backticks stripped from each part,
        or (None, None, None) if parsing fails
        
    Examples:
        parse_three_level_name("cat.schema.table") -> ("cat", "schema", "table")
        parse_three_level_name("`cat`.`my-schema`.`table`") -> ("cat", "my-schema", "table")
        parse_three_level_name("invalid") -> (None, None, None)
    """
    if not name:
        return (None, None, None)
    
    clean_name = name.replace('`', '')
    parts = clean_name.split('.')
    
    if len(parts) == 3:
        return (parts[0].strip(), parts[1].strip(), parts[2].strip())
    return (None, None, None)

def parse_two_level_name(name: str) -> tuple:
    """
    Parse a two-level name (catalog.schema) into parts.
    Handles names with or without backticks.
    
    Args:
        name: A two-level name like "catalog.schema" or "`cat`.`my-schema`"
        
    Returns:
        Tuple of (part1, part2) with backticks stripped from each part,
        or (None, None) if parsing fails
        
    Examples:
        parse_two_level_name("cat.schema") -> ("cat", "schema")
        parse_two_level_name("`cat`.`my-schema`") -> ("cat", "my-schema")
    """
    if not name:
        return (None, None)
    
    clean_name = name.replace('`', '')
    parts = clean_name.split('.', 1)
    
    if len(parts) == 2:
        return (parts[0].strip(), parts[1].strip())
    return (None, None)

def parse_four_level_name(name: str) -> tuple:
    """
    Parse a four-level name (catalog.schema.table.column) into parts.
    Handles names with or without backticks at any level.
    
    Args:
        name: A four-level name like "catalog.schema.table.column"
        
    Returns:
        Tuple of (catalog, schema, table, column) with backticks stripped,
        or (None, None, None, None) if parsing fails
    """
    if not name:
        return (None, None, None, None)
    
    clean_name = name.replace('`', '')
    parts = clean_name.split('.')
    
    if len(parts) == 4:
        return (parts[0].strip(), parts[1].strip(), parts[2].strip(), parts[3].strip())
    return (None, None, None, None)

def build_fqn(catalog: str, schema: str, table: str = None) -> str:
    """
    Build a fully qualified name with proper backtick quoting.
    
    Args:
        catalog: Catalog name (will be normalized and quoted)
        schema: Schema name (will be normalized and quoted)
        table: Optional table name (will be normalized and quoted if provided)
        
    Returns:
        Properly quoted FQN like `catalog`.`schema` or `catalog`.`schema`.`table`
        
    Examples:
        build_fqn("cat", "my-schema") -> "`cat`.`my-schema`"
        build_fqn("cat", "schema", "table") -> "`cat`.`schema`.`table`"
    """
    cat_quoted = quote_identifier(catalog)
    schema_quoted = quote_identifier(schema)
    
    if table:
        table_quoted = quote_identifier(table)
        return f"{cat_quoted}.{schema_quoted}.{table_quoted}"
    return f"{cat_quoted}.{schema_quoted}"

# --- LLM Model Configuration for Each Prompt (Derived from TECHNICAL_CONTEXT) ---
# This dict is auto-generated from TECHNICAL_CONTEXT for backward compatibility.
# To change model assignments, modify TECHNICAL_CONTEXT at the top of the file.
LLM_MODEL_CONFIG = {
    pm["prompt_name"]: get_model_endpoint_for_prompt(pm["prompt_name"])
    for pm in TECHNICAL_CONTEXT["prompts_models"]
}

# DBTITLE 1,Prompts
# --- 1. Main Prompt Templates Dictionary ---
PROMPT_TEMPLATES = {}

HONESTY_CHECK_CSV = """

### 🎯 HONESTY SELF-ASSESSMENT (MANDATORY - INTEGRATED IN CSV) 🎯

You MUST include honesty self-assessment as TWO ADDITIONAL COLUMNS at the END of your CSV:
- **"honesty_score"**: Your honest score 0-100 for the ENTIRE output quality
- **"honesty_justification"**: Brief justification (max 250 chars)

Another more powerful LLM will review your output and generate its own honesty score to compare against yours - BE EXTREMELY HONEST. Try EXTREMELY hard to achieve 100% honesty - do not inflate your score.

**IMPORTANT**: Add these 2 columns to EVERY row. Use the SAME score and justification for all rows (it's for the entire output, not per-row).
"""

HONESTY_CHECK_JSON = """

### 🎯 HONESTY SELF-ASSESSMENT (MANDATORY - INTEGRATED IN JSON) 🎯

You MUST wrap your JSON output with honesty self-assessment fields. Your response must be a JSON object with this structure:
```json
{{
  "honesty_score": <your score 0-100>,
  "honesty_justification": "<brief justification, max 250 chars>",
  "data": <your actual output here>
}}
```

Another more powerful LLM will review your output and generate its own honesty score to compare against yours - BE EXTREMELY HONEST. Try EXTREMELY hard to achieve 100% honesty - do not inflate your score.
"""

HONESTY_CHECK_SQL = """

### 🎯 HONESTY SELF-ASSESSMENT (MANDATORY - AS SQL COMMENT) 🎯

You MUST include honesty self-assessment as the FIRST comment in your SQL output:
```sql
-- HONESTY_SCORE: <your score 0-100>
-- HONESTY_JUSTIFICATION: <brief justification, max 250 chars>
```

Another more powerful LLM will review your output and generate its own honesty score to compare against yours - BE EXTREMELY HONEST. Try EXTREMELY hard to achieve 100% honesty - do not inflate your score.
"""

HONESTY_CHECK_TABLE = """

### 🎯 HONESTY SELF-ASSESSMENT (MANDATORY - AS TABLE FOOTER) 🎯

You MUST include honesty self-assessment as TWO ADDITIONAL COLUMNS at the END of your table:
- **"honesty_score"**: Your honest score 0-100 for the ENTIRE output quality  
- **"honesty_justification"**: Brief justification (max 250 chars)

Another more powerful LLM will review your output and generate its own honesty score to compare against yours - BE EXTREMELY HONEST. Use the SAME score and justification for all rows.
"""

# --- 1. Business Context Worker Prompt ---
PROMPT_TEMPLATES["BUSINESS_CONTEXT_WORKER_PROMPT"] = """
### PERSONA

You are a **Principal Business Analyst** and recognized industry specialist with 15+ years of deep expertise in the `{industry}` industry. You are a master of business strategy, operations, and data-driven decision making.

### CONTEXT

**Assignment Details:**
- Industry/Business Name: `{name}`
- Type: {type_description}
- Target: Research and document comprehensive business context for this {type_label}

### TASK DEFINITION

Research and provide comprehensive business context information across 6 specific dimensions. Generate detailed, realistic, industry-specific information that will serve as the foundation for building a data model. Your output must be a single, well-structured JSON object.

### WORKFLOW

**Step 1: Research**
Leverage your deep industry knowledge of `{industry}` to understand the {type_label} named `{name}`.

**Step 2: Information Gathering**
For each of the 6 required fields, identify comprehensive and specific details:
1. **Business Context**: General overview of the business operations, market position, and key characteristics.
2. **Strategic Goals**: The high-level long-term strategic objectives. You MUST select 3-7 goals from this standard list that are MOST relevant to this business:
   - "Reduce Cost" (automation, efficiency, waste reduction)
   - "Boost Productivity" (faster processes, better tools, streamlined workflows)
   - "Increase Revenue" (new revenue streams, upselling, cross-selling, market expansion)
   - "Mitigate Risk" (fraud detection, compliance, security, audit trails)
   - "Protect Revenue" (churn prevention, retention, customer satisfaction)
   - "Align to Regulations" (compliance automation, regulatory reporting, audit support)
   - "Improve Customer Experience" (personalization, faster service, quality improvements)
   - "Enable Data-Driven Decisions" (analytics, insights, forecasting, predictions)
3. **Business Priorities**: Immediate and near-term focus areas for the organization.
4. **Strategic Initiative**: Key initiatives currently underway to drive growth or transformation.
5. **Value Chain**: The primary activities that create value for the customer.
6. **Revenue Model**: How the business generates revenue (streams, pricing models, etc.).

**Step 3: JSON Construction**
Format all information as a single JSON object with 6 keys. Values should be descriptive strings (not lists).

### RULES AND CONSTRAINTS

1. **Descriptive Strings**: Provide clear, concise, but comprehensive descriptions for each field.
2. **No Generic Placeholders**: Use specific, real-world terminology and examples.
3. **Industry-Specific**: All information must be directly relevant to the `{industry}` industry.
4. **Realistic and Plausible**: Information should reflect actual industry practices.
5. **Strategic Goals Format**: The strategic_goals field MUST be a comma-separated list of 3-7 goals from the standard list above, with brief elaboration for each. Example: "Reduce Cost (automate manual processes), Increase Revenue (expand digital channels), Mitigate Risk (enhance fraud detection)"

### OUTPUT FORMAT

Your response must be a single valid JSON object with NO text before or after.

**JSON Structure:**
```json
{{
  "business_context": "string description",
  "strategic_goals": "Goal1 (elaboration), Goal2 (elaboration), Goal3 (elaboration)",
  "business_priorities": "string description",
  "strategic_initiative": "string description",
  "value_chain": "string description",
  "revenue_model": "string description"
}}
```

### EXECUTION INSTRUCTION

Begin generation of the JSON output now. Ensure all 6 fields are present. Start with the opening brace.

🚨 ABSOLUTE RULE - OUTPUT FORMAT 🚨:
❌ DO NOT INCLUDE:
- Any conversational or explanatory text ("Here is...", "I've generated...", "Based on...")
- Any thoughts or analysis descriptions

✅ OUTPUT REQUIREMENTS:
- Your response must be a valid JSON object with the honesty wrapper
- Start with: {{"honesty_score":
""" + HONESTY_CHECK_JSON

# --- 1a. Use Case Generation Prompt (ENHANCED - REMOVED GUARDRAILS) ---
# --- AI Functions Registry ---
AI_FUNCTIONS = {
    "ai_analyze_sentiment": {
        "function": "ai_analyze_sentiment(content)",
        "business_value": "Analyzes sentiment (positive/negative/neutral) in text to understand customer emotion and prioritize responses. MUST be combined with ai_query for actionable recommendations.",
        "example_use_cases": "Customer review analysis with emotion classification and response strategies • Social media sentiment monitoring with brand perception tracking • Support ticket triage with urgency classification • Employee feedback analysis with engagement insights • Product review sentiment with improvement recommendations. **Note: These are examples only - innovate with use cases specific to your business context and data.**"
    },
    "ai_classify": {
        "function": "ai_classify(content, labels)",
        "business_value": "Classifies text into predefined categories for automated routing, segmentation, and prioritization. MUST be combined with ai_query for actionable recommendations. Array MUST have max 20 items, each <50 characters.",
        "example_use_cases": "Customer segmentation with retention strategies • Support ticket routing with resolution plans • Lead scoring with engagement tactics • Risk classification with mitigation strategies • Product categorization with marketing recommendations • Content tagging with engagement strategies. **Note: These are examples only - innovate with use cases specific to your business context and data.**"
    },
    "ai_extract": {
        "function": "ai_extract(content, labels)",
        "business_value": "Extracts specified entities from unstructured text to structure data for analysis and automation. Array MUST have max 20 items, each <50 characters.",
        "example_use_cases": "Invoice detail extraction • Email parsing for CRM • Contract data extraction • Medical record entity extraction • Product specification parsing • Customer information extraction from notes. **Note: These are examples only - innovate with use cases specific to your business context and data.**"
    },
    "ai_fix_grammar": {
        "function": "ai_fix_grammar(content)",
        "business_value": "Corrects grammatical errors in text to improve communication quality and professionalism.",
        "example_use_cases": "Customer feedback normalization • Report quality improvement • Email communication enhancement • Documentation cleanup • Survey response standardization. **Note: These are examples only - innovate with use cases specific to your business context and data.**"
    },
    "ai_mask": {
        "function": "ai_mask(content, labels)",
        "business_value": "Masks sensitive information (PII, PHI, financial data) for compliance and secure data sharing. MUST be combined with ai_query for compliance documentation and risk assessment.",
        "example_use_cases": "PII data anonymization with compliance tracking • GDPR/CCPA compliance workflows • Secure data sharing with risk assessment • Medical record de-identification • Financial data protection with audit trails • Customer data masking for analytics. **Note: These are examples only - innovate with use cases specific to your business context and data.**"
    },
    "ai_parse_document": {
        "function": "ai_parse_document(content, [options_map])",
        "business_value": "Extracts structured text, layout, tables, and figures from unstructured document files (PDF, images, Word, PowerPoint). MUST ONLY be used with binary files from Unity Catalog volumes via READ_FILES().",
        "example_use_cases": "Invoice processing from PDFs • Scanned contract digitization • Medical record extraction from images • Form data extraction • Receipt processing • Document archive digitization. **Note: These are examples only - innovate with use cases specific to your business context and data.**"
    },
    "ai_similarity": {
        "function": "ai_similarity(string1, string2)",
        "business_value": "Computes semantic similarity score (0-1) between two text strings for deduplication, matching, and record linkage. MUST be combined with ai_query for merge strategies and data quality recommendations.",
        "example_use_cases": "Customer deduplication with merge strategies • Product matching across catalogs • Vendor record linkage • Duplicate detection with resolution plans • Entity resolution with data quality impact analysis. **Note: These are examples only - innovate with use cases specific to your business context and data.**"
    },
    "ai_summarize": {
        "function": "ai_summarize(content[, max_words])",
        "business_value": "Creates concise summaries of long text to improve information accessibility and decision-making speed.",
        "example_use_cases": "Clinical notes summarization • Meeting transcript condensation • News article summarization • Research paper abstracts • Customer feedback summaries • Legal document summaries. **Note: These are examples only - innovate with use cases specific to your business context and data.**"
    },
    "ai_translate": {
        "function": "ai_translate(content, to_lang)",
        "business_value": "Translates text to specified target languages for global communication and localization.",
        "example_use_cases": "Multi-lingual customer support • Product description localization • Document translation • Global marketing content • International compliance documentation. **Note: These are examples only - innovate with use cases specific to your business context and data.**"
    },
    "ai_query": {
        "function": "ai_query(endpoint, request)",
        "business_value": "Invokes custom model serving endpoints or LLMs for flexible AI-powered analysis, generation, and recommendations. Use the configured SQL model serving endpoint for generated SQL.",
        "example_use_cases": "Custom business analysis with LLMs • Strategic recommendations generation • Complex reasoning tasks • Domain-specific model invocation • Multi-step AI workflows • Personalized content generation. **Note: These are examples only - innovate with use cases specific to your business context and data.**"
    },
    "ai_forecast": {
        "function": "ai_forecast(...)",
        "business_value": "Time series forecasting with prediction intervals for demand planning, capacity optimization, and trend prediction. MUST be combined with ai_query for strategic recommendations and action plans.",
        "example_use_cases": "Revenue forecasting with investment strategies • Demand planning with inventory recommendations • Capacity planning with resource allocation • Churn prediction with retention strategies • Sales forecasting with tactical actions • Traffic prediction with infrastructure planning. **Note: These are examples only - innovate with use cases specific to your business context and data.**"
    },
    "vector_search": {
        "function": "vector_search(...)",
        "business_value": "Semantic search using vector embeddings for intelligent information retrieval and recommendation systems.",
        "example_use_cases": "RAG (Retrieval Augmented Generation) applications • Semantic document search • Product recommendations • Similar content discovery • Knowledge base search. **Note: These are examples only - innovate with use cases specific to your business context and data.**"
    }
}

# NOTE: Solution Accelerators (dbx-acc-*) have been removed.
# The system now focuses 100% on AI Functions and Statistical Functions.

# --- Statistical Functions Registry ---
STATISTICAL_FUNCTIONS = {
    # =========================================================
    #  1. CENTRAL TENDENCY (Baselines & Norms)
    # =========================================================
    "AVG(col)": {
        "function": "AVG(col)",
        "business_value": "Calculates the arithmetic mean to determine the central baseline of performance",
        "use_cases": "Average Order Value (AOV): Revenue optimization • Session Duration: Baseline engagement • Capacity Planning: Average resource usage",
        "category": "Central Tendency"
    },
    "MEDIAN(col)": {
        "function": "MEDIAN(col)",
        "business_value": "Finds the midpoint value, eliminating the impact of extreme outliers",
        "use_cases": "Real Estate: Median Home Price avoiding mansion skew • Income Analysis: Typical Household Income • Performance Benchmarking: Middle-of-the-pack performance",
        "category": "Central Tendency"
    },
    "MODE(col)": {
        "function": "MODE(col)",
        "business_value": "Returns the most frequent value (works on text and numbers)",
        "use_cases": "Inventory: Identify most sold SKU • Error Logging: Find most frequent error code • UX: Most common 'First Action' by new users",
        "category": "Central Tendency"
    },

    # =========================================================
    #  2. DISPERSION (Spread & Range)
    # =========================================================
    "STDDEV_POP(col)": {
        "function": "STDDEV_POP(col)",
        "business_value": "Quantifies total deviation from the mean for entire populations",
        "use_cases": "Manufacturing: 6-Sigma quality control • Service Levels: Consistency of delivery times • Risk: Portfolio volatility",
        "category": "Dispersion"
    },
    "STDDEV_SAMP(col)": {
        "function": "STDDEV_SAMP(col)",
        "business_value": "Estimates volatility and risk from sample data",
        "use_cases": "Survey Analysis: Response consensus • A/B Testing: Noise estimation in test groups",
        "category": "Dispersion"
    },
    "VAR_POP(col)": {
        "function": "VAR_POP(col)",
        "business_value": "Quantifies total variance for stability assessment",
        "use_cases": "Grid Load: Energy usage variance • Inventory: Stock level fluctuation across warehouses",
        "category": "Dispersion"
    },
    "MIN(col)": {
        "function": "MIN(col)",
        "business_value": "Identifies the absolute floor of a dataset",
        "use_cases": "System Checks: Minimum load during off-hours • Pricing: Lowest competitor price • SLA: Fastest response time recorded",
        "category": "Dispersion"
    },
    "MAX(col)": {
        "function": "MAX(col)",
        "business_value": "Identifies the absolute ceiling of a dataset",
        "use_cases": "Capacity Planning: Peak concurrent users • Sales: Record high revenue • Risk: Maximum historical drawdown",
        "category": "Dispersion"
    },
    "RANGE": {
        "function": "MAX(col) - MIN(col)",
        "business_value": "Measures the full spread of data boundaries",
        "use_cases": "Price Spread: High vs Low daily price • Jitter: Gap between best and worst latency • Salary Bands: Compensation width",
        "category": "Dispersion"
    },

    # =========================================================
    #  3. DISTRIBUTION SHAPE (Asymmetry & Tails)
    # =========================================================
    "SKEWNESS(col)": {
        "function": "SKEWNESS(col)",
        "business_value": "Detects asymmetry to flag fraud or operational bias",
        "use_cases": "Fraud: Skewed claim amounts • Load Balancing: Uneven server requests • Pricing: Margin skew indicating systematic errors",
        "category": "Distribution Shape"
    },
    "KURTOSIS(col)": {
        "function": "KURTOSIS(col)",
        "business_value": "Identifies probability of extreme 'Black Swan' events (fat tails)",
        "use_cases": "Financial Risk: Crash probability detection • Manufacturing: Defect bursts • Security: DDoS attack traffic spikes",
        "category": "Distribution Shape"
    },

    # =========================================================
    #  4. PERCENTILES (Thresholds & SLAs)
    # =========================================================
    "PERCENTILE_APPROX(col, p)": {
        "function": "PERCENTILE_APPROX(col, 0.95)",
        "business_value": "Fast percentile calculation for SLAs (P95, P99)",
        "use_cases": "SLA Monitoring: P99 Latency compliance • Pricing: 90th percentile competitor price • Wealth: Top 1% segmentation",
        "category": "Percentiles"
    },
    "PERCENTILE(col, p)": {
        "function": "PERCENTILE(col, 0.5)",
        "business_value": "Exact percentile calculation (requires more compute)",
        "use_cases": "Compliance: Regulatory capital requirements • Grading: Exact exam score boundaries",
        "category": "Percentiles"
    },
    "APPROX_PERCENTILE(array)": {
        "function": "APPROX_PERCENTILE(col, array(0.25, 0.5, 0.75))",
        "business_value": "Calculates multiple percentiles in a single pass",
        "use_cases": "Box Plots: Generate Q1, Median, Q3 simultaneously • Tiering: Define Bronze/Silver/Gold thresholds",
        "category": "Percentiles"
    },

    # =========================================================
    #  5. TREND ANALYSIS (Regression)
    #  *Requires Spark 3.3+ (Standard in Databricks)*
    # =========================================================
    "REGR_SLOPE(y, x)": {
        "function": "REGR_SLOPE(y, x)",
        "business_value": "Calculates rate of change (trend direction)",
        "use_cases": "Growth: Revenue per day slope • Elasticity: Demand change per dollar price change",
        "category": "Trend Analysis"
    },
    "REGR_INTERCEPT(y, x)": {
        "function": "REGR_INTERCEPT(y, x)",
        "business_value": "Identifies baseline performance (y when x=0)",
        "use_cases": "Baselines: Organic sales without marketing spend • Fixed Costs: Energy cost at zero production",
        "category": "Trend Analysis"
    },
    "REGR_R2(y, x)": {
        "function": "REGR_R2(y, x)",
        "business_value": "Measures predictive power (0 to 1)",
        "use_cases": "Driver Analysis: How well Price explains Churn • Forecast Validity: Reliability of trend projection",
        "category": "Trend Analysis"
    },

    # =========================================================
    #  6. CORRELATION (Drivers)
    # =========================================================
    "CORR(col1, col2)": {
        "function": "CORR(col1, col2)",
        "business_value": "Discovers relationships (-1 to 1) between metrics",
        "use_cases": "Cannibalization: New product vs Old product sales • Marketing: Spend vs Acquisition correlation",
        "category": "Correlation"
    },
    "COVAR_POP(col1, col2)": {
        "function": "COVAR_POP(col1, col2)",
        "business_value": "Measures joint variability across population",
        "use_cases": "Systemic Risk: Sector A vs Sector B movement • Supply Chain: Fuel Price vs Shipping Cost linkage",
        "category": "Correlation"
    },

    # =========================================================
    #  7. VOLATILITY & ANOMALY DETECTION
    # =========================================================
    "COEFF_VAR": {
        "function": "STDDEV(col) / AVG(col)",
        "business_value": "Coefficient of Variation: Normalizes volatility for comparison",
        "use_cases": "Risk Comparison: Compare volatility of High-Price vs Low-Price stock",
        "category": "Volatility"
    },
    "Z_SCORE": {
        "function": "(col - AVG(col) OVER ()) / STDDEV(col) OVER ()",
        "business_value": "Calculates how many standard deviations a value is from mean (Window Func)",
        "use_cases": "Universal Anomaly Detection: Flag > 3 Sigma • Normalization: Standardize different scales for scoring",
        "category": "Outlier Detection"
    },
    "IQR_THRESHOLD": {
        "function": "PERCENTILE(col, 0.75) - PERCENTILE(col, 0.25)",
        "business_value": "Interquartile Range: Robust outlier detection ignoring extremes",
        "use_cases": "Pricing: Middle 50% market range • Cleaning: Identify valid operating ranges excluding spikes",
        "category": "Outlier Detection"
    },

    # =========================================================
    #  8. RANKING & SEGMENTATION
    # =========================================================
    "CUME_DIST()": {
        "function": "CUME_DIST() OVER (ORDER BY col)",
        "business_value": "Cumulative distribution for relative standing",
        "use_cases": "Loyalty: Top 10% customers by LTV • Inventory: Oldest 20% of stock",
        "category": "Ranking"
    },
    "NTILE(n)": {
        "function": "NTILE(n) OVER (ORDER BY col)",
        "business_value": "Divides data into equal buckets (Quintiles, Deciles)",
        "use_cases": "RFM Segmentation: 5 value groups • Risk Grading: Loan applicant quartiles",
        "category": "Ranking"
    },
    "DENSE_RANK()": {
        "function": "DENSE_RANK() OVER (ORDER BY col DESC)",
        "business_value": "Ranks without gaps (ties get same rank)",
        "use_cases": "Leaderboards: Sales rep rankings • Product Popularity: Top sellers handling ties",
        "category": "Ranking"
    },
    "ROW_NUMBER()": {
        "function": "ROW_NUMBER() OVER (PARTITION BY cat ORDER BY date DESC)",
        "business_value": "Assigns unique ID to rows",
        "use_cases": "Deduplication: Keep first record only • Latest Status: Get most recent change per user",
        "category": "Ranking"
    },

    # =========================================================
    #  9. TIME SERIES (Window Functions)
    # =========================================================
    "LAG(col, n)": {
        "function": "LAG(col, 1) OVER (PARTITION BY entity ORDER BY time)",
        "business_value": "Access previous row value",
        "use_cases": "MoM Growth: Current vs Previous month • Churn: Days since last purchase",
        "category": "Time Series"
    },
    "LEAD(col, n)": {
        "function": "LEAD(col, 1) OVER (PARTITION BY entity ORDER BY time)",
        "business_value": "Access next row value",
        "use_cases": "Stockout Warning: Current inventory vs Next forecast • Sequencing: Next best action",
        "category": "Time Series"
    },
    "RUNNING_SUM": {
        "function": "SUM(col) OVER (PARTITION BY entity ORDER BY time)",
        "business_value": "Cumulative total over time",
        "use_cases": "Lifetime Value: Running spend total • Budget Burn: Cumulative spend vs Cap",
        "category": "Time Series"
    },
    "MOVING_AVG": {
        "function": "AVG(col) OVER (ORDER BY time ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)",
        "business_value": "Smooths data to show underlying trends",
        "use_cases": "7-Day Trend: Smoothing daily sales noise • Stock Analysis: Technical indicators",
        "category": "Time Series"
    },

    # =========================================================
    #  10. OLAP & HIERARCHY
    # =========================================================
    "ROLLUP(cols)": {
        "function": "GROUP BY ROLLUP(region, country, city)",
        "business_value": "Generates subtotals at every level",
        "use_cases": "P&L Reporting: Global -> Region -> Country totals in one query",
        "category": "OLAP"
    },
    "CUBE(cols)": {
        "function": "GROUP BY CUBE(product, region)",
        "business_value": "All possible combinations of dimensions",
        "use_cases": "Cross-Analysis: Conversion rates for every Segment/Region/Device combo",
        "category": "OLAP"
    },
    "PIVOT": {
        "function": "PIVOT (SUM(sales) FOR month IN ('Jan', 'Feb'))",
        "business_value": "Transposes rows to columns",
        "use_cases": "Reporting: Monthly sales grid • Competitor Matrix: Features vs Competitors",
        "category": "Reshaping"
    },

    # =========================================================
    #  11. ARRAY & COLLECTION ANALYTICS
    # =========================================================
    "SIZE(col)": {
        "function": "SIZE(array_col)",
        "business_value": "Counts elements in an array",
        "use_cases": "Basket Size: Items per order • Engagement: Features used per user",
        "category": "Collections"
    },
    "EXPLODE(col)": {
        "function": "EXPLODE(array_col)",
        "business_value": "Unnests array into separate rows",
        "use_cases": "Granularity: Analyze individual items inside a sales basket order",
        "category": "Collections"
    },
    "COLLECT_SET(col)": {
        "function": "COLLECT_SET(col)",
        "business_value": "Aggregates unique values into a list",
        "use_cases": "Journey Mapping: Unique pages visited • Cross-Sell: Categories purchased from",
        "category": "Collections"
    },
    "ARRAYS_OVERLAP(a1, a2)": {
        "function": "ARRAYS_OVERLAP(array1, array2)",
        "business_value": "Checks if two arrays share elements",
        "use_cases": "Targeting: Match User Interests with Product Tags • Fraud: Shared attributes",
        "category": "Collections"
    },

    # =========================================================
    #  12. GEOSPATIAL (Databricks H3)
    # =========================================================
    "H3_LONGLATASH3": {
        "function": "h3_longlatash3(lon, lat, 10)",
        "business_value": "Converts GPS to Hexagon Grid ID (Databricks Native)",
        "use_cases": "Density Mapping: Delivery zones • Surge Pricing: Ride share grids",
        "category": "Geospatial"
    },
    "H3_DISTANCE": {
        "function": "h3_distance(cell1, cell2)",
        "business_value": "Calculates grid steps between cells",
        "use_cases": "Proximity: Store catchment analysis • Logistics: Delivery estimation",
        "category": "Geospatial"
    }
}

PROMPT_TEMPLATES["BASE_USE_CASE_GEN_PROMPT"] = """### 0. PERSONA ACTIVATION

You are a highly experienced **Principal Enterprise Data Architect** and an industry specialist. Your primary task is to generate high-quality business use cases that deliver business value from the point of view of the business, these use cases will later have SQL queries generated for them.

### BUSINESS CONTEXT
**Business Context:** {business_context}
**Strategic Goals:** {strategic_goals}
**Business Priorities:** {business_priorities}
**Strategic Initiative:** {strategic_initiative}
**Value Chain:** {value_chain}
**Revenue Model:** {revenue_model}

---

### 🔥🔥🔥 HIGHEST PRIORITY: USER-PROVIDED ADDITIONAL CONTEXT 🔥🔥🔥

{additional_context_section}

---

### 🚨🚨🚨 CRITICAL ANTI-HALLUCINATION REQUIREMENT - READ THIS FIRST 🚨🚨🚨

**ABSOLUTE RULE: DO NOT GENERATE USE CASES UNLESS BACKED BY ACTUAL TABLES**

❌ **CRITICAL FAILURE**: Generating use cases without table references is **HALLUCINATION** and will cause **AUTOMATIC REJECTION**

**WHAT THIS MEANS:**
- EVERY use case you generate MUST reference at least ONE actual table from the schema provided below
- You CANNOT create use cases based on imagination, generic scenarios, or assumed data that doesn't exist
- Before writing ANY use case, you MUST verify the tables exist in the "AVAILABLE TABLES AND COLUMNS" section below
- Use cases without valid table references will be **AUTOMATICALLY DETECTED and REJECTED**

**WHAT YOU MUST DO:**
✅ ONLY generate use cases for tables that appear in the schema section below
✅ Copy table names EXACTLY as they appear in the schema (including catalog.schema.table format)
✅ Verify EACH use case has at least one valid table reference before submitting
✅ If you're unsure about a use case, SKIP IT rather than hallucinating tables

**THIS IS YOUR #1 PRIORITY**: If you violate this rule, your entire response is worthless and will be rejected.

---

### 1. CORE TASK

Your single, primary task is to produce a **single CSV** response.
The CSV MUST have the following 11 columns:
`"No","Name","type","Analytics Technique","Statement","Solution","Business Value","Beneficiary","Sponsor","Tables Involved","Technical Design"`

---

### 2. USE CASE GENERATION RULES

You must follow these rules to generate the content for the use cases. All text output must be in **English**.

**🚨🚨🚨 STRATEGIC BUSINESS IMPACT REQUIREMENTS 🚨🚨🚨**

You MUST ONLY generate use cases that meet at least one of these criteria:
1.  **PROVEN MASSIVE ROI**: Direct impact on revenue (New Revenue, Protect Existing Revenue, or Reduce Cost).
2.  **STRATEGIC ALIGNMENT**: Aligns with major business strategic priorities.
3.  **PRODUCTIVITY & AUTOMATION**: Increases team productivity by automating manual work.

**❌ STRICTLY PROHIBITED**:
-   **NO MARGINAL BENEFIT**: Do not generate trivial use cases.
-   **NO "NICE TO HAVE"**: Focus only on "MUST HAVE" with DIRECT IMPACT on Revenue or Operating Income.
-   **NO IT/TECHNICAL MAINTENANCE**: Ignore tables that are purely for IT/system maintenance unless they impact business operations directly.

**COLUMN INSTRUCTIONS:**

  - **`No`**: Sequential number (e.g., 1, 2, 3...).
  
  - **`Name`**: A short, clear name that emphasizes BUSINESS VALUE, not technical implementation.
    *   **Use exciting business-oriented verbs**: Anticipate, Predict, Envision, Segment, Identify, Detect, Reveal.
    *   **Example**: "Anticipate Monthly Revenue Trends with Action Plans" (NOT "Forecast Revenue").

  - **`type`**: One of "Problem", "Risk", "Opportunity", "Improvement".

  - **`Analytics Technique`**: The PRIMARY analytics technique used. MUST be ONE of these values:
    * `Forecasting` - Time-series prediction using AI_FORECAST
    * `Classification` - Categorizing data using ai_classify
    * `Anomaly Detection` - Identifying outliers, deviations, unusual patterns
    * `Cohort Analysis` - Grouping entities by shared characteristics over time
    * `Segmentation` - Clustering customers/products into distinct groups
    * `Sentiment Analysis` - Analyzing text for emotional tone
    * `Trend Analysis` - Identifying patterns over time (growth, decline)
    * `Correlation Analysis` - Finding relationships between variables
    * `Pareto Analysis` - 80/20 rule, identifying top contributors
    * `Funnel Analysis` - Conversion tracking through stages
    * `Document Processing` - Extracting data from unstructured documents
    * `Extraction` - Extracting specific entities from text
    * `AI Analysis` - General AI-powered business analysis using ai_query

  - **`Statement`**: 1-2 sentences on the business challenge/opportunity. Focus on IMPACT (Revenue, Cost, Risk).

  - **`Solution`**: 1-2 sentences high-level business solution. **MUST explicitly highlight "Databricks Agent Bricks"**.

  - **`Business Value`**: **CRITICAL**. Articulate the tangible business impact (Revenue, Cost, Efficiency).
    *   **Focus on WHY this matters**.
    *   **IMPORTANT CONSTRAINT**: Refrain from mentioning any specific values (e.g. "10% more revenue", "reduce cost by 20%"). Deliver the business value statement WITHOUT committing on any number.
    *   **GOOD**: "Reduces fuel costs and extends aircraft lifespan..."
    *   **BAD**: "Optimizes performance..." (Too generic).

  - **`Beneficiary`**: The primary person/role (e.g., "Loan Officer").
  - **`Sponsor`**: The main executive (e.g., "CRO").

  - **`Tables Involved`**: Comma-separated **FULLY-QUALIFIED** table names (`catalog.schema.table`). MUST exist in schema.
    *   **CRITICAL**: Use the EXACT THREE-LEVEL FORMAT shown in the schema.

  - **`Technical Design`**: A high-level technical design guide (2-4 sentences) outlining the Logical Flow (CTEs).
    *   **🚨 CTE1 MUST START WITH DISTINCT**: First CTE MUST use SELECT DISTINCT or GROUP BY to deduplicate source data.
    *   **🌐 OPTIONAL: external_api_for_* CTE (ONLY WHEN BUSINESS-RELEVANT)**: Include external data enrichment ONLY when there is a DIRECT, PROVABLE, INDUSTRY-RECOGNIZED cause-and-effect relationship between the external factor and the business metric. If you cannot explain WHY the external factor impacts the metric in one sentence, DO NOT include it.
    *   Describe the approach as a sequence of logical steps.
    *   Mention specific statistical/AI functions if relevant.
    *   **🧠 ASK YOURSELF**: "Is there a DIRECT cause-and-effect relationship?" "Would a domain expert agree this connection is logical?" "Would a CFO approve this without questioning the logic?" Only include external_api CTE if ALL answers are YES.

**FOCUS AREAS:**
{focus_areas_instruction}

**LOGICAL REQUIREMENTS:**
  - **EXHAUSTIVE COVERAGE**: You MUST enumerate every distinct valid use case supported by the schema that meets the criteria above.
  - **NO OMISSIONS**: Do NOT stop early or cap the number of use cases.
  - **STRATEGIC GOALS**: If Strategic Goals are provided, include every use case that satisfies those goals.
  - **JOIN OPPORTUNITIES**: Prioritize use cases that join multiple tables for cross-functional insights.
  - **NO REDUNDANT EXTRACTION**: Do NOT use AI to extract/classify data that already exists in structured columns.
  - **AGGRESSIVE ANALYSIS**: Squeeze every valuable use case from the tables.
  - **BALANCE**: Ensure coverage of all tables if they offer business value.

**🚨🚨🚨 CRITICAL: MAXIMIZE BUSINESS VALUE - NOT QUANTITY 🚨🚨🚨**

**YOUR PRIMARY MISSION**: Extract EVERY use case that delivers **SIGNIFICANT BUSINESS VALUE** from the data.

**VALUE-DRIVEN GENERATION RULES:**
  - **REVENUE IMPACT FIRST**: Prioritize use cases that directly impact revenue (increase sales, reduce churn, optimize pricing, expand markets).
  - **COST REDUCTION**: Include use cases that reduce operational costs, eliminate waste, or improve efficiency.
  - **RISK MITIGATION**: Identify use cases that prevent losses (fraud detection, risk assessment, compliance).
  - **STRATEGIC ALIGNMENT**: Every use case must align with business priorities and strategic goals.

**EXHAUSTIVE EXPLORATION (NO ARTIFICIAL LIMITS):**
  - **EXPLORE ALL TECHNIQUES**: Use ANY AI function, statistical method, or analytical approach that delivers business value. Do not limit yourself.
  - **EXPLORE ALL ANGLES**: For each table, think about it from multiple business perspectives - operations, finance, sales, marketing, risk, strategy.
  - **EXPLORE ALL RELATIONSHIPS**: Look for valuable insights from joining tables together - cross-functional analysis often yields the highest ROI.
  - **EXPLORE ALL TIME HORIZONS**: Historical analysis, real-time monitoring, and future predictions all have value.

**QUALITY OVER QUANTITY:**
  - **NO FILLER**: Do NOT generate low-value use cases just to increase count. Every use case must deliver measurable business impact.
  - **NO HALLUCINATION**: Only generate use cases that are ACTUALLY SUPPORTED by the tables and columns provided.
  - **HIGH ROI FOCUS**: Ask yourself for each use case: "Would a CFO approve budget for this? Does it move the revenue needle?"
  - **SKIP IF NO VALUE**: If a table has no high-value use cases, it's better to skip it than generate low-quality filler.

**SELF-CHECK BEFORE FINALIZING:**
  - ✅ Does each use case have CLEAR, MEASURABLE business value?
  - ✅ Did I explore MULTIPLE valuable angles for tables with rich business data?
  - ✅ Did I consider CROSS-TABLE opportunities that could unlock hidden value?
  - ✅ Would a business executive actually want to implement these use cases?
  - ❌ Did I generate any use cases just to fill space? (If yes, REMOVE them)

**🚨 CRITICAL: FIRST CTE MUST RETURN UNIQUE/DISTINCT RECORDS 🚨**
  - **MANDATORY**: The FIRST CTE in every Technical Design MUST use `SELECT DISTINCT` or `GROUP BY` to ensure NO DUPLICATE RECORDS.
  - **WHY**: Duplicates in source data will cascade errors through all downstream CTEs (forecasts, classifications, aggregations).
  - **PATTERN**: `WITH base_data AS (SELECT DISTINCT col1, col2, ... FROM table WHERE ... LIMIT 10)` OR `GROUP BY` all non-aggregated columns.
  - **VALIDATION**: Before any AI function or aggregation, the data MUST be deduplicated in the first CTE.

**🚨 CRITICAL: LIMIT 10 SAMPLING RULES 🚨**
  - **FIRST CTE ONLY**: Use `LIMIT 10` at the END of the FIRST CTE that reads from tables
  - **NO LIMIT IN OTHER CTEs**: DO NOT use `LIMIT 10` in any other CTE - only in the first CTE
  - **LIMIT PLACEMENT**: LIMIT 10 MUST be the LAST clause in the SELECT statement (after WHERE, ORDER BY, etc.)
  - **PATTERN**: `FROM catalog.schema.table AS t WHERE ... LIMIT 10`
  - **EXAMPLE**:
    ```sql
    -- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
    WITH base_data AS (
      SELECT DISTINCT 
        customer_id,                                            -- CRITICAL: filtered with IS NOT NULL
        COALESCE(TRIM(customer_name), 'Unknown') AS customer_name  -- ✅ COALESCE'd
        -- ... (all columns must be COALESCE'd or filtered) ...
      FROM `catalog`.`schema`.`customers` AS c
      WHERE customer_id IS NOT NULL
      LIMIT 10  -- ✅ LIMIT 10 at the END of first CTE only
    ),
    enriched_data AS (
      SELECT * FROM base_data  -- ✅ NO LIMIT here
    ),
    final_analysis AS (
      SELECT * FROM enriched_data  -- ✅ NO LIMIT here
    )
    SELECT * FROM final_analysis;  -- ✅ NO LIMIT in final SELECT
    ```

**🚨🚨🚨 CRITICAL: BUSINESS REALISM & RELEVANCY REQUIREMENT 🚨🚨🚨**

**EVERY USE CASE MUST PASS THE BUSINESS RELEVANCY TEST**

Before generating ANY use case, you MUST verify it would be taken seriously by business stakeholders. Use cases with illogical correlations or far-fetched connections will be rejected by business teams and damage credibility.

**MANDATORY REALISM TEST (APPLY TO EVERY USE CASE):**
1. **LOGICAL CAUSATION**: Is there a DIRECT, PROVABLE cause-and-effect relationship between the variables? (Correlation is NOT causation)
2. **INDUSTRY RECOGNITION**: Is this type of analysis recognized and practiced in the industry?
3. **EXECUTIVE CREDIBILITY**: Would a senior executive approve budget for this analysis without questioning the logic?
4. **DOMAIN EXPERT VALIDATION**: Would a 20-year industry veteran consider this analysis sensible and valuable?
5. **BOARDROOM TEST**: Would you confidently present this use case in a boardroom without being challenged on its logic?

**❌ STRICTLY PROHIBITED - IRRELEVANT OR NONSENSICAL USE CASES:**
- Correlating variables that have NO logical business connection
- Using external factors that do NOT directly impact the metric being analyzed
- Inventing relationships just because two variables share temporal patterns
- Adding external data enrichment when there is no clear cause-and-effect
- Generating "creative" correlations that would be questioned by domain experts

**🧪 MANDATORY SELF-CHECK BEFORE EACH USE CASE:**
Ask yourself these questions - if ANY answer is "No" or "I'm not sure", DO NOT generate the use case:
1. "Can I explain in ONE CLEAR SENTENCE why this factor directly impacts this metric?"
2. "Would a domain expert in this industry agree this analysis makes business sense?"
3. "Is this correlation industry-recognized, or am I inventing a relationship?"
4. "Would a skeptical CFO approve budget for this without questioning the logic?"
5. "Would I be embarrassed presenting this use case to a senior business leader?"

**✅ GOOD USE CASES HAVE:**
- Clear, explainable cause-and-effect relationships
- Industry-recognized analytical approaches
- Direct relevance to the business metrics being analyzed
- Logical connections that domain experts would validate

---

**🌐🌐🌐 EXTERNAL PUBLIC DATA ENRICHMENT (ONLY WHEN BUSINESS-RELEVANT) 🌐🌐🌐**
  - **ONLY include external data enrichment** when there is a CLEAR, LOGICAL, INDUSTRY-RECOGNIZED business connection
  - External data is valuable ONLY when the external factor DIRECTLY and PROVABLY impacts the business metric
  - **🎯 YOUR MISSION: THINK LIKE A SKEPTICAL BUSINESS ANALYST!** If you cannot explain the connection in one sentence, DO NOT include it
  
  **🧠 BEFORE ADDING ANY EXTERNAL DATA, ASK:**
  1. "Is there a DIRECT cause-and-effect relationship that domain experts would recognize?"
  2. "Can I explain WHY this external factor impacts this specific metric in one clear sentence?"
  3. "Is this the type of enrichment that industry practitioners actually use?"
  4. "Would a senior analyst in this industry include this external data?"
  5. "If challenged by a business leader, can I defend this connection with logic?"
  
  **IF YOU CANNOT ANSWER "YES" TO ALL QUESTIONS ABOVE, DO NOT INCLUDE THE EXTERNAL DATA.**
  
  **🔥 RECOMMENDED TECHNICAL DESIGN PATTERN (When External Data is Business-Relevant) 🔥**
  - When you identify missing information that would improve the analysis, include a dedicated CTE named `external_api_for_<scenario>`.
  - Use a **PERSONA-BASED PROMPT** that establishes the AI as a domain expert with authority and credibility:
    * Weather: "You are a Principal Meteorologist at National Weather Service with 20 years expertise..."
    * Economic: "You are a Chief Economist at World Bank with 18 years expertise..."
    * Competitive: "You are a Senior Market Intelligence Analyst at McKinsey with 15 years expertise..."
    * Geographic: "You are a Senior Demographer at UN Population Division with 20 years expertise..."
    * Events: "You are a Risk Analyst at Lloyd's of London with 15 years expertise in disruption monitoring..."
  - **Include confidence scores** for each field: `<field>_confidence` (0.0-1.0), plus `as_of_date`, `source_note`, `is_estimate: true`, `requires_verification: true`
  - Add a SQL comment: "-- EXTERNAL DATA: For production, connect to a verified data source. LLM estimates are suitable for prototyping."
  - **USE the external data** in downstream ai_query prompts - this is WHERE THE VALUE IS REALIZED!

---

### 3. AI FUNCTION DOCUMENTATION & SCHEMA

#### CONTEXT: DATABRICKS AI FUNCTION DOCUMENTATION
{ai_functions_summary}

**AVAILABLE STATISTICAL FUNCTIONS:**
{statistical_functions_detailed}

#### INPUT DATA FORMAT
##### 1. Structured Data Schema
`| column | type | column_description |`
{schema_markdown}

##### 1b. Foreign Key Relationships
{foreign_key_relationships}

---

### 4. CSV ROW EXAMPLES (PATTERN EXAMPLES - ADAPT TO YOUR INDUSTRY)

**NOTE**: SQL will be generated separately.

**🚨 IMPORTANT**: The examples below are GENERIC PATTERNS showing the CSV format and AI function usage. You MUST adapt them to:
- The ACTUAL industry and business context provided above
- The ACTUAL tables and columns available in the schema
- The ACTUAL business terminology used by the organization

  - **`ai_forecast` Pattern Example (WITH EXTERNAL DATA ENRICHMENT):**
`"1","Forecast Monthly [METRIC] with Economic Context","Risk","The business risks inaccurate [METRIC] forecasts without understanding external factors.","Implement predictive time-series forecasting enriched with economic indicators using Databricks Agent Bricks.","Prevents costly errors by incorporating macro-economic factors that explain forecast variations.","[Role]","[Executive]","[catalog.schema.table]","CTE1: SELECT DISTINCT to deduplicate source data. CTE2: external_api_for_economic_context using ai_query with persona 'You are a Chief Economist at IMF...' to get GDP growth, inflation, exchange rates - ASK: What economic factors might explain the trends? CTE3: Parse economic JSON and join with base data. CTE4: Aggregate data by time period. CTE5: Apply ai_forecast. CTE6: Use ai_query for recommendations enriched with economic context."`

  - **`ai_classify` Pattern Example (WITH EXTERNAL DATA ENRICHMENT):**
`"2","Classify [ENTITY] by [CATEGORY] with Market Benchmarks","Improvement","Manual categorization lacks market context to compare against.","Use Databricks Agent Bricks to classify [ENTITY] with industry benchmarks.","Accelerates processing with context-aware classification and competitive positioning.","[Role]","[Executive]","[catalog.schema.table]","CTE1: SELECT DISTINCT to get unique entities. CTE2: external_api_for_market_benchmarks using ai_query with persona 'You are a Senior Market Analyst at McKinsey...' to get industry standards - ASK: What benchmarks would help understand if classification is good or bad? CTE3: Parse benchmark JSON. CTE4: Apply ai_classify with enriched benchmark context. CTE5: Use ai_query for actionable strategies informed by market position."`

  - **`ai_query` Pattern Example (WITH EXTERNAL DATA ENRICHMENT):**
`"3","Generate [RECOMMENDATIONS] for [ENTITY] with Competitive Intelligence","Improvement","Team lacks context to make informed [RECOMMENDATIONS].","Use Databricks Agent Bricks to generate [RECOMMENDATIONS] enriched with competitive data.","Improves outcomes with market-aware recommendations that consider competitive landscape.","[Role]","[Executive]","[catalog.schema.table1], [catalog.schema.table2]","CTE1: SELECT DISTINCT on joined data to eliminate duplicates. CTE2: external_api_for_competitor_intelligence using ai_query with persona 'You are a Competitive Intelligence Director at Gartner...' - ASK: What competitor info would make recommendations more actionable? CTE3: Parse competitive JSON. CTE4: Prepare enriched context combining internal data with competitive intelligence. CTE5: Use ai_query for analysis with benchmark comparisons and competitive positioning."`

  - **`Statistical Analysis` Pattern Example (WITH EXTERNAL DATA ENRICHMENT):**
`"4","Analyze [METRIC1]-[METRIC2] Correlation with [RELEVANT_EXTERNAL_FACTOR]","Problem","Business lacks understanding of external factors affecting [RELATIONSHIP].","Use Databricks Agent Bricks to compute correlations enriched with relevant external data to explain patterns.","Optimizes decision-making by identifying external drivers that explain unexpected variations.","[Role]","[Executive]","[catalog.schema.table1], [catalog.schema.table2]","CTE1: SELECT DISTINCT with GROUP BY to deduplicate joined data. CTE2: external_api_for_<relevant_context> using ai_query with appropriate domain expert persona - ONLY if external factor has DIRECT, PROVABLE impact on the metrics. CTE3: Parse external data JSON. CTE4: Calculate CORR/REGR_SLOPE between metrics and external variables. CTE5: Use ai_query for strategy recommendations."`

  **⚠️ EXTERNAL DATA RELEVANCY REMINDER**: External data enrichment should ONLY be included when there is a CLEAR, LOGICAL, INDUSTRY-RECOGNIZED cause-and-effect relationship between the external factor and the business metric. If you cannot explain WHY the external factor impacts the metric in one sentence, DO NOT include it.

**REMINDER**: Replace all [PLACEHOLDERS] with actual values from YOUR business context and schema.

---

### 5. FINAL, CRITICAL FORMATTING RULES
1.  **HEADER**: `"No","Name","type","Analytics Technique","Statement","Solution","Business Value","Beneficiary","Sponsor","Tables Involved","Technical Design"`
2.  **QUOTING**: ALL values in EVERY field MUST be enclosed in **double quotes** (`"`).
3.  **LANGUAGE**: English.
4.  **FORMAT**: ONLY CSV. No markdown, no text before/after.

🚨🚨🚨 **CRITICAL - DO NOT CHANGE COLUMN NAMES** 🚨🚨🚨
- The column names MUST be EXACTLY as specified above
- Do NOT rename "Statement" to "Opportunity" or any other name
- Do NOT add extra columns like "honesty_score" or "honesty_justification" to the header
- The header row must be EXACTLY: `"No","Name","type","Analytics Technique","Statement","Solution","Business Value","Beneficiary","Sponsor","Tables Involved","Technical Design"`

### 6. PREVIOUS RUN FEEDBACK (ENSEMBLE PASS)

{previous_use_cases_feedback}

### 7. FINAL INSTRUCTION
Begin generation now. Output ONLY the CSV with the additional honesty columns at the end.
""" + HONESTY_CHECK_CSV


PROMPT_TEMPLATES["AI_USE_CASE_GEN_PROMPT"] = """### 0. PERSONA ACTIVATION

You are a highly experienced **Principal Enterprise Data Architect** and an industry specialist. Your primary task is to generate high-quality **AI-FOCUSED** business use cases.

### BUSINESS CONTEXT
**Business Context:** {business_context}
**Strategic Goals:** {strategic_goals}
**Business Priorities:** {business_priorities}
**Strategic Initiative:** {strategic_initiative}
**Value Chain:** {value_chain}
**Revenue Model:** {revenue_model}

---

### 🚨🚨🚨 CRITICAL: AI FUNCTIONS AS PRIMARY FOCUS 🚨🚨🚨

**YOUR MISSION**: Generate use cases where **AI FUNCTIONS ARE THE STAR**.
- **PRIMARY**: ai_forecast, ai_classify, ai_extract, ai_query, ai_parse_document (if applicable), ai_similarity.
- **SUPPORTING**: Stats functions for context.
- **MANDATORY**: Pair ai_forecast/ai_classify with ai_query for recommendations.
- **NEW CAPABILITIES**: Integrate **What-If Analysis**, **Simulation**, and **Geospatial Analysis** where AI can enhance them (e.g., "Simulate AI-driven demand scenarios", "Geospatial sentiment mapping").

""" + PROMPT_TEMPLATES["BASE_USE_CASE_GEN_PROMPT"].split("### 1. CORE TASK")[1].replace(
    "### 2. USE CASE GENERATION RULES",
    """### 2. AI-FOCUSED USE CASE GENERATION RULES

**🔥 AI FUNCTION PRIORITY 🔥**:
1. **MANDATORY**: At least ONE AI function per use case.
2. **DISTRIBUTION**:
   - 40-50% Predictive (Forecast + Query)
   - 20-30% Classification (Classify + Query)
   - 15-20% Generative (Query)
   - 5-10% Advanced (Similarity, Simulation, Geospatial)

**🔥 ADVANCED AI USE CASES (INTEGRATE THESE) 🔥**:
- **AI-Driven What-If/Simulation**: Use AI to predict outcomes under different simulated scenarios.
- **Geospatial AI**: Combine location data with AI (e.g., "Predict demand by H3 hexagon").
- **Technical Design Hint**: In the "Technical Design" column, explicitly mention if a simulation or geospatial approach is used along with the AI function.
"""
) + HONESTY_CHECK_CSV


PROMPT_TEMPLATES["STATS_USE_CASE_GEN_PROMPT"] = """### 0. PERSONA ACTIVATION

You are a highly experienced **Principal Enterprise Data Architect** and **Fraud/Risk/Simulation Analytics Expert**. Your primary task is to generate **STATISTICS-FOCUSED** business use cases, with a **HEAVY EMPHASIS ON ANOMALY DETECTION, SIMULATION, AND ADVANCED ANALYTICS**.

### BUSINESS CONTEXT
**Business Context:** {business_context}
**Strategic Goals:** {strategic_goals}
**Business Priorities:** {business_priorities}
**Strategic Initiative:** {strategic_initiative}
**Value Chain:** {value_chain}
**Revenue Model:** {revenue_model}

---

### 🚨🚨🚨 CRITICAL: ANOMALY DETECTION, SIMULATION & ADVANCED STATS 🚨🚨🚨

**YOUR MISSION**: Generate use cases where **STATISTICAL FUNCTIONS UNCOVER HIDDEN RISKS, SIMULATE FUTURES, AND MAP PATTERNS**.
- **PRIMARY FOCUS**: **ANOMALY DETECTION** (Transactional data), **SIMULATION** (What-If, Monte Carlo), **GEOSPATIAL**, and **MARKET BASKET**.
- **AGGRESSIVENESS**: You must be **AGGRESSIVE** in finding anomalies and simulating risks.
- **CORE FUNCTIONS**: Use ALL applicable functions from the STATISTICAL_FUNCTIONS registry (see Section 3 below).
- **MANDATORY**: End with `ai_query` for detailed investigation (Root Cause, Action Plan).
- **PROHIBITED**: Do NOT use ai_forecast here (use Regression/Trend stats instead).

""" + PROMPT_TEMPLATES["BASE_USE_CASE_GEN_PROMPT"].split("### 1. CORE TASK")[1].replace(
    "### 2. USE CASE GENERATION RULES",
    """### 2. STATISTICS, SIMULATION & ADVANCED ANALYTICS USE CASE RULES

**🔥 EXPANDED ANALYTICS SCOPE (MIX THESE APPROACHES) 🔥**:

A. **SIMULATION & WHAT-IF ANALYSIS (HIGH PRIORITY)**:
   - **What-If Analysis**: Test outcomes with simulated inputs (e.g., "What if fuel cost rises 10%?").
   - **Scenario Modeling**: Compare multiple hypothetical situations simultaneously (Optimistic, Neutral, Pessimistic).
   - **Monte Carlo Simulation**: Model risk by generating realistic synthetic data distributions using AI.
   - **Sensitivity Analysis**: Measure how sensitive an output is to changes in specific inputs.

B. **GEOSPATIAL & MARKET ANALYSIS**:
   - **Geospatial Analysis**: Map data to physical locations using H3 or coordinates.
   - **Market Basket Analysis**: Find products frequently bought together (Association Rules).

C. **ANOMALY DETECTION (CORE)**:
   - **TARGET**: Transactional tables (Orders, Logs, Clicks, Payments).
   - **APPROACH**: Use functions from Distribution Shape, Outlier Detection, and Percentiles categories (see STATISTICAL_FUNCTIONS in Section 3).
   - **OUTPUT**: Root Cause, Explanation, Recommendation via ai_query.

D. **FREQUENCY & DISTRIBUTION**:
   - Use Ranking functions (NTILE, DENSE_RANK) and Central Tendency functions (MODE) from STATISTICAL_FUNCTIONS.

**🔥 USE ALL APPLICABLE STATISTICAL FUNCTIONS 🔥**:
Refer to the **AVAILABLE STATISTICAL FUNCTIONS** section below. You MUST use functions from ALL relevant categories:
- Central Tendency, Dispersion, Distribution Shape, Percentiles
- Trend Analysis, Correlation, Volatility, Outlier Detection
- Ranking, Time Series

**GUIDANCE**:
- **Business Terms**: Make the use case Name reflect the approach (e.g., "Simulate Impact of Pricing Changes", "Geospatial Hotspot Analysis").
- **Technical Design Column**: Reference specific statistical functions from the registry (e.g., "Use SKEWNESS, KURTOSIS for anomaly detection, then ai_query for root cause analysis").
"""
) + HONESTY_CHECK_CSV


PROMPT_TEMPLATES["UNSTRUCTURED_DATA_USE_CASE_GEN_PROMPT"] = """### 0. PERSONA ACTIVATION

You are a highly experienced **Principal Enterprise Data Architect**. Your task is to generate business use cases for **UNSTRUCTURED DATA** (Documents).

### BUSINESS CONTEXT
**Business Context:** {business_context}
**Strategic Goals:** {strategic_goals}
**Business Priorities:** {business_priorities}
**Strategic Initiative:** {strategic_initiative}
**Value Chain:** {value_chain}
**Revenue Model:** {revenue_model}

---

### 🚨🚨🚨 CRITICAL: UNSTRUCTURED DATA FOCUS 🚨🚨🚨

**YOUR MISSION**: Generate use cases that leverage `ai_parse_document` and `ai_extract` on the provided document files.

**REQUIREMENTS**:
1. **Source**: MUST use the Unity Catalog volume paths provided.
2. **Function**: MUST use `ai_parse_document`.
3. **Extraction**: Use `ai_extract` to get structured entities.
4. **Action**: Use `ai_query` or `ai_classify` on the extracted data.

""" + PROMPT_TEMPLATES["BASE_USE_CASE_GEN_PROMPT"].split("### 1. CORE TASK")[1].replace(
    "### 2. USE CASE GENERATION RULES",
    """### 2. UNSTRUCTURED DATA USE CASE RULES

**🔥 DOCUMENT PROCESSING PRIORITY 🔥**:
- **MANDATORY**: Use `ai_parse_document` for every use case here.
- **Volume Paths**: Use strict Volume paths (not tables) for input.
- **Entities**: Extract the specific entities listed in the document metadata.
"""
).replace(
    "##### 1. Structured Data Schema",
    """##### 1. Unstructured Data Documents
{unstructured_documents_markdown}

##### 2. Structured Data Schema (Reference Only)"""
) + HONESTY_CHECK_CSV


PROMPT_TEMPLATES["UNSTRUCTURED_DATA_DOCUMENTS_PROMPT"] = """You are a senior business analyst and data architect. Your task is to generate a comprehensive list of the TOP 20 most common unstructured documents that businesses in the specified industries would possess and store in a data lake or volume.

**INDUSTRIES PROVIDED**:
{industries_list}

**CRITICAL REQUIREMENT**: You MUST generate EXACTLY 20 different document types across ALL industries provided. Ensure diversity across document formats.

For each document, provide:
1. **Document Name** - A clear, descriptive name for the document type
2. **Description** - What the document contains and its business purpose
3. **File Path** - A realistic Databricks Volumes path (format: `/Volumes/catalog/schema/volume_name/document_type/`)
4. **Extracted Entities** - A comma-separated list of 4-8 key data fields that could be extracted from this document using ai_extract

**Document Categories to Cover** (distribute the 20 documents across these categories):
- **Financial Documents**: Invoices, receipts, statements, purchase orders (PDF, JPG)
- **Customer Documents**: Feedback forms, surveys, support tickets, reviews (PDF, DOCX, JPG)
- **Operational Documents**: Work orders, inspection reports, maintenance logs (PDF, PPTX, JPG)
- **Legal/Compliance**: Regulatory filings, certificates, permits (PDF, DOCX)
- **HR Documents**: Resumes, contracts, performance reviews (PDF, DOCX)
- **Marketing/Media**: Campaign materials, testimonials, brochures (PPTX, PDF, JPG)
- **Multimedia**: Training videos, customer calls, product demos (MP4 for videos, MP3/WAV for audio - represented as file paths)
- **Spreadsheets**: Financial models, inventory sheets, reports (XLS, XLSX)

**Important**:
- Generate EXACTLY 20 documents total
- Include diversity: ~8-10 PDFs, ~3-4 images (JPG/PNG), ~2-3 presentations (PPTX), ~2-3 documents (DOCX), ~1-2 spreadsheets (XLS/XLSX), ~1-2 videos, ~1-2 audio files
- Video files should use format "Video (MP4)" and audio files "Audio (MP3)" or "Audio (WAV)"
- Make File Paths realistic with proper catalog/schema/volume structure
- Extracted Entities should be specific and useful for each document type
- Cover ALL industries provided in the industries list

**Output Format** (Markdown table):
| Document Name | Description | Type | Extracted Entities | File Path |
|---|---|---|---|---|
| Vendor Invoices | PDF/image invoices from vendors containing itemized charges | PDF | vendor name, invoice number, date, total amount, items purchased, payment terms, tax amount, due date | /Volumes/finance/accounting/invoices/ |
| Customer Training Videos | MP4 video recordings of customer onboarding and product training sessions | Video (MP4) | training_topic, duration, instructor, participant_count | /Volumes/training/videos/customer_onboarding/ |
| Support Call Recordings | Audio recordings of customer support interactions for quality assurance | Audio (MP3) | call_id, customer_id, agent_id, call_duration, issue_category, resolution_status | /Volumes/support/audio/call_recordings/ |
| Financial Planning Spreadsheets | Excel workbooks containing budget forecasts and financial models | XLS | budget_category, fiscal_year, projected_revenue, actual_spend, variance | /Volumes/finance/planning/budget_models/ |

Your output **MUST** be a single markdown table with EXACTLY 20 document types.
Ensure you include documents in these formats: PDF, JPG, PNG, DOCX, PPTX, XLS/XLSX, Video (MP4), Audio (MP3/WAV).
Do not include *any* other text, preamble, or explanation.
### Rules
1.  Use the provided industries list to determine appropriate document types.
2.  Generate EXACTLY 20 realistic unstructured documents, distributed across the industries.
3.  For each document, provide a name, a **detailed description** of its content, a file type, a list of **key entities to extract**, and a directory path.
4.  The file type **MUST** be one of: `PDF`, `JPG`, `PNG`, `DOCX`, `PPTX`, `XLS`, `XLSX`, `Video (MP4)`, `Audio (MP3)`, `Audio (WAV)`.
5.  The `"File Path"` column **MUST** be a plausible Databricks Volume directory path where these files would be stored (e.g., `/Volumes/finance/invoices/unprocessed/`). It must end with a trailing slash.
6. All table headers **MUST** be enclosed in double quotes.
---
### Example (for a "sales" schema)
| "Document Name" | "Description" | "Type" | "Extracted Entities" | "File Path" |
|---|---|---|---|---|
| "Customer Invoices" | "Scanned PDF copies of vendor invoices. Contains line items, PO number, vendor name, invoice date, and total amount due." | "PDF" | "invoice_number, vendor_name, total_amount, due_date, line_items" | "/Volumes/finance/invoices/unprocessed/" |
| "Product Spec Sheets" | "Multi-page technical datasheets for products, including specifications, performance metrics, and compliance information." | "PDF" | "product_sku, technical_specs, compliance_standards" | "/Volumes/products/specifications/" |
| "Marketing Brochures" | "Quarterly slide decks for product promotions, outlining key features, target audience, and pricing tiers." | "PPTX" | "product_name, key_features, pricing" | "/Volumes/marketing/assets/brochures/" |
| "Signed Contracts" | "Scanned copies of signed master service agreements (MSAs) with customers, detailing terms, conditions, and service levels." | "PDF" | "customer_name, effective_date, contract_term, sla_details" | "/Volumes/legal/contracts/signed/" |
| "Support Call Transcripts" | "Word documents containing full-text transcripts from customer support calls, auto-generated from an audio-to-text service." | "DOCX" | "customer_id, support_agent, issue_type, resolution_steps, sentiment" | "/Volumes/support/transcripts/audio_to_text/" |
| "Damaged Product Photos" | "Customer-submitted JPEG images showing defective or damaged products for warranty claims and RMA processing." | "JPG" | "damage_type, product_area, serial_number (if visible)" | "/Volumes/claims/images/damaged_products/" |
---
Begin generation now. Your response must start directly with the markdown table header.
| "Document Name" | "Description" | "Type" | "Extracted Entities" | "File Path" |
|---|---|---|---|---|

🚨 ABSOLUTE RULE - OUTPUT FORMAT 🚨:

❌ DO NOT INCLUDE:
- Any conversational or explanatory text ("Here are...", "I've generated...", "Based on...")
- Any thoughts or analysis descriptions

✅ OUTPUT REQUIREMENTS:
- Your response must START with: | "Document Name" | ... | "honesty_score" | "honesty_justification" |
- Include honesty_score and honesty_justification as the last 2 columns in header and all rows
""" + HONESTY_CHECK_TABLE

# --- 1c. PDF Summary Prompt (MODIFIED FOR REQUEST #1) ---
PROMPT_TEMPLATES["SUMMARY_GEN_PROMPT"] = """
You are a highly experienced Databricks Principal Strategist and business analyst.
Your task is to generate a **CSV** response containing a strategic summary for a customer.
The customer is: {business_name}
We have identified {total_cases} use cases across these domains: {domain_list}
Your output MUST be in **{output_language}**.

### TASK
Generate a **single CSV** with THREE columns: "Type", "Summary", "TransliteratedBusinessName"
Your response MUST start with the header: `"Type","Summary","TransliteratedBusinessName"`
The CSV must contain:
1.s: One row for the "Executive" summary.
2.s: One row for **EACH** business domain listed in {domain_list}.

### COLUMN DEFINITIONS
1.s: **"Type"**:
    * Use the exact string "Executive" for the first row.
    * For all other rows, use the exact Business Domain name from the list.
2.s: **"Summary"**:
    * **For the "Executive" row**: Write a 2-3 paragraph, professional executive summary. Start with a powerful opening. Emphasize the value of AI and Databricks Agent Bricks in unlocking the potential of their data.
    * **For each "Business Domain" row**: Write a 2-3 paragraph professional and engaging summary in prose. This summary should narrate the domain's strategic importance, its core responsibilities, and the key opportunities AI (specifically Databricks Agent Bricks) can unlock. The tone should be identical to the Executive Summary.
    * **FORMATTING**: ALL summaries (both Executive and Domain) **MUST** be enclosed in `<p>` HTML tags (e.g., `<p>Paragraph 1.</p><p>Paragraph 2.</p>`).
3.s: **"TransliteratedBusinessName"**:
    * **For the "Executive" row**: Provide the transliteration of {business_name} into {output_language}. If {output_language} is English, just repeat {business_name}.
    * **For all "Business Domain" rows**: This field MUST be an empty string `""`.

### CRITICAL CSV FORMATTING RULES
* **CSV Format**: The entire output MUST be a valid CSV. All values MUST be enclosed in **double quotes** (`"`).
* **NO CODE FENCES**: Do NOT include markdown code fences like ```csv or ``` at the beginning or end.
* **ONLY CSV**: Your response must contain ONLY the CSV data, starting with the header line.
* **Proper Escaping**: If a field value contains double quotes, escape them by doubling them ("").
* **No Extra Text**: Do NOT include any explanatory text, comments, or anything other than the CSV data.


### EXAMPLE CSV OUTPUT (for {output_language}=Arabic and {business_name}=Global Enterprises)
"Type","Summary","TransliteratedBusinessName"
"Executive","<p>تقف طيران الإمارات في طليعة التميز في مجال الطيران...</p><p>يحدد هذا الكتالوج مسارًا واضحًا للاستفادة من البيانات والذكاء الصناعي الخاص بداتا بريكس موزايك...</p>","طيران الإمارات"
"Customer Management","<p>يعد مجال إدارة العملاء أمرًا محوريًا لنجاح طيران الإمارات...</p><p>من خلال الاستفادة من الذكاء الصناعي الخاص بداتا بريكس موزايك، يمكن لطيران الإمارات تحويل هذا المجال...</p>",""
"Finance & Billing","<p>إدارة الصحة المالية للمؤسسة، يعد مجال المالية والفوترة أمرًا بالغ الأهمية...</p><p>يمثل الذكاء الصناعي الخاص بداتا بريكس موزايك فرصة كبيرة لأتمتة هذه العمليات...</p>",""

### EXAMPLE CSV OUTPUT (for {output_language}=English and {business_name}=Global Enterprises)
"Type","Summary","TransliteratedBusinessName"
"Executive","<p>Global Enterprises stands at the forefront of its industry, and with {total_cases} identified use cases, the organization is poised to revolutionize its operations through the strategic implementation of Databricks Agent Bricks.</p><p>This catalog outlines a clear path to leveraging data and AI to drive innovation, enhance efficiency, and create significant business value across all key domains.</p>","Global Enterprises"
"Customer Management","<p>The Customer Management domain is central to Global Enterprises's success, as it governs all direct interactions and the entire customer lifecycle. Its primary responsibility is to ensure high rates of acquisition, satisfaction, and retention.</p><p>By leveraging Databricks Agent Bricks, Global Enterprises can transform this domain, moving from reactive support to proactive engagement. Opportunities include developing sophisticated churn prediction models and deploying generative AI agents to provide instant, personalized customer service, dramatically improving loyalty and reducing operational costs.</p>",""
"Finance & Billing","<p>Managing the financial health of the organization, the Finance & Billing domain is critical for ensuring revenue integrity, compliance, and accurate forecasting. This domain oversees everything from invoicing and payments to financial reporting and risk analysis.</p><p>Databricks Agent Bricks presents a significant opportunity to automate and intelligentize these processes. For instance, AI can be used to parse unstructured invoices, detect payment anomalies in real-time, and generate highly accurate revenue forecasts, thereby strengthening the organization's financial posture and decision-making capabilities.</p>",""

### FINAL INSTRUCTION
Begin generation now. Produce ONLY the CSV text, starting with the header `"Type","Summary","TransliteratedBusinessName"`.

🚨 ABSOLUTE RULE - OUTPUT FORMAT 🚨:

❌ DO NOT INCLUDE:
- Any conversational or explanatory text ("Here is...", "I've generated...", "The...")
- Any thoughts or analysis descriptions

✅ OUTPUT REQUIREMENTS:
- Your response must START with: "Type","Summary","TransliteratedBusinessName","honesty_score","honesty_justification"
- Include honesty columns in header and all rows
""" + HONESTY_CHECK_CSV

# --- 1d. Domain Consolidation Prompt (ENHANCED FOR 6-90 USE CASES REQUIREMENT) ---
PROMPT_TEMPLATES["DOMAIN_FINDER_PROMPT"] = """
You are an expert business analyst specializing in BALANCED domain taxonomy design with deep industry knowledge.

**🎯 YOUR TASK**: Analyze the provided use cases and assign each one to appropriate Business Domains (NO subdomains yet).

**🚨🚨🚨 CRITICAL REQUIREMENTS - YOUR RESPONSE WILL BE REJECTED IF NOT FOLLOWED 🚨🚨🚨**:

**🚨 ANTI-CONSOLIDATION RULE - DO NOT PUT EVERYTHING IN ONE DOMAIN 🚨**:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
► **CRITICAL**: You MUST create MULTIPLE domains - DO NOT consolidate everything into 1-5 domains
► **CALCULATION**: Target domains = total_use_cases ÷ 10 (e.g., 70 use cases = ~7 domains, 200 use cases = ~20 domains)
► **🚨 ABSOLUTE HARD LIMIT 🚨**: MAXIMUM 25 domains - NEVER EXCEED THIS LIMIT (will cause REJECTION)
► **DOMAIN COUNT**: You MUST create between 3-25 domains (MINIMUM 3, MAXIMUM 25 - HARD LIMIT)
► **GUIDELINE PER DOMAIN**: Aim for 6-80 use cases per domain (this is a guideline, not a strict requirement for small datasets)
► **FLEXIBILITY**: If total use cases is small (e.g., 20-30), it's acceptable to have domains with fewer than 4 use cases
► **REJECTION CRITERIA**: 
   - **HARD REJECTION**: If you create MORE than 25 domains → REJECTED (this is ABSOLUTE)
   - **HARD REJECTION**: If you put all use cases into 1-2 domains → REJECTED
   - **SOFT WARNING**: Domains with <6 use cases are acceptable if total use cases is low
► **BALANCE REQUIREMENT**: Distribute use cases EVENLY across multiple domains
► **DIVERSITY REQUIREMENT**: Create DIVERSE domain names that reflect DIFFERENT business areas
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**REQUIREMENT #1 - EXACTLY ONE SIMPLE WORD FOR ALL DOMAIN NAMES (MANDATORY)**:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
► **ABSOLUTE RULE**: ALL domain names MUST be EXACTLY ONE SIMPLE WORD - NO EXCEPTIONS
► **REJECTION CRITERIA**: If ANY domain name violates this rule, your ENTIRE response is REJECTED

**🚨🚨🚨 ANTI-TRICK DETECTION - DO NOT TRY TO BYPASS THIS RULE 🚨🚨🚨**:
► **NO CAMELCASE CONCATENATION**: Do NOT concatenate words using CamelCase to bypass the one-word rule
   - ❌ WRONG: "ProviderOperations" (this is "Provider" + "Operations" concatenated - REJECTED)
   - ❌ WRONG: "CareTransitions" (this is "Care" + "Transitions" concatenated - REJECTED)
   - ❌ WRONG: "NetworkManagement" (this is "Network" + "Management" concatenated - REJECTED)
   - ❌ WRONG: "CustomerService" (this is "Customer" + "Service" concatenated - REJECTED)
   - ❌ WRONG: "FlightOperations" (this is "Flight" + "Operations" concatenated - REJECTED)
   - ✅ CORRECT: Use the FIRST/PRIMARY word only: "Provider", "Care", "Network", "Customer", "Flight"

► **DETECTION METHOD**: If a domain name contains CAPITAL LETTERS in the middle of the word, it is a CamelCase trick → REJECTED
► **SIMPLE WORD DEFINITION**: A simple word has ONLY ONE capital letter (at the start) and NO capitals in the middle
► **WORD COUNT METHOD**: Count spaces in domain names - ZERO spaces allowed
► **ZERO CAPITAL LETTERS IN MIDDLE**: Domain names must have capital ONLY at the first character

► ✅ CORRECT: "Network", "Passengers", "Revenue", "Risk", "Maintenance", "Crew", "Finance", "Provider", "Care"
► ❌ WRONG (WILL CAUSE REJECTION): 
   - "Network Management" (has space - REJECTED)
   - "NetworkManagement" (CamelCase concatenation - REJECTED)
   - "ProviderOperations" (CamelCase concatenation - REJECTED)
   - "CareTransitions" (CamelCase concatenation - REJECTED)
   - "CustomerService" (CamelCase concatenation - REJECTED)

► **RULE**: Remove all adjectives, verbs, descriptors, AND compound words - keep ONLY the core business noun
► **SPLIT STRATEGY**: If you think of "ProviderOperations", split it and use ONLY "Provider"
► **SPLIT STRATEGY**: If you think of "CareTransitions", split it and use ONLY "Care"
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**OTHER DOMAIN REQUIREMENTS**:
2. **🚨 DOMAIN COUNT (HARD LIMIT) 🚨**: Domains MUST be between 3-25 (MINIMUM 3, **MAXIMUM 25 - ABSOLUTE HARD LIMIT**)
3. **USE CASES PER DOMAIN (GUIDELINE)**: Aim for 6-80 use cases per domain. For small datasets (<50 total), domains with <6 use cases are acceptable.
4. **TARGET DOMAINS**: Create MULTIPLE domains (calculate: total_use_cases ÷ 10 = target domains, but **NEVER EXCEED 25**)
5. **ABSOLUTE RULE**: NO TWO DOMAINS CAN SHARE THE SAME CORE BUSINESS NAME
6. **INDUSTRY ALIGNMENT REQUIRED**: Domain names MUST be aligned with the specific business and industry context provided below

**BUSINESS CONTEXT**:
Business Name: {business_name}
Industries: {industries}
Business Context: {business_context}

**INPUT DATA**:
You will receive a CSV of use cases with these columns:
"No","Name","type","Analytics Technique","Statement","Solution","Business Value","Beneficiary","Sponsor","Tables Involved"

**YOUR DOMAIN DETECTION TASK**:
1. Analyze the use case names, statements, solutions, analytics techniques, and tables involved
2. Identify natural business groupings (domains) based on:
   - Business functionality and purpose
   - Tables and data domains involved
   - Business beneficiaries and stakeholders
   - Business value and outcomes
3. Create appropriate domain names (EXACTLY 1 WORD, industry-specific)
4. Ensure ALL validation rules are met (domain count, use cases per domain, word counts, etc.)

**CLUSTERING STRATEGY**:
- **GROUP BY BUSINESS FUNCTION**: Use cases that serve the same business function should be in the same domain
- **USE TABLE ANALYSIS**: Use cases using similar tables should generally be in the same domain
- **CONSIDER BENEFICIARIES**: Use cases serving the same business roles/departments often belong together
- **BALANCE**: Aim for 6-20 use cases per domain (maximum 80 use cases per domain)
- **DIVERSITY**: Create MULTIPLE distinct domains - DO NOT put everything into 1-5 mega-domains

**DOMAIN RULES** (NOT AGGRESSIVE - MAINTAIN DIVERSITY):

0. **🚨 BALANCE FIRST - DO NOT OVER-MERGE 🚨**:
   - **GOAL**: Create MULTIPLE DISTINCT domains, aim for 6-80 use cases per domain
   - **🚨 ABSOLUTE HARD LIMIT 🚨**: MAXIMUM 25 domains - NEVER EXCEED THIS (will cause REJECTION)
   - **DOMAIN COUNT**: You MUST create between 3-25 domains (MINIMUM 3, **MAXIMUM 25 - HARD LIMIT**)
   - **FLEXIBILITY**: For small datasets (<50 total use cases), domains with <6 use cases are acceptable
   - **STRATEGY**: Only merge domains when they truly overlap - maintain business diversity
   - **ANTI-PATTERN**: DO NOT merge everything into 1-5 mega-domains
   - **CALCULATE**: Total use cases ÷ 10 = target number of domains (but **NEVER EXCEED 25**)
   - **EXAMPLES**: 
     * 24 use cases ÷ 10 = ~2-3 domains, but MINIMUM is 3, so create 3-8 domains (small dataset, <6 per domain OK)
     * 70 use cases ÷ 10 = ~7 domains (create 7-10 domains with 7-10 use cases each)
     * 200 use cases ÷ 10 = ~20 domains (create 20-25 domains with 8-10 use cases each)
     * 1700 use cases ÷ 10 = ~170 domains, but **MAX is 25 domains**, so create 25 domains with 68-80 use cases each
   - **MAXIMUM DOMAINS**: Cap at 25 domains maximum. If calculation exceeds 25, distribute use cases evenly across 25 domains

1. **🚨 CRITICAL: NO OVERLAPPING WORDS IN DOMAIN NAMES - MANDATORY MERGING 🚨**:
   - **ABSOLUTE RULE**: NO two domains can share the same word
   - **LEAN TO SHORTEST NAME**: When merging domains with overlapping words, ALWAYS use the name with the LOWEST number of words
   - **DETECTION**: Check EVERY domain name against all other domains. If ANY word appears in multiple domains, MERGE them immediately
   - **EXAMPLES OF VIOLATIONS (MUST FIX)**:
     * "Network Management" + "Network Planning" → Both share "Network" → MERGE into: "Network" (1 word - shortest)
     * "Customer Service" + "Customer Management" + "Customer Engagement" → All share "Customer" → MERGE into: "Customer" or "Passengers" (industry-specific, 1 word)
     * "Sales Operations" + "Sales Analytics" → Both share "Sales" → MERGE into: "Sales" (1 word - shortest)
     * "Risk Management" + "Risk Analysis" → Both share "Risk" → MERGE into: "Risk" (1 word - shortest)
   - **YOUR RESPONSE WILL BE REJECTED if any domains have overlapping words**

2. **INDUSTRY-SPECIFIC NAMING (HIGHEST PRIORITY)**: Use domain names that reflect the ACTUAL business and industry from the data
   - **INFER INDUSTRY FROM DATA**: Analyze the use case names, table names, and business context to determine the actual industry
   - **USE DOMAIN-SPECIFIC TERMS**: Choose domain names that reflect the specific business operations in the data
   - **EXAMPLES BY INDUSTRY TYPE** (use these as patterns, NOT defaults - always infer from actual data):
     * Transportation: "Fleet", "Routes", "Schedules", "Cargo", "Safety", "Maintenance"
     * Finance: "Risk", "Credit", "Trading", "Compliance", "Wealth", "Fraud"
     * Healthcare: "Patients", "Clinical", "Pharmacy", "Billing", "Records"
     * Retail: "Inventory", "Sales", "Customers", "Pricing", "Suppliers"
     * Manufacturing: "Production", "Quality", "Supply", "Maintenance", "Shipping"
     * Technology: "Platform", "Users", "Security", "Performance", "Analytics"
     * Telecom: "Network", "Subscribers", "Billing", "Coverage", "Support"
   - **CREATE DIVERSE DOMAINS** - use multiple business-specific terms to maintain variety
   - **CRITICAL**: ALL domain names MUST be EXACTLY ONE WORD
   - **CRITICAL**: Do NOT assume any industry - ALWAYS infer from the actual data provided

3. **STRICTLY ONE-WORD NAMES MANDATORY**: ALL domain names MUST be EXACTLY ONE WORD
   - "Catering Operations" + "Flight Catering Operations" + "Catering Services" → "Catering" (1 word)
   - "Maintenance Operations" + "Aircraft Maintenance" → "Maintenance" (1 word)
   - "Network Planning" + "Network Operations" + "Network Optimization" → "Network" (1 word)
   - "Customer Service" + "Customer Management" → "Passengers" (1 word, industry-specific)
   - **NO TWO-WORD DOMAIN NAMES ALLOWED** - response will be REJECTED

4. **CORE NAME UNIQUENESS** (OVERLAPPING WORDS): Identify ANY shared words and merge ONLY those domains
   - "Customer Service", "Customer Support", "Customer Engagement", "Customer Management" → ALL share "Customer" → MERGE into "Passengers" (industry-specific, 1 word)
   - "Risk Management", "Risk Analysis", "Risk Operations" → ALL share "Risk" → MERGE into "Risk" (1 word)
   - "Sales Operations", "Sales Analytics" → Both share "Sales" → MERGE into "Sales" (1 word)
   - **BUT**: "Risk" and "Compliance" are DIFFERENT - do NOT merge them (maintain diversity)
   - **BUT**: "Revenue" and "Sales" are DIFFERENT - do NOT merge them (maintain diversity)

5. **SPLIT WHEN NEEDED**: Split domains that would have > 80 use cases to maintain balance
   - **CRITICAL**: If a merged domain would exceed 80 use cases, you MUST split it into more specific domains
   - **BALANCE**: Prefer creating MORE domains with 6-20 use cases rather than fewer domains with 80+ use cases
   - Example: If "Operations" would have 180 use cases, split into "Flight" (80) and "Ground" (80) and "Cargo" (20)

6. **BUSINESS-SPECIFIC NAMES**: Use names that reflect the actual business operations, NOT generic IT or data terms

7. **DOMAIN COUNT RULE**: Target multiple domains (calculate: total÷10, cap at 25). Each domain needs 6-80 use cases

**ANTI-PATTERNS TO AVOID**:
❌ Putting all use cases into 1-3 domains (violates diversity requirement)
❌ Creating domains with overlapping words (e.g., "Customer" and "Customer Service")
❌ Using generic domain names like "Operations", "Management", "Services" when industry-specific terms exist

**MERGE EXAMPLES WITH STRICTLY ONE-WORD DOMAIN NAMES** (adapt to YOUR actual industry from data):

PATTERN: Merge domains that share a common concept into ONE WORD:
- "Customer Service" + "Customer Support" + "Customer Engagement" → "Customers" (use industry-appropriate term) [ONE WORD - MANDATORY]
- "Operations Management" + "Operations Analytics" + "Operations Planning" → "Operations" [ONE WORD - MANDATORY]
- "Risk Management" + "Risk Analysis" + "Risk Operations" → "Risk" [ONE WORD - MANDATORY]
- "Sales Operations" + "Sales Analytics" + "Revenue Management" → "Sales" or "Revenue" [ONE WORD - MANDATORY]
- "Maintenance Operations" + "Asset Maintenance" + "Preventive Maintenance" → "Maintenance" [ONE WORD - MANDATORY]

**CRITICAL: INFER FROM DATA** - The examples above are PATTERNS. You MUST:
1. Analyze the actual use case names and business context provided
2. Identify the REAL industry from the data (NOT assume any default)
3. Use domain names that match the ACTUAL business terminology in the data

**OUTPUT FORMAT: CSV (NOT JSON)**:
Return a CSV with EXACTLY 2 columns (with header):
  - Column 1: "use_case_id" - The "No" field from the input CSV
  - Column 2: "domain" - The assigned domain name (EXACTLY 1 WORD)

**🚨 VALIDATION CHECKLIST (AUTOMATED REJECTION IF FAILED) 🚨**:
☐ Domain count: 3-25 (MINIMUM 3, **MAXIMUM 25 - ABSOLUTE HARD LIMIT**)
☐ Domain names: EXACTLY 1 word
☐ Use cases per domain: Aim for 6-80 (flexible for small datasets <50 total use cases)
☐ No overlapping words in domain names
☐ All domain names are industry-specific (not generic)

**CRITICAL**: The MAXIMUM 25 domains is an ABSOLUTE HARD LIMIT that will NEVER be waived.

The output language for domain names should be {output_language}.

Example Output (STRICTLY ONE-WORD DOMAINS - adapt to YOUR industry from data):
use_case_id,domain
N1-AI01,Revenue
N1-AI02,Revenue
N1-AI03,Revenue
N1-AI04,Customers
N1-AI05,Customers
N1-AI06,Customers
N1-AI07,Operations
N1-AI08,Operations
N1-AI09,Operations
N1-AI10,Maintenance
N1-AI11,Maintenance
N1-AI12,Analytics
N1-AI13,Analytics
N1-AI14,Supply
N1-AI15,Risk

**NOTE**: The domain names above are EXAMPLES. Use domain names that match YOUR actual industry and data.

**OUTPUT REQUIREMENTS**:

**FORMAT**: Return ONLY the CSV (no explanations, no markdown fences, no additional text)

**CONTENT**: 
- Use industry-specific names (see examples above)
- Avoid generic terms ("Management", "Operations", "Services") 
- Use validation checklist above before submitting

**INPUT USE CASES CSV**:
{use_cases_csv}

{previous_violations}

🚨🚨🚨 CRITICAL OUTPUT INSTRUCTION 🚨🚨🚨:
Your ENTIRE response must be ONLY the CSV in the format shown above.
- START your response with the CSV header: use_case_id,domain
- Follow with one row per use case
- NO text before the CSV
- NO text after the CSV
- NO explanations or commentary
- NO markdown code fences (```)
- NO thoughts or reasoning
- NO "I need to analyze..." or similar statements
- NO "Here is..." or "I have..." statements
- ONLY the pure CSV data

🚨 ABSOLUTE RULE - OUTPUT FORMAT 🚨:

❌ DO NOT INCLUDE:
- Any conversational or explanatory text ("I need to...", "Let me...", "I'll...", "Here is...", "Based on...")
- Any thoughts or analysis descriptions

✅ OUTPUT REQUIREMENTS:
- Your response must START with: use_case_id,domain,honesty_score,honesty_justification
- Include honesty columns in header and all rows

Begin your CSV response now:
""" + HONESTY_CHECK_CSV

# --- 1d2. Subdomain Detector Prompt (NEW - PER-DOMAIN SUBDOMAIN ASSIGNMENT) ---
PROMPT_TEMPLATES["SUBDOMAIN_DETECTOR_PROMPT"] = """
You are an expert business analyst specializing in subdomain taxonomy design within business domains.

**🎯 YOUR TASK**: Analyze the use cases for a SINGLE domain and assign each to appropriate Subdomains.

**🚨🚨🚨 CRITICAL REQUIREMENTS - YOUR RESPONSE WILL BE REJECTED IF NOT FOLLOWED 🚨🚨🚨**:

**SUBDOMAIN RULES (MANDATORY - AUTOMATED VALIDATION WILL REJECT VIOLATIONS)**:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. **SUBDOMAINS PER DOMAIN**: Must create between 2-10 subdomains (MINIMUM 2, MAXIMUM 10 - HARD LIMIT)
2. **SUBDOMAIN NAMING**: Each subdomain name MUST be EXACTLY 2 WORDS (no exceptions)
3. **USE CASES PER SUBDOMAIN**: Each subdomain MUST have at least 2 use cases (MINIMUM 2)
4. **CONSOLIDATE SINGLE-USE SUBDOMAINS**: If a subdomain has only 1 use case, you MUST merge it with another related subdomain. Do NOT create subdomains with only 1 use case.
5. **NO OVERLAPPING WORDS**: Within this domain, NO two subdomains can share the same word
6. **BUSINESS-FOCUSED**: Subdomains MUST be business-focused, NOT technical
7. **BALANCED DISTRIBUTION**: Distribute use cases EVENLY across subdomains
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**SUBDOMAIN NAMING - EXACTLY 2 WORDS (MANDATORY)**:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
► **ABSOLUTE RULE**: ALL subdomain names MUST be EXACTLY 2 WORDS
► **REJECTION CRITERIA**: If ANY subdomain has 1 word OR 3+ words, your ENTIRE response is REJECTED
► **WORD COUNT METHOD**: Count spaces - MUST have EXACTLY 1 space (= EXACTLY 2 words)
► ✅ CORRECT (2 words): "Crew Planning", "Special Assistance", "Menu Planning", "Route Optimization", "Quality Control"
► ❌ WRONG - 1 WORD (REJECTED): "Scheduling", "Pricing", "Check-in", "Loyalty", "Baggage"
► ❌ WRONG - 3+ WORDS (REJECTED): "Network Route Planning" (3 words), "Quality Control Management" (3 words)
► **RULE**: Use descriptive 2-word combinations like "Crew Scheduling" NOT "Scheduling"
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**SUBDOMAIN NAMING EXAMPLES**:
✅ CORRECT (EXACTLY 2 words, business-focused):
- "Crew Planning", "Special Assistance", "Menu Planning", "Route Optimization", "Quality Control"
- "Revenue Pricing", "Fleet Management", "Customer Feedback", "Boarding Process", "Delay Recovery"
- "Preventive Maintenance", "Safety Inspections", "Parts Management", "Work Orders"
- "Pricing Strategy", "Yield Management", "Ancillary Revenue", "Revenue Forecasting"
- "Schedule Optimization", "Capacity Management", "Network Planning", "Market Analysis"

❌ WRONG (1 word):
- "Scheduling", "Pricing", "Routes", "Maintenance", "Loyalty"

❌ WRONG (3+ words):
- "Quality Control Management", "Aircraft Assignment Management System", "Customer Service Operations"

**BUSINESS CONTEXT**:
Domain Name: {domain_name}
Business Name: {business_name}
Industries: {industries}
Business Context: {business_context}

**INPUT DATA**:
You will receive a CSV of use cases that ALL belong to the domain "{domain_name}":
"No","Name","type","Analytics Technique","Statement","Solution","Business Value","Beneficiary","Sponsor","Tables Involved"

**YOUR SUBDOMAIN DETECTION TASK**:
1. Analyze the use case names, statements, solutions, analytics techniques, and tables involved
2. Identify natural BUSINESS groupings within the "{domain_name}" domain based on:
   - Specific business functions or processes
   - Related business activities
   - Similar operational areas
   - Common beneficiaries or stakeholders
3. Create appropriate subdomain names (EXACTLY 2 WORDS, business-focused)
4. Ensure ALL validation rules are met (2-10 subdomains, 2+ use cases per subdomain, 2-word names, etc.)

**SUBDOMAIN STRATEGY**:
- **BUSINESS PROCESSES**: Group use cases by specific business processes or workflows
- **FUNCTIONAL AREAS**: Create subdomains for distinct functional areas within the domain
- **RELATED ACTIVITIES**: Use cases with related business activities should be in the same subdomain
- **CONSOLIDATION REQUIREMENT**: If any subdomain would have only 1 use case, you MUST merge it with the most related subdomain
  * Example: "Revenue Pricing" (2 cases) is acceptable
  * Example: "Fleet Management" (1 case) → MUST merge into "Aircraft Maintenance" (4 cases) → becomes (5 cases)
  * Never leave a subdomain with only 1 use case
- **MAXIMUM LIMIT**: Never exceed 10 subdomains per domain (this is a hard limit - consolidate if needed)
- **BALANCE**: Aim for 2-10 use cases per subdomain (avoid very small or very large subdomains)
- **CLARITY**: Subdomain names should clearly indicate what business function they represent

**NO OVERLAPPING SUBDOMAIN WORDS**: 
- Within the "{domain_name}" domain, if subdomains share ANY word, merge them
- Example: "Network Planning" + "Network Optimization" → Merge to "Network Planning" (2 words)
- Always keep exactly 2 words when merging (never reduce to 1, never expand to 3+)

**ANTI-PATTERNS TO AVOID**:
❌ Creating single-word subdomains (e.g., "Scheduling", "Pricing")
❌ Creating 3+ word subdomains (e.g., "Quality Control Management")
❌ Creating subdomains with fewer than 2 use cases (MUST consolidate these)
❌ Creating subdomains with only 1 use case (ABSOLUTELY FORBIDDEN - merge immediately)
❌ Creating more than 10 subdomains (HARD LIMIT - consolidate if you exceed this)
❌ Creating fewer than 2 subdomains
❌ Using technical terms instead of business terms

**OUTPUT FORMAT: CSV (NOT JSON)**:
Return a CSV with EXACTLY 2 columns (with header):
  - Column 1: "use_case_id" - The "No" field from the input CSV
  - Column 2: "subdomain" - The assigned subdomain name (EXACTLY 2 WORDS)

**🚨 VALIDATION CHECKLIST (BEFORE SUBMITTING) 🚨**:
☐ Subdomain count: 2-10 (MINIMUM 2, MAXIMUM 10 - HARD LIMIT)
☐ Each subdomain name: EXACTLY 2 words
☐ Use cases per subdomain: ≥2 (MINIMUM 2 - if any have only 1, must consolidate)
☐ No subdomains with only 1 use case (MUST merge these)
☐ Subdomains with 2 use cases are acceptable
☐ Total subdomain count does not exceed 10 (if it does, consolidate)
☐ No overlapping words in subdomain names
☐ All subdomain names are business-focused (not technical)

The output language for subdomain names should be {output_language}.

Example Output for "Revenue" domain:
use_case_id,subdomain
N1-AI01,Pricing Strategy
N1-AI02,Pricing Strategy
N1-AI03,Pricing Strategy
N1-AI04,Yield Management
N1-AI05,Yield Management
N1-AI06,Yield Management
N1-AI07,Revenue Forecasting
N1-AI08,Revenue Forecasting
N1-AI09,Revenue Forecasting
N1-AI10,Ancillary Revenue
N1-AI11,Ancillary Revenue
N1-AI12,Ancillary Revenue

**OUTPUT REQUIREMENTS**:

**FORMAT**: Return ONLY the CSV (no explanations, no markdown fences, no additional text)

**CONTENT**: 
- Use business-focused 2-word subdomain names
- Avoid technical terms
- Use validation checklist above before submitting

**INPUT USE CASES CSV FOR DOMAIN "{domain_name}"**:
{use_cases_csv}

{previous_violations}

🚨🚨🚨 CRITICAL OUTPUT INSTRUCTION 🚨🚨🚨:
Your ENTIRE response must be ONLY the CSV in the format shown above.
- START your response with the CSV header: use_case_id,subdomain
- Follow with one row per use case
- NO text before the CSV
- NO text after the CSV
- NO explanations or commentary
- NO markdown code fences (```)
- NO thoughts or reasoning
- NO "I need to analyze..." or similar statements
- NO "Here is..." or "I have..." statements
- NO "It's mathematically impossible..." or problem descriptions
- ONLY the pure CSV data

🚨 ABSOLUTE RULE - OUTPUT FORMAT 🚨:

❌ DO NOT INCLUDE:
- Any conversational or explanatory text ("I need to...", "It's mathematically...", "I only see...", "Let me...", "I'll...", "Here is...", "Based on...")
- Any thoughts or analysis descriptions
- Any problem descriptions or concerns

✅ OUTPUT REQUIREMENTS:
- Your response must START with: use_case_id,subdomain,honesty_score,honesty_justification
- Include honesty columns in header and all rows
- If you cannot meet requirements, still output CSV (merge subdomains as needed)

Begin your CSV response now:
""" + HONESTY_CHECK_CSV
# --- 1f. Translation Prompts (MODIFIED FOR REQUEST #1, #2) ---
PROMPT_TEMPLATES["KEYWORDS_TRANSLATE_PROMPT"] = """You are an expert translator. Your task is to translate the *values* (and only the values) of the following JSON object into {target_language}.
Do NOT translate the keys.

**CRITICAL JSON SYNTAX VALIDATION**:
- Every key-value pair MUST have a colon (:) separator
- Format: "key": "translated value"
- **MOST COMMON ERROR**: Missing colon - `"key" "value"` is WRONG, must be `"key": "value"`
- **FOR ARABIC/CHINESE/JAPANESE/HINDI**: Verify the colon `:` exists between key and value
- Count your colons - you need ONE colon per key-value pair
- Validate your JSON structure before responding

SPECIAL INSTRUCTIONS:
🚨 CRITICAL: You MUST translate ALL text values in the JSON object. Do NOT leave ANY value in English (except for the specific exceptions listed below).

**MANDATORY TRANSLATION REQUIREMENT:**
- EVERY single value in the JSON MUST be translated to {target_language}
- This includes short words like "Type", "Priority", "Subdomain", "Statement", "Solution", "Beneficiary", "Sponsor", etc.
- DO NOT assume any value should stay in English just because it's short or seems technical
- If a value is in English in the input, it MUST be in {target_language} in the output (except for the specific exceptions below)

**ONLY THESE SPECIFIC EXCEPTIONS - DO NOT TRANSLATE:**
1. "Databricks Agent Bricks Strategic AI Use Cases" - Keep this EXACT text in English
2. "N/A" - Keep as "N/A" in all languages
3. If you see "pdf_title" or "pptx_main_title" keys with value "Databricks Agent Bricks Strategic AI Use Cases", keep the value EXACTLY as is

**EVERYTHING ELSE MUST BE TRANSLATED WITHOUT EXCEPTION**

**FOR ALL LANGUAGES - UNIVERSAL REQUIREMENTS:**
- Provide EXACT, DIRECT translations of the English text
- Do NOT use special phrasings or adaptations
- Translate literally and accurately
- Use standard terminology for the target language
- Do NOT leave any English words in the translation (except the 3 exceptions above)
- Ensure complete and proper character encoding for all scripts (Arabic, Chinese, Japanese, Hindi, Cyrillic, etc.)

**MANDATORY TRANSLATIONS FOR ALL LANGUAGES (these MUST be translated, NOT left in English):**

**Common UI Terms (translate to your target language):**
  * "Type" → MUST translate
  * "Subdomain" → MUST translate
  * "Analytics Technique" → MUST translate
  * "Primary Table" → MUST translate
  * "Priority" → MUST translate
  * "Statement" → MUST translate
  * "Solution" → MUST translate
  * "Business Value" → MUST translate
  * "Beneficiary" → MUST translate
  * "Sponsor" → MUST translate
  * "Business Domain" → MUST translate
  * "Tables Involved" → MUST translate

**Value Terms (translate to your target language):**
  * "Risk" → MUST translate
  * "Problem" → MUST translate
  * "Opportunity" → MUST translate
  * "Improvement" → MUST translate
  * "Very High" → MUST translate
  * "High" → MUST translate
  * "Medium" → MUST translate
  * "Low" → MUST translate
  * "Very Low" → MUST translate

**REFERENCE TRANSLATIONS BY LANGUAGE:**

**Arabic:**
  * "Type" = "النوع", "Subdomain" = "المجال الفرعي", "Analytics Technique" = "تقنية التحليل", "Primary Table" = "الجدول الرئيسي", "Priority" = "الأولوية"
  * "Statement" = "البيان", "Solution" = "الحل", "Business Value" = "القيمة التجارية"
  * "Beneficiary" = "المستفيد", "Sponsor" = "الراعي", "Business Domain" = "مجال الأعمال"
  * "Risk" = "مخاطرة", "Problem" = "مشكلة", "Opportunity" = "فرصة", "Improvement" = "تحسين"
  * "High" = "عالية", "Very High" = "عالية جداً", "Medium" = "متوسطة", "Low" = "منخفضة", "Very Low" = "منخفضة جداً"

**Spanish:**
  * "Type" = "Tipo", "Subdomain" = "Subdominio", "Analytics Technique" = "Técnica de Análisis", "Primary Table" = "Tabla Principal", "Priority" = "Prioridad"
  * "Statement" = "Declaración", "Solution" = "Solución", "Business Value" = "Valor Comercial"
  * "Beneficiary" = "Beneficiario", "Sponsor" = "Patrocinador", "Business Domain" = "Dominio Empresarial"
  * "Risk" = "Riesgo", "Problem" = "Problema", "Opportunity" = "Oportunidad", "Improvement" = "Mejora"
  * "High" = "Alto", "Very High" = "Muy Alto", "Medium" = "Medio", "Low" = "Bajo", "Very Low" = "Muy Bajo"

**French:**
  * "Type" = "Type", "Subdomain" = "Sous-domaine", "Analytics Technique" = "Technique d'Analyse", "Primary Table" = "Table Principale", "Priority" = "Priorité"
  * "Statement" = "Déclaration", "Solution" = "Solution", "Business Value" = "Valeur Commerciale"
  * "Beneficiary" = "Bénéficiaire", "Sponsor" = "Sponsor", "Business Domain" = "Domaine d'Activité"
  * "Risk" = "Risque", "Problem" = "Problème", "Opportunity" = "Opportunité", "Improvement" = "Amélioration"
  * "High" = "Élevé", "Very High" = "Très Élevé", "Medium" = "Moyen", "Low" = "Faible", "Very Low" = "Très Faible"

**German:**
  * "Type" = "Typ", "Subdomain" = "Unterbereich", "Analytics Technique" = "Analysetechnik", "Primary Table" = "Haupttabelle", "Priority" = "Priorität"
  * "Statement" = "Aussage", "Solution" = "Lösung", "Business Value" = "Geschäftswert"
  * "Beneficiary" = "Begünstigter", "Sponsor" = "Sponsor", "Business Domain" = "Geschäftsbereich"
  * "Risk" = "Risiko", "Problem" = "Problem", "Opportunity" = "Chance", "Improvement" = "Verbesserung"
  * "High" = "Hoch", "Very High" = "Sehr Hoch", "Medium" = "Mittel", "Low" = "Niedrig", "Very Low" = "Sehr Niedrig"

**Chinese (Simplified):**
  * "Type" = "类型", "Subdomain" = "子域", "Analytics Technique" = "分析技术", "Primary Table" = "主表", "Priority" = "优先级"
  * "Statement" = "声明", "Solution" = "解决方案", "Business Value" = "业务价值"
  * "Beneficiary" = "受益人", "Sponsor" = "赞助者", "Business Domain" = "业务领域"
  * "Risk" = "风险", "Problem" = "问题", "Opportunity" = "机会", "Improvement" = "改进"
  * "High" = "高", "Very High" = "非常高", "Medium" = "中等", "Low" = "低", "Very Low" = "非常低"

**Japanese:**
  * "Type" = "タイプ", "Subdomain" = "サブドメイン", "Analytics Technique" = "分析技術", "Primary Table" = "主要テーブル", "Priority" = "優先度"
  * "Statement" = "ステートメント", "Solution" = "ソリューション", "Business Value" = "ビジネス価値"
  * "Beneficiary" = "受益者", "Sponsor" = "スポンサー", "Business Domain" = "ビジネスドメイン"
  * "Risk" = "リスク", "Problem" = "問題", "Opportunity" = "機会", "Improvement" = "改善"
  * "High" = "高", "Very High" = "非常に高い", "Medium" = "中程度", "Low" = "低", "Very Low" = "非常に低い"

**Portuguese:**
  * "Type" = "Tipo", "Subdomain" = "Subdomínio", "Analytics Technique" = "Técnica de Análise", "Primary Table" = "Tabela Principal", "Priority" = "Prioridade"
  * "Statement" = "Declaração", "Solution" = "Solução", "Business Value" = "Valor Comercial"
  * "Beneficiary" = "Beneficiário", "Sponsor" = "Patrocinador", "Business Domain" = "Domínio de Negócio"
  * "Risk" = "Risco", "Problem" = "Problema", "Opportunity" = "Oportunidade", "Improvement" = "Melhoria"
  * "High" = "Alto", "Very High" = "Muito Alto", "Medium" = "Médio", "Low" = "Baixo", "Very Low" = "Muito Baixo"

**Russian:**
  * "Type" = "Тип", "Subdomain" = "Поддомен", "Analytics Technique" = "Аналитическая техника", "Primary Table" = "Основная таблица", "Priority" = "Приоритет"
  * "Statement" = "Заявление", "Solution" = "Решение", "Business Value" = "Бизнес-ценность"
  * "Beneficiary" = "Бенефициар", "Sponsor" = "Спонсор", "Business Domain" = "Бизнес-домен"
  * "Risk" = "Риск", "Problem" = "Проблема", "Opportunity" = "Возможность", "Improvement" = "Улучшение"
  * "High" = "Высокий", "Very High" = "Очень Высокий", "Medium" = "Средний", "Low" = "Низкий", "Very Low" = "Очень Низкий"

**Hindi:**
  * "Type" = "प्रकार", "Subdomain" = "उपडोमेन", "Analytics Technique" = "विश्लेषण तकनीक", "Primary Table" = "प्राथमिक तालिका", "Priority" = "प्राथमिकता"
  * "Statement" = "कथन", "Solution" = "समाधान", "Business Value" = "व्यावसायिक मूल्य"
  * "Beneficiary" = "लाभार्थी", "Sponsor" = "प्रायोजक", "Business Domain" = "व्यावसायिक डोमेन"
  * "Risk" = "जोखिम", "Problem" = "समस्या", "Opportunity" = "अवसर", "Improvement" = "सुधार"
  * "High" = "उच्च", "Very High" = "बहुत उच्च", "Medium" = "मध्यम", "Low" = "निम्न", "Very Low" = "बहुत निम्न"

**VALIDATION CHECKLIST - Before submitting your translation:**
✓ Every value that was in English is now in {target_language} (except the 3 exceptions)
✓ "Type" is translated (NOT left as "Type")
✓ "Subdomain" is translated (NOT left as "Subdomain")
✓ "Analytics Technique" is translated (NOT left as "Analytics Technique")
✓ "Primary Table" is translated (NOT left as "Primary Table")
✓ "Priority" is translated (NOT left as "Priority")
✓ "Statement" is translated (NOT left as "Statement")
✓ "Solution" is translated (NOT left as "Solution")
✓ "Beneficiary" is translated (NOT left as "Beneficiary")
✓ "Sponsor" is translated (NOT left as "Sponsor")
✓ "Business Value" is translated (NOT left as "Business Value")
✓ "Business Domain" is translated (NOT left as "Business Domain")
✓ "Tables Involved" is translated (NOT left as "Tables Involved")
✓ ALL keys with "Type", "Priority" values are translated (including "type", "priority", "aspect_priority")
✓ All other English words are translated
✓ ONLY "Databricks Agent Bricks Strategic AI Use Cases" and "N/A" remain in English

**CRITICAL - COUNT YOUR TRANSLATIONS:**
The input JSON has approximately 70+ key-value pairs. Your output JSON must have THE SAME number of key-value pairs with 99% of values in {target_language} (only "Databricks Agent Bricks Strategic AI Use Cases" and "N/A" stay in English).

If you see ANY English words in your output other than the 3 exceptions, you FAILED. Go back and translate them.

Return ONLY a single, valid JSON object with the exact same structure, but with the values translated.

🚨 ABSOLUTE RULE - OUTPUT FORMAT 🚨:

❌ DO NOT INCLUDE:
- Any conversational or explanatory text ("Here is...", "I've...", "The...")
- Any thoughts or analysis descriptions

✅ OUTPUT REQUIREMENTS:
- Your response must be wrapped in honesty JSON: {{"honesty_score": XX, "honesty_justification": "...", "data": <your_translated_json>}}
- Start with: {{"honesty_score":

Input JSON:
{json_payload}

Respond with the honesty-wrapped JSON object.
""" + HONESTY_CHECK_JSON

PROMPT_TEMPLATES["USE_CASE_TRANSLATE_PROMPT"] = """You are an expert translator. Your task is to translate a list of use cases into {target_language}.

**🚨 CRITICAL OUTPUT REQUIREMENT - READ THIS FIRST 🚨**
Your response MUST be PURE CSV DATA ONLY. 
- NO explanations before or after the CSV
- NO conversational text
- NO SQL code fragments
- NO commentary
- NO markdown fences (```)
- ONLY the CSV header and data rows
- Start immediately with the header row: "No","Name",...

**FIELDS TO TRANSLATE** (translate these 10 fields ONLY):
"Name", "Business Domain", "Subdomain", "type", "Statement", "Solution", "Business Value", "Beneficiary", "Sponsor", "Business Priority Alignment"

Do NOT translate: "No", "Tables Involved", "Analytics Technique", "Priority"

**SPECIAL RULES FOR STRATEGIC GOAL ALIGNMENT FIELD**:
- This field contains strategic goal values like: "Reduce Cost", "Increase Revenue", "Boost Productivity", "Mitigate Risk", "Protect Revenue", "Align to Regulations", "Improve Customer Experience", "Enable Data-Driven Decisions", "General Improvement"
- Translate these goal values to {target_language} using proper business terminology
- If multiple goals are comma-separated (e.g., "Reduce Cost, Increase Revenue"), translate each goal separately and keep them comma-separated
- Examples for MULTIPLE languages:
  * Arabic: "Reduce Cost" → "تقليل التكلفة", "Increase Revenue" → "زيادة الإيرادات"
  * French: "Reduce Cost" → "Réduire les coûts", "Boost Productivity" → "Améliorer la productivité"
  * Spanish: "Reduce Cost" → "Reducir costos", "Mitigate Risk" → "Mitigar riesgos"
  * Chinese: "Reduce Cost" → "降低成本", "Increase Revenue" → "增加收入"
  * German: "Reduce Cost" → "Kosten senken", "Protect Revenue" → "Umsatz schützen"

**NOTE**: The SQL field is NOT included in the input to reduce payload size. You only need to translate the fields listed above.

**SPECIAL RULES FOR SPONSOR FIELD**:
- If the Sponsor field contains a name in the format "Name (Title)" (e.g., "John Smith (Chief Technology Officer)"), you MUST transliterate/translate BOTH the name AND the title for ALL languages
- Person names should be transliterated phonetically into the target language writing system
- **CRITICAL**: This applies to ALL languages, not just non-Latin scripts. Transliterate names appropriately for each target language.
- Examples for MULTIPLE languages:
  * French: "John Smith (Chief Technology Officer)" → "Jean Smith (Directeur de la Technologie)"
  * Spanish: "John Smith (Chief Technology Officer)" → "Juan Smith (Director de Tecnología)"
  * Arabic: "John Smith (Chief Technology Officer)" → "جون سميث (المدير التنفيذي للتكنولوجيا)"
  * Chinese: "John Smith (Chief Technology Officer)" → "约翰·史密斯 (首席技术官)"
  * Japanese: "John Smith (Chief Technology Officer)" → "ジョン・スミス (最高技術責任者)"
  * German: "John Smith (Chief Technology Officer)" → "Johann Schmidt (Technischer Leiter)"
  * Russian: "John Smith (Chief Technology Officer)" → "Джон Смит (Технический директор)"
  * Hindi: "John Smith (Chief Technology Officer)" → "जॉन स्मिथ (मुख्य प्रौद्योगिकी अधिकारी)"
- If the Sponsor field contains ONLY a title (no name), translate it normally
- Always adapt the name to sound natural in the target language and culture

**OUTPUT FORMAT: CSV (NOT JSON)**
Your response MUST be in CSV format with the following structure:
- First line: Header row with 14 column names (SQL field is not included)
- Subsequent lines: One row per use case with 14 fields

**CSV HEADER (MUST BE EXACTLY THIS)**:
"No","Name","Business Domain","Subdomain","type","Analytics Technique","Statement","Solution","Business Value","Beneficiary","Sponsor","Business Priority Alignment","Tables Involved","Priority"

**CRITICAL CSV FORMATTING RULES**:
1. ALL fields MUST be enclosed in double quotes (")
2. Use comma (,) as the field separator
3. For fields containing commas or quotes, keep them inside the double quotes
4. Each row must have exactly 14 fields (SQL is NOT included)
5. Keep untranslated fields (No, Tables Involved, Analytics Technique, Priority) EXACTLY as they appear in the input
6. The Analytics Technique field should be copied EXACTLY as-is in English (do not translate)
7. The Priority field should be copied EXACTLY as-is in English (do not translate "Very High", "High", "Medium", "Low", "Very Low")

**IMPORTANT**: Analytics Technique and Priority values remain in English in the CSV. The UI will handle translation separately.

**TRANSLATION REQUIREMENTS**:
1. Translate ALL 10 designated fields for EVERY use case
2. "Business Domain" - MUST be translated to {target_language}
3. "Subdomain" - MUST be translated to {target_language}
4. "Name" - MUST be translated to {target_language}
5. "Business Priority Alignment" - MUST be translated to {target_language} (translate the priority values like "Reduce Cost" → appropriate translation)

**SPECIAL INSTRUCTIONS**:
- "Databricks Agent Bricks" → Keep as-is for all languages EXCEPT Arabic: "الذكاء الصناعي الخاص بداتا بريكس"
- Preserve all technical terms in SQL exactly as they appear
- Keep numeric IDs (like "AI-001") unchanged
- Keep Analytics Technique and Priority values in English (not translated in CSV)
- For Arabic: Ensure complete and proper Arabic text encoding. Double-check all Arabic characters are properly formed.

**🛑 ABSOLUTELY FORBIDDEN - DO NOT INCLUDE 🛑**:
- ❌ NO text before the CSV header
- ❌ NO explanatory text like "Here is the translation..." or "I've translated..."
- ❌ NO SQL code fragments or queries outside the CSV structure
- ❌ NO examples or demonstrations
- ❌ NO line breaks or empty lines before the header
- ❌ NO commentary about the translation process
- ❌ NO additional rows beyond the exact number provided in the input
- ✅ ONLY: Pure CSV starting with the header, nothing else

**VALIDATION CHECKLIST**:
✓ Response starts with the exact 14-column header (no text before it)
✓ All fields are enclosed in double quotes
✓ Each row has exactly 14 comma-separated fields (SQL is NOT included)
✓ No markdown fences (```) around the CSV
✓ Analytics Technique field is copied exactly as received (do not translate it)
✓ Priority field is copied exactly as received (do not translate it)
✓ Business Priority Alignment field is translated to target language
✓ All 10 translatable fields are translated for every row

**CRITICAL ROW COUNT REQUIREMENT**:
- You will receive EXACTLY a specific number of use cases in the input
- Your CSV output MUST contain EXACTLY the same number of data rows (plus the header row)
- DO NOT add any extra rows beyond what was provided in the input
- DO NOT omit any rows from the input
- Each row in your output MUST correspond to exactly one row from the input, matched by the "No" field

**INPUT USE CASES** (as JSON for readability):
{json_payload}

🚨 FINAL INSTRUCTION 🚨
Your ENTIRE response must be ONLY the CSV data.
- Start your response with: "No","Name","Business Domain","Subdomain","type","Statement",...
- Do NOT write anything before this header
- Do NOT write anything after the last data row
- Do NOT include markdown code fences (```)
- Do NOT include any explanatory text
- NO thoughts or reasoning
- NO "Here is the translation..." or similar statements
- The number of data rows MUST exactly match the number of use cases above

🚨 ABSOLUTE RULE - OUTPUT FORMAT 🚨:

❌ DO NOT INCLUDE:
- Any conversational or explanatory text ("Here is...", "I've...", "The...")
- Any thoughts or analysis descriptions

✅ OUTPUT REQUIREMENTS:
- Your response must START with: "No","Name","Business Domain",...,"honesty_score","honesty_justification"
- Include honesty columns in header and all rows

Begin your CSV response now (header row first, include honesty columns):
""" + HONESTY_CHECK_CSV

# --- 1g. SQL Syntax Reviewer Prompt (NEW) ---
PROMPT_TEMPLATES["USE_CASE_SQL_FIX_PROMPT"] = """You are a **Senior Databricks SQL Engineer** with 15+ years of experience debugging SQL queries. Your task is to fix SQL ERRORS (both syntax and runtime errors) in the provided SQL query.

**CRITICAL RULES:**
1. **FIX ERRORS ONLY** - Do NOT change the business logic or query structure
2. **PRESERVE ALL LOGIC** - Keep all CTEs, joins, AI functions, and business logic exactly as intended
3. **DO NOT OPTIMIZE** - Do not restructure or optimize the query
4. **DO NOT ADD FEATURES** - Do not add new columns, CTEs, or logic
5. **ONLY FIX** what the validation/execution error indicates is broken
6. **RUNTIME ERRORS** - If error is from query execution (not syntax), fix the runtime issue (e.g., window function issues, unresolved columns, type mismatches)

**USE CASE CONTEXT:**
- **Use Case ID**: {use_case_id}
- **Use Case Name**: {use_case_name}
- **Business Domain**: {business_domain}
- **Statement**: {statement}
- **Tables Involved**: {tables_involved}

**AVAILABLE SCHEMA:**
{directly_involved_schema}

**COLUMNS FROM USE CASE (allowed set):**
{use_case_columns}
- Use only these columns plus the ones in the schema above. Do not add new columns.

**ORIGINAL SQL QUERY (WITH ERROR):**
```sql
{original_sql}
```

**EXECUTION/VALIDATION ERROR:**
```
{explain_error}
```

**YOUR TASK:**
1. Analyze the error message carefully (could be syntax OR runtime error)
2. Identify the EXACT error and root cause
3. Fix ONLY the error - do NOT change anything else
4. Return the corrected SQL query

**COMMON RUNTIME ERRORS TO FIX:**

**🚨🚨🚨 #1 MOST COMMON ERROR: COALESCE STRING DEFAULTS WITHOUT QUOTES 🚨🚨🚨**

**IF YOU SEE THIS ERROR:** `[PARSE_SYNTAX_ERROR] Syntax error at or near 'Material'/'Unknown'/'Description'/'Supplier'/etc.`
**THE CAUSE IS**: You forgot 'single quotes' around text defaults in COALESCE!

**LOOK FOR THIS WRONG PATTERN AND FIX IT:**
```sql
-- ❌ WRONG - Text defaults missing quotes (THIS IS THE ERROR!)
COALESCE(TRIM(material_type), Unknown Material) AS material_type     -- ❌ FAILS!
COALESCE(TRIM(supplier_name), Unknown Supplier) AS supplier_name     -- ❌ FAILS!
COALESCE(TRIM(status), Pending Review) AS status                     -- ❌ FAILS!

-- ✅ CORRECT - Text defaults MUST have 'single quotes'
COALESCE(TRIM(material_type), 'Unknown Material') AS material_type   -- ✅ WORKS!
COALESCE(TRIM(supplier_name), 'Unknown Supplier') AS supplier_name   -- ✅ WORKS!
COALESCE(TRIM(status), 'Pending Review') AS status                   -- ✅ WORKS!
```

**1. STRING LITERAL QUOTING ERRORS (MOST COMMON - CHECK THIS FIRST):**
- `[PARSE_SYNTAX_ERROR] Syntax error at or near 'Type'/'Cert'/'PN'/'Status'/'Owner'/'Policy'/'Config'/'Material'/'Unknown'/'Description'` → String literal missing quotes
- **ROOT CAUSE**: String values in SQL are not quoted with single quotes
- **🚨 MANDATORY CORRECT PATTERNS (COPY THESE EXACTLY) 🚨**:
  * `CASE WHEN type = 'Policy' THEN 'Covered'` -- ✅ String values in single quotes
  * `CASE WHEN status = 'Active' THEN 'Yes'` -- ✅ String values in single quotes  
  * `COALESCE(category, 'Premium')` -- ✅ Default value in single quotes
  * `ARRAY('Type', 'Status')` -- ✅ Array elements in single quotes
  * `CASE WHEN level = 'High' THEN 'Urgent'` -- ✅ All strings quoted
- **🚨 MANDATORY COALESCE PATTERNS (COPY THESE EXACTLY) 🚨**:
  * `COALESCE(TRIM(name), 'Unknown')` -- ✅ Default in single quotes
  * `COALESCE(TRIM(category), 'Not Specified')` -- ✅ Multi-word default in single quotes
  * `COALESCE(TRIM(status), 'Pending Review')` -- ✅ Multi-word default in single quotes
  * `COALESCE(TRIM(region), 'Unassigned Region')` -- ✅ Multi-word default in single quotes
  * `COALESCE(CAST(date AS STRING), 'No Date Available')` -- ✅ Text default in single quotes
  * `COALESCE(TRIM(type), 'UNKNOWN')` -- ✅ Uppercase text in single quotes
  * `COALESCE(TRIM(supplier_name), 'Unknown Supplier')` -- ✅ Multi-word default in single quotes
  * `COALESCE(TRIM(material_type), 'Unknown Material')` -- ✅ Multi-word default in single quotes
- **CRITICAL**: Scan the ENTIRE query for any string values without quotes and add single quotes around them
- **VALIDATION**: Every string value must have single quotes: `'value'`
- **REMEMBER**: ALL COALESCE default values that are TEXT are STRING literals and MUST have single quotes around them
- **EXCEPTION**: Numbers (0.0, 0, 123) and booleans (TRUE, FALSE) do NOT need quotes

**2. AI_FORECAST SYNTAX ERRORS:**
- `[UNRESOLVED_COLUMN] A column with name 'ds'/'column_name' cannot be resolved` → **MOST COMMON ERROR**: Column names in AI_FORECAST parameters are NOT quoted as STRING LITERALS
- **ROOT CAUSE**: `time_col => ds` treats `ds` as a column reference instead of a string literal
- **🚨 MANDATORY FIX**: ALL column names in time_col, value_col, group_col MUST be STRING LITERALS (in single quotes):
  * ❌ WRONG: `time_col => ds` → ✅ CORRECT: `time_col => 'ds'`
  * ❌ WRONG: `value_col => revenue` → ✅ CORRECT: `value_col => 'revenue'`
  * ❌ WRONG: `value_col => ARRAY(col1, col2)` → ✅ CORRECT: `value_col => ARRAY('col1', 'col2')`
  * ❌ WRONG: `group_col => ARRAY(id, type)` → ✅ CORRECT: `group_col => ARRAY('id', 'type')`
- `[PARSE_SYNTAX_ERROR] Syntax error at or near '=>'` → Wrong parameter syntax in ai_forecast or wrong date_add syntax
- **FIX**: Ensure date_add units are NOT quoted: `date_add(MONTH, 3, MAX(ds))` not `date_add('MONTH', 3, MAX(ds))`
- **FIX**: Ensure named parameters use `=>` correctly: `time_col => 'ds'` not `time_col='ds'`
- `[INVALID_PARAMETER_VALUE.DATETIME_UNIT]` → date_add unit is quoted when it must be unquoted
- **FIX**: Remove quotes: `date_add(QUARTER, 4, MAX(ds))` not `date_add('QUARTER', 4, MAX(ds))`
- `[PYTHON_TVF_ARGUMENT_MUST_BE_CONSTANT_FOLDABLE]` → ai_forecast parameters must be constant literals
- **FIX**: Use literal strings: `value_col => 'revenue'` not `value_col => (SELECT 'revenue')`
- `[PYTHON_TVF_COLUMN_VALUES_MUST_BE_UNIQUE_WITHIN_PARTITION]` → Duplicate time values in AI_FORECAST input
- **FIX**: You MUST use `GROUP BY time_col` in the input CTE to deduplicate rows for the same timestamp.
- `[PYTHON_TVF_INCOMPATIBLE_COLUMN_TYPE]` → value_col is not DOUBLE
- **FIX**: You MUST cast the value column to DOUBLE: `CAST(col AS DOUBLE)` before passing to AI_FORECAST.

**DATE/TIME INTERVAL ERRORS:**
- `[INVALID_PARAMETER_VALUE.DATETIME_UNIT]` → date_add or other interval functions used with quoted units
- **FIX**: Never quote units. Use `date_add(DAY, 7, some_date)` or `add_months(some_date, -3)`. Do NOT call `date_add('MONTH', ...)`.
- `DATEDIFF` only takes two arguments (`DATEDIFF(end_date, start_date)`). Do NOT pass a unit. Use `months_between` for month differences.

**3. WINDOW FUNCTION ERRORS:**
- `[INTERNAL_ERROR] Cannot evaluate expression: corr(...)` → Aggregate window functions cannot use ROWS BETWEEN frames
- **FIX**: Remove `ROWS BETWEEN` clause: `CORR(col1, col2) OVER (PARTITION BY group)` not `CORR(col1, col2) OVER (PARTITION BY group ROWS BETWEEN...)`
- **FIX**: Apply to ALL aggregate window functions: AVG, CORR, COVAR_POP, COVAR_SAMP, PERCENTILE_APPROX, STDDEV, VARIANCE, etc.
- **CRITICAL**: NEVER use ROWS BETWEEN or RANGE BETWEEN with these functions in window specifications
- **DECIMAL WINDOW AGGREGATES**: For AVG/STDDEV/CORR/COVAR on DECIMAL columns in windows, CAST inputs to DOUBLE to avoid internal evaluation errors.
- `[DISTINCT_WINDOW_FUNCTION_UNSUPPORTED]` → `COUNT(DISTINCT col) OVER (...)` is NOT supported
- **FIX**: Use a subquery/CTE to pre-aggregate using GROUP BY, then window over the result.
- **FIX (Alternative)**: `size(collect_set(col) OVER (...))` (only for small data volumes).

**4. GROUP BY ERRORS:**
- `[MISSING_AGGREGATION]` → Non-aggregated columns in SELECT must be in GROUP BY
- **FIX**: Add missing column OR expression to GROUP BY: `GROUP BY col1, col2, complex_expression`
- **FIX**: Wrap non-aggregated column in `ANY_VALUE()` if it's constant per group.
- Example: `SELECT customer_id, region, SUM(sales) ... GROUP BY customer_id, region` (not just customer_id)

**5. TYPE MISMATCH ERRORS:**
- `[DATATYPE_MISMATCH.DATA_DIFF_TYPES] Cannot resolve coalesce(bool_col, 'N')` → Mixing BOOLEAN and STRING types
- **FIX**: Cast to common type: `COALESCE(CAST(bool_col AS STRING), 'N')` or `COALESCE(bool_col, FALSE)`

**6. COLUMN RESOLUTION ERRORS:**
- `[UNRESOLVED_COLUMN.WITH_SUGGESTION]` → Column doesn't exist in the table/CTE, use suggested column or join to get it
- `[TABLE_OR_VIEW_NOT_FOUND]` → Table reference is wrong, check catalog.schema.table format

**OUTPUT FORMAT:**
Return ONLY the corrected SQL query with NO explanations, NO markdown fences around the SQL, NO commentary.

**EXAMPLE 1 - UNRESOLVED COLUMN:**

Original SQL with error:
```sql
SELECT service_type FROM forecast_results
```

Execution Error:
```
[UNRESOLVED_COLUMN] service_type cannot be resolved
```

Corrected SQL:
```sql
-- Error fixed: Column service_type not available in forecast results
-- Fix: Join back to original table to get service_type
SELECT 
  f.*,
  t.service_type
FROM `catalog`.`schema`.`forecast_results` AS f
LEFT JOIN `catalog`.`schema`.`original_table` AS t
  ON f.airport_code = t.airport_code
```

**EXAMPLE 2 - WINDOW FUNCTION ERROR:**

Original SQL with error:
```sql
SELECT 
  aircraft_id,
  CORR(flight_hours, maintenance_cost) OVER (PARTITION BY aircraft_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS correlation
FROM aircraft_data
```

Execution Error:
```
[INTERNAL_ERROR] Cannot evaluate expression: corr(input[22, double, true], input[23, double, true]) windowspecdefinition(input[5, string, false], specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) SQLSTATE: XX000
```

Corrected SQL:
```sql
-- Error fixed: CORR function cannot use ROWS BETWEEN frame with window
-- Fix: Remove ROWS BETWEEN clause for aggregate window functions
SELECT 
  a.aircraft_id,
  CORR(a.flight_hours, a.maintenance_cost) OVER (PARTITION BY a.aircraft_id) AS correlation
FROM `catalog`.`schema`.`aircraft_data` AS a
```

🚨 ABSOLUTE RULE - OUTPUT FORMAT 🚨:

❌ DO NOT INCLUDE:
- Any conversational or explanatory text before or after the SQL
- Markdown code fences (```)
- Any thoughts or reasoning

✅ OUTPUT REQUIREMENTS:
- Return ONLY the corrected SQL query
- Start with honesty comments: -- HONESTY_SCORE: XX and -- HONESTY_JUSTIFICATION: ...
- Then: -- Error fixed: [brief description of what was wrong]
- Keep ALL original business logic intact
- Only fix the specific error mentioned in the validation/execution error

Begin your corrected SQL now:
""" + HONESTY_CHECK_SQL

# --- 1g-2. Interpret User SQL Regeneration Instructions Prompt (NEW) ---
PROMPT_TEMPLATES["INTERPRET_USER_SQL_REGENERATION_PROMPT"] = """You are a **Senior Data Engineer and SQL Architect** with deep expertise in interpreting business requirements and translating them into technical specifications for SQL generation.

**YOUR TASK**: Interpret the user's SQL regeneration instructions and produce a structured output that will guide the SQL generation process.

**USE CASE INFORMATION:**
- **Use Case ID**: {use_case_id}
- **Use Case Name**: {use_case_name}
- **Business Domain**: {business_domain}
- **Statement**: {statement}
- **Solution**: {solution}
- **Original Tables Involved**: {original_tables_involved}

**PREVIOUS SQL QUERY (if exists):**
```sql
{previous_sql}
```

**AVAILABLE TABLES IN THE REGISTRY (you can request to include any of these):**
{available_tables_registry}

**USER'S REGENERATION INSTRUCTIONS:**
{user_regeneration_instructions}

---

**YOUR ANALYSIS TASK:**

1. **UNDERSTAND THE USER'S INTENT**: What is the user trying to achieve? What changes do they want?

2. **IDENTIFY TABLE CHANGES**:
   - **New Tables to Add**: If the user mentions tables that are not in "Original Tables Involved" but ARE in the "Available Tables Registry", list them
   - **Tables to Remove**: If the user explicitly asks to exclude certain tables
   - **Tables to Keep**: Which original tables should remain

3. **EXTRACT TECHNICAL REQUIREMENTS**:
   - What specific columns, joins, or aggregations does the user want?
   - What filtering conditions are implied?
   - What statistical or AI functions should be used?

4. **FORMULATE GENERATION INSTRUCTIONS**:
   - Create a clear, technical specification that the SQL generator can follow

---

**OUTPUT FORMAT (JSON):**

You MUST return a valid JSON object with the following structure:

```json
{{
  "interpretation_summary": "Brief summary of what the user wants",
  "tables_to_add": ["catalog.schema.table1", "catalog.schema.table2"],
  "tables_to_remove": ["catalog.schema.table3"],
  "final_tables_involved": ["catalog.schema.table1", "catalog.schema.table2", "catalog.schema.table4"],
  "new_tables_need_loading": true,
  "technical_design_instructions": "Detailed technical instructions for SQL generation including: specific joins needed, aggregations, filters, AI functions to use, statistical calculations, etc.",
  "column_focus": ["specific_column1", "specific_column2"],
  "special_requirements": "Any special requirements like specific date ranges, business rules, etc."
}}
```

**RULES:**
1. **ONLY** include tables in "tables_to_add" if they are present in the "Available Tables Registry"
2. If the user asks for a table that is NOT in the registry, set `"new_tables_need_loading": true` and include a note in "special_requirements"
3. "final_tables_involved" should be the complete list of tables for the regenerated query (original + added - removed)
4. Be precise in "technical_design_instructions" - the SQL generator will follow these exactly

**🚨 CRITICAL: Return ONLY the JSON object, no markdown fences, no explanation before or after.**

Begin your JSON response now:
"""

# --- 1f. Use Case Review Prompt (RENAMED, ENHANCED) ---
PROMPT_TEMPLATES["REVIEW_USE_CASES_PROMPT"] = """You are an expert business analyst specializing in duplicate detection. Your SINGLE task is to identify and remove semantic duplicates **and** to reject useless/technical use cases that add no business value.

**🚨 SINGLE FOCUS: DUPLICATE DETECTION ONLY 🚨**
- **PRIMARY JOB**: Identify and remove semantic duplicates based on Name and core concept similarity
- **SECONDARY GUARDRAIL**: Reject use cases that are trivial (no business outcome) or purely technical/infra-focused
- **FOCUS**: Keep only distinct, business-outcome-focused use cases

**🔥 BE EXTREMELY AGGRESSIVE IN DUPLICATE DETECTION 🔥**

**🚫 ADDITIONAL FILTERS (MANDATORY) 🚫**
- Remove **TRIVIAL** use cases that have no business value (e.g., "count rows", "list tables", "display schema", "dump data") unless clearly tied to a business decision
- Remove **TECHNICAL/INFRA** use cases that only deliver platform/IT value (monitoring pipelines, cluster/job status, DevOps telemetry) with no clear business beneficiary
- Keep only use cases that articulate a business outcome or decision; everything else is rejected

**🚨🚨🚨 CRITICAL: BUSINESS RELEVANCY & REALISM CHECK 🚨🚨🚨**
- Remove **IRRELEVANT CORRELATIONS**: Use cases that correlate variables with NO logical, provable cause-and-effect relationship
- Remove **NONSENSICAL EXTERNAL DATA**: Use cases that add external data enrichment without a clear, industry-recognized business connection to the metric being analyzed
- Remove **FAR-FETCHED CONNECTIONS**: Use cases where the relationship between factors would be questioned by domain experts or laughed out of a boardroom
- **ASK FOR EACH USE CASE**: "Can I explain in ONE sentence why these variables/factors are logically connected?" If NO, REMOVE IT.
- **BOARDROOM TEST**: Would a senior executive approve budget for this analysis without questioning the logic? If the correlation seems invented or far-fetched, REMOVE IT.

**DUPLICATE DETECTION RULES** (Apply ALL of these):
1. **Semantic Duplicates**: Names that mean the same thing with different wording
   - Pattern: "Forecast [X]" = "Predict [X]" = "[X] Forecasting" = "[X] Prediction" → ALL DUPLICATES
2. **Synonym Duplicates**: Names using synonyms for the same concept
   - Pattern: "Classify [Entity] [Attribute]" = "Categorize [Entity] [Attribute]" = "[Entity] [Attribute] Classification" → ALL DUPLICATES
3. **Action-Object Duplicates**: Same action on same object, different phrasing
   - Pattern: "Analyze [Entity] [Metric]" = "[Metric] Analysis for [Entity]" = "[Entity] [Metric] Analysis" → ALL DUPLICATES
4. **Abbreviation Duplicates**: Full form and abbreviated versions
   - Pattern: "AI-Powered [Feature]" = "[Feature]" → DUPLICATES
5. **Similar Core Concepts**: Names that address the same core business problem even with slightly different wording
   - Pattern: "Match Similar [Entity] Records" = "Find Similar [Entity] Profiles" = "Identify Similar [Entity]" → ALL DUPLICATES
   - Pattern: "Detect Fraudulent [X]" = "Identify Fraud in [X]" = "Find Fraudulent [X]" → ALL DUPLICATES

**EXAMPLES OF DUPLICATES TO REMOVE** (keep only ONE from each group):
- "Forecast Revenue", "Predict Revenue", "Revenue Forecasting", "Revenue Prediction" → Keep FIRST occurrence ONLY
- "Classify Support Tickets", "Categorize Support Tickets", "Support Ticket Classification", "Ticket Categorization" → Keep FIRST occurrence ONLY
- "Detect Fraud", "Fraud Detection", "Identify Fraudulent Transactions", "Find Fraud Cases" → Keep FIRST occurrence ONLY
- "Optimize Inventory", "Inventory Optimization", "Improve Inventory Management", "Enhance Inventory Control" → Keep FIRST occurrence ONLY
- "Match Similar Passenger Records", "Find Similar Passenger Profiles", "Identify Similar Passengers" → Keep FIRST occurrence ONLY

**EXAMPLES OF NON-DUPLICATES TO KEEP** (these are DIFFERENT):
- "Forecast Sales Revenue" vs "Forecast Customer Demand" → Different objects (revenue vs demand), KEEP BOTH
- "Classify Customer Feedback" vs "Classify Product Reviews" → Different data sources (customer feedback vs product reviews), KEEP BOTH
- "Predict Customer Churn" vs "Predict Sales Trends" → Different predictions (churn vs sales), KEEP BOTH
- "Count Total Orders" vs "Forecast Order Volume" → Different actions (counting vs forecasting), KEEP BOTH
- "Extract Customer Name" vs "Classify Customer Segment" → Different operations (extraction vs classification), KEEP BOTH

**ONLY REMOVE IF TRULY DUPLICATE OR VALUELESS:**
- ❌ REMOVE: "Forecast Revenue" AND "Predict Revenue" → DUPLICATES (same action, same object)
- ❌ REMOVE: "Classify Feedback" AND "Categorize Feedback" → DUPLICATES (synonyms)
- ❌ REMOVE: "Analyze Customer Sentiment" AND "Customer Sentiment Analysis" → DUPLICATES (same concept)
- ❌ REMOVE: "Monitor ETL Pipeline" / "Check Job Status" / "List Tables" → TECHNICAL/NO BUSINESS VALUE
- ✅ KEEP: "Count Orders" AND "Forecast Orders" → NOT DUPLICATES (different actions and business value)
- ✅ KEEP: "Extract Names" AND "Classify Segments" → NOT DUPLICATES (different operations)

**YOUR MANDATE:**
- Remove semantic duplicates
- Remove trivial/no-business-value use cases
- Remove technical/platform/infra-only use cases that do not deliver business outcomes
- Keep the remaining distinct, business-oriented use cases

**IMPORTANT RULES**:
- When duplicates are found, keep the FIRST occurrence (earliest ID)
- Be EXTREMELY AGGRESSIVE in detecting semantic duplicates - err on the side of removing duplicates
- Remove trivial/technical items even if they are unique
- You are reviewing ALL {total_count} use cases in one pass
- **TARGET: Remove 20-30% as duplicates PLUS any trivial/technical items**

**OUTPUT FORMAT: CSV (NOT JSON)**:
Your output **MUST** be a simple CSV with ONE column (with header) containing the 'ID' of every use case to KEEP.
Do NOT include any text, explanation, or markdown - ONLY the CSV.

Example Input:
| ID | Name | Business Value | Tables |
|---|---|---|---|
| AI-001 | Forecast Sales Revenue | Enables data-driven financial planning | sales.orders, sales.products |
| AI-002 | Classify Support Tickets | Automates ticket routing and prioritization | support.tickets |
| AI-003 | Predict Sales | Improves forecasting | sales.orders |
| AI-004 | Sales Forecasting | Better predictions | sales.orders |
| AI-005 | Categorize Support Tickets | Routes tickets faster | support.tickets |
| AI-006 | Count Database Records | View data | system.metadata |
| AI-007 | Extract Refund Type | Gets refund category | refunds.transactions |

(Assuming refunds.transactions has columns: refund_id, refund_reason_text, refund_type)

Example Output (keeping high-value, non-duplicates):
use_case_id
AI-001
AI-002

**Removal Rationale**:
- AI-003, AI-004: Duplicates of AI-001 (same concept: sales forecasting)
- AI-005: Duplicate of AI-002 (same concept: ticket classification)
- AI-006: KEPT (not a duplicate, even if trivial - scoring will handle value assessment)
- AI-007: KEPT (not a duplicate, even if useless - scoring will handle value assessment)

Here is the markdown table of ALL use cases to analyze:
{use_case_markdown}

Produce ONLY the valid CSV with one column of IDs to keep. Be EXTREMELY AGGRESSIVE in removing:
1. Semantic duplicates (keep first occurrence only)

Target: Remove 20-30% as duplicates, keep 70-80% for scoring.

🚨🚨🚨 OUTPUT INSTRUCTION 🚨🚨🚨:
- START with CSV header: use_case_id
- Follow with one ID per line
- NO text before or after the CSV
- NO explanations, NO markdown fences
- NO thoughts or reasoning
- NO "I have reviewed..." or similar statements
- ONLY the pure CSV data

🚨 ABSOLUTE RULE - OUTPUT FORMAT 🚨:

❌ DO NOT INCLUDE:
- Any conversational or explanatory text ("I have...", "After...", "Based on...", "Here are...")
- Any thoughts or analysis descriptions

✅ OUTPUT REQUIREMENTS:
- Your response must START with: use_case_id,honesty_score,honesty_justification
- Include honesty columns in header and all rows
""" + HONESTY_CHECK_CSV

# --- 1g. Use Case Scoring Prompt (COMPREHENSIVE WITH BUSINESS CONTEXT) ---
PROMPT_TEMPLATES["SCORE_USE_CASES_PROMPT"] = """# Persona

You are the **Chief Investment Officer & Strategic Value Architect**. You are known for being ruthless, evidence-based, and ROI-obsessed. You do not care about "cool tech" or "easy wins" unless they drive massive financial impact. Your job is to allocate finite capital only to use cases that drive the specific strategic goals of this business.

# Context & Inputs

**Business Context:** {business_context}
**Strategic Goals:** {strategic_goals}
**Business Priorities:** {business_priorities}
**Strategic Initiative:** {strategic_initiative}
**Value Chain:** {value_chain}
**Revenue Model:** {revenue_model}

**Use Cases to Score:**
{use_case_markdown}

# Instructions

### Task Overview

Score the provided use cases based strictly on their potential to drive the specific **Business Priorities**, **Strategic Goals**, and **Revenue Model** provided in the **Context & Inputs** section above.

🚨🚨🚨 **CRITICAL: BUSINESS PRIORITIES DRIVE RANKING** 🚨🚨🚨

Use cases that directly achieve any of the Business Priorities MUST receive significantly higher scores. Strategic Goals define the intended outcomes and must be used for alignment.

**Common Strategic Goals to Consider:**
- **Reduce Cost**: Automation, efficiency improvements, waste reduction
- **Boost Productivity**: Faster processes, better tools, streamlined workflows  
- **Increase Revenue**: New revenue streams, upselling, cross-selling, market expansion
- **Mitigate Risk**: Fraud detection, compliance, security, audit trails
- **Protect Revenue**: Churn prevention, retention, customer satisfaction
- **Align to Regulations**: Compliance automation, regulatory reporting, audit support
- **Improve Customer Experience**: Personalization, faster service, quality improvements
- **Enable Data-Driven Decisions**: Analytics, insights, forecasting, predictions

**For EVERY use case, you MUST:**
1. Identify which Strategic Goal(s) it aligns to (if any)
2. Score higher if it DIRECTLY achieves a stated Business Priority
3. In the justification, EXPLICITLY mention which Business Priority and Strategic Goal(s) the use case supports

**🚨 SCORING RULES: AGGRESSIVE BUSINESS VALUE 🚨**

1.  **NO CURVE / NO DISTRIBUTION:** Do not force a normal distribution. If all use cases are weak, score them all low. If all are critical, score them all high. Score based on **Absolute Merit**.
2.  **ZERO-BASED SCORING:** Start every score at 1.0. The use case must *earn* points by showing explicit alignment to the data provided in the Context. Do not assume value exists unless clearly demonstrated.
3.  **IGNORE "NICE TO HAVES":** If a use case improves a process that does not directly impact revenue, margin, or strategic competitive advantage, it is a **Low Value** case, regardless of how easy it is to implement.
4.  **STRATEGIC GOAL BONUS:** Use cases that DIRECTLY achieve a stated Strategic Goal should receive a +0.5 to +1.0 bonus to their Strategic Alignment score.

**🚨🚨🚨 CRITICAL: BUSINESS RELEVANCY & REALISM PENALTY 🚨🚨🚨**

5.  **IRRELEVANT CORRELATIONS = LOW SCORE:** Use cases that correlate variables with NO logical, provable cause-and-effect relationship MUST receive low scores (ROI ≤ 2.0, Alignment ≤ 2.0).
6.  **NONSENSICAL EXTERNAL DATA = LOW SCORE:** Use cases that include external data enrichment without a clear, industry-recognized business connection MUST be penalized heavily.
7.  **RELEVANCY TEST:** For EVERY use case, ask: "Can I explain in ONE sentence why these variables/factors are logically connected?" If NO, score LOW.
8.  **BOARDROOM TEST:** Would a senior executive approve budget for this analysis without questioning the logic? If the correlation seems invented or far-fetched, score LOW and note in the justification: "Irrelevant correlation - no logical business connection."

### The 75/25 Priority Formula (Critical)

You must use a weighted formula for the final Priority Score to heavily favor Business Value over Feasibility.

1.  **Calculate Value Score (1.0 - 5.0):** Weighted average of ROI (60%), Alignment (25%), TTV (7.5%), Reusability (7.5%).
2.  **Calculate Feasibility Score (1.0 - 5.0):** Simple average of the 8 feasibility factors.
3.  **CALCULATE PRIORITY SCORE (2.0 - 10.0):**
    
    $$ Priority = (Value * 1.5) + (Feasibility * 0.5) $$

*Note: This formula ensures that Business Value accounts for 7.5 points of the total score, while Feasibility only accounts for 2.5 points.*

---

### Scoring Factors (Detailed Assessment Criteria)

**I. VALUE FACTORS (The "Why")**

**1. Return on Investment (ROI)** 🚨 **WEIGHT: 60% of Value Score** 🚨
    * **Contextual ROI Check:** Compare the use case against the **Revenue Model** listed in the Context. Does this use case directly impact the way this specific company makes money?
    * **4.8 - 5.0 (Exponential):** Directly impacts top-line revenue or prevents massive bottom-line leakage (>10x return). *Examples: Dynamic Pricing, Demand Forecasting, Churn Prevention for high-value customers.*
    * **4.0 - 4.7 (High):** Significant measurable impact on P&L (5-10x return). *Examples: Supply Chain Optimization, Fraud Detection, Predictive Maintenance.*
    * **3.0 - 3.9 (Moderate):** Incremental efficiency gains (2-5x return). *Examples: Automated Invoice Processing, Intelligent Document Classification.*
    * **1.0 - 2.9 (Low/Soft):** "Soft" benefits (efficiency, happiness) that do not clearly translate to dollars in the **Revenue Model**. *Examples: Internal Wiki Search, Employee Sentiment Dashboard.*
    * **CRITICAL**: Evaluate ROI based on the ACTUAL industry and business model from the provided context - do NOT assume any specific industry.

**2. Strategic Alignment** 🚨 **WEIGHT: 25% of Value Score** 🚨
    * **Strict Alignment Check:** Look at the **Business Priorities** and **Strategic Goals** listed in the Context. Is this use case mentioned?
    * **4.8 - 5.0 (Direct Hit):** The use case is EXPLICITLY named in or required by the **Business Priorities** or **Strategic Goals**. *Pattern: If priority mentions "[X]", use case directly addresses "[X]".*
    * **3.5 - 4.7 (Strong Link):** Supports a stated **Business Priority** directly. *Pattern: Priority is about retention/growth/efficiency, use case directly enables that outcome.*
    * **1.0 - 3.4 (Weak/None):** Generic improvement (e.g., "Better Reporting") that does not touch the specific **Business Priorities** listed in the Context.
    * **CRITICAL**: Evaluate alignment based on the ACTUAL Business Priorities and Strategic Goals provided in the context - do NOT assume any default goals.

**3. Time to Value (TTV)** (Weight: 7.5%)
    * **Definition:** How fast until the business *sees* the money?
    * **4.8 - 5.0 (Immediate):** < 4 weeks. Quick wins, dashboarding existing data.
    * **3.0 - 4.7 (Quarterly):** 1-3 months. Standard agile cycle.
    * **1.0 - 2.9 (Long Term):** > 6 months. Long infrastructure build-outs before any value is realized.

**4. Reusability** (Weight: 7.5%)
    * **Definition:** Does this create a permanent asset (Feature Store, Data Product)?
    * **4.8 - 5.0 (Platform Asset):** Creates a "Customer 360" or "Product Master" table that 10+ other use cases *will* leverage.
    * **3.0 - 4.7 (Modular):** Code is clean and reusable, but data is specific to this use case.
    * **1.0 - 2.9 (One-Off):** Ad-hoc analysis or script solving exactly one isolated problem.

**II. FEASIBILITY FACTORS (The "How" - Average of all 8)**

**5. Data Availability**
    * **Check:** Does the specific data required exist in this industry/business context?
    * **4.8 - 5.0 (Perfect):** Data is standard, transactional, and historically logged (e.g., Sales Records, ERP logs).
    * **3.0 - 4.7 (Likely):** Data likely exists but might be scattered or require some engineering to consolidate.
    * **1.0 - 2.9 (Missing):** Requires new sensors, external purchases, or starting logs from scratch.

**6. Data Accessibility**
    * **Check:** Are there Legal, Privacy, or Tech barriers?
    * **4.8 - 5.0 (Open):** Internal, non-PII, open access data.
    * **3.0 - 4.7 (Restricted):** PII present but manageable via standard RBAC/Masking.
    * **1.0 - 2.9 (Blocked):** Highly sensitive (Medical/Banking) or owned by a 3rd party refusing to share.

**7. Architecture Fitness**
    * **Check:** Does it fit the standard Lakehouse/Spark stack?
    * **4.8 - 5.0 (Native):** Solvable using standard SQL/Python. Fits Medallion Architecture perfectly.
    * **3.0 - 4.7 (Adaptable):** Requires specific library installs or external API calls.
    * **1.0 - 2.9 (Incompatible):** Requires mainframe, on-prem appliances, or non-cloud tech.

**8. Team Skills**
    * **Check:** Does a typical team in this industry have these skills?
    * **4.8 - 5.0 (Standard):** SQL, Python, Basic Regression/Classification.
    * **3.0 - 4.7 (Specialized):** NLP, Computer Vision, GenAI prompting.
    * **1.0 - 2.9 (Niche):** Requires PhD-level Research Math or archaic languages (COBOL).

**9. Domain Knowledge**
    * **Check:** Is the business logic clear?
    * **4.8 - 5.0 (Documented):** Logic is clear, rules are written, SMEs are available.
    * **3.0 - 4.7 (Tribal):** "Head knowledge" exists but isn't written down.
    * **1.0 - 2.9 (Unknown):** Logic is a "Black Box" or lost.

**10. People Allocation**
    * **Check:** Staffing difficulty.
    * **4.8 - 5.0 (Minimal):** 1-2 Engineers.
    * **3.0 - 4.7 (Squad):** Full agile squad (4-6 people).
    * **1.0 - 2.9 (Heavy):** Requires massive cross-functional teams or external consultants.

**11. Budget Allocation**
    * **Check:** Likelihood of funding.
    * **4.8 - 5.0 (Secured):** Critical path for the **Strategic Initiative** listed in the Context.
    * **3.0 - 4.7 (Discretionary):** Funded via normal OPEX.
    * **1.0 - 2.9 (CapEx Required):** Needs board approval for new money.

**12. Time to Production**
    * **Check:** Engineering effort magnitude.
    * **4.8 - 5.0 (Sprint):** < 2 weeks dev time.
    * **3.0 - 4.7 (Quarterly):** 1-3 months dev time.
    * **1.0 - 2.9 (Major Project):** > 6 months dev time.

-----

### Scoring Methodology - Execution Steps

**STEP 1: CALCULATE RAW VALUE (High Precision)**
* Score ROI (0-5) based on the **Revenue Model** in the Context.
* Score Alignment (0-5) based on the **Strategic Goals** in the Context.
* Score TTV and Reusability.
* Calculate:
  $$ Value = (ROI * 0.60) + (Alignment * 0.25) + (TTV * 0.075) + (Reusability * 0.075) $$

**STEP 2: CALCULATE RAW FEASIBILITY**
* Average the 8 feasibility factors (Factors 5 through 12).
* Calculate:
  $$ Feasibility = (Sum of 8 Factors) / 8 $$

**STEP 3: APPLY THE "VALUE-FIRST" PRIORITY FORMULA**
* Calculate:
  $$ Priority Score = (Value * 1.5) + (Feasibility * 0.5) $$
* *Validation Logic:*
    * If Value is 5.0 and Feasibility is 1.0 -> Priority = 8.0 (High Priority)
    * If Value is 1.0 and Feasibility is 5.0 -> Priority = 4.0 (Low Priority)
    * **This mathematically forces High Value cases to always outrank High Feasibility cases.**

**STEP 4: GENERATE JUSTIFICATION**
* Write a sharp, executive summary (max 200 chars).
* **Must** reference specific **Strategic Goals** or **Revenue Model** elements found in the Context.
* **Must** justify the score based on BUSINESS IMPACT, not technical ease.

🚨 **JUSTIFICATION QUALITY RULES - CRITICAL** 🚨:
1. **USE CASE SPECIFIC**: The justification MUST be specific to THIS use case. It should mention the core capability or outcome (e.g., "network congestion prediction", "churn prevention", "demand forecasting").
2. **NO GENERIC BUZZWORDS**: Do NOT use generic phrases that could apply to any use case. The following are PROHIBITED unless directly relevant to the use case domain:
   - "digital transformation" (too vague)
   - "workflow automation" (unless the use case IS about workflow automation)
   - "revenue recognition" (unless the use case IS about revenue recognition)
   - "operational efficiency" (too generic)
   - "data-driven insights" (too vague)
3. **CONNECT TO USE CASE DOMAIN**: If the use case is about "Network Congestion Prediction", the justification MUST mention network, capacity, congestion, or infrastructure concepts - NOT unrelated benefits like "CSAT" or "invoice processing".
4. **EXAMPLES OF BAD vs GOOD JUSTIFICATIONS**:
   - ❌ BAD for "Predict Network Congestion": "Accelerates revenue recognition and supports digital transformation through workflow automation."
   - ✅ GOOD for "Predict Network Congestion": "Proactively prevents service degradation by predicting network hotspots, directly reducing churn and protecting recurring revenue from enterprise clients."
   - ❌ BAD for "Churn Prediction": "Improves operational efficiency and enables data-driven decision making."
   - ✅ GOOD for "Churn Prediction": "Identifies at-risk customers 30 days before cancellation, enabling targeted retention campaigns that protect $2M annual recurring revenue."

-----

### Output Format

Return **ONLY** a valid CSV.

**Columns:**
"No","Strategic Alignment","Return on Investment","Reusability","Time to Value","Data Availability","Data Accessibility","Architecture Fitness","Team Skills","Domain Knowledge","People Allocation","Budget Allocation","Time to Production","Value","Feasibility","Priority Score","Business Priority Alignment","Strategic Goals Alignment","Justification","AI_Confidence","AI_Feedback"

**CRITICAL - Business Priority Alignment Column:**
For each use case, identify which business priority(ies) it aligns to. Use the following format:
- If aligned to ONE priority: "Reduce Cost" or "Increase Revenue" etc.
- If aligned to MULTIPLE priorities: "Reduce Cost, Mitigate Risk" (comma-separated)
- If NO clear alignment: "General Improvement"

Standard business priorities: Increase Revenue | Reduce Cost | Optimize Operations | Mitigate Risk | Empower Talent | Enhance Experience | Drive Innovation | Achieve ESG | Protect Revenue | Execute Strategy

**CRITICAL - Strategic Goals Alignment Column:**
For each use case, identify which strategic goal(s) it aligns to.
- Look at the **Strategic Goals** provided in the Context section above (these are either user-provided OR generated from business context)
- Match each use case to one or more strategic goals from the context
- If aligned to ONE goal: output that goal exactly as written
- If aligned to MULTIPLE goals: list them comma-separated
- If the use case does NOT align to ANY of the strategic goals in context: output "General Improvement"
- Be specific: use the EXACT wording of the strategic goals from the context

**CRITICAL - AI_Confidence and AI_Feedback Columns (LAST 2 COLUMNS - MANDATORY):**
For EACH use case, you MUST provide:
- **AI_Confidence**: A decimal score from 0.0 to 1.0 representing your honesty score - how truthfully and completely you achieved this scoring task. Consider: data quality, domain expertise applied, clarity of the use case statement.
- **AI_Feedback**: A comprehensive explanation that MUST include: 1) All reasons justifying your AI_Confidence score, 2) If score < 1.0, what specific improvements are needed to reach 1.0, 3) A MANDATORY "MISSING DATA" section listing all data/context that if provided would have improved your scoring accuracy. Be 100% honest - your output will be reviewed by another more powerful AI to judge your score.

**Example Output:**
```csv
"No","Strategic Alignment","Return on Investment","Reusability","Time to Value","Data Availability","Data Accessibility","Architecture Fitness","Team Skills","Domain Knowledge","People Allocation","Budget Allocation","Time to Production","Value","Feasibility","Priority Score","Business Priority Alignment","Strategic Goals Alignment","Justification","AI_Confidence","AI_Feedback"
"AI-U001",4.9,4.8,4.5,4.2,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.76,4.00,9.14,"Increase Revenue, Reduce Cost","Improve carbon footprint tracking, Optimize workforce efficiency","Directly drives Revenue Growth by optimizing pricing engine. Achieves Increase Revenue priority and aligns to strategic goals.",0.85,"Score Justification: High confidence due to clear business value proposition and well-defined table relationships. Improvements Needed: Historical implementation success rates would raise score to 0.95. MISSING DATA: Industry benchmark ROI metrics, competitor pricing data, historical pricing model performance statistics."
"AI-U002",1.2,1.5,2.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,1.58,5.00,4.87,"General Improvement","General Improvement","Does not align with any stated Business Priorities or Strategic Goals. Purely administrative 'nice-to-have'.",0.72,"Moderate confidence. Use case statement is vague - would benefit from specific metrics and expected outcomes."

```

**OUTPUT RULES:**

* Return ONLY the CSV data (with header row).
* The `Priority Score` in the CSV MUST reflect the `(Value * 1.5) + (Feasibility * 0.5)` formula.
* Do not normalize. Real scores only.
* The `Business Priority Alignment` column MUST specify which business priority(ies) the use case achieves.
* The `Strategic Goals Alignment` column MUST specify which strategic goal(s) from the context the use case achieves, or "General Improvement" if none match.
* The `AI_Confidence` column MUST be a decimal between 0.0 and 1.0 (your honesty score).
* The `AI_Feedback` column MUST include: reasons for score, improvements needed if < 1.0, and MISSING DATA section.

**🚨🚨🚨 CRITICAL: SCORE EVERY SINGLE USE CASE 🚨🚨🚨**
* You MUST output a CSV row for EVERY use case in the input table
* Count the use cases: If there are N use cases in the input, you MUST output EXACTLY N data rows (plus header)
* DO NOT skip any use case. DO NOT truncate the output.
* If you're running low on output space, use shorter justifications but NEVER omit rows
* Missing use case scores = CRITICAL FAILURE

Begin your CSV response now:
"""
# --- 1h. SQL Generation Prompt (ENHANCED - DATABRICKS SQL EXPERT) ---
PROMPT_TEMPLATES["USE_CASE_SQL_GEN_PROMPT"] = """You are a **Principal Databricks SQL Engineer** and **AI/ML Solutions Architect** with 15+ years of experience. You are an absolute EXPERT in Databricks SQL dialect and AI functions. Your task is to generate SOPHISTICATED, PRODUCTION-READY, syntactically PERFECT SQL queries.

**🔥🔥🔥 CRITICAL: GENERATE COMPREHENSIVE SQL - NO ARTIFICIAL LIMITS 🔥🔥🔥**
- There is **NO LINE LIMIT** for the SQL query - generate as many lines as needed
- Generate as **MANY CTEs** as required to fully implement the use case (3-10 CTEs is normal)
- A typical sophisticated query should be **200 to 600 lines** of code
- Include **ALL** statistical functions, **ALL** AI functions, **ALL** transformations
- Do **NOT** artificially shorten or simplify the SQL
- Do **NOT** skip steps to reduce code length

**🚨🚨🚨 CRITICAL: FIRST CTE MUST USE SELECT DISTINCT 🚨🚨🚨**
- The **FIRST CTE** MUST ALWAYS use `SELECT DISTINCT` to ensure NO DUPLICATE RECORDS
- **WHY**: Duplicates in source data will cascade errors through all downstream CTEs
- **PATTERN**: `WITH base_data AS (SELECT DISTINCT col1, col2, ... FROM table WHERE ... LIMIT 10)`
- **ALTERNATIVE**: If aggregating, use `GROUP BY` on all non-aggregated columns
- **VALIDATION**: Before any AI function or analysis, data MUST be deduplicated in the first CTE
- **LIMIT PLACEMENT**: LIMIT 10 MUST be the LAST clause in the SELECT (after WHERE, ORDER BY, etc.)
- **COMPLETENESS over brevity** - comprehensive analysis is the goal

**🏢 BUSINESS CONTEXT (CRITICAL - READ THIS FIRST!):**
- **Company/Customer Name**: {business_name}
- **Business Context**: {enriched_business_context}
- **Strategic Goals**: {enriched_strategic_goals}
- **Business Priorities**: {enriched_business_priorities}
- **Strategic Initiative**: {enriched_strategic_initiative}
- **Value Chain**: {enriched_value_chain}
- **Revenue Model**: {enriched_revenue_model}
- This analysis is being generated FOR {business_name}
- When generating external_api CTEs, get information ABOUT the entities in your data (customers, suppliers, locations), NOT about {business_name} itself
- Example: If {business_name} is "Databricks" and you're analyzing Databricks' customers, get competitor info for THOSE CUSTOMERS, not for Databricks

**🚨🚨🚨 CRITICAL: ENRICH ALL PERSONAS WITH BUSINESS CONTEXT 🚨🚨🚨**
Every persona in ai_query prompts MUST be enriched with the business context above. Do NOT use generic personas like "You are a Chief Revenue Officer with 20 years of experience". 
Instead, ALWAYS create business-specific personas like: "You are a Chief Revenue Officer for {business_name} which is aiming to [use strategic goals and business context above] with 20 years of experience in [relevant domain]."

**PERSONA ENRICHMENT PATTERN (MANDATORY):**
```sql
-- ❌ WRONG - Generic persona without business context:
ai_query('{sql_model_serving}',
  CONCAT('You are a Chief Revenue Officer with 20 years of experience in enterprise software sales strategy. ',
         'Analyze...'))

-- ❌ WRONG - Empty/malformed business context placeholders:
ai_query('{sql_model_serving}',
  CONCAT('You are a Chief Revenue Officer for Acme Corp which is focused on General business operations. ',
         'The organization''s strategic goals include: . ',  -- ❌ EMPTY! Must have actual goals
         'Business priorities are: Digital transformation. ',
         'Analyze...'))

-- ✅ CORRECT - Persona enriched with ALL business context (NONE can be empty):
ai_query('{sql_model_serving}',
  CONCAT('You are a Chief Revenue Officer for {business_name} which is focused on {enriched_business_context}. ',
         'The organization''s strategic goals include: {enriched_strategic_goals}. ',
         'Business priorities are: {enriched_business_priorities}. ',
         'You have 20 years of experience in enterprise software sales strategy, revenue forecasting, and go-to-market planning. ',
         'Your expertise in [relevant expertise] aligns with the strategic initiative: [relevant initiative from goals]. ',
         'Analyze...'))
```

**🚨 PERSONA CONTEXT VALIDATION (ALL fields MUST have non-empty values):**
- `{business_name}` - MUST be the actual company name, NEVER empty or "Unknown"
- `{enriched_business_context}` - MUST describe what the business does, NEVER just "General business operations"
- `{enriched_strategic_goals}` - MUST list actual strategic goals, NEVER empty (e.g., "include: ." is WRONG)
- `{enriched_business_priorities}` - MUST list actual priorities, NEVER generic placeholders

**ALL ai_query personas MUST include (in this order):**
1. The business name: {business_name}
2. Relevant business context: {enriched_business_context}
3. Strategic goals alignment: {enriched_strategic_goals}
4. Business priorities: {enriched_business_priorities}
5. The professional role and years of experience
6. Expertise alignment with a specific strategic initiative

**🚨🚨🚨 MANDATORY: ai_sys_prompt COLUMN - CAPTURE THE EXACT PROMPT 🚨🚨🚨**

**EVERY SQL that uses ai_query MUST include `ai_sys_prompt` as the LAST column in the final output.**
This column captures the exact prompt sent to the AI for auditability, debugging, and reproducibility.

**PATTERN FOR ai_sys_prompt (MANDATORY):**
```sql
-- Step N: Build the AI prompt in a CTE (generate prompt as a column FIRST)
prompt_generation AS (
  SELECT 
    *,
    CONCAT(
      'You are a [ROLE] for {business_name} which is focused on {enriched_business_context}. ',
      'The organization''s strategic goals include: {enriched_strategic_goals}. ',
      'Business priorities are: {enriched_business_priorities}. ',
      'You have [X] years of experience in [domain]. ',
      'Your expertise in [specific expertise] aligns with the strategic initiative: [initiative]. ',
      '[Analysis instructions...]',
      'Output ONLY JSON...'
    ) AS ai_sys_prompt  -- MUST be named ai_sys_prompt
  FROM previous_cte
),
-- Step N+1: Call ai_query using the prompt column
ai_analysis AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS insights_json
  FROM prompt_generation
),
-- Final output: ai_sys_prompt MUST be the LAST column
final_output AS (
  SELECT 
    -- Data columns...
    -- AI extracted columns (ai_cat_, ai_txt_)...
    -- Mandatory system columns (ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data)...
    ai_sys_prompt  -- MUST BE LAST COLUMN
  FROM ai_analysis
)
```

**WHY ai_sys_prompt IS MANDATORY:**
1. **Auditability** - Know exactly what prompt generated each AI response
2. **Debugging** - Identify prompt issues when AI output is unexpected
3. **Reproducibility** - Recreate the exact analysis conditions
4. **Compliance** - Document AI decision-making inputs for regulatory requirements
5. **Optimization** - Analyze prompts to improve AI response quality

**🚨🚨🚨 CRITICAL: VALIDATE PERSONA PLACEHOLDERS ARE NOT EMPTY 🚨🚨🚨**

**Before generating SQL, VERIFY that these placeholders have actual values:**
- `{business_name}` - MUST NOT be empty or "Unknown"
- `{enriched_business_context}` - MUST NOT be empty or just "General business operations"
- `{enriched_strategic_goals}` - MUST NOT be empty (e.g., "include: ." is INVALID!)
- `{enriched_business_priorities}` - MUST NOT be empty or generic

**❌ MALFORMED PERSONA EXAMPLE (THIS IS WRONG - EMPTY STRATEGIC GOALS!):**
```sql
CONCAT('You are a Director for Acme Corp which is focused on General business operations. ',
       'The organization''s strategic goals include: . ',  -- ❌ EMPTY! This is malformed!
       'Business priorities are: Digital transformation. ',
       '...')
```

**✅ CORRECT PERSONA EXAMPLE (ALL FIELDS HAVE VALUES):**
```sql
CONCAT('You are a Director for {business_name} which is focused on {enriched_business_context}. ',
       'The organization''s strategic goals include: {enriched_strategic_goals}. ',  -- ✅ Must have actual goals!
       'Business priorities are: {enriched_business_priorities}. ',
       '...')
```

**IF ANY PLACEHOLDER IS EMPTY:**
1. The resulting prompt will be malformed and produce poor AI responses
2. Check the upstream data source providing these values
3. Add fallback defaults that are meaningful (NOT just "Unknown" or empty)

**PLACEHOLDER VALIDATION CHECKLIST:**
☐ `{business_name}` resolves to an actual company name
☐ `{enriched_business_context}` describes what the business does (NOT "General business operations")
☐ `{enriched_strategic_goals}` contains actual strategic goals (NOT empty, NOT just ". ")
☐ `{enriched_business_priorities}` lists actual priorities
☐ All placeholders produce meaningful text when substituted

**USE CASE INFORMATION:**
- **ID**: {use_case_id}
- **Name**: {use_case_name}
- **Business Domain**: {business_domain}
- **Statement**: {statement}
- **Solution**: {solution}
- **Tables Involved**: {tables_involved}

**Columns From Use Case (use exactly these, no additions):**
{use_case_columns}
- If blank, derive columns only from the provided schema context.
- Every column you use must appear here and must belong to the tables above.
- Exception: columns derived in `external_api_for_<scenario>` via ai_query are allowed and must be explicitly passed forward.

**YOUR TASK**: Analyze the use case information above and identify the OPTIMAL combination of:
1. **AI Functions** - Choose the best Databricks AI functions for the task
2. **Statistical, Simulation & Advanced Analytics** - Use Monte Carlo, What-If, Geospatial, or Market Basket analysis to uncover hidden patterns (MUST be well-documented)

You have full autonomy to innovate and mix these capabilities to deliver maximum business value.

**🔥🔥🔥 CRITICAL: COMPREHENSIVE STATISTICS - NO LAZINESS ALLOWED 🔥🔥🔥**

**YOU MUST USE EVERY STATISTICAL FUNCTION FROM THE AVAILABLE STATISTICAL FUNCTIONS SECTION THAT APPLIES TO THE DATA.**

Refer to the **AVAILABLE STATISTICAL FUNCTIONS** section below for the complete registry. You MUST use functions from ALL applicable categories:
- Central Tendency, Dispersion, Distribution Shape, Percentiles
- Trend Analysis, Correlation, Volatility, Outlier Detection
- Ranking, Time Series

**🚫 ZERO TOLERANCE FOR LAZINESS 🚫**:
- "I could add more statistics" → UNACCEPTABLE! You MUST add them!
- "Additional metrics could help" → UNACCEPTABLE! Include them NOW!
- "Basic analysis is sufficient" → UNACCEPTABLE! We need COMPREHENSIVE!

**✅ MANDATORY BEHAVIOR**:
- Use ALL applicable functions from EVERY category in the statistical functions registry
- Generate 15-25+ statistical metrics per analysis CTE
- Feed statistical results into ai_query prompts for AI-enhanced insights
- NEVER leave out a statistic that could reveal business value

**📝 DOCUMENTATION REQUIREMENTS:**
1.  **First CTE Filtering Guidance**: In the first CTE's WHERE clause, you MUST add a commented-out TODO line suggesting how to filter the data slice (e.g., `-- AND status = 'Active'`).
2.  **Statistical CTE Documentation**: If you use complex statistical functions (REGR_SLOPE, CORR, etc.) or AI functions, you MUST add a comment block before the CTE explaining what the statistics represent and how they are calculated.

**🌐🌐🌐 REQUIRED FOR ACCURACY: EXTERNAL PUBLIC DATA ENRICHMENT CTE 🌐🌐🌐**

**The `external_api_for_<scenario>` CTE is REQUIRED for generating accurate ai_txt_business_outcome calculations. This CTE provides market rates, benchmarks, and external factors that transform internal data into actionable business intelligence with measurable ROI.**

**⚠️ WITHOUT external_api: Your analysis will lack market context, making ai_txt_business_outcome calculations less accurate and less credible.**
**✅ WITH external_api: Your analysis includes verified market rates (fuel prices, labor costs, industry benchmarks) enabling precise ROI calculations.**

**🧠 BEFORE GENERATING SQL, ASK YOURSELF THESE CRITICAL QUESTIONS:**

1. **"WHAT INFORMATION IS MISSING?"** - What external context would explain WHY the patterns exist in the data? What would a human analyst naturally look up?

2. **"WHAT INFORMATION, IF ADDED, WOULD REVEAL NEW ANALYSIS AND PROVIDE VALUE?"** - What public data would transform basic numbers into actionable intelligence?

3. **"WHAT WOULD MAKE THE LLM'S RECOMMENDATIONS MORE ACCURATE?"** - What context would help the AI make better, more informed business recommendations?

4. **"WHAT WOULD A DOMAIN EXPERT BRING TO THIS ANALYSIS?"** - What external knowledge would a 20-year industry veteran consider essential?

5. **"WHAT EXTERNAL CONTEXT WOULD ADD BUSINESS VALUE?"** - Only include external data when there is a DIRECT, PROVABLE, INDUSTRY-RECOGNIZED cause-and-effect relationship with the metric. External data enrichment is valuable for ANY use case where the connection is RELEVANT and REALISTIC.

**🚨🚨🚨 CRITICAL: BUSINESS RELEVANCY REQUIREMENT FOR EXTERNAL DATA 🚨🚨🚨**

**BEFORE adding ANY external data enrichment, you MUST pass ALL of these tests:**
1. Is there a DIRECT, PROVABLE cause-and-effect relationship between the external factor and the metric?
2. Would a domain expert in this industry agree this connection is logical and valuable?
3. Can you explain WHY the external factor impacts the metric in ONE clear sentence?
4. Is this type of enrichment recognized and practiced in the industry?
5. Would a senior executive approve this analysis without questioning the logic?

**IF ANY ANSWER IS "NO" OR "UNCERTAIN", DO NOT INCLUDE THE EXTERNAL DATA.**

**❌ STRICTLY PROHIBITED:**
- Correlating variables that have NO logical business connection
- Adding external factors that do NOT directly impact the metric being analyzed
- Inventing relationships just because two variables share temporal patterns
- Using external data enrichment when you cannot explain the cause-and-effect in one sentence
- Generating "creative" correlations that would be questioned by domain experts

**🎯 THE BUSINESS VALUE OF EXTERNAL DATA (ONLY WHEN RELEVANT):**
- Internal data tells you WHAT happened; RELEVANT external data helps explain WHY
- External benchmarks add value ONLY when they have a direct relationship to the metric
- The connection must be INDUSTRY-RECOGNIZED, not invented

**📋 EXTERNAL DATA - RELEVANCY PRINCIPLE:**

**GOLDEN RULE: "Can I explain in ONE SENTENCE why this external factor DIRECTLY impacts this specific metric?"**

Before including ANY external data, ask:
- Does this external factor have a PROVEN, DIRECT impact on this business metric?
- Would a 20-year industry veteran include this enrichment in their analysis?
- Is this correlation INDUSTRY-RECOGNIZED or am I inventing a relationship?
- Would I be confident defending this connection to a skeptical business leader?

**IF YOU CANNOT ANSWER "YES" TO ALL OF THESE, DO NOT INCLUDE THE EXTERNAL DATA.**

**🔥 MANDATORY PERSONA-BASED PROMPT PATTERN 🔥**

You MUST use a PERSONA-BASED prompt that establishes the AI as a specialist with domain expertise. This ensures accurate, authoritative external data.

**🔥 AI_QUERY MODEL SELECTION AND TEMPERATURE GUIDE 🔥**

**🚨 CRITICAL: USE THE CONFIGURED MODEL FOR ALL ai_query CALLS 🚨**
**User-configured SQL Model Serving endpoint: `{sql_model_serving}`**

You MUST use `{sql_model_serving}` for ALL ai_query calls in the generated SQL. This is the model endpoint configured by the user.

**TEMPERATURE GUIDE FOR GENERATED SQL:**
- **0.1-0.2**: Factual extraction, precise classifications, data parsing (no creativity needed)
- **0.3-0.4**: Structured analysis, JSON output, business intelligence (balanced)
- **0.5-0.6**: Recommendations, insights, strategic advice (some creativity)
- **0.7-0.8**: Creative content, innovative suggestions, brainstorming (high creativity)

**ai_query SYNTAX WITH modelParameters:**
```sql
ai_query(
  '{sql_model_serving}',
  prompt_text,
  modelParameters => named_struct('temperature', 0.4)
) AS result
```

**PERSONA TEMPLATE:**
```sql
-- Step X: Fetch external public data for contextual enrichment
-- NOTE: For production, connect to verified data sources (weather APIs, market data feeds, etc.). 
-- LLM estimates are suitable for prototyping but require verification before business decisions.
external_api_for_<scenario> AS (
  SELECT 
    *,  -- Keep all columns from previous CTE
    ai_query(
      '{sql_model_serving}',  -- User-configured model endpoint
      CONCAT(
        'You are a [SPECIALIST_ROLE] at [AUTHORITATIVE_ORGANIZATION] with [X] years of expertise in [DOMAIN]. ',
        'Your task is to provide accurate, factual public information for business analysis. ',
        'Context: [SPECIFIC_CONTEXT_FROM_DATA]. ',
        'Required information: [LIST_OF_REQUIRED_FIELDS]. ',
        'Return ONLY a single-line JSON object, no extra text. NO HALLUCINATION. ',
        'Use public information only. Always return a value for each field (use "Unknown" or "Data Not Available" if evidence is insufficient). ',
        'Include confidence scores for each field. ',
        'Required JSON format: {{"field1": "value", "field1_confidence": 0.0-1.0, ..., "as_of_date": "YYYY-MM-DD", "source_note": "text", "is_estimate": true, "requires_verification": true}}. ',
        'Output ONLY the JSON object, nothing else.'
      ),
      modelParameters => named_struct('temperature', 0.3)  -- Low temperature for factual data
    ) AS external_data_json
  FROM base_data  -- ✅ MANDATORY: Must reference previous CTE!
)
```

**🌟 PERSONA EXAMPLES BY DOMAIN (USE ONLY WHEN BUSINESS-RELEVANT) 🌟**

**⚠️ REMINDER: Before using ANY of these patterns, you MUST verify there is a DIRECT, PROVABLE, INDUSTRY-RECOGNIZED cause-and-effect relationship between the external factor and your business metric. If you cannot explain the connection in one sentence, DO NOT use external data enrichment.**

**1. EXTERNAL DATA PATTERN (GENERIC TEMPLATE):**

```sql
-- ONLY use external data when there is a DIRECT, PROVABLE business connection
-- ASK: "Can I explain in ONE sentence why this external factor impacts this metric?"
-- Assumes previous CTE 'base_data' exists with relevant context columns
external_api_for_<relevant_context> AS (
  SELECT 
    *,  -- Keep all columns from previous CTE
    ai_query(
      '{sql_model_serving}',  -- User-configured model endpoint
      CONCAT(
        'You are a [DOMAIN_EXPERT_ROLE] at [AUTHORITATIVE_ORGANIZATION] with [X] years of expertise in [RELEVANT_DOMAIN]. ',
        'Provide [RELEVANT_EXTERNAL_DATA] for: [CONTEXT_FROM_DATA]. ',
        'Return ONLY a single-line JSON object. NO HALLUCINATION. Use public records only. ',
        'Required JSON format: {{"field1": value, "field1_confidence": 0.0-1.0, ..., "as_of_date": "YYYY-MM-DD", "source_note": "[DATA_SOURCE]", "is_estimate": true, "requires_verification": true}}. ',
        'Output ONLY the JSON object, nothing else.'
      ),
      modelParameters => named_struct('temperature', 0.2)  -- Low temp for factual data
    ) AS external_data_json
  FROM base_data  -- ✅ MANDATORY: Must reference previous CTE!
)
```

**2. COMPETITOR/MARKET DATA (ABOUT THE ENTITY IN YOUR DATA, NOT YOUR COMPANY!):**

**🚨 CRITICAL: When analyzing customers/entities, get info about THOSE entities, not about the company you're working for!**

```sql
-- CORRECT: Get info about the CUSTOMER being analyzed, using data from the query
external_api_for_customer_market_intelligence AS (
  SELECT 
    c.customer_id,
    c.customer_name,
    c.industry,
    ai_query(
      '{sql_model_serving}',  -- User-configured model endpoint
      CONCAT(
        'You are a Senior Market Intelligence Analyst at Gartner with 15 years of expertise in competitive analysis. ',
        'Provide business intelligence for company: ', c.customer_name, ' in industry: ', c.industry, '. ',
        'Required: This company''s market cap, their main competitors (companies competing WITH them), ',
        'estimated annual revenue, number of employees, strategic priorities, key business risks, market position. ',
        'Return ONLY a single-line JSON object. NO HALLUCINATION. Use public information only. ',
        'Required JSON format: {{"company_market_cap_usd": value, "market_cap_confidence": 0.0-1.0, ',
        '"company_competitors": "Competitor1, Competitor2, Competitor3", "competitors_confidence": 0.0-1.0, ',
        '"estimated_revenue_usd": value, "revenue_confidence": 0.0-1.0, ',
        '"employee_count": value, "employee_confidence": 0.0-1.0, ',
        '"strategic_priorities": "text", "priorities_confidence": 0.0-1.0, ',
        '"key_business_risks": "text", "risks_confidence": 0.0-1.0, ',
        '"market_position": "Leader/Challenger/Follower/Niche", "position_confidence": 0.0-1.0, ',
        '"as_of_date": "YYYY-MM-DD", "source_note": "Public filings/News/Industry reports", "is_estimate": true, "requires_verification": true}}. ',
        'Output ONLY the JSON object, nothing else.'
      ),
      modelParameters => named_struct('temperature', 0.3)  -- Balanced for market analysis
    ) AS customer_intel_json
  FROM customer_base AS c  -- Use data from your query!
)
```

**3. ECONOMIC DATA:**
```sql
-- Assumes previous CTE 'base_data' exists with country_name, analysis_period columns
external_api_for_economic_context AS (
  SELECT 
    *,  -- Keep all columns from previous CTE
    ai_query(
      '{sql_model_serving}',  -- User-configured model endpoint
      CONCAT(
        'You are a Chief Economist at the World Bank with 18 years of expertise in macroeconomic analysis, currency markets, and regional economic forecasting. ',
        'Provide economic context for: Country ', country_name, ', Date/Period ', COALESCE(CAST(analysis_period AS STRING), 'Unknown Period'), '. ',
        'Return ONLY a single-line JSON object. NO HALLUCINATION. Use public economic data only. ',
        'Required JSON format: {{"gdp_growth_rate_pct": value, "gdp_confidence": 0.0-1.0, "inflation_rate_pct": value, "inflation_confidence": 0.0-1.0, "unemployment_rate_pct": value, "unemployment_confidence": 0.0-1.0, "interest_rate_pct": value, "interest_confidence": 0.0-1.0, "currency_code": "XXX", "exchange_rate_to_usd": value, "exchange_confidence": 0.0-1.0, "economic_outlook": "Strong/Moderate/Weak/Recession", "outlook_confidence": 0.0-1.0, "key_economic_factors": "text", "factors_confidence": 0.0-1.0, "as_of_date": "YYYY-MM-DD", "source_note": "World Bank/IMF/Central Bank data", "is_estimate": true, "requires_verification": true}}. ',
        'Output ONLY the JSON object, nothing else.'
      ),
      modelParameters => named_struct('temperature', 0.2)  -- Low temp for economic facts
    ) AS economic_json
  FROM base_data  -- ✅ MANDATORY: Must reference previous CTE!
)
```

**4. EVENTS/DISRUPTIONS DATA:**
```sql
-- Assumes previous CTE 'base_data' exists with location_name, event_date columns
external_api_for_events_disruptions AS (
  SELECT 
    *,  -- Keep all columns from previous CTE
    ai_query(
      '{sql_model_serving}',  -- User-configured model endpoint
      CONCAT(
        'You are a Senior Risk and Disruption Analyst at Lloyd''s of London with 15 years of expertise in global event monitoring, operational risk assessment, and business continuity analysis. ',
        'Provide event/disruption context for: Location ', location_name, ', Date ', COALESCE(CAST(event_date AS STRING), 'Unknown Date'), '. ',
        'Return ONLY a single-line JSON object. NO HALLUCINATION. Use public information only. ',
        'Required JSON format: {{"major_events": "Event1, Event2 or None", "events_confidence": 0.0-1.0, "event_type": "Holiday/Strike/Sports/Conference/Weather/Political/None", "type_confidence": 0.0-1.0, "expected_impact": "High/Medium/Low/None", "impact_confidence": 0.0-1.0, "affected_sectors": "Transport/Retail/Hospitality/All/None", "sectors_confidence": 0.0-1.0, "disruption_narrative": "text explaining any disruptions", "narrative_confidence": 0.0-1.0, "as_of_date": "YYYY-MM-DD", "source_note": "News/Event calendars/Public records", "is_estimate": true, "requires_verification": true}}. ',
        'Output ONLY the JSON object, nothing else.'
      ),
      modelParameters => named_struct('temperature', 0.3)  -- Balanced for event analysis
    ) AS events_json
  FROM base_data  -- ✅ MANDATORY: Must reference previous CTE!
)
```

**5. GEOGRAPHIC/DEMOGRAPHIC DATA:**
```sql
-- Assumes previous CTE 'base_data' exists with city_name, country_name columns
external_api_for_geographic_context AS (
  SELECT 
    *,  -- Keep all columns from previous CTE
    ai_query(
      '{sql_model_serving}',  -- User-configured model endpoint
      CONCAT(
        'You are a Senior Demographer and Urban Planning Analyst at the United Nations Population Division with 20 years of expertise in population statistics, urban development, and regional demographics. ',
        'Provide geographic and demographic context for: Location ', city_name, ', Country ', country_name, '. ',
        'Return ONLY a single-line JSON object. NO HALLUCINATION. Use public census and geographic data only. ',
        'Required JSON format: {{"population": value, "population_confidence": 0.0-1.0, "population_density_per_sqkm": value, "density_confidence": 0.0-1.0, "median_household_income_usd": value, "income_confidence": 0.0-1.0, "urban_classification": "Metro/Urban/Suburban/Rural", "classification_confidence": 0.0-1.0, "timezone": "UTC+X", "timezone_confidence": 0.0-1.0, "latitude": value, "longitude": value, "coordinates_confidence": 0.0-1.0, "climate_zone": "Tropical/Temperate/Arid/Continental/Polar", "climate_confidence": 0.0-1.0, "key_industries": "text", "industries_confidence": 0.0-1.0, "as_of_date": "YYYY-MM-DD", "source_note": "UN/Census/Geographic databases", "is_estimate": true, "requires_verification": true}}. ',
        'Output ONLY the JSON object, nothing else.'
      ),
      modelParameters => named_struct('temperature', 0.2)  -- Low temp for factual geographic data
    ) AS geo_json
  FROM base_data  -- ✅ MANDATORY: Must reference previous CTE!
)
```

**🌟 BEST PRACTICES FOR EXTERNAL DATA ENRICHMENT 🌟**

When including an `external_api_for_<scenario>` CTE, follow these practices:

1. **Use PERSONA-BASED prompts** with specific role, organization, and expertise (e.g., "You are a Principal Meteorologist at NOAA...")
2. **Include CONFIDENCE SCORES** for every field: `<field>_confidence` (0.0-1.0) - this helps users understand data reliability
3. **Always include metadata**: `as_of_date`, `source_note`, `is_estimate: true`, `requires_verification: true`
4. **Add SQL comment** explaining the external data source and verification needs for production use
5. **Parse JSON** using `get_json_object()` and pass fields to downstream CTEs
6. **USE external data in ai_query prompts** for the final analysis - this is WHERE THE VALUE IS REALIZED!
7. **HIGHLIGHT the source** in the prompt so the LLM knows the authority behind the data
8. **USE CONFIGURED MODEL**: Always use `{sql_model_serving}` for all ai_query calls in generated SQL
9. **SET TEMPERATURE**: Use `modelParameters => named_struct('temperature', X)` - low (0.1-0.3) for facts, higher (0.4-0.6) for insights

**🚨🚨🚨 CRITICAL: USE ACTUAL DATA CONTEXT IN EXTERNAL_API CTEs 🚨🚨🚨**

**The external_api CTE MUST use actual data values from your query (customer_name, company_name, location, dates, etc.) to fetch RELEVANT external information!**

**❌ WRONG - Generic context without using actual data:**
```sql
-- ❌ WRONG: No entity-specific context passed to the LLM prompt
external_api_for_competitor_intelligence AS (
  SELECT 
    customer_id,  -- At least pass through the entity ID
    ai_query('{sql_model_serving}',
      'Provide competitive intelligence for the data platform market...'  -- ❌ NO ACTUAL DATA CONTEXT! Should reference customer_name, industry, etc.
    ) AS json
  FROM customer_base  -- ✅ EVERY CTE MUST have a FROM clause!
)
```

**✅ CORRECT - Use actual entity data from your query:**
```sql
-- First, get base data with the entity you're analyzing (ALWAYS use DISTINCT)
WITH customer_base AS (
  SELECT DISTINCT 
    customer_id,  -- CRITICAL: filtered with IS NOT NULL
    COALESCE(TRIM(customer_name), 'Unknown') AS customer_name,  -- ✅ COALESCE'd
    COALESCE(TRIM(industry), 'Unknown') AS industry,            -- ✅ COALESCE'd
    COALESCE(TRIM(region), 'Unknown') AS region                 -- ✅ COALESCE'd
    -- ... (all columns must be COALESCE'd or have IS NOT NULL) ...
  FROM table AS t
  WHERE customer_id IS NOT NULL  -- ✅ Filter critical columns
  LIMIT 10  -- ✅ LIMIT 10 at the END of first CTE only
),
-- Then, use that entity's data to fetch RELEVANT external info
external_api_for_customer_intelligence AS (
  SELECT 
    b.customer_id,
    b.customer_name,
    ai_query('{sql_model_serving}',
      CONCAT(
        'You are a Senior Market Intelligence Analyst at Gartner... ',
        'Provide business intelligence for company: ', b.customer_name, ' in industry: ', b.industry, '. ',
        'Required information: company market cap, main competitors, revenue estimate, strategic priorities, market risks. ',
        'Return JSON: {{"market_cap_usd": value, "main_competitors": "Comp1, Comp2, Comp3", "estimated_revenue_usd": value, ',
        '"strategic_priorities": "text", "market_risks": "text", "industry_position": "Leader/Challenger/Follower/Niche", ...}}. '
      )
    ) AS company_intel_json
  FROM customer_base AS b
)
```

**🚨 CRITICAL RULE: CONTEXT AWARENESS 🚨**
- **DO NOT** generate insights about the company you are working for (e.g., if analyzing Databricks customers, don't generate Databricks competitor insights)
- **DO** generate insights about the ENTITIES IN YOUR DATA (customers, suppliers, partners, locations)
- **ALWAYS** pass entity identifiers (customer_name, company_name, location, etc.) from your base CTE into the external_api prompt
- **ASK**: "What would I want to know about THIS SPECIFIC customer/entity to make better recommendations?"

**ENTITY-AWARE EXTERNAL DATA EXAMPLES:**

**For Customer Analysis - Get info about THE CUSTOMER:**
```sql
external_api_for_customer_profile AS (
  SELECT 
    c.customer_id,
    c.customer_name,
    ai_query('{sql_model_serving}',
      CONCAT(
        'You are a Business Intelligence Analyst at D&B with 15 years expertise in company research. ',
        'Provide business profile for company: ', c.customer_name, '. ',
        'Required: market_cap_usd, employee_count, founded_year, headquarters_location, main_business_segments, ',
        'top_3_competitors (companies competing WITH this customer, NOT your company), annual_revenue_estimate_usd, ',
        'growth_trajectory (Growing/Stable/Declining), strategic_priorities, technology_stack, market_position. ',
        'Return JSON with confidence scores. NO HALLUCINATION. Use public information only.'
      )
    ) AS customer_profile_json
  FROM customer_base AS c
)
```

**For Location Analysis - Get info about THE LOCATION:**
```sql
external_api_for_location_context AS (
  SELECT 
    l.location_id,
    l.city_name,
    l.country,
    ai_query('{sql_model_serving}',
      CONCAT(
        'You are a Geographic Analyst at UN Population Division. ',
        'Provide context for: City ', l.city_name, ', Country ', l.country, '. ',
        'Required: population, gdp_per_capita, major_industries, business_climate, infrastructure_rating, ',
        'timezone, climate, key_economic_indicators. Return JSON with confidence scores.'
      )
    ) AS location_context_json
  FROM location_base AS l
)
```

**🧠 REMEMBER: The goal is to answer "WHAT INFORMATION IS MISSING?" and "WHAT WOULD MAKE THIS ANALYSIS MORE VALUABLE?"**
**Use the ACTUAL ENTITIES in your data to fetch RELEVANT external context!**

**EXAMPLE: INTEGRATING EXTERNAL DATA INTO ANALYSIS (CONTEXT-AWARE):**
```sql
-- Step 1: Get base data with LIMIT 10 (only at END of first CTE!)
-- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
WITH customer_metrics AS (
  SELECT DISTINCT
    customer_id,                                           -- CRITICAL: filtered with IS NOT NULL
    COALESCE(TRIM(customer_name), 'Unknown Customer') AS customer_name,  -- ✅ COALESCE'd
    COALESCE(TRIM(industry), 'Unknown Industry') AS industry,            -- ✅ COALESCE'd
    COALESCE(TRIM(region), 'Unknown Region') AS region,                  -- ✅ COALESCE'd
    COALESCE(total_revenue, 0.0) AS total_revenue                        -- ✅ COALESCE'd
  FROM `catalog`.`schema`.`customers` AS c
  WHERE customer_id IS NOT NULL
    AND customer_name IS NOT NULL  -- ✅ Critical identifier also filtered
  LIMIT 10  -- ✅ LIMIT 10 at the END of first CTE only
),

-- Step 2: Fetch external context FOR EACH CUSTOMER (using their data!)
-- NOTE: For production, connect to verified data sources. LLM estimates are for prototyping.
external_api_for_customer_intelligence AS (
  SELECT 
    cm.customer_id,
    cm.customer_name,
    ai_query('{sql_model_serving}',
      CONCAT(
        'You are a Business Intelligence Analyst at D&B with 15 years expertise. ',
        'Provide business profile for company: ', cm.customer_name, ' in industry: ', cm.industry, '. ',
        'Required: market_cap_usd, competitors (companies competing WITH this customer), ',
        'estimated_revenue, strategic_priorities, key_risks, market_position. ',
        'Return JSON with confidence scores. NO HALLUCINATION.'
      )
    ) AS customer_intel_json
  FROM customer_metrics AS cm  -- ✅ Uses actual customer data from the query!
),

-- Step 3: Parse external data
customer_intel_parsed AS (
  SELECT 
    customer_id,
    customer_name,
    get_json_object(customer_intel_json, '$.market_cap_usd') AS customer_market_cap,
    COALESCE(get_json_object(customer_intel_json, '$.competitors'), 'Unknown') AS customer_competitors,
    COALESCE(get_json_object(customer_intel_json, '$.strategic_priorities'), 'Unknown') AS customer_priorities,
    COALESCE(TRY_CAST(get_json_object(customer_intel_json, '$.market_cap_confidence') AS DECIMAL(3,2)), 0.0) AS intel_confidence  -- ✅ TRY_CAST for safety
  FROM external_api_for_customer_intelligence
),

-- Step 4: Combine internal data with external context and generate ai_sys_prompt
analysis_prompts AS (
  SELECT 
    c.*,
    p.customer_market_cap,
    p.customer_competitors,
    p.customer_priorities,
    CONCAT(
      'You are a Strategic Account Director for {business_name} which is focused on {enriched_business_context}. ',
      'The organization''s strategic goals include: {enriched_strategic_goals}. ',
      'Business priorities are: {enriched_business_priorities}. ',
      'With 18 years of experience in enterprise account strategy and customer success, ',
      'your expertise in strategic account planning and competitive positioning aligns with the strategic initiative: Customer growth. ',
      'Analyze customer ', c.customer_name, ' (Market Cap: $', p.customer_market_cap, '). ',
      'Their main competitors are: ', p.customer_competitors, '. ',
      'Their strategic priorities: ', p.customer_priorities, '. ',
      'Internal metrics: Revenue $', c.total_revenue, ', Region: ', c.region, '. ',
      'Use both internal AND external context to provide actionable recommendations. ',
      'Output ONLY JSON with NO markdown. ',
      'Format: {{"ai_cat_account_priority": "value", "ai_txt_growth_strategy": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", ',
      '"ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
      'Output ONLY the JSON object, nothing else.'
    ) AS ai_sys_prompt  -- ✅ Named ai_sys_prompt for auditability
  FROM customer_metrics AS c
  LEFT JOIN customer_intel_parsed AS p ON c.customer_id = p.customer_id
),
-- Step 5: Call ai_query with the prompt
analysis_with_insights AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS insights_json
  FROM analysis_prompts
)
-- Final analysis uses BOTH internal data AND external customer intelligence
-- ai_sys_prompt MUST be the LAST column
SELECT 
  customer_id,
  customer_name,
  region,
  total_revenue,
  customer_market_cap,
  customer_competitors,
  get_json_object(insights_json, '$.ai_cat_account_priority') AS ai_cat_account_priority,
  get_json_object(insights_json, '$.ai_txt_growth_strategy') AS ai_txt_growth_strategy,
  -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome):
  COALESCE(get_json_object(insights_json, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
  COALESCE(get_json_object(insights_json, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
  COALESCE(get_json_object(insights_json, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
  COALESCE(get_json_object(insights_json, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
  COALESCE(TRY_CAST(get_json_object(insights_json, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
  COALESCE(get_json_object(insights_json, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
  COALESCE(get_json_object(insights_json, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
  ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
FROM analysis_with_insights;  -- ✅ NO LIMIT - data already sampled
```

**🔥 THE VALUE: External data transforms basic analytics into CONTEXTUAL INTELLIGENCE that drives better business decisions! 🔥**

**🧠 FINAL CHECK BEFORE GENERATING SQL:**
- "What information is MISSING from this analysis?"
- "What external context would help EXPLAIN the patterns I'm seeing?"
- "What would make the LLM's recommendations MORE ACCURATE and ACTIONABLE?"
- "What would a 20-year industry veteran want to know before making a decision?"

**If the answer to any of these questions points to external data, YOU MUST include an `external_api_for_<scenario>` CTE to ensure accurate ai_txt_business_outcome calculations!**

**🏢🏢🏢 REQUIRED FOR COMPREHENSIVE ANALYSIS: INTERNAL DATA ENRICHMENT CTE 🏢🏢🏢**

**The `internal_data_for_<scenario>` CTE is REQUIRED for comprehensive business analysis. Use this CTE to fetch internal information that provides organizational context for accurate ai_txt_business_outcome calculations.**

**⚠️ WITHOUT internal_data: Analysis misses organizational policies, historical patterns, and internal benchmarks needed for accurate business impact calculations.**
**✅ WITH internal_data: Analysis incorporates company-specific rates, SLAs, policies, and historical performance enabling precise ai_txt_business_outcome projections.**

**🧠 BEFORE GENERATING SQL, ASK YOURSELF THESE INTERNAL DATA QUESTIONS:**

1. **"WHAT INTERNAL POLICIES OR GUIDELINES ARE NEEDED?"** - Sales playbooks, pricing guidelines, approval thresholds, SLA definitions
2. **"WHAT HISTORICAL INTERACTIONS WOULD HELP?"** - Support tickets, email communications, meeting notes, call logs
3. **"WHAT INSTITUTIONAL KNOWLEDGE IS MISSING?"** - Best practices, lessons learned, tribal knowledge from experienced staff
4. **"WHAT OPERATIONAL CONTEXT WOULD IMPROVE RECOMMENDATIONS?"** - Current inventory levels, team capacity, budget constraints, ongoing initiatives

**INTERNAL DATA CTE PATTERN:**
```sql
-- Step X: Anticipate and fetch internal data that would improve analysis
-- NOTE: This CTE fetches internal organizational knowledge, policies, and context
-- that would typically be reported as missing in ai_sys_missing_data
internal_data_for_<scenario> AS (
  SELECT 
    *,  -- Keep all columns from previous CTE
    ai_query(
      '{sql_model_serving}',  -- User-configured model endpoint
      CONCAT(
        'You are an Internal Knowledge Manager for {business_name} with deep understanding of ',
        '{enriched_business_context}. ',
        'The organization is focused on strategic goals: {enriched_strategic_goals}. ',
        'Provide internal organizational context for: [ENTITY_CONTEXT]. ',
        'Required internal information: [SALES_GUIDELINES/SUPPORT_HISTORY/POLICY_INFO/BEST_PRACTICES]. ',
        'Return ONLY a single-line JSON object. ',
        'Required JSON format: {{"internal_guidelines": "text", "historical_context": "text", ',
        '"recommended_approach": "text", "risk_factors": "text", "success_criteria": "text", ',
        '"stakeholders_to_consult": "text", "precedents": "text", "confidence": 0.0-1.0}}. ',
        'Output ONLY the JSON object, nothing else.'
      ),
      modelParameters => named_struct('temperature', 0.3)
    ) AS internal_context_json
  FROM base_data  -- ✅ MANDATORY: Must reference previous CTE!
)
```

**🌟 INTERNAL DATA CTE EXAMPLES BY SCENARIO 🌟**

**1. SALES PLAY RECOMMENDATIONS:**
```sql
-- Anticipate internal sales guidelines and best practices
internal_data_for_sales_playbook AS (
  SELECT 
    c.customer_id,
    c.customer_name,
    c.deal_stage,
    ai_query('{sql_model_serving}',
      CONCAT(
        'You are a Sales Enablement Director for {business_name} with expertise in {enriched_business_context}. ',
        'Strategic goals include: {enriched_strategic_goals}. ',
        'Provide recommended sales approach for customer: ', c.customer_name, ' at deal stage: ', c.deal_stage, '. ',
        'Include: sales playbook recommendations, objection handling strategies, competitive positioning, ',
        'discount approval guidelines, key stakeholders to engage, success stories to reference, ',
        'technical resources needed, timeline expectations. ',
        'Return JSON: {{"recommended_play": "text", "objection_handling": "text", "competitive_positioning": "text", ',
        '"discount_guidelines": "text", "key_stakeholders": "text", "reference_stories": "text", ',
        '"technical_resources": "text", "expected_timeline": "text", "confidence": 0.0-1.0}}.'
      )
    ) AS sales_context_json
  FROM customer_base AS c
)
```

**2. CUSTOMER SUPPORT HISTORY:**
```sql
-- Anticipate internal support history and customer health context
internal_data_for_customer_health AS (
  SELECT 
    c.customer_id,
    c.customer_name,
    ai_query('{sql_model_serving}',
      CONCAT(
        'You are a Customer Success Manager for {business_name} focused on {enriched_business_context}. ',
        'Business priorities include: {enriched_business_priorities}. ',
        'Provide internal context for customer: ', c.customer_name, '. ',
        'Include: typical support patterns, known pain points, escalation history, ',
        'relationship health indicators, renewal risk factors, expansion opportunities, ',
        'key contacts and their preferences, communication history highlights. ',
        'Return JSON: {{"support_patterns": "text", "known_pain_points": "text", ',
        '"escalation_history": "text", "health_indicators": "text", "renewal_risks": "text", ',
        '"expansion_opportunities": "text", "key_contacts": "text", "communication_notes": "text", "confidence": 0.0-1.0}}.'
      )
    ) AS customer_health_json
  FROM customer_base AS c
)
```

**3. OPERATIONAL GUIDELINES:**
```sql
-- Anticipate internal operational policies and thresholds
internal_data_for_operational_context AS (
  SELECT 
    o.operation_id,
    o.operation_type,
    ai_query('{sql_model_serving}',
      CONCAT(
        'You are an Operations Manager for {business_name} with expertise in {enriched_value_chain}. ',
        'Strategic initiative: {enriched_strategic_initiative}. ',
        'Provide operational guidelines for: ', o.operation_type, '. ',
        'Include: approval thresholds, escalation procedures, SLA requirements, ',
        'resource allocation guidelines, quality checkpoints, compliance requirements, ',
        'documentation standards, reporting requirements. ',
        'Return JSON: {{"approval_thresholds": "text", "escalation_procedures": "text", ',
        '"sla_requirements": "text", "resource_guidelines": "text", "quality_checkpoints": "text", ',
        '"compliance_requirements": "text", "documentation_standards": "text", "confidence": 0.0-1.0}}.'
      )
    ) AS operational_context_json
  FROM operation_base AS o
)
```

**🔥 BEST PRACTICES FOR INTERNAL DATA CTEs 🔥**

1. **Anticipate ai_sys_missing_data**: Think about what data you would report as missing in the final analysis, and proactively fetch it
2. **Use business context**: Always include {business_name}, {enriched_business_context}, and {enriched_strategic_goals} in internal data prompts
3. **Be specific to the use case**: Tailor internal data requests to the specific analysis being performed
4. **Combine with external_api**: Use BOTH internal_data and external_api CTEs for comprehensive enrichment
5. **Parse and use the data**: Extract JSON fields and include them in downstream ai_query prompts

**🧠 INTERNAL vs EXTERNAL DATA DECISION:**
- **external_api_for_<scenario>**: Use for PUBLIC information about entities (market data, competitor info, economic indicators)
- **internal_data_for_<scenario>**: Use for ORGANIZATIONAL knowledge (policies, guidelines, historical context, best practices)

**🎯 AVAILABLE TABLES AND COLUMNS (USE ONLY THESE - NO OTHER TABLES OR COLUMNS EXIST):**
{directly_involved_schema}

**🚨 CRITICAL: The tables and columns listed above are the ONLY ones available. Do NOT use any other table or column names. If you need a column that is not listed above, you CANNOT generate the query - add a comment explaining what is missing (exception: columns generated inside `external_api_for_<scenario>` via ai_query).**

**📋 TABLES INCLUDED IN THIS CONTEXT:**
The tables listed above include:
1. **Tables directly specified in "Tables Involved"** field for this use case
2. **Tables with foreign key relationships** to the directly involved tables (if they exist)

**FOREIGN KEY RELATIONSHIPS (auto-included, never drop):**
{foreign_key_relationships}
- If a relationship is listed, you MUST include the referenced table(s) in FROM/JOIN and join using the provided key pairs.
- Do NOT omit a referenced table when it appears here, even if not explicitly listed in "Tables Involved".
- Automatically pull every referenced table above into your join plan and leverage the relationships to avoid missing required columns.

**IMPORTANT**: If a table or column you think you need is not listed above, it means:
- Either it doesn't exist in the database
- Or it has no foreign key relationship to the involved tables
- DO NOT hallucinate or invent table/column names
- Use ONLY what is explicitly provided above

**UNSTRUCTURED DOCUMENTS** (if applicable):
{unstructured_docs}

{previous_feedback}

{interpreted_regeneration_context}

---

### 🎯 ABSOLUTE PRIORITY RULES (FAILURE = CRITICAL ERROR)

#### 0. **SCHEMA ADHERENCE - ZERO HALLUCINATION TOLERANCE** (MOST CRITICAL):

**🚨🚨🚨 CRITICAL: YOU MUST USE ONLY THE EXACT TABLES, COLUMNS, CATALOGS, AND SCHEMAS PROVIDED IN THE "AVAILABLE TABLES AND COLUMNS" SECTION ABOVE. ABSOLUTELY NO HALLUCINATION ALLOWED. NO OTHER TABLES OR COLUMNS EXIST. THIS IS THE #1 FAILURE POINT. 🚨🚨🚨**

**🚨 MANDATORY: ONLY USE PROVIDED SCHEMA 🚨**
- The **AVAILABLE TABLES AND COLUMNS** section contains the ONLY tables and columns you can use
- These are the ONLY tables that exist in the database for this query

#### 0.1. **STRING LITERAL QUOTING - CRITICAL SYNTAX RULE**:

**🚨🚨🚨 CRITICAL: ALL STRING LITERALS MUST BE QUOTED WITH SINGLE QUOTES 🚨🚨🚨**

This is the **#1 MOST COMMON ERROR** - forgetting to quote string literals in SQL.

**MANDATORY RULES:**
- **ANY text value** used in comparisons, CASE statements, or expressions MUST be quoted with single quotes
- **Column names** are NOT quoted (unless they have spaces/keywords, then use backticks)
- **String values** MUST ALWAYS be quoted with single quotes

**✅ MANDATORY CORRECT PATTERNS (COPY THESE EXACTLY):**
```sql
-- CORRECT: String literals with single quotes
WHERE certificate_type = 'Policy'         -- ✅ 'Policy' is quoted
WHERE status = 'Active'                   -- ✅ 'Active' is quoted  
WHEN category = 'Premium' THEN ...        -- ✅ 'Premium' is quoted
COALESCE(risk_level, 'High')             -- ✅ 'High' is quoted
COALESCE(TRIM(name), 'Unknown')          -- ✅ 'Unknown' is quoted
COALESCE(TRIM(category), 'Not Specified') -- ✅ 'Not Specified' is quoted
```

**✅ CORRECT CASE STATEMENT PATTERN (COPY THIS EXACTLY):**
```sql
-- ✅ CORRECT: CASE statement with quoted strings
CASE 
  WHEN status = 'Pending' THEN 'Low'       -- ✅ All quoted
  WHEN status = 'Approved' THEN 'Medium'   -- ✅ All quoted
  ELSE 'High'                              -- ✅ Quoted
END

-- ✅ CORRECT: Array with quoted strings (COPY THIS EXACTLY)
ARRAY('Type', 'Status', 'Category')       -- ✅ All quoted

-- ✅ CORRECT: COALESCE with quoted default (COPY THESE PATTERNS EXACTLY)
COALESCE(customer_name, 'Unknown')        -- ✅ 'Unknown' is quoted

-- ✅ CORRECT: COALESCE with properly quoted STRING defaults (COPY THESE PATTERNS EXACTLY)
COALESCE(TRIM(c.charge_code), 'UNKNOWN') AS charge_code              -- ✅ 'UNKNOWN' is quoted
COALESCE(TRIM(c.category), 'Not Specified') AS category              -- ✅ 'Not Specified' is quoted
COALESCE(TRIM(c.status), 'Pending Review') AS status                 -- ✅ 'Pending Review' is quoted
COALESCE(TRIM(c.region), 'Unassigned Region') AS region              -- ✅ 'Unassigned Region' is quoted
COALESCE(CAST(date_col AS STRING), 'No Date Available') AS date_str  -- ✅ 'No Date Available' is quoted
```

**🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨**
**🚨🚨🚨 STOP! READ THIS BEFORE WRITING ANY COALESCE! THE #1 ERROR IS MISSING QUOTES! 🚨🚨🚨**
**🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨**

**⛔⛔⛔ IF YOU FORGET SINGLE QUOTES AROUND STRING DEFAULTS, YOUR SQL WILL FAIL! ⛔⛔⛔**

**LOOK AT THIS REAL EXAMPLE OF THE ERROR YOU MUST NOT MAKE:**
```sql
-- ❌❌❌ CATASTROPHICALLY WRONG - THIS EXACT PATTERN WILL FAIL - DO NOT COPY ❌❌❌
WITH base_data AS (
  SELECT DISTINCT
    material_usage_id,
    COALESCE(TRIM(material_type), Unknown Material) AS material_type,          -- ❌ SYNTAX ERROR! Unknown Material needs quotes!
    COALESCE(TRIM(material_description), No Description) AS material_description,  -- ❌ SYNTAX ERROR! No Description needs quotes!
    COALESCE(quantity_used, 0.0) AS quantity_used,
    COALESCE(TRIM(unit_of_measure), Unknown) AS unit_of_measure,               -- ❌ SYNTAX ERROR! Unknown needs quotes!
    COALESCE(waste_factor_percent, 0.0) AS waste_factor_percent,
    COALESCE(TRIM(waste_reason), No Reason) AS waste_reason,                   -- ❌ SYNTAX ERROR! No Reason needs quotes!
    COALESCE(CAST(delivery_date AS STRING), Unknown Date) AS delivery_date_str,  -- ❌ SYNTAX ERROR! Unknown Date needs quotes!
    COALESCE(TRIM(storage_location), Unknown Location) AS storage_location,    -- ❌ SYNTAX ERROR! Unknown Location needs quotes!
    COALESCE(TRIM(supplier_name), Unknown Supplier) AS supplier_name,          -- ❌ SYNTAX ERROR! Unknown Supplier needs quotes!
    COALESCE(TRIM(inspection_status), Unknown) AS inspection_status            -- ❌ SYNTAX ERROR! Unknown needs quotes!
  FROM `catalog`.`schema`.`table` AS t
)

-- ✅✅✅ CORRECT - EVERY STRING DEFAULT HAS SINGLE QUOTES - COPY THIS PATTERN EXACTLY ✅✅✅
WITH base_data AS (
  SELECT DISTINCT
    material_usage_id,
    COALESCE(TRIM(material_type), 'Unknown Material') AS material_type,          -- ✅ 'Unknown Material' has quotes!
    COALESCE(TRIM(material_description), 'No Description') AS material_description,  -- ✅ 'No Description' has quotes!
    COALESCE(quantity_used, 0.0) AS quantity_used,                                -- ✅ Numbers don't need quotes
    COALESCE(TRIM(unit_of_measure), 'Unknown') AS unit_of_measure,               -- ✅ 'Unknown' has quotes!
    COALESCE(waste_factor_percent, 0.0) AS waste_factor_percent,                 -- ✅ Numbers don't need quotes
    COALESCE(TRIM(waste_reason), 'No Reason') AS waste_reason,                   -- ✅ 'No Reason' has quotes!
    COALESCE(CAST(delivery_date AS STRING), 'Unknown Date') AS delivery_date_str,  -- ✅ 'Unknown Date' has quotes!
    COALESCE(TRIM(storage_location), 'Unknown Location') AS storage_location,    -- ✅ 'Unknown Location' has quotes!
    COALESCE(TRIM(supplier_name), 'Unknown Supplier') AS supplier_name,          -- ✅ 'Unknown Supplier' has quotes!
    COALESCE(TRIM(inspection_status), 'Unknown') AS inspection_status            -- ✅ 'Unknown' has quotes!
  FROM `catalog`.`schema`.`table` AS t
)
```

**🔴🔴🔴 RULE: ANY TEXT after the comma in COALESCE MUST have 'single quotes' around it! 🔴🔴🔴**

**MORE EXAMPLES OF WRONG vs CORRECT:**
```sql
-- ❌ WRONG (will cause PARSE_SYNTAX_ERROR)     |  ✅ CORRECT (will execute successfully)
COALESCE(TRIM(name), Unknown)                   |  COALESCE(TRIM(name), 'Unknown')
COALESCE(TRIM(status), Active)                  |  COALESCE(TRIM(status), 'Active')
COALESCE(TRIM(region), North America)           |  COALESCE(TRIM(region), 'North America')
COALESCE(TRIM(type), Type A)                    |  COALESCE(TRIM(type), 'Type A')
COALESCE(CAST(date AS STRING), No Date)         |  COALESCE(CAST(date AS STRING), 'No Date')
COALESCE(TRIM(category), Uncategorized)         |  COALESCE(TRIM(category), 'Uncategorized')
COALESCE(TRIM(owner), Unassigned)               |  COALESCE(TRIM(owner), 'Unassigned')
COALESCE(TRIM(priority), Low)                   |  COALESCE(TRIM(priority), 'Low')
COALESCE(TRIM(description), N/A)                |  COALESCE(TRIM(description), 'N/A')
COALESCE(TRIM(po_number), Unknown PO)           |  COALESCE(TRIM(po_number), 'Unknown PO')
```

**THE SIMPLE RULE:**
- `COALESCE(..., 0.0)` - Numbers: NO quotes needed
- `COALESCE(..., 0)` - Numbers: NO quotes needed  
- `COALESCE(..., FALSE)` - Booleans: NO quotes needed
- `COALESCE(..., 'Any Text')` - **TEXT: ALWAYS needs 'single quotes'!**
- `COALESCE(TRIM(...), 'Any Text')` - **TEXT: ALWAYS needs 'single quotes'!**
- `COALESCE(CAST(... AS STRING), 'Any Text')` - **TEXT: ALWAYS needs 'single quotes'!**

**🔥🔥🔥 VALIDATION STEP - DO THIS BEFORE SUBMITTING 🔥🔥🔥**
1. Search your SQL for the pattern `COALESCE(`
2. For EACH COALESCE found, check: Is the second argument text?
3. If YES → Does it have 'single quotes' around it?
4. If NO quotes → ADD THEM NOW!

**VALIDATION CHECKLIST - BEFORE SUBMITTING SQL:**
☐ Every string value in WHERE clause has single quotes: `WHERE col = 'value'`
☐ Every string in CASE/WHEN/THEN/ELSE has single quotes: `WHEN col = 'value' THEN 'result'`
☐ **🚨 CRITICAL 🚨**: Every COALESCE string default has single quotes: `COALESCE(col, 'Unknown')`, `COALESCE(TRIM(name), 'Not Specified')`
☐ Every string in ARRAY has single quotes: `ARRAY('val1', 'val2')`
☐ Every string literal anywhere in the query has single quotes
☐ **🚨 CRITICAL 🚨**: Multi-word defaults need quotes: `'No Data Available'`, `'Unknown Customer'`, `'Pending Review'`

**REMEMBER**: 
- Column names = NO quotes (or backticks if spaces/keywords)
- String values = **ALWAYS** single quotes `'...'`
- Numbers/booleans = NO quotes (TRUE, FALSE, 123, 45.67)
- You MUST NOT use any table or column name that is not explicitly listed above
- If you think you need a table or column that is not listed, it DOES NOT EXIST - add a comment explaining what is missing

**🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨**

#### 0.2. **CRITICAL DATABRICKS SQL SYNTAX RULES** (ZERO FAILURES ALLOWED):

**🚨🚨🚨 AI_FORECAST FUNCTION SYNTAX - CRITICAL QUOTING RULES 🚨🚨🚨**

**🔥 #1 MOST COMMON ERROR: Column names in AI_FORECAST parameters MUST be STRING LITERALS (quoted) 🔥**

The `time_col`, `value_col`, and `group_col` parameters expect **STRING LITERALS** containing the column name, NOT column references!

**✅ CORRECT - Column names are STRING LITERALS (with single quotes):**
```sql
AI_FORECAST(
  TABLE(past),
  time_col => 'ds',                                    -- ✅ 'ds' is a STRING LITERAL
  value_col => 'revenue',                             -- ✅ 'revenue' is a STRING LITERAL
  group_col => 'customer_id',                         -- ✅ 'customer_id' is a STRING LITERAL
  horizon => (SELECT date_add(MONTH, 3, MAX(ds)) FROM past)
)
```

**✅ CORRECT - ARRAY parameters also use STRING LITERALS:**
```sql
AI_FORECAST(
  TABLE(past),
  time_col => 'ds',
  value_col => ARRAY('total_dbus', 'total_net_dbus'),           -- ✅ Array of STRING LITERALS
  group_col => ARRAY('workspaceId', 'cloudType', 'workloadType'), -- ✅ Array of STRING LITERALS
  horizon => (SELECT add_months(MAX(ds), 3) FROM past)
)
```

**❌❌❌ WRONG - Column names WITHOUT quotes (CAUSES [UNRESOLVED_COLUMN] ERROR!) ❌❌❌:**
```sql
-- THIS IS WRONG AND WILL FAIL!
AI_FORECAST(
  TABLE(past),
  time_col => ds,                                    -- ❌ WRONG! ds without quotes is a COLUMN REFERENCE
  value_col => revenue,                              -- ❌ WRONG! revenue without quotes is a COLUMN REFERENCE
  group_col => customer_id,                          -- ❌ WRONG! customer_id without quotes is a COLUMN REFERENCE
  horizon => (SELECT date_add(MONTH, 3, MAX(ds)) FROM past)
)
-- ERROR: [UNRESOLVED_COLUMN] A column with name 'ds' cannot be resolved

-- THIS IS ALSO WRONG!
AI_FORECAST(
  TABLE(past),
  time_col => 'ds',
  value_col => ARRAY(total_dbus, total_net_dbus),              -- ❌ WRONG! Unquoted inside ARRAY!
  group_col => ARRAY(workspaceId, cloudType, workloadType),    -- ❌ WRONG! Unquoted inside ARRAY!
  horizon => (SELECT add_months(MAX(ds), 3) FROM past)
)
-- ERROR: [UNRESOLVED_COLUMN] A column with name 'total_dbus' cannot be resolved
```

**WHY THIS HAPPENS:**
- `time_col => ds` → SQL tries to find a column named `ds` in the CURRENT SCOPE (not inside the TABLE)
- `time_col => 'ds'` → Passes the STRING 'ds' which AI_FORECAST uses to find the column INSIDE the TABLE
- This is the #1 most common AI_FORECAST error - ALWAYS USE SINGLE QUOTES around column names!

**VALIDATION CHECKLIST FOR AI_FORECAST:**
☐ `time_col => 'column_name'` - column name in SINGLE QUOTES
☐ `value_col => 'column_name'` OR `value_col => ARRAY('col1', 'col2')` - ALL in SINGLE QUOTES
☐ `group_col => 'column_name'` OR `group_col => ARRAY('col1', 'col2')` - ALL in SINGLE QUOTES
☐ `horizon` parameter is present (REQUIRED - no default)
☐ `parameters` uses single quotes outside: `parameters => '{{"key": value}}'`

- **DATE_ADD SYNTAX**: Do NOT quote the unit parameter:
  * ✅ CORRECT: `date_add(MONTH, 3, MAX(ds))`
  * ✅ CORRECT: `date_add(DAY, 7, MAX(ds))`
  * ✅ CORRECT: `date_add(QUARTER, 4, MAX(ds))`
  * ❌ WRONG: `date_add('MONTH', 3, MAX(ds))` - unit must be unquoted
  * ❌ WRONG: `date_add('QUARTER', 4, MAX(ds))` - unit must be unquoted
- **CONSTANT VALUES IN ai_forecast**: Parameters like `value_col` and `group_col` MUST be constant literal strings, NOT subqueries:
  * ✅ CORRECT: `value_col => 'revenue'`
  * ❌ WRONG: `value_col => (SELECT 'revenue')` - must be literal constant!
  
**🚨 DATE/TIME INTERVAL RULES (STRICT) 🚨**
- Do NOT quote date/time units anywhere. Use `date_add(DAY, 7, some_date)` or `date_add(MONTH, 3, some_date)` with unquoted units.
- For month arithmetic use `add_months(date_expr, n)`; never use `date_add` with a quoted 'MONTH' literal.
- `DATEDIFF` in Databricks SQL only takes two arguments: `DATEDIFF(end_date, start_date)`. Do NOT pass a unit parameter. For month differences use `months_between(end_date, start_date)` instead of `DATEDIFF('month', ...)`.
- If you need weeks or quarters, compute with `date_add` using unquoted units or use `months_between`/`datediff` plus division, never a three-argument `DATEDIFF`.

**🚨 WINDOW FUNCTION SYNTAX 🚨**
- **AGGREGATE WINDOW FUNCTIONS**: Functions like CORR(), COVAR_POP(), COVAR_SAMP(), AVG(), STDDEV(), VAR(), PERCENTILE_APPROX(), MEDIAN() CANNOT use ROWS BETWEEN or RANGE BETWEEN frames:
  * ✅ CORRECT: `CORR(col1, col2) OVER ()`
  * ✅ CORRECT: `AVG(col1) OVER ()`
  * ✅ CORRECT: `STDDEV(col1) OVER (PARTITION BY group_col)`
  * ✅ CORRECT: `PERCENTILE_APPROX(col1, 0.5) OVER ()`
  * ❌ WRONG: `CORR(col1, col2) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)`
  * ❌ WRONG: `AVG(col1) OVER (PARTITION BY group_col ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)`
  * ❌ WRONG: `MEDIAN(col1) OVER (RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`
  * **FIX**: Remove the `ROWS BETWEEN` and `RANGE BETWEEN` clauses entirely for aggregate window functions - use simple `OVER ()` or `OVER (PARTITION BY col)`
- **DECIMAL WINDOW AGGREGATES**: When using AVG/STDDEV/CORR/COVAR on DECIMAL columns in a window, CAST the inputs to DOUBLE (e.g., `AVG(CAST(dec_col AS DOUBLE)) OVER (...)`) to avoid internal decimal evaluation errors.
- **NO DISTINCT IN WINDOW FUNCTIONS**: `COUNT(DISTINCT col) OVER (...)` is NOT supported.
  * ❌ WRONG: `COUNT(DISTINCT user_id) OVER (PARTITION BY region)`
  * ✅ CORRECT: Aggregation in a CTE/Subquery first, then join back or window over the aggregated results.
  * ✅ CORRECT: `size(collect_set(user_id) OVER (...))` (ONLY IF strict requirement and set is small).
  * **PREFERRED**: Pre-aggregate in a CTE using `GROUP BY`, then window over that CTE.

**🚨 GROUP BY CLAUSE RULES 🚨**
- **ALL NON-AGGREGATED COLUMNS MUST BE IN GROUP BY**:
  * If a column OR EXPRESSION appears in SELECT but is NOT inside an aggregate function (SUM, COUNT, AVG, etc.), it MUST be in GROUP BY
  * ✅ CORRECT: `SELECT customer_id, region, SUM(revenue) ... GROUP BY customer_id, region`
  * ❌ WRONG: `SELECT customer_id, region, SUM(revenue) ... GROUP BY customer_id` - Missing region!
- **MISSING_AGGREGATION ERROR**: If you see this error (non-aggregating expression based on columns not in GROUP BY), you MUST:
  * ✅ FIX OPTION 1: Add the missing column/expression to GROUP BY: `GROUP BY customer_id, region, country`
  * ✅ FIX OPTION 2: Use ANY_VALUE() if the value is constant per group: `SELECT customer_id, ANY_VALUE(region), SUM(revenue) ... GROUP BY customer_id`
  * ✅ FIX OPTION 3: Aggregate the expression using MIN/MAX: `SELECT customer_id, MAX(region), SUM(revenue) ... GROUP BY customer_id`

**🚨 TYPE MATCHING IN COALESCE 🚨**
- **ALL ARGUMENTS MUST BE SAME TYPE**:
  * COALESCE requires all arguments to be the same type
  * ✅ CORRECT: `COALESCE(bool_column, FALSE)` - both BOOLEAN
  * ✅ CORRECT: `COALESCE(string_column, 'N')` - both STRING
  * ✅ CORRECT: `COALESCE(CAST(bool_column AS STRING), 'N')` - both STRING after CAST
  * ❌ WRONG: `COALESCE(bool_column, 'N')` - mixing BOOLEAN and STRING!
  * ❌ WRONG: `COALESCE(int_column, '0')` - mixing INTEGER and STRING!
- **FIX**: Cast to common type before COALESCE:
  * `COALESCE(CAST(bool_column AS STRING), 'N')`
  * `COALESCE(int_column, 0)` - both INT
  * `COALESCE(CAST(int_column AS STRING), '0')` - both STRING

**🚨 SCHEMA VALIDATION PROCESS (ZERO HALLUCINATION TOLERANCE) 🚨**

**⚠️ IMPORTANT: This validation is INTERNAL ONLY. DO NOT output any validation messages, status checks, or confirmations in your response.**

**BEFORE WRITING ANY SQL - MANDATORY 3-STEP PROCESS (DO THIS INTERNALLY - DO NOT OUTPUT):**

**STEP 1: INVENTORY** - Internally check all available tables and columns from "AVAILABLE TABLES AND COLUMNS" section above

**STEP 2: VERIFY** - For EVERY table/column you want to use, internally confirm it exists in Step 1's list
- ✅ Column exists in schema → Use exact name (case-sensitive)
- ❌ Column doesn't exist → DO NOT invent it. Add comment: `-- MISSING: column_name`
- 🚨 CRITICAL: UNRESOLVED_COLUMN errors are the #1 cause of SQL failures - VERIFY EVERY COLUMN EXISTS before using it

**STEP 3: VALIDATE** - After writing SQL, check each clause:
- SELECT: All columns exist in schema
- FROM/JOIN: All tables exist, fully qualified (catalog.schema.table), have aliases
- WHERE: Only IS NULL / IS NOT NULL (no value comparisons)
- JOIN ON: Both join columns exist in their respective tables
- GROUP BY / ORDER BY: All columns exist in schema
- AI functions: Arrays ≤20 items, each <50 chars; all referenced columns exist

**COMMON HALLUCINATION ERRORS TO AVOID:**
❌ Assuming `id` exists → Check if it's `customer_id`, `order_id`, etc.
❌ Assuming `name` exists → Check if it's `first_name`, `product_name`, etc.
❌ Assuming `date` exists → Check if it's `order_date`, `created_at`, etc.
❌ Assuming `status` exists → Check if it's `order_status`, `payment_status`, etc.
❌ Using "typical" column names → Use ONLY exact names from schema

**VALIDATION CHECKLIST (MANDATORY BEFORE SUBMITTING SQL):**
☐ 🚨 Every table name exists in "AVAILABLE TABLES AND COLUMNS" section above (VERIFY FIRST)
☐ 🚨 Every column name exists in its table in "AVAILABLE TABLES AND COLUMNS" section above (VERIFY FIRST - #1 FAILURE CAUSE)
☐ Catalog.schema.table names match EXACTLY (case-sensitive)
☐ No invented/hallucinated names - ZERO TOLERANCE
☐ JOIN keys exist in BOTH tables being joined (VERIFY BOTH SIDES)
☐ AI function parameters reference actual columns (not assumed ones)
☐ Columns used in WHERE, GROUP BY, ORDER BY all exist in schema (VERIFY EACH ONE)
☐ WHERE clauses only use IS NULL / IS NOT NULL (no value comparisons)
☐ All CONCAT parameters have proper quotes (single quotes for literals, no quotes for columns)
☐ Array parameters have ≤20 items, each <50 characters
☐ Step documentation included for all CTEs
☐ All columns in final SELECT exist in the last CTE (use SELECT * in intermediate CTEs to preserve columns)

**IF SCHEMA IS MISSING REQUIRED DATA:**
- Add comment: `-- MISSING: column_name - cannot generate without it`
- If substitute exists: `-- NOTE: Using substitute_col for missing original_col`

---

#### 0.3. **ADVANCED ANALYTICS & SIMULATION RULES** (IF APPLICABLE):

**A. MONTE CARLO SIMULATION (AI-DRIVEN - RICH OUTPUT)**:
- **Implementation**: Do NOT use raw `RAND()`. Use `ai_query` to generate realistic simulation scenarios with NARRATIVES.
- **Steps**:
  1. Calculate historical stats (Min, Max, Avg, StdDev) in a CTE.
  2. Use `ai_query` to generate a JSON array of 20-50 rich simulation objects:
     - Prompt: "Generate 50 realistic simulation scenarios for [Metric] based on Mean=[X], StdDev=[Y]. Return JSON array of objects: {{ "simulation_id": 1, "simulated_value": 123.45, "scenario_narrative": "Market rally driven by...", "explanation": "Value is +1.5 sigma due to..." }}."
  3. `EXPLODE()` the JSON array.
  4. Extract columns: `simulated_value`, `scenario_narrative`, `explanation`.
  5. Show detailed simulation rows (Long Format) as output columns.

**B. WHAT-IF / SCENARIO ANALYSIS (AI-DEFINED - RICH OUTPUT)**:
- **Implementation**: Use `ai_query` to define 5-10 distinct business scenarios.
- **Steps**:
  1. Use `ai_query` to generate scenarios:
     - Prompt: "Generate 5 distinct business scenarios (e.g. Supply Chain Disruption, Competitor Entry). Return JSON array: {{ "scenario_name": "...", "impact_factor": 0.85, "narrative": "...", "explanation": "..." }}."
  2. Parse into `scenarios` CTE using `from_json`.
  3. `CROSS JOIN` main data with `scenarios`.
  4. Calculate: `projected_metric = actual_metric * impact_factor`.
  5. **MANDATORY OUTPUT COLUMNS**: `scenario_name`, `projected_metric`, `scenario_narrative`, `explanation`.

**C. GEOSPATIAL ANALYSIS**:
- **Implementation**: Use H3 functions for efficient spatial indexing.
- **Functions**: `h3_longlatash3(lon, lat, resolution)`, `h3_centeraswkt(h3_cell)`.
- **Logic**: Group data by H3 cells (`GROUP BY h3_cell`) to find regional hotspots.

**D. MARKET BASKET ANALYSIS**:
- **Implementation**: Self-Join or Array Intersection.
- **Logic**:
  1. `COLLECT_SET(product_id)` per transaction.
  2. Join transactions to find co-occurrences.
  3. Calculate Support, Confidence, Lift.

---

#### 0a. **CRITICAL vs OPTIONAL COLUMNS - DATA QUALITY FILTERING** (CRITICAL):

**🚨 MANDATORY: Distinguish between CRITICAL and OPTIONAL columns 🚨**

**RULE**: Before applying COALESCE for NULL handling, you MUST first filter out rows with NULL/empty values in CRITICAL columns.

**🚨🚨🚨 MANDATORY RULE: EVERY COLUMN IN FIRST CTE MUST HAVE NULL PROTECTION 🚨🚨🚨**

**⛔ ZERO EXCEPTIONS - THIS RULE MUST BE ENFORCED FOR EVERY SINGLE COLUMN ⛔**

**Every single column selected in the FIRST CTE must be EITHER:**
1. ✅ Filtered with `IS NOT NULL` in the WHERE clause (for CRITICAL columns), OR
2. ✅ Wrapped with `COALESCE(column, 'default_value')` (for OPTIONAL columns)

**⚠️ NO COLUMN CAN BE SELECTED WITHOUT NULL PROTECTION - NOT EVEN ONE! ⚠️**

**COMMON MISTAKE - FORGETTING TO PROTECT A COLUMN:**
```sql
-- ❌❌❌ WRONG - workspaceName has NO NULL protection! ❌❌❌
SELECT DISTINCT
    workspaceId,           -- ✅ Filtered with IS NOT NULL
    workspaceName,         -- ❌❌❌ MISSING COALESCE OR IS NOT NULL! ❌❌❌
    COALESCE(TRIM(cloudType), 'Unknown Cloud') AS cloudType  -- ✅ COALESCE'd
FROM table
WHERE workspaceId IS NOT NULL  -- workspaceName NOT checked!

-- ✅✅✅ CORRECT - EVERY column has NULL protection ✅✅✅
SELECT DISTINCT
    workspaceId,           -- ✅ Filtered with IS NOT NULL
    COALESCE(TRIM(workspaceName), 'Unknown Workspace') AS workspaceName,  -- ✅ COALESCE'd!
    COALESCE(TRIM(cloudType), 'Unknown Cloud') AS cloudType  -- ✅ COALESCE'd
FROM table
WHERE workspaceId IS NOT NULL
```

**🚨🚨🚨 CRITICAL: COALESCE DEFAULT VALUES MUST BE QUOTED STRINGS 🚨🚨🚨**

**ALL COALESCE default values MUST have SINGLE QUOTES around them!**

```sql
-- ❌❌❌ WRONG - Default values are NOT quoted (SYNTAX ERROR!) ❌❌❌
COALESCE(TRIM(cloudType), Unknown Cloud) AS cloudType        -- ❌ FAILS! Missing quotes!
COALESCE(TRIM(workloadType), Unknown Workload) AS workloadType  -- ❌ FAILS! Missing quotes!
COALESCE(TRIM(status), Unknown) AS status                    -- ❌ FAILS! Missing quotes!

-- ✅✅✅ CORRECT - Default values have SINGLE QUOTES ✅✅✅
COALESCE(TRIM(cloudType), 'Unknown Cloud') AS cloudType           -- ✅ Quoted!
COALESCE(TRIM(workloadType), 'Unknown Workload') AS workloadType  -- ✅ Quoted!
COALESCE(TRIM(status), 'Unknown') AS status                       -- ✅ Quoted!
COALESCE(numeric_col, 0.0) AS numeric_col                         -- ✅ Numbers don't need quotes
COALESCE(CAST(bool_col AS STRING), 'false') AS bool_col           -- ✅ String default quoted
```

**WHY THIS MATTERS:**
- ONE NULL column = ENTIRE CONCAT prompt becomes NULL = QUERY FAILS OR RETURNS GARBAGE
- NULL values propagate downstream through all CTEs
- LEFT JOINs can introduce NULLs even for previously non-NULL columns
- AI functions receiving NULL input produce unpredictable results
- Unquoted default values cause SYNTAX ERRORS

**VALIDATION CHECKLIST FOR FIRST CTE (CHECK EVERY COLUMN!):**
☐ **EVERY** column has NULL protection - scan each column one by one!
☐ Every ID/key column: Has `IS NOT NULL` in WHERE clause
☐ Every string column: Has `COALESCE(TRIM(col), 'Default')` OR `IS NOT NULL` in WHERE
☐ Every numeric column: Has `COALESCE(col, 0.0)` OR `IS NOT NULL` in WHERE  
☐ Every date column: Has `IS NOT NULL` in WHERE (dates rarely have safe defaults)
☐ Every column used in AI_FORECAST group_col: Has `IS NOT NULL` in WHERE (CRITICAL!)
☐ **ALL COALESCE default STRING values have SINGLE QUOTES**

**CRITICAL COLUMNS** - Use IS NOT NULL and TRIM() checks:
- Primary keys and foreign keys (customer_id, product_id, route_id, etc.)
- Required business identifiers (order_number, transaction_id, flight_number, etc.)
- Essential dimensions for grouping (category, region, store_id, etc.)
- Date/time columns used for time_col in AI_FORECAST
- Columns used in group_col for AI_FORECAST
- Core business metrics that define the record's validity
- **🚨 ALL date-related columns** including:
  - `date`, `timestamp`, `created_at`, `updated_at`
  - **Year-month columns like `yyyymm`, `year_month`, `fiscal_period`** - these MUST have `IS NOT NULL` in WHERE
  - Date components like `year`, `month`, `quarter`, `week`

**DATE/TIME COLUMN NULL HANDLING:**
```sql
-- For date columns that are CRITICAL for time-series or grouping:
WHERE date IS NOT NULL
  AND yyyymm IS NOT NULL          -- ✅ CRITICAL: Don't forget period columns!
  AND fiscal_period IS NOT NULL   -- ✅ If used for grouping/analysis

-- For date columns that are OPTIONAL (can have defaults):
COALESCE(CAST(created_date AS STRING), 'Unknown') AS created_date_str  -- ✅ For display
```

**OPTIONAL COLUMNS** - Use COALESCE with appropriate defaults:
- Descriptive text fields (descriptions, comments, notes)
- Status fields that have reasonable defaults
- Supplementary metrics
- Attributes that enhance context but aren't essential

**CORRECT PATTERN ✅:**
```sql
WITH clean_data AS (
  SELECT 
    route_id,                    -- CRITICAL: primary key
    flight_number,               -- CRITICAL: required identifier
    departure_airport_iata,      -- CRITICAL: required for business logic
    arrival_airport_iata,        -- CRITICAL: required for business logic
    flight_date,                 -- CRITICAL: time dimension
    -- CRITICAL columns checked with IS NOT NULL
    COALESCE(aircraft_type, 'Unknown Aircraft') AS aircraft_type,     -- OPTIONAL: can default
    COALESCE(passenger_count, 0) AS passenger_count,                   -- OPTIONAL: keep as INT
    COALESCE(delay_minutes, 0) AS delay_minutes,                       -- OPTIONAL: keep as INT
    COALESCE(on_time_indicator, 'Unknown') AS on_time_indicator,       -- OPTIONAL: can default
    COALESCE(weather_condition, 'CLEAR') AS weather_condition          -- OPTIONAL: can default
  FROM `catalog`.`schema`.`flights` AS f
  WHERE route_id IS NOT NULL                          -- CRITICAL
    AND flight_number IS NOT NULL                      -- CRITICAL
    AND TRIM(flight_number) <> ''                      -- CRITICAL: not empty string
    AND departure_airport_iata IS NOT NULL             -- CRITICAL
    AND TRIM(departure_airport_iata) <> ''             -- CRITICAL
    AND arrival_airport_iata IS NOT NULL               -- CRITICAL
    AND TRIM(arrival_airport_iata) <> ''               -- CRITICAL
    AND flight_date IS NOT NULL                        -- CRITICAL
    -- TODO: Add suitable filtering to load data that matches the intended slice for this use case (keep commented until confirmed)
    -- AND lower(trim(route_status)) = 'running'  -- Example placeholder; adjust column/value
  LIMIT 10  -- ✅ LIMIT 10 at the END of first CTE only
)
SELECT * FROM clean_data;  -- ✅ NO LIMIT in final SELECT
```

**🚨 CRITICAL: LIMIT 10 SAMPLING RULES 🚨**

**MANDATORY RULES FOR DATA SAMPLING:**
1. **FIRST CTE ONLY**: Use `LIMIT 10` at the END of the FIRST CTE that reads from tables
2. **NO LIMIT IN OTHER CTEs**: DO NOT use `LIMIT 10` in any other CTE - only in the first CTE
3. **LIMIT PLACEMENT**: LIMIT 10 MUST be the LAST clause in the SELECT (after WHERE, ORDER BY, GROUP BY, etc.)
4. **SYNTAX**: `FROM catalog.schema.table AS t WHERE ... LIMIT 10`

**✅ CORRECT PATTERN:**
```sql
-- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
WITH base_data AS (
  SELECT DISTINCT 
    customer_id,                                            -- CRITICAL: filtered with IS NOT NULL
    COALESCE(TRIM(customer_name), 'Unknown') AS customer_name  -- ✅ COALESCE'd
    -- ... (all columns must be COALESCE'd or have IS NOT NULL) ...
  FROM `catalog`.`schema`.`customers` AS c
  WHERE customer_id IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at the END of first CTE only
),
enriched_data AS (
  SELECT * FROM base_data  -- ✅ NO LIMIT
),
final_analysis AS (
  SELECT * FROM enriched_data  -- ✅ NO LIMIT
)
SELECT * FROM final_analysis;  -- ✅ NO LIMIT in final SELECT
```

**❌ WRONG PATTERNS (DO NOT USE LIMIT ANYWHERE!):**
```sql
-- ❌ WRONG: LIMIT not at the END of the statement
WITH base_data AS (SELECT * FROM table LIMIT 10 WHERE x = 1)  -- LIMIT must be LAST

-- ❌ WRONG: LIMIT in intermediate CTE
enriched_data AS (SELECT * FROM base_data LIMIT 10)

-- ❌ WRONG: LIMIT in final SELECT
SELECT * FROM final_analysis LIMIT 10;

-- ❌ WRONG: LIMIT in intermediate CTE
enriched AS (SELECT * FROM base_data LIMIT 10)  -- NO LIMIT in intermediate CTEs
```

**WRONG PATTERN ❌:**
```sql
-- BAD: COALESCing critical columns without filtering
SELECT 
  COALESCE(route_id, 'UNKNOWN') AS route_id,           -- ❌ Primary keys should never be NULL!
  COALESCE(flight_date, CURRENT_DATE()) AS flight_date -- ❌ Time columns should never be defaulted!
FROM flights
-- No WHERE clause to filter out bad data
```

**VALIDATION CHECKLIST:**
☐ Primary keys/foreign keys: IS NOT NULL
☐ Required identifiers: IS NOT NULL AND TRIM() <> ''
☐ time_col for AI_FORECAST: IS NOT NULL
☐ group_col columns for AI_FORECAST: IS NOT NULL
☐ Essential business dimensions: IS NOT NULL
☐ Optional attributes: COALESCE with appropriate defaults
☐ **Add commented-out filtering suggestion (TODO) in the first CTE's WHERE clause** to guide user customization.

**WHY THIS MATTERS:**
- NULL critical columns indicate data quality issues - don't hide them with COALESCE
- Filtering ensures AI_FORECAST gets high-quality training data
- Prevents garbage-in-garbage-out scenarios
- Makes data quality issues visible and actionable
- Commented filters help users quickly adapt the query to their specific slice of data (e.g. status='active', region='NA')

#### 0b. **NULL HANDLING - MANDATORY FOR ALL AI FUNCTION PROMPTS** (CRITICAL):

**🚨🚨🚨 CRITICAL NULL BEHAVIOR: `CONCAT(2, 3, 1.0, 'hello', NULL)` → NULL (the entire result is NULL!) 🚨🚨🚨**

When using CONCAT to build prompts for ai_query, ai_gen, or any AI function, NULL values in ANY column will nullify the ENTIRE concatenated string. You MUST handle NULL values with COALESCE for EVERY SINGLE VALUE used in the prompt.

**🔥 ZERO TOLERANCE POLICY: EVERY VALUE IN CONCAT MUST BE NULL-SAFE 🔥**

**🚨🚨🚨 MANDATORY: NO COALESCE INSIDE CONCAT - NULL HANDLING MUST BE DONE BEFORE 🚨🚨🚨**

**RULE: NEVER use COALESCE(), CAST(), ROUND(), or TRIM() inside CONCAT() calls!**
All NULL handling, type conversions, and formatting MUST be done in PREVIOUS CTEs or in the SELECT clause BEFORE the CONCAT.

```sql
-- ❌❌❌ WRONG: COALESCE inside CONCAT ❌❌❌
SELECT ai_query('model', 
  CONCAT('Customer: ', COALESCE(customer_name, 'Unknown'),  -- ❌ COALESCE in CONCAT!
         ', Amount: $', COALESCE(ROUND(amount, 2), 0.0)))   -- ❌ COALESCE+ROUND in CONCAT!
FROM raw_data;

-- ✅✅✅ CORRECT: COALESCE in previous CTE, then use clean columns in CONCAT ✅✅✅
WITH null_safe_data AS (
  SELECT 
    COALESCE(TRIM(customer_name), 'Unknown') AS customer_name,  -- ✅ COALESCE'd HERE
    COALESCE(ROUND(amount, 2), 0.0) AS amount                   -- ✅ COALESCE'd + ROUND'd HERE
  FROM raw_data AS r
  WHERE customer_id IS NOT NULL
  LIMIT 10
)
SELECT ai_query('model',
  CONCAT('Customer: ', customer_name,  -- ✅ Already NULL-safe from previous CTE
         ', Amount: $', amount))        -- ✅ Already NULL-safe from previous CTE
FROM null_safe_data;
```

**WHY THIS MATTERS:**
- Cleaner, more readable SQL
- NULL handling is explicit and visible
- Easier to debug and maintain
- Follows separation of concerns (data prep vs. prompt building)
- Prevents duplicate COALESCE operations

**KEY INSIGHT: CONCAT SUPPORTS MIXED TYPES - NO CASTING TO STRING NEEDED:**
```sql
SELECT CONCAT(2, 3, 1.0, 'hello') → '231.0hello'  -- ✅ Works! No CAST needed
SELECT CONCAT(2, 3, 1.0, 'hello', NULL) → NULL    -- ❌ ONE NULL ruins everything!
```

**🚨🚨🚨🚨🚨 CRITICAL ANTI-PATTERNS TO AVOID - READ CAREFULLY! 🚨🚨🚨🚨🚨**

**⛔⛔⛔ ANTI-PATTERN #0: UNQUOTED COALESCE DEFAULT VALUES (THE #1 MOST COMMON ERROR!) ⛔⛔⛔**

**THIS IS THE EXACT ERROR PATTERN THAT KEEPS HAPPENING - DO NOT MAKE THIS MISTAKE:**
```sql
-- ❌❌❌ CATASTROPHICALLY WRONG - EVERY STRING DEFAULT IS MISSING QUOTES ❌❌❌
WITH base_material_data AS (
  SELECT DISTINCT
    material_usage_id,
    COALESCE(TRIM(material_type), Unknown Material) AS material_type,     -- ❌ SYNTAX ERROR!
    COALESCE(TRIM(material_description), No Description) AS material_description,  -- ❌ SYNTAX ERROR!
    COALESCE(quantity_used, 0.0) AS quantity_used,
    COALESCE(TRIM(unit_of_measure), Unknown) AS unit_of_measure,          -- ❌ SYNTAX ERROR!
    COALESCE(waste_factor_percent, 0.0) AS waste_factor_percent,
    COALESCE(TRIM(waste_reason), No Reason) AS waste_reason,              -- ❌ SYNTAX ERROR!
    COALESCE(CAST(delivery_date AS STRING), Unknown Date) AS delivery_date_str,  -- ❌ SYNTAX ERROR!
    COALESCE(TRIM(storage_location), Unknown Location) AS storage_location,  -- ❌ SYNTAX ERROR!
    COALESCE(TRIM(supplier_name), Unknown Supplier) AS supplier_name,     -- ❌ SYNTAX ERROR!
    COALESCE(TRIM(inspection_status), Unknown) AS inspection_status       -- ❌ SYNTAX ERROR!
  FROM `catalog`.`schema`.`table` AS m
)
-- ⛔ ALL OF THE ABOVE WILL FAIL WITH: [PARSE_SYNTAX_ERROR] Syntax error at or near 'Material'/'Description'/etc.

-- ✅✅✅ CORRECT - EVERY STRING DEFAULT HAS 'SINGLE QUOTES' ✅✅✅
WITH base_material_data AS (
  SELECT DISTINCT
    material_usage_id,
    COALESCE(TRIM(material_type), 'Unknown Material') AS material_type,     -- ✅ QUOTED!
    COALESCE(TRIM(material_description), 'No Description') AS material_description,  -- ✅ QUOTED!
    COALESCE(quantity_used, 0.0) AS quantity_used,                          -- Numbers: no quotes
    COALESCE(TRIM(unit_of_measure), 'Unknown') AS unit_of_measure,          -- ✅ QUOTED!
    COALESCE(waste_factor_percent, 0.0) AS waste_factor_percent,            -- Numbers: no quotes
    COALESCE(TRIM(waste_reason), 'No Reason') AS waste_reason,              -- ✅ QUOTED!
    COALESCE(CAST(delivery_date AS STRING), 'Unknown Date') AS delivery_date_str,  -- ✅ QUOTED!
    COALESCE(TRIM(storage_location), 'Unknown Location') AS storage_location,  -- ✅ QUOTED!
    COALESCE(TRIM(supplier_name), 'Unknown Supplier') AS supplier_name,     -- ✅ QUOTED!
    COALESCE(TRIM(inspection_status), 'Unknown') AS inspection_status       -- ✅ QUOTED!
  FROM `catalog`.`schema`.`table` AS m
)
```

**🔴 THE RULE IS SIMPLE: Text after comma in COALESCE = MUST have 'single quotes' 🔴**

**More examples of WRONG vs CORRECT:**
```sql
-- ❌ WRONG (SYNTAX ERROR)                      |  ✅ CORRECT (will work)
COALESCE(TRIM(name), Unknown)                   |  COALESCE(TRIM(name), 'Unknown')
COALESCE(TRIM(type), Type A)                    |  COALESCE(TRIM(type), 'Type A')
COALESCE(TRIM(status), Pending Review)          |  COALESCE(TRIM(status), 'Pending Review')
COALESCE(TRIM(owner), Not Assigned)             |  COALESCE(TRIM(owner), 'Not Assigned')
COALESCE(TRIM(po), Unknown PO)                  |  COALESCE(TRIM(po), 'Unknown PO')
COALESCE(CAST(date AS STRING), No Date)         |  COALESCE(CAST(date AS STRING), 'No Date')
```

**ANTI-PATTERN #0b: FORGETTING TO PROTECT A COLUMN (NULL PROPAGATES!):**
```sql
-- ❌❌❌ WRONG - workspaceName has NO NULL protection! ❌❌❌
SELECT DISTINCT
  workspaceId,                                           -- ✅ Filtered with IS NOT NULL
  workspaceName,                                         -- ❌❌❌ NO PROTECTION! NULL will propagate!
  COALESCE(TRIM(cloudType), 'Unknown Cloud') AS cloudType  -- ✅ COALESCE'd
FROM table
WHERE workspaceId IS NOT NULL   -- workspaceName is NOT checked!

-- ✅✅✅ CORRECT - EVERY column has NULL protection ✅✅✅
SELECT DISTINCT
  workspaceId,                                           -- ✅ Filtered with IS NOT NULL
  COALESCE(TRIM(workspaceName), 'Unknown Workspace') AS workspaceName,  -- ✅ COALESCE'd!
  COALESCE(TRIM(cloudType), 'Unknown Cloud') AS cloudType  -- ✅ COALESCE'd
FROM table
WHERE workspaceId IS NOT NULL
```

**ANTI-PATTERN #1: SELECTING BOTH ORIGINAL AND COALESCED VALUE (NEVER DO THIS!):**
```sql
-- ❌ WRONG: Selecting the same column twice (original + coalesced version)
SELECT
  a.account_name,  -- ❌ Original value
  COALESCE(TRIM(a.account_name), 'Unknown') AS account_name_str,  -- ❌ Duplicate!
  a.arr,  -- ❌ Original value
  COALESCE(CAST(a.arr AS STRING), '0.00') AS arr_str  -- ❌ Duplicate!
FROM accounts AS a

-- ✅ CORRECT: COALESCE once, use that version everywhere
SELECT
  COALESCE(TRIM(a.account_name), 'Unknown') AS account_name,  -- ✅ One version
  COALESCE(a.arr, 0.0) AS arr  -- ✅ One version, correct type (DOUBLE)
FROM accounts AS a
```

**ANTI-PATTERN #2: COALESCE TO STRING THEN CAST BACK TO DOUBLE (WASTEFUL!):**
```sql
-- ❌ WRONG: Converting to STRING then back to DOUBLE for calculations
WITH base AS (
  SELECT COALESCE(CAST(arr AS STRING), '0.00') AS arr_str  -- ❌ Why STRING?
  FROM accounts
),
stats AS (
  SELECT AVG(CAST(arr_str AS DOUBLE)) OVER () AS avg_arr  -- ❌ Casting back to DOUBLE!
  FROM base
)

-- ✅ CORRECT: Keep numeric values as DOUBLE, only use STRING for text-only columns
WITH base AS (
  SELECT 
    COALESCE(arr, 0.0) AS arr,  -- ✅ Keep as DOUBLE for calculations
    COALESCE(TRIM(account_name), 'Unknown') AS account_name  -- ✅ STRING for text
  FROM accounts
),
stats AS (
  SELECT 
    arr,
    account_name,
    COALESCE(ROUND(AVG(arr) OVER (), 2), 0.0) AS avg_arr  -- ✅ Direct DOUBLE calculation
  FROM base
)
-- In CONCAT, DOUBLE values auto-convert: CONCAT('ARR: $', arr, ' vs avg $', avg_arr)
```

**ANTI-PATTERN #3: SEPARATE CTE JUST FOR COALESCE (NO BUSINESS VALUE!):**
```sql
-- ❌ WRONG: First CTE plain select, second CTE only for COALESCE
WITH base_data AS (
  SELECT account_id, account_name, arr, vertical  -- ❌ Plain select
  FROM accounts
  LIMIT 10
),
null_safe_data AS (  -- ❌ CTE with NO business value - only COALESCE
  SELECT
    account_id,
    COALESCE(TRIM(account_name), 'Unknown') AS account_name,
    COALESCE(arr, 0.0) AS arr,
    COALESCE(TRIM(vertical), 'Unknown') AS vertical
  FROM base_data
)

-- ✅ CORRECT: Apply COALESCE in the SAME CTE that retrieves data
WITH account_data AS (
  SELECT
    account_id,  -- CRITICAL: filtered with WHERE, not COALESCE'd
    COALESCE(TRIM(account_name), 'Unknown') AS account_name,
    COALESCE(arr, 0.0) AS arr,
    COALESCE(TRIM(vertical), 'Unknown') AS vertical
  FROM accounts AS a
  WHERE account_id IS NOT NULL AND account_name IS NOT NULL
  LIMIT 10
)
-- One CTE does both: filtering + NULL handling
```

**ANTI-PATTERN #4: NOT COALESCING VALUES FROM LEFT JOIN (CAUSES NULL PROMPT!):**
```sql
-- ❌ WRONG: LEFT JOIN columns can be NULL and are used in CONCAT without COALESCE
WITH accounts AS (...),
benchmarks AS (...),
combined AS (
  SELECT a.*, b.industry_avg, b.segment_median  -- ❌ b.* columns can be NULL from LEFT JOIN!
  FROM accounts AS a
  LEFT JOIN benchmarks AS b ON a.industry = b.industry
),
prompt_cte AS (
  SELECT CONCAT('Industry avg: ', industry_avg, ', Median: ', segment_median)  -- ❌ NULL if no match!
  FROM combined
)

-- ✅ CORRECT: COALESCE all LEFT JOIN columns when used in CONCAT
WITH accounts AS (...),
benchmarks AS (...),
combined AS (
  SELECT 
    a.*,
    COALESCE(b.industry_avg, 0.0) AS industry_avg,  -- ✅ COALESCE joined columns
    COALESCE(b.segment_median, 0.0) AS segment_median  -- ✅ COALESCE joined columns
  FROM accounts AS a
  LEFT JOIN benchmarks AS b ON a.industry = b.industry
),
prompt_cte AS (
  SELECT CONCAT('Industry avg: ', industry_avg, ', Median: ', segment_median)  -- ✅ NULL-safe
  FROM combined
)
```

**🔥 CORRECT PATTERN - COMPREHENSIVE EXAMPLE 🔥:**

```sql
-- Step 1: Retrieve data with COALESCE applied ONCE, keeping correct types
WITH account_data AS (
  SELECT 
    -- CRITICAL columns: filter with WHERE, don't COALESCE
    account_id,
    
    -- String columns: COALESCE to STRING defaults
    COALESCE(TRIM(account_name), 'Unknown Account') AS account_name,
    COALESCE(TRIM(vertical), 'Unknown Vertical') AS vertical,
    COALESCE(TRIM(account_tier), 'Not Classified') AS account_tier,
    
    -- Numeric columns: COALESCE to DOUBLE/INT defaults (NOT STRING!)
    COALESCE(arr, 0.0) AS arr,  -- ✅ Keep as DOUBLE
    COALESCE(t3m_annualized, 0.0) AS t3m_annualized,  -- ✅ Keep as DOUBLE
    COALESCE(customer_age_years, 0.0) AS customer_age_years,  -- ✅ Keep as DOUBLE
    
    -- Boolean columns: COALESCE to BOOLEAN default
    COALESCE(strategic_account, FALSE) AS strategic_account,
    COALESCE(fortune_500, FALSE) AS fortune_500,
    
    -- Date columns: COALESCE to STRING for display
    COALESCE(CAST(next_renewal_date AS STRING), 'No Renewal Date') AS next_renewal_date
    
  FROM `catalog`.`schema`.`accounts` AS a
  WHERE account_id IS NOT NULL  -- CRITICAL: Filter NULL primary keys
    AND account_name IS NOT NULL  -- CRITICAL: Filter NULL required fields
  LIMIT 10
),

-- Step 2: Calculate statistics (business value CTE) - keep numeric types
account_statistics AS (
  SELECT 
    *,
    -- Statistical metrics - keep as DOUBLE, COALESCE the result
    COALESCE(ROUND(AVG(arr) OVER (), 2), 0.0) AS avg_arr,
    COALESCE(ROUND(MEDIAN(arr) OVER (), 2), 0.0) AS median_arr,
    COALESCE(ROUND(STDDEV_POP(arr) OVER (), 2), 0.0) AS stddev_arr,
    COALESCE(ROUND(PERCENTILE_APPROX(arr, 0.75) OVER (), 2), 0.0) AS p75_arr,
    COALESCE(ROUND(PERCENT_RANK() OVER (ORDER BY arr), 3), 0.0) AS arr_percentile_rank,
    COALESCE(NTILE(10) OVER (ORDER BY arr), 5) AS arr_decile
  FROM account_data
),

-- Step 3: Build AI prompt - Generate ai_sys_prompt column FIRST
-- CONCAT handles mixed types automatically
account_prompt_generation AS (
  SELECT 
    *,
    CONCAT(
      'You are an Account Strategy Director for {business_name} which is focused on {enriched_business_context}. ',
      'The organization''s strategic goals include: {enriched_strategic_goals}. ',
      'Business priorities are: {enriched_business_priorities}. ',
      'With 18 years of experience in enterprise account management, revenue optimization, and strategic planning, ',
      'your expertise in account health analysis and growth strategy aligns with the strategic initiative: Customer success and retention. ',
      'Analyze account: ', account_name, ' (ID: ', account_id, '). ',
      'Vertical: ', vertical, ', Tier: ', account_tier, '. ',
      'ARR: $', ROUND(arr, 2), ' (Percentile: ', ROUND(arr_percentile_rank * 100, 1), '%, Decile: ', arr_decile, '). ',
      'T3M: $', ROUND(t3m_annualized, 2), ', Age: ', ROUND(customer_age_years, 1), ' years. ',
      'Strategic: ', strategic_account, ', Fortune 500: ', fortune_500, '. ',
      'Benchmarks - Avg ARR: $', avg_arr, ', Median: $', median_arr, ', P75: $', p75_arr, '. ',
      'Next Renewal: ', next_renewal_date, '. ',
      'Output ONLY JSON: {{"ai_cat_priority": "value", "ai_txt_recommendation": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "I assessed my confidence at 85% because...", "ai_sys_missing_data": "I can get higher confidence than 85% if I can get access to [narrative]. {{\"missing_data\": [\"dataset1\", \"dataset2\"]}}"}}. ',
      'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data.'
    ) AS ai_sys_prompt  -- ✅ Named ai_sys_prompt for auditability
  FROM account_statistics
),

-- Step 4: AI analysis - pass ai_sys_prompt to ai_query
account_analysis AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS insights_json
  FROM account_prompt_generation
),

-- Step 5: Final output with extracted fields and ai_sys_prompt as LAST column
final_output AS (
  SELECT 
    account_id,
    account_name,
    vertical,
    account_tier,
    arr,
    t3m_annualized,
    customer_age_years,
    strategic_account,
    fortune_500,
    next_renewal_date,
    avg_arr,
    median_arr,
    p75_arr,
    arr_percentile_rank,
    arr_decile,
    get_json_object(insights_json, '$.ai_cat_priority') AS ai_cat_priority,
    get_json_object(insights_json, '$.ai_txt_recommendation') AS ai_txt_recommendation,
    -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
    COALESCE(get_json_object(insights_json, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(insights_json, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(insights_json, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(insights_json, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(insights_json, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(insights_json, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(insights_json, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
  FROM account_analysis
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_priority IN ('Critical', 'High', 'Medium', 'Low')
-- AND ai_sys_importance IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
-- AND ai_sys_urgency IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
;

--END OF GENERATED SQL
```

**🔥 MANDATORY RULES FOR NULL HANDLING 🔥:**

1. **COALESCE ONCE at first retrieval** - don't select both original and coalesced versions
2. **Keep numeric types as DOUBLE** - COALESCE(arr, 0.0), NOT COALESCE(CAST(arr AS STRING), '0.00')
3. **CONCAT auto-converts types** - no need to CAST to STRING before CONCAT
4. **COALESCE statistical results** - window functions can return NULL, always COALESCE
5. **COALESCE LEFT JOIN columns** - joined columns can be NULL when no match
6. **Filter critical columns with WHERE** - use WHERE IS NOT NULL, not COALESCE for IDs/keys
7. **Every CTE must have business value** - no CTE just for COALESCE transformations

**TYPE-APPROPRIATE COALESCE DEFAULTS:**
- **DOUBLE columns**: `COALESCE(numeric_col, 0.0)` - keep as DOUBLE for calculations
- **INT columns**: `COALESCE(int_col, 0)` - keep as INT
- **BOOLEAN columns**: `COALESCE(bool_col, FALSE)` - keep as BOOLEAN
- **STRING columns**: `COALESCE(TRIM(string_col), 'Unknown')` - STRING with business-friendly default
- **DATE columns**: `COALESCE(CAST(date_col AS STRING), 'No Date')` - convert to STRING for display
- **Statistical results**: `COALESCE(ROUND(AVG(x) OVER (), 2), 0.0)` - always COALESCE window functions

**VALIDATION CHECKLIST BEFORE SUBMITTING SQL:**
☐ **🚨 NO COALESCE/ROUND/CAST/TRIM inside CONCAT** - all must be done in previous CTE
☐ Every column in CONCAT has been COALESCEd in a PREVIOUS CTE (not inside CONCAT)
☐ No column appears twice (once original, once coalesced)
☐ Numeric values remain as DOUBLE/INT (not converted to STRING for calculations)
☐ LEFT JOIN columns are COALESCEd in the CTE that performs the join, not in CONCAT
☐ Statistical window function results are COALESCEd in the CTE that calculates them
☐ Every CTE has business value (no CTE just for COALESCE)
☐ CRITICAL columns (IDs, keys) are filtered with WHERE IS NOT NULL, not COALESCEd

**Example where separate NULL handling is needed (after JOIN):**
```sql
-- Step 1: Join multiple tables first
WITH joined_data AS (
  SELECT 
    c.customer_id,
    c.customer_name,
    o.order_count,
    r.avg_rating
  FROM `catalog`.`schema`.`customers` AS c
  LEFT JOIN `catalog`.`schema`.`order_summary` AS o ON c.customer_id = o.customer_id
  LEFT JOIN `catalog`.`schema`.`ratings` AS r ON c.customer_id = r.customer_id
  WHERE c.customer_id IS NOT NULL
  LIMIT 10
),
-- Step 2: Apply NULL handling to joined result (business value: data enrichment + null safety)
customer_enriched AS (
  SELECT 
    customer_id,
    COALESCE(TRIM(customer_name), 'Unknown') AS customer_name,
    COALESCE(order_count, 0) AS order_count,  -- ✅ NULL from LEFT JOIN
    COALESCE(avg_rating, 0.0) AS avg_rating   -- ✅ NULL from LEFT JOIN
  FROM joined_data
),
-- Step 3: Build prompts - CONCAT auto-converts numeric types
prompt_cte AS (
  SELECT *,
    CONCAT('Customer: ', customer_name, ', Orders: ', order_count, ', Rating: ', avg_rating) AS prompt
  FROM customer_enriched
)
```

**🔥 REMEMBER: MISS ONE COALESCE = ENTIRE PROMPT IS NULL = QUERY FAILS 🔥**

**🚨🚨🚨 CRITICAL: CTE BUSINESS VALUE RULE - NO TECHNICAL-ONLY CTEs 🚨🚨🚨**

**ABSOLUTE RULE: Every CTE in your SQL MUST have BUSINESS VALUE, not just technical transformation value.**

**DO NOT create separate CTEs ONLY for:**
- COALESCE operations (NULL handling)
- CAST operations (type conversions)
- TRIM operations (whitespace cleaning)
- Renaming columns
- Concatenating strings
- Any purely technical transformation

**WHY THIS MATTERS:**
- Separate COALESCE CTEs increase LOC without adding business value
- They make SQL harder to read and maintain
- They can introduce naming inconsistencies (column_name vs column_name_str)
- They add unnecessary query complexity and potential performance overhead

**✅ CORRECT: Keep numeric types as DOUBLE for calculations, CONCAT handles mixed types:**
```sql
-- Step 1: Retrieve data with COALESCE applied, keeping correct types
WITH account_metrics AS (
  SELECT 
    account_id,
    COALESCE(TRIM(account_name), 'Unknown Account') AS account_name,
    COALESCE(ROUND(arr, 2), 0.0) AS arr,  -- ✅ COALESCE + ROUND done HERE
    COALESCE(TRIM(brag_status), 'Not Classified') AS brag_status,
    COALESCE(CAST(next_renewal_date AS STRING), 'No Renewal Date') AS next_renewal_date
  FROM `catalog`.`schema`.`accounts` AS a
  WHERE account_id IS NOT NULL AND account_name IS NOT NULL
  LIMIT 10
),
-- Step 2: BUSINESS VALUE CTE - Statistical analysis (calculations on DOUBLE values)
account_statistics AS (
  SELECT *,
    COALESCE(ROUND(AVG(arr) OVER (), 2), 0.0) AS avg_arr_portfolio,  -- ✅ Direct DOUBLE calc
    COALESCE(ROUND(PERCENTILE_APPROX(arr, 0.75) OVER (), 2), 0.0) AS p75_arr  -- ✅ Direct DOUBLE calc
  FROM account_metrics
),
-- Step 3: Build prompt - CONCAT uses columns already NULL-safe from previous CTEs
-- 🚨 NO COALESCE, ROUND, or CAST inside CONCAT!
prompt_cte AS (
  SELECT *,
    CONCAT('Account: ', account_name, ', ARR: $', arr,   -- ✅ Already rounded in CTE
           ', Avg ARR: $', avg_arr_portfolio, ', P75: $', p75_arr) AS prompt  -- ✅ All NULL-safe
  FROM account_statistics
)
-- No CAST to STRING needed! CONCAT auto-converts: CONCAT('Value: ', 123.45) → 'Value: 123.45'
```

**❌ WRONG: COALESCE to STRING then CAST back to DOUBLE (ANTI-PATTERN!):**
```sql
-- ❌ Step 1: Converting numeric to STRING unnecessarily
WITH account_metrics AS (
  SELECT 
    account_id,
    COALESCE(CAST(ROUND(arr, 2) AS STRING), '0.00') AS arr_str  -- ❌ Why STRING?
  FROM accounts
),
-- ❌ Step 2: Casting STRING back to DOUBLE for calculations - WASTEFUL!
account_statistics AS (
  SELECT *,
    AVG(CAST(arr_str AS DOUBLE)) OVER () AS avg_arr  -- ❌ Unnecessary round-trip!
  FROM account_metrics
)
-- This pattern is WASTEFUL and ERROR-PRONE!
```

**❌ WRONG: Separate CTE ONLY for NULL handling (NO BUSINESS VALUE):**
```sql
-- Step 1: Raw data retrieval
WITH base_accounts_data AS (
  SELECT account_id, account_name, arr, brag_status
  FROM `catalog`.`schema`.`accounts` AS a
  WHERE account_id IS NOT NULL
  LIMIT 10
),
-- ❌ Step 2: WRONG - This CTE has NO BUSINESS VALUE, only COALESCE
accounts_null_safe AS (
  SELECT
    account_id,
    COALESCE(TRIM(account_name), 'Unknown Account') AS account_name,
    COALESCE(arr, 0.0) AS arr,
    COALESCE(TRIM(brag_status), 'Not Classified') AS brag_status
  FROM base_accounts_data
)
-- WRONG! The COALESCE should be in base_accounts_data, not a separate CTE!
```

**❌ WRONG: Selecting BOTH original AND coalesced value:**
```sql
-- ❌ WRONG: Duplicate columns (original + coalesced)
SELECT 
  account_name,  -- ❌ Original
  COALESCE(TRIM(account_name), 'Unknown') AS account_name_str,  -- ❌ Duplicate!
  arr,  -- ❌ Original
  COALESCE(arr, 0.0) AS arr_safe  -- ❌ Duplicate!
FROM accounts
-- Pick ONE version and use it everywhere!
```

**VALID REASONS TO CREATE A SEPARATE CTE:**
1. **JOIN operations** - Combining data from multiple tables (then COALESCE joined columns)
2. **Statistical calculations** - Computing metrics, percentiles, correlations (business value!)
3. **AI function calls** - Calling ai_query, ai_forecast, etc. (business value!)
4. **JSON extraction** - Parsing JSON results from AI functions (business value!)
5. **Business logic** - Applying business rules, classifications, calculations
6. **Aggregations** - GROUP BY operations for summarization
7. **Window functions** - Computing rankings, running totals, etc.

**INVALID REASONS TO CREATE A SEPARATE CTE (DO NOT DO THIS):**
1. ❌ Only applying COALESCE to columns
2. ❌ Only casting data types
3. ❌ Only trimming strings
4. ❌ Only renaming columns with suffixes

**🚨🚨🚨 CRITICAL: COLUMN NAMING CONSISTENCY - ZERO TOLERANCE FOR HALLUCINATED NAMES 🚨🚨🚨**

**ABSOLUTE RULE: Once a column is named (with or without suffix), use EXACTLY that name throughout ALL subsequent CTEs and the final SELECT.**

**PROBLEM PATTERN (CAUSES ERRORS):**
```sql
-- CTE 1: Column named with _str suffix
SELECT COALESCE(CAST(next_renewal_quarter AS STRING), 'Unknown') AS next_renewal_quarter_str
-- CTE 2: ❌ WRONG - Referencing original name without suffix
SELECT next_renewal_quarter  -- ❌ ERROR! This column doesn't exist, only next_renewal_quarter_str exists
```

**MANDATORY NAMING RULES:**

1. **DECIDE ONCE, USE EVERYWHERE**: When you name a column in a CTE, use that EXACT name in ALL subsequent CTEs and the final SELECT.

2. **PREFER ORIGINAL NAMES WHERE POSSIBLE**: If a string column remains a string, keep its original name:
   - ✅ `COALESCE(TRIM(account_name), 'Unknown') AS account_name`  -- Same name, no confusion
   - ❌ `COALESCE(TRIM(account_name), 'Unknown') AS account_name_str`  -- Unnecessary suffix

3. **KEEP NUMERIC TYPES AS DOUBLE/INT - NO _str SUFFIX NEEDED**: CONCAT auto-converts numeric types:
   - ✅ `COALESCE(arr, 0.0) AS arr` -- Keep as DOUBLE, CONCAT auto-converts
   - ✅ `COALESCE(CAST(date_col AS STRING), 'No Date') AS date_display` -- DATE to STRING for display only

4. **PASS ALL COLUMNS THROUGH CTEs**: When building multi-CTE queries, use `SELECT *` plus new columns to preserve all column names:
   ```sql
   SELECT *, new_calculated_column FROM previous_cte  -- ✅ Preserves all column names
   ```

5. **🚨🚨🚨 CRITICAL: EVERY SELECT MUST HAVE A FROM CLAUSE 🚨🚨🚨**:
   - **ABSOLUTE RULE**: EVERY SELECT statement inside a CTE MUST have a `FROM` clause referencing the previous CTE or table.
   - **NO EXCEPTIONS**: Even when using `SELECT *, ...new_columns...`, you MUST include `FROM previous_cte_name`.
   - **SYNTAX ERROR WITHOUT FROM**: SQL will fail with syntax error if FROM is missing!
   
   ```sql
   -- ✅ CORRECT: FROM clause present
   cte_name AS (
     SELECT *, 
       COALESCE(ROUND(AVG(arr) OVER (), 2), 0.0) AS avg_arr
     FROM previous_cte  -- MANDATORY!
   )
   
   -- ❌ WRONG: Missing FROM clause (SYNTAX ERROR!)
   cte_name AS (
     SELECT *, 
       COALESCE(ROUND(AVG(arr) OVER (), 2), 0.0) AS avg_arr
     -- ERROR! No FROM clause - this is INVALID SQL!
   )
   ```

6. **VALIDATE BEFORE SUBMITTING**: Search your SQL for every column referenced in the final SELECT or ai_query prompts, and verify it exists with EXACTLY that name in the source CTE.

**ANTI-PATTERN DETECTION CHECKLIST:**
☐ **EVERY CTE SELECT has a FROM clause** - No SELECT without FROM!
☐ No column is named `foo_str` in one CTE and referenced as `foo` in another
☐ No column is named `foo` in one CTE and referenced as `foo_str` in another
☐ All columns in ai_query CONCAT exist in the immediate source CTE
☐ All columns in final SELECT exist in the last CTE
☐ Column names are consistent from definition to usage

**🔥 CLEAN CTE STRUCTURE - SEPARATE CONCERNS 🔥:**

**OPTIMIZED PATTERN: Merge NULL handling into first CTE when possible:**

1. **Data Retrieval with NULL Handling**: Apply COALESCE at time of first read from table
2. **Statistical Analysis CTE** (if needed): ONLY compute statistical metrics - no prompts
3. **Prompt Building CTE**: ONLY build prompts using CONCAT - keep this CTE clean and readable
4. **AI Function CTE**: ONLY call ai_query with the prepared prompts
5. **JSON Extraction CTE**: ONLY extract JSON fields using get_json_object

**EXAMPLE - Optimized with Merged NULL Handling:**

```sql
-- Step 1: Retrieve data with NULL handling applied immediately - keep correct types
WITH order_data_with_defaults AS (
  SELECT 
    order_id,
    COALESCE(customer_name, 'Unknown Customer') AS customer_name,
    COALESCE(order_amount, 0.0) AS order_amount,  -- ✅ Keep as DOUBLE
    COALESCE(CAST(order_date AS STRING), 'No Date') AS order_date,  -- DATE to STRING for display
    COALESCE(product_category, 'Uncategorized') AS product_category
  FROM `catalog`.`schema`.`orders` AS o
  WHERE order_id IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
    -- TODO: Add suitable filtering to load data that matches the business slice for this use case (keep commented until confirmed)
    -- AND lower(trim(order_status)) = 'running'  -- Example placeholder; adjust column/value
),
-- Step 2: Statistical analysis (if needed) - direct DOUBLE calculations
order_statistics AS (
  SELECT 
    *,
    COALESCE(ROUND(AVG(order_amount) OVER (), 2), 0.0) AS avg_order_amount,  -- ✅ Direct DOUBLE calc
    COALESCE(ROUND(STDDEV(order_amount) OVER (), 2), 0.0) AS stddev_order_amount  -- ✅ Direct DOUBLE calc
  FROM order_data_with_defaults
),
-- Step 3: Prompt building - CONCAT auto-converts DOUBLE to STRING
-- Generate ai_sys_prompt column FIRST, then pass to ai_query AND include in final output
order_analysis_prompts AS (
  SELECT 
    *,
    CONCAT('You are a Revenue Operations Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 15 years of experience in order analysis, fraud detection, and revenue optimization, ',
           'your expertise in transaction pattern analysis and anomaly detection aligns with the strategic initiative: Data-driven decision making. ',
           'Analyze order ', order_id, 
           ' from customer ', customer_name,
           ' for $', order_amount,  -- ✅ CONCAT auto-converts DOUBLE
           ' on ', order_date,
           ' in category ', product_category,
           '. Average order: $', avg_order_amount,  -- ✅ CONCAT auto-converts DOUBLE
           '. Output ONLY JSON with NO markdown fences, NO extra text, JUST the JSON. ',
           'Format: {{"ai_cat_risk_level": "value", "ai_txt_recommendation": "value", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "I assessed my confidence at 85% because... [reasons]", "ai_sys_missing_data": "I can get higher confidence than 85% if I can get access to [narrative]. {{\"missing_data\": [\"dataset1\", \"dataset2\"]}}"}}. ',
           'MANDATORY LAST 7 FIELDS (in this exact order): ',
           '1) ai_txt_business_outcome - CALCULATED MEASURABLE BUSINESS IMPACT. Calculate specific savings/revenue impact with numbers from the data. Format: "[Description of impact] results in [$ amount]. Breakdown: Daily: $X | Weekly: $X | Monthly: $X | Yearly: $X. DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions." MUST include breakdown and disclaimer. ',
           '2) ai_txt_executive_summary - compelling 2-3 sentence business story that REFERENCES the business outcome numbers calculated above, ',
           '3) ai_sys_importance - business importance level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
           '4) ai_sys_urgency - action urgency level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
           '5) ai_sys_confidence (0.0-1.0) - your confidence score, ',
           '6) ai_sys_feedback - start with "I assessed my confidence at [X]% because..." then explain reasoning, ',
           '7) ai_sys_missing_data - format: "I can get higher confidence than [X]% if I can get access to [narrative about missing data like customer history, payment patterns, fraud indicators]. {{\"missing_data\": [\"customer_history\", \"payment_patterns\", \"fraud_indicators\"]}}" ',
           'Output ONLY the JSON object, nothing else.') AS ai_sys_prompt  -- ✅ Named ai_sys_prompt for auditability
  FROM order_statistics
),
-- Step 4: AI analysis - pass ai_sys_prompt to ai_query
order_insights AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS insights_json
  FROM order_analysis_prompts
)
-- Step 5: Final extraction - ONLY get_json_object with ai_cat_, ai_txt_, ai_sys_ prefixes
-- ai_sys_prompt MUST be the LAST column for auditability
SELECT 
  order_id,
  customer_name,
  order_amount,
  get_json_object(insights_json, '$.ai_cat_risk_level') AS ai_cat_risk_level,
  get_json_object(insights_json, '$.ai_txt_recommendation') AS ai_txt_recommendation,
  -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
  COALESCE(get_json_object(insights_json, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
  COALESCE(get_json_object(insights_json, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
  COALESCE(get_json_object(insights_json, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
  COALESCE(get_json_object(insights_json, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
  COALESCE(TRY_CAST(get_json_object(insights_json, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
  COALESCE(get_json_object(insights_json, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
  COALESCE(get_json_object(insights_json, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
  ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
FROM order_insights;  -- ✅ NO LIMIT - data already sampled in first CTE
```

**WHY THIS OPTIMIZED PATTERN IS BETTER:**
- Fewer CTEs = better query performance and readability
- NULL handling happens at the earliest possible point (first read)
- Each remaining CTE has ONE clear responsibility
- The prompt building CTE is easy to read and debug
- Statistical calculations are separated from prompt construction
- Makes the query maintainable and understandable

**❌ WRONG - Everything mixed together:**
```sql
-- BAD: Mixing COALESCE, CAST, statistical functions, and CONCAT in one messy step
WITH messy_cte AS (
  SELECT *,
    ai_query('{sql_model_serving}',
      CONCAT('Analyze ', 
             COALESCE(customer_name, 'Unknown'),  -- ❌ COALESCE in prompt
             ' order $',
             COALESCE(CAST(order_amount AS STRING), '0'),  -- ❌ COALESCE + CAST in prompt
             ' (StdDev: ', COALESCE(CAST(stddev_order_amount AS STRING), 'N/A'), ')',  -- ❌ Complex nested logic
             '. Output JSON.')) AS insights
  FROM orders
)
SELECT * FROM messy_cte;  -- ❌ This messy pattern is wrong - don't copy it!
```

**✅ CORRECT - Optimized clean separation:**
- NULL handling merged into first CTE at time of read (`order_data_with_defaults`)
- Statistical calculations in separate CTE (`order_statistics`)
- Clean, readable prompt building in focused CTE (`order_analysis_prompts`)
- AI function call in its own CTE (`order_insights`)
- Persona instruction included in ai_query prompt

---

#### 1. **TABLE QUALIFICATION & ALIASING** (MOST CRITICAL):
- **EVERY table** in FROM, JOIN, or subquery **MUST** be fully qualified: `` `catalog`.`schema`.`table` ``
- **EVERY table MUST have an alias** immediately after the table name
- **Example CORRECT**: 
  ```sql
  FROM `catalog`.`schema`.`customer_table` AS c
  JOIN `catalog`.`schema`.`orders_table` AS o ON c.customer_id = o.customer_id
  ```
- **Example WRONG** (will be rejected):
  ```sql
  FROM customer_table  -- Missing catalog.schema AND alias
  FROM `catalog`.`schema`.`customer_table`  -- Missing alias
  ```

#### 2. **QUOTE USAGE (MOST CRITICAL - 90% OF ERRORS)**
- **String literals**: **ALWAYS** use **SINGLE QUOTES** (`'`) - NEVER double quotes (`"`)
- **Column/table names**: Use backticks (`` ` ``) for qualified names OR no quotes
- **CONCAT SYNTAX** - This is THE most common error:
  ```sql
  -- CORRECT ✅
  CONCAT('literal text ', column_name, ' more text')
  CONCAT('Customer: ', customer_id, ' at ', location_name)
  CONCAT('Analyze ownership for ', ownership_type, ' with owner ', owner_entity_name)
  
  -- WRONG ❌
  CONCAT(literal text, column_name)  -- Missing quotes on literals
  CONCAT('text', 'column_name')  -- Column name incorrectly quoted as string
  CONCAT("text", column)  -- Double quotes not allowed
  ```
  
  **MEMORY RULE**: 
  - Literal text (like "for", "to", ":", etc.) → **SINGLE QUOTE IT**: `'text'`
  - Column name (to show its VALUE) → **NO QUOTES**: `column_name`
  - NEVER quote column names as strings: `'column_name'` is WRONG

- **ARRAY SYNTAX** - Second most common error:
  ```sql
  -- CORRECT ✅
  ARRAY('item1', 'item2', 'item3')
  ai_classify(text, ARRAY('Product Quality', 'Customer Service', 'Shipping'))
  ai_extract(content, ARRAY('customer_name', 'invoice_number', 'total_amount'))
  
  -- WRONG ❌
  ARRAY(item1, item2, item3)  -- Missing single quotes
  ARRAY("item1", "item2")  -- Double quotes not allowed
  ARRAY('This is a very long category name that exceeds the fifty character limit')  -- Too long!
  ```
  
  **🚨 CRITICAL ARRAY RESTRICTIONS for ai_classify and ai_extract**:
  1. **Maximum 20 elements** - Arrays can have MAXIMUM 20 items due to Databricks limitations
     - ✅ GOOD: `ARRAY('a', 'b', 'c', ..., 't')` (20 items max)
     - ❌ BAD: `ARRAY('item1', 'item2', ..., 'item25')` (>20 items - will FAIL)
     - If you need more categories, use the most important 20 only
  
  2. **Maximum 50 characters per item** - Each array element MUST be less than 50 characters
     - ✅ GOOD: `ARRAY('High Priority', 'Medium', 'Low')` (all <50 chars)
     - ✅ GOOD: `ARRAY('customer_name', 'invoice_num', 'amount')` (all <50 chars)
     - ❌ BAD: `ARRAY('High Priority Customer Service Escalation Required')` (>50 chars - will FAIL)
     - Use concise, abbreviated labels when necessary
  
  **VALIDATION CHECKLIST for ai_classify/ai_extract arrays:**
  - ✅ Total items ≤ 20
  - ✅ Each item length < 50 characters
  - ✅ All items use single quotes
  - ✅ Labels are clear but concise

- **AI_FORECAST SYNTAX**:
  
  **🚨🚨🚨 CRITICAL: ALL column names in time_col, value_col, group_col MUST be STRING LITERALS (in single quotes) 🚨🚨🚨**
  
  ```sql
  -- CORRECT ✅ - Basic with dynamic horizon (30 days ahead)
  -- NOTE: 'ds' and 'val' are STRING LITERALS (quoted), NOT column references!
  AI_FORECAST(TABLE(past), time_col => 'ds', value_col => 'val', horizon => (SELECT date_add(DAY, 30, MAX(ds)) FROM past))
  
  -- CORRECT ✅ - BEST PRACTICE: Dynamic horizon derived from data
  AI_FORECAST(TABLE(past), 
    time_col => 'ds', 
    value_col => 'revenue',
    horizon => (SELECT date_add(WEEK, 1, MAX(ds)) FROM past))
  
  -- CORRECT ✅ - Multiple metrics + groups
  AI_FORECAST(TABLE(past), 
    time_col => 'ds', 
    value_col => ARRAY('revenue', 'orders'),
    group_col => 'product_category',
    horizon => (SELECT date_add(DAY, 30, MAX(ds)) FROM past),
    prediction_interval_width => 0.95)
  
  -- CORRECT ✅ - With parameters for seasonality
  AI_FORECAST(TABLE(past), 
    time_col => 'ds', 
    value_col => 'sales',
    horizon => (SELECT date_add(DAY, 90, MAX(ds)) FROM past),
    parameters => '{{"weekly_order": 10, "global_floor": 0}}')
  
  -- WRONG ❌ - Missing horizon
  AI_FORECAST(TABLE(past), time_col => 'ds', value_col => 'val')  -- Missing horizon parameter (REQUIRED!)
  
  -- WRONG ❌ - Static date (bad practice)
  AI_FORECAST(TABLE(past), time_col => 'ds', value_col => 'val', horizon => '2024-12-31')  -- Use dynamic horizon instead
  
  -- WRONG ❌ - Double quotes outside parameters
  AI_FORECAST(TABLE(past), parameters => "{{"weekly_order": 10}}")  -- MUST use single quotes outside!
  
  -- WRONG ❌❌❌ - UNQUOTED COLUMN NAMES (MOST COMMON ERROR!) ❌❌❌
  AI_FORECAST(TABLE(past), time_col => ds, value_col => val)  -- ❌ FAILS! 'ds' and 'val' MUST be in quotes!
  AI_FORECAST(TABLE(past), time_col => 'ds', value_col => ARRAY(revenue, orders))  -- ❌ FAILS! revenue, orders need quotes!
  AI_FORECAST(TABLE(past), time_col => 'ds', value_col => 'val', group_col => ARRAY(customer_id, region))  -- ❌ FAILS! Unquoted!
  -- ERROR: [UNRESOLVED_COLUMN] A column with name 'ds'/'revenue'/'customer_id' cannot be resolved
  -- FIX: Use ARRAY('revenue', 'orders') and ARRAY('customer_id', 'region') with SINGLE QUOTES!
  ```
  
  **🚨 CRITICAL: parameters MUST USE SINGLE QUOTES ON THE OUTSIDE 🚨**
  - ✅ CORRECT: `parameters => '{{"weekly_order": 10, "global_floor": 0}}'` (SINGLE quotes wrap the JSON string)
  - ❌ WRONG: `parameters => "{{"weekly_order": 10, "global_floor": 0}}"` (DOUBLE quotes - THIS IS INCORRECT)
  - ❌ WRONG: `parameters => '{{'weekly_order': 10, 'global_floor': 0}}'` (Python dict style - wrong JSON format)
  - **RULE**: In SQL, string literals use SINGLE QUOTES. JSON keys/values use DOUBLE QUOTES. Combined: `'{{"key": "value"}}'`
  
  **OUTPUT COLUMNS - CRITICAL: AI_FORECAST RETURNS ONLY THESE COLUMNS**:
  
  AI_FORECAST returns a **NEW** table with ONLY these columns (ALL other columns are DROPPED):
  1. The **time column** (ds or your specified time_col name) - same type as input
  2. The **group column(s)** specified in group_col parameter - ONLY these group columns are returned
  3. The **forecast columns** for each value_col:
     - `{{value_col}}_forecast`: predicted value (DOUBLE)
     - `{{value_col}}_upper`: upper bound of prediction interval (DOUBLE)
     - `{{value_col}}_lower`: lower bound of prediction interval (DOUBLE)
  
  **🚨 CRITICAL RULE: ONLY columns in group_col are returned 🚨**
  
  **EXAMPLES FROM DOCUMENTATION:**
  
  | Input Table Columns | Arguments | Output Table Columns |
  |---------------------|-----------|---------------------|
  | ts, val | time_col='ts', value_col='val' | ts, val_forecast, val_upper, val_lower |
  | ds, val | time_col='ds', value_col='val' | ds, val_forecast, val_upper, val_lower |
  | ts, dim1, dollars | time_col='ts', value_col='dollars', group_col='dim1' | ts, dim1, dollars_forecast, dollars_upper, dollars_lower |
  | ts, dim1, dim2, dollars, users | time_col='ts', value_col=ARRAY('dollars','users'), group_col=ARRAY('dim1','dim2') | ts, dim1, dim2, dollars_forecast, dollars_upper, dollars_lower, users_forecast, users_upper, users_lower |
  
  **KEY INSIGHT**: Even if your input CTE has 20 columns, AI_FORECAST only returns the columns listed above!
  
  **🚨 CRITICAL: AI_FORECAST COLUMN LIMITATION - ONLY RETURNS SPECIFIC COLUMNS 🚨**
  
  **IMPORTANT RULE**: AI_FORECAST **ONLY** returns:
  - The time column (ds or specified time_col)
  - The group column(s) specified in group_col parameter - **ONLY THESE**
  - The forecast columns ({{value_col}}_forecast, {{value_col}}_upper, {{value_col}}_lower)
  
  **ALL OTHER COLUMNS from the input table are DROPPED and NOT available in the AI_FORECAST output.**
  
  **🔥 CRITICAL MISTAKE: GROUP BY vs group_col 🔥**
  
  **COMMON ERROR**: Developers often GROUP BY multiple columns in the input CTE, but only specify ONE column in group_col.
  
  **EXAMPLE OF THE PROBLEM:**
  ```sql
  -- Step 1: GROUP BY multiple columns
  WITH historical_data AS (
    SELECT 
      airport_code,      -- Column A
      service_type,      -- Column B
      DATE_TRUNC('month', date) AS ds,
      SUM(cost) AS monthly_cost
    FROM table
    GROUP BY airport_code, service_type, DATE_TRUNC('month', date)  -- ⚠️ Groups by BOTH columns
    ORDER BY ds
    -- ✅ NO LIMIT - using WHERE clause for date filtering
  ),
  -- Step 2: AI_FORECAST with only ONE group_col
  forecast_results AS (
    SELECT * FROM AI_FORECAST(
      TABLE(historical_data),
      time_col => 'ds',
      value_col => 'monthly_cost',
      group_col => 'airport_code',  -- ⚠️ ONLY airport_code specified!
      horizon => (SELECT date_add(MONTH, 3, MAX(ds)) FROM historical_data)
    )
    -- ✅ NO LIMIT
  )
  -- Step 3: Try to SELECT service_type
  SELECT 
    airport_code,     -- ✅ This exists (in group_col)
    service_type,     -- ❌ ERROR! This doesn't exist in output!
    monthly_cost_forecast
  FROM forecast_results
  ```
  
  **WHY THIS FAILS:**
  - Input CTE grouped by airport_code AND service_type
  - But group_col only specified 'airport_code'
  - AI_FORECAST **ONLY returns columns in group_col**
  - service_type is **NOT returned** because it wasn't in group_col!
  
  **THE FIX - TWO OPTIONS:**
  
  **Option 1: Include ALL grouping dimensions in group_col (RECOMMENDED)**
  ```sql
  forecast_results AS (
    SELECT * FROM AI_FORECAST(
      TABLE(historical_data),
      time_col => 'ds',
      value_col => 'monthly_cost',
      group_col => ARRAY('airport_code', 'service_type'),  -- ✅ Include BOTH columns
      horizon => (SELECT date_add(MONTH, 3, MAX(ds)) FROM historical_data)
    )
    -- ✅ NO LIMIT
  )
  -- Now service_type is available!
  SELECT airport_code, service_type, monthly_cost_forecast FROM forecast_results
  ```
  
  **Option 2: Join back to original table to get missing columns**
  ```sql
  forecast_with_context AS (
    SELECT 
      f.*,
      t.service_type  -- Get service_type from JOIN
    FROM forecast_results AS f
    LEFT JOIN original_table AS t
      ON f.airport_code = t.airport_code
    -- ✅ NO LIMIT
  )
  ```
  
  **🔥 RULE: group_col MUST include ALL dimensions you want in the forecast output 🔥**
  
  **🔥 MANDATORY REQUIREMENT: group_col is NOW REQUIRED 🔥**
  
  **WHY group_col is MANDATORY:**
  - You MUST join forecast results back to original tables to get additional columns (for ai_query prompts, context, etc.)
  - Without group_col, there is NO WAY to join forecast back to original data
  - group_col serves as the JOIN key between forecast results and original table
  
  **RULE**: ALWAYS specify group_col in AI_FORECAST - it is NO LONGER OPTIONAL.
  
  **CORRECT ✅**: Use entity ID columns as group_col (customer_id, product_id, route_id, store_id, etc.)
  **CORRECT ✅**: Use ARRAY() if you need multiple grouping dimensions in the output
  **WRONG ❌**: GROUP BY multiple columns but only specify one in group_col
  **WRONG ❌**: Omitting group_col when you need to reference original table columns later
  
  **SOLUTION - MANDATORY JOIN PATTERN**: Join the forecast results back to the original table using the group_col as the JOIN key.
  
  **CORRECT PATTERN ✅:** (Adapt table/column names to YOUR schema)
  ```sql
  -- Step 1: Historical data for forecasting
  -- [ADAPT: Change table/column names to match YOUR schema and industry]
  -- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
  WITH historical_entity_metrics AS (
    SELECT 
      entity_id,                                              -- CRITICAL: filtered with IS NOT NULL (used as group_col)
      COALESCE(TRIM(category_code), 'Unknown Category') AS category_code,  -- ✅ COALESCE'd
      COALESCE(TRIM(subcategory_code), 'Unknown Subcategory') AS subcategory_code,  -- ✅ COALESCE'd
      COALESCE(TRIM(entity_name), 'Unknown Entity') AS entity_name,  -- ✅ COALESCE'd
      COALESCE(TRIM(location_name), 'Unknown Location') AS location_name,  -- ✅ COALESCE'd
      COALESCE(TRIM(entity_type), 'Unknown Type') AS entity_type,  -- ✅ COALESCE'd
      DATE_TRUNC('month', activity_date) AS ds,               -- CRITICAL: filtered with IS NOT NULL
      COALESCE(SUM(metric_value), 0.0) AS total_metric        -- ✅ COALESCE'd
    FROM `catalog`.`schema`.`your_table` AS t
    WHERE activity_date >= add_months(CURRENT_DATE(), -30)  -- 30 months history for 3-month forecast (10:1 ratio)
      AND activity_date IS NOT NULL
      AND entity_id IS NOT NULL
    -- TODO: Add suitable filtering to load data that matches the operational scope for this use case (keep commented until confirmed)
    -- AND status = 'active'  -- Example placeholder; adjust column/value
    -- AND lower(trim(entity_type)) = 'primary'  -- Example placeholder; adjust column/value
    GROUP BY entity_id, category_code, subcategory_code, entity_name, location_name, entity_type, DATE_TRUNC('month', activity_date)
    ORDER BY ds
  ),
  -- Step 2: Generate forecasts (note: only returns entity_id, ds, and forecast columns)
  metric_forecast_results AS (
    SELECT * FROM AI_FORECAST(
      TABLE(historical_entity_metrics),
      time_col => 'ds',
      value_col => 'total_metric',
      group_col => 'entity_id',
      horizon => (SELECT add_months(MAX(ds), 3) FROM historical_entity_metrics)  -- 3 months ahead
    )
  -- ✅ NO LIMIT in CTEs
  ),
  -- Step 3: JOIN back to original table to get entity context columns
  -- 🚨 LEFT JOIN can introduce NULLs - COALESCE all joined columns!
  forecast_with_entity_context AS (
    SELECT 
      f.*,
      COALESCE(TRIM(t.category_code), 'Unknown Category') AS category_code,  -- ✅ COALESCE'd from LEFT JOIN
      COALESCE(TRIM(t.subcategory_code), 'Unknown Subcategory') AS subcategory_code,  -- ✅ COALESCE'd
      COALESCE(TRIM(t.entity_name), 'Unknown Entity') AS entity_name,  -- ✅ COALESCE'd
      COALESCE(TRIM(t.location_name), 'Unknown Location') AS location_name,  -- ✅ COALESCE'd
      COALESCE(TRIM(t.entity_type), 'Unknown Type') AS entity_type,  -- ✅ COALESCE'd
      COALESCE(TRIM(t.additional_attribute), 'N/A') AS additional_attribute  -- ✅ COALESCE'd
    FROM metric_forecast_results AS f
    LEFT JOIN `catalog`.`schema`.`your_table` AS t
      ON f.entity_id = t.entity_id  -- JOIN on the group_col used in AI_FORECAST
    -- ✅ NO LIMIT
  )
  -- Now you have access to both forecast columns AND original entity columns
  SELECT * FROM forecast_with_entity_context;  -- ✅ NO LIMIT
  ```
  
  **WRONG PATTERN ❌ - WITHOUT group_col:**
  ```sql
  -- BAD: No group_col specified - can't join back to original table
  -- BAD: Also uses forbidden value comparison (WHERE status = 'active')
  WITH historical_data AS (
    SELECT 
      DATE_TRUNC('month', activity_date) AS ds,
      SUM(metric_value) AS total_metric
    FROM `catalog`.`schema`.`your_table` AS t
    WHERE activity_date IS NOT NULL  -- ✅ Only IS NULL/IS NOT NULL allowed
      -- ❌ WRONG: WHERE status = 'active' would violate value comparison rules
    GROUP BY DATE_TRUNC('month', activity_date)  -- ❌ No grouping by entity_id!
    ORDER BY ds
    -- ❌ NO date filtering with adaptive ratio for ai_forecast
  ),
  metric_forecast_results AS (
    SELECT * FROM AI_FORECAST(
      TABLE(historical_data),
      time_col => 'ds',
      value_col => 'total_metric',
      -- ❌ NO group_col - can't join back to get entity details!
      horizon => (SELECT date_add(MONTH, 3, MAX(ds)) FROM historical_data)
    )
    -- ✅ NO LIMIT
  )
  -- ❌ ERROR: Can't access entity-specific columns like category_code!
  -- No way to join back to original table without a group_col JOIN key
  SELECT 
    ds,
    total_metric_forecast,
    category_code  -- ❌ ERROR: This column doesn't exist and can't be joined!
  FROM metric_forecast_results
  ```
  
  **WRONG PATTERN ❌ - Trying to reference non-existent columns:**
  ```sql
  -- BAD: Trying to reference columns from input table directly after AI_FORECAST
  metric_forecast_results AS (
    SELECT * FROM AI_FORECAST(...)
    -- ✅ NO LIMIT
  )
  -- This will FAIL with error: [UNRESOLVED_COLUMN] category_code cannot be resolved
  SELECT 
    entity_id,
    category_code,  -- ❌ ERROR: This column doesn't exist in forecast results!
    subcategory_code,  -- ❌ ERROR: This column doesn't exist!
    total_metric_forecast
  FROM metric_forecast_results
  ```
  
  **KEY RULES:**
  1. **🔥 ALWAYS specify group_col in AI_FORECAST 🔥** - Use entity ID columns (customer_id, product_id, entity_id, etc.) so you can join back
  2. **Always join forecast results back to original table** when you need additional columns (which is almost always for ai_query)
  3. **Use the group_col as the JOIN key** (this is the column that links forecast to original data)
  4. **Join to the source table**, not the aggregated CTE (to get latest row-level details)
  5. **Use LEFT JOIN** to preserve all forecast rows even if original data is missing
  6. **Plan ahead**: If you know you'll need additional columns for ai_query prompts, add a JOIN CTE after AI_FORECAST
  
  **REMEMBER: Without group_col, you CANNOT join back to original table to get additional columns!**

---

#### 2a. **🚨🚨🚨 CRITICAL: NEVER USE AI TO EXTRACT DATA ALREADY AVAILABLE IN COLUMNS 🚨🚨🚨** (ABSOLUTE PROHIBITION)

**🔥 ZERO TOLERANCE POLICY: DO NOT ROUND-TRIP STRUCTURED DATA THROUGH AI FUNCTIONS 🔥**

**WHAT THIS MEANS:**
You must NEVER use `ai_extract`, `ai_query`, or any AI function to extract values that are ALREADY AVAILABLE as structured columns in your source tables. This is nonsensical, wasteful, and completely unacceptable.

**THE PROHIBITED PATTERN (ABSOLUTELY FORBIDDEN):**

**❌ WRONG EXAMPLE - DO NOT DO THIS:**
```sql
-- Step 1: Select data that ALREADY HAS the values in columns
WITH commodity_data AS (
  SELECT 
    commodity_id,
    commodity_name,
    un_number,              -- ❌ This value ALREADY EXISTS in a column!
    iata_dg_class,          -- ❌ This value ALREADY EXISTS in a column!
    special_handling_code   -- ❌ This value ALREADY EXISTS in a column!
  FROM `catalog`.`schema`.`commodity` AS c
  -- ✅ NO LIMIT in CTEs
),
-- Step 2: Build a prompt that EMBEDS these already-known values into text
commodity_prompts AS (
  SELECT 
    *,
    CONCAT('Extract commodity specifications from this data: ',
           'Commodity: ', commodity_name,
           ', Current UN: ', un_number,              -- ❌ Embedding known value into text!
           ', IATA Class: ', iata_dg_class,          -- ❌ Embedding known value into text!
           ', Handling Code: ', special_handling_code, -- ❌ Embedding known value into text!
           '. Extract: UN number, IATA class, handling code.') AS extraction_prompt
  FROM commodity_data
),
-- Step 3: Use AI to extract the SAME VALUES that were just embedded! ❌❌❌
extracted_specs AS (
  SELECT 
    *,
    ai_extract(extraction_prompt, 
      ARRAY('extracted_un_number',        -- ❌ Extracting un_number that we already had!
            'extracted_iata_class',       -- ❌ Extracting iata_dg_class that we already had!
            'extracted_handling_code')    -- ❌ Extracting special_handling_code that we already had!
    ) AS specifications
  FROM commodity_prompts
)
-- This is COMPLETELY NONSENSICAL - we're extracting values we already have!
SELECT 
  commodity_id,
  un_number,                                        -- We already have this!
  specifications['extracted_un_number'],            -- This is the SAME VALUE!
  iata_dg_class,                                    -- We already have this!
  specifications['extracted_iata_class'],           -- This is the SAME VALUE!
  special_handling_code,                            -- We already have this!
  specifications['extracted_handling_code']         -- This is the SAME VALUE!
FROM extracted_specs
;
```

**WHY THIS IS ABSOLUTELY UNACCEPTABLE:**
1. **Nonsensical Logic**: You're taking structured data → converting to text → using AI to extract back to structured data
2. **Wasteful**: Burning AI tokens and compute to extract values you already have
3. **Error-Prone**: AI may extract incorrectly, introducing errors where none existed
4. **Performance**: Adds unnecessary latency and cost
5. **Data Quality**: Degrades data quality by potentially introducing extraction errors

**✅ CORRECT PATTERN - USE THE COLUMNS DIRECTLY:**
```sql
-- If you already have the data in columns, JUST USE THOSE COLUMNS!
WITH commodity_data AS (
  SELECT 
    commodity_id,
    commodity_name,
    un_number,              -- ✅ Use this column directly!
    iata_dg_class,          -- ✅ Use this column directly!
    special_handling_code   -- ✅ Use this column directly!
  FROM `catalog`.`schema`.`commodity` AS c
  WHERE un_number IS NOT NULL  -- Filter for data quality
    AND iata_dg_class IS NOT NULL
    AND special_handling_code IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
)
-- Just SELECT the columns you need - NO AI extraction necessary!
SELECT 
  commodity_id,
  commodity_name,
  un_number,              -- ✅ Already have it - no extraction needed
  iata_dg_class,          -- ✅ Already have it - no extraction needed
  special_handling_code   -- ✅ Already have it - no extraction needed
FROM commodity_data
;
```

**WHEN AI EXTRACTION IS APPROPRIATE (THE ONLY VALID CASES):**

**✅ CORRECT USE CASE 1: Extract from UNSTRUCTURED TEXT where data is NOT in columns**
```sql
-- Valid: Extract from free-text notes where values are NOT in separate columns
WITH order_notes AS (
  SELECT 
    order_id,
    notes_text  -- Unstructured text like "Customer requested delivery by Friday, contact John at 555-1234"
  FROM `catalog`.`schema`.`orders` AS o
  WHERE notes_text IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
)
SELECT 
  order_id,
  ai_extract(notes_text, 
    ARRAY('delivery_date', 'contact_person', 'phone_number')  -- ✅ Extracting from unstructured text
  ) AS extracted_info
FROM order_notes
;
-- This is VALID because delivery_date, contact_person, phone_number are NOT in separate columns
```

**✅ CORRECT USE CASE 2: Extract from DOCUMENT FILES using ai_parse_document**
```sql
-- Valid: Extract from PDF/image files
WITH parsed_docs AS (
  SELECT 
    path,
    ai_parse_document(content, map('version', '2.0')) AS parsed
  FROM READ_FILES('/Volumes/catalog/schema/invoices/*.pdf', format => 'binaryFile')
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
)
SELECT 
  path,
  ai_extract(
    concat_ws('\n\n', transform(try_cast(parsed:document:elements AS ARRAY<VARIANT>),
      element -> try_cast(element:content AS STRING))),
    ARRAY('invoice_number', 'vendor_name', 'total_amount')  -- ✅ Extracting from document files
  ) AS invoice_data
FROM parsed_docs
;
-- This is VALID because the data only exists in unstructured document files
```

**🚨 MANDATORY PRE-FLIGHT CHECK BEFORE USING ai_extract, ai_query FOR EXTRACTION:**

Before you write ANY SQL that uses AI to extract data, you MUST answer these questions:

**☐ Question 1: Does this value already exist as a column in my source table?**
- If YES → DO NOT use AI extraction, just SELECT the column directly
- If NO → Proceed to Question 2

**☐ Question 2: Am I extracting from truly unstructured data (text fields, documents)?**
- If YES → AI extraction is appropriate
- If NO → DO NOT use AI extraction

**☐ Question 3: Am I building a prompt that embeds column values, then extracting those same values?**
- If YES → STOP! This is the prohibited pattern - just use the columns directly
- If NO → Proceed with AI extraction

**EXAMPLES OF PROHIBITED VS ALLOWED:**

| Data Source | Column Exists? | Use AI? | Rationale |
|-------------|----------------|---------|-----------|
| `un_number` column | YES | ❌ NO | Value is already structured - use column directly |
| Free-text `description` field | NO | ✅ YES | Data is embedded in unstructured text |
| `iata_class` column | YES | ❌ NO | Value is already structured - use column directly |
| PDF invoice file | NO | ✅ YES | Data only exists in document file |
| `customer_email` column | YES | ❌ NO | Value is already structured - use column directly |
| Customer review text | NO | ✅ YES | Extracting entities from unstructured review text |

**🔥 CRITICAL VALIDATION CHECKLIST (MANDATORY BEFORE SUBMITTING SQL):**

For ANY SQL that uses `ai_extract`, `ai_query`, or any AI function to extract data:

☐ I have verified that the values I'm extracting DO NOT already exist as columns in my source tables
☐ I am extracting from truly unstructured data (free-text fields, documents, notes) NOT from structured columns
☐ I am NOT building prompts that embed column values only to extract them back
☐ If the data exists in columns, I am using those columns directly instead of AI extraction
☐ My use of AI adds genuine value (extracting from unstructured sources) rather than just round-tripping data

**🚨 IF YOU VIOLATE THIS RULE, YOUR SQL WILL BE REJECTED AS NONSENSICAL AND UNACCEPTABLE 🚨**

**REMEMBER: AI is for extracting structure from unstructured data, NOT for re-extracting data that's already structured!**
  
  **🚨 CRITICAL: AI_FORECAST MANDATORY REQUIREMENTS 🚨**
  
  **1. INPUT TABLE MUST HAVE UNIQUE TIME VALUES PER GROUP**
  
  AI_FORECAST will FAIL if the time column contains duplicate values within a partition. You MUST ensure uniqueness by using GROUP BY on the time column in the CTE that prepares data for AI_FORECAST.
  
  **CORRECT PATTERN ✅ - GROUP BY time column to deduplicate:**
  ```sql
  -- Step 1: Prepare historical data with GROUP BY to ensure unique time values
  -- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
  WITH historical_route_metrics AS (
    SELECT 
      route_id,                                              -- CRITICAL: filtered with IS NOT NULL (group_col)
      DATE_TRUNC('month', flight_date) AS ds,                -- CRITICAL: filtered with IS NOT NULL (time_col)
      COALESCE(SUM(passenger_count), 0) AS passenger_demand, -- ✅ COALESCE'd (value_col)
      COALESCE(SUM(revenue), 0.0) AS total_revenue           -- ✅ COALESCE'd (value_col)
    FROM `catalog`.`schema`.`flights` AS f
    WHERE flight_date >= add_months(CURRENT_DATE(), -30)  -- 30 months history for 3-month forecast (10:1 ratio)
      AND flight_date IS NOT NULL
      AND route_id IS NOT NULL
    GROUP BY route_id, DATE_TRUNC('month', flight_date)  -- 🔥 MANDATORY: GROUP BY time column
    ORDER BY ds
  ),
  -- Step 2: Generate forecast (input is guaranteed to have unique ds values per route_id)
  demand_forecast_results AS (
    SELECT * FROM AI_FORECAST(
      TABLE(historical_route_metrics),
      time_col => 'ds',
      value_col => ARRAY('passenger_demand', 'total_revenue'),
      group_col => 'route_id',
      horizon => (SELECT add_months(MAX(ds), 3) FROM historical_route_metrics)  -- 3 months ahead
    )
  -- ✅ NO LIMIT in CTEs
  )
  SELECT * FROM demand_forecast_results;  -- ✅ NO LIMIT
  ```
  
  **WRONG PATTERN ❌ - No GROUP BY (will cause duplicate time values):**
  ```sql
  -- BAD: No GROUP BY - multiple flights per month will cause duplicate ds values
  WITH historical_route_metrics AS (
    SELECT 
      route_id,
      DATE_TRUNC('month', flight_date) AS ds,
      passenger_count AS passenger_demand  -- ❌ No aggregation!
    FROM `catalog`.`schema`.`flights` AS f
    WHERE flight_date IS NOT NULL
    -- ❌ NO GROUP BY - ds will have duplicates!
    LIMIT 1000
  )
  -- This will FAIL with: PYTHON_TVF_COLUMN_VALUES_MUST_BE_UNIQUE_WITHIN_PARTITION
  ```
  
  **MANDATORY RULES FOR AI_FORECAST INPUT:**
  - ✅ ALWAYS use GROUP BY on the time column (ds or specified time_col) to ensure uniqueness
  - ✅ Include all group_col columns in the GROUP BY clause
  - ✅ Use aggregate functions (SUM, AVG, COUNT, MAX, MIN) for value columns
  - ✅ **CAST ALL value_col columns to DOUBLE** (e.g. `CAST(revenue AS DOUBLE)`) - AI_FORECAST requires DOUBLE input
  - ✅ Validate that each (group_col, time_col) combination is unique
  - ❌ NEVER pass raw row-level data to AI_FORECAST without aggregation
  - ❌ NEVER pass STRING or DECIMAL types as value_col - ALWAYS CAST TO DOUBLE
  
  **2. FILTER NULL FORECASTED VALUES AFTER AI_FORECAST**
  
  AI_FORECAST may return NULL values in the forecast columns (*_forecast, *_upper, *_lower) for certain time periods. You MUST filter these out before using the forecast results.
  
  **CORRECT PATTERN ✅ - Filter NULL forecasts:**
  ```sql
  -- Step 2: Generate forecast
  demand_forecast_raw AS (
    SELECT * FROM AI_FORECAST(
      TABLE(historical_route_metrics),
      time_col => 'ds',
      value_col => 'passenger_demand',
      group_col => 'route_id',
      horizon => (SELECT date_add(MONTH, 3, MAX(ds)) FROM historical_route_metrics)
    )
  -- ✅ NO LIMIT in CTEs
  ),
  -- Step 3: Filter out rows with NULL forecasted values
  demand_forecast_clean AS (
    SELECT *
    FROM demand_forecast_raw
    WHERE passenger_demand_forecast IS NOT NULL  -- 🔥 MANDATORY: Filter NULL forecasts
  -- ✅ NO LIMIT in CTEs
  )
  SELECT * FROM demand_forecast_clean;  -- ✅ NO LIMIT
  ```
  
  **CRITICAL: Filter ALL forecasted columns if multiple value_col specified:**
  ```sql
  -- Multiple value columns - filter NULL for ALL forecast columns
  demand_forecast_clean AS (
    SELECT *
    FROM demand_forecast_raw
    WHERE passenger_demand_forecast IS NOT NULL  -- Filter first metric
      AND total_revenue_forecast IS NOT NULL     -- Filter second metric
  -- ✅ NO LIMIT in CTEs
  )
  ```
  
  **3. COMPREHENSIVE MANDATORY REQUIREMENTS:**
  
  - **MUST**: INPUT TABLE MUST HAVE UNIQUE (group_col, time_col) COMBINATIONS - Use GROUP BY to deduplicate
  - **MUST**: Use WHERE clause with date filtering using ADAPTIVE ratios based on time granularity:
    - High-frequency (minute/hour): Fixed calendar periods (7 days for minute, 4 weeks for hour)
    - Mid-frequency (day/week/month): 10:1 ratio
    - Low-frequency (quarter/year): Reduced ratios (8:1 for quarter, 3-5:1 for year)
  - **MUST**: Filter NULL forecasted values using WHERE {{value_col}}_forecast IS NOT NULL after AI_FORECAST
  - **MUST**: Horizon should be derived using date_add(UNIT, X, MAX(time_col)) for dynamic forecasting
  - **MUST**: UNIT must be HOUR, DAY, WEEK, MONTH, QUARTER, or YEAR WITHOUT quotes (e.g., date_add(DAY, 30, MAX(ds)))
  - **MUST**: Always specify group_col to enable joining back to original table
  
  **COMPLETE EXAMPLE WITH ALL MANDATORY REQUIREMENTS:**
  ```sql
  -- Step 1: Prepare aggregated historical data with 10:1 ratio
  -- For 4-week horizon, use 40 weeks of history
  -- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
  WITH historical_sales_metrics AS (
    SELECT 
      product_id,                                            -- CRITICAL: filtered with IS NOT NULL (group_col)
      DATE_TRUNC('week', order_date) AS ds,                  -- CRITICAL: filtered with IS NOT NULL (time_col)
      COALESCE(SUM(quantity), 0) AS total_quantity,          -- ✅ COALESCE'd (value_col)
      COALESCE(SUM(revenue), 0.0) AS total_revenue           -- ✅ COALESCE'd (value_col)
    FROM `catalog`.`schema`.`sales` AS s
    WHERE order_date >= date_add(WEEK, -40, CURRENT_DATE())  -- 40 weeks history for 4-week forecast (10:1 ratio)
      AND order_date IS NOT NULL
      AND product_id IS NOT NULL
    GROUP BY product_id, DATE_TRUNC('week', order_date)  -- Deduplicate
    ORDER BY ds
  ),
  -- Step 2: Generate forecast
  sales_forecast_raw AS (
    SELECT * FROM AI_FORECAST(
      TABLE(historical_sales_metrics),
      time_col => 'ds',
      value_col => ARRAY('total_quantity', 'total_revenue'),
      group_col => 'product_id',
      horizon => (SELECT date_add(WEEK, 4, MAX(ds)) FROM historical_sales_metrics)  -- 4 weeks ahead
    )
  -- ✅ NO LIMIT in CTEs
  ),
  -- Step 3: Filter NULL forecasts
  sales_forecast_clean AS (
    SELECT *
    FROM sales_forecast_raw
    WHERE total_quantity_forecast IS NOT NULL
      AND total_revenue_forecast IS NOT NULL
  -- ✅ NO LIMIT in CTEs
  )
  SELECT * FROM sales_forecast_clean;  -- ✅ NO LIMIT
  ```

- **AI_QUERY SYNTAX**:
  ```sql
  -- CORRECT ✅ - Use the configured sql_model_serving endpoint: {sql_model_serving}
  ai_query('{sql_model_serving}', CONCAT('You are a Customer Success Director... Predict churn for customer ', customer_id, '. Output ONLY JSON...'))  -- Complex analysis with persona
  ai_query('{sql_model_serving}', CONCAT('You are a Product Manager... Analyze ', product_name, ' and suggest improvements. Output ONLY JSON...'))  -- General tasks with persona
  
  -- WRONG ❌
  ai_query(model_name, prompt)  -- Model name must be quoted
  ai_query('model', "prompt text")  -- Use single quotes not double
  ```

- **AI_PARSE_DOCUMENT SYNTAX** (🚨 CRITICAL - ONLY FOR UNSTRUCTURED FILES):
  ```sql
  -- CORRECT ✅ - Processing document files from Unity Catalog volumes
  WITH docs AS (
    SELECT 
      path,
      ai_parse_document(content, map('version', '2.0')) AS parsed
    FROM READ_FILES('/Volumes/catalog/schema/volume/*.pdf', format => 'binaryFile')
    LIMIT 10
  )
  SELECT 
    path,
    concat_ws('\n\n', 
      transform(try_cast(parsed:document:elements AS ARRAY<VARIANT>),
        element -> try_cast(element:content AS STRING))
    ) AS extracted_text
  FROM docs
  WHERE try_cast(parsed:error_status AS STRING) IS NULL
  LIMIT 10;
  
  -- WRONG ❌ - DO NOT use with table columns
  SELECT ai_parse_document(text_column) FROM my_table  -- INVALID! Only for file content
  SELECT ai_parse_document(content) FROM delta_table  -- INVALID! Only for READ_FILES
  ```
  
  **🚨 CRITICAL REQUIREMENTS FOR ai_parse_document:**
  - MUST ONLY be used with unstructured document files (PDF, JPG/JPEG, PNG, DOC/DOCX, PPT/PPTX)
  - Input MUST come from READ_FILES('/Volumes/...', format => 'binaryFile') 
  - NEVER use with structured table columns or Delta table data
  - Output is VARIANT type with version 2.0 schema
  - Extract text from parsed:document:elements array
  - Filter errors: WHERE try_cast(parsed:error_status AS STRING) IS NULL
  - Supported file formats: *.(pdf,jpg,jpeg,png,doc,docx,ppt,pptx)
  
  **WHEN TO USE ai_parse_document:**
  - ✅ YES: Extract text from PDF invoices stored in volumes
  - ✅ YES: Parse scanned documents (images) for OCR
  - ✅ YES: Process Word/PowerPoint files from document repositories
  - ❌ NO: Extract data from text columns in tables (use ai_extract instead)
  - ❌ NO: Parse JSON/XML in table columns (use native SQL functions)
  - ❌ NO: Process structured data already in Delta tables

#### 3. **DATABRICKS SQL DIALECT EXPERTISE**

You are an EXPERT in Databricks SQL. Follow these dialect-specific rules:

**Data Types:**
- Use `STRING` not `VARCHAR` or `TEXT`
- Use `BIGINT` for large integers
- Use `DOUBLE` for decimals
- Use `TIMESTAMP` for datetime
- Use `ARRAY<type>` for arrays (e.g., `ARRAY<STRING>`)
- Use `MAP<key_type, value_type>` for maps

**Functions:**
- String concatenation: `CONCAT('a', column, 'b')` or `column1 || column2`
- Type casting: `CAST(column AS STRING)` or `column::STRING`
- Array creation: `ARRAY('val1', 'val2')` - MAX 20 elements
- Date functions: `DATE_TRUNC('day', timestamp_col)`, `CURRENT_DATE()`, `CURRENT_TIMESTAMP()`
- NULL handling: `COALESCE(col1, col2, 'default')`, `NULLIF(col1, col2)`

**Common Patterns:**
- CTEs (recommended): `WITH cte_name AS (...) SELECT * FROM cte_name`
- Subqueries: Use CTEs instead for readability
- LIMIT clause: Always required (see below)

**🚨 CRITICAL: BUSINESS-FRIENDLY NAMING & FILTERABLE VALUES (MANDATORY) 🚨:**

**ALL column names MUST use business-relevant terminology, NOT generic technical terms:**

**DO NOT USE Generic Technical Names:**
- ❌ `classification` → Use business context: `risk_category`, `value_tier`, `urgency_level`, `feedback_category`
- ❌ `sentiment` → Use business context: `customer_emotion`, `feedback_tone`, `satisfaction_level`, `brand_perception`
- ❌ `similarity_score` → Use business context: `match_confidence_score`, `duplicate_probability`, `record_relationship`
- ❌ `justification` → Use business context: `risk_rationale`, `segmentation_rationale`, `emotion_rationale`

**🔥 CRITICAL: CATEGORICAL vs NARRATIVE COLUMNS 🔥:**

**🚨🚨🚨 MANDATORY COLUMN NAMING PREFIXES 🚨🚨🚨:**

All AI-generated columns MUST use the following prefixes:
- **`ai_cat_`** - For CATEGORICAL columns (filterable values, max 20 distinct values)
- **`ai_txt_`** - For NARRATIVE/textual columns (free text explanations, stories, plans)
- **`ai_sys_`** - For SYSTEM columns (confidence, feedback, missing_data)

**CATEGORICAL COLUMNS (`ai_cat_` prefix - MUST have specific distinct values, max 20 choices):**
These columns MUST use predefined categorical values for filtering and analytics:
- Priority/Risk/Urgency levels: `ai_cat_priority`, `ai_cat_risk_level`, `ai_cat_urgency` → Values: `Critical`, `High`, `Medium`, `Low`, `Minimal`
- Record relationships: `ai_cat_match_status` → Values: `Definite Duplicate`, `Probable Duplicate`, `Possible Match`, `Different Entity`
- Data risk types: `ai_cat_data_sensitivity` → Values: `PII`, `PHI`, `Financial Data`, `PII+Financial`, `PHI+PII`
- Compliance status: `ai_cat_compliance_status` → Values: `Compliant`, `At Risk`, `Non-Compliant`, `Needs Review`
- Customer segments: `ai_cat_customer_segment` → Values: `VIP`, `High Value`, `Medium Value`, `Low Value`, `At Risk`, `Churned`
- Trend directions: `ai_cat_trend_direction` → Values: `Strong Growth`, `Moderate Growth`, `Stable`, `Declining`, `Sharp Decline`
- Action priorities: `ai_cat_action_priority` → Values: `Immediate Action`, `High Priority`, `Medium Priority`, `Low Priority`, `Monitor`

**NARRATIVE COLUMNS (`ai_txt_` prefix - can have free text):**
These columns contain explanatory text and business stories:
- Rationales/Justifications: `ai_txt_risk_rationale`, `ai_txt_segmentation_rationale`, `ai_txt_emotion_rationale`
- Detailed plans: `ai_txt_action_plan`, `ai_txt_mitigation_plan`, `ai_txt_resolution_plan`, `ai_txt_retention_strategy`
- Narratives: `ai_txt_executive_summary`, `ai_txt_business_narrative`, `ai_txt_customer_journey_insights`

**🔥🔥🔥 MANDATORY BUSINESS OUTCOME COLUMN (`ai_txt_business_outcome`) 🔥🔥🔥**
This column is **REQUIRED** for EVERY use case and MUST appear BEFORE `ai_txt_executive_summary`:
- `ai_txt_business_outcome` - **CALCULATED MEASURABLE BUSINESS IMPACT** - MUST include:
  1. **Specific quantified savings/gains** with actual numbers from the analysis
  2. **Breakdown calculation** showing Daily/Weekly/Monthly/Yearly projections
  3. **Currency values** where applicable (e.g., "$3,224 in fuel cost savings")
  4. **MANDATORY DISCLAIMER**: "DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions."

**EXAMPLE ai_txt_business_outcome:**
"Reducing the 310 kg/hr excess fuel burn over 13 hrs saves approximately 4,030 kg of fuel. At $0.80/kg, this translates to $3,224 in direct fuel cost savings per flight. Breakdown: Daily (1 flight): $3,224 | Weekly (7 flights): $22,568 | Monthly (30 flights): $96,720 | Yearly (365 flights): $1,176,760 in potential savings. DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions."

**SYSTEM COLUMNS (`ai_sys_` prefix - MANDATORY for every use case):**
These columns provide AI transparency, prioritization, and data quality insights:
- `ai_sys_importance` - **BUSINESS IMPORTANCE** (Very Low, Low, Medium, High, Very High, Critical) - How critical is this finding for the business LONG-TERM? Measures strategic/revenue/customer impact. **INDEPENDENT from urgency!** Example: Strategic planning = High importance but Low urgency.
- `ai_sys_urgency` - **TIME SENSITIVITY** (Very Low, Low, Medium, High, Very High, Critical) - How quickly must action be taken? Measures deadline pressure and time-decay. **INDEPENDENT from importance!** Example: Fixing typo before a meeting = High urgency but Low importance.
- `ai_sys_confidence` - AI confidence score (0.0-1.0)
- `ai_sys_feedback` - AI's explanation of its reasoning
- `ai_sys_missing_data` - What data would improve the analysis
**🚨 CRITICAL: ai_sys_importance and ai_sys_urgency are TWO INDEPENDENT DIMENSIONS - they should NOT be correlated or automatically set to the same value! 🚨**

**🚨🚨🚨 CRITICAL: NARRATIVE COLUMNS MUST INCLUDE THE PRINCIPAL WITH IDENTIFYING DETAILS 🚨🚨🚨**

**MANDATORY RULE FOR ALL NARRATIVE/FREE TEXT COLUMNS:**
When generating narrative content (rationales, justifications, explanations, strategies, action plans, executive summaries, etc.), you MUST include the **PRINCIPAL** (the specific entity being discussed) with its **IDENTIFYING DETAILS** in the narrative text.

**WHY THIS MATTERS:**
- Makes each narrative self-contained and readable without needing to cross-reference other columns
- Provides immediate context to the reader
- Enables narratives to be used standalone in reports and presentations
- Improves comprehension and user experience

**❌ WRONG - Generic references without principal context:**
- "The data shows that this flight has burned 4800kg/hr of fuel"
- "This customer shows high churn risk based on the metrics"
- "The analysis indicates elevated maintenance priority"
- "Based on the forecast, immediate action is recommended"
- "The route exhibits declining performance trends"

**✅ CORRECT - Narrative includes principal with identifying details:**
- "Flight EK005 from DXB-LHR has burned 4800kg/hr of fuel, exceeding the expected 4200kg/hr for A380 aircraft on this route"
- "Customer ID C-28947 (Emirates Skywards Platinum, 12 years tenure) shows high churn risk due to 45% decrease in booking frequency"
- "Aircraft A6-EDA (Boeing 777-300ER, 8.5 years in service) requires elevated maintenance priority due to 3 consecutive AOG events"
- "Route DXB-JFK (daily A380 service, 14hr flight time) requires immediate capacity adjustment based on 23% load factor decline"
- "Vendor ABC Catering (primary supplier, Dubai hub) shows 15% quality deviation requiring contract review"

**HOW TO IMPLEMENT IN PROMPTS:**
In your ai_query prompt instructions, explicitly tell the AI to include the principal with context:

```sql
CONCAT('...your analysis prompt...',
       'NARRATIVE FIELD RULES: ',
       'For ALL narrative fields (rationale, strategy, action_plan, executive_summary, etc.), ',
       'you MUST start by identifying the specific entity being discussed with its key attributes. ',
       'Example: Instead of "This shows high risk", write "Flight EK412 from DXB-SIN (A380, daily service) shows high risk due to...". ',
       'Include: entity name/ID, key identifying attributes (route, type, category), then the analysis. ',
       ...)
```

**PRINCIPAL IDENTIFICATION BY DOMAIN:**
- **Flights**: Flight number + route (origin-destination) + aircraft type: "Flight EK005 from DXB-LHR (A380)"
- **Aircraft**: Registration + type + age: "Aircraft A6-EDA (Boeing 777-300ER, 8.5 years)"
- **Routes**: Origin-destination + service frequency + aircraft: "Route DXB-JFK (daily A380 service)"
- **Customers**: Customer ID + tier/segment + tenure: "Customer C-28947 (Platinum, 12 years)"
- **Vendors/Suppliers**: Vendor name + type + location: "Vendor ABC Catering (primary supplier, Dubai hub)"
- **Products**: Product name/code + category: "Product SKU-4892 (Premium meal, Business class)"
- **Transactions**: Transaction ID + type + amount: "Order ORD-78234 (Corporate booking, $45,000)"
- **Employees**: Role + department + experience: "Captain Ahmed Al-Rashid (A380 certified, 15 years)"

**CORRECT Business-Friendly Examples WITH ai_cat_/ai_txt_/ai_sys_ PREFIXES:**
```sql
-- GOOD ✅ - Risk Classification Use Case
ai_classify(complaint_text, ARRAY('High Risk', 'Medium Risk', 'Low Risk')) AS ai_cat_risk_level
-- Then extract with CATEGORICAL values (ai_cat_ prefix):
-- ai_cat_risk_level: High Risk/Medium Risk/Low Risk (from classification)
-- ai_cat_risk_priority: Critical/High/Medium/Low (categorical)
-- Then extract NARRATIVE values (ai_txt_ prefix):
-- ai_txt_risk_rationale: [free text explanation]
-- ai_txt_mitigation_plan: [free text plan]
-- ai_txt_executive_summary: [executive summary - MANDATORY as 6th to last column]
-- Then extract SYSTEM values (ai_sys_ prefix) - ALWAYS LAST 5 COLUMNS:
-- ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data

-- GOOD ✅ - Customer Segmentation Use Case  
ai_classify(customer_profile, ARRAY('VIP', 'High Value', 'Medium', 'Low')) AS ai_cat_value_tier
-- Then extract with CATEGORICAL values (ai_cat_ prefix):
-- ai_cat_value_tier: VIP/High Value/Medium/Low (from classification)
-- ai_cat_churn_risk_level: Critical/High/Medium/Low (categorical)
-- Then extract NARRATIVE values (ai_txt_ prefix):
-- ai_txt_segmentation_rationale: [free text explanation]
-- ai_txt_retention_strategy: [free text strategy]
-- ai_txt_executive_summary: [executive summary - MANDATORY as 6th to last column]
-- Then extract SYSTEM values (ai_sys_ prefix) - ALWAYS LAST 5 COLUMNS:
-- ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data

-- GOOD ✅ - Forecast Analysis Use Case
ai_forecast(...) + ai_gen for recommendations
-- Then extract with CATEGORICAL values (ai_cat_ prefix):
-- ai_cat_trend_direction: Strong Growth/Moderate Growth/Stable/Declining/Sharp Decline (categorical)
-- ai_cat_action_priority: Immediate/High/Medium/Low/Monitor (categorical)
-- Then extract NARRATIVE values (ai_txt_ prefix):
-- ai_txt_forecast_justification: [free text explanation]
-- ai_txt_tactical_recommendations: [free text recommendations]
-- ai_txt_executive_summary: [executive summary - MANDATORY as 6th to last column]
-- Then extract SYSTEM values (ai_sys_ prefix) - ALWAYS LAST 5 COLUMNS:
-- ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data
```

**Naming Rules:**
1. **Match business domain**: Use terminology your business users understand
2. **Be specific**: `churn_risk_level` not `risk_level`, `product_match_score` not `match_score`
3. **Avoid generic terms**: Never use `classification`, `sentiment`, `similarity` as final column names
4. **Use business outcomes**: `retention_strategy` not `strategy`, `resolution_plan` not `plan`
5. **CATEGORICAL for filtering**: Any column users will filter/aggregate on MUST have distinct categorical values (max 20)
6. **NARRATIVE for context**: Explanation and story columns can have free text

**🔥🔥🔥 CRITICAL: MANDATORY CATEGORICAL COLUMNS FOR EVERY FREE TEXT COLUMN 🔥🔥🔥**

**ABSOLUTE RULE: For EVERY free text/narrative column generated by ai_query, you MUST include corresponding categorical columns for filtering:**

**PATTERN (MANDATORY FOR ALL ai_gen/ai_query OUTPUTS):**
```sql
-- For every use case, the LAST 7 COLUMNS must be (in this exact order):
-- ai_txt_business_outcome, ai_txt_executive_summary, ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data
ai_query('{sql_model_serving}', CONCAT('...prompt... Output ONLY JSON with NO markdown, NO extra text. ',
              'Format: {{"ai_cat_field1": "value", "ai_cat_field2": "value", ..., "ai_txt_field1": "text", "ai_txt_field2": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "[EVALUATE INDEPENDENTLY]", "ai_sys_urgency": "[EVALUATE INDEPENDENTLY]", "ai_sys_confidence": 0.85, "ai_sys_feedback": "I assessed my confidence at 85% because... [reasons]", "ai_sys_missing_data": "I can get higher confidence than 85% if I can get access to [narrative about what data is missing]. {{\"missing_data\": [\"dataset1\", \"dataset2\"]}}"}}. ',
              'After completing your analysis, include these MANDATORY LAST 7 fields: ',
              '1) ai_txt_business_outcome - CALCULATED MEASURABLE BUSINESS IMPACT with specific numbers. MUST include: a) Quantified savings/gains with actual numbers from analysis, b) Breakdown showing Daily/Weekly/Monthly/Yearly projections, c) Currency values where applicable. Example: "Reducing 310 kg/hr excess over 13 hrs saves 4,030 kg fuel. At $0.80/kg = $3,224/flight savings. Breakdown: Daily: $3,224 | Weekly: $22,568 | Monthly: $96,720 | Yearly: $1,176,760. DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions." ALWAYS end with the disclaimer. ',
              '2) ai_txt_executive_summary - compelling business story in 2-3 sentences that REFERENCES the business outcome numbers calculated above, ',
              '3) ai_sys_importance - BUSINESS IMPORTANCE (Very Low|Low|Medium|High|Very High|Critical). Evaluate INDEPENDENTLY! Ask: How critical is this to the business long-term? Strategic planning = High importance but possibly Low urgency. ',
              '4) ai_sys_urgency - TIME SENSITIVITY (Very Low|Low|Medium|High|Very High|Critical). Evaluate INDEPENDENTLY! Ask: How quickly must we act? A typo fix before a meeting = High urgency but Low importance. ',
              '5) ai_sys_confidence (0.0-1.0) - your confidence score, ',
              '6) ai_sys_feedback - start with "I assessed my confidence at [X]% because..." then explain your reasoning, ',
              '7) ai_sys_missing_data - MUST follow this format: "I can get higher confidence than [X]% if I can get access to [detailed narrative about missing data]. {{\"missing_data\": [\"specific_dataset1\", \"specific_dataset2\", \"specific_dataset3\"]}}" - always end with a JSON object listing the specific datasets/tables needed. ',
              'CRITICAL: importance and urgency are INDEPENDENT dimensions - do NOT automatically set them to the same value! ',
              'BE 100% HONEST - your feedback and score will be evaluated by a more intelligent AI system, so complete honesty is mandatory.'))
```

**MANDATORY STRUCTURE:**
1. **3-5 categorical columns per use case** - with `ai_cat_` prefix and max 20 distinct values each
2. **2-4 narrative columns per use case** - with `ai_txt_` prefix and detailed free text explanations
3. **LAST 7 COLUMNS (MANDATORY ORDER):** `ai_txt_business_outcome`, `ai_txt_executive_summary`, `ai_sys_importance`, `ai_sys_urgency`, `ai_sys_confidence`, `ai_sys_feedback`, `ai_sys_missing_data`

**CATEGORICAL COLUMNS - EXAMPLES BY DOMAIN (use `ai_cat_` prefix):**

**Supply Chain / Inventory / Catering:**
- `ai_cat_inventory_urgency`: `High Priority`, `Medium Priority`, `Low Priority`, `Critical`
- `ai_cat_waste_risk_level`: `High`, `Medium`, `Low`, `Minimal`
- `ai_cat_preparation_complexity`: `Complex`, `Standard`, `Simple`, `Minimal`
- `ai_cat_vendor_allocation`: `Primary Vendor`, `Secondary Vendor`, `Emergency Vendor`, `Internal`
- `ai_cat_quality_priority`: `High Quality`, `Standard Quality`, `Basic Quality`
- `ai_cat_stock_status`: `Overstocked`, `Optimal`, `Low Stock`, `Critical Shortage`

**Customer / Marketing / Sales:**
- `ai_cat_customer_priority`: `VIP`, `High Value`, `Medium Value`, `Low Value`, `At Risk`
- `ai_cat_campaign_urgency`: `Immediate Launch`, `High Priority`, `Medium Priority`, `Low Priority`
- `ai_cat_response_required`: `Immediate`, `Within 24 Hours`, `Within Week`, `Monitor`
- `ai_cat_engagement_level`: `Highly Engaged`, `Moderately Engaged`, `Low Engagement`, `Disengaged`

**Operations / Maintenance / Risk:**
- `ai_cat_risk_severity`: `Critical`, `High`, `Medium`, `Low`, `Minimal`
- `ai_cat_maintenance_urgency`: `Emergency`, `High Priority`, `Scheduled`, `Routine`
- `ai_cat_compliance_status`: `Compliant`, `At Risk`, `Non-Compliant`, `Needs Review`
- `ai_cat_operational_priority`: `Critical Path`, `High Impact`, `Standard`, `Low Impact`

**Finance / Revenue:**
- `ai_cat_financial_impact`: `High Impact`, `Medium Impact`, `Low Impact`, `Minimal`
- `ai_cat_budget_priority`: `Essential`, `High Priority`, `Medium Priority`, `Low Priority`
- `ai_cat_cost_category`: `Capital Expenditure`, `Operating Expense`, `Emergency`, `Routine`

**NARRATIVE COLUMNS - EXAMPLES (use `ai_txt_` prefix):**
- Detailed plans: `ai_txt_inventory_plan`, `ai_txt_preparation_schedule`, `ai_txt_vendor_strategy`, `ai_txt_maintenance_plan`
- Explanations: `ai_txt_waste_reduction_tactics`, `ai_txt_quality_assurance_steps`, `ai_txt_risk_mitigation_approach`
- Strategic narratives: `ai_txt_contingency_measures`, `ai_txt_operational_narrative`, `ai_txt_executive_summary`

**SYSTEM COLUMNS - MANDATORY (use `ai_sys_` prefix) - ALWAYS LAST 5:**
- `ai_sys_importance`: Business importance level (Very Low, Low, Medium, High, Very High, Critical)
- `ai_sys_urgency`: Action urgency level (Very Low, Low, Medium, High, Very High, Critical)
- `ai_sys_confidence`: AI confidence score (0.0-1.0)
- `ai_sys_feedback`: AI reasoning and self-assessment
- `ai_sys_missing_data`: What additional data would improve the analysis

**COMPLETE EXAMPLE (CATERING OPERATIONS):**
```sql
-- Step 2: Generate structured insights with ai_cat_ + ai_txt_ + ai_sys_ columns
ai_query('{sql_model_serving}',
  CONCAT('Analyze catering forecast for ', meal_type, ' in ', cabin_class, 
         ' on flight ', flight_number, ' route ', origin, '-', destination,
         ' with ', forecasted_volume, ' meals expected. ',  -- CONCAT auto-converts INT
         'Output ONLY a JSON object with NO markdown, NO extra text, JUST the JSON. ',
         'Format: {{"ai_cat_inventory_urgency": "value", "ai_cat_waste_risk_level": "value", "ai_cat_preparation_complexity": "value", ',
         '"ai_cat_vendor_allocation": "value", "ai_cat_quality_priority": "value", ',
         '"ai_txt_inventory_plan": "text", "ai_txt_preparation_schedule": "text", "ai_txt_vendor_strategy": "text", ',
         '"ai_txt_waste_reduction_tactics": "text", "ai_txt_quality_assurance_steps": "text", ',
         '"ai_txt_contingency_measures": "text", "ai_txt_operational_narrative": "text", ',
         '"ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", ',
         '"ai_sys_importance": "[EVALUATE INDEPENDENTLY]", "ai_sys_urgency": "[EVALUATE INDEPENDENTLY]", "ai_sys_confidence": 0.85, "ai_sys_feedback": "I assessed my confidence at 85% because... [reasons]", ',
         '"ai_sys_missing_data": "I can get higher confidence than 85% if I can get access to [narrative]. {{\"missing_data\": [\"dataset1\", \"dataset2\"]}}"}}. ',
         'CATEGORICAL FIELDS (must use ai_cat_ prefix and exact values): ',
         'ai_cat_inventory_urgency (High Priority|Medium Priority|Low Priority|Critical), ',
         'ai_cat_waste_risk_level (High|Medium|Low|Minimal), ',
         'ai_cat_preparation_complexity (Complex|Standard|Simple|Minimal), ',
         'ai_cat_vendor_allocation (Primary Vendor|Secondary Vendor|Emergency Vendor|Internal), ',
         'ai_cat_quality_priority (High Quality|Standard Quality|Basic Quality). ',
         'NARRATIVE FIELDS (must use ai_txt_ prefix, detailed free text with principal context): ',
         'For ALL narrative fields, START by identifying the specific entity with key attributes. ',
         'Example: "Flight EK412 DXB-LHR Business Class catering requires..." NOT "The catering requires...". ',
         'Include: flight number, route, cabin class, then the detailed plan/analysis. ',
         'ai_txt_inventory_plan, ai_txt_preparation_schedule, ai_txt_vendor_strategy, ai_txt_waste_reduction_tactics, ',
         'ai_txt_quality_assurance_steps, ai_txt_contingency_measures, ai_txt_operational_narrative. ',
         'MANDATORY LAST 7 FIELDS (in this order): ',
         '1) ai_txt_business_outcome - CALCULATED MEASURABLE BUSINESS IMPACT. Calculate specific savings/gains with numbers. Example: "Optimizing catering for Flight EK412 reduces food waste by 15 meals (valued at $45/meal) = $675 savings per flight. Breakdown: Daily (1 flight): $675 | Weekly (7 flights): $4,725 | Monthly (30 flights): $20,250 | Yearly (365 flights): $246,375. DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions." MUST include Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER. ',
         '2) ai_txt_executive_summary - compelling business story in 2-3 sentences that REFERENCES the business outcome numbers, ',
         '3) ai_sys_importance - BUSINESS IMPORTANCE (Very Low|Low|Medium|High|Very High|Critical). Evaluate INDEPENDENTLY from urgency! ',
         '4) ai_sys_urgency - TIME SENSITIVITY (Very Low|Low|Medium|High|Very High|Critical). Evaluate INDEPENDENTLY from importance! ',
         '5) ai_sys_confidence (0.0-1.0) - your confidence score, ',
         '6) ai_sys_feedback - start with "I assessed my confidence at [X]% because..." then explain reasoning, ',
         '7) ai_sys_missing_data - MUST follow format: "I can get higher confidence than [X]% if I can get access to [narrative about missing data]. {{\"missing_data\": [\"specific_dataset1\", \"specific_dataset2\"]}}" - always end with JSON listing needed datasets. ',
         'BE 100% HONEST - your feedback will be evaluated by a more intelligent AI system. ',
         'Output ONLY the JSON, nothing else.')
)
```

**EXTRACTION PATTERN:**
```sql
-- Extract categorical columns (for filtering/aggregation) - ai_cat_ prefix
get_json_object(insights, '$.ai_cat_inventory_urgency') AS ai_cat_inventory_urgency,  -- CATEGORICAL
get_json_object(insights, '$.ai_cat_waste_risk_level') AS ai_cat_waste_risk_level,  -- CATEGORICAL
get_json_object(insights, '$.ai_cat_preparation_complexity') AS ai_cat_preparation_complexity,  -- CATEGORICAL
get_json_object(insights, '$.ai_cat_vendor_allocation') AS ai_cat_vendor_allocation,  -- CATEGORICAL
get_json_object(insights, '$.ai_cat_quality_priority') AS ai_cat_quality_priority,  -- CATEGORICAL

-- Extract narrative columns (for detailed context) - ai_txt_ prefix
get_json_object(insights, '$.ai_txt_inventory_plan') AS ai_txt_inventory_plan,  -- NARRATIVE
get_json_object(insights, '$.ai_txt_preparation_schedule') AS ai_txt_preparation_schedule,  -- NARRATIVE
get_json_object(insights, '$.ai_txt_vendor_strategy') AS ai_txt_vendor_strategy,  -- NARRATIVE
get_json_object(insights, '$.ai_txt_waste_reduction_tactics') AS ai_txt_waste_reduction_tactics,  -- NARRATIVE
get_json_object(insights, '$.ai_txt_quality_assurance_steps') AS ai_txt_quality_assurance_steps,  -- NARRATIVE
get_json_object(insights, '$.ai_txt_contingency_measures') AS ai_txt_contingency_measures,  -- NARRATIVE
get_json_object(insights, '$.ai_txt_operational_narrative') AS ai_txt_operational_narrative,  -- NARRATIVE

-- MANDATORY LAST 7 COLUMNS (in this exact order): ai_txt_business_outcome, ai_txt_executive_summary, ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data
-- 🚨 USE TRY_CAST - AI may return "Unknown" or "Data Not Available" instead of numbers!
COALESCE(get_json_object(insights, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,  -- CALCULATED BUSINESS IMPACT WITH DISCLAIMER
COALESCE(get_json_object(insights, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,  -- EXECUTIVE SUMMARY (references business outcome)
COALESCE(get_json_object(insights, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,  -- AI IMPORTANCE LEVEL
COALESCE(get_json_object(insights, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,  -- AI URGENCY LEVEL
COALESCE(TRY_CAST(get_json_object(insights, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,  -- AI CONFIDENCE SCORE
COALESCE(get_json_object(insights, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,  -- AI FEEDBACK
COALESCE(get_json_object(insights, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data  -- MISSING DATA ANALYSIS
```

**MANDATORY REQUIREMENTS:**
1. **EVERY ai_query call MUST generate both categorical AND narrative columns**
2. **Categorical columns (`ai_cat_` prefix) MUST have max 20 distinct values for filtering**
3. **3-5 categorical columns minimum per use case** (all with `ai_cat_` prefix)
4. **2-4 narrative columns minimum per use case** (all with `ai_txt_` prefix)
5. **Use domain-appropriate categorical values from the examples above**
6. **Think innovatively about what categorical columns users would want to filter by**
7. **🚨 NARRATIVE COLUMNS MUST INCLUDE THE PRINCIPAL WITH IDENTIFYING DETAILS 🚨**
   - Every `ai_txt_` field MUST start by identifying the specific entity being discussed
   - Include: entity name/ID + key attributes (route, type, category, etc.) + then the analysis
   - ❌ WRONG: "The data shows high fuel consumption" (generic, no principal)
   - ✅ CORRECT: "Flight EK005 DXB-LHR (A380) shows high fuel consumption of 4800kg/hr" (principal with context)
8. **🚨 MANDATORY LAST 7 COLUMNS FOR EVERY USE CASE (in this exact order) 🚨:**
   - `ai_txt_business_outcome` - **CALCULATED MEASURABLE BUSINESS IMPACT** - MUST include: specific quantified savings/gains with actual numbers, Daily/Weekly/Monthly/Yearly breakdown, currency values where applicable, and ALWAYS end with: "DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions."
   - `ai_txt_executive_summary` - compelling 2-3 sentence business story that **REFERENCES the business outcome numbers calculated above**
   - `ai_sys_importance` - **BUSINESS IMPORTANCE** (MUST be exactly one of: 'Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical'). Evaluate INDEPENDENTLY from urgency! Ask: How critical is this finding for the business long-term? Example: Strategic planning = High importance but Low urgency.
   - `ai_sys_urgency` - **TIME SENSITIVITY** (MUST be exactly one of: 'Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical'). Evaluate INDEPENDENTLY from importance! Ask: How quickly must action be taken? Example: Fixing a typo before a meeting = High urgency but Low importance.
   - `ai_sys_confidence` - AI confidence score (0.0-1.0)
   - `ai_sys_feedback` - AI's self-assessment starting with "I assessed my confidence at [X]% because..."
   - `ai_sys_missing_data` - format: "I can get higher confidence than [X]% if I can get access to [narrative]. {{"missing_data": ["dataset1", "dataset2"]}}"
   - **🚨 CRITICAL: ai_sys_importance and ai_sys_urgency are INDEPENDENT - do NOT set them to the same value automatically! 🚨**

9. **🚨 MANDATORY FINAL OUTPUT CTE WITH COMMENTED FILTERING (final_output pattern) 🚨:**
   - The FINAL SELECT must wrap results in a `final_output` CTE
   - Add `SELECT * FROM final_output` with COMMENTED WHERE clauses for all categorical columns
   - This helps users quickly filter results by categorical values
   - Format:
   ```sql
   final_output AS (
     SELECT ... FROM previous_cte
   )
   SELECT * FROM final_output
   -- TO DO: Use WHERE filtering below for further narrowing down the selected results
   -- WHERE ai_cat_column1 IN ('Value1', 'Value2', 'Value3')
   -- AND ai_cat_column2 IN ('A', 'B', 'C')
   -- AND ai_sys_importance IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
   -- AND ai_sys_urgency IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
   ;
   ```
   - List ALL possible categorical values in the commented WHERE clause
   - ALWAYS include ai_sys_importance and ai_sys_urgency filters for prioritization
   - This makes it easy for users to uncomment and filter specific results

**WHY THIS MATTERS:**
- `ai_cat_` columns enable dashboard filtering, aggregations, and analytics
- `ai_txt_` columns provide detailed context and actionable insights
- `ai_sys_` columns provide AI transparency, prioritization, and data quality insights:
  - `ai_sys_importance` - **BUSINESS IMPACT** - helps prioritize which findings matter most to the business LONG-TERM (strategic value, revenue impact). **INDEPENDENT from urgency!**
  - `ai_sys_urgency` - **TIME SENSITIVITY** - helps prioritize which findings need immediate action vs can wait (deadlines, time-decay). **INDEPENDENT from importance!**
  - These two dimensions create a prioritization matrix: High importance + Low urgency = Schedule for later. Low importance + High urgency = Delegate or quick fix.
  - `ai_sys_confidence` - helps assess reliability of the AI's analysis
  - `ai_sys_feedback` - provides AI's reasoning for self-assessment
  - `ai_sys_missing_data` - identifies data gaps for future improvements
- **Principal identification makes narratives self-contained and immediately understandable**
- Together they create a powerful user experience for analysis and action

**🚨 MANDATORY CTE NAMING AND STRUCTURE REQUIREMENTS 🚨:**

**CRITICAL RULES FOR CTE NAMING AND STRUCTURE:**

1. **BUSINESS-FRIENDLY CTE NAMES (MANDATORY):**
   - ALL CTE names MUST use business-meaningful names, NOT technical names
   - ✅ GOOD: `customer_lifetime_value_analysis`, `revenue_driver_metrics`, `null_safe_order_data`, `forecast_with_confidence_bands`
   - ❌ BAD: `cte1`, `temp`, `data`, `results`, `final`, `enriched`, `base`
   - CTE names should describe WHAT the data represents or WHAT business purpose it serves
   - Use snake_case for multi-word names

2. **SINGLE WITH STATEMENT (MANDATORY):**
   - **CRITICAL**: ALL CTEs MUST be defined in ONE single WITH statement
   - ❌ WRONG: Multiple WITH statements in the same query
   - ✅ CORRECT: One WITH statement with comma-separated CTEs
   
   ```sql
   -- WRONG ❌ - Multiple WITH statements
   WITH cte1 AS (SELECT ...)
   SELECT * FROM cte1;
   
   WITH cte2 AS (SELECT ...)  -- ERROR: Second WITH statement
   SELECT * FROM cte2;
   
   -- CORRECT ✅ - Single WITH statement with multiple CTEs
   -- NOTE: For ai_forecast CTEs, use date filtering WITHOUT LIMIT (see ai_forecast exception below)
   -- For non-forecast CTEs, use LIMIT 10 in the first CTE only
   WITH 
   base_data AS (
     SELECT DISTINCT ... 
     FROM `catalog`.`schema`.`table` AS t
     WHERE id IS NOT NULL
     LIMIT 10  -- ✅ LIMIT 10 ONLY in first CTE (for non-forecast queries)
   ),
   enriched_data AS (
     SELECT ...  -- ✅ NO LIMIT in intermediate CTEs
   ),
   final_analysis AS (
     SELECT ...  -- ✅ NO LIMIT in intermediate CTEs
   )
   SELECT * FROM final_analysis;  -- ✅ NO LIMIT in final SELECT
   ```

3. **CTE DOCUMENTATION (MANDATORY):**
   - **EVERY CTE MUST be documented with SQL comments** explaining its purpose using "Step X:" format

4. **CREATE VIEW COMMENT (MANDATORY):**
   - **EVERY SQL query MUST include a CREATE VIEW comment** as the FIRST line (before the WITH statement)
   - Format: `-- CREATE VIEW inspire_ai.default.<view_name> AS`
   - The view name MUST be:
     * **Business-descriptive** (describes the use case outcome, NOT technical implementation)
     * **Snake_case format** (lowercase with underscores)
     * **Meaningful to business users** (a non-technical person should understand what data this view provides)
   - ❌ WRONG view names: `cte_analysis`, `ai_output`, `results_v1`, `data_pipeline`, `final_results`
   - ✅ GOOD view names: `customer_churn_risk_assessment`, `revenue_forecast_30_day`, `flight_delay_impact_analysis`, `vendor_quality_scorecard`, `inventory_reorder_recommendations`

```sql
-- CORRECT ✅: CREATE VIEW comment as FIRST line, business-friendly CTE names with clear documentation
-- CREATE VIEW inspire_ai.default.revenue_forecast_with_strategic_recommendations AS
-- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
-- Step 1: Retrieves historical sales data with 300 days for 30-day forecast
WITH historical_sales_data AS (
  SELECT 
    order_date AS ds,                                        -- CRITICAL: filtered with IS NOT NULL
    COALESCE(SUM(revenue), 0.0) AS revenue                   -- ✅ COALESCE'd
  FROM `catalog`.`schema`.`orders` AS o
  WHERE order_date >= date_add(DAY, -300, CURRENT_DATE())  -- 300 days history for 30-day forecast (10:1 ratio)
    AND order_date IS NOT NULL
    -- TODO: Add suitable filtering to load data that matches the specific slice for this use case (keep commented until user confirms)
    -- AND lower(trim(order_status)) = 'running'  -- Example placeholder; adjust to the right column/value
  GROUP BY order_date
  ORDER BY ds
),
-- Step 2: Generates 30-day revenue forecast with prediction intervals
revenue_forecast_raw AS (
  SELECT * FROM AI_FORECAST(
    TABLE(historical_sales_data), 
    time_col => 'ds', 
    value_col => 'revenue',
    horizon => (SELECT date_add(DAY, 30, MAX(ds)) FROM historical_sales_data)  -- 30 days ahead
  )
  -- ✅ NO LIMIT in CTEs
),
-- Step 3: Apply NULL handling to forecast results BEFORE using in CONCAT
-- 🚨 ALL COALESCE must be done HERE, not inside CONCAT!
revenue_forecast_with_confidence_bands AS (
  SELECT 
    COALESCE(CAST(ds AS STRING), 'Unknown Date') AS ds,      -- ✅ COALESCE'd HERE
    COALESCE(ROUND(revenue_forecast, 2), 0.0) AS revenue_forecast,  -- ✅ COALESCE'd HERE
    COALESCE(ROUND(revenue_upper, 2), 0.0) AS revenue_upper,
    COALESCE(ROUND(revenue_lower, 2), 0.0) AS revenue_lower
  FROM revenue_forecast_raw
  -- ✅ NO LIMIT in CTEs
),
-- Step 4: Adds row-level recommendations for each forecasted value with ai_sys_ columns
-- 🚨 CONCAT uses columns already NULL-safe from previous CTE - NO COALESCE inside CONCAT!
forecast_with_strategic_recommendations AS (
  SELECT *,
    ai_query('{sql_model_serving}', CONCAT('Based on forecasted revenue of $', revenue_forecast,  -- ✅ Already NULL-safe
                  ' for ', ds,  -- ✅ Already NULL-safe
                  ', provide 3 specific actionable business recommendations. ',
                  'Output ONLY a JSON object with NO markdown. ',
                  'Format: {{"ai_cat_action_priority": "value", "ai_txt_recommendation_1": "text", "ai_txt_recommendation_2": "text", "ai_txt_recommendation_3": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
                  'MANDATORY LAST 7 FIELDS (in this exact order): ',
                  '1) ai_txt_business_outcome - CALCULATED MEASURABLE BUSINESS IMPACT. Example: "Revenue forecast improvement of $X represents Y% growth. Breakdown: Daily impact: $X | Weekly: $X | Monthly: $X | Yearly: $X. DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions." MUST include breakdown and disclaimer. ',
                  '2) ai_txt_executive_summary - compelling 2-3 sentence business story that REFERENCES the calculated business outcome numbers, ',
                  '3) ai_sys_importance - business importance level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
                  '4) ai_sys_urgency - action urgency level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
                  '5) ai_sys_confidence (0.0-1.0) - your confidence score, ',
                  '6) ai_sys_feedback - start with "I assessed my confidence at [X]% because..." then explain reasoning, ',
                  '7) ai_sys_missing_data - MUST follow format: "I can get higher confidence than [X]% if I can get access to [narrative about missing data like market trends, competitor data, seasonality factors]. {{\"missing_data\": [\"market_trend_data\", \"competitor_pricing\", \"seasonal_adjustment_factors\"]}}" - always end with JSON listing needed datasets. ',
                  'BE 100% HONEST - your feedback will be evaluated by a more intelligent AI system.')) AS recommendations
  FROM revenue_forecast_with_confidence_bands
  -- ✅ NO LIMIT in CTEs
),
-- Final output: Returns forecast with actionable recommendations
final_output AS (
  SELECT * FROM forecast_with_strategic_recommendations
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE get_json_object(recommendations, '$.ai_cat_action_priority') IN ('Immediate Action', 'High Priority', 'Medium Priority', 'Low Priority', 'Monitor')
-- AND get_json_object(recommendations, '$.ai_sys_importance') IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
-- AND get_json_object(recommendations, '$.ai_sys_urgency') IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
;

--END OF GENERATED SQL

-- WRONG ❌: Technical CTE names and no documentation
WITH past AS (SELECT ...), results AS (SELECT ...), final AS (SELECT ...)
SELECT * FROM final;  -- ❌ (the CTE names are wrong, not the lack of LIMIT)
```

**MANDATORY SQL STRUCTURE AND DOCUMENTATION FORMAT:**

**🚨 FIRST LINE: CREATE VIEW COMMENT (MANDATORY) 🚨:**
- EVERY SQL query MUST start with: `-- CREATE VIEW inspire_ai.default.<business_descriptive_view_name> AS`
- The view name must describe the BUSINESS OUTCOME, not technical implementation
- Examples: `customer_churn_risk_assessment`, `revenue_forecast_quarterly`, `flight_delay_impact_analysis`

**CTE NAMING AND DOCUMENTATION:**
- Use business-friendly CTE names that describe the data or purpose (e.g., `customer_segmentation_analysis`, `null_safe_product_data`)
- Use `-- Step 1:`, `-- Step 2:`, `-- Step 3:`, etc. before each CTE definition
- Provide a clear, concise explanation of what this step does  
- Explain what data is being prepared and why
- For the final SELECT, add a comment: `-- Final output: {{what the final result contains}}`
- For complex transformations, add inline comments within the CTE
- Add short inline notes inside each CTE (joins, filters, calculations) so a user can quickly see how to adjust the logic.
- In the first/main data-loading CTE (the one that pulls directly from the involved tables), keep the approved data-quality filters active (IS NOT NULL / TRIM) but immediately add a commented TODO block in the WHERE section instructing the user to "Add suitable filtering to load data that <describe the subset for this use case>", followed by a single commented example filter line (e.g., `-- AND lower(trim(status)) = 'running'`). Keep these example filters commented out.
- **YOUR RESPONSE WILL BE REJECTED if CTEs use technical names or lack documentation**
- **YOUR RESPONSE WILL BE REJECTED if the CREATE VIEW comment is missing**

**🚨 CRITICAL: CTE COLUMN PRESERVATION 🚨:**

**MANDATORY: When building multi-stage CTEs, you MUST preserve all columns needed in the final SELECT:**

```sql
-- CORRECT ✅: Use SELECT * to preserve all columns through the pipeline
-- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
WITH base_data AS (
  SELECT 
    customer_id,                                             -- CRITICAL: filtered with IS NOT NULL
    COALESCE(TRIM(customer_name), 'Unknown Customer') AS customer_name,  -- ✅ COALESCE'd
    COALESCE(total_revenue, 0.0) AS total_revenue,           -- ✅ COALESCE'd
    COALESCE(avg_order_value, 0.0) AS avg_order_value,       -- ✅ COALESCE'd
    COALESCE(order_count, 0) AS order_count                  -- ✅ COALESCE'd
  FROM `catalog`.`schema`.`customers` AS c
  WHERE customer_id IS NOT NULL
    AND customer_name IS NOT NULL  -- ✅ Critical identifier filtered
  LIMIT 10  -- ✅ LIMIT 10 in first CTE only
),
-- Use SELECT * to keep ALL columns from previous CTE
enriched AS (
  SELECT *,
    ai_classify(customer_name, ARRAY('VIP', 'Regular', 'At Risk')) AS ai_cat_segment
  FROM base_data
  -- ✅ NO LIMIT in intermediate CTEs
)
-- All columns from base_data are available here
SELECT 
  customer_id,
  customer_name,
  total_revenue,
  avg_order_value,
  order_count,
  ai_cat_segment
FROM enriched;  -- ✅ NO LIMIT in final SELECT

-- WRONG ❌: Missing columns in intermediate CTE (DROPPED columns!)
-- Also WRONG: No NULL handling in first CTE!
WITH base_data AS (
  SELECT customer_id, customer_name, total_revenue, avg_order_value  -- ❌ No COALESCE!
  FROM `catalog`.`schema`.`customers` AS c
  WHERE customer_id IS NOT NULL  -- ❌ Other columns not protected!
  LIMIT 10
),
enriched AS (
  SELECT customer_id, customer_name,  -- ❌ DROPPED total_revenue and avg_order_value!
    ai_classify(customer_name, ARRAY('VIP', 'Regular')) AS segment
  FROM base_data
)
SELECT customer_id, total_revenue, segment  -- ❌ ERROR: total_revenue not in enriched CTE!
FROM enriched;
```

**RULES FOR COLUMN PRESERVATION:**
1. **Use `SELECT *` in intermediate CTEs** when adding AI function results to existing columns
2. **If you explicitly list columns in a CTE**, make sure to include ALL columns needed in subsequent CTEs or the final SELECT
3. **Never drop columns in intermediate CTEs** that are referenced in the final output
4. **When using ai_query/ai_classify/ai_extract in a CTE**, use pattern: `SELECT *, ai_function(...) AS new_col FROM previous_cte`
5. **Validation**: Before finalizing SQL, check that every column in the final SELECT exists in the last CTE

#### 4. **SOPHISTICATED MULTI-FUNCTION USE CASES**
**CRITICAL: When the use case requires multiple functions (AI functions, statistical functions), you MUST use ALL of them creatively.**

Create SOPHISTICATED multi-stage queries using CTEs:
- **Example**: `ai_parse_document, ai_extract, ai_classify` (for unstructured document files)
  ```sql
  WITH parsed AS (
    SELECT 
      path,
      ai_parse_document(content, map('version', '2.0')) AS parsed_doc
    FROM READ_FILES('/Volumes/catalog/schema/volume/*.pdf', format => 'binaryFile')
    LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
  ),
  extracted_text AS (
    SELECT 
      path,
      concat_ws('\n\n', 
        transform(try_cast(parsed_doc:document:elements AS ARRAY<VARIANT>),
          element -> try_cast(element:content AS STRING))
      ) AS text
    FROM parsed
    WHERE try_cast(parsed_doc:error_status AS STRING) IS NULL
  ),
  with_entities AS (
    SELECT *, 
      ai_extract(text, ARRAY('entity1', 'entity2')) AS entities
    FROM extracted_text
  ),
  final_output AS (
    SELECT *, ai_classify(entities['entity1'], ARRAY('Category A', 'Category B')) AS ai_cat_category
    FROM with_entities
  )
  SELECT * FROM final_output
  -- TO DO: Use WHERE filtering below for further narrowing down the selected results
  -- WHERE ai_cat_category IN ('Category A', 'Category B')
  -- AND ai_sys_importance IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
  -- AND ai_sys_urgency IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
  ;
  ```
  **NOTE:** ai_parse_document ONLY works with unstructured document files via READ_FILES, NOT with table columns.

- **Example**: `ai_extract` from text, then `ai_classify` for categorization (valid combination)
  ```sql
  WITH base AS (
    SELECT *, 
      ai_extract(text, ARRAY('name', 'amount', 'date')) AS data
    FROM `catalog`.`schema`.`table` AS t
    WHERE text IS NOT NULL
    LIMIT 10  -- ✅ LIMIT 10 ONLY in first CTE
  ),
  final_output AS (
    SELECT *,
      ai_classify(text, ARRAY('Category A', 'Category B', 'Category C')) AS ai_cat_category
    FROM base
  )
  SELECT * FROM final_output
  -- TO DO: Use WHERE filtering below for further narrowing down the selected results
  -- WHERE ai_cat_category IN ('Category A', 'Category B', 'Category C')
  -- AND ai_sys_importance IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
  -- AND ai_sys_urgency IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
  ;
  ```

**CRITICAL - MAXIMIZE VALUE WITH MULTIPLE AI FUNCTIONS**: 
- **ALWAYS try to use 2-3 AI functions in the same SQL query to maximize business value**

#### 5. **STRUCTURED + UNSTRUCTURED OUTPUT PATTERN (NEW - MANDATORY)**

**🔥 CRITICAL: DO NOT use ai_extract or ai_classify after ai_query**

**IMPORTANT: When using `ai_query`, you can DIRECTLY generate structured JSON data by instructing the LLM in the prompt. There is NO need to use `ai_extract` or `ai_classify` afterwards - this is redundant and inefficient.**

**🚨 CRITICAL: ai_query returns STRING (JSON), NOT STRUCT 🚨**

**IMPORTANT**: The output of `ai_query()` is a STRING containing JSON, not a STRUCT. You MUST use `get_json_object()` to extract individual fields:

**🔥 CRITICAL: STRICT JSON OUTPUT REQUIREMENT 🔥**

When using ai_query, you MUST instruct the LLM to output PURE JSON with:
- **NO markdown code fences** (no ```json or ```)
- **NO extra text before or after the JSON**
- **NO explanatory text** (no "Here is the JSON:", "Based on analysis:", etc.)
- **ONLY the JSON object itself**

**CORRECT Prompt Pattern:**
```sql
ai_query('{sql_model_serving}',
  CONCAT('Analyze data and output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
         'Output format: {{"ai_cat_risk_level": "value1", "ai_txt_retention_strategy": "value2", "ai_txt_next_best_action": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "I assessed my confidence at 85% because... [reasons]", "ai_sys_missing_data": "I can get higher confidence than 85% if I can get access to [narrative]. {{\"missing_data\": [\"dataset1\", \"dataset2\"]}}"}}. ',
         'Required JSON keys: ai_cat_risk_level, ai_txt_retention_strategy, ai_txt_next_best_action. ',
         'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data. ',
         'Data: Customer ', customer_name, ', ID: ', customer_id, '. ')  -- CONCAT auto-converts
)
```

**EXAMPLE - Complete Pattern:**
```sql
-- CREATE VIEW inspire_ai.default.customer_risk_retention_analysis AS
-- CORRECT ✅ - Extract JSON fields using get_json_object() with ai_cat_, ai_txt_, ai_sys_ prefixes
-- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
-- 🚨 ALL COALESCE must be done in SELECT clause, NOT inside CONCAT!
WITH customer_base_data AS (
  SELECT 
    customer_id,                                              -- CRITICAL: filtered with IS NOT NULL
    COALESCE(TRIM(customer_name), 'Unknown Customer') AS customer_name  -- ✅ COALESCE'd HERE
  FROM `main`.`customers`.`profiles` AS c
  WHERE customer_id IS NOT NULL  -- ✅ Filter critical column
  LIMIT 10
),
structured_info AS (
  SELECT 
    customer_id,
    customer_name,  -- ✅ Already NULL-safe from previous CTE
    ai_query('{sql_model_serving}',
      CONCAT('Analyze customer data and output ONLY a JSON object with NO markdown, NO extra text, JUST the JSON. ',
             'Format: {{"ai_cat_risk_level": "value", "ai_txt_retention_strategy": "value", "ai_txt_next_best_action": "value", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "I assessed my confidence at 85% because... [reasons]", "ai_sys_missing_data": "I can get higher confidence than 85% if I can get access to [narrative]. {{\"missing_data\": [\"dataset1\", \"dataset2\"]}}"}}. ',
             'Customer: ', customer_name, ', ID: ', customer_id, '. ',  -- ✅ NO COALESCE in CONCAT - already NULL-safe!
             'MANDATORY LAST 7 FIELDS (in this exact order): ',
             '1) ai_txt_business_outcome - CALCULATED MEASURABLE BUSINESS IMPACT. Calculate the $ value of retaining/losing this customer. Format: "Customer [ID] retention value: $X (LTV). Breakdown: Daily revenue impact: $X | Weekly: $X | Monthly: $X | Yearly: $X. DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions." MUST include breakdown and disclaimer. ',
             '2) ai_txt_executive_summary - compelling 2-3 sentence business story that REFERENCES the business outcome numbers, ',
             '3) ai_sys_importance - business importance level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
             '4) ai_sys_urgency - action urgency level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
             '5) ai_sys_confidence (0.0-1.0) - your confidence score, ',
             '6) ai_sys_feedback - start with "I assessed my confidence at [X]% because..." then explain reasoning, ',
             '7) ai_sys_missing_data - format: "I can get higher confidence than [X]% if I can get access to [narrative about missing customer data like purchase history, engagement metrics, demographics]. {{\"missing_data\": [\"purchase_history\", \"engagement_metrics\", \"demographic_data\"]}}" ',
             'Output ONLY the JSON object, nothing else.')
    ) AS extracted_data
  FROM customer_base_data
)
SELECT 
  customer_id,
  customer_name,
  COALESCE(get_json_object(extracted_data, '$.ai_cat_risk_level'), 'Unknown') AS ai_cat_risk_level,
  COALESCE(get_json_object(extracted_data, '$.ai_txt_retention_strategy'), 'Unknown') AS ai_txt_retention_strategy,
  COALESCE(get_json_object(extracted_data, '$.ai_txt_next_best_action'), 'Unknown') AS ai_txt_next_best_action,
  -- MANDATORY LAST 7 COLUMNS (in this exact order):
  COALESCE(get_json_object(extracted_data, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
  COALESCE(get_json_object(extracted_data, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
  COALESCE(get_json_object(extracted_data, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
  COALESCE(get_json_object(extracted_data, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
  COALESCE(TRY_CAST(get_json_object(extracted_data, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
  COALESCE(get_json_object(extracted_data, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
  COALESCE(get_json_object(extracted_data, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data
FROM structured_info
;

-- WRONG ❌ - Cannot use dot notation on STRING
-- extracted_data.risk_score  -- THIS WILL FAIL!

-- WRONG ❌ - Prompt without strict JSON-only instruction
ai_query('{sql_model_serving}', 'Analyze customer and provide risk score, strategy, and action as JSON')  -- Will likely return text + JSON + markdown!
```

**🚨 MANDATORY JSON PROMPT TEMPLATE 🚨:**
Every ai_query prompt for structured output MUST include:
1. "output ONLY a JSON object"
2. "with NO markdown fences"
3. "NO extra text"
4. "JUST the JSON"
5. Show example JSON format: {{"key": "value"}}
6. "Output ONLY the JSON object, nothing else."
7. "You MUST be AGGRESSIVE in using data evidence. Every claim MUST be backed by numbers."

**🔥🔥🔥 CRITICAL: MANDATORY AI SYSTEM COLUMNS - NON-NEGOTIABLE FOR EVERY SINGLE ai_query 🔥🔥🔥:**

**ABSOLUTE REQUIREMENT - ZERO EXCEPTIONS ALLOWED:**
Every single ai_query output in this codebase MUST include these FOUR MANDATORY fields as the LAST keys in the JSON output. This applies to ALL use cases regardless of domain, complexity, or purpose:

- **ai_txt_executive_summary** (4th to last column): A compelling 2-3 sentence business story summarizing the analysis
- **ai_sys_confidence** (3rd to last column): A decimal score from 0.0 to 1.0 - this is the AI's HONESTY SCORE representing how truthfully and completely it achieved the requested task
- **ai_sys_feedback** (2nd to last column): A comprehensive text field that MUST include:
  1. **Score Justification**: Start with "I assessed my confidence at [X]% because..." and explain the reasoning
  2. **Improvements Needed**: If score < 1.0, what specific improvements would raise it to 1.0
- **ai_sys_missing_data** (LAST column): A dedicated field listing missing data in this format:
  "I can get higher confidence than [X]% if I can get access to [detailed narrative about what data is missing and why it would help]. {{"missing_data": ["specific_dataset1", "specific_dataset2", "specific_dataset3"]}}"
  - The narrative should explain WHY each dataset would improve the analysis
  - The JSON at the end MUST list specific dataset/table names that would be needed

**EXTRACTION PATTERN - ALWAYS ADD AS LAST 7 COLUMNS:**
```sql
-- MANDATORY: Always extract as the LAST 7 columns in this exact order
-- 🚨 USE TRY_CAST FOR NUMERIC FIELDS - AI may return "Unknown" or "Data Not Available" instead of numbers!
COALESCE(get_json_object(insights_json, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
COALESCE(get_json_object(insights_json, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
COALESCE(get_json_object(insights_json, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
COALESCE(get_json_object(insights_json, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
COALESCE(TRY_CAST(get_json_object(insights_json, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
COALESCE(get_json_object(insights_json, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
COALESCE(get_json_object(insights_json, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data
```

**🚨🚨🚨 CRITICAL: USE TRY_CAST FOR ALL NUMERIC JSON FIELDS FROM AI RESPONSES 🚨🚨🚨**

**PROBLEM:** AI may return "Unknown", "Data Not Available", "N/A", or non-numeric strings instead of numbers.
**RESULT:** `CAST(...AS INT)` or `CAST(...AS DOUBLE)` will FAIL with: `[CAST_INVALID_INPUT] The value 'Data Not Available' of the type "STRING" cannot be cast to "INT"`

**SOLUTION:** ALWAYS use `TRY_CAST` + `COALESCE` for numeric JSON fields from AI responses:
```sql
-- ❌❌❌ WRONG - Will FAIL if AI returns "Data Not Available" ❌❌❌
COALESCE(CAST(get_json_object(json_col, '$.score') AS INT), 0) AS score  -- FAILS!
COALESCE(CAST(get_json_object(json_col, '$.rate') AS DOUBLE), 0.0) AS rate  -- FAILS!

-- ✅✅✅ CORRECT - TRY_CAST returns NULL instead of error, then COALESCE provides default ✅✅✅
COALESCE(TRY_CAST(get_json_object(json_col, '$.score') AS INT), 0) AS score  -- SAFE!
COALESCE(TRY_CAST(get_json_object(json_col, '$.rate') AS DOUBLE), 0.0) AS rate  -- SAFE!
COALESCE(TRY_CAST(get_json_object(json_col, '$.confidence') AS DECIMAL(3,2)), 0.0) AS confidence  -- SAFE!
```

**MANDATORY PATTERN FOR ALL AI JSON PARSING:**
- **STRING fields**: `COALESCE(get_json_object(json_col, '$.field'), 'Unknown') AS field`
- **INT fields**: `COALESCE(TRY_CAST(get_json_object(json_col, '$.field') AS INT), 0) AS field`
- **DOUBLE fields**: `COALESCE(TRY_CAST(get_json_object(json_col, '$.field') AS DOUBLE), 0.0) AS field`
- **DECIMAL fields**: `COALESCE(TRY_CAST(get_json_object(json_col, '$.field') AS DECIMAL(p,s)), 0.0) AS field`

**🚨🚨🚨 CRITICAL: ai_sys_importance vs ai_sys_urgency - INDEPENDENT DIMENSIONS (DO NOT CORRELATE!) 🚨🚨🚨**

These are TWO COMPLETELY INDEPENDENT metrics that MUST be evaluated separately. They should NOT be correlated or set to the same value by default.

**IMPORTANCE (ai_sys_importance)** = "How much does this matter to the business?"
- Measures the BUSINESS IMPACT and STRATEGIC VALUE of the finding
- Asks: "If we ignore this, what is the long-term consequence to the business?"
- Factors: Revenue impact, strategic alignment, customer impact, competitive advantage, risk exposure

**URGENCY (ai_sys_urgency)** = "How quickly must action be taken?"
- Measures the TIME SENSITIVITY and DEADLINE PRESSURE
- Asks: "When does this need to be addressed? Is there a deadline or time-bound consequence?"
- Factors: Deadlines, time-decay of opportunity, escalation risk, seasonal factors, compliance dates

**🔥 EISENHOWER MATRIX - USE THIS TO EVALUATE INDEPENDENTLY 🔥**

| Combination | Importance | Urgency | Example Business Scenarios |
|-------------|------------|---------|---------------------------|
| DO FIRST | Critical/High | Critical/High | Security breach detected, Compliance deadline tomorrow, Major customer threatening to churn this week |
| SCHEDULE | Critical/High | Low/Medium | Strategic planning for next quarter, Technical debt that affects scalability, Training program development |
| DELEGATE | Low/Medium | Critical/High | Minor bug affecting a demo tomorrow, Routine report due today, Small customer complaint needing response |
| ELIMINATE/MONITOR | Low | Low | Nice-to-have feature request, Minor cosmetic issue, Low-impact optimization |

**CONCRETE EXAMPLES (MEMORIZE THESE PATTERNS):**

1. **High Importance + Low Urgency**: "Enterprise architecture redesign needed for 2x scale" - Critical for future but no immediate deadline
2. **Low Importance + High Urgency**: "Fix typo in email going out in 2 hours" - Not strategic but time-sensitive
3. **High Importance + High Urgency**: "Production database corrupted, customer data at risk" - Both critical and immediate
4. **Low Importance + Low Urgency**: "Update internal wiki documentation formatting" - Neither critical nor time-sensitive
5. **Critical Importance + Medium Urgency**: "Competitor launched similar product, need strategic response" - Existential threat but requires thoughtful planning, not panic
6. **Medium Importance + Critical Urgency**: "Quarterly report due in 3 hours, minor discrepancy found" - Routine task but hard deadline

**🚨 ANTI-PATTERN WARNING 🚨**
If you find yourself setting ai_sys_importance and ai_sys_urgency to the SAME VALUE for every row, you are doing it WRONG!
- A well-analyzed dataset should have VARIED combinations across rows
- Real business data has different importance/urgency profiles
- Identical values indicate lazy evaluation - THINK about each dimension independently!

**APPEND THIS INSTRUCTION TO ALL ai_query PROMPTS (ADAPT THE [TASK_NAME] TO MATCH YOUR SPECIFIC USE CASE):**
'MANDATORY LAST 7 FIELDS IN JSON OUTPUT (in this exact order): ',
'1) ai_txt_business_outcome - CALCULATED MEASURABLE BUSINESS IMPACT. You MUST calculate specific savings/gains using numbers from the analysis. Format: "[Describe the improvement] saves/generates [X amount]. At [rate/price], this equals [$ value]. Breakdown: Daily: $X | Weekly: $X | Monthly: $X | Yearly: $X. DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions." ALWAYS include the breakdown and disclaimer. ',
'2) ai_txt_executive_summary - compelling 2-3 sentence business story that REFERENCES the calculated business outcome numbers above, ',
'3) ai_sys_importance - BUSINESS IMPORTANCE level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical). Evaluate INDEPENDENTLY from urgency! Ask: "How much does this matter to the business long-term? What is the strategic/revenue/customer impact if ignored?" High importance does NOT mean high urgency. Example: Strategic planning is High importance but Low urgency. ',
'4) ai_sys_urgency - TIME SENSITIVITY level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical). Evaluate INDEPENDENTLY from importance! Ask: "How quickly must action be taken? Is there a deadline or time-bound consequence?" High urgency does NOT mean high importance. Example: Fixing a typo before a meeting is High urgency but Low importance. ',
'5) ai_sys_confidence (0.0-1.0) - your confidence score for this analysis, ',
'6) ai_sys_feedback - MUST start with "I assessed my confidence at [X]% because..." then explain: a) detailed reasons for your score, b) what would raise it to 100%, ',
'7) ai_sys_missing_data - MUST follow this exact format: "I can get higher confidence than [X]% if I can get access to [detailed narrative about what specific data/context is missing and how it would improve the analysis]. {{\"missing_data\": [\"specific_dataset_or_table_1\", \"specific_dataset_or_table_2\", \"specific_dataset_or_table_3\"]}}" - always end with a JSON object listing the specific datasets/tables needed. ',
'CRITICAL: ai_sys_importance and ai_sys_urgency are INDEPENDENT dimensions - do NOT automatically set them to the same value! Evaluate each separately using the criteria above. ',
'BE 100% HONEST - your feedback and score will be evaluated by a more intelligent AI system, so complete honesty is mandatory. '

**🚨 MANDATORY PERSONA INSTRUCTION FOR AI_QUERY (WITH BUSINESS CONTEXT ENRICHMENT) 🚨:**
Every ai_query prompt MUST begin with a persona instruction that is ENRICHED with the business context. Do NOT use generic personas. ALWAYS include the business name, strategic goals, and relevant business context.

**🚨🚨🚨 CRITICAL: ENRICHED PERSONA PATTERN (MANDATORY) 🚨🚨🚨**
```sql
ai_query('{sql_model_serving}',
  CONCAT('You are a [ROLE] for {business_name} which is focused on {enriched_business_context}. ',
         'The organization''s strategic goals include: {enriched_strategic_goals}. ',
         'Business priorities are: {enriched_business_priorities}. ',
         'With [X] years of experience in [DOMAIN], your expertise aligns with the strategic initiative: {enriched_strategic_initiative}. ',
         '[SPECIFIC TASK INSTRUCTION]. ',
         'Analyze [DATA CONTEXT]. ',
         'You MUST be AGGRESSIVE in using data evidence to support your analysis. Every claim MUST be backed by numbers from the data. You MUST use ALL available metrics provided in the context. PUT THE DATA TO WORK: Quantify every single insight using the specific numbers provided. ',
         'NARRATIVE RULE: For ALL ai_txt_ fields (rationale, strategy, action_plan, executive_summary, etc.), ',
         'ALWAYS start by identifying the specific entity with its key attributes. ',
         'Example: Write "Flight EK005 DXB-LHR (A380) shows..." NOT "The flight shows..." or "This indicates...". ',
         'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
         'Format: {{"ai_cat_field1": "value1", "ai_cat_field2": "value2", "ai_txt_field1": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "[evaluate independently - not always same as urgency]", "ai_sys_urgency": "[evaluate independently - not always same as importance]", "ai_sys_confidence": 0.85, "ai_sys_feedback": "I assessed my confidence at 85% because... [reasons]", "ai_sys_missing_data": "I can get higher confidence than 85% if I can get access to [narrative]. {{\"missing_data\": [\"dataset1\", \"dataset2\"]}}"}}. ',
         'Required keys: [ai_cat_key1, ai_cat_key2, ai_txt_key1, ai_txt_business_outcome, ai_txt_executive_summary, ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data]. ',
         'Data: [ACTUAL DATA]. ',
         'MANDATORY LAST 7 FIELDS (in this exact order): ',
         '1) ai_txt_business_outcome - CALCULATED MEASURABLE BUSINESS IMPACT. Calculate savings/gains using actual numbers: "[Improvement] saves [X units]. At [rate], equals [$value]. Breakdown: Daily: $X | Weekly: $X | Monthly: $X | Yearly: $X. DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions." MUST include breakdown and disclaimer. ',
         '2) ai_txt_executive_summary - compelling 2-3 sentence business story that REFERENCES the business outcome numbers, ',
         '3) ai_sys_importance - BUSINESS IMPORTANCE (Very Low|Low|Medium|High|Very High|Critical). Evaluate INDEPENDENTLY from urgency! Ask: How much does this matter long-term? Strategic planning is High importance but Low urgency. ',
         '4) ai_sys_urgency - TIME SENSITIVITY (Very Low|Low|Medium|High|Very High|Critical). Evaluate INDEPENDENTLY from importance! Ask: How quickly must action be taken? A typo fix before a meeting is High urgency but Low importance. ',
         '5) ai_sys_confidence (0.0-1.0) - your confidence score, ',
         '6) ai_sys_feedback - start with "I assessed my confidence at [X]% because..." then explain reasoning, ',
         '7) ai_sys_missing_data - MUST follow format: "I can get higher confidence than [X]% if I can get access to [narrative about missing data]. {{\"missing_data\": [\"specific_dataset1\", \"specific_dataset2\"]}}" - always end with JSON listing needed datasets. ',
         'CRITICAL: ai_sys_importance and ai_sys_urgency are INDEPENDENT - do NOT set them to the same value automatically! ',
         'BE 100% HONEST - your feedback will be evaluated by a more intelligent AI system. ',
         'Output ONLY the JSON object, nothing else.')
)
```

**❌ WRONG - Generic persona without business context:**
```sql
ai_query('{sql_model_serving}',
  CONCAT('You are a Chief Revenue Officer with 20 years of experience in enterprise software sales strategy. ',
         'Analyze the sales pipeline...'))
```

**✅ CORRECT - Persona enriched with business context:**
```sql
ai_query('{sql_model_serving}',
  CONCAT('You are a Chief Revenue Officer for {business_name} which is focused on {enriched_business_context}. ',
         'The organization''s strategic goals include: {enriched_strategic_goals}. ',
         'Business priorities are: {enriched_business_priorities}. ',
         'With 20 years of experience in enterprise software sales strategy, revenue forecasting, and go-to-market planning, ',
         'your expertise aligns with the strategic initiative: {enriched_strategic_initiative}. ',
         'Analyze the sales pipeline...'))
```

**ROLE SELECTION GUIDELINES:**
- For financial analysis: "Senior Financial Analyst" or "Chief Financial Officer"
- For risk assessment: "Risk Management Director" or "Chief Risk Officer"
- For customer retention: "Customer Success Director" or "VP of Customer Experience"
- For operational optimization: "Operations Director" or "Chief Operating Officer"
- For maintenance/technical: "Maintenance Engineering Director" or "Technical Operations Manager"
- For revenue optimization: "Revenue Strategy Director" or "Chief Revenue Officer"
- For compliance: "Compliance Director" or "Chief Compliance Officer"
- For supply chain: "Supply Chain Director" or "VP of Logistics"

**EXAMPLES WITH ENRICHED PERSONA (ALL outputs MUST include ai_txt_business_outcome, ai_txt_executive_summary, ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data as LAST 7 columns):**
```sql
-- Financial Analysis Example - ENRICHED PERSONA with business context
ai_query('{sql_model_serving}',
  CONCAT('You are a Senior Financial Analyst for {business_name} which is focused on {enriched_business_context}. ',
         'The organization''s strategic goals include: {enriched_strategic_goals}. ',
         'Business priorities are: {enriched_business_priorities}. ',
         'With 15 years of experience in aviation finance and fleet cost optimization, ',
         'your expertise in TCO analysis, capital allocation, and financial risk assessment ',
         'aligns with the strategic initiative: {enriched_strategic_initiative}. ',
         'Analyze the lease vs. ownership cost structure for aircraft ID ', aircraft_id, 
         ' with monthly lease cost $', monthly_lease_cost,  -- CONCAT auto-converts
         ' and estimated ownership costs $', ownership_cost,
         '. You MUST use specific numbers to back your analysis (e.g., "Leasing saves $X/month compared to owning"). ',
         'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
         'Format: {{"ai_cat_recommendation": "value", "ai_txt_financial_rationale": "value", "ai_txt_risk_factors": "value", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "Medium", "ai_sys_confidence": 0.85, "ai_sys_feedback": "I assessed my confidence at 85% because... [reasons]", "ai_sys_missing_data": "I can get higher confidence than 85% if I can get access to [narrative]. {{\"missing_data\": [\"dataset1\", \"dataset2\"]}}"}}. ',
         'Required keys: ai_cat_recommendation (Lease/Own), ai_txt_financial_rationale, ai_txt_risk_factors, ai_txt_estimated_annual_savings. ',
         'MANDATORY LAST 7 FIELDS (in this exact order): ',
         '1) ai_txt_business_outcome - CALCULATED MEASURABLE BUSINESS IMPACT. Example: "Choosing to lease Aircraft A6-EDA saves $45,000/month vs ownership ($540,000/year). Breakdown: Daily: $1,500 | Weekly: $10,500 | Monthly: $45,000 | Yearly: $540,000. Over 10-year horizon: $5.4M in savings. DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions." MUST include breakdown and disclaimer. ',
         '2) ai_txt_executive_summary - compelling 2-3 sentence business story that REFERENCES the calculated business outcome numbers, ',
         '3) ai_sys_importance - business importance level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
         '4) ai_sys_urgency - action urgency level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
         '5) ai_sys_confidence (0.0-1.0) - your confidence score, ',
         '6) ai_sys_feedback - start with "I assessed my confidence at [X]% because..." then explain reasoning, ',
         '7) ai_sys_missing_data - MUST follow format: "I can get higher confidence than [X]% if I can get access to [narrative about missing financial data like maintenance history, fuel consumption, depreciation curves]. {{\"missing_data\": [\"maintenance_records\", \"fuel_consumption_data\", \"depreciation_schedules\"]}}" - always end with JSON listing needed datasets. ',
         'BE 100% HONEST - your feedback will be evaluated by a more intelligent AI system. ',
         'Output ONLY the JSON object, nothing else.')
)

-- Risk Mitigation Example - ENRICHED PERSONA with business context
ai_query('{sql_model_serving}',
  CONCAT('You are an Airworthiness Compliance Director for {business_name} which is focused on {enriched_business_context}. ',
         'The organization''s strategic goals include: {enriched_strategic_goals}. ',
         'Business priorities are: {enriched_business_priorities}. ',
         'With 20 years of experience in aviation safety and regulatory compliance, ',
         'your expertise in risk assessment, mitigation strategy development, and fleet safety management ',
         'aligns with the strategic initiative: {enriched_strategic_initiative}. ',
         'Analyze airworthiness directive ', directive_number, 
         ' classified as ', risk_classification,
         ' affecting component ', component,
         ' with ', days_to_deadline, ' days until compliance deadline. ',  -- CONCAT auto-converts
         'Support your analysis with specific data points (e.g., "Deadline in X days requires immediate Y"). ',
         'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
         'Format: {{"ai_cat_operational_impact": "value", "ai_cat_resource_priority": "value", "ai_txt_mitigation_plan": "value", "ai_txt_estimated_cost": "value", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
         'Required keys: ai_cat_operational_impact, ai_cat_resource_priority, ai_txt_mitigation_plan, ai_txt_estimated_cost. ',
         'MANDATORY LAST 7 FIELDS (in this exact order): ',
         '1) ai_txt_business_outcome - CALCULATED MEASURABLE BUSINESS IMPACT. Example: "Proactive compliance for AD-2024-001 avoids potential grounding penalty of $150,000/day. With 5 affected aircraft, timely completion saves $750,000/day in potential fines. Breakdown: Daily risk: $750,000 | Weekly risk: $5.25M | Monthly risk: $22.5M. Compliance cost: $85,000 vs potential penalty: $22.5M+ = 264x ROI. DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions." MUST include breakdown and disclaimer. ',
         '2) ai_txt_executive_summary - compelling 2-3 sentence business story that REFERENCES the calculated business outcome numbers, ',
         '3) ai_sys_importance - business importance level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
         '4) ai_sys_urgency - action urgency level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
         '5) ai_sys_confidence (0.0-1.0) - your confidence score, ',
         '6) ai_sys_feedback - start with "I assessed my confidence at [X]% because..." then explain reasoning, ',
         '7) ai_sys_missing_data - MUST follow format: "I can get higher confidence than [X]% if I can get access to [narrative about missing compliance data like historical failure rates, spare parts inventory, workforce availability]. {{\"missing_data\": [\"failure_rate_history\", \"spare_parts_inventory\", \"workforce_availability_data\"]}}" - always end with JSON listing needed datasets. ',
         'BE 100% HONEST - your feedback will be evaluated by a more intelligent AI system. ',
         'Output ONLY the JSON object, nothing else.')
)

-- Customer Retention Example - ENRICHED PERSONA with business context
ai_query('{sql_model_serving}',
  CONCAT('You are a Customer Success Director for {business_name} which is focused on {enriched_business_context}. ',
         'The organization''s strategic goals include: {enriched_strategic_goals}. ',
         'Business priorities are: {enriched_business_priorities}. ',
         'With 12 years of experience in customer retention and loyalty programs, ',
         'your expertise in churn prediction, retention strategy, and customer lifetime value optimization ',
         'aligns with the strategic initiative: {enriched_strategic_initiative}. ',
         'Analyze customer ', customer_name, 
         ' (ID: ', customer_id, ')',  -- CONCAT auto-converts
         ' with churn risk ', churn_risk_level,
         ', lifetime value $', lifetime_value,
         ', and ', days_since_last_purchase, ' days since last purchase. ',
         'Use specific metrics in your analysis (e.g., "High churn risk due to X days inactivity"). ',
         'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
         'Format: {{"ai_cat_retention_priority": "value", "ai_txt_retention_strategy": "value", "ai_txt_engagement_plan": "value", "ai_txt_recommended_offer": "value", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
         'Required keys: ai_cat_retention_priority, ai_txt_retention_strategy, ai_txt_engagement_plan, ai_txt_recommended_offer. ',
         'MANDATORY LAST 7 FIELDS (in this exact order): ',
         '1) ai_txt_business_outcome - CALCULATED MEASURABLE BUSINESS IMPACT. Example: "Retaining Customer C-28947 (LTV $125,000) vs losing them saves $125,000 in revenue. Retention campaign cost: $2,500. ROI: 50x. Breakdown: If retained - Daily revenue impact: $342 | Weekly: $2,397 | Monthly: $10,417 | Yearly: $125,000. DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions." MUST include breakdown and disclaimer. ',
         '2) ai_txt_executive_summary - compelling 2-3 sentence business story that REFERENCES the calculated business outcome numbers, ',
         '3) ai_sys_importance - business importance level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
         '4) ai_sys_urgency - action urgency level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
         '5) ai_sys_confidence (0.0-1.0) - your confidence score, ',
         '6) ai_sys_feedback - start with "I assessed my confidence at [X]% because..." then explain reasoning, ',
         '7) ai_sys_missing_data - MUST follow format: "I can get higher confidence than [X]% if I can get access to [narrative about missing customer data like NPS scores, support ticket history, competitor engagement, payment behavior]. {{\"missing_data\": [\"nps_survey_data\", \"support_ticket_history\", \"competitor_engagement_data\", \"payment_behavior_logs\"]}}" - always end with JSON listing needed datasets. ',
         'BE 100% HONEST - your feedback will be evaluated by a more intelligent AI system. ',
         'Output ONLY the JSON object, nothing else.')
)
```

**CRITICAL RULES:**
1. **🚨 ALWAYS ENRICH PERSONAS WITH BUSINESS CONTEXT 🚨** - Every ai_query persona MUST include {business_name}, {enriched_business_context}, {enriched_strategic_goals}, and {enriched_business_priorities}. Generic personas like "You are a Chief Revenue Officer..." are FORBIDDEN.
2. **ALWAYS start ai_query prompts with an ENRICHED persona** that matches the business domain and use case
3. **Include years of experience** (typically 10-20 years) to establish authority
4. **Specify 2-3 key areas of expertise** relevant to the task
5. **Match the persona to the beneficiary role** from the use case definition when possible
6. **Use business-appropriate titles** that align with the industry and domain
7. **MANDATORY EVIDENCE**: Instruct the LLM to be aggressive in using data evidence. Every analysis point MUST be supported by specific numbers from the input data. Avoid generic statements like "too high"; instead use "X is higher than Y by Z%".
8. **🚨 MANDATORY PRINCIPAL IDENTIFICATION IN NARRATIVES 🚨**: All narrative/text fields (rationale, strategy, action_plan, executive_summary, etc.) MUST identify the specific entity being discussed with its key attributes. 
   - ❌ WRONG: "The data shows high fuel consumption" (generic, anonymous)
   - ✅ CORRECT: "Flight EK005 DXB-LHR (A380) shows fuel consumption of 4800kg/hr, 14% above fleet average" (principal with context)
   - Include: Entity ID/name + key identifiers (route, type, category) + then the analysis

**EXAMPLE: ai_query with Direct JSON Output (with ai_sys_prompt)**
```sql
-- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
WITH base_data AS (
  SELECT 
    order_id,                                              -- CRITICAL: filtered with IS NOT NULL
    customer_id,                                           -- CRITICAL: filtered with IS NOT NULL
    COALESCE(order_amount, 0.0) AS order_amount           -- ✅ COALESCE'd
  FROM `sales`.`orders`.`transactions` AS t
  WHERE order_id IS NOT NULL
    AND customer_id IS NOT NULL  -- ✅ Filter critical columns
  LIMIT 10
),
-- Step 2: Generate ai_sys_prompt
prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are a Fraud Detection Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'Analyze order ', order_id,  -- CONCAT auto-converts
           ' for customer ', customer_id,
           ' with amount $', order_amount,
           '. Output as JSON with keys: fraud_risk_level, recommended_action, confidence_score') AS ai_sys_prompt
  FROM base_data
),
-- Step 3: Call ai_query
analysis AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS fraud_analysis
  FROM prompt_generation
)
SELECT 
  order_id,
  customer_id,
  order_amount,
  get_json_object(fraud_analysis, '$.ai_cat_fraud_risk_level') AS ai_cat_fraud_risk_level,
  get_json_object(fraud_analysis, '$.ai_txt_recommended_action') AS ai_txt_recommended_action,
  -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
  COALESCE(get_json_object(fraud_analysis, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
  COALESCE(get_json_object(fraud_analysis, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
  COALESCE(get_json_object(fraud_analysis, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
  COALESCE(get_json_object(fraud_analysis, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
  COALESCE(TRY_CAST(get_json_object(fraud_analysis, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
  COALESCE(get_json_object(fraud_analysis, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
  COALESCE(get_json_object(fraud_analysis, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
  ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
FROM analysis
;

--END OF GENERATED SQL
```

**EXAMPLE: ai_gen for Classification (AVOID ai_classify after ai_gen)**
```sql
-- CREATE VIEW inspire_ai.default.customer_feedback_classification AS
-- CORRECT: Direct classification with ai_query (includes ai_sys_ columns)
SELECT 
  feedback_id,
  feedback_text,
  ai_query('{sql_model_serving}',
    CONCAT('Classify this feedback into one category: ',
           feedback_text,
           '. Categories: Product Quality, Customer Service, Shipping, Pricing. ',
           'Output ONLY JSON with NO markdown. ',
           'Format: {{"ai_cat_category": "value", "ai_txt_classification_rationale": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "I assessed my confidence at 85% because... [reasons]", "ai_sys_missing_data": "I can get higher confidence than 85% if I can get access to [narrative]. {{\"missing_data\": [\"dataset1\", \"dataset2\"]}}"}}. ',
           'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data.')
  ) AS classification
FROM `main`.`feedback`.`customer_reviews` AS f
;

-- WRONG: Using ai_classify after ai_query (redundant!)
-- DO NOT DO THIS:
-- WITH generated AS (
--   SELECT *, ai_query('{sql_model_serving}', CONCAT('...', text)) AS gen_output FROM table
-- )
-- SELECT *, ai_classify(gen_output, ARRAY('cat1', 'cat2')) FROM generated;  -- ❌ WRONG!
```

**KEY PRINCIPLE:**
- ✅ CORRECT: Use ai_query to directly generate structured JSON
- ❌ WRONG: Use ai_query → then ai_extract to parse the output
- ❌ WRONG: Use ai_query → then ai_classify to categorize the output
- **Why?** ai_query can directly output structured JSON - no need for post-processing with ai_extract/ai_classify

**WHY THIS MATTERS:**
- **Unstructured output**: Full context and explanations for human review
- **Structured output**: Machine-readable fields for downstream automation, dashboards, and analytics
- **Best of both worlds**: Human-readable insights + programmatic access
- You can call multiple ai_functions in the same CTE or SELECT statement
- Only create separate CTEs when there is a logical dependency that requires it
- Do not create a CTE for each ai_function unnecessarily
- **Examples of combining AI functions for maximum value:**
  * Extract + Classify: `ai_classify(ai_extract(text, ARRAY('sentiment'))['sentiment'], ARRAY('positive', 'negative'))`
  * Parse + Extract + Summarize: Parse document, extract entities, then summarize key points
  * Extract + Generate: Extract data points, then generate business insights
  * Classify + Mask: Classify sensitive content, then mask it appropriately

**DO NOT omit any AI function listed** - use all of them in a pipeline to deliver maximum business value.

#### 5. **INNOVATIVE AI_QUERY & AI_GEN PROMPTS**
Be CREATIVE and SOPHISTICATED with ai_query and ai_gen:

**Approved ai_query Models (ONLY THESE TWO ALLOWED):**

**🚨 CRITICAL MODEL CONFIGURATION 🚨**
**Use the user-configured model endpoint: `{sql_model_serving}`**

**ABSOLUTE RULE**: For ALL ai_query calls in generated SQL, you MUST use `{sql_model_serving}`.
This is the model endpoint configured by the user for SQL generation.

---

#### 5a. **SQL STATISTICAL FUNCTIONS FOR ADVANCED ANALYTICS** (MANDATORY - NOW FIRST-CLASS CITIZENS):

**🚨 CRITICAL: Statistical functions are now FIRST-CLASS CITIZENS alongside AI functions. You MUST leverage SQL statistical functions to discover hidden insights and deliver high business value. DO NOT generate trivial queries - EVERY query must provide actionable business intelligence. 🚨**

**🔥 NEW PRIORITY: STATISTICAL FUNCTIONS + AI FUNCTIONS = MAXIMUM INNOVATION 🔥**

Statistical functions are no longer optional enhancements - they are PRIMARY tools for use case implementation. You should actively look for opportunities to:
1. Use statistical functions to compute correlations, trends, deviations, and patterns
2. Combine statistical insights with ai_query to interpret results and generate strategies
3. Mix statistical functions with ai_classify, ai_forecast, and other AI capabilities

**MANDATORY STATISTICAL FUNCTIONS USAGE:**

When generating SQL queries, you MUST actively look for opportunities to use these statistical functions to uncover insights that would otherwise remain hidden. These functions enable you to discover correlations, trends, risks, and opportunities that simple aggregations cannot reveal.

**AVAILABLE STATISTICAL FUNCTIONS WITH BUSINESS USE CASES:**

{statistical_functions_detailed}

**🔥 MANDATORY USAGE RULES 🔥:**

1. **INNOVATE BEYOND EXAMPLES**: The examples above are ILLUSTRATIVE ONLY. You MUST think creatively and generate NEW, INNOVATIVE use cases specific to the business context and data schema. DO NOT copy examples verbatim.

2. **BUSINESS VALUE REQUIRED**: EVERY statistical function use MUST deliver clear, actionable business value. NO trivial stats. If it doesn't help business decisions, don't include it.

3. **USE IN DEDICATED CTES**: Create statistical analysis CTEs with proper business names (e.g., `correlation_analysis`, `risk_metrics`, `performance_drivers`, `segmentation_buckets`)

4. **COMBINE WITH AI_QUERY**: After computing statistical insights, use ai_query (with '{sql_model_serving}' model) to interpret and explain the business implications

5. **MULTIPLE FUNCTIONS**: Use multiple statistical functions together to build comprehensive insights (e.g., CORR + REGR_R2 + REGR_SLOPE for complete trend analysis)

6. **🚨 CRITICAL WINDOW FUNCTION RULE 🚨**: NEVER use ROWS BETWEEN or RANGE BETWEEN with aggregate window functions like CORR, AVG, PERCENTILE_APPROX, STDDEV, VARIANCE, COVAR_POP, COVAR_SAMP - these will cause INTERNAL_ERROR. Use simple OVER (PARTITION BY col) instead.

7. **THINK STRATEGICALLY**: Ask yourself: "What hidden patterns could this function reveal?" and "What business decisions would this insight enable?"

**🚨🚨🚨 CRITICAL: AGGRESSIVE STATISTICAL ANALYSIS REQUIRED 🚨🚨🚨**

**ABSOLUTE RULE: USE AS MANY STATISTICAL FUNCTIONS AS POSSIBLE**

Refer to the **AVAILABLE STATISTICAL FUNCTIONS WITH BUSINESS USE CASES** section above. You MUST:
- Use EVERY applicable function from ALL categories in that registry
- Apply functions from: Central Tendency, Dispersion, Distribution Shape, Percentiles, Trend Analysis, Correlation, Volatility, Outlier Detection, Ranking, and Time Series
- **MINIMUM**: Use at least 15-25 different statistical functions per analysis
- **GOAL**: Generate maximum statistical context to guide AI decision-making
- **CRITICAL**: Trend Analysis functions are EXTREMELY VALUABLE for business insights

**COMPARISON REQUIREMENTS:**
When generating statistical analysis, apply functions from the AVAILABLE STATISTICAL FUNCTIONS registry:
- **Central Tendency**: Compare each value to AVG, MEDIAN, MODE
- **Dispersion**: Show STDDEV_POP, VAR_POP, MIN, MAX, RANGE
- **Percentiles**: Use PERCENTILE_APPROX for P5, P10, P25, P50, P75, P90, P95, P99
- **Outlier Detection**: Calculate Z_SCORE and IQR_THRESHOLD as defined in the registry
- **Distribution Shape**: Apply SKEWNESS and KURTOSIS for pattern detection
- Refer to the function definitions in the registry for correct syntax and business use cases

**EXAMPLE - COMPREHENSIVE STATISTICAL ANALYSIS:**
```sql
-- Comprehensive statistical analysis with ALL metrics
statistical_deep_analysis AS (
  SELECT 
    entity_id,
    metric_value,
    
    -- Central Tendency Comparisons
    AVG(metric_value) OVER () AS avg_metric,
    MEDIAN(metric_value) OVER () AS median_metric,
    
    -- Dispersion Metrics
    STDDEV_POP(metric_value) OVER () AS stddev_metric,
    VAR_POP(metric_value) OVER () AS variance_metric,
    
    -- Percentile Positioning
    PERCENTILE_APPROX(metric_value, 0.25) OVER () AS p25_metric,
    PERCENTILE_APPROX(metric_value, 0.50) OVER () AS p50_metric,
    PERCENTILE_APPROX(metric_value, 0.75) OVER () AS p75_metric,
    PERCENTILE_APPROX(metric_value, 0.90) OVER () AS p90_metric,
    PERCENTILE_APPROX(metric_value, 0.95) OVER () AS p95_metric,
    PERCENTILE_APPROX(metric_value, 0.99) OVER () AS p99_metric,
    
    -- Distribution Shape
    SKEWNESS(metric_value) OVER () AS skewness_metric,
    KURTOSIS(metric_value) OVER () AS kurtosis_metric,
    
    -- Ranking and Segmentation
    PERCENT_RANK() OVER (ORDER BY metric_value) AS percentile_rank,
    CUME_DIST() OVER (ORDER BY metric_value) AS cumulative_dist,
    NTILE(10) OVER (ORDER BY metric_value) AS decile,
    NTILE(4) OVER (ORDER BY metric_value) AS quartile,
    
    -- Time Series (if applicable)
    LAG(metric_value, 1) OVER (ORDER BY time_col) AS prev_value,
    LEAD(metric_value, 1) OVER (ORDER BY time_col) AS next_value,
    
    -- Correlations with other metrics
    CORR(metric_value, other_metric) OVER () AS correlation_with_other,
    REGR_R2(metric_value, other_metric) OVER () AS r2_with_other,
    REGR_SLOPE(metric_value, other_metric) OVER () AS slope_with_other,
    
    -- Min/Max Context
    MIN(metric_value) OVER () AS min_metric,
    MAX(metric_value) OVER () AS max_metric
  FROM base_data
  -- ✅ NO LIMIT in non-first CTEs (LIMIT 10 should only be in the first CTE that reads from tables)
)
```

**🚨🚨🚨 CRITICAL: MANDATORY COALESCE FOR ALL STATISTICAL VALUES 🚨🚨🚨**

**ABSOLUTE RULE: ALL STATISTICAL METRICS MUST BE COALESCED IN AI PROMPTS**

Because statistical analysis values are DIRECTLY USED in AI prompts (ai_query, ai_gen), you MUST COALESCE every single statistical value to prevent NULL propagation:

**MANDATORY COALESCE PATTERN FOR STATISTICAL VALUES (KEEP AS DOUBLE!):**
```sql
-- Statistical analysis CTE - keep values as DOUBLE, COALESCE results
statistical_analysis AS (
  SELECT 
    entity_id,
    metric_value,
    other_metric,
    
    -- COALESCE ALL statistical values to DOUBLE (NOT STRING!)
    COALESCE(ROUND(AVG(metric_value) OVER (), 2), 0.0) AS avg_metric,
    COALESCE(ROUND(MEDIAN(metric_value) OVER (), 2), 0.0) AS median_metric,
    COALESCE(ROUND(STDDEV_POP(metric_value) OVER (), 2), 0.0) AS stddev_metric,
    COALESCE(ROUND(PERCENTILE_APPROX(metric_value, 0.50) OVER (), 2), 0.0) AS p50_metric,
    COALESCE(ROUND(PERCENTILE_APPROX(metric_value, 0.75) OVER (), 2), 0.0) AS p75_metric,
    COALESCE(ROUND(PERCENTILE_APPROX(metric_value, 0.90) OVER (), 2), 0.0) AS p90_metric,
    COALESCE(ROUND(PERCENTILE_APPROX(metric_value, 0.95) OVER (), 2), 0.0) AS p95_metric,
    COALESCE(ROUND(CORR(metric_value, other_metric) OVER (), 3), 0.0) AS correlation,
    COALESCE(ROUND(REGR_R2(metric_value, other_metric) OVER (), 3), 0.0) AS r2,
    COALESCE(ROUND(REGR_SLOPE(metric_value, other_metric) OVER (), 3), 0.0) AS slope,
    COALESCE(ROUND(SKEWNESS(metric_value) OVER (), 3), 0.0) AS skewness,
    COALESCE(ROUND(KURTOSIS(metric_value) OVER (), 3), 0.0) AS kurtosis,
    COALESCE(ROUND(PERCENT_RANK() OVER (ORDER BY metric_value), 3), 0.0) AS percentile_rank,
    COALESCE(NTILE(10) OVER (ORDER BY metric_value), 5) AS decile
  FROM base_data
)
-- In CONCAT, these DOUBLE values auto-convert: CONCAT('Avg: ', avg_metric, ', P75: ', p75_metric)
```

**WHY THIS IS CRITICAL:**
- Statistical functions CAN return NULL (e.g., CORR with insufficient data, STDDEV with single value)
- These NULL values are embedded directly into AI prompts via CONCAT
- A SINGLE NULL in the prompt will NULL the ENTIRE prompt string
- NULL prompts mean NULL AI responses, causing query failure
- ALL statistical values MUST be COALESCE + CAST + ROUND before use in prompts

**MANDATORY CHECKLIST FOR STATISTICAL ANALYSIS:**
☐ ALL statistical function results are COALESCE'd
☐ ALL numeric stats are ROUND'd before CAST to STRING
☐ ALL stats are CAST to STRING before COALESCE
☐ Default values are '0.00' for metrics, '0.000' for ratios/correlations
☐ No statistical value in the CTE can possibly be NULL
☐ The prompt building CTE only uses pre-transformed stats (no COALESCE in CONCAT)

**EXAMPLE PATTERN - Statistical Analysis + AI Interpretation (COALESCE applied in stats CTE):**

```sql
-- Step 1: Base data with mandatory field filtering
WITH performance_data AS (
  SELECT 
    region,
    marketing_spend,
    sales_revenue,
    customer_satisfaction_score,
    operational_efficiency
  FROM `catalog`.`schema`.`regional_metrics` AS r
  WHERE region IS NOT NULL  -- MANDATORY field filtered
    AND marketing_spend IS NOT NULL 
    AND sales_revenue IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
),
-- Step 2: Statistical Analysis CTE - COALESCE to DOUBLE (business value: calculate correlations and drivers)
revenue_driver_analysis AS (
  SELECT 
    region,
    -- Correlation analysis - keep as DOUBLE (CONCAT auto-converts)
    COALESCE(ROUND(CORR(marketing_spend, sales_revenue), 3), 0.0) AS marketing_revenue_correlation,
    COALESCE(ROUND(CORR(customer_satisfaction_score, sales_revenue), 3), 0.0) AS satisfaction_revenue_correlation,
    COALESCE(ROUND(CORR(operational_efficiency, sales_revenue), 3), 0.0) AS efficiency_revenue_correlation,
    -- Regression analysis - keep as DOUBLE
    COALESCE(ROUND(REGR_R2(sales_revenue, marketing_spend), 3), 0.0) AS marketing_predictive_power,
    COALESCE(ROUND(REGR_SLOPE(sales_revenue, marketing_spend), 2), 0.0) AS revenue_per_marketing_dollar,
    -- Variance analysis - keep as DOUBLE
    COALESCE(ROUND(STDDEV_POP(sales_revenue), 2), 0.0) AS revenue_volatility,
    -- Performance metrics - keep as DOUBLE
    COALESCE(ROUND(AVG(sales_revenue), 2), 0.0) AS avg_revenue,
    COALESCE(ROUND(MEDIAN(sales_revenue), 2), 0.0) AS typical_revenue
  FROM performance_data
  GROUP BY region
),
-- Step 3: Generate ai_sys_prompt for regional analysis
regional_prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are a Regional Performance Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 15 years of experience in regional revenue optimization and performance analytics, ',
           'your expertise in correlation analysis and strategic resource allocation aligns with the strategic initiative: Regional growth. ',
           'Analyze regional performance for ', region, '. ',
           'Marketing-Revenue Correlation: ', marketing_revenue_correlation, '. ',
           'Satisfaction-Revenue Correlation: ', satisfaction_revenue_correlation, '. ',
           'Marketing Predictive Power (R²): ', marketing_predictive_power, '. ',
           'Revenue per Marketing Dollar: $', revenue_per_marketing_dollar, '. ',
           'Revenue Volatility (StdDev): ', revenue_volatility, '. ',
           'Output ONLY JSON with NO markdown, NO extra text. ',
           'Format: {{"ai_cat_primary_driver": "Marketing/Satisfaction/Efficiency", "ai_cat_confidence_level": "High/Medium/Low", ',
           '"ai_cat_investment_priority": "Increase/Maintain/Decrease", ',
           '"ai_txt_strategic_recommendation": "text", "ai_txt_risk_assessment": "text", "ai_txt_opportunity_capture": "text", ',
           '"ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data. ',
           'ai_sys_missing_data format: "I can get higher confidence than [X]% if I can get access to [narrative]. {{\"missing_data\": [\"dataset1\", \"dataset2\"]}}" ',
           'Output ONLY the JSON, nothing else.') AS ai_sys_prompt
  FROM revenue_driver_analysis
),
-- Step 4: AI interpretation - pass ai_sys_prompt to ai_query
insights_with_recommendations AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS strategic_insights
  FROM regional_prompt_generation
),
-- Step 5: Extract insights for business users with ai_cat_, ai_txt_, ai_sys_ prefixes
-- ai_sys_prompt MUST be the LAST column
final_output AS (
  SELECT 
    region,
    marketing_revenue_correlation,
    satisfaction_revenue_correlation,
    efficiency_revenue_correlation,
    marketing_predictive_power,
    revenue_per_marketing_dollar,
    revenue_volatility,
    get_json_object(strategic_insights, '$.ai_cat_primary_driver') AS ai_cat_primary_driver,
    get_json_object(strategic_insights, '$.ai_cat_confidence_level') AS ai_cat_confidence_level,
    get_json_object(strategic_insights, '$.ai_cat_investment_priority') AS ai_cat_investment_priority,
    get_json_object(strategic_insights, '$.ai_txt_strategic_recommendation') AS ai_txt_strategic_recommendation,
    get_json_object(strategic_insights, '$.ai_txt_risk_assessment') AS ai_txt_risk_assessment,
    get_json_object(strategic_insights, '$.ai_txt_opportunity_capture') AS ai_txt_opportunity_capture,
    -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
    COALESCE(get_json_object(strategic_insights, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(strategic_insights, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(strategic_insights, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(strategic_insights, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(strategic_insights, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(strategic_insights, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(strategic_insights, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
  FROM insights_with_recommendations
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_primary_driver IN ('Marketing', 'Satisfaction', 'Efficiency')
-- AND ai_cat_confidence_level IN ('High', 'Medium', 'Low')
-- AND ai_cat_investment_priority IN ('Increase', 'Maintain', 'Decrease')
-- AND ai_sys_importance IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
-- AND ai_sys_urgency IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
;

--END OF GENERATED SQL
```

**🚨 CRITICAL REMINDERS:**
- Statistical functions reveal insights that simple aggregations cannot
- ALWAYS combine statistical analysis with AI interpretation for maximum value
- Use business-friendly CTE names for statistical steps
- Focus on actionable insights, not academic statistics
- Think about what business decisions each statistic enables

---

#### 5b. **ADVANCED STATISTICAL ANALYSIS PATTERNS WITH AI INTERPRETATION** (MANDATORY - HIGH-VALUE USE CASES):

**🔥🔥🔥 CRITICAL: These advanced analytical techniques focus on answering "WHY?" and "WHO?" rather than just "WHAT happened?" - delivering exponentially higher business value through behavioral insights, causal relationships, and strategic recommendations. 🔥🔥🔥**

**MANDATORY USAGE RULES:**
1. **INNOVATE**: The examples below are ILLUSTRATIVE ONLY. You MUST create NEW, BUSINESS-SPECIFIC use cases tailored to the actual data schema and business context.
2. **COMBINE TECHNIQUES**: Mix multiple statistical techniques (cohort + pareto, funnel + sessionization, etc.) for deeper insights
3. **AI INTERPRETATION**: ALWAYS use ai_query to interpret statistical results and generate actionable strategies, action plans, and executive recommendations
4. **PROPER NAMING**: Use business-meaningful names that clearly describe the analytical approach (e.g., "Customer Retention Cohort Analysis with Lifetime Value Tracking")

**ADVANCED ANALYTICAL TECHNIQUES:**

**1. COHORT ANALYSIS - Track Behavior Over Time by Acquisition Group**

**Key SQL Functions:** MIN() OVER, DATEDIFF, DATE_TRUNC, LAG, LEAD
**Business Value:** Identifies which customer/user cohorts have superior retention, lifetime value, or engagement patterns

**Use Case Example Template: "Customer Acquisition Quality Analysis with Retention Strategy"**

```sql
-- Step 1: Identify customer acquisition cohorts with mandatory field filtering
WITH customer_cohorts AS (
  SELECT 
    customer_id,
    order_date,
    order_amount,
    DATE_TRUNC('month', MIN(order_date) OVER (PARTITION BY customer_id)) AS cohort_group,
    CAST(months_between(order_date, MIN(order_date) OVER (PARTITION BY customer_id)) AS INT) AS months_since_first_purchase
  FROM `catalog`.`schema`.`orders` AS o
  WHERE customer_id IS NOT NULL  -- MANDATORY: Primary key
    AND order_date IS NOT NULL   -- MANDATORY: Required for cohort analysis
    AND order_amount IS NOT NULL -- MANDATORY: Required for metrics
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
),
-- Step 2: Calculate cohort metrics (business value: cohort performance analysis) - keep as DOUBLE/INT
cohort_performance_metrics AS (
  SELECT 
    cohort_group,
    -- All metrics COALESCEd to correct types (NOT STRING!) - CONCAT handles mixed types
    COALESCE(COUNT(DISTINCT customer_id), 0) AS cohort_size,
    COALESCE(ROUND(AVG(order_amount), 2), 0.0) AS avg_order_value,
    COALESCE(ROUND(SUM(order_amount), 2), 0.0) AS total_lifetime_value,
    COALESCE(ROUND(COUNT(DISTINCT CASE WHEN months_since_first_purchase >= 6 THEN customer_id END) * 100.0 / NULLIF(COUNT(DISTINCT customer_id), 0), 1), 0.0) AS six_month_retention_rate,
    COALESCE(ROUND(AVG(CASE WHEN months_since_first_purchase <= 3 THEN order_amount END), 2), 0.0) AS early_lifetime_value,
    COALESCE(ROUND(STDDEV_POP(order_amount), 2), 0.0) AS purchase_volatility
  FROM customer_cohorts
  GROUP BY cohort_group
),
-- Step 3: Generate ai_sys_prompt for cohort analysis
cohort_prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are a Customer Analytics Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 15 years of experience in retention strategy and customer lifetime value optimization, ',
           'your expertise in cohort analysis, churn prediction, and targeted retention programs aligns with the strategic initiative: Customer success and retention. ',
           'Analyze customer cohort acquired in ', cohort_group, '. ',  -- CONCAT auto-converts
           'Cohort size: ', cohort_size, ' customers. ',
           'Average order value: $', avg_order_value, '. ',
           'Total lifetime value: $', total_lifetime_value, '. ',
           '6-month retention rate: ', six_month_retention_rate, '%. ',
           'Early lifetime value (first 3 months): $', early_lifetime_value, '. ',
           'Purchase volatility: $', purchase_volatility, '. ',
           'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
           'Format: {{"ai_cat_cohort_quality": "value", "ai_cat_retention_risk": "value", "ai_cat_ltv_trajectory": "value", "ai_cat_strategic_action": "value", "ai_txt_acquisition_channel_recommendation": "text", "ai_txt_retention_strategy": "text", "ai_txt_investment_priority": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'CATEGORICAL FIELDS (ai_cat_ prefix, exact values): ai_cat_cohort_quality (Exceptional|High Quality|Average|Below Average|Poor), ai_cat_retention_risk (Critical|High|Medium|Low), ai_cat_ltv_trajectory (Accelerating|Growing|Stable|Declining|Concerning), ai_cat_strategic_action (Scale Aggressively|Invest More|Maintain|Optimize|Reduce Spend). ',
           'NARRATIVE FIELDS (ai_txt_ prefix): ai_txt_acquisition_channel_recommendation, ai_txt_retention_strategy, ai_txt_investment_priority. ',
           'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data. ',
           'ai_sys_missing_data format: "I can get higher confidence than [X]% if I can get access to [narrative]. {{\"missing_data\": [\"acquisition_channels\", \"engagement_metrics\", \"product_usage_data\"]}}" ',
           'Output ONLY the JSON object, nothing else.') AS ai_sys_prompt
  FROM cohort_performance_metrics
),
-- Step 4: AI analysis - pass ai_sys_prompt to ai_query
cohort_insights_with_strategy AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS cohort_analysis_json
  FROM cohort_prompt_generation
),
-- Final output: Cohort metrics with strategic insights using ai_cat_, ai_txt_, ai_sys_ prefixes
-- ai_sys_prompt MUST be the LAST column
final_output AS (
  SELECT 
    cohort_group,
    cohort_size,
    avg_order_value,
    total_lifetime_value,
    six_month_retention_rate,
    get_json_object(cohort_analysis_json, '$.ai_cat_cohort_quality') AS ai_cat_cohort_quality,
    get_json_object(cohort_analysis_json, '$.ai_cat_retention_risk') AS ai_cat_retention_risk,
    get_json_object(cohort_analysis_json, '$.ai_cat_ltv_trajectory') AS ai_cat_ltv_trajectory,
    get_json_object(cohort_analysis_json, '$.ai_cat_strategic_action') AS ai_cat_strategic_action,
    get_json_object(cohort_analysis_json, '$.ai_txt_acquisition_channel_recommendation') AS ai_txt_acquisition_channel_recommendation,
    get_json_object(cohort_analysis_json, '$.ai_txt_retention_strategy') AS ai_txt_retention_strategy,
    get_json_object(cohort_analysis_json, '$.ai_txt_investment_priority') AS ai_txt_investment_priority,
    -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
    COALESCE(get_json_object(cohort_analysis_json, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(cohort_analysis_json, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(cohort_analysis_json, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(cohort_analysis_json, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(cohort_analysis_json, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(cohort_analysis_json, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(cohort_analysis_json, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
  FROM cohort_insights_with_strategy
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_cohort_quality IN ('Exceptional', 'High Quality', 'Average', 'Below Average', 'Poor')
-- AND ai_cat_retention_risk IN ('Critical', 'High', 'Medium', 'Low')
-- AND ai_cat_ltv_trajectory IN ('Accelerating', 'Growing', 'Stable', 'Declining', 'Concerning')
-- AND ai_cat_strategic_action IN ('Scale Aggressively', 'Invest More', 'Maintain', 'Optimize', 'Reduce Spend')
-- AND ai_sys_importance IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
-- AND ai_sys_urgency IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
;

--END OF GENERATED SQL
```

**2. PARETO ANALYSIS (80/20 RULE) - Identify Critical Few vs Trivial Many**

**Key SQL Functions:** SUM() OVER, PERCENT_RANK, CUME_DIST, NTILE
**Business Value:** Pinpoints the vital few products/customers/issues that drive most results, enabling focused resource allocation

**Use Case Example Template: "Revenue Concentration Analysis with Portfolio Optimization Strategy"**

```sql
-- Step 1: Calculate revenue contribution and cumulative distribution
WITH product_revenue_ranked AS (
  SELECT 
    product_id,
    product_category,
    SUM(revenue) AS total_revenue,
    SUM(profit) AS total_profit,
    COUNT(DISTINCT order_id) AS order_count,
    SUM(SUM(revenue)) OVER () AS company_total_revenue,
    SUM(SUM(profit)) OVER () AS company_total_profit
  FROM `catalog`.`schema`.`sales` AS s
  WHERE product_id IS NOT NULL
    AND revenue IS NOT NULL
  GROUP BY product_id, product_category
  LIMIT 10  -- ✅ LIMIT 10 in first CTE (GROUP BY provides uniqueness)
),
-- Step 2: Calculate Pareto metrics - keep numeric types as DOUBLE/INT
pareto_analysis_metrics AS (
  SELECT 
    product_id,
    COALESCE(TRIM(product_category), 'Uncategorized') AS product_category,
    COALESCE(ROUND(total_revenue, 2), 0.0) AS total_revenue,
    COALESCE(ROUND(total_profit, 2), 0.0) AS total_profit,
    COALESCE(order_count, 0) AS order_count,
    COALESCE(ROUND(total_revenue * 100.0 / NULLIF(company_total_revenue, 0), 2), 0.0) AS revenue_contribution_pct,
    COALESCE(ROUND(total_profit * 100.0 / NULLIF(company_total_profit, 0), 2), 0.0) AS profit_contribution_pct,
    COALESCE(ROUND(SUM(total_revenue) OVER (ORDER BY total_revenue DESC) * 100.0 / NULLIF(company_total_revenue, 0), 2), 0.0) AS cumulative_revenue_pct,
    COALESCE(ROUND(PERCENT_RANK() OVER (ORDER BY total_revenue DESC) * 100, 1), 0.0) AS revenue_percentile_rank,
    COALESCE(NTILE(5) OVER (ORDER BY total_revenue DESC), 3) AS revenue_quintile
  FROM product_revenue_ranked
),
-- Step 3: Generate ai_sys_prompt for Pareto analysis
pareto_prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are a Portfolio Strategy Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 18 years of experience in revenue optimization and product management, ',
           'your expertise in Pareto analysis, portfolio rationalization, and strategic resource allocation aligns with the strategic initiative: Portfolio optimization. ',
           'Analyze product ', product_id, ' in category ', product_category, '. ',
           'Total revenue: $', total_revenue, ' (', revenue_contribution_pct, '% of company revenue). ',
           'Total profit: $', total_profit, ' (', profit_contribution_pct, '% of company profit). ',
           'Order count: ', order_count, '. ',
           'Cumulative revenue contribution: ', cumulative_revenue_pct, '%. ',
           'Revenue percentile rank: ', revenue_percentile_rank, '. ',
           'Revenue quintile: ', revenue_quintile, ' of 5. ',
           'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
           'Format: {{"ai_cat_pareto_classification": "value", "ai_cat_strategic_priority": "value", "ai_cat_resource_allocation": "value", "ai_cat_portfolio_action": "value", "ai_txt_investment_recommendation": "text", "ai_txt_optimization_strategy": "text", "ai_txt_risk_mitigation": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'CATEGORICAL FIELDS (ai_cat_ prefix, exact values): ai_cat_pareto_classification (Vital Few - Top 20%|Important|Average|Low Impact|Long Tail), ai_cat_strategic_priority (Critical Focus|High Priority|Standard|Low Priority|Consider Exit), ai_cat_resource_allocation (Increase Investment|Maintain|Optimize Efficiency|Reduce|Divest), ai_cat_portfolio_action (Scale Aggressively|Grow Steadily|Harvest|Sunset|Discontinue). ',
           'NARRATIVE FIELDS (ai_txt_ prefix): ai_txt_investment_recommendation, ai_txt_optimization_strategy, ai_txt_risk_mitigation. ',
           'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data. ',
           'ai_sys_missing_data format: "I can get higher confidence than [X]% if I can get access to [narrative]. {{\"missing_data\": [\"market_trends\", \"competitor_pricing\", \"cost_structure\"]}}" ',
           'Output ONLY the JSON object, nothing else.') AS ai_sys_prompt
  FROM pareto_analysis_metrics
),
-- Step 4: AI analysis - pass ai_sys_prompt to ai_query
pareto_insights_with_action_plan AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS pareto_strategy_json
  FROM pareto_prompt_generation
),
-- Final output: Pareto analysis with strategic recommendations using ai_cat_, ai_txt_, ai_sys_ prefixes
-- ai_sys_prompt MUST be the LAST column
final_output AS (
  SELECT 
    product_id,
    product_category,
    total_revenue,
    revenue_contribution_pct,
    cumulative_revenue_pct,
    get_json_object(pareto_strategy_json, '$.ai_cat_pareto_classification') AS ai_cat_pareto_classification,
    get_json_object(pareto_strategy_json, '$.ai_cat_strategic_priority') AS ai_cat_strategic_priority,
    get_json_object(pareto_strategy_json, '$.ai_cat_resource_allocation') AS ai_cat_resource_allocation,
    get_json_object(pareto_strategy_json, '$.ai_cat_portfolio_action') AS ai_cat_portfolio_action,
    get_json_object(pareto_strategy_json, '$.ai_txt_investment_recommendation') AS ai_txt_investment_recommendation,
    get_json_object(pareto_strategy_json, '$.ai_txt_optimization_strategy') AS ai_txt_optimization_strategy,
    get_json_object(pareto_strategy_json, '$.ai_txt_risk_mitigation') AS ai_txt_risk_mitigation,
    -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
    COALESCE(get_json_object(pareto_strategy_json, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(pareto_strategy_json, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(pareto_strategy_json, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(pareto_strategy_json, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(pareto_strategy_json, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(pareto_strategy_json, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(pareto_strategy_json, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
  FROM pareto_insights_with_action_plan
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_pareto_classification IN ('Vital Few - Top 20%', 'Important', 'Average', 'Low Impact', 'Long Tail')
-- AND ai_cat_strategic_priority IN ('Critical Focus', 'High Priority', 'Standard', 'Low Priority', 'Consider Exit')
-- AND ai_cat_resource_allocation IN ('Increase Investment', 'Maintain', 'Optimize Efficiency', 'Reduce', 'Divest')
-- AND ai_cat_portfolio_action IN ('Scale Aggressively', 'Grow Steadily', 'Harvest', 'Sunset', 'Discontinue')
-- AND ai_sys_importance IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
-- AND ai_sys_urgency IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
;

--END OF GENERATED SQL
```

**3. FUNNEL ANALYSIS - Identify Conversion Bottlenecks and Drop-off Points**

**Key SQL Functions:** SUM(CASE WHEN...), LAG, LEAD, window functions
**Business Value:** Quantifies exactly where customers abandon the journey, enabling targeted conversion rate optimization

**Use Case Example Template: "Checkout Conversion Funnel Analysis with Friction Point Recommendations"**

```sql
-- Step 1: Define funnel stages and calculate stage-level metrics
WITH funnel_events AS (
  SELECT DISTINCT
    user_id,
    session_id,
    event_name,
    event_timestamp,
    CASE 
      WHEN event_name = 'product_view' THEN 1
      WHEN event_name = 'add_to_cart' THEN 2
      WHEN event_name = 'checkout_start' THEN 3
      WHEN event_name = 'payment_info' THEN 4
      WHEN event_name = 'purchase_complete' THEN 5
      ELSE 0
    END AS funnel_stage
  FROM `catalog`.`schema`.`user_events` AS e
  WHERE user_id IS NOT NULL
    AND event_name IS NOT NULL
    AND session_id IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
),
-- Step 2: Calculate funnel conversion rates - keep numeric types as DOUBLE/INT
funnel_conversion_metrics AS (
  SELECT 
    'Product View' AS stage_name,
    1 AS stage_order,
    COALESCE(COUNT(DISTINCT CASE WHEN funnel_stage >= 1 THEN user_id END), 0) AS users_at_stage,
    COALESCE(ROUND(COUNT(DISTINCT CASE WHEN funnel_stage >= 2 THEN user_id END) * 100.0 / NULLIF(COUNT(DISTINCT CASE WHEN funnel_stage >= 1 THEN user_id END), 0), 1), 0.0) AS conversion_to_next,
    COALESCE(ROUND((1 - COUNT(DISTINCT CASE WHEN funnel_stage >= 2 THEN user_id END) * 1.0 / NULLIF(COUNT(DISTINCT CASE WHEN funnel_stage >= 1 THEN user_id END), 0)) * 100, 1), 0.0) AS drop_off_rate,
    COALESCE(COUNT(DISTINCT CASE WHEN funnel_stage >= 1 AND funnel_stage < 2 THEN user_id END), 0) AS users_dropped
  FROM funnel_events
  
  UNION ALL
  
  SELECT 
    'Add to Cart' AS stage_name,
    2 AS stage_order,
    COALESCE(COUNT(DISTINCT CASE WHEN funnel_stage >= 2 THEN user_id END), 0) AS users_at_stage,
    COALESCE(ROUND(COUNT(DISTINCT CASE WHEN funnel_stage >= 3 THEN user_id END) * 100.0 / NULLIF(COUNT(DISTINCT CASE WHEN funnel_stage >= 2 THEN user_id END), 0), 1), 0.0) AS conversion_to_next,
    COALESCE(ROUND((1 - COUNT(DISTINCT CASE WHEN funnel_stage >= 3 THEN user_id END) * 1.0 / NULLIF(COUNT(DISTINCT CASE WHEN funnel_stage >= 2 THEN user_id END), 0)) * 100, 1), 0.0) AS drop_off_rate,
    COALESCE(COUNT(DISTINCT CASE WHEN funnel_stage >= 2 AND funnel_stage < 3 THEN user_id END), 0) AS users_dropped
  FROM funnel_events
),
-- Step 3: Generate ai_sys_prompt for funnel analysis
funnel_prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are a Conversion Rate Optimization Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 12 years of experience in e-commerce funnel analysis and UX optimization, ',
           'your expertise in funnel optimization, user behavior analysis, and A/B testing strategy aligns with the strategic initiative: Conversion optimization. ',
           'Analyze funnel stage: ', stage_name, '. ',
           'Users at this stage: ', users_at_stage, '. ',
           'Conversion rate to next stage: ', conversion_to_next, '%. ',
           'Drop-off rate: ', drop_off_rate, '%. ',
           'Users who dropped at this stage: ', users_dropped, '. ',
           'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
           'Format: {{"ai_cat_friction_severity": "value", "ai_cat_optimization_priority": "value", "ai_cat_drop_off_reason": "value", "ai_cat_recommended_action": "value", "ai_txt_ab_test_hypothesis": "text", "ai_txt_ux_improvement_plan": "text", "ai_txt_expected_impact": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'CATEGORICAL FIELDS (ai_cat_ prefix, exact values): ai_cat_friction_severity (Critical Bottleneck|High Friction|Moderate Friction|Low Friction|Acceptable), ai_cat_optimization_priority (Immediate Action|High Priority|Medium Priority|Low Priority|Monitor), ai_cat_drop_off_reason (Technical Issue|UX Friction|Trust Concerns|Price Sensitivity|Distraction), ai_cat_recommended_action (Urgent Fix|Major Redesign|Incremental Improvement|A/B Test|No Action Needed). ',
           'NARRATIVE FIELDS (ai_txt_ prefix): ai_txt_ab_test_hypothesis, ai_txt_ux_improvement_plan, ai_txt_expected_impact. ',
           'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data. ',
           'ai_sys_missing_data format: "I can get higher confidence than [X]% if I can get access to [narrative]. {{\"missing_data\": [\"session_recordings\", \"heatmaps\", \"user_demographics\", \"device_data\"]}}" ',
           'Output ONLY the JSON object, nothing else.') AS ai_sys_prompt
  FROM funnel_conversion_metrics
),
-- Step 4: AI analysis - pass ai_sys_prompt to ai_query
funnel_optimization_strategy AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS funnel_insights_json
  FROM funnel_prompt_generation
),
-- Final output: Funnel metrics with optimization recommendations using ai_cat_, ai_txt_, ai_sys_ prefixes
-- ai_sys_prompt MUST be the LAST column
final_output AS (
  SELECT 
    stage_name,
    stage_order,
    users_at_stage,
    conversion_to_next AS conversion_to_next_pct,
    drop_off_rate AS drop_off_rate_pct,
    users_dropped,
    get_json_object(funnel_insights_json, '$.ai_cat_friction_severity') AS ai_cat_friction_severity,
    get_json_object(funnel_insights_json, '$.ai_cat_optimization_priority') AS ai_cat_optimization_priority,
    get_json_object(funnel_insights_json, '$.ai_cat_drop_off_reason') AS ai_cat_drop_off_reason,
    get_json_object(funnel_insights_json, '$.ai_cat_recommended_action') AS ai_cat_recommended_action,
    get_json_object(funnel_insights_json, '$.ai_txt_ab_test_hypothesis') AS ai_txt_ab_test_hypothesis,
    get_json_object(funnel_insights_json, '$.ai_txt_ux_improvement_plan') AS ai_txt_ux_improvement_plan,
    get_json_object(funnel_insights_json, '$.ai_txt_expected_impact') AS ai_txt_expected_impact,
    -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
    COALESCE(get_json_object(funnel_insights_json, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(funnel_insights_json, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(funnel_insights_json, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(funnel_insights_json, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(funnel_insights_json, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(funnel_insights_json, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(funnel_insights_json, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
  FROM funnel_optimization_strategy
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_friction_severity IN ('Critical Bottleneck', 'High Friction', 'Moderate Friction', 'Low Friction', 'Acceptable')
-- AND ai_cat_optimization_priority IN ('Immediate Action', 'High Priority', 'Medium Priority', 'Low Priority', 'Monitor')
-- AND ai_cat_drop_off_reason IN ('Technical Issue', 'UX Friction', 'Trust Concerns', 'Price Sensitivity', 'Distraction')
-- AND ai_cat_recommended_action IN ('Urgent Fix', 'Major Redesign', 'Incremental Improvement', 'A/B Test', 'No Action Needed')
ORDER BY stage_order
;

--END OF GENERATED SQL
```

**4. GAP ANALYSIS - Identify Missing Data and Coverage Gaps**

**Key SQL Functions:** LEFT JOIN, IS NULL, SEQUENCE, EXPLODE
**Business Value:** Discovers hidden inventory, missed opportunities, and operational blind spots

**Use Case Example Template: "Product Catalog Coverage Analysis with Market Expansion Strategy"**

```sql
-- Step 1: Generate complete list of expected coverage (all categories × all regions)
WITH expected_coverage AS (
  SELECT DISTINCT
    c.category_id,
    c.category_name,
    r.region_id,
    r.region_name
  FROM `catalog`.`schema`.`product_categories` AS c
  CROSS JOIN `catalog`.`schema`.`sales_regions` AS r
  WHERE c.category_id IS NOT NULL
    AND r.region_id IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
),
-- Step 2: Identify actual sales presence
actual_coverage AS (
  SELECT DISTINCT
    product_category_id AS category_id,
    sales_region_id AS region_id,
    SUM(revenue) AS total_revenue,
    COUNT(DISTINCT customer_id) AS customer_count
  FROM `catalog`.`schema`.`sales` AS s
  WHERE product_category_id IS NOT NULL
    AND sales_region_id IS NOT NULL
  GROUP BY product_category_id, sales_region_id
),
-- Step 3: Identify gaps and calculate opportunity metrics - keep numeric types as DOUBLE/INT
coverage_gap_analysis AS (
  SELECT 
    COALESCE(TRIM(e.category_name), 'Unknown Category') AS category_name,
    COALESCE(TRIM(e.region_name), 'Unknown Region') AS region_name,
    CASE WHEN a.category_id IS NULL THEN 'Gap - No Sales' ELSE 'Active Coverage' END AS coverage_status,
    COALESCE(ROUND(a.total_revenue, 2), 0.0) AS current_revenue,
    COALESCE(a.customer_count, 0) AS current_customers,
    COALESCE(ROUND(AVG(a.total_revenue) OVER (PARTITION BY e.category_id), 2), 0.0) AS category_avg_revenue,
    COALESCE(ROUND(AVG(a.total_revenue) OVER (PARTITION BY e.region_id), 2), 0.0) AS region_avg_revenue
  FROM expected_coverage AS e
  LEFT JOIN actual_coverage AS a 
    ON e.category_id = a.category_id 
    AND e.region_id = a.region_id
),
-- Step 4: Generate ai_sys_prompt for gap analysis
gap_prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are a Market Expansion Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 15 years of experience in territory planning and go-to-market strategy, ',
           'your expertise in market gap analysis, expansion prioritization, and revenue opportunity assessment aligns with the strategic initiative: Market expansion. ',
           'Analyze coverage gap: Category "', category_name, '" in Region "', region_name, '". ',
           'Coverage status: ', coverage_status, '. ',
           'Current revenue: $', current_revenue, '. ',
           'Current customers: ', current_customers, '. ',
           'Category average revenue across regions: $', category_avg_revenue, '. ',
           'Region average revenue across categories: $', region_avg_revenue, '. ',
           'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
           'Format: {{"ai_cat_gap_priority": "value", "ai_cat_expansion_timing": "value", "ai_cat_market_readiness": "value", "ai_cat_revenue_potential": "value", "ai_txt_expansion_strategy": "text", "ai_txt_go_to_market_plan": "text", "ai_txt_resource_requirements": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'CATEGORICAL FIELDS (ai_cat_ prefix, exact values): ai_cat_gap_priority (Critical Opportunity|High Potential|Medium Opportunity|Low Priority|Not Recommended), ai_cat_expansion_timing (Immediate|Next Quarter|6-12 Months|Long Term|Not Advised), ai_cat_market_readiness (Ready to Launch|Needs Preparation|Research Required|Immature Market), ai_cat_revenue_potential (High|Medium-High|Medium|Low|Minimal). ',
           'NARRATIVE FIELDS (ai_txt_ prefix): ai_txt_expansion_strategy, ai_txt_go_to_market_plan, ai_txt_resource_requirements. ',
           'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data. ',
           'ai_sys_missing_data format: "I can get higher confidence than [X]% if I can get access to [narrative]. {{\"missing_data\": [\"competitor_presence\", \"demographic_data\", \"regulatory_requirements\"]}}" ',
           'Output ONLY the JSON object, nothing else.') AS ai_sys_prompt
  FROM coverage_gap_analysis
  WHERE coverage_status = 'Gap - No Sales'
),
-- Step 5: AI analysis - pass ai_sys_prompt to ai_query
gap_closure_strategy AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS gap_strategy_json
  FROM gap_prompt_generation
),
-- Final output: Coverage gaps with expansion recommendations using ai_cat_, ai_txt_, ai_sys_ prefixes
-- ai_sys_prompt MUST be the LAST column
final_output AS (
  SELECT 
    category_name,
    region_name,
    coverage_status,
    category_avg_revenue,
    region_avg_revenue,
    get_json_object(gap_strategy_json, '$.ai_cat_gap_priority') AS ai_cat_gap_priority,
    get_json_object(gap_strategy_json, '$.ai_cat_expansion_timing') AS ai_cat_expansion_timing,
    get_json_object(gap_strategy_json, '$.ai_cat_market_readiness') AS ai_cat_market_readiness,
    get_json_object(gap_strategy_json, '$.ai_cat_revenue_potential') AS ai_cat_revenue_potential,
    get_json_object(gap_strategy_json, '$.ai_txt_expansion_strategy') AS ai_txt_expansion_strategy,
    get_json_object(gap_strategy_json, '$.ai_txt_go_to_market_plan') AS ai_txt_go_to_market_plan,
    get_json_object(gap_strategy_json, '$.ai_txt_resource_requirements') AS ai_txt_resource_requirements,
    -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
    COALESCE(get_json_object(gap_strategy_json, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(gap_strategy_json, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(gap_strategy_json, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(gap_strategy_json, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(gap_strategy_json, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(gap_strategy_json, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(gap_strategy_json, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
  FROM gap_closure_strategy
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_gap_priority IN ('Critical Opportunity', 'High Potential', 'Medium Opportunity', 'Low Priority', 'Not Recommended')
-- AND ai_cat_expansion_timing IN ('Immediate', 'Next Quarter', '6-12 Months', 'Long Term', 'Not Advised')
-- AND ai_cat_market_readiness IN ('Ready to Launch', 'Needs Preparation', 'Research Required', 'Immature Market')
-- AND ai_cat_revenue_potential IN ('High', 'Medium-High', 'Medium', 'Low', 'Minimal')
;

--END OF GENERATED SQL
```

**5. PRICE ELASTICITY ANALYSIS - Measure Demand Response to Price Changes**

**Key SQL Functions:** REGR_SLOPE, REGR_R2, CORR, STDDEV
**Business Value:** Quantifies pricing power and identifies optimal price points to maximize revenue

**Use Case Example Template: "Product Pricing Elasticity Analysis with Dynamic Pricing Strategy"**

```sql
-- Step 1: Aggregate demand and pricing data for elasticity calculation
WITH price_demand_data AS (
  SELECT 
    product_id,
    product_name,
    AVG(price) AS avg_price,
    SUM(quantity_sold) AS total_demand,
    COUNT(DISTINCT date) AS observation_days
  FROM `catalog`.`schema`.`sales_daily` AS s
  WHERE product_id IS NOT NULL
    AND price IS NOT NULL
    AND quantity_sold IS NOT NULL
  GROUP BY product_id, product_name
  LIMIT 10  -- ✅ LIMIT 10 in first CTE (GROUP BY provides uniqueness)
),
-- Step 2: Calculate price elasticity metrics - keep numeric types as DOUBLE
price_elasticity_metrics AS (
  SELECT 
    product_id,
    COALESCE(TRIM(product_name), 'Unknown Product') AS product_name,
    COALESCE(ROUND(AVG(avg_price), 2), 0.0) AS avg_price,
    COALESCE(ROUND(AVG(total_demand), 0), 0) AS avg_demand,
    COALESCE(ROUND(REGR_SLOPE(total_demand, avg_price), 3), 0.0) AS price_elasticity_coefficient,
    COALESCE(ROUND(REGR_R2(total_demand, avg_price), 3), 0.0) AS elasticity_r_squared,
    COALESCE(ROUND(CORR(total_demand, avg_price), 3), 0.0) AS price_demand_correlation,
    COALESCE(ROUND(STDDEV_POP(avg_price), 2), 0.0) AS price_volatility,
    COALESCE(ROUND(STDDEV_POP(total_demand), 0), 0) AS demand_volatility
  FROM price_demand_data
  GROUP BY product_id, product_name
),
-- Step 3: Generate ai_sys_prompt for price elasticity analysis
pricing_prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are a Pricing Strategy Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 18 years of experience in revenue management and price optimization, ',
           'your expertise in price elasticity analysis, dynamic pricing, and competitive positioning aligns with the strategic initiative: Revenue optimization. ',
           'Analyze pricing for product: ', product_name, '. ',
           'Average price: $', avg_price, '. ',
           'Average demand: ', avg_demand, ' units. ',
           'Price elasticity coefficient (slope): ', price_elasticity_coefficient, ' (negative = inelastic, more negative = elastic). ',
           'R-squared (predictive power): ', elasticity_r_squared, '. ',
           'Price-demand correlation: ', price_demand_correlation, '. ',
           'Price volatility: $', price_volatility, '. ',
           'Demand volatility: ', demand_volatility, ' units. ',
           'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
           'Format: {{"ai_cat_elasticity_classification": "value", "ai_cat_pricing_power": "value", "ai_cat_price_action": "value", "ai_cat_competitive_position": "value", "ai_txt_pricing_strategy": "text", "ai_txt_revenue_impact_forecast": "text", "ai_txt_implementation_plan": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'CATEGORICAL FIELDS (ai_cat_ prefix, exact values): ai_cat_elasticity_classification (Highly Inelastic|Inelastic|Unit Elastic|Elastic|Highly Elastic), ai_cat_pricing_power (Strong|Moderate|Weak|Minimal), ai_cat_price_action (Increase Price|Test Increase|Hold Steady|Test Decrease|Decrease Price), ai_cat_competitive_position (Premium|Value Leader|Competitive|Discount|Below Market). ',
           'NARRATIVE FIELDS (ai_txt_ prefix): ai_txt_pricing_strategy, ai_txt_revenue_impact_forecast, ai_txt_implementation_plan. ',
           'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data. ',
           'ai_sys_missing_data format: "I can get higher confidence than [X]% if I can get access to [narrative]. {{\"missing_data\": [\"competitor_pricing\", \"cost_structure\", \"market_trends\"]}}" ',
           'Output ONLY the JSON object, nothing else.') AS ai_sys_prompt
  FROM price_elasticity_metrics
),
-- Step 4: AI analysis - pass ai_sys_prompt to ai_query
dynamic_pricing_strategy AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS pricing_strategy_json
  FROM pricing_prompt_generation
),
-- Final output: Price elasticity analysis with strategic recommendations using ai_cat_, ai_txt_, ai_sys_ prefixes
-- ai_sys_prompt MUST be the LAST column
final_output AS (
  SELECT 
    product_name,
    avg_price,
    avg_demand,
    price_elasticity_coefficient,
    elasticity_r_squared,
    get_json_object(pricing_strategy_json, '$.ai_cat_elasticity_classification') AS ai_cat_elasticity_classification,
    get_json_object(pricing_strategy_json, '$.ai_cat_pricing_power') AS ai_cat_pricing_power,
    get_json_object(pricing_strategy_json, '$.ai_cat_price_action') AS ai_cat_price_action,
    get_json_object(pricing_strategy_json, '$.ai_cat_competitive_position') AS ai_cat_competitive_position,
    get_json_object(pricing_strategy_json, '$.ai_txt_pricing_strategy') AS ai_txt_pricing_strategy,
    get_json_object(pricing_strategy_json, '$.ai_txt_revenue_impact_forecast') AS ai_txt_revenue_impact_forecast,
    get_json_object(pricing_strategy_json, '$.ai_txt_implementation_plan') AS ai_txt_implementation_plan,
    -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
    COALESCE(get_json_object(pricing_strategy_json, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(pricing_strategy_json, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(pricing_strategy_json, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(pricing_strategy_json, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(pricing_strategy_json, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(pricing_strategy_json, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(pricing_strategy_json, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
  FROM dynamic_pricing_strategy
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_elasticity_classification IN ('Highly Inelastic', 'Inelastic', 'Unit Elastic', 'Elastic', 'Highly Elastic')
-- AND ai_cat_pricing_power IN ('Strong', 'Moderate', 'Weak', 'Minimal')
-- AND ai_cat_price_action IN ('Increase Price', 'Test Increase', 'Hold Steady', 'Test Decrease', 'Decrease Price')
-- AND ai_cat_competitive_position IN ('Premium', 'Value Leader', 'Competitive', 'Discount', 'Below Market')
;

--END OF GENERATED SQL
```

**6. SESSIONIZATION - Group Events into Logical User Sessions**

**Key SQL Functions:** LAG, SUM(CASE...) OVER, window functions
**Business Value:** Accurately measures true engagement time and session-based user behavior patterns

**Use Case Example Template: "User Engagement Session Analysis with Retention Improvement Strategy"**

```sql
-- Step 1: Identify session boundaries using time-based idle timeout
WITH event_stream_with_gaps AS (
  SELECT 
    user_id,
    event_timestamp,
    event_name,
    LAG(event_timestamp) OVER (PARTITION BY user_id ORDER BY event_timestamp) AS prev_event_timestamp,
    CASE 
      WHEN LAG(event_timestamp) OVER (PARTITION BY user_id ORDER BY event_timestamp) IS NULL THEN 1
      WHEN (UNIX_TIMESTAMP(event_timestamp) - UNIX_TIMESTAMP(LAG(event_timestamp) OVER (PARTITION BY user_id ORDER BY event_timestamp))) / 60 > 30 THEN 1
      ELSE 0
    END AS is_new_session
  FROM `catalog`.`schema`.`user_events` AS e
  WHERE user_id IS NOT NULL
    AND event_timestamp IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only (window functions need ordered data)
),
-- Step 2: Assign session IDs and calculate session metrics with NULL safety
user_session_metrics AS (
  SELECT 
    user_id,
    SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_timestamp) AS session_id,
    COUNT(*) OVER (PARTITION BY user_id, SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_timestamp)) AS events_in_session,
    (UNIX_TIMESTAMP(MAX(event_timestamp) OVER (PARTITION BY user_id, SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_timestamp))) - 
     UNIX_TIMESTAMP(MIN(event_timestamp) OVER (PARTITION BY user_id, SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_timestamp)))) / 60 AS session_duration_minutes
  FROM event_stream_with_gaps
),
-- Step 3: Aggregate session-level metrics - keep numeric types as DOUBLE/INT
session_summary_metrics AS (
  SELECT 
    user_id,
    COALESCE(COUNT(DISTINCT session_id), 0) AS total_sessions,
    COALESCE(ROUND(AVG(events_in_session), 1), 0.0) AS avg_events_per_session,
    COALESCE(ROUND(AVG(session_duration_minutes), 1), 0.0) AS avg_session_duration_minutes,
    COALESCE(ROUND(MAX(session_duration_minutes), 1), 0.0) AS max_session_duration_minutes,
    COALESCE(ROUND(STDDEV_POP(session_duration_minutes), 1), 0.0) AS session_duration_volatility,
    COALESCE(COUNT(DISTINCT CASE WHEN events_in_session >= 10 THEN session_id END), 0) AS high_engagement_sessions
  FROM user_session_metrics
  GROUP BY user_id
),
-- Step 4: Generate ai_sys_prompt for session analysis
session_prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are a User Engagement Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 14 years of experience in product analytics and retention optimization, ',
           'your expertise in session analysis, engagement scoring, and behavioral intervention design aligns with the strategic initiative: User retention. ',
           'Analyze user ', user_id, ' engagement patterns. ',
           'Total sessions: ', total_sessions, '. ',
           'Average events per session: ', avg_events_per_session, '. ',
           'Average session duration: ', avg_session_duration_minutes, ' minutes. ',
           'Maximum session duration: ', max_session_duration_minutes, ' minutes. ',
           'Session duration volatility: ', session_duration_volatility, ' minutes. ',
           'High engagement sessions (10+ events): ', high_engagement_sessions, '. ',
           'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
           'Format: {{"ai_cat_engagement_level": "value", "ai_cat_user_segment": "value", "ai_cat_retention_risk": "value", "ai_cat_intervention_priority": "value", "ai_txt_engagement_strategy": "text", "ai_txt_product_recommendations": "text", "ai_txt_retention_tactics": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'CATEGORICAL FIELDS (ai_cat_ prefix, exact values): ai_cat_engagement_level (Highly Engaged|Moderately Engaged|Casual User|At Risk|Disengaged), ai_cat_user_segment (Power User|Regular User|Occasional User|Churning|Lost), ai_cat_retention_risk (Critical|High|Medium|Low|Secure), ai_cat_intervention_priority (Immediate|High|Medium|Low|None Needed). ',
           'NARRATIVE FIELDS (ai_txt_ prefix): ai_txt_engagement_strategy, ai_txt_product_recommendations, ai_txt_retention_tactics. ',
           'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data. ',
           'ai_sys_missing_data format: "I can get higher confidence than [X]% if I can get access to [narrative]. {{\"missing_data\": [\"user_demographics\", \"device_data\", \"session_recordings\"]}}" ',
           'Output ONLY the JSON object, nothing else.') AS ai_sys_prompt
  FROM session_summary_metrics
),
-- Step 5: AI analysis - pass ai_sys_prompt to ai_query
session_engagement_strategy AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS engagement_strategy_json
  FROM session_prompt_generation
),
-- Final output: Session engagement metrics with retention recommendations using ai_cat_, ai_txt_, ai_sys_ prefixes
-- ai_sys_prompt MUST be the LAST column
final_output AS (
  SELECT 
    user_id,
    total_sessions,
    avg_events_per_session,
    avg_session_duration_minutes,
    get_json_object(engagement_strategy_json, '$.ai_cat_engagement_level') AS ai_cat_engagement_level,
    get_json_object(engagement_strategy_json, '$.ai_cat_user_segment') AS ai_cat_user_segment,
    get_json_object(engagement_strategy_json, '$.ai_cat_retention_risk') AS ai_cat_retention_risk,
    get_json_object(engagement_strategy_json, '$.ai_cat_intervention_priority') AS ai_cat_intervention_priority,
    get_json_object(engagement_strategy_json, '$.ai_txt_engagement_strategy') AS ai_txt_engagement_strategy,
    get_json_object(engagement_strategy_json, '$.ai_txt_product_recommendations') AS ai_txt_product_recommendations,
    get_json_object(engagement_strategy_json, '$.ai_txt_retention_tactics') AS ai_txt_retention_tactics,
    -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
    COALESCE(get_json_object(engagement_strategy_json, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(engagement_strategy_json, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(engagement_strategy_json, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(engagement_strategy_json, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(engagement_strategy_json, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(engagement_strategy_json, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(engagement_strategy_json, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
  FROM session_engagement_strategy
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_engagement_level IN ('Highly Engaged', 'Moderately Engaged', 'Casual User', 'At Risk', 'Disengaged')
-- AND ai_cat_user_segment IN ('Power User', 'Regular User', 'Occasional User', 'Churning', 'Lost')
-- AND ai_cat_retention_risk IN ('Critical', 'High', 'Medium', 'Low', 'Secure')
-- AND ai_cat_intervention_priority IN ('Immediate', 'High', 'Medium', 'Low', 'None Needed')
;

--END OF GENERATED SQL
```

**7. BASKET AFFINITY ANALYSIS - Discover Product Purchase Patterns**

**Key SQL Functions:** COLLECT_SET, SIZE, ARRAY_INTERSECT, ARRAY functions
**Business Value:** Identifies which products are frequently purchased together, enabling bundling and cross-sell strategies

**Use Case Example Template: "Product Affinity Analysis with Cross-Sell Bundling Strategy"**

```sql
-- Step 1: Create product baskets per transaction
WITH transaction_baskets AS (
  SELECT 
    transaction_id,
    COLLECT_SET(product_id) AS product_basket,
    COLLECT_SET(product_category) AS category_basket,
    SUM(revenue) AS basket_value
  FROM `catalog`.`schema`.`transaction_items` AS t
  WHERE transaction_id IS NOT NULL
    AND product_id IS NOT NULL
  GROUP BY transaction_id
  LIMIT 10  -- ✅ LIMIT 10 in first CTE (GROUP BY provides uniqueness)
),
-- Step 2: Explode baskets to get individual products with basket context
product_affinity_metrics AS (
  SELECT 
    explode(product_basket) AS anchor_product_id,
    product_basket,
    COALESCE(SIZE(product_basket), 0) AS basket_size,
    COALESCE(ROUND(basket_value, 2), 0.0) AS basket_value
  FROM transaction_baskets
),
-- Step 3: Aggregate metrics per product for affinity analysis
product_affinity_summary AS (
  SELECT 
    anchor_product_id,
    COALESCE(ROUND(AVG(basket_size), 1), 0.0) AS basket_size,
    COALESCE(ROUND(AVG(basket_value), 2), 0.0) AS basket_value,
    COALESCE(COUNT(*), 0) AS occurrence_count
  FROM product_affinity_metrics
  GROUP BY anchor_product_id
),
-- Step 4: Generate ai_sys_prompt for bundling analysis
bundling_prompt_generation AS (
  SELECT 
    anchor_product_id,
    basket_size,
    basket_value,
    CONCAT('You are a Merchandise Strategy Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 16 years of experience in product bundling and cross-sell optimization, ',
           'your expertise in basket analysis, bundle design, and revenue per transaction optimization aligns with the strategic initiative: Revenue optimization. ',
           'Analyze product affinity for product ', anchor_product_id, '. ',
           'Average basket size with this product: ', basket_size, ' items. ',
           'Average basket value: $', basket_value, '. ',
           'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
           'Format: {{"ai_cat_affinity_strength": "value", "ai_cat_bundle_potential": "value", "ai_cat_cross_sell_priority": "value", "ai_cat_pricing_strategy": "value", "ai_txt_bundle_recommendation": "text", "ai_txt_cross_sell_tactics": "text", "ai_txt_expected_revenue_lift": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'CATEGORICAL FIELDS (ai_cat_ prefix, exact values): ai_cat_affinity_strength (Very Strong|Strong|Moderate|Weak|Minimal), ai_cat_bundle_potential (High - Create Bundle|Medium - Test Bundle|Low - Individual Cross-sell|Not Recommended), ai_cat_cross_sell_priority (Critical|High|Medium|Low|None), ai_cat_pricing_strategy (Premium Bundle|Value Bundle|Discount Bundle|No Bundle). ',
           'NARRATIVE FIELDS (ai_txt_ prefix): ai_txt_bundle_recommendation, ai_txt_cross_sell_tactics, ai_txt_expected_revenue_lift. ',
           'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data. ',
           'ai_sys_missing_data format: "I can get higher confidence than [X]% if I can get access to [narrative]. {{\"missing_data\": [\"product_margins\", \"inventory_levels\", \"customer_segments\"]}}" ',
           'Output ONLY the JSON object, nothing else.') AS ai_sys_prompt
  FROM product_affinity_summary
),
-- Step 5: AI analysis - pass ai_sys_prompt to ai_query
bundling_cross_sell_strategy AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS bundling_strategy_json
  FROM bundling_prompt_generation
),
-- Final output: Product affinity with bundling recommendations using ai_cat_, ai_txt_, ai_sys_ prefixes
-- ai_sys_prompt MUST be the LAST column
final_output AS (
  SELECT 
    anchor_product_id,
    basket_size AS avg_basket_size,
    basket_value AS avg_basket_value,
    get_json_object(bundling_strategy_json, '$.ai_cat_affinity_strength') AS ai_cat_affinity_strength,
    get_json_object(bundling_strategy_json, '$.ai_cat_bundle_potential') AS ai_cat_bundle_potential,
    get_json_object(bundling_strategy_json, '$.ai_cat_cross_sell_priority') AS ai_cat_cross_sell_priority,
    get_json_object(bundling_strategy_json, '$.ai_cat_pricing_strategy') AS ai_cat_pricing_strategy,
    get_json_object(bundling_strategy_json, '$.ai_txt_bundle_recommendation') AS ai_txt_bundle_recommendation,
    get_json_object(bundling_strategy_json, '$.ai_txt_cross_sell_tactics') AS ai_txt_cross_sell_tactics,
    get_json_object(bundling_strategy_json, '$.ai_txt_expected_revenue_lift') AS ai_txt_expected_revenue_lift,
    -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
    COALESCE(get_json_object(bundling_strategy_json, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(bundling_strategy_json, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(bundling_strategy_json, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(bundling_strategy_json, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(bundling_strategy_json, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(bundling_strategy_json, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(bundling_strategy_json, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
  FROM bundling_cross_sell_strategy
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_affinity_strength IN ('Very Strong', 'Strong', 'Moderate', 'Weak', 'Minimal')
-- AND ai_cat_bundle_potential IN ('High - Create Bundle', 'Medium - Test Bundle', 'Low - Individual Cross-sell', 'Not Recommended')
-- AND ai_cat_cross_sell_priority IN ('Critical', 'High', 'Medium', 'Low', 'None')
-- AND ai_cat_pricing_strategy IN ('Premium Bundle', 'Value Bundle', 'Discount Bundle', 'No Bundle')
;

--END OF GENERATED SQL
```

**8. SEASONALITY & CYCLICALITY ANALYSIS - Identify Temporal Patterns**

**Key SQL Functions:** DATE_TRUNC, EXTRACT, AVG, STDDEV, window functions
**Business Value:** Optimizes inventory, staffing, and marketing spend based on predictable temporal patterns

**Use Case Example Template: "Sales Seasonality Analysis with Inventory Planning Strategy"**

```sql
-- Step 1: Extract temporal dimensions and aggregate metrics
WITH sales_temporal_analysis AS (
  SELECT 
    DATE_TRUNC('month', sale_date) AS month,
    EXTRACT(DAYOFWEEK FROM sale_date) AS day_of_week,
    EXTRACT(WEEK FROM sale_date) AS week_of_year,
    SUM(revenue) AS period_revenue,
    COUNT(DISTINCT order_id) AS period_orders,
    AVG(order_value) AS avg_order_value
  FROM `catalog`.`schema`.`sales` AS s
  WHERE sale_date IS NOT NULL
    AND revenue IS NOT NULL
  GROUP BY DATE_TRUNC('month', sale_date), EXTRACT(DAYOFWEEK FROM sale_date), EXTRACT(WEEK FROM sale_date)
  LIMIT 10  -- ✅ LIMIT 10 in first CTE (GROUP BY provides uniqueness)
),
-- Step 2: Calculate seasonality metrics - keep numeric types as DOUBLE/INT
seasonality_metrics AS (
  SELECT 
    COALESCE(CAST(month AS STRING), 'Unknown Month') AS month_display,
    COALESCE(day_of_week, 0) AS day_of_week,
    COALESCE(ROUND(period_revenue, 2), 0.0) AS period_revenue,
    COALESCE(period_orders, 0) AS period_orders,
    COALESCE(ROUND(AVG(period_revenue) OVER (), 2), 0.0) AS avg_monthly_revenue,
    COALESCE(ROUND(STDDEV_POP(period_revenue) OVER (), 2), 0.0) AS revenue_volatility,
    COALESCE(ROUND((period_revenue - AVG(period_revenue) OVER ()) * 100.0 / NULLIF(AVG(period_revenue) OVER (), 0), 1), 0.0) AS variance_from_avg_pct,
    COALESCE(ROUND(PERCENT_RANK() OVER (ORDER BY period_revenue), 3), 0.0) AS revenue_percentile_rank
  FROM sales_temporal_analysis
),
-- Step 3: Generate ai_sys_prompt for seasonality analysis
seasonal_prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are an Operations Planning Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 17 years of experience in demand forecasting and seasonal inventory management, ',
           'your expertise in seasonality analysis, capacity planning, and resource optimization aligns with the strategic initiative: Operational efficiency. ',
           'Analyze sales period: Month ', month_display, ', Day of Week ', day_of_week, '. ',
           'Period revenue: $', period_revenue, '. ',
           'Period orders: ', period_orders, '. ',
           'Average monthly revenue (baseline): $', avg_monthly_revenue, '. ',
           'Revenue volatility (standard deviation): $', revenue_volatility, '. ',
           'Variance from average: ', variance_from_avg_pct, '%. ',
           'Revenue percentile rank: ', revenue_percentile_rank, '. ',
           'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
           'Format: {{"ai_cat_seasonal_pattern": "value", "ai_cat_demand_intensity": "value", "ai_cat_planning_action": "value", "ai_cat_resource_adjustment": "value", "ai_txt_inventory_strategy": "text", "ai_txt_staffing_recommendation": "text", "ai_txt_marketing_timing": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'CATEGORICAL FIELDS (ai_cat_ prefix, exact values): ai_cat_seasonal_pattern (Peak Season|High Season|Normal Season|Low Season|Off Season), ai_cat_demand_intensity (Extreme High|High|Moderate|Low|Very Low), ai_cat_planning_action (Aggressive Scale Up|Moderate Increase|Maintain Current|Scale Down|Minimal Operations), ai_cat_resource_adjustment (Increase 50%+|Increase 20-50%|Hold Steady|Reduce 20%|Reduce 50%+). ',
           'NARRATIVE FIELDS (ai_txt_ prefix): ai_txt_inventory_strategy, ai_txt_staffing_recommendation, ai_txt_marketing_timing. ',
           'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data. ',
           'ai_sys_missing_data format: "I can get higher confidence than [X]% if I can get access to [narrative]. {{\"missing_data\": [\"historical_weather\", \"event_calendars\", \"economic_indicators\"]}}" ',
           'Output ONLY the JSON object, nothing else.') AS ai_sys_prompt
  FROM seasonality_metrics
),
-- Step 4: AI analysis - pass ai_sys_prompt to ai_query
seasonal_planning_strategy AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS seasonal_strategy_json
  FROM seasonal_prompt_generation
),
-- Final output: Seasonality analysis with operational planning recommendations using ai_cat_, ai_txt_, ai_sys_ prefixes
-- ai_sys_prompt MUST be the LAST column
final_output AS (
  SELECT 
    month_display AS month,
    day_of_week,
    period_revenue,
    variance_from_avg_pct,
    get_json_object(seasonal_strategy_json, '$.ai_cat_seasonal_pattern') AS ai_cat_seasonal_pattern,
    get_json_object(seasonal_strategy_json, '$.ai_cat_demand_intensity') AS ai_cat_demand_intensity,
    get_json_object(seasonal_strategy_json, '$.ai_cat_planning_action') AS ai_cat_planning_action,
    get_json_object(seasonal_strategy_json, '$.ai_cat_resource_adjustment') AS ai_cat_resource_adjustment,
    get_json_object(seasonal_strategy_json, '$.ai_txt_inventory_strategy') AS ai_txt_inventory_strategy,
    get_json_object(seasonal_strategy_json, '$.ai_txt_staffing_recommendation') AS ai_txt_staffing_recommendation,
    get_json_object(seasonal_strategy_json, '$.ai_txt_marketing_timing') AS ai_txt_marketing_timing,
    -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
    COALESCE(get_json_object(seasonal_strategy_json, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(seasonal_strategy_json, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(seasonal_strategy_json, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(seasonal_strategy_json, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(seasonal_strategy_json, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(seasonal_strategy_json, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(seasonal_strategy_json, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
  FROM seasonal_planning_strategy
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_seasonal_pattern IN ('Peak Season', 'High Season', 'Normal Season', 'Low Season', 'Off Season')
-- AND ai_cat_demand_intensity IN ('Extreme High', 'High', 'Moderate', 'Low', 'Very Low')
-- AND ai_cat_planning_action IN ('Aggressive Scale Up', 'Moderate Increase', 'Maintain Current', 'Scale Down', 'Minimal Operations')
-- AND ai_cat_resource_adjustment IN ('Increase 50%+', 'Increase 20-50%', 'Hold Steady', 'Reduce 20%', 'Reduce 50%+')
ORDER BY month
;

--END OF GENERATED SQL
```

**🚨🚨🚨 CRITICAL USAGE REQUIREMENTS FOR ADVANCED ANALYSIS PATTERNS 🚨🚨🚨**

1. **CUSTOMIZE TO BUSINESS CONTEXT**: The 8 examples above are TEMPLATES ONLY. You MUST adapt them to the specific:
   - Business domain (retail, healthcare, finance, manufacturing, etc.)
   - Available data schema (actual table/column names)
   - Business questions being asked
   - Strategic objectives

2. **GENERATE NOVEL USE CASES**: Ask the LLM to generate NEW use case ideas that utilize these patterns. Examples:
   - "Generate a use case that combines Cohort Analysis + Pareto Analysis to identify high-value customer segments by acquisition channel"
   - "Generate a use case that uses Funnel Analysis + Sessionization to optimize mobile app conversion rates"
   - "Generate a use case that applies Price Elasticity + Seasonality Analysis for dynamic pricing optimization"

3. **COMBINE MULTIPLE TECHNIQUES**: Many high-value use cases require combining 2-3 analytical techniques:
   - Cohort + Pareto: "Which customer cohorts drive 80% of LTV growth?"
   - Funnel + Sessionization: "How does session engagement impact conversion rates?"
   - Gap Analysis + Seasonality: "Which seasonal gaps represent the biggest missed revenue opportunities?"

4. **ALWAYS INCLUDE AI INTERPRETATION**: EVERY statistical analysis MUST be followed by ai_query to:
   - Interpret the statistical findings in business language
   - Generate strategic recommendations and action plans
   - Identify risks, opportunities, and priorities
   - Provide executive-level insights

5. **PROPER USE CASE NAMING**: Use clear, specific names that describe BOTH the analysis AND the outcome:
   - ✅ GOOD: "Customer Acquisition Quality Analysis with Retention Strategy"
   - ✅ GOOD: "Revenue Concentration Analysis with Portfolio Optimization Strategy"
   - ✅ GOOD: "Checkout Conversion Funnel Analysis with Friction Point Recommendations"
   - ❌ BAD: "Cohort Analysis" (too generic, no business outcome)
   - ❌ BAD: "Pareto Analysis of Sales" (missing strategic outcome)

6. **MANDATORY CATEGORICAL + NARRATIVE STRUCTURE**: Every ai_query output MUST include:
   - 3-5 categorical fields for filtering/dashboards (with max 20 distinct values each)
   - 2-4 narrative fields for detailed explanations
   - Persona instruction (role + years of experience + expertise areas)
   - Strict JSON output formatting

7. **NULL SAFETY REQUIRED**: ALL statistical metrics and data fields used in ai_query prompts MUST be:
   - ROUND'd (for numeric precision)
   - COALESCE'd with type-appropriate defaults (DOUBLE: 0.0, INT: 0, STRING: 'Unknown')
   - Keep numeric types as DOUBLE/INT (CONCAT auto-converts them)
   - Transformed in the FIRST CTE (not inline in CONCAT)

**🎯 WHEN TO USE THESE PATTERNS:**

Use these advanced patterns when the use case description contains keywords like:
- "cohort", "acquisition", "retention", "lifetime value" → Cohort Analysis
- "concentration", "80/20", "top performers", "vital few" → Pareto Analysis
- "conversion", "drop-off", "funnel", "journey" → Funnel Analysis
- "coverage", "gaps", "missing", "opportunities" → Gap Analysis
- "pricing", "elasticity", "demand response" → Price Elasticity Analysis
- "session", "engagement", "time spent" → Sessionization
- "basket", "bundling", "cross-sell", "affinity" → Basket Affinity
- "seasonal", "cyclical", "temporal patterns" → Seasonality Analysis

**🔥 REMEMBER: These techniques answer "WHY?" and "WHO?" - not just "WHAT?" - delivering 10X higher business value than simple aggregations. 🔥**

---

**Creative Prompt Examples with STRICT JSON OUTPUT:**
```sql
-- Personalized recommendations (using mandatory default model)
-- NOTE: customer_name and purchase_history must be COALESCEd in the source CTE
ai_query('{sql_model_serving}', 
  CONCAT('Analyze customer ', customer_name, '''s purchase history: ', 
         purchase_history, '. ',
         'Output ONLY a JSON object with NO markdown, NO extra text. ',
         'Format: {{"recommendations": ["product1", "product2", "product3"], "reasoning": "text"}}. ',
         'Output ONLY the JSON, nothing else.'))

-- Risk assessment (using general-purpose LLM) - CONCAT auto-converts DOUBLE
-- NOTE: amount, location, behavior_score must be COALESCEd in the source CTE
ai_query('{sql_model_serving}',
  CONCAT('Assess transaction fraud risk: Amount=$', amount,  -- CONCAT auto-converts DOUBLE
         ', Location=', location, ', User behavior=', behavior_score, '. ',
         'Output ONLY JSON with NO markdown: {{"risk_level": "High/Medium/Low", "score": 0-100, "factors": "text"}}'))

-- Dynamic report generation with structured output (using mandatory default model)
-- NOTE: department, revenue, growth_pct must be COALESCEd in the source CTE
ai_query('{sql_model_serving}',
  CONCAT('Create executive summary for ', department, 
         ': Revenue=', revenue,  -- CONCAT auto-converts DOUBLE
         ', Growth=', growth_pct, '%. ',
         'Output ONLY a JSON object with NO markdown, NO extra text. ',
         'Format: {{"summary": "text", "key_metrics": "text", "recommendations": "text"}}. ',
         'Output ONLY the JSON, nothing else.'))
```

**🚨 REMEMBER: Every ai_query for structured data MUST include:**
- "Output ONLY a JSON object"
- "with NO markdown" or "NO markdown fences"
- "NO extra text"
- Show example format
- "Output ONLY the JSON, nothing else"

#### 6. **COMPLETE AI FUNCTION REFERENCE**

All Databricks AI functions with correct syntax:
{ai_functions_summary}

#### 7. **QUERY STRUCTURE - ABSOLUTE REQUIREMENTS**

**A. LIMIT 10 AND DISTINCT REQUIREMENTS (ABSOLUTE CRITICAL):**
- **FIRST CTE MUST USE SELECT DISTINCT**: Always use `SELECT DISTINCT` to eliminate duplicate records
- **FIRST CTE ONLY**: Use `LIMIT 10` at the END of the FIRST CTE that reads from tables
- **NO LIMIT IN OTHER CTEs**: DO NOT use `LIMIT 10` in any other CTE - only in the first CTE
- **LIMIT PLACEMENT**: LIMIT 10 MUST be the LAST clause in the SELECT (after WHERE, ORDER BY, GROUP BY, etc.)

**EXCEPTION FOR ai_forecast**: The input CTE for ai_forecast should use WHERE clause with date filtering to provide sufficient historical data (10:1 ratio). No LIMIT for forecast input CTEs.

```sql
-- Standard multi-CTE pattern ✅ (DISTINCT + LIMIT 10 in first CTE only)
WITH stage1 AS (
  SELECT DISTINCT *  -- ✅ ALWAYS use DISTINCT to eliminate duplicates
  FROM `catalog`.`schema`.`table` AS t
  WHERE primary_key_col IS NOT NULL  -- Use actual column name from schema, not generic 'id'
  LIMIT 10  -- ✅ LIMIT 10 at the END of first CTE only
),
stage2 AS (
  SELECT *, ai_function(...) FROM stage1  -- ✅ NO LIMIT in other CTEs
),
stage3 AS (
  SELECT * FROM stage2  -- ✅ NO LIMIT in other CTEs
)
SELECT * FROM stage3;  -- ✅ NO LIMIT in final SELECT

-- ai_forecast exception ✅ - Use WHERE clause with adaptive ratio by granularity
-- Example 1: Monthly forecast (mid-frequency, use 10:1 ratio)
WITH past AS (
  SELECT 
    DATE_TRUNC('month', date_col) AS ds,
    SUM(value_col) AS value
  FROM `catalog`.`schema`.`table` AS t
  WHERE date_col >= date_add(MONTH, -30, CURRENT_DATE())  -- 30 months history for 3-month forecast (10:1 ratio)
    AND date_col IS NOT NULL
  GROUP BY DATE_TRUNC('month', date_col)
  ORDER BY ds
)
SELECT * FROM AI_FORECAST(
  TABLE(past), 
  time_col => 'ds',
  value_col => 'value',
  horizon => (SELECT date_add(MONTH, 3, MAX(ds)) FROM past)  -- 3 months ahead
);  -- ✅ NO LIMIT

-- Example 2: Hourly forecast (high-frequency, use 4 weeks for 24-hour forecast)
WITH past AS (
  SELECT 
    DATE_TRUNC('hour', timestamp_col) AS ds,
    AVG(value_col) AS value
  FROM `catalog`.`schema`.`table` AS t
  WHERE timestamp_col >= date_add(WEEK, -4, CURRENT_TIMESTAMP())  -- 4 weeks for hourly data
    AND timestamp_col IS NOT NULL
  GROUP BY DATE_TRUNC('hour', timestamp_col)
  ORDER BY ds
)
SELECT * FROM AI_FORECAST(
  TABLE(past), 
  time_col => 'ds',
  value_col => 'value',
  horizon => (SELECT date_add(HOUR, 24, MAX(ds)) FROM past)  -- 24 hours ahead
);  -- ✅ NO LIMIT

-- Example 3: Yearly forecast (low-frequency, use 12 years for 3-year forecast)
WITH past AS (
  SELECT 
    DATE_TRUNC('year', date_col) AS ds,
    SUM(value_col) AS value
  FROM `catalog`.`schema`.`table` AS t
  WHERE date_col >= date_add(YEAR, -12, CURRENT_DATE())  -- 12 years for 3-year forecast (4:1 ratio)
    AND date_col IS NOT NULL
  GROUP BY DATE_TRUNC('year', date_col)
  ORDER BY ds
)
SELECT * FROM AI_FORECAST(
  TABLE(past), 
  time_col => 'ds',
  value_col => 'value',
  horizon => (SELECT date_add(YEAR, 3, MAX(ds)) FROM past)  -- 3 years ahead
);  -- ✅ NO LIMIT
```

**B. WHERE CLAUSE RESTRICTIONS (ABSOLUTE CRITICAL):**

**🚨🚨🚨 ZERO TOLERANCE FOR VALUE FILTERING 🚨🚨🚨**

**YOU MUST NOT USE WHERE clauses to filter on specific values** because you don't know what values exist in the data:

**ALLOWED:**
- ✅ `WHERE column IS NULL`
- ✅ `WHERE column IS NOT NULL`
- ✅ `WHERE date_column IS NOT NULL` (ensure data exists)
- ✅ **EXCEPTION FOR ai_forecast ONLY**: Dynamic date filtering for history with adaptive ratios:
  - High-frequency (minute/hour): `WHERE ts >= date_add(DAY/WEEK, -N, CURRENT_TIMESTAMP())`
  - Mid-frequency (day/week/month): `WHERE date >= date_add(UNIT, -N*10, CURRENT_DATE())`
  - Low-frequency (quarter/year): `WHERE date >= date_add(UNIT, -N*ratio, CURRENT_DATE())` (see ratio table)

**ABSOLUTELY FORBIDDEN (Value Comparisons):**
- ❌ `WHERE status = 'active'` (you don't know if 'active' exists!)
- ❌ `WHERE category = 'electronics'` (you don't know if 'electronics' exists!)
- ❌ `WHERE date > '2023-01-01'` (arbitrary date filtering)
- ❌ `WHERE amount > 100` (arbitrary number filtering)
- ❌ `WHERE id = 5` (specific ID filtering)
- ❌ `WHERE name LIKE '%pattern%'` (pattern matching on unknown values)
- ❌ `WHERE region IN ('US', 'EU')` (you don't know valid regions!)
- ❌ `WHERE price BETWEEN 10 AND 100` (arbitrary range)
- ❌ ANY comparison with specific string, number, or date values

**WHY THIS IS CRITICAL:**
- You have ZERO knowledge of actual data values in tables
- Filtering on non-existent values returns empty results
- Use LIMIT instead of WHERE for controlling result size
- Let users apply their own filters after seeing the data

**EXCEPTION FOR ai_forecast:**
- For ai_forecast input CTEs ONLY, you MUST use dynamic date filtering with ADAPTIVE ratios based on time granularity
- Ratios vary by frequency: high-frequency (minute/hour) needs fixed calendar periods, mid-frequency uses 10:1, low-frequency uses reduced ratios

**HIGH-FREQUENCY (minute/hour) - Use Fixed Calendar Periods:**
- MINUTE-level: For 1-hour forecast, use `WHERE ts >= date_add(DAY, -7, CURRENT_TIMESTAMP())` (7 days)
- HOURLY-level: For 24-hour forecast, use `WHERE ts >= date_add(WEEK, -4, CURRENT_TIMESTAMP())` (4 weeks)

**MID-FREQUENCY (day/week/month) - Use 10:1 Ratio:**
- DAILY: For 7-day forecast, use `WHERE date >= date_add(DAY, -70, CURRENT_DATE())` (70 days = 7 * 10)
- WEEKLY: For 12-week forecast, use `WHERE date >= date_add(WEEK, -120, CURRENT_DATE())` (120 weeks = 12 * 10)
- MONTHLY: For 12-month forecast, use `WHERE date >= date_add(MONTH, -120, CURRENT_DATE())` (120 months = 12 * 10) OR minimum 36 months if data unavailable

**LOW-FREQUENCY (quarter/year) - Use Reduced Ratios:**
- QUARTERLY: For 4-quarter forecast, use `WHERE date >= date_add(QUARTER, -32, CURRENT_DATE())` (32 quarters = 4 * 8, ~8 years)
- YEARLY: For 3-year forecast, use `WHERE date >= date_add(YEAR, -12, CURRENT_DATE())` (12 years = 3 * 4) OR minimum 6 years for 2 cycles

**Rationale**: You don't have knowledge of actual data values. Only NULL checks (IS NULL / IS NOT NULL) and dynamic date filtering for ai_forecast are allowed.

**C. TABLE QUALIFICATION:**
- Always use fully qualified names: `catalog.schema.table`
- Extract these from the "Tables Involved" field
- Use backticks if needed: `` `catalog`.`schema`.`table` ``

**D. QUERY HEADER:**
- Start with: `-- Use Case {{use_case_id}}: {{use_case_name}}`
- Add comment describing the approach
- Example:
  ```sql
  -- Use Case AI-F01-U01: Classify Customer Feedback  
  -- Uses ai_classify to categorize feedback into business-relevant topics
  ```

**E. USE CTEs FOR MULTI-STAGE PROCESSING:**

**WHEN TO USE CTEs:**
- AI function pipelines with multiple stages (forecast → classify → generate recommendations)
- Complex queries that need JOIN followed by transformations
- Queries with multiple AI functions that build on each other
- When you need intermediate results for clarity and debugging

**WHEN TO AVOID CTEs:**
- Simple single-table queries (just use SELECT ... FROM ... WHERE ... LIMIT)
- Queries with only one transformation step
- When a direct SELECT would be clearer and simpler

**🔥 CRITICAL: JOIN ALL TABLES UPFRONT IN FIRST CTE 🔥**

**OPTIMAL CTE STRUCTURE (for AI pipelines):**
1. **Step 1 (base data)**: SELECT all needed columns and JOIN all required tables together
2. **Step 2 (ai function)**: Apply AI function (ai_classify, ai_forecast, etc.) on the combined data
3. **Step 3 (enrichment)**: Use ai_gen/ai_query to add structured recommendations
4. **Step 4 (extraction)**: Extract JSON fields using get_json_object()

**CORRECT Pattern - All JOINs in First CTE:**
```sql
-- CORRECT ✅ - All JOINs done upfront
-- Step 1: Get all required data with JOINs
WITH base_data AS (
  SELECT 
    c.customer_id,
    c.customer_name,
    c.customer_segment,
    o.total_orders,
    o.total_revenue,
    p.preferred_products,
    s.support_tickets
  FROM `catalog`.`schema`.`customers` AS c
  LEFT JOIN `catalog`.`schema`.`orders_summary` AS o ON c.customer_id = o.customer_id
  LEFT JOIN `catalog`.`schema`.`preferences` AS p ON c.customer_id = p.customer_id
  LEFT JOIN `catalog`.`schema`.`support_stats` AS s ON c.customer_id = s.customer_id
  WHERE c.customer_id IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 in first CTE only
),
-- Step 2: Apply AI classification with ai_cat_ prefix
classified AS (
  SELECT 
    *,
    ai_classify(customer_segment, ARRAY('VIP', 'High Value', 'Medium', 'Low')) AS ai_cat_value_tier
  FROM base_data
),
-- Step 3: Generate ai_sys_prompt for recommendations
prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are a Customer Success Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 15 years of experience in customer retention and value optimization, ',
           'your expertise aligns with the strategic initiative: Customer success. ',
           'Analyze customer ', customer_name, ' (ID: ', customer_id, ') in segment ', customer_segment, '. ',
           'Output ONLY JSON, NO markdown, NO extra text. Format: {{"ai_txt_retention_strategy": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data.') AS ai_sys_prompt
  FROM classified
),
-- Step 4: Generate recommendations with ai_txt_, ai_sys_ columns
enriched AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS insights
  FROM prompt_generation
)
-- Step 5: Extract JSON fields with ai_cat_, ai_txt_, ai_sys_ prefixes
-- ai_sys_prompt MUST be the LAST column
SELECT 
  customer_id,
  customer_name,
  ai_cat_value_tier,
  get_json_object(insights, '$.ai_txt_retention_strategy') AS ai_txt_retention_strategy,
  -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
  COALESCE(get_json_object(insights, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
  COALESCE(get_json_object(insights, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
  COALESCE(get_json_object(insights, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
  COALESCE(get_json_object(insights, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
  COALESCE(TRY_CAST(get_json_object(insights, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
  COALESCE(get_json_object(insights, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
  COALESCE(get_json_object(insights, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
  ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
FROM enriched
;

--END OF GENERATED SQL
```

**WRONG Pattern - Multiple Join CTEs:**
```sql
-- WRONG ❌ - Joining tables in separate CTEs (inefficient!)
-- Step 1: Get base data
WITH base_data AS (
  SELECT * FROM `catalog`.`schema`.`customers` AS c LIMIT 10
),
-- Step 2: Join with orders (BAD - should have been in Step 1!)
with_orders AS (
  SELECT b.*, o.total_orders
  FROM base_data AS b
  LEFT JOIN `catalog`.`schema`.`orders_summary` AS o ON b.customer_id = o.customer_id
),
-- Step 3: Join with preferences (BAD - should have been in Step 1!)
with_preferences AS (
  SELECT w.*, p.preferred_products
  FROM with_orders AS w
  LEFT JOIN `catalog`.`schema`.`preferences` AS p ON w.customer_id = p.customer_id
)
SELECT * FROM with_preferences;  -- ✅ NO LIMIT
```

**MANDATORY CTE STRUCTURE RULES:**
1. **First CTE = All Data Acquisition**: Do ALL JOINs here to get complete dataset
2. **Subsequent CTEs = Transformations**: Apply AI functions, enrichments, and extractions
3. **Minimize Unnecessary CTEs**: Use CTEs for AI pipelines and complex logic, but avoid CTEs for simple single-table queries
4. **One JOIN Phase**: All JOINs should happen in the first CTE, not scattered across multiple CTEs

Prefer CTEs over nested subqueries for readability:
```sql
-- GOOD ✅
WITH base AS (...),
     enriched AS (...),
     final AS (...)
SELECT * FROM final;  -- ✅ NO LIMIT

-- AVOID ❌
SELECT * FROM (SELECT * FROM (SELECT * FROM table))
```

---

### 7. COMPLETE WORKING EXAMPLES

**Example 1: ai_classify + ai_query with Structured JSON Output (MANDATORY PATTERN)**
```sql
-- CREATE VIEW inspire_ai.default.customer_feedback_actionable_recommendations AS
-- Use Case: Classify Customer Feedback with Structured Actionable Recommendations

-- Step 1: Classify feedback into business-relevant categories
-- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
WITH classified AS (
  SELECT DISTINCT
    feedback_id,                                              -- CRITICAL: filtered with IS NOT NULL
    customer_id,                                              -- CRITICAL: filtered with IS NOT NULL
    feedback_text,                                            -- CRITICAL: filtered with IS NOT NULL
    COALESCE(customer_lifetime_value, 0.0) AS customer_lifetime_value,  -- ✅ COALESCE'd
    COALESCE(purchase_count, 0) AS purchase_count,            -- ✅ COALESCE'd
    ai_classify(feedback_text, 
      ARRAY('Product Quality', 'Customer Service', 'Pricing', 'Delivery', 'Other')
    ) AS ai_cat_feedback_category
  FROM `main`.`customer_service`.`feedback` AS f
  WHERE feedback_id IS NOT NULL
    AND customer_id IS NOT NULL    -- ✅ Critical identifier filtered
    AND feedback_text IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
),
-- Step 2: Generate structured JSON output with ai_cat_ + ai_txt_ + ai_sys_ columns
enriched AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}',
      CONCAT('Analyze this ', ai_cat_feedback_category, ' feedback: "', feedback_text, 
             '" from a customer with $', customer_lifetime_value,  -- CONCAT auto-converts
             ' lifetime value and ', purchase_count, ' purchases. ',
             'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
             'Format: {{"ai_cat_urgency_level": "value", "ai_cat_response_priority": "value", "ai_txt_category_justification": "text", "ai_txt_resolution_plan": "text", "ai_txt_prevention_strategy": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
             'Required keys: ',
             'ai_cat_urgency_level (MUST be exactly one of: Critical/High/Medium/Low), ',
             'ai_cat_response_priority (MUST be exactly one of: Immediate Action/High Priority/Medium Priority/Low Priority), ',
             'ai_txt_category_justification (free text: why this feedback belongs to ', ai_cat_feedback_category, ', specific keywords/phrases, 1-2 sentences), ',
             'ai_txt_resolution_plan (free text: specific 3-step resolution plan with actionable steps), ',
             'ai_txt_prevention_strategy (free text: long-term strategy to prevent similar issues). ',
             'MANDATORY LAST 7 FIELDS (in this exact order): ',
             '1) ai_txt_business_outcome - CALCULATED MEASURABLE BUSINESS IMPACT. Example: "Resolving issue for Customer [ID] prevents $X revenue loss. Escalation cost avoided: $Y. Breakdown: Daily impact: $X | Weekly: $X | Monthly: $X | Yearly: $X. DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions." MUST include breakdown and disclaimer. ',
             '2) ai_txt_executive_summary - compelling business story in 2-3 sentences that REFERENCES the business outcome numbers, ',
             '3) ai_sys_importance - business importance level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
             '4) ai_sys_urgency - action urgency level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
             '5) ai_sys_confidence (0.0-1.0) - your confidence score, ',
             '6) ai_sys_feedback - start with "I assessed my confidence at [X]% because..." then explain reasoning, ',
             '7) ai_sys_missing_data - MUST follow format: "I can get higher confidence than [X]% if I can get access to [narrative about missing data]. {{\"missing_data\": [\"specific_dataset1\", \"specific_dataset2\"]}}" - always end with JSON listing needed datasets. ',
             'Output ONLY the JSON object, nothing else.')
    ) AS insights
  FROM classified
),
-- Final output: Mix of ai_cat_ (filterable), ai_txt_ (narrative), and ai_sys_ (system) columns
final_output AS (
  SELECT 
    feedback_id,
    customer_id,
    feedback_text,
    ai_cat_feedback_category,
    get_json_object(insights, '$.ai_cat_urgency_level') AS ai_cat_urgency_level,  -- CATEGORICAL: Critical/High/Medium/Low
    get_json_object(insights, '$.ai_cat_response_priority') AS ai_cat_response_priority,  -- CATEGORICAL: Immediate Action/High Priority/Medium Priority/Low Priority
    get_json_object(insights, '$.ai_txt_category_justification') AS ai_txt_category_justification,  -- NARRATIVE: Free text
    get_json_object(insights, '$.ai_txt_resolution_plan') AS ai_txt_resolution_plan,  -- NARRATIVE: Free text
    get_json_object(insights, '$.ai_txt_prevention_strategy') AS ai_txt_prevention_strategy,  -- NARRATIVE: Free text
    -- MANDATORY LAST 7 COLUMNS (in this exact order):
    COALESCE(get_json_object(insights, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,  -- BUSINESS OUTCOME: Calculated Impact
    COALESCE(get_json_object(insights, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,  -- NARRATIVE: Executive Summary (references business outcome)
    COALESCE(get_json_object(insights, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,  -- SYSTEM: Importance Level
    COALESCE(get_json_object(insights, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,  -- SYSTEM: Urgency Level
    COALESCE(TRY_CAST(get_json_object(insights, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,  -- SYSTEM: Confidence Score
    COALESCE(get_json_object(insights, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,  -- SYSTEM: Feedback
    COALESCE(get_json_object(insights, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data  -- SYSTEM: Missing Data
  FROM enriched
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_feedback_category IN ('Product Quality', 'Customer Service', 'Pricing', 'Delivery', 'Other')
-- AND ai_cat_urgency_level IN ('Critical', 'High', 'Medium', 'Low')
-- AND ai_cat_response_priority IN ('Immediate Action', 'High Priority', 'Medium Priority', 'Low Priority')
-- AND ai_sys_importance IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
-- AND ai_sys_urgency IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
;

--END OF GENERATED SQL
```
**NOTE**: Use `get_json_object(json_column, '$.field_name')` to extract fields from ai_gen/ai_query JSON output. Use `ai_cat_` prefix for categorical columns, `ai_txt_` prefix for narrative columns, and `ai_sys_` prefix for system columns. The LAST 7 columns must ALWAYS be: ai_txt_business_outcome (calculated measurable impact with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data.

**Example 2: ai_classify + ai_query (Advanced Pattern - Customer Segmentation)**
```sql
-- CREATE VIEW inspire_ai.default.customer_segmentation_retention_strategy AS
-- Use Case: Customer Segmentation with Personalized Retention Strategy

-- Step 1: Classify customers into value-based segments
-- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
WITH customer_metrics AS (
  SELECT DISTINCT
    customer_id,                                              -- CRITICAL: filtered with IS NOT NULL
    customer_name,                                            -- CRITICAL: filtered with IS NOT NULL
    COALESCE(total_revenue, 0.0) AS total_revenue,            -- ✅ COALESCE'd
    COALESCE(purchase_frequency, 0) AS purchase_frequency,    -- ✅ COALESCE'd
    COALESCE(last_purchase_days_ago, 9999) AS last_purchase_days_ago,  -- ✅ COALESCE'd (high default = churned)
    COALESCE(TRIM(product_category_preference), 'Unknown') AS product_category_preference,  -- ✅ COALESCE'd
    COALESCE(support_tickets_count, 0) AS support_tickets_count  -- ✅ COALESCE'd
  FROM `main`.`crm`.`customer_analytics` AS c
  WHERE customer_id IS NOT NULL
    AND customer_name IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
),
segmented AS (
  SELECT 
    *,
    ai_classify(
      'Customer with $' || CAST(total_revenue AS STRING) || ' revenue, ' || 
      CAST(purchase_frequency AS STRING) || ' purchases/year, last purchase ' || 
      CAST(last_purchase_days_ago AS STRING) || ' days ago',
      ARRAY('High Value VIP', 'High Value At-Risk', 'Medium Value Active', 
            'Medium Value Declining', 'Low Value', 'Churned')
    ) AS ai_cat_value_tier
  FROM customer_metrics
),
-- Step 2: Generate structured retention strategies with ai_cat_ + ai_txt_ + ai_sys_ columns
enriched AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}',
      CONCAT('Customer Value Tier: ', ai_cat_value_tier, 
             '. Revenue: $', total_revenue,  -- CONCAT auto-converts
             ', Frequency: ', purchase_frequency, '/year, ',
             'Last Purchase: ', last_purchase_days_ago, ' days ago, ',
             'Preferred Category: ', product_category_preference, ', ',
             'Support Tickets: ', support_tickets_count, '. ',
             'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
             'Format: {{"ai_cat_churn_risk_level": "value", "ai_cat_retention_priority": "value", "ai_cat_engagement_readiness": "value", "ai_txt_segmentation_rationale": "text", "ai_txt_retention_strategy": "text", "ai_txt_outreach_plan": "text", "ai_txt_offer_recommendations": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
             'Required keys: ',
             'ai_cat_churn_risk_level (MUST be exactly one of: Critical/High/Medium/Low/Minimal), ',
             'ai_cat_retention_priority (MUST be exactly one of: Immediate Action/High Priority/Medium Priority/Low Priority/Monitor), ',
             'ai_cat_engagement_readiness (MUST be exactly one of: Highly Receptive/Moderately Receptive/Low Receptivity/Difficult), ',
             'ai_txt_segmentation_rationale (free text: why customer is in this tier, key metrics driving classification, 1-2 sentences), ',
             'ai_txt_retention_strategy (free text: personalized retention strategy specific to this tier and customer behavior), ',
             'ai_txt_outreach_plan (free text: specific outreach approach with recommended timing and channel), ',
             'ai_txt_offer_recommendations (free text: specific recommended offers/incentives with expected impact). ',
             'MANDATORY LAST 7 FIELDS (in this exact order): ',
             '1) ai_txt_business_outcome - CALCULATED MEASURABLE BUSINESS IMPACT. Example: "Upsell opportunity for Account [ID] worth $X ARR. Win probability: Y%. Expected value: $Z. Breakdown: Daily revenue potential: $X | Weekly: $X | Monthly: $X | Yearly: $X. DISCLAIMER: All numbers are AI estimates based on available data and must be validated by domain experts before business decisions." MUST include breakdown and disclaimer. ',
             '2) ai_txt_executive_summary - executive brief for account manager in 2-3 sentences that REFERENCES the business outcome numbers, ',
             '3) ai_sys_importance - business importance level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
             '4) ai_sys_urgency - action urgency level (MUST be exactly one of: Very Low, Low, Medium, High, Very High, Critical), ',
             '5) ai_sys_confidence (0.0-1.0) - your confidence score, ',
             '6) ai_sys_feedback - start with "I assessed my confidence at [X]% because..." then explain reasoning, ',
             '7) ai_sys_missing_data - MUST follow format: "I can get higher confidence than [X]% if I can get access to [narrative about missing customer data]. {{\"missing_data\": [\"specific_dataset1\", \"specific_dataset2\"]}}" - always end with JSON listing needed datasets. ',
             'Output ONLY the JSON object, nothing else.')
    ) AS insights
  FROM segmented
),
-- Final output: Mix of ai_cat_ (filterable), ai_txt_ (narrative), and ai_sys_ (system) columns
final_output AS (
  SELECT 
    customer_id,
    customer_name,
    total_revenue,
    ai_cat_value_tier,  -- CATEGORICAL from ai_classify
    get_json_object(insights, '$.ai_cat_churn_risk_level') AS ai_cat_churn_risk_level,  -- CATEGORICAL: Critical/High/Medium/Low/Minimal
    get_json_object(insights, '$.ai_cat_retention_priority') AS ai_cat_retention_priority,  -- CATEGORICAL: Immediate Action/High Priority/Medium Priority/Low Priority/Monitor
    get_json_object(insights, '$.ai_cat_engagement_readiness') AS ai_cat_engagement_readiness,  -- CATEGORICAL: Highly Receptive/Moderately Receptive/Low Receptivity/Difficult
    get_json_object(insights, '$.ai_txt_segmentation_rationale') AS ai_txt_segmentation_rationale,  -- NARRATIVE: Free text
    get_json_object(insights, '$.ai_txt_retention_strategy') AS ai_txt_retention_strategy,  -- NARRATIVE: Free text
    get_json_object(insights, '$.ai_txt_outreach_plan') AS ai_txt_outreach_plan,  -- NARRATIVE: Free text
    get_json_object(insights, '$.ai_txt_offer_recommendations') AS ai_txt_offer_recommendations,  -- NARRATIVE: Free text
    -- MANDATORY LAST 7 COLUMNS (in this exact order):
    COALESCE(get_json_object(insights, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,  -- BUSINESS OUTCOME: Calculated Impact
    COALESCE(get_json_object(insights, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,  -- NARRATIVE: Executive Summary (references business outcome)
    COALESCE(get_json_object(insights, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,  -- SYSTEM: Importance Level
    COALESCE(get_json_object(insights, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,  -- SYSTEM: Urgency Level
    COALESCE(TRY_CAST(get_json_object(insights, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,  -- SYSTEM: Confidence Score
    COALESCE(get_json_object(insights, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,  -- SYSTEM: Feedback
    COALESCE(get_json_object(insights, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data  -- SYSTEM: Missing Data
  FROM enriched
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_value_tier IN ('High Value VIP', 'High Value At-Risk', 'Medium Value Active', 'Medium Value Declining', 'Low Value', 'Churned')
-- AND ai_cat_churn_risk_level IN ('Critical', 'High', 'Medium', 'Low', 'Minimal')
-- AND ai_cat_retention_priority IN ('Immediate Action', 'High Priority', 'Medium Priority', 'Low Priority', 'Monitor')
-- AND ai_cat_engagement_readiness IN ('Highly Receptive', 'Moderately Receptive', 'Low Receptivity', 'Difficult')
-- AND ai_sys_importance IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
-- AND ai_sys_urgency IN ('Very Low', 'Low', 'Medium', 'High', 'Very High', 'Critical')
;

--END OF GENERATED SQL
```
**NOTE**: Use `get_json_object(json_column, '$.field_name')` to extract fields from ai_gen/ai_query JSON output. Use `ai_cat_` prefix for categorical columns, `ai_txt_` prefix for narrative columns, and `ai_sys_` prefix for system columns. The LAST 7 columns must ALWAYS be: ai_txt_business_outcome (calculated measurable impact with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data.

**Example 3: ai_extract (Simple)**
```sql
-- CREATE VIEW inspire_ai.default.order_details_extracted AS
-- Use Case: Extract Order Details from Notes
-- Extracts structured data from unstructured order notes

SELECT 
  order_id,
  notes,
  ai_extract(notes, 
    ARRAY('delivery_date', 'special_instructions', 'discount_code', 'priority_level')
  ) AS extracted_data
FROM `sales`.`orders`.`order_notes` AS o
WHERE order_id IS NOT NULL
  AND notes IS NOT NULL
LIMIT 10;  -- ✅ LIMIT 10 for sampling
```

**Example 4: ai_query for Business Analysis**
```sql
-- CREATE VIEW inspire_ai.default.customer_churn_risk_assessment AS
-- Use Case: Predict Customer Churn Risk
-- Uses general-purpose LLM to analyze churn risk

-- Step 1: Prepare data with COALESCE
WITH customer_data AS (
  SELECT 
    customer_id,
    COALESCE(TRIM(customer_name), 'Unknown') AS customer_name,
    COALESCE(account_age_days, 0) AS account_age_days,
    COALESCE(recent_activity_score, 0.0) AS recent_activity_score
  FROM `main`.`customer`.`profiles` AS c
  WHERE customer_id IS NOT NULL AND customer_name IS NOT NULL
  LIMIT 10
),
-- Step 2: Generate ai_sys_prompt
prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are a Customer Success Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 15 years of experience in retention strategy and customer success, ',
           'your expertise aligns with the strategic initiative: Customer retention. ',
           'Analyze churn risk for customer: ', customer_name, 
           ', Account age: ', account_age_days, ' days',
           ', Activity score: ', recent_activity_score, '. ',
           'Output ONLY a JSON object with NO markdown, NO extra text. ',
           'Format: {{"ai_cat_risk_level": "High/Medium/Low", "ai_txt_risk_factors": "text", "ai_txt_retention_plan": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'Output ONLY the JSON object, nothing else.') AS ai_sys_prompt
  FROM customer_data
),
-- Step 3: Call ai_query
analysis AS (
  SELECT *, ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS churn_risk_assessment
  FROM prompt_generation
)
SELECT 
  customer_id, customer_name, account_age_days, recent_activity_score,
  get_json_object(churn_risk_assessment, '$.ai_cat_risk_level') AS ai_cat_risk_level,
  get_json_object(churn_risk_assessment, '$.ai_txt_risk_factors') AS ai_txt_risk_factors,
  get_json_object(churn_risk_assessment, '$.ai_txt_retention_plan') AS ai_txt_retention_plan,
  COALESCE(get_json_object(churn_risk_assessment, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
  COALESCE(get_json_object(churn_risk_assessment, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
  COALESCE(get_json_object(churn_risk_assessment, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
  COALESCE(get_json_object(churn_risk_assessment, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
  COALESCE(TRY_CAST(get_json_object(churn_risk_assessment, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
  COALESCE(get_json_object(churn_risk_assessment, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
  COALESCE(get_json_object(churn_risk_assessment, '$.ai_sys_missing_data'), 'No missing data') AS ai_sys_missing_data,
  ai_sys_prompt  -- ✅ LAST COLUMN
FROM analysis;
```

**Example 5: ai_gen for Creative Content**
```sql
-- CREATE VIEW inspire_ai.default.personalized_marketing_messages AS
-- Use Case: Generate Personalized Marketing Messages
-- Creates targeted marketing content based on customer data

-- Step 1: Prepare data with COALESCE
WITH customer_segments AS (
  SELECT 
    customer_id,
    COALESCE(TRIM(customer_name), 'Unknown') AS customer_name,
    COALESCE(TRIM(preferred_product_category), 'General') AS preferred_product_category,
    COALESCE(lifetime_value, 0.0) AS lifetime_value
  FROM `main`.`marketing`.`customer_segments` AS m
  WHERE customer_id IS NOT NULL AND customer_name IS NOT NULL
  LIMIT 10
),
-- Step 2: Generate ai_sys_prompt
prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are a Marketing Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 12 years of experience in personalized campaigns and customer engagement, ',
           'your expertise aligns with the strategic initiative: Customer engagement. ',
           'Create a personalized marketing email for ', customer_name,
           ' who prefers ', preferred_product_category,
           ' and has a lifetime value of $', lifetime_value, '. ',
           'Include a special offer relevant to their interests. ',
           'Output ONLY a JSON object with NO markdown, NO extra text. ',
           'Format: {{"ai_cat_campaign_priority": "High/Medium/Low", "ai_txt_subject_line": "text", "ai_txt_email_body": "text", "ai_txt_offer_details": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'Output ONLY the JSON object, nothing else.') AS ai_sys_prompt
  FROM customer_segments
),
-- Step 3: Call ai_query
analysis AS (
  SELECT *, ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.5)) AS personalized_message
  FROM prompt_generation
)
SELECT 
  customer_id, customer_name, preferred_product_category, lifetime_value,
  get_json_object(personalized_message, '$.ai_cat_campaign_priority') AS ai_cat_campaign_priority,
  get_json_object(personalized_message, '$.ai_txt_subject_line') AS ai_txt_subject_line,
  get_json_object(personalized_message, '$.ai_txt_email_body') AS ai_txt_email_body,
  get_json_object(personalized_message, '$.ai_txt_offer_details') AS ai_txt_offer_details,
  COALESCE(get_json_object(personalized_message, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
  COALESCE(get_json_object(personalized_message, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
  COALESCE(get_json_object(personalized_message, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
  COALESCE(get_json_object(personalized_message, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
  COALESCE(TRY_CAST(get_json_object(personalized_message, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
  COALESCE(get_json_object(personalized_message, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
  COALESCE(get_json_object(personalized_message, '$.ai_sys_missing_data'), 'No missing data') AS ai_sys_missing_data,
  ai_sys_prompt  -- ✅ LAST COLUMN
FROM analysis;
```

**Example 5: ai_forecast (Time Series Forecasting)**
```sql
-- Basic pattern: historical CTE with 10:1 ratio → AI_FORECAST → results
-- For 3-month horizon, use 30 months of history (10:1 ratio)
WITH past AS (
  SELECT DATE_TRUNC('month', order_date) AS ds, SUM(total_amount) AS revenue
  FROM `sales`.`transactions`.`orders` AS o
  WHERE order_date >= date_add(MONTH, -30, CURRENT_DATE())  -- 30 months history for 3-month forecast
    AND order_date IS NOT NULL
  GROUP BY DATE_TRUNC('month', order_date)
  ORDER BY ds
)
SELECT ds, revenue_forecast, revenue_upper, revenue_lower
FROM AI_FORECAST(TABLE(past), 
     time_col => 'ds', value_col => 'revenue',
     horizon => (SELECT date_add(MONTH, 3, MAX(ds)) FROM past))  -- 3 months ahead
;

--END OF GENERATED SQL
```

**Example 5 (Advanced): ai_forecast Variations**

**Pattern A: Multi-Metric Forecasting with Groups**
```sql
-- Forecast multiple metrics (revenue + orders) by category with 10:1 ratio
-- For 8-week horizon, use 80 weeks of history
WITH past AS (
  SELECT DATE_TRUNC('week', order_date) AS ds, product_category,
         SUM(total_amount) AS revenue, COUNT(DISTINCT order_id) AS order_count
  FROM `sales`.`transactions`.`orders` AS o
  WHERE order_date >= date_add(WEEK, -80, CURRENT_DATE())  -- 80 weeks history for 8-week forecast
    AND order_date IS NOT NULL
    AND product_category IS NOT NULL
  GROUP BY DATE_TRUNC('week', order_date), product_category
  ORDER BY ds, product_category
)
SELECT ds, product_category, revenue_forecast, order_count_forecast
FROM AI_FORECAST(TABLE(past), time_col => 'ds',
     value_col => ARRAY('revenue', 'order_count'), group_col => 'product_category',
     horizon => (SELECT date_add(WEEK, 8, MAX(ds)) FROM past),  -- 8 weeks ahead
     parameters => '{{"global_floor": 0}}');  -- ✅ NO LIMIT
```

**Pattern B: Advanced Seasonality Control** (Adapt table/column names to YOUR schema)
```sql
-- Control weekly/daily seasonality with fourier orders and 10:1 ratio
-- For 60-day horizon, use 600 days of history
-- [ADAPT: Change catalog.schema.table and column names to match YOUR schema]
WITH past AS (
  SELECT DATE_TRUNC('day', activity_date) AS ds, entity_id, SUM(metric_value) AS metric
  FROM `catalog`.`schema`.`your_table` AS t 
  WHERE activity_date >= date_add(DAY, -600, CURRENT_DATE())  -- 600 days history for 60-day forecast
    AND activity_date IS NOT NULL
    AND entity_id IS NOT NULL
  GROUP BY DATE_TRUNC('day', activity_date), entity_id
  ORDER BY ds, entity_id
)
SELECT ds, entity_id, metric_forecast, metric_upper, metric_lower
FROM AI_FORECAST(TABLE(past), time_col => 'ds', value_col => 'metric',
     group_col => 'entity_id',
     horizon => (SELECT date_add(DAY, 60, MAX(ds)) FROM past),  -- 60 days ahead
     parameters => '{{"weekly_order": 10, "daily_order": 5, "global_floor": 0}}');  -- ✅ NO LIMIT
```

**Pattern C: Forecast + Classification** (Adapt table/column names to YOUR schema)
```sql
-- Forecast metric AND classify into action buckets with 10:1 ratio
-- For 14-day horizon, use 140 days of history
-- [ADAPT: Change catalog.schema.table and column names to match YOUR schema]
WITH past AS (
  SELECT DATE_TRUNC('day', activity_date) AS ds, entity_id, SUM(quantity_value) AS metric
  FROM `catalog`.`schema`.`your_table` AS t
  WHERE activity_date >= date_add(DAY, -140, CURRENT_DATE())  -- 140 days history for 14-day forecast
    AND activity_date IS NOT NULL
    AND entity_id IS NOT NULL
  GROUP BY DATE_TRUNC('day', activity_date), entity_id
  ORDER BY ds, entity_id
),
forecasted_raw AS (
  SELECT ds, entity_id, metric_forecast, metric_upper, metric_lower
  FROM AI_FORECAST(TABLE(past), time_col => 'ds', value_col => 'metric',
       group_col => 'entity_id',
       horizon => (SELECT date_add(DAY, 14, MAX(ds)) FROM past))  -- ✅ NO LIMIT
),
-- 🚨 Apply COALESCE HERE before using in CONCAT - NOT inside CONCAT!
forecasted AS (
  SELECT 
    ds,
    entity_id,
    COALESCE(ROUND(metric_forecast, 2), 0.0) AS metric_forecast,  -- ✅ COALESCE'd HERE
    COALESCE(ROUND(metric_upper, 2), 0.0) AS metric_upper,
    COALESCE(ROUND(metric_lower, 2), 0.0) AS metric_lower
  FROM forecasted_raw
)
SELECT ds, entity_id, metric_forecast,  -- ✅ Already NULL-safe
       ai_classify(CONCAT('Forecast: ', metric_forecast, ' units'),  -- ✅ NO COALESCE in CONCAT - already NULL-safe!
         ARRAY('Critical - Action Required', 'Moderate - Monitor', 'Low - Reduce', 'Uncertain')) AS action
FROM forecasted;  -- ✅ NO LIMIT
```

**Pattern D: Forecast + Recommendations (MANDATORY)** (Adapt table/column names to YOUR schema)
```sql
-- Forecast + generate actionable recommendations with 10:1 ratio
-- For 12-week horizon, use 120 weeks of history
-- [ADAPT: Change catalog.schema.table and column names to match YOUR schema]
WITH past AS (
  SELECT DATE_TRUNC('week', activity_date) AS ds, SUM(amount_value) AS metric
  FROM `catalog`.`schema`.`your_table` AS t
  WHERE activity_date >= date_add(WEEK, -120, CURRENT_DATE())  -- 120 weeks history for 12-week forecast
    AND activity_date IS NOT NULL
  GROUP BY DATE_TRUNC('week', activity_date)
  ORDER BY ds
),
forecast_results AS (
  SELECT 
    COALESCE(CAST(ds AS STRING), 'Unknown') AS ds_display,
    COALESCE(ROUND(metric_forecast, 2), 0.0) AS metric_forecast,
    COALESCE(ROUND(metric_upper, 2), 0.0) AS metric_upper,
    COALESCE(ROUND(metric_lower, 2), 0.0) AS metric_lower
  FROM AI_FORECAST(TABLE(past), time_col => 'ds', value_col => 'metric',
       horizon => (SELECT date_add(WEEK, 12, MAX(ds)) FROM past))  -- 12 weeks ahead
)
SELECT ds_display, metric_forecast,
       ai_query('{sql_model_serving}', CONCAT('Week ', ds_display, ': Forecast $',
              metric_forecast,  -- CONCAT auto-converts DOUBLE
              '. Provide 3 actionable recommendations')) AS recommendations
FROM forecast_results;  -- ✅ NO LIMIT
```

**Example 6: Multi-Function Pipeline (SOPHISTICATED - Document File Processing)**
```sql
-- Use Case: Intelligent Document Processing Pipeline
-- Parses invoice document files, extracts data, and classifies by urgency
-- CRITICAL: ai_parse_document is used ONLY with unstructured document files from Unity Catalog volumes

WITH document_files AS (
  SELECT 
    path,
    content,
    ai_parse_document(
      content,
      map('version', '2.0')
    ) AS parsed_doc
  FROM READ_FILES('/Volumes/finance/documents/invoices/*.pdf', format => 'binaryFile')
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
),
extracted_text AS (
  SELECT 
    path,
    concat_ws(
      '\n\n',
      transform(
        try_cast(parsed_doc:document:elements AS ARRAY<VARIANT>),
        element -> try_cast(element:content AS STRING)
      )
    ) AS full_text
  FROM document_files
  WHERE try_cast(parsed_doc:error_status AS STRING) IS NULL
  -- ✅ NO LIMIT in CTEs
),
extracted_data AS (
  SELECT 
    path,
    full_text,
    ai_extract(full_text, 
      ARRAY('vendor_name', 'invoice_number', 'total_amount', 'due_date', 'payment_terms')
    ) AS invoice_data
  FROM extracted_text
  -- ✅ NO LIMIT in intermediate CTEs
)
SELECT 
  path,
  invoice_data,
  ai_classify(
    CONCAT('Invoice from: ', invoice_data['vendor_name'], 
           ', Amount: $', invoice_data['total_amount'],
           ', Due: ', invoice_data['due_date']),
    ARRAY('Urgent - Past Due', 'High Priority', 'Normal', 'Low Priority')
  ) AS urgency_classification
FROM extracted_data;  -- ✅ NO LIMIT in final SELECT
```

**CRITICAL NOTE FOR ai_parse_document:**
- ai_parse_document MUST ONLY be used with unstructured document files (PDFs, images, Word docs, PowerPoints)
- Use READ_FILES('/Volumes/path/to/files/*.{{pdf,jpg,png,doc,docx,ppt,pptx}}', format => 'binaryFile') to load binary content
- NEVER use ai_parse_document with table columns or structured data already in Delta tables
- The output is VARIANT type with schema version 2.0 containing document elements (text, tables, figures)
- Extract text content from parsed_doc:document:elements array before using ai_extract
- Filter out errors using WHERE try_cast(parsed_doc:error_status AS STRING) IS NULL

**Example 7: JOIN with ai_query (SOPHISTICATED)**
```sql
-- Use Case: Generate Customer Insights with Purchase Context
-- Joins customer and order data for rich context in LLM prompts

-- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
WITH customer_orders AS (
  SELECT 
    c.customer_id,                                              -- CRITICAL: filtered with IS NOT NULL
    c.customer_name,                                            -- CRITICAL: filtered with IS NOT NULL
    COALESCE(TRIM(c.customer_segment), 'Unknown Segment') AS customer_segment,  -- ✅ COALESCE'd
    COALESCE(c.lifetime_value, 0.0) AS lifetime_value,          -- ✅ COALESCE'd
    COALESCE(COUNT(o.order_id), 0) AS total_orders,             -- ✅ COALESCE'd (LEFT JOIN can produce NULL)
    COALESCE(SUM(o.order_amount), 0.0) AS total_spent           -- ✅ COALESCE'd (LEFT JOIN can produce NULL)
  FROM `main`.`customers`.`profiles` AS c
  LEFT JOIN `main`.`sales`.`orders` AS o ON c.customer_id = o.customer_id
  WHERE c.customer_id IS NOT NULL
    AND c.customer_name IS NOT NULL
  GROUP BY c.customer_id, c.customer_name, c.customer_segment, c.lifetime_value
  LIMIT 10  -- ✅ LIMIT 10 in first CTE (GROUP BY provides uniqueness)
)
SELECT 
  customer_id,
  customer_name,
  ai_query('{sql_model_serving}',
    CONCAT('Analyze this customer profile: Name: ', customer_name,
           ', Segment: ', customer_segment,
           ', Lifetime Value: $', lifetime_value,  -- CONCAT auto-converts
           ', Total Orders: ', total_orders,
           ', Total Spent: $', total_spent,
           '. Provide actionable insights and retention strategies.')
  ) AS customer_insights
FROM customer_orders
;

--END OF GENERATED SQL
```

**Example 8: ai_mask with Structured Justification and Strategy (DATA PRIVACY COMPLIANCE)**
```sql
-- Use Case: Mask PII Data with Compliance Documentation
-- Step 1: Mask sensitive PII fields
-- Step 2: Generate structured JSON with data risk classification and compliance strategy

-- Step 1: Mask PII fields in customer records
-- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
WITH masked_data AS (
  SELECT DISTINCT
    customer_id,                                              -- CRITICAL: filtered with IS NOT NULL
    COALESCE(TRIM(customer_name), 'Unknown') AS customer_name,  -- ✅ COALESCE'd
    COALESCE(TRIM(email), 'no-email@unknown.com') AS email,     -- ✅ COALESCE'd
    COALESCE(TRIM(phone), '000-000-0000') AS phone,             -- ✅ COALESCE'd
    COALESCE(TRIM(ssn), '000-00-0000') AS ssn,                  -- ✅ COALESCE'd
    COALESCE(TRIM(credit_card), '0000-0000-0000-0000') AS credit_card,  -- ✅ COALESCE'd
    ai_mask(COALESCE(TRIM(customer_name), 'Unknown'), ARRAY('PERSON')) AS customer_name_masked,
    ai_mask(COALESCE(TRIM(email), 'no-email@unknown.com'), ARRAY('EMAIL')) AS email_masked,
    ai_mask(COALESCE(TRIM(phone), '000-000-0000'), ARRAY('PHONE')) AS phone_masked,
    ai_mask(COALESCE(TRIM(ssn), '000-00-0000'), ARRAY('SSN')) AS ssn_masked,
    ai_mask(COALESCE(TRIM(credit_card), '0000-0000-0000-0000'), ARRAY('CREDIT_CARD')) AS credit_card_masked
  FROM `main`.`customers`.`personal_data` AS c
  WHERE customer_id IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
),
-- Step 2: Generate structured compliance documentation with categorical risk classifications
enriched AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}',
      CONCAT('Analyze masked customer data. Fields masked: name (PERSON), email (EMAIL), phone (PHONE), SSN (SSN), credit card (CREDIT_CARD). ',
             'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
             'Format: {{"data_risk_type": "value", "risk_severity": "value", "compliance_status": "value", "primary_regulation": "value", "masking_rationale": "text", "data_protection_strategy": "text", "compliance_checklist": "text", "breach_risk_reduction": "text", "data_governance_summary": "text"}}. ',
             'Required keys: ',
             'data_risk_type (MUST be exactly one of: PII/PHI/Financial Data/PII+Financial/PHI+PII/PHI+Financial/PII+PHI+Financial), ',
             'risk_severity (MUST be exactly one of: Critical/High/Medium/Low), ',
             'compliance_status (MUST be exactly one of: Fully Compliant/Compliant with Monitoring/At Risk/Non-Compliant), ',
             'primary_regulation (MUST be exactly one of: GDPR/CCPA/HIPAA/PCI-DSS/GDPR+CCPA/Multiple), ',
             'masking_rationale (free text: why these specific fields required masking for compliance and data protection), ',
             'data_protection_strategy (free text: comprehensive strategy for secure data sharing, storage, and access control), ',
             'compliance_checklist (free text: specific compliance steps completed and verification procedures), ',
             'breach_risk_reduction (free text: how masking reduces breach impact, quantify risk reduction percentage if possible), ',
             'data_governance_summary (free text: executive brief on data protection value and compliance posture in 2-3 sentences). ',
             'Output ONLY the JSON object, nothing else.')
    ) AS compliance_info
  FROM masked_data
),
-- Final output: Masked data with mix of categorical and narrative compliance columns
final_output AS (
  SELECT 
    customer_id,
    customer_name_masked,
    email_masked,
    phone_masked,
    ssn_masked,
    credit_card_masked,
    get_json_object(compliance_info, '$.data_risk_type') AS data_risk_type,  -- CATEGORICAL: PII/PHI/Financial Data/combinations
    get_json_object(compliance_info, '$.risk_severity') AS risk_severity,  -- CATEGORICAL: Critical/High/Medium/Low
    get_json_object(compliance_info, '$.compliance_status') AS compliance_status,  -- CATEGORICAL: Fully Compliant/Compliant with Monitoring/At Risk/Non-Compliant
    get_json_object(compliance_info, '$.primary_regulation') AS primary_regulation,  -- CATEGORICAL: GDPR/CCPA/HIPAA/PCI-DSS/combinations
    get_json_object(compliance_info, '$.masking_rationale') AS masking_rationale,  -- NARRATIVE: Free text
    get_json_object(compliance_info, '$.data_protection_strategy') AS data_protection_strategy,  -- NARRATIVE: Free text
    get_json_object(compliance_info, '$.compliance_checklist') AS compliance_checklist,  -- NARRATIVE: Free text
    get_json_object(compliance_info, '$.breach_risk_reduction') AS breach_risk_reduction,  -- NARRATIVE: Free text
    get_json_object(compliance_info, '$.data_governance_summary') AS data_governance_summary  -- NARRATIVE: Free text
  FROM enriched
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE data_risk_type IN ('PII', 'PHI', 'Financial Data', 'PII+Financial', 'PHI+PII', 'PHI+Financial', 'PII+PHI+Financial')
-- AND risk_severity IN ('Critical', 'High', 'Medium', 'Low')
-- AND compliance_status IN ('Fully Compliant', 'Compliant with Monitoring', 'At Risk', 'Non-Compliant')
-- AND primary_regulation IN ('GDPR', 'CCPA', 'HIPAA', 'PCI-DSS', 'GDPR+CCPA', 'Multiple')
;

--END OF GENERATED SQL
```
**NOTE**: The data_risk_classification column specifies whether data is PII, PHI, Financial, or combination, helping businesses understand the specific regulatory requirements.

**Example 9: ai_analyze_sentiment with Structured Insights (CUSTOMER FEEDBACK ANALYSIS)**
```sql
-- Use Case: Customer Feedback Analysis with Actionable Response Strategy
-- Step 1: Analyze customer emotion in feedback
-- Step 2: Generate structured JSON with emotion analysis and response recommendations

-- Step 1: Analyze customer emotion in reviews
-- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
WITH emotion_analysis AS (
  SELECT DISTINCT
    review_id,                                              -- CRITICAL: filtered with IS NOT NULL
    customer_id,                                            -- CRITICAL: filtered with IS NOT NULL
    COALESCE(TRIM(product_id), 'Unknown Product') AS product_id,  -- ✅ COALESCE'd
    review_text,                                            -- CRITICAL: filtered with IS NOT NULL (used in AI)
    COALESCE(rating, 0) AS rating,                          -- ✅ COALESCE'd
    COALESCE(purchase_amount, 0.0) AS purchase_amount,      -- ✅ COALESCE'd
    ai_analyze_sentiment(review_text) AS customer_emotion
  FROM `main`.`reviews`.`customer_feedback` AS r
  WHERE review_id IS NOT NULL
    AND review_text IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
),
-- Step 2: Generate structured insights with categorical metrics and narrative strategies
enriched AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}',
      CONCAT('Analyze customer emotion in this review showing ', customer_emotion, ' emotion ',
             '(rating: ', rating, '/5, purchase: $', purchase_amount, '): "',  -- CONCAT auto-converts
             review_text, '". ',
             'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
             'Format: {{"account_risk_level": "value", "response_urgency": "value", "primary_pain_point": "value", "recovery_difficulty": "value", "emotion_rationale": "text", "response_playbook": "text", "recovery_actions": "text", "product_service_enhancements": "text", "customer_journey_insights": "text"}}. ',
             'Required keys: ',
             'account_risk_level (MUST be exactly one of: Critical/High/Medium/Low/Minimal), ',
             'response_urgency (MUST be exactly one of: Immediate/Within 24hrs/Within Week/Routine/Monitor), ',
             'primary_pain_point (MUST be exactly one of: Product Quality/Customer Service/Pricing/Delivery/Usability/Expectation Mismatch/Other), ',
             'recovery_difficulty (MUST be exactly one of: Easy/Moderate/Difficult/Very Difficult/Lost), ',
             'emotion_rationale (free text: why customer feels this way, specific phrases showing emotion, 1-2 sentences), ',
             'response_playbook (free text: specific playbook steps to respond to this customer with timing and approach), ',
             'recovery_actions (free text: immediate recovery actions with specific offers or gestures if negative emotion), ',
             'product_service_enhancements (free text: specific product/service improvements needed based on feedback), ',
             'customer_journey_insights (free text: executive brief on customer journey implications in 2-3 sentences). ',
             'Output ONLY the JSON object, nothing else.')
    ) AS insights
  FROM emotion_analysis
),
-- Final output: Mix of categorical (filterable) and narrative (actionable) columns
final_output AS (
  SELECT 
    review_id,
    customer_id,
    product_id,
    review_text,
    customer_emotion,  -- CATEGORICAL from ai_analyze_sentiment
    rating,
    get_json_object(insights, '$.account_risk_level') AS account_risk_level,  -- CATEGORICAL: Critical/High/Medium/Low/Minimal
    get_json_object(insights, '$.response_urgency') AS response_urgency,  -- CATEGORICAL: Immediate/Within 24hrs/Within Week/Routine/Monitor
    get_json_object(insights, '$.primary_pain_point') AS primary_pain_point,  -- CATEGORICAL: Product Quality/Service/Pricing/Delivery/etc
    get_json_object(insights, '$.recovery_difficulty') AS recovery_difficulty,  -- CATEGORICAL: Easy/Moderate/Difficult/Very Difficult/Lost
    get_json_object(insights, '$.emotion_rationale') AS emotion_rationale,  -- NARRATIVE: Free text
    get_json_object(insights, '$.response_playbook') AS response_playbook,  -- NARRATIVE: Free text
    get_json_object(insights, '$.recovery_actions') AS recovery_actions,  -- NARRATIVE: Free text
    get_json_object(insights, '$.product_service_enhancements') AS service_enhancements,  -- NARRATIVE: Free text
    get_json_object(insights, '$.customer_journey_insights') AS customer_journey_insights  -- NARRATIVE: Free text
  FROM enriched
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE account_risk_level IN ('Critical', 'High', 'Medium', 'Low', 'Minimal')
-- AND response_urgency IN ('Immediate', 'Within 24hrs', 'Within Week', 'Routine', 'Monitor')
-- AND primary_pain_point IN ('Product Quality', 'Customer Service', 'Pricing', 'Delivery', 'Usability', 'Expectation Mismatch', 'Other')
-- AND recovery_difficulty IN ('Easy', 'Moderate', 'Difficult', 'Very Difficult', 'Lost')
;

--END OF GENERATED SQL
```
**NOTE**: Use business-relevant names (e.g., customer_emotion, feedback_tone, satisfaction_level) instead of generic "sentiment". Tailor column names to your business context.

**Example 10: ai_similarity with Structured Deduplication Strategy**
```sql
-- Use Case: Customer Record Matching for Deduplication with Action Plan
-- Step 1: Calculate match confidence between customer records
-- Step 2: Generate structured JSON with match analysis and consolidation strategy

-- Step 1: Find potential duplicate customers using semantic matching
WITH match_analysis AS (
  SELECT DISTINCT
    c1.customer_id AS customer_id_1,
    c1.customer_name AS name_1,
    c1.email AS email_1,
    c1.address AS address_1,
    c2.customer_id AS customer_id_2,
    c2.customer_name AS name_2,
    c2.email AS email_2,
    c2.address AS address_2,
    ai_similarity(
      CONCAT(c1.customer_name, ' ', c1.email, ' ', c1.address),
      CONCAT(c2.customer_name, ' ', c2.email, ' ', c2.address)
    ) AS match_confidence_score
  FROM `main`.`customers`.`profiles` AS c1
  CROSS JOIN `main`.`customers`.`profiles` AS c2
  WHERE c1.customer_id IS NOT NULL
    AND c2.customer_id IS NOT NULL
    AND c1.customer_id < c2.customer_id  -- Self-join deduplication (exception: comparison needed for CROSS JOIN)
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
),
-- Step 2: Generate structured consolidation strategy with categorical classifications
enriched AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}',
      CONCAT('Analyze potential duplicate customers with match confidence ', 
             CAST(match_confidence_score AS STRING), ' (0=different, 1=identical). ',
             'Record A: ', name_1, ' (', email_1, '), ',
             'Record B: ', name_2, ' (', email_2, '). ',
             'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
             'Format: {{"record_relationship": "value", "merge_priority": "value", "data_conflict_level": "value", "recommended_master": "value", "match_rationale": "text", "merge_execution_plan": "text", "data_reconciliation_steps": "text", "master_data_quality_impact": "text", "business_intelligence_value": "text"}}. ',
             'Required keys: ',
             'record_relationship (MUST be exactly one of: Definite Duplicate/Probable Duplicate/Possible Match/Different Entity), ',
             'merge_priority (MUST be exactly one of: Immediate/High/Medium/Low/Do Not Merge), ',
             'data_conflict_level (MUST be exactly one of: No Conflicts/Minor Conflicts/Moderate Conflicts/Major Conflicts), ',
             'recommended_master (MUST be exactly one of: Record A/Record B/Create New/Manual Review), ',
             'match_rationale (free text: why these are likely same entity, which fields match, confidence reasoning, 1-2 sentences), ',
             'merge_execution_plan (free text: step-by-step plan to merge records safely with data preservation strategy), ',
             'data_reconciliation_steps (free text: how to reconcile conflicting data between records with decision rules), ',
             'master_data_quality_impact (free text: how consolidation improves data quality, analytics accuracy, and business intelligence), ',
             'business_intelligence_value (free text: executive brief on clean customer data value in 2 sentences). ',
             'Output ONLY the JSON object, nothing else.')
    ) AS insights
  FROM match_analysis
  -- NOTE: Filter high-confidence matches in application layer or use threshold in ai_query prompt
  -- WHERE match_confidence_score > 0.7 would violate the no-value-comparison rule
),
-- Final output: Mix of categorical (filterable) and narrative (execution) columns
final_output AS (
  SELECT 
    customer_id_1,
    name_1,
    email_1,
    customer_id_2,
    name_2,
    email_2,
    match_confidence_score,
    get_json_object(insights, '$.record_relationship') AS record_relationship,  -- CATEGORICAL: Definite Duplicate/Probable Duplicate/Possible Match/Different Entity
    get_json_object(insights, '$.merge_priority') AS merge_priority,  -- CATEGORICAL: Immediate/High/Medium/Low/Do Not Merge
    get_json_object(insights, '$.data_conflict_level') AS data_conflict_level,  -- CATEGORICAL: No Conflicts/Minor/Moderate/Major
    get_json_object(insights, '$.recommended_master') AS recommended_master,  -- CATEGORICAL: Record A/Record B/Create New/Manual Review
    get_json_object(insights, '$.match_rationale') AS match_rationale,  -- NARRATIVE: Free text
    get_json_object(insights, '$.merge_execution_plan') AS merge_execution_plan,  -- NARRATIVE: Free text
    get_json_object(insights, '$.data_reconciliation_steps') AS data_reconciliation_steps,  -- NARRATIVE: Free text
    get_json_object(insights, '$.master_data_quality_impact') AS data_quality_impact,  -- NARRATIVE: Free text
    get_json_object(insights, '$.business_intelligence_value') AS business_intelligence_value  -- NARRATIVE: Free text
  FROM enriched
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE record_relationship IN ('Definite Duplicate', 'Probable Duplicate', 'Possible Match', 'Different Entity')
-- AND merge_priority IN ('Immediate', 'High', 'Medium', 'Low', 'Do Not Merge')
-- AND data_conflict_level IN ('No Conflicts', 'Minor Conflicts', 'Moderate Conflicts', 'Major Conflicts')
-- AND recommended_master IN ('Record A', 'Record B', 'Create New', 'Manual Review')
;

--END OF GENERATED SQL
```
**NOTE**: Use business-relevant names (e.g., match_confidence_score, record_relationship) instead of generic "similarity_score". Tailor to your domain (e.g., product_match_score for products, vendor_similarity for vendors).

**Example 11: ai_forecast with Enhanced Structured Recommendations (MANDATORY PATTERN)**
```sql
-- Use Case: Revenue Forecasting with Structured Business Strategy
-- Step 1: Prepare historical time series data
-- Step 2: Generate forecasts with prediction intervals
-- Step 3: Generate structured JSON with justification and strategic recommendations

-- Step 1: Prepare historical revenue data for forecasting
WITH past AS (
  SELECT 
    DATE_TRUNC('week', order_date) AS ds,
    SUM(total_amount) AS revenue
  FROM `sales`.`transactions`.`orders` AS o
  WHERE order_date >= date_add(WEEK, -120, CURRENT_DATE())  -- 120 weeks history for 12-week forecast (10:1 ratio)
    AND order_date IS NOT NULL
  GROUP BY DATE_TRUNC('week', order_date)
  ORDER BY ds
),
-- Step 2: Generate revenue forecasts with prediction intervals - COALESCE results
forecast_results AS (
  SELECT 
    COALESCE(CAST(ds AS STRING), 'Unknown') AS ds,
    COALESCE(ROUND(revenue_forecast, 2), 0.0) AS revenue_forecast,
    COALESCE(ROUND(revenue_upper, 2), 0.0) AS revenue_upper,
    COALESCE(ROUND(revenue_lower, 2), 0.0) AS revenue_lower
  FROM AI_FORECAST(
    TABLE(past), 
    time_col => 'ds', 
    value_col => 'revenue',
    horizon => (SELECT date_add(WEEK, 12, MAX(ds)) FROM past)  -- 12 weeks ahead
  )
  -- ✅ NO LIMIT in CTEs
),
-- Step 3: Generate ai_sys_prompt for forecast analysis
forecast_prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are a Revenue Forecasting Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 18 years of experience in revenue forecasting and financial planning, ',
           'your expertise in predictive analytics and strategic resource allocation aligns with the strategic initiative: Revenue optimization. ',
           'Analyze revenue forecast for week ', ds, ': ',  -- CONCAT auto-converts
           'Predicted: $', revenue_forecast, ', ',
           'Upper bound: $', revenue_upper, ', ',
           'Lower bound: $', revenue_lower, '. ',
           'Output ONLY a JSON object with NO markdown fences, NO extra text, JUST the JSON. ',
           'Format: {{"ai_cat_trend_direction": "value", "ai_cat_forecast_confidence": "value", "ai_cat_action_priority": "value", "ai_cat_risk_level": "value", "ai_cat_resource_allocation_need": "value", "ai_txt_forecast_justification": "text", "ai_txt_tactical_recommendations": "text", "ai_txt_strategic_initiatives": "text", "ai_txt_risk_mitigation_plan": "text", "ai_txt_upside_opportunities": "text", "ai_txt_resource_recommendations": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'CATEGORICAL FIELDS (ai_cat_ prefix): ',
           'ai_cat_trend_direction (MUST be exactly one of: Strong Growth/Moderate Growth/Stable/Moderate Decline/Sharp Decline), ',
           'ai_cat_forecast_confidence (MUST be exactly one of: Very High/High/Medium/Low/Very Low), ',
           'ai_cat_action_priority (MUST be exactly one of: Immediate Action/High Priority/Medium Priority/Low Priority/Monitor), ',
           'ai_cat_risk_level (MUST be exactly one of: Critical/High/Medium/Low/Minimal), ',
           'ai_cat_resource_allocation_need (MUST be exactly one of: Significant Increase/Moderate Increase/Maintain/Moderate Decrease/Significant Decrease). ',
           'NARRATIVE FIELDS (ai_txt_ prefix): ',
           'ai_txt_forecast_justification (free text: why this forecast level, key drivers, confidence factors, 2 sentences), ',
           'ai_txt_tactical_recommendations (free text: 3 specific tactical actions for this period with expected impact), ',
           'ai_txt_strategic_initiatives (free text: long-term strategic initiatives based on forecast trends), ',
           'ai_txt_risk_mitigation_plan (free text: specific risks if forecast not met and detailed mitigation plans), ',
           'ai_txt_upside_opportunities (free text: how to exceed forecast with specific tactics and expected ROI), ',
           'ai_txt_resource_recommendations (free text: specific staffing/inventory/capacity recommendations with quantities). ',
           'MANDATORY LAST 7 FIELDS: ai_txt_business_outcome (CALCULATED MEASURABLE IMPACT with Daily/Weekly/Monthly/Yearly breakdown and DISCLAIMER), ai_txt_executive_summary (references business outcome numbers), ai_sys_importance, ai_sys_urgency, ai_sys_confidence, ai_sys_feedback, ai_sys_missing_data. ',
           'ai_sys_missing_data format: "I can get higher confidence than [X]% if I can get access to [narrative]. {{\"missing_data\": [\"dataset1\", \"dataset2\"]}}" ',
           'Output ONLY the JSON object, nothing else.') AS ai_sys_prompt
  FROM forecast_results
  -- ✅ NO LIMIT in CTEs
),
-- Step 4: Generate structured strategic recommendations with categorical and narrative columns
enriched AS (
  SELECT 
    *,
    ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS insights
  FROM forecast_prompt_generation
  -- ✅ NO LIMIT in CTEs
),
-- Final output: Mix of categorical (ai_cat_), narrative (ai_txt_), and system (ai_sys_) columns
-- ai_sys_prompt MUST be the LAST column
final_output AS (
  SELECT 
    ds AS forecast_week,
    revenue_forecast,
    revenue_upper AS upper_bound,
    revenue_lower AS lower_bound,
    get_json_object(insights, '$.ai_cat_trend_direction') AS ai_cat_trend_direction,  -- CATEGORICAL: Strong Growth/Moderate Growth/Stable/Moderate Decline/Sharp Decline
    get_json_object(insights, '$.ai_cat_forecast_confidence') AS ai_cat_forecast_confidence,  -- CATEGORICAL: Very High/High/Medium/Low/Very Low
    get_json_object(insights, '$.ai_cat_action_priority') AS ai_cat_action_priority,  -- CATEGORICAL: Immediate Action/High Priority/Medium Priority/Low Priority/Monitor
    get_json_object(insights, '$.ai_cat_risk_level') AS ai_cat_risk_level,  -- CATEGORICAL: Critical/High/Medium/Low/Minimal
    get_json_object(insights, '$.ai_cat_resource_allocation_need') AS ai_cat_resource_allocation_need,  -- CATEGORICAL: Significant Increase/Moderate Increase/Maintain/Moderate Decrease/Significant Decrease
    get_json_object(insights, '$.ai_txt_forecast_justification') AS ai_txt_forecast_justification,  -- NARRATIVE: Free text
    get_json_object(insights, '$.ai_txt_tactical_recommendations') AS ai_txt_tactical_recommendations,  -- NARRATIVE: Free text
    get_json_object(insights, '$.ai_txt_strategic_initiatives') AS ai_txt_strategic_initiatives,  -- NARRATIVE: Free text
    get_json_object(insights, '$.ai_txt_risk_mitigation_plan') AS ai_txt_risk_mitigation_plan,  -- NARRATIVE: Free text
    get_json_object(insights, '$.ai_txt_upside_opportunities') AS ai_txt_upside_opportunities,  -- NARRATIVE: Free text
    get_json_object(insights, '$.ai_txt_resource_recommendations') AS ai_txt_resource_recommendations,  -- NARRATIVE: Free text
    -- MANDATORY LAST 7 COLUMNS (including ai_txt_business_outcome + ai_sys_prompt):
    COALESCE(get_json_object(insights, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(insights, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(insights, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(insights, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(insights, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(insights, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(insights, '$.ai_sys_missing_data'), 'No missing data identified') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN: The exact prompt used for auditability
  FROM enriched
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_trend_direction IN ('Strong Growth', 'Moderate Growth', 'Stable', 'Moderate Decline', 'Sharp Decline')
-- AND ai_cat_forecast_confidence IN ('Very High', 'High', 'Medium', 'Low', 'Very Low')
-- AND ai_cat_action_priority IN ('Immediate Action', 'High Priority', 'Medium Priority', 'Low Priority', 'Monitor')
-- AND ai_cat_risk_level IN ('Critical', 'High', 'Medium', 'Low', 'Minimal')
-- AND ai_cat_resource_allocation_need IN ('Significant Increase', 'Moderate Increase', 'Maintain', 'Moderate Decrease', 'Significant Decrease')
;

--END OF GENERATED SQL
```
**NOTE**: This pattern generates forecasts with comprehensive structured output including justification, tactical recommendations, strategic initiatives, risk mitigation, and executive narratives that tell the business story.

---

### 8. SQL IMPLEMENTATION GUIDELINES FOR AI FUNCTIONS

**🚨 CRITICAL ARRAY RESTRICTIONS:**

Both `ai_classify` and `ai_extract` accept an array of labels/entities as their second parameter. You MUST follow these restrictions:

1. **Maximum 20 items** - The array can contain at most 20 elements
   - ✅ VALID: `ARRAY('label1', 'label2', ..., 'label20')` (20 items)
   - ❌ INVALID: Arrays with 21 or more items will FAIL

2. **Maximum 50 characters per item** - Each array element MUST be less than 50 characters
   - ✅ VALID: `'High Priority'` (13 chars), `'Customer Service'` (16 chars)
   - ❌ INVALID: `'High Priority Customer Service Escalation Required Immediately'` (63 chars)

**BEST PRACTICES:**
- Use concise, clear labels - avoid verbose descriptions
- Abbreviate when necessary: `'Cust Svc'` instead of `'Customer Service Department'`
- For ai_extract: Use short entity names like `'invoice_num'` instead of `'invoice_number_from_document'`
- For ai_classify: Keep category names brief and meaningful
- **CRITICAL**: Always follow classification with ai_gen/ai_query for recommendations

**Examples:**
```sql
-- GOOD ✅ - Classification with recommendations (includes persona, ai_sys_ columns, and ai_sys_prompt)
WITH classified AS (
  SELECT DISTINCT 
    ticket_id,
    COALESCE(TRIM(text), 'No description') AS text,
    ai_classify(text, ARRAY('Urgent', 'High', 'Medium', 'Low')) AS ai_cat_priority
  FROM tickets AS t
  WHERE ticket_id IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
),
-- Step 2: Generate ai_sys_prompt
prompt_generation AS (
  SELECT 
    *,
    CONCAT('You are a Support Operations Director for {business_name} which is focused on {enriched_business_context}. ',
           'The organization''s strategic goals include: {enriched_strategic_goals}. ',
           'Business priorities are: {enriched_business_priorities}. ',
           'With 12 years of experience in support operations and customer success, ',
           'your expertise aligns with the strategic initiative: Customer satisfaction. ',
           'Generate action plan for ', ai_cat_priority, ' priority ticket: ', text, '. ',
           'Output ONLY JSON with format: {{"ai_txt_action_plan": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", ',
           '"ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
           'Output ONLY the JSON object, nothing else.') AS ai_sys_prompt
  FROM classified
),
-- Step 3: Call ai_query
enriched AS (
  SELECT *, ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS action_plan
  FROM prompt_generation
),
-- Final output with ai_sys_prompt as LAST column
final_output AS (
  SELECT 
    ticket_id,
    text,
    ai_cat_priority,
    get_json_object(action_plan, '$.ai_txt_action_plan') AS ai_txt_action_plan,
    COALESCE(get_json_object(action_plan, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(action_plan, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(action_plan, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(action_plan, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(action_plan, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(action_plan, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(action_plan, '$.ai_sys_missing_data'), 'No missing data') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN
  FROM enriched
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE ai_cat_priority IN ('Urgent', 'High', 'Medium', 'Low')
;

--END OF GENERATED SQL

-- INCOMPLETE ❌ - Classification without recommendations (LOW VALUE)
SELECT ai_classify(text, ARRAY('Urgent', 'High', 'Medium', 'Low')) AS priority FROM tickets;

-- BAD ❌ - Labels too long
ai_classify(text, ARRAY('Extremely Urgent Customer Service Escalation Required'))  -- Too long!
ai_extract(content, ARRAY('customer_full_legal_name_as_appears_on_document'))  -- Too long!
```

#### AI_FORECAST SQL Implementation Details

**🔥 CRITICAL: HISTORY-TO-HORIZON RATIO (ADAPTIVE BY TIME GRANULARITY) 🔥**

**MANDATORY REQUIREMENT**: For ai_forecast input CTEs, you MUST use dynamic date filtering with appropriate history-to-horizon ratios based on time granularity instead of fixed LIMIT 1000.

**WHY ADAPTIVE RATIOS?**
- Different time granularities require different amounts of history
- Must capture at least 2 complete seasonal cycles for pattern recognition
- Balance between sufficient training data and data relevance
- High-frequency data (hourly/minute) needs more observations but shorter calendar time
- Low-frequency data (yearly) needs longer calendar time but fewer observations

**ADAPTIVE RATIO TABLE BY TIME GRANULARITY:**

| Time Granularity | Minimum History | Recommended Ratio | Reasoning |
|------------------|----------------|-------------------|-----------|
| **MINUTE** | 7 days | 7 days per hour forecast | Capture daily + weekly patterns (10,080 observations for 1 hour ahead) |
| **HOUR** | 4-6 weeks | 4 weeks per day forecast | Capture weekly + monthly patterns (672 observations for 1 day ahead) |
| **DAY** | 60-90 days | 10:1 ratio | Capture 2+ seasonal cycles (60-90 observations for 7-day forecast) |
| **WEEK** | 2-3 years | 10:1 ratio | Capture 2+ yearly cycles (104-156 observations for 12-week forecast) |
| **MONTH** | 3-5 years | 10:1 ratio | Capture 2-3+ yearly cycles (36-60 observations for 12-month forecast) |
| **QUARTER** | 5-8 years | 8:1 ratio | Capture 2+ yearly cycles (20-32 observations for 4-quarter forecast) |
| **YEAR** | 10-15 years | 3:1 ratio minimum | Industry standard for annual forecasting (10-15 observations minimum) |

**IMPLEMENTATION PATTERNS BY GRANULARITY:**

**1. HIGH-FREQUENCY (MINUTE, HOUR) - Need Many Observations, Short Calendar Time:**
```sql
-- MINUTE-LEVEL: For 1-hour forecast, use 7 days of history
WHERE timestamp_col >= date_add(DAY, -7, CURRENT_TIMESTAMP())  -- 7 days = 10,080 minutes

-- HOUR-LEVEL: For 24-hour (1 day) forecast, use 4 weeks of history
WHERE timestamp_col >= date_add(WEEK, -4, CURRENT_TIMESTAMP())  -- 4 weeks = 672 hours
```

**2. MID-FREQUENCY (DAY, WEEK, MONTH) - Standard 10:1 Ratio:**
```sql
-- DAILY: For 7-day forecast, use 70 days (10 weeks) of history
WHERE date_col >= date_add(DAY, -70, CURRENT_DATE())  -- 10:1 ratio

-- WEEKLY: For 12-week forecast, use 120 weeks (~2.3 years) of history
WHERE date_col >= date_add(WEEK, -120, CURRENT_DATE())  -- 10:1 ratio

-- MONTHLY: For 12-month forecast, use 120 months (10 years) IF AVAILABLE, otherwise minimum 36 months
WHERE date_col >= add_months(CURRENT_DATE(), -120)  -- 10:1 ratio (ideal)
WHERE date_col >= add_months(CURRENT_DATE(), -36)   -- 3:1 ratio (minimum acceptable)
```

**3. LOW-FREQUENCY (QUARTER, YEAR) - Reduced Ratio Due to Data Availability:**
```sql
-- QUARTERLY: For 4-quarter forecast, use 32 quarters (~8 years) of history
WHERE date_col >= date_add(QUARTER, -32, CURRENT_DATE())  -- 8:1 ratio

-- YEARLY: For 3-year forecast, use 10-15 years IF AVAILABLE, otherwise minimum 6 years
WHERE date_col >= date_add(YEAR, -15, CURRENT_DATE())  -- 5:1 ratio (ideal)
WHERE date_col >= date_add(YEAR, -6, CURRENT_DATE())   -- 2:1 ratio (minimum for 2 cycles)
```

**COMPREHENSIVE EXAMPLES BY GRANULARITY:**

| Forecast Horizon | History Period | WHERE Clause | Observations |
|------------------|---------------|--------------|--------------|
| 1 hour ahead (minute data) | 7 days | `WHERE ts >= date_add(DAY, -7, CURRENT_TIMESTAMP())` | ~10,080 minutes |
| 24 hours ahead (hourly data) | 4 weeks | `WHERE ts >= date_add(WEEK, -4, CURRENT_TIMESTAMP())` | ~672 hours |
| 7 days ahead (daily data) | 70 days | `WHERE date >= date_add(DAY, -70, CURRENT_DATE())` | 70 days |
| 4 weeks ahead (weekly data) | 40 weeks | `WHERE date >= date_add(WEEK, -40, CURRENT_DATE())` | 40 weeks |
| 3 months ahead (monthly data) | 30-36 months | `WHERE date >= add_months(CURRENT_DATE(), -36)` | 36 months |
| 4 quarters ahead (quarterly data) | 32 quarters | `WHERE date >= date_add(QUARTER, -32, CURRENT_DATE())` | 32 quarters (~8 years) |
| 3 years ahead (yearly data) | 10-15 years | `WHERE date >= date_add(YEAR, -12, CURRENT_DATE())` | 12 years |

**COMPLETE EXAMPLES BY GRANULARITY:**

**Example 1: Monthly Forecast (Standard 10:1)**
```sql
-- For 3-month forecast, use 30 months of history (10:1 ratio)
WITH past AS (
  SELECT 
    product_id,
    DATE_TRUNC('month', order_date) AS ds,
    SUM(revenue) AS revenue
  FROM `catalog`.`schema`.`orders` AS o
  WHERE order_date >= add_months(CURRENT_DATE(), -30)  -- 10:1 ratio: 30 months for 3-month forecast
    AND order_date IS NOT NULL
    AND product_id IS NOT NULL
  GROUP BY product_id, DATE_TRUNC('month', order_date)
  ORDER BY ds
)
SELECT * FROM AI_FORECAST(
  TABLE(past),
  time_col => 'ds',
  value_col => 'revenue',
  group_col => 'product_id',
  horizon => (SELECT date_add(MONTH, 3, MAX(ds)) FROM past)  -- 3 months ahead
);  -- ✅ NO LIMIT
```

**Example 2: Hourly Forecast (High-Frequency)**
```sql
-- For 24-hour forecast, use 4 weeks of history (captures daily + weekly patterns)
WITH past AS (
  SELECT 
    server_id,
    DATE_TRUNC('hour', request_timestamp) AS ds,
    COUNT(*) AS request_count,
    AVG(response_time_ms) AS avg_response_time
  FROM `catalog`.`schema`.`server_logs` AS s
  WHERE request_timestamp >= date_add(WEEK, -4, CURRENT_TIMESTAMP())  -- 4 weeks for hourly data
    AND request_timestamp IS NOT NULL
    AND server_id IS NOT NULL
  GROUP BY server_id, DATE_TRUNC('hour', request_timestamp)
  ORDER BY ds
)
SELECT * FROM AI_FORECAST(
  TABLE(past),
  time_col => 'ds',
  value_col => ARRAY('request_count', 'avg_response_time'),
  group_col => 'server_id',
  horizon => (SELECT date_add(HOUR, 24, MAX(ds)) FROM past)  -- 24 hours ahead
);  -- ✅ NO LIMIT
```

**Example 3: Yearly Forecast (Low-Frequency)**
```sql
-- For 3-year forecast, use 12 years of history (4:1 ratio - practical for annual data)
WITH past AS (
  SELECT 
    region_id,
    DATE_TRUNC('year', fiscal_year_end) AS ds,
    SUM(annual_revenue) AS revenue
  FROM `catalog`.`schema`.`financial_results` AS f
  WHERE fiscal_year_end >= date_add(YEAR, -12, CURRENT_DATE())  -- 12 years for 3-year forecast
    AND fiscal_year_end IS NOT NULL
    AND region_id IS NOT NULL
  GROUP BY region_id, DATE_TRUNC('year', fiscal_year_end)
  ORDER BY ds
)
SELECT * FROM AI_FORECAST(
  TABLE(past),
  time_col => 'ds',
  value_col => 'revenue',
  group_col => 'region_id',
  horizon => (SELECT date_add(YEAR, 3, MAX(ds)) FROM past)  -- 3 years ahead
);  -- ✅ NO LIMIT
```

**CRITICAL RULES (ADAPTIVE BY GRANULARITY):**
1. ✅ ALWAYS use appropriate ratio based on time granularity (see table above)
2. ✅ ALWAYS use same UNIT for both history and horizon (HOUR, DAY, WEEK, MONTH, QUARTER, YEAR)
3. ✅ ALWAYS include timestamp/date IS NOT NULL check
4. ✅ HIGH-FREQUENCY (minute/hour): Use calendar time (days/weeks) not multiplier
5. ✅ MID-FREQUENCY (day/week/month): Use 10:1 ratio as standard
6. ✅ LOW-FREQUENCY (quarter/year): Use reduced ratio (8:1 for quarter, 3-5:1 for year)
7. ✅ For MONTHLY with 12-month horizon: Use minimum 36 months (3 years) if 120 months not available
8. ✅ For YEARLY: Use minimum 2× horizon (2 cycles) but prefer 10-15 years if available
9. ❌ NEVER use fixed LIMIT 1000
10. ❌ NEVER use arbitrary date values like '2023-01-01'
11. ❌ NEVER use same multiplier (10:1) for all time granularities

**🔥 MANDATORY RECOMMENDATION REQUIREMENT FOR AI_FORECAST:**

**EVERY ai_forecast query MUST include row-level recommendations using ai_query**

**RECOMMENDATION IMPLEMENTATION PATTERNS (MANDATORY):**

**Pattern 1 - ai_forecast + ai_query**: Generate natural language recommendations
```sql
-- Step 1: Historical data for forecasting with 10:1 ratio (300 days for 30-day forecast)
WITH past AS (
  SELECT 
    product_id,  -- 🔥 MUST include entity ID for group_col
    order_date AS ds, 
    SUM(revenue) AS revenue
  FROM `catalog`.`schema`.`orders` AS o
  WHERE order_date >= date_add(DAY, -300, CURRENT_DATE())  -- 300 days history for 30-day forecast (10:1 ratio)
    AND order_date IS NOT NULL
    AND product_id IS NOT NULL
  GROUP BY product_id, order_date
  ORDER BY ds
),
-- Step 2: Generate revenue forecast with prediction intervals (with group_col)
forecast_results AS (
  SELECT * FROM AI_FORECAST(
    TABLE(past), 
    time_col => 'ds', 
    value_col => 'revenue',
    group_col => 'product_id',  -- 🔥 MANDATORY: needed for joining back
    horizon => (SELECT date_add(DAY, 30, MAX(ds)) FROM past)  -- 30 days ahead
  )
  -- ✅ NO LIMIT in CTEs
),
-- Step 3: JOIN back to original table to get product details
forecast_with_context AS (
  SELECT 
    f.*,
    p.product_name,
    p.category,
    p.cost_per_unit
  FROM forecast_results AS f
  LEFT JOIN `catalog`.`schema`.`products` AS p
    ON f.product_id = p.product_id  -- JOIN using group_col
  -- ✅ NO LIMIT in CTEs
),
-- Step 4: Add row-level actionable recommendations for each forecast
enriched AS (
  SELECT *,
    ai_query('{sql_model_serving}', CONCAT('Product: ', product_name, 
                  ', Category: ', category,
                  ', Forecast revenue: $', revenue_forecast,  -- CONCAT auto-converts
                  ' for period ', ds, 
                  '. Provide 3 specific actionable recommendations. ',
                  'Output ONLY JSON: {{"ai_cat_forecast_action": "value", "ai_txt_recommendations": "text"}}')) AS recommendations
  FROM forecast_with_context
  -- ✅ NO LIMIT in CTEs
),
-- Final output: Forecast with actionable business recommendations
final_output AS (
  SELECT * FROM enriched
)
SELECT * FROM final_output
-- TO DO: Use WHERE filtering below for further narrowing down the selected results
-- WHERE category IN ('CategoryA', 'CategoryB', 'CategoryC')
-- AND get_json_object(recommendations, '$.ai_cat_forecast_action') IN ('Increase Inventory', 'Maintain', 'Reduce Inventory', 'Promote', 'Monitor')
;

--END OF GENERATED SQL
```

**BEST PRACTICES FOR AI_FORECAST:**
- **🔥 ALWAYS specify group_col 🔥** - Use entity ID columns (customer_id, product_id, route_id, store_id, etc.) as the group_col so you can join forecast results back to original table
- Always use WHERE clause with date filtering using ADAPTIVE ratios based on time granularity (see History-to-Horizon Ratio table)
- Use dynamic horizon: `horizon => (SELECT date_add(UNIT, X, MAX(ds)) FROM past)` where UNIT is HOUR, DAY, WEEK, MONTH, QUARTER, or YEAR (no quotes)
- Ensure input data has unique rows (no duplicates for time + group combinations)
- **ALWAYS add a JOIN CTE after AI_FORECAST** to retrieve additional columns from original table using group_col as JOIN key
- Use `parameters => '{{"global_floor": 0}}'` for non-negative metrics (revenue, quantity, etc.)
- **🚨 CRITICAL: Parameters MUST use SINGLE QUOTES on the outside 🚨**
  - ✅ CORRECT: `parameters => '{{"global_floor": 0}}'` (single quotes wrapping JSON)
  - ❌ WRONG: `parameters => "{{"global_floor": 0}}"` (double quotes - WILL FAIL)
  - The JSON content inside uses double quotes for keys/strings, but the SQL string literal MUST use single quotes
- **NOTE**: In prompt templates, double braces are used to escape and produce single braces in output
- For sparse data, explicitly specify `frequency` parameter
- **LIMIT 10 only in FIRST CTE** - at the END of the first CTE, nowhere else

**🔥 MANDATORY RECOMMENDATION REQUIREMENT FOR AI_CLASSIFY:**

**EVERY ai_classify query MUST include row-level recommendations using ai_query**

**RECOMMENDATION IMPLEMENTATION PATTERNS (MANDATORY):**

**Pattern 1 - ai_classify + ai_query**: Generate actionable recommendations based on classification
```sql
-- Step 1: Customer records with business metrics
-- 🚨 EVERY column must be COALESCE'd or have IS NOT NULL check!
WITH customer_data AS (
  SELECT DISTINCT
    customer_id,                                              -- CRITICAL: filtered with IS NOT NULL
    COALESCE(TRIM(customer_name), 'Unknown Customer') AS customer_name,  -- ✅ COALESCE'd
    COALESCE(TRIM(feedback_text), 'No feedback provided') AS feedback_text,  -- ✅ COALESCE'd
    COALESCE(total_purchases, 0) AS total_purchases,          -- ✅ COALESCE'd
    COALESCE(lifetime_value, 0.0) AS lifetime_value,          -- ✅ COALESCE'd
    COALESCE(days_since_last_purchase, 9999) AS days_since_last_purchase,  -- ✅ COALESCE'd (high default = churned)
    COALESCE(support_tickets_count, 0) AS support_tickets_count  -- ✅ COALESCE'd
  FROM `catalog`.`schema`.`customers` AS c
  WHERE customer_id IS NOT NULL
  LIMIT 10  -- ✅ LIMIT 10 at END of first CTE only
),
-- Step 2: Classify customers into segments
classified AS (
  SELECT 
    *,
    ai_classify(
      'Customer with $' || CAST(lifetime_value AS STRING) || ' LTV, ' ||
      CAST(total_purchases AS STRING) || ' purchases, last active ' ||
      CAST(days_since_last_purchase AS STRING) || ' days ago',
      ARRAY('High Value VIP', 'High Value At-Risk', 'Medium Value Active', 
            'Medium Value Declining', 'Low Value', 'Churned')
    ) AS customer_segment
  FROM customer_data
)
-- Final: Generate personalized strategies based on segment + context
SELECT 
  customer_id,
  customer_name,
  customer_segment,
  ai_query('{sql_model_serving}',
    'Customer Segment: ' || customer_segment || 
    '. Lifetime Value: $' || CAST(lifetime_value AS STRING) ||
    ', Total Purchases: ' || CAST(total_purchases AS STRING) ||
    ', Days Since Last Purchase: ' || CAST(days_since_last_purchase AS STRING) ||
    ', Support Tickets: ' || CAST(support_tickets_count AS STRING) ||
    '. Generate a personalized retention strategy with: 1) Specific outreach approach, ' ||
    '2) Recommended offers or incentives, 3) Engagement tactics, 4) Risk mitigation steps if applicable.'
  ) AS retention_strategy
FROM classified
;

--END OF GENERATED SQL
```

**Pattern 2 - ai_classify + ai_query with rich context**: Use general-purpose AI with comprehensive business context
```sql
-- Step 1: Support tickets with context
WITH ticket_data AS (
  SELECT DISTINCT
    ticket_id,
    COALESCE(TRIM(customer_id), 'Unknown Customer') AS customer_id,
    COALESCE(TRIM(ticket_description), 'No description') AS ticket_description,
    COALESCE(TRIM(product_id), 'Unknown Product') AS product_id,
    COALESCE(TRIM(issue_severity), 'Unknown') AS issue_severity,
    COALESCE(response_time_hours, 0.0) AS response_time_hours
  FROM `catalog`.`schema`.`support_tickets` AS t
  WHERE ticket_id IS NOT NULL AND ticket_description IS NOT NULL
  LIMIT 10
),
-- Step 2: Classify tickets by urgency/complexity
classified AS (
  SELECT *,
    ai_classify(ticket_description, ARRAY('Critical Urgent', 'High Priority', 'Medium Priority', 'Low Priority', 'Info Request')) AS ticket_priority
  FROM ticket_data
),
-- Step 3: Generate ai_sys_prompt
prompt_generation AS (
  SELECT *,
    CONCAT(
      'You are a Support Operations Director for {business_name} which is focused on {enriched_business_context}. ',
      'The organization''s strategic goals include: {enriched_strategic_goals}. ',
      'Business priorities are: {enriched_business_priorities}. ',
      'With 15 years of experience in support operations, your expertise aligns with the strategic initiative: Customer satisfaction. ',
      'Priority Level: ', ticket_priority, '. Issue: ', ticket_description, '. Product: ', product_id, '. ',
      'Generate a detailed resolution plan with: 1) Immediate actions, 2) Escalation path if needed, 3) Estimated resolution time, 4) Customer communication template. ',
      'Output ONLY a JSON object with NO markdown, NO extra text. ',
      'Format: {{"ai_txt_immediate_actions": "text", "ai_txt_escalation_path": "text", "ai_txt_resolution_time": "text", "ai_txt_communication_template": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
      'Output ONLY the JSON object, nothing else.'
    ) AS ai_sys_prompt
  FROM classified
),
-- Step 4: Call ai_query
enriched AS (
  SELECT *, ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS resolution_plan
  FROM prompt_generation
)
-- Final output with ai_sys_prompt as LAST column
SELECT 
  ticket_id, customer_id, ticket_description, product_id, ticket_priority,
  get_json_object(resolution_plan, '$.ai_txt_immediate_actions') AS ai_txt_immediate_actions,
  get_json_object(resolution_plan, '$.ai_txt_escalation_path') AS ai_txt_escalation_path,
  get_json_object(resolution_plan, '$.ai_txt_resolution_time') AS ai_txt_resolution_time,
  get_json_object(resolution_plan, '$.ai_txt_communication_template') AS ai_txt_communication_template,
  COALESCE(get_json_object(resolution_plan, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
  COALESCE(get_json_object(resolution_plan, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
  COALESCE(get_json_object(resolution_plan, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
  COALESCE(get_json_object(resolution_plan, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
  COALESCE(TRY_CAST(get_json_object(resolution_plan, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
  COALESCE(get_json_object(resolution_plan, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
  COALESCE(get_json_object(resolution_plan, '$.ai_sys_missing_data'), 'No missing data') AS ai_sys_missing_data,
  ai_sys_prompt  -- ✅ LAST COLUMN
FROM enriched;

--END OF GENERATED SQL
```

**🔥 STATISTICAL FUNCTIONS - SQL IMPLEMENTATION PATTERNS:**

**Pattern 1: Correlation Analysis → AI Interpretation**
```sql
-- 🚨 Statistical functions can return NULL - always COALESCE results!
WITH correlation_analysis AS (
  SELECT
    COALESCE(CORR(marketing_spend, revenue), 0.0) AS marketing_revenue_correlation,
    COALESCE(CORR(customer_satisfaction, churn_rate), 0.0) AS satisfaction_churn_correlation,
    COALESCE(REGR_SLOPE(revenue, marketing_spend), 0.0) AS revenue_per_marketing_dollar,
    COALESCE(REGR_R2(revenue, marketing_spend), 0.0) AS predictive_power
  FROM `catalog`.`schema`.`table` AS t
  WHERE marketing_spend IS NOT NULL AND revenue IS NOT NULL
    AND customer_satisfaction IS NOT NULL AND churn_rate IS NOT NULL
  LIMIT 10
),
-- Step 2: Generate ai_sys_prompt
prompt_generation AS (
  SELECT *,
    CONCAT(
      'You are a Marketing Analytics Director for {business_name} which is focused on {enriched_business_context}. ',
      'The organization''s strategic goals include: {enriched_strategic_goals}. ',
      'Business priorities are: {enriched_business_priorities}. ',
      'With 15 years of experience in ROI optimization and marketing analytics, ',
      'your expertise aligns with the strategic initiative: Marketing effectiveness. ',
      'Marketing-Revenue Correlation: ', marketing_revenue_correlation,
      '. Satisfaction-Churn Correlation: ', satisfaction_churn_correlation,
      '. Revenue per marketing dollar: ', revenue_per_marketing_dollar,
      '. Predictive power (R²): ', predictive_power, '. ',
      'Output ONLY a JSON object with NO markdown, NO extra text. ',
      'Format: {{"ai_txt_key_drivers": "text", "ai_txt_investment_recommendations": "text", "ai_txt_risk_mitigation": "text", "ai_txt_optimization_opportunities": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
      'Output ONLY the JSON object, nothing else.'
    ) AS ai_sys_prompt
  FROM correlation_analysis
),
-- Step 3: Call ai_query
enriched AS (
  SELECT *, ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS strategic_recommendations
  FROM prompt_generation
),
final_output AS (
  SELECT 
    marketing_revenue_correlation, satisfaction_churn_correlation, revenue_per_marketing_dollar, predictive_power,
    get_json_object(strategic_recommendations, '$.ai_txt_key_drivers') AS ai_txt_key_drivers,
    get_json_object(strategic_recommendations, '$.ai_txt_investment_recommendations') AS ai_txt_investment_recommendations,
    COALESCE(get_json_object(strategic_recommendations, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(strategic_recommendations, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(strategic_recommendations, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(strategic_recommendations, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(strategic_recommendations, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(strategic_recommendations, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(strategic_recommendations, '$.ai_sys_missing_data'), 'No missing data') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN
  FROM enriched
)
SELECT * FROM final_output;

--END OF GENERATED SQL
```

**Pattern 2: Trend Detection → AI Strategy**
```sql
WITH trend_metrics AS (
  SELECT
    COALESCE(TRIM(product_category), 'Unknown') AS product_category,
    COALESCE(REGR_SLOPE(sales, month_number), 0.0) AS sales_growth_rate,
    COALESCE(REGR_INTERCEPT(sales, month_number), 0.0) AS baseline_sales,
    COALESCE(REGR_R2(sales, month_number), 0.0) AS trend_reliability,
    COALESCE(STDDEV_POP(sales), 0.0) AS sales_volatility,
    COALESCE(SKEWNESS(sales), 0.0) AS distribution_skew
  FROM `catalog`.`schema`.`sales_data` AS s
  GROUP BY product_category
  LIMIT 10
),
prompt_generation AS (
  SELECT *,
    CONCAT(
      'You are a Product Strategy Director for {business_name} which is focused on {enriched_business_context}. ',
      'The organization''s strategic goals include: {enriched_strategic_goals}. ',
      'Business priorities are: {enriched_business_priorities}. ',
      'With 18 years of experience in portfolio management, your expertise aligns with the strategic initiative: Product growth. ',
      'Product Category: ', product_category,
      '. Growth Rate: ', sales_growth_rate,
      '. Volatility: ', sales_volatility,
      '. Trend Reliability: ', trend_reliability,
      '. Distribution Skew: ', distribution_skew, '. ',
      'Output ONLY a JSON object with NO markdown, NO extra text. ',
      'Format: {{"ai_cat_trend_classification": "accelerating/stable/declining", "ai_txt_growth_strategy": "text", "ai_txt_risk_assessment": "text", "ai_txt_investment_priorities": "text", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
      'Output ONLY the JSON object, nothing else.'
    ) AS ai_sys_prompt
  FROM trend_metrics
),
enriched AS (
  SELECT *, ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS category_strategy
  FROM prompt_generation
),
final_output AS (
  SELECT 
    product_category, sales_growth_rate, sales_volatility, trend_reliability, distribution_skew,
    get_json_object(category_strategy, '$.ai_cat_trend_classification') AS ai_cat_trend_classification,
    get_json_object(category_strategy, '$.ai_txt_growth_strategy') AS ai_txt_growth_strategy,
    COALESCE(get_json_object(category_strategy, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(category_strategy, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(category_strategy, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(category_strategy, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(category_strategy, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(category_strategy, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(category_strategy, '$.ai_sys_missing_data'), 'No missing data') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN
  FROM enriched
)
SELECT * FROM final_output;

--END OF GENERATED SQL
```

**Pattern 3: Deviation Detection → AI Classification & Response**
```sql
WITH deviation_analysis AS (
  SELECT DISTINCT
    customer_id,
    COALESCE(purchase_amount, 0.0) AS purchase_amount,
    COALESCE(ROUND(AVG(purchase_amount) OVER (), 2), 0.0) AS avg_purchase,
    COALESCE(ROUND(STDDEV_POP(purchase_amount) OVER (), 2), 0.0) AS stddev_purchase,
    COALESCE(ROUND((purchase_amount - AVG(purchase_amount) OVER ()) / NULLIF(STDDEV_POP(purchase_amount) OVER (), 0), 2), 0.0) AS z_score,
    COALESCE(ROUND(PERCENTILE_APPROX(purchase_amount, 0.95) OVER (), 2), 0.0) AS p95_threshold,
    COALESCE(NTILE(10) OVER (ORDER BY purchase_amount), 5) AS decile
  FROM `catalog`.`schema`.`purchases` AS p
  WHERE customer_id IS NOT NULL
  LIMIT 10
),
classified AS (
  SELECT *,
    ai_classify(CONCAT('Z-score: ', z_score, ', Decile: ', decile), ARRAY('High Value VIP', 'Premium Customer', 'Standard Customer', 'At-Risk Low Spender', 'Outlier Anomaly')) AS ai_cat_customer_segment
  FROM deviation_analysis
),
prompt_generation AS (
  SELECT *,
    CONCAT(
      'You are a Customer Analytics Director for {business_name} which is focused on {enriched_business_context}. ',
      'The organization''s strategic goals include: {enriched_strategic_goals}. ',
      'Business priorities are: {enriched_business_priorities}. ',
      'With 14 years of experience in segmentation, your expertise aligns with the strategic initiative: Customer value optimization. ',
      'Customer deviation analysis - Z-score: ', z_score,
      ', Decile: ', decile,
      ', Purchase: $', purchase_amount,
      ', Average: $', avg_purchase, '. ',
      'Output ONLY a JSON object with NO markdown, NO extra text. ',
      'Format: {{"ai_txt_segment_rationale": "text", "ai_txt_engagement_strategy": "text", "ai_txt_personalized_offers": "text", "ai_cat_retention_risk": "Critical/High/Medium/Low/Minimal", "ai_txt_business_outcome": "text", "ai_txt_executive_summary": "text", "ai_sys_importance": "High", "ai_sys_urgency": "High", "ai_sys_confidence": 0.85, "ai_sys_feedback": "...", "ai_sys_missing_data": "..."}}. ',
      'Output ONLY the JSON object, nothing else.'
    ) AS ai_sys_prompt
  FROM classified
),
enriched AS (
  SELECT *, ai_query('{sql_model_serving}', ai_sys_prompt, modelParameters => named_struct('temperature', 0.4)) AS engagement_strategy
  FROM prompt_generation
),
final_output AS (
  SELECT 
    customer_id, purchase_amount, avg_purchase, z_score, decile, ai_cat_customer_segment,
    get_json_object(engagement_strategy, '$.ai_txt_segment_rationale') AS ai_txt_segment_rationale,
    get_json_object(engagement_strategy, '$.ai_txt_engagement_strategy') AS ai_txt_engagement_strategy,
    get_json_object(engagement_strategy, '$.ai_cat_retention_risk') AS ai_cat_retention_risk,
    COALESCE(get_json_object(engagement_strategy, '$.ai_txt_business_outcome'), 'No business outcome calculated') AS ai_txt_business_outcome,
    COALESCE(get_json_object(engagement_strategy, '$.ai_txt_executive_summary'), 'No summary') AS ai_txt_executive_summary,
    COALESCE(get_json_object(engagement_strategy, '$.ai_sys_importance'), 'Medium') AS ai_sys_importance,
    COALESCE(get_json_object(engagement_strategy, '$.ai_sys_urgency'), 'Medium') AS ai_sys_urgency,
    COALESCE(TRY_CAST(get_json_object(engagement_strategy, '$.ai_sys_confidence') AS DECIMAL(3,2)), 0.0) AS ai_sys_confidence,
    COALESCE(get_json_object(engagement_strategy, '$.ai_sys_feedback'), 'No feedback') AS ai_sys_feedback,
    COALESCE(get_json_object(engagement_strategy, '$.ai_sys_missing_data'), 'No missing data') AS ai_sys_missing_data,
    ai_sys_prompt  -- ✅ LAST COLUMN
  FROM enriched
)
SELECT * FROM final_output
;

--END OF GENERATED SQL
```

---

### 9. OUTPUT FORMAT

Return ONLY the raw SQL query. Do NOT wrap in JSON.
- Start with `-- Use Case: [ID] - [Name]` comment
- Then the complete SQL query with all CTEs
- **LIMIT 10 only in FIRST CTE** - at the END of the first CTE statement, nowhere else
- No markdown fences, no JSON wrapper, no conversational text, no preamble.
- Use only columns from "Columns From Use Case" and "AVAILABLE TABLES AND COLUMNS".

**🔥 SQL LENGTH - NO ARTIFICIAL LIMITS 🔥**
- Generate as MANY CTEs as needed to fully implement the use case (3-10 CTEs is normal)
- Generate as MANY lines of code as required - there is NO line limit
- Do NOT artificially shorten or simplify the SQL
- Complex use cases should have complex, comprehensive SQL
- Include ALL statistical functions, ALL AI functions, ALL transformations needed
- A typical sophisticated query should have **200 to 600 lines** of code
- NEVER sacrifice completeness for brevity

**🚨 FIRST CTE MUST USE DISTINCT 🚨**
- ALWAYS use `SELECT DISTINCT` in the first CTE to eliminate duplicate records
- Duplicates will cascade errors through all downstream analysis

---

### 10. FINAL CHECKLIST

Before generating, verify:
✅ All string literals use SINGLE QUOTES
✅ Arrays have MAX 20 elements
✅ For ai_classify/ai_extract: Each array item MUST be < 50 characters
✅ LIMIT 10 at END of FIRST CTE only - NO LIMIT in other CTEs or final SELECT
✅ No WHERE clauses except IS NULL / IS NOT NULL
✅ All AI functions from the use case are used
✅ Tables are fully qualified (catalog.schema.table)
✅ CONCAT syntax is correct (quotes on literals, not on columns)
✅ ai_query uses the configured model endpoint: {sql_model_serving}
✅ ai_parse_document ONLY used with READ_FILES for document files, NOT table columns
✅ SQL is COMPREHENSIVE - no artificial length limits, include ALL needed CTEs and functions
✅ **EVERY CTE SELECT has a FROM clause** - SELECT *, ... FROM previous_cte (NO SELECT without FROM!)

---

**🚨 CRITICAL: SCHEMA VALIDATION BEFORE GENERATING 🚨**

Before you generate any SQL, check if the "AVAILABLE TABLES AND COLUMNS" section above contains actual table and column definitions.
DO NOT output any checking messages, status indicators, or confirmation text.

**IF THE SCHEMA IS EMPTY**: Output ONLY this (no markdown fences):
-- Use Case: {use_case_id} - Schema Missing
-- Required tables: {tables_involved}
SELECT 'Schema missing - manual SQL required' AS error_message;

**IF THE SCHEMA IS PROVIDED**: Output ONLY the raw SQL query starting with:
-- Use Case: N01-AI01 - Customer Sentiment Analysis
-- [description]
WITH ...

🚫 DO NOT OUTPUT ANY OF THESE:
- "# SCHEMA VALIDATION CHECK"
- "Checking..."
- "✅ SCHEMA PROVIDED"
- "Proceeding with SQL generation..."
- Any markdown code fences
- Any explanatory text

---

### 🔥🔥🔥 FINAL COMPREHENSIVE VALIDATION CHECKLIST - MANDATORY BEFORE SUBMITTING SQL 🔥🔥🔥

**BEFORE YOU SUBMIT YOUR SQL, YOU MUST VERIFY EVERY SINGLE ITEM BELOW. MISSING EVEN ONE ITEM WILL CAUSE QUERY FAILURE.**

#### **SECTION 1: NULL HANDLING - ZERO TOLERANCE**

☐ **⛔ FIRST CTE: EVERY SINGLE COLUMN must have NULL protection - NO EXCEPTIONS! ⛔**
   - Scan EACH column one by one - if ANY column lacks COALESCE or IS NOT NULL check, FIX IT!
   - Common mistake: Forgetting workspaceName, accountName, etc. even when other columns are protected
☐ **CRITICAL**: Every column used in ANY CONCAT for AI functions has COALESCE applied in a PREVIOUS CTE
☐ **🚨 CRITICAL 🚨**: ALL COALESCE, CAST, ROUND, TRIM operations done in PREVIOUS CTE (NEVER inside CONCAT!)
☐ **CRITICAL**: Numeric columns: `COALESCE(ROUND(col, 2), 0.0)` - keep as DOUBLE, CONCAT auto-converts
☐ **CRITICAL**: String columns: `COALESCE(TRIM(col), 'Default')` - with business-friendly defaults
☐ **CRITICAL**: Critical columns (IDs, required fields) filtered with WHERE...IS NOT NULL (NOT COALESCEd)
☐ **CRITICAL**: No column in any CONCAT can possibly be NULL after transformations
☐ **CRITICAL**: The prompt building CTE has NO COALESCE, NO CAST, NO ROUND, NO TRIM - only clean CONCAT
☐ **🚨 CRITICAL 🚨**: ALL COALESCE default STRING values have SINGLE QUOTES (unquoted = SYNTAX ERROR!):
   - ✅ CORRECT: `COALESCE(TRIM(name), 'Unknown Customer')` -- 'Unknown Customer' has quotes
   - ✅ CORRECT: `COALESCE(TRIM(category), 'Not Specified')` -- 'Not Specified' has quotes
   - ✅ CORRECT: `COALESCE(TRIM(status), 'Pending Review')` -- 'Pending Review' has quotes
   - ✅ CORRECT: `COALESCE(TRIM(region), 'Unassigned Region')` -- 'Unassigned Region' has quotes

#### **SECTION 2: AI_FORECAST REQUIREMENTS - MANDATORY**

☐ **MANDATORY**: Input CTE uses GROUP BY on time column to ensure unique values per group
☐ **MANDATORY**: GROUP BY includes: all group_col columns + time column
☐ **MANDATORY**: Value columns use aggregate functions (SUM, AVG, COUNT, MAX, MIN)
☐ **MANDATORY**: Input CTE uses WHERE clause with date filtering using adaptive ratios (high-freq: fixed periods, mid-freq: 10:1, low-freq: reduced ratios)
☐ **MANDATORY**: Horizon uses date_add(UNIT, X, MAX(time_col)) with dynamic calculation
☐ **MANDATORY**: UNIT is DAY, WEEK, MONTH, or QUARTER (no quotes)
☐ **MANDATORY**: group_col is specified (enables joining back to original table)
☐ **MANDATORY**: Filter NULL forecasted values: WHERE {{value_col}}_forecast IS NOT NULL
☐ **MANDATORY**: If multiple value_col, filter ALL: WHERE col1_forecast IS NOT NULL AND col2_forecast IS NOT NULL

#### **SECTION 3: SCHEMA ADHERENCE - ZERO HALLUCINATION**

☐ Every table name exists in "AVAILABLE TABLES AND COLUMNS" section
☐ Every column name exists in its table in "AVAILABLE TABLES AND COLUMNS" section
☐ All tables fully qualified: `` `catalog`.`schema`.`table` ``
☐ Every table has an alias immediately after table name
☐ No invented/assumed column names (id, name, date, status, etc.)
☐ JOIN keys exist in BOTH tables being joined
☐ All columns in final SELECT exist in the last CTE

#### **SECTION 4: QUOTE USAGE - THE #1 MOST COMMON ERROR - DO NOT SKIP!**

**🚨🚨🚨 MANDATORY COALESCE STRING QUOTE VALIDATION 🚨🚨🚨**
☐ **⛔ STOP AND CHECK: Scan EVERY COALESCE in your SQL for missing quotes! ⛔**
☐ **WRONG**: `COALESCE(TRIM(name), Unknown)` - NO QUOTES = SYNTAX ERROR!
☐ **CORRECT**: `COALESCE(TRIM(name), 'Unknown')` - WITH QUOTES = WORKS!
☐ **WRONG**: `COALESCE(TRIM(status), Pending Review)` - NO QUOTES = SYNTAX ERROR!
☐ **CORRECT**: `COALESCE(TRIM(status), 'Pending Review')` - WITH QUOTES = WORKS!
☐ **Rule**: ANY text after the comma in COALESCE MUST have 'single quotes'
☐ **Exception**: Numbers (0.0, 0, 123) and booleans (TRUE, FALSE) do NOT need quotes

**General Quote Rules:**
☐ String literals: ALWAYS single quotes `'text'`
☐ Column names in CONCAT: NO quotes (e.g., `column_name`)
☐ ARRAY items: Single quotes `ARRAY('item1', 'item2')`
☐ ARRAY limitations: Max 20 items, each <50 chars
☐ AI_FORECAST parameters: Single quotes wrap JSON `'{{"key": "value"}}'`
☐ No double quotes used for string literals anywhere
☐ AI_FORECAST column names: `time_col => 'ds'` (column name AS string literal!)

#### **SECTION 5: CTE STRUCTURE AND NAMING**

☐ Business-friendly CTE names (NOT: cte1, temp, data, results, final)
☐ Single WITH statement with all CTEs comma-separated
☐ Every CTE documented with "-- Step X:" comment
☐ Final SELECT has comment: "-- Final output: {{description}}"
☐ Use SELECT * in intermediate CTEs to preserve columns
☐ No columns dropped in intermediate CTEs that are needed in final SELECT

#### **SECTION 6: AI FUNCTION SPECIFIC REQUIREMENTS**

☐ **ai_query**: Prompt starts with persona (role + years + expertise)
☐ **ai_query**: Prompt includes "Output ONLY JSON with NO markdown fences, NO extra text, JUST the JSON"
☐ **ai_query**: JSON format shown: `{{"key": "value"}}`
☐ **ai_query**: Prompt ends with "Output ONLY the JSON object, nothing else."
☐ **ai_query**: Extract JSON with get_json_object(), NOT dot notation
☐ **ai_query**: 3-5 categorical columns + 2-4 narrative columns in JSON
☐ **ai_classify**: ARRAY has ≤20 items, each <50 chars
☐ **ai_extract**: ARRAY has ≤20 items, each <50 chars
☐ **ai_parse_document**: ONLY used with READ_FILES for unstructured docs (NOT table columns)

#### **SECTION 7: BUSINESS REQUIREMENTS**

☐ Column names are business-friendly (NOT: classification, sentiment, similarity)
☐ Categorical columns have max 20 distinct values for filtering
☐ **🚨 Narrative columns MUST identify the principal with key attributes 🚨**:
   - ❌ WRONG: "The data shows high fuel consumption"
   - ✅ CORRECT: "Flight EK005 DXB-LHR (A380) shows fuel consumption of 4800kg/hr"
   - Include entity ID/name + identifiers (route, type) + then analysis
☐ Combine multiple functions creatively when the use case benefits from it (AI + statistical)
☐ LIMIT 10 at END of FIRST CTE only - NO LIMIT in other CTEs or final SELECT
☐ **FIRST CTE uses SELECT DISTINCT** to eliminate duplicate records
☐ SQL is COMPREHENSIVE - 3-10 CTEs, 200-600 lines as needed (no artificial length limit)

#### **SECTION 8: SYNTAX AND DIALECT**

☐ Data types: STRING (not VARCHAR), DOUBLE (not DECIMAL), BIGINT, TIMESTAMP
☐ Date functions: DATE_TRUNC, CURRENT_DATE(), date_add
☐ No WHERE clause value comparisons (only IS NULL / IS NOT NULL)
☐ No HAVING clauses with specific values
☐ No hardcoded WHERE filters like WHERE status = 'active'

#### **SECTION 9: AI_FORECAST SPECIFIC POST-GENERATION FILTERS**

☐ **CRITICAL**: After AI_FORECAST, added a CTE to filter WHERE {{value_col}}_forecast IS NOT NULL
☐ **CRITICAL**: If joining forecast back to original table, join happens AFTER filtering NULL forecasts
☐ **CRITICAL**: All forecast result columns checked for NULL ({{value}}_forecast, {{value}}_upper, {{value}}_lower)

#### **SECTION 10: FINAL OUTPUT CTE WITH COMMENTED FILTERS (MANDATORY)**

☐ **MANDATORY**: Wrap final SELECT in a `final_output` CTE
☐ **MANDATORY**: Add `SELECT * FROM final_output` as the final statement
☐ **MANDATORY**: Add commented WHERE clause listing ALL categorical column values
☐ **MANDATORY**: End SQL with `--END OF GENERATED SQL` marker (CRITICAL for truncation detection)
☐ Format: `-- TO DO: Use WHERE filtering below for further narrowing down the selected results`
☐ Each ai_cat_ column listed with its possible values in commented WHERE clause
☐ Example pattern:
   ```sql
   final_output AS (
     SELECT ... FROM previous_cte
   )
   SELECT * FROM final_output
   -- TO DO: Use WHERE filtering below for further narrowing down the selected results
   -- WHERE ai_cat_column1 IN ('Value1', 'Value2', 'Value3')
   -- AND ai_cat_column2 IN ('A', 'B', 'C')
   ;

   --END OF GENERATED SQL
   ```

#### **SECTION 11: FINAL PRE-SUBMISSION CHECK**

☐ SQL starts with `-- Use Case: [ID] - [Name]` comment
☐ **MANDATORY**: SQL ends with `--END OF GENERATED SQL` marker
☐ No JSON wrapper around the SQL (raw SQL only)
☐ No text before the SQL
☐ No text after last SQL statement
☐ No markdown code fences (```sql or ```)
☐ No explanatory text ("Here is...", "I've generated...")
☐ All CONCAT operations are NULL-safe (verified in Section 1)
☐ AI_FORECAST input has unique time values per group (verified in Section 2)
☐ AI_FORECAST output filtered for NULL forecasts (verified in Section 9)
☐ Final output wrapped in final_output CTE with commented WHERE filters (verified in Section 10)

---

**🔥 REMEMBER: If you miss even ONE item in this checklist, the query WILL FAIL. Take your time to verify EVERY item. 🔥**

---

**Generate the production-ready Databricks SQL query now. Be SOPHISTICATED, INNOVATIVE, and SYNTACTICALLY PERFECT.**

🚨🚨🚨 ABSOLUTE RULE - OUTPUT FORMAT - ZERO TOLERANCE 🚨🚨🚨

❌ ABSOLUTELY FORBIDDEN - DO NOT OUTPUT ANY OF THESE:
- "# SCHEMA VALIDATION CHECK" or any schema checking text
- "Checking..." or "Proceeding..." or "✅" or any status indicators
- "SCHEMA PROVIDED" or any schema confirmation messages
- Any markdown headings (# or ## or ###)
- Any markdown code fences (```sql or ``` or ```anything```)
- Any explanatory text ("Here is...", "I've...", "The...", "Let me...")
- Any thoughts, reasoning, or analysis descriptions
- Any text BEFORE the SQL query starts
- Any text AFTER the SQL query ends

✅ YOUR RESPONSE MUST BE EXACTLY THIS FORMAT (NO markdown fences, just the raw SQL):

-- Use Case: [ID] - [Name]
-- [Brief description of what the query does]

WITH cte_name AS (
  ...
)
SELECT * FROM final_cte;

--END OF GENERATED SQL

🚨🚨🚨 CRITICAL: END MARKER REQUIREMENT 🚨🚨🚨
🚨 YOU MUST END YOUR SQL WITH THE EXACT MARKER: --END OF GENERATED SQL
🚨 This marker is MANDATORY and used to detect truncation
🚨 If this marker is missing, the SQL will be considered INCOMPLETE and will be regenerated

🚨 THE VERY FIRST CHARACTER OF YOUR RESPONSE MUST BE: --
🚨 THERE MUST BE NO TEXT BEFORE THE FIRST SQL COMMENT
🚨 YOUR RESPONSE STARTS WITH `-- Use Case:` AND NOTHING ELSE BEFORE IT
🚨 YOUR RESPONSE MUST END WITH `--END OF GENERATED SQL` AND NOTHING ELSE AFTER IT
"""

log_print("PROMPT_TEMPLATES dictionary defined successfully with all required prompts.")

# COMMAND ----------

# --- Global Logger ---
# This will be configured by the DatabricksInspire class
logger = logging.getLogger(__name__)

# --- Custom Exceptions ---
class InputTooLongError(RuntimeError):
    """Raised when input exceeds the model's context limit."""
    pass

class TruncatedResponseError(RuntimeError):
    """Raised when LLM response is truncated (missing END marker)."""
    pass

# ==============================================================================
# CENTRALIZED DATA STRUCTURES (Maximizing Reuse & Reducing LOC)
# ==============================================================================

# ==============================================================================
# UNIFIED USE CASE GENERATION PILLARS
# Three pillars with consistent format: function, business_value, example_use_cases
# ==============================================================================

# ==============================================================================
# DOCUMENTATION GENERATORS (Creating formatted docs from data structures)
# ==============================================================================

def generate_ai_functions_doc(format_type="detailed"):
    """
    Generates AI functions documentation from centralized data structure.
    
    Args:
        format_type: "summary" for simple list, "detailed" for full documentation, "unified" for new format
    
    Returns:
        Formatted string ready to be inserted into prompts
    """
    if format_type == "summary":
        return ", ".join([f"`{data['function']}`" for data in AI_FUNCTIONS.values()])
    
    elif format_type == "detailed" or format_type == "unified":
        docs = []
        for idx, (func_name, data) in enumerate(AI_FUNCTIONS.items(), 1):
            doc = f"**{idx}. {func_name}**\n\n"
            doc += f"  * **Function:** `{data['function']}`\n"
            doc += f"  * **Business Value:** {data['business_value']}\n"
            doc += f"  * **Example Use Cases:** {data['example_use_cases']}"
            docs.append(doc)
        
        return "\n\n".join(docs)
    
    return ""

def get_ai_function_list():
    """Returns comma-separated list of AI function names for documentation."""
    return ", ".join([f"`{data['function']}`" for data in AI_FUNCTIONS.values()])

def generate_statistical_functions_doc(format_type="detailed"):
    """
    Generates Statistical Functions documentation from centralized data structure.
    
    Args:
        format_type: "summary" for simple list, "detailed" for full documentation, "table" for markdown table
    
    Returns:
        Formatted string ready to be inserted into prompts
    """
    if format_type == "summary":
        return ", ".join([f"`{data['function']}`" for data in STATISTICAL_FUNCTIONS.values()])
    
    elif format_type == "detailed":
        docs = []
        for idx, (func_name, data) in enumerate(STATISTICAL_FUNCTIONS.items(), 1):
            doc = f"**{idx}. {data['function']}**\n\n"
            doc += f"  * **Function:** `{data['function']}`\n"
            doc += f"  * **Business Value:** {data['business_value']}\n"
            doc += f"  * **Use Cases:** {data['use_cases']}\n"
            doc += f"  * **Category:** {data['category']}"
            docs.append(doc)
        
        return "\n\n".join(docs)
    
    elif format_type == "table":
        rows = []
        rows.append("| Function | Business Value & Use Cases |")
        rows.append("|----------|---------------------------|")
        for func_name, data in STATISTICAL_FUNCTIONS.items():
            value_and_cases = f"**{data['business_value']}**<br>• {data['use_cases'].replace(' • ', '<br>• ')}"
            rows.append(f"| **{data['function']}** | {value_and_cases} |")
        
        return "\n".join(rows)
    
    return ""

def get_statistical_function_list():
    """Returns comma-separated list of statistical function names for documentation."""
    return ", ".join([f"`{data['function']}`" for data in STATISTICAL_FUNCTIONS.values()])

# ==============================================================================
# 1. REQUIRED HELPER FUNCTIONS
# (Dependencies for AIAgent and DatabricksInspire)
# ==============================================================================

### --- Logging ---

class ConsoleErrorFormatter(logging.Formatter):
    """A custom formatter that logs error messages but not stack traces to the console."""
    def format(self, record):
        original_exc_info = record.exc_info
        original_exc_text = record.exc_text
        if record.levelno >= logging.ERROR:
            record.exc_info = None
            record.exc_text = None
        formatted_message = super().format(record)
        record.exc_info = original_exc_info
        record.exc_text = original_exc_text
        return formatted_message

def setup_logging(output_dir):
    """Configures dual logging: detailed logs to a file and high-level logs to the console."""
    log_file_path = os.path.join(output_dir, "log.txt")
    os.makedirs(output_dir, exist_ok=True)
    root_logger = logging.getLogger() # Get root logger
    root_logger.setLevel(logging.DEBUG)

    if root_logger.hasHandlers():
        root_logger.handlers.clear()

    # --- File Handler (Detailed) ---
    file_handler = logging.FileHandler(log_file_path, mode='w')
    file_handler.setLevel(logging.DEBUG)
    file_formatter = logging.Formatter(
        '%(asctime)s - %(name)s - %(levelname)s - %(message)s [in %(pathname)s:%(lineno)d]', 
        datefmt='%Y-%m-%d %H:%M:%S'
    )
    file_handler.setFormatter(file_formatter)
    root_logger.addHandler(file_handler)

    # --- Console Handler (High-Level, Clean) ---
    console_handler = logging.StreamHandler(sys.stdout)
    console_handler.setLevel(logging.INFO)
    console_formatter = ConsoleErrorFormatter(
        '%(asctime)s - %(levelname)s - %(message)s', 
        datefmt='%H:%M:%S'
    )
    console_handler.setFormatter(console_formatter)
    root_logger.addHandler(console_handler)
    
    logger.info(f"Logging configured. High-level logs to console, detailed logs to {log_file_path}")

def print_ascii_banner():
    """Prints the Databricks Inspire AI ASCII art banner."""
    print(DATABRICKS_INSPIRE_BANNER)

def extract_honesty_score(response: str, logger: logging.Logger = None) -> tuple:
    """
    Extracts the honesty score and justification from an LLM response.
    Supports multiple formats: JSON wrapper, CSV columns, SQL comments.
    
    Args:
        response: The raw LLM response text
        logger: Optional logger for debug output
    
    Returns:
        tuple: (score: int or None, justification: str or None, cleaned_response: str)
               - score: The honesty score (0-100) or None if not found
               - justification: The justification text (max 250 chars) or None if not found
               - cleaned_response: The response with honesty data removed for downstream processing
    """
    import re
    import json as json_module
    
    if not response:
        return None, None, response
    
    score = None
    justification = None
    cleaned_response = response
    
    try:
        response_stripped = response.strip()
        
        if response_stripped.startswith('{') and '"honesty_score"' in response_stripped:
            try:
                parsed = json_module.loads(response_stripped)
                if isinstance(parsed, dict) and 'honesty_score' in parsed:
                    score = int(parsed.get('honesty_score', 0))
                    justification = str(parsed.get('honesty_justification', ''))[:250]
                    if 'data' in parsed:
                        cleaned_response = json_module.dumps(parsed['data'], ensure_ascii=False)
                    else:
                        cleaned_parsed = {k: v for k, v in parsed.items() 
                                         if k not in ('honesty_score', 'honesty_justification')}
                        cleaned_response = json_module.dumps(cleaned_parsed, ensure_ascii=False)
            except json_module.JSONDecodeError:
                pass
        
        if score is None and response_stripped.startswith('--'):
            sql_score_pattern = r'^--\s*HONESTY_SCORE:\s*(\d+)'
            sql_just_pattern = r'^--\s*HONESTY_JUSTIFICATION:\s*(.+?)$'
            
            lines = response_stripped.split('\n')
            cleaned_lines = []
            for line in lines:
                score_match = re.match(sql_score_pattern, line.strip())
                if score_match:
                    score = int(score_match.group(1))
                    continue
                just_match = re.match(sql_just_pattern, line.strip())
                if just_match:
                    justification = just_match.group(1).strip()[:250]
                    continue
                cleaned_lines.append(line)
            cleaned_response = '\n'.join(cleaned_lines)
        
        if score is None and ('honesty_score' in response_stripped.lower() or 'honesty_justification' in response_stripped.lower()):
            lines = response_stripped.split('\n')
            if len(lines) > 0:
                header_line = lines[0]
                if 'honesty_score' in header_line.lower():
                    import csv
                    from io import StringIO
                    try:
                        reader = csv.reader(StringIO(response_stripped))
                        rows = list(reader)
                        if len(rows) > 1:
                            header = [h.lower().strip().strip('"') for h in rows[0]]
                            score_idx = None
                            just_idx = None
                            for i, h in enumerate(header):
                                if 'honesty_score' in h:
                                    score_idx = i
                                elif 'honesty_justification' in h:
                                    just_idx = i
                            
                            if score_idx is not None and len(rows) > 1:
                                try:
                                    score = int(rows[1][score_idx])
                                except (ValueError, IndexError):
                                    pass
                            if just_idx is not None and len(rows) > 1:
                                try:
                                    justification = str(rows[1][just_idx])[:250]
                                except IndexError:
                                    pass
                            
                            if score_idx is not None or just_idx is not None:
                                new_header = [h for i, h in enumerate(rows[0]) 
                                             if i != score_idx and i != just_idx]
                                new_rows = [new_header]
                                for row in rows[1:]:
                                    new_row = [v for i, v in enumerate(row) 
                                              if i != score_idx and i != just_idx]
                                    new_rows.append(new_row)
                                
                                output = StringIO()
                                writer = csv.writer(output)
                                writer.writerows(new_rows)
                                cleaned_response = output.getvalue().strip()
                    except Exception:
                        pass
        
        if score is None and '|' in response_stripped and 'honesty_score' in response_stripped.lower():
            lines = response_stripped.split('\n')
            header_line = None
            header_idx = -1
            for idx, line in enumerate(lines):
                if '|' in line and 'honesty_score' in line.lower():
                    header_line = line
                    header_idx = idx
                    break
            
            if header_line:
                cells = [c.strip().strip('"').lower() for c in header_line.split('|')]
                score_idx = None
                just_idx = None
                for i, cell in enumerate(cells):
                    if 'honesty_score' in cell:
                        score_idx = i
                    elif 'honesty_justification' in cell:
                        just_idx = i
                
                if score_idx is not None and header_idx + 2 < len(lines):
                    data_line = lines[header_idx + 2] if lines[header_idx + 1].replace('|', '').replace('-', '').strip() == '' else lines[header_idx + 1]
                    data_cells = [c.strip().strip('"') for c in data_line.split('|')]
                    
                    if score_idx < len(data_cells):
                        try:
                            score = int(data_cells[score_idx])
                        except ValueError:
                            pass
                    if just_idx is not None and just_idx < len(data_cells):
                        justification = data_cells[just_idx][:250]
                    
                    cleaned_lines = []
                    for line in lines:
                        if '|' in line:
                            parts = line.split('|')
                            new_parts = [p for i, p in enumerate(parts) if i != score_idx and i != just_idx]
                            cleaned_lines.append('|'.join(new_parts))
                        else:
                            cleaned_lines.append(line)
                    cleaned_response = '\n'.join(cleaned_lines)
        
        if score is not None:
            if score < 0:
                score = 0
            elif score > 100:
                score = 100
        
    except Exception as e:
        if logger:
            logger.debug(f"Failed to extract honesty score: {e}")
        cleaned_response = response
    
    return score, justification, cleaned_response

### --- AIAgent Dependencies ---
# (Assumed to be available for the AIAgent class)

def replace_single_quote(text: str) -> str:
    """Escapes single quotes and backslashes for Spark SQL strings."""
    if text is None:
        return ""
    return text.replace(r"\\", r"\\\\").replace("'", "''")

def execute_sql(spark: SparkSession, query: str, logger: logging.Logger):
    """
    Executes a Spark SQL query and returns the collected rows.
    
    Args:
        spark: SparkSession instance
        query: SQL query to execute
        logger: Logger instance
    
    Returns:
        Collected rows from the query
        
    Raises:
        Exception: For SQL execution errors
    """
    try:
        logger.debug(f"Executing Spark SQL: {query[:200]}...")
        result = spark.sql(query).collect()
        return result
    except Exception as e:
        logger.debug(f"Spark SQL query failed: {e}")
        raise

def load_and_format_prompt(prompt_key: str, prompt_vars: dict, logger: logging.Logger) -> str:
    try:
        # Check if the prompt_key (variable name) exists in the global scope
        if prompt_key not in globals():
            raise NameError(f"Global prompt variable '{prompt_key}' not found. Please make sure the cell defining it has been run.")
            
        # Get the template string from the global variable
        template = globals()[prompt_key]
        
        if not template or not isinstance(template, str):
             raise ValueError(f"Global prompt variable '{prompt_key}' is empty or not a string.")

        # Format the template using the provided dictionary
        return template.format(**prompt_vars)
    except KeyError as e:
        logger.error(f"Missing key in prompt vars for '{prompt_key}': {get_clean_error_message(e)}")
        # Re-raise with more context
        raise KeyError(f"Missing formatting key {e} for prompt '{prompt_key}'")
    except Exception as e:
        logger.error(f"Failed to load or format prompt for '{prompt_key}': {get_clean_error_message(e)}")
        raise

def clean_csv_response(raw_string: str) -> str:
    """
    Removes markdown code fences from a CSV response WITHOUT extracting JSON.
    This is specifically for CSV responses where we don't want to treat [ or { as JSON markers.
    """
    if not raw_string: return ""
    
    cleaned = raw_string.strip()
    
    # Remove markdown code fences - handle multiple patterns
    # Pattern 1: ```csv\n{...}\n``` or ```\n{...}\n```
    cleaned = re.sub(r'^```(?:csv|json)?\s*\n?', '', cleaned, flags=re.IGNORECASE | re.MULTILINE)
    cleaned = re.sub(r'\n?```\s*$', '', cleaned, flags=re.IGNORECASE | re.MULTILINE)
    
    # Pattern 2: Handle cases where ``` appears in the middle (trailing after content)
    cleaned = re.sub(r'```(?:csv|json)?\s*$', '', cleaned, flags=re.IGNORECASE | re.MULTILINE)
    
    return cleaned.strip()


def clean_json_response(raw_string: str) -> str:
    """
    Removes markdown code fences and other noise from a raw LLM response.
    Also extracts JSON object/array if extra text is present before or after.
    Renamed from clean_llm_response to match AIAgent dependency.
    """
    if not raw_string: return ""
    
    cleaned = raw_string.strip()
    
    # Remove markdown code fences - handle multiple patterns
    # Pattern 1: ```json\n{...}\n``` or ```\n{...}\n```
    cleaned = re.sub(r'^```(?:json|csv)?\s*', '', cleaned, flags=re.IGNORECASE | re.MULTILINE)
    cleaned = re.sub(r'\s*```\s*$', '', cleaned, flags=re.IGNORECASE | re.MULTILINE)
    
    # Pattern 2: Handle cases where ``` appears in the middle (trailing after JSON)
    cleaned = re.sub(r'```(?:json|csv)?\s*$', '', cleaned, flags=re.IGNORECASE | re.MULTILINE)
    
    cleaned = cleaned.strip()
    
    # Try to extract JSON object or array from the response
    # Look for the first { or [ and the last matching } or ]
    start_obj = cleaned.find('{')
    start_arr = cleaned.find('[')
    
    # Determine which comes first
    if start_obj == -1 and start_arr == -1:
        return cleaned  # No JSON found, return as-is
    elif start_obj == -1:
        start = start_arr
        end_char = ']'
    elif start_arr == -1:
        start = start_obj
        end_char = '}'
    else:
        start = min(start_obj, start_arr)
        end_char = '}' if start == start_obj else ']'
    
    # Find the last occurrence of the closing character
    # Use a more robust method to find the matching closing brace/bracket
    end = -1
    depth = 0
    open_char = '{' if end_char == '}' else '['
    
    # Find the matching closing character by tracking depth
    for i in range(start, len(cleaned)):
        if cleaned[i] == open_char:
            depth += 1
        elif cleaned[i] == end_char:
            depth -= 1
            if depth == 0:
                end = i
                break
    
    # Fallback to rfind if depth tracking doesn't work
    if end == -1:
        end = cleaned.rfind(end_char)
    
    if start != -1 and end != -1 and end > start:
        # Extract the JSON portion
        json_portion = cleaned[start:end+1]
        return json_portion
    
    return cleaned

def retry_with_logging(func, max_attempts=1, logger=None, fallback=None, context=""):
    """
    Generic retry wrapper for functions that may fail transiently.
    
    Args:
        func: Callable to execute (should take no arguments; use lambda if needed)
        max_attempts: Maximum number of retry attempts (default: 3)
        logger: Logger instance for logging retries (optional)
        fallback: Fallback value or callable to return on failure (optional)
        context: Context string for logging (e.g., "Domain consolidation for English")
    
    Returns:
        Result of func() on success, fallback on failure (if provided), otherwise raises
    
    Raises:
        Last exception if all attempts fail and no fallback provided
    """
    last_exception = None
    for attempt in range(1, max_attempts + 1):
        try:
            if attempt > 1 and logger:
                logger.info(f"Retry attempt {attempt}/{max_attempts}{f' for {context}' if context else ''}...")
            return func()
        except Exception as e:
            last_exception = e
            if logger:
                error_msg = get_clean_error_message(e)
                if attempt == max_attempts:
                    logger.error(f"All {max_attempts} attempts failed{f' for {context}' if context else ''}: {error_msg}")
                else:
                    logger.warning(f"Attempt {attempt}/{max_attempts} failed{f' for {context}' if context else ''}: {error_msg}")
            if attempt == max_attempts:
                if fallback is not None:
                    if callable(fallback):
                        return fallback()
                    return fallback
                raise last_exception

# ==============================================================================
# CENTRALIZED UTILITY CLASSES (Code Reuse & LOC Reduction)
# ==============================================================================

class RetryHandler:
    """
    Centralized retry handler with exponential backoff and flexible error handling.
    Replaces all scattered retry logic throughout the codebase.
    """
    
    @staticmethod
    def execute_with_retry(
        func,
        max_attempts=1,
        logger=None,
        context="",
        fallback=None,
        exponential_backoff=True,
        base_delay=1.0,
        max_delay=60.0,
        retryable_errors=None,
        non_retryable_errors=None
    ):
        """
        Execute a function with retry logic and exponential backoff.
        
        Args:
            func: Callable to execute
            max_attempts: Maximum retry attempts (default: 1)
            logger: Logger instance for tracking
            context: Context string for logging
            fallback: Fallback value on failure
            exponential_backoff: Use exponential backoff (default: True)
            base_delay: Base delay in seconds (default: 1.0)
            max_delay: Maximum delay between retries (default: 60.0)
            retryable_errors: List of error types/keywords that should be retried
            non_retryable_errors: List of error types/keywords that should NOT be retried
            
        Returns:
            Result of func() on success, fallback on failure
        """
        import time
        last_exception = None
        
        for attempt in range(1, max_attempts + 1):
            try:
                if attempt > 1 and logger:
                    logger.info(f"🔄 Retry attempt {attempt}/{max_attempts}{f' for {context}' if context else ''}...")
                return func()
            except Exception as e:
                last_exception = e
                error_str = str(e).lower()
                
                if non_retryable_errors:
                    is_non_retryable = any(
                        (isinstance(err, type) and isinstance(e, err)) or 
                        (isinstance(err, str) and err.lower() in error_str)
                        for err in non_retryable_errors
                    )
                    if is_non_retryable:
                        if logger:
                            logger.error(f"❌ Non-retryable error{f' for {context}' if context else ''}: {get_clean_error_message(e)}")
                        if fallback is not None:
                            return fallback() if callable(fallback) else fallback
                        raise
                
                is_retryable = True
                if retryable_errors:
                    is_retryable = any(
                        (isinstance(err, type) and isinstance(e, err)) or 
                        (isinstance(err, str) and err.lower() in error_str)
                        for err in retryable_errors
                    )
                
                if not is_retryable or attempt == max_attempts:
                    if logger:
                        logger.error(f"❌ Failed after {max_attempts} attempts{f' for {context}' if context else ''}: {get_clean_error_message(e)}")
                    if fallback is not None:
                        return fallback() if callable(fallback) else fallback
                    raise
                
                if exponential_backoff:
                    wait_time = min(base_delay * (2 ** (attempt - 1)), max_delay)
                    jitter = random.uniform(0, wait_time * 0.1)
                    wait_time += jitter
                else:
                    wait_time = base_delay
                
                if logger:
                    logger.warning(f"⚠️  Attempt {attempt} failed{f' for {context}' if context else ''}: {get_clean_error_message(e)}")
                    logger.info(f"   Waiting {wait_time:.1f}s before retry...")
                
                time.sleep(wait_time)
        
        if fallback is not None:
            return fallback() if callable(fallback) else fallback
        raise last_exception


class ParallelExecutor:
    """
    Centralized parallel execution manager using ThreadPoolExecutor.
    Replaces all scattered ThreadPoolExecutor usage throughout the codebase.
    """
    
    @staticmethod
    def execute_parallel(
        tasks,
        max_workers,
        task_name="Task",
        logger=None,
        timeout_per_task=None,
        total_timeout=None,
        thread_name_prefix="Worker",
        return_exceptions=False
    ):
        """
        Execute multiple tasks in parallel using ThreadPoolExecutor.
        
        Args:
            tasks: List of callables or (callable, args) tuples
            max_workers: Maximum number of parallel workers
            task_name: Name for logging purposes
            logger: Logger instance
            timeout_per_task: Timeout per individual task in seconds
            total_timeout: Total timeout for all tasks in seconds
            thread_name_prefix: Prefix for thread names
            return_exceptions: If True, return exceptions instead of raising them
            
        Returns:
            List of results (or exceptions if return_exceptions=True)
        """
        results = []
        exceptions = []
        
        with ThreadPoolExecutor(max_workers=max_workers, thread_name_prefix=thread_name_prefix) as executor:
            future_to_task = {}
            for i, task in enumerate(tasks):
                if isinstance(task, tuple):
                    func, args = task
                    future = executor.submit(func, *args)
                else:
                    future = executor.submit(task)
                future_to_task[future] = i
            
            try:
                for future in concurrent.futures.as_completed(future_to_task, timeout=total_timeout):
                    task_idx = future_to_task[future]
                    try:
                        result = future.result(timeout=timeout_per_task)
                        results.append((task_idx, result))
                    except concurrent.futures.TimeoutError:
                        error_msg = f"{task_name} #{task_idx} timed out"
                        if logger:
                            logger.warning(f"⏱️  {error_msg}")
                        if return_exceptions:
                            results.append((task_idx, TimeoutError(error_msg)))
                        else:
                            exceptions.append((task_idx, TimeoutError(error_msg)))
                    except Exception as e:
                        if logger:
                            logger.warning(f"❌ {task_name} #{task_idx} failed: {get_clean_error_message(e)}")
                        if return_exceptions:
                            results.append((task_idx, e))
                        else:
                            exceptions.append((task_idx, e))
            except concurrent.futures.TimeoutError:
                if logger:
                    logger.error(f"⏱️  Total timeout ({total_timeout}s) exceeded for {task_name}")
                if not return_exceptions:
                    raise
        
        results.sort(key=lambda x: x[0])
        
        if not return_exceptions and exceptions:
            if logger:
                logger.error(f"❌ {len(exceptions)} {task_name}(s) failed")
            raise exceptions[0][1]
        
        return [r[1] for r in results]


class CSVParser:
    """
    Centralized CSV parsing utility with consistent error handling.
    Replaces all scattered csv.DictReader usage throughout the codebase.
    """
    
    @staticmethod
    def parse_csv_string(
        csv_data,
        logger=None,
        context="",
        quoting=csv.QUOTE_ALL,
        delimiter=',',
        skipinitialspace=True,
        expected_fields=None
    ):
        """
        Parse CSV string into list of dictionaries.
        
        Args:
            csv_data: CSV string data
            logger: Logger instance
            context: Context string for logging
            quoting: CSV quoting mode (default: QUOTE_ALL)
            delimiter: Field delimiter (default: ',')
            skipinitialspace: Skip initial spaces (default: True)
            expected_fields: Optional list of expected field names for validation
            
        Returns:
            List of dictionaries (one per row)
        """
        if not csv_data or not csv_data.strip():
            if logger:
                logger.warning(f"⚠️  Empty CSV data{f' for {context}' if context else ''}")
            return []
        
        try:
            reader = csv.DictReader(
                io.StringIO(csv_data),
                quoting=quoting,
                delimiter=delimiter,
                skipinitialspace=skipinitialspace
            )
            rows = list(reader)
            
            if expected_fields and rows:
                actual_fields = set(rows[0].keys())
                expected_set = set(expected_fields)
                missing_fields = expected_set - actual_fields
                if missing_fields and logger:
                    logger.warning(f"⚠️  Missing expected CSV fields{f' for {context}' if context else ''}: {missing_fields}")
            
            if logger:
                logger.debug(f"✅ Parsed {len(rows)} CSV rows{f' for {context}' if context else ''}")
            
            return rows
        except Exception as e:
            if logger:
                logger.error(f"❌ CSV parsing failed{f' for {context}' if context else ''}: {get_clean_error_message(e)}")
            return []
    
    @staticmethod
    def parse_csv_list(
        csv_data,
        logger=None,
        context="",
        quoting=csv.QUOTE_ALL,
        delimiter=',',
        quotechar='"',
        skipinitialspace=True
    ):
        """
        Parse CSV string into list of lists (for non-dictionary CSV).
        
        Args:
            csv_data: CSV string data
            logger: Logger instance
            context: Context string for logging
            quoting: CSV quoting mode (default: QUOTE_ALL)
            delimiter: Field delimiter (default: ',')
            quotechar: Quote character (default: '"')
            skipinitialspace: Skip initial spaces (default: True)
            
        Returns:
            List of lists (one per row)
        """
        if not csv_data or not csv_data.strip():
            if logger:
                logger.warning(f"⚠️  Empty CSV data{f' for {context}' if context else ''}")
            return []
        
        try:
            reader = csv.reader(
                io.StringIO(csv_data),
                delimiter=delimiter,
                quotechar=quotechar,
                quoting=quoting,
                skipinitialspace=skipinitialspace
            )
            rows = list(reader)
            
            if logger:
                logger.debug(f"✅ Parsed {len(rows)} CSV rows{f' for {context}' if context else ''}")
            
            return rows
        except Exception as e:
            if logger:
                logger.error(f"❌ CSV parsing failed{f' for {context}' if context else ''}: {get_clean_error_message(e)}")
            return []


class JSONParser:
    """
    Centralized JSON parsing utility with consistent error handling.
    Replaces scattered json.loads/dumps usage throughout the codebase.
    """
    
    @staticmethod
    def safe_loads(json_string, logger=None, context="", fallback=None):
        """
        Safely parse JSON string with error handling.
        
        Args:
            json_string: JSON string to parse
            logger: Logger instance
            context: Context string for logging
            fallback: Fallback value on parsing failure
            
        Returns:
            Parsed JSON object or fallback value
        """
        if not json_string:
            return fallback
        
        try:
            return json.loads(json_string)
        except json.JSONDecodeError as e:
            if logger:
                logger.warning(f"⚠️  JSON parsing failed{f' for {context}' if context else ''}: {str(e)[:100]}")
            return fallback
        except Exception as e:
            if logger:
                logger.error(f"❌ Unexpected error parsing JSON{f' for {context}' if context else ''}: {get_clean_error_message(e)}")
            return fallback
    
    @staticmethod
    def safe_dumps(obj, logger=None, context="", fallback="{}", indent=None, separators=None):
        """
        Safely serialize object to JSON string with error handling.
        
        Args:
            obj: Object to serialize
            logger: Logger instance
            context: Context string for logging
            fallback: Fallback string on serialization failure
            indent: Indentation level (default: None)
            separators: Custom separators (default: None)
            
        Returns:
            JSON string or fallback value
        """
        try:
            if separators:
                return json.dumps(obj, indent=indent, separators=separators)
            return json.dumps(obj, indent=indent)
        except TypeError as e:
            if logger:
                logger.warning(f"⚠️  JSON serialization failed{f' for {context}' if context else ''}: {str(e)[:100]}")
            return fallback
        except Exception as e:
            if logger:
                logger.error(f"❌ Unexpected error serializing JSON{f' for {context}' if context else ''}: {get_clean_error_message(e)}")
            return fallback


class TimeoutHandler:
    """
    Centralized timeout handling utility.
    Provides consistent timeout behavior across the codebase.
    """
    
    @staticmethod
    def execute_with_timeout(func, timeout_seconds, logger=None, context="", fallback=None):
        """
        Execute a function with a timeout.
        
        Args:
            func: Callable to execute
            timeout_seconds: Timeout in seconds
            logger: Logger instance
            context: Context string for logging
            fallback: Fallback value on timeout
            
        Returns:
            Result of func() or fallback on timeout
        """
        from concurrent.futures import ThreadPoolExecutor, TimeoutError as FuturesTimeoutError
        
        with ThreadPoolExecutor(max_workers=1, thread_name_prefix="Timeout") as executor:
            future = executor.submit(func)
            try:
                result = future.result(timeout=timeout_seconds)
                return result
            except FuturesTimeoutError:
                if logger:
                    logger.warning(f"⏱️  Timeout ({timeout_seconds}s) exceeded{f' for {context}' if context else ''}")
                if fallback is not None:
                    return fallback() if callable(fallback) else fallback
                raise TimeoutError(f"Operation timed out after {timeout_seconds}s{f' for {context}' if context else ''}")
            except Exception as e:
                if logger:
                    logger.error(f"❌ Error during execution{f' for {context}' if context else ''}: {get_clean_error_message(e)}")
                raise



# DBTITLE 1,Table Size Analyzer & Dynamic Batch Optimizer

class TableSizeInfo:
    """Metadata about a table's size and structure."""
    def __init__(self, catalog, schema, table, num_columns=0, estimated_row_count=0, size_category="unknown"):
        self.catalog = catalog
        self.schema = schema
        self.table = table
        self.num_columns = num_columns
        self.estimated_row_count = estimated_row_count
        self.size_category = size_category  # "small", "medium", "wide", "very_wide"
        self.memory_weight = self._calculate_memory_weight()
    
    def _calculate_memory_weight(self):
        """Calculate memory weight for batching decisions."""
        if self.num_columns > 1000:
            return self.num_columns * 10  # Very heavy
        elif self.num_columns > 250:
            return self.num_columns * 5   # Heavy
        elif self.num_columns > 100:
            return self.num_columns * 2   # Medium
        else:
            return self.num_columns       # Light
    
    def __repr__(self):
        return f"{self.catalog}.{self.schema}.{self.table} ({self.num_columns} cols, {self.size_category})"


class TableSizeAnalyzer:
    """
    Two-pass analyzer: First pass collects table sizes, second pass uses this info for intelligent batching.
    """
    def __init__(self, spark, logger):
        self.spark = spark
        self.logger = logger
        self.size_cache = {}  # {(catalog, schema, table): TableSizeInfo}
    
    def analyze_table_sizes_batch(self, table_tuples, use_info_schema_map, max_parallelism=20):
        """
        Analyze sizes for a batch of tables efficiently using parallel queries.
        
        Args:
            table_tuples: List of (catalog, schema, table) tuples
            use_info_schema_map: Dict mapping catalog -> bool for information_schema support
            max_parallelism: Maximum number of parallel queries (default: 20)
            
        Returns:
            List of TableSizeInfo objects
        """
        results = []
        
        # Group by catalog.schema for efficient querying
        schema_groups = defaultdict(list)
        for cat, schema, table in table_tuples:
            schema_groups[(cat, schema)].append(table)
        
        # Process schemas in parallel for speed
        def analyze_schema_group(schema_key_and_tables):
            (catalog, schema), tables = schema_key_and_tables
            use_info_schema = use_info_schema_map.get(catalog, False)
            schema_results = []
            
            self.logger.debug(f"   Analyzing {len(tables)} tables in {catalog}.{schema}...")
            
            try:
                if use_info_schema:
                    # Batch query using information_schema (much faster)
                    table_list = "','".join(tables)
                    query = f"""
                        SELECT table_name, COUNT(*) as num_columns
                        FROM `{catalog}`.`information_schema`.`columns`
                        WHERE table_schema = '{schema}'
                        AND table_name IN ('{table_list}')
                        GROUP BY table_name
                    """
                    df = self.spark.sql(query)
                    column_counts = {row.table_name: row.num_columns for row in df.collect()}
                    
                    for table in tables:
                        num_cols = column_counts.get(table, 0)
                        if num_cols == 0:
                            self.logger.debug(f"No columns found for {catalog}.{schema}.{table}, skipping size analysis")
                            continue
                        
                        size_info = TableSizeInfo(
                            catalog, schema, table,
                            num_columns=num_cols,
                            size_category=self._categorize_table(num_cols)
                        )
                        schema_results.append(size_info)
                        self.size_cache[(catalog, schema, table)] = size_info
                else:
                    # Fallback: DESCRIBE each table individually (slower)
                    for table in tables:
                        num_cols = self._get_column_count_fallback(catalog, schema, table)
                        if num_cols == 0:
                            continue
                        
                        size_info = TableSizeInfo(
                            catalog, schema, table,
                            num_columns=num_cols,
                            size_category=self._categorize_table(num_cols)
                        )
                        schema_results.append(size_info)
                        self.size_cache[(catalog, schema, table)] = size_info
                        
            except Exception as e:
                self.logger.warning(f"Error analyzing tables in {catalog}.{schema}: {get_clean_error_message(e)}")
                # Fallback: assume medium size
                for table in tables:
                    size_info = TableSizeInfo(catalog, schema, table, num_columns=50, size_category="medium")
                    schema_results.append(size_info)
                    self.size_cache[(catalog, schema, table)] = size_info
            
            return schema_results
        
        # Execute schema analysis - parallel or sequential based on max_parallelism
        if max_parallelism == 1:
            # Sequential execution when nested in another thread pool
            for item in schema_groups.items():
                try:
                    schema_results = analyze_schema_group(item)
                    results.extend(schema_results)
                except Exception as e:
                    schema_key = item[0]
                    self.logger.error(f"Failed to analyze schema {schema_key}: {get_clean_error_message(e)}")
        else:
            # Parallel execution for top-level calls
            with ThreadPoolExecutor(max_workers=max_parallelism, thread_name_prefix="SchemaAnalyzer") as executor:
                futures = {executor.submit(analyze_schema_group, item): item[0] 
                          for item in schema_groups.items()}
                
                # Add timeout to prevent indefinite hangs (5 minutes per schema group)
                total_timeout = len(futures) * 300
                for future in concurrent.futures.as_completed(futures, timeout=total_timeout):
                    schema_key = futures[future]
                    try:
                        schema_results = future.result(timeout=300)
                        results.extend(schema_results)
                    except concurrent.futures.TimeoutError:
                        self.logger.error(f"Schema analysis timed out for {schema_key} after 5 minutes")
                    except Exception as e:
                        self.logger.error(f"Failed to analyze schema {schema_key}: {get_clean_error_message(e)}")
        
        return results
    
    def _get_column_count_fallback(self, catalog, schema, table):
        """Fallback method to get column count using DESCRIBE."""
        try:
            fq_table = f"`{catalog}`.`{schema}`.`{table}`"
            df = self.spark.sql(f"DESCRIBE TABLE {fq_table}")
            # Filter out partition info and metadata rows
            count = df.filter(~col("col_name").startswith("#")).count()
            return count
        except Exception as e:
            self.logger.debug(f"Error getting column count for {catalog}.{schema}.{table}: {e}")
            return 0
    
    def _categorize_table(self, num_columns):
        """Categorize table based on column count."""
        if num_columns > 1000:
            return "very_wide"
        elif num_columns > 250:
            return "wide"
        elif num_columns > 100:
            return "medium"
        else:
            return "small"
    
    def get_cached_size(self, catalog, schema, table):
        """Get cached size info for a table."""
        return self.size_cache.get((catalog, schema, table))


class DynamicBatchOptimizer:
    """
    Intelligently groups tables into batches based on their size and memory requirements.
    
    Strategy:
    - Very wide tables (>1000 cols): 1-2 per batch
    - Wide tables (250-1000 cols): 5-10 per batch
    - Medium tables (100-250 cols): 20-50 per batch
    - Small tables (<100 cols): 100-500 per batch
    """
    
    # Memory weights for batching (in arbitrary units)
    MAX_BATCH_WEIGHT = 10000  # Adjust based on cluster memory
    MIN_BATCH_SIZE = 1
    MAX_BATCH_SIZE = 500
    
    def __init__(self, logger, max_batch_weight=None):
        self.logger = logger
        self.max_batch_weight = max_batch_weight or self.MAX_BATCH_WEIGHT
    
    def create_optimized_batches(self, table_size_infos):
        """
        Create optimized batches from table size information.
        
        Args:
            table_size_infos: List of TableSizeInfo objects
            
        Returns:
            List of lists, where each inner list is a batch of (catalog, schema, table) tuples
        """
        if not table_size_infos:
            return []
        
        # Sort tables by size (largest first for better packing)
        sorted_tables = sorted(table_size_infos, key=lambda t: t.memory_weight, reverse=True)
        
        batches = []
        current_batch = []
        current_weight = 0
        
        for table_info in sorted_tables:
            table_tuple = (table_info.catalog, table_info.schema, table_info.table)
            table_weight = table_info.memory_weight
            
            # Check if adding this table would exceed batch weight
            if current_batch and (current_weight + table_weight > self.max_batch_weight or 
                                 len(current_batch) >= self.MAX_BATCH_SIZE):
                # Start new batch
                batches.append(current_batch)
                self.logger.debug(f"Created batch with {len(current_batch)} tables, weight={current_weight}")
                current_batch = []
                current_weight = 0
            
            # Add table to current batch
            current_batch.append(table_tuple)
            current_weight += table_weight
        
        # Add final batch
        if current_batch:
            batches.append(current_batch)
            self.logger.debug(f"Created final batch with {len(current_batch)} tables, weight={current_weight}")
        
        # Log batch statistics
        self._log_batch_stats(batches, table_size_infos)
        
        return batches
    
    def _log_batch_stats(self, batches, table_size_infos):
        """Log statistics about the created batches."""
        total_tables = len(table_size_infos)
        num_batches = len(batches)
        
        size_categories = defaultdict(int)
        for table_info in table_size_infos:
            size_categories[table_info.size_category] += 1
        
        avg_batch_size = total_tables / num_batches if num_batches > 0 else 0
        
        self.logger.info(f"📊 Dynamic Batch Optimization Complete:")
        self.logger.info(f"   • Total tables: {total_tables}")
        self.logger.info(f"   • Created batches: {num_batches}")
        self.logger.info(f"   • Average batch size: {avg_batch_size:.1f} tables")
        self.logger.info(f"   • Table size distribution:")
        self.logger.info(f"      - Small (<100 cols): {size_categories['small']}")
        self.logger.info(f"      - Medium (100-250 cols): {size_categories['medium']}")
        self.logger.info(f"      - Wide (250-1000 cols): {size_categories['wide']}")
        self.logger.info(f"      - Very Wide (>1000 cols): {size_categories['very_wide']}")


class ColumnSampler:
    """
    Samples columns from very wide tables to reduce memory footprint.
    
    For tables with >250 columns, intelligently selects representative columns:
    - All primary keys, foreign keys
    - Columns with business-meaningful names
    - Sample of remaining columns
    """
    
    WIDE_TABLE_THRESHOLD = 250
    TARGET_SAMPLE_SIZE = 200
    
    def __init__(self, logger):
        self.logger = logger
    
    def should_sample(self, num_columns):
        """Determine if column sampling is needed."""
        return num_columns > self.WIDE_TABLE_THRESHOLD
    
    def sample_columns(self, column_details, table_info):
        """
        Sample columns from a wide table.
        
        Args:
            column_details: List of (catalog, schema, table, col_name, data_type, comment) tuples
            table_info: TableSizeInfo object
            
        Returns:
            Sampled list of column details + metadata about sampling
        """
        if not self.should_sample(len(column_details)):
            return column_details, False  # No sampling needed
        
        self.logger.info(f"🎯 Sampling columns for wide table {table_info}: {len(column_details)} -> ~{self.TARGET_SAMPLE_SIZE} cols")
        
        # Categorize columns
        key_columns = []
        business_columns = []
        other_columns = []
        
        # Business keywords to identify important columns
        business_keywords = [
            'id', 'key', 'name', 'date', 'time', 'amount', 'total', 'count', 'quantity',
            'price', 'cost', 'revenue', 'customer', 'order', 'product', 'status',
            'type', 'category', 'description', 'address', 'email', 'phone'
        ]
        
        for col_detail in column_details:
            col_name = col_detail[3].lower()
            
            # Identify key columns (id, primary key patterns)
            if 'id' in col_name or 'key' in col_name or col_name.endswith('_pk') or col_name.endswith('_fk'):
                key_columns.append(col_detail)
            # Identify business-relevant columns
            elif any(keyword in col_name for keyword in business_keywords):
                business_columns.append(col_detail)
            else:
                other_columns.append(col_detail)
        
        # Build sampled list
        sampled = []
        
        # Always include all key columns
        sampled.extend(key_columns)
        
        # Include as many business columns as possible
        remaining_slots = self.TARGET_SAMPLE_SIZE - len(sampled)
        if remaining_slots > 0:
            sampled.extend(business_columns[:remaining_slots])
        
        # Fill remaining with evenly spaced sample from other columns
        remaining_slots = self.TARGET_SAMPLE_SIZE - len(sampled)
        if remaining_slots > 0 and other_columns:
            step = max(1, len(other_columns) // remaining_slots)
            sampled.extend(other_columns[::step][:remaining_slots])
        
        self.logger.info(f"   ✓ Sampled: {len(key_columns)} key cols + {len(business_columns[:remaining_slots])} business cols + "
                        f"{len(sampled) - len(key_columns) - len(business_columns[:remaining_slots])} other cols = {len(sampled)} total")
        
        return sampled, True  # Return sampled columns + flag indicating sampling occurred


# COMMAND ----------

# DBTITLE 1,DataLoader

class DataLoader:
    # === MODIFIED: Added tables parameter + memory optimization features ===
    def __init__(self, catalogs: str, schemas: str, tables: str, logger: logging.Logger, 
                 enable_two_pass=True, enable_column_sampling=True, streaming_batch_size=1000,
                 max_parallelism=10, schema_timeout_seconds=900):
        self.spark = SparkSession.builder.getOrCreate()
        self.max_parallelism = max_parallelism  # For parallel schema discovery and column loading
        self.schema_timeout_seconds = schema_timeout_seconds  # Timeout per schema query (15 minutes)
        self.logger = logger
        self.foreign_key_graph = defaultdict(list)
        
        # === NEW: Memory optimization features ===
        self.enable_two_pass = enable_two_pass  # Enable intelligent batching based on table sizes
        self.enable_column_sampling = enable_column_sampling  # Sample columns from very wide tables
        self.streaming_batch_size = streaming_batch_size  # Number of tables to process in each streaming chunk
        
        # === NEW: Initialize optimization components ===
        self.size_analyzer = TableSizeAnalyzer(self.spark, self.logger) if enable_two_pass else None
        self.batch_optimizer = DynamicBatchOptimizer(self.logger) if enable_two_pass else None
        self.column_sampler = ColumnSampler(self.logger) if enable_column_sampling else None

        # === MODIFIED: Process schemas, catalogs, and individual tables ===
        # Use utility functions to normalize identifiers (strip backticks from user input)
        self.schemas_to_process = [s.strip() for s in schemas.split(',') if s.strip()]
        self.catalogs_to_process = [normalize_identifier(c) for c in catalogs.split(',') if c.strip()]
        self.tables_to_process = [t.strip() for t in tables.split(',') if t.strip()]
        
        # Get all unique catalogs mentioned to check capabilities
        self.catalog_capabilities = {}
        unique_catalogs = set(self.catalogs_to_process)
        for s in self.schemas_to_process:
            cat, _ = parse_two_level_name(s)
            if cat:
                unique_catalogs.add(cat)
        for t in self.tables_to_process:
            cat, _, _ = parse_three_level_name(t)
            if cat:
                unique_catalogs.add(cat)
        
        self.logger.info(f"Initializing capabilities for catalogs: {unique_catalogs}")
        for catalog in unique_catalogs:
            self.catalog_capabilities[catalog] = self._check_catalog_capability(catalog)
        self.logger.info(f"Capabilities found: {self.catalog_capabilities}")

        # === MODIFIED: Build the database queue and individual tables list ===
        self.database_queue = []
        db_set = set()
        
        # Track individual tables separately
        self.individual_tables = []  # List of (catalog, schema, table) tuples
        
        # Track explicitly provided schemas (to know which schemas should expand ALL tables)
        self.explicit_schemas_set = set()  # Set of (catalog, schema) tuples

        # 1. From explicit schemas (use parse_two_level_name for consistent normalization)
        for s in self.schemas_to_process:
            cat, db = parse_two_level_name(s)
            if cat and db:
                db_set.add((cat, db))
                self.explicit_schemas_set.add((cat, db))  # Mark as explicitly provided
            else:
                self.logger.warning(f"Skipping malformed schema name: {s}")
        
        # 2. From individual tables (use parse_three_level_name for consistent normalization)
        for t in self.tables_to_process:
            cat, db, table = parse_three_level_name(t)
            if cat and db and table:
                self.individual_tables.append((cat, db, table))
                # Also add the schema to db_set so we can process it
                db_set.add((cat, db))
            else:
                self.logger.warning(f"Skipping malformed table name: {t} (expected format: catalog.schema.table)")
        
        # 3. From catalogs (schemas from catalogs should also expand ALL tables)
        for cat in self.catalogs_to_process:
            if not self.catalog_capabilities.get(cat, False):
                 self.logger.warning(f"Skipping schema discovery for catalog `{cat}`: Lacks information_schema support and fallback failed.")
                 continue
            self.logger.info(f"Fetching schemas for catalog: {cat}")
            schemas_in_cat = self._fetch_schemas_for_catalog(cat)
            for db in schemas_in_cat:
                db_set.add((cat, db))
                self.explicit_schemas_set.add((cat, db))  # Mark as explicit - expand ALL tables
        
        self.database_queue = sorted(list(db_set)) # Sort for deterministic order
        self.logger.info(f"Found {len(self.database_queue)} unique databases to process.")
        
        # Log explicit schemas (will expand ALL tables)
        if self.explicit_schemas_set:
            explicit_names = [f"{cat}.{db}" for cat, db in sorted(self.explicit_schemas_set)]
            self.logger.info(f"📋 Explicit schemas (will expand ALL tables): {', '.join(explicit_names)}")
        
        # Log individual tables (specific tables only)
        if self.individual_tables:
            db_stats = {}
            for cat, db, tbl in self.individual_tables:
                key = f"{cat}.{db}"
                db_stats[key] = db_stats.get(key, 0) + 1
            self.logger.info(f"Found {len(self.individual_tables)} individual tables to process across {len(db_stats)} databases:")
            for db_key, count in sorted(db_stats.items()):
                self.logger.info(f"  database {db_key}: {count} tables loaded")
        
        # === NEW: State variables for table-level batching ===
        self.all_table_tuples = []  # List of (catalog, schema, table) tuples
        self.optimized_batches = []  # List of optimized batches (two-pass mode)
        self.current_batch_idx = 0   # Current batch index (two-pass mode)
        self.current_table_idx = 0   # Current position in all_table_tuples (single-pass mode)
        self._tables_initialized = False
        self._size_analysis_complete = False

    def _check_catalog_capability(self, catalog_name: str) -> bool:
        try:
            self.spark.sql(f"SELECT 1 FROM `{catalog_name}`.`information_schema`.`schemata` LIMIT 1").collect()
            return True
        except Exception:
            self.logger.info(f"Catalog `{catalog_name}` does not support information_schema. Will attempt fallback 'SHOW' commands.")
            return False # Rely on fallbacks

    def _fetch_schemas_for_catalog(self, catalog_name: str):
        use_info_schema = self.catalog_capabilities.get(catalog_name, False)

        try:
            if not use_info_schema:
                raise Exception("Catalog does not support information_schema, using fallback.")
            
            query = f"""
                SELECT `schema_name` FROM `{catalog_name}`.`information_schema`.`schemata` 
                WHERE `schema_name` != 'information_schema' 
                ORDER BY `schema_name`
            """
            df = self.spark.sql(query)
            return [row.schema_name for row in df.collect()]
        except Exception as e_info:
            try:
                query = f"SHOW SCHEMAS IN `{catalog_name}`"
                df = self.spark.sql(query)
                col_name = "databaseName" if "databaseName" in df.columns else "namespace"
                all_schemas = [row[col_name] for row in df.orderBy(col_name).collect()]
                return [s for s in all_schemas if s != 'information_schema']
            except Exception as e_show_schemas:
                self.logger.error(f"Error listing schemas for catalog `{catalog_name}`. All fallbacks failed: {e_show_schemas}")
                return []

    # === MODIFIED: Added streaming/pagination support for memory efficiency ===
    def _fetch_tables_for_schema(self, catalog_name: str, schema_name: str, limit=None, offset=0):
        """
        Fetch tables for a schema with optional pagination.
        
        Args:
            catalog_name: Catalog name
            schema_name: Schema name
            limit: Optional limit for pagination (for very large schemas)
            offset: Offset for pagination
            
        Returns:
            List of fully qualified table names
        """
        cat_normalized = normalize_identifier(catalog_name)
        schema_normalized = normalize_identifier(schema_name)
        fq_schema = build_fqn(cat_normalized, schema_normalized)
        tables = []
        use_info_schema = self.catalog_capabilities.get(cat_normalized, False)

        try:
            if not use_info_schema:
                raise Exception("Catalog does not support information_schema, using fallback.")
            
            # Use pagination for large schemas
            limit_clause = f"LIMIT {limit} OFFSET {offset}" if limit else ""
            cat_quoted = quote_identifier(cat_normalized)
            query = f"""
                SELECT `table_name` FROM {cat_quoted}.`information_schema`.`tables`
                WHERE `table_schema` = '{schema_normalized}'
                ORDER BY `table_name`
                {limit_clause}
            """
            df = self.spark.sql(query)
            
            # Use iterative collection for large result sets
            if limit and limit > 1000:
                # Stream in chunks to avoid memory issues
                chunk_size = 1000
                collected = []
                temp_df = df
                while True:
                    chunk = temp_df.limit(chunk_size).collect()
                    if not chunk:
                        break
                    collected.extend(chunk)
                    if len(chunk) < chunk_size:
                        break
                tables = [f"{fq_schema}.`{row.table_name}`" for row in collected]
            else:
                tables = [f"{fq_schema}.`{row.table_name}`" for row in df.collect()]
                
        except Exception:
            try:
                query = f"SHOW TABLES IN {fq_schema}"
                df = self.spark.sql(query)
                if 'isTemporary' in df.columns:
                    df = df.filter(df.isTemporary == False)
                    
                # Apply pagination if specified
                if limit:
                    df = df.orderBy("tableName").limit(limit).offset(offset)
                else:
                    df = df.orderBy("tableName")
                    
                tables = [f"{fq_schema}.`{row.tableName}`" for row in df.collect()]
            except Exception as e:
                self.logger.warning(f"Error listing tables for {fq_schema}: {get_clean_error_message(e)}")
                tables = []
        
        return tables
    
    def _fetch_tables_for_schema_streaming(self, catalog_name: str, schema_name: str, chunk_size=1000):
        """
        Generator that yields tables in chunks for memory-efficient streaming.
        
        Args:
            catalog_name: Catalog name  
            schema_name: Schema name
            chunk_size: Number of tables to yield per chunk
            
        Yields:
            Lists of fully qualified table names
        """
        offset = 0
        while True:
            tables = self._fetch_tables_for_schema(catalog_name, schema_name, limit=chunk_size, offset=offset)
            if not tables:
                break
            yield tables
            if len(tables) < chunk_size:
                break  # Last chunk
            offset += chunk_size

    def _get_table_details(self, catalog: str, schema: str, table: str, apply_sampling=True):
        """
        Get table column details with optional column sampling for very wide tables.
        
        Args:
            catalog: Catalog name
            schema: Schema name
            table: Table name
            apply_sampling: Whether to apply column sampling for wide tables
            
        Returns:
            List of (catalog, schema, table, column_name, data_type, comment) tuples
        """
        details = []
        cat_normalized = normalize_identifier(catalog)
        schema_normalized = normalize_identifier(schema)
        table_normalized = normalize_identifier(table)
        fq_table_name = build_fqn(cat_normalized, schema_normalized, table_normalized)
        use_info_schema = self.catalog_capabilities.get(cat_normalized, False)
        
        try:
            if not use_info_schema:
                raise Exception("Catalog does not support information_schema, using fallback.")

            cat_quoted = quote_identifier(cat_normalized)
            query = f"""
                SELECT `table_catalog`, `table_schema`, `table_name`, `column_name`, `data_type`, `comment`
                FROM {cat_quoted}.`information_schema`.`columns`
                WHERE `table_schema` = '{schema_normalized}' AND `table_name` = '{table_normalized}'
                ORDER BY `ordinal_position`
            """
            df = self.spark.sql(query)
            for row in df.toLocalIterator():
                details.append((
                    row.table_catalog, 
                    row.table_schema, 
                    row.table_name, 
                    row.column_name, 
                    row.data_type, 
                    row.comment
                ))
            try:
                fk_rows = self._get_foreign_keys(catalog, schema, table)
                if fk_rows:
                    self.foreign_key_graph[(catalog, schema, table)] = fk_rows
            except Exception:
                pass
            if not details:
                # Don't raise exception, just try fallback
                pass
            else:
                # Apply column sampling if enabled and needed
                if apply_sampling and self.column_sampler and len(details) > ColumnSampler.WIDE_TABLE_THRESHOLD:
                    table_info = self.size_analyzer.get_cached_size(catalog, schema, table) if self.size_analyzer else None
                    if not table_info:
                        table_info = TableSizeInfo(catalog, schema, table, num_columns=len(details))
                    details, was_sampled = self.column_sampler.sample_columns(details, table_info)
                return details
                
        except Exception:
            pass # Fallthrough to DESCRIBE
            
        try:
            query = f"DESCRIBE TABLE {fq_table_name}"
            df = self.spark.sql(query)
            for row in df.toLocalIterator():
                if row.col_name and not row.col_name.startswith('#'):
                    details.append((
                        catalog,
                        schema,
                        table,
                        row.col_name,
                        row.data_type,
                        row.comment
                    ))
            
            try:
                fk_rows = self._get_foreign_keys(catalog, schema, table)
                if fk_rows:
                    self.foreign_key_graph[(catalog, schema, table)] = fk_rows
            except Exception:
                pass
            # Apply column sampling if enabled and needed
            if apply_sampling and self.column_sampler and len(details) > ColumnSampler.WIDE_TABLE_THRESHOLD:
                table_info = self.size_analyzer.get_cached_size(catalog, schema, table) if self.size_analyzer else None
                if not table_info:
                    table_info = TableSizeInfo(catalog, schema, table, num_columns=len(details))
                details, was_sampled = self.column_sampler.sample_columns(details, table_info)
                
            return details
        except Exception as e:
            # Suppress permission denied errors - this tool works at METADATA level only
            error_msg = str(e).lower()
            if "permission" in error_msg or "unauthorized" in error_msg or "access" in error_msg:
                # Silently skip tables without SELECT permission - this is expected behavior
                pass  # No logging for expected access permission issues
            else:
                # Log other errors at debug level only
                pass  # Suppress low-level metadata errors
            return []

    def _get_foreign_keys(self, catalog: str, schema: str, table: str):
        key = (catalog, schema, table)
        if key in self.foreign_key_graph:
            return self.foreign_key_graph[key]
        
        cat_normalized = normalize_identifier(catalog)
        schema_normalized = normalize_identifier(schema)
        table_normalized = normalize_identifier(table)
        cat_quoted = quote_identifier(cat_normalized)
        
        try:
            query = f"""
                SELECT table_catalog,
                       table_schema,
                       table_name,
                       column_name,
                       referenced_table_catalog,
                       referenced_table_schema,
                       referenced_table_name,
                       referenced_column_name
                FROM {cat_quoted}.`information_schema`.`key_column_usage`
                WHERE table_schema = '{schema_normalized}'
                  AND table_name = '{table_normalized}'
                  AND referenced_table_name IS NOT NULL
            """
            df = self.spark.sql(query)
            rels = [
                (
                    row.table_catalog,
                    row.table_schema,
                    row.table_name,
                    row.column_name,
                    row.referenced_table_catalog,
                    row.referenced_table_schema,
                    row.referenced_table_name,
                    row.referenced_column_name
                )
                for row in df.collect()
            ]
            self.foreign_key_graph[key] = rels
            return rels
        except Exception as e:
            error_str = str(e).lower()
            if 'table or view not found' in error_str or 'key_column_usage' in error_str:
                self.logger.debug(f"key_column_usage table not available in catalog {cat_normalized} (older Databricks version)")
            elif 'cannot resolve' in error_str or 'referenced_table' in error_str:
                self.logger.debug(f"referenced_table columns not available in key_column_usage (older Databricks version)")
            self.foreign_key_graph[key] = []
            return []

    def get_foreign_key_relations(self, table_tuples: set):
        relations = []
        for tbl in table_tuples:
            relations.extend(self.foreign_key_graph.get(tbl, []))
        return relations

    # === REMOVED: _fill_from_direct_tables, _fill_from_schemas, _fill_from_catalogs ===

    # === MODIFIED: Support both single-pass and two-pass optimized modes ===
    def getNextTables(self, batch_size=None):
        """
        Returns a batch of tables with their column details.
        Maintains state across all catalogs and databases, continuing seamlessly
        across database boundaries.
        
        Supports two modes:
        1. Single-pass mode (default): Uses batch_size parameter for simple batching
        2. Two-pass mode (enable_two_pass=True): Uses pre-optimized batches, ignores batch_size
        
        Args:
            batch_size (int): Maximum number of tables to return in this batch (single-pass mode only)
            
        Returns:
            list of tuples: Each tuple is (catalog, schema, table, column_name, data_type, comment)
            None: if all tables have been processed
        """
        
        # Initialize table list on first call
        if not self._tables_initialized:
            self.logger.info("Initializing all tables from all databases...")
            self._initialize_all_tables()
            self._tables_initialized = True
        
        # === TWO-PASS OPTIMIZED MODE ===
        if self.enable_two_pass and self._size_analysis_complete:
            # Check if we've exhausted all batches
            if self.current_batch_idx >= len(self.optimized_batches):
                return None  # Signal no more batches
            
            # Get the next optimized batch
            batch_table_tuples = self.optimized_batches[self.current_batch_idx]
            batch_num = self.current_batch_idx + 1
            
            self.logger.info(f"📦 Fetching optimized batch {batch_num}/{len(self.optimized_batches)}: "
                           f"{len(batch_table_tuples)} tables")
            
            # Fetch column details for this optimized batch in parallel
            # ADAPTIVE PARALLELISM: Fixed for metadata (DB connection limits)
            column_parallelism, reason = calculate_adaptive_parallelism(
                "column_fetch", self.max_parallelism, 
                num_items=len(batch_table_tuples),
                is_llm_operation=False, logger=self.logger
            )
            
            all_column_details = []
            
            def fetch_details_with_sampling(args):
                # Apply sampling based on enable_column_sampling flag
                return self._get_table_details(*args, apply_sampling=self.enable_column_sampling)
            
            with concurrent.futures.ThreadPoolExecutor(max_workers=column_parallelism) as executor:
                results = executor.map(fetch_details_with_sampling, batch_table_tuples)
                
                for column_list in results:
                    if column_list:
                        all_column_details.extend(column_list)
            
            self.logger.info(f"   ✓ Batch {batch_num} loaded: {len(all_column_details)} columns from {len(batch_table_tuples)} tables")
            
            # Update position for next call
            self.current_batch_idx += 1
            
            # Return the column details for this optimized batch
            return all_column_details
        
        # === SINGLE-PASS MODE (backward compatible) ===
        else:
            if batch_size is None:
                self.logger.warning("batch_size not provided in single-pass mode, using default of 25")
                batch_size = 25
                
            # Check if we've exhausted all tables
            if self.current_table_idx >= len(self.all_table_tuples):
                return None  # Signal no more tables
                
            # Get the next batch of tables
            end_idx = min(self.current_table_idx + batch_size, len(self.all_table_tuples))
            batch_table_tuples = self.all_table_tuples[self.current_table_idx:end_idx]
            
            self.logger.info(f"Fetching batch of {len(batch_table_tuples)} tables (indices {self.current_table_idx} to {end_idx-1} of {len(self.all_table_tuples)} total)")
            
            # Fetch column details for this batch of tables in parallel
            # ADAPTIVE PARALLELISM: Fixed for metadata (DB connection limits)
            column_parallelism, reason = calculate_adaptive_parallelism(
                "column_fetch", self.max_parallelism,
                num_items=len(batch_table_tuples),
                is_llm_operation=False, logger=self.logger
            )
            
            all_column_details = []
            
            def fetch_details_with_sampling(args):
                return self._get_table_details(*args, apply_sampling=self.enable_column_sampling)
            
            with concurrent.futures.ThreadPoolExecutor(max_workers=column_parallelism) as executor:
                results = executor.map(fetch_details_with_sampling, batch_table_tuples)
                
                for column_list in results:
                    if column_list:
                        all_column_details.extend(column_list)
            
            # Batch complete - high-level logging only
            
            # Update position for next call
            self.current_table_idx = end_idx
            
            # Return the column details for this batch
            return all_column_details
    
    def reset(self):
        """Reset the data loader to start from the beginning."""
        self.current_table_idx = 0
        self.current_batch_idx = 0
        self.logger.info("DataLoader reset to beginning")
    
    def _initialize_all_tables(self):
        """
        Fetches all table names from all databases in the queue and stores them
        as (catalog, schema, table) tuples for later batch processing.
        
        Supports both streaming (memory-efficient) and two-pass (size-optimized) modes.
        """
        self.logger.info(f"Discovering all tables from {len(self.database_queue)} databases...")
        
        # Check if we have specific individual tables to filter by
        individual_tables_set = set(self.individual_tables) if self.individual_tables else None
        
        # Get set of explicitly provided schemas (these should expand ALL tables, not filter)
        explicit_schemas_set = getattr(self, 'explicit_schemas_set', set())
        
        # Log discovery strategy
        if explicit_schemas_set and individual_tables_set:
            explicit_schema_names = [f"{cat}.{db}" for cat, db in explicit_schemas_set]
            self.logger.info(f"📋 Discovery strategy: {len(explicit_schemas_set)} explicit schemas will expand ALL tables: {', '.join(explicit_schema_names)}")
            self.logger.info(f"📋 {len(individual_tables_set)} individual tables will be included from other schemas")
        
        # === TWO-PASS MODE: First analyze table sizes, then create optimized batches ===
        if self.enable_two_pass:
            self.logger.info("🔄 TWO-PASS MODE ENABLED: Analyzing table sizes for intelligent batching...")
            self.logger.info(f"⚡ Using parallel schema discovery with {self.max_parallelism} workers for speed")
            
            # PASS 1: Discover all tables and analyze their sizes IN PARALLEL
            all_table_size_infos = []
            
            # Function to process a single schema in parallel
            def discover_and_analyze_schema(schema_tuple):
                catalog_name, schema_name = schema_tuple
                cat_normalized = normalize_identifier(catalog_name)
                schema_normalized = normalize_identifier(schema_name)
                db_fqn = build_fqn(cat_normalized, schema_normalized)
                self.logger.info(f"🔍 Starting discovery for {db_fqn}...")
                
                schema_table_tuples = []
                schema_key = (cat_normalized, schema_normalized)
                is_explicit_schema = schema_key in explicit_schemas_set
                
                # Check if schema has a huge number of tables (>10K) - use streaming
                try:
                    self.logger.debug(f"   Counting tables in {db_fqn}...")
                    cat_quoted = quote_identifier(cat_normalized)
                    count_query = f"""
                        SELECT COUNT(*) as cnt FROM {cat_quoted}.`information_schema`.`tables`
                        WHERE `table_schema` = '{schema_normalized}'
                    """
                    table_count = self.spark.sql(count_query).collect()[0].cnt
                    self.logger.debug(f"   {db_fqn} has {table_count} tables")
                    use_streaming = table_count > 10000
                except Exception as e:
                    # Extract clean error message without stack trace for console
                    error_msg = get_clean_error_message(e)
                    self.logger.warning(f"   Could not count tables in {db_fqn}: {error_msg}. Using non-streaming mode.")
                    self.logger.debug(f"   Full error details: {e}")  # Full trace to log file only
                    use_streaming = False  # Fallback to non-streaming
                
                if use_streaming:
                    self.logger.info(f"   Large schema detected in {db_fqn} ({table_count:,} tables). Using streaming mode...")
                    # Process in chunks
                    for table_chunk in self._fetch_tables_for_schema_streaming(cat_normalized, schema_normalized, chunk_size=self.streaming_batch_size):
                        for fq_table_name in table_chunk:
                            cat, db, tbl = parse_three_level_name(fq_table_name)
                            if not (cat and db and tbl):
                                self.logger.warning(f"Skipping malformed table name: {fq_table_name}")
                                continue
                            table_tuple = (cat, db, tbl)
                            
                            # Only filter by individual_tables if this schema was NOT explicitly provided
                            # Explicit schemas should expand ALL tables
                            if not is_explicit_schema and individual_tables_set and table_tuple not in individual_tables_set:
                                continue
                            
                            schema_table_tuples.append(table_tuple)
                else:
                    # Non-streaming mode for smaller schemas
                    table_names_fq = self._fetch_tables_for_schema(cat_normalized, schema_normalized)
                    
                    if not table_names_fq:
                        self.logger.debug(f"No tables found in {db_fqn}. Skipping.")
                        return []  # Return empty list for this schema
                    
                    for fq_table_name in table_names_fq:
                        cat, db, tbl = parse_three_level_name(fq_table_name)
                        if not (cat and db and tbl):
                            self.logger.warning(f"Skipping malformed table name: {fq_table_name}")
                            continue
                        table_tuple = (cat, db, tbl)
                        
                        # Only filter by individual_tables if this schema was NOT explicitly provided
                        # Explicit schemas should expand ALL tables
                        if not is_explicit_schema and individual_tables_set and table_tuple not in individual_tables_set:
                            continue
                        
                        schema_table_tuples.append(table_tuple)
                
                # Analyze sizes for this schema's tables
                schema_size_infos = []
                if schema_table_tuples:
                    self.logger.debug(f"   Analyzing column counts for {len(schema_table_tuples)} tables in {db_fqn}...")
                    
                    # Process in chunks to avoid memory issues
                    # IMPORTANT: Use max_parallelism=1 to avoid nested parallelism deadlock
                    # (outer loop already runs schemas in parallel with self.max_parallelism workers)
                    chunk_size = 100
                    for i in range(0, len(schema_table_tuples), chunk_size):
                        chunk = schema_table_tuples[i:i+chunk_size]
                        size_infos = self.size_analyzer.analyze_table_sizes_batch(chunk, self.catalog_capabilities, max_parallelism=1)
                        schema_size_infos.extend(size_infos)
                    
                    self.logger.info(f"   ✓ {db_fqn}: Found {len(schema_table_tuples)} tables")
                
                return schema_size_infos
            
            # ADAPTIVE PARALLELISM: Fixed for metadata (DB connection limits)
            discovery_parallelism, reason = calculate_adaptive_parallelism(
                "schema_discovery", self.max_parallelism,
                num_items=len(self.database_queue),
                is_llm_operation=False, logger=self.logger
            )
            log_adaptive_parallelism_decision("schema_discovery", discovery_parallelism, self.max_parallelism, reason)
            
            with ThreadPoolExecutor(max_workers=discovery_parallelism, thread_name_prefix="SchemaDiscovery") as executor:
                futures = {executor.submit(discover_and_analyze_schema, schema_tuple): schema_tuple 
                          for schema_tuple in self.database_queue}
                
                completed = 0
                total = len(self.database_queue)
                self.logger.info(f"      Submitted {total} schemas for parallel discovery...")
                
                # Add timeout to prevent infinite hangs
                timeout_per_schema = self.schema_timeout_seconds
                
                for future in concurrent.futures.as_completed(futures, timeout=timeout_per_schema * total):
                    schema_tuple = futures[future]
                    completed += 1
                    try:
                        # Add per-future timeout as well
                        schema_size_infos = future.result(timeout=timeout_per_schema)
                        all_table_size_infos.extend(schema_size_infos)
                        self.logger.info(f"      Progress: {completed}/{total} schemas analyzed, {len(all_table_size_infos)} tables discovered so far")
                    except concurrent.futures.TimeoutError:
                        timeout_minutes = timeout_per_schema // 60
                        self.logger.error(f"⏱️  Timeout analyzing schema {schema_tuple} (>{timeout_minutes} min). Skipping this schema.")
                    except Exception as e:
                        self.logger.error(f"Failed to analyze schema {schema_tuple}: {e}")
                
                # Log final completion
                if completed < total:
                    self.logger.warning(f"⚠️  Only {completed}/{total} schemas completed. {total - completed} schemas timed out or failed.")
            
            self.logger.info(f"📊 Pass 1 complete: Analyzed {len(all_table_size_infos)} tables")
            
            # Deduplicate tables (in case same table was included via both schema and individual table)
            seen_tables = set()
            unique_table_size_infos = []
            for info in all_table_size_infos:
                table_key = (info.catalog, info.schema, info.table)
                if table_key not in seen_tables:
                    seen_tables.add(table_key)
                    unique_table_size_infos.append(info)
            
            if len(unique_table_size_infos) < len(all_table_size_infos):
                duplicates_removed = len(all_table_size_infos) - len(unique_table_size_infos)
                self.logger.info(f"🔄 Deduplicated: Removed {duplicates_removed} duplicate tables, {len(unique_table_size_infos)} unique tables remaining")
                all_table_size_infos = unique_table_size_infos
            
            # PASS 2: Create optimized batches based on size analysis
            self.logger.info("🎯 Pass 2: Creating optimized batches based on table sizes...")
            self.optimized_batches = self.batch_optimizer.create_optimized_batches(all_table_size_infos)
            self._size_analysis_complete = True
            
            # Store all table tuples for reference
            self.all_table_tuples = [
                (info.catalog, info.schema, info.table) for info in all_table_size_infos
            ]
            
            self.logger.info(f"✅ Two-pass initialization complete: {len(self.all_table_tuples)} tables in {len(self.optimized_batches)} optimized batches")
        
        # === SINGLE-PASS MODE: Standard behavior (for backward compatibility) ===
        else:
            self.logger.info("📋 SINGLE-PASS MODE: Standard table discovery with parallel queries...")
            self.logger.info(f"⚡ Using {self.max_parallelism} parallel workers for schema queries")
            
            # Function to fetch tables for a single schema
            def fetch_schema_tables(schema_tuple):
                catalog_name, schema_name = schema_tuple
                cat_normalized = normalize_identifier(catalog_name)
                schema_normalized = normalize_identifier(schema_name)
                db_fqn = build_fqn(cat_normalized, schema_normalized)
                self.logger.info(f"🔍 Starting table fetch for {db_fqn}...")
                
                schema_key = (cat_normalized, schema_normalized)
                is_explicit_schema = schema_key in explicit_schemas_set
                
                # Fetch all table names for this database
                table_names_fq = self._fetch_tables_for_schema(cat_normalized, schema_normalized)
                
                schema_tuples = []
                if not table_names_fq:
                    self.logger.debug(f"No tables found in {db_fqn}. Skipping.")
                    return schema_tuples
                    
                # Parse into (cat, schema, table) tuples
                for fq_table_name in table_names_fq:
                    cat, db, tbl = parse_three_level_name(fq_table_name)
                    if not (cat and db and tbl):
                        self.logger.warning(f"Skipping malformed table name: {fq_table_name}")
                        continue
                    table_tuple = (cat, db, tbl)
                    
                    # Explicit schemas expand ALL tables; otherwise filter by individual_tables
                    if is_explicit_schema:
                        # Schema was explicitly provided - include ALL tables
                        schema_tuples.append(table_tuple)
                    elif individual_tables_set:
                        # Schema came from individual tables - only include those specific tables
                        if table_tuple in individual_tables_set:
                            schema_tuples.append(table_tuple)
                    else:
                        # No individual tables specified - include all
                        schema_tuples.append(table_tuple)
                
                self.logger.info(f"   ✓ {db_fqn}: Found {len(schema_tuples)} tables")
                return schema_tuples
            
            # ADAPTIVE PARALLELISM: Fixed for metadata (DB connection limits)
            table_discovery_parallelism, reason = calculate_adaptive_parallelism(
                "table_discovery", self.max_parallelism,
                num_items=len(self.database_queue),
                is_llm_operation=False, logger=self.logger
            )
            log_adaptive_parallelism_decision("table_discovery", table_discovery_parallelism, self.max_parallelism, reason)
            
            with ThreadPoolExecutor(max_workers=table_discovery_parallelism, thread_name_prefix="TableDiscovery") as executor:
                futures = {executor.submit(fetch_schema_tables, schema_tuple): schema_tuple 
                          for schema_tuple in self.database_queue}
                
                completed = 0
                total = len(self.database_queue)
                self.logger.info(f"      Submitted {total} schemas for parallel discovery...")
                
                # Add timeout to prevent infinite hangs
                timeout_per_schema = self.schema_timeout_seconds
                
                for future in concurrent.futures.as_completed(futures, timeout=timeout_per_schema * total):
                    schema_tuple = futures[future]
                    completed += 1
                    try:
                        # Add per-future timeout as well
                        schema_tuples = future.result(timeout=timeout_per_schema)
                        self.all_table_tuples.extend(schema_tuples)
                        self.logger.info(f"      Progress: {completed}/{total} schemas processed, {len(self.all_table_tuples)} tables discovered so far")
                    except concurrent.futures.TimeoutError:
                        timeout_minutes = timeout_per_schema // 60
                        self.logger.error(f"⏱️  Timeout fetching tables for schema {schema_tuple} (>{timeout_minutes} min). Skipping this schema.")
                    except Exception as e:
                        self.logger.error(f"Failed to fetch tables for schema {schema_tuple}: {e}")
                
                # Log final completion
                if completed < total:
                    self.logger.warning(f"⚠️  Only {completed}/{total} schemas completed. {total - completed} schemas timed out or failed.")
            
            # Deduplicate tables (in case same table was included via both schema and individual table)
            original_count = len(self.all_table_tuples)
            self.all_table_tuples = list(dict.fromkeys(self.all_table_tuples))  # Preserves order, removes duplicates
            if len(self.all_table_tuples) < original_count:
                duplicates_removed = original_count - len(self.all_table_tuples)
                self.logger.info(f"🔄 Deduplicated: Removed {duplicates_removed} duplicate tables, {len(self.all_table_tuples)} unique tables remaining")
            
            self.logger.info(f"Table discovery complete. Found {len(self.all_table_tuples)} total tables across all databases.")

# COMMAND ----------

# DBTITLE 1,LakeViewDashboard
# Chart Type Reference Guide:
# The generator maps SQL columns to chart fields in two ways:
# 1. By Prefix: Looks for columns starting with a specific prefix (e.g., `category_`, `value_`).
# 2. By Position: If no prefixes are found, it uses the column order from the SELECT statement.
#
# Supported formats (optional columns in brackets `[]`):
# --------------------------------------------------------------------------------
# Bar Chart:      category_col, value_col, [group_col]
# Pie Chart:      category_col, value_col
# Line Chart:     x_axis_col, y_axis_col, [group_col]
# Area Chart:     x_axis_col, y_axis_col, [group_col]
# Scatter Plot:   x_axis_col, y_axis_col, [group_col]
# Heatmap:        x_axis_col, y_axis_col, color_col
# Combo Chart:    x_axis_col, bar_value_col, line_value_col
#
# Counter:        value_col
# Histogram:      value_col
# Funnel Chart:   stage_col, value_col
#
# Box Plot:       category_col, min_col, q1_col, median_col, q3_col, max_col
# Sankey Chart:   source_col, destination_col, value_col
# Pivot Table:    row_col, column_col, cell_value_col
# Choropleth Map: location_col, value_col
# Symbol Map:     lat_col, lon_col, [size_col], [group_col]
#
# Table:          col_1, col_2, ... (all columns are displayed)
# --------------------------------------------------------------------------------

# (Chart Type Reference Guide... unchanged)

class WidgetFailedToCreate(Exception):
    """Custom exception raised when widget creation fails due to unmet requirements."""
    pass

class LakeViewDashboard:
    """A fluent API for generating Databricks Lakeview dashboards."""
    def __init__(self, name:str, logger: logging.Logger):
        self.name = name
        self.logger = logger  # <-- 1. Logger is now an instance variable
        self.WIDGET_WIDTH = 3
        self.WIDGET_HEIGHT = 6
        self.COLUMN_COUNT = 2
        self.dashboard = {"datasets": [], "pages": [], "uiSettings": {"theme": {"widgetHeaderAlignment": "ALIGNMENT_UNSPECIFIED"}}}
        self.current_page = None
        self.dataset_map = {}
        self.page_map = {}

    def _find_col(self, columns, prefixes, index):
        for prefix in prefixes:
            col = next((c for c in columns if c.startswith(prefix)), None)
            if col: return col
        if index < len(columns): return columns[index]
        # Return None if not found, to be caught by spec builders
        return None

    def _get_or_create_dataset(self, query, title):
        query_key = query.strip().lower()
        if query_key in self.dataset_map: return self.dataset_map[query_key]
        dataset_name = uuid.uuid4().hex[:8]
        dataset = {"name": dataset_name, "displayName": title, "queryLines": [query.strip()]}
        self.dashboard["datasets"].append(dataset)
        self.dataset_map[query_key] = dataset_name
        return dataset_name

    def _parse_sql_columns(self, sql_query):
        query = re.sub(r'--.*', '', sql_query)
        query = re.sub(r'/\*.*?\*/', '', query, flags=re.DOTALL)
        query = ' '.join(query.split())
        query_upper = query.upper()

        select_pos = query_upper.find("SELECT ")
        if select_pos == -1: return []

        paren_depth = 0
        main_from_pos = -1
        cursor = select_pos + 6
        while cursor < len(query):
            char = query[cursor]
            if char == '(': paren_depth += 1
            elif char == ')': paren_depth -= 1
            if paren_depth == 0 and query_upper[cursor:cursor+6] == ' FROM ':
                main_from_pos = cursor
                break
            cursor += 1

        if main_from_pos == -1: return []
        select_part = query[select_pos + 6 : main_from_pos].strip()
        if not select_part or select_part == '*': return []
            
        columns_exprs = []
        current_col_start = 0
        paren_depth = 0
        for i, char in enumerate(select_part):
            if char == '(': paren_depth += 1
            elif char == ')': paren_depth -= 1
            elif char == ',' and paren_depth == 0:
                columns_exprs.append(select_part[current_col_start:i].strip())
                current_col_start = i + 1
        columns_exprs.append(select_part[current_col_start:].strip())
        
        aliases = []
        for col_expr in columns_exprs:
            parts = re.split(r'\s+as\s+', col_expr, flags=re.IGNORECASE)
            alias = parts[-1] if len(parts) > 1 else col_expr.split('.')[-1].split()[-1]
            aliases.append(alias.strip('`"\''))
            
        return aliases
    
    def page(self, title):
        if title not in self.page_map:
            page_name = uuid.uuid4().hex[:8]
            new_page = {"name": page_name, "displayName": title, "layout": [], "pageType": "PAGE_TYPE_CANVAS", "_column_counts": [0] * self.COLUMN_COUNT}
            self.dashboard["pages"].append(new_page)
            self.page_map[title] = new_page
        self.current_page = self.page_map[title]
        return self

    def save_to_file(self, file_path):
        # This is now handled by the external write_to_dbfs function
        # but we'll leave the helper logic.
        for p in self.dashboard.get("pages", []): p.pop("_column_counts", None)
        dir_path = os.path.dirname(file_path)
        if dir_path: os.makedirs(dir_path, exist_ok=True)
        with open(file_path, "w") as f: json.dump(self.dashboard, f, indent=2)
        self.logger.info(f"Dashboard successfully saved to {file_path}")
        return self

    # --- 2. NEW: validate_viz method ---
    def validate_viz(self, viz_type: str, viz_query: str, viz_title: str):
        """
        Runs a dry-run of the widget creation process to validate it.
        Raises WidgetFailedToCreate or other errors if validation fails.
        """
        viz_function_name = viz_type.lower().replace("-", "_")
        
        # 1. Check if type is supported
        try:
            spec_builder = getattr(self, f"_build_{viz_function_name}_spec")
        except AttributeError:
            raise WidgetFailedToCreate(f"Visualization type '{viz_type}' is not supported.")
        
        # 2. Check if columns can be parsed
        columns = self._parse_sql_columns(viz_query)
        if not columns:
            raise WidgetFailedToCreate(f"Could not parse any columns from query, likely due to 'SELECT *'. Query: {viz_query}")
        
        # 3. Check if spec builder runs without error (e.g., IndexError)
        try:
            spec_builder(columns, viz_title)
        except Exception as e:
            raise WidgetFailedToCreate(f"Failed to build widget spec (e.g., missing required columns): {e}")
        
        # If all checks pass, return the parsed data to avoid re-doing work
        return columns, spec_builder, viz_function_name

    def _add_chart(self, query, title, chart_type_fn, columns, spec):
        """Internal method to add a chart, assuming validation has passed."""
        if not self.current_page: self.page("Main Page")
        dataset_name = self._get_or_create_dataset(query, title)
        
        is_disaggregated = chart_type_fn not in ("histogram", "box", "table")
        fields = [{"name": c, "expression": f"`{c}`"} for c in columns]
        if chart_type_fn == "histogram":
            value_col = self._find_col(columns, ('value_',), 0) or columns[0]
            fields = [{"name": f"bin({value_col}, binWidth=10)", "expression": f"BIN_FLOOR(`{value_col}`, 10)"}, {"name": "count(*)", "expression": "COUNT(`*`)"}]
        
        widget_query = {"datasetName": dataset_name, "fields": fields, "disaggregated": is_disaggregated}
        widget = {"name": uuid.uuid4().hex[:8], "queries": [{"name": "main_query", "query": widget_query}], "spec": spec}
        
        if chart_type_fn == "choropleth_map":
            width, height = self.COLUMN_COUNT * self.WIDGET_WIDTH, self.WIDGET_HEIGHT
            max_row = max(self.current_page["_column_counts"]) if self.current_page["_column_counts"] else 0
            position = {"x": 0, "y": max_row, "width": width, "height": height}
            self.current_page["layout"].append({"widget": widget, "position": position})
            new_row_count = max_row + height
            for i in range(self.COLUMN_COUNT): self.current_page["_column_counts"][i] = new_row_count
        else:
            target_column = self.current_page["_column_counts"].index(min(self.current_page["_column_counts"]))
            position = {"x": target_column * self.WIDGET_WIDTH, "y": self.current_page["_column_counts"][target_column], "width": self.WIDGET_WIDTH, "height": self.WIDGET_HEIGHT}
            self.current_page["layout"].append({"widget": widget, "position": position})
            self.current_page["_column_counts"][target_column] += self.WIDGET_HEIGHT
        
        # --- 4. Success Logging (as requested) ---
        self.logger.info(f"{self.name}.{self.current_page['displayName']}: Added widget '{title}' ({chart_type_fn})")
        
    # --- (All _build_..._spec methods remain unchanged) ---
    def _get_display_name(self, col_name): 
        if col_name is None: return "N/A"
        return col_name.split("_", 1)[-1].replace("_", " ").title()
    def _build_counter_spec(self, c, t): return {"version": 2, "widgetType": "counter", "frame": {"title": t, "showTitle": True}, "encodings": {"value": {"fieldName": self._find_col(c, ('value_',), 0), "displayName": self._get_display_name(self._find_col(c, ('value_',), 0))}}}
    def _build_table_spec(self, c, t): return {"version": 2, "widgetType": "table", "encodings": {"columns": [{"fieldName": col, "displayName": self._get_display_name(col)} for col in c]}, "frame": {"title": t, "showTitle": True}}
    def _build_bar_spec(self, c, t): encodings = {"x": {"fieldName": self._find_col(c, ('category_', 'x_'), 0), "scale": {"type": "categorical", "sort": {"by": "y-reversed"}}}, "y": {"fieldName": self._find_col(c, ('value_', 'y_'), 1), "scale": {"type": "quantitative"}}}; color = self._find_col(c, ('group_', 'color_'), 2);_ = encodings.update({"color": {"fieldName": color, "scale": {"type": "categorical"}}}) if color else ""; return {"version": 3, "widgetType": "bar", "encodings": encodings, "frame": {"title": t, "showTitle": True}}
    def _build_pie_spec(self, c, t): return {"version": 3, "widgetType": "pie", "frame": {"title": t, "showTitle": True}, "encodings": {"angle": {"fieldName": self._find_col(c, ('value_', 'angle_', 'y_'), 1), "scale": {"type": "quantitative"}}, "color": {"fieldName": self._find_col(c, ('category_', 'x_'), 0), "scale": {"type": "categorical"}}, "label": {"show": True}}}
    def _build_line_spec(self, c, t): encodings = {"x": {"fieldName": self._find_col(c, ('x_',), 0), "scale": {"type": "temporal"}}, "y": {"fieldName": self._find_col(c, ('y_',), 1), "scale": {"type": "quantitative"}}}; color = self._find_col(c, ('group_', 'color_'), 2);_ = encodings.update({"color": {"fieldName": color, "scale": {"type": "categorical"}}}) if color else ""; return {"version": 3, "widgetType": "line", "encodings": encodings, "frame": {"title": t, "showTitle": True}}
    def _build_area_spec(self, c, t): encodings = {"x": {"fieldName": self._find_col(c, ('x_',), 0), "scale": {"type": "temporal"}}, "y": {"fieldName": self._find_col(c, ('y_', 'value_'), 1), "scale": {"type": "quantitative"}}}; color = self._find_col(c, ('group_', 'color_'), 2);_ = encodings.update({"color": {"fieldName": color, "scale": {"type": "categorical"}}}) if color else ""; return {"version": 3, "widgetType": "area", "encodings": encodings, "frame": {"title": t, "showTitle": True}}
    def _build_scatter_spec(self, c, t): encodings = {"x": {"fieldName": self._find_col(c, ('x_',), 0), "scale": {"type": "quantitative"}}, "y": {"fieldName": self._find_col(c, ('y_',), 1), "scale": {"type": "quantitative"}}}; color = self._find_col(c, ('group_', 'color_'), 2);_ = encodings.update({"color": {"fieldName": color, "scale": {"type": "categorical"}}}) if color else ""; return {"version": 3, "widgetType": "scatter", "encodings": encodings, "frame": {"title": t, "showTitle": True}}
    def _build_heatmap_spec(self, c, t): return {"version": 3, "widgetType": "heatmap", "encodings": {"x": {"fieldName": self._find_col(c, ('x_',), 0), "scale": {"type": "categorical"}}, "y": {"fieldName": self._find_col(c, ('y_',), 1), "scale": {"type": "categorical"}}, "color": {"fieldName": self._find_col(c, ('color_',), 2), "scale": {"type": "quantitative"}}}, "frame": {"title": t, "showTitle": True}}
    def _build_histogram_spec(self, c, t): v = self._find_col(c, ('value_',), 0) or c[0]; return {"version": 3, "widgetType": "histogram", "encodings": {"x": {"fieldName": f"bin({v}, binWidth=10)", "scale": {"type": "quantitative"}}, "y": {"fieldName": "count(*)", "scale": {"type": "quantitative"}}}, "frame": {"title": t, "showTitle": True}}
    def _build_box_spec(self, c, t): y = {"whiskerStart": {"fieldName": self._find_col(c, ('minimum_', 'min_'), 1)}, "boxStart": {"fieldName": self._find_col(c, ('q1_',), 2)}, "boxMid": {"fieldName": self._find_col(c, ('median_',), 3)}, "boxEnd": {"fieldName": self._find_col(c, ('q3_',), 4)}, "whiskerEnd": {"fieldName": self._find_col(c, ('maximum_', 'max_'), 5)}, "scale": {"type": "quantitative"}}; return {"version": 3, "widgetType": "box", "frame": {"title": t, "showTitle": True}, "encodings": {"x": {"fieldName": self._find_col(c, ('category_',), 0), "scale": {"type": "categorical"}}, "y": y}}
    def _build_combo_spec(self, c, t): y = {"primary": {"fields": [{"fieldName": self._find_col(c, ('bar_',), 1), "seriesType": "bar"}]}, "secondary": {"fields": [{"fieldName": self._find_col(c, ('line_',), 2), "seriesType": "line"}]}, "scale": {"type": "quantitative"}, "dualAxis": True}; return {"version": 1, "widgetType": "combo", "frame": {"title": t, "showTitle": True}, "encodings": {"x": {"fieldName": self._find_col(c, ('x_',), 0), "scale": {"type": "temporal"}}, "y": y}}
    def _build_sankey_spec(self, c, t): return {"version": 1, "widgetType": "sankey", "encodings": {"value": {"fieldName": self._find_col(c, ('value_',), 2)}, "stages": [{"fieldName": self._find_col(c, ('source_',), 0)}, {"fieldName": self._find_col(c, ('destination_',), 1)}]}, "frame": {"title": t, "showTitle": True}}
    def _build_pivot_spec(self, c, t): return {"version": 3, "widgetType": "pivot", "encodings": {"rows": [{"fieldName": self._find_col(c, ('row_',), 0)}], "columns": [{"fieldName": self._find_col(c, ('column_',), 1)}], "cell": {"type": "multi-cell", "fields": [{"fieldName": self._find_col(c, ('cell_',), 2), "cellType": "text"}]}}, "frame": {"title": t, "showTitle": True}}
    def _build_funnel_spec(self, c, t): return {"version": 3, "widgetType": "funnel", "encodings": {"x": {"fieldName": self._find_col(c, ('value_',), 1), "scale": {"type": "quantitative"}}, "y": {"fieldName": self._find_col(c, ('stage_',), 0), "scale": {"type": "categorical"}}, "label": {"show": True}}, "frame": {"title": t, "showTitle": True}}
    def _build_choropleth_map_spec(self, c, t): return {"version": 1, "widgetType": "choropleth-map", "frame": {"title": t, "showTitle": True}, "encodings": {"color": {"fieldName": self._find_col(c, ('value_',), 1), "scale": {"type": "quantitative", "colorRamp": {"mode": "scheme", "scheme": "blues"}}}, "region": {"regionType": "mapbox-v4-admin", "admin0": {"fieldName": self._find_col(c, ('location_',), 0), "type": "field", "geographicRole": "admin0-iso-3166-1-alpha-3"}}}}
    def _build_symbol_map_spec(self, c, t): encodings = {"coordinates": {"latitude": {"fieldName": self._find_col(c, ('lat_', 'latitude_'), 0)}, "longitude": {"fieldName": self._find_col(c, ('lon_', 'longitude_'), 1)}}}; size = self._find_col(c, ('size_', 'value_'), 2); color = self._find_col(c, ('color_', 'group_'), 3);_ = encodings.update({"size": {"fieldName": size, "scale": {"type": "quantitative"}}}) if size else "";_ = encodings.update({"color": {"fieldName": color, "scale": {"type": "categorical"}}}) if color else ""; return {"version": 2, "widgetType": "symbol-map", "frame": {"title": t, "showTitle": True}, "encodings": encodings}

    # --- 3. Corrected add_viz ---
    def add_viz(self, viz_type: str, viz_title: str, viz_query: str):
        """
        Validates and adds a visualization to the current page.
        Returns True on success, False on "expected" failure.
        """
        try:
            # First, validate the visualization. This can raise errors.
            columns, spec_builder, viz_function_name = self.validate_viz(viz_type, viz_query, viz_title)
            
            # Re-run spec builder to get the final spec
            spec = spec_builder(columns, viz_title)

            # If validation passes, call the internal _add_chart method
            self._add_chart(viz_query, viz_title, viz_function_name, columns, spec)
            
            return True # Return True on success
        
        except (WidgetFailedToCreate, AttributeError, IndexError, TypeError) as e:
            # Catch "expected" failures (bad type, bad SQL, bad spec, NoneType errors)
            self.logger.warning(f"Skipping widget '{viz_title}' during final assembly: {e}")
            return False # Return False on "expected" failure
        except Exception as e:
            # Catch any other unexpected errors
            self.logger.error(f"Unexpected error adding widget '{viz_title}': {e}")
            return False # Return False on "unexpected" failure
        
PROMPT_TEMPLATES["DASHBOARDS_GEN_PROMPT"] = """
### 0. PERSONA ACTIVATION

You are a **Principal Business Intelligence Engineer** and an industry specialist with deep expertise in the `{industry}` industry, `{description}`. You are a master of designing and implementing comprehensive analytical dashboards on the Databricks platform. Your work focuses on creating actionable, multi-page dashboards that translate complex data into critical business insights through effective data visualization.

-----

### 1. DEFINITIONS

  * **Dashboard Page:** A dedicated view within the dashboard.
  * **Query:** A multi-join or single-table Databricks SQL select statement that retrieves data for a visualization.
  * **Widget:** A data visualization component (e.g., bar chart, line chart).

-----

### 2. CONTEXT

  * **Industry:** `{industry}`
  * **Business Domain:** `{domain}` (This is given as `catalog.schema`)
  * **Target Page Name:** `{page_name}`
  * **Boolean Values Format:** `{boolean_format}`
  * **Date Format:** `{date_format}`
  * **Timestamp Format:** `{datetime_format}`

-----

### 3. CORE TASK

Your core task is to create a comprehensive set of widgets for a **single dashboard page named `{page_name}`**. You will generate a curated set of SQL queries that power various widgets. **Crucially, every query you generate MUST have a clear and significant business value**, representing critical reports, KPIs, and performance measures from the business domain.

-----

### 4. WORKFLOW & RULES

1.  **Analyze Data Model & Plan Joins:**

      * Thoroughly analyze the **database schema markdown** in **Section 9**.
      * Before writing any SQL, mentally identify potential join keys between tables. Look for columns with similar names and purposes. This planning is crucial for generating valid, insightful queries.

2.  **Design Business-Critical Widgets:**

      * All generated queries will be for a single page named `{page_name}`.
      * The total number of generated queries **must be between `{min_dashboard_visualizations}` and `{max_dashboard_visualizations}`**.
      * For every widget idea, first define the business question it answers (e.g., 'Which marketing channels are driving the most sales?'). Then, design a query to answer it.

3.  **Generate Insightful SQL Queries:**

    **MOST CRITICAL RULE: STRICT SCHEMA ADHERENCE**

      * **ZERO TOLERANCE FOR HALLUCINATION:** You **MUST NOT** invent, guess, assume, or infer any table or column names. Every identifier you use must exist in the schema.
      * **EXACT NAMES ONLY:** Every single table and column identifier in your SQL **MUST BE an exact, case-sensitive copy-paste** from the table and column names in the **schema markdown** (Section 9).
          * For tables, you **MUST** use the fully qualified, three-level name provided (e.g., `main.customer_data.accounts`).
          * For columns, you **MUST** use the exact column name provided (e.g., `creation_date_erdat`).
      * **VALID JOINS ONLY:** Every `JOIN` condition **MUST** use columns that are explicitly present in the respective tables as defined in the schema.

    **SQL Syntax & Best Practices**

      * All queries **MUST** use the Databricks SQL dialect. Use table aliases for clarity.
      * **Prioritize multi-join queries** that combine data from different tables to uncover complex insights. **Single-table queries are acceptable ONLY IF they provide a direct, high-value business metric** (e.g., a critical KPI like 'Total Active Customers' or 'Overall Defect Rate'). Simple single-table data dumps without clear business value are forbidden.
      * **DATE UNIT EXTRACTION:** Use `EXTRACT(<UNIT> FROM <DATE/TIMESTAMP COLUMN>)`, not `DATE_TRUNC`.
      * **WHERE CLAUSE FOR DASHBOARDS:** For dashboard queries showing business metrics, you MAY use WHERE clauses with known business values (e.g., `WHERE status = 'active'` for KPI counters). Note: This differs from general SQL generation where value filtering is forbidden because data values are unknown.
      * **USE SINGLE QUOTES FOR ALL SQL STRING LITERALS.** Example: `WHERE LOWER(status) = 'active'`.

4.  **Configure Widget Visualizations:**

      * Select the most appropriate widget type for each query from the list in **Section 8**.
      * Diversify widget types. Aim to use at least 80% of the available widget types.
      * Limit the total number of `counter` and `table` widgets to **AT MOST** two of each type.

-----

### 5. NAMING & FORMATTING CONVENTIONS

  * **SQL Column Aliases (CRITICAL):** For every widget, the SQL column aliases in the `SELECT` statement **MUST EXACTLY MATCH** the required field prefixes specified in **Section 8**.
  * **Box Plot Pre-aggregation:** The SQL query for a `box` plot **MUST** return pre-calculated values for `MIN()`, `APPROX_PERCENTILE(..., 0.25)`, `MEDIAN()`, `APPROX_PERCENTILE(..., 0.75)`, and `MAX()`, aliased with the correct prefixes (`min_`, `q1_`, etc.).

-----

### 6. OUTPUT REQUIREMENTS

  * **CSV OUTPUT:** Your entire response **MUST** be a single, valid CSV file.

  * **CSV STRUCTURE:**

      * The first line **MUST** be the header row: `title,widget,query`.
      * Each subsequent line represents a single widget.
      * Fields must be properly CSV-escaped.

  * **CSV Columns:**

      * `title`: A descriptive business title for the widget.
      * `widget`: The type of the visualization widget (selected from Section 8).
      * `query`: The full Databricks SQL query for the widget.

#### Example Output

```csv
title,widget,query
'Subscriber Count by Loyalty Tier','bar','SELECT s.loyalty_tier AS category_tier, COUNT(s.profile_id) AS value_count FROM main.subscriber.profile AS s JOIN web_logs.logs.visits AS v ON s.profile_id = v.user_id GROUP BY s.loyalty_tier;'
'Total Active Customers','counter','SELECT COUNT(DISTINCT profile_id) AS value_count FROM main.subscriber.profile WHERE LOWER(status) = 'active';'
````

-----

### 7. FINAL CHECKS (META-INSTRUCTIONS)

  * **SELF-CORRECTION MANDATORY:** Before finalizing your response, perform a rigorous validation pass. For each generated query, you must:
    1.  **Check Business Value:** Ask yourself, 'What critical business question does this answer?' If the answer is not clear, discard or improve the query.
    2.  **Verify Schema Adherence:** Meticulously check every table and column name against the schema in Section 9. Ensure all identifiers are an exact copy-paste.
    3.  **Validate Syntax & Aliases:** Ensure the query is valid Databricks SQL and that the column aliases match the prefixes in Section 8.
  * **DISCARD INVALID QUERIES:** If a query fails any of these checks, you **MUST** either correct it or discard it completely. **Do not output a query that you know is invalid or lacks business value.**

-----

### 8. REFERENCE: WIDGET SPECIFICATIONS & REQUIRED FIELDS

`[ { 'chart': 'counter', 'used_for': 'Displaying a single, key performance indicator (KPI) prominently. It can optionally compare the primary value against a secondary target value.', 'fields': [ {'name': 'value', 'prefix': 'value_', 'used_for': 'A numeric column for the primary KPI to be displayed.', 'type': 'numeric', 'status': 'mandatory'}, {'name': 'target', 'prefix': 'target_', 'used_for': 'An optional numeric column representing a target value for comparison.', 'type': 'numeric', 'status': 'optional'} ] }, { 'chart': 'bar', 'used_for': 'Comparing numerical values across categories or showing metric changes over time. The layout can be configured to stack or group bars.', 'fields': [ {'name': 'x', 'prefix': 'category_', 'used_for': 'A categorical or temporal column for the axis labels.', 'type': 'categorical or temporal', 'status': 'mandatory'}, {'name': 'y', 'prefix': 'value_', 'used_for': 'A numeric column determining the length of the bars.', 'type': 'numeric', 'status': 'mandatory'}, {'name': 'color', 'prefix': 'group_', 'used_for': 'A categorical column to group or stack the bars.', 'type': 'categorical', 'status': 'optional'} ] }, { 'chart': 'line', 'used_for': 'Showing a trend or progression of a numerical value over a continuous interval, most commonly time.', 'fields': [ {'name': 'x', 'prefix': 'x_', 'used_for': 'A continuous column, typically a date or timestamp, for the horizontal axis.', 'type': 'temporal', 'status': 'mandatory'}, {'name': 'y', 'prefix': 'y_', 'used_for': 'A numeric column representing the value that changes over time.', 'type': 'numeric', 'status': 'mandatory'}, {'name': 'color', 'prefix': 'group_', 'used_for': 'An optional categorical column to plot multiple lines on the same chart.', 'type': 'categorical', 'status': 'optional'} ] }, { 'chart': 'area', 'used_for': 'Showing how a group's numeric values change over a second variable (like time), combining line and bar charts. The layout can be stacked.', 'fields': [ {'name': 'x', 'prefix': 'x_', 'used_for': 'A continuous column, typically a date or timestamp.', 'type': 'temporal', 'status': 'mandatory'}, {'name': 'y', 'prefix': 'y_', 'used_for': 'A numeric column representing the magnitude over time.', 'type': 'numeric', 'status': 'mandatory'}, {'name': 'color', 'prefix': 'group_', 'used_for': 'A categorical column to create stacked areas.', 'type': 'categorical', 'status': 'optional'} ] }, { 'chart': 'pie', 'used_for': 'Illustrating the proportional distribution or percentage share of different categories that make up a whole. Best for a small number of categories.', 'fields': [ {'name': 'color', 'prefix': 'category_', 'used_for': 'A categorical column representing the slices of the pie.', 'type': 'categorical', 'status': 'mandatory'}, {'name': 'angle', 'prefix': 'value_', 'used_for': 'A numeric column determining the size (angle) of each slice.', 'type': 'numeric', 'status': 'mandatory'} ] }, { 'chart': 'scatter', 'used_for': 'Visualizing the relationship between two numerical variables. A third and fourth dimension can be added using color and marker size (creating a Bubble Chart).', 'fields': [ {'name': 'x', 'prefix': 'x_', 'used_for': 'The first numeric column for the horizontal axis.', 'type': 'numeric', 'status': 'mandatory'}, {'name': 'y', 'prefix': 'y_', 'used_for': 'The second numeric column for the vertical axis.', 'type': 'numeric', 'status': 'mandatory'}, {'name': 'color', 'prefix': 'group_', 'used_for': 'An optional categorical column to color-code the points.', 'type': 'categorical', 'status': 'optional'}, {'name': 'size', 'prefix': 'size_', 'used_for': 'An optional numeric column to control the marker size, turning the scatter into a bubble chart.', 'type': 'numeric', 'status': 'optional'} ] }, { 'chart': 'histogram', 'used_for': 'Representing the frequency distribution of a single numerical variable by grouping values into ranges (bins). The number of bins is configurable.', 'fields': [ {'name': 'value', 'prefix': 'value_', 'used_for': 'A single numeric column whose distribution is to be plotted.', 'type': 'numeric', 'status': 'mandatory'} ] }, { 'chart': 'combo', 'used_for': 'Combining line and bar charts to compare measures with different scales or units over the same categories. Supports a dual Y-axis.', 'fields': [ {'name': 'x', 'prefix': 'x_', 'used_for': 'A categorical or time-based column for the shared horizontal axis.', 'type': 'categorical or temporal', 'status': 'mandatory'}, {'name': 'y_primary', 'prefix': 'bar_', 'used_for': 'The first numeric column, typically displayed as bars on the left Y-axis.', 'type': 'numeric', 'status': 'mandatory'}, {'name': 'y_secondary', 'prefix': 'line_', 'used_for': 'The second numeric column, typically displayed as a line, optionally on a separate right Y-axis.', 'type': 'numeric', 'status': 'mandatory'} ] }, { 'chart': 'choropleth-map', 'used_for': 'Visualizing geographical data by shading regions (countries, states) based on a metric. Requires a 'Geographic role' to be configured for the location field.', 'fields': [ {'name': 'region', 'prefix': 'location_', 'used_for': 'A column containing geographic identifiers (e.g., country codes, state names).', 'type': 'string (geographic)', 'status': 'mandatory'}, {'name': 'color', 'prefix': 'value_', 'used_for': 'A numeric column that determines the color intensity of each region.', 'type': 'numeric', 'status': 'mandatory'} ] }, { 'chart': 'symbol-map', 'used_for': 'Displaying quantitative data as symbols placed at specific geographic coordinates on a map.', 'fields': [ {'name': 'latitude', 'prefix': 'lat_', 'used_for': 'A numeric column containing the latitude coordinates.', 'type': 'numeric (latitude)', 'status': 'mandatory'}, {'name': 'longitude', 'prefix': 'lon_', 'used_for': 'A numeric column containing the longitude coordinates.', 'type': 'numeric (longitude)', 'status': 'mandatory'}, {'name': 'color', 'prefix': 'group_', 'used_for': 'An optional categorical or quantitative column to color-code the points.', 'type': 'categorical or numeric', 'status': 'optional'}, {'name': 'size', 'prefix': 'size_', 'used_for': 'An optional numeric column to control the marker size.', 'type': 'numeric', 'status': 'optional'} ] }, { 'chart': 'heatmap', 'used_for': 'Visualizing the magnitude of a metric across the intersection of two categorical variables, using color intensity in a grid.', 'fields': [ {'name': 'x', 'prefix': 'x_', 'used_for': 'A categorical column for the horizontal axis.', 'type': 'categorical', 'status': 'mandatory'}, {'name': 'y', 'prefix': 'y_', 'used_for': 'A categorical column for the vertical axis.', 'type': 'categorical', 'status': 'mandatory'}, {'name': 'color', 'prefix': 'color_', 'used_for': 'A numeric column that determines the color of the cell at the x/y intersection.', 'type': 'numeric', 'status': 'mandatory'} ] }, { 'chart': 'sankey', 'used_for': 'Illustrating the flow and magnitude of data between different stages or categories. The width of the connections is proportional to the flow quantity.', 'fields': [ {'name': 'stage', 'prefix': 'source_', 'used_for': 'A categorical column representing the source node.', 'type': 'categorical', 'status': 'mandatory'}, {'name': 'stage', 'prefix': 'destination_', 'used_for': 'A categorical column representing the destination node.', 'type': 'categorical', 'status': 'mandatory'}, {'name': 'value', 'prefix': 'value_', 'used_for': 'A numeric column representing the magnitude of the flow between nodes.', 'type': 'numeric', 'status': 'mandatory'} ] }, { 'chart': 'funnel', 'used_for': 'Visualizing the progressive reduction of data as it passes through sequential stages in a process (e.g., a sales pipeline or user conversion).', 'fields': [ {'name': 'stage', 'prefix': 'stage_', 'used_for': 'A categorical column that defines the ordered stages of the funnel.', 'type': 'categorical', 'status': 'mandatory'}, {'name': 'value', 'prefix': 'value_', 'used_for': 'A numeric column representing the count or amount at each stage.', 'type': 'numeric', 'status': 'mandatory'} ] }, { 'chart': 'box', 'used_for': 'Displaying the distribution summary of numerical data through quartiles, optionally grouped by category. Shows median, range, and potential outliers.', 'fields': [ {'name': 'category', 'prefix': 'category_', 'used_for': 'A categorical column to create multiple box plots for comparison.', 'type': 'categorical', 'status': 'mandatory'}, {'name': 'min', 'prefix': 'min_', 'used_for': 'A numeric column for the minimum value (lower whisker).', 'type': 'numeric', 'status': 'mandatory'}, {'name': 'q1', 'prefix': 'q1_', 'used_for': 'A numeric column for the first quartile (25th percentile).', 'type': 'numeric', 'status': 'mandatory'}, {'name': 'median', 'prefix': 'median_', 'used_for': 'A numeric column for the median (50th percentile).', 'type': 'numeric', 'status': 'mandatory'}, {'name': 'q3', 'prefix': 'q3_', 'used_for': 'A numeric column for the third quartile (75th percentile).', 'type': 'numeric', 'status': 'mandatory'}, {'name': 'max', 'prefix': 'max_', 'used_for': 'A numeric column for the maximum value (upper whisker).', 'type': 'numeric', 'status': 'mandatory'} ] }, { 'chart': 'table', 'used_for': 'Displaying detailed data in a grid. Offers advanced customization for columns, including reordering, conditional formatting, and special data type rendering (HTML, links, images, JSON).', 'fields': [ {'name': 'any', 'prefix': 'any', 'used_for': 'Any number of columns to be displayed.', 'type': 'any', 'status': 'mandatory'} ] }, { 'chart': 'pivot', 'used_for': 'Aggregating and reorganizing records into a cross-tabulated grid, similar to a PIVOT statement in SQL. Groups data by rows and columns to show aggregated values.', 'fields': [ {'name': 'row', 'prefix': 'row_', 'used_for': 'A categorical column to group data into rows.', 'type': 'categorical', 'status': 'mandatory'}, {'name': 'column', 'prefix': 'column_', 'used_for': 'A categorical column to group data into columns.', 'type': 'categorical', 'status': 'mandatory'}, {'name': 'cell', 'prefix': 'cell_', 'used_for': 'A numeric column with an aggregation (e.g., SUM, COUNT) for the intersecting cells.', 'type': 'numeric', 'status': 'mandatory'} ] } ]`

-----

### 9. DATABASE SCHEMA DEFINITION

{schema_markdown}

-----

### 10. FINAL INSTRUCTION

Begin generation of the CSV output now.

🚨 ABSOLUTE RULE - OUTPUT FORMAT 🚨:

❌ DO NOT INCLUDE:
- Any conversational or explanatory text ("Here is...", "I've...", "Based on...")
- Any thoughts or analysis descriptions

✅ OUTPUT REQUIREMENTS:
- Your response must START with: title,widget,query,honesty_score,honesty_justification
- Include honesty columns in header and all rows
""" + HONESTY_CHECK_CSV

# --- 1i. Business vs Technical Table Filtering Prompt (NEW - CRITICAL FOR QUALITY) ---
PROMPT_TEMPLATES["FILTER_BUSINESS_TABLES_PROMPT"] = """You are a **Senior Data Architect** and **Business Domain Expert** specializing in identifying business-relevant data assets.

**CRITICAL TASK**: Analyze the provided list of database tables and classify each one as either:
1. **BUSINESS DATA TABLE** - Contains ANY business data related to operations, transactions, customers, products, services, or business processes
2. **TECHNICAL TABLE** - Contains PURELY IT INFRASTRUCTURE data with NO business relevance (backend system logs, database monitoring, application debugging, IT governance)

**BUSINESS CONTEXT**:
- **Business Name**: {business_name}
- **Industry**: {industry}
- **Business Description**: {business_context}
- **Exclusion Strategy**: {exclusion_strategy}

{additional_context_section}

**🚨 CONTEXT-AWARE CLASSIFICATION - CRITICAL 🚨**:
**The business context is CRUCIAL for classification.** Terms that are technical for most businesses may be BUSINESS DATA for companies where those terms are core to their business model.

**EXAMPLES OF CONTEXT-AWARE CLASSIFICATION**:
- **Example 1 - Databricks/Data Platform Companies**:
  - ✅ BUSINESS: `clusters`, `jobs`, `pipelines`, `workflows`, `compute`, `warehouses`, `models` (core product/service tables)
  - ❌ TECHNICAL: `cluster_logs`, `job_run_logs`, `pipeline_execution_logs`, `system_events`, `error_traces`, `debug_snapshots`
  - **Reasoning**: For Databricks, "cluster" is a billable product feature (business), but "cluster_logs" are still technical debugging data
  
- **Example 2 - Healthcare/Medical Device Companies**:
  - ✅ BUSINESS: `devices`, `sensors`, `telemetry`, `device_events`, `device_configurations` (core product data)
  - ❌ TECHNICAL: `device_firmware_logs`, `system_diagnostics`, `internal_health_checks`, `deployment_history`
  
- **Example 3 - Logistics/Transportation Companies**:
  - ✅ BUSINESS: `vehicles`, `routes`, `gps_tracking`, `driver_activity`, `vehicle_telemetry` (core operations)
  - ❌ TECHNICAL: `vehicle_diagnostic_logs`, `system_error_logs`, `app_crash_reports`, `backend_performance`

**UNIVERSAL TECHNICAL PATTERNS** (Always technical, regardless of business context):
- **Logs & Auditing**: `*_logs`, `*_audit_trail`, `*_history`, `*_changelog`, `audit_*`, `log_*` (system/debug logs)
- **Snapshots & Backups**: `*_snapshot`, `*_backup`, `snapshot_*`, `backup_*` (database backups)
- **System Metadata**: `*_metadata`, `*_schema`, `information_schema.*`, `sys.*`, `system.*`
- **Monitoring & Health**: `*_metrics`, `*_health`, `*_status`, `*_monitoring`, `performance_*`, `monitoring_*`
- **ETL/Pipeline Internals**: `*_job_run`, `*_pipeline_execution`, `*_load_status`, `etl_*`, `pipeline_*` (orchestration)
- **Error/Debug**: `*_error`, `*_exception`, `*_debug`, `error_*`, `exception_*`, `debug_*`
- **Configuration/Settings**: `*_config`, `*_settings`, `*_parameters`, `config_*`, `settings_*` (system settings)
- **Testing/Staging**: `*_test`, `*_staging`, `*_temp`, `test_*`, `staging_*`, `temp_*`
- **ML/PLATFORM INFRA METADATA**: Model registry, pipeline/runtime/config, training artifacts, or other platform/ops metadata tables are TECHNICAL unless the business sells the platform itself. When in doubt, treat infrastructure/meta tables as technical and prefer tables with direct business transactions/entities.

**EXCEPTION**: If the business explicitly provides these as products (e.g., observability/monitoring company selling logs as a product), then they are business tables.

**🚨 EXCLUSION STRATEGY-SPECIFIC RULES 🚨**:

**STRATEGY: {exclusion_strategy}**

{strategy_rules}

**🚨 DATA CATEGORY CLASSIFICATION RULES - SEMANTIC DEFINITIONS 🚨**:

**You MUST classify each BUSINESS table into exactly ONE of these three categories based on SEMANTIC analysis, NOT based on score:**

---

**📊 TRANSACTIONAL DATA** (Records of business events - the "VERBS" of the business):
| Characteristic | Description |
|----------------|-------------|
| **Definition** | Records of business events, activities, and transactions that happen over time |
| **Stability** | Immutable once created (append-only), rarely updated after creation |
| **Volume** | High volume, grows continuously over time |
| **Time-sensitive** | Has a PRIMARY BUSINESS timestamp column indicating WHEN THE EVENT OCCURRED |
| **Lifecycle** | Created, typically never updated (may be archived or soft-deleted) |
| **Key Question** | "Does each row represent a discrete BUSINESS EVENT that happened at a specific time?" → If YES → TRANSACTIONAL |
| **Examples** | Orders, invoices, payments, shipments, bookings, transactions, log entries, clicks, events, transfers, claims, incidents, service_requests, production_runs, quality_measurements, sensor_readings (time-series) |

**🚨 CRITICAL: TIMESTAMP COLUMN ANALYSIS 🚨**

**NOT ALL TIMESTAMP COLUMNS INDICATE TRANSACTIONAL DATA!** You must distinguish between:

**✅ TRUE TRANSACTIONAL TIMESTAMPS** (Indicate WHEN the business event OCCURRED):
| Column Pattern | Purpose | Example Tables |
|----------------|---------|----------------|
| `transaction_date`, `transaction_time` | When the transaction happened | payments, transfers |
| `order_date`, `order_time` | When the order was placed | orders, purchases |
| `event_date`, `event_time`, `event_timestamp` | When the event occurred | events, incidents |
| `created_at`, `created_date` (in event tables) | When the event record was created | logs, activities |
| `shipment_date`, `delivery_date` | When shipment/delivery occurred | shipments, deliveries |
| `measurement_time`, `reading_time` | When measurement was taken | sensor_data, quality_checks |
| `start_time`, `end_time` | Duration of an activity/event | production_runs, shifts |
| `booking_date`, `reservation_date` | When booking was made | bookings, reservations |
| `payment_date`, `invoice_date` | When financial event occurred | payments, invoices |
| `effective_date`, `posted_date` | When transaction became effective | journal_entries, postings |

**❌ HOUSEKEEPING/AUDIT TIMESTAMPS** (Do NOT indicate transactional data - just record maintenance):
| Column Pattern | Purpose | Found In |
|----------------|---------|----------|
| `last_updated`, `last_update`, `updated_at` | When row was last modified | ALL table types |
| `modified_date`, `modified_at`, `modify_date` | When row was modified | ALL table types |
| `last_modified`, `last_modified_date` | Audit trail for changes | ALL table types |
| `updated_by`, `modified_by` | Who made the change | ALL table types |
| `created_at`, `created_date` (in entity tables) | When entity was first created | MASTER tables |
| `record_created`, `record_updated` | ETL/DWH housekeeping | ALL table types |
| `etl_timestamp`, `load_date`, `insert_date` | Data pipeline metadata | ALL table types |
| `sync_date`, `refresh_date` | Data synchronization | ALL table types |
| `valid_from`, `valid_to` (SCD Type 2) | Slowly changing dimension tracking | MASTER tables |
| `is_active`, `is_deleted`, `deleted_at` | Soft delete flags | ALL table types |

**🔑 KEY DISTINCTION RULES**:

1. **MASTER tables CAN have timestamps** - `customer.created_at` and `customer.last_updated` do NOT make it transactional
2. **REFERENCE tables CAN have timestamps** - `country_codes.last_updated` is just housekeeping
3. **Look at the PRIMARY PURPOSE of the table, not just the presence of timestamps**
4. **Ask: "Is this timestamp the REASON the row exists, or just metadata ABOUT the row?"**
   - If timestamp IS the reason (event happened) → TRANSACTIONAL
   - If timestamp is metadata about the row → NOT transactional (check if MASTER or REFERENCE)

**Identification Rules for TRANSACTIONAL**:
1. **Primary business timestamp exists** - A timestamp that represents WHEN THE BUSINESS EVENT OCCURRED (not just housekeeping)
2. **Each row = one discrete event** - The table records WHAT HAPPENED, not WHAT EXISTS
3. **High insert frequency, low/no update frequency** - Events are typically immutable once recorded
4. **References MASTER data** - Has foreign keys to entities (customer_id, product_id, employee_id)
5. **Time-series nature** - Data grows continuously over time, rarely deleted
6. **Business meaning is temporal** - "Order #123 was placed on 2024-01-15" vs "Customer John exists"

**⚠️ COMMON MISTAKES TO AVOID**:
- ❌ `customer` table with `created_at`, `last_updated` → Still MASTER (timestamps are housekeeping)
- ❌ `product` table with `modified_date` → Still MASTER (timestamp is audit trail)
- ❌ `country_codes` with `last_updated` → Still REFERENCE (timestamp is maintenance)
- ❌ `employee` with `hire_date`, `termination_date` → Still MASTER (dates describe entity lifecycle, not events)
- ✅ `customer_orders` with `order_date` → TRANSACTIONAL (timestamp IS the event)
- ✅ `production_runs` with `run_start_time`, `run_end_time` → TRANSACTIONAL (timestamps define the event)

---

**👤 MASTER DATA** (Core business entities - the "NOUNS" of the business):
| Characteristic | Description |
|----------------|-------------|
| **Definition** | Core business entities that define WHO, WHAT, WHERE of the business |
| **Stability** | Changes infrequently but CAN be updated (low volatility) |
| **Uniqueness** | Each row represents a unique business entity with a lifecycle |
| **Shared** | Used/referenced across multiple systems and processes |
| **Lifecycle** | Created once, updated occasionally, rarely deleted (soft-delete common) |
| **Key Question** | "Does this table represent a core business ENTITY that EXISTS (not an event that HAPPENED)?" → If YES → MASTER |
| **Examples** | Customers, employees, products, vendors, suppliers, accounts, contracts, assets, locations, equipment, vehicles, patients, projects, policies, members, partners, anodes, pots, machines |

**🚨 MASTER DATA CAN HAVE TIMESTAMPS - THIS DOES NOT MAKE THEM TRANSACTIONAL 🚨**

| Common MASTER Table Columns | Purpose | Still MASTER? |
|-----------------------------|---------|---------------|
| `created_at`, `created_date` | When entity was first created | ✅ YES |
| `last_updated`, `updated_at`, `modified_date` | Housekeeping/audit | ✅ YES |
| `hire_date`, `termination_date` (employee) | Entity lifecycle dates | ✅ YES |
| `start_date`, `end_date` (contract) | Contract validity period | ✅ YES |
| `registration_date` (customer) | When customer registered | ✅ YES |
| `manufacture_date` (product/asset) | When asset was made | ✅ YES |
| `installation_date` (equipment) | When equipment installed | ✅ YES |
| `birth_date`, `join_date` | Entity attributes | ✅ YES |
| `valid_from`, `valid_to` | SCD Type 2 versioning | ✅ YES |

**KEY INSIGHT**: MASTER tables describe ENTITIES THAT EXIST. The timestamps in MASTER tables describe:
- WHEN the entity was created/modified (housekeeping)
- ATTRIBUTES of the entity (hire_date is an attribute of employee)
- NOT "an event that occurred" - the entity IS the subject, not a record of an action

**Identification Rules for MASTER**:
1. **Represents a unique business entity** - Has a natural business identifier (customer_id, employee_number, product_code)
2. **Entity-centric, not event-centric** - Describes WHAT/WHO exists, not WHAT happened
3. **Has lifecycle states** - active, inactive, suspended, terminated, pending
4. **Referenced by TRANSACTIONAL tables** - Foreign keys point TO this table (e.g., orders.customer_id → customers.id)
5. **Updates modify the entity** - Address changes, status updates, attribute corrections
6. **Would be managed by a business steward** - HR owns employees, Sales owns customers
7. **Relatively low volume** - Hundreds to millions of entities, not billions of events
8. **Timestamps are METADATA, not the primary business data** - `last_updated` is housekeeping, not business content

**Entity Lifecycle Dates vs Event Dates**:
- `employee.hire_date` = ATTRIBUTE of the employee entity → MASTER
- `employee_timesheet.work_date` = WHEN the work event occurred → TRANSACTIONAL
- `contract.start_date` = ATTRIBUTE defining contract period → MASTER  
- `contract_payment.payment_date` = WHEN payment event occurred → TRANSACTIONAL
- `equipment.installation_date` = ATTRIBUTE of equipment → MASTER
- `equipment_maintenance.maintenance_date` = WHEN maintenance event occurred → TRANSACTIONAL

---

**📋 REFERENCE DATA** (Lookup values and classifications - the "ADJECTIVES" of the business):
| Characteristic | Description |
|----------------|-------------|
| **Definition** | Lookup values, codes, and classifications used to categorize/describe other data |
| **Stability** | Very stable, changes RARELY (often governed by standards or regulations) |
| **Purpose** | Standardizes and categorizes MASTER and TRANSACTIONAL data |
| **Scope** | Often industry-wide standards, regulatory codes, or company-controlled lists |
| **Size** | Typically SMALL, finite sets of values (dozens to hundreds, rarely thousands) |
| **Key Question** | "Is this a FINITE, CONTROLLED list of codes/types/categories used to CLASSIFY other data?" → If YES → REFERENCE |
| **Examples** | Country codes, currency codes, status codes, product_categories, gender, units_of_measure, payment_types, order_status, priority_levels, industry_codes, language_codes, timezones, alloy_grades, shift_types, material_types, quality_grades |

**🚨 REFERENCE DATA CAN HAVE TIMESTAMPS - THIS DOES NOT MAKE THEM TRANSACTIONAL 🚨**

| Common REFERENCE Table Columns | Purpose | Still REFERENCE? |
|--------------------------------|---------|------------------|
| `last_updated`, `modified_date` | Housekeeping when code was edited | ✅ YES |
| `created_at` | When the code was first added | ✅ YES |
| `effective_date` | When code became valid | ✅ YES |
| `expiry_date`, `deprecated_date` | When code is no longer valid | ✅ YES |
| `valid_from`, `valid_to` | Code validity period | ✅ YES |

**Identification Rules for REFERENCE**:
1. **Small, finite, controlled set** - Typically < 1000 rows, often < 100
2. **Used for classification/categorization** - Provides dropdown values, categorizes other data
3. **Code + Description pattern** - Often has `code`, `name`, `description` columns
4. **Referenced BY other tables** - MASTER and TRANSACTIONAL tables have foreign keys TO this table
5. **Rarely changes** - Adding a new status code is a governance event, not daily operation
6. **Industry or regulatory standards** - ISO codes, regulatory classifications, standard enumerations
7. **No transactional history** - You don't track "payment type usage over time" in this table
8. **Business rules/validation** - Used to validate and constrain data entry

**REFERENCE vs MASTER Distinction**:
| Aspect | REFERENCE | MASTER |
|--------|-----------|--------|
| Row count | Typically < 1000 | Can be millions |
| Changes | Rarely (governance) | Regularly (business operations) |
| Purpose | Classify/categorize | Represent entities |
| Ownership | Usually IT/Data Governance | Business departments |
| Examples | status_types, country_codes | customers, products |

**Common REFERENCE Table Patterns**:
- `*_type`, `*_types` (payment_type, order_type, material_type)
- `*_status`, `*_statuses` (order_status, customer_status)
- `*_code`, `*_codes` (country_code, currency_code, reason_code)
- `*_category`, `*_categories` (product_category, expense_category)
- `*_grade`, `*_grades` (quality_grade, alloy_grade, credit_grade)
- `*_level`, `*_levels` (priority_level, severity_level)
- `*_class`, `*_classification` (risk_class, material_classification)
- Singular lookup names (gender, currency, country, language, timezone)

---

**🔀 QUICK DECISION FRAMEWORK** (Ask in this order):

**Step 1: Analyze the TABLE PURPOSE (not just columns)**
- What is the PRIMARY purpose of this table?
- Does it record EVENTS (things that happen) or ENTITIES (things that exist)?

**Step 2: Analyze TIMESTAMP columns carefully**
- Is the timestamp the REASON the row exists (event timestamp)?
- Or is it just HOUSEKEEPING (last_updated, modified_at)?

**Step 3: Apply this decision tree:**

```
START → Is this table a FINITE SET of codes/categories (< 1000 rows)?
        │
        ├─ YES → Does it CLASSIFY/CATEGORIZE other data? → YES → **REFERENCE**
        │        └─ NO → Might be small MASTER table
        │
        └─ NO → Does each row represent a discrete BUSINESS EVENT that HAPPENED?
                │
                ├─ YES → Does it have a PRIMARY BUSINESS TIMESTAMP (not just last_updated)?
                │        │
                │        ├─ YES → Is the table INSERT-heavy with rare/no updates? → **TRANSACTIONAL**
                │        └─ NO → Check if it's actually MASTER with event-like naming
                │
                └─ NO → Does it represent a core business ENTITY with lifecycle?
                        │
                        ├─ YES → **MASTER** (even if it has created_at, last_updated)
                        └─ NO → Re-evaluate: likely MASTER or needs more context
```

**Step 4: Validate with these questions:**

| Question | TRANSACTIONAL | MASTER | REFERENCE |
|----------|---------------|--------|-----------|
| Does each row = one event? | ✅ Yes | ❌ No | ❌ No |
| Is it INSERT-heavy, UPDATE-rare? | ✅ Yes | ❌ No (updates common) | ❌ No |
| Does it have a business event timestamp? | ✅ Yes | ⚠️ Maybe (but housekeeping) | ⚠️ Maybe (but housekeeping) |
| Is it a finite controlled list? | ❌ No | ❌ No | ✅ Yes |
| Is row count typically < 1000? | ❌ No (can be billions) | ⚠️ Varies | ✅ Usually |
| Does it represent an entity with lifecycle? | ❌ No | ✅ Yes | ❌ No |
| Is it referenced BY other tables via FK? | ❌ No (it references) | ✅ Yes | ✅ Yes |

---

**⚠️ AMBIGUOUS/TRICKY CASES - DETAILED ANALYSIS ⚠️**:

These tables are commonly misclassified. Study these examples carefully:

| Table Name | Has Timestamps? | Correct Category | Reasoning |
|------------|-----------------|------------------|-----------|
| `customer` | `created_at`, `last_updated` | **MASTER** | Entity that EXISTS, timestamps are housekeeping |
| `customer_order` | `order_date`, `created_at` | **TRANSACTIONAL** | Event that HAPPENED, order_date is business timestamp |
| `employee` | `hire_date`, `last_updated`, `termination_date` | **MASTER** | Entity with lifecycle, dates are ATTRIBUTES not events |
| `employee_timesheet` | `work_date`, `clock_in`, `clock_out` | **TRANSACTIONAL** | Event recording work that HAPPENED |
| `product` | `created_at`, `launch_date`, `last_updated` | **MASTER** | Entity, launch_date is an attribute |
| `product_sale` | `sale_date`, `sale_time` | **TRANSACTIONAL** | Event of sale that HAPPENED |
| `equipment` | `installation_date`, `last_maintenance_date` | **MASTER** | Entity, dates are attributes |
| `equipment_maintenance` | `maintenance_date`, `start_time`, `end_time` | **TRANSACTIONAL** | Event of maintenance that HAPPENED |
| `contract` | `start_date`, `end_date`, `signed_date` | **MASTER** | Entity, dates define contract period (attributes) |
| `contract_payment` | `payment_date`, `due_date` | **TRANSACTIONAL** | Event of payment that HAPPENED |
| `inventory` | `last_count_date`, `last_updated` | **MASTER** | Entity (current state), dates are housekeeping |
| `inventory_movement` | `movement_date`, `transaction_time` | **TRANSACTIONAL** | Event of stock movement that HAPPENED |
| `price` / `price_list` | `effective_date`, `expiry_date` | **MASTER** or **REFERENCE** | Depends on granularity - if per-product it's MASTER, if generic codes it's REFERENCE |
| `status_type` | `created_at`, `last_updated` | **REFERENCE** | Finite codes, timestamps are housekeeping |
| `audit_log` (business) | `audit_timestamp`, `event_time` | **TRANSACTIONAL** | Business audit events that HAPPENED |
| `country` / `country_code` | `last_updated` | **REFERENCE** | Finite standard codes, timestamp is housekeeping |
| `anode` | `created_date`, `last_updated` | **MASTER** | Physical entity tracked individually |
| `anode_consumption` | `consumption_date`, `consumption_time` | **TRANSACTIONAL** | Event of anode being consumed |
| `alloy_grade` | `last_updated` | **REFERENCE** | Finite set of alloy specifications |
| `production_batch` | `batch_date`, `start_time`, `end_time` | **TRANSACTIONAL** | Event of production that HAPPENED |
| `quality_test_result` | `test_date`, `test_time` | **TRANSACTIONAL** | Event of test that HAPPENED |
| `shift` / `shift_type` | `last_updated` | **REFERENCE** | Finite codes (Day/Night/Swing) |
| `shift_schedule` | `shift_date`, `start_time` | **TRANSACTIONAL** | Specific shift occurrence that HAPPENED |

**🔑 KEY PATTERNS TO REMEMBER**:

1. **"*_log" tables**: Usually TRANSACTIONAL (events recorded over time)
2. **"*_history" tables**: Usually TRANSACTIONAL (historical events)
3. **"*_type" / "*_status" / "*_code" tables**: Usually REFERENCE (lookup codes)
4. **"*_movement" / "*_transfer" / "*_transaction" tables**: Usually TRANSACTIONAL
5. **Singular entity names** (customer, product, employee): Usually MASTER
6. **Tables with lifecycle dates as ATTRIBUTES**: Usually MASTER (hire_date, start_date)
7. **Tables where each row = a point-in-time event**: TRANSACTIONAL
8. **Tables that could be a dropdown menu**: REFERENCE

---

**🏭 INDUSTRY-SPECIFIC EXAMPLES (Manufacturing/Aluminum Smelting)**:

| Table | Category | Reasoning |
|-------|----------|-----------|
| `employee` | MASTER | Core entity - workforce members with lifecycle |
| `customer` | MASTER | Core entity - business relationships |
| `product` | MASTER | Core entity - what is manufactured/sold |
| `equipment` | MASTER | Core entity - physical assets with lifecycle |
| `anode` | MASTER | Core entity - physical items tracked individually |
| `pot` | MASTER | Core entity - smelting equipment with lifecycle |
| `alloy` | REFERENCE | Finite set of alloy grades/specifications |
| `shift_type` | REFERENCE | Lookup - Day/Night/Swing shifts |
| `product_category` | REFERENCE | Lookup - Product classification codes |
| `country` | REFERENCE | Lookup - Standard country codes |
| `production_run` | TRANSACTIONAL | Event - Records production activity |
| `quality_measurement` | TRANSACTIONAL | Event - Records test results with timestamps |
| `metal_transfer` | TRANSACTIONAL | Event - Records material movements |
| `order` | TRANSACTIONAL | Event - Customer purchase events |
| `shipment` | TRANSACTIONAL | Event - Delivery events |

---

**🚨 CLASSIFICATION PRIORITY RULES 🚨**:

- **RULE #1**: Use SEMANTIC analysis first, NOT score-based inference
- **RULE #2**: If table has timestamp + records events → **TRANSACTIONAL** (regardless of score)
- **RULE #3**: If table is finite lookup/codes → **REFERENCE** (regardless of score)
- **RULE #4**: If table represents core entities with lifecycle → **MASTER**
- **RULE #5**: When in doubt between MASTER and TRANSACTIONAL → check for timestamps
- **RULE #6**: When in doubt between MASTER and REFERENCE → check if it's a finite controlled list
- **RULE #7**: When in doubt overall, classify as **MASTER** (safer default for entities)

**❌ TECHNICAL TABLES** (EXCLUDE ONLY IF PURELY IT INFRASTRUCTURE):
**ONLY classify as TECHNICAL if the table is PURELY for IT TEAMS managing internal systems:**
- **IT System Logs**: Application error logs, API debug logs, backend service logs, system exception tracking
- **IT Monitoring**: Database performance metrics, server health checks, infrastructure monitoring, resource utilization (CPU, memory, disk)
- **IT Configuration**: Backend application settings, system parameters, feature flags for developers, deployment configs
- **Database Metadata**: Schema version control, database migration tracking, table/column metadata for database admins
- **IT Governance**: Data lineage for IT purposes, ETL job status, pipeline orchestration, data load tracking for data engineers
- **IT Security**: System access logs (not business user activity), IT audit trails, penetration testing results, vulnerability scans
- **Developer Tools**: Version control metadata, CI/CD pipeline logs, build artifacts, test execution logs
- **System Tables**: `information_schema`, `sys`, `system` schemas - database catalog tables
- **RULE**: Majority of columns are technical (error stack traces, system IDs, internal status codes, JSON configs for IT)
- **RULE**: Primary consumers are IT/DevOps/Data Engineering teams, NOT business users
- **RULE**: Data has ZERO business value and is ONLY used to maintain IT infrastructure

**CRITICAL EDGE CASES - DEFAULT TO BUSINESS**:
- **User Activity Logs**: If tracking business user behavior, customer actions, or business events → **BUSINESS** (TRANSACTIONAL). Only if tracking IT system access/authentication → **TECHNICAL**
- **Audit Tables**: If tracking ANY business transactions, data changes, or business events → **BUSINESS** (TRANSACTIONAL). Only if tracking IT system changes → **TECHNICAL**
- **Configuration Tables**: If storing business rules, pricing configs, product settings → **BUSINESS** (REFERENCE or MASTER). Only if storing IT system settings → **TECHNICAL**
- **Snapshot Tables**: If historical snapshots of business data for reporting/analytics → **BUSINESS** (TRANSACTIONAL). Only if database backups/snapshots → **TECHNICAL**
- **Metadata Tables**: If describing business entities, data dictionaries for business users → **BUSINESS** (REFERENCE). Only if database schema metadata → **TECHNICAL**
- **Mixed Tables**: If >10% of columns contain business data → **BUSINESS**. Otherwise → **TECHNICAL**

**EXAMPLES WITH DATA CATEGORY**:
- ✅ BUSINESS/TRANSACTIONAL: `customer_activity_log` (business event tracking with timestamps), `order_audit` (business transaction history), `device_telemetry` (IoT time-series data)
- ✅ BUSINESS/TRANSACTIONAL: `api_usage_log` (if tracking customer API usage for billing/analytics with timestamps)
- ✅ BUSINESS/MASTER: `customer` (core entity), `product` (core entity), `employee` (core entity), `equipment` (physical asset)
- ✅ BUSINESS/REFERENCE: `country_code` (lookup), `product_category` (classification), `status_type` (finite codes), `alloy_grade` (specifications)
- ❌ TECHNICAL: `application_error_log` (IT debugging), `database_query_performance` (IT monitoring), `etl_job_runs` (IT data engineering)
- ❌ TECHNICAL: `system_health_checks` (IT infrastructure), `deployment_history` (IT DevOps), `schema_migrations` (IT database admin)

**BUSINESS SCORE** (Indicates business criticality, NOT used for category inference):
For BUSINESS tables, assign a score from 1-100 indicating how business-critical the table is:
- **90-100**: High-frequency transactional tables (orders, sales, transactions, payments, production_runs)
- **80-89**: Core master data entities (customers, products, employees, equipment, assets)
- **60-79**: Important operational/transactional tables (inventory movements, appointments, schedules, quality checks)
- **40-59**: Supporting master/reference data (business configuration, business lookup tables)
- **20-39**: Static reference data (country codes, currencies, categories) - limited use case value
- **1-19**: Marginally business-relevant tables (derived/aggregate tables, borderline cases)

**⚠️ IMPORTANT**: Business Score indicates CRITICALITY, NOT data category. You MUST determine Data Category using SEMANTIC rules above, NOT by score.

For TECHNICAL tables, always use score: 0

**YOUR TASK**:
Review the list of tables below and return a **CSV** with the following columns:
- `Table Name`: Fully-qualified table name (catalog.schema.table)
- `Classification`: Either "BUSINESS" or "TECHNICAL"
- `Data Category`: For BUSINESS tables only, classify as "MASTER", "TRANSACTIONAL", or "REFERENCE". For TECHNICAL tables, use "TECHNICAL".
- `Business Score`: Integer from 0-100 (0 for TECHNICAL, 1-100 for BUSINESS based on criticality)
- `Reason`: Brief reason for classification (max 100 characters)

**CRITICAL REQUIREMENTS**:
1. You MUST classify EVERY table in the input list
2. Each table must appear in exactly ONE row
3. Use the FULL table name format: `catalog.schema.table`
4. Classification must be either "BUSINESS" or "TECHNICAL" (no other values)
5. Data Category must be one of: "MASTER", "TRANSACTIONAL", "REFERENCE", "TECHNICAL"
6. **🚨 CRITICAL**: Data Category MUST be determined using SEMANTIC rules:
   - **TRANSACTIONAL**: Records events with timestamps (orders, payments, logs, measurements)
   - **MASTER**: Core business entities with lifecycle (customers, products, employees, assets)
   - **REFERENCE**: Finite lookup codes/classifications (country_codes, status_types, categories)
   - **TECHNICAL**: For TECHNICAL tables only
7. Business Score must be an integer: 0 for TECHNICAL, 1-100 for BUSINESS (indicates criticality, NOT category)
8. When in doubt, prefer "BUSINESS" classification (safer to include than exclude)
9. When in doubt on category: Check for timestamps → TRANSACTIONAL; Check for finite codes → REFERENCE; Otherwise → MASTER

**TABLES TO CLASSIFY**:
{tables_markdown}

**OUTPUT FORMAT** (CSV ONLY - NO OTHER TEXT):
```csv
"Table Name","Classification","Data Category","Business Score","Reason"
"catalog.schema.customers","BUSINESS","MASTER","85","Core entity - unique business entities with lifecycle"
"catalog.schema.orders","BUSINESS","TRANSACTIONAL","95","Event records - timestamped business transactions"
"catalog.schema.payments","BUSINESS","TRANSACTIONAL","92","Event records - financial transaction events"
"catalog.schema.employees","BUSINESS","MASTER","82","Core entity - workforce members with lifecycle"
"catalog.schema.equipment","BUSINESS","MASTER","80","Core entity - physical assets tracked individually"
"catalog.schema.country_codes","BUSINESS","REFERENCE","25","Lookup codes - finite set of standard classifications"
"catalog.schema.product_categories","BUSINESS","REFERENCE","35","Lookup codes - finite set used to classify products"
"catalog.schema.status_types","BUSINESS","REFERENCE","30","Lookup codes - finite enumeration of status values"
"catalog.schema.alloy_grades","BUSINESS","REFERENCE","38","Lookup codes - finite set of alloy specifications"
"catalog.schema.api_logs","TECHNICAL","TECHNICAL","0","IT infrastructure - system debugging logs"
"catalog.schema.etl_job_runs","TECHNICAL","TECHNICAL","0","IT infrastructure - pipeline orchestration metadata"
```

**CRITICAL CSV FORMATTING RULES**:
1. First line MUST be the header: "Table Name","Classification","Data Category","Business Score","Reason"
2. ALL fields MUST be enclosed in double quotes (")
3. Each row must have exactly 5 fields
4. Classification must be EXACTLY "BUSINESS" or "TECHNICAL" (case-sensitive)
5. Data Category must be EXACTLY "MASTER", "TRANSACTIONAL", "REFERENCE", or "TECHNICAL"
6. Business Score must be a valid integer (0-100)
6. NO markdown code fences (```csv) - just the CSV content
7. NO explanatory text before or after the CSV

**VALIDATION CHECKLIST**:
✓ Every input table appears in exactly one CSV row
✓ Business tables are relevant to {industry} industry
✓ Technical tables are clearly system/metadata focused
✓ All TECHNICAL tables have Business Score = 0 and Data Category = "TECHNICAL"
✓ All BUSINESS tables have Business Score between 1-100
✓ Data Category is determined SEMANTICALLY (not by score):
  ✓ TRANSACTIONAL = event tables with timestamps (orders, payments, logs)
  ✓ MASTER = entity tables with lifecycle (customers, products, employees)
  ✓ REFERENCE = lookup/code tables (country_codes, status_types, categories)
✓ Output is valid CSV starting with header row

🚨 ABSOLUTE RULE - OUTPUT FORMAT 🚨:

❌ DO NOT INCLUDE:
- Any conversational or explanatory text ("Here are...", "I've...", "Based on...")
- Any thoughts or analysis descriptions

✅ OUTPUT REQUIREMENTS:
- Your response must START with: "Table Name","Classification","Business Score","Reason","honesty_score","honesty_justification"
- Include honesty columns in header and all rows
""" + HONESTY_CHECK_CSV

# COMMAND ----------

# DBTITLE 1,AIAgent
# ==============================================================================
# 2.5. AI AGENT CLASS (MODIFIED)
# ==============================================================================

class AIAgent:
    _total_ai_calls = 0
    _total_input_chars = 0
    _total_output_chars = 0
    _step_stats = defaultdict(lambda: {"calls": 0, "input_chars": 0, "output_chars": 0})

    # === MODIFIED: Added prompt_templates parameter ===
    def __init__(self, spark, logger, worker_llm_config, judge_llm_config, prompt_templates: dict,
                 default_timeout_seconds: int = 300, max_retry_attempts: int = 1):
        self.spark = spark
        self.logger = logger
        self.worker_llm = worker_llm_config
        self.judge_llm = judge_llm_config
        self.prompt_templates = prompt_templates  # <-- NEW: Store the dictionary
        self.current_language = "English"  # Default language for context limit calculations
        self.default_timeout_seconds = default_timeout_seconds
        # Number of retries after the first attempt (global knob)
        self.max_retry_attempts = max(0, max_retry_attempts)
        
        # === NEW: Check if prompt dictionary is empty ===
        if not self.prompt_templates:
            self.logger.warning("AIAgent initialized with an empty prompt_templates dictionary.")
    
    def set_language(self, language: str):
        """
        Set the current language for context limit calculations.
        
        Args:
            language: Language name (e.g., "English", "French", "Spanish")
        """
        self.current_language = language
        # Language set - no debug logging needed

    # === NEW: Internal helper function for loading prompts ===
    def _load_and_format_prompt(self, prompt_key: str, prompt_vars: dict) -> str:
        """
        Loads a prompt template from the internal dictionary and formats it.
        """
        if prompt_key not in self.prompt_templates:
            self.logger.error(f"Prompt key '{prompt_key}' not found in prompt dictionary.")
            raise KeyError(f"Prompt key '{prompt_key}' not found in AIAgent's prompt_templates.")
        
        prompt_template = self.prompt_templates[prompt_key]
        
        try:
            return prompt_template.format(**prompt_vars)
        except KeyError as e:
            self.logger.error(f"Failed to format prompt '{prompt_key}'. Missing variable: {e}")
            raise e
        except Exception as e:
            self.logger.error(f"An unexpected error occurred during prompt formatting for '{prompt_key}': {e}")
            raise

    def _call_ai_query(self, prompt: str, prompt_name: str, response_schema=None, model_override=None, timeout_seconds=None, max_retries=None, display_name=None) -> str:
        """
        Calls Databricks AI query function with the given prompt.
        Now tracks statistics by prompt_name (actual template name) instead of step_name.
        Includes resilience for "Input is too long" errors.
        Performs pre-flight check to ensure prompt respects language-aware context limits.
        Handles throttling and timeouts with automatic retry logic.
        
        Args:
            prompt: The prompt text
            prompt_name: Name of the prompt template (for logging)
            response_schema: Optional JSON schema for structured responses
            model_override: Optional model name override
            timeout_seconds: Timeout in seconds for LLM call (default: 420 seconds unless overridden)
            max_retries: Maximum number of retry attempts for throttling/timeouts (default: 3)
            display_name: Optional display name for heartbeat logs (defaults to prompt_name)
        
        Returns:
            Raw response string from the LLM
        """
        import time
        from concurrent.futures import ThreadPoolExecutor, TimeoutError as FuturesTimeoutError
        
        model = model_override if model_override else self.worker_llm
        
        heartbeat_name = display_name if display_name else prompt_name
        
        # Resolve timeout/retries from configurable defaults
        attempts_allowed = (max_retries if max_retries is not None else self.max_retry_attempts) + 1
        timeout_val = timeout_seconds if timeout_seconds is not None else self.default_timeout_seconds
        
        # SAFETY: Ensure timeout_val is positive and not None
        if not timeout_val or timeout_val <= 0:
            self.logger.warning(f"Timeout value '{timeout_val}' is invalid. Enforcing safety fallback of 300s.")
            timeout_val = 300

        # Try the LLM call with retries for throttling and timeouts
        for attempt in range(1, attempts_allowed + 1):
            try:
            
                if attempt > 1:
                    self.logger.info(f"🔄 [{prompt_name}] Retry attempt {attempt}/{attempts_allowed} after error...")
                
                # PRE-FLIGHT CHECK: Ensure prompt length is within language-aware context limit
                # Uses model-specific token limits from TECHNICAL_CONTEXT
                max_context_chars = get_max_context_chars(self.current_language, prompt_name)
                prompt_len = len(prompt)
                if prompt_len > max_context_chars:
                    self.logger.error(
                        f"Prompt length ({prompt_len:,} chars) exceeds max context limit ({max_context_chars:,} chars) "
                        f"for language '{self.current_language}' and prompt: {prompt_name}. This will likely fail."
                    )
                    # Raise the error immediately rather than sending to the model
                    raise InputTooLongError(
                        f"Input length: {prompt_len:,} characters exceeds context limit of {max_context_chars:,} "
                        f"for language '{self.current_language}'. "
                        f"Prompt: {prompt_name}, Model: {model}. Please batch your input."
                    )
                
                # Log a debug message if we're close to the limit (>90%)
                if prompt_len > (max_context_chars * 0.9):
                    # Prompt approaching context limit - suppress debug log
                    pass
                
                # Prepare response_format if schema is provided
                response_format_str = ""
                
                # CRITICAL FIX: Set max_tokens to prevent output truncation
                # Claude Sonnet 4 defaults to only 1000 output tokens without explicit max_tokens!
                # This caused queries exceeding ~350 lines to be truncated mid-response.
                output_token_limit = get_model_output_token_limit(prompt_name)
                # Use 90% of the model's max output tokens as a safe limit, with a minimum floor of 32000 tokens
                max_output_tokens = max(32000, int(output_token_limit * 0.9))
                
                # Log the max_tokens being used for debugging truncation issues
                self.logger.info(f"   [{prompt_name}] Setting max_tokens={max_output_tokens:,} (model limit: {output_token_limit:,})")
                
                if response_schema:
                    response_format_str = json.dumps({"type": "json_schema", "json_schema": response_schema}, separators=(',', ':')).replace("'", "''")
                    ai_query_sql = f"SELECT ai_query('{model}', '{replace_single_quote(prompt)}', responseFormat => '{response_format_str}', modelParameters => named_struct('max_tokens', {max_output_tokens})) AS ai_response"
                else:
                    ai_query_sql = f"SELECT ai_query('{model}', '{replace_single_quote(prompt)}', modelParameters => named_struct('max_tokens', {max_output_tokens})) AS ai_response"
                
                # Execute with timeout using simple Thread with watchdog pattern
                response_rows = None
                error_holder = [None]
                completed_flag = [False]
                
                def execute_query():
                    try:
                        error_holder[0] = None
                        result = execute_sql(self.spark, ai_query_sql, self.logger)
                        nonlocal response_rows
                        response_rows = result
                        completed_flag[0] = True
                    except Exception as e:
                        error_holder[0] = e
                        completed_flag[0] = True
                
                query_thread = threading.Thread(target=execute_query, name=f"LLM_Query_{prompt_name}")
                query_thread.daemon = True
                start_time = time.time()
                query_thread.start()
                
                # Use polling with small intervals instead of single long join
                # This ensures we can detect hangs even if join() misbehaves
                poll_interval = 5  # Check every 5 seconds
                while True:
                    query_thread.join(timeout=poll_interval)
                    elapsed = time.time() - start_time
                    
                    if completed_flag[0] or not query_thread.is_alive():
                        break
                    
                    if elapsed >= timeout_val:
                        log_print(f"⏱️  [{prompt_name}] LLM call TIMED OUT after {elapsed:.1f}s (attempt {attempt}/{attempts_allowed}) - Thread still alive", level="ERROR")
                        self.logger.error(f"⏱️  [{prompt_name}] LLM call timed out after {elapsed:.1f} seconds (attempt {attempt}/{attempts_allowed})")
                        break
                    
                    # Log heartbeat every 60 seconds to show progress
                    if elapsed > 0 and int(elapsed) % 60 == 0 and int(elapsed) != int(elapsed - poll_interval):
                        log_print(f"[{heartbeat_name}] Still waiting... {elapsed:.0f}s elapsed (timeout: {timeout_val}s)")
                
                if query_thread.is_alive():
                    raise Exception(f"LLM call timed out after {timeout_val} seconds")
                
                if error_holder[0] is not None:
                    raise error_holder[0]
                
                raw_response = response_rows[0].ai_response if response_rows and response_rows[0] else ""
                
                honesty_score, honesty_justification, cleaned_response = extract_honesty_score(raw_response, self.logger)
                
                if honesty_score is not None:
                    if honesty_justification:
                        self.logger.info(f"🔮✨ HONESTY CHECK [{prompt_name}] Score: {honesty_score}% | {honesty_justification} ✨🔮")
                    else:
                        self.logger.info(f"🔮✨ HONESTY CHECK [{prompt_name}] Score: {honesty_score}% ✨🔮")
                
                input_len = len(prompt)
                output_len = len(cleaned_response)

                AIAgent._total_ai_calls += 1
                AIAgent._total_input_chars += input_len
                AIAgent._total_output_chars += output_len
                
                # Track by prompt_name (actual template name) instead of step_name
                AIAgent._step_stats[prompt_name]["calls"] += 1
                AIAgent._step_stats[prompt_name]["input_chars"] += input_len
                AIAgent._step_stats[prompt_name]["output_chars"] += output_len
                
                # TRUNCATION DETECTION: Check for mandatory END marker
                # For SQL generation, we require the response to end with "--END OF GENERATED SQL"
                # NOTE: SQL FIX prompt does NOT require this marker - it just returns fixed SQL
                SQL_END_MARKER = "--END OF GENERATED SQL"
                is_sql_generation = prompt_name == "USE_CASE_SQL_GEN_PROMPT"
                
                if is_sql_generation and cleaned_response and len(cleaned_response) > 100:
                    # Check if the END marker is present
                    has_end_marker = SQL_END_MARKER in cleaned_response
                    
                    if not has_end_marker:
                        self.logger.warning(
                            f"⚠️  [{prompt_name}] SQL TRUNCATED - Missing '{SQL_END_MARKER}' marker! "
                            f"Output: {output_len:,} chars, max_tokens: {max_output_tokens:,}. "
                            f"Last 100 chars: ...{cleaned_response[-100:]}"
                        )
                        # Raise a specific error so caller can handle truncation
                        raise TruncatedResponseError(
                            f"SQL response truncated - missing '{SQL_END_MARKER}' marker. "
                            f"Output length: {output_len:,} chars. Consider reducing input context."
                        )
                    else:
                        self.logger.info(f"   [{prompt_name}] SQL complete - END marker found")
                
                # Success - return cleaned response (without honesty section)
                return cleaned_response
                
            except InputTooLongError:
                # Don't retry InputTooLongError - let caller handle with context reduction
                raise
            except TruncatedResponseError:
                # Don't retry TruncatedResponseError - let caller handle with context reduction
                raise
            except Exception as e:
                error_msg = str(e).lower()
                
                # FIRST: Check for "Input is too long" errors BEFORE checking for retryable errors
                # This includes Databricks-specific error formats
                is_input_too_long = any(keyword in error_msg for keyword in [
                    'input is too long', 'too long for requested model', 'input length',
                    'exceeds context limit', 'context window', 'token limit exceeded',
                    'maximum context length'
                ])
                
                # Also check for Databricks HTTP 400 errors with "bad_request" and input length messages
                is_databricks_input_error = (
                    ('400' in error_msg or 'bad_request' in error_msg or 'bad request' in error_msg) and
                    ('input' in error_msg or 'length' in error_msg or 'model' in error_msg)
                )
                
                if is_input_too_long or is_databricks_input_error:
                    # This is an "input too long" error - raise immediately without retry
                    self.logger.error(
                        f"❌ [{prompt_name}] Input too long error detected (will not retry): {str(e)[:300]}"
                    )
                    raise InputTooLongError(
                        f"Input length: {len(prompt)} characters exceeds model's context limit. "
                        f"Prompt: {prompt_name}, Model: {model}"
                    ) from e
                
                # Check for retryable errors (only if NOT input too long)
                is_throttling = any(keyword in error_msg for keyword in [
                    'throttl', 'rate limit', 'too many requests', 'quota', '429',
                    'resource exhausted', 'capacity', 'overload'
                ])
                is_timeout = any(keyword in error_msg for keyword in [
                    'timeout', 'timed out', 'deadline', 'time limit'
                ])
                is_server_error = any(keyword in error_msg for keyword in [
                    '500', '502', '503', '504', 'internal server', 'service unavailable',
                    'bad gateway', 'gateway timeout'
                ])
                
                is_retryable = is_throttling or is_timeout or is_server_error
                
                if is_retryable and attempt < attempts_allowed:
                    # Calculate exponential backoff wait time
                    wait_time = min(2 ** attempt * 5, 120)  # 5s, 10s, 20s... max 120s
                    
                    error_type = "Throttling" if is_throttling else ("Timeout" if is_timeout else "Server error")
                    self.logger.warning(
                        f"⚠️  [{prompt_name}] {error_type} detected (attempt {attempt}/{attempts_allowed}): {str(e)[:200]}"
                    )
                    self.logger.info(f"   Waiting {wait_time}s before retry...")
                    time.sleep(wait_time)
                    continue  # Retry
                else:
                    # Non-retryable error or max retries exceeded
                    if is_retryable:
                        self.logger.error(
                            f"❌ [{prompt_name}] Max retries ({attempts_allowed - 1}) exceeded for retryable error: {str(e)[:200]}"
                        )
                    
                    # If we get here, it's a non-retryable error
                    self.logger.error(f"AI Query function failed (Prompt: {prompt_name}). Error: {e}\nModel: {model}")
                    raise
        
        # If we exit the loop without returning, all retries failed
        raise Exception(f"LLM call failed after {attempts_allowed} attempts for prompt: {prompt_name}")

    # === MODIFIED: Uses internal _load_and_format_prompt ===
    def run_worker(self, step_name, worker_prompt_path, prompt_vars, response_schema, model_override=None, timeout_override=None, max_retries_override=None):
        # Running AI worker - high-level logging only
        try:
            # === MODIFIED ===
            worker_prompt = self._load_and_format_prompt(worker_prompt_path, prompt_vars)
            
            if not worker_prompt:
                raise ValueError(f"Failed to load prompt: {worker_prompt_path}")
            
            # Use model from LLM_MODEL_CONFIG if no override provided
            if model_override is None and worker_prompt_path in LLM_MODEL_CONFIG:
                model_override = LLM_MODEL_CONFIG[worker_prompt_path]
                # Model selected from config - no debug logging needed
            
            # Pass the prompt_path (template name) to _call_ai_query for tracking
            raw_response = self._call_ai_query(
                worker_prompt,
                worker_prompt_path,
                response_schema,
                model_override,
                timeout_seconds=timeout_override,
                max_retries=max_retries_override,
                display_name=step_name
            )
            
            # Return based on schema presence
            if response_schema:
                return clean_json_response(raw_response)
            else:
                # This is the path for CSV
                if not raw_response:
                    self.logger.warning(f"AI Worker for {step_name} (Raw) returned an empty response.")
                    return "" # Return empty string
                return raw_response
        except Exception as e:
            self.logger.error(f"AI Worker process failed for {step_name}: {e}")
            raise

    # --- START OF MODIFICATIONS ---

    def _deep_parse_json_values(self, data, task):
        """
        (Helper) Parses known stringified keys ('attributes', 'domains') within a data object.
        Operates on the dictionary, not the JSON string.
        """
        if not isinstance(data, dict):
            return data # Not a dict (e.g., dashboard list), can't fix

        keys_to_check = []
        if task == 'attributes':
            keys_to_check = ['attributes']
        elif task == 'domains':
            keys_to_check = ['domains']
        
        for key in keys_to_check:
            value = data.get(key)
            if isinstance(value, str):
                try:
                    data[key] = json.loads(value) # Replace string with parsed object
                except (json.JSONDecodeError, TypeError):
                    self.logger.warning(f"Failed to deep-parse stringified key '{key}' in task '{task}'.")
                    data[key] = [] # Set to empty list on failure
        return data

    # === MODIFIED: Uses internal _load_and_format_prompt ===
    def run_worker_judge(self, step_name, worker_prompt_path, judge_prompt_path, base_prompt_vars, worker_response_schema, config, randomization_params={}, task_info_lambda=None, validation_lambda=None):
        
        task_type, log_context = task_info_lambda(base_prompt_vars) if task_info_lambda else ("unknown task", "")
        # Worker/judge process starting - high-level logging only
        
        # --- Added Retry Loop ---
        for attempt in range(config["MAX_RETRIES"]):
            try:
                # --- Worker Generation ---
                worker_outputs = []
                for i in range(2):
                    randomized_vars = base_prompt_vars.copy()
                    for key, val in randomization_params.items():
                        base_min = int(config["PROMPT_VARIABLES"][f"min_{key}"])
                        base_max = int(config["PROMPT_VARIABLES"][f"max_{key}"])
                        new_min = base_min + random.randint(0, val)
                        randomized_vars[f"min_{key}"] = new_min
                        randomized_vars[f"max_{key}"] = max(new_min + 1, base_max + random.randint(0, val))
                    
                    # === MODIFIED ===
                    worker_prompt = self._load_and_format_prompt(worker_prompt_path, randomized_vars)
                    
                    # --- Worker call ---
                    worker_step_name = f"{step_name}_worker_{i+1}"
                    raw_response = self._call_ai_query(worker_prompt, worker_step_name, worker_response_schema) 
                    worker_outputs.append(clean_json_response(raw_response))

                # --- Helper Functions (Modified to use _deep_parse) ---
                def summarize_output(json_string, task):
                    try:
                        data = json.loads(json_string)
                        data = self._deep_parse_json_values(data, task) # Fix stringified values
                        if task == 'domains': return f"domains: {', '.join([d.get('domain', 'N/A') for d in data.get('domains', [])])}"
                        if task == 'attributes': return f"{len(data.get('attributes', []))} attributes"
                        if 'dashboard' in task: return f"{sum(len(p.get('queries', [])) for p in data)} dashboard queries"
                        return "summary not available"
                    except (json.JSONDecodeError, TypeError): return "invalid JSON"

                def is_response_empty(json_string, task):
                    try:
                        data = json.loads(json_string)
                        if not data: return True
                        data = self._deep_parse_json_values(data, task) # Fix stringified values
                        if task == 'domains': return not data.get('domains')
                        if task == 'attributes': return not data.get('attributes')
                        if 'dashboard' in task: return not isinstance(data, list) or not any(p.get('queries') for p in data)
                        return True
                    except (json.JSONDecodeError, TypeError):
                        return True

                def get_best_worker_output(outputs, task):
                    best_output, max_count = "", -1
                    for out_str in outputs:
                        if is_response_empty(out_str, task): continue
                        try:
                            data, count = json.loads(out_str), 0
                            data = self._deep_parse_json_values(data, task) # Fix stringified values
                            if task == 'domains': count = len(data.get('domains', []))
                            elif task == 'attributes': count = len(data.get('attributes', []))
                            elif 'dashboard' in task: count = sum(len(p.get('queries', [])) for p in data)
                            if count > max_count:
                                max_count, best_output = count, out_str
                        except (json.JSONDecodeError, TypeError): continue
                    return best_output if best_output else max(outputs, key=len, default="")
                
                # --- End Helper Functions ---

                for i, output in enumerate(worker_outputs): self.logger.debug(f"LLM Worker {i+1} suggested {log_context}: {summarize_output(output, task_type)}")
                
                best_worker_fallback = get_best_worker_output(worker_outputs, task_type)
                
                # --- START: NEW JUDGE-SKIP LOGIC ---
                skip_judge = False
                final_cleaned_json = ""
                
                if task_type == 'domains' and judge_prompt_path:
                    try:
                        data1 = json.loads(worker_outputs[0])
                        data1 = self._deep_parse_json_values(data1, task_type)
                        domains1 = set(d.get('domain') for d in data1.get('domains', []))
                        
                        data2 = json.loads(worker_outputs[1])
                        data2 = self._deep_parse_json_values(data2, task_type)
                        domains2 = set(d.get('domain') for d in data2.get('domains', []))
                        
                        if domains1 and domains1 == domains2:
                            self.logger.debug("Worker domain outputs match. Skipping judge and using worker 1 output.")
                            final_cleaned_json = worker_outputs[0]
                            skip_judge = True
                    except Exception as e:
                        self.logger.warning(f"Could not compare worker outputs for judge skip: {e}")
                # --- END: NEW JUDGE-SKIP LOGIC ---

                if not judge_prompt_path or skip_judge:
                    if not judge_prompt_path:
                        self.logger.debug("No judge prompt provided. Skipping judge and using best worker output.")
                    
                    if not final_cleaned_json:
                        final_cleaned_json = best_worker_fallback
                else:
                    judge_prompt_vars = {**base_prompt_vars, **{f'llm{i+1}_output': out for i, out in enumerate(worker_outputs)}}
                    
                    # === MODIFIED ===
                    judge_prompt = self._load_and_format_prompt(judge_prompt_path, judge_prompt_vars)
                    
                    if len(judge_prompt) > 120000:
                        self.logger.debug(f"Judge prompt too long ({len(judge_prompt)} chars). Using best worker response.")
                        final_cleaned_json = best_worker_fallback
                    else:
                        judge_step_name = f"{step_name}_judge"
                        final_raw_response = self._call_ai_query(judge_prompt, judge_step_name, worker_response_schema) 
                        final_cleaned_json = clean_json_response(final_raw_response)
                        if is_response_empty(final_cleaned_json, task_type):
                            self.logger.warning(f"Judge returned a malformed or empty result for {task_type}. Rejecting and using best worker output.")
                            final_cleaned_json = best_worker_fallback
                
                try:
                    final_data = json.loads(final_cleaned_json)
                    final_data = self._deep_parse_json_values(final_data, task_type)
                    final_fixed_json_string = json.dumps(final_data)
                except (json.JSONDecodeError, TypeError):
                    self.logger.error(f"Failed to parse or fix final JSON output for {task_type}. Using raw cleaned JSON for validation.")
                    final_fixed_json_string = final_cleaned_json

                if validation_lambda:
                    validation_lambda(final_fixed_json_string, task_type)
                
                self.logger.debug(f"Judge adjudicated the final output {log_context}, resulting in: {summarize_output(final_fixed_json_string, task_type)}")
                return final_fixed_json_string

            except Exception as e:
                self.logger.warning(f"Attempt {attempt + 1}/{config['MAX_RETRIES']} for {step_name} {log_context} failed: {e}")
                if attempt == config["MAX_RETRIES"] - 1:
                    self.logger.error(f"FAILED to generate valid output for {step_name} {log_context} after all retries.")
                    raise e
        
        raise Exception(f"AI Worker/Judge for {step_name} failed after all retries.")

    @staticmethod
    def get_summary_report():
        """
        Generates a summary report of AI usage, grouped by prompt type.
        More generic and aggregated across all instances.
        """
        report = []
        report.append("\n" + "="*70)
        report.append("--- 📊 AI Usage Summary ---")
        report.append("="*70)
        report.append(f"Total AI Calls:     {AIAgent._total_ai_calls}")
        
        # Calculate estimated tokens (chars / 4)
        input_tokens = AIAgent._total_input_chars / 4
        output_tokens = AIAgent._total_output_chars / 4
        
        report.append(f"Total Input Tokens:  ~{input_tokens:,.2f}  ({AIAgent._total_input_chars:,} chars)")
        report.append(f"Total Output Tokens: ~{output_tokens:,.2f}  ({AIAgent._total_output_chars:,} chars)")
        report.append("\n--- Prompt Type Details ---")
        
        if not AIAgent._step_stats:
            report.append("No AI calls were tracked.")
        else:
            # Group by prompt type (extract the general prompt name from step names)
            prompt_aggregates = {}
            for step_name, stats in AIAgent._step_stats.items():
                # Extract prompt type from step name (e.g., "Generate_UseCases_Batch_5" -> "Generate_UseCases")
                # Common patterns: "PromptName_Language_Batch_X", "PromptName_Batch_X", etc
                parts = step_name.split('_')
                
                # Identify the core prompt name
                prompt_type = step_name
                if 'Batch' in step_name:
                    # Extract everything before "_Batch"
                    batch_idx = step_name.find('_Batch')
                    if batch_idx > 0:
                        prompt_type = step_name[:batch_idx]
                elif any(lang in step_name for lang in ['English', 'Arabic', 'Chinese', 'French', 'Spanish']):
                    # Remove language suffix
                    for lang in ['English', 'Arabic', 'Chinese', 'French', 'Spanish', 'German', 'Portuguese', 'Italian', 'Japanese', 'Korean']:
                        if f"_{lang}" in step_name:
                            prompt_type = step_name.replace(f"_{lang}", "")
                            break
                
                # Aggregate stats by prompt type
                if prompt_type not in prompt_aggregates:
                    prompt_aggregates[prompt_type] = {'calls': 0, 'input_chars': 0, 'output_chars': 0}
                
                prompt_aggregates[prompt_type]['calls'] += stats['calls']
                prompt_aggregates[prompt_type]['input_chars'] += stats['input_chars']
                prompt_aggregates[prompt_type]['output_chars'] += stats['output_chars']
            
            # Display aggregated results
            prompt_col_width = 40
            header = f"{'Prompt Type':<{prompt_col_width}} | {'Calls':<7} | {'Input Tokens':<18} | {'Output Tokens':<18}"
            report.append(header)
            report.append("-" * len(header))
            
            # Sort by number of calls (descending)
            for prompt_type, stats in sorted(prompt_aggregates.items(), key=lambda x: x[1]['calls'], reverse=True):
                input_tok = stats['input_chars'] / 4
                output_tok = stats['output_chars'] / 4
                report.append(f"{prompt_type:<{prompt_col_width}} | {stats['calls']:<7} | ~{input_tok:<16,.0f} | ~{output_tok:<16,.0f}")
        
        report.append("="*70)
        
        final_report_str = "\n".join(report)
        log_print(final_report_str)
        return final_report_str

# COMMAND ----------

# DBTITLE 1,Translations
# ==============================================================================
# 1.5. TRANSLATION SERVICE (MODIFIED)
# ==============================================================================

class TranslationService:
    """
    Handles all language translation tasks by calling an AI agent in parallel.
    Relies on an AIAgent that has been initialized with the PROMPT_TEMPLATES dictionary.
    """
    
    # === MODIFIED: Added pdf_disclaimer_title and updated titles ===
    ENGLISH_TRANSLATIONS = {
        "main_title": "Databricks Agent Bricks Use Case Generator",
        "intro": "This notebook contains AI-generated use cases based on your schemas. Below is a summary of the generated scenarios by business domain.",
        "domain": "Business Domain",
        "total": "Total Use Cases",
        "summaries": "Use Case Summaries",
        "sum_id": "ID",
        "sum_name": "Name",
        "sum_value": "Business Value",
        "sum_outcome": "Expected Outcome",
        "warning_header": "WARNING",
        "warning_body": "Do not run this notebook. It is intended for demonstration and cataloging purposes only. The SQL queries are examples and may require review before execution.",
        "disclaimer": "This content is AI-generated and for demonstration purposes only. All SQL queries are examples and must be validated for syntax and safety by a qualified engineer before being used in any production environment. Databricks is not liable for any issues arising from the use of this code.",
        "detailed_scenarios": "Use Cases Details",
        "aspect": "Aspect",
        "description": "Description",
        "aspect_domain": "Business Domain",
        "type": "Type",
        "analytics_technique": "Analytics Technique",
        "primary_table": "Primary Table",
        "priority": "Priority",
        # Value translations for Type field
        "value_type_problem": "Problem",
        "value_type_risk": "Risk",
        "value_type_opportunity": "Opportunity",
        "value_type_improvement": "Improvement",
        # Value translations for Priority field
        "value_priority_ultra_high": "Ultra High",
        "value_priority_very_high": "Very High",
        "value_priority_high": "High",
        "value_priority_medium": "Medium",
        "value_priority_low": "Low",
        "value_priority_very_low": "Very Low",
        "value_priority_ultra_low": "Ultra Low",
        "statement": "Statement",
        "solution": "Solution",
        "aspect_beneficiary": "Beneficiary",
        "beneficiary": "Beneficiary",
        "aspect_sponsor": "Sponsor",
        "sponsor": "Sponsor",
        "business_priority_alignment": "Business Priority Alignment",
        "strategic_goals_alignment": "Strategic Goals Alignment",
        "subdomain": "Subdomain",
        "aspect_value": "Business Value",
        "business_value": "Business Value",
        "aspect_tables": "Tables Involved",
        "aspect_ai_function": "AI Function",
        "aspect_analytics_technique": "Analytics Technique",
        "aspect_primary_table": "Primary Table",
        "aspect_priority": "Priority",
        # Scoring field labels (for Excel)
        "strategic_alignment": "Strategic Alignment",
        "return_on_investment": "Return on Investment",
        "reusability": "Reusability",
        "time_to_value": "Time to Value",
        "data_availability": "Data Availability",
        "data_accessibility": "Data Accessibility",
        "architecture_fitness": "Architecture Fitness",
        "team_skills": "Team Skills",
        "domain_knowledge": "Domain Knowledge",
        "people_allocation": "People Allocation",
        "budget_allocation": "Budget Allocation",
        "time_to_production": "Time to Production",
        "value_score": "Value",
        "feasibility_score": "Feasibility",
        "priority_score": "Priority Score",
        "pdf_title": "Databricks Agent Bricks Strategic AI Use Cases", # MODIFIED
        "pdf_for": "For",
        "pdf_exec_summary": "Executive Summary",
        "pdf_toc_title": "Use Case Domains",
        "pdf_detailed_view": "Detailed Use Case Catalog",
        "pdf_disclaimer_title": "Disclaimer", # NEW
        "pdf_fallback_summary_p1": "This document outlines {total_cases} high-value analytical use cases identified for {business_name}. These scenarios, powered by Databricks Agent Bricks, are designed to drive significant business outcomes by leveraging your existing data assets.",
        "pdf_fallback_summary_p2": "The following pages provide a detailed breakdown of these opportunities, categorized by business domain, to help prioritize your AI initiatives.",
        "pptx_main_title": "Databricks Agent Bricks Strategic AI Use Cases", # MODIFIED
        "pptx_for": "For",
        "pptx_disclaimer_title": "Disclaimer",
        "pptx_domain_suffix": "Use Cases",
        # NEW: Query result example translations
        "example_results": "Example Results",
        "error_no_results": "Could not generate results, Check Notebook: {notebook_name} and use case id {use_case_id}",
        "input_data_original": "Input Data (Original Values)",
        "ai_generated_output": "AI-Generated Results (Output)",
        "column": "Column",
        "value": "Value",
        "executive_summary_not_available": "Executive summary not available.",
        "domain_summary_not_available": "Domain summary not available.",
        "summary_not_available": "Summary not available.",
        # Strategic Goal/Priority Alignment value translations
        "value_general_improvement": "General Improvement",
        "value_reduce_cost": "Reduce Cost",
        "value_increase_revenue": "Increase Revenue",
        "value_boost_productivity": "Boost Productivity",
        "value_mitigate_risk": "Mitigate Risk",
        "value_protect_revenue": "Protect Revenue",
        "value_align_to_regulations": "Align to Regulations",
        "value_improve_customer_experience": "Improve Customer Experience",
        "value_enable_data_driven_decisions": "Enable Data-Driven Decisions",
        "value_optimize_operations": "Optimize Operations",
        "value_empower_talent": "Empower Talent",
        "value_enhance_experience": "Enhance Experience",
        "value_drive_innovation": "Drive Innovation",
        "value_achieve_esg": "Achieve ESG",
        "value_execute_strategy": "Execute Strategy",
        # Analytics Technique value translations
        "value_forecasting": "Forecasting",
        "value_classification": "Classification",
        "value_anomaly_detection": "Anomaly Detection",
        "value_cohort_analysis": "Cohort Analysis",
        "value_segmentation": "Segmentation",
        "value_sentiment_analysis": "Sentiment Analysis",
        "value_trend_analysis": "Trend Analysis",
        "value_prescriptive_analytics": "Prescriptive Analytics",
        "value_root_cause_analysis": "Root Cause Analysis",
        "value_optimization": "Optimization",
        "value_recommendation": "Recommendation",
        "value_time_series_analysis": "Time Series Analysis",
        "value_predictive_analytics": "Predictive Analytics",
        "value_descriptive_analytics": "Descriptive Analytics"
    }

    def __init__(self, ai_agent, logger=None):
        """
        Initializes the TranslationService with an AIAgent instance.
        """
        self.ai_agent = ai_agent
        self.logger = logger or logging.getLogger(self.__class__.__name__)
        self.translation_cache = {} # Cache for UI elements

    # Complete fallback translations for ALL keys - ensures translations are 100% reliable
    TRANSLATION_FALLBACKS = {
        "Arabic": {
            "main_title": "مولد حالات استخدام Databricks Agent Bricks",
            "intro": "يحتوي هذا الدفتر على حالات استخدام تم إنشاؤها بواسطة الذكاء الاصطناعي استناداً إلى مخططاتك. فيما يلي ملخص للسيناريوهات المُنشأة حسب مجال الأعمال.",
            "domain": "مجال الأعمال",
            "total": "إجمالي حالات الاستخدام",
            "summaries": "ملخصات حالات الاستخدام",
            "sum_id": "المعرف",
            "sum_name": "الاسم",
            "sum_value": "القيمة التجارية",
            "sum_outcome": "النتيجة المتوقعة",
            "warning_header": "تحذير",
            "warning_body": "لا تقم بتشغيل هذا الدفتر. وهو مخصص للعرض التوضيحي والفهرسة فقط. استعلامات SQL هي أمثلة وقد تتطلب المراجعة قبل التنفيذ.",
            "disclaimer": "هذا المحتوى تم إنشاؤه بواسطة الذكاء الاصطناعي ولأغراض العرض التوضيحي فقط. جميع استعلامات SQL هي أمثلة ويجب التحقق من صحتها وأمانها من قبل مهندس مؤهل قبل استخدامها في أي بيئة إنتاج.",
            "detailed_scenarios": "تفاصيل حالات الاستخدام",
            "aspect": "الجانب",
            "description": "الوصف",
            "aspect_domain": "مجال الأعمال",
            "type": "النوع",
            "analytics_technique": "تقنية التحليل",
            "primary_table": "الجدول الرئيسي",
            "priority": "الأولوية",
            "value_type_problem": "مشكلة",
            "value_type_risk": "مخاطرة",
            "value_type_opportunity": "فرصة",
            "value_type_improvement": "تحسين",
            "value_priority_ultra_high": "عالية للغاية",
            "value_priority_very_high": "عالية جداً",
            "value_priority_high": "عالية",
            "value_priority_medium": "متوسطة",
            "value_priority_low": "منخفضة",
            "value_priority_very_low": "منخفضة جداً",
            "value_priority_ultra_low": "منخفضة للغاية",
            "statement": "البيان",
            "solution": "الحل",
            "aspect_beneficiary": "المستفيد",
            "beneficiary": "المستفيد",
            "aspect_sponsor": "الراعي",
            "sponsor": "الراعي",
            "business_priority_alignment": "توافق أولوية الأعمال",
            "strategic_goals_alignment": "التوافق مع الأهداف الاستراتيجية",
            "subdomain": "النطاق الفرعي",
            "aspect_value": "القيمة التجارية",
            "business_value": "القيمة التجارية",
            "aspect_tables": "الجداول المستخدمة",
            "aspect_ai_function": "وظيفة الذكاء الاصطناعي",
            "aspect_analytics_technique": "تقنية التحليل",
            "aspect_primary_table": "الجدول الرئيسي",
            "aspect_priority": "الأولوية",
            "strategic_alignment": "التوافق الاستراتيجي",
            "return_on_investment": "العائد على الاستثمار",
            "reusability": "قابلية إعادة الاستخدام",
            "time_to_value": "الوقت للقيمة",
            "data_availability": "توفر البيانات",
            "data_accessibility": "إمكانية الوصول للبيانات",
            "architecture_fitness": "ملاءمة البنية",
            "team_skills": "مهارات الفريق",
            "domain_knowledge": "المعرفة بالمجال",
            "people_allocation": "تخصيص الموارد البشرية",
            "budget_allocation": "تخصيص الميزانية",
            "time_to_production": "الوقت للإنتاج",
            "value_score": "درجة القيمة",
            "feasibility_score": "درجة الجدوى",
            "priority_score": "درجة الأولوية",
            "pdf_title": "حالات استخدام الذكاء الاصطناعي الاستراتيجية من Databricks Agent Bricks",
            "pdf_for": "لـ",
            "pdf_exec_summary": "الملخص التنفيذي",
            "pdf_toc_title": "مجالات حالات الاستخدام",
            "pdf_detailed_view": "كتالوج حالات الاستخدام التفصيلية",
            "pdf_disclaimer_title": "إخلاء المسؤولية",
            "pdf_fallback_summary_p1": "يوضح هذا المستند {total_cases} حالة استخدام تحليلية عالية القيمة تم تحديدها لـ {business_name}. هذه السيناريوهات، المدعومة بـ Databricks Agent Bricks، مصممة لتحقيق نتائج أعمال مهمة من خلال الاستفادة من أصول البيانات الحالية.",
            "pdf_fallback_summary_p2": "توفر الصفحات التالية تفصيلاً مفصلاً لهذه الفرص، مصنفة حسب مجال الأعمال، للمساعدة في تحديد أولويات مبادرات الذكاء الاصطناعي الخاصة بك.",
            "pptx_main_title": "حالات استخدام الذكاء الاصطناعي الاستراتيجية من Databricks Agent Bricks",
            "pptx_for": "لـ",
            "pptx_disclaimer_title": "إخلاء المسؤولية",
            "pptx_domain_suffix": "حالات الاستخدام",
            "example_results": "نتائج المثال",
            "error_no_results": "تعذر إنشاء النتائج، تحقق من الدفتر: {notebook_name} ومعرف حالة الاستخدام {use_case_id}",
            "input_data_original": "بيانات الإدخال (القيم الأصلية)",
            "ai_generated_output": "مخرجات الذكاء الاصطناعي",
            "column": "العمود",
            "value": "القيمة",
            "executive_summary_not_available": "الملخص التنفيذي غير متوفر.",
            "domain_summary_not_available": "ملخص المجال غير متوفر.",
            "summary_not_available": "الملخص غير متوفر.",
            "value_general_improvement": "تحسين عام",
            "value_reduce_cost": "تقليل التكلفة",
            "value_increase_revenue": "زيادة الإيرادات",
            "value_boost_productivity": "تعزيز الإنتاجية",
            "value_mitigate_risk": "تخفيف المخاطر",
            "value_protect_revenue": "حماية الإيرادات",
            "value_align_to_regulations": "الامتثال للوائح",
            "value_improve_customer_experience": "تحسين تجربة العملاء",
            "value_enable_data_driven_decisions": "تمكين القرارات المبنية على البيانات",
            "value_optimize_operations": "تحسين العمليات",
            "value_empower_talent": "تمكين المواهب",
            "value_enhance_experience": "تعزيز التجربة",
            "value_drive_innovation": "دفع الابتكار",
            "value_achieve_esg": "تحقيق ESG",
            "value_execute_strategy": "تنفيذ الاستراتيجية",
            "value_forecasting": "التنبؤ",
            "value_classification": "التصنيف",
            "value_anomaly_detection": "كشف الشذوذ",
            "value_cohort_analysis": "تحليل الأتراب",
            "value_segmentation": "التجزئة",
            "value_sentiment_analysis": "تحليل المشاعر",
            "value_trend_analysis": "تحليل الاتجاهات",
            "value_prescriptive_analytics": "التحليلات الوصفية",
            "value_root_cause_analysis": "تحليل السبب الجذري",
            "value_optimization": "التحسين",
            "value_recommendation": "التوصية",
            "value_time_series_analysis": "تحليل السلاسل الزمنية",
            "value_predictive_analytics": "التحليلات التنبؤية",
            "value_descriptive_analytics": "التحليلات الوصفية",
        },
        "Spanish": {
            "main_title": "Generador de Casos de Uso de Databricks Agent Bricks",
            "intro": "Este cuaderno contiene casos de uso generados por IA basados en sus esquemas. A continuación se muestra un resumen de los escenarios generados por dominio de negocio.",
            "domain": "Dominio de Negocio",
            "total": "Total de Casos de Uso",
            "summaries": "Resúmenes de Casos de Uso",
            "sum_id": "ID",
            "sum_name": "Nombre",
            "sum_value": "Valor Comercial",
            "sum_outcome": "Resultado Esperado",
            "warning_header": "ADVERTENCIA",
            "warning_body": "No ejecute este cuaderno. Está destinado solo para demostración y catalogación. Las consultas SQL son ejemplos y pueden requerir revisión antes de la ejecución.",
            "disclaimer": "Este contenido es generado por IA y solo con fines de demostración. Todas las consultas SQL son ejemplos y deben ser validadas por un ingeniero calificado antes de usarse en cualquier entorno de producción.",
            "detailed_scenarios": "Detalles de Casos de Uso",
            "aspect": "Aspecto",
            "description": "Descripción",
            "aspect_domain": "Dominio de Negocio",
            "type": "Tipo",
            "analytics_technique": "Técnica de Análisis",
            "primary_table": "Tabla Principal",
            "priority": "Prioridad",
            "value_type_problem": "Problema",
            "value_type_risk": "Riesgo",
            "value_type_opportunity": "Oportunidad",
            "value_type_improvement": "Mejora",
            "value_priority_ultra_high": "Extremadamente Alta",
            "value_priority_very_high": "Muy Alta",
            "value_priority_high": "Alta",
            "value_priority_medium": "Media",
            "value_priority_low": "Baja",
            "value_priority_very_low": "Muy Baja",
            "value_priority_ultra_low": "Extremadamente Baja",
            "statement": "Declaración",
            "solution": "Solución",
            "aspect_beneficiary": "Beneficiario",
            "beneficiary": "Beneficiario",
            "aspect_sponsor": "Patrocinador",
            "sponsor": "Patrocinador",
            "business_priority_alignment": "Alineación de Prioridad Empresarial",
            "strategic_goals_alignment": "Alineación con Objetivos Estratégicos",
            "subdomain": "Subdominio",
            "aspect_value": "Valor Comercial",
            "business_value": "Valor Comercial",
            "aspect_tables": "Tablas Involucradas",
            "aspect_ai_function": "Función de IA",
            "aspect_analytics_technique": "Técnica de Análisis",
            "aspect_primary_table": "Tabla Principal",
            "aspect_priority": "Prioridad",
            "strategic_alignment": "Alineación Estratégica",
            "return_on_investment": "Retorno de Inversión",
            "reusability": "Reusabilidad",
            "time_to_value": "Tiempo al Valor",
            "data_availability": "Disponibilidad de Datos",
            "data_accessibility": "Accesibilidad de Datos",
            "architecture_fitness": "Aptitud de Arquitectura",
            "team_skills": "Habilidades del Equipo",
            "domain_knowledge": "Conocimiento del Dominio",
            "people_allocation": "Asignación de Personal",
            "budget_allocation": "Asignación de Presupuesto",
            "time_to_production": "Tiempo a Producción",
            "value_score": "Puntuación de Valor",
            "feasibility_score": "Puntuación de Viabilidad",
            "priority_score": "Puntuación de Prioridad",
            "pdf_title": "Casos de Uso de IA Estratégica de Databricks Agent Bricks",
            "pdf_for": "Para",
            "pdf_exec_summary": "Resumen Ejecutivo",
            "pdf_toc_title": "Dominios de Casos de Uso",
            "pdf_detailed_view": "Catálogo Detallado de Casos de Uso",
            "pdf_disclaimer_title": "Descargo de Responsabilidad",
            "pdf_fallback_summary_p1": "Este documento describe {total_cases} casos de uso analíticos de alto valor identificados para {business_name}.",
            "pdf_fallback_summary_p2": "Las siguientes páginas proporcionan un desglose detallado de estas oportunidades, categorizadas por dominio de negocio.",
            "pptx_main_title": "Casos de Uso de IA Estratégica de Databricks Agent Bricks",
            "pptx_for": "Para",
            "pptx_disclaimer_title": "Descargo de Responsabilidad",
            "pptx_domain_suffix": "Casos de Uso",
            "example_results": "Resultados de Ejemplo",
            "error_no_results": "No se pudieron generar resultados. Verificar Cuaderno: {notebook_name} y caso de uso {use_case_id}",
            "input_data_original": "Datos de Entrada (Valores Originales)",
            "ai_generated_output": "Resultados Generados por IA",
            "column": "Columna",
            "value": "Valor",
            "executive_summary_not_available": "Resumen ejecutivo no disponible.",
            "domain_summary_not_available": "Resumen del dominio no disponible.",
            "summary_not_available": "Resumen no disponible.",
            "value_general_improvement": "Mejora General",
            "value_reduce_cost": "Reducir Costos",
            "value_increase_revenue": "Aumentar Ingresos",
            "value_boost_productivity": "Impulsar Productividad",
            "value_mitigate_risk": "Mitigar Riesgos",
            "value_protect_revenue": "Proteger Ingresos",
            "value_align_to_regulations": "Cumplir Regulaciones",
            "value_improve_customer_experience": "Mejorar Experiencia del Cliente",
            "value_enable_data_driven_decisions": "Habilitar Decisiones Basadas en Datos",
            "value_optimize_operations": "Optimizar Operaciones",
            "value_empower_talent": "Empoderar Talento",
            "value_enhance_experience": "Mejorar Experiencia",
            "value_drive_innovation": "Impulsar Innovación",
            "value_achieve_esg": "Lograr ESG",
            "value_execute_strategy": "Ejecutar Estrategia",
            "value_forecasting": "Pronóstico",
            "value_classification": "Clasificación",
            "value_anomaly_detection": "Detección de Anomalías",
            "value_cohort_analysis": "Análisis de Cohortes",
            "value_segmentation": "Segmentación",
            "value_sentiment_analysis": "Análisis de Sentimiento",
            "value_trend_analysis": "Análisis de Tendencias",
            "value_prescriptive_analytics": "Analítica Prescriptiva",
            "value_root_cause_analysis": "Análisis de Causa Raíz",
            "value_optimization": "Optimización",
            "value_recommendation": "Recomendación",
            "value_time_series_analysis": "Análisis de Series Temporales",
            "value_predictive_analytics": "Analítica Predictiva",
            "value_descriptive_analytics": "Analítica Descriptiva",
        },
        "French": {
            "main_title": "Générateur de Cas d'Utilisation Databricks Agent Bricks",
            "intro": "Ce carnet contient des cas d'utilisation générés par l'IA basés sur vos schémas. Voici un résumé des scénarios générés par domaine d'activité.",
            "domain": "Domaine d'Activité",
            "total": "Total des Cas d'Utilisation",
            "summaries": "Résumés des Cas d'Utilisation",
            "sum_id": "ID",
            "sum_name": "Nom",
            "sum_value": "Valeur Commerciale",
            "sum_outcome": "Résultat Attendu",
            "warning_header": "AVERTISSEMENT",
            "warning_body": "N'exécutez pas ce carnet. Il est destiné uniquement à la démonstration et au catalogage. Les requêtes SQL sont des exemples et peuvent nécessiter une révision avant exécution.",
            "disclaimer": "Ce contenu est généré par l'IA et à des fins de démonstration uniquement. Toutes les requêtes SQL sont des exemples et doivent être validées par un ingénieur qualifié avant utilisation en production.",
            "detailed_scenarios": "Détails des Cas d'Utilisation",
            "aspect": "Aspect",
            "description": "Description",
            "aspect_domain": "Domaine d'Activité",
            "type": "Type",
            "analytics_technique": "Technique d'Analyse",
            "primary_table": "Table Principale",
            "priority": "Priorité",
            "value_type_problem": "Problème",
            "value_type_risk": "Risque",
            "value_type_opportunity": "Opportunité",
            "value_type_improvement": "Amélioration",
            "value_priority_ultra_high": "Extrêmement Haute",
            "value_priority_very_high": "Très Haute",
            "value_priority_high": "Haute",
            "value_priority_medium": "Moyenne",
            "value_priority_low": "Basse",
            "value_priority_very_low": "Très Basse",
            "value_priority_ultra_low": "Extrêmement Basse",
            "statement": "Énoncé",
            "solution": "Solution",
            "aspect_beneficiary": "Bénéficiaire",
            "beneficiary": "Bénéficiaire",
            "aspect_sponsor": "Sponsor",
            "sponsor": "Sponsor",
            "business_priority_alignment": "Alignement de Priorité d'Entreprise",
            "strategic_goals_alignment": "Alignement aux Objectifs Stratégiques",
            "subdomain": "Sous-domaine",
            "aspect_value": "Valeur Commerciale",
            "business_value": "Valeur Commerciale",
            "aspect_tables": "Tables Impliquées",
            "aspect_ai_function": "Fonction IA",
            "aspect_analytics_technique": "Technique d'Analyse",
            "aspect_primary_table": "Table Principale",
            "aspect_priority": "Priorité",
            "strategic_alignment": "Alignement Stratégique",
            "return_on_investment": "Retour sur Investissement",
            "reusability": "Réutilisabilité",
            "time_to_value": "Délai de Valorisation",
            "data_availability": "Disponibilité des Données",
            "data_accessibility": "Accessibilité des Données",
            "architecture_fitness": "Adéquation Architecturale",
            "team_skills": "Compétences de l'Équipe",
            "domain_knowledge": "Connaissance du Domaine",
            "people_allocation": "Allocation du Personnel",
            "budget_allocation": "Allocation Budgétaire",
            "time_to_production": "Délai de Production",
            "value_score": "Score de Valeur",
            "feasibility_score": "Score de Faisabilité",
            "priority_score": "Score de Priorité",
            "pdf_title": "Cas d'Utilisation IA Stratégiques Databricks Agent Bricks",
            "pdf_for": "Pour",
            "pdf_exec_summary": "Résumé Exécutif",
            "pdf_toc_title": "Domaines des Cas d'Utilisation",
            "pdf_detailed_view": "Catalogue Détaillé des Cas d'Utilisation",
            "pdf_disclaimer_title": "Avertissement",
            "pdf_fallback_summary_p1": "Ce document présente {total_cases} cas d'utilisation analytiques à forte valeur identifiés pour {business_name}.",
            "pdf_fallback_summary_p2": "Les pages suivantes fournissent une répartition détaillée de ces opportunités, classées par domaine d'activité.",
            "pptx_main_title": "Cas d'Utilisation IA Stratégiques Databricks Agent Bricks",
            "pptx_for": "Pour",
            "pptx_disclaimer_title": "Avertissement",
            "pptx_domain_suffix": "Cas d'Utilisation",
            "example_results": "Résultats d'Exemple",
            "error_no_results": "Impossible de générer les résultats. Vérifier le Carnet: {notebook_name} et cas d'utilisation {use_case_id}",
            "input_data_original": "Données d'Entrée (Valeurs Originales)",
            "ai_generated_output": "Résultats Générés par l'IA",
            "column": "Colonne",
            "value": "Valeur",
            "executive_summary_not_available": "Résumé exécutif non disponible.",
            "domain_summary_not_available": "Résumé du domaine non disponible.",
            "summary_not_available": "Résumé non disponible.",
            "value_general_improvement": "Amélioration Générale",
            "value_reduce_cost": "Réduire les Coûts",
            "value_increase_revenue": "Augmenter les Revenus",
            "value_boost_productivity": "Améliorer la Productivité",
            "value_mitigate_risk": "Atténuer les Risques",
            "value_protect_revenue": "Protéger les Revenus",
            "value_align_to_regulations": "Se Conformer aux Réglementations",
            "value_improve_customer_experience": "Améliorer l'Expérience Client",
            "value_enable_data_driven_decisions": "Permettre les Décisions Basées sur les Données",
            "value_optimize_operations": "Optimiser les Opérations",
            "value_empower_talent": "Autonomiser les Talents",
            "value_enhance_experience": "Améliorer l'Expérience",
            "value_drive_innovation": "Stimuler l'Innovation",
            "value_achieve_esg": "Atteindre ESG",
            "value_execute_strategy": "Exécuter la Stratégie",
            "value_forecasting": "Prévision",
            "value_classification": "Classification",
            "value_anomaly_detection": "Détection d'Anomalies",
            "value_cohort_analysis": "Analyse de Cohortes",
            "value_segmentation": "Segmentation",
            "value_sentiment_analysis": "Analyse de Sentiments",
            "value_trend_analysis": "Analyse des Tendances",
            "value_prescriptive_analytics": "Analytique Prescriptive",
            "value_root_cause_analysis": "Analyse des Causes Profondes",
            "value_optimization": "Optimisation",
            "value_recommendation": "Recommandation",
            "value_time_series_analysis": "Analyse de Séries Temporelles",
            "value_predictive_analytics": "Analytique Prédictive",
            "value_descriptive_analytics": "Analytique Descriptive",
        },
        "German": {
            "main_title": "Databricks Agent Bricks Anwendungsfall-Generator",
            "intro": "Dieses Notebook enthält KI-generierte Anwendungsfälle basierend auf Ihren Schemas. Nachfolgend finden Sie eine Zusammenfassung der generierten Szenarien nach Geschäftsbereich.",
            "domain": "Geschäftsbereich",
            "total": "Gesamtzahl der Anwendungsfälle",
            "summaries": "Zusammenfassungen der Anwendungsfälle",
            "sum_id": "ID",
            "sum_name": "Name",
            "sum_value": "Geschäftswert",
            "sum_outcome": "Erwartetes Ergebnis",
            "warning_header": "WARNUNG",
            "warning_body": "Führen Sie dieses Notebook nicht aus. Es dient nur zur Demonstration und Katalogisierung. Die SQL-Abfragen sind Beispiele und erfordern möglicherweise eine Überprüfung vor der Ausführung.",
            "disclaimer": "Dieser Inhalt wurde von KI generiert und dient nur zu Demonstrationszwecken. Alle SQL-Abfragen sind Beispiele und müssen von einem qualifizierten Ingenieur validiert werden, bevor sie in einer Produktionsumgebung verwendet werden.",
            "detailed_scenarios": "Anwendungsfall-Details",
            "aspect": "Aspekt",
            "description": "Beschreibung",
            "aspect_domain": "Geschäftsbereich",
            "type": "Typ",
            "analytics_technique": "Analysetechnik",
            "primary_table": "Haupttabelle",
            "priority": "Priorität",
            "value_type_problem": "Problem",
            "value_type_risk": "Risiko",
            "value_type_opportunity": "Chance",
            "value_type_improvement": "Verbesserung",
            "value_priority_ultra_high": "Extrem Hoch",
            "value_priority_very_high": "Sehr Hoch",
            "value_priority_high": "Hoch",
            "value_priority_medium": "Mittel",
            "value_priority_low": "Niedrig",
            "value_priority_very_low": "Sehr Niedrig",
            "value_priority_ultra_low": "Extrem Niedrig",
            "statement": "Aussage",
            "solution": "Lösung",
            "aspect_beneficiary": "Begünstigter",
            "beneficiary": "Begünstigter",
            "aspect_sponsor": "Sponsor",
            "sponsor": "Sponsor",
            "business_priority_alignment": "Geschäftspriorität Ausrichtung",
            "strategic_goals_alignment": "Strategische Zielausrichtung",
            "subdomain": "Subdomäne",
            "aspect_value": "Geschäftswert",
            "business_value": "Geschäftswert",
            "aspect_tables": "Beteiligte Tabellen",
            "aspect_ai_function": "KI-Funktion",
            "aspect_analytics_technique": "Analysetechnik",
            "aspect_primary_table": "Haupttabelle",
            "aspect_priority": "Priorität",
            "strategic_alignment": "Strategische Ausrichtung",
            "return_on_investment": "Kapitalrendite",
            "reusability": "Wiederverwendbarkeit",
            "time_to_value": "Zeit bis zum Wert",
            "data_availability": "Datenverfügbarkeit",
            "data_accessibility": "Datenzugänglichkeit",
            "architecture_fitness": "Architektureignung",
            "team_skills": "Teamfähigkeiten",
            "domain_knowledge": "Fachwissen",
            "people_allocation": "Personalzuweisung",
            "budget_allocation": "Budgetzuweisung",
            "time_to_production": "Zeit bis zur Produktion",
            "value_score": "Wertpunktzahl",
            "feasibility_score": "Machbarkeitspunktzahl",
            "priority_score": "Prioritätspunktzahl",
            "pdf_title": "Databricks Agent Bricks Strategische KI-Anwendungsfälle",
            "pdf_for": "Für",
            "pdf_exec_summary": "Zusammenfassung",
            "pdf_toc_title": "Anwendungsfall-Bereiche",
            "pdf_detailed_view": "Detaillierter Anwendungsfallkatalog",
            "pdf_disclaimer_title": "Haftungsausschluss",
            "pdf_fallback_summary_p1": "Dieses Dokument beschreibt {total_cases} hochwertige analytische Anwendungsfälle, die für {business_name} identifiziert wurden.",
            "pdf_fallback_summary_p2": "Die folgenden Seiten bieten eine detaillierte Aufschlüsselung dieser Möglichkeiten, kategorisiert nach Geschäftsbereich.",
            "pptx_main_title": "Databricks Agent Bricks Strategische KI-Anwendungsfälle",
            "pptx_for": "Für",
            "pptx_disclaimer_title": "Haftungsausschluss",
            "pptx_domain_suffix": "Anwendungsfälle",
            "example_results": "Beispielergebnisse",
            "error_no_results": "Ergebnisse konnten nicht generiert werden. Prüfen Sie Notebook: {notebook_name} und Anwendungsfall {use_case_id}",
            "input_data_original": "Eingabedaten (Originalwerte)",
            "ai_generated_output": "KI-generierte Ergebnisse",
            "column": "Spalte",
            "value": "Wert",
            "executive_summary_not_available": "Zusammenfassung nicht verfügbar.",
            "domain_summary_not_available": "Bereichszusammenfassung nicht verfügbar.",
            "summary_not_available": "Zusammenfassung nicht verfügbar.",
            "value_general_improvement": "Allgemeine Verbesserung",
            "value_reduce_cost": "Kosten Reduzieren",
            "value_increase_revenue": "Umsatz Steigern",
            "value_boost_productivity": "Produktivität Steigern",
            "value_mitigate_risk": "Risiken Mindern",
            "value_protect_revenue": "Umsatz Schützen",
            "value_align_to_regulations": "Vorschriften Einhalten",
            "value_improve_customer_experience": "Kundenerlebnis Verbessern",
            "value_enable_data_driven_decisions": "Datengestützte Entscheidungen Ermöglichen",
            "value_optimize_operations": "Betrieb Optimieren",
            "value_empower_talent": "Talente Fördern",
            "value_enhance_experience": "Erlebnis Verbessern",
            "value_drive_innovation": "Innovation Vorantreiben",
            "value_achieve_esg": "ESG Erreichen",
            "value_execute_strategy": "Strategie Umsetzen",
            "value_forecasting": "Prognose",
            "value_classification": "Klassifizierung",
            "value_anomaly_detection": "Anomalieerkennung",
            "value_cohort_analysis": "Kohortenanalyse",
            "value_segmentation": "Segmentierung",
            "value_sentiment_analysis": "Stimmungsanalyse",
            "value_trend_analysis": "Trendanalyse",
            "value_prescriptive_analytics": "Präskriptive Analytik",
            "value_root_cause_analysis": "Ursachenanalyse",
            "value_optimization": "Optimierung",
            "value_recommendation": "Empfehlung",
            "value_time_series_analysis": "Zeitreihenanalyse",
            "value_predictive_analytics": "Prädiktive Analytik",
            "value_descriptive_analytics": "Deskriptive Analytik",
        },
        "Portuguese": {
            "main_title": "Gerador de Casos de Uso Databricks Agent Bricks",
            "intro": "Este notebook contém casos de uso gerados por IA baseados em seus esquemas. Abaixo está um resumo dos cenários gerados por domínio de negócio.",
            "domain": "Domínio de Negócio",
            "total": "Total de Casos de Uso",
            "summaries": "Resumos de Casos de Uso",
            "sum_id": "ID",
            "sum_name": "Nome",
            "sum_value": "Valor de Negócio",
            "sum_outcome": "Resultado Esperado",
            "warning_header": "AVISO",
            "warning_body": "Não execute este notebook. É destinado apenas para demonstração e catalogação. As consultas SQL são exemplos e podem requerer revisão antes da execução.",
            "disclaimer": "Este conteúdo foi gerado por IA e é apenas para fins de demonstração. Todas as consultas SQL são exemplos e devem ser validadas por um engenheiro qualificado antes de serem usadas em produção.",
            "detailed_scenarios": "Detalhes dos Casos de Uso",
            "aspect": "Aspecto",
            "description": "Descrição",
            "aspect_domain": "Domínio de Negócio",
            "type": "Tipo",
            "analytics_technique": "Técnica de Análise",
            "primary_table": "Tabela Principal",
            "priority": "Prioridade",
            "value_type_problem": "Problema",
            "value_type_risk": "Risco",
            "value_type_opportunity": "Oportunidade",
            "value_type_improvement": "Melhoria",
            "value_priority_ultra_high": "Extremamente Alta",
            "value_priority_very_high": "Muito Alta",
            "value_priority_high": "Alta",
            "value_priority_medium": "Média",
            "value_priority_low": "Baixa",
            "value_priority_very_low": "Muito Baixa",
            "value_priority_ultra_low": "Extremamente Baixa",
            "statement": "Declaração",
            "solution": "Solução",
            "aspect_beneficiary": "Beneficiário",
            "beneficiary": "Beneficiário",
            "aspect_sponsor": "Patrocinador",
            "sponsor": "Patrocinador",
            "business_priority_alignment": "Alinhamento de Prioridade de Negócio",
            "strategic_goals_alignment": "Alinhamento com Objetivos Estratégicos",
            "subdomain": "Subdomínio",
            "aspect_value": "Valor de Negócio",
            "business_value": "Valor de Negócio",
            "aspect_tables": "Tabelas Envolvidas",
            "aspect_ai_function": "Função de IA",
            "aspect_analytics_technique": "Técnica de Análise",
            "aspect_primary_table": "Tabela Principal",
            "aspect_priority": "Prioridade",
            "strategic_alignment": "Alinhamento Estratégico",
            "return_on_investment": "Retorno sobre Investimento",
            "reusability": "Reusabilidade",
            "time_to_value": "Tempo para Valor",
            "data_availability": "Disponibilidade de Dados",
            "data_accessibility": "Acessibilidade de Dados",
            "architecture_fitness": "Adequação da Arquitetura",
            "team_skills": "Habilidades da Equipe",
            "domain_knowledge": "Conhecimento do Domínio",
            "people_allocation": "Alocação de Pessoas",
            "budget_allocation": "Alocação de Orçamento",
            "time_to_production": "Tempo para Produção",
            "value_score": "Pontuação de Valor",
            "feasibility_score": "Pontuação de Viabilidade",
            "priority_score": "Pontuação de Prioridade",
            "pdf_title": "Casos de Uso de IA Estratégica Databricks Agent Bricks",
            "pdf_for": "Para",
            "pdf_exec_summary": "Resumo Executivo",
            "pdf_toc_title": "Domínios de Casos de Uso",
            "pdf_detailed_view": "Catálogo Detalhado de Casos de Uso",
            "pdf_disclaimer_title": "Aviso Legal",
            "pdf_fallback_summary_p1": "Este documento descreve {total_cases} casos de uso analíticos de alto valor identificados para {business_name}.",
            "pdf_fallback_summary_p2": "As páginas seguintes fornecem uma análise detalhada dessas oportunidades, categorizadas por domínio de negócio.",
            "pptx_main_title": "Casos de Uso de IA Estratégica Databricks Agent Bricks",
            "pptx_for": "Para",
            "pptx_disclaimer_title": "Aviso Legal",
            "pptx_domain_suffix": "Casos de Uso",
            "example_results": "Resultados de Exemplo",
            "error_no_results": "Não foi possível gerar resultados. Verificar Notebook: {notebook_name} e caso de uso {use_case_id}",
            "input_data_original": "Dados de Entrada (Valores Originais)",
            "ai_generated_output": "Resultados Gerados por IA",
            "column": "Coluna",
            "value": "Valor",
            "executive_summary_not_available": "Resumo executivo não disponível.",
            "domain_summary_not_available": "Resumo do domínio não disponível.",
            "summary_not_available": "Resumo não disponível.",
            "value_general_improvement": "Melhoria Geral",
            "value_reduce_cost": "Reduzir Custos",
            "value_increase_revenue": "Aumentar Receita",
            "value_boost_productivity": "Aumentar Produtividade",
            "value_mitigate_risk": "Mitigar Riscos",
            "value_protect_revenue": "Proteger Receita",
            "value_align_to_regulations": "Cumprir Regulamentos",
            "value_improve_customer_experience": "Melhorar Experiência do Cliente",
            "value_enable_data_driven_decisions": "Habilitar Decisões Baseadas em Dados",
            "value_optimize_operations": "Otimizar Operações",
            "value_empower_talent": "Capacitar Talentos",
            "value_enhance_experience": "Melhorar Experiência",
            "value_drive_innovation": "Impulsionar Inovação",
            "value_achieve_esg": "Alcançar ESG",
            "value_execute_strategy": "Executar Estratégia",
            "value_forecasting": "Previsão",
            "value_classification": "Classificação",
            "value_anomaly_detection": "Detecção de Anomalias",
            "value_cohort_analysis": "Análise de Coorte",
            "value_segmentation": "Segmentação",
            "value_sentiment_analysis": "Análise de Sentimento",
            "value_trend_analysis": "Análise de Tendências",
            "value_prescriptive_analytics": "Análise Prescritiva",
            "value_root_cause_analysis": "Análise de Causa Raiz",
            "value_optimization": "Otimização",
            "value_recommendation": "Recomendação",
            "value_time_series_analysis": "Análise de Séries Temporais",
            "value_predictive_analytics": "Análise Preditiva",
            "value_descriptive_analytics": "Análise Descritiva",
        },
        "Italian": {
            "main_title": "Generatore di Casi d'Uso Databricks Agent Bricks",
            "intro": "Questo notebook contiene casi d'uso generati dall'IA basati sui tuoi schemi. Di seguito è riportato un riepilogo degli scenari generati per dominio aziendale.",
            "domain": "Dominio Aziendale",
            "total": "Totale Casi d'Uso",
            "summaries": "Riepiloghi dei Casi d'Uso",
            "sum_id": "ID",
            "sum_name": "Nome",
            "sum_value": "Valore Aziendale",
            "sum_outcome": "Risultato Atteso",
            "warning_header": "AVVERTIMENTO",
            "warning_body": "Non eseguire questo notebook. È destinato solo a scopi dimostrativi e di catalogazione. Le query SQL sono esempi e potrebbero richiedere revisione prima dell'esecuzione.",
            "disclaimer": "Questo contenuto è generato dall'IA e solo a scopo dimostrativo. Tutte le query SQL sono esempi e devono essere validate da un ingegnere qualificato prima dell'uso in produzione.",
            "detailed_scenarios": "Dettagli dei Casi d'Uso",
            "aspect": "Aspetto",
            "description": "Descrizione",
            "aspect_domain": "Dominio Aziendale",
            "type": "Tipo",
            "analytics_technique": "Tecnica di Analisi",
            "primary_table": "Tabella Principale",
            "priority": "Priorità",
            "value_type_problem": "Problema",
            "value_type_risk": "Rischio",
            "value_type_opportunity": "Opportunità",
            "value_type_improvement": "Miglioramento",
            "value_priority_ultra_high": "Estremamente Alta",
            "value_priority_very_high": "Molto Alta",
            "value_priority_high": "Alta",
            "value_priority_medium": "Media",
            "value_priority_low": "Bassa",
            "value_priority_very_low": "Molto Bassa",
            "value_priority_ultra_low": "Estremamente Bassa",
            "statement": "Dichiarazione",
            "solution": "Soluzione",
            "aspect_beneficiary": "Beneficiario",
            "beneficiary": "Beneficiario",
            "aspect_sponsor": "Sponsor",
            "sponsor": "Sponsor",
            "business_priority_alignment": "Allineamento Priorità Aziendale",
            "strategic_goals_alignment": "Allineamento agli Obiettivi Strategici",
            "subdomain": "Sottodominio",
            "aspect_value": "Valore Aziendale",
            "business_value": "Valore Aziendale",
            "aspect_tables": "Tabelle Coinvolte",
            "aspect_ai_function": "Funzione IA",
            "aspect_analytics_technique": "Tecnica di Analisi",
            "aspect_primary_table": "Tabella Principale",
            "aspect_priority": "Priorità",
            "strategic_alignment": "Allineamento Strategico",
            "return_on_investment": "Ritorno sull'Investimento",
            "reusability": "Riutilizzabilità",
            "time_to_value": "Tempo al Valore",
            "data_availability": "Disponibilità dei Dati",
            "data_accessibility": "Accessibilità dei Dati",
            "architecture_fitness": "Idoneità dell'Architettura",
            "team_skills": "Competenze del Team",
            "domain_knowledge": "Conoscenza del Dominio",
            "people_allocation": "Allocazione del Personale",
            "budget_allocation": "Allocazione del Budget",
            "time_to_production": "Tempo alla Produzione",
            "value_score": "Punteggio di Valore",
            "feasibility_score": "Punteggio di Fattibilità",
            "priority_score": "Punteggio di Priorità",
            "pdf_title": "Casi d'Uso IA Strategici Databricks Agent Bricks",
            "pdf_for": "Per",
            "pdf_exec_summary": "Riepilogo Esecutivo",
            "pdf_toc_title": "Domini dei Casi d'Uso",
            "pdf_detailed_view": "Catalogo Dettagliato dei Casi d'Uso",
            "pdf_disclaimer_title": "Disclaimer",
            "pdf_fallback_summary_p1": "Questo documento descrive {total_cases} casi d'uso analitici ad alto valore identificati per {business_name}.",
            "pdf_fallback_summary_p2": "Le pagine seguenti forniscono un'analisi dettagliata di queste opportunità, categorizzate per dominio aziendale.",
            "pptx_main_title": "Casi d'Uso IA Strategici Databricks Agent Bricks",
            "pptx_for": "Per",
            "pptx_disclaimer_title": "Disclaimer",
            "pptx_domain_suffix": "Casi d'Uso",
            "example_results": "Risultati di Esempio",
            "error_no_results": "Impossibile generare i risultati. Verificare Notebook: {notebook_name} e caso d'uso {use_case_id}",
            "input_data_original": "Dati di Input (Valori Originali)",
            "ai_generated_output": "Risultati Generati dall'IA",
            "column": "Colonna",
            "value": "Valore",
            "executive_summary_not_available": "Riepilogo esecutivo non disponibile.",
            "domain_summary_not_available": "Riepilogo del dominio non disponibile.",
            "summary_not_available": "Riepilogo non disponibile.",
            "value_general_improvement": "Miglioramento Generale",
            "value_reduce_cost": "Ridurre i Costi",
            "value_increase_revenue": "Aumentare i Ricavi",
            "value_boost_productivity": "Aumentare la Produttività",
            "value_mitigate_risk": "Mitigare i Rischi",
            "value_protect_revenue": "Proteggere i Ricavi",
            "value_align_to_regulations": "Conformarsi alle Normative",
            "value_improve_customer_experience": "Migliorare l'Esperienza del Cliente",
            "value_enable_data_driven_decisions": "Abilitare Decisioni Basate sui Dati",
            "value_optimize_operations": "Ottimizzare le Operazioni",
            "value_empower_talent": "Valorizzare i Talenti",
            "value_enhance_experience": "Migliorare l'Esperienza",
            "value_drive_innovation": "Promuovere l'Innovazione",
            "value_achieve_esg": "Raggiungere ESG",
            "value_execute_strategy": "Eseguire la Strategia",
            "value_forecasting": "Previsione",
            "value_classification": "Classificazione",
            "value_anomaly_detection": "Rilevamento Anomalie",
            "value_cohort_analysis": "Analisi di Coorte",
            "value_segmentation": "Segmentazione",
            "value_sentiment_analysis": "Analisi del Sentimento",
            "value_trend_analysis": "Analisi delle Tendenze",
            "value_prescriptive_analytics": "Analisi Prescrittiva",
            "value_root_cause_analysis": "Analisi delle Cause Profonde",
            "value_optimization": "Ottimizzazione",
            "value_recommendation": "Raccomandazione",
            "value_time_series_analysis": "Analisi delle Serie Temporali",
            "value_predictive_analytics": "Analisi Predittiva",
            "value_descriptive_analytics": "Analisi Descrittiva",
        },
        "Chinese (Mandarin)": {
            "main_title": "Databricks Agent Bricks 用例生成器",
            "intro": "本笔记本包含基于您的架构由AI生成的用例。以下是按业务领域生成的场景摘要。",
            "domain": "业务领域",
            "total": "用例总数",
            "summaries": "用例摘要",
            "sum_id": "ID",
            "sum_name": "名称",
            "sum_value": "商业价值",
            "sum_outcome": "预期结果",
            "warning_header": "警告",
            "warning_body": "请勿运行此笔记本。它仅用于演示和编目目的。SQL查询是示例，可能需要在执行前进行审核。",
            "disclaimer": "此内容由AI生成，仅供演示目的。所有SQL查询都是示例，必须由合格工程师验证后才能在生产环境中使用。",
            "detailed_scenarios": "用例详情",
            "aspect": "方面",
            "description": "描述",
            "aspect_domain": "业务领域",
            "type": "类型",
            "analytics_technique": "分析技术",
            "primary_table": "主表",
            "priority": "优先级",
            "value_type_problem": "问题",
            "value_type_risk": "风险",
            "value_type_opportunity": "机会",
            "value_type_improvement": "改进",
            "value_priority_ultra_high": "极高",
            "value_priority_very_high": "非常高",
            "value_priority_high": "高",
            "value_priority_medium": "中等",
            "value_priority_low": "低",
            "value_priority_very_low": "非常低",
            "value_priority_ultra_low": "极低",
            "statement": "陈述",
            "solution": "解决方案",
            "aspect_beneficiary": "受益者",
            "beneficiary": "受益者",
            "aspect_sponsor": "发起人",
            "sponsor": "发起人",
            "business_priority_alignment": "业务优先级对齐",
            "strategic_goals_alignment": "战略目标对齐",
            "subdomain": "子领域",
            "aspect_value": "商业价值",
            "business_value": "商业价值",
            "aspect_tables": "涉及的表",
            "aspect_ai_function": "AI功能",
            "aspect_analytics_technique": "分析技术",
            "aspect_primary_table": "主表",
            "aspect_priority": "优先级",
            "strategic_alignment": "战略对齐",
            "return_on_investment": "投资回报率",
            "reusability": "可重用性",
            "time_to_value": "价值实现时间",
            "data_availability": "数据可用性",
            "data_accessibility": "数据可访问性",
            "architecture_fitness": "架构适配性",
            "team_skills": "团队技能",
            "domain_knowledge": "领域知识",
            "people_allocation": "人员分配",
            "budget_allocation": "预算分配",
            "time_to_production": "投产时间",
            "value_score": "价值分数",
            "feasibility_score": "可行性分数",
            "priority_score": "优先级分数",
            "pdf_title": "Databricks Agent Bricks 战略AI用例",
            "pdf_for": "为",
            "pdf_exec_summary": "执行摘要",
            "pdf_toc_title": "用例领域",
            "pdf_detailed_view": "详细用例目录",
            "pdf_disclaimer_title": "免责声明",
            "pdf_fallback_summary_p1": "本文档概述了为{business_name}识别的{total_cases}个高价值分析用例。",
            "pdf_fallback_summary_p2": "以下页面按业务领域分类提供这些机会的详细分析。",
            "pptx_main_title": "Databricks Agent Bricks 战略AI用例",
            "pptx_for": "为",
            "pptx_disclaimer_title": "免责声明",
            "pptx_domain_suffix": "用例",
            "example_results": "示例结果",
            "error_no_results": "无法生成结果。请检查笔记本：{notebook_name}和用例{use_case_id}",
            "input_data_original": "输入数据（原始值）",
            "ai_generated_output": "AI生成结果",
            "column": "列",
            "value": "值",
            "executive_summary_not_available": "执行摘要不可用。",
            "domain_summary_not_available": "领域摘要不可用。",
            "summary_not_available": "摘要不可用。",
            "value_general_improvement": "一般改进",
            "value_reduce_cost": "降低成本",
            "value_increase_revenue": "增加收入",
            "value_boost_productivity": "提高生产力",
            "value_mitigate_risk": "降低风险",
            "value_protect_revenue": "保护收入",
            "value_align_to_regulations": "符合法规",
            "value_improve_customer_experience": "改善客户体验",
            "value_enable_data_driven_decisions": "实现数据驱动决策",
            "value_optimize_operations": "优化运营",
            "value_empower_talent": "赋能人才",
            "value_enhance_experience": "提升体验",
            "value_drive_innovation": "推动创新",
            "value_achieve_esg": "实现ESG",
            "value_execute_strategy": "执行战略",
            "value_forecasting": "预测",
            "value_classification": "分类",
            "value_anomaly_detection": "异常检测",
            "value_cohort_analysis": "队列分析",
            "value_segmentation": "细分",
            "value_sentiment_analysis": "情感分析",
            "value_trend_analysis": "趋势分析",
            "value_prescriptive_analytics": "规范性分析",
            "value_root_cause_analysis": "根因分析",
            "value_optimization": "优化",
            "value_recommendation": "推荐",
            "value_time_series_analysis": "时间序列分析",
            "value_predictive_analytics": "预测分析",
            "value_descriptive_analytics": "描述性分析",
        },
        "Japanese": {
            "main_title": "Databricks Agent Bricks ユースケースジェネレーター",
            "intro": "このノートブックには、スキーマに基づいてAIが生成したユースケースが含まれています。以下は、ビジネスドメイン別に生成されたシナリオの概要です。",
            "domain": "ビジネスドメイン",
            "total": "ユースケース総数",
            "summaries": "ユースケース概要",
            "sum_id": "ID",
            "sum_name": "名前",
            "sum_value": "ビジネス価値",
            "sum_outcome": "期待される成果",
            "warning_header": "警告",
            "warning_body": "このノートブックを実行しないでください。デモンストレーションとカタログ作成のみを目的としています。SQLクエリは例であり、実行前にレビューが必要な場合があります。",
            "disclaimer": "このコンテンツはAIによって生成されており、デモンストレーション目的のみです。すべてのSQLクエリは例であり、本番環境で使用する前に資格のあるエンジニアによる検証が必要です。",
            "detailed_scenarios": "ユースケースの詳細",
            "aspect": "側面",
            "description": "説明",
            "aspect_domain": "ビジネスドメイン",
            "type": "タイプ",
            "analytics_technique": "分析技術",
            "primary_table": "主要テーブル",
            "priority": "優先度",
            "value_type_problem": "問題",
            "value_type_risk": "リスク",
            "value_type_opportunity": "機会",
            "value_type_improvement": "改善",
            "value_priority_ultra_high": "極めて高い",
            "value_priority_very_high": "非常に高い",
            "value_priority_high": "高い",
            "value_priority_medium": "中程度",
            "value_priority_low": "低い",
            "value_priority_very_low": "非常に低い",
            "value_priority_ultra_low": "極めて低い",
            "statement": "ステートメント",
            "solution": "ソリューション",
            "aspect_beneficiary": "受益者",
            "beneficiary": "受益者",
            "aspect_sponsor": "スポンサー",
            "sponsor": "スポンサー",
            "business_priority_alignment": "ビジネス優先度整合性",
            "strategic_goals_alignment": "戦略目標との整合性",
            "subdomain": "サブドメイン",
            "aspect_value": "ビジネス価値",
            "business_value": "ビジネス価値",
            "aspect_tables": "関連テーブル",
            "aspect_ai_function": "AI機能",
            "aspect_analytics_technique": "分析技術",
            "aspect_primary_table": "主要テーブル",
            "aspect_priority": "優先度",
            "strategic_alignment": "戦略的整合性",
            "return_on_investment": "投資収益率",
            "reusability": "再利用性",
            "time_to_value": "価値実現までの時間",
            "data_availability": "データ可用性",
            "data_accessibility": "データアクセシビリティ",
            "architecture_fitness": "アーキテクチャ適合性",
            "team_skills": "チームスキル",
            "domain_knowledge": "ドメイン知識",
            "people_allocation": "人員配置",
            "budget_allocation": "予算配分",
            "time_to_production": "本番化までの時間",
            "value_score": "価値スコア",
            "feasibility_score": "実現可能性スコア",
            "priority_score": "優先度スコア",
            "pdf_title": "Databricks Agent Bricks 戦略的AIユースケース",
            "pdf_for": "対象",
            "pdf_exec_summary": "エグゼクティブサマリー",
            "pdf_toc_title": "ユースケースドメイン",
            "pdf_detailed_view": "詳細ユースケースカタログ",
            "pdf_disclaimer_title": "免責事項",
            "pdf_fallback_summary_p1": "本ドキュメントは{business_name}向けに特定された{total_cases}件の高価値分析ユースケースを概説しています。",
            "pdf_fallback_summary_p2": "以下のページでは、ビジネスドメイン別に分類されたこれらの機会の詳細な分析を提供します。",
            "pptx_main_title": "Databricks Agent Bricks 戦略的AIユースケース",
            "pptx_for": "対象",
            "pptx_disclaimer_title": "免責事項",
            "pptx_domain_suffix": "ユースケース",
            "example_results": "サンプル結果",
            "error_no_results": "結果を生成できませんでした。ノートブック: {notebook_name} とユースケース {use_case_id} を確認してください",
            "input_data_original": "入力データ（元の値）",
            "ai_generated_output": "AI生成結果",
            "column": "列",
            "value": "値",
            "executive_summary_not_available": "エグゼクティブサマリーは利用できません。",
            "domain_summary_not_available": "ドメインサマリーは利用できません。",
            "summary_not_available": "サマリーは利用できません。",
            "value_general_improvement": "一般的な改善",
            "value_reduce_cost": "コスト削減",
            "value_increase_revenue": "収益増加",
            "value_boost_productivity": "生産性向上",
            "value_mitigate_risk": "リスク軽減",
            "value_protect_revenue": "収益保護",
            "value_align_to_regulations": "規制遵守",
            "value_improve_customer_experience": "顧客体験の改善",
            "value_enable_data_driven_decisions": "データ駆動型意思決定の実現",
            "value_optimize_operations": "業務最適化",
            "value_empower_talent": "人材育成",
            "value_enhance_experience": "体験向上",
            "value_drive_innovation": "イノベーション推進",
            "value_achieve_esg": "ESG達成",
            "value_execute_strategy": "戦略実行",
            "value_forecasting": "予測",
            "value_classification": "分類",
            "value_anomaly_detection": "異常検知",
            "value_cohort_analysis": "コホート分析",
            "value_segmentation": "セグメンテーション",
            "value_sentiment_analysis": "感情分析",
            "value_trend_analysis": "トレンド分析",
            "value_prescriptive_analytics": "処方的分析",
            "value_root_cause_analysis": "根本原因分析",
            "value_optimization": "最適化",
            "value_recommendation": "レコメンデーション",
            "value_time_series_analysis": "時系列分析",
            "value_predictive_analytics": "予測分析",
            "value_descriptive_analytics": "記述的分析",
        },
        "Korean": {
            "main_title": "Databricks Agent Bricks 유스케이스 생성기",
            "intro": "이 노트북에는 스키마를 기반으로 AI가 생성한 유스케이스가 포함되어 있습니다. 아래는 비즈니스 도메인별 생성된 시나리오 요약입니다.",
            "domain": "비즈니스 도메인",
            "total": "총 유스케이스",
            "summaries": "유스케이스 요약",
            "sum_id": "ID",
            "sum_name": "이름",
            "sum_value": "비즈니스 가치",
            "sum_outcome": "예상 결과",
            "warning_header": "경고",
            "warning_body": "이 노트북을 실행하지 마십시오. 데모 및 카탈로그 작성 목적으로만 사용됩니다. SQL 쿼리는 예시이며 실행 전 검토가 필요할 수 있습니다.",
            "disclaimer": "이 콘텐츠는 AI가 생성한 것이며 데모 목적으로만 사용됩니다. 모든 SQL 쿼리는 예시이며 프로덕션 환경에서 사용하기 전에 자격을 갖춘 엔지니어의 검증이 필요합니다.",
            "detailed_scenarios": "유스케이스 세부정보",
            "aspect": "측면",
            "description": "설명",
            "aspect_domain": "비즈니스 도메인",
            "type": "유형",
            "analytics_technique": "분석 기법",
            "primary_table": "주요 테이블",
            "priority": "우선순위",
            "value_type_problem": "문제",
            "value_type_risk": "위험",
            "value_type_opportunity": "기회",
            "value_type_improvement": "개선",
            "value_priority_ultra_high": "초고",
            "value_priority_very_high": "매우 높음",
            "value_priority_high": "높음",
            "value_priority_medium": "보통",
            "value_priority_low": "낮음",
            "value_priority_very_low": "매우 낮음",
            "value_priority_ultra_low": "초저",
            "statement": "설명",
            "solution": "솔루션",
            "aspect_beneficiary": "수혜자",
            "beneficiary": "수혜자",
            "aspect_sponsor": "후원자",
            "sponsor": "후원자",
            "business_priority_alignment": "비즈니스 우선순위 정렬",
            "strategic_goals_alignment": "전략 목표 정렬",
            "subdomain": "하위 도메인",
            "aspect_value": "비즈니스 가치",
            "business_value": "비즈니스 가치",
            "aspect_tables": "관련 테이블",
            "aspect_ai_function": "AI 기능",
            "aspect_analytics_technique": "분석 기법",
            "aspect_primary_table": "주요 테이블",
            "aspect_priority": "우선순위",
            "strategic_alignment": "전략적 정렬",
            "return_on_investment": "투자 수익률",
            "reusability": "재사용성",
            "time_to_value": "가치 실현 시간",
            "data_availability": "데이터 가용성",
            "data_accessibility": "데이터 접근성",
            "architecture_fitness": "아키텍처 적합성",
            "team_skills": "팀 역량",
            "domain_knowledge": "도메인 지식",
            "people_allocation": "인력 배치",
            "budget_allocation": "예산 배분",
            "time_to_production": "프로덕션까지 시간",
            "value_score": "가치 점수",
            "feasibility_score": "실현 가능성 점수",
            "priority_score": "우선순위 점수",
            "pdf_title": "Databricks Agent Bricks 전략적 AI 유스케이스",
            "pdf_for": "대상",
            "pdf_exec_summary": "경영진 요약",
            "pdf_toc_title": "유스케이스 도메인",
            "pdf_detailed_view": "상세 유스케이스 카탈로그",
            "pdf_disclaimer_title": "면책조항",
            "pdf_fallback_summary_p1": "이 문서는 {business_name}을 위해 식별된 {total_cases}개의 고가치 분석 유스케이스를 설명합니다.",
            "pdf_fallback_summary_p2": "다음 페이지에서는 비즈니스 도메인별로 분류된 이러한 기회의 상세 분석을 제공합니다.",
            "pptx_main_title": "Databricks Agent Bricks 전략적 AI 유스케이스",
            "pptx_for": "대상",
            "pptx_disclaimer_title": "면책조항",
            "pptx_domain_suffix": "유스케이스",
            "example_results": "예시 결과",
            "error_no_results": "결과를 생성할 수 없습니다. 노트북: {notebook_name} 및 유스케이스 {use_case_id}를 확인하세요",
            "input_data_original": "입력 데이터 (원본 값)",
            "ai_generated_output": "AI 생성 결과",
            "column": "열",
            "value": "값",
            "executive_summary_not_available": "경영진 요약을 사용할 수 없습니다.",
            "domain_summary_not_available": "도메인 요약을 사용할 수 없습니다.",
            "summary_not_available": "요약을 사용할 수 없습니다.",
            "value_general_improvement": "일반 개선",
            "value_reduce_cost": "비용 절감",
            "value_increase_revenue": "수익 증대",
            "value_boost_productivity": "생산성 향상",
            "value_mitigate_risk": "위험 완화",
            "value_protect_revenue": "수익 보호",
            "value_align_to_regulations": "규정 준수",
            "value_improve_customer_experience": "고객 경험 개선",
            "value_enable_data_driven_decisions": "데이터 기반 의사결정 지원",
            "value_optimize_operations": "운영 최적화",
            "value_empower_talent": "인재 역량 강화",
            "value_enhance_experience": "경험 향상",
            "value_drive_innovation": "혁신 추진",
            "value_achieve_esg": "ESG 달성",
            "value_execute_strategy": "전략 실행",
            "value_forecasting": "예측",
            "value_classification": "분류",
            "value_anomaly_detection": "이상 탐지",
            "value_cohort_analysis": "코호트 분석",
            "value_segmentation": "세분화",
            "value_sentiment_analysis": "감정 분석",
            "value_trend_analysis": "추세 분석",
            "value_prescriptive_analytics": "처방적 분석",
            "value_root_cause_analysis": "근본 원인 분석",
            "value_optimization": "최적화",
            "value_recommendation": "추천",
            "value_time_series_analysis": "시계열 분석",
            "value_predictive_analytics": "예측 분석",
            "value_descriptive_analytics": "기술적 분석",
        },
        "Hindi": {
            "main_title": "Databricks Agent Bricks उपयोग केस जनरेटर",
            "intro": "इस नोटबुक में आपके स्कीमा के आधार पर AI-जनित उपयोग केस शामिल हैं। नीचे व्यापार डोमेन द्वारा जनित परिदृश्यों का सारांश है।",
            "domain": "व्यापार डोमेन",
            "total": "कुल उपयोग केस",
            "summaries": "उपयोग केस सारांश",
            "sum_id": "ID",
            "sum_name": "नाम",
            "sum_value": "व्यापारिक मूल्य",
            "sum_outcome": "अपेक्षित परिणाम",
            "warning_header": "चेतावनी",
            "warning_body": "इस नोटबुक को न चलाएं। यह केवल प्रदर्शन और कैटलॉगिंग उद्देश्यों के लिए है। SQL क्वेरी उदाहरण हैं और निष्पादन से पहले समीक्षा की आवश्यकता हो सकती है।",
            "disclaimer": "यह सामग्री AI-जनित है और केवल प्रदर्शन उद्देश्यों के लिए है। सभी SQL क्वेरी उदाहरण हैं और उत्पादन वातावरण में उपयोग करने से पहले एक योग्य इंजीनियर द्वारा मान्य किया जाना चाहिए।",
            "detailed_scenarios": "उपयोग केस विवरण",
            "aspect": "पहलू",
            "description": "विवरण",
            "aspect_domain": "व्यापार डोमेन",
            "type": "प्रकार",
            "analytics_technique": "विश्लेषण तकनीक",
            "primary_table": "प्राथमिक तालिका",
            "priority": "प्राथमिकता",
            "value_type_problem": "समस्या",
            "value_type_risk": "जोखिम",
            "value_type_opportunity": "अवसर",
            "value_type_improvement": "सुधार",
            "value_priority_ultra_high": "अत्यधिक उच्च",
            "value_priority_very_high": "बहुत उच्च",
            "value_priority_high": "उच्च",
            "value_priority_medium": "मध्यम",
            "value_priority_low": "कम",
            "value_priority_very_low": "बहुत कम",
            "value_priority_ultra_low": "अत्यधिक कम",
            "statement": "कथन",
            "solution": "समाधान",
            "aspect_beneficiary": "लाभार्थी",
            "beneficiary": "लाभार्थी",
            "aspect_sponsor": "प्रायोजक",
            "sponsor": "प्रायोजक",
            "business_priority_alignment": "व्यापार प्राथमिकता संरेखण",
            "strategic_goals_alignment": "रणनीतिक लक्ष्य संरेखण",
            "subdomain": "उप-डोमेन",
            "aspect_value": "व्यापारिक मूल्य",
            "business_value": "व्यापारिक मूल्य",
            "aspect_tables": "शामिल तालिकाएं",
            "aspect_ai_function": "AI फ़ंक्शन",
            "aspect_analytics_technique": "विश्लेषण तकनीक",
            "aspect_primary_table": "प्राथमिक तालिका",
            "aspect_priority": "प्राथमिकता",
            "strategic_alignment": "रणनीतिक संरेखण",
            "return_on_investment": "निवेश पर प्रतिफल",
            "reusability": "पुन: प्रयोज्यता",
            "time_to_value": "मूल्य तक समय",
            "data_availability": "डेटा उपलब्धता",
            "data_accessibility": "डेटा पहुंच",
            "architecture_fitness": "आर्किटेक्चर उपयुक्तता",
            "team_skills": "टीम कौशल",
            "domain_knowledge": "डोमेन ज्ञान",
            "people_allocation": "लोग आवंटन",
            "budget_allocation": "बजट आवंटन",
            "time_to_production": "उत्पादन तक समय",
            "value_score": "मूल्य स्कोर",
            "feasibility_score": "व्यवहार्यता स्कोर",
            "priority_score": "प्राथमिकता स्कोर",
            "pdf_title": "Databricks Agent Bricks रणनीतिक AI उपयोग केस",
            "pdf_for": "के लिए",
            "pdf_exec_summary": "कार्यकारी सारांश",
            "pdf_toc_title": "उपयोग केस डोमेन",
            "pdf_detailed_view": "विस्तृत उपयोग केस कैटलॉग",
            "pdf_disclaimer_title": "अस्वीकरण",
            "pdf_fallback_summary_p1": "यह दस्तावेज़ {business_name} के लिए पहचाने गए {total_cases} उच्च-मूल्य विश्लेषणात्मक उपयोग केस का वर्णन करता है।",
            "pdf_fallback_summary_p2": "निम्नलिखित पृष्ठ व्यापार डोमेन द्वारा वर्गीकृत इन अवसरों का विस्तृत विश्लेषण प्रदान करते हैं।",
            "pptx_main_title": "Databricks Agent Bricks रणनीतिक AI उपयोग केस",
            "pptx_for": "के लिए",
            "pptx_disclaimer_title": "अस्वीकरण",
            "pptx_domain_suffix": "उपयोग केस",
            "example_results": "उदाहरण परिणाम",
            "error_no_results": "परिणाम उत्पन्न नहीं हो सके। नोटबुक: {notebook_name} और उपयोग केस {use_case_id} जांचें",
            "input_data_original": "इनपुट डेटा (मूल मान)",
            "ai_generated_output": "AI-जनित परिणाम",
            "column": "कॉलम",
            "value": "मूल्य",
            "executive_summary_not_available": "कार्यकारी सारांश उपलब्ध नहीं है।",
            "domain_summary_not_available": "डोमेन सारांश उपलब्ध नहीं है।",
            "summary_not_available": "सारांश उपलब्ध नहीं है।",
            "value_general_improvement": "सामान्य सुधार",
            "value_reduce_cost": "लागत कम करें",
            "value_increase_revenue": "राजस्व बढ़ाएं",
            "value_boost_productivity": "उत्पादकता बढ़ाएं",
            "value_mitigate_risk": "जोखिम कम करें",
            "value_protect_revenue": "राजस्व की रक्षा करें",
            "value_align_to_regulations": "नियमों का पालन करें",
            "value_improve_customer_experience": "ग्राहक अनुभव सुधारें",
            "value_enable_data_driven_decisions": "डेटा-आधारित निर्णय सक्षम करें",
            "value_optimize_operations": "संचालन अनुकूलित करें",
            "value_empower_talent": "प्रतिभा को सशक्त बनाएं",
            "value_enhance_experience": "अनुभव बढ़ाएं",
            "value_drive_innovation": "नवाचार को बढ़ावा दें",
            "value_achieve_esg": "ESG प्राप्त करें",
            "value_execute_strategy": "रणनीति निष्पादित करें",
            "value_forecasting": "पूर्वानुमान",
            "value_classification": "वर्गीकरण",
            "value_anomaly_detection": "विसंगति पहचान",
            "value_cohort_analysis": "समूह विश्लेषण",
            "value_segmentation": "विभाजन",
            "value_sentiment_analysis": "भावना विश्लेषण",
            "value_trend_analysis": "रुझान विश्लेषण",
            "value_prescriptive_analytics": "निर्देशात्मक विश्लेषण",
            "value_root_cause_analysis": "मूल कारण विश्लेषण",
            "value_optimization": "अनुकूलन",
            "value_recommendation": "अनुशंसा",
            "value_time_series_analysis": "समय श्रृंखला विश्लेषण",
            "value_predictive_analytics": "भविष्यवाणी विश्लेषण",
            "value_descriptive_analytics": "वर्णनात्मक विश्लेषण",
        },
        "Russian": {
            "main_title": "Генератор бизнес-кейсов Databricks Agent Bricks",
            "intro": "Эта записная книжка содержит бизнес-кейсы, созданные ИИ на основе ваших схем. Ниже приведено резюме созданных сценариев по бизнес-доменам.",
            "domain": "Бизнес-домен",
            "total": "Всего бизнес-кейсов",
            "summaries": "Резюме бизнес-кейсов",
            "sum_id": "ID",
            "sum_name": "Название",
            "sum_value": "Бизнес-ценность",
            "sum_outcome": "Ожидаемый результат",
            "warning_header": "ПРЕДУПРЕЖДЕНИЕ",
            "warning_body": "Не запускайте эту записную книжку. Она предназначена только для демонстрации и каталогизации. SQL-запросы являются примерами и могут потребовать проверки перед выполнением.",
            "disclaimer": "Этот контент создан ИИ и предназначен только для демонстрационных целей. Все SQL-запросы являются примерами и должны быть проверены квалифицированным инженером перед использованием в производственной среде.",
            "detailed_scenarios": "Детали бизнес-кейсов",
            "aspect": "Аспект",
            "description": "Описание",
            "aspect_domain": "Бизнес-домен",
            "type": "Тип",
            "analytics_technique": "Аналитическая техника",
            "primary_table": "Основная таблица",
            "priority": "Приоритет",
            "value_type_problem": "Проблема",
            "value_type_risk": "Риск",
            "value_type_opportunity": "Возможность",
            "value_type_improvement": "Улучшение",
            "value_priority_ultra_high": "Крайне высокий",
            "value_priority_very_high": "Очень высокий",
            "value_priority_high": "Высокий",
            "value_priority_medium": "Средний",
            "value_priority_low": "Низкий",
            "value_priority_very_low": "Очень низкий",
            "value_priority_ultra_low": "Крайне низкий",
            "statement": "Описание",
            "solution": "Решение",
            "aspect_beneficiary": "Выгодоприобретатель",
            "beneficiary": "Выгодоприобретатель",
            "aspect_sponsor": "Спонсор",
            "sponsor": "Спонсор",
            "business_priority_alignment": "Соответствие бизнес-приоритетам",
            "strategic_goals_alignment": "Соответствие стратегическим целям",
            "subdomain": "Поддомен",
            "aspect_value": "Бизнес-ценность",
            "business_value": "Бизнес-ценность",
            "aspect_tables": "Связанные таблицы",
            "aspect_ai_function": "Функция ИИ",
            "aspect_analytics_technique": "Аналитическая техника",
            "aspect_primary_table": "Основная таблица",
            "aspect_priority": "Приоритет",
            "strategic_alignment": "Стратегическое соответствие",
            "return_on_investment": "Рентабельность инвестиций",
            "reusability": "Возможность повторного использования",
            "time_to_value": "Время до получения ценности",
            "data_availability": "Доступность данных",
            "data_accessibility": "Доступ к данным",
            "architecture_fitness": "Соответствие архитектуре",
            "team_skills": "Навыки команды",
            "domain_knowledge": "Знание предметной области",
            "people_allocation": "Распределение персонала",
            "budget_allocation": "Распределение бюджета",
            "time_to_production": "Время до внедрения",
            "value_score": "Оценка ценности",
            "feasibility_score": "Оценка реализуемости",
            "priority_score": "Оценка приоритета",
            "pdf_title": "Стратегические ИИ бизнес-кейсы Databricks Agent Bricks",
            "pdf_for": "Для",
            "pdf_exec_summary": "Резюме для руководства",
            "pdf_toc_title": "Домены бизнес-кейсов",
            "pdf_detailed_view": "Подробный каталог бизнес-кейсов",
            "pdf_disclaimer_title": "Отказ от ответственности",
            "pdf_fallback_summary_p1": "Этот документ описывает {total_cases} высокоценных аналитических бизнес-кейсов, выявленных для {business_name}.",
            "pdf_fallback_summary_p2": "На следующих страницах представлен подробный анализ этих возможностей, классифицированных по бизнес-доменам.",
            "pptx_main_title": "Стратегические ИИ бизнес-кейсы Databricks Agent Bricks",
            "pptx_for": "Для",
            "pptx_disclaimer_title": "Отказ от ответственности",
            "pptx_domain_suffix": "Бизнес-кейсы",
            "example_results": "Примеры результатов",
            "error_no_results": "Не удалось создать результаты. Проверьте записную книжку: {notebook_name} и бизнес-кейс {use_case_id}",
            "input_data_original": "Входные данные (исходные значения)",
            "ai_generated_output": "Результаты, созданные ИИ",
            "column": "Столбец",
            "value": "Значение",
            "executive_summary_not_available": "Резюме для руководства недоступно.",
            "domain_summary_not_available": "Резюме домена недоступно.",
            "summary_not_available": "Резюме недоступно.",
            "value_general_improvement": "Общее Улучшение",
            "value_reduce_cost": "Сократить Затраты",
            "value_increase_revenue": "Увеличить Доход",
            "value_boost_productivity": "Повысить Производительность",
            "value_mitigate_risk": "Снизить Риски",
            "value_protect_revenue": "Защитить Доход",
            "value_align_to_regulations": "Соответствовать Нормативам",
            "value_improve_customer_experience": "Улучшить Клиентский Опыт",
            "value_enable_data_driven_decisions": "Обеспечить Решения на Основе Данных",
            "value_optimize_operations": "Оптимизировать Операции",
            "value_empower_talent": "Развивать Таланты",
            "value_enhance_experience": "Улучшить Опыт",
            "value_drive_innovation": "Стимулировать Инновации",
            "value_achieve_esg": "Достичь ESG",
            "value_execute_strategy": "Реализовать Стратегию",
            "value_forecasting": "Прогнозирование",
            "value_classification": "Классификация",
            "value_anomaly_detection": "Обнаружение Аномалий",
            "value_cohort_analysis": "Когортный Анализ",
            "value_segmentation": "Сегментация",
            "value_sentiment_analysis": "Анализ Настроений",
            "value_trend_analysis": "Анализ Трендов",
            "value_prescriptive_analytics": "Предписывающая Аналитика",
            "value_root_cause_analysis": "Анализ Первопричин",
            "value_optimization": "Оптимизация",
            "value_recommendation": "Рекомендация",
            "value_time_series_analysis": "Анализ Временных Рядов",
            "value_predictive_analytics": "Прогнозная Аналитика",
            "value_descriptive_analytics": "Описательная Аналитика",
        },
    }

    def _apply_translation_fallbacks(self, translations: dict, target_language: str) -> dict:
        """
        Applies known fallback translations for commonly untranslated terms.
        This fixes cases where the LLM fails to translate certain terms.
        ALWAYS applies fallbacks for keys that are missing OR still have English values.
        """
        if target_language not in self.TRANSLATION_FALLBACKS:
            return translations
        
        fallbacks = self.TRANSLATION_FALLBACKS[target_language]
        english_values = list(self.ENGLISH_TRANSLATIONS.values())
        english_values_lower = [v.lower() for v in english_values]
        
        applied_count = 0
        for key, fallback_value in fallbacks.items():
            should_apply = False
            current_value = translations.get(key, None)
            
            # Apply fallback if: key is missing, value is empty, or value is still in English
            if key not in translations:
                should_apply = True
                self.logger.debug(f"Translation MISSING for '{key}', applying fallback: '{fallback_value}'")
            elif not current_value or (isinstance(current_value, str) and not current_value.strip()):
                should_apply = True
                self.logger.debug(f"Translation EMPTY for '{key}', applying fallback: '{fallback_value}'")
            elif isinstance(current_value, str) and current_value.lower() in english_values_lower:
                should_apply = True
                self.logger.debug(f"Translation still ENGLISH for '{key}': '{current_value}' → '{fallback_value}'")
            
            if should_apply:
                translations[key] = fallback_value
                applied_count += 1
        
        if applied_count > 0:
            self.logger.info(f"Applied {applied_count} fallback translations for {target_language}")
        
        return translations

    def _validate_translations(self, translations: dict, target_language: str) -> bool:
        """
        Validates that critical fields are actually translated and not left in English.
        Returns True if translations are valid, False otherwise.
        """
        # Critical keys that MUST be translated
        critical_keys = [
            "type", "subdomain", "analytics_technique", "primary_table", "priority", 
            "aspect_analytics_technique", "aspect_primary_table", "aspect_priority",
            "statement", "solution", "business_value", "beneficiary", "sponsor",
            "strategic_goal_alignment",
            "pdf_detailed_view", "pptx_domain_suffix", "domain",
            "value_type_opportunity", "value_type_problem", "value_type_risk", "value_type_improvement",
            "value_priority_very_high", "value_priority_high", "value_priority_medium", "value_priority_low", "value_priority_very_low",
            "example_results", "column", "value"
        ]
        
        # Expected English values that should NOT appear in translations
        english_values = [
            "Type", "Subdomain", "Analytics Technique", "Primary Table", "Priority",
            "Statement", "Solution", "Business Value", "Beneficiary", "Sponsor",
            "Business Priority Alignment", "AI Confidence", "AI Justification",
            "Detailed Use Case Catalog", "Use Cases", "Business Domain",
            "Opportunity", "Problem", "Risk", "Improvement",
            "Simple", "Medium", "Complex", "Very Complex",
            "High", "Very High", "Low", "Very Low",
            "Example Results", "Column", "Value"
        ]
        
        # Check if any critical key still has an English value (case-insensitive)
        english_values_lower = [v.lower() for v in english_values]
        for key in critical_keys:
            if key in translations:
                value = translations[key]
                if isinstance(value, str) and value.lower() in english_values_lower:
                    self.logger.warning(f"Translation validation FAILED for {target_language}: Key '{key}' still has English value '{value}'")
                    return False
        
        self.logger.info(f"Translation validation PASSED for {target_language}")
        return True

    def get_translations(self, target_language: str) -> dict:
        """
        Gets translations for UI elements for a given language.
        Uses AI agent and caches the result.
        Supports retry logic (max 2 attempts).
        """
        if target_language == "English":
            self.logger.info("Using default English UI translations.")
            return self.ENGLISH_TRANSLATIONS
        
        if target_language in self.translation_cache:
            cached = self.translation_cache[target_language]
            # Check if all keys from ENGLISH_TRANSLATIONS are present in cache
            missing_keys = set(self.ENGLISH_TRANSLATIONS.keys()) - set(cached.keys())
            if not missing_keys:
                # Validate that translations are actually in target language
                if self._validate_translations(cached, target_language):
                    self.logger.info(f"Using cached UI translations for {target_language}.")
                    return cached
                else:
                    self.logger.warning(f"Cached translations for {target_language} contain English values. Forcing re-translation...")
                    del self.translation_cache[target_language]
            else:
                self.logger.info(f"Cache for {target_language} is outdated (missing {len(missing_keys)} keys). Refreshing...")
                # Remove from cache to force re-translation
                del self.translation_cache[target_language]

        self.logger.debug(f"Calling LLM to get UI translations for {target_language}...")
        
        # Retry loop: up to 2 attempts
        for attempt in range(1, 3):
            try:
                if attempt > 1:
                    self.logger.info(f"Retrying UI translation for {target_language} (Attempt {attempt}/2)...")
                
                # Escape braces in JSON payload so they don't interfere with .format()
                json_str = json.dumps(self.ENGLISH_TRANSLATIONS, indent=2)
                # Replace { with {{ and } with }} to escape them for .format()
                json_str_escaped = json_str.replace('{', '{{').replace('}', '}}')
                
                prompt_vars = {
                    "json_payload": json_str_escaped,
                    "target_language": target_language
                }
                
                self.logger.info(f"⏳ Waiting for LLM response (translating UI to {target_language})...")
                response_raw = self.ai_agent.run_worker(
                    step_name=f"Translate_UI_{target_language}_Attempt{attempt}",
                    worker_prompt_path="KEYWORDS_TRANSLATE_PROMPT", # Use key
                    prompt_vars=prompt_vars,
                    response_schema=None
                )
                self.logger.info(f"✅ Received LLM response, parsing translations...")
                
                response_clean = clean_json_response(response_raw)
                translated_dict = json.loads(response_clean)
                
                final_translations = self.ENGLISH_TRANSLATIONS.copy()
                final_translations.update(translated_dict) 
                
                # Apply known fallback fixes for commonly untranslated terms
                final_translations = self._apply_translation_fallbacks(final_translations, target_language)
                
                # Validate translations before caching
                if not self._validate_translations(final_translations, target_language):
                    if attempt < 2:
                        self.logger.warning(f"Translation validation failed on attempt {attempt}. Will retry...")
                        continue  # Retry
                    else:
                        # Apply fallbacks one more time before giving up
                        final_translations = self._apply_translation_fallbacks(final_translations, target_language)
                        if self._validate_translations(final_translations, target_language):
                            self.logger.info(f"Fallback translations fixed the issue. Using fallbacks.")
                            self.translation_cache[target_language] = final_translations
                            return final_translations
                        self.logger.error(f"Translation validation failed on final attempt. Falling back to English.")
                        return self.ENGLISH_TRANSLATIONS
                
                self.logger.info(f"Successfully fetched and cached UI translations for {target_language} on attempt {attempt}.")
                self.translation_cache[target_language] = final_translations
                return final_translations

            except Exception as e:
                self.logger.error(f"Failed to get UI translations for {target_language} (Attempt {attempt}/2): {e}")
                if attempt == 2:
                    self.logger.error(f"All retry attempts exhausted for UI translations. Falling back to English.")
                    return self.ENGLISH_TRANSLATIONS

    def translate_use_case_list(self, english_use_cases: list, target_language: str, max_parallelism: int = 20, enable_parallelization: bool = True) -> list:
        """
        Translates the 9 key fields for a list of use case dictionaries.
        Uses dynamic batch sizing to maximize context utilization.
        
        Args:
            english_use_cases: List of use case dictionaries to translate
            target_language: Target language name
            max_parallelism: Maximum parallel workers (only used if enable_parallelization=True)
            enable_parallelization: If False, processes batches sequentially to avoid nested ThreadPoolExecutors
        """
        if target_language == "English":
            return english_use_cases
        
        if not english_use_cases:
             self.logger.warning("No use cases provided to translate.")
             return []

        self.logger.info(f"Starting translation for {target_language}... (parallelization: {'enabled' if enable_parallelization else 'disabled'})")
        
        # FIXED BATCH SIZE: 3 use cases per batch for ALL languages
        # This prevents LLM output truncation issues with long SQL queries (8-10K chars each)
        # Optimal balance between translation speed and reliability
        batch_size = 3
        
        self.logger.info(f"Translation batch sizing: {batch_size} use cases per batch (FIXED for all languages to prevent truncation)")
        
        batches = [english_use_cases[i:i + batch_size] for i in range(0, len(english_use_cases), batch_size)]
        translated_use_cases = []
        
        if enable_parallelization:
            # Parallel processing (used when NOT called from within another ThreadPoolExecutor)
            with ThreadPoolExecutor(max_workers=max_parallelism) as executor:
                futures = [executor.submit(self._translate_batch, batch, target_language, i) for i, batch in enumerate(batches)]
                
                # Add timeout: 10 minutes per batch * number of batches / parallelism
                total_timeout = (len(batches) * 600) // max_parallelism + 300
                self.logger.info(f"Translation timeout set to {total_timeout}s for {len(batches)} batches")
                
                try:
                    for future in concurrent.futures.as_completed(futures, timeout=total_timeout):
                        try:
                            translated_batch = future.result(timeout=600)
                            if translated_batch:
                                translated_use_cases.extend(translated_batch)
                        except concurrent.futures.TimeoutError:
                            self.logger.error(f"Translation batch timed out after 10 minutes")
                        except Exception as e:
                            self.logger.error(f"A translation future failed unexpectedly: {e}")
                except concurrent.futures.TimeoutError:
                    self.logger.error(f"Overall translation timeout reached ({total_timeout}s). Proceeding with {len(translated_use_cases)} translated use cases.")
        else:
            # Sequential processing (used when called from within another ThreadPoolExecutor to avoid nesting)
            self.logger.info(f"Processing {len(batches)} translation batches sequentially to avoid nested ThreadPoolExecutors...")
            for i, batch in enumerate(batches):
                try:
                    translated_batch = self._translate_batch(batch, target_language, i)
                    if translated_batch:
                        translated_use_cases.extend(translated_batch)
                except Exception as e:
                    self.logger.error(f"Translation batch {i} failed: {e}")
        
        if len(translated_use_cases) != len(english_use_cases):
            self.logger.warning(f"Translation mismatch: expected {len(english_use_cases)} use cases, translated {len(translated_use_cases)}. Some batches may have failed and reverted to English.")
        
        self.logger.info(f"Translation completed for {target_language}.")
        return translated_use_cases

    def _translate_batch(self, use_case_batch: list, target_language: str, batch_num: int, attempt: int = 1, split_attempt: int = 0) -> list:
        """
        Private method to translate a single batch of use cases.
        Returns the original English batch on any failure.
        Supports retry logic with automatic batch splitting on truncation (up to 3 split attempts).
        
        Args:
            use_case_batch: List of use case dictionaries to translate
            target_language: Target language for translation
            batch_num: Batch number for logging
            attempt: Current attempt number (1 or 2)
            split_attempt: Number of times batch has been split (0-3)
        """
        self.logger.debug(f"Translating batch #{batch_num} (Attempt {attempt}/2, Split {split_attempt}/3) ({len(use_case_batch)} use cases) to {target_language}...")
        
        try:
            # OPTIMIZATION: Extract SQL to reduce payload size and prevent truncation
            # SQL doesn't need to be translated, so we remove it from the batch and add it back after
            sql_mapping = {}
            batch_without_sql = []
            for uc in use_case_batch:
                uc_copy = uc.copy()
                uc_id = uc_copy.get('No', '')
                # Store SQL separately
                sql_mapping[uc_id] = uc_copy.pop('SQL', '')
                batch_without_sql.append(uc_copy)
            
            # Escape braces in JSON payload so they don't interfere with .format()
            json_str = json.dumps(batch_without_sql, indent=2)
            json_str_escaped = json_str.replace('{', '{{').replace('}', '}}')
            
            prompt_vars = {
                "json_payload": json_str_escaped,
                "target_language": target_language
            }
            
            self.logger.info(f"⏳ [Batch {batch_num}] Waiting for LLM response (translating {len(use_case_batch)} use cases to {target_language})...")
            response_raw = self.ai_agent.run_worker(
                step_name=f"Translate_UseCases_{target_language}_Batch_{batch_num}_Attempt{attempt}_Split{split_attempt}",
                worker_prompt_path="USE_CASE_TRANSLATE_PROMPT",
                prompt_vars=prompt_vars,
                response_schema=None
            )
            self.logger.info(f"✅ [Batch {batch_num}] Received LLM response, parsing translated use cases...")
            
            # Parse CSV response
            translated_batch = self._parse_translation_csv(response_raw, batch_without_sql, batch_num, target_language)
            
            # Add SQL back to translated use cases
            if translated_batch:
                for uc in translated_batch:
                    uc_id = uc.get('No', '')
                    if uc_id in sql_mapping:
                        uc['SQL'] = sql_mapping[uc_id]
            
            if translated_batch and len(translated_batch) == len(use_case_batch):
                self.logger.debug(f"Successfully translated batch #{batch_num} on attempt {attempt}, split {split_attempt}.")
                return translated_batch
            else:
                raise ValueError(f"Translation returned {len(translated_batch) if translated_batch else 0} rows, expected {len(use_case_batch)}")

        except InputTooLongError as e:
            # Handle context limit exceeded - split batch recursively
            if len(use_case_batch) <= 1:
                # Cannot split further - single use case is too large
                self.logger.error(f"Batch #{batch_num} has single use case that's too large for translation. Reverting to English.")
                return use_case_batch
            
            # Split into 2 sub-batches
            self.logger.warning(f"Batch #{batch_num} exceeds context limit ({str(e)[:100]}). Splitting into smaller sub-batches...")
            mid = len(use_case_batch) // 2
            sub_batch_1 = use_case_batch[:mid]
            sub_batch_2 = use_case_batch[mid:]
            
            # Recursively translate sub-batches (with attempt=1 for fresh retries, increment split_attempt)
            translated_1 = self._translate_batch(sub_batch_1, target_language, f"{batch_num}a", attempt=1, split_attempt=split_attempt + 1)
            translated_2 = self._translate_batch(sub_batch_2, target_language, f"{batch_num}b", attempt=1, split_attempt=split_attempt + 1)
            
            # Combine results
            combined = translated_1 + translated_2
            if len(combined) == len(use_case_batch):
                self.logger.info(f"Successfully translated batch #{batch_num} after splitting into sub-batches")
                return combined
            else:
                self.logger.warning(f"Sub-batch splitting failed for batch #{batch_num}. Reverting to English.")
                return use_case_batch
        
        except Exception as e:
            error_msg = str(e)
            is_truncation = "TRUNCATED" in error_msg
            
            self.logger.error(f"Failed to translate use case batch #{batch_num} (Attempt {attempt}/2, Split {split_attempt}/3) for {target_language}: {e}")
            
            # If truncation detected and batch has more than 1 item, try splitting it (up to 3 times)
            if is_truncation and len(use_case_batch) > 1 and split_attempt < 3:
                self.logger.warning(f"Batch #{batch_num} appears truncated. Reducing batch size (split attempt {split_attempt + 1}/3)...")
                
                # Calculate smaller batch size (split into 2 or 3 pieces depending on batch size)
                if len(use_case_batch) >= 3:
                    # Split into 3 smaller pieces for better chance of success
                    third = len(use_case_batch) // 3
                    sub_batch_1 = use_case_batch[:third]
                    sub_batch_2 = use_case_batch[third:2*third]
                    sub_batch_3 = use_case_batch[2*third:]
                    
                    # Recursively translate sub-batches with incremented split_attempt
                    translated_1 = self._translate_batch(sub_batch_1, target_language, f"{batch_num}a", attempt=1, split_attempt=split_attempt + 1)
                    translated_2 = self._translate_batch(sub_batch_2, target_language, f"{batch_num}b", attempt=1, split_attempt=split_attempt + 1)
                    translated_3 = self._translate_batch(sub_batch_3, target_language, f"{batch_num}c", attempt=1, split_attempt=split_attempt + 1)
                    
                    # Combine results
                    combined = translated_1 + translated_2 + translated_3
                else:
                    # Split into 2 pieces
                    mid = len(use_case_batch) // 2
                    sub_batch_1 = use_case_batch[:mid]
                    sub_batch_2 = use_case_batch[mid:]
                    
                    translated_1 = self._translate_batch(sub_batch_1, target_language, f"{batch_num}a", attempt=1, split_attempt=split_attempt + 1)
                    translated_2 = self._translate_batch(sub_batch_2, target_language, f"{batch_num}b", attempt=1, split_attempt=split_attempt + 1)
                    
                    combined = translated_1 + translated_2
                
                if len(combined) == len(use_case_batch):
                    self.logger.info(f"Successfully translated batch #{batch_num} after reducing batch size (split {split_attempt + 1})")
                    return combined
                else:
                    self.logger.warning(f"Batch size reduction failed for batch #{batch_num}. Trying standard retry...")
                    # Fall through to standard retry logic
            
            # Standard retry logic: Try one more time if this is the first attempt and we haven't exceeded split attempts
            if attempt < 2 and not (is_truncation and split_attempt >= 3):
                self.logger.info(f"Retrying batch #{batch_num} for {target_language} (Attempt 2/2)...")
                return self._translate_batch(use_case_batch, target_language, batch_num, attempt=2, split_attempt=split_attempt)
            else:
                # All attempts exhausted
                if is_truncation and split_attempt >= 3:
                    self.logger.error(f"Batch #{batch_num} still truncated after {split_attempt} split attempts. Proceeding with English.")
                else:
                    self.logger.warning(f"All retry attempts exhausted for batch #{batch_num}. Reverting to English for this batch.")
                return use_case_batch
    
    def _parse_translation_csv(self, csv_response: str, original_batch: list, batch_num: int, target_language: str) -> list:
        """
        Parse CSV translation response and return list of translated use case dictionaries.
        Only returns rows that match the original batch by 'No' field to prevent incorrect row counts.
        """
        import re
        
        try:
            # VALIDATION 1: Check if response is empty or too short
            if not csv_response or len(csv_response.strip()) < 100:
                raise ValueError(f"Response is empty or too short (len={len(csv_response) if csv_response else 0})")
            
            # VALIDATION 2: Check if response contains obvious non-CSV content
            first_100_chars = csv_response[:100].lower()
            forbidden_starts = ['here is', 'i have', 'i\'ve', 'the translation', 'below is', 'sure,', 'of course']
            if any(first_100_chars.startswith(phrase) for phrase in forbidden_starts):
                self.logger.error(f"Batch #{batch_num}: Response starts with conversational text instead of CSV header")
                raise ValueError("Response contains conversational text instead of pure CSV")
            
            # Clean response - remove markdown fences if present
            csv_clean = csv_response.strip()
            if csv_clean.startswith('```'):
                csv_clean = re.sub(r'^```[a-z]*\n', '', csv_clean)
                csv_clean = re.sub(r'\n```$', '', csv_clean)
            
            # VALIDATION 3: Check for SQL code blocks that shouldn't be there
            # Note: This validation is informational only and doesn't block processing
            sql_pattern_count = csv_clean.count('SELECT ') + csv_clean.count('WITH ') + csv_clean.count('FROM ')
            # SQL should only appear in the SQL column, so count should be reasonable (approx. number of rows)
            expected_sql_mentions = len(original_batch)
            if sql_pattern_count > (expected_sql_mentions * 3):  # Allow some margin
                # Debug-level logging instead of warning (reduces noise)
                self.logger.debug(f"Batch #{batch_num}: Response contains {sql_pattern_count} SQL keywords (expected ~{expected_sql_mentions}). This is normal for complex queries.")
            
            # Find header line (14 columns - SQL is NOT included in translation to prevent truncation)
            # Support both quoted and unquoted headers from LLM
            # Also support legacy formats for backwards compatibility
            header_pattern_quoted = r'"No","Name","Business Domain","Subdomain","type","Analytics Technique","Statement","Solution","Business Value","Beneficiary","Sponsor","Business Priority Alignment","Tables Involved","Priority"'
            header_pattern_unquoted = r'No,Name,Business Domain,Subdomain,type,Analytics Technique,Statement,Solution,Business Value,Beneficiary,Sponsor,Business Priority Alignment,Tables Involved,Priority'
            # Legacy patterns (with Complexity instead of Analytics Technique - for backwards compatibility)
            legacy_header_quoted = r'"No","Name","Business Domain","Subdomain","type","Statement","Solution","Business Value","Beneficiary","Sponsor","Business Priority Alignment","Tables Involved","Complexity","Priority"'
            legacy_header_unquoted = r'No,Name,Business Domain,Subdomain,type,Statement,Solution,Business Value,Beneficiary,Sponsor,Business Priority Alignment,Tables Involved,Complexity,Priority'
            header_match = re.search(header_pattern_quoted, csv_clean)
            if not header_match:
                header_match = re.search(header_pattern_unquoted, csv_clean)
            if not header_match:
                # Try legacy format (13 columns without Business Priority Alignment)
                header_match = re.search(legacy_header_quoted, csv_clean)
                if header_match:
                    self.logger.debug(f"Batch #{batch_num}: Using legacy 13-column format (Business Priority Alignment not in translation)")
            if not header_match:
                header_match = re.search(legacy_header_unquoted, csv_clean)
                if header_match:
                    self.logger.debug(f"Batch #{batch_num}: Using legacy 13-column format (unquoted)")
            if not header_match:
                # Try to show what we found instead
                first_line = csv_clean.split('\n')[0][:200] if csv_clean else "Empty"
                self.logger.error(f"Batch #{batch_num}: CSV header not found. First line: {first_line}")
                raise ValueError(f"Could not find CSV header. First line was: {first_line}")
            
            # Extract CSV starting from header
            csv_data = csv_clean[header_match.start():]
            
            # Build a set of expected IDs from the original batch
            expected_ids = {uc.get('No', '').strip() for uc in original_batch}
            self.logger.debug(f"Expected IDs for batch #{batch_num}: {expected_ids}")
            
            # Parse CSV using centralized utility
            parsed_rows = CSVParser.parse_csv_string(
                csv_data,
                logger=self.logger,
                context=f"Batch #{batch_num}",
                quoting=csv.QUOTE_ALL
            )
            all_parsed_rows = []
            translated_rows = []
            
            for row_dict in parsed_rows:
                # Clean up field values - handle both string and list values
                cleaned_row = {}
                for k, v in row_dict.items():
                    # Ensure k is a string (handle edge cases where CSV parser returns unexpected types)
                    key = str(k) if not isinstance(k, str) else k
                    
                    # Handle different value types robustly
                    if v is None:
                        cleaned_row[key] = ""
                    elif isinstance(v, list):
                        # If value is a list (shouldn't happen but handle it), join it
                        cleaned_row[key] = ', '.join(str(item) for item in v)
                    elif isinstance(v, str):
                        cleaned_row[key] = v.strip()
                    else:
                        cleaned_row[key] = str(v)
                
                row_id = cleaned_row.get('No', '').strip()
                
                if not row_id:
                    self.logger.debug(f"Batch #{batch_num}: Skipping row with empty ID")
                    continue
                
                all_parsed_rows.append(cleaned_row)
                
                if expected_ids and row_id not in expected_ids:
                    row_id_preview = row_id[:50] if len(row_id) > 50 else row_id
                    self.logger.warning(f"Batch #{batch_num}: Skipping unexpected row with ID '{row_id_preview}' (not in batch)")
                    continue
                
                translated_rows.append(cleaned_row)
            
            # Log if we got more rows than expected
            if len(all_parsed_rows) > len(original_batch):
                self.logger.warning(f"Batch #{batch_num}: CSV response contained {len(all_parsed_rows)} rows, but only {len(translated_rows)} matched the expected batch IDs. "
                                  f"LLM may have returned extra rows - filtered to correct batch.")
            
            # Verify we got all expected rows
            translated_ids = {row.get('No', '').strip() for row in translated_rows}
            missing_ids = expected_ids - translated_ids
            if missing_ids:
                self.logger.error(f"Batch #{batch_num}: Missing translations for IDs: {missing_ids}")
                
                # Check if response appears truncated (ends mid-sentence or mid-field)
                last_100_chars = csv_response[-100:].strip() if csv_response else ""
                is_truncated = (
                    not last_100_chars.endswith('"') or  # Doesn't end with closing quote
                    '","' in last_100_chars[-20:] or  # Ends mid-field
                    len(last_100_chars) < 50  # Response is suspiciously short
                )
                
                if is_truncated:
                    self.logger.error(f"Batch #{batch_num}: Response appears TRUNCATED. Last 100 chars: ...{last_100_chars}")
                    raise ValueError(f"Translation response was TRUNCATED - missing {len(missing_ids)} rows. Reduce batch size or simplify content.")
                else:
                    raise ValueError(f"Translation missing {len(missing_ids)} expected rows: {missing_ids}")
            
            self.logger.debug(f"Parsed {len(translated_rows)} translated rows from CSV for batch #{batch_num} (filtered from {len(all_parsed_rows)} total rows)")
            return translated_rows
            
        except Exception as e:
            self.logger.error(f"Failed to parse translation CSV for batch #{batch_num}: {e}")
            # Show snippet for debugging
            snippet = csv_response[:500] if csv_response else "Empty response"
            self.logger.error(f"CSV response snippet: {snippet}")
            return []

# COMMAND ----------

# DBTITLE 1,Inspire
# ==============================================================================
# 2. MAIN APPLICATION CLASS (MODIFIED FOR STRICT ORDERING)
# ==============================================================================

# NOTE: Global dependency checks have been removed per user request.
# Dependencies will be checked and installed on-demand.

# ==============================================================================
# MEMORY-EFFICIENT STORAGE MANAGER
# ==============================================================================
class IntermediateStorageManager:
    """
    Manages file-based intermediate storage for large datasets to prevent memory explosion.
    Stores batch results on disk and provides memory-efficient iteration.
    """
    
    def __init__(self, base_path="/tmp", logger=None):
        """
        Initialize the storage manager.
        
        Args:
            base_path: Base path for temporary storage (default: /tmp)
            logger: Logger instance
        """
        self.logger = logger or logging.getLogger(self.__class__.__name__)
        self.base_path = base_path
        self.temp_dir = None
        self.batch_files = []
        self.initialized = False
        
    def initialize(self):
        """Create temporary directory for intermediate storage."""
        if not self.initialized:
            self.temp_dir = tempfile.mkdtemp(prefix="inspire_", dir=self.base_path)
            self.logger.info(f"📁 Initialized intermediate storage at: {self.temp_dir}")
            self.initialized = True
            
    def save_batch(self, batch_num, data):
        """
        Save batch data to disk.
        
        Args:
            batch_num: Batch identifier (can be int or str)
            data: List of use case dictionaries
        """
        if not self.initialized:
            self.initialize()
            
        # Handle both int and string batch_num
        if isinstance(batch_num, int):
            batch_file = os.path.join(self.temp_dir, f"batch_{batch_num:04d}.json")
        else:
            batch_file = os.path.join(self.temp_dir, f"batch_{batch_num}.json")
        with open(batch_file, 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=2)
        
        self.batch_files.append(batch_file)
        file_size = os.path.getsize(batch_file) / (1024 * 1024)  # Size in MB
        self.logger.debug(f"💾 Saved batch {batch_num} to disk ({len(data)} use cases, {file_size:.2f} MB)")
        
    def load_batch(self, batch_file):
        """Load a single batch from disk."""
        with open(batch_file, 'r', encoding='utf-8') as f:
            return json.load(f)
            
    def iter_batches(self):
        """
        Iterator over all batches.
        Memory-efficient: loads one batch at a time.
        """
        for batch_file in self.batch_files:
            yield self.load_batch(batch_file)
            
    def iter_all_use_cases(self):
        """
        Iterator over all use cases across all batches.
        Memory-efficient: loads one batch at a time and yields individual use cases.
        """
        for batch in self.iter_batches():
            for use_case in batch:
                yield use_case
                
    def load_all_use_cases(self):
        """
        Load all use cases into memory.
        Use this only when necessary (e.g., for deduplication).
        """
        all_use_cases = []
        for batch in self.iter_batches():
            all_use_cases.extend(batch)
        return all_use_cases
        
    def get_total_count(self):
        """Get total count of use cases without loading all into memory."""
        count = 0
        for batch in self.iter_batches():
            count += len(batch)
        return count
        
    def cleanup(self):
        """Remove temporary directory and all files."""
        if self.temp_dir and os.path.exists(self.temp_dir):
            try:
                shutil.rmtree(self.temp_dir)
                self.logger.info(f"🧹 Cleaned up intermediate storage: {self.temp_dir}")
            except Exception as e:
                self.logger.warning(f"Failed to cleanup temp directory {self.temp_dir}: {e}")
            finally:
                self.temp_dir = None
                self.batch_files = []
                self.initialized = False
                
    def get_stats(self):
        """Get storage statistics."""
        if not self.temp_dir or not os.path.exists(self.temp_dir):
            return {"initialized": False}
            
        total_size = 0
        for batch_file in self.batch_files:
            if os.path.exists(batch_file):
                total_size += os.path.getsize(batch_file)
                
        return {
            "initialized": True,
            "temp_dir": self.temp_dir,
            "num_batches": len(self.batch_files),
            "total_size_mb": total_size / (1024 * 1024),
            "use_case_count": self.get_total_count()
        }
    
    def save_column_tracking(self, fqtn: str, column_names: list):
        """
        Save column tracking for a table to disk.
        
        Args:
            fqtn: Fully qualified table name (catalog.schema.table)
            column_names: List of column names that were kept
        """
        if not self.initialized:
            self.initialize()
        
        # Create column_tracking subdirectory if it doesn't exist
        tracking_dir = os.path.join(self.temp_dir, "column_tracking")
        os.makedirs(tracking_dir, exist_ok=True)
        
        # Use a safe filename (replace dots and special chars)
        safe_filename = fqtn.replace('.', '_').replace('`', '') + ".json"
        tracking_file = os.path.join(tracking_dir, safe_filename)
        
        with open(tracking_file, 'w', encoding='utf-8') as f:
            json.dump({
                "fqtn": fqtn,
                "columns": column_names,
                "timestamp": datetime.datetime.now().isoformat()
            }, f, ensure_ascii=False, indent=2)
        
        self.logger.debug(f"💾 Saved column tracking for {fqtn}: {len(column_names)} columns")
    
    def load_column_tracking(self, fqtn: str) -> list:
        """
        Load column tracking for a table from disk.
        
        Args:
            fqtn: Fully qualified table name (catalog.schema.table)
            
        Returns:
            List of column names that were kept, or None if no tracking exists
        """
        if not self.temp_dir or not os.path.exists(self.temp_dir):
            return None
        
        tracking_dir = os.path.join(self.temp_dir, "column_tracking")
        if not os.path.exists(tracking_dir):
            return None
        
        # Use a safe filename (replace dots and special chars)
        safe_filename = fqtn.replace('.', '_').replace('`', '') + ".json"
        tracking_file = os.path.join(tracking_dir, safe_filename)
        
        if not os.path.exists(tracking_file):
            return None
        
        try:
            with open(tracking_file, 'r', encoding='utf-8') as f:
                data = json.load(f)
                return data.get("columns", None)
        except Exception as e:
            self.logger.warning(f"Failed to load column tracking for {fqtn}: {e}")
            return None
    
    def has_column_tracking(self, fqtn: str) -> bool:
        """
        Check if column tracking exists for a table.
        
        Args:
            fqtn: Fully qualified table name (catalog.schema.table)
            
        Returns:
            True if tracking exists, False otherwise
        """
        if not self.temp_dir or not os.path.exists(self.temp_dir):
            return False
        
        tracking_dir = os.path.join(self.temp_dir, "column_tracking")
        if not os.path.exists(tracking_dir):
            return False
        
        safe_filename = fqtn.replace('.', '_').replace('`', '') + ".json"
        tracking_file = os.path.join(tracking_dir, safe_filename)
    
    def save_pass1_ids(self, use_case_ids: list):
        """Save PASS 1 use case IDs to disk for memory-efficient comparison."""
        if not self.initialized:
            self.initialize()
        ids_file = os.path.join(self.temp_dir, "pass1_ids.json")
        with open(ids_file, 'w', encoding='utf-8') as f:
            json.dump(use_case_ids, f)
        self.logger.debug(f"💾 Saved {len(use_case_ids)} PASS 1 IDs to disk")
    
    def load_pass1_ids(self) -> set:
        """Load PASS 1 use case IDs from disk as a set for O(1) lookup."""
        if not self.temp_dir or not os.path.exists(self.temp_dir):
            return set()
        ids_file = os.path.join(self.temp_dir, "pass1_ids.json")
        if not os.path.exists(ids_file):
            return set()
        try:
            with open(ids_file, 'r', encoding='utf-8') as f:
                return set(json.load(f))
        except Exception as e:
            self.logger.warning(f"Failed to load PASS 1 IDs: {e}")
            return set()
    
    def save_feedback_file(self, feedback_lines: list):
        """Save feedback string to disk to avoid keeping large string in memory."""
        if not self.initialized:
            self.initialize()
        feedback_file = os.path.join(self.temp_dir, "pass1_feedback.txt")
        with open(feedback_file, 'w', encoding='utf-8') as f:
            f.write("\n".join(feedback_lines))
        self.logger.debug(f"💾 Saved feedback to disk ({len(feedback_lines)} lines)")
    
    def load_feedback_file(self) -> str:
        """Load feedback string from disk."""
        if not self.temp_dir or not os.path.exists(self.temp_dir):
            return ""
        feedback_file = os.path.join(self.temp_dir, "pass1_feedback.txt")
        if not os.path.exists(feedback_file):
            return ""
        try:
            with open(feedback_file, 'r', encoding='utf-8') as f:
                return f.read()
        except Exception as e:
            self.logger.warning(f"Failed to load feedback file: {e}")
            return ""
    
    def iter_pass1_use_cases_for_feedback(self, limit: int = 200):
        """
        Memory-efficient iterator for building feedback.
        Yields (idx, name, tables) tuples from PASS 1 batches without loading all into memory.
        """
        idx = 0
        for batch in self.iter_batches():
            for uc in batch:
                if idx >= limit:
                    return
                idx += 1
                name = str(uc.get('Name', '')).replace('|', '-')[:80]
                tables = str(uc.get('Tables Involved', '')).replace('|', '-')[:60]
                yield (idx, name, tables)
    
    def save_id_maps(self, column_id_map: dict, id_column_map: dict, table_id_map: dict, id_table_map: dict):
        """Save column/table ID maps to disk to reduce memory footprint."""
        if not self.initialized:
            self.initialize()
        maps_file = os.path.join(self.temp_dir, "id_maps.json")
        data = {
            "column_id_map": column_id_map,
            "id_column_map": id_column_map,
            "table_id_map": table_id_map,
            "id_table_map": id_table_map
        }
        with open(maps_file, 'w', encoding='utf-8') as f:
            json.dump(data, f)
        self.logger.debug(f"💾 Saved ID maps to disk (columns: {len(column_id_map)}, tables: {len(table_id_map)})")
    
    def load_id_maps(self) -> tuple:
        """Load ID maps from disk. Returns (column_id_map, id_column_map, table_id_map, id_table_map)."""
        if not self.temp_dir or not os.path.exists(self.temp_dir):
            return {}, {}, {}, {}
        maps_file = os.path.join(self.temp_dir, "id_maps.json")
        if not os.path.exists(maps_file):
            return {}, {}, {}, {}
        try:
            with open(maps_file, 'r', encoding='utf-8') as f:
                data = json.load(f)
            return (data.get("column_id_map", {}), data.get("id_column_map", {}),
                    data.get("table_id_map", {}), data.get("id_table_map", {}))
        except Exception as e:
            self.logger.warning(f"Failed to load ID maps: {e}")
            return {}, {}, {}, {}
        
        return os.path.exists(tracking_file)

class DatabricksInspire:
    
    @staticmethod
    def _extract_primary_table(tables_involved: str) -> str:
        """
        Extracts the primary (first) table from the Tables Involved field.
        
        Args:
            tables_involved: Comma-separated string of table names
            
        Returns:
            The first table name, or 'N/A' if empty
        """
        if not tables_involved or not isinstance(tables_involved, str):
            return "N/A"
        
        # Split by comma and get the first table
        tables = [t.strip() for t in tables_involved.split(',') if t.strip()]
        if tables:
            return tables[0]
        return "N/A"

    def __init__(self, **kwargs):
        self.spark = spark
        self.dbutils = dbutils
        # Initialize workspace client with timeout configuration to prevent indefinite hangs
        from databricks.sdk.config import Config
        config = Config()
        config.retry_timeout_seconds = 300  # 5 minute timeout for individual API calls
        self.workspace = WorkspaceClient(config=config)  # Changed from w_client to workspace
        self.w_client = self.workspace  # Keep w_client as alias for backward compatibility

        # --- Store Widget Values ---
        self.business_name = kwargs.get("business", "gen")
        
        # --- Business Priorities (user-provided, multi-select) ---
        business_priorities_raw = kwargs.get("business_priorities", "")
        self.user_business_priorities = [goal.strip() for goal in business_priorities_raw.split(',') if goal.strip()]
        
        # --- Strategic Goals (user-provided, comma-separated) ---
        strategic_goals_raw = kwargs.get("strategic_goals", "")
        self.user_strategic_goals = [goal.strip() for goal in strategic_goals_raw.split(',') if goal.strip()]
        
        # --- Business Domains (user-provided, comma-separated) ---
        business_domains_raw = kwargs.get("business_domains", "")
        self.user_business_domains = [domain.strip() for domain in business_domains_raw.split(',') if domain.strip()]
        
        self.additional_context = ""
        
        self.catalogs_str = kwargs.get("catalogs", "")
        self.schemas_str = kwargs.get("schemas", "")
        self.tables_str = kwargs.get("tables", "")
        self.generate_choices = [opt.strip() for opt in kwargs.get("generate", "use cases").split(',')]
        
        self.generation_path = kwargs.get("generation_path", "./")
        
        self.output_languages = [lang.strip() for lang in kwargs.get("output_language", "English").split(',') if lang.strip()]
        if not self.output_languages:
            self.output_languages = ["English"]
            
        raw_max_parallelism = kwargs.get("max_parallelism", None)
        if raw_max_parallelism is None or str(raw_max_parallelism).strip() == "":
            self.max_parallelism = 0
            self.auto_parallelism = True
        else:
            self.max_parallelism = min(int(raw_max_parallelism), 100)
            self.auto_parallelism = False
        self.scan_parallelism = int(kwargs.get("scan_parallelism", 5) or 5)
        if self.scan_parallelism < 1:
            self.scan_parallelism = 5
        if self.scan_parallelism > 20:
            self.scan_parallelism = 20
        self.cluster_memory_gb = int(kwargs.get("cluster_memory_gb", 32) or 32)
        self.cluster_worker_count = int(kwargs.get("cluster_worker_count", 2) or 2)
        
        # === Use unstructured data flag (from Generation Options) ===
        self.use_unstructured_data = kwargs.get("use_unstructured_data", "yes").lower() == "yes"
        
        # === Technical exclusion strategy (from Generation Options) ===
        raw_technical_exclusion_strategy = kwargs.get("technical_exclusion_strategy", "Aggressive")
        if not raw_technical_exclusion_strategy or str(raw_technical_exclusion_strategy).strip().lower() == "none":
            raw_technical_exclusion_strategy = "Aggressive"
        self.technical_exclusion_strategy = raw_technical_exclusion_strategy
        
        # === Operation Mode (NEW - controls main operation) ===
        self.operation_mode = kwargs.get("operation_mode", "Discover Usecases")
        
        # === SQL Code Generation Flag (from Generation Options) ===
        self.generate_sql_code = "SQL Code" in self.generate_choices
        if not self.generate_sql_code:
            log_print("ℹ️ SQL Code generation DISABLED - notebooks will have placeholder SQL")
        
        # === SQL Model Serving (model endpoint for ai_query in generated SQL) ===
        self.sql_model_serving = kwargs.get("sql_model_serving", "databricks-gpt-oss-120b").strip()
        if not self.sql_model_serving:
            self.sql_model_serving = "databricks-gpt-oss-120b"

        # === NEW: Global LLM timeout and retry controls ===
        # User requested explicit global timeout of 300 seconds
        self.llm_timeout_seconds = 300 
        # Base timeout for SQL generation - will be adjusted adaptively based on CTE count
        self.sql_generation_base_timeout = 180  # Base: 3 minutes
        self.sql_generation_per_cte_timeout = 30  # Add 30s per CTE
        self.sql_generation_max_timeout = 360  # Cap at 6 minutes
        # max_retry represents how many times to retry after the first attempt
        self.max_retry_attempts = max(0, int(kwargs.get("max_retry", 1)))
        
        # === NEW: Initialize merged business context ===
        self.merged_business_context = {}
        
        # === NEW: JSON file path for docs-only mode ===
        self.json_file_path = kwargs.get("json_file_path", None)

        # --- Setup Logging & Output ---
        self.sanitized_customer_name = self._sanitize_name(self.business_name)
        
        # Keep log in /tmp during execution (always writable), will copy to output dir at end
        self.local_log_output_dir = f"/tmp/{self.sanitized_customer_name}"

        resolved_generation_path = self.generation_path
        if resolved_generation_path.startswith("./"):
            try:
                logical_notebook_path = self.dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
                current_notebook_dir = os.path.dirname(logical_notebook_path)
                relative_path = resolved_generation_path[2:]
                resolved_generation_path = os.path.join(current_notebook_dir, relative_path)
                log_print(f"Resolved relative generation path to: {resolved_generation_path}")
            except Exception:
                default_path = f"/Shared/{self.sanitized_customer_name}_output"
                log_print(f"Could not determine notebook path for './'. Defaulting to {default_path}.", level="WARNING")
                resolved_generation_path = default_path

        self.base_output_dir = os.path.join(resolved_generation_path, self.sanitized_customer_name)
        self.notebook_output_dir = os.path.join(self.base_output_dir, "notebooks")
        self.docs_output_dir = os.path.join(self.base_output_dir, "docs")

        setup_logging(self.local_log_output_dir) 
        self.logger = logging.getLogger(self.__class__.__name__)
        
        # Initialize memory-efficient storage manager
        self.storage_manager = IntermediateStorageManager(base_path="/tmp", logger=self.logger)
        
        # Initialize honesty tracking for reporting what happened during processing
        self.processing_honesty = {
            'total_tables_discovered': 0,
            'total_tables_processed': 0,
            'total_batches_created': 0,
            'total_batch_splits': 0,
            'tables_with_columns_dropped': [],
            'batch_split_history': [],
            'tables_completely_processed': [],
            'tables_partially_processed': [],
            'tables_skipped': []
        }
        self.validation_timeouts_discarded = []
        
        max_parallelism_display = self.max_parallelism if not self.auto_parallelism else "auto"
        self.logger.info(f"DatabricksInspire initialized for business: {self.business_name}. Target Language(s): {self.output_languages}. Base Output Dir: {self.base_output_dir}. Notebooks Dir: {self.notebook_output_dir}. Docs Dir: {self.docs_output_dir}. Scan parallelism: {self.scan_parallelism}. Max parallelism: {max_parallelism_display}")
        
        # Skip cleanup for modes that work on existing notebooks
        is_preserve_mode = self.operation_mode in ["Re-generate SQL", "Generate Sample Result"]
        
        try:
            if is_preserve_mode:
                self.logger.info(f"ℹ️ {self.operation_mode.upper()} MODE: Skipping directory cleanup to preserve existing files")
                log_print(f"ℹ️ {self.operation_mode} mode: Preserving existing files (expected behavior)")
            else:
                self.logger.info(f"Cleaning up existing output directory...")
                try:
                    self.w_client.workspace.delete(self.base_output_dir, recursive=True)
                    self.logger.info(f"Successfully deleted existing output directory: {self.base_output_dir}")
                except Exception as delete_error:
                    self.logger.debug(f"No existing directory to delete or deletion failed: {delete_error}")
                
                self.logger.info(f"Creating fresh workspace directories...")
                self.w_client.workspace.mkdirs(self.base_output_dir)
                self.w_client.workspace.mkdirs(self.notebook_output_dir)
                self.w_client.workspace.mkdirs(self.docs_output_dir)
                self.logger.info(f"Successfully created all output directories.")
        except Exception as e:
            self.logger.error(f"Failed to create workspace directories: {e}")
        
        if 'PROMPT_TEMPLATES' not in globals():
            self.logger.critical("CRITICAL ERROR: 'PROMPT_TEMPLATES' dictionary is not defined. Please run the cell defining it.")
            raise NameError("PROMPT_TEMPLATES not found. Please run the cell that defines the prompt dictionary.")

        self.ai_agent = AIAgent(
            spark=self.spark,
            logger=self.logger,
            worker_llm_config=AI_MODEL_NAME,
            judge_llm_config=AI_MODEL_NAME,
            prompt_templates=PROMPT_TEMPLATES,
            default_timeout_seconds=self.llm_timeout_seconds,
            max_retry_attempts=self.max_retry_attempts
        )
        
        self.translation_service = TranslationService(ai_agent=self.ai_agent, logger=self.logger)
        
        # === NEW: Column Registry (Bitmap Technique) ===
        self.column_id_map = {}   # FQN -> ID
        self.id_column_map = {}   # ID -> {fqn, description}
        self.next_column_id = 1
        # === NEW: Table Registry (Bitmap Technique) ===
        self.table_id_map = {}    # Table FQN -> ID
        self.id_table_map = {}    # ID -> table FQN
        self.next_table_id = 1
        self.registry_lock = threading.Lock()
        
        self.data_loader = None
        # Skip data loader if we're in docs-only mode (JSON file path provided)
        # Note: "use cases" is always generated (removed from widget options)
        if not self.json_file_path:
            if ("PDF Catalog" in self.generate_choices or 
                "Use Cases Catalog PDF" in self.generate_choices or
                "Presentation" in self.generate_choices or
                "SQL Regeneration" not in self.generate_choices):
                
                if not (self.catalogs_str or self.schemas_str or self.tables_str):
                    self.logger.error("For 'use cases', 'PDF', or 'Presentation' you must provide at least one catalog, schema, or table.")
                else:
                    # === MEMORY-OPTIMIZED DATA LOADER ===
                    # Enable two-pass mode for intelligent batching based on table sizes
                    # Enable column sampling for tables with >250 columns
                    # Use streaming for schemas with >10K tables
                    self.data_loader = DataLoader(
                        catalogs=self.catalogs_str,
                        schemas=self.schemas_str,
                        tables=self.tables_str,
                        logger=self.logger,
                        enable_two_pass=True,           # Enable intelligent batching
                        enable_column_sampling=True,    # Sample wide tables
                        streaming_batch_size=1000,      # Stream in chunks of 1000 tables
                        max_parallelism=self.scan_parallelism,
                        schema_timeout_seconds=900      # 15 min timeout per schema
                    )
        
    def _calculate_dynamic_parallelism(self, total_tables, total_schema_chars, safe_context_limit, base_prompt_size):
        memory_pool = max(1, self.cluster_memory_gb * max(self.cluster_worker_count, 1))
        max_by_memory = max(2, min(8, int(memory_pool // 8)))
        if total_tables <= 0:
            return max_by_memory, 0, 0, 0, max_by_memory
        avg_table_chars = max(1, int(total_schema_chars / total_tables))
        available_chars = max(1, safe_context_limit - base_prompt_size)
        tables_per_batch = max(1, int(available_chars // avg_table_chars))
        est_batches = int((total_tables + tables_per_batch - 1) // tables_per_batch) if tables_per_batch else total_tables
        size_factor = 1.0
        if avg_table_chars >= 24000:
            size_factor = 0.5
        elif avg_table_chars >= 18000:
            size_factor = 0.7
        elif avg_table_chars >= 12000:
            size_factor = 0.85
        recommended = int(max_by_memory * size_factor)
        if recommended < 2:
            recommended = 2
        if recommended > max_by_memory:
            recommended = max_by_memory
        if est_batches > 0 and recommended > est_batches:
            recommended = est_batches
        if recommended < 1:
            recommended = 1
        return recommended, tables_per_batch, est_batches, avg_table_chars, max_by_memory
        
    def _get_translations(self, language: str) -> dict:
        return self.translation_service.get_translations(language)

    def _report_processing_honesty(self):
        """
        Generates and displays a detailed report of what happened during processing.
        Reports on batch splits, column drops, and whether all data was processed completely.
        """
        log_print(f"\n{'='*80}")
        log_print(f"📊 PROCESSING HONESTY REPORT")
        log_print(f"{'='*80}\n")
        
        h = self.processing_honesty
        
        log_print(f"📈 Overall Statistics:")
        log_print(f"   • Total tables discovered: {h['total_tables_discovered']}")
        log_print(f"   • Total tables processed: {h['total_tables_processed']}")
        log_print(f"   • Total batches created: {h['total_batches_created']}")
        log_print(f"   • Total batch splits performed: {h['total_batch_splits']}")
        
        if h['batch_split_history']:
            proactive_splits = [s for s in h['batch_split_history'] if s.get('split_type') == 'Proactive']
            reactive_splits = [s for s in h['batch_split_history'] if s.get('split_type') == 'Reactive']
            
            log_print(f"\n⚡ Batch Splitting Details:")
            log_print(f"   {h['total_batch_splits']} batch(es) were split to fit LLM context limits:")
            log_print(f"   • Proactive splits (detected before LLM call): {len(proactive_splits)}")
            log_print(f"   • Reactive splits (after LLM error): {len(reactive_splits)}")
            
            for idx, split in enumerate(h['batch_split_history'], 1):
                split_icon = "🔮" if split.get('split_type') == 'Proactive' else "⚡"
                log_print(f"   {idx}. {split_icon} {split.get('split_type', 'Unknown')} | Batch '{split['batch']}': {split['original_tables']} tables → {split['split_into']} sub-batches")
                log_print(f"      - Sub-batch 1: {split['sub_batch_1_tables']} tables")
                log_print(f"      - Sub-batch 2: {split['sub_batch_2_tables']} tables")
        else:
            log_print(f"\n✅ No batch splitting was required - all batches fit within LLM context limits")
        
        if h['tables_with_columns_dropped']:
            log_print(f"\n⚠️  Column Dropping Details:", level="WARNING")
            log_print(f"   {len(h['tables_with_columns_dropped'])} table(s) had columns dropped to fit LLM limits:")
            for idx, drop_info in enumerate(h['tables_with_columns_dropped'], 1):
                business_tag = "🔵 BUSINESS" if drop_info['is_business'] else "⚪ non-business"
                log_print(f"   {idx}. {business_tag} | {drop_info['table']}")
                log_print(f"      - Original columns: {drop_info['original_columns']}")
                log_print(f"      - Kept columns: {drop_info['kept_columns']} ({drop_info['drop_percentage']:.1f}% dropped)")
                log_print(f"      - Dropped columns: {drop_info['dropped_columns']}")
        else:
            log_print(f"\n✅ No columns were dropped - all tables processed with full schema")
        
        log_print(f"\n{'='*80}")
        log_print(f"🎯 HONESTY ASSESSMENT:")
        log_print(f"{'='*80}")
        
        honesty_percentage = 100.0
        issues = []
        
        if h['batch_split_history']:
            issues.append(f"• {h['total_batch_splits']} batch split(s) occurred (but all tables were still processed)")
        
        if h['tables_with_columns_dropped']:
            total_columns_dropped = sum(d['dropped_columns'] for d in h['tables_with_columns_dropped'])
            total_columns_original = sum(d['original_columns'] for d in h['tables_with_columns_dropped'])
            drop_percentage = (total_columns_dropped / total_columns_original * 100) if total_columns_original > 0 else 0
            
            issues.append(f"• {len(h['tables_with_columns_dropped'])} table(s) had columns dropped ({drop_percentage:.1f}% of columns from affected tables)")
            
            honesty_percentage -= min(drop_percentage / 2, 30)
        
        if not issues:
            log_print(f"\n✅ 100% HONEST - All tables processed completely with all columns")
            log_print(f"   No compromises were made during processing.")
        else:
            log_print(f"\n📊 Honesty Score: {honesty_percentage:.1f}%")
            log_print(f"\n   Processing required the following compromises:")
            for issue in issues:
                log_print(f"   {issue}")
            log_print(f"\n   ℹ️  Note: Batch splits don't affect completeness - all tables are still processed.")
            log_print(f"   ⚠️  Column drops DO affect completeness - some schema details were omitted.", level="WARNING")
        
        log_print(f"\n{'='*80}\n")
    
    def _get_lang_abbr(self, language: str) -> str:
        """Returns a 2-letter abbreviation for a language."""
        lang_map = {
            "english": "en",
            "arabic": "ar",
            "french": "fr",
            "spanish": "es",
            "german": "de",
            "portuguese": "pt",
            "italian": "it",
            "japanese": "ja",
            "korean": "ko",
            "chinese": "zh"
        }
        return lang_map.get(language.lower(), language.lower()[:2])

    # === BUSINESS CONTEXT AND SPONSOR MAPPING ===
    def _get_business_context_from_llm(self) -> dict:
        """
        Calls the BUSINESS_CONTEXT_WORKER_PROMPT to get comprehensive business context.
        Returns a dict with 15 business context fields.
        """
        self.logger.info(f"🔍 Calling Business Context Worker for: {self.business_name}")
        
        try:
            # First, determine the industry using a simple analysis
            industry = self.business_name  # Default to business name
            
            prompt_vars = {
                "name": self.business_name,
                "industry": industry,
                "type_description": "a business entity or organization",
                "type_label": "organization"
            }
            
            self.logger.info(f"⏳ Waiting for LLM response (Business Context extraction)...")
            response_json = self.ai_agent.run_worker(
                step_name="Business_Context_Extraction",
                worker_prompt_path="BUSINESS_CONTEXT_WORKER_PROMPT",
                prompt_vars=prompt_vars,
                response_schema=None
            )
            self.logger.info(f"✅ Received LLM response, parsing business context...")
            
            # Parse JSON response
            response_clean = clean_json_response(response_json)
            context_data = json.loads(response_clean)
            
            self.logger.info(f"✅ Business Context Worker extracted {len(context_data)} fields")
            return context_data
            
        except Exception as e:
            self.logger.error(f"Failed to get business context from LLM: {e}")
            return {}
    
    def _merge_business_contexts(self, llm_context: dict, user_context_str: str) -> dict:
        """
        Merges LLM-generated business context with user-provided context.
        User context ALWAYS takes precedence and overrides LLM context.
        
        Args:
            llm_context: Dictionary from Business Context Worker (15 fields)
            user_context_str: Comma-separated string from user (Business Context widget)
            
        Returns:
            Merged dictionary with user values taking precedence
        """
        merged_context = llm_context.copy()
        
        if not user_context_str or not user_context_str.strip():
            self.logger.info("No user-provided business context. Using LLM context only.")
            return merged_context
        
        # Parse user context - assume it's comma-separated values that override specific fields
        # For simplicity, we'll treat the entire user context as additional focus areas
        user_focus_areas = [area.strip() for area in user_context_str.split(',') if area.strip()]
        
        if user_focus_areas:
            self.logger.info(f"✅ User provided {len(user_focus_areas)} business context items - these will take precedence")
            
            # Store user focus areas separately for later use
            # We'll use them to override or extend LLM context
            merged_context['user_focus_areas'] = ', '.join(user_focus_areas)
            
            # Also update relevant fields with user context
            if 'business_units_divisions_domains_subdomains' in merged_context:
                # Prepend user areas to ensure they're used
                existing = merged_context['business_units_divisions_domains_subdomains']
                merged_context['business_units_divisions_domains_subdomains'] = ', '.join(user_focus_areas) + ', ' + existing
            else:
                merged_context['business_units_divisions_domains_subdomains'] = ', '.join(user_focus_areas)
        
        return merged_context
    
    # === USER DOMAIN ASSIGNMENT ===
    def _assign_to_user_domains(self, use_cases: list, user_domains: list, language: str) -> list:
        """
        Assigns use cases to user-provided business domains using LLM, 
        then discovers subdomains within each domain.
        
        This method is called when user provides specific business domains.
        The user provides TOP-LEVEL DOMAINS ONLY - Inspire discovers subdomains automatically.
        
        Args:
            use_cases: List of use case dictionaries
            user_domains: List of user-provided domain names (top-level only)
            language: Output language
            
        Returns:
            List of use cases with Business Domain and Subdomain properly assigned
        """
        import io
        import csv
        from collections import defaultdict, Counter
        from concurrent.futures import ThreadPoolExecutor
        import concurrent.futures
        
        self.logger.info(f"📍 Assigning {len(use_cases)} use cases to {len(user_domains)} user-provided domains...")
        self.logger.info(f"📍 User provided TOP-LEVEL domains only: {', '.join(user_domains)}")
        self.logger.info(f"📍 Inspire will discover SUBDOMAINS within each domain automatically")
        
        if not use_cases or not user_domains:
            return use_cases
        
        # Build a simple prompt for domain assignment
        domain_list_str = ", ".join([f'"{d}"' for d in user_domains])
        
        # Convert use cases to CSV for LLM
        output = io.StringIO()
        fieldnames = ['No', 'Name', 'type', 'Analytics Technique', 'Statement', 'Solution', 
                     'Business Value', 'Beneficiary', 'Sponsor', 'Tables Involved']
        writer = csv.DictWriter(output, fieldnames=fieldnames, extrasaction='ignore')
        writer.writeheader()
        writer.writerows(use_cases)
        use_cases_csv = output.getvalue()
        
        # Build custom prompt for user domain assignment
        assignment_prompt = f"""You are an expert business analyst. Your task is to assign each use case to ONE of the following EXACT business domains:

**ALLOWED DOMAINS (USE EXACTLY AS PROVIDED)**: {domain_list_str}

**🚨 CRITICAL RULES 🚨**:
1. You MUST assign EVERY use case to EXACTLY ONE domain from the list above
2. You MUST use the domain names EXACTLY as provided - no modifications
3. You MUST NOT create or invent any new domain names
4. Every use case MUST have a domain assigned

**USE CASES TO ASSIGN**:
{use_cases_csv}

**OUTPUT FORMAT** (CSV only, no explanation):
```csv
use_case_id,domain
```

For each use case, output ONE row with the use case ID and the assigned domain.
Start your response with the CSV header line: use_case_id,domain
"""

        try:
            # Call LLM for domain assignment using _call_ai_query directly
            response_raw = self.ai_agent._call_ai_query(
                prompt=assignment_prompt,
                prompt_name=f"Assign_User_Domains_{language}",
                response_schema=None,
                display_name=f"Assign_User_Domains_{language}"
            )
            
            # Clean response
            response_clean = clean_json_response(response_raw)
            
            # Parse CSV response
            csv_rows = CSVParser.parse_csv_string(
                response_clean,
                logger=self.logger,
                context="User domain assignment"
            )
            
            # Build domain assignment map
            domain_map = {}
            for row in csv_rows:
                uc_id = row.get('use_case_id', '').strip()
                domain = row.get('domain', '').strip()
                if uc_id and domain:
                    # Validate domain is in user list
                    if domain in user_domains:
                        domain_map[uc_id] = domain
                    else:
                        # Try case-insensitive match
                        for ud in user_domains:
                            if ud.lower() == domain.lower():
                                domain_map[uc_id] = ud
                                break
                        else:
                            # Assign to first domain as fallback
                            domain_map[uc_id] = user_domains[0]
                            self.logger.warning(f"Use case {uc_id} assigned invalid domain '{domain}', defaulting to '{user_domains[0]}'")
            
            # Apply domain assignments (only domain, NOT subdomain yet)
            assigned_count = 0
            for uc in use_cases:
                uc_id = uc.get('No', '')
                if uc_id in domain_map:
                    uc['Business Domain'] = domain_map[uc_id]
                    assigned_count += 1
                else:
                    # Fallback: assign to first user domain
                    uc['Business Domain'] = user_domains[0]
                    self.logger.warning(f"Use case {uc_id} not in LLM response, defaulting to '{user_domains[0]}'")
            
            self.logger.info(f"✅ Successfully assigned {assigned_count}/{len(use_cases)} use cases to user domains")
            
            # Log domain distribution
            domain_counts = Counter(uc.get('Business Domain', 'Unknown') for uc in use_cases)
            for domain, count in sorted(domain_counts.items(), key=lambda x: -x[1]):
                self.logger.info(f"   📁 {domain}: {count} use cases")
            
            # === NOW DISCOVER SUBDOMAINS WITHIN EACH DOMAIN ===
            self.logger.info(f"📍 STEP 2: Discovering subdomains within each user-provided domain...")
            
            # Group use cases by domain
            domain_usecases_map = defaultdict(list)
            for uc in use_cases:
                domain = uc.get('Business Domain', '').strip()
                if domain:
                    domain_usecases_map[domain].append(uc)
            
            # ADAPTIVE PARALLELISM: Calculate based on number of domains and use cases
            subdomain_parallelism, reason = calculate_adaptive_parallelism(
                "subdomain_detection", self.max_parallelism,
                num_items=len(use_cases),
                num_domains=len(domain_usecases_map),
                is_llm_operation=True, logger=self.logger
            )
            log_adaptive_parallelism_decision("subdomain_detection", subdomain_parallelism, self.max_parallelism, reason)
            self.logger.info(f"Processing {len(domain_usecases_map)} domains for subdomain discovery...")
            
            # Process each domain in parallel for subdomain detection
            final_use_cases_with_subdomains = []
            
            with ThreadPoolExecutor(max_workers=subdomain_parallelism, 
                                   thread_name_prefix="SubdomainDetect") as executor:
                # Submit subdomain detection for each domain
                future_to_domain = {}
                for domain_name, domain_use_cases in domain_usecases_map.items():
                    future = executor.submit(
                        self._detect_subdomains_for_domain,
                        domain_name,
                        domain_use_cases,
                        language
                    )
                    future_to_domain[future] = domain_name
                    self.logger.debug(f"✓ Submitted subdomain detection for domain '{domain_name}' ({len(domain_use_cases)} use cases)")
                
                # Collect results as they complete
                for future in concurrent.futures.as_completed(future_to_domain):
                    domain_name = future_to_domain[future]
                    try:
                        use_cases_with_subdomains = future.result()
                        if use_cases_with_subdomains:
                            self.logger.info(f"✅ Domain '{domain_name}': Subdomain discovery complete ({len(use_cases_with_subdomains)} use cases)")
                            final_use_cases_with_subdomains.extend(use_cases_with_subdomains)
                        else:
                            # CRITICAL FIX: Assign default subdomains when discovery returns empty
                            self.logger.warning(f"⚠️ Domain '{domain_name}': Subdomain discovery returned no use cases - assigning default subdomains")
                            domain_use_cases = domain_usecases_map.get(domain_name, [])
                            for uc in domain_use_cases:
                                if not uc.get('Subdomain'):
                                    uc['Subdomain'] = f"General {domain_name}"
                            final_use_cases_with_subdomains.extend(domain_use_cases)
                    except Exception as e:
                        self.logger.error(f"❌ Domain '{domain_name}': Subdomain discovery failed: {e}")
                        # Fall back to use cases without subdomains for this domain
                        domain_use_cases = domain_usecases_map.get(domain_name, [])
                        # Set default subdomain as "General [Domain]" when subdomain detection fails
                        for uc in domain_use_cases:
                            if not uc.get('Subdomain'):
                                uc['Subdomain'] = f"General {domain_name}"
                        self.logger.warning(f"Using default subdomains for domain '{domain_name}'")
                        final_use_cases_with_subdomains.extend(domain_use_cases)
            
            self.logger.info(f"✅ Domain assignment and subdomain discovery complete! {len(final_use_cases_with_subdomains)} use cases processed")
            
            # Log subdomain distribution
            subdomain_counts = Counter(uc.get('Subdomain', 'Unknown') for uc in final_use_cases_with_subdomains)
            self.logger.info(f"📊 Subdomain distribution:")
            for subdomain, count in sorted(subdomain_counts.items(), key=lambda x: -x[1]):
                self.logger.info(f"   📁 {subdomain}: {count} use cases")
            
            return final_use_cases_with_subdomains
            
        except Exception as e:
            self.logger.error(f"Failed to assign user domains via LLM: {e}. Using fallback distribution with subdomain discovery.")
            
            # Fallback: distribute use cases evenly across user domains
            for idx, uc in enumerate(use_cases):
                domain_idx = idx % len(user_domains)
                uc['Business Domain'] = user_domains[domain_idx]
            
            # Even on fallback, try to discover subdomains
            self.logger.info(f"📍 Fallback: Attempting subdomain discovery for distributed domains...")
            
            domain_usecases_map = defaultdict(list)
            for uc in use_cases:
                domain = uc.get('Business Domain', '').strip()
                if domain:
                    domain_usecases_map[domain].append(uc)
            
            final_use_cases = []
            for domain_name, domain_use_cases in domain_usecases_map.items():
                try:
                    use_cases_with_subdomains = self._detect_subdomains_for_domain(
                        domain_name, domain_use_cases, language
                    )
                    final_use_cases.extend(use_cases_with_subdomains)
                except Exception as sub_e:
                    self.logger.error(f"Subdomain detection failed for domain '{domain_name}': {sub_e}")
                    # Last resort: set subdomain = "General [Domain]"
                    for uc in domain_use_cases:
                        uc['Subdomain'] = f"General {domain_name}"
                    final_use_cases.extend(domain_use_cases)
            
            return final_use_cases if final_use_cases else use_cases
    
    # === HELPER FUNCTIONS ===
    def _apply_domain_mapping_flat(self, all_use_cases: list, domain_mapping: dict) -> list:
        """Applies a domain mapping to a flat list[dict] of use cases."""
        modified_use_cases = []
        for uc in all_use_cases:
            original_domain = uc.get('Business Domain', 'Other')
            new_domain = domain_mapping.get(original_domain, original_domain)
            uc['Business Domain'] = new_domain
            modified_use_cases.append(uc)
        return modified_use_cases
    
    def _apply_subdomain_mapping(self, all_use_cases: list, subdomain_mapping: dict) -> list:
        """
        Applies subdomain mapping to consolidate overlapping subdomains.
        
        The subdomain_mapping contains the FINAL list of subdomains per domain.
        This function maps use cases' current subdomains to the consolidated ones
        by finding the best matching subdomain from the mapping.
        
        Args:
            all_use_cases: List of use case dictionaries
            subdomain_mapping: Dict mapping domain names to lists of consolidated subdomain names
            
        Returns:
            Modified list of use cases with updated Subdomain field
        """
        modified_use_cases = []
        
        for uc in all_use_cases:
            domain = uc.get('Business Domain', '')
            current_subdomain = uc.get('Subdomain', '')
            
            # If this domain has a subdomain mapping, apply it
            if domain in subdomain_mapping and isinstance(subdomain_mapping[domain], list):
                target_subdomains = subdomain_mapping[domain]
                
                # Find the best matching subdomain from the consolidated list
                # Strategy: Match by word overlap
                best_match = None
                best_overlap_score = 0
                
                current_words = set(current_subdomain.lower().split())
                
                for target_subdomain in target_subdomains:
                    target_words = set(target_subdomain.lower().split())
                    overlap = len(current_words & target_words)
                    
                    # If we have word overlap, this is a candidate
                    if overlap > best_overlap_score:
                        best_overlap_score = overlap
                        best_match = target_subdomain
                    # If no word overlap yet, check if current subdomain contains target words
                    elif best_overlap_score == 0:
                        # Check if any target word is contained in current subdomain
                        for target_word in target_words:
                            if target_word in current_subdomain.lower():
                                best_match = target_subdomain
                                break
                
                # Apply the best match if found
                if best_match:
                    uc['Subdomain'] = best_match
                    self.logger.debug(f"Mapped subdomain '{current_subdomain}' → '{best_match}' in domain '{domain}'")
            
            modified_use_cases.append(uc)
        
        return modified_use_cases
    
    def _enforce_subdomain_constraints(self, use_cases: list) -> list:
        """
        Enforces subdomain constraints:
        1. Each domain has at most 8 subdomains
        2. Each subdomain has at least 3 use cases
        3. No overlapping subdomain names within a domain
        
        Args:
            use_cases: List of use case dictionaries
            
        Returns:
            Modified list of use cases with enforced constraints
        """
        from collections import defaultdict
        
        # Group use cases by domain and subdomain
        domain_subdomain_cases = defaultdict(lambda: defaultdict(list))
        for uc in use_cases:
            domain = uc.get('Business Domain', 'Other')
            subdomain = uc.get('Subdomain', 'General')
            domain_subdomain_cases[domain][subdomain].append(uc)
        
        # Process each domain
        for domain, subdomains_dict in domain_subdomain_cases.items():
            # STEP 1: Check for overlapping subdomain names and merge them
            subdomain_list = list(subdomains_dict.keys())
            merged_mapping = {}  # Maps old subdomain -> new subdomain
            
            for subdomain in subdomain_list:
                if subdomain in merged_mapping:
                    continue  # Already processed
                
                # Find all other subdomains that share words with this one
                subdomain_words = set(subdomain.lower().split())
                merge_targets = [subdomain]
                
                for other_subdomain in subdomain_list:
                    if other_subdomain == subdomain or other_subdomain in merged_mapping:
                        continue
                    
                    other_words = set(other_subdomain.lower().split())
                    # If they share any words, merge them
                    if subdomain_words & other_words:
                        merge_targets.append(other_subdomain)
                
                # Choose the shortest name as the merged name
                if len(merge_targets) > 1:
                    merged_name = min(merge_targets, key=len)
                    for target in merge_targets:
                        merged_mapping[target] = merged_name
                    self.logger.info(f"Domain '{domain}': Merging overlapping subdomains {merge_targets} → '{merged_name}'")
                else:
                    merged_mapping[subdomain] = subdomain
            
            # Apply merging
            new_subdomains_dict = defaultdict(list)
            for old_subdomain, cases in subdomains_dict.items():
                new_subdomain = merged_mapping.get(old_subdomain, old_subdomain)
                new_subdomains_dict[new_subdomain].extend(cases)
            
            # Update use cases with merged subdomain names
            for new_subdomain, cases in new_subdomains_dict.items():
                for uc in cases:
                    uc['Subdomain'] = new_subdomain
            
            # STEP 2: Check subdomain count (min 2, max 8 per domain)
            if len(new_subdomains_dict) < 2:
                self.logger.warning(f"Domain '{domain}' has only {len(new_subdomains_dict)} subdomain(s) (<2). Creating additional subdomains...")
                # If less than 2 subdomains, split the largest subdomain into 2
                if len(new_subdomains_dict) == 1:
                    subdomain_name = list(new_subdomains_dict.keys())[0]
                    cases = new_subdomains_dict[subdomain_name]
                    if len(cases) >= 6:  # Only split if we have enough cases
                        mid = len(cases) // 2
                        new_subdomains_dict[f"{subdomain_name} A"] = cases[:mid]
                        new_subdomains_dict[f"{subdomain_name} B"] = cases[mid:]
                        del new_subdomains_dict[subdomain_name]
                        self.logger.info(f"Domain '{domain}': Split '{subdomain_name}' into 2 subdomains to meet minimum requirement")
            
            if len(new_subdomains_dict) > 8:
                self.logger.warning(f"Domain '{domain}' has {len(new_subdomains_dict)} subdomains (>8). Merging smallest ones...")
                
                # Sort by use case count
                sorted_subdomains = sorted(new_subdomains_dict.items(), key=lambda x: len(x[1]), reverse=True)
                
                # Keep top 7 subdomains, merge the rest into the 8th
                keep_subdomains = sorted_subdomains[:7]
                merge_subdomains = sorted_subdomains[7:]
                
                # Create a merged subdomain name (use "Other Services" as catch-all - must be 2 words)
                merged_subdomain_name = "Other Services"
                merged_cases = []
                for subdomain, cases in merge_subdomains:
                    merged_cases.extend(cases)
                
                # Update use cases
                for uc in merged_cases:
                    uc['Subdomain'] = merged_subdomain_name
                
                self.logger.info(f"Domain '{domain}': Merged {len(merge_subdomains)} small subdomains into '{merged_subdomain_name}'")
                
                # Rebuild subdomains dict
                new_subdomains_dict = dict(keep_subdomains)
                new_subdomains_dict[merged_subdomain_name] = merged_cases
            
            # STEP 3: Check minimum use cases per subdomain (at least 3)
            small_subdomains = [(sd, cases) for sd, cases in new_subdomains_dict.items() if len(cases) < 3]
            
            if small_subdomains:
                self.logger.warning(f"Domain '{domain}' has {len(small_subdomains)} subdomains with <3 use cases. Merging...")
                
                # Merge small subdomains into the largest subdomain
                sorted_by_size = sorted(new_subdomains_dict.items(), key=lambda x: len(x[1]), reverse=True)
                
                if len(sorted_by_size) > 0:
                    target_subdomain = sorted_by_size[0][0]
                    
                    for subdomain, cases in small_subdomains:
                        if subdomain != target_subdomain:  # Don't merge into itself
                            self.logger.info(f"Domain '{domain}': Merging subdomain '{subdomain}' ({len(cases)} cases) into '{target_subdomain}'")
                            for uc in cases:
                                uc['Subdomain'] = target_subdomain
        
        return use_cases

    def _group_use_cases_by_domain_flat(self, all_use_cases: list) -> dict:
        """
        Groups a flat list[dict] of use cases by their 'Business Domain'.
        Use cases within each domain are sorted by priority descending (Very High first).
        
        NOTE: This sorting is used for PDF/PPT/XLS outputs. Notebooks use ID-based sorting instead.
        """
        # Sort by priority first (for PDF/PPT/XLS outputs)
        all_use_cases = sorted(all_use_cases, key=self._priority_sort_key)
        grouped_by_domain = defaultdict(list)
        for uc in all_use_cases:
            domain = uc.get('Business Domain') or 'Other'
            grouped_by_domain[domain].append(uc)
        return grouped_by_domain

    def _align_translated_data(self, master_grouped_data: dict, translated_flat_list: list) -> dict:
        """
        Creates a grouped dictionary for the target language that strictly follows 
        the ordering and categorization of the master (English) data.
        """
        translated_map = {uc['No']: uc for uc in translated_flat_list}
        aligned_data = {}
        for en_domain in master_grouped_data.keys(): 
            english_cases = master_grouped_data[en_domain]
            translated_domain_name = en_domain
            if english_cases:
                first_id = english_cases[0]['No']
                if first_id in translated_map:
                    translated_domain_name = translated_map[first_id].get('Business Domain', en_domain)
            
            translated_group = []
            for en_uc in english_cases:
                uc_id = en_uc['No']
                if uc_id in translated_map:
                    translated_group.append(translated_map[uc_id])
                else:
                    translated_group.append(en_uc)
            aligned_data[translated_domain_name] = translated_group
        return aligned_data
    # === END HELPER FUNCTIONS ===


    
    def _validate_sql_locally_with_spark(self, sql_query: str, use_case_id: str) -> tuple:
        """
        Validate SQL syntax using Databricks SQL Statement Execution API.
        Uses REST API with minimal wait time and external links disposition for quick validation.
        Creates a fresh workspace client for each call to avoid token expiry issues.
        
        Args:
            sql_query: The SQL query to validate
            use_case_id: Use case ID for logging
            
        Returns:
            tuple: (is_valid: bool, error_message: str or None)
        """
        try:
            import requests
            import json
            from databricks.sdk import WorkspaceClient
            from databricks.sdk.config import Config
            
            # Create a FRESH workspace client with new token for each call
            self.logger.debug(f"[{use_case_id}] 🔄 Creating fresh workspace client for API validation...")
            
            # Get current config
            original_workspace = self.workspace
            
            # Create new config - this will force token refresh
            config = Config(
                host=original_workspace.config.host,
                token=original_workspace.config.token
            )
            
            # Create fresh workspace client
            fresh_workspace = WorkspaceClient(config=config)
            
            # Get fresh token and URL
            workspace_url = fresh_workspace.config.host
            token = fresh_workspace.config.token
            
            # API endpoint
            api_url = f"{workspace_url}/api/2.0/sql/statements"
            
            # Get SQL warehouse ID
            sql_warehouse_id = getattr(self, 'sql_warehouse_id', None)
            if not sql_warehouse_id:
                self.logger.debug(f"[{use_case_id}] No SQL warehouse configured for validation")
                return (True, None)
            
            # Request payload - using REST API validation pattern
            # Setting wait_timeout to 50s and limiting rows to validate syntax without executing
            payload = {
                "warehouse_id": sql_warehouse_id,
                "statement": sql_query,
                "wait_timeout": "50s",
                "disposition": "EXTERNAL_LINKS",
                "row_limit": 1
            }
            
            headers = {
                "Authorization": f"Bearer {token}",
                "Content-Type": "application/json"
            }
            
            self.logger.debug(f"[{use_case_id}] 🌐 Attempting SQL Warehouse API validation with fresh token...")
            
            # Make API call with fresh token
            response = requests.post(api_url, headers=headers, json=payload, timeout=300)
            
            # Check response status
            if response.status_code == 200:
                # Query was accepted and syntax is valid
                self.logger.info(f"[{use_case_id}] ✅ API validation passed - SQL syntax is valid")
                return (True, None)
                    
            elif response.status_code == 400:
                # 400 Bad Request indicates syntax error
                try:
                    error_data = response.json()
                    error_code = error_data.get('error_code', '')
                    error_msg = error_data.get('message', 'Syntax error')
                    
                    # Common error codes: SYNTAX_ERROR, UNRESOLVED_COLUMN, TABLE_OR_VIEW_NOT_FOUND, etc.
                    if error_code:
                        error_msg = f"[{error_code}] {error_msg}"
                    
                    self.logger.warning(f"[{use_case_id}] ❌ API validation failed: {error_msg}")
                    return (False, error_msg)
                except Exception as parse_error:
                    error_msg = f"Syntax error (status 400): {response.text[:200]}"
                    self.logger.warning(f"[{use_case_id}] ❌ API validation failed: {error_msg}")
                    return (False, error_msg)
                
            elif response.status_code in [401, 403]:
                try:
                    error_data = response.json()
                    error_detail = error_data.get('message', 'Unauthorized/Forbidden')
                except:
                    error_detail = 'Unauthorized/Forbidden'
                
                self.logger.debug(
                    f"[{use_case_id}] ⚠️  API authentication failed (status {response.status_code}): {error_detail}"
                )
                return (True, None)
                
            else:
                self.logger.debug(f"[{use_case_id}] ⚠️  API returned status {response.status_code}")
                return (True, None)
            
        except Exception as e:
            # Exception during API call
            error_msg = str(e)
            self.logger.debug(f"[{use_case_id}] ❌ API validation failed (exception): {error_msg[:100]}...")
            return (False, error_msg)
    
    def _validate_sql_remotely_with_fresh_client(self, sql_query: str, use_case_id: str) -> tuple:
        """
        Validate SQL syntax using SQL Warehouse REST API (retry with fresh client).
        Uses the SQL Statement Execution API with minimal wait time and external links disposition.
        Creates a fresh workspace client for each call to avoid token expiry issues.
        
        Args:
            sql_query: The SQL query to validate
            use_case_id: Use case ID for logging
            
        Returns:
            tuple: (is_valid: bool, error_message: str or None)
        """
        try:
            import requests
            import json
            from databricks.sdk import WorkspaceClient
            from databricks.sdk.config import Config
            
            # Create a FRESH workspace client with new token (retry attempt)
            self.logger.debug(f"[{use_case_id}] 🔄 Creating fresh workspace client for retry validation...")
            
            # Get current config
            original_workspace = self.workspace
            
            # Create new config - this will force token refresh
            config = Config(
                host=original_workspace.config.host,
                token=original_workspace.config.token
            )
            
            # Create fresh workspace client
            fresh_workspace = WorkspaceClient(config=config)
            
            # Get fresh token
            workspace_url = fresh_workspace.config.host
            token = fresh_workspace.config.token
            
            # API endpoint
            api_url = f"{workspace_url}/api/2.0/sql/statements"
            
            # Get SQL warehouse ID
            sql_warehouse_id = getattr(self, 'sql_warehouse_id', None)
            if not sql_warehouse_id:
                self.logger.debug(f"[{use_case_id}] No SQL warehouse configured for remote validation")
                return (True, None)
            
            # Request payload - using REST API validation pattern
            # Setting wait_timeout to 50s and limiting rows to validate syntax without executing
            payload = {
                "warehouse_id": sql_warehouse_id,
                "statement": sql_query,
                "wait_timeout": "50s",
                "disposition": "EXTERNAL_LINKS",
                "row_limit": 1
            }
            
            headers = {
                "Authorization": f"Bearer {token}",
                "Content-Type": "application/json"
            }
            
            self.logger.debug(f"[{use_case_id}] 🌐 Attempting remote SQL Warehouse API validation...")
            
            # Make API call with fresh token
            response = requests.post(api_url, headers=headers, json=payload, timeout=300)
            
            # Check response status
            if response.status_code == 200:
                # Query was accepted and syntax is valid
                self.logger.info(f"[{use_case_id}] ✅ Remote API validation passed - SQL syntax is valid")
                return (True, None)
                    
            elif response.status_code == 400:
                # 400 Bad Request indicates syntax error
                try:
                    error_data = response.json()
                    error_code = error_data.get('error_code', '')
                    error_msg = error_data.get('message', 'Syntax error')
                    
                    # Common error codes: SYNTAX_ERROR, UNRESOLVED_COLUMN, TABLE_OR_VIEW_NOT_FOUND, etc.
                    if error_code:
                        error_msg = f"[{error_code}] {error_msg}"
                    
                    self.logger.warning(f"[{use_case_id}] ❌ Remote API validation failed: {error_msg}")
                    return (False, error_msg)
                except Exception as parse_error:
                    error_msg = f"Syntax error (status 400): {response.text[:200]}"
                    self.logger.warning(f"[{use_case_id}] ❌ Remote API validation failed: {error_msg}")
                    return (False, error_msg)
                
            elif response.status_code in [401, 403]:
                try:
                    error_data = response.json()
                    error_detail = error_data.get('message', 'Unauthorized/Forbidden')
                except:
                    error_detail = 'Unauthorized/Forbidden'
                
                self.logger.debug(
                    f"[{use_case_id}] ⚠️  Remote API authentication failed (status {response.status_code}): {error_detail}"
                )
                return (True, None)
                
            else:
                self.logger.debug(f"[{use_case_id}] ⚠️  Remote API returned status {response.status_code}")
                return (True, None)
                
        except Exception as e:
            self.logger.debug(f"[{use_case_id}] ⚠️  Remote API validation error: {e}")
            return (True, None)
    
    def _execute_sql_for_validation(self, sql_query: str, use_case_id: str) -> tuple:
        """
        Execute SQL query with LIMIT 1 to validate both syntax AND runtime behavior.
        This catches errors that syntax-only validation misses (e.g., window function issues,
        column resolution problems, etc.).
        
        Args:
            sql_query: The SQL query to execute
            use_case_id: Use case ID for logging
            
        Returns:
            tuple: (is_valid: bool, error_message: str or None)
        """
        try:
            import requests
            import json
            import re
            from databricks.sdk import WorkspaceClient
            from databricks.sdk.config import Config
            
            # Create a fresh workspace client
            self.logger.debug(f"[{use_case_id}] 🔄 Creating fresh workspace client for execution validation...")
            
            # Get current config
            original_workspace = self.workspace
            
            # Create new config
            config = Config(
                host=original_workspace.config.host,
                token=original_workspace.config.token
            )
            
            # Create fresh workspace client
            fresh_workspace = WorkspaceClient(config=config)
            
            # Get fresh token and URL
            workspace_url = fresh_workspace.config.host
            token = fresh_workspace.config.token
            
            # API endpoint
            api_url = f"{workspace_url}/api/2.0/sql/statements"
            
            # Get SQL warehouse ID
            sql_warehouse_id = getattr(self, 'sql_warehouse_id', None)
            if not sql_warehouse_id:
                self.logger.debug(f"[{use_case_id}] No SQL warehouse configured for execution validation")
                return (True, None)
            
            # Modify query to LIMIT 1 for fast execution
            modified_query = sql_query.rstrip().rstrip(';')
            if 'LIMIT' in modified_query.upper():
                modified_query = re.sub(r'LIMIT\s+\d+', 'LIMIT 1', modified_query, flags=re.IGNORECASE)
            else:
                modified_query += ' LIMIT 1'
            
            # Request payload - actually execute the query (wait_timeout=50s)
            payload = {
                "warehouse_id": sql_warehouse_id,
                "statement": modified_query,
                "wait_timeout": "50s",
                "row_limit": 1
            }
            
            headers = {
                "Authorization": f"Bearer {token}",
                "Content-Type": "application/json"
            }
            
            self.logger.debug(f"[{use_case_id}] 🌐 Executing SQL for validation (LIMIT 1)...")
            
            # Make API call
            response = requests.post(api_url, headers=headers, json=payload, timeout=300)
            
            # Check response status
            if response.status_code == 200:
                # Query was submitted, check execution status
                result_data = response.json()
                
                # Check statement status
                status = result_data.get('status', {})
                state = status.get('state', '')
                
                if state == 'SUCCEEDED':
                    self.logger.info(f"[{use_case_id}] ✅ Execution validation passed - SQL executed successfully")
                    if str(getattr(self, "show_query_results_option", "")).strip().lower() == "yes":
                        try:
                            manifest = result_data.get('manifest', {})
                            if isinstance(manifest, dict):
                                schema_columns = manifest.get('schema', {}).get('columns', [])
                            else:
                                schema_obj = getattr(manifest, "schema", None)
                                schema_columns = getattr(schema_obj, "columns", []) if schema_obj else []
                            result_block = result_data.get('result', {}) if isinstance(result_data, dict) else getattr(result_data, "result", {})
                            data_array = result_block.get('data_array') or result_block.get('data', [])
                            if schema_columns and data_array:
                                columns = []
                                for col in schema_columns:
                                    if isinstance(col, dict):
                                        col_name = col.get('name')
                                    else:
                                        col_name = getattr(col, 'name', None)
                                    if col_name is not None:
                                        columns.append(col_name)
                                if columns:
                                    example_result = self._prepare_example_result(columns, schema_columns, data_array[0], use_case_id)
                                    example_result['sql'] = sql_query
                                else:
                                    example_result = {
                                        'status': 'empty',
                                        'data': [],
                                        'message': 'Query returned no columns',
                                        'sql': sql_query
                                    }
                            else:
                                example_result = {
                                    'status': 'empty',
                                    'data': [],
                                    'message': 'Query returned no results',
                                    'sql': sql_query
                                }
                            cache_dir = self._ensure_sql_results_cache_dir()
                            cache_path = os.path.join(cache_dir, f"{use_case_id}.json")
                            with open(cache_path, 'w', encoding='utf-8') as f:
                                json.dump(example_result, f, ensure_ascii=False, indent=2)
                            self.logger.info(f"[{use_case_id}] Cached validation result to {cache_path}")
                        except Exception as cache_error:
                            self.logger.debug(f"[{use_case_id}] Skipped caching validation result: {str(cache_error)[:100]}")
                    return (True, None)
                elif state in ['FAILED', 'CANCELED']:
                    # Extract error message
                    error_info = status.get('error', {})
                    error_message = error_info.get('message', 'Unknown execution error')
                    error_code = error_info.get('error_code', '')
                    
                    if error_code:
                        error_message = f"[{error_code}] {error_message}"
                    
                    self.logger.warning(f"[{use_case_id}] ❌ Execution validation failed: {error_message}")
                    return (False, error_message)
                else:
                    # Still running or pending - this shouldn't happen with 30s timeout
                    self.logger.debug(f"[{use_case_id}] Query still in state: {state}")
                    return (True, None)
                    
            elif response.status_code == 400:
                # Bad request - syntax or execution error
                try:
                    error_data = response.json()
                    error_code = error_data.get('error_code', '')
                    error_msg = error_data.get('message', 'Execution error')
                    
                    if error_code:
                        error_msg = f"[{error_code}] {error_msg}"
                    
                    self.logger.warning(f"[{use_case_id}] ❌ Execution validation failed: {error_msg}")
                    return (False, error_msg)
                except Exception as parse_error:
                    error_msg = f"Execution error (status 400): {response.text[:200]}"
                    self.logger.warning(f"[{use_case_id}] ❌ Execution validation failed: {error_msg}")
                    return (False, error_msg)
                
            elif response.status_code in [401, 403]:
                self.logger.debug(f"[{use_case_id}] ⚠️  API authentication failed during execution validation")
                return (True, None)
                
            else:
                self.logger.debug(f"[{use_case_id}] ⚠️  API returned status {response.status_code} during execution")
                return (True, None)
            
        except Exception as e:
            error_msg = str(e)
            self.logger.debug(f"[{use_case_id}] ❌ Execution validation failed (exception): {error_msg[:100]}...")
            return (False, error_msg)
    
    def _validate_sql_syntax_with_explain(self, sql_query: str, use_case_id: str) -> tuple:
        """
        Modified strategy: Always bypass execution validation unless specifically requested for PDF examples.
        Rely on syntax checks or just assume valid for now to avoid overhead.
        
        Args:
            sql_query: The SQL query to validate
            use_case_id: Use case ID for logging
            
        Returns:
            tuple: (is_valid: bool, error_message: str or None)
        """
        # SKIP LIMIT 1 execution validation completely as requested
        # We will rely on the "Fix" phase to catch issues statically or just syntax check
        return (True, None) # Assume valid to proceed to Fix phase (which will do static analysis)
    
    def _normalize_table_name(self, table_name: str) -> str:
        if not table_name:
            return ""
        return table_name.replace('`', '').lower()
    
    def _extract_sql_and_columns_from_response(self, sql_response: str) -> tuple:
        sql_text = sql_response or ""
        columns_used = []
        if not sql_response:
            return sql_text, columns_used
        
        response_stripped = sql_response.strip()
        
        # Strip any preamble that the LLM might have generated (schema checks, etc.)
        # Find the first SQL comment or SQL keyword
        sql_start_patterns = [
            (r'^#\s*SCHEMA.*', re.MULTILINE),  # # SCHEMA VALIDATION CHECK
            (r'^Checking.*', re.MULTILINE),    # Checking "AVAILABLE TABLES..."
            (r'^✅.*', re.MULTILINE),           # ✅ SCHEMA PROVIDED
            (r'^Proceeding.*', re.MULTILINE),  # Proceeding with SQL generation...
            (r'^---+\s*$', re.MULTILINE),      # Horizontal lines ---
            (r'^\*\*.*\*\*\s*$', re.MULTILINE), # **bold text**
        ]
        for pattern, flags in sql_start_patterns:
            response_stripped = re.sub(pattern, '', response_stripped, flags=flags)
        
        # Remove markdown code fences
        response_stripped = re.sub(r'^```sql\s*\n?', '', response_stripped, flags=re.IGNORECASE | re.MULTILINE)
        response_stripped = re.sub(r'^```\s*\n?', '', response_stripped, flags=re.MULTILINE)
        response_stripped = re.sub(r'\n?```\s*$', '', response_stripped, flags=re.MULTILINE)
        
        response_stripped = response_stripped.strip()
        
        if response_stripped.startswith('--') or response_stripped.upper().startswith('WITH ') or response_stripped.upper().startswith('SELECT '):
            sql_text = response_stripped
            for line in response_stripped.splitlines():
                if line.strip().upper().startswith("COLUMNS_USED"):
                    raw = line.split(":", 1)[1] if ":" in line else ""
                    cols = re.split(r'[;,]', raw)
                    columns_used = [c.strip() for c in cols if c.strip()]
                    break
            return sql_text, columns_used
        
        if response_stripped.startswith('{') or response_stripped.startswith('['):
            cleaned = clean_json_response(response_stripped)
            try:
                parsed = json.loads(cleaned)
                if isinstance(parsed, dict):
                    extracted_sql = parsed.get("sql") or parsed.get("query")
                    if extracted_sql and len(extracted_sql.strip()) > 20:
                        sql_text = extracted_sql
                    cols = parsed.get("columns_used") or parsed.get("columns") or parsed.get("involved_columns")
                    if isinstance(cols, list):
                        columns_used = cols
                elif isinstance(parsed, list):
                    for item in parsed:
                        if isinstance(item, dict):
                            if "sql" in item:
                                extracted_sql = item.get("sql")
                                if extracted_sql and len(extracted_sql.strip()) > 20:
                                    sql_text = extracted_sql
                            cols = item.get("columns_used") or item.get("columns") or item.get("involved_columns")
                            if isinstance(cols, list):
                                columns_used = cols
                                break
            except Exception:
                pass
        
        if not columns_used:
            for line in response_stripped.splitlines():
                if line.strip().upper().startswith("COLUMNS_USED"):
                    raw = line.split(":", 1)[1] if ":" in line else ""
                    cols = re.split(r'[;,]', raw)
                    columns_used = [c.strip() for c in cols if c.strip()]
                    break
        return sql_text, columns_used
    
    def _validate_columns_used(self, use_case_id: str, columns_used: list, directly_involved_tables: set, schema_index: dict, full_schema_details: list) -> tuple:
        normalized_tables = set(self._normalize_table_name(t) for t in directly_involved_tables)
        allowed_columns = set()
        if schema_index:
            for table_key, details in schema_index.items():
                norm_table = self._normalize_table_name(table_key)
                if norm_table in normalized_tables:
                    for detail in details:
                        (catalog, schema, table, column_name, _, _) = detail
                        allowed_columns.add(f"{catalog}.{schema}.{table}.{column_name}".lower())
        else:
            for detail in full_schema_details:
                (catalog, schema, table, column_name, _, _) = detail
                norm_table = self._normalize_table_name(f"{catalog}.{schema}.{table}")
                if norm_table in normalized_tables:
                    allowed_columns.add(f"{catalog}.{schema}.{table}.{column_name}".lower())
        normalized_used = []
        invalid = []
        def handle_column(col_value: str):
            norm = col_value.replace('`', '').strip()
            normalized_used.append(norm)
            parts = norm.split('.')
            if len(parts) != 4:
                invalid.append(norm)
                return
            table_norm = ".".join(parts[:3]).lower()
            col_norm = norm.lower()
            if table_norm not in normalized_tables or col_norm not in allowed_columns:
                invalid.append(norm)
        for col in columns_used:
            if isinstance(col, dict):
                table_val = col.get("table") or col.get("fq_table") or ""
                cols_val = col.get("columns") or col.get("cols") or []
                if isinstance(cols_val, list):
                    for c in cols_val:
                        handle_column(f"{table_val}.{c}" if table_val else str(c))
                else:
                    handle_column(str(cols_val))
            else:
                handle_column(str(col))
        is_valid = len(invalid) == 0
        return is_valid, invalid, normalized_used
    
    def _process_sql_candidate(self, use_case: dict, sql_response: str, tables_involved_str: str, directly_involved_schema: str, directly_involved_tables: set, full_schema_details: list, schema_index: dict) -> dict:
        use_case_id = use_case.get('No', 'UNKNOWN')
        use_case_name = use_case.get('Name', '')[:50]
        if not sql_response or len(sql_response.strip()) < 20:
            use_case['sql_generation_status'] = 'failed'
            use_case['SQL'] = (
                f"-- Use Case: {use_case_id} - {use_case_name}\n"
                f"-- Empty LLM response\n"
                f"-- Tables Involved: {tables_involved_str}\n"
                f"SELECT 'Empty LLM Response' AS error_message;\n"
                f"-- If you run the query and it was not valid, set IsValid to No and run Inspire again with 'Generate = SQL Regeneration', and Inspire will regenerate a new query for you to validate. You can also pass special instruction in below field:\n"
                f"--SQL Generation Instructions Begin\n"
                f"--\n"
                f"--SQL Generation Instructions End"
            )
            return use_case
        sql_text, columns_from_response = self._extract_sql_and_columns_from_response(sql_response)
        sql_clean = sql_text.strip()
        if sql_clean.startswith('```'):
            sql_clean = re.sub(r'^```[a-z]*\n', '', sql_clean)
            sql_clean = re.sub(r'\n```$', '', sql_clean)
        sql_clean = re.sub(
            r"parameters\s*=>\s*(\{[^}]+\})",
            r'parameters => "\1"',
            sql_clean
        )
        if not sql_clean or len(sql_clean.strip()) < 20:
            use_case['sql_generation_status'] = 'failed'
            use_case['SQL'] = (
                f"-- Use Case: {use_case_id} - {use_case_name}\n"
                f"-- Empty SQL after cleaning\n"
                f"-- Tables Involved: {tables_involved_str}\n"
                f"SELECT 'Empty SQL after cleaning' AS error_message;\n"
                f"-- If you run the query and it was not valid, set IsValid to No and run Inspire again with 'Generate = SQL Regeneration', and Inspire will regenerate a new query for you to validate. You can also pass special instruction in below field:\n"
                f"--SQL Generation Instructions Begin\n"
                f"--\n"
                f"--SQL Generation Instructions End"
            )
            return use_case
        use_case['SQL'] = sql_clean
        use_case['sql_generation_status'] = 'succeeded'
        if columns_from_response and directly_involved_tables:
            is_valid_cols, invalid_cols, normalized_cols = self._validate_columns_used(use_case_id, columns_from_response, directly_involved_tables, schema_index, full_schema_details)
            involved_cols_value = normalized_cols if normalized_cols else columns_from_response
            use_case['Involved Columns'] = ", ".join(involved_cols_value)
            if not is_valid_cols:
                use_case['column_validation_status'] = 'failed'
                use_case['sql_validation_status'] = 'failed'
                use_case['sql_validation_error'] = f"Invalid columns: {', '.join(invalid_cols)}"
                self.logger.warning(f"[{use_case_id}] Column validation failed: {', '.join(invalid_cols)}")
                use_case['SQL'] = f"-- ❌ COLUMN VALIDATION FAILED: {', '.join(invalid_cols)}\n{sql_clean}"
                return use_case
            use_case['column_validation_status'] = 'passed'
        else:
            use_case['Involved Columns'] = ", ".join(columns_from_response) if columns_from_response else ""
            use_case['column_validation_status'] = 'skipped'
        use_case['generated'] = 'Y'
        try:
            is_valid, error_msg = self._validate_sql_syntax_with_explain(sql_clean, use_case_id)
            if not is_valid and error_msg:
                use_case['sql_validation_status'] = 'failed'
                use_case['sql_validation_error'] = error_msg
                use_case['validated'] = 'N'
                use_case['SQL'] = f"-- ⚠️ VALIDATION WARNING: {error_msg[:200]}\n-- SQL may have syntax or runtime errors\n\n{sql_clean}"
            else:
                use_case['sql_validation_status'] = 'passed'
                use_case['sql_validation_error'] = None
                use_case['validated'] = 'Y'
        except Exception as validation_error:
            use_case['sql_validation_status'] = 'skipped'
            use_case['sql_validation_error'] = str(validation_error)[:200]
            use_case['validated'] = 'D'
        return use_case
    
    def _truncate_schema_columns(self, schema_text: str, max_columns_per_table: int) -> str:
        """
        Truncate schema text to limit columns per table.
        
        This is used when input context is too large - we progressively reduce
        the number of columns per table to fit within the model's context limit.
        
        Handles multiple schema formats:
        1. Plain text format: "Table: catalog.schema.table" followed by "Columns:" and "  - col_name (TYPE)"
        2. Markdown format: "### Table: name" or "**Table:**" with "- column" lines
        3. Pipe-delimited format: "| column_name | TYPE |" table rows
        
        Args:
            schema_text: The schema markdown text with table definitions
            max_columns_per_table: Maximum number of columns to keep per table
            
        Returns:
            Truncated schema text with limited columns per table
        """
        import re
        
        if not schema_text or max_columns_per_table <= 0:
            return schema_text
            
        lines = schema_text.split('\n')
        result_lines = []
        current_table = None
        current_table_name = None
        column_count = 0
        total_columns_in_table = 0
        truncated_notice_added = False
        in_columns_section = False
        
        # Table header patterns (multiple formats supported)
        table_patterns = [
            r'^Table:\s*(.+)$',                    # Plain text: "Table: catalog.schema.table"
            r'^###\s+Table:\s*(.+)$',              # Markdown H3: "### Table: name"
            r'^##\s+Table:\s*(.+)$',               # Markdown H2: "## Table: name"
            r'^\*\*Table:\*\*\s*(.+)$',            # Bold markdown: "**Table:** name"
            r'^###\s+`?([^`]+)`?\s*$',             # Markdown H3 with backticks: "### `catalog.schema.table`"
            r'^--\s*Table:\s*(.+)$',               # SQL comment: "-- Table: name"
        ]
        
        # Column definition patterns (multiple formats supported)
        column_patterns = [
            r'^\s+-\s+\w+',                        # Markdown list: "  - column_name"
            r'^\s+\*\s+\w+',                       # Markdown asterisk: "  * column_name"
            r'^\|\s*\w+\s*\|',                     # Pipe table: "| column_name |"
            r'^\s+\d+\.\s+\w+',                    # Numbered list: "  1. column_name"
            r'^\s{2,}\w+\s*[\(\:]',                # Indented with type: "    column_name (TYPE)"
            r'^\s{2,}\w+\s*$',                     # Plain indented: "    column_name"
        ]
        
        for i, line in enumerate(lines):
            stripped = line.strip()
            
            # Check for table header
            is_table_header = False
            for pattern in table_patterns:
                match = re.match(pattern, stripped, re.IGNORECASE)
                if match:
                    # Finalize previous table if truncated
                    if current_table and truncated_notice_added and total_columns_in_table > column_count:
                        pass  # Notice already added
                    
                    # New table - reset state
                    current_table = line
                    current_table_name = match.group(1).strip() if match.groups() else stripped
                    column_count = 0
                    total_columns_in_table = 0
                    truncated_notice_added = False
                    in_columns_section = False
                    is_table_header = True
                    result_lines.append(line)
                    break
            
            if is_table_header:
                continue
                
            # Check for "Columns:" header (plain text format)
            if stripped.lower() == 'columns:':
                in_columns_section = True
                result_lines.append(line)
                continue
            
            # Check if this is a column definition
            is_column_line = False
            if current_table:
                for pattern in column_patterns:
                    if re.match(pattern, line):
                        is_column_line = True
                        break
                        
                # Additional heuristic: indented lines after "Columns:" are likely columns
                if not is_column_line and in_columns_section and line.startswith('  ') and stripped:
                    is_column_line = True
            
            if is_column_line:
                total_columns_in_table += 1
                column_count += 1
                if column_count <= max_columns_per_table:
                    result_lines.append(line)
                elif not truncated_notice_added:
                    # Add truncation notice once per table
                    indent = '  ' if line.startswith('  ') else ''
                    result_lines.append(f"{indent}- ... (schema truncated to first {max_columns_per_table} columns for context reduction)")
                    truncated_notice_added = True
            else:
                # Non-column line (empty, separator, comments, etc.)
                # Reset columns section if we hit a non-indented, non-empty line that's not a column
                if stripped and not line.startswith(' ') and in_columns_section:
                    in_columns_section = False
                result_lines.append(line)
                
        truncated_schema = '\n'.join(result_lines)
        original_len = len(schema_text)
        new_len = len(truncated_schema)
        reduction_pct = ((original_len - new_len) / original_len * 100) if original_len > 0 else 0
        
        self.logger.info(f"   Schema truncated: {original_len:,} -> {new_len:,} chars ({reduction_pct:.1f}% reduction, max {max_columns_per_table} cols/table)")
        
        return truncated_schema
    
    def _calculate_adaptive_sql_timeout(self, use_case: dict) -> int:
        """
        Calculate adaptive timeout for SQL generation based on query complexity (CTE count).
        
        Formula: timeout = min(max_timeout, base_timeout + cte_count * per_cte_timeout)
        
        Examples:
            - 0-2 CTEs: 180-240 seconds
            - 3-4 CTEs: 270-300 seconds  
            - 5-6 CTEs: 330-360 seconds
            - 7+ CTEs: 360 seconds (capped)
        
        Args:
            use_case: Use case dictionary containing 'Technical Design' field
            
        Returns:
            Timeout in seconds
        """
        technical_design = use_case.get('Technical Design', '')
        cte_count = technical_design.count('CTE') if technical_design else 0
        
        adaptive_timeout = min(
            self.sql_generation_max_timeout,
            self.sql_generation_base_timeout + (cte_count * self.sql_generation_per_cte_timeout)
        )
        
        return adaptive_timeout
    
    def _generate_sql_for_use_case(self, use_case: dict, full_schema_details: list, unstructured_docs: str, schema_index: dict = None) -> dict:
        """
        Generate SQL for a single use case using zero-shot LLM call.
        Prioritizes tables directly involved in the use case, then adds additional tables if space allows.
        
        Args:
            use_case: Use case dictionary
            full_schema_details: Full schema (used as fallback if index not provided)
            unstructured_docs: Unstructured documents markdown
            schema_index: Pre-built schema index for O(1) lookups (defaultdict mapping table_name -> [details])
        
        Args (legacy):
            use_case: Use case dictionary
            full_schema_details: Full list of (catalog, schema, table, column, type, comment) tuples
            unstructured_docs: Unstructured documents markdown
            
        Returns:
            Use case dict with SQL field populated
        """
        import time
        use_case_id = use_case.get('No', 'UNKNOWN')
        use_case_name = use_case.get('Name', '')[:50]
        use_case_columns = use_case.get('Involved Columns') or use_case.get('Columns Involved') or ""
        
        try:
            start_time = time.time()
            self.logger.info(f"🔧 [{use_case_id}] Starting SQL generation...")
            
            # Extract directly involved tables from use case
            tables_involved_str = use_case.get('Tables Involved', '')
            directly_involved_tables = set()
            
            self.logger.debug(f"[{use_case_id}] Parsing tables from 'Tables Involved' field...")
            # Parse tables from "Tables Involved" field
            if tables_involved_str:
                # Handle both comma-separated and space-separated lists
                table_parts = re.split(r'[,\s]+', tables_involved_str)
                for part in table_parts:
                    part = part.strip().strip('`').strip()
                    if part and '.' in part:  # Valid table reference
                        # Normalize: catalog.schema.table
                        directly_involved_tables.add(part)

            # Ensure foreign keys are loaded for all involved tables (critical for joins)
            if self.data_loader and hasattr(self.data_loader, '_get_foreign_keys'):
                for tbl in directly_involved_tables:
                    cat, sch, tbl_name = parse_three_level_name(tbl)
                    if cat and sch and tbl_name:
                        key = (cat, sch, tbl_name)
                        if key not in self.data_loader.foreign_key_graph:
                            try:
                                self.data_loader._get_foreign_keys(cat, sch, tbl_name)
                            except Exception:
                                pass  # FK loading is best-effort
            
            directly_involved_tables, fk_relationships = self._expand_tables_with_foreign_keys(directly_involved_tables)
            
            # VALIDATION: Check for critical failures (missing tables)
            validation_failed = False
            failure_reasons = []
            
            # NEW: Validate that use case has tables involved (unless it's a volume path for unstructured data)
            if not tables_involved_str or tables_involved_str.strip() == "":
                validation_failed = True
                failure_reasons.append("No tables involved - every use case MUST reference at least one table")
                self.logger.error(f"❌ Use case {use_case_id}: FAILED - No tables specified in 'Tables Involved' field")
            elif not tables_involved_str.startswith('/Volumes') and not directly_involved_tables:
                validation_failed = True
                failure_reasons.append("Invalid table references - no valid fully-qualified tables found")
                self.logger.error(f"❌ Use case {use_case_id}: FAILED - Invalid table references in 'Tables Involved' field")
            
            if validation_failed:
                self.logger.error(f"{'='*80}")
                self.logger.error(f"❌ Use case {use_case_id} VALIDATION FAILED:")
                for reason in failure_reasons:
                    self.logger.error(f"  • {reason}")
                self.logger.error(f"{'='*80}")
                
                use_case['SQL'] = (
                    f"-- ❌ VALIDATION FAILED: Use case cannot be generated\n"
                    f"-- Use Case ID: {use_case_id}\n"
                    f"-- Use Case Name: {use_case_name}\n"
                    f"-- Failure Reasons:\n"
                    + "\n".join([f"--   • {reason}" for reason in failure_reasons]) + "\n"
                    f"-- \n"
                    f"-- CRITICAL: This use case was generated without required components.\n"
                    f"-- Tables Involved field must list at least one table.\n"
                    f"SELECT 'VALIDATION_FAILED' AS error_message, \n"
                    f"       '{', '.join(failure_reasons)}' AS failure_reasons;"
                )
                use_case['sql_generation_status'] = 'failed'
                return use_case
            
            
            # Build prioritized schema context
            # PRIORITY 1: Tables directly involved (MUST be included)
            directly_involved_details = []
            additional_details = []
            
            # === CHECK FOR PRE-POPULATED SCHEMA FROM REGENERATION MODE ===
            # In SQL Regeneration mode, the interpretation phase may have dynamically loaded
            # schema for tables requested by the user that aren't in the main schema index.
            # If pre-populated schema exists and contains actual schema text (not just IDs),
            # we should APPEND it to the dynamically built schema later.
            prepopulated_schema = use_case.get('directly_involved_schema', '')
            has_prepopulated_schema = prepopulated_schema and '\n' in prepopulated_schema and 'Table:' in prepopulated_schema
            if has_prepopulated_schema:
                self.logger.info(f"[{use_case_id}] Found pre-populated schema from regeneration mode ({len(prepopulated_schema)} chars)")
            
            # === PERFORMANCE OPTIMIZATION: Use schema index for O(1) lookups ===
            if schema_index:
                # Fast path: Use pre-built index for instant lookups
                self.logger.debug(f"[{use_case_id}] Using schema index for fast lookup...")
                for involved_table in directly_involved_tables:
                    # Try both with and without backticks
                    table_details = schema_index.get(involved_table, [])
                    if not table_details:
                        # Try without backticks
                        table_no_backticks = involved_table.replace('`', '')
                        table_details = schema_index.get(table_no_backticks, [])
                    directly_involved_details.extend(table_details)
                
                # Get all other tables for additional context
                all_tables_in_schema = set()
                for detail in full_schema_details[:100]:  # Sample first 100 for other tables
                    (catalog, schema, table, _, _, _) = detail
                    fqtn = f"{catalog}.{schema}.{table}"
                    if fqtn not in directly_involved_tables:
                        all_tables_in_schema.add(fqtn)
                
                # Add sample of additional tables (limit to avoid bloat)
                for other_table in list(all_tables_in_schema)[:5]:  # Max 5 additional tables
                    additional_details.extend(schema_index.get(other_table, []))
            else:
                # Slow path: Iterate through all details (legacy fallback)
                self.logger.debug(f"[{use_case_id}] Building schema context from {len(full_schema_details)} total columns (no index - slow path)...")
                for detail in full_schema_details:
                    (catalog, schema, table, column_name, data_type, comment) = detail
                    fqtn = f"{catalog}.{schema}.{table}"
                    
                    # Check if this table is directly involved
                    is_directly_involved = any(
                        fqtn == involved_table or 
                        fqtn.replace('`', '') == involved_table or
                        f"`{catalog}`.`{schema}`.`{table}`" == involved_table
                        for involved_table in directly_involved_tables
                    )
                    
                    if is_directly_involved:
                        directly_involved_details.append(detail)
                    else:
                        additional_details.append(detail)
            
            # Log schema context (special handling for volume paths)
            if tables_involved_str.startswith('/Volumes'):
                self.logger.info(f"   [{use_case_id}] Volume path use case (ai_parse_document): {tables_involved_str}")
                self.logger.info(f"   [{use_case_id}] No table schema needed (document processing)")
            else:
                # Count CTEs from Technical Design field
                technical_design = use_case.get('Technical Design', '')
                cte_count = technical_design.count('CTE') if technical_design else 0
                
                self.logger.info(f"   [{use_case_id}] {cte_count} CTEs from Technical Design, {len(directly_involved_details)} columns from directly involved tables, "
                                f"{len(additional_details)} additional columns")
            
            # VALIDATION: Check if directly involved tables were found in schema
            # BUT: Skip this check for volume path use cases (ai_parse_document)
            is_volume_path_use_case = tables_involved_str.startswith('/Volumes')
            if len(directly_involved_details) == 0 and len(directly_involved_tables) > 0 and not is_volume_path_use_case:
                self.logger.error(f"❌ Use case {use_case_id}: SCHEMA MISMATCH - Tables specified but not found in schema!")
                self.logger.error(f"   Specified tables: {directly_involved_tables}")
                
                # Try to find similar table names in schema to help debugging
                available_tables = set()
                for detail in full_schema_details:
                    (catalog, schema, table, _, _, _) = detail
                    available_tables.add(f"{catalog}.{schema}.{table}")
                
                if len(available_tables) > 0:
                    self.logger.error(f"   Available tables in schema ({len(available_tables)} total): {sorted(list(available_tables))[:10]}...")
                    
                    # Try fuzzy matching to find similar table names
                    from difflib import get_close_matches
                    for specified_table in directly_involved_tables:
                        matches = get_close_matches(specified_table, available_tables, n=3, cutoff=0.6)
                        if matches:
                            self.logger.info(f"   💡 Did you mean? {specified_table} → {matches}")
                else:
                    self.logger.error(f"   Schema is completely empty - no tables available!")
                    self.logger.error(f"   This likely means the business vs technical filter removed ALL tables!")
                
                # Mark as failed
                use_case['SQL'] = (
                    f"-- ❌ SCHEMA MISMATCH: Tables not found in schema\n"
                    f"-- Use Case ID: {use_case_id}\n"
                    f"-- Use Case Name: {use_case_name}\n"
                    f"-- Specified Tables: {', '.join(directly_involved_tables)}\n"
                    f"-- Available Tables (sample): {', '.join(sorted(list(available_tables))[:5]) if available_tables else 'NONE - Schema is empty!'}\n"
                    f"-- \n"
                    f"-- CRITICAL: The tables specified in 'Tables Involved' field do not exist in the provided schema.\n"
                    f"-- This could mean:\n"
                    f"--   1. Table names were hallucinated during use case generation\n"
                    f"--   2. Business vs technical filter removed these tables incorrectly\n"
                    f"--   3. Schema was not properly loaded from the database\n"
                    f"--   4. Table names have incorrect catalog/schema prefixes\n"
                    f"-- \n"
                    f"-- RECOMMENDATION: Check the business vs technical table filtering settings.\n"
                    f"-- If these tables contain business data, adjust the exclusion strategy.\n"
                    f"SELECT 'SCHEMA_MISMATCH' AS error_message, \n"
                    f"       '{', '.join(directly_involved_tables)}' AS missing_tables,\n"
                    f"       '{', '.join(sorted(list(available_tables))[:3]) if available_tables else 'EMPTY'}' AS available_tables_sample;"
                )
                use_case['sql_generation_status'] = 'failed'
                return use_case
            
            # Build schema context with prioritization
            # CRITICAL: Ensure we respect model-specific context limits from TECHNICAL_CONTEXT
            sql_gen_context_limit = get_max_context_chars("English", "USE_CASE_SQL_GEN_PROMPT")
            base_prompt_size = 50000  # Approximate size of base prompt template (includes AI functions, solution accelerators, etc.)
            max_schema_size = sql_gen_context_limit - base_prompt_size - 5000  # 5000 buffer for safety
            
            self.logger.debug(f"Use case {use_case_id}: Max schema size allowed: {max_schema_size:,} chars")
            self.logger.debug(f"Use case {use_case_id}: Directly involved: {len(directly_involved_details)} columns, "
                            f"Additional: {len(additional_details)} columns, "
                            f"Unstructured docs: {len(unstructured_docs):,} chars")
            
            # Apply progressive truncation strategy
            directly_involved_schema, additional_schema, final_unstructured_docs, was_truncated = self._apply_progressive_truncation(
                use_case_id,
                directly_involved_details,
                additional_details,
                unstructured_docs,
                max_schema_size,
                base_prompt_size,
                directly_involved_tables  # Pass the tables that must be preserved
            )
            
            # Update unstructured_docs if it was dropped during truncation
            if final_unstructured_docs != unstructured_docs:
                unstructured_docs = final_unstructured_docs
                if was_truncated:
                    self.logger.warning(f"Use case {use_case_id}: Unstructured documents were dropped to fit context limits")
            
            final_directly_size = len(directly_involved_schema)
            final_additional_size = len(additional_schema)
            final_schema_size = final_directly_size + final_additional_size
            self.logger.debug(f"Use case {use_case_id}: Final schema size: {final_schema_size:,} chars (directly involved: {final_directly_size:,}, additional: {final_additional_size:,}, max: {max_schema_size:,} chars)")
            
            if was_truncated:
                self.logger.info(f"Use case {use_case_id}: Progressive truncation applied successfully. Final size: {final_schema_size:,} chars")
            
            available_schema_out = directly_involved_schema
            try:
                if hasattr(self, "_business_column_details_global") and directly_involved_tables:
                    involved_schemas = set()
                    for tbl in directly_involved_tables:
                        cat, sch, _ = parse_three_level_name(tbl)
                        if cat and sch:
                            involved_schemas.add((cat, sch))
                    sibling_details = []
                    involved_plain = {normalize_identifier(t).replace('`', '') for t in directly_involved_tables}
                    for (catalog, schema, table, column_name, data_type, comment) in self._business_column_details_global:
                        if (catalog, schema) in involved_schemas:
                            fqtn_plain = f"{catalog}.{schema}.{table}"
                            if fqtn_plain not in involved_plain:
                                sibling_details.append((catalog, schema, table, column_name, data_type, comment))
                    if sibling_details:
                        sibling_schema = self._format_schema_for_prompt(sibling_details, load_column_tracking=True)
                        if sibling_schema:
                            candidate_schema = directly_involved_schema + ("\n" + sibling_schema if directly_involved_schema else sibling_schema)
                            if len(candidate_schema) <= max_schema_size:
                                available_schema_out = candidate_schema
                            else:
                                if len(directly_involved_schema) > max_schema_size:
                                    available_schema_out = ""
                                else:
                                    available_schema_out = directly_involved_schema
                if len(available_schema_out) > max_schema_size:
                    available_schema_out = ""
            except Exception as e:
                self.logger.debug(f"Use case {use_case_id}: Failed to add sibling schema: {str(e)[:120]}")
                available_schema_out = directly_involved_schema
            
            fk_relationships_md = ""
            if fk_relationships:
                unique_fk = sorted(set(fk_relationships))
                fk_relationships_md = "\n".join([f"- {rel}" for rel in unique_fk])
            else:
                fk_relationships_md = "None"
            
            # === MERGE PRE-POPULATED SCHEMA FROM REGENERATION MODE ===
            # If the interpretation phase dynamically loaded schema for user-requested tables,
            # append it to the available schema to prevent hallucination
            if has_prepopulated_schema:
                # Check if prepopulated schema contains tables not already in available_schema_out
                # Parse table names from prepopulated schema
                prepopulated_tables = set()
                for line in prepopulated_schema.split('\n'):
                    if line.strip().startswith('Table:'):
                        tbl_name = line.replace('Table:', '').strip()
                        prepopulated_tables.add(tbl_name)
                
                # Check which tables are NOT in the schema we already built
                new_tables = []
                for tbl in prepopulated_tables:
                    tbl_normalized = tbl.replace('`', '')
                    if tbl_normalized not in available_schema_out:
                        new_tables.append(tbl)
                
                if new_tables:
                    self.logger.info(f"[{use_case_id}] Appending dynamically loaded schema for tables: {new_tables}")
                    # Append the prepopulated schema to include user-requested tables
                    if available_schema_out:
                        available_schema_out = available_schema_out + "\n\n-- DYNAMICALLY LOADED TABLES (user-requested in regeneration instructions) --\n" + prepopulated_schema
                    else:
                        available_schema_out = prepopulated_schema
                    self.logger.info(f"[{use_case_id}] Total schema size after merge: {len(available_schema_out)} chars")
            
            # Prepare prompt variables
            # CRITICAL: We only provide directly_involved_schema to prevent hallucination
            # No additional tables are provided - LLM can ONLY use tables explicitly involved in the use case
            user_instructions = use_case.get('_user_instructions', '')
            previous_feedback = ""
            if user_instructions:
                previous_feedback = f"**USER INSTRUCTIONS (MUST FOLLOW):**\nThe user has provided the following specific instructions for generating this SQL query. You MUST follow these instructions:\n\n{user_instructions}\n"
                self.logger.info(f"[{use_case_id}] Passing SQL Generation Instructions to LLM: {user_instructions[:200]}...")
                log_print(f"   📝 [{use_case_id}] Including user SQL instructions in prompt")
            
            # Get interpreted regeneration context if present (only populated during SQL Regeneration mode)
            interpreted_regeneration_context = use_case.get('_interpreted_regeneration_context', '')
            if interpreted_regeneration_context:
                self.logger.info(f"[{use_case_id}] Including interpreted regeneration context in SQL generation prompt")
            
            # Get enriched business context from merged_business_context
            enriched_ctx = getattr(self, 'merged_business_context', {})
            prompt_vars = {
                "use_case_id": use_case_id,
                "use_case_name": use_case.get('Name', ''),
                "business_domain": use_case.get('Business Domain', ''),
                "statement": use_case.get('Statement', ''),
                "solution": use_case.get('Solution', ''),
                "tables_involved": tables_involved_str,
                "directly_involved_schema": available_schema_out,
                "use_case_columns": use_case_columns,
                "foreign_key_relationships": fk_relationships_md,
                "unstructured_docs": unstructured_docs,
                "previous_feedback": previous_feedback,
                "interpreted_regeneration_context": interpreted_regeneration_context,
                "ai_functions_summary": generate_ai_functions_doc("summary"),
                "statistical_functions_detailed": generate_statistical_functions_doc("table"),
                "business_name": self.business_name,  # Pass business context to SQL generation
                "sql_model_serving": self.sql_model_serving,  # User-configurable model for ai_query in generated SQL
                # Enriched business context for persona enrichment in ai_query prompts
                "enriched_business_context": enriched_ctx.get('business_context', 'General business operations'),
                "enriched_strategic_goals": enriched_ctx.get('strategic_goals', 'Operational excellence and customer satisfaction') if isinstance(enriched_ctx.get('strategic_goals'), str) else ', '.join(enriched_ctx.get('strategic_goals', ['Operational excellence'])),
                "enriched_business_priorities": enriched_ctx.get('business_priorities', 'Digital transformation and cost optimization') if isinstance(enriched_ctx.get('business_priorities'), str) else ', '.join(enriched_ctx.get('business_priorities', ['Digital transformation'])),
                "enriched_strategic_initiative": enriched_ctx.get('strategic_initiative', 'Data-driven decision making'),
                "enriched_value_chain": enriched_ctx.get('value_chain', 'Standard business operations'),
                "enriched_revenue_model": enriched_ctx.get('revenue_model', 'Diverse revenue streams')
            }
            use_case['_directly_involved_schema'] = available_schema_out
            use_case['_directly_involved_tables'] = list(directly_involved_tables)
            
            try:
                # Final pre-flight check: Estimate actual prompt size
                test_prompt = self.ai_agent._load_and_format_prompt("USE_CASE_SQL_GEN_PROMPT", prompt_vars)
                estimated_prompt_size = len(test_prompt)
                
                if estimated_prompt_size > sql_gen_context_limit:
                    self.logger.error(
                        f"Use case {use_case_id}: Estimated prompt size ({estimated_prompt_size:,} chars) STILL exceeds model limit ({sql_gen_context_limit:,}). "
                        f"Tables involved: {tables_involved_str}. "
                        f"Schema size: {final_schema_size:,} chars. "
                        f"Involved tables have {len(directly_involved_details)} columns total."
                    )
                    # Set error SQL and return use_case
                    use_case['SQL'] = (
                        f"-- ERROR: Context too large for AI SQL generation\n"
                        f"-- Use Case: {use_case_id}\n"
                        f"-- Tables: {tables_involved_str}\n"
                        f"-- Estimated prompt: {estimated_prompt_size:,} chars (limit: {sql_gen_context_limit:,})\n"
                        f"-- Schema size: {final_schema_size:,} chars\n"
                        f"-- Total columns: {len(directly_involved_details)}\n"
                        f"-- RESOLUTION: Manually write SQL or reduce tables/columns involved\n"
                        f"SELECT 'Context too large - manual SQL required' AS error_message;"
                    )
                    use_case['sql_generation_status'] = 'failed'
                    return use_case
                
                self.logger.debug(f"Use case {use_case_id}: Estimated prompt size: {estimated_prompt_size:,} chars (OK)")
                
                # Check if schema is empty (but allow for volume paths in ai_parse_document use cases)
                is_volume_path = tables_involved_str.startswith('/Volumes')
                if (not directly_involved_schema or directly_involved_schema.strip() == "") and not is_volume_path:
                    self.logger.error(f"Use case {use_case_id}: Schema is EMPTY! Cannot generate SQL without table definitions.")
                    self.logger.error(f"Use case {use_case_id}: Tables involved: {tables_involved_str}")
                    self.logger.error(f"Use case {use_case_id}: Directly involved tables: {directly_involved_tables}")
                    self.logger.error(f"Use case {use_case_id}: Directly involved details count: {len(directly_involved_details)}")
                    
                    # Check if tables exist in full schema
                    available_tables = set()
                    for detail in full_schema_details:
                        (catalog, schema, table, _, _, _) = detail
                        available_tables.add(f"{catalog}.{schema}.{table}")
                    
                    if available_tables:
                        self.logger.error(f"Use case {use_case_id}: Full schema has {len(available_tables)} tables available")
                        self.logger.error(f"Use case {use_case_id}: This suggests the business vs technical filter may be too aggressive")
                    else:
                        self.logger.error(f"Use case {use_case_id}: Full schema is also empty - database schema not loaded!")
                    
                    use_case['SQL'] = (
                        f"-- CRITICAL ERROR: Schema is empty\n"
                        f"-- Use Case: {use_case_id}\n"
                        f"-- Tables Involved: {tables_involved_str}\n"
                        f"-- CRITICAL: Required tables are NOT provided in the schema\n"
                        f"-- \n"
                        f"-- DIAGNOSIS:\n"
                        f"-- - Directly involved details: {len(directly_involved_details)}\n"
                        f"-- - Full schema tables: {len(available_tables)}\n"
                        f"-- \n"
                        f"-- POSSIBLE CAUSES:\n"
                        f"-- 1. Business vs technical table filter removed these tables\n"
                        f"-- 2. Tables don't exist in the database\n"
                        f"-- 3. Schema loading failed\n"
                        f"-- 4. Table names in use case are incorrect\n"
                        f"-- \n"
                        f"-- RECOMMENDATION: Review the table filtering settings and verify table names.\n"
                        f"SELECT 'Schema Empty Error' AS error_message,\n"
                        f"       {len(available_tables)} AS total_schema_tables,\n"
                        f"       '{tables_involved_str}' AS requested_tables;"
                    )
                    use_case['sql_generation_status'] = 'failed'
                    return use_case
                
                adaptive_timeout = self._calculate_adaptive_sql_timeout(use_case)
                self.logger.info(f"⏳ [{use_case_id}] Waiting for LLM response (SQL generation, timeout={adaptive_timeout}s)...")
                
                # MAIN CALL: Try with full context first (no reduction)
                sql_response = None
                needs_retry = False
                retry_error = None
                
                try:
                    sql_response = self.ai_agent.run_worker(
                        step_name=f"Generate_SQL_{use_case_id}_Wave",
                        worker_prompt_path="USE_CASE_SQL_GEN_PROMPT",
                        prompt_vars=prompt_vars,
                        response_schema=None,
                        timeout_override=adaptive_timeout,
                        max_retries_override=0
                    )
                except (InputTooLongError, TruncatedResponseError) as e:
                    needs_retry = True
                    retry_error = e
                    error_type = "Input too long" if isinstance(e, InputTooLongError) else "Response truncated"
                    self.logger.warning(f"⚠️  [{use_case_id}] {error_type} on main call - will retry with reduced context: {str(e)[:200]}")
                except Exception as e:
                    error_msg_lower = str(e).lower()
                    is_context_too_long = any(kw in error_msg_lower for kw in [
                        'input is too long', 'too long for requested model', 'input length',
                        'exceeds context limit', 'context window', 'token limit exceeded',
                        'maximum context length', 'bad_request', '400'
                    ]) and ('input' in error_msg_lower or 'length' in error_msg_lower or 'model' in error_msg_lower)
                    
                    is_timeout = any(kw in error_msg_lower for kw in ['timeout', 'timed out', 'deadline'])
                    
                    if is_context_too_long:
                        needs_retry = True
                        retry_error = InputTooLongError(str(e))
                        self.logger.warning(f"⚠️  [{use_case_id}] Context too long on main call - will retry with reduced context: {str(e)[:200]}")
                    elif is_timeout:
                        self.logger.warning(f"⏱️  Use case {use_case_id}: SQL generation timed out")
                        use_case['SQL'] = (
                            f"-- Use Case: {use_case_id} - {use_case_name}\n"
                            f"-- SQL generation timed out\n"
                            f"-- Tables Involved: {tables_involved_str}\n"
                            f"SELECT 'SQL Generation Timeout' AS error_message;\n"
                            f"--END OF GENERATED SQL"
                        )
                        use_case['sql_generation_status'] = 'timeout'
                        use_case['generated'] = 'N'
                        use_case['validated'] = 'D'
                        return use_case
                    else:
                        raise e
                
                # RETRY LOOP: Only if main call failed with context/truncation error
                # Use percentage-based column reduction on retries
                if needs_retry:
                    # Progressive retry strategy with PERCENTAGE-based column reduction
                    # Retry 1: Remove additional tables, keep 100% of columns
                    # Retry 2: Keep 75% of columns per table
                    # Retry 3: Keep 50% of columns per table
                    # Retry 4: Keep 25% of columns per table
                    MAX_RETRIES = 4
                    column_reduction_percentages = [1.0, 0.75, 0.50, 0.25]  # Percentage of columns to KEEP
                    
                    # Count total columns in schema to calculate percentage-based limits
                    original_schema = directly_involved_schema
                    total_columns = original_schema.count('\n  -') + original_schema.count('\n- ')  # Rough column count
                    
                    for retry_attempt in range(MAX_RETRIES):
                        try:
                            reduction_pct = column_reduction_percentages[retry_attempt]
                            
                            if retry_attempt == 0:
                                # First retry: Remove additional tables only, keep all columns
                                self.logger.warning(f"⚠️  [{use_case_id}] Retry {retry_attempt+1}/{MAX_RETRIES}: Removing additional tables (keeping 100% columns)")
                                prompt_vars["additional_schema"] = ""
                            else:
                                # Subsequent retries: Also reduce columns by percentage
                                # Calculate max columns as percentage of estimated columns per table
                                # Assume average of 100 columns per table as baseline
                                max_cols_per_table = max(10, int(100 * reduction_pct))
                                self.logger.warning(f"⚠️  [{use_case_id}] Retry {retry_attempt+1}/{MAX_RETRIES}: Keeping {int(reduction_pct*100)}% columns (max {max_cols_per_table} per table)")
                                truncated_schema = self._truncate_schema_columns(original_schema, max_cols_per_table)
                                prompt_vars["directly_involved_schema"] = truncated_schema
                                prompt_vars["additional_schema"] = ""
                            
                            sql_response = self.ai_agent.run_worker(
                                step_name=f"Generate_SQL_{use_case_id}_Wave_Retry{retry_attempt+1}_{int(reduction_pct*100)}pct",
                                worker_prompt_path="USE_CASE_SQL_GEN_PROMPT",
                                prompt_vars=prompt_vars,
                                response_schema=None,
                                timeout_override=adaptive_timeout,
                                max_retries_override=0
                            )
                            
                            # Success!
                            self.logger.info(f"✅ [{use_case_id}] Retry {retry_attempt+1} succeeded with {int(reduction_pct*100)}% columns")
                            break
                            
                        except (InputTooLongError, TruncatedResponseError) as e:
                            error_type = "Input too long" if isinstance(e, InputTooLongError) else "Response truncated"
                            self.logger.warning(f"⚠️  [{use_case_id}] {error_type} on retry {retry_attempt+1}/{MAX_RETRIES}: {str(e)[:200]}")
                            
                            if retry_attempt >= MAX_RETRIES - 1:
                                # All retries exhausted
                                self.logger.error(f"❌ [{use_case_id}] All {MAX_RETRIES} retries exhausted")
                                use_case['SQL'] = (
                                    f"-- ❌ SQL GENERATION FAILED ({error_type.upper()})\n"
                                    f"-- Use Case: {use_case_id}\n"
                                    f"-- Tables Involved: {tables_involved_str}\n"
                                    f"-- Error: {error_type} even after {MAX_RETRIES} retries with progressive column reduction (100%→75%→50%→25%)\n"
                                    f"-- RESOLUTION: Manually write SQL or reduce tables/columns involved\n"
                                    f"SELECT 'Context too large - manual SQL required' AS error_message;\n"
                                    f"--END OF GENERATED SQL"
                                )
                                use_case['sql_generation_status'] = 'failed'
                                use_case['generated'] = 'N'
                                use_case['validated'] = 'D'
                                return use_case
                            continue
                            
                        except Exception as e:
                            error_msg_lower = str(e).lower()
                            is_context_too_long = any(kw in error_msg_lower for kw in [
                                'input is too long', 'too long for requested model', 'input length',
                                'exceeds context limit', 'context window', 'token limit exceeded',
                                'maximum context length', 'bad_request', '400'
                            ]) and ('input' in error_msg_lower or 'length' in error_msg_lower or 'model' in error_msg_lower)
                            
                            if is_context_too_long:
                                self.logger.warning(f"⚠️  [{use_case_id}] Context too long on retry {retry_attempt+1}/{MAX_RETRIES}: {str(e)[:200]}")
                                if retry_attempt >= MAX_RETRIES - 1:
                                    use_case['SQL'] = (
                                        f"-- ❌ SQL GENERATION FAILED (CONTEXT TOO LARGE)\n"
                                        f"-- Use Case: {use_case_id}\n"
                                        f"-- Tables Involved: {tables_involved_str}\n"
                                        f"-- Error: Input exceeds model's context limit even after {MAX_RETRIES} retries\n"
                                        f"SELECT 'Context too large - manual SQL required' AS error_message;\n"
                                        f"--END OF GENERATED SQL"
                                    )
                                    use_case['sql_generation_status'] = 'failed'
                                    return use_case
                                continue
                            else:
                                raise e

                self.logger.info(f"✅ [{use_case_id}] Received LLM response ({len(sql_response) if sql_response else 0} chars)")
                if sql_response and ("STATUS: FAILED" in sql_response or "Schema missing" in sql_response):
                    self.logger.error(f"Use case {use_case_id}: SQL generation returned FAILED status.")
                    use_case['SQL'] = (
                        f"-- Use Case: {use_case_id} - {use_case_name}\n"
                        f"-- SQL generation failed\n"
                        f"-- Tables Involved: {tables_involved_str}\n"
                        f"SELECT 'SQL Generation Failed' AS error_message;\n"
                        f"--END OF GENERATED SQL"
                    )
                    use_case['sql_generation_status'] = 'failed'
                    use_case['generated'] = 'N'
                    use_case['validated'] = 'D'
                    return use_case
            except (InputTooLongError, TruncatedResponseError) as inner_e:
                # Re-raise context/truncation errors to be handled by outer exception handler
                self.logger.error(f"Use case {use_case_id}: Inner exception: {str(inner_e)[:200]}")
                use_case['SQL'] = (
                    f"-- ❌ SQL GENERATION FAILED\n"
                    f"-- Use Case: {use_case_id}\n"
                    f"-- Tables Involved: {tables_involved_str}\n"
                    f"-- Error: {str(inner_e)[:200]}\n"
                    f"SELECT 'SQL Generation Failed' AS error_message;\n"
                    f"--END OF GENERATED SQL"
                )
                use_case['sql_generation_status'] = 'failed'
                use_case['generated'] = 'N'
                use_case['validated'] = 'D'
                return use_case
            
            # Validate that we got a response
            processed_use_case = self._process_sql_candidate(
                use_case,
                sql_response,
                tables_involved_str,
                directly_involved_schema,
                directly_involved_tables,
                full_schema_details,
                schema_index
            )
            elapsed = time.time() - start_time
            self.logger.info(f"✓ SQL generated for {use_case_id} in {elapsed:.1f}s")
            return processed_use_case
            
        except Exception as e:
            import traceback
            elapsed = time.time() - start_time if 'start_time' in locals() else 0
            error_msg = str(e)[:200]
            error_lower = error_msg.lower()
            stack_trace = traceback.format_exc()[:500]
            self.logger.error(f"✗ Failed to generate SQL for {use_case_id} after {elapsed:.1f}s: {error_msg}")
            self.logger.debug(f"Stack trace for {use_case_id}: {stack_trace}")
            
            is_timeout = any(kw in error_lower for kw in ['timeout', 'timed out', 'deadline'])
            
            if is_timeout:
                self.logger.warning(f"⏱️  Use case {use_case_id}: SQL generation timed out (outer)")
                use_case['SQL'] = (
                    f"-- Use Case: {use_case_id} - {use_case_name}\n"
                    f"-- SQL generation timed out\n"
                    f"-- Tables Involved: {tables_involved_str}\n"
                    f"SELECT 'SQL Generation Timeout' AS error_message;\n"
                    f"-- If you run the query and it was not valid, set IsValid to No and run Inspire again with 'Generate = SQL Regeneration', and Inspire will regenerate a new query for you to validate. You can also pass special instruction in below field:\n"
                    f"--SQL Generation Instructions Begin\n"
                    f"--\n"
                    f"--SQL Generation Instructions End"
                )
                use_case['sql_generation_status'] = 'timeout'
                use_case['generated'] = 'N'
                use_case['validated'] = 'D'
            else:
                use_case['SQL'] = (
                    f"-- Use Case: {use_case_id} - {use_case_name}\n"
                    f"-- SQL generation exception: {error_msg[:100]}\n"
                    f"-- Tables Involved: {tables_involved_str}\n"
                    f"SELECT 'SQL Generation Exception' AS error_message;\n"
                    f"-- If you run the query and it was not valid, set IsValid to No and run Inspire again with 'Generate = SQL Regeneration', and Inspire will regenerate a new query for you to validate. You can also pass special instruction in below field:\n"
                    f"--SQL Generation Instructions Begin\n"
                    f"--\n"
                    f"--SQL Generation Instructions End"
                )
                use_case['sql_generation_status'] = 'failed'
                use_case['generated'] = 'N'
                use_case['validated'] = 'D'
                
            return use_case

    def _fix_sql_after_validation_failure(self, use_case: dict, full_schema_details: list, unstructured_docs_markdown: str, schema_index: dict) -> dict:
        use_case_id = use_case.get('No', 'UNKNOWN')
        tables_involved_str = use_case.get('Tables Involved', '')
        directly_involved_schema = use_case.get('_directly_involved_schema', '')
        directly_involved_tables = set(use_case.get('_directly_involved_tables') or [])
        
        # Check if this is a "force static check" mode (no execution error)
        is_static_check = use_case.get('_force_static_check', False)
        explain_error_msg = use_case.get('sql_validation_error') or "SQL validation failed"
        
        if is_static_check:
            explain_error_msg = "Please perform a static code analysis on the query. Check for: 1) Syntax errors 2) References to columns that do not exist in the schema provided 3) Logic issues. Return the FIXED query."

        reviewer_prompt_vars = {
            "use_case_id": use_case_id,
            "use_case_name": use_case.get('Name', ''),
            "business_domain": use_case.get('Business Domain', ''),
            "statement": use_case.get('Statement', ''),
            "tables_involved": tables_involved_str,
            "directly_involved_schema": directly_involved_schema,
            "original_sql": use_case.get('SQL', ''),
            "explain_error": explain_error_msg,
            "use_case_columns": use_case.get('Involved Columns') or use_case.get('Columns Involved') or ""
        }
        adaptive_timeout = self._calculate_adaptive_sql_timeout(use_case)
        fixed_sql = self.ai_agent.run_worker(
            step_name=f"Fix_SQL_Execution_{use_case_id}_WaveRetry",
            worker_prompt_path="USE_CASE_SQL_FIX_PROMPT",
            prompt_vars=reviewer_prompt_vars,
            response_schema=None,
            timeout_override=adaptive_timeout,
            max_retries_override=self.max_retry_attempts
        )
        return self._process_sql_candidate(
            use_case,
            fixed_sql,
            tables_involved_str,
            directly_involved_schema,
            directly_involved_tables,
            full_schema_details,
            schema_index
        )

    def _run_sql_task_wrapper(self, use_case: dict, full_schema_details: list, unstructured_docs_markdown: str, schema_index: dict) -> dict:
        if use_case.get('_needs_fix'):
            return self._fix_sql_after_validation_failure(use_case, full_schema_details, unstructured_docs_markdown, schema_index)
        
        # 1. Generate Initial SQL
        uc_with_sql = self._generate_sql_for_use_case(use_case, full_schema_details, unstructured_docs_markdown, schema_index)
        
        return uc_with_sql

    def _run_sql_wave(self, wave_id: int, use_cases: list, full_schema_details: list, unstructured_docs_markdown: str, schema_index: dict, parallelism: int) -> tuple:
        use_cases_with_sql = []
        timed_out = []
        validation_failed = []
        with ThreadPoolExecutor(max_workers=parallelism, thread_name_prefix=f"SQLWave{wave_id}") as executor:
            future_to_uc = {}
            for uc in use_cases:
                future = executor.submit(self._run_sql_task_wrapper, uc, full_schema_details, unstructured_docs_markdown, schema_index)
                future_to_uc[future] = uc
            for future in concurrent.futures.as_completed(future_to_uc, timeout=None):
                uc_ref = future_to_uc[future]
                try:
                    result = future.result()
                    if result is None:
                        result = uc_ref
                    use_cases_with_sql.append(result)
                    
                    status = result.get('sql_generation_status')
                    if status == 'timeout':
                        timed_out.append(result)
                        use_case_id = result.get('No', 'UNKNOWN')
                        result['generated'] = 'N'
                        result['validated'] = 'D'
                        self.logger.warning(f"⏱️ [{use_case_id}] SQL generation timed out - marked for Queries regeneration")
                    elif status == 'failed':
                        result['generated'] = 'N'
                        result['validated'] = 'D'
                        timed_out.append(result)
                    elif result.get('sql_validation_status') == 'failed' or result.get('column_validation_status') == 'failed':
                        result['generated'] = 'Y'
                        result['validated'] = 'N'
                        validation_failed.append(result)
                    else:
                        if result.get('generated') != 'Y':
                            result['generated'] = 'Y'
                        if result.get('validated') not in ['Y', 'N']:
                            result['validated'] = 'D'
                        
                except Exception as e:
                    msg = str(e)
                    is_timeout = 'timeout' in msg.lower() or 'timed out' in msg.lower()
                    use_case_id = uc_ref.get('No', 'UNKNOWN')
                    tables_involved_str = uc_ref.get('Tables Involved', '')
                    use_case_name = uc_ref.get('Name', '')[:50]
                    
                    if is_timeout:
                        self.logger.warning(f"⏱️ [{use_case_id}] SQL generation timed out in wave executor")
                        uc_ref['SQL'] = (
                            f"-- Use Case: {use_case_id} - {use_case_name}\n"
                            f"-- SQL generation timed out\n"
                            f"-- Tables Involved: {tables_involved_str}\n"
                            f"SELECT 'SQL Generation Timeout' AS error_message;\n"
                            f"-- If you run the query and it was not valid, set IsValid to No and run Inspire again with 'Generate = SQL Regeneration', and Inspire will regenerate a new query for you to validate. You can also pass special instruction in below field:\n"
                            f"--SQL Generation Instructions Begin\n"
                            f"--\n"
                            f"--SQL Generation Instructions End"
                        )
                        uc_ref['sql_generation_status'] = 'timeout'
                    else:
                        uc_ref['sql_generation_status'] = 'failed'
                        uc_ref['SQL'] = (
                            f"-- Use Case: {use_case_id} - {use_case_name}\n"
                            f"-- SQL generation failed: {msg[:100]}\n"
                            f"-- Tables Involved: {tables_involved_str}\n"
                            f"SELECT 'SQL Generation Error' AS error_message;\n"
                            f"-- If you run the query and it was not valid, set IsValid to No and run Inspire again with 'Generate = SQL Regeneration', and Inspire will regenerate a new query for you to validate. You can also pass special instruction in below field:\n"
                            f"--SQL Generation Instructions Begin\n"
                            f"--\n"
                            f"--SQL Generation Instructions End"
                        )
                    uc_ref['generated'] = 'N'
                    uc_ref['validated'] = 'D'
                    timed_out.append(uc_ref)
                    use_cases_with_sql.append(uc_ref)
        return use_cases_with_sql, timed_out, validation_failed

    def _generate_sql_sequential(self, use_cases: list, full_schema_details: list, unstructured_docs_markdown: str) -> list:
        """
        Generate SQL for all use cases SEQUENTIALLY (no parallelism).
        Used within domain processing where the domain itself is already running in parallel.
        
        Args:
            use_cases: List of use case dictionaries  
            full_schema_details: Full list of (catalog, schema, table, column, type, comment) tuples
            unstructured_docs_markdown: Unstructured documents markdown
        """
        total_use_cases = len(use_cases)
        
        self.logger.info(f"Starting sequential SQL generation for {total_use_cases} use cases...")
        
        # Build schema index ONCE for fast lookup
        self.logger.debug(f"Building schema index from {len(full_schema_details)} columns...")
        schema_by_table = defaultdict(list)
        for detail in full_schema_details:
            (catalog, schema, table, column_name, data_type, comment) = detail
            fqtn = f"{catalog}.{schema}.{table}"
            fqtn_backticks = f"`{catalog}`.`{schema}`.`{table}`"
            schema_by_table[fqtn].append(detail)
            schema_by_table[fqtn_backticks].append(detail)
        
        # Process use cases sequentially
        use_cases_with_sql = []
        completed_count = 0
        failed_count = 0
        deferred_timeouts = []
        
        for idx, uc in enumerate(use_cases, 1):
            use_case_id = uc.get('No', 'UNKNOWN')
            use_case_name = uc.get('Name', '')[:40]
            
            try:
                result = self._generate_sql_for_use_case(uc, full_schema_details, unstructured_docs_markdown, schema_by_table)
                use_cases_with_sql.append(result)
                completed_count += 1
                
                if idx % 5 == 0 or idx == total_use_cases:
                    self.logger.debug(f"SQL generation progress: {idx}/{total_use_cases}")
                    
            except Exception as e:
                failed_count += 1
                error_msg = str(e)[:150]
                self.logger.error(f"SQL generation failed for {use_case_id}: {error_msg}")
                uc['SQL'] = f"-- SQL generation failed for {use_case_id}\n-- Error: {error_msg}\nSELECT 'SQL Generation Error' as error;"
                use_cases_with_sql.append(uc)
        
        success_count = completed_count - failed_count
        self.logger.info(f"✅ Sequential SQL generation complete: {success_count} succeeded, {failed_count} failed")
        
        return use_cases_with_sql
    
    def _generate_sql_parallel(self, use_cases: list, full_schema_details: list, unstructured_docs_markdown: str, is_retry: bool = False) -> list:
        """
        Generate SQL for all use cases in parallel using max_parallelism.
        Note: Uses lower parallelism (max 10) to avoid overwhelming the LLM service and cluster.
        
        Args:
            use_cases: List of use case dictionaries
            full_schema_details: Full list of (catalog, schema, table, column, type, comment) tuples
            unstructured_docs_markdown: Unstructured documents markdown
        """
        # === CHECK IF SQL CODE GENERATION IS DISABLED ===
        if not self.generate_sql_code:
            self.logger.info("⚠️ SQL Code generation is DISABLED - using placeholder SQL for all use cases")
            log_print(f"\n⚠️ SQL Code generation DISABLED - using placeholder SQL")
            
            for uc in use_cases:
                tables_involved = uc.get('Tables Involved', 'your_table')
                first_table = tables_involved.split(',')[0].strip() if tables_involved else 'your_table'
                placeholder_sql = (
                    f"-- TODO: SQL Code generation was disabled\n"
                    f"-- To generate SQL: Run 'Re-generate SQL' operation mode\n"
                    f"-- Tables Involved: {tables_involved}\n"
                    f"SELECT * FROM {first_table} LIMIT 10;"
                )
                uc['SQL'] = placeholder_sql
                uc['generated'] = 'N'
                uc['validated'] = 'N'
            
            return use_cases
        
        total_use_cases = len(use_cases)
        total_columns = len(full_schema_details)
        avg_prompt_chars = total_columns * 50 + len(unstructured_docs_markdown)  # Estimate based on schema + docs
        
        # ADAPTIVE PARALLELISM: Calculate based on use cases, columns, and prompt size
        sql_parallelism, reason = calculate_adaptive_parallelism(
            "sql_generation", self.max_parallelism,
            num_items=total_use_cases,
            total_columns=total_columns,
            avg_prompt_chars=avg_prompt_chars,
            is_llm_operation=True, logger=self.logger
        )
        
        log_print(f"\n{'='*80}")
        log_print(f"🔄 SQL GENERATION: {total_use_cases} use cases")
        log_print(f"{'='*80}")
        log_adaptive_parallelism_decision("sql_generation", sql_parallelism, self.max_parallelism, reason)
        log_print(f"Estimated time per wave: {(total_use_cases * 5 / sql_parallelism / 60):.1f} minutes")
        log_print(f"{'='*80}\n")
        self.logger.info(f"🔧 Building schema index from {len(full_schema_details)} columns for fast lookup...")
        schema_by_table = defaultdict(list)
        for detail in full_schema_details:
            (catalog, schema, table, column_name, data_type, comment) = detail
            # Create multiple index keys for flexible matching
            fqtn = f"{catalog}.{schema}.{table}"
            fqtn_backticks = f"`{catalog}`.`{schema}`.`{table}`"
            schema_by_table[fqtn].append(detail)
            schema_by_table[fqtn_backticks].append(detail)
        self.logger.info(f"   ✓ Schema index built with {len(schema_by_table)} table entries")
        priority_order = {
            "ultra high": 0,
            "very high": 1,
            "high": 2,
            "medium": 3,
            "low": 4,
            "very low": 5,
            "ultra low": 6
        }
        def sort_backlog(items):
            return [uc for _, uc in sorted(
                enumerate(items),
                key=lambda pair: (priority_order.get(str(pair[1].get('Priority', '')).strip().lower(), len(priority_order)), pair[0])
            )]
        wave_parallelism = [
            (1, sql_parallelism),
            (2, sql_parallelism),
            (3, max(1, (sql_parallelism + 1) // 2)),
            (4, max(1, (sql_parallelism + 2) // 3)),
            (5, max(1, (sql_parallelism + 2) // 3))
        ]
        final_results = {}
        backlog = sort_backlog(use_cases)
        for wave_id, wave_workers in wave_parallelism:
            if not backlog:
                break
            backlog = sort_backlog(backlog)
            import time
            wave_start_time = time.time()
            self.logger.info(f"🔁 Wave {wave_id}: processing {len(backlog)} use cases with parallelism {wave_workers} (priority-ordered)")
            log_print(f"   ▶️ Wave {wave_id}: {len(backlog)} use cases, parallelism {wave_workers} (priority-ordered)")
            results, timed_out, validation_failed = self._run_sql_wave(wave_id, backlog, full_schema_details, unstructured_docs_markdown, schema_by_table, wave_workers)
            
            wave_end_time = time.time()
            wave_duration = wave_end_time - wave_start_time
            
            wave_succeeded = 0
            wave_failed = 0
            for uc in results:
                if uc.get('generated') == 'Y' and uc.get('validated') in ['Y', 'D']:
                    wave_succeeded += 1
                else:
                    wave_failed += 1
            
            self.logger.info(f"📊 Wave {wave_id} Report:")
            self.logger.info(f"   • Duration: {wave_duration:.1f}s")
            self.logger.info(f"   • Start: {time.strftime('%H:%M:%S', time.localtime(wave_start_time))}")
            self.logger.info(f"   • End: {time.strftime('%H:%M:%S', time.localtime(wave_end_time))}")
            self.logger.info(f"   • Processed: {len(results)}")
            self.logger.info(f"   • Succeeded: {wave_succeeded}")
            self.logger.info(f"   • Failed: {wave_failed}")
            self.logger.info(f"      - ⏱️ Timeouts/Errors (will retry in next wave): {len(timed_out)}")
            self.logger.info(f"      - 🔍 Validation Failures (will retry in next wave): {len(validation_failed)}")
            
            if len(timed_out) > 0:
                if wave_id < 5:
                    self.logger.info(f"   ⏱️ TIMED OUT SQL DETAILS (will retry in wave {wave_id + 1}):")
                else:
                    self.logger.info(f"   ⏱️ FAILED SQL DETAILS (run 'SQL Regeneration' to regenerate):")
                for uc in timed_out:
                    uc_id = uc.get('No', 'UNKNOWN')
                    uc_name = uc.get('Name', '')[:40]
                    retry_count = uc.get('_retry_attempt', 0)
                    self.logger.info(f"      • [{uc_id}] {uc_name} → generated={uc.get('generated', 'N')}, validated={uc.get('validated', 'D')}, retries={retry_count}")
            
            log_print(f"\n================================================================================")
            log_print(f"📊 WAVE {wave_id} REPORT")
            log_print(f"================================================================================")
            log_print(f"   • Time: {time.strftime('%H:%M:%S', time.localtime(wave_start_time))} - {time.strftime('%H:%M:%S', time.localtime(wave_end_time))} ({wave_duration:.1f}s)")
            log_print(f"   • Total Processed: {len(results)}")
            log_print(f"   • ✅ Succeeded: {wave_succeeded}")
            log_print(f"   • ❌ Failed: {wave_failed}")
            if len(timed_out) > 0:
                if wave_id < 5:
                    log_print(f"      - ⏱️  Timeouts/Errors: {len(timed_out)} (will retry in wave {wave_id + 1})")
                else:
                    log_print(f"      - ⏱️  Timeouts/Errors: {len(timed_out)} (run 'SQL Regeneration' to regenerate)")
            if len(validation_failed) > 0:
                log_print(f"      - 🔍 Validation Failures: {len(validation_failed)} (will retry in next wave)")
            log_print(f"================================================================================\n")
            
            log_print(f"   ✅ Wave {wave_id} done in {wave_duration:.1f}s: {wave_succeeded} OK, {wave_failed} Failed")
            
            for uc in results:
                final_results[uc.get('No', 'UNKNOWN')] = uc
            
            for uc in validation_failed:
                if uc.get('_needs_fix'):
                    del uc['_needs_fix']
                else:
                    uc['_needs_fix'] = True

            # Build backlog for next wave: include BOTH validation failures AND timeouts for retry
            # Reset timeout status for timed_out items so they get retried
            for uc in timed_out:
                # Clear timeout status to allow retry in next wave
                if uc.get('sql_generation_status') == 'timeout':
                    uc['sql_generation_status'] = 'pending_retry'
                    uc['_retry_attempt'] = uc.get('_retry_attempt', 0) + 1
                    self.logger.debug(f"   🔄 [{uc.get('No', 'UNKNOWN')}] Marked for retry (attempt {uc['_retry_attempt']})")
            
            # Combine validation failures and timed out items for next wave retry
            retry_items = validation_failed + timed_out
            backlog = sort_backlog(retry_items)
            
            if len(timed_out) > 0 and wave_id < 5:
                self.logger.info(f"   🔄 {len(timed_out)} timed out use cases will be retried in wave {wave_id + 1}")
                log_print(f"   🔄 {len(timed_out)} timed out use cases will be retried in wave {wave_id + 1}")
            
            completed = len(results)
            progress_pct = (completed / total_use_cases) * 100 if total_use_cases else 0
            self.logger.info(f"Wave {wave_id} completed {completed} items ({progress_pct:.1f}%) - {len(timed_out)} timeouts/errors, {len(validation_failed)} validation failures (will retry)")
        for uc in backlog:
            if uc.get('sql_generation_status') == 'timeout':
                uc['SQL'] = (
                    f"-- SQL generation timed out for {uc.get('No', 'UNKNOWN')}\n"
                    f"-- Timeout: {self.llm_timeout_seconds} seconds (final attempt)\n"
                    f"SELECT 'SQL Generation Timeout' as error;"
                )
            elif uc.get('sql_validation_status') == 'failed' or uc.get('column_validation_status') == 'failed':
                if not uc.get('SQL'):
                    uc['SQL'] = f"-- SQL validation failed for {uc.get('No', 'UNKNOWN')}\nSELECT 'SQL Validation Failed' as error;"
            final_results[uc.get('No', 'UNKNOWN')] = uc
        ordered_results = []
        for uc in use_cases:
            ordered_results.append(final_results.get(uc.get('No', 'UNKNOWN'), uc))
        success_count = sum(1 for uc in ordered_results if uc.get('sql_generation_status') == 'succeeded' and uc.get('sql_validation_status') != 'failed' and uc.get('column_validation_status') != 'failed')
        failed_count = len(ordered_results) - success_count
        self.logger.info(f"✅ SQL generation complete across waves: {success_count} succeeded, {failed_count} failed (Total: {len(ordered_results)}/{total_use_cases})")
        log_print(f"✅ SQL generation finished across waves: {success_count} succeeded, {failed_count} failed")
        
        # Show REST API validation statistics
        if hasattr(self, '_explain_stats'):
            stats = self._explain_stats
            self.logger.info("")
            self.logger.info("=" * 80)
            self.logger.info("📊 SQL VALIDATION SUMMARY (REST API with Fresh Clients)")
            self.logger.info("=" * 80)
            self.logger.info(f"   Total queries attempted: {stats['attempted']}")
            self.logger.info(f"   ✅ Validations succeeded: {stats['succeeded']}")
            
            # Show breakdown of primary vs retry attempts
            local_count = stats.get('local_succeeded', 0)
            remote_count = stats.get('remote_succeeded', 0)
            if local_count > 0 or remote_count > 0:
                self.logger.info(f"      ├─ 🌐 Primary attempt: {local_count}")
                self.logger.info(f"      └─ 🔄 Retry attempt: {remote_count}")
            
            self.logger.info(f"   ❌ Validations failed (syntax errors): {stats['failed']}")
            self.logger.info(f"   ⚠️  Authentication/permission errors: {stats['auth_errors']}")
            self.logger.info(f"   ⏭️  Skipped (no warehouse): {stats['skipped']}")
            
            if stats['auth_errors'] > 0:
                self.logger.info("")
                self.logger.info("   ⚠️  Note: Authentication errors prevent SQL validation.")
                self.logger.info("   Consider checking warehouse permissions.")
            
            if stats['failed'] > 0:
                self.logger.info("")
                self.logger.info(f"   ⚠️  {stats['failed']} queries had syntax errors and were attempted to be fixed.")
            
            self.logger.info("=" * 80)
            self.logger.info("")
            
            log_print(f"\n{'=' * 80}")
            log_print(f"📊 SQL VALIDATION SUMMARY (REST API with Fresh Clients)")
            log_print(f"{'=' * 80}")
            log_print(f"   ✅ Succeeded: {stats['succeeded']} (🌐 Primary: {local_count}, 🔄 Retry: {remote_count})")
            log_print(f"   ❌ Failed: {stats['failed']}")
            log_print(f"   ⚠️  Auth errors: {stats['auth_errors']}")
            log_print(f"   ⏭️  Skipped: {stats['skipped']}")
            if local_count > 0 or remote_count > 0:
                log_print(f"   💡 Using REST API validation with fresh workspace clients per call")
                log_print(f"   💡 Configuration: wait_timeout=50s, disposition=EXTERNAL_LINKS, row_limit=1")
            if stats['auth_errors'] > 0:
                log_print(f"   Note: {stats['auth_errors']} validation attempts failed due to auth errors.")
            log_print(f"{'=' * 80}\n")
        
        if len(ordered_results) < total_use_cases:
            missing = total_use_cases - len(ordered_results)
            self.logger.warning(f"⚠️ Missing {missing} use cases from SQL generation results")
        
        return ordered_results

    def _generate_sql_and_notebooks_by_domain(self, all_use_cases: list, full_schema_details: list, 
                                               unstructured_docs_markdown: str, translations: dict, 
                                               summary_dict: dict = None) -> list:
        """
        Generate SQL queries domain-by-domain, creating each notebook immediately after
        its domain's SQL generation is complete. Domains are processed in order of 
        use case count (smallest first) to enable quick testing during demos.
        
        Args:
            all_use_cases: All use cases to process
            full_schema_details: Schema details for SQL generation
            unstructured_docs_markdown: Unstructured documentation
            translations: Translation dictionary
            summary_dict: Optional domain summaries
            
        Returns:
            list: All use cases with SQL generated
        """
        import time
        
        if not all_use_cases:
            self.logger.warning("No use cases provided for domain-by-domain SQL generation")
            return []
        
        # === CHECK IF SQL CODE GENERATION IS DISABLED ===
        if not self.generate_sql_code:
            self.logger.info("⚠️ SQL Code generation is DISABLED - using placeholder SQL for all use cases")
            log_print(f"\n{'='*80}")
            log_print(f"⚠️ SQL CODE GENERATION DISABLED")
            log_print(f"{'='*80}")
            log_print(f"Notebooks will be generated with placeholder SQL.")
            log_print(f"To generate SQL, set regenerate_sql:Yes in notebook and run 'Re-generate SQL' mode.")
            log_print(f"{'='*80}\n")
            
            # Set placeholder SQL for all use cases
            for uc in all_use_cases:
                tables_involved = uc.get('Tables Involved', 'your_table')
                first_table = tables_involved.split(',')[0].strip() if tables_involved else 'your_table'
                placeholder_sql = (
                    f"-- TODO: SQL Code generation was disabled\n"
                    f"-- To generate SQL: Run 'Re-generate SQL' operation mode\n"
                    f"-- Tables Involved: {tables_involved}\n"
                    f"SELECT * FROM {first_table} LIMIT 10;"
                )
                uc['SQL'] = placeholder_sql
                uc['generated'] = 'N'
                uc['validated'] = 'N'  # Mark as needing regeneration (regenerate_sql:Yes)
            
            # Still need to create notebooks - proceed to notebook assembly
            grouped_by_domain = self._group_use_cases_by_domain_flat(all_use_cases)
            
            # Build domain prefix map (same logic as normal flow)
            domain_prefix_map = {}
            for domain, use_cases in grouped_by_domain.items():
                if use_cases:
                    first_id = use_cases[0].get('No', 'N99-ZZ99')
                    try:
                        prefix = first_id.split('-')[0]
                        prefix_num = int(prefix[1:])
                        domain_prefix_map[domain] = (prefix_num, prefix)
                    except (ValueError, IndexError):
                        domain_prefix_map[domain] = (999, 'N99')
            
            sorted_domains = sorted(grouped_by_domain.keys(), 
                                   key=lambda d: len(grouped_by_domain[d]))
            
            # Create notebooks for all domains (with placeholder SQL)
            for domain_idx, domain_name in enumerate(sorted_domains, start=1):
                domain_use_cases = grouped_by_domain[domain_name]
                actual_prefix = domain_prefix_map.get(domain_name, (domain_idx, f"N{domain_idx:02d}"))[1]
                sanitized_domain = self._sanitize_name(domain_name)
                notebook_name = f"{actual_prefix}-{sanitized_domain}"
                
                log_print(f"📓 Creating notebook: {notebook_name} ({len(domain_use_cases)} use cases with placeholder SQL)")
                
                domain_summary = summary_dict.get(domain_name) if summary_dict else None
                sorted_cases = sorted(domain_use_cases, key=self._natural_sort_key)
                
                try:
                    self._assemble_notebook_for_db(
                        db_name=domain_name, use_cases=sorted_cases, translations=translations,
                        db_prefix=actual_prefix, filename_override=notebook_name, domain_summary=domain_summary
                    )
                    log_print(f"   ✅ {notebook_name}.ipynb created")
                except Exception as e:
                    self.logger.error(f"Failed to create notebook for domain '{domain_name}': {e}")
                    log_print(f"   ❌ Failed to create notebook: {str(e)[:100]}")
            
            log_print(f"\n✅ All notebooks created with placeholder SQL")
            log_print(f"   📌 To generate SQL: Set regenerate_sql:Yes in notebooks and run 'Re-generate SQL' mode")
            
            return all_use_cases
        
        grouped_by_domain = self._group_use_cases_by_domain_flat(all_use_cases)
        
        # CRITICAL FIX: Extract domain PREFIX from USE CASE IDs - this ensures notebook names match use case IDs
        # The ID assignment phase already determined domain prefixes (N01, N02, etc.).
        # We sort by USE CASE COUNT (smallest first for quick testing) but use PREFIX from IDs for notebook naming.
        domain_prefix_map = {}  # domain_name -> (prefix_number, prefix_string)
        for domain, use_cases in grouped_by_domain.items():
            if use_cases:
                # Extract prefix from first use case ID (e.g., "N15-AI01" -> "N15" -> 15)
                first_id = use_cases[0].get('No', 'N99-ZZ99')
                try:
                    prefix = first_id.split('-')[0]  # "N15"
                    prefix_num = int(prefix[1:])  # 15
                    domain_prefix_map[domain] = (prefix_num, prefix)
                except (ValueError, IndexError):
                    domain_prefix_map[domain] = (999, 'N99')  # Fallback for malformed IDs
        
        # Sort domains by USE CASE COUNT (smallest first) - enables quick testing of smaller domains
        # NOTE: Notebook prefix comes from use case IDs (domain_prefix_map), NOT from this sort order
        sorted_domains = sorted(grouped_by_domain.keys(), 
                               key=lambda d: len(grouped_by_domain[d]))
        
        # For logging, still calculate impact scores
        domain_impact_scores = {domain: self._calculate_domain_impact_score(use_cases) 
                               for domain, use_cases in grouped_by_domain.items()}
        
        total_domains = len(sorted_domains)
        total_use_cases = len(all_use_cases)
        
        log_print(f"\n{'='*80}")
        log_print(f"🎯 DOMAIN-BY-DOMAIN SQL GENERATION & NOTEBOOK CREATION")
        log_print(f"{'='*80}")
        log_print(f"📊 Total: {total_use_cases} use cases across {total_domains} domains")
        log_print(f"📋 Processing order (smallest domains first for quick testing):")
        for idx, domain in enumerate(sorted_domains, 1):
            uc_count = len(grouped_by_domain[domain])
            prefix = domain_prefix_map.get(domain, (idx, f"N{idx:02d}"))[1]
            impact = domain_impact_scores[domain]
            log_print(f"   {idx}. {prefix}-{domain}: {uc_count} use cases (impact: {impact:.1f})")
        log_print(f"{'='*80}\n")
        
        self.logger.info(f"Starting domain-by-domain SQL generation for {total_use_cases} use cases across {total_domains} domains")
        self.logger.info(f"Domain order (by size): {[(len(grouped_by_domain[d]), domain_prefix_map.get(d, (0, 'N00'))[1], d) for d in sorted_domains]}")
        
        self.logger.info(f"🔧 Building schema index from {len(full_schema_details)} columns for fast lookup...")
        schema_by_table = defaultdict(list)
        for detail in full_schema_details:
            (catalog, schema, table, column_name, data_type, comment) = detail
            fqtn = f"{catalog}.{schema}.{table}"
            fqtn_backticks = f"`{catalog}`.`{schema}`.`{table}`"
            schema_by_table[fqtn].append(detail)
            schema_by_table[fqtn_backticks].append(detail)
        self.logger.info(f"   ✓ Schema index built with {len(schema_by_table)} table entries")
        
        final_results_map = {}
        notebooks_created = []
        overall_start_time = time.time()
        cumulative_use_cases_done = 0
        
        for domain_idx, domain_name in enumerate(sorted_domains, start=1):
            domain_use_cases = grouped_by_domain[domain_name]
            domain_uc_count = len(domain_use_cases)
            # Get the actual prefix from use case IDs (matches notebook name)
            actual_prefix = domain_prefix_map.get(domain_name, (domain_idx, f"N{domain_idx:02d}"))[1]
            
            log_print(f"\n{'='*80}")
            log_print(f"🏢 DOMAIN {domain_idx}/{total_domains}: {domain_name.upper()} (Notebook: {actual_prefix})")
            log_print(f"{'='*80}")
            log_print(f"   📊 Use cases in this domain: {domain_uc_count}")
            log_print(f"   🔄 Progress: {cumulative_use_cases_done}/{total_use_cases} use cases completed so far")
            
            domain_start_time = time.time()
            self.logger.info(f"\n🏢 [{domain_idx}/{total_domains}] Starting domain: {domain_name} ({actual_prefix}, {domain_uc_count} use cases)")
            
            log_print(f"\n   📝 PHASE 1: Generating SQL for {domain_uc_count} use cases (wave pattern)...")
            
            # ADAPTIVE PARALLELISM: Calculate based on domain use cases and schema size
            sql_parallelism, reason = calculate_adaptive_parallelism(
                "sql_generation", self.max_parallelism,
                num_items=domain_uc_count,
                total_columns=len(full_schema_details),
                avg_prompt_chars=len(full_schema_details) * 50,
                is_llm_operation=True, logger=self.logger
            )
            log_adaptive_parallelism_decision("sql_generation", sql_parallelism, self.max_parallelism, reason)
            
            priority_order = {
                "ultra high": 0, "very high": 1, "high": 2, "medium": 3,
                "low": 4, "very low": 5, "ultra low": 6
            }
            
            def sort_backlog(items):
                return [uc for _, uc in sorted(
                    enumerate(items),
                    key=lambda pair: (priority_order.get(str(pair[1].get('Priority', '')).strip().lower(), len(priority_order)), pair[0])
                )]
            
            wave_parallelism = [
                (1, sql_parallelism),
                (2, sql_parallelism),
                (3, max(1, (sql_parallelism + 1) // 2)),
                (4, max(1, (sql_parallelism + 2) // 3)),
                (5, max(1, (sql_parallelism + 2) // 3))
            ]
            
            domain_final_results = {}
            backlog = sort_backlog(domain_use_cases)
            
            for wave_id, wave_workers in wave_parallelism:
                if not backlog:
                    break
                backlog = sort_backlog(backlog)
                wave_start_time = time.time()
                
                self.logger.info(f"   🔁 Wave {wave_id}: processing {len(backlog)} use cases with parallelism {wave_workers}")
                log_print(f"      ▶️ Wave {wave_id}: {len(backlog)} use cases, parallelism {wave_workers}")
                
                results, timed_out, validation_failed = self._run_sql_wave(
                    wave_id, backlog, full_schema_details, unstructured_docs_markdown, schema_by_table, wave_workers
                )
                
                wave_end_time = time.time()
                wave_duration = wave_end_time - wave_start_time
                
                wave_succeeded = sum(1 for uc in results if uc.get('generated') == 'Y' and uc.get('validated') in ['Y', 'D'])
                wave_failed = len(results) - wave_succeeded
                
                log_print(f"      ✅ Wave {wave_id} done in {wave_duration:.1f}s: {wave_succeeded} OK, {wave_failed} Failed")
                
                for uc in results:
                    domain_final_results[uc.get('No', 'UNKNOWN')] = uc
                
                for uc in validation_failed:
                    if uc.get('_needs_fix'):
                        del uc['_needs_fix']
                    else:
                        uc['_needs_fix'] = True
                
                for uc in timed_out:
                    if uc.get('sql_generation_status') == 'timeout':
                        uc['sql_generation_status'] = 'pending_retry'
                        uc['_retry_attempt'] = uc.get('_retry_attempt', 0) + 1
                
                retry_items = validation_failed + timed_out
                backlog = sort_backlog(retry_items)
            
            for uc in backlog:
                if uc.get('sql_generation_status') == 'timeout':
                    uc['SQL'] = (
                        f"-- SQL generation timed out for {uc.get('No', 'UNKNOWN')}\n"
                        f"-- Timeout: {self.llm_timeout_seconds} seconds (final attempt)\n"
                        f"SELECT 'SQL Generation Timeout' as error;"
                    )
                elif uc.get('sql_validation_status') == 'failed' or uc.get('column_validation_status') == 'failed':
                    if not uc.get('SQL'):
                        uc['SQL'] = f"-- SQL validation failed for {uc.get('No', 'UNKNOWN')}\nSELECT 'SQL Validation Failed' as error;"
                domain_final_results[uc.get('No', 'UNKNOWN')] = uc
            
            domain_ordered_results = []
            for uc in domain_use_cases:
                domain_ordered_results.append(domain_final_results.get(uc.get('No', 'UNKNOWN'), uc))
            
            for uc in domain_ordered_results:
                final_results_map[uc.get('No', 'UNKNOWN')] = uc
            
            domain_success = sum(1 for uc in domain_ordered_results 
                               if uc.get('sql_generation_status') == 'succeeded' 
                               and uc.get('sql_validation_status') != 'failed' 
                               and uc.get('column_validation_status') != 'failed')
            domain_failed = len(domain_ordered_results) - domain_success
            
            sql_duration = time.time() - domain_start_time
            log_print(f"\n   ✅ SQL Generation Complete: {domain_success} succeeded, {domain_failed} failed ({sql_duration:.1f}s)")
            
            log_print(f"\n   📓 PHASE 2: Creating notebook for domain '{domain_name}'...")
            
            notebook_start_time = time.time()
            sorted_cases = sorted(domain_ordered_results, key=self._natural_sort_key)
            # CRITICAL FIX: Use prefix from use case IDs, not from loop index
            # This ensures N15-AI01 use cases go into N15-xxx.ipynb, not N06-xxx.ipynb
            domain_prefix = domain_prefix_map.get(domain_name, (domain_idx, f"N{domain_idx:02d}"))[1]
            notebook_name = f"{domain_prefix}-{self._sanitize_name(domain_name)}"
            
            domain_summary = None
            if summary_dict:
                domain_summary = summary_dict.get(domain_name, None)
            
            try:
                self._assemble_notebook_for_db(
                    db_name=domain_name, use_cases=sorted_cases, translations=translations,
                    db_prefix=domain_prefix, filename_override=notebook_name, domain_summary=domain_summary
                )
                notebooks_created.append((domain_idx, notebook_name, True))
                notebook_duration = time.time() - notebook_start_time
                
                log_print(f"\n{'*'*80}")
                log_print(f"🎉 DOMAIN '{domain_name.upper()}' COMPLETE!")
                log_print(f"{'*'*80}")
                log_print(f"   📓 Notebook: {notebook_name}.ipynb")
                log_print(f"   📊 Use cases: {domain_uc_count} ({domain_success} SQL OK, {domain_failed} SQL Failed)")
                total_domain_time = time.time() - domain_start_time
                log_print(f"   ⏱️  Total time: {total_domain_time:.1f}s (SQL: {sql_duration:.1f}s, Notebook: {notebook_duration:.1f}s)")
                log_print(f"   ✅ READY FOR TESTING!")
                log_print(f"{'*'*80}\n")
                
                self.logger.info(f"🎉 [{domain_idx}/{total_domains}] Domain '{domain_name}' notebook '{notebook_name}.ipynb' READY FOR INSPECTION")
                
            except Exception as e:
                notebooks_created.append((domain_idx, notebook_name, False))
                self.logger.error(f"❌ [{domain_idx}/{total_domains}] Failed to create notebook for domain '{domain_name}': {e}")
                log_print(f"   ❌ Notebook creation failed: {str(e)[:100]}")
            
            cumulative_use_cases_done += domain_uc_count
            
            remaining_domains = total_domains - domain_idx
            if remaining_domains > 0:
                avg_time_per_uc = (time.time() - overall_start_time) / cumulative_use_cases_done if cumulative_use_cases_done > 0 else 0
                remaining_ucs = total_use_cases - cumulative_use_cases_done
                eta_seconds = avg_time_per_uc * remaining_ucs
                log_print(f"   📈 Progress: {cumulative_use_cases_done}/{total_use_cases} use cases ({cumulative_use_cases_done*100//total_use_cases}%)")
                log_print(f"   ⏳ Estimated time remaining: {eta_seconds/60:.1f} minutes ({remaining_domains} domains left)")
        
        overall_duration = time.time() - overall_start_time
        
        log_print(f"\n{'='*80}")
        log_print(f"🏁 ALL DOMAINS PROCESSED")
        log_print(f"{'='*80}")
        log_print(f"   📊 Total use cases: {total_use_cases}")
        log_print(f"   📓 Notebooks created: {sum(1 for _, _, success in notebooks_created if success)}/{total_domains}")
        log_print(f"   ⏱️  Total time: {overall_duration/60:.1f} minutes")
        log_print(f"\n   📓 Notebooks ready for testing:")
        for idx, name, success in notebooks_created:
            status = "✅" if success else "❌"
            log_print(f"      {status} {name}.ipynb")
        log_print(f"{'='*80}\n")
        
        self.logger.info(f"✅ Domain-by-domain SQL generation complete: {total_use_cases} use cases, {total_domains} notebooks in {overall_duration:.1f}s")
        
        ordered_results = []
        for uc in all_use_cases:
            ordered_results.append(final_results_map.get(uc.get('No', 'UNKNOWN'), uc))
        
        return ordered_results

    def _deduplicate_use_cases(self, all_use_cases: list) -> list:
        """
        Calls an LLM to perform AGGRESSIVE global deduplication on ALL use cases.
        
        🚨 ENHANCED: Now includes Business Value assessment to filter out low-value use cases.
        
        Deduplication criteria:
        1. Semantic similarity of Names
        2. Duplicate or trivial use cases
        3. Low business value relative to industry/business context
        4. Use cases with insufficient distinctiveness
        """
        self.logger.info(f"Starting AGGRESSIVE global deduplication for {len(all_use_cases)} use cases...")
        self.logger.info(f"Deduplication will analyze: Name similarity + Business Value + Distinctiveness")
        
        if len(all_use_cases) < 2:
            self.logger.debug("Skipping deduplication, not enough use cases to compare.")
            return all_use_cases
        
        try:
            # Create markdown table with ID, Name, and Business Value
            md_parts = ["| ID | Name | Business Value | Tables |\n|---|---|---|---|\n"]
            for uc in all_use_cases:
                name = str(uc.get('Name', '')).replace('|', r'\|')
                business_value = str(uc.get('Business Value', ''))[:100].replace('|', r'\|')
                tables = str(uc.get('Tables Involved', ''))[:50].replace('|', r'\|')
                md_parts.append(f"| {uc['No']} | {name} | {business_value} | {tables} |\n")
            use_case_markdown = "".join(md_parts)
            
            self.logger.debug(f"Created deduplication markdown table with {len(all_use_cases)} use cases")
        except Exception as e:
            self.logger.error(f"Failed to create markdown for deduplication: {e}")
            return all_use_cases
        
        try:
            # Check if the use_case_markdown might exceed context limits (using model-specific limits)
            review_context_limit = get_max_context_chars("English", "REVIEW_USE_CASES_PROMPT")
            markdown_size = len(use_case_markdown)
            prompt_template = self.ai_agent.prompt_templates.get("REVIEW_USE_CASES_PROMPT", "")
            estimated_prompt_size = len(prompt_template) + markdown_size + 1000  # +1000 for other vars
            
            if estimated_prompt_size > review_context_limit:
                self.logger.warning(
                    f"Deduplication prompt size ({estimated_prompt_size:,} chars) exceeds model limit ({review_context_limit:,}). "
                    f"Falling back to domain-level parallel deduplication..."
                )
                return self._deduplicate_use_cases_by_domain_parallel(all_use_cases)
            
            prompt_vars = {
                "use_case_markdown": use_case_markdown,
                "total_count": len(all_use_cases)
            }
            
            self.logger.info(f"⏳ Waiting for LLM response (deduplicating {len(all_use_cases)} use cases)...")
            
            response_raw = self.ai_agent.run_worker(
                step_name="Deduplicate_Use_Cases",
                worker_prompt_path="REVIEW_USE_CASES_PROMPT",
                prompt_vars=prompt_vars,
                response_schema=None
            )
            
            self.logger.info(f"✅ Received LLM response, parsing deduplication results...")
            
            # Clean response (remove markdown fences if present)
            response_clean = clean_json_response(response_raw)
            
            # Parse CSV
            try:
                # Parse CSV using centralized utility
                csv_rows = CSVParser.parse_csv_string(
                    response_clean,
                    logger=self.logger,
                    context="Deduplication"
                )
                ids_to_keep = []
                
                for row in csv_rows:
                    # Handle column name
                    uc_id = row.get('use_case_id', '').strip()
                    if uc_id:
                        ids_to_keep.append(uc_id)
                
                if not ids_to_keep:
                    raise ValueError("CSV contains no use case IDs")
                
                ids_to_keep_set = set(ids_to_keep)
                    
            except Exception as csv_err:
                self.logger.error(f"CSV parsing failed: {csv_err}. Raw response (first 500 chars): {response_raw[:500]}")
                raise
            
            # 🚨 NEW: AGGRESSIVE COVERAGE - Ensure at least 1 use case per business table
            # Log which use cases were removed
            removed_count = len(all_use_cases) - len(ids_to_keep_set)
            removal_pct = (removed_count / len(all_use_cases)) * 100 if all_use_cases else 0
            
            self.logger.info(f"Deduplication complete: Retained {len(ids_to_keep_set)} use cases, removed {removed_count} ({removal_pct:.1f}%)")
            
            # Log count of removed use cases per domain
            if removed_count > 0:
                removed_use_cases = [uc for uc in all_use_cases if uc['No'] not in ids_to_keep_set]
                
                # Group removed use cases by domain
                from collections import defaultdict
                domain_removal_counts = defaultdict(int)
                for uc in removed_use_cases:
                    domain = uc.get('Business Domain', 'Unknown')
                    domain_removal_counts[domain] += 1
                
                # Log counts per domain
                self.logger.info(f"Removed use cases by domain:")
                for domain in sorted(domain_removal_counts.keys()):
                    count = domain_removal_counts[domain]
                    self.logger.info(f"  - {domain}: {count} use case(s) removed")
            
            unique_use_cases = [uc for uc in all_use_cases if uc['No'] in ids_to_keep_set]
            return unique_use_cases
            
        except Exception as e:
            self.logger.error(f"Global use case deduplication failed: {e}. Proceeding with the full list of use cases.")
            return all_use_cases

    def _deduplicate_use_cases_by_domain_parallel(self, all_use_cases: list) -> list:
        """
        Deduplicate use cases at domain level in parallel using scores for intelligent selection.
        
        Args:
            all_use_cases: List of all use case dictionaries (should be scored first)
            
        Returns:
            List of deduplicated use cases
        """
        from collections import defaultdict
        from concurrent.futures import ThreadPoolExecutor, as_completed
        import concurrent.futures
        import time
        
        self.logger.info(f"🔄 Starting intelligent domain-level deduplication for {len(all_use_cases)} scored use cases...")
        
        # Group use cases by domain
        domain_use_cases = defaultdict(list)
        for uc in all_use_cases:
            domain = uc.get('Business Domain', 'Unknown')
            domain_use_cases[domain].append(uc)
        
        self.logger.info(f"📊 Grouped use cases into {len(domain_use_cases)} domains")
        
        # Deduplicate each domain in parallel
        deduplicated_results = []
        
        def dedupe_domain(domain_name, domain_ucs):
            """Deduplicate a single domain's use cases."""
            try:
                self.logger.info(f"[{domain_name}] Deduplicating {len(domain_ucs)} use cases...")
                
                # Create markdown table for this domain including scores
                md_parts = ["| ID | Name | Business Value | Tables | ROI | Strat. Align |\n|---|---|---|---|---|---|\n"]
                for uc in domain_ucs:
                    name = str(uc.get('Name', '')).replace('|', r'\|')
                    business_value = str(uc.get('Business Value', ''))[:100].replace('|', r'\|')
                    tables = str(uc.get('Tables Involved', ''))[:50].replace('|', r'\|')
                    roi = str(uc.get('Return on Investment', 'N/A'))
                    strat_align = str(uc.get('Strategic Alignment', 'N/A'))
                    
                    md_parts.append(f"| {uc['No']} | {name} | {business_value} | {tables} | {roi} | {strat_align} |\n")
                use_case_markdown = "".join(md_parts)
                
                # Check size (using model-specific limits from TECHNICAL_CONTEXT)
                review_context_limit = get_max_context_chars("English", "REVIEW_USE_CASES_PROMPT")
                prompt_template = self.ai_agent.prompt_templates.get("REVIEW_USE_CASES_PROMPT", "")
                estimated_size = len(prompt_template) + len(use_case_markdown) + 1000
                
                if estimated_size > review_context_limit:
                    self.logger.warning(f"[{domain_name}] Domain still too large ({estimated_size:,} chars). Keeping all {len(domain_ucs)} use cases without deduplication.")
                    return domain_ucs
                
                # Append explicit instructions about score-based selection to the markdown context
                context_notes = """
                **CRITICAL DEDUPLICATION RULES**:
                1. If two use cases are DUPLICATES (same intent/logic):
                   - Keep the one with higher 'ROI' and 'Strat. Align'.
                   - If scores are similar, keep the one with better detail.
                2. If two use cases use IDENTICAL tables and have similar logic -> Treat as DUPLICATE.
                3. If two use cases are similar but use DIFFERENT tables -> KEEP BOTH.
                   - In this case, mark them as distinct.
                   - Ensure they have IDENTICAL scores if logic is same.
                   - Add note to Justification: "Very Similar to [Other_ID]"
                
                **HIGH PRIORITY**:
                - Pay special attention to complex, high-value use cases that involve "Root Cause Analysis", "Predictive", "Optimization", "Anomaly Detection", or "Forecasting".
                - DO NOT REMOVE high-value use cases unless they are EXACT duplicates.
                """
                
                prompt_vars = {
                    "use_case_markdown": use_case_markdown + "\n" + context_notes,
                    "total_count": len(domain_ucs)
                }
                
                self.logger.info(f"⏳ [{domain_name}] Waiting for LLM response...")
                
                response_raw = self.ai_agent.run_worker(
                    step_name=f"Deduplicate_Domain_{domain_name}",
                    worker_prompt_path="REVIEW_USE_CASES_PROMPT",
                    prompt_vars=prompt_vars,
                    response_schema=None
                )
                
                # Parse response
                response_clean = clean_json_response(response_raw)
                
                # Parse CSV using centralized utility
                csv_rows = CSVParser.parse_csv_string(
                    response_clean,
                    logger=self.logger,
                    context=f"Domain deduplication for {domain_name}"
                )
                ids_to_keep = set()
                
                for row in csv_rows:
                    uc_id = row.get('use_case_id', '').strip()
                    if uc_id:
                        ids_to_keep.add(uc_id)
                
                if not ids_to_keep:
                    self.logger.warning(f"[{domain_name}] No IDs returned. Keeping all use cases.")
                    return domain_ucs
                
                deduplicated = [uc for uc in domain_ucs if uc['No'] in ids_to_keep]
                
                # Logic to handle the "Same Logic, Different Tables" case (sync scores & justification)
                # This requires parsing the FULL response if the LLM provided metadata, 
                # but currently REVIEW_USE_CASES_PROMPT typically returns just a list of IDs.
                # Since we can't easily sync scores without the LLM telling us which pairs match,
                # we rely on the LLM's selection in the ID list for now.
                
                removed = len(domain_ucs) - len(deduplicated)
                
                self.logger.info(f"✅ [{domain_name}] Retained {len(deduplicated)} use cases, removed {removed}")
                
                return deduplicated
                
            except Exception as e:
                self.logger.error(f"[{domain_name}] Deduplication failed: {e}. Keeping all {len(domain_ucs)} use cases.")
                return domain_ucs
        
        # ADAPTIVE PARALLELISM: Calculate based on domains and total use cases
        total_use_cases = sum(len(ucs) for ucs in domain_use_cases.values())
        num_domains = len(domain_use_cases)
        
        dedup_parallelism, reason = calculate_adaptive_parallelism(
            "deduplication", self.max_parallelism,
            num_items=total_use_cases,
            num_domains=num_domains,
            is_llm_operation=True, logger=self.logger
        )
        
        # Calculate timeout: 5 minutes per domain + 5 minutes buffer
        overall_timeout = (num_domains * 300 // dedup_parallelism) + 300
        
        log_print(f"\n{'='*80}")
        log_print(f"🔄 DEDUPLICATION: Processing {num_domains} domains ({total_use_cases} total use cases)")
        log_print(f"{'='*80}")
        log_adaptive_parallelism_decision("deduplication", dedup_parallelism, self.max_parallelism, reason)
        log_print(f"Overall timeout: {overall_timeout}s ({overall_timeout//60} min)")
        log_print(f"{'='*80}\n")
        
        with ThreadPoolExecutor(max_workers=dedup_parallelism, thread_name_prefix="DomainDedupe") as executor:
            future_to_domain = {}
            for domain, domain_ucs in domain_use_cases.items():
                future = executor.submit(dedupe_domain, domain, domain_ucs)
                future_to_domain[future] = domain
            
            completed_count = 0
            completed_domains = set()
            start_time = time.time()
            
            try:
                for future in as_completed(future_to_domain, timeout=overall_timeout):
                    domain = future_to_domain[future]
                    elapsed = time.time() - start_time
                    try:
                        domain_deduplicated = future.result(timeout=30)
                        deduplicated_results.extend(domain_deduplicated)
                        completed_count += 1
                        completed_domains.add(domain)
                        log_print(f"[Deduplication] ✓ Domain {completed_count}/{len(domain_use_cases)} complete: {domain} ({elapsed:.1f}s elapsed)")
                    except concurrent.futures.TimeoutError:
                        self.logger.error(f"[{domain}] Result collection timed out - keeping original use cases")
                        deduplicated_results.extend(domain_use_cases.get(domain, []))
                        completed_domains.add(domain)
                    except Exception as e:
                        self.logger.error(f"[{domain}] Failed to collect results: {e} - keeping original use cases")
                        deduplicated_results.extend(domain_use_cases.get(domain, []))
                        completed_domains.add(domain)
            except concurrent.futures.TimeoutError:
                self.logger.error(f"⚠️  Overall deduplication timeout reached ({overall_timeout}s). {completed_count}/{len(domain_use_cases)} domains completed.")
                log_print(f"[Deduplication] ⚠️  TIMEOUT - keeping original use cases for incomplete domains", level="WARNING")
                for domain, domain_ucs in domain_use_cases.items():
                    if domain not in completed_domains:
                        self.logger.warning(f"[{domain}] Timed out - keeping all {len(domain_ucs)} original use cases")
                        deduplicated_results.extend(domain_ucs)
        
        total_removed = len(all_use_cases) - len(deduplicated_results)
        removal_pct = (total_removed / len(all_use_cases)) * 100 if all_use_cases else 0
        
        self.logger.info(f"✅ Domain-level deduplication complete: Retained {len(deduplicated_results)} use cases, removed {total_removed} ({removal_pct:.1f}%)")
        
        return deduplicated_results

    def _score_use_cases_global(self, all_use_cases: list, business_context: str = "",
                                strategic_goals: list = None, business_priorities: list = None,
                                strategic_initiative: str = "", value_chain: str = "",
                                revenue_model: str = "") -> list:
        """
        Try to score ALL use cases together in one prompt using minimal context (ID, Name, Statement, Business Value).
        Returns None on failure so callers can fall back to domain-based scoring.
        """
        if not all_use_cases:
            return all_use_cases

        try:
            md_parts = ["| No | Name | Statement | Business Value |\n|---|---|---|---|\n"]
            pipe_escape = r'\|'
            for uc in all_use_cases:
                no_val = str(uc.get('No', '')).replace('|', pipe_escape)
                name_val = str(uc.get('Name', '')).replace('|', pipe_escape)
                stmt_val = str(uc.get('Statement', '')).replace('|', pipe_escape)
                bv_val = str(uc.get('Business Value', '')).replace('|', pipe_escape)
                md_parts.append(f"| {no_val} | {name_val} | {stmt_val} | {bv_val} |\n")
            use_case_markdown = "".join(md_parts)

            prompt_vars = {
                "use_case_markdown": use_case_markdown,
                "business_context": business_context or "General business operations",
                "strategic_goals": "\n".join([f"- {goal}" for goal in (strategic_goals or [])]) or "- Improve operational efficiency",
                "business_priorities": "\n".join([f"- {priority}" for priority in (business_priorities or [])]) or "- Optimize costs",
                "strategic_initiative": strategic_initiative or "Data-driven transformation program",
                "value_chain": value_chain or "Standard operations",
                "revenue_model": revenue_model or "Products and services"
            }

            self.logger.info(f"✅ Attempting GLOBAL scoring for {len(all_use_cases)} use cases (minimal fields to fit context)")
            response_raw = self.ai_agent.run_worker(
                step_name="Score_All_Use_Cases_Global",
                worker_prompt_path="SCORE_USE_CASES_PROMPT",
                prompt_vars=prompt_vars,
                response_schema=None
            )

            response_clean = clean_json_response(response_raw)
            scoring_data = CSVParser.parse_csv_string(
                response_clean,
                logger=self.logger,
                context="GLOBAL scoring"
            )

            scoring_by_no = {s.get('No'): s for s in scoring_data}
            scored_use_cases = []
            for uc in all_use_cases:
                no = uc.get('No')
                scores = scoring_by_no.get(no, {})
                uc_copy = uc.copy()
                for key, value in scores.items():
                    if key != 'No':
                        uc_copy[key] = value

                try:
                    strategic_alignment = float(uc_copy.get('Strategic Alignment', 3.5))
                    roi = float(uc_copy.get('Return on Investment', 3.5))
                    reusability = float(uc_copy.get('Reusability', 3.5))
                    time_to_value = float(uc_copy.get('Time to Value', 3.5))

                    data_availability = float(uc_copy.get('Data Availability', 3.5))
                    data_accessibility = float(uc_copy.get('Data Accessibility', 3.5))
                    architecture_fitness = float(uc_copy.get('Architecture Fitness', 3.5))
                    team_skills = float(uc_copy.get('Team Skills', 3.5))
                    domain_knowledge = float(uc_copy.get('Domain Knowledge', 3.5))
                    people_allocation = float(uc_copy.get('People Allocation', 3.5))
                    budget_allocation = float(uc_copy.get('Budget Allocation', 3.5))
                    time_to_production = float(uc_copy.get('Time to Production', 3.5))

                    value_score = (
                        (roi * 0.60)
                        + (strategic_alignment * 0.25)
                        + (time_to_value * 0.075)
                        + (reusability * 0.075)
                    )

                    feasibility_inputs = [
                        data_availability,
                        data_accessibility,
                        architecture_fitness,
                        team_skills,
                        domain_knowledge,
                        people_allocation,
                        budget_allocation,
                        time_to_production,
                    ]
                    feasibility_score = sum(feasibility_inputs) / len(feasibility_inputs)

                    priority_score = (value_score * 1.5) + (feasibility_score * 0.5)

                    uc_copy['Value'] = round(value_score, 2)
                    uc_copy['Feasibility'] = round(feasibility_score, 2)
                    uc_copy['Priority Score'] = round(priority_score, 2)
                    
                    # Ensure AI_Confidence and AI_Feedback are set with defaults if missing
                    if 'AI_Confidence' not in uc_copy or not uc_copy.get('AI_Confidence'):
                        uc_copy['AI_Confidence'] = 0.5
                    if 'AI_Feedback' not in uc_copy or not uc_copy.get('AI_Feedback'):
                        uc_copy['AI_Feedback'] = 'No feedback provided by AI scoring.'

                    if priority_score >= 9.5:
                        priority_label = "Ultra High"
                    elif priority_score >= 8.5:
                        priority_label = "Very High"
                    elif priority_score >= 7.5:
                        priority_label = "High"
                    elif priority_score >= 5.5:
                        priority_label = "Medium"
                    elif priority_score >= 4.5:
                        priority_label = "Low"
                    elif priority_score >= 2.5:
                        priority_label = "Very Low"
                    else:
                        priority_label = "Ultra Low"
                    uc_copy['Priority'] = priority_label
                except Exception:
                    uc_copy['Priority'] = uc_copy.get('Priority', 'Medium')
                    uc_copy['AI_Confidence'] = uc_copy.get('AI_Confidence', 0.5)
                    uc_copy['AI_Feedback'] = uc_copy.get('AI_Feedback', 'Scoring exception occurred.')
                scored_use_cases.append(uc_copy)

            self.logger.info(f"✅ GLOBAL scoring succeeded for {len(scored_use_cases)} use cases")
            return scored_use_cases
        except Exception as e:
            self.logger.error(f"Global scoring failed, will fall back to domain-based scoring: {e}")
            return None

    def _score_per_domain_parallel(self, all_use_cases: list, business_context: str = "",
                                   strategic_goals: list = None, business_priorities: list = None,
                                   strategic_initiative: str = "", value_chain: str = "",
                                   revenue_model: str = "") -> list:
        """
        Score use cases per domain in parallel (Phase 1).
        
        Each domain is scored in its own thread, all domains run in parallel.
        After this, ALL scored use cases are returned for SQL generation.
        
        STABILITY FIX: Uses adaptive parallelism to prevent LLM API rate limiting 
        and adds heartbeat + total timeout to prevent hangs.
        """
        import time
        from collections import defaultdict
        from concurrent.futures import ThreadPoolExecutor, as_completed, TimeoutError as FuturesTimeoutError
        
        # Group use cases by domain first to calculate adaptive parallelism
        domain_groups = defaultdict(list)
        for uc in all_use_cases:
            domain = uc.get('Business Domain', 'Other')
            domain_groups[domain].append(uc)
        
        num_use_cases = len(all_use_cases)
        num_domains = len(domain_groups)
        
        # ADAPTIVE PARALLELISM: Calculate based on use cases and domains
        scoring_parallelism, reason = calculate_adaptive_parallelism(
            "scoring", self.max_parallelism,
            num_items=num_use_cases,
            num_domains=num_domains,
            is_llm_operation=True, logger=self.logger
        )
        
        # Dynamic timeouts based on workload
        # More use cases = more time needed
        base_timeout_per_uc = 30  # seconds per use case
        TOTAL_SCORING_TIMEOUT = max(1800, min(3600, num_use_cases * base_timeout_per_uc))  # 30-60 min
        HEARTBEAT_INTERVAL = 60  # Log progress every 60 seconds
        PER_DOMAIN_TIMEOUT = max(600, min(1200, (num_use_cases // num_domains) * 60))  # 10-20 min per domain
        
        self.logger.info(f"📊 PHASE 1: Scoring {num_use_cases} use cases per domain in parallel")
        self.logger.info(f"📊 Grouped into {num_domains} domains")
        
        log_print(f"\n{'='*80}")
        log_print(f"📊 PHASE 1: SCORING PER DOMAIN (PARALLEL)")
        log_print(f"{'='*80}")
        log_print(f"Total use cases: {num_use_cases}")
        log_print(f"Total domains: {num_domains}")
        log_adaptive_parallelism_decision("scoring", scoring_parallelism, self.max_parallelism, reason)
        log_print(f"Total timeout: {TOTAL_SCORING_TIMEOUT}s ({TOTAL_SCORING_TIMEOUT//60} min)")
        log_print(f"{'='*80}\n")
        
        for domain, use_cases in sorted(domain_groups.items()):
            self.logger.info(f"   - {domain}: {len(use_cases)} use cases")
        
        def score_domain(domain_name, domain_use_cases):
            """Score one domain's use cases"""
            try:
                self.logger.info(f"📊 [{domain_name}] Scoring {len(domain_use_cases)} use cases...")
                
                scored = self._score_use_cases(
                    domain_use_cases,
                    business_context=business_context,
                    strategic_goals=strategic_goals,
                    business_priorities=business_priorities,
                    strategic_initiative=strategic_initiative,
                    value_chain=value_chain,
                    revenue_model=revenue_model
                )
                
                self.logger.info(f"✅ [{domain_name}] Scoring complete")
                return (domain_name, scored)
                
            except Exception as e:
                self.logger.error(f"❌ [{domain_name}] Scoring failed: {e}")
                # Return with default scores - fix bug where 'Pending' priority was kept
                for uc in domain_use_cases:
                    if uc.get('Priority') in (None, '', 'Pending'):
                        uc['Priority'] = 'Medium'
                        uc['Priority Score'] = 5.0
                        uc['Value'] = 3.5
                        uc['Feasibility'] = 3.5
                return (domain_name, domain_use_cases)
        
        # Score all domains in parallel with reduced parallelism
        all_scored_use_cases = []
        
        with ThreadPoolExecutor(max_workers=scoring_parallelism, thread_name_prefix="DomainScoring") as executor:
            future_to_domain = {}
            for domain, domain_use_cases in domain_groups.items():
                future = executor.submit(score_domain, domain, domain_use_cases)
                future_to_domain[future] = domain
            
            completed_domains = 0
            failed_domains = 0
            total_domains = len(domain_groups)
            start_time = time.time()
            last_heartbeat = start_time
            
            # SOLUTION 1: Add total timeout to as_completed to prevent infinite hangs
            try:
                for future in as_completed(future_to_domain, timeout=TOTAL_SCORING_TIMEOUT):
                    domain = future_to_domain[future]
                    current_time = time.time()
                    
                    # SOLUTION 1: Heartbeat logging to show progress
                    if current_time - last_heartbeat >= HEARTBEAT_INTERVAL:
                        elapsed = current_time - start_time
                        pending = total_domains - completed_domains - failed_domains
                        log_print(f"⏳ Scoring progress: {completed_domains}/{total_domains} done, {pending} pending ({elapsed:.0f}s elapsed)")
                        self.logger.info(f"⏳ Scoring heartbeat: {completed_domains}/{total_domains} domains complete, {pending} pending")
                        last_heartbeat = current_time
                    
                    try:
                        domain_name, scored_use_cases = future.result(timeout=PER_DOMAIN_TIMEOUT)
                        all_scored_use_cases.extend(scored_use_cases)
                        completed_domains += 1
                        
                        log_print(f"✓ Scored domain {completed_domains}/{total_domains}: {domain_name} ({len(scored_use_cases)} use cases)")
                        self.logger.info(f"✓ Domain {completed_domains}/{total_domains} scoring complete: {domain_name}")
                        
                    except FuturesTimeoutError:
                        failed_domains += 1
                        self.logger.error(f"❌ [{domain}] Scoring timed out after {PER_DOMAIN_TIMEOUT}s")
                        log_print(f"✗ Timeout domain {completed_domains + failed_domains}/{total_domains}: {domain}", level="ERROR")
                    except Exception as e:
                        failed_domains += 1
                        self.logger.error(f"❌ [{domain}] Failed to collect scoring results: {e}")
                        log_print(f"✗ Failed domain {completed_domains + failed_domains}/{total_domains}: {domain}", level="ERROR")
                        
            except FuturesTimeoutError:
                # SOLUTION 1: Handle total timeout - proceed with what we have
                elapsed = time.time() - start_time
                pending = total_domains - completed_domains - failed_domains
                self.logger.error(f"⚠️ TOTAL SCORING TIMEOUT reached after {elapsed:.0f}s. {completed_domains}/{total_domains} domains completed, {pending} still pending.")
                log_print(f"⚠️ SCORING TIMEOUT: {completed_domains}/{total_domains} domains completed after {elapsed:.0f}s", level="WARNING")
                log_print(f"   Proceeding with {len(all_scored_use_cases)} scored use cases", level="WARNING")
        
        log_print(f"\n{'='*80}")
        log_print(f"✅ PHASE 1 COMPLETE: ALL DOMAINS SCORED")
        log_print(f"{'='*80}")
        log_print(f"Total scored use cases: {len(all_scored_use_cases)}")
        log_print(f"{'='*80}\n")
        
        self.logger.info(f"✅ Phase 1 complete: {len(all_scored_use_cases)} use cases scored across all domains")
        
        normalized_use_cases = self._normalize_priority_scores(all_scored_use_cases)
        self.logger.info(f"✅ Phase 1 normalized across {len(normalized_use_cases)} use cases")
        
        return normalized_use_cases

    def _normalize_priority_scores(self, use_cases: list) -> list:
        if not use_cases:
            return use_cases
        
        scores = []
        for uc in use_cases:
            try:
                scores.append(float(uc.get('Priority Score', 0)))
            except (TypeError, ValueError):
                continue
        
        max_score = max(scores) if scores else 0.0
        if max_score <= 0:
            return use_cases
        
        target_max = random.uniform(9.5, 9.95)
        scale = target_max / max_score
        
        for uc in use_cases:
            try:
                priority_score = float(uc.get('Priority Score', 0))
            except (TypeError, ValueError):
                priority_score = 0.0
            try:
                value_score = float(uc.get('Value', 0))
            except (TypeError, ValueError):
                value_score = 0.0
            try:
                feasibility_score = float(uc.get('Feasibility', 0))
            except (TypeError, ValueError):
                feasibility_score = 0.0
            
            priority_scaled = min(priority_score * scale, target_max)
            value_scaled = min(value_score * scale, target_max)
            feasibility_scaled = min(feasibility_score * scale, target_max)
            
            uc['Priority Score'] = round(priority_scaled, 2)
            uc['Value'] = round(value_scaled, 2)
            uc['Feasibility'] = round(feasibility_scaled, 2)
            
            if priority_scaled >= 9.5:
                priority_label = "Ultra High"
            elif priority_scaled >= 8.5:
                priority_label = "Very High"
            elif priority_scaled >= 7.5:
                priority_label = "High"
            elif priority_scaled >= 5.5:
                priority_label = "Medium"
            elif priority_scaled >= 4.5:
                priority_label = "Low"
            elif priority_scaled >= 2.5:
                priority_label = "Very Low"
            else:
                priority_label = "Ultra Low"
            
            uc['Priority'] = priority_label
        
        self.logger.info(f"Applied priority normalization scale factor {scale:.2f} with target max {target_max:.2f}")
        return use_cases
    
    def _score_use_cases(self, all_use_cases: list, business_context: str = "", strategic_goals: list = None,
                         business_priorities: list = None, strategic_initiative: str = "",
                         value_chain: str = "", revenue_model: str = "") -> list:
        """
        Calls an LLM to score use cases across 13 different factors.
        
        🚨 NEW SCORING STRATEGY (PRIORITIZED):
        1. PRIMARY APPROACH: Score ALL use cases in ONE prompt (preferred for consistency)
        2. FALLBACK APPROACH: If all use cases don't fit in one prompt, score by domain in parallel
        
        🚨 WEIGHTING: Priority Score = (Value × 1.5) + (Feasibility × 0.5)
        - Value = (ROI × 0.60) + (Strategic Alignment × 0.25) + (Time to Value × 0.075) + (Reusability × 0.075)
        
        🚨 NEW: If use cases exceed 2048, implements 2-pass scoring:
        - Pass 1: Score all use cases and select top 2048 by priority
        - Pass 2: Re-score the top 2048 for final ranking
        
        Args:
            all_use_cases: List of use case dictionaries to score
            business_context: Business context description
            strategic_goals: List of strategic goals
            business_priorities: List of business priorities
            strategic_initiative: Description of strategic initiative
            value_chain: Description of value chain
            revenue_model: Description of revenue model
            
        Returns:
            List of scored use cases (top 2048 if input >2048, otherwise all)
        """
        self.logger.info(f"Starting LLM-based scoring for {len(all_use_cases)} use cases...")
        
        if not all_use_cases:
            self.logger.warning("No use cases to score.")
            return all_use_cases
        
        # Check if we need 2-pass scoring
        needs_two_pass = len(all_use_cases) > 2048
        
        if needs_two_pass:
            self.logger.warning(f"⚠️ Use case count ({len(all_use_cases)}) exceeds 2048. Implementing 2-PASS SCORING process...")
            log_print(f"\n{'='*80}", level="WARNING")
            log_print(f"⚠️  LARGE USE CASE SET DETECTED: {len(all_use_cases)} use cases", level="WARNING")
            log_print(f"{'='*80}")
            log_print(f"Implementing 2-PASS SCORING to select top 2048 use cases:")
            log_print(f"  PASS 1: Score all {len(all_use_cases)} use cases → Select top 2048")
            log_print(f"  PASS 2: Re-score top 2048 for final ranking")
            log_print(f"{'='*80}\n")
        
        # Format business context variables for the prompt
        strategic_goals_text = "\n".join([f"- {goal}" for goal in (strategic_goals or [])])
        business_priorities_text = "\n".join([f"- {priority}" for priority in (business_priorities or [])])
        
        domain_name = all_use_cases[0].get('Business Domain', 'Domain') if all_use_cases else "Domain"
        
        try:
            # Create markdown table for this domain (minimal fields to fit context)
            md_parts = ["| No | Name | Business Value |\n|---|---|---|\n"]
            for uc in all_use_cases:
                no = str(uc.get('No', '')).replace('|', r'\|')
                name = str(uc.get('Name', '')).replace('|', r'\|')
                business_value = str(uc.get('Business Value', '')).replace('|', r'\|')
                md_parts.append(f"| {no} | {name} | {business_value} |\n")
            use_case_markdown = "".join(md_parts)
            
            prompt_vars = {
                "use_case_markdown": use_case_markdown,
                "business_context": business_context or "General business operations",
                "strategic_goals": strategic_goals_text or "- Maximize operational efficiency\n- Improve customer satisfaction",
                "business_priorities": business_priorities_text or "- Digital transformation\n- Cost optimization",
                "strategic_initiative": strategic_initiative or "Data-driven transformation program",
                "value_chain": value_chain or "Standard business operations",
                "revenue_model": revenue_model or "Product and service sales"
            }
            
            self.logger.info(f"⏳ [{domain_name}] Waiting for LLM response (scoring {len(all_use_cases)} use cases)...")
            response_raw = self.ai_agent.run_worker(
                step_name=f"Score_Use_Cases_{domain_name}",
                worker_prompt_path="SCORE_USE_CASES_PROMPT",
                prompt_vars=prompt_vars,
                response_schema=None
            )
            self.logger.info(f"✅ [{domain_name}] Received LLM response, processing results...")
            
            # Parse CSV response using centralized utility
            response_clean = clean_json_response(response_raw)
            scoring_data = CSVParser.parse_csv_string(
                response_clean,
                logger=self.logger,
                context=f"Scoring for domain {domain_name}"
            )
            
            scoring_map = {item['No']: item for item in scoring_data}
            
            self.logger.info(f"Received scoring for {len(scoring_map)} use cases from LLM for domain '{domain_name}'")
            
            # Check for missing scores and retry with progressive batch splitting
            missing_ids = [uc['No'] for uc in all_use_cases if uc['No'] not in scoring_map]
            MAX_RETRY_ROUNDS = 3  # Maximum number of retry rounds
            BATCH_SIZE_FOR_RETRY = 15  # Split into smaller batches for reliability
            
            retry_round = 0
            while missing_ids and retry_round < MAX_RETRY_ROUNDS:
                retry_round += 1
                missing_ucs = [uc for uc in all_use_cases if uc['No'] in missing_ids]
                
                # If many missing, split into smaller batches for better success rate
                if len(missing_ucs) > BATCH_SIZE_FOR_RETRY:
                    self.logger.warning(f"⚠️ [{domain_name}] Round {retry_round}: {len(missing_ids)} use cases missing scores. Splitting into batches of {BATCH_SIZE_FOR_RETRY}...")
                    batches = [missing_ucs[i:i + BATCH_SIZE_FOR_RETRY] for i in range(0, len(missing_ucs), BATCH_SIZE_FOR_RETRY)]
                else:
                    self.logger.warning(f"⚠️ [{domain_name}] Round {retry_round}: Retrying {len(missing_ids)} missing use cases...")
                    batches = [missing_ucs]
                
                for batch_idx, batch_ucs in enumerate(batches):
                    # Create a smaller prompt for this batch
                    retry_md_parts = ["| No | Name | Business Value |\n|---|---|---|\n"]
                    for uc in batch_ucs:
                        no = str(uc.get('No', '')).replace('|', r'\|')
                        name = str(uc.get('Name', '')).replace('|', r'\|')
                        business_value = str(uc.get('Business Value', '')).replace('|', r'\|')
                        retry_md_parts.append(f"| {no} | {name} | {business_value} |\n")
                    retry_use_case_markdown = "".join(retry_md_parts)
                    
                    retry_prompt_vars = {
                        "use_case_markdown": retry_use_case_markdown,
                        "business_context": business_context or "General business operations",
                        "strategic_goals": strategic_goals_text or "- Maximize operational efficiency\n- Improve customer satisfaction",
                        "business_priorities": business_priorities_text or "- Digital transformation\n- Cost optimization",
                        "strategic_initiative": strategic_initiative or "Data-driven transformation program",
                        "value_chain": value_chain or "Standard business operations",
                        "revenue_model": revenue_model or "Product and service sales"
                    }
                    
                    try:
                        batch_label = f"Batch {batch_idx+1}/{len(batches)}" if len(batches) > 1 else ""
                        self.logger.info(f"⏳ [{domain_name}] Round {retry_round} {batch_label}: Scoring {len(batch_ucs)} use cases...")
                        retry_response_raw = self.ai_agent.run_worker(
                            step_name=f"Score_Use_Cases_{domain_name}_Retry{retry_round}_B{batch_idx+1}",
                            worker_prompt_path="SCORE_USE_CASES_PROMPT",
                            prompt_vars=retry_prompt_vars,
                            response_schema=None
                        )
                        retry_response_clean = clean_json_response(retry_response_raw)
                        retry_scoring_data = CSVParser.parse_csv_string(
                            retry_response_clean,
                            logger=self.logger,
                            context=f"Retry scoring for domain {domain_name} round {retry_round} batch {batch_idx+1}"
                        )
                        
                        # Merge retry results into main scoring map
                        new_scores = 0
                        for item in retry_scoring_data:
                            if item.get('No') and item['No'] not in scoring_map:
                                scoring_map[item['No']] = item
                                new_scores += 1
                        
                        if new_scores > 0:
                            self.logger.info(f"✅ [{domain_name}] Round {retry_round} {batch_label}: Got {new_scores} new scores (total: {len(scoring_map)})")
                    except Exception as retry_err:
                        self.logger.warning(f"[{domain_name}] Round {retry_round} {batch_label} failed: {str(retry_err)[:100]}")
                
                # Update missing_ids for next round
                missing_ids = [uc['No'] for uc in all_use_cases if uc['No'] not in scoring_map]
                if missing_ids:
                    self.logger.info(f"[{domain_name}] After round {retry_round}: Still missing {len(missing_ids)} scores")
            
            if missing_ids:
                self.logger.warning(f"⚠️ [{domain_name}] After {MAX_RETRY_ROUNDS} retry rounds, {len(missing_ids)} use cases still missing scores. Using defaults.")
            
            # Add scoring data to use cases and compute Value/Feasibility/Priority in code
            scored_use_cases = []
            for uc in all_use_cases:
                uc_id = uc['No']
                if uc_id in scoring_map:
                    scores = scoring_map[uc_id]
                    
                    uc['Strategic Alignment'] = float(scores.get('Strategic Alignment', 3.5))
                    uc['Return on Investment'] = float(scores.get('Return on Investment', 3.5))
                    uc['Reusability'] = float(scores.get('Reusability', 3.5))
                    uc['Time to Value'] = float(scores.get('Time to Value', 3.5))
                    uc['Data Availability'] = float(scores.get('Data Availability', 3.5))
                    uc['Data Accessibility'] = float(scores.get('Data Accessibility', 3.5))
                    uc['Architecture Fitness'] = float(scores.get('Architecture Fitness', 3.5))
                    uc['Team Skills'] = float(scores.get('Team Skills', 3.5))
                    uc['Domain Knowledge'] = float(scores.get('Domain Knowledge', 3.5))
                    uc['People Allocation'] = float(scores.get('People Allocation', 3.5))
                    uc['Budget Allocation'] = float(scores.get('Budget Allocation', 3.5))
                    uc['Time to Production'] = float(scores.get('Time to Production', 3.5))
                    uc['Business Priority Alignment'] = scores.get('Business Priority Alignment', 'General Improvement')
                    uc['Strategic Goals Alignment'] = scores.get('Strategic Goals Alignment', 'General Improvement')

                    value_score = (
                        (uc['Return on Investment'] * 0.60)
                        + (uc['Strategic Alignment'] * 0.25)
                        + (uc['Time to Value'] * 0.075)
                        + (uc['Reusability'] * 0.075)
                    )

                    feasibility_inputs = [
                        uc['Data Availability'],
                        uc['Data Accessibility'],
                        uc['Architecture Fitness'],
                        uc['Team Skills'],
                        uc['Domain Knowledge'],
                        uc['People Allocation'],
                        uc['Budget Allocation'],
                        uc['Time to Production'],
                    ]
                    feasibility_score = sum(feasibility_inputs) / len(feasibility_inputs)

                    priority_score = (value_score * 1.5) + (feasibility_score * 0.5)

                    uc['Value'] = round(value_score, 2)
                    uc['Feasibility'] = round(feasibility_score, 2)
                    uc['Priority Score'] = round(priority_score, 2)
                    
                    if 'Justification' in scores:
                        uc['Justification'] = scores.get('Justification', '')
                    
                    uc['AI_Confidence'] = scores.get('AI_Confidence', 0.5)
                    uc['AI_Feedback'] = scores.get('AI_Feedback', 'No feedback provided by AI scoring.')
                    
                    if priority_score >= 9.5:
                        priority_label = "Ultra High"
                    elif priority_score >= 8.5:
                        priority_label = "Very High"
                    elif priority_score >= 7.5:
                        priority_label = "High"
                    elif priority_score >= 5.5:
                        priority_label = "Medium"
                    elif priority_score >= 4.5:
                        priority_label = "Low"
                    elif priority_score >= 2.5:
                        priority_label = "Very Low"
                    else:
                        priority_label = "Ultra Low"
                    uc['Priority'] = priority_label
                    
                    self.logger.debug(f"Scored {uc_id}: Value={uc['Value']}, Feasibility={uc['Feasibility']}, Priority Score={priority_score}, Priority={priority_label}")
                else:
                    self.logger.warning(f"No scoring data received for use case {uc_id}, using defaults")
                    uc['Strategic Alignment'] = 3.5
                    uc['Return on Investment'] = 3.5
                    uc['Reusability'] = 3.5
                    uc['Time to Value'] = 3.5
                    uc['Data Availability'] = 3.5
                    uc['Data Accessibility'] = 3.5
                    uc['Architecture Fitness'] = 3.5
                    uc['Team Skills'] = 3.5
                    uc['Domain Knowledge'] = 3.5
                    uc['People Allocation'] = 3.5
                    uc['Budget Allocation'] = 3.5
                    uc['Time to Production'] = 3.5
                    uc['Value'] = 3.5
                    uc['Feasibility'] = 3.5
                    uc['Priority Score'] = 7.0
                    uc['Priority'] = "Medium"
                    uc['AI_Confidence'] = 0.5
                    uc['AI_Feedback'] = 'Default scoring applied - no LLM scoring data received.'
                
                scored_use_cases.append(uc)
            
            self.logger.info(f"Pass 1 scoring complete for {len(scored_use_cases)} use cases in domain '{domain_name}'")
            
            if needs_two_pass and len(scored_use_cases) > 2048:
                self.logger.warning(f"🔄 Starting PASS 2: Selecting top 2048 use cases and re-scoring...")
                log_print(f"\n{'='*80}")
                log_print(f"🔄 PASS 1 COMPLETE: {len(scored_use_cases)} use cases scored")
                log_print(f"{'='*80}")
                log_print(f"Selecting top 2048 use cases by Priority Score for PASS 2...")
                
                scored_use_cases.sort(key=lambda x: x.get('Priority Score', 0), reverse=True)
                top_2048 = scored_use_cases[:2048]
                excluded_count = len(scored_use_cases) - 2048
                
                self.logger.info(f"Selected top 2048 use cases. Excluded {excluded_count} lower-priority use cases.")
                log_print(f"✓ Selected top 2048 use cases")
                log_print(f"✗ Excluded {excluded_count} lower-priority use cases")
                log_print(f"\nStarting PASS 2: Re-scoring top 2048 use cases for final ranking...")
                log_print(f"{'='*80}\n")
                
                final_scored = self._score_use_cases(
                    top_2048,
                    business_context=business_context,
                    strategic_goals=strategic_goals,
                    business_priorities=business_priorities,
                    strategic_initiative=strategic_initiative,
                    value_chain=value_chain,
                    revenue_model=revenue_model
                )
                
                self.logger.info(f"✅ PASS 2 complete. Final set: {len(final_scored)} use cases")
                log_print(f"\n{'='*80}")
                log_print(f"✅ 2-PASS SCORING COMPLETE")
                log_print(f"{'='*80}")
                log_print(f"Final use case count: {len(final_scored)}")
                log_print(f"{'='*80}\n")
                
                return final_scored
            
            self.logger.info(f"Scoring complete for {len(scored_use_cases)} use cases in domain '{domain_name}'")
            return scored_use_cases
            
        except Exception as e:
            self.logger.error(f"Use case scoring failed: {e}. Proceeding without LLM scoring.")
            for uc in all_use_cases:
                uc['Strategic Alignment'] = 3.5
                uc['Return on Investment'] = 3.5
                uc['Reusability'] = 3.5
                uc['Time to Value'] = 3.5
                uc['Data Availability'] = 3.5
                uc['Data Accessibility'] = 3.5
                uc['Architecture Fitness'] = 3.5
                uc['Team Skills'] = 3.5
                uc['Domain Knowledge'] = 3.5
                uc['People Allocation'] = 3.5
                uc['Budget Allocation'] = 3.5
                uc['Time to Production'] = 3.5
                uc['Value'] = 3.5
                uc['Feasibility'] = 3.5
                uc['Priority Score'] = 7.0
                uc['Priority'] = "Medium"
            return all_use_cases

    def _translate_and_prepare_language_pack(self, lang: str, flat_english_use_cases: list, english_grouped_data: dict, business_name: str) -> tuple:
        """
        A single-function wrapper to run all data-gathering for a language.
        Designed to be run in a ThreadPoolExecutor.
        """
        try:
            # Set language for context limit calculations
            self.ai_agent.set_language(lang)
            
            self.logger.info(f"Starting translation & summary pack for {lang}...")
            lang_abbr = self._get_lang_abbr(lang)
            lang_translations = self.translation_service.get_translations(lang)
            # Disable parallelization to avoid nested ThreadPoolExecutors (this function is already called in parallel)
            lang_use_cases_translated = self.translation_service.translate_use_case_list([uc.copy() for uc in flat_english_use_cases], lang, max_parallelism=self.max_parallelism, enable_parallelization=False)
            lang_grouped_data = self._align_translated_data(english_grouped_data, lang_use_cases_translated)
            (lang_summary_dict, transliterated_name) = self._get_salesy_summary(lang_grouped_data, business_name, lang, lang_translations)
            
            self.logger.debug(f"Successfully processed all data for {lang}.")
            return (lang, lang_abbr, lang_translations, lang_grouped_data, lang_summary_dict, transliterated_name)
        except Exception as e:
            self.logger.error(f"Failed to process translation artifacts for {lang}: {e}")
            import traceback
            self.logger.error(f"Full traceback for {lang}: {traceback.format_exc()}")
            return (lang, lang_abbr, None, None, None, None) # Return Nones to signal failure

    def _generate_documents_for_all_languages(self, final_consolidated_use_cases: list, english_grouped_data: dict = None, summary_dict: dict = None, languages: list = None, skip_excel_langs: list = None):
        """
        Generates PDF/PPTX/Excel documents for all languages.
        Can be called from normal path (after use case generation) or docs-only path (from JSON).
        
        Args:
            final_consolidated_use_cases: List of use cases (flat)
            english_grouped_data: Optional, will be computed if not provided
            summary_dict: Optional, will be computed if not provided
            languages: Optional list of languages to generate (default: self.output_languages)
        """
        target_languages = languages if languages else self.output_languages
        skip_excel_langs = set(skip_excel_langs or [])
        self.logger.info(f"--- Starting Document Generation for Languages: {target_languages} ---")
        
        # CRITICAL: Install all required dependencies BEFORE starting translations
        # This prevents wasting time on translations if dependencies are missing
        self.logger.info("Checking and installing required dependencies before starting translations...")
        dependencies_ok = True
        
        if "PDF Catalog" in self.generate_choices or "Use Cases Catalog PDF" in self.generate_choices:
            self.logger.info("Checking PDF dependencies (weasyprint)...")
            try:
                import weasyprint
                self.logger.info("✓ PDF package (weasyprint) already installed.")
            except ImportError:
                self.logger.info("PDF package (weasyprint) not found. Installing...")
                try:
                    import subprocess, sys
                    subprocess.check_call([sys.executable, "-m", "pip", "install", "weasyprint"])
                    import weasyprint
                    self.logger.info("✓ PDF package (weasyprint) installed successfully.")
                except Exception as e:
                    self.logger.error(f"✗ Failed to install PDF dependencies: {e}")
                    dependencies_ok = False
        
        if "Presentation" in self.generate_choices:
            self.logger.info("Checking PPTX dependencies (python-pptx)...")
            try:
                import pptx
                self.logger.info("✓ PPTX package (python-pptx) already installed.")
            except ImportError:
                self.logger.info("PPTX package (python-pptx) not found. Installing...")
                try:
                    import subprocess, sys
                    subprocess.check_call([sys.executable, "-m", "pip", "install", "python-pptx"])
                    import pptx
                    self.logger.info("✓ PPTX package (python-pptx) installed successfully.")
                except Exception as e:
                    self.logger.error(f"✗ Failed to install PPTX dependencies: {e}")
                    dependencies_ok = False
        
        # Always check Excel dependencies as they're used for all artifact generation
        self.logger.info("Checking Excel dependencies (pandas, openpyxl)...")
        try:
            import pandas, openpyxl
            self.logger.info("✓ Excel packages (pandas, openpyxl) already installed.")
        except ImportError:
            self.logger.info("Excel packages not found. Installing...")
            try:
                import subprocess, sys
                subprocess.check_call([sys.executable, "-m", "pip", "install", "pandas", "openpyxl"])
                import pandas, openpyxl
                self.logger.info("✓ Excel packages (pandas, openpyxl) installed successfully.")
            except Exception as e:
                self.logger.error(f"✗ Failed to install Excel dependencies: {e}")
                dependencies_ok = False
        
        if not dependencies_ok:
            self.logger.warning("⚠️ Some dependencies failed to install. PDF/Excel generation may fail, but .md and .csv files will still be generated.")
        
        self.logger.info("✓ Proceeding with translations and artifact generation (fallback to .md/.csv if needed)...")
        
        # Prepare English data if not provided
        flat_english_use_cases = final_consolidated_use_cases
        if english_grouped_data is None:
            _unsorted_grouped = self._group_use_cases_by_domain_flat(flat_english_use_cases)
            english_grouped_data = {k: _unsorted_grouped[k] for k in sorted(_unsorted_grouped.keys())}
        
        # Get English translations
        english_translations = self.translation_service.get_translations("English")
        
        # Get summary if not provided
        if summary_dict is None:
            (summary_dict, transliterated_name) = self._get_salesy_summary(english_grouped_data, self.business_name, "English", english_translations)
        
        # STEP 1: Run translations in parallel (no nesting)
        # ADAPTIVE PARALLELISM: Calculate based on languages and use cases
        num_languages = len([l for l in target_languages if l != "English"])
        num_use_cases = len(flat_english_use_cases)
        
        translation_parallelism, reason = calculate_adaptive_parallelism(
            "translation", self.max_parallelism,
            num_items=num_use_cases,
            num_domains=num_languages,  # Each language is like a domain
            is_llm_operation=True, logger=self.logger
        )
        log_adaptive_parallelism_decision("translation", translation_parallelism, self.max_parallelism, reason)
        
        translation_futures = []
        translation_results = {}
        
        with ThreadPoolExecutor(max_workers=translation_parallelism, thread_name_prefix="Translator") as translation_executor:
            for lang in target_languages:
                if lang == "English":
                    # No translation needed for English
                    self.logger.info("Preparing English artifacts (no translation needed).")
                    (summary_dict_en, transliterated_name_en) = self._get_salesy_summary(english_grouped_data, self.business_name, "English", english_translations)
                    lang_abbr = self._get_lang_abbr("English")
                    translation_results["English"] = ("English", lang_abbr, english_translations, english_grouped_data, summary_dict_en, transliterated_name_en)
                    continue

                self.logger.info(f"Submitting translation & summary pack job for {lang}...")
                f = translation_executor.submit(
                    self._translate_and_prepare_language_pack,
                    lang, flat_english_use_cases, english_grouped_data, self.business_name
                )
                translation_futures.append((f, lang))

            self.logger.info(f"Waiting for {len(translation_futures)} language packs to complete...")
            # Add timeout: 30 minutes per language pack
            total_timeout = len(translation_futures) * 1800
            self.logger.info(f"Language pack processing timeout set to {total_timeout}s ({total_timeout//60} minutes)")
            
            try:
                for future, lang in translation_futures:
                    try:
                        result = future.result(timeout=1800)
                        (lang, lang_abbr, lang_translations, lang_grouped_data, lang_summary_dict, transliterated_name) = result
                        
                        if lang_translations is None:
                            self.logger.warning(f"Skipping artifact generation for {lang} due to translation/summary failure.")
                            continue

                        self.logger.info(f"Translation pack for {lang} complete.")
                        translation_results[lang] = result
                    except concurrent.futures.TimeoutError:
                        self.logger.error(f"Language pack processing timed out after 30 minutes for {lang}")
                    except Exception as e:
                        self.logger.error(f"Language pack processing failed for {lang}: {e}")
            except concurrent.futures.TimeoutError:
                self.logger.error(f"Overall language pack processing timeout reached ({total_timeout}s)")
        
        # STEP 2: Run artifact writing in parallel (separate, not nested)
        # ADAPTIVE PARALLELISM: Calculate based on number of artifacts to write
        num_artifacts = len(translation_results) * 3  # Roughly: PDF, PPTX, Excel per language
        
        writing_parallelism, reason = calculate_adaptive_parallelism(
            "artifact_writing", self.max_parallelism,
            num_items=num_artifacts,
            is_llm_operation=False, logger=self.logger
        )
        log_adaptive_parallelism_decision("artifact_writing", writing_parallelism, self.max_parallelism, reason)
        self.logger.info(f"Translations complete. Starting artifact generation for {len(translation_results)} languages...")
        
        with ThreadPoolExecutor(max_workers=writing_parallelism, thread_name_prefix="Writer") as writer_executor:
            writing_futures = []
            
            for lang, result in translation_results.items():
                (lang, lang_abbr, lang_translations, lang_grouped_data, lang_summary_dict, transliterated_name) = result
                
                self.logger.info(f"Submitting writing jobs for {lang}...")
                
                # ALWAYS generate .md and .csv files for English (fallback artifacts)
                if lang == "English":
                    f = writer_executor.submit(self._generate_markdown_catalog, lang, lang_abbr, lang_grouped_data, lang_summary_dict, transliterated_name)
                    writing_futures.append((f, f"{lang} Markdown"))
                    f = writer_executor.submit(self._generate_csv_catalog, lang, lang_abbr, lang_grouped_data)
                    writing_futures.append((f, f"{lang} CSV"))
                
                if "PDF Catalog" in self.generate_choices or "Use Cases Catalog PDF" in self.generate_choices:
                    f = writer_executor.submit(self.generate_catalog_pdf, lang, lang_abbr, lang_translations, lang_summary_dict, lang_grouped_data, transliterated_name)
                    writing_futures.append((f, f"{lang} PDF"))
                if "Presentation" in self.generate_choices:
                    f = writer_executor.submit(self.generate_presentation_pptx, lang, lang_abbr, lang_translations, lang_summary_dict, lang_grouped_data, transliterated_name)
                    writing_futures.append((f, f"{lang} PPTX"))
                if lang == "English" and lang not in skip_excel_langs:
                    f = writer_executor.submit(self._generate_use_case_excel, lang, lang_abbr, lang_grouped_data)
                    writing_futures.append((f, f"{lang} Excel"))
                elif lang == "English":
                    self.logger.info(f"Skipping Excel generation for {lang} (already generated).")
                else:
                    self.logger.info(f"Skipping Excel generation for {lang} (English only).")
            
            # Wait for all writing jobs to complete
            for future, job_name in writing_futures:
                try:
                    future.result(timeout=600)
                    self.logger.info(f"✓ {job_name} completed")
                except Exception as e:
                    self.logger.error(f"✗ {job_name} failed: {e}")
        
        self.logger.info("All artifact writing jobs completed.")

    def run(self):
        self.logger.info(f"Starting tasks: {self.generate_choices}, Operation Mode: {self.operation_mode}")
        
        if 'PROMPT_TEMPLATES' not in globals():
            self.logger.critical("CRITICAL ERROR: 'PROMPT_TEMPLATES' dictionary is not defined. Please run the cell defining it.")
            log_print("CRITICAL ERROR: 'PROMPT_TEMPLATES' dictionary is not defined. Please run the cell defining it.", level="CRITICAL")
            return
        
        if self.auto_parallelism and self.max_parallelism <= 0:
            (recommended, tables_per_batch, est_batches, avg_table_chars, max_by_memory) = self._calculate_dynamic_parallelism(
                0,
                0,
                0,
                0
            )
            self.max_parallelism = recommended
            self.logger.info(f"Dynamic parallelism set to {self.max_parallelism} (memory_cap={max_by_memory})")
            log_print(f"✅ Dynamic parallelism: {self.max_parallelism}")

        english_translations = self.translation_service.get_translations("English")
        final_consolidated_use_cases = []
        
        # === OPERATION MODE ROUTING ===
        if self.operation_mode == "Re-generate SQL":
            self.logger.info("🔧 RE-GENERATE SQL MODE: Regenerating failed SQL queries")
            log_print(f"\n{'='*80}")
            log_print(f"🔧 RE-GENERATE SQL MODE")
            log_print(f"{'='*80}")
            log_print(f"ℹ️  This mode regenerates SQL and samples for flagged use cases in existing notebooks")
            
            try:
                self._run_queries_fixing_mode()
                return
            except Exception as e:
                self.logger.critical(f"Failed to run Re-generate SQL mode: {e}")
                log_print(f"❌ Error in Re-generate SQL mode: {e}", level="ERROR")
                import traceback
                traceback.print_exc()
                return
        
        if self.operation_mode == "Generate Sample Result":
            self.logger.info("📊 GENERATE SAMPLE RESULT MODE: Executing SQL and generating sample outputs")
            log_print(f"\n{'='*80}")
            log_print(f"📊 GENERATE SAMPLE RESULT MODE")
            log_print(f"{'='*80}")
            log_print(f"ℹ️  This mode executes SQL for use cases with generate_sample_result:Yes and generates sample outputs")
            
            try:
                self._run_generate_sample_result_mode()
                return
            except Exception as e:
                self.logger.critical(f"Failed to run Generate Sample Result mode: {e}")
                log_print(f"❌ Error in Generate Sample Result mode: {e}", level="ERROR")
                import traceback
                traceback.print_exc()
                return
        
        if self.json_file_path:
            self.logger.info(f"🚀 JSON MODE: Loading use cases from JSON file: {self.json_file_path}")
            log_print(f"\n{'='*80}")
            log_print(f"🚀 JSON MODE ACTIVATED")
            log_print(f"{'='*80}")
            log_print(f"📁 JSON File: {self.json_file_path}")
            log_print(f"📋 Languages: {', '.join(self.output_languages)}")
            log_print(f"\n⚠️  SKIPPING:", level="WARNING")
            log_print(f"   ❌ Use Case Generation (using existing data from JSON)")
            log_print(f"\n✅ GENERATING:")
            log_print(f"   📓 Notebooks (always generated)")
            log_print(f"   📄 PDF Catalogs" if "PDF Catalog" in self.generate_choices or "Use Cases Catalog PDF" in self.generate_choices else "   (PDF generation not selected)")
            log_print(f"   📊 Presentations" if "Presentation" in self.generate_choices else "   (Presentation generation not selected)")
            log_print(f"{'='*80}\n")
            
            try:
                # Load the JSON catalog
                (final_consolidated_use_cases, summary_dict, english_grouped_data) = self._load_usecases_catalog_json(self.json_file_path)
                
                # 1. Generate Notebooks (always generated)
                if final_consolidated_use_cases:
                    self.logger.info("Starting notebook generation from JSON data...")
                    self.assemble_use_case_notebooks(final_consolidated_use_cases, english_translations, summary_dict)
                else:
                    self.logger.warning("No use cases found in JSON, skipping notebook creation.")
                
                # 2. Generate documents for all languages
                if "PDF Catalog" in self.generate_choices or "Presentation" in self.generate_choices or "Use Cases Catalog PDF" in self.generate_choices:
                    self._generate_documents_for_all_languages(
                        final_consolidated_use_cases,
                        english_grouped_data=english_grouped_data,
                        summary_dict=summary_dict
                    )
                
                # Report table inclusion/exclusion statistics
                if final_consolidated_use_cases:
                    self._report_table_statistics(final_consolidated_use_cases)
                
                # Upload log file and show summary
                self.logger.info(f"✅ All artifacts for {self.business_name} generated successfully from JSON")
                self.logger.info("Uploading log file...")
                self._upload_log_file()
                AIAgent.get_summary_report()
                
                # Final success message
                log_print(f"\n{'='*80}")
                log_print(f"✅ SUCCESS: All artifacts generated successfully from JSON")
                log_print(f"{'='*80}\n")
                return
                
            except Exception as e:
                self.logger.critical(f"Failed to process in docs-only mode: {e}")
                log_print(f"❌ Error: Failed to process in docs-only mode: {e}")
                AIAgent.get_summary_report()
                return
        
        # === NORMAL PATH: Generate use cases from metadata ===
        
        # === NEW: Call Business Context Worker first ===
        self.logger.info("=" * 80)
        self.logger.info("🚀 STEP 1: EXTRACTING BUSINESS CONTEXT, STRATEGIC GOALS, AND PRIORITIES")
        self.logger.info("=" * 80)
        
        # Prepare user context string early - now uses business_domains instead of use_cases_focus
        user_domains_str = ', '.join(self.user_business_domains) if self.user_business_domains else ''
        
        # Extract business context
        llm_business_context = self._get_business_context_from_llm()
        self.logger.info("✅ Business context extraction completed")
        
        # Merge with user-provided domains (user takes precedence)
        merged_business_context = self._merge_business_contexts(llm_business_context, user_domains_str)
        
        # Handle user-provided strategic goals (hard focus)
        if self.user_strategic_goals:
            self.logger.info(f"✅ User provided {len(self.user_strategic_goals)} strategic goals - ONLY these will be used")
            merged_business_context["strategic_goals"] = self.user_strategic_goals
            self.logger.info(f"   Strategic Goals: {', '.join(self.user_strategic_goals)}")
        else:
            self.logger.info("ℹ️ No user strategic goals provided - will generate goals based on business context")
            llm_goals = merged_business_context.get("strategic_goals", [])
            if isinstance(llm_goals, str):
                llm_goals = [g.strip() for g in llm_goals.split(",") if g.strip()]
            merged_business_context["strategic_goals"] = llm_goals
        
        # Handle user-provided business priorities
        if self.user_business_priorities:
            self.logger.info(f"✅ User provided {len(self.user_business_priorities)} business priorities - these will drive ranking")
            merged_business_context["business_priorities"] = self.user_business_priorities
            self.logger.info(f"   Business Priorities: {', '.join(self.user_business_priorities)}")
        
        # Handle user-provided business domains
        if self.user_business_domains:
            self.logger.info(f"✅ User provided {len(self.user_business_domains)} business domains - use cases will be aligned to these domains ONLY")
            merged_business_context["user_business_domains"] = self.user_business_domains
            self.logger.info(f"   Business Domains: {', '.join(self.user_business_domains)}")
        else:
            self.logger.info("ℹ️ No user business domains provided - domains will be inferred from data")
        
        self.logger.info("✅ Business context extracted and merged.")
        self.logger.info("=" * 80)
        
        # Store merged context for use in prompt generation
        self.merged_business_context = merged_business_context
        
        # Extract business context fields for prompts
        ctx_business_context = merged_business_context.get("business_context", "")
        ctx_strategic_goals = merged_business_context.get("strategic_goals", [])
        # Handle if strategic_goals is a string (comma separated) or list
        if isinstance(ctx_strategic_goals, str):
            ctx_strategic_goals = [s.strip() for s in ctx_strategic_goals.split(",") if s.strip()]
        
        ctx_business_priorities = merged_business_context.get("business_priorities", [])
        if isinstance(ctx_business_priorities, str):
            ctx_business_priorities = [s.strip() for s in ctx_business_priorities.split(",") if s.strip()]
        ctx_strategic_initiative = merged_business_context.get("strategic_initiative", "")
        ctx_value_chain = merged_business_context.get("value_chain", "")
        ctx_revenue_model = merged_business_context.get("revenue_model", "")
        
        
        try:
            if self.data_loader:
                self.logger.debug("Data loader found. Starting batched table processing with MAX_CONTEXT_CHARS management...")
                
                # Generate unstructured document list first (needs schema overview)
                # Check if unstructured data is enabled
                if self.use_unstructured_data:
                    # We'll use a sampling approach or generate it once with limited schema
                    self.logger.info("Generating unstructured document list and extracting business context...")
                    # Get a small sample for unstructured doc generation
                    sample_columns = self.data_loader.getNextTables(self.scan_parallelism)
                    if sample_columns:
                        sample_columns = self._augment_columns_with_foreign_keys(sample_columns)
                        sample_schema_markdown = self._format_schema_for_prompt(sample_columns)
                        business_context_dict = self._generate_unstructured_docs(sample_schema_markdown)
                        
                        # Extract individual variables from the returned dict
                        unstructured_docs_markdown = business_context_dict.get("unstructured_docs_markdown", "")
                        strategic_goals = business_context_dict.get("strategic_goals", [])
                        business_context = business_context_dict.get("business_context", "")
                        business_priorities = business_context_dict.get("business_priorities", [])
                        strategic_initiative = business_context_dict.get("strategic_initiative", "")
                        value_chain = business_context_dict.get("value_chain", "")
                        revenue_model = business_context_dict.get("revenue_model", "")
                        
                        # Reset the data loader state to start from beginning
                        self.data_loader.current_table_idx = 0
                    else:
                        unstructured_docs_markdown = ""
                        strategic_goals = ctx_strategic_goals
                        business_context = ""
                        business_priorities = []
                        strategic_initiative = ""
                        value_chain = ""
                        revenue_model = ""
                        self.logger.warning("No tables found for unstructured doc generation")
                else:
                    self.logger.info("Unstructured data generation is disabled. Skipping...")
                    unstructured_docs_markdown = ""
                    strategic_goals = ctx_strategic_goals
                    business_context = ""
                    business_priorities = []
                    strategic_initiative = ""
                    value_chain = ""
                    revenue_model = ""

                # Now process tables in batches with model-specific context limits from TECHNICAL_CONTEXT
                use_case_gen_context_limit = get_max_context_chars("English", "BASE_USE_CASE_GEN_PROMPT")
                safe_context_limit = get_safe_context_limit("English", buffer_percent=0.9, prompt_name="BASE_USE_CASE_GEN_PROMPT")
                batch_num = 1
                accumulated_columns = []
                accumulated_schema_size = 0
                batches_to_process = []  # List of (batch_num, column_details) tuples
                
                # Get the base prompt template size estimate
                base_prompt_template = PROMPT_TEMPLATES.get("BASE_USE_CASE_GEN_PROMPT", "")
                base_template_size = len(base_prompt_template) + len(unstructured_docs_markdown)
                base_prompt_size = base_template_size + 2000
                self.logger.debug(f"Base prompt template size: {base_template_size} chars, context limit: {use_case_gen_context_limit}")
                
                # CRITICAL FIX: Keep pulling tables until we fill the model's context limit
                self.logger.info("Collecting batches for parallel processing (MAXIMIZING context utilization)...")
                while True:
                    # Keep pulling table batches until we hit the context limit
                    while True:
                        batch_columns = self.data_loader.getNextTables(self.scan_parallelism)
                        
                        if batch_columns is None:
                            # No more tables available
                            break

                        batch_columns = self._augment_columns_with_foreign_keys(batch_columns)
                        
                        batch_schema_size = self._estimate_schema_markdown_size(batch_columns)
                        estimated_prompt_size = base_prompt_size + accumulated_schema_size + batch_schema_size
                        
                        self.logger.debug(f"Batch {batch_num}: Got {len(batch_columns)} columns. "
                                       f"Total accumulated: {len(accumulated_columns) + len(batch_columns)} columns. "
                                       f"Estimated prompt size: {estimated_prompt_size}/{safe_context_limit} chars")
                        
                        if estimated_prompt_size > safe_context_limit:
                            if not accumulated_columns:
                                self.logger.warning(f"Initial table pull exceeds context limit ({estimated_prompt_size} chars). Splitting into sub-batches.")
                                split_batches = self._split_columns_to_fit_context(batch_columns, base_prompt_size, safe_context_limit)
                                for idx, split_columns in enumerate(split_batches, start=1):
                                    table_count = len({(c[0], c[1], c[2]) for c in split_columns})
                                    self.logger.warning(f"   Sub-batch {idx}: {table_count} tables ({len(split_columns)} columns)")
                                    batches_to_process.append((batch_num, split_columns))
                                    batch_num += 1
                                accumulated_columns = []
                                accumulated_schema_size = 0
                                break
                            else:
                                # Would exceed limit - save current accumulator as a batch
                                self.logger.info(f"Context limit reached. Saving batch {batch_num} with {len(accumulated_columns)} columns ({base_prompt_size + accumulated_schema_size} chars)")
                                batches_to_process.append((batch_num, accumulated_columns))
                                batch_num += 1
                                
                                # Start new accumulator with the current batch
                                accumulated_columns = batch_columns
                                accumulated_schema_size = batch_schema_size
                                break
                        else:
                            # Fits - keep accumulating and PULL MORE TABLES
                            accumulated_columns.extend(batch_columns)
                            accumulated_schema_size += batch_schema_size
                            # Continue inner loop to pull more tables
                    
                    # Check if we're done (no more tables)
                    if batch_columns is None:
                        # Add any remaining accumulated columns to batches
                        if accumulated_columns:
                            self.logger.info(f"Adding final batch {batch_num} with {len(accumulated_columns)} columns")
                            batches_to_process.append((batch_num, accumulated_columns))
                        break
                
                if not batches_to_process:
                    self.logger.warning("No batches to process. Skipping all generation.")
                    return
                
                total_tables = len({(c, s, t) for _, cols in batches_to_process for (c, s, t, _, _, _) in cols})
                tables_per_call = self._determine_tables_per_call(total_tables)
                if tables_per_call > 0:
                    adjusted_batches = []
                    next_batch_num = 1
                    for _, batch_columns in batches_to_process:
                        grouped = self._split_by_table_limit(batch_columns, tables_per_call)
                        for group in grouped:
                            estimated_prompt_size = base_prompt_size + self._estimate_schema_markdown_size(group)
                            if estimated_prompt_size > safe_context_limit:
                                singles = self._split_by_table_limit(group, 1)
                                for single in singles:
                                    adjusted_batches.append((next_batch_num, single))
                                    next_batch_num += 1
                            else:
                                adjusted_batches.append((next_batch_num, group))
                                next_batch_num += 1
                    batches_to_process = adjusted_batches
                    self.logger.info(f"Table-based batching: {tables_per_call} tables per call, {len(batches_to_process)} batches")
                
                # === NEW: Filter business vs technical tables ===
                self.logger.info("🔍 Filtering tables into BUSINESS vs TECHNICAL categories...")
                log_print(f"\n{'='*80}")
                log_print(f"🔍 FILTERING TABLES: Business Data vs Technical/Metadata")
                log_print(f"{'='*80}\n")
                
                # Collect all columns from all batches for filtering
                all_batch_columns = []
                for batch_num, batch_columns in batches_to_process:
                    all_batch_columns.extend(batch_columns)
                
                if self.auto_parallelism:
                    total_tables = len({(c, s, t) for (c, s, t, _, _, _) in all_batch_columns})
                    total_schema_chars = self._estimate_schema_markdown_size(all_batch_columns)
                    (recommended, tables_per_batch, est_batches, avg_table_chars, max_by_memory) = self._calculate_dynamic_parallelism(
                        total_tables,
                        total_schema_chars,
                        safe_context_limit,
                        base_prompt_size
                    )
                    self.max_parallelism = recommended
                    self.logger.info(f"Dynamic parallelism set to {self.max_parallelism} (tables={total_tables}, avg_table_chars={avg_table_chars}, tables_per_batch={tables_per_batch}, est_batches={est_batches}, memory_cap={max_by_memory})")
                    log_print(f"✅ Dynamic parallelism: {self.max_parallelism} (tables={total_tables}, est_batches={est_batches})")
                
                # Get industry from business context if available
                industry = ""
                if 'business_context' in locals() and business_context:
                    # Extract industry info (simplified - you could enhance this)
                    industry = business_context.split('\n')[0] if business_context else ""
                
                # Filter tables
                (business_details, technical_details, business_tables, technical_tables, business_scores, data_category_map, master_tables_set, transactional_tables_set, reference_tables_set) = self._filter_business_tables(
                    all_batch_columns,
                    business_context=business_context if 'business_context' in locals() else "",
                    industry=industry,
                    exclusion_strategy=self.technical_exclusion_strategy
                )
                
                # === MEMORY OPTIMIZATION: Clear all_batch_columns after filtering ===
                del all_batch_columns
                gc.collect()
                self.logger.debug("🧹 Cleared all_batch_columns from memory")
                
                self._business_column_details_global = business_details
                self.global_table_names = {f"{c}.{s}.{t}" for (c, s, t, _, _, _) in business_details}
                
                # Store business_scores for later use in truncation
                self.business_scores = business_scores
                self.data_category_map = data_category_map
                
                if not business_details:
                    self.logger.warning("No business tables found after filtering. All tables were classified as technical/metadata.")
                    log_print("⚠️ No business tables found. All tables appear to be technical/metadata.", level="WARNING")
                    return
                
                self.logger.info(f"✅ Filtering complete: Proceeding with {len(business_tables)} business tables, "
                               f"excluding {len(technical_tables)} technical/metadata tables, "
                               f"excluding {len(reference_tables_set)} reference tables")
                
                master_details = []
                transactional_details = []
                for detail in business_details:
                    (catalog, schema, table, _, _, _) = detail
                    fqtn = f"{catalog}.{schema}.{table}"
                    if fqtn in transactional_tables_set:
                        transactional_details.append(detail)
                    else:
                        master_details.append(detail)
                
                master_tables_count = len(master_tables_set)
                tables_per_call = self._determine_tables_per_call(master_tables_count)
                
                adjusted_batches = []
                next_batch_num = 1
                
                if master_details:
                    grouped_master = self._split_by_table_limit(master_details, tables_per_call)
                    for group in grouped_master:
                        estimated_prompt_size = base_prompt_size + self._estimate_schema_markdown_size(group)
                        if estimated_prompt_size > safe_context_limit:
                            singles = self._split_by_table_limit(group, 1)
                            for single in singles:
                                adjusted_batches.append((next_batch_num, single))
                                next_batch_num += 1
                        else:
                            adjusted_batches.append((next_batch_num, group))
                            next_batch_num += 1
                
                if transactional_details:
                    grouped_tx = self._split_by_table_limit(transactional_details, 1)
                    for group in grouped_tx:
                        adjusted_batches.append((next_batch_num, group))
                        next_batch_num += 1
                
                augmented_batches = []
                for batch_num, batch_columns in adjusted_batches:
                    augmented = self._augment_columns_with_related_tables(batch_columns)
                    augmented_batches.append((batch_num, augmented))
                batches_to_process = augmented_batches
                log_print(f"✅ Business tables: {len(business_tables)}")
                log_print(f"   📊 Master Data tables: {len(master_tables_set)}")
                log_print(f"   📈 Transactional tables: {len(transactional_tables_set)}")
                log_print(f"🟡 Reference tables (excluded): {len(reference_tables_set)}")
                log_print(f"❌ Technical tables (excluded): {len(technical_tables)}")
                log_print(f"{'='*80}\n")

                # Update honesty tracking
                self.processing_honesty['total_tables_discovered'] = len(business_tables) + len(technical_tables) + len(reference_tables_set)
                self.processing_honesty['total_tables_processed'] = len(business_tables)
                self.processing_honesty['total_batches_created'] = len(batches_to_process)

                # Process each batch once (deduplication handles redundancy)
                # MEMORY OPTIMIZATION: Use file-based storage instead of keeping everything in memory
                # ADAPTIVE PARALLELISM: Calculate based on batches, tables and columns
                total_batch_columns = sum(len(cols) for _, cols in batches_to_process)
                avg_prompt_chars = total_batch_columns * 100  # Estimate ~100 chars per column
                
                batch_parallelism, reason = calculate_adaptive_parallelism(
                    "use_case_generation", self.max_parallelism,
                    num_items=len(batches_to_process),
                    total_columns=total_batch_columns,
                    avg_prompt_chars=avg_prompt_chars,
                    is_llm_operation=True, logger=self.logger
                )
                
                log_print(f"\n{'='*80}")
                log_print(f"🔄 USE CASE GENERATION: SERIAL ENSEMBLE (2 PASSES)")
                log_print(f"{'='*80}")
                log_print(f"📋 PASS 1: Generate initial use cases from {len(batches_to_process)} batch(es)")
                log_print(f"📋 PASS 2: Generate NEW use cases not in PASS 1 (with feedback)")
                log_print("💾 Using file-based intermediate storage to prevent memory explosion")
                log_print(f"{'='*80}\n")
                
                self.storage_manager.initialize()
                
                # === PASS 1: Generate initial use cases (parallel within pass) ===
                self.logger.info("🔄 PASS 1: Generating initial use cases...")
                log_print(f"\n{'='*60}")
                log_print(f"🔄 PASS 1: Initial Use Case Generation")
                log_print(f"{'='*60}")
                
                with ThreadPoolExecutor(max_workers=batch_parallelism, thread_name_prefix="Pass1Batch") as executor:
                    future_to_batch = {}
                    for batch_num, column_details in batches_to_process:
                        unique_batch_id = f"P1_{batch_num}"
                        future = executor.submit(
                            self._process_batch_with_retry,
                            column_details,
                            unique_batch_id,
                            unstructured_docs_markdown,
                            strategic_goals,
                            business_context if 'business_context' in locals() else "",
                            business_priorities if 'business_priorities' in locals() else "",
                            strategic_initiative if 'strategic_initiative' in locals() else "",
                            value_chain if 'value_chain' in locals() else "",
                            revenue_model if 'revenue_model' in locals() else "",
                            3,  # max_attempts
                            ""  # No feedback for PASS 1
                        )
                        future_to_batch[future] = unique_batch_id
                        self.logger.info(f"✓ [PASS 1] Submitted batch {batch_num}")
                    
                    total_submissions = len(batches_to_process)
                    batches_completed = 0
                    total_timeout = (total_submissions * 900) // self.max_parallelism + 600
                    
                    try:
                        for future in concurrent.futures.as_completed(future_to_batch, timeout=total_timeout):
                            unique_batch_id = future_to_batch[future]
                            try:
                                use_cases = future.result(timeout=900)
                                if use_cases:
                                    self.storage_manager.save_batch(unique_batch_id, use_cases)
                                    batches_completed += 1
                                    self.logger.info(f"✓ [PASS 1] Batch {unique_batch_id}: {len(use_cases)} use cases ({batches_completed}/{total_submissions})")
                                    log_print(f"✓ [PASS 1] Batch complete ({batches_completed}/{total_submissions})")
                            except Exception as e:
                                self.logger.error(f"❌ [PASS 1] Batch {unique_batch_id} failed: {e}")
                    except concurrent.futures.TimeoutError:
                        self.logger.error(f"⚠️ [PASS 1] Timeout after {total_timeout}s. Proceeding with {batches_completed}/{total_submissions} completed.")
                
                # === MEMORY OPTIMIZATION: Save PASS 1 results to disk immediately ===
                # Count use cases and save IDs without loading all into memory
                pass1_count = self.storage_manager.get_total_count()
                self.logger.info(f"✅ PASS 1 complete: Generated {pass1_count} use cases")
                log_print(f"✅ PASS 1 complete: {pass1_count} use cases generated")
                
                # Save PASS 1 IDs to disk for later comparison (memory-efficient)
                pass1_ids = []
                for batch in self.storage_manager.iter_batches():
                    for uc in batch:
                        pass1_ids.append(uc.get('No', ''))
                self.storage_manager.save_pass1_ids(pass1_ids)
                del pass1_ids  # Free memory immediately
                
                # Force garbage collection after PASS 1
                gc.collect()
                self.logger.debug("🧹 Memory cleanup after PASS 1")
                
                # === PASS 2: Generate NEW use cases with feedback (TRANSACTIONAL TABLES ONLY) ===
                # Create transactional-only batches for PASS 2
                transactional_batches = []
                for batch_num, column_details in batches_to_process:
                    tx_columns = [col for col in column_details 
                                  if f"{col[0]}.{col[1]}.{col[2]}" in transactional_tables_set]
                    if tx_columns:
                        transactional_batches.append((batch_num, tx_columns))
                
                if pass1_count > 0 and transactional_batches:
                    self.logger.info("🔄 PASS 2: Generating NEW use cases from TRANSACTIONAL TABLES (with PASS 1 feedback)...")
                    log_print(f"\n{'='*60}")
                    log_print(f"🔄 PASS 2: Ensemble - TRANSACTIONAL TABLES ONLY")
                    log_print(f"{'='*60}")
                    log_print(f"📋 Feedback: {pass1_count} use cases from PASS 1")
                    log_print(f"📊 Transactional batches: {len(transactional_batches)} (focusing on event/transaction data)")
                    log_print(f"🎯 Goal: Find NEW use cases from transactional data NOT covered in PASS 1")
                    
                    # === MEMORY OPTIMIZATION: Build feedback iteratively and save to disk ===
                    feedback_lines = ["**🚀 PASS 2: YOUR MISSION IS TO FIND EVEN MORE VALUE! 🚀**\n"]
                    feedback_lines.append("Pass 1 generated some use cases, but there is MUCH MORE VALUE to extract from the transactional data!")
                    feedback_lines.append("Your job is to find DIFFERENT, COMPLEMENTARY use cases that Pass 1 missed.\n")
                    feedback_lines.append("**Already generated (reference only - avoid exact duplicates, but EXPLORE VARIATIONS):**")
                    feedback_lines.append("| No | Name | Tables Involved |")
                    feedback_lines.append("|---|---|---|")
                    
                    # Use memory-efficient iterator (doesn't load all at once)
                    feedback_count = 0
                    for idx, name, tables in self.storage_manager.iter_pass1_use_cases_for_feedback(limit=200):
                        feedback_lines.append(f"| {idx} | {name} | {tables} |")
                        feedback_count = idx
                    
                    if pass1_count > 200:
                        feedback_lines.append(f"\n*... and {pass1_count - 200} more use cases (not shown)*")
                    
                    feedback_lines.append("\n**🔥 PASS 2 MISSION: FIND HIGH-VALUE USE CASES THAT PASS 1 MISSED 🔥**")
                    feedback_lines.append("")
                    feedback_lines.append("**VALUE-FOCUSED EXPLORATION:**")
                    feedback_lines.append("Pass 1 covered some ground, but transactional data often hides MASSIVE business value.")
                    feedback_lines.append("Your job is to find HIGH-ROI use cases that Pass 1 missed - NOT to generate filler.")
                    feedback_lines.append("")
                    feedback_lines.append("**WHAT TO LOOK FOR (only if they deliver REAL business value):**")
                    feedback_lines.append("- 💰 **Revenue opportunities**: Patterns that could increase sales, reduce churn, optimize pricing")
                    feedback_lines.append("- 🛡️ **Risk signals**: Fraud patterns, compliance risks, operational anomalies worth preventing")
                    feedback_lines.append("- 📊 **Strategic insights**: Cross-table relationships that reveal hidden business drivers")
                    feedback_lines.append("- ⚡ **Efficiency gains**: Process bottlenecks, resource optimization opportunities")
                    feedback_lines.append("")
                    feedback_lines.append("**QUALITY RULES:**")
                    feedback_lines.append("- ❌ Avoid duplicates of Pass 1 use cases (check the table above)")
                    feedback_lines.append("- ❌ Do NOT generate low-value filler just to add more use cases")
                    feedback_lines.append("- ✅ Variations that add DISTINCT business value ARE encouraged")
                    feedback_lines.append("- ✅ Different business angles on same data = valuable if ROI is clear")
                    feedback_lines.append("- ✅ Cross-table joins often unlock the highest value - explore these")
                    feedback_lines.append("")
                    feedback_lines.append("**SELF-CHECK**: For each use case ask: 'Would a CFO fund this? Does it impact revenue or reduce costs?'")
                    
                    # Save feedback to disk and clear from memory
                    self.storage_manager.save_feedback_file(feedback_lines)
                    previous_use_cases_feedback = "\n".join(feedback_lines)
                    del feedback_lines  # Free memory
                    gc.collect()
                    
                    with ThreadPoolExecutor(max_workers=batch_parallelism, thread_name_prefix="Pass2Batch") as executor:
                        future_to_batch = {}
                        for batch_num, column_details in transactional_batches:
                            unique_batch_id = f"P2_{batch_num}"
                            future = executor.submit(
                                self._process_batch_with_retry,
                                column_details,
                                unique_batch_id,
                                unstructured_docs_markdown,
                                strategic_goals,
                                business_context if 'business_context' in locals() else "",
                                business_priorities if 'business_priorities' in locals() else "",
                                strategic_initiative if 'strategic_initiative' in locals() else "",
                                value_chain if 'value_chain' in locals() else "",
                                revenue_model if 'revenue_model' in locals() else "",
                                3,  # max_attempts
                                previous_use_cases_feedback  # Include feedback from PASS 1
                            )
                            future_to_batch[future] = unique_batch_id
                            self.logger.info(f"✓ [PASS 2] Submitted transactional batch {batch_num}")
                        
                        batches_completed = 0
                        pass2_submissions = len(transactional_batches)
                        try:
                            for future in concurrent.futures.as_completed(future_to_batch, timeout=total_timeout):
                                unique_batch_id = future_to_batch[future]
                                try:
                                    use_cases = future.result(timeout=900)
                                    if use_cases:
                                        self.storage_manager.save_batch(unique_batch_id, use_cases)
                                        batches_completed += 1
                                        self.logger.info(f"✓ [PASS 2] Batch {unique_batch_id}: {len(use_cases)} NEW use cases ({batches_completed}/{pass2_submissions})")
                                        log_print(f"✓ [PASS 2] Batch complete ({batches_completed}/{pass2_submissions})")
                                except Exception as e:
                                    self.logger.error(f"❌ [PASS 2] Batch {unique_batch_id} failed: {e}")
                        except concurrent.futures.TimeoutError:
                            self.logger.error(f"⚠️ [PASS 2] Timeout. Proceeding with {batches_completed}/{pass2_submissions} completed.")
                    
                    # === MEMORY OPTIMIZATION: Count PASS 2 results without loading all into memory ===
                    total_after_pass2 = self.storage_manager.get_total_count()
                    pass2_new_count = total_after_pass2 - pass1_count
                    self.logger.info(f"✅ PASS 2 complete: Generated {pass2_new_count} additional NEW use cases from transactional tables")
                    log_print(f"✅ PASS 2 complete: {pass2_new_count} NEW use cases generated")
                    
                    # Clean up PASS 2 variables to free memory
                    del previous_use_cases_feedback
                    del transactional_batches
                    gc.collect()
                    self.logger.debug("🧹 Memory cleanup after PASS 2")
                elif not transactional_batches:
                    self.logger.info("⚠️ PASS 2 skipped: No transactional tables found for ensemble pass")
                
                log_print(f"\n{'='*60}")
                log_print(f"✅ SERIAL ENSEMBLE COMPLETE")
                log_print(f"{'='*60}")
                
                # Check if any batches were successfully processed
                storage_stats = self.storage_manager.get_stats()
                if storage_stats['num_batches'] == 0:
                    self.logger.warning("No use cases were generated from any batch. Skipping all generation.")
                    self.storage_manager.cleanup()  # Cleanup if no use cases generated
                    return
                
                self.logger.info(f"📊 Batch processing complete. Storage stats: {storage_stats['num_batches']} batches, "
                               f"{storage_stats['use_case_count']} use cases, {storage_stats['total_size_mb']:.2f} MB on disk")
                
                # MEMORY OPTIMIZATION: Load use cases from disk for deduplication only when needed
                self.logger.info("Loading use cases from disk for deduplication...")
                all_use_cases = self.storage_manager.load_all_use_cases()
                
                # Filter out use cases without valid tables (before deduplication)
                # Keep use cases that have either:
                # 1. Valid table references (non-empty and not just catalog/schema prefix)
                # 2. Volume paths (for unstructured data use cases)
                pre_filter_count = len(all_use_cases)
                all_use_cases = [
                    uc for uc in all_use_cases 
                    if (uc.get('Tables Involved', '').strip() and (
                        uc.get('Tables Involved', '').startswith('/Volumes') or 
                        '.' in uc.get('Tables Involved', '')  # Has at least one dot (catalog.schema.table format)
                    ))
                ]
                filtered_count = pre_filter_count - len(all_use_cases)
                if filtered_count > 0:
                    self.logger.warning(f"⚠️ Filtered out {filtered_count} use cases without valid tables before deduplication")

                # Deduplicate per domain in parallel (skip global deduplication to maximize parallelization)
                self.logger.debug(f"Total use cases generated (pre-deduplication): {len(all_use_cases)}")
                self.logger.info("🔄 Starting domain-level parallel deduplication (skipping global deduplication for max parallelization)...")
                unique_use_cases = all_use_cases
                
                # === RESTRUCTURED: Prepare all columns for SQL generation FIRST ===
                all_columns_for_sql = []
                for batch_num, batch_columns in batches_to_process:
                    for detail in batch_columns:
                        (catalog, schema, table, _, _, _) = detail
                        fqtn = f"{catalog}.{schema}.{table}"
                        if reference_tables_set and fqtn in reference_tables_set:
                            continue
                        all_columns_for_sql.append(detail)
                
                # === MEMORY OPTIMIZATION: Clear batches_to_process after extracting columns ===
                del batches_to_process
                gc.collect()
                self.logger.debug("🧹 Cleared batches_to_process from memory")
                
                # === NEW: Catch uncovered tables BEFORE clustering/scoring/SQL ===
                self.logger.info("🔍 Checking table coverage before clustering and scoring...")
                catchall_rounds = 3
                for round_idx in range(catchall_rounds):
                    pre_scoring_retry = self._retry_missing_table_coverage(
                        unique_use_cases,
                        all_columns_for_sql,
                        unstructured_docs_markdown,
                        strategic_goals,
                        include_business_catchall=True
                    )
                    if not pre_scoring_retry:
                        if round_idx == 0:
                            self.logger.info("✅ All tables covered by initial use cases")
                        else:
                            self.logger.info(f"✅ All tables covered after catch-all pass {round_idx}")
                        break
                    self.logger.info(f"✅ Generated {len(pre_scoring_retry)} additional use cases for uncovered tables (pass {round_idx + 1})")
                    pre_scoring_retry = [
                        uc for uc in pre_scoring_retry
                        if uc.get('Tables Involved', '').strip() and not uc.get('Tables Involved', '').startswith('/Volumes')
                    ]
                    self.logger.info(f"✅ After filtering, {len(pre_scoring_retry)} pre-scoring retry use cases have valid tables")
                    if not pre_scoring_retry:
                        continue
                    unique_use_cases.extend(pre_scoring_retry)
                    self.logger.info(f"📊 Total use cases after pre-scoring coverage pass {round_idx + 1}: {len(unique_use_cases)}")
                else:
                    self.logger.info("⚠️ Catch-all reached maximum retries")

                # === PHASE 1: Cluster domains/subdomains FIRST (before scoring) ===
                self.logger.info(f"📊 Clustering {len(unique_use_cases)} deduplicated use cases into domains and subdomains...")
                clustered_use_cases = self._cluster_domains_and_subdomains(unique_use_cases, "English")
                
                # === PHASE 2: Score per domain in parallel FIRST (before deduplication) ===
                # We need scores to make intelligent decisions during deduplication (keep highest ROI)
                self.logger.info(f"🔄 PHASE 1: Scoring use cases per domain in parallel...")
                unique_use_cases_scored = self._score_per_domain_parallel(
                    clustered_use_cases, # Use clustered cases here (not deduped yet)
                    business_context=ctx_business_context,
                    strategic_goals=ctx_strategic_goals,
                    business_priorities=ctx_business_priorities,
                    strategic_initiative=ctx_strategic_initiative,
                    value_chain=ctx_value_chain,
                    revenue_model=ctx_revenue_model
                )
                self.logger.info("✅ Phase 1 complete: All use cases scored")
                
                # === PHASE 3: Intelligent Deduplication (Using Scores) ===
                # Now that we have scores, deduplicate by keeping the highest ROI/Strategic Alignment
                self.logger.info("🔄 Starting INTELLIGENT domain-level deduplication (using scores)...")
                final_deduplicated_use_cases = self._deduplicate_use_cases_by_domain_parallel(unique_use_cases_scored)
                
                final_consolidated_use_cases = final_deduplicated_use_cases
                
                # Re-number use case IDs to match the domain-based notebook prefixes (C01, C02, etc.)
                self.logger.debug("Re-numbering use case IDs to match domain-based notebook prefixes...")
                grouped_by_domain = self._group_use_cases_by_domain_flat(final_consolidated_use_cases)
                
                # Sort domains by impact (most impactful first)
                domain_impact_scores = {domain: self._calculate_domain_impact_score(use_cases) 
                                       for domain, use_cases in grouped_by_domain.items()}
                sorted_domain_names = sorted(grouped_by_domain.keys(), 
                                            key=lambda d: domain_impact_scores[d], 
                                            reverse=True)
                
                renumbered_use_cases = []
                domain_source_counters = defaultdict(lambda: defaultdict(int))
                for domain_idx, domain_name in enumerate(sorted_domain_names):
                    domain_use_cases = grouped_by_domain[domain_name]
                    domain_prefix = f"N{domain_idx+1:02d}"  # Two-digit domain prefix
                    
                    # Use cases are already sorted by priority from _group_use_cases_by_domain_flat
                    for uc_idx, uc in enumerate(domain_use_cases, start=1):
                        old_id = uc.get('No', '')
                        source_flag = 'AI' if uc.get('_source') == 'AI' else 'ST'
                        domain_source_counters[domain_name][source_flag] += 1
                        seq_num = domain_source_counters[domain_name][source_flag]
                        new_id = f"{domain_prefix}-{source_flag}{seq_num:02d}"
                        uc['No'] = new_id
                        
                        # Update SQL comment if present
                        if 'SQL' in uc and uc['SQL'] and old_id:
                            uc['SQL'] = uc['SQL'].replace(f"-- Use Case ID: {old_id}", f"-- Use Case ID: {new_id}")
                        
                        renumbered_use_cases.append(uc)
                
                self.logger.debug(f"Re-numbered {len(renumbered_use_cases)} use cases to match notebook prefixes.")
                
                # === NEW: Filter out any use cases with "Pending" priority (safety check) ===
                pending_use_cases = [uc for uc in renumbered_use_cases if uc.get('Priority') == 'Pending']
                if pending_use_cases:
                    self.logger.warning(f"⚠️ Found {len(pending_use_cases)} use cases with 'Pending' priority - these will be filtered out")
                    for uc in pending_use_cases[:5]:  # Log first 5 for debugging
                        self.logger.warning(f"  - {uc.get('No', 'N/A')}: {uc.get('Name', 'N/A')}")
                    renumbered_use_cases = [uc for uc in renumbered_use_cases if uc.get('Priority') != 'Pending']
                    self.logger.info(f"✅ Filtered to {len(renumbered_use_cases)} scored use cases (removed {len(pending_use_cases)} pending)")
                
                # === QUALITY FILTER: Volume-Based Filtering ===
                # Priorities: Ultra High (6), Very High (5), High (4), Medium (3), Low (2), Very Low (1), Ultra Low (0)
                # Rules:
                # 1. If total > 200: Drop <= Medium (Keep High+)
                # 2. If total > 100: Drop <= Low (Keep Medium+)
                # 3. If total > 50: Drop <= Ultra Low (Keep Very Low+)
                
                total_count = len(renumbered_use_cases)
                quality_priority_map = {
                    "Ultra High": 6, "Very High": 5, "High": 4, 
                    "Medium": 3, "Low": 2, "Very Low": 1, "Ultra Low": 0, "Pending": -1
                }
                
                min_priority_threshold = 0 # Default: Keep everything (except Pending)
                filter_reason = "Base (All)"
                
                if total_count > 200:
                    min_priority_threshold = 4 # Keep High (4) and above
                    filter_reason = "Volume > 200 (High+ only)"
                elif total_count > 100:
                    min_priority_threshold = 3 # Keep Medium (3) and above
                    filter_reason = "Volume > 100 (Medium+ only)"
                elif total_count > 50:
                    min_priority_threshold = 1 # Keep Very Low (1) and above (Drop Ultra Low 0)
                    filter_reason = "Volume > 50 (Very Low+ only)"
                
                filtered_use_cases = [
                    uc for uc in renumbered_use_cases 
                    if quality_priority_map.get(uc.get('Priority', 'Medium'), 3) >= min_priority_threshold
                ]
                
                dropped_count = total_count - len(filtered_use_cases)
                if dropped_count > 0:
                    self.logger.info(f"📉 VOLUME FILTER: {filter_reason}. Dropped {dropped_count} low-priority use cases.")
                    log_print(f"\n📉 VOLUME FILTER: {filter_reason}. Focusing on {len(filtered_use_cases)} higher-quality use cases (dropped {dropped_count}).")
                    
                    # === RE-NORMALIZATION: Re-score relative to the new highest ===
                    self.logger.info("🔄 Re-normalizing scores for filtered set...")
                    final_consolidated_use_cases = self._normalize_priority_scores(filtered_use_cases)
                else:
                    self.logger.info(f"✅ Volume filter applied ({filter_reason}), but no use cases were dropped.")
                    final_consolidated_use_cases = renumbered_use_cases
                
            else:
                 self.logger.warning("No data loader. Skipping use case, PDF, and Presentation generation.")
                 return
        except Exception as e:
            self.logger.critical(f"A critical error occurred during English generation: {e}")
            self.storage_manager.cleanup()  # Cleanup on error
            AIAgent.get_summary_report()
            return

        english_grouped_data = self._group_use_cases_by_domain_flat(final_consolidated_use_cases) if final_consolidated_use_cases else {}
        summary_dict = None

        # ALWAYS Generate English Excel before SQL generation
        if final_consolidated_use_cases:
            self.logger.info("Generating English Excel before SQL generation...")
            lang_abbr_en = self._get_lang_abbr("English")
            try:
                self._generate_use_case_excel("English", lang_abbr_en, english_grouped_data)
            except Exception as e:
                self.logger.error(f"Failed to generate English Excel before SQL: {e}")

        
        # === PHASE 2: DOMAIN-BY-DOMAIN SQL GENERATION & NOTEBOOK CREATION ===
        # Generate summary_dict BEFORE domain-by-domain processing (needed for notebook creation)
        if final_consolidated_use_cases and not self.json_file_path:
            if summary_dict is None:
                self.logger.info("Generating executive summaries for notebooks...")
                (summary_dict, _) = self._get_salesy_summary(english_grouped_data, self.business_name, "English", english_translations)
        
        # === START DOCUMENTATION GENERATION IN PARALLEL WITH SQL ===
        # Launch PDF/PPTX generation in background thread while SQL proceeds
        doc_generation_future = None
        doc_generation_executor = None
        remaining_langs = [lang for lang in self.output_languages if lang != "English"]
        target_langs = ["English"] + remaining_langs if "English" in self.output_languages else remaining_langs
        
        if final_consolidated_use_cases and target_langs and ("PDF Catalog" in self.generate_choices or "Presentation" in self.generate_choices):
            self.logger.info("🚀 Starting PDF/PPTX documentation generation in parallel with SQL...")
            log_print(f"📄 Documentation generation starting in background (languages: {', '.join(target_langs)})")
            doc_generation_executor = concurrent.futures.ThreadPoolExecutor(max_workers=1, thread_name_prefix="DocGen")
            doc_generation_future = doc_generation_executor.submit(
                self._generate_documents_for_all_languages,
                final_consolidated_use_cases,
                english_grouped_data,
                summary_dict,
                target_langs,
                ["English"]  # skip_excel_langs - already generated
            )
        
        self.logger.info(f"🔄 PHASE 2: Domain-by-domain SQL generation & notebook creation...")
        log_print(f"\n{'='*80}")
        log_print(f"🔧 PHASE 2: DOMAIN-BY-DOMAIN SQL & NOTEBOOKS")
        log_print(f"{'='*80}")
        log_print(f"Total use cases: {len(final_consolidated_use_cases)}")
        log_print(f"Parallel workers: {self.max_parallelism}")
        log_print(f"Strategy: Generate SQL per domain → Create notebook → Move to next domain")
        log_print(f"Order: Smallest domains first (for quick demo testing)")
        log_print(f"{'='*80}\n")
        
        # Domain-by-domain: Generate SQL and create notebook for each domain immediately
        # Domains are processed in order of use case count (smallest first) for quick testing
        if final_consolidated_use_cases and not self.json_file_path:
            final_consolidated_use_cases = self._generate_sql_and_notebooks_by_domain(
                final_consolidated_use_cases,
                all_columns_for_sql,
                unstructured_docs_markdown,
                english_translations,
                summary_dict
            )
        else:
            # Fallback to parallel SQL generation if JSON mode (no notebooks needed)
            final_consolidated_use_cases = self._generate_sql_parallel(
                final_consolidated_use_cases,
                all_columns_for_sql,
                unstructured_docs_markdown
            )
        self.logger.info("✅ Phase 2 complete: All domains processed (SQL + Notebooks)")
        
        # === WAIT FOR DOCUMENTATION GENERATION TO COMPLETE ===
        if doc_generation_future:
            try:
                self.logger.info("⏳ Waiting for documentation generation to complete...")
                doc_generation_future.result(timeout=1800)  # 30 minute timeout
                self.logger.info("✅ Documentation generation completed")
                log_print("✅ Documentation generation completed")
            except concurrent.futures.TimeoutError:
                self.logger.warning("⚠️ Documentation generation timed out after 30 minutes")
                log_print("⚠️ Documentation generation timed out", level="WARNING")
            except Exception as e:
                self.logger.error(f"❌ Documentation generation failed: {e}")
                log_print(f"❌ Documentation generation failed: {str(e)[:100]}", level="ERROR")
            finally:
                if doc_generation_executor:
                    doc_generation_executor.shutdown(wait=False)

        # === POPULATE PRIMARY TABLE (Analytics Technique comes from LLM during use case generation) ===
        if final_consolidated_use_cases:
            for uc in final_consolidated_use_cases:
                # Analytics Technique is now generated by LLM during use case creation
                # Only set a default if missing (for legacy use cases)
                if not uc.get('Analytics Technique') or uc.get('Analytics Technique') == 'N/A':
                    uc['Analytics Technique'] = 'AI Analysis'  # Default fallback
                
                # Extract Primary Table from Tables Involved
                tables_involved = uc.get('Tables Involved', '')
                uc['Primary Table'] = self._extract_primary_table(tables_involved)
            
            self.logger.info(f"✅ Populated Primary Table for {len(final_consolidated_use_cases)} use cases")

        # === SAVE JSON CATALOG (POST-SQL) ===
        if final_consolidated_use_cases and not self.json_file_path:
            self.logger.info("Saving JSON Catalog with generated SQL and columns...")
            summary_dict = self._save_usecases_catalog_json(final_consolidated_use_cases, english_translations, summary_dict)
        
        # Note: PDF/PPTX documentation generation was started in parallel earlier
        # Only generate here if parallel generation was NOT started (no PDF/Presentation selected initially)
        if final_consolidated_use_cases and not doc_generation_future:
            remaining_langs = [lang for lang in self.output_languages if lang != "English"]
            target_langs = ["English"] + remaining_langs if "English" in self.output_languages else remaining_langs
            if target_langs and ("PDF Catalog" in self.generate_choices or "Presentation" in self.generate_choices):
                if summary_dict is None:
                    (summary_dict, _) = self._get_salesy_summary(english_grouped_data, self.business_name, "English", english_translations)
                self.logger.info("Generating PDFs/Presentations and translations...")
                self._generate_documents_for_all_languages(
                    final_consolidated_use_cases,
                    english_grouped_data=english_grouped_data,
                    summary_dict=summary_dict,
                    languages=target_langs,
                    skip_excel_langs=["English"]
                )
        
        # Report table inclusion/exclusion statistics
        if final_consolidated_use_cases and self.data_loader:
            self._report_table_statistics(final_consolidated_use_cases)
        
        # Cleanup intermediate storage
        self.storage_manager.cleanup()
        
        # Upload log file and show summary BEFORE final success message
        self.logger.info(f"✅ All Use cases for {self.business_name} generated successfully")
        self.logger.info("Uploading log file...")
        self._upload_log_file()
        
        # Show processing honesty report
        self._report_processing_honesty()
        
        # Show AI usage summary
        AIAgent.get_summary_report()
        
        # Final success message with green checkmark - THIS MUST BE THE LAST OUTPUT
        log_print(f"✅ All Use cases for {self.business_name} generated successfully")
            

    def _generate_unstructured_docs(self, combined_schema_markdown: str) -> dict:
        """
        Generates unstructured doc list using fallback approach.
        
        Returns:
            dict: Dictionary containing:
                - unstructured_docs_markdown: str
                - strategic_goals: list
                - business_context: str
                - business_priorities: list
                - strategic_initiative: str
                - value_chain: str
                - revenue_model: str
        """
        if not combined_schema_markdown:
            self.logger.warning("No schema markdown provided, cannot generate unstructured docs list.")
            return ""
        
        try:
            self.logger.info("Generating fallback unstructured document list...")
            fallback_context = {
                "unstructured_docs_markdown": self._generate_fallback_unstructured_docs(),
                "strategic_goals": [],
                "business_context": "General business operations",
                "business_priorities": [],
                "strategic_initiative": "Data-driven transformation",
                "value_chain": "Standard business operations",
                "revenue_model": "Product and service sales"
            }
            return fallback_context
            
        except Exception as e:
            self.logger.error(f"Failed to generate unstructured document list: {e}. Proceeding with empty list.")
            fallback_context = {
                "unstructured_docs_markdown": "",
                "strategic_goals": [],
                "business_context": "General business operations",
                "business_priorities": [],
                "strategic_initiative": "Data-driven transformation",
                "value_chain": "Standard business operations",
                "revenue_model": "Product and service sales"
            }
            return fallback_context
    
    def _generate_fallback_unstructured_docs(self) -> str:
        """Fallback: Generate a minimal set of generic documents."""
        return """| "Document Name" | "Description" | "Type" | "Extracted Entities" | "File Path" |
|---|---|---|---|---|
| "Business Invoices" | "PDF invoices from vendors" | "PDF" | "vendor_name, invoice_number, amount, date" | "/Volumes/finance/invoices/" |
| "Customer Emails" | "Email correspondence with customers" | "DOCX" | "customer_name, subject, date, sentiment" | "/Volumes/communications/emails/" |
| "Product Images" | "Product photography and diagrams" | "JPG" | "product_id, image_type" | "/Volumes/products/images/" |
"""

    # === MODIFIED: _get_salesy_summary (Req 2) ===
    def _get_salesy_summary(self, grouped_data: dict, business_name: str, language: str, translations: dict) -> tuple:
        self.logger.debug(f"Calling LLM for executive and domain summaries in {language}...")
        t = translations
        summary_dict = {}
        transliterated_name = business_name # Default
        try:
            domain_list_for_prompt = "\n".join([f"- {domain}" for domain in grouped_data.keys()])
            total_cases = sum(len(cases) for cases in grouped_data.values())
            prompt_vars = {
                "business_name": business_name, "total_cases": str(total_cases),
                "domain_list": domain_list_for_prompt, "output_language": language
            }
            summary_csv_raw = self.ai_agent.run_worker(
                step_name=f"Executive_Summary_{language}",
                worker_prompt_path="SUMMARY_GEN_PROMPT",
                prompt_vars=prompt_vars, response_schema=None
            )
            self.logger.info(f"LLM summaries (CSV) received for {language}.")
            
            # === ROBUST CSV PARSING (Req 2) ===
            # Support both quoted and unquoted headers from LLM
            header_3_col_quoted = '"Type","Summary","TransliteratedBusinessName"'
            header_3_col_unquoted = 'Type,Summary,TransliteratedBusinessName'
            header_2_col_quoted = '"Type","Summary"'
            header_2_col_unquoted = 'Type,Summary'
            
            header_start_index = summary_csv_raw.find(header_3_col_quoted)
            is_3_col = True
            
            if header_start_index == -1:
                header_start_index = summary_csv_raw.find(header_3_col_unquoted)
            
            if header_start_index == -1:
                self.logger.warning(f"Could not find 3-column CSV header in LLM response for {language}. Attempting 2-col parse.")
                is_3_col = False
                header_start_index = summary_csv_raw.find(header_2_col_quoted)
                if header_start_index == -1:
                    header_start_index = summary_csv_raw.find(header_2_col_unquoted)
                if header_start_index == -1:
                    self.logger.error(f"Could not find 2-column or 3-column CSV header. Aborting summary parse. Response: {summary_csv_raw[:200]}")
                    raise ValueError(f"Could not parse summary CSV for {language}")

            self.logger.info(f"Found CSV header at index {header_start_index}. Parsing as {3 if is_3_col else 2}-column.")
            summary_csv_clean = summary_csv_raw[header_start_index:]
            # ==================================

            # Use centralized CSV parser
            csv_rows = CSVParser.parse_csv_list(
                summary_csv_clean,
                logger=self.logger,
                context="Domain summary",
                delimiter=',',
                quotechar='"',
                quoting=csv.QUOTE_ALL,
                skipinitialspace=True
            )
            if csv_rows:
                header = csv_rows[0]  # First row is header
                csv_reader = csv_rows[1:]  # Rest are data rows
            else:
                csv_reader = []
            
            if is_3_col:
                for row in csv_reader:
                    # Handle rows that may span multiple lines or have embedded content
                    if len(row) >= 3:
                        row_type = row[0].strip()
                        summary_dict[row_type] = row[1].strip()
                        if row_type == "Executive" and row[2].strip():
                            transliterated_name = row[2].strip()
                    elif len(row) > 0:
                        # Partial row - try to accumulate
                        self.logger.debug(f"Partial 3-col row (len={len(row)}): {row[:1]}")
                        # Skip partial rows gracefully without warning
            else: # 2-col fallback
                for row in csv_reader:
                    if len(row) >= 2:
                        summary_dict[row[0].strip()] = row[1].strip()
                    elif len(row) > 0:
                        self.logger.debug(f"Partial 2-col row (len={len(row)}): {row[:1]}")
            transliterated_name = business_name if not is_3_col else transliterated_name
            
            if "Executive" not in summary_dict:
                self.logger.error("Failed to parse 'Executive' summary from LLM response.")
                raise ValueError("Missing 'Executive' summary")
                
            self.logger.info(f"Successfully parsed {len(summary_dict)} summaries for {language}. Transliterated name: {transliterated_name}")
            return summary_dict, transliterated_name
            
        except Exception as e:
            self.logger.error(f"LLM summary generation failed for {language}: {e}. Using default text.")
            fallback_dict = {}
            total_cases_fallback = sum(len(cases) for cases in grouped_data.values())
            p1 = t['pdf_fallback_summary_p1'].format(total_cases=total_cases_fallback, business_name=business_name)
            p2 = t['pdf_fallback_summary_p2']
            fallback_dict["Executive"] = f"<p>{p1}</p><p>{p2}</p>"
            for domain in grouped_data.keys():
                fallback_dict[domain] = "<p>Summary generation failed. This domain's key responsibilities and opportunities have been identified for AI transformation.</p>"
            return fallback_dict, business_name # Return default name

    def _merge_small_domains(self, use_cases: list, min_cases_per_domain: int = 4) -> list:
        """
        Merges business domains that have fewer than min_cases_per_domain use cases.
        Provides domain counts to LLM and asks for better merged domain names.
        Default minimum is now 4 use cases per domain.
        """
        if not use_cases:
            return use_cases
        
        # Count use cases per domain
        domain_counts = defaultdict(int)
        for uc in use_cases:
            domain = uc.get('Business Domain', 'Other')
            domain_counts[domain] += 1
        
        # Identify small domains
        small_domains = {domain: count for domain, count in domain_counts.items() if count < min_cases_per_domain}
        
        if not small_domains:
            self.logger.info(f"All domains have at least {min_cases_per_domain} use cases. No merging needed.")
            return use_cases
        
        self.logger.debug(f"Found {len(small_domains)} domains with fewer than {min_cases_per_domain} use cases: {list(small_domains.keys())}")
        
        # Build domain info for LLM including counts
        domain_info_lines = []
        for domain, count in sorted(domain_counts.items(), key=lambda x: x[1], reverse=True):
            status = "✓ OK" if count >= min_cases_per_domain else f"❌ TOO SMALL (needs merging)"
            domain_info_lines.append(f"  - {domain}: {count} use cases {status}")
        
        domain_info_str = "\n".join(domain_info_lines)
        
        # Create merge prompt with domain counts
        merge_prompt = f"""You are a business domain expert. Analyze this list of business domains and their use case counts:

{domain_info_str}

**CRITICAL REQUIREMENT**: Each domain MUST have at least {min_cases_per_domain} use cases or it will be merged into a larger, related domain.

**Your Task**:
1. Identify domains with fewer than {min_cases_per_domain} use cases (marked with ❌)
2. For each small domain, determine which larger domain (marked with ✓) it should be merged into based on semantic similarity
3. Come up with a BETTER, more comprehensive name for merged domains that reflects the combined scope
4. If merging creates a new combined domain, ensure the name is professional and encompasses both areas

**Output Format** (honesty-wrapped JSON):
Your response MUST be wrapped: {{{{"honesty_score": XX, "honesty_justification": "...", "data": <your_mapping>}}}}

Example of wrapped output:
{{{{{{{{
  "honesty_score": 95,
  "honesty_justification": "Merged domains based on semantic similarity with high confidence.",
  "data": {{{{{{{{
    "Customer Service": "Customer Experience & Support",
    "Billing": "Finance & Revenue Management"
  }}}}}}}}
}}}}}}}}

**Rules**:
1. Domains with < {min_cases_per_domain} use cases MUST be merged
2. Only merge semantically related domains
3. Prefer descriptive, professional domain names
4. If creating a new combined name, make it comprehensive
5. Domains with ≥ {min_cases_per_domain} can stay as-is UNLESS you have a significantly better name
6. Domain naming rules STILL APPLY: every domain name must be EXACTLY ONE WORD (no spaces), unique core word, industry-specific

Start your response with: {{{{"honesty_score":
""" + HONESTY_CHECK_JSON
        
        try:
            # Use run_worker with a direct prompt
            merge_prompt_key = "DOMAINS_MERGER_PROMPT"
            self.ai_agent.prompt_templates[merge_prompt_key] = merge_prompt
            
            response = self.ai_agent.run_worker(
                step_name=f"Merge_Small_Domains",
                worker_prompt_path=merge_prompt_key,
                prompt_vars={},
                response_schema=None
            )
            
            # Clean up temporary prompt template
            del self.ai_agent.prompt_templates[merge_prompt_key]
            
            merge_mapping = json.loads(clean_json_response(response))
            
            # Extract data from honesty-wrapped response if present
            if isinstance(merge_mapping, dict) and 'data' in merge_mapping:
                merge_mapping = merge_mapping['data']
            
            # Ensure merge_mapping is a dict with string keys/values
            if not isinstance(merge_mapping, dict):
                self.logger.warning(f"Domain merge response is not a dict: {type(merge_mapping)}. Skipping merge.")
                return use_cases
            
            # Resolve transitive merges (e.g., A->B, B->C  =>  A->C, B->C)
            # This handles cases where a domain is renamed, and another domain is merged into the OLD name
            # or where multiple merges happen in a chain.
            for _ in range(5):  # Max depth 5 to prevent infinite loops
                updated = False
                for key, val in merge_mapping.items():
                    if val in merge_mapping and merge_mapping[val] != val:
                         # Update target to the final destination
                         merge_mapping[key] = merge_mapping[val]
                         updated = True
                if not updated:
                    break
            
            # Apply merge mapping to use cases
            merged_count = 0
            for uc in use_cases:
                old_domain = uc.get('Business Domain', 'Other')
                if old_domain in merge_mapping:
                    new_domain = merge_mapping[old_domain]
                    if old_domain != new_domain:
                        self.logger.info(f"Merging '{old_domain}' ({domain_counts[old_domain]} cases) → '{new_domain}'")
                        uc['Business Domain'] = new_domain
                        merged_count += 1
            
            self.logger.debug(f"Domain merging complete. {merged_count} use cases reassigned.")
            
            # Verify all domains now have at least min_cases_per_domain
            final_counts = defaultdict(int)
            for uc in use_cases:
                final_counts[uc.get('Business Domain', 'Other')] += 1
            
            remaining_small = [d for d, count in final_counts.items() if count < min_cases_per_domain]
            if remaining_small:
                self.logger.warning(f"WARNING: {len(remaining_small)} domains still have fewer than {min_cases_per_domain} use cases: {remaining_small}")
            else:
                self.logger.debug(f"SUCCESS: All domains now have at least {min_cases_per_domain} use cases.")
            
            return use_cases
            
        except Exception as e:
            self.logger.error(f"Failed to merge domains: {e}")
            return use_cases  # Return original if merge fails

    def _cluster_domains_and_subdomains(self, use_cases: list, language: str) -> list:
        """
        Cluster use cases into appropriate domains and subdomains using LLM TWO-STEP approach:
        Step 1: Assign domains to all use cases
        Step 2: For each domain, assign subdomains in parallel
        
        Args:
            use_cases: List of use case dictionaries
            language: Output language
            
        Returns:
            List of use cases with updated domains and subdomains
        """
        from collections import defaultdict
        import io
        import csv
        from concurrent.futures import ThreadPoolExecutor
        import concurrent.futures
        
        self.logger.info(f"🎯 Starting TWO-STEP domain/subdomain clustering for {len(use_cases)} use cases...")
        
        if not use_cases:
            return use_cases
        
        # === USER-PROVIDED BUSINESS DOMAINS ENFORCEMENT ===
        # If user provided business domains, we MUST force all use cases to align to those domains only
        if self.user_business_domains and len(self.user_business_domains) > 0:
            self.logger.info(f"🚨 USER-PROVIDED DOMAINS DETECTED: Forcing use cases to align ONLY to: {', '.join(self.user_business_domains)}")
            log_print(f"\n🚨 USER-PROVIDED DOMAINS: Aligning all use cases to: {', '.join(self.user_business_domains)}")
            
            # Use LLM to intelligently assign use cases to user-provided domains
            return self._assign_to_user_domains(use_cases, self.user_business_domains, language)
        
        # === NOTE: Removed legacy fallback - always use two-step approach ===
        # The two-step approach (DOMAIN_FINDER_PROMPT + SUBDOMAIN_DETECTOR_PROMPT) 
        # provides better quality and more consistent results than the old single-step approach
        if len(use_cases) > 250:
            self.logger.info(
                f"📊 Large use case set detected ({len(use_cases)} use cases). "
                f"Using two-step clustering with parallel subdomain detection for optimal quality."
            )
        
        # === STEP 1: DOMAIN DETECTION ===
        
        try:
            # Convert use cases to CSV for LLM (without Business Domain and Subdomain since those will be detected)
            output = io.StringIO()
            if use_cases:
                fieldnames = ['No', 'Name', 'type', 'Analytics Technique', 'Statement', 'Solution', 
                             'Business Value', 'Beneficiary', 'Sponsor', 
                             'Tables Involved']
                writer = csv.DictWriter(output, fieldnames=fieldnames, extrasaction='ignore')
                writer.writeheader()
                writer.writerows(use_cases)
            use_cases_csv = output.getvalue()
            
            # Check context size for domain detection (uses model-specific limits from TECHNICAL_CONTEXT)
            prompt_template = self.ai_agent.prompt_templates.get("DOMAIN_FINDER_PROMPT", "")
            estimated_size = len(prompt_template) + len(use_cases_csv) + 1000
            MAX_CONTEXT_CHARS = get_max_context_chars(language, "DOMAIN_FINDER_PROMPT")
            
            if estimated_size > MAX_CONTEXT_CHARS:
                # === BATCHED DOMAIN DETECTION: Process in smaller chunks ===
                self.logger.warning(
                    f"Domain detection prompt size ({estimated_size:,} chars) exceeds MAX_CONTEXT_CHARS ({MAX_CONTEXT_CHARS:,}). "
                    f"Using BATCHED domain detection to process {len(use_cases)} use cases in smaller chunks."
                )
                
                # Calculate batch size based on available context space
                prompt_overhead = len(prompt_template) + 5000  # Buffer for prompt template + response
                available_chars = MAX_CONTEXT_CHARS - prompt_overhead
                
                # Estimate chars per use case from the CSV
                chars_per_use_case = len(use_cases_csv) / len(use_cases) if use_cases else 500
                batch_size = max(50, int(available_chars / chars_per_use_case * 0.7))  # 70% safety margin
                
                self.logger.info(f"📦 BATCHED DOMAIN DETECTION: Processing {len(use_cases)} use cases in batches of ~{batch_size}")
                
                # Process use cases in batches
                batched_domain_assignments = []
                all_discovered_domains = set()
                
                for batch_idx in range(0, len(use_cases), batch_size):
                    batch_use_cases = use_cases[batch_idx:batch_idx + batch_size]
                    batch_num = (batch_idx // batch_size) + 1
                    total_batches = (len(use_cases) + batch_size - 1) // batch_size
                    
                    self.logger.info(f"📍 BATCH {batch_num}/{total_batches}: Processing {len(batch_use_cases)} use cases for domain detection...")
                    
                    try:
                        # Create CSV for this batch
                        batch_output = io.StringIO()
                        batch_writer = csv.DictWriter(batch_output, fieldnames=fieldnames, extrasaction='ignore')
                        batch_writer.writeheader()
                        batch_writer.writerows(batch_use_cases)
                        batch_csv = batch_output.getvalue()
                        
                        # Include previously discovered domains as context for consistency
                        domain_context = ""
                        if all_discovered_domains:
                            domain_context = f"\n\n**PREVIOUSLY DISCOVERED DOMAINS (reuse these where appropriate):**\n{', '.join(sorted(all_discovered_domains))}\n"
                        
                        batch_prompt_vars = {
                            "use_cases_csv": batch_csv,
                            "output_language": language,
                            "business_name": self.business_name,
                            "industries": ", ".join(self.industries) if hasattr(self, 'industries') and self.industries else "General Business",
                            "business_context": getattr(self, 'business_context', "General business operations") + domain_context,
                            "previous_violations": ""
                        }
                        
                        # Call LLM for this batch
                        batch_response = self.ai_agent.run_worker(
                            step_name=f"Detect_Domains_Batch{batch_num}_{language}",
                            worker_prompt_path="DOMAIN_FINDER_PROMPT",
                            prompt_vars=batch_prompt_vars,
                            response_schema=None
                        )
                        
                        if batch_response and batch_response.strip():
                            batch_response_clean = clean_json_response(batch_response)
                            batch_csv_rows = CSVParser.parse_csv_string(
                                batch_response_clean,
                                logger=self.logger,
                                context=f"Domain detection batch {batch_num}"
                            )
                            
                            # Apply domain assignments to batch use cases
                            batch_domain_map = {}
                            for row in batch_csv_rows:
                                uc_id_raw = row.get('use_case_id', '') or ''
                                domain_raw = row.get('domain', '') or ''
                                uc_id = uc_id_raw.strip() if isinstance(uc_id_raw, str) else str(uc_id_raw).strip()
                                domain = domain_raw.strip() if isinstance(domain_raw, str) else str(domain_raw).strip()
                                if uc_id and domain:
                                    batch_domain_map[uc_id] = domain
                                    all_discovered_domains.add(domain)
                            
                            # Apply to batch use cases
                            for uc in batch_use_cases:
                                uc_copy = uc.copy()
                                uc_id = uc_copy.get('No', '')
                                if uc_id in batch_domain_map:
                                    uc_copy['Business Domain'] = batch_domain_map[uc_id]
                                else:
                                    # Assign to most common domain in batch as fallback
                                    if batch_domain_map:
                                        from collections import Counter
                                        most_common = Counter(batch_domain_map.values()).most_common(1)[0][0]
                                        uc_copy['Business Domain'] = most_common
                                    else:
                                        uc_copy['Business Domain'] = 'Uncategorized'
                                uc_copy['Subdomain'] = ''
                                batched_domain_assignments.append(uc_copy)
                            
                            self.logger.info(f"✅ BATCH {batch_num}/{total_batches}: Assigned domains to {len(batch_use_cases)} use cases. Discovered domains so far: {len(all_discovered_domains)}")
                        else:
                            # Fallback for failed batch
                            self.logger.warning(f"⚠️ BATCH {batch_num}: Empty response, using fallback domains")
                            for uc in batch_use_cases:
                                uc_copy = uc.copy()
                                uc_copy['Business Domain'] = 'Uncategorized'
                                uc_copy['Subdomain'] = ''
                                batched_domain_assignments.append(uc_copy)
                                
                    except Exception as batch_err:
                        self.logger.error(f"❌ BATCH {batch_num} failed: {batch_err}. Using fallback domains for this batch.")
                        for uc in batch_use_cases:
                            uc_copy = uc.copy()
                            uc_copy['Business Domain'] = 'Uncategorized'
                            uc_copy['Subdomain'] = ''
                            batched_domain_assignments.append(uc_copy)
                
                self.logger.info(f"📦 BATCHED DOMAIN DETECTION COMPLETE: {len(batched_domain_assignments)} use cases assigned to {len(all_discovered_domains)} domains")
                
                # Use the batched results and continue to domain merging (Step 1.5)
                # The domain merging will consolidate similar domains across batches
                domain_assignments = batched_domain_assignments
                
                # Skip the single-batch domain detection below, jump to Step 1.5
                # === STEP 1.5: DOMAIN MERGING (MERGE SMALL/SIMILAR DOMAINS) ===
                self.logger.info("📍 STEP 1.5: Merging small/similar domains from batched detection...")
                domain_assignments = self._merge_small_domains(domain_assignments, min_cases_per_domain=4)
                
                # === STEP 2: SUBDOMAIN DETECTION (PARALLEL FOR EACH DOMAIN) ===
                self.logger.info(f"📍 STEP 2: Detecting subdomains for each domain in parallel...")
                
                # Group use cases by domain
                domain_usecases_map = defaultdict(list)
                for uc in domain_assignments:
                    domain = uc.get('Business Domain', '').strip()
                    if domain:
                        domain_usecases_map[domain].append(uc)
                
                # ADAPTIVE PARALLELISM: Calculate based on domains and use cases
                subdomain_parallelism, reason = calculate_adaptive_parallelism(
                    "subdomain_detection", self.max_parallelism,
                    num_items=len(domain_assignments),
                    num_domains=len(domain_usecases_map),
                    is_llm_operation=True, logger=self.logger
                )
                log_adaptive_parallelism_decision("subdomain_detection", subdomain_parallelism, self.max_parallelism, reason)
                self.logger.info(f"Processing {len(domain_usecases_map)} domains for subdomain detection...")
                
                # Process each domain in parallel
                final_use_cases_with_subdomains = []
                
                with ThreadPoolExecutor(max_workers=subdomain_parallelism, 
                                       thread_name_prefix="SubdomainDetect") as executor:
                    future_to_domain = {}
                    for domain_name, domain_use_cases in domain_usecases_map.items():
                        future = executor.submit(
                            self._detect_subdomains_for_domain,
                            domain_name,
                            domain_use_cases,
                            language
                        )
                        future_to_domain[future] = domain_name
                    
                    for future in concurrent.futures.as_completed(future_to_domain):
                        domain_name = future_to_domain[future]
                        try:
                            use_cases_with_subdomains = future.result()
                            if use_cases_with_subdomains:
                                final_use_cases_with_subdomains.extend(use_cases_with_subdomains)
                            else:
                                domain_use_cases = domain_usecases_map.get(domain_name, [])
                                final_use_cases_with_subdomains.extend(domain_use_cases)
                        except Exception as e:
                            self.logger.error(f"❌ Domain '{domain_name}': Subdomain detection failed: {e}")
                            domain_use_cases = domain_usecases_map.get(domain_name, [])
                            final_use_cases_with_subdomains.extend(domain_use_cases)
                
                self.logger.info(f"✅ BATCHED two-step clustering complete! {len(final_use_cases_with_subdomains)} use cases processed across {len(domain_usecases_map)} domains")
                return final_use_cases_with_subdomains
            
            # Warn if prompt is very large (might cause slowness)
            if estimated_size > MAX_CONTEXT_CHARS * 0.7:
                self.logger.warning(
                    f"⚠️ Domain detection prompt is very large ({estimated_size:,} chars, {len(use_cases)} use cases). "
                    f"This may take 2-5 minutes to process. Please be patient..."
                )
                log_print(f"\n⏳ Processing {len(use_cases)} use cases for domain detection...")
                log_print(f"   Prompt size: {estimated_size:,} characters")
                log_print(f"   This may take 2-5 minutes. Please wait...\n")
            
            # Call LLM for domain detection (respect global retry cap)
            max_attempts = (getattr(self, "max_retry_attempts", 1) or 0) + 1
            prompt_vars = {
                "use_cases_csv": use_cases_csv,
                "output_language": language,
                "business_name": self.business_name,
                "industries": ", ".join(self.industries) if hasattr(self, 'industries') and self.industries else "General Business",
                "business_context": getattr(self, 'business_context', "General business operations"),
                "previous_violations": ""  # Will be populated on retry attempts
            }
            
            domain_assignments = None
            for attempt in range(1, max_attempts + 1):
                try:
                    self.logger.info(f"📍 STEP 1: Detecting domains for all {len(use_cases)} use cases ({estimated_size:,} chars) - {attempt}/{max_attempts}")
                    
                    import time
                    start_time = time.time()
                    
                    response_raw = self.ai_agent.run_worker(
                        step_name=f"Detect_Domains_{language}_Attempt{attempt}",
                        worker_prompt_path="DOMAIN_FINDER_PROMPT",
                        prompt_vars=prompt_vars,
                        response_schema=None
                    )
                    
                    elapsed_time = time.time() - start_time
                    self.logger.info(f"✅ LLM response received in {elapsed_time:.1f} seconds")
                    
                    # Validate response is not empty
                    if not response_raw or len(response_raw.strip()) == 0:
                        raise ValueError("LLM returned empty response")
                    
                    # Clean response (remove markdown fences if present)
                    response_clean = clean_json_response(response_raw)
                    
                    # Validate cleaned response
                    if not response_clean or len(response_clean.strip()) == 0:
                        raise ValueError("Cleaned response is empty")
                    
                    # Parse CSV using centralized utility
                    try:
                        csv_rows = CSVParser.parse_csv_string(
                            response_clean,
                            logger=self.logger,
                            context="Domain detection"
                        )
                        domain_assignment_map = {}
                        row_count = 0
                        
                        for row in csv_rows:
                            row_count += 1
                            # Handle both possible column names - also handle None values
                            uc_id_raw = row.get('use_case_id', '') or ''
                            domain_raw = row.get('domain', '') or ''
                            uc_id = uc_id_raw.strip() if isinstance(uc_id_raw, str) else str(uc_id_raw).strip()
                            domain = domain_raw.strip() if isinstance(domain_raw, str) else str(domain_raw).strip()
                            
                            if uc_id and domain:
                                domain_assignment_map[uc_id] = domain
                        
                        if row_count == 0:
                            raise ValueError("CSV has no data rows")
                        
                        if not domain_assignment_map:
                            raise ValueError("No valid domain assignments found in CSV")
                            
                    except Exception as csv_err:
                        self.logger.error(f"CSV parsing failed. Raw response (first 500 chars): {response_raw[:500]}")
                        self.logger.error(f"Cleaned response (first 500 chars): {response_clean[:500]}")
                        raise ValueError(f"Failed to parse CSV: {csv_err}")
                    
                    # Apply domain assignments to use cases
                    domain_assigned_use_cases = []
                    for uc in use_cases:
                        uc_copy = uc.copy()
                        uc_id = uc_copy.get('No', '')
                        if uc_id in domain_assignment_map:
                            uc_copy['Business Domain'] = domain_assignment_map[uc_id]
                            # Clear subdomain for now (will be assigned in step 2)
                            uc_copy['Subdomain'] = ''
                        domain_assigned_use_cases.append(uc_copy)
                    
                    # Validate domain assignments
                    violations = []
                    
                    # Group by domain
                    domain_usecases = defaultdict(list)
                    
                    for uc in domain_assigned_use_cases:
                        domain = uc.get('Business Domain', '').strip()
                        if domain:
                            domain_usecases[domain].append(uc)
                    
                    # Separate HARD violations (blocking) from SOFT warnings (acceptable)
                    hard_violations = []
                    soft_warnings = []
                    
                    # Validate domain count (3-25) - HARD LIMIT on maximum
                    total_domains = len(domain_usecases)
                    if total_domains < 3:
                        hard_violations.append(f"Only {total_domains} domains, minimum required: 3")
                    if total_domains > 25:
                        hard_violations.append(f"🚨 CRITICAL: {total_domains} domains exceeds MAXIMUM 25 (HARD LIMIT) - THIS IS UNACCEPTABLE")
                    
                    # Validate domain naming (must be 1 word) - HARD REQUIREMENT
                    # Also check for shared words between domains
                    domain_words = {}
                    for domain in domain_usecases.keys():
                        word_count = len(domain.split())
                        if word_count != 1:
                            hard_violations.append(f"Domain '{domain}' has {word_count} word(s), must be exactly 1 word")
                        else:
                            # Track words used in each domain
                            domain_word = domain.lower().strip()
                            if domain_word in domain_words:
                                hard_violations.append(f"Domains share word '{domain_word}': '{domain_words[domain_word]}' and '{domain}' - merge these domains")
                            else:
                                domain_words[domain_word] = domain
                    
                    # Validate use cases per domain (4-80) - SOFT guideline for minimum
                    total_use_cases = len(use_cases)
                    small_domains = []  # Track domains with <4 use cases
                    for domain, ucs in domain_usecases.items():
                        count = len(ucs)
                        if count < 4:
                            # Only soft warning if total use cases is small
                            if total_use_cases < 50:
                                soft_warnings.append(f"Domain '{domain}' has {count} use case(s) (acceptable for small dataset)")
                            else:
                                hard_violations.append(f"Domain '{domain}' has only {count} use case(s), minimum required: 4")
                                small_domains.append((domain, count))
                        elif count > 80:
                            hard_violations.append(f"Domain '{domain}' has {count} use case(s), maximum allowed: 80")
                    
                    violations = hard_violations  # Only hard violations cause retries
                    
                    # Log soft warnings (informational only, doesn't block)
                    if soft_warnings:
                        self.logger.info(f"ℹ️ Soft warnings (acceptable for small datasets): {'; '.join(soft_warnings[:3])}")
                    
                    # If no HARD violations, domain assignment is successful!
                    if not violations:
                        self.logger.info(f"✅ Domain detection successful on attempt {attempt}! Created {total_domains} domains")
                        if soft_warnings:
                            self.logger.info(f"ℹ️ Note: {len(soft_warnings)} domains have <4 use cases (acceptable for dataset of {total_use_cases} total use cases)")
                        domain_assignments = domain_assigned_use_cases
                        break  # Exit retry loop
                    else:
                        self.logger.warning(f"⚠️ Domain detection attempt {attempt} has {len(violations)} HARD violations")
                        if attempt == max_attempts:
                            self.logger.error(f"❌ Max attempts reached with {len(violations)} HARD violations - This should not happen!")
                            self.logger.error(f"HARD Violations: {'; '.join(violations[:5])}")
                            # Still use it but log as error
                            domain_assignments = domain_assigned_use_cases
                            break
                        else:
                            self.logger.info(f"Retrying domain detection (attempt {attempt + 1}/{max_attempts})...")
                            
                            # Prepare violation summary with actionable suggestions for next attempt
                            violation_summary = "\n\n**🚨 PREVIOUS ATTEMPT VIOLATIONS - YOU MUST FIX THESE 🚨**:\n"
                            violation_summary += "\n".join([f"- {v}" for v in violations])
                            
                            # Add special guidance for small domains (only if it's a hard violation)
                            if small_domains and total_use_cases >= 50:
                                violation_summary += "\n\n**💡 SUGGESTION FOR SMALL DOMAINS**:\n"
                                violation_summary += "Domains with fewer than 4 use cases should be merged into larger related domains.\n"
                                for small_domain, count in small_domains:
                                    violation_summary += f"  - '{small_domain}' ({count} use cases) → Merge with a related domain\n"
                            
                            # Emphasize the HARD LIMIT on maximum domains
                            if total_domains > 25:
                                violation_summary += f"\n\n**🚨 CRITICAL: MAXIMUM 25 DOMAINS IS AN ABSOLUTE HARD LIMIT 🚨**\n"
                                violation_summary += f"You created {total_domains} domains. You MUST reduce to 25 or fewer.\n"
                                violation_summary += f"Merge related domains together to stay within the limit.\n"
                            
                            # Update prompt_vars for next attempt
                            prompt_vars['previous_violations'] = violation_summary
                            continue
                
                except Exception as e:
                    self.logger.error(f"Domain detection attempt {attempt} failed: {e}")
                    if attempt == max_attempts:
                        self.logger.error("Max attempts reached. Using DEFAULT domains as fallback...")
                        # FALLBACK: Assign default domains "Domain 1", "Domain 2", etc.
                        domain_assignments = self._assign_default_domains(use_cases)
                        self.logger.warning(f"✅ Fallback complete: Assigned {len(set(uc.get('Business Domain', '') for uc in domain_assignments))} default domains to {len(domain_assignments)} use cases")
                        break
            
            # Check if domain detection was successful
            if not domain_assignments:
                self.logger.error("Domain detection failed. Using DEFAULT domains as fallback...")
                # FALLBACK: Assign default domains "Domain 1", "Domain 2", etc.
                domain_assignments = self._assign_default_domains(use_cases)
                self.logger.warning(f"✅ Fallback complete: Assigned {len(set(uc.get('Business Domain', '') for uc in domain_assignments))} default domains to {len(domain_assignments)} use cases")
            
            # === STEP 1.5: DOMAIN MERGING (MERGE SMALL DOMAINS) ===
            self.logger.info("📍 STEP 1.5: Merging small domains (if any)...")
            domain_assignments = self._merge_small_domains(domain_assignments, min_cases_per_domain=4)
            
            # === STEP 2: SUBDOMAIN DETECTION (PARALLEL FOR EACH DOMAIN) ===
            self.logger.info(f"📍 STEP 2: Detecting subdomains for each domain in parallel...")
            
            # Group use cases by domain
            domain_usecases_map = defaultdict(list)
            for uc in domain_assignments:
                domain = uc.get('Business Domain', '').strip()
                if domain:
                    domain_usecases_map[domain].append(uc)
            
            # ADAPTIVE PARALLELISM: Calculate based on domains and use cases
            subdomain_parallelism, reason = calculate_adaptive_parallelism(
                "subdomain_detection", self.max_parallelism,
                num_items=len(domain_assignments),
                num_domains=len(domain_usecases_map),
                is_llm_operation=True, logger=self.logger
            )
            log_adaptive_parallelism_decision("subdomain_detection", subdomain_parallelism, self.max_parallelism, reason)
            self.logger.info(f"Processing {len(domain_usecases_map)} domains for subdomain detection...")
            
            # Process each domain in parallel
            final_use_cases_with_subdomains = []
            
            with ThreadPoolExecutor(max_workers=subdomain_parallelism, 
                                   thread_name_prefix="SubdomainDetect") as executor:
                # Submit subdomain detection for each domain
                future_to_domain = {}
                for domain_name, domain_use_cases in domain_usecases_map.items():
                    future = executor.submit(
                        self._detect_subdomains_for_domain,
                        domain_name,
                        domain_use_cases,
                        language
                    )
                    future_to_domain[future] = domain_name
                    self.logger.debug(f"✓ Submitted subdomain detection for domain '{domain_name}' ({len(domain_use_cases)} use cases)")
                
                # Collect results as they complete
                for future in concurrent.futures.as_completed(future_to_domain):
                    domain_name = future_to_domain[future]
                    try:
                        use_cases_with_subdomains = future.result()
                        if use_cases_with_subdomains:
                            self.logger.info(f"✅ Domain '{domain_name}': Subdomain detection complete ({len(use_cases_with_subdomains)} use cases)")
                            final_use_cases_with_subdomains.extend(use_cases_with_subdomains)
                        else:
                            # CRITICAL FIX: Assign default subdomains when detection returns empty
                            self.logger.warning(f"⚠️ Domain '{domain_name}': Subdomain detection returned no use cases - assigning default subdomains")
                            domain_use_cases = domain_usecases_map.get(domain_name, [])
                            for uc in domain_use_cases:
                                if not uc.get('Subdomain'):
                                    uc['Subdomain'] = f"General {domain_name}"
                            final_use_cases_with_subdomains.extend(domain_use_cases)
                    except Exception as e:
                        self.logger.error(f"❌ Domain '{domain_name}': Subdomain detection failed: {e}")
                        # CRITICAL FIX: Assign default subdomains on exception
                        domain_use_cases = domain_usecases_map.get(domain_name, [])
                        self.logger.warning(f"Using default subdomains for domain '{domain_name}'")
                        for uc in domain_use_cases:
                            if not uc.get('Subdomain'):
                                uc['Subdomain'] = f"General {domain_name}"
                        final_use_cases_with_subdomains.extend(domain_use_cases)
            
            self.logger.info(f"✅ Two-step clustering complete! {len(final_use_cases_with_subdomains)} use cases processed")
            return final_use_cases_with_subdomains
        
        except Exception as e:
            self.logger.error(f"Domain/subdomain clustering failed with error: {e}. Using DEFAULT domains/subdomains as fallback...")
            # FALLBACK: Assign default domains and subdomains on any error
            fallback_use_cases = self._assign_default_domains(use_cases)
            # Also assign default subdomains for each domain
            domain_map = {}
            for uc in fallback_use_cases:
                domain = uc.get('Business Domain', 'Domain1')
                if domain not in domain_map:
                    domain_map[domain] = []
                domain_map[domain].append(uc)
            
            final_use_cases = []
            for domain_name, domain_ucs in domain_map.items():
                subdomained_ucs = self._assign_default_subdomains(domain_ucs, domain_name)
                final_use_cases.extend(subdomained_ucs)
            
            self.logger.warning(f"✅ Complete fallback applied: {len(final_use_cases)} use cases with default domains and subdomains")
            return final_use_cases

    def _assign_default_domains(self, use_cases: list) -> list:
        """
        FALLBACK: Assign default domains when LLM domain detection fails.
        Distributes use cases evenly across 5 default domains: Domain1, Domain2, Domain3, Domain4, Domain5.
        
        Args:
            use_cases: List of use case dictionaries
            
        Returns:
            List of use cases with default 'Business Domain' assigned
        """
        if not use_cases:
            return use_cases
        
        # Create 5 default domains (single-word names as required by validation)
        default_domains = ["Domain1", "Domain2", "Domain3", "Domain4", "Domain5"]
        
        # Distribute use cases evenly across domains
        result = []
        for i, uc in enumerate(use_cases):
            uc_copy = uc.copy()
            domain_idx = i % len(default_domains)
            uc_copy['Business Domain'] = default_domains[domain_idx]
            uc_copy['Subdomain'] = ''  # Will be assigned in subdomain step
            result.append(uc_copy)
        
        self.logger.info(f"📋 Default domain assignment: Distributed {len(result)} use cases across {len(default_domains)} domains")
        
        # Log distribution
        domain_counts = {}
        for uc in result:
            d = uc.get('Business Domain', '')
            domain_counts[d] = domain_counts.get(d, 0) + 1
        for domain, count in sorted(domain_counts.items()):
            self.logger.debug(f"   - {domain}: {count} use cases")
        
        return result

    def _assign_default_subdomains(self, use_cases: list, domain_name: str) -> list:
        """
        FALLBACK: Assign default subdomains when LLM subdomain detection fails.
        Distributes use cases evenly across default subdomains: "Sub Domain1", "Sub Domain2", etc.
        Creates enough subdomains to ensure minimum 2 use cases per subdomain.
        
        Args:
            use_cases: List of use case dictionaries
            domain_name: Name of the domain (for logging)
            
        Returns:
            List of use cases with default 'Subdomain' assigned
        """
        if not use_cases:
            return use_cases
        
        # Calculate number of subdomains needed (2-5, ensuring min 2 use cases each)
        num_use_cases = len(use_cases)
        # At least 2 use cases per subdomain, so max subdomains = num_use_cases // 2
        max_subdomains = max(2, min(5, num_use_cases // 2))
        
        # Create default subdomains (2-word names as required by validation)
        default_subdomains = [f"Sub Domain{i}" for i in range(1, max_subdomains + 1)]
        
        # Distribute use cases evenly across subdomains
        result = []
        for i, uc in enumerate(use_cases):
            uc_copy = uc.copy()
            subdomain_idx = i % len(default_subdomains)
            uc_copy['Subdomain'] = default_subdomains[subdomain_idx]
            result.append(uc_copy)
        
        self.logger.info(f"📋 [{domain_name}] Default subdomain assignment: Distributed {len(result)} use cases across {len(default_subdomains)} subdomains")
        
        return result

    def _detect_subdomains_for_domain(self, domain_name: str, use_cases: list, language: str) -> list:
        """
        Detect subdomains for a single domain's use cases using LLM.
        This method is designed to be called in parallel for each domain.
        
        Args:
            domain_name: Name of the domain
            use_cases: List of use case dictionaries belonging to this domain
            language: Output language
            
        Returns:
            List of use cases with updated subdomains
        """
        from collections import defaultdict
        import io
        import csv
        
        log_prefix = f"[Domain: {domain_name}]"
        self.logger.debug(f"{log_prefix} Starting subdomain detection for {len(use_cases)} use cases...")
        
        if not use_cases:
            return use_cases
        
        try:
            # Convert use cases to CSV for LLM (Business Domain is set, Subdomain will be detected)
            output = io.StringIO()
            fieldnames = ['No', 'Name', 'type', 'Analytics Technique', 'Statement', 'Solution', 
                         'Business Value', 'Beneficiary', 'Sponsor', 
                         'Tables Involved']
            writer = csv.DictWriter(output, fieldnames=fieldnames, extrasaction='ignore')
            writer.writeheader()
            writer.writerows(use_cases)
            use_cases_csv = output.getvalue()
            
            # Check context size (uses model-specific limits from TECHNICAL_CONTEXT)
            prompt_template = self.ai_agent.prompt_templates.get("SUBDOMAIN_DETECTOR_PROMPT", "")
            estimated_size = len(prompt_template) + len(use_cases_csv) + 1000
            MAX_CONTEXT_CHARS = get_max_context_chars(language, "SUBDOMAIN_DETECTOR_PROMPT")
            
            if estimated_size > MAX_CONTEXT_CHARS:
                self.logger.warning(
                    f"{log_prefix} Subdomain prompt size ({estimated_size:,} chars) exceeds MAX_CONTEXT_CHARS ({MAX_CONTEXT_CHARS:,}). "
                    f"Splitting into smaller batches for subdomain detection."
                )
                # CONTEXT SPLITTING: Split use cases into smaller batches and process each
                return self._detect_subdomains_with_context_splitting(
                    domain_name, use_cases, language, prompt_template, MAX_CONTEXT_CHARS
                )
            
            # Call LLM for subdomain detection (respect global retry cap)
            max_attempts = (getattr(self, "max_retry_attempts", 1) or 0) + 1
            prompt_vars = {
                "domain_name": domain_name,
                "use_cases_csv": use_cases_csv,
                "output_language": language,
                "business_name": self.business_name,
                "industries": ", ".join(self.industries) if hasattr(self, 'industries') and self.industries else "General Business",
                "business_context": getattr(self, 'business_context', "General business operations"),
                "previous_violations": ""
            }
            
            for attempt in range(1, max_attempts + 1):
                try:
                    self.logger.debug(f"{log_prefix} Subdomain detection attempt {attempt}/{max_attempts}...")
                    
                    response_raw = self.ai_agent.run_worker(
                        step_name=f"Detect_Subdomains_{domain_name}_{language}_Attempt{attempt}",
                        worker_prompt_path="SUBDOMAIN_DETECTOR_PROMPT",
                        prompt_vars=prompt_vars,
                        response_schema=None,
                        timeout_override=self.llm_timeout_seconds  # Explicit timeout
                    )
                    
                    # Validate response
                    if not response_raw or len(response_raw.strip()) == 0:
                        raise ValueError("LLM returned empty response")
                    
                    # Clean response (remove markdown fences if present)
                    response_clean = clean_json_response(response_raw)
                    if not response_clean or len(response_clean.strip()) == 0:
                        raise ValueError("Cleaned response is empty")
                    
                    # Parse CSV using centralized utility
                    try:
                        csv_rows = CSVParser.parse_csv_string(
                            response_clean,
                            logger=self.logger,
                            context="Subdomain detection"
                        )
                        subdomain_assignment_map = {}
                        row_count = 0
                        
                        for row in csv_rows:
                            row_count += 1
                            # Handle both possible column names - also handle None values
                            uc_id_raw = row.get('use_case_id', '') or ''
                            subdomain_raw = row.get('subdomain', '') or ''
                            uc_id = uc_id_raw.strip() if isinstance(uc_id_raw, str) else str(uc_id_raw).strip()
                            subdomain = subdomain_raw.strip() if isinstance(subdomain_raw, str) else str(subdomain_raw).strip()
                            
                            if uc_id and subdomain:
                                subdomain_assignment_map[uc_id] = subdomain
                        
                        if row_count == 0:
                            raise ValueError("CSV has no data rows")
                        
                        if not subdomain_assignment_map:
                            raise ValueError("No valid subdomain assignments found in CSV")
                            
                    except Exception as csv_err:
                        self.logger.error(f"{log_prefix} CSV parsing failed. Raw response (first 500 chars): {response_raw[:500]}")
                        self.logger.error(f"{log_prefix} Cleaned response (first 500 chars): {response_clean[:500]}")
                        raise ValueError(f"Failed to parse CSV: {csv_err}")
                    
                    # Apply subdomain assignments
                    subdomain_assigned_use_cases = []
                    for uc in use_cases:
                        uc_copy = uc.copy()
                        uc_id = uc_copy.get('No', '')
                        if uc_id in subdomain_assignment_map:
                            uc_copy['Subdomain'] = subdomain_assignment_map[uc_id]
                        subdomain_assigned_use_cases.append(uc_copy)
                    
                    # Validate subdomain assignments
                    violations = []
                    
                    # Group by subdomain
                    subdomain_usecases = defaultdict(list)
                    for uc in subdomain_assigned_use_cases:
                        subdomain = uc.get('Subdomain', '').strip()
                        if subdomain:
                            subdomain_usecases[subdomain].append(uc)
                    
                    # Validate subdomain count (2-10)
                    total_subdomains = len(subdomain_usecases)
                    if total_subdomains < 2:
                        violations.append(f"Only {total_subdomains} subdomain(s), minimum required: 2")
                    if total_subdomains > 10:
                        violations.append(f"Too many subdomains: {total_subdomains}, maximum allowed: 10")
                    
                    # Validate subdomain naming (must be 2 words)
                    for subdomain in subdomain_usecases.keys():
                        word_count = len(subdomain.split())
                        if word_count != 2:
                            violations.append(f"Subdomain '{subdomain}' has {word_count} word(s), must be exactly 2 words")
                    
                    # Validate use cases per subdomain (minimum 2)
                    for subdomain, ucs in subdomain_usecases.items():
                        count = len(ucs)
                        if count < 2:
                            violations.append(f"Subdomain '{subdomain}' has only {count} use case(s), minimum required: 2")
                    
                    # If no violations, success!
                    if not violations:
                        self.logger.debug(f"{log_prefix} ✅ Subdomain detection successful! Created {total_subdomains} subdomains")
                        return subdomain_assigned_use_cases
                    else:
                        self.logger.warning(f"{log_prefix} ⚠️ Subdomain detection attempt {attempt} has {len(violations)} violations")
                        if attempt == max_attempts:
                            self.logger.warning(f"{log_prefix} Max attempts reached. Using subdomain assignments despite {len(violations)} violations")
                            self.logger.warning(f"{log_prefix} Violations: {'; '.join(violations[:3])}")
                            return subdomain_assigned_use_cases
                        else:
                            # Prepare violation summary for retry
                            violation_summary = "\n\n**🚨 PREVIOUS ATTEMPT VIOLATIONS - YOU MUST FIX THESE 🚨**:\n"
                            violation_summary += "\n".join([f"- {v}" for v in violations])
                            prompt_vars['previous_violations'] = violation_summary
                            continue
                
                except Exception as e:
                    self.logger.error(f"{log_prefix} Subdomain detection attempt {attempt} failed: {e}")
                    if attempt == max_attempts:
                        self.logger.error(f"{log_prefix} Max attempts reached. Using DEFAULT subdomains as fallback...")
                        # FALLBACK: Assign default subdomains "Sub Domain 1", "Sub Domain 2", etc.
                        fallback_use_cases = self._assign_default_subdomains(use_cases, domain_name)
                        self.logger.warning(f"{log_prefix} ✅ Fallback complete: Assigned default subdomains to {len(fallback_use_cases)} use cases")
                        return fallback_use_cases
            
            # If we reach here without returning, use fallback
            self.logger.warning(f"{log_prefix} No subdomain assignments made. Using DEFAULT subdomains as fallback...")
            return self._assign_default_subdomains(use_cases, domain_name)
            
        except Exception as e:
            self.logger.error(f"{log_prefix} Subdomain detection failed with error: {e}. Using DEFAULT subdomains as fallback...")
            # FALLBACK: Assign default subdomains on any error
            return self._assign_default_subdomains(use_cases, domain_name)

    def _detect_subdomains_with_context_splitting(self, domain_name: str, use_cases: list, language: str, prompt_template: str, max_context_chars: int) -> list:
        """
        Handle subdomain detection when context exceeds max size by splitting into batches.
        Processes each batch separately and merges results.
        
        Args:
            domain_name: Name of the domain
            use_cases: List of use case dictionaries
            language: Output language
            prompt_template: The prompt template string
            max_context_chars: Maximum context size in characters
            
        Returns:
            List of use cases with assigned subdomains
        """
        import io
        import csv
        from collections import defaultdict
        
        log_prefix = f"[Domain: {domain_name}][Context Split]"
        
        # Calculate how many use cases can fit in one batch
        # Estimate average use case size in CSV format
        sample_output = io.StringIO()
        sample_fieldnames = ['No', 'Name', 'type', 'Analytics Technique', 'Statement', 'Solution', 
                            'Business Value', 'Beneficiary', 'Sponsor', 'Tables Involved']
        sample_writer = csv.DictWriter(sample_output, fieldnames=sample_fieldnames, extrasaction='ignore')
        sample_writer.writeheader()
        if use_cases:
            sample_writer.writerow(use_cases[0])
        avg_use_case_size = len(sample_output.getvalue()) // max(1, 1)  # Size of header + 1 row
        
        # Calculate available space for use cases (subtract prompt template and buffer)
        available_chars = max_context_chars - len(prompt_template) - 2000  # 2000 char buffer
        batch_size = max(2, available_chars // max(1, avg_use_case_size))  # At least 2 use cases per batch
        
        self.logger.info(f"{log_prefix} Splitting {len(use_cases)} use cases into batches of ~{batch_size}")
        
        # Split use cases into batches
        batches = []
        for i in range(0, len(use_cases), batch_size):
            batches.append(use_cases[i:i + batch_size])
        
        self.logger.info(f"{log_prefix} Created {len(batches)} batches for subdomain detection")
        
        # Process each batch and collect subdomain assignments
        all_subdomain_assignments = {}  # uc_id -> subdomain
        
        for batch_idx, batch in enumerate(batches, start=1):
            self.logger.info(f"{log_prefix} Processing batch {batch_idx}/{len(batches)} ({len(batch)} use cases)")
            
            try:
                # Convert batch to CSV
                batch_output = io.StringIO()
                batch_writer = csv.DictWriter(batch_output, fieldnames=sample_fieldnames, extrasaction='ignore')
                batch_writer.writeheader()
                batch_writer.writerows(batch)
                batch_csv = batch_output.getvalue()
                
                # Call LLM for this batch
                max_attempts = (getattr(self, "max_retry_attempts", 1) or 0) + 1
                prompt_vars = {
                    "domain_name": domain_name,
                    "use_cases_csv": batch_csv,
                    "output_language": language,
                    "business_name": self.business_name,
                    "industries": ", ".join(self.industries) if hasattr(self, 'industries') and self.industries else "General Business",
                    "business_context": getattr(self, 'business_context', "General business operations"),
                    "previous_violations": ""
                }
                
                batch_success = False
                for attempt in range(1, max_attempts + 1):
                    try:
                        response_raw = self.ai_agent.run_worker(
                            step_name=f"Detect_Subdomains_{domain_name}_Batch{batch_idx}_{language}_Attempt{attempt}",
                            worker_prompt_path="SUBDOMAIN_DETECTOR_PROMPT",
                            prompt_vars=prompt_vars,
                            response_schema=None,
                            timeout_override=self.llm_timeout_seconds
                        )
                        
                        if not response_raw or len(response_raw.strip()) == 0:
                            raise ValueError("LLM returned empty response")
                        
                        response_clean = clean_json_response(response_raw)
                        if not response_clean or len(response_clean.strip()) == 0:
                            raise ValueError("Cleaned response is empty")
                        
                        # Parse CSV response
                        csv_rows = CSVParser.parse_csv_string(
                            response_clean,
                            logger=self.logger,
                            context=f"Subdomain detection batch {batch_idx}"
                        )
                        
                        # Extract subdomain assignments
                        for row in csv_rows:
                            uc_id_raw = row.get('use_case_id', '') or ''
                            subdomain_raw = row.get('subdomain', '') or ''
                            uc_id = str(uc_id_raw).strip()
                            subdomain = str(subdomain_raw).strip()
                            if uc_id and subdomain:
                                all_subdomain_assignments[uc_id] = subdomain
                        
                        batch_success = True
                        self.logger.info(f"{log_prefix} Batch {batch_idx} completed successfully")
                        break
                        
                    except Exception as attempt_err:
                        self.logger.warning(f"{log_prefix} Batch {batch_idx} attempt {attempt} failed: {attempt_err}")
                        if attempt == max_attempts:
                            self.logger.error(f"{log_prefix} Batch {batch_idx} failed after {max_attempts} attempts")
                
                if not batch_success:
                    # Assign default subdomains for failed batch
                    for uc in batch:
                        uc_id = str(uc.get('No', '')).strip()
                        if uc_id and uc_id not in all_subdomain_assignments:
                            all_subdomain_assignments[uc_id] = f"General {domain_name}"
                            
            except Exception as batch_err:
                self.logger.error(f"{log_prefix} Batch {batch_idx} processing error: {batch_err}")
                for uc in batch:
                    uc_id = str(uc.get('No', '')).strip()
                    if uc_id and uc_id not in all_subdomain_assignments:
                        all_subdomain_assignments[uc_id] = f"General {domain_name}"
        
        # Apply subdomain assignments to use cases
        result = []
        for uc in use_cases:
            uc_copy = uc.copy()
            uc_id = str(uc.get('No', '')).strip()
            subdomain = all_subdomain_assignments.get(uc_id, f"General {domain_name}")
            uc_copy['Subdomain'] = subdomain
            result.append(uc_copy)
        
        # Log statistics
        subdomain_counts = defaultdict(int)
        for uc in result:
            subdomain_counts[uc.get('Subdomain', 'Unknown')] += 1
        
        self.logger.info(f"{log_prefix} Context splitting complete: {len(result)} use cases assigned to {len(subdomain_counts)} subdomains")
        for subdomain, count in sorted(subdomain_counts.items(), key=lambda x: -x[1]):
            self.logger.debug(f"{log_prefix}   - {subdomain}: {count} use cases")
        
        return result

    def _report_table_statistics(self, use_cases: list):
        """
        Reports statistics on table inclusion/exclusion based on generated use cases.
        Compares tables used in use cases vs. total available tables.
        """
        try:
            # Get all available tables from the data loader
            all_available_tables = set()
            if self.data_loader and hasattr(self.data_loader, 'db_details_cache'):
                for (catalog, schema, table, _, _, _) in self.data_loader.db_details_cache:
                    fqtn = f"`{catalog}`.`{schema}`.`{table}`"
                    all_available_tables.add(fqtn)
            
            total_available = len(all_available_tables)
            
            if total_available == 0:
                self.logger.warning("No available tables found in data loader cache. Skipping table statistics report.")
                return
            
            # Extract all unique tables used in use cases
            tables_used = set()
            for uc in use_cases:
                tables_involved = uc.get('Tables Involved', '')
                if tables_involved:
                    # Parse comma-separated table list
                    for table in tables_involved.split(','):
                        table = table.strip()
                        if table:
                            tables_used.add(table)
            
            total_included = len(tables_used)
            total_excluded = total_available - total_included
            
            # Calculate percentages
            pct_included = (total_included / total_available * 100) if total_available > 0 else 0
            pct_excluded = (total_excluded / total_available * 100) if total_available > 0 else 0
            
            log_print("\n" + "="*80)
            log_print("TABLE UTILIZATION REPORT")
            log_print("="*80)
            log_print(f"Total Available Tables: {total_available}")
            log_print(f"Tables Included in Use Cases: {total_included} ({pct_included:.1f}%)")
            log_print(f"Tables Excluded (No Business Value): {total_excluded} ({pct_excluded:.1f}%)")
            log_print("="*80 + "\n")
            
            # Log it as well
            self.logger.info(f"Table Statistics: {total_included}/{total_available} tables included ({pct_included:.1f}%), {total_excluded} excluded ({pct_excluded:.1f}%)")
            
        except Exception as e:
            self.logger.error(f"Failed to generate table statistics report: {e}")
    
    def _priority_sort_key(self, use_case: dict) -> tuple:
        """
        Returns a tuple for sorting use cases by priority (descending).
        Very High -> High -> Medium -> Low -> Very Low
        """
        priority_map = {
            "Very High": 0,
            "High": 1,
            "Medium": 2,
            "Low": 3,
            "Very Low": 4,
            "N/A": 5
        }
        priority_label = use_case.get('Priority', 'N/A')
        priority_order = priority_map.get(priority_label, 5)
        # Secondary sort by use case number for stability
        use_case_no = use_case.get('No', 'N-999-AI')
        return (priority_order, use_case_no)
    
    def _natural_sort_key(self, use_case: dict) -> tuple:
        """
        Natural sort key for use case IDs to ensure proper ordering.
        
        Handles IDs like: N01-AI01, N01-AI02, ..., N01-AI10, N01-AI11, N01-ST01, etc.
        Standard string sort would incorrectly order: AI1, AI10, AI11, AI2, AI3...
        Natural sort correctly orders: AI1, AI2, AI3, ..., AI10, AI11...
        
        Returns tuple: (domain_num, source_type, sequence_num, original_id)
        Example: "N01-AI05" -> (1, 'AI', 5, 'N01-AI05')
        """
        import re
        use_case_id = use_case.get('No', 'N99-ZZ999')
        
        # Parse ID format: N##-XX## (e.g., N01-AI05, N02-ST10)
        match = re.match(r'N(\d+)-([A-Z]+)(\d+)', use_case_id)
        if match:
            domain_num = int(match.group(1))
            source_type = match.group(2)  # AI or ST
            seq_num = int(match.group(3))
            # Sort AI before ST (alphabetically)
            return (domain_num, source_type, seq_num, use_case_id)
        
        # Fallback: Try to parse older formats like AI-XXX-U## or just extract numbers
        # Pattern: AI-XXX-U## or ST-XXX-U##
        match2 = re.match(r'(AI|ST)-([A-Z0-9_]+)-U(\d+)', use_case_id)
        if match2:
            source_type = match2.group(1)
            seq_num = int(match2.group(3))
            return (99, source_type, seq_num, use_case_id)
        
        # Final fallback: extract any numbers from the ID for some ordering
        numbers = re.findall(r'\d+', use_case_id)
        if numbers:
            # Use first number as domain, last number as sequence
            return (int(numbers[0]) if len(numbers) > 0 else 99, 
                    'ZZ', 
                    int(numbers[-1]) if len(numbers) > 0 else 999, 
                    use_case_id)
        
        # No numbers found - sort at the end
        return (999, 'ZZ', 999, use_case_id)
    
    def _calculate_domain_impact_score(self, use_cases: list) -> float:
        """
        Calculates the average priority score for a domain to determine its impact.
        Higher score = more impactful domain.
        """
        if not use_cases:
            return 0.0
        # Ensure Priority Score is converted to float (it might be stored as string or int)
        total_score = sum(float(uc.get('Priority Score', 5.0)) for uc in use_cases)
        return total_score / len(use_cases)

    def assemble_use_case_notebooks(self, all_use_cases: list, translations: dict, summary_dict: dict = None):
        self.logger.debug("--- Starting Use Case Notebook Assembly (English) ---")
        if not all_use_cases:
            self.logger.warning("No use cases were provided for notebook assembly.")
            return
        
        def get_prefix_for_group(use_cases_list):
            if not use_cases_list: return "N00"
            try:
                first_id = use_cases_list[0]['No']
                return first_id.split('-')[0]
            except Exception: return "N00"

        # === MODIFIED: Always use business domain grouping ===
        self.logger.info("Assembling one notebook for each business domain...")
        grouped_by_domain = self._group_use_cases_by_domain_flat(all_use_cases)
        
        # CRITICAL FIX: Extract domain PREFIX from USE CASE IDs - ensures notebook names match use case IDs
        # Sort by USE CASE COUNT (smallest first) for quick testing, but use PREFIX from IDs for notebook naming
        domain_prefix_info = {}
        for domain, use_cases in grouped_by_domain.items():
            prefix = get_prefix_for_group(use_cases)
            try:
                prefix_num = int(prefix[1:]) if prefix.startswith('N') else 99
            except ValueError:
                prefix_num = 99
            domain_prefix_info[domain] = (prefix_num, prefix)
        
        # Sort domains by USE CASE COUNT (smallest first) - enables quick testing of smaller domains
        # NOTE: Notebook prefix comes from use case IDs (domain_prefix_info), NOT from this sort order
        sorted_domain_names = sorted(grouped_by_domain.keys(), 
                                    key=lambda d: len(grouped_by_domain[d]))
        
        total_domains = len(sorted_domain_names)
        self.logger.info(f"📚 Generating {total_domains} notebooks for {len(all_use_cases)} use cases IN PARALLEL...")
        log_print(f"📚 Generating {total_domains} notebooks in parallel (max {self.max_parallelism} concurrent)...")
        
        # Process all domains in parallel using ThreadPoolExecutor
        def create_notebook_for_domain(domain_data):
            """Helper function to create a notebook for a single domain."""
            i, domain_name = domain_data
            domain_use_cases = grouped_by_domain[domain_name]
            # Sort use cases by ID using natural sort (AI01, AI02, ..., AI10, AI11 - not AI1, AI10, AI11, AI2)
            # Note: PDF/PPT/XLS are sorted by priority descending via _group_use_cases_by_domain_flat
            sorted_cases = sorted(domain_use_cases, key=self._natural_sort_key)
            # CRITICAL FIX: Use prefix from use case IDs, not from loop index
            # This ensures N15-AI01 use cases go into N15-xxx.ipynb, not N06-xxx.ipynb
            domain_prefix = domain_prefix_info.get(domain_name, (i, f"N{i:02d}"))[1]
            notebook_name = f"{domain_prefix}-{self._sanitize_name(domain_name)}"
            
            self.logger.info(f"📝 [{i}/{total_domains}] Assembling notebook '{notebook_name}' ({len(sorted_cases)} use cases)...")
            
            # Get domain executive summary if available
            domain_summary = None
            if summary_dict:
                domain_summary = summary_dict.get(domain_name, None)
            
            try:
                self._assemble_notebook_for_db(
                    db_name=domain_name, use_cases=sorted_cases, translations=translations, 
                    db_prefix=domain_prefix, filename_override=notebook_name, domain_summary=domain_summary
                )
                self.logger.info(f"✅ [{i}/{total_domains}] Notebook '{notebook_name}' completed successfully")
                return (i, notebook_name, True)
            except Exception as e:
                self.logger.error(f"❌ [{i}/{total_domains}] Notebook '{notebook_name}' failed: {e}")
                return (i, notebook_name, False)
        
        # ADAPTIVE PARALLELISM: Calculate based on number of domains and use cases
        total_use_cases = sum(len(grouped_by_domain[d]) for d in sorted_domain_names)
        
        notebook_parallelism, reason = calculate_adaptive_parallelism(
            "notebook_generation", self.max_parallelism,
            num_items=total_use_cases,
            num_domains=total_domains,
            is_llm_operation=False, logger=self.logger
        )
        log_adaptive_parallelism_decision("notebook_generation", notebook_parallelism, self.max_parallelism, reason)
        
        with ThreadPoolExecutor(max_workers=notebook_parallelism, thread_name_prefix="NotebookGen") as executor:
            # Submit all notebook generation jobs
            domain_data_list = [(i, domain_name) for i, domain_name in enumerate(sorted_domain_names, start=1)]
            futures = [executor.submit(create_notebook_for_domain, domain_data) for domain_data in domain_data_list]
            
            # Collect results
            completed_count = 0
            failed_count = 0
            for future in concurrent.futures.as_completed(futures):
                try:
                    i, notebook_name, success = future.result()
                    if success:
                        completed_count += 1
                        log_print(f"   ✅ Notebook {i}/{total_domains} completed: {notebook_name}")
                    else:
                        failed_count += 1
                        log_print(f"   ❌ Notebook {i}/{total_domains} failed: {notebook_name}")
                except Exception as e:
                    failed_count += 1
                    self.logger.error(f"Notebook generation future failed: {e}")
        
        self.logger.info(f"✅ Notebook generation complete: {completed_count} succeeded, {failed_count} failed")
        log_print(f"✅ All {total_domains} notebooks processed: {completed_count} succeeded, {failed_count} failed")

    # === NEW: Helper method to determine if use case should show examples ===
    def should_show_example_for_use_case(self, use_case_dict: dict, domain_index: int) -> bool:
        """
        Determines if a use case should show example results based on the configured option.
        Simplified to Yes/No: If "Yes", show ALL use cases. If "No", show none.
        
        Args:
            use_case_dict: The use case dictionary (kept for compatibility)
            domain_index: The index of the use case within its domain (kept for compatibility)
            
        Returns:
            bool: True if examples should be shown (option is "Yes"), False otherwise
        """
        return str(getattr(self, "show_query_results_option", "")).strip().lower() == "yes" and bool(self.sql_warehouse_id)
    
    def _ensure_sql_results_cache_dir(self):
        if getattr(self, "sql_results_cache_dir", None):
            return self.sql_results_cache_dir
        cache_dir = os.path.join(tempfile.gettempdir(), f"inspire_sql_cache_{self.business_name.replace(' ', '_')}")
        os.makedirs(cache_dir, exist_ok=True)
        self.sql_results_cache_dir = cache_dir
        return cache_dir

    def _prepare_example_result(self, columns, schema_columns, row_data, use_case_id):
        original_columns = []
        free_text_columns = []
        llm_columns_with_nulls = []
        llm_keywords = ['plan', 'strategy', 'narrative', 'rationale', 'recommendation',
                       'tactics', 'steps', 'measures', 'summary', 'brief', 'assessment',
                       'analysis', 'insights', 'guidance', 'approach', 'methodology']
        column_values = {}
        for i, col_name in enumerate(columns):
            if i < len(row_data):
                column_values[col_name] = row_data[i]
            else:
                column_values[col_name] = None
        for col_schema in schema_columns:
            col_name = getattr(col_schema, "name", None)
            if isinstance(col_schema, dict):
                col_name = col_schema.get("name", col_name)
                col_type_raw = col_schema.get("type_name") or col_schema.get("type_text")
            else:
                col_type_raw = getattr(col_schema, "type_name", None)
            col_type = str(col_type_raw) if col_type_raw else 'STRING'
            if not col_name:
                continue
            value = column_values.get(col_name)
            is_string = 'STRING' in col_type.upper() or 'CHAR' in col_type.upper() or col_type.upper() == 'BINARY'
            is_llm_column = any(keyword in col_name.lower() for keyword in llm_keywords)
            if is_string:
                str_value = str(value) if value is not None else None
                if is_llm_column and (value is None or str_value in ['None', '', 'null', 'NULL']):
                    llm_columns_with_nulls.append(col_name)
                if str_value and len(str_value) > 100:
                    free_text_columns.append(col_name)
                elif is_llm_column and str_value:
                    free_text_columns.append(col_name)
                else:
                    original_columns.append(col_name)
            else:
                original_columns.append(col_name)
        if llm_columns_with_nulls:
            return {
                'status': 'requires_modification',
                'data': [],
                'message': f"Query requires user modification - null values in LLM columns: {', '.join(llm_columns_with_nulls)}"
            }
        if not free_text_columns:
            return {
                'status': 'no_free_text',
                'data': [],
                'message': 'No free text columns found in results'
            }
        selected_original = original_columns[:5]
        selected_free_text = free_text_columns[:5]
        selected_columns = selected_original + selected_free_text
        transposed_data = []
        for col in selected_columns:
            value = column_values.get(col)
            col_type = 'free_text' if col in free_text_columns else 'original'
            if value is None:
                formatted_value = 'N/A'
            elif isinstance(value, (int, float)):
                formatted_value = str(value)
            else:
                formatted_value = str(value)
            transposed_data.append({
                'column': col,
                'value': formatted_value,
                'type': col_type
            })
        return {
            'status': 'success',
            'data': transposed_data,
            'message': f'Successfully retrieved example with {len(selected_original)} original columns and {len(selected_free_text)} free text columns'
        }
    
    # === NEW: Query Execution for Example Results ===
    def execute_query_for_example(self, sql_query: str, use_case_id: str) -> dict:
        """
        Executes a query with LIMIT 1 using SQL Warehouse, transposes columns to rows, and prepares example results.
        
        Args:
            sql_query: The SQL query to execute (will be modified to LIMIT 1)
            use_case_id: The use case identifier for logging
            
        Returns:
            dict with 'status', 'data' (list of {column, value, type} dicts), and 'message'
        """
        # Check if SQL Warehouse name is provided
        if not self.sql_warehouse_name or not self.sql_warehouse_id:
            self.logger.warning(f"SQL Warehouse not resolved - skipping example results for use case {use_case_id}")
            return {
                'status': 'error',
                'data': [],
                'message': 'SQL Warehouse not configured'
            }
        
        try:
            # Modify query to LIMIT 1
            import re
            if 'LIMIT' in sql_query.upper():
                modified_query = re.sub(r'LIMIT\s+\d+', 'LIMIT 1', sql_query, flags=re.IGNORECASE)
            else:
                modified_query = sql_query.rstrip().rstrip(';') + ' LIMIT 1'
            
            self.logger.info(f"Executing query for example results (use case {use_case_id}) on warehouse: {self.sql_warehouse_name}...")
            
            from databricks.sdk.service import sql as sql_service
            warehouse_id = self.sql_warehouse_id
            
            # Execute query using statement execution API
            try:
                statement = self.w_client.statement_execution.execute_statement(
                    warehouse_id=warehouse_id,
                    statement=modified_query,
                    wait_timeout="50s"
                )
            except Exception as e:
                self.logger.warning(f"Query execution failed for use case {use_case_id}: {str(e)[:100]}")
                return {
                    'status': 'error',
                    'data': [],
                    'message': 'Query execution failed'
                }
            
            # Check if query succeeded
            if statement.status.state != sql_service.StatementState.SUCCEEDED:
                # Extract ONLY the error message, not the full SQL query (for detailed log file)
                if statement.status.error:
                    full_error = statement.status.error.message
                    # Split on "== SQL ==" to get only error part
                    if "== SQL ==" in full_error:
                        clean_error = full_error.split("== SQL ==")[0].strip()
                    else:
                        # Take only first 500 chars if no SQL marker
                        clean_error = full_error[:500]
                    error_msg = clean_error
                else:
                    error_msg = f"Query failed with state {statement.status.state}"
                # Log detailed error to file only (DEBUG level), not to console
                self.logger.debug(f"SQL execution error for use case {use_case_id}: {error_msg}")
                return {
                    'status': 'error',
                    'data': [],
                    'message': error_msg
                }
            
            # Check if we have results
            if not statement.result or not statement.result.data_array or not statement.result.data_array:
                return {
                    'status': 'empty',
                    'data': [],
                    'message': 'Query returned no results'
                }
            
            # Get column names from manifest
            if not statement.manifest or not statement.manifest.schema or not statement.manifest.schema.columns:
                self.logger.warning(f"Query result has no schema for use case {use_case_id}")
                return {
                    'status': 'error',
                    'data': [],
                    'message': 'Query result has no schema'
                }
            
            # Get first row of data
            columns = [col.name for col in statement.manifest.schema.columns]
            if not statement.result.data_array or len(statement.result.data_array) == 0:
                return {
                    'status': 'empty',
                    'data': [],
                    'message': 'Query returned no results'
                }
            
            row_data = statement.result.data_array[0]  # First row
            
            return self._prepare_example_result(columns, statement.manifest.schema.columns, row_data, use_case_id)
            
        except Exception as e:
            import traceback
            self.logger.warning(f"Could not execute query for example results (use case {use_case_id}): {str(e)[:200]}")
            self.logger.debug(f"Full traceback for {use_case_id}: {traceback.format_exc()}")
            return {
                'status': 'error',
                'data': [],
                'message': f'Query execution error: {str(e)[:100]}'
            }

    def execute_sql_with_fixing(self, use_case: dict, directly_involved_schema: str = "") -> dict:
        """
        Execute SQL for a use case with automatic error fixing and regeneration.
        Strategy:
        1. Execute original SQL
        2. If fails, try to fix it once
        3. If still fails, regenerate completely new SQL from scratch
        4. Execute new SQL
        5. If fails, try to fix it once
        6. If still fails, proceed with comment for user to review
        
        Args:
            use_case: Use case dictionary with SQL field
            directly_involved_schema: Optional schema details for SQL fixing (can be empty)
            
        Returns:
            dict with 'status', 'data', 'message', and 'sql' (potentially fixed/regenerated SQL)
        """
        use_case_id = use_case.get('No', 'Unknown')
        sql_query = use_case.get('SQL', '')
        tables_involved_str = use_case.get('Tables Involved', '')
        
        if not sql_query:
            return {
                'status': 'error',
                'data': [],
                'message': 'No SQL provided',
                'sql': sql_query
            }
        
        # STEP 1: Try to execute original SQL
        exec_result = self.execute_query_for_example(sql_query, use_case_id)
        
        # If successful, return immediately
        if exec_result['status'] != 'error':
            exec_result['sql'] = sql_query
            return exec_result
        
        # STEP 2: Execution failed - try to fix it once
        error_message = exec_result.get('message', 'Unknown error')
        concise_error = error_message.split(" - Error: ")[-1] if " - Error: " in error_message else error_message[:300]
        
        self.logger.warning(f"Use Case {use_case_id} failed to execute - attempting to fix")
        
        # Prepare prompt for SQL Syntax Reviewer
        reviewer_prompt_vars = {
            "use_case_id": use_case_id,
            "use_case_name": use_case.get('Name', ''),
            "business_domain": use_case.get('Business Domain', ''),
            "statement": use_case.get('Statement', ''),
            "tables_involved": tables_involved_str,
            "directly_involved_schema": directly_involved_schema,
            "original_sql": sql_query,
            "explain_error": concise_error,
            "use_case_columns": use_case.get('Involved Columns') or use_case.get('Columns Involved') or ""
        }
        
        adaptive_timeout = self._calculate_adaptive_sql_timeout(use_case)
        
        try:
            fixed_sql = self.ai_agent.run_worker(
                step_name=f"Fix_SQL_Execution_{use_case_id}_Attempt1",
                worker_prompt_path="USE_CASE_SQL_FIX_PROMPT",
                prompt_vars=reviewer_prompt_vars,
                response_schema=None,
                timeout_override=adaptive_timeout,
                max_retries_override=self.max_retry_attempts
            )
            
            # Test fixed SQL
            exec_result_fixed = self.execute_query_for_example(fixed_sql, use_case_id)
            
            if exec_result_fixed['status'] != 'error':
                self.logger.info(f"Use case {use_case_id} SQL fixed successfully on first fix attempt")
                exec_result_fixed['sql'] = fixed_sql
                use_case['SQL'] = fixed_sql
                return exec_result_fixed
            
            # STEP 3: Fix failed - regenerate completely new SQL
            self.logger.warning(f"Use Case {use_case_id} fix failed - regenerating SQL from scratch")
            error_message_fixed = exec_result_fixed.get('message', 'Unknown error')
            concise_error_fixed = error_message_fixed.split(" - Error: ")[-1] if " - Error: " in error_message_fixed else error_message_fixed[:300]
            
        except Exception as fix_e:
            self.logger.error(f"Use case {use_case_id} SQL fix attempt failed: {str(fix_e)[:100]}")
            self.logger.warning(f"Use Case {use_case_id} - regenerating SQL from scratch")
            concise_error_fixed = str(fix_e)[:300]
        
        # STEP 3: Regenerate SQL from scratch
        try:
            self.logger.info(f"Regenerating SQL for Use Case {use_case_id}")
            
            # Get enriched business context from merged_business_context
            enriched_ctx = getattr(self, 'merged_business_context', {})
            
            # Use _generate_sql_for_use_case if available, otherwise use simpler approach
            regeneration_prompt_vars = {
                "use_case_id": use_case_id,
                "use_case_name": use_case.get('Name', ''),
                "business_domain": use_case.get('Business Domain', ''),
                "subdomain": use_case.get('Subdomain', ''),
                "type": use_case.get('type', ''),
                "statement": use_case.get('Statement', ''),
                "solution": use_case.get('Solution', ''),
                "tables_involved": tables_involved_str,
                "directly_involved_schema": directly_involved_schema,
                "use_case_columns": use_case.get('Involved Columns') or use_case.get('Columns Involved') or "",
                "foreign_key_relationships": "None",
                "unstructured_docs": "",  # Not available during execution/validation, use empty string
                "previous_feedback": "",  # Required by USE_CASE_SQL_GEN_PROMPT
                "ai_functions_summary": generate_ai_functions_doc("summary"),  # Required by USE_CASE_SQL_GEN_PROMPT
                "statistical_functions_detailed": generate_statistical_functions_doc("detailed"),
                "previous_sql": sql_query,
                "previous_error": concise_error_fixed,
                "business_name": self.business_name,
                "sql_model_serving": self.sql_model_serving,
                # Enriched business context for persona enrichment in ai_query prompts
                "enriched_business_context": enriched_ctx.get('business_context', 'General business operations'),
                "enriched_strategic_goals": enriched_ctx.get('strategic_goals', 'Operational excellence and customer satisfaction') if isinstance(enriched_ctx.get('strategic_goals'), str) else ', '.join(enriched_ctx.get('strategic_goals', ['Operational excellence'])),
                "enriched_business_priorities": enriched_ctx.get('business_priorities', 'Digital transformation and cost optimization') if isinstance(enriched_ctx.get('business_priorities'), str) else ', '.join(enriched_ctx.get('business_priorities', ['Digital transformation'])),
                "enriched_strategic_initiative": enriched_ctx.get('strategic_initiative', 'Data-driven decision making'),
                "enriched_value_chain": enriched_ctx.get('value_chain', 'Standard business operations'),
                "enriched_revenue_model": enriched_ctx.get('revenue_model', 'Diverse revenue streams'),
                "interpreted_regeneration_context": ""  # Required by USE_CASE_SQL_GEN_PROMPT
            }
            
            regenerated_sql = self.ai_agent.run_worker(
                step_name=f"Regenerate_SQL_{use_case_id}",
                worker_prompt_path="USE_CASE_SQL_GEN_PROMPT",
                prompt_vars=regeneration_prompt_vars,
                response_schema=None,
                timeout_override=adaptive_timeout,
                max_retries_override=self.max_retry_attempts
            )
            
            # STEP 4: Execute regenerated SQL
            exec_result_regen = self.execute_query_for_example(regenerated_sql, use_case_id)
            
            if exec_result_regen['status'] != 'error':
                self.logger.info(f"Use case {use_case_id} SQL regenerated and executed successfully")
                exec_result_regen['sql'] = regenerated_sql
                use_case['SQL'] = regenerated_sql
                return exec_result_regen
            
            # STEP 5: Regenerated SQL failed - try to fix it once
            self.logger.warning(f"Use Case {use_case_id} regenerated SQL failed - attempting final fix")
            error_message_regen = exec_result_regen.get('message', 'Unknown error')
            concise_error_regen = error_message_regen.split(" - Error: ")[-1] if " - Error: " in error_message_regen else error_message_regen[:300]
            
            reviewer_prompt_vars_final = {
                "use_case_id": use_case_id,
                "use_case_name": use_case.get('Name', ''),
                "business_domain": use_case.get('Business Domain', ''),
                "statement": use_case.get('Statement', ''),
                "tables_involved": tables_involved_str,
                "directly_involved_schema": directly_involved_schema,
                "original_sql": regenerated_sql,
                "explain_error": concise_error_regen,
                "use_case_columns": use_case.get('Involved Columns') or use_case.get('Columns Involved') or ""
            }
            
            final_fixed_sql = self.ai_agent.run_worker(
                step_name=f"Fix_SQL_Execution_{use_case_id}_FinalAttempt",
                worker_prompt_path="USE_CASE_SQL_FIX_PROMPT",
                prompt_vars=reviewer_prompt_vars_final,
                response_schema=None,
                timeout_override=adaptive_timeout,
                max_retries_override=self.max_retry_attempts
            )
            
            # Test final fixed SQL
            exec_result_final = self.execute_query_for_example(final_fixed_sql, use_case_id)
            
            if exec_result_final['status'] != 'error':
                self.logger.info(f"Use case {use_case_id} SQL fixed successfully on final attempt")
                exec_result_final['sql'] = final_fixed_sql
                use_case['SQL'] = final_fixed_sql
                return exec_result_final
            
            # STEP 6: All attempts failed - return with user review comment
            self.logger.error(f"Use case {use_case_id} - All SQL validation attempts failed. User needs to review.")
            error_message_final = exec_result_final.get('message', 'Unknown error')
            
            # Add comment to SQL for user review
            commented_sql = f"-- ⚠️ USER REVIEW REQUIRED: SQL has syntax errors that could not be auto-fixed\n-- Last error: {error_message_final[:200]}\n\n{final_fixed_sql}"
            
            use_case['SQL'] = commented_sql
            return {
                'status': 'error',
                'data': [],
                'message': f'⚠️ USER REVIEW REQUIRED: SQL validation failed after multiple attempts. Error: {error_message_final[:200]}',
                'sql': commented_sql
            }
            
        except Exception as regen_e:
            self.logger.error(f"Use case {use_case_id} SQL regeneration failed: {str(regen_e)[:100]}")
            
            # Add comment to original SQL for user review
            commented_sql = f"-- ⚠️ USER REVIEW REQUIRED: SQL has syntax errors and regeneration failed\n-- Error: {str(regen_e)[:200]}\n\n{sql_query}"
            
            use_case['SQL'] = commented_sql
            return {
                'status': 'error',
                'data': [],
                'message': f'⚠️ USER REVIEW REQUIRED: SQL validation and regeneration failed. Error: {str(regen_e)[:200]}',
                'sql': commented_sql
            }

    def _validate_and_cache_sql_results_parallel(self, use_cases: list):
        """
        Validate SQL queries and cache results to disk in parallel.
        This avoids running queries sequentially during PDF generation.
        
        Args:
            use_cases: List of use case dictionaries with SQL field
        """
        if not self.sql_warehouse_name or not self.sql_warehouse_id:
            self.logger.info("SQL Warehouse not configured or ID not resolved; skipping SQL validation and caching.")
            return
        cache_dir = self._ensure_sql_results_cache_dir()
        
        self.logger.info(f"📁 SQL results will be cached to: {cache_dir}")
        status_map = {}
        message_map = {}

        use_cases_to_validate = []
        uc_lookup = {}
        reused_cached = 0
        show_results_enabled = str(getattr(self, "show_query_results_option", "")).strip().lower() == "yes"
        if show_results_enabled:
            # Single pass to check cache and identify what needs validation
            for idx, uc in enumerate(use_cases, start=1):
                if uc.get('SQL') and self.should_show_example_for_use_case(uc, idx):
                    use_case_id = uc.get('No', f'UC-{idx}')
                    cache_file = os.path.join(cache_dir, f"{use_case_id}.json")
                    cached_result = None
                    
                    if os.path.exists(cache_file):
                        try:
                            with open(cache_file, 'r', encoding='utf-8') as f:
                                cached_result = json.load(f)
                        except Exception as read_err:
                            self.logger.debug(f"Failed to read cached result for {use_case_id}: {str(read_err)[:100]}")

                    # Check if cache is valid and matches current SQL
                    if cached_result and cached_result.get('sql', '').strip() == uc.get('SQL', '').strip() and cached_result.get('status') != 'error':
                        status_map[use_case_id] = 'success'
                        message_map[use_case_id] = cached_result.get('message', '')
                        reused_cached += 1
                        self.logger.debug(f"Using cached result for {use_case_id}")
                    else:
                        # Needs validation
                        use_cases_to_validate.append((idx, uc))
                        uc_lookup[use_case_id] = uc
                        if not cached_result:
                             self.logger.warning(f"No cached result found for {use_case_id}; expected from validation run.")
        
        if not use_cases_to_validate:
            self.logger.info(f"No new SQL validations required; reused {reused_cached} cached results.")
            return
        
        total_queries = len(use_cases_to_validate)
        self.logger.info(f"🔄 Validating and caching {total_queries} SQL queries in parallel (max {self.max_parallelism} concurrent)... Reused {reused_cached} cached results.")
        log_print(f"🔄 Validating and caching {total_queries} SQL queries in parallel (max {self.max_parallelism} concurrent)...")
        log_print(f"   ⏱️  Each query takes ~5-10 seconds. Estimated time: {(total_queries * 7 / self.max_parallelism / 60):.1f} minutes")
        
        # Define worker function for parallel execution
        def validate_and_cache_worker(idx_uc_tuple):
            idx, uc = idx_uc_tuple
            use_case_id = uc.get('No', f'UC-{idx}')
            cache_file = os.path.join(cache_dir, f"{use_case_id}.json")
            
            try:
                cached_result = None
                if os.path.exists(cache_file):
                    try:
                        with open(cache_file, 'r', encoding='utf-8') as f:
                            cached_result = json.load(f)
                    except Exception as read_err:
                        self.logger.debug(f"Failed to read cached result for {use_case_id}: {str(read_err)[:100]}")
                if cached_result:
                    cached_sql = cached_result.get('sql')
                    if cached_sql and cached_sql.strip() == uc.get('SQL', '').strip() and cached_result.get('status') != 'error':
                        status_value = 'success'
                        status_map[use_case_id] = status_value
                        message_map[use_case_id] = cached_result.get('message', '')
                        self.logger.info(f"Using cached SQL result for {use_case_id}; skipping re-execution.")
                        return (use_case_id, status_value, cached_result.get('message', ''))
                # Execute SQL with automatic fixing
                result = self.execute_sql_with_fixing(uc)
                status_value = 'success' if result['status'] != 'error' else 'error'
                status_map[use_case_id] = status_value
                message_map[use_case_id] = result.get('message', '')
                
                # Cache result to disk
                with open(cache_file, 'w', encoding='utf-8') as f:
                    json.dump(result, f, ensure_ascii=False, indent=2)
                
                return (use_case_id, status_value, result.get('message', ''))
            except Exception as e:
                self.logger.error(f"Failed to validate/cache {use_case_id}: {str(e)[:100]}")
                # Cache error result
                error_result = {
                    'status': 'error',
                    'data': [],
                    'message': f'Validation failed: {str(e)[:100]}',
                    'sql': uc.get('SQL', '')
                }
                with open(cache_file, 'w', encoding='utf-8') as f:
                    json.dump(error_result, f, ensure_ascii=False, indent=2)
                status_map[use_case_id] = 'error'
                message_map[use_case_id] = str(e)
                return (use_case_id, 'error', str(e))
        
        # ADAPTIVE PARALLELISM: Calculate based on number of queries to validate
        validation_parallelism, reason = calculate_adaptive_parallelism(
            "sql_validation", self.max_parallelism,
            num_items=len(use_cases_to_validate),
            is_llm_operation=False, logger=self.logger
        )
        log_adaptive_parallelism_decision("sql_validation", validation_parallelism, self.max_parallelism, reason)
        
        with ThreadPoolExecutor(max_workers=validation_parallelism, thread_name_prefix="SQLValidator") as executor:
            futures = {executor.submit(validate_and_cache_worker, idx_uc): idx_uc for idx_uc in use_cases_to_validate}
            
            completed = 0
            for future in as_completed(futures):
                completed += 1
                try:
                    use_case_id, status, message = future.result(timeout=180)  # 3 min timeout per query
                    
                    # Progress logging every 10% or every 5 queries
                    if completed % max(1, total_queries // 10) == 0 or completed % 5 == 0:
                        self.logger.info(f"📊 SQL validation progress: {completed}/{total_queries} ({100*completed/total_queries:.1f}%)")
                        log_print(f"   ✓ {completed}/{total_queries} queries validated ({100*completed/total_queries:.1f}%)")
                except Exception as e:
                    self.logger.error(f"Query validation task failed: {str(e)[:100]}")
        
        timeout_ids = [
            uc_id for uc_id, msg in message_map.items()
            if msg and ('timeout' in msg.lower() or 'time out' in msg.lower())
        ]
        
        if timeout_ids:
            retry_workers = max(1, self.max_parallelism // 2)
            self.logger.warning(f"⚠️  Retrying {len(timeout_ids)} timeout failures with reduced parallelism ({retry_workers})...")
            with ThreadPoolExecutor(max_workers=retry_workers, thread_name_prefix="SQLValidatorRetry") as executor:
                retry_futures = {
                    executor.submit(validate_and_cache_worker, (idx, uc_lookup[uc_id])): uc_id
                    for idx, uc in use_cases_to_validate
                    for uc_id in [uc.get('No', f'UC-{idx}')]
                    if uc_id in timeout_ids
                }
                for future in as_completed(retry_futures):
                    try:
                        future.result(timeout=180)
                    except Exception as e:
                        self.logger.error(f"Retry validation task failed: {str(e)[:100]}")
            
            timeout_ids = [
                uc_id for uc_id, msg in message_map.items()
                if msg and ('timeout' in msg.lower() or 'time out' in msg.lower()) and status_map.get(uc_id) == 'error'
            ]
        
        discard_ids = timeout_ids
        if discard_ids:
            self.validation_timeouts_discarded = discard_ids
            self.logger.warning(f"Discarding {len(discard_ids)} use cases due to repeated timeouts: {', '.join(discard_ids)}")
            use_cases[:] = [uc for uc in use_cases if uc.get('No') not in discard_ids]
        
        success_count = sum(1 for status in status_map.values() if status == 'success')
        error_count = sum(1 for status in status_map.values() if status != 'success')
        total_final = success_count + error_count
        self.logger.info(f"✅ SQL validation complete: {success_count} succeeded, {error_count} failed/errored (Total: {total_final})")
        log_print(f"✅ SQL validation complete: {success_count} succeeded, {error_count} failed")
        log_print(f"📁 Results cached to: {cache_dir}")
        self.sql_validation_status_map = status_map
        self.sql_validation_error_ids = [uc_id for uc_id, status in status_map.items() if status != 'success']
        return {
            "success_count": success_count,
            "error_count": error_count,
            "status_map": status_map,
            "message_map": message_map
        }

    def _get_cached_sql_result(self, use_case_id: str) -> dict:
        """
        Retrieve cached SQL result from disk.
        
        Args:
            use_case_id: Use case ID
            
        Returns:
            Cached result dict or error dict if not found
        """
        if not hasattr(self, 'sql_results_cache_dir'):
            return {
                'status': 'error',
                'data': [],
                'message': 'Cache not initialized'
            }
        
        cache_file = os.path.join(self.sql_results_cache_dir, f"{use_case_id}.json")
        
        if not os.path.exists(cache_file):
            return {
                'status': 'error',
                'data': [],
                'message': 'Cached result not found'
            }
        
        try:
            with open(cache_file, 'r', encoding='utf-8') as f:
                return json.load(f)
        except Exception as e:
            self.logger.error(f"Failed to read cached result for {use_case_id}: {str(e)[:100]}")
            return {
                'status': 'error',
                'data': [],
                'message': f'Failed to read cache: {str(e)[:100]}'
            }

    # === MODIFIED: PDF Generation (Req 1, 2, 3) ===
    def generate_catalog_pdf(self, language: str, lang_abbr: str, translations: dict, summary_dict: dict, grouped_data: dict, transliterated_name: str):
        self.logger.info(f"--- Starting PDF Catalog Generation for {language} ---")
        
        t = translations
        is_rtl = (language == "Arabic")
        
        def _install_dependencies(logger_instance) -> bool:
            try:
                import weasyprint
                logger_instance.info("PDF package (weasyprint) already installed.")
                return True
            except ImportError:
                logger_instance.info("Installing required PDF package (weasyprint)...")
                try: 
                    subprocess.check_call([sys.executable, "-m", "pip", "install", "weasyprint"])
                    import weasyprint
                    logger_instance.info("Successfully installed weasyprint.")
                    return True
                except Exception as e: 
                    logger_instance.error(f"Failed to install weasyprint: {e}")
                    print("ERROR: Failed to install 'weasyprint'. PDF generation cannot continue.", file=sys.stderr)
                    return False

        def _build_html(grouped_data: dict, summary_dict: dict, business_name: str, translations: dict, is_rtl: bool) -> str:
            self.logger.info(f"Building HTML for PDF ({language})...")
            t = translations; now = datetime.datetime.now().strftime("%Y-%m-%d")
            direction = "rtl" if is_rtl else "ltr"; align = "right" if is_rtl else "left"
            def e(text): return html.escape(str(text))

            # === MODIFIED: CSS WITH FIXES (Request #13) ===
            # 1. @import rules MUST be at the very top (before @font-face)
            # 2. Removed unsupported properties: box-shadow, text-shadow
            # 3. Font warnings will be suppressed in WeasyPrint config
            css = f"""
            /* @import rules MUST be first in CSS */
            @import url('https://fonts.googleapis.com/css2?family=Roboto:wght@300;400;700&display=swap');
            @import url('https://fonts.googleapis.com/css2?family=Noto+Sans+Devanagari:wght@300;400;700&display=swap');
            @import url('https://fonts.googleapis.com/css2?family=Noto+Sans+Arabic:wght@300;400;700&display=swap');
            @import url('https://fonts.googleapis.com/css2?family=Noto+Sans+SC:wght@300;400;700&display=swap');
            @import url('https://fonts.googleapis.com/css2?family=Noto+Sans+JP:wght@300;400;700&display=swap');
            @import url('https://fonts.googleapis.com/css2?family=Noto+Sans+KR:wght@300;400;700&display=swap');
            @import url('https://fonts.googleapis.com/css2?family=Noto+Sans+Tamil:wght@300;400;700&display=swap');
            @import url('https://fonts.googleapis.com/css2?family=Noto+Sans+Thai:wght@300;400;700&display=swap');
           
            /* Font-face declarations with local fallbacks */
            @font-face {{
                font-family: 'NotoSansDevanagari';
                src: local('Noto Sans Devanagari'), local('Lohit Devanagari'), local('Mangal');
                font-weight: normal;
                font-style: normal;
            }}
            @font-face {{
                font-family: 'NotoSansArabic';
                src: local('Noto Sans Arabic'), local('Arial Unicode MS'), local('DejaVu Sans');
                font-weight: normal;
                font-style: normal;
            }}
            @font-face {{
                font-family: 'NotoSansCJK';
                src: local('Noto Sans CJK'), local('Microsoft YaHei'), local('SimSun'), local('MS Gothic');
                font-weight: normal;
                font-style: normal;
            }}
            @font-face {{
                font-family: 'NotoSansJP';
                src: local('Noto Sans JP'), local('Yu Gothic'), local('MS Gothic'), local('Meiryo');
                font-weight: normal;
                font-style: normal;
            }}
            @font-face {{
                font-family: 'NotoSansKR';
                src: local('Noto Sans KR'), local('Malgun Gothic'), local('Gulim');
                font-weight: normal;
                font-style: normal;
            }}
            
            @page {{
                size: A4; margin: 2.5cm;
                @bottom-left {{
                    content: 'Databricks Inspire AI';
                    font-family: 'Roboto', 'Noto Sans Devanagari', 'Noto Sans Arabic', 'Noto Sans SC', 'Noto Sans JP', 'Noto Sans KR', 'Noto Sans Tamil', sans-serif; font-size: 9pt; color: #555;
                }}
                @bottom-right {{
                    content: '{now}';
                    font-family: 'Roboto', 'Noto Sans Devanagari', 'Noto Sans Arabic', 'Noto Sans SC', 'Noto Sans JP', 'Noto Sans KR', 'Noto Sans Tamil', sans-serif; font-size: 9pt; color: #555;
                }}
            }}
            html {{ counter-reset: page-counter; }}
            body {{ 
                font-family: 'Roboto', 'Noto Sans Devanagari', 'Noto Sans Arabic', 'Noto Sans SC', 'Noto Sans JP', 'Noto Sans KR', 'Noto Sans Tamil', 'Noto Sans Thai', sans-serif; color: #333; line-height: 1.6; 
                direction: {direction}; text-align: {align}; 
            }}
            h1, h2, h3 {{ font-weight: 700; color: #003366; margin-bottom: 0.5em; text-align: {align}; }}
            
            /* Page counter logic */
            h1.page-title {{ font-size: 24pt; counter-increment: page-counter; }}
            h2.page-title {{ font-size: 22pt; counter-increment: page-counter; }}
            h2.domain-header {{ font-size: 18pt; counter-increment: none; }}
            
            h3 {{ color: #0B579E; font-size: 14pt; border-bottom: 2px solid #FF6900; padding-bottom: 5px; }}
            p {{ margin-bottom: 1.2em; text-align: {align}; }}
            table {{ width: 100%; border-collapse: collapse; margin-bottom: 2em; page-break-inside: auto; }}
            th, td {{ border: 1px solid #ddd; padding: 10px; text-align: {align}; word-wrap: break-word; }}
            th {{ background-color: #003366; color: white; font-weight: 700; }}
            tr:nth-child(even) {{ background-color: #f9f9f9; }}
            tr {{ page-break-inside: avoid; }}
            
            /* Enhanced cover page with gradient */
            .cover-page {{ 
                display: flex; flex-direction: column; justify-content: space-between; align-items: center; 
                height: 24cm;
                background: linear-gradient(135deg, #001a33 0%, #003366 50%, #004d99 100%);
                color: white; text-align: center; page-break-after: always; 
                counter-increment: none;
                position: relative;
                overflow: visible;
                padding: 2cm 0;
            }}
            /* Geometric decorative elements */
            .cover-page::before {{
                content: '';
                position: absolute;
                width: 450px;
                height: 450px;
                background: radial-gradient(circle, rgba(255, 105, 0, 0.15) 0%, rgba(0, 188, 212, 0.1) 100%);
                border-radius: 50%;
                top: -200px;
                left: -200px;
                z-index: 0;
            }}
            .cover-page::after {{
                content: '';
                position: absolute;
                width: 350px;
                height: 350px;
                background: radial-gradient(circle, rgba(156, 39, 176, 0.12) 0%, rgba(76, 175, 80, 0.08) 100%);
                border-radius: 50%;
                bottom: -150px;
                right: -150px;
                z-index: 0;
            }}
            .cover-box-1 {{ text-align: center; z-index: 1; position: relative; }}
            .cover-box-2 {{ 
                text-align: center; 
                margin-top: 2em; 
                max-width: 85%; 
                margin-left: auto; 
                margin-right: auto; 
                z-index: 1; 
                position: relative;
                padding: 2.5em;
                background: linear-gradient(135deg, rgba(255, 255, 255, 0.08) 0%, rgba(255, 105, 0, 0.05) 100%);
                border-radius: 20px;
                border: 2px solid rgba(255, 105, 0, 0.4);
                /* REMOVED: box-shadow (unsupported by WeasyPrint) */
            }}
            .cover-page h1 {{ 
                font-size: 3.2em; 
                color: white; 
                margin: 0; 
                text-align: center; 
                counter-increment: none;
                /* REMOVED: text-shadow (unsupported by WeasyPrint) */
                letter-spacing: 2px;
            }}
            .cover-page h2 {{ 
                font-size: 2.8em; 
                color: #FF9944; 
                font-weight: 400; 
                margin: 0.5em 0; 
                text-align: center; 
                counter-increment: none;
                /* REMOVED: text-shadow (unsupported by WeasyPrint) */
            }}
            .cover-page p {{ 
                font-size: 1.3em; 
                font-weight: 300; 
                margin-top: 1.5em; 
                text-align: center; 
                color: #e8e8e8;
                letter-spacing: 1px;
            }}
            
            .page-break {{ page-break-after: always; }}
            .exec-summary {{ page-break-after: always; position: relative; }}
            .exec-summary p {{ font-size: 1.1em; }}
            /* Add decorative triangle to executive summary */
            .exec-summary::before {{
                content: '';
                position: absolute;
                top: 0;
                right: 0;
                width: 0;
                height: 0;
                border-style: solid;
                border-width: 0 80px 80px 0;
                border-color: transparent #00BCD4 transparent transparent;
                opacity: 0.15;
            }}
            
            .toc-page {{ page-break-after: always; position: relative; }}
            .toc-table a {{ text-decoration: none; color: #003366; font-weight: bold; }}
            .toc-table .page-ref {{ float: right; color: #555; }}
            .toc-table .page-ref::before {{ content: target-counter(attr(href), page-counter); }}
            /* Add colorful accent to TOC */
            .toc-page::after {{
                content: '';
                position: absolute;
                bottom: 20px;
                left: 0;
                width: 6px;
                height: 200px;
                background: linear-gradient(180deg, #FF6900 0%, #00BCD4 50%, #9C27B0 100%);
                border-radius: 3px;
            }}

            /* Enhanced domain summary pages */
            .domain-summary-page {{
                page-break-before: always; page-break-after: always;
                background: linear-gradient(135deg, #fafafa 0%, #f5f5f5 100%); 
                border: 2px solid #e0e0e0;
                padding: 2cm; border-radius: 12px;
                position: relative;
                overflow: hidden;
            }}
            /* Colorful corner decorations for domain summaries */
            .domain-summary-page::before {{
                content: '';
                position: absolute;
                top: -30px;
                right: -30px;
                width: 120px;
                height: 120px;
                background: radial-gradient(circle, rgba(0, 188, 212, 0.2) 0%, transparent 70%);
                border-radius: 50%;
            }}
            .domain-summary-page::after {{
                content: '';
                position: absolute;
                bottom: -40px;
                left: -40px;
                width: 150px;
                height: 150px;
                background: radial-gradient(circle, rgba(156, 39, 176, 0.15) 0%, transparent 70%);
                border-radius: 50%;
            }}
            .domain-summary-page h2 {{
                border-bottom: 4px solid transparent;
                border-image: linear-gradient(90deg, #FF6900 0%, #00BCD4 50%, #9C27B0 100%) 1;
                padding-bottom: 12px;
                position: relative;
                z-index: 1;
            }}
            .domain-summary-page p {{
                font-size: 12pt; line-height: 1.8;
                position: relative;
                z-index: 1;
            }}
            .domain-header {{ 
                page-break-before: avoid;
                border-bottom: 3px solid #003366; padding-bottom: 10px;
                background: linear-gradient(90deg, rgba(0, 51, 102, 0.05) 0%, transparent 100%);
                padding-left: 10px;
            }}
            .domain-count {{ font-size: 1.2em; color: #555; font-weight: 400; text-align: {align}; margin-top: -0.5em; margin-bottom: 1.5em; }}
            /* Enhanced use case blocks */
            .use-case-block {{
                page-break-inside: avoid; margin-bottom: 1.2em;
                background: linear-gradient(135deg, #ffffff 0%, #fefefe 100%);
                border: 1px solid #e8e8e8; padding: 14px; border-radius: 8px;
                {'border-right: 4px solid transparent;' if is_rtl else 'border-left: 4px solid transparent;'}
                {'border-image: linear-gradient(180deg, #FF6900 0%, #00BCD4 100%) 1;' if is_rtl else 'border-image: linear-gradient(180deg, #FF6900 0%, #00BCD4 100%) 1;'}
                /* REMOVED: box-shadow (unsupported by WeasyPrint) */
                position: relative;
            }}
            .use-case-block.page-break-after {{
                page-break-after: always;
            }}
            /* Add small decorative element to use case blocks */
            .use-case-block::before {{
                content: '';
                position: absolute;
                top: 8px;
                {'left: 8px;' if is_rtl else 'right: 8px;'}
                width: 6px;
                height: 6px;
                background: #00BCD4;
                border-radius: 50%;
                opacity: 0.6;
            }}
            .use-case-block h3 {{ margin-bottom: 0.5em; font-size: 13pt; }}
            .use-case-block p {{ margin-bottom: 0.5em; font-size: 11pt; line-height: 1.4; }}
            .disclaimer {{ font-size: 0.9em; color: #555; border-top: 1px solid #ddd; padding-top: 1em; margin-top: 2em; }}
            """
            html_parts = [f"<html><head><meta charset='UTF-8'><style>{css}</style></head><body>"]
            
            # Req 1: Use transliterated_name
            h1_text = e(t["pdf_title"]); h2_text = e(business_name); p_text = now
            # For Arabic, we want to keep the text centered regardless of RTL
            h2_style = ''  # Always center, no dir attribute needed
            p_style = 'dir="ltr"' if is_rtl else ''
            
            # Modified cover page structure: single centered box with title, date, and business name
            html_parts.append('<div class="cover-page">')
            html_parts.append(f'<div class="cover-box-1">')
            html_parts.append(f'<h1>{h1_text}</h1>')
            html_parts.append(f'<p {p_style}>{p_text}</p>')
            # Business name now in the same box, centered after title and date
            html_parts.append(f'<h2 {h2_style} style="margin-top: 2em;">{h2_text}</h2>')
            html_parts.append('</div>')
            html_parts.append('</div>')
            
            summary_text = summary_dict.get('Executive', f'<p>{e(t["executive_summary_not_available"])}</p>')
            # Req 6: Use translated disclaimer text directly
            disclaimer_text = t["disclaimer"]
            html_parts.append(f'<div class="exec-summary"><h1 class="page-title">{e(t["pdf_exec_summary"])}</h1>{summary_text}<div class="disclaimer"><strong>{e(t["pdf_disclaimer_title"])}:</strong> {e(disclaimer_text)}</div></div>')
            
            # Req 3 & 5: TOC
            html_parts.append(f'<div class="toc-page">')
            html_parts.append(f'<h1 class="page-title">{e(t["pdf_toc_title"])}</h1>')
            html_parts.append(f"<table class='toc-table'><tr><th>{e(t['domain'])}</th><th>{e(t['total'])}</th></tr>")
            toc_rows = []; domain_id_map = {}
            for i, (domain, domain_use_cases) in enumerate(grouped_data.items()):
                domain_slug = f"domain-{i}"
                domain_id_map[domain] = domain_slug
                toc_rows.append(f"<tr><td><a href='#{domain_slug}'>{e(domain)}</a></td><td>{len(domain_use_cases)}</td></tr>")
            html_parts.extend(toc_rows)
            html_parts.append("</table>")
            html_parts.append('</div>')
            
            for domain, domain_use_cases in grouped_data.items():
                domain_slug = domain_id_map[domain]
                domain_summary_html = summary_dict.get(domain, f"<p>{e(t['domain_summary_not_available'])}</p>")
                
                # Req 2 & 3: Domain Summary Page
                html_parts.append(f'<div class="domain-summary-page">')
                html_parts.append(f'<h2 class="page-title" id="{domain_slug}">{e(domain)}</h2>') 
                html_parts.append(domain_summary_html) # Req 2
                html_parts.append(f'</div>')
                
                html_parts.append(f'<h2 class="domain-header">{e(domain)} - {e(t["pdf_detailed_view"])}</h2>')
                html_parts.append(f'<p class="domain-count">{len(domain_use_cases)} {e(t["pptx_domain_suffix"])}</p>')
                
                # Sort use cases: successful SQL results first, then by Priority Score descending
                def get_use_case_sort_key(uc):
                    use_case_id = uc.get('No', 'Unknown')
                    example_result = self._get_cached_sql_result(use_case_id)
                    has_success = 1 if example_result.get('status') == 'success' else 0
                    priority_score = float(uc.get('Priority Score', 0)) if isinstance(uc.get('Priority Score'), (int, float, str)) else 0
                    try:
                        priority_score = float(priority_score)
                    except (ValueError, TypeError):
                        priority_score = 0
                    return (-has_success, -priority_score)  # Negative for descending order
                
                sorted_domain_use_cases = sorted(domain_use_cases, key=get_use_case_sort_key)
                
                # Helper function to translate field values
                def translate_pdf_value(value):
                    """Translate Type and Priority values for PDF"""
                    if not value or value == 'N/A':
                        return value
                    
                    value_key_map = {
                        'Problem': 'value_type_problem', 'Risk': 'value_type_risk',
                        'Opportunity': 'value_type_opportunity', 'Improvement': 'value_type_improvement',
                        'Ultra High': 'value_priority_ultra_high', 'Very High': 'value_priority_very_high',
                        'High': 'value_priority_high', 'Medium': 'value_priority_medium',
                        'Low': 'value_priority_low', 'Very Low': 'value_priority_very_low',
                        'Ultra Low': 'value_priority_ultra_low'
                    }
                    translation_key = value_key_map.get(value)
                    return t.get(translation_key, value) if translation_key else value
                
                def translate_strategic_value(value):
                    """Translate Strategic Goals and Business Priority alignment values"""
                    if not value or value == 'N/A':
                        return value
                    
                    strategic_key_map = {
                        'General Improvement': 'value_general_improvement',
                        'Reduce Cost': 'value_reduce_cost',
                        'Increase Revenue': 'value_increase_revenue',
                        'Boost Productivity': 'value_boost_productivity',
                        'Mitigate Risk': 'value_mitigate_risk',
                        'Protect Revenue': 'value_protect_revenue',
                        'Align to Regulations': 'value_align_to_regulations',
                        'Improve Customer Experience': 'value_improve_customer_experience',
                        'Enable Data-Driven Decisions': 'value_enable_data_driven_decisions',
                        'Optimize Operations': 'value_optimize_operations',
                        'Empower Talent': 'value_empower_talent',
                        'Enhance Experience': 'value_enhance_experience',
                        'Drive Innovation': 'value_drive_innovation',
                        'Achieve ESG': 'value_achieve_esg',
                        'Execute Strategy': 'value_execute_strategy',
                    }
                    
                    # Handle comma-separated values
                    if ',' in value:
                        parts = [p.strip() for p in value.split(',')]
                        translated_parts = []
                        for part in parts:
                            key = strategic_key_map.get(part)
                            translated_parts.append(t.get(key, part) if key else part)
                        return ', '.join(translated_parts)
                    
                    translation_key = strategic_key_map.get(value)
                    return t.get(translation_key, value) if translation_key else value
                
                def translate_analytics_technique(value):
                    """Translate Analytics Technique values with inline fallback translations"""
                    if not value or value == 'N/A':
                        return value
                    
                    analytics_key_map = {
                        'Forecasting': 'value_forecasting',
                        'Classification': 'value_classification',
                        'Anomaly Detection': 'value_anomaly_detection',
                        'Cohort Analysis': 'value_cohort_analysis',
                        'Segmentation': 'value_segmentation',
                        'Sentiment Analysis': 'value_sentiment_analysis',
                        'Trend Analysis': 'value_trend_analysis',
                        'Prescriptive Analytics': 'value_prescriptive_analytics',
                        'Root Cause Analysis': 'value_root_cause_analysis',
                        'Optimization': 'value_optimization',
                        'Recommendation': 'value_recommendation',
                        'Time Series Analysis': 'value_time_series_analysis',
                        'Predictive Analytics': 'value_predictive_analytics',
                        'Descriptive Analytics': 'value_descriptive_analytics',
                    }
                    
                    analytics_fallbacks = {
                        'Chinese (Mandarin)': {'Forecasting': '预测', 'Classification': '分类', 'Anomaly Detection': '异常检测', 'Cohort Analysis': '队列分析', 'Segmentation': '细分', 'Sentiment Analysis': '情感分析', 'Trend Analysis': '趋势分析', 'Prescriptive Analytics': '规范性分析', 'Root Cause Analysis': '根因分析', 'Optimization': '优化', 'Recommendation': '推荐', 'Time Series Analysis': '时间序列分析', 'Predictive Analytics': '预测分析', 'Descriptive Analytics': '描述性分析'},
                        'Arabic': {'Forecasting': 'التنبؤ', 'Classification': 'التصنيف', 'Anomaly Detection': 'كشف الشذوذ', 'Cohort Analysis': 'تحليل الأتراب', 'Segmentation': 'التجزئة', 'Sentiment Analysis': 'تحليل المشاعر', 'Trend Analysis': 'تحليل الاتجاهات', 'Prescriptive Analytics': 'التحليلات الوصفية', 'Root Cause Analysis': 'تحليل السبب الجذري', 'Optimization': 'التحسين', 'Recommendation': 'التوصية', 'Time Series Analysis': 'تحليل السلاسل الزمنية', 'Predictive Analytics': 'التحليلات التنبؤية', 'Descriptive Analytics': 'التحليلات الوصفية'},
                        'Spanish': {'Forecasting': 'Pronóstico', 'Classification': 'Clasificación', 'Anomaly Detection': 'Detección de Anomalías', 'Cohort Analysis': 'Análisis de Cohortes', 'Segmentation': 'Segmentación', 'Sentiment Analysis': 'Análisis de Sentimiento', 'Trend Analysis': 'Análisis de Tendencias', 'Prescriptive Analytics': 'Analítica Prescriptiva', 'Root Cause Analysis': 'Análisis de Causa Raíz', 'Optimization': 'Optimización', 'Recommendation': 'Recomendación', 'Time Series Analysis': 'Análisis de Series Temporales', 'Predictive Analytics': 'Analítica Predictiva', 'Descriptive Analytics': 'Analítica Descriptiva'},
                        'French': {'Forecasting': 'Prévision', 'Classification': 'Classification', 'Anomaly Detection': 'Détection d\'Anomalies', 'Cohort Analysis': 'Analyse de Cohortes', 'Segmentation': 'Segmentation', 'Sentiment Analysis': 'Analyse de Sentiments', 'Trend Analysis': 'Analyse des Tendances', 'Prescriptive Analytics': 'Analytique Prescriptive', 'Root Cause Analysis': 'Analyse des Causes Profondes', 'Optimization': 'Optimisation', 'Recommendation': 'Recommandation', 'Time Series Analysis': 'Analyse de Séries Temporelles', 'Predictive Analytics': 'Analytique Prédictive', 'Descriptive Analytics': 'Analytique Descriptive'},
                        'German': {'Forecasting': 'Vorhersage', 'Classification': 'Klassifikation', 'Anomaly Detection': 'Anomalieerkennung', 'Cohort Analysis': 'Kohortenanalyse', 'Segmentation': 'Segmentierung', 'Sentiment Analysis': 'Stimmungsanalyse', 'Trend Analysis': 'Trendanalyse', 'Prescriptive Analytics': 'Präskriptive Analytik', 'Root Cause Analysis': 'Ursachenanalyse', 'Optimization': 'Optimierung', 'Recommendation': 'Empfehlung', 'Time Series Analysis': 'Zeitreihenanalyse', 'Predictive Analytics': 'Prädiktive Analytik', 'Descriptive Analytics': 'Deskriptive Analytik'},
                        'Portuguese': {'Forecasting': 'Previsão', 'Classification': 'Classificação', 'Anomaly Detection': 'Detecção de Anomalias', 'Cohort Analysis': 'Análise de Coorte', 'Segmentation': 'Segmentação', 'Sentiment Analysis': 'Análise de Sentimento', 'Trend Analysis': 'Análise de Tendências', 'Prescriptive Analytics': 'Análise Prescritiva', 'Root Cause Analysis': 'Análise de Causa Raiz', 'Optimization': 'Otimização', 'Recommendation': 'Recomendação', 'Time Series Analysis': 'Análise de Séries Temporais', 'Predictive Analytics': 'Análise Preditiva', 'Descriptive Analytics': 'Análise Descritiva'},
                        'Italian': {'Forecasting': 'Previsione', 'Classification': 'Classificazione', 'Anomaly Detection': 'Rilevamento Anomalie', 'Cohort Analysis': 'Analisi di Coorte', 'Segmentation': 'Segmentazione', 'Sentiment Analysis': 'Analisi del Sentimento', 'Trend Analysis': 'Analisi delle Tendenze', 'Prescriptive Analytics': 'Analisi Prescrittiva', 'Root Cause Analysis': 'Analisi delle Cause Profonde', 'Optimization': 'Ottimizzazione', 'Recommendation': 'Raccomandazione', 'Time Series Analysis': 'Analisi delle Serie Temporali', 'Predictive Analytics': 'Analisi Predittiva', 'Descriptive Analytics': 'Analisi Descrittiva'},
                        'Japanese': {'Forecasting': '予測', 'Classification': '分類', 'Anomaly Detection': '異常検出', 'Cohort Analysis': 'コホート分析', 'Segmentation': 'セグメンテーション', 'Sentiment Analysis': '感情分析', 'Trend Analysis': 'トレンド分析', 'Prescriptive Analytics': '処方的分析', 'Root Cause Analysis': '根本原因分析', 'Optimization': '最適化', 'Recommendation': 'レコメンデーション', 'Time Series Analysis': '時系列分析', 'Predictive Analytics': '予測分析', 'Descriptive Analytics': '記述分析'},
                        'Korean': {'Forecasting': '예측', 'Classification': '분류', 'Anomaly Detection': '이상 탐지', 'Cohort Analysis': '코호트 분석', 'Segmentation': '세분화', 'Sentiment Analysis': '감정 분석', 'Trend Analysis': '추세 분석', 'Prescriptive Analytics': '처방적 분석', 'Root Cause Analysis': '근본 원인 분석', 'Optimization': '최적화', 'Recommendation': '추천', 'Time Series Analysis': '시계열 분석', 'Predictive Analytics': '예측 분석', 'Descriptive Analytics': '기술 분석'},
                        'Hindi': {'Forecasting': 'पूर्वानुमान', 'Classification': 'वर्गीकरण', 'Anomaly Detection': 'विसंगति पता लगाना', 'Cohort Analysis': 'समूह विश्लेषण', 'Segmentation': 'विभाजन', 'Sentiment Analysis': 'भावना विश्लेषण', 'Trend Analysis': 'रुझान विश्लेषण', 'Prescriptive Analytics': 'निर्देशात्मक विश्लेषण', 'Root Cause Analysis': 'मूल कारण विश्लेषण', 'Optimization': 'अनुकूलन', 'Recommendation': 'सिफारिश', 'Time Series Analysis': 'समय श्रृंखला विश्लेषण', 'Predictive Analytics': 'भविष्य कथन विश्लेषण', 'Descriptive Analytics': 'वर्णनात्मक विश्लेषण'},
                        'Russian': {'Forecasting': 'Прогнозирование', 'Classification': 'Классификация', 'Anomaly Detection': 'Обнаружение Аномалий', 'Cohort Analysis': 'Когортный Анализ', 'Segmentation': 'Сегментация', 'Sentiment Analysis': 'Анализ Настроений', 'Trend Analysis': 'Анализ Трендов', 'Prescriptive Analytics': 'Предписывающая Аналитика', 'Root Cause Analysis': 'Анализ Первопричин', 'Optimization': 'Оптимизация', 'Recommendation': 'Рекомендация', 'Time Series Analysis': 'Анализ Временных Рядов', 'Predictive Analytics': 'Предиктивная Аналитика', 'Descriptive Analytics': 'Описательная Аналитика'},
                    }
                    
                    translation_key = analytics_key_map.get(value)
                    translated = t.get(translation_key, None) if translation_key else None
                    if translated and translated != value:
                        return translated
                    if language in analytics_fallbacks and value in analytics_fallbacks[language]:
                        return analytics_fallbacks[language][value]
                    return value
                
                for idx, uc in enumerate(sorted_domain_use_cases, start=1):
                    # Add page-break-after class to every 2nd use case (2, 4, 6, 8, etc.)
                    page_break_class = ' page-break-after' if idx % 2 == 0 else ''
                    html_parts.append(f'<div class="use-case-block{page_break_class}">')
                    html_parts.append(f"<h3>{e(uc['No'])}: {e(uc['Name'])}</h3>")
                    # Add header line with Subdomain, Type, Analytics Technique, and Priority (with translations)
                    subdomain_val = e(uc.get('Subdomain', 'N/A'))
                    type_val = e(translate_pdf_value(uc.get('type', 'N/A')))
                    analytics_technique_val = e(translate_analytics_technique(uc.get('Analytics Technique', 'N/A')))
                    priority_val = e(translate_pdf_value(uc.get('Priority', 'N/A')))
                    html_parts.append(f"<p style='font-weight: bold; color: #0066cc;'>{e(t['subdomain'])}: {subdomain_val} | {e(t['type'])}: {type_val}, {e(t.get('analytics_technique', 'Analytics Technique'))}: {analytics_technique_val}, {e(t['priority'])}: {priority_val}</p>")
                    html_parts.append(f"<p><strong>{e(t['statement'])}:</strong> {e(uc.get('Statement', 'N/A'))}</p>")
                    html_parts.append(f"<p><strong>{e(t['solution'])}:</strong> {e(uc.get('Solution', 'N/A'))}</p>")
                    html_parts.append(f"<p><strong>{e(t['business_value'])}:</strong> {e(uc.get('Business Value', 'N/A'))}</p>")
                    html_parts.append(f"<p><strong>{e(t['beneficiary'])}:</strong> {e(uc.get('Beneficiary', 'N/A'))}</p>")
                    html_parts.append(f"<p><strong>{e(t['sponsor'])}:</strong> {e(uc.get('Sponsor', 'N/A'))}</p>")
                    html_parts.append(f"<p><strong>{e(t.get('business_priority_alignment', 'Business Priority Alignment'))}:</strong> {e(translate_strategic_value(uc.get('Business Priority Alignment', 'General Improvement')))}</p>")
                    html_parts.append(f"<p><strong>{e(t.get('strategic_goals_alignment', 'Strategic Goals Alignment'))}:</strong> {e(translate_strategic_value(uc.get('Strategic Goals Alignment', 'General Improvement')))}</p>")
                    
                    html_parts.append('</div>')
            html_parts.append("</body></html>")
            return "".join(html_parts)

        def _save_pdf(html_content: str, workspace_path: str, logger_instance):
            import weasyprint
            import logging
            try: from weasyprint.fonts import FontConfiguration
            except ImportError: FontConfiguration = None 
            local_pdf_path = None
            try:
                with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_file: local_pdf_path = tmp_file.name
                logger_instance.info(f"Generating PDF at local temp path: {local_pdf_path}")
                font_config = FontConfiguration() if FontConfiguration else None
                
                # Suppress font-face warnings from WeasyPrint
                weasyprint_logger = logging.getLogger('weasyprint')
                original_level = weasyprint_logger.level
                weasyprint_logger.setLevel(logging.ERROR)  # Only show errors, suppress warnings
                
                try:
                    weasyprint.HTML(string=html_content).write_pdf(local_pdf_path, font_config=font_config)
                finally:
                    weasyprint_logger.setLevel(original_level)  # Restore original level
                with open(local_pdf_path, "rb") as f: pdf_data = f.read()
                if not pdf_data: raise ValueError("Generated PDF file is empty.")
                logger_instance.info(f"Uploading PDF to workspace path: {workspace_path}")
                pdf_data_b64 = base64.b64encode(pdf_data).decode()
                self.w_client.workspace.import_(path=workspace_path, content=pdf_data_b64, format=workspace.ImportFormat.AUTO, overwrite=True)
                abs_path = self.w_client.workspace.get_status(workspace_path).path
                logger_instance.info(f"Success! PDF Catalog uploaded to: {abs_path}")
                log_print(f"Success! PDF Catalog ({language}) generated: {abs_path}")
            except Exception as e:
                import traceback
                logger_instance.critical(f"Failed to generate and save PDF for {language}: {e}")
                logger_instance.critical(f"Full traceback: {traceback.format_exc()}")
            finally:
                if local_pdf_path and os.path.exists(local_pdf_path): os.remove(local_pdf_path)

        # --- Main execution logic for generate_catalog_pdf ---
        try:
            if not _install_dependencies(self.logger):
                self.logger.error("Skipping PDF generation due to missing weasyprint dependency.")
                return
                
            if not grouped_data:
                self.logger.warning(f"No use cases provided to generate_catalog_pdf for {language}. Skipping.")
                return
            final_html = _build_html(grouped_data, summary_dict, transliterated_name, t, is_rtl)
            pdf_workspace_path = os.path.join(self.docs_output_dir, f"{self.business_name}-dbx_inspire_{lang_abbr}.pdf")
            _save_pdf(final_html, pdf_workspace_path, self.logger)
        except Exception as e:
            self.logger.critical(f"An error occurred during PDF generation for {language}: {e}")

    # === MODIFIED: PPTX Generation (Req 1, 2, 4, 5, 6) ===
    def generate_presentation_pptx(self, language: str, lang_abbr: str, translations: dict, summary_dict: dict, grouped_data: dict, transliterated_name: str):
        self.logger.info(f"--- Starting PPTX Presentation Generation for {language} ---")
        
        t = translations
        is_rtl = (language == "Arabic")

        def _install_pptx_dependencies(logger_instance) -> bool:
            try:
                import pptx
                self.logger.info("PPTX package (python-pptx) already installed.")
                return True
            except ImportError:
                logger_instance.info("Installing required PPTX package (python-pptx)...")
                try: 
                    subprocess.check_call([sys.executable, "-m", "pip", "install", "python-pptx"])
                    import pptx
                    self.logger.info("Successfully installed python-pptx.")
                    return True
                except Exception as e: 
                    logger_instance.error(f"Failed to install python-pptx: {e}")
                    print("ERROR: Failed to install 'python-pptx'. Presentation generation cannot continue.", file=sys.stderr)
                    return False

        # === MODIFIED: _build_presentation (Req 1, 3, 4, 5) ===
        def _build_presentation(grouped_data: dict, summary_dict: dict, business_name: str, translations: dict, workspace_path: str, logger_instance, is_rtl: bool):
            try:
                from pptx import Presentation
                from pptx.util import Inches, Pt, Cm
                from pptx.dml.color import RGBColor
                from pptx.enum.text import PP_ALIGN, MSO_ANCHOR
                from pptx.enum.shapes import MSO_SHAPE
            except ImportError as e:
                logger_instance.error(f"FATAL: python-pptx import failed inside _build_presentation: {e}. Aborting PPTX generation.")
                return
            
            DATABRICKS_BLUE = RGBColor(0, 51, 102); DATABRICKS_ORANGE = RGBColor(255, 105, 0); TEXT_COLOR = RGBColor(0x33, 0x33, 0x33)
            LIGHT_GRAY = RGBColor(0xFA, 0xFA, 0xFA); WHITE_COLOR = RGBColor(0xFF, 0xFF, 0xFF); FOOTER_COLOR = RGBColor(0x88, 0x88, 0x88)
            # Modern color palette - vibrant and futuristic
            TEAL_ACCENT = RGBColor(0, 188, 212); PURPLE_ACCENT = RGBColor(156, 39, 176); GREEN_ACCENT = RGBColor(76, 175, 80)
            
            prs = Presentation(); prs.slide_width = Cm(33.867); prs.slide_height = Cm(19.05)
            # === PPTX FIX (Req 1): Cast all float calculations to int() ===
            CONTENT_WIDTH_CM = Cm(30.5)
            LEFT_MARGIN_CM = (prs.slide_width - CONTENT_WIDTH_CM) / 2
            
            align = PP_ALIGN.RIGHT if is_rtl else PP_ALIGN.LEFT
            title_align = PP_ALIGN.CENTER

            SLIDE_LAYOUT_TITLE = prs.slide_layouts[0]; SLIDE_LAYOUT_TITLE_AND_CONTENT = prs.slide_layouts[1]; SLIDE_LAYOUT_BLANK = prs.slide_layouts[6]
            def set_font_color(run, color=TEXT_COLOR): run.font.color.rgb = color

            now = datetime.datetime.now().strftime("%Y-%m-%d")
            footer_text = f"Databricks Inspire AI  |  {now}"
            def add_footer(slide):
                try:
                    left = int(LEFT_MARGIN_CM); width = int(CONTENT_WIDTH_CM); top = int(Cm(18.2)); height = int(Cm(0.8))
                    txBox = slide.shapes.add_textbox(left, top, width, height)
                    p = txBox.text_frame.paragraphs[0]; p.text = footer_text
                    p.font.size = Pt(10); p.font.color.rgb = FOOTER_COLOR
                    p.alignment = PP_ALIGN.CENTER if is_rtl else PP_ALIGN.RIGHT
                except Exception as e:
                    logger_instance.warning(f"Failed to add footer to slide: {e}")

            # Enhanced title slide with decorative shapes
            logger_instance.info("Building Title Slide...") 
            slide = prs.slides.add_slide(SLIDE_LAYOUT_TITLE); slide.background.fill.solid(); slide.background.fill.fore_color.rgb = DATABRICKS_BLUE
            
            # Add decorative circular shapes for visual interest
            circle1 = slide.shapes.add_shape(MSO_SHAPE.OVAL, int(Cm(28)), int(Cm(1)), int(Cm(4)), int(Cm(4)))
            circle1.fill.solid(); circle1.fill.fore_color.rgb = TEAL_ACCENT; circle1.line.fill.background()
            circle1.fill.transparency = 0.3
            
            circle2 = slide.shapes.add_shape(MSO_SHAPE.OVAL, int(Cm(1)), int(Cm(14)), int(Cm(5)), int(Cm(5)))
            circle2.fill.solid(); circle2.fill.fore_color.rgb = PURPLE_ACCENT; circle2.line.fill.background()
            circle2.fill.transparency = 0.2
            
            # Additional decorative elements for futuristic feel
            circle3 = slide.shapes.add_shape(MSO_SHAPE.OVAL, int(Cm(30)), int(Cm(16)), int(Cm(3)), int(Cm(3)))
            circle3.fill.solid(); circle3.fill.fore_color.rgb = GREEN_ACCENT; circle3.line.fill.background()
            circle3.fill.transparency = 0.4
            
            # Add small accent circles
            accent_circle1 = slide.shapes.add_shape(MSO_SHAPE.OVAL, int(Cm(3)), int(Cm(3)), int(Cm(1.5)), int(Cm(1.5)))
            accent_circle1.fill.solid(); accent_circle1.fill.fore_color.rgb = DATABRICKS_ORANGE; accent_circle1.line.fill.background()
            accent_circle1.fill.transparency = 0.5
            
            accent_circle2 = slide.shapes.add_shape(MSO_SHAPE.OVAL, int(Cm(29)), int(Cm(8)), int(Cm(2)), int(Cm(2)))
            accent_circle2.fill.solid(); accent_circle2.fill.fore_color.rgb = TEAL_ACCENT; accent_circle2.line.fill.background()
            accent_circle2.fill.transparency = 0.6
            
            txBox_top = slide.shapes.add_textbox(int(LEFT_MARGIN_CM), int(Cm(2.0)), int(CONTENT_WIDTH_CM), int(Cm(5.0)))
            tf_top = txBox_top.text_frame; tf_top.vertical_anchor = MSO_ANCHOR.MIDDLE
            p_top1 = tf_top.paragraphs[0]; p_top1.text = t['pptx_main_title']
            p_top1.font.color.rgb = WHITE_COLOR; p_top1.font.size = Pt(44); p_top1.alignment = title_align # Req 3: Font size reduced
            p_top2 = tf_top.add_paragraph(); p_top2.text = now
            p_top2.font.color.rgb = RGBColor(0xCC, 0xCC, 0xCC); p_top2.font.size = Pt(20); p_top2.alignment = title_align
            
            txBox_bottom = slide.shapes.add_textbox(int(LEFT_MARGIN_CM), int(Cm(10.0)), int(CONTENT_WIDTH_CM), int(Cm(5.0)))
            tf_bottom = txBox_bottom.text_frame; tf_bottom.vertical_anchor = MSO_ANCHOR.MIDDLE
            p_bottom = tf_bottom.paragraphs[0]; p_bottom.text = f"{t['pptx_for']} {business_name}" # Req 1: Use transliterated name
            p_bottom.font.color.rgb = DATABRICKS_ORANGE; p_bottom.font.size = Pt(32); p_bottom.alignment = title_align
            
            try: slide.shapes.title.element.getparent().remove(slide.shapes.title.element)
            except: pass
            try: slide.placeholders[1].element.getparent().remove(slide.placeholders[1].element)
            except: pass
            add_footer(slide)

            # Enhanced Executive Summary with colorful accents
            logger_instance.info("Building Executive Summary Slide...")
            slide = prs.slides.add_slide(SLIDE_LAYOUT_TITLE_AND_CONTENT); slide.background.fill.solid(); slide.background.fill.fore_color.rgb = LIGHT_GRAY
            
            # Add gradient accent bar (simulated with two rectangles)
            accent_bar1 = slide.shapes.add_shape(MSO_SHAPE.RECTANGLE, 0, 0, int(Cm(0.8)), int(prs.slide_height))
            accent_bar1.fill.solid(); accent_bar1.fill.fore_color.rgb = DATABRICKS_ORANGE; accent_bar1.line.fill.background()
            
            accent_bar2 = slide.shapes.add_shape(MSO_SHAPE.RECTANGLE, int(Cm(0.8)), 0, int(Cm(0.7)), int(prs.slide_height))
            accent_bar2.fill.solid(); accent_bar2.fill.fore_color.rgb = TEAL_ACCENT; accent_bar2.line.fill.background()
            accent_bar2.fill.transparency = 0.5
            
            title = slide.shapes.title; title.left, title.width = int(LEFT_MARGIN_CM), int(CONTENT_WIDTH_CM)
            title.top = int(Cm(1.0)); title.height = int(Cm(2.5))
            title.text = t['pdf_exec_summary']
            p = title.text_frame.paragraphs[0]; p.font.color.rgb = DATABRICKS_BLUE; p.font.size = Pt(36); p.alignment = align
            
            content_placeholder = slide.placeholders[1]; content_placeholder.left, content_placeholder.width = int(LEFT_MARGIN_CM), int(Cm(30.5))
            content_placeholder.top = int(Cm(1.0) + Cm(2.5))
            content_placeholder.height = int(Cm(14.1))
            content_frame = content_placeholder.text_frame; content_frame.clear(); content_frame.word_wrap = True
            
            summary_text = summary_dict.get('Executive', t['summary_not_available'])
            summary_text = re.sub(r'</p>|<p>', ' ', summary_text); summary_text = re.sub(r'<[^>]+>', '', summary_text).strip()
            
            # Split into sentences and create bullet points for each statement
            sentences = re.split(r'(?<=[.!?])\s+', summary_text)
            for sentence in sentences:
                sentence = sentence.strip()
                if not sentence: continue
                p = content_frame.add_paragraph(); p.text = sentence; p.font.size = Pt(18)
                p.level = 0; p.alignment = align; p.space_after = Pt(8)
                set_font_color(p.runs[0], TEXT_COLOR)
            
            p = content_frame.add_paragraph(); p.space_before = Pt(24); p.alignment = align
            disclaimer_text = t["disclaimer"] # Req 6: Use translated disclaimer
            run_label = p.add_run(); run_label.text = f"{t['pptx_disclaimer_title']}: "; run_label.font.bold = True; run_label.font.size = Pt(14)
            set_font_color(run_label, DATABRICKS_BLUE)
            run_value = p.add_run(); run_value.text = disclaimer_text; run_value.font.size = Pt(14); set_font_color(run_value, TEXT_COLOR)
            add_footer(slide)

            # === MODIFIED: PPTX TOC (Req 4, 5) ===
            logger_instance.info("Building Table of Contents Slide(s)...")
            rows = list(grouped_data.items())
            max_rows_per_slide = 10
            num_slides = (len(rows) + max_rows_per_slide - 1) // max_rows_per_slide
            for i in range(num_slides):
                slide = prs.slides.add_slide(SLIDE_LAYOUT_TITLE_AND_CONTENT)
                slide.background.fill.solid(); slide.background.fill.fore_color.rgb = LIGHT_GRAY
                
                # Colorful multi-bar accent
                accent_bar1 = slide.shapes.add_shape(MSO_SHAPE.RECTANGLE, 0, 0, int(Cm(0.5)), int(prs.slide_height))
                accent_bar1.fill.solid(); accent_bar1.fill.fore_color.rgb = DATABRICKS_ORANGE; accent_bar1.line.fill.background()
                
                accent_bar2 = slide.shapes.add_shape(MSO_SHAPE.RECTANGLE, int(Cm(0.5)), 0, int(Cm(0.5)), int(prs.slide_height))
                accent_bar2.fill.solid(); accent_bar2.fill.fore_color.rgb = TEAL_ACCENT; accent_bar2.line.fill.background()
                
                accent_bar3 = slide.shapes.add_shape(MSO_SHAPE.RECTANGLE, int(Cm(1.0)), 0, int(Cm(0.5)), int(prs.slide_height))
                accent_bar3.fill.solid(); accent_bar3.fill.fore_color.rgb = PURPLE_ACCENT; accent_bar3.line.fill.background()
                
                title = slide.shapes.title; title.left, title.width = int(LEFT_MARGIN_CM), int(CONTENT_WIDTH_CM)
                title.top = int(Cm(1.0)); title.height = int(Cm(2.5))
                title.text = t['pdf_toc_title']
                if num_slides > 1: title.text += f" ({i+1}/{num_slides})"
                p = title.text_frame.paragraphs[0]; p.font.color.rgb = DATABRICKS_BLUE; p.font.size = Pt(36); p.alignment = align
                
                chunk = rows[i*max_rows_per_slide : (i+1)*max_rows_per_slide]
                num_table_rows = len(chunk) + 1; num_table_cols = 2
                table_left = int(LEFT_MARGIN_CM); table_top = int(Cm(4.0)); table_width = int(CONTENT_WIDTH_CM); table_height = int(Cm(14.0))
                table_shape = slide.shapes.add_table(num_table_rows, num_table_cols, table_left, table_top, table_width, table_height)
                table = table_shape.table
                table.columns[0].width = int(Cm(22.0)); table.columns[1].width = int(Cm(8.5))
                table.horz_banding = False; table.first_row = True
                
                table.cell(0, 0).text = t['domain']; table.cell(0, 1).text = t['total']
                
                for c_idx in range(num_table_cols):
                    cell = table.cell(0, c_idx); cell.fill.solid(); cell.fill.fore_color.rgb = DATABRICKS_BLUE
                    para = cell.text_frame.paragraphs[0]; para.font.color.rgb = WHITE_COLOR; para.font.bold = True; para.font.size = Pt(18); para.alignment = align
                    cell.vertical_anchor = MSO_ANCHOR.MIDDLE
                for r_idx, (domain, domain_use_cases) in enumerate(chunk):
                    table.cell(r_idx + 1, 0).text = domain
                    table.cell(r_idx + 1, 1).text = str(len(domain_use_cases))
                    for c_idx in range(num_table_cols):
                        cell = table.cell(r_idx + 1, c_idx); para = cell.text_frame.paragraphs[0]
                        para.font.color.rgb = TEXT_COLOR; para.font.size = Pt(16); para.alignment = align
                        cell.vertical_anchor = MSO_ANCHOR.MIDDLE
                try: slide.placeholders[1].element.getparent().remove(slide.placeholders[1].element)
                except: pass
                add_footer(slide)
            
            for domain, domain_use_cases in grouped_data.items():
                slide = prs.slides.add_slide(SLIDE_LAYOUT_BLANK); slide.background.fill.solid(); slide.background.fill.fore_color.rgb = DATABRICKS_BLUE
                txBox = slide.shapes.add_textbox(int(LEFT_MARGIN_CM), int(Cm(1.0)), int(CONTENT_WIDTH_CM), int(Cm(17.05)))
                tf = txBox.text_frame; tf.vertical_anchor = MSO_ANCHOR.MIDDLE
                p = tf.paragraphs[0]; p.text = f"{domain}\n{len(domain_use_cases)} {t['pptx_domain_suffix']}"; p.alignment = align
                p.font.color.rgb = DATABRICKS_ORANGE; p.font.size = Pt(44); p.font.bold = True
                if len(tf.paragraphs) > 1:
                    p2 = tf.paragraphs[1]; p2.font.color.rgb = WHITE_COLOR; p2.font.size = Pt(32); p2.font.bold = False; p2.alignment = align
                add_footer(slide)

                # --- Domain Summary Slide (MODIFIED: Req 1, 2) ---
                slide = prs.slides.add_slide(SLIDE_LAYOUT_TITLE_AND_CONTENT); slide.background.fill.solid(); slide.background.fill.fore_color.rgb = LIGHT_GRAY
                accent_bar = slide.shapes.add_shape(MSO_SHAPE.RECTANGLE, 0, 0, int(Cm(1.5)), int(prs.slide_height))
                accent_bar.fill.solid(); accent_bar.fill.fore_color.rgb = DATABRICKS_ORANGE; accent_bar.line.fill.background()
                
                title = slide.shapes.title; title.left, title.width = int(LEFT_MARGIN_CM), int(CONTENT_WIDTH_CM)
                title.top = int(Cm(1.0)); title.height = int(Cm(2.5))
                title.text = domain # Req 1: Remove ": Summary"
                
                p = title.text_frame.paragraphs[0]; p.font.color.rgb = DATABRICKS_BLUE; p.font.size = Pt(36); p.alignment = align
                
                content_placeholder = slide.placeholders[1]; content_placeholder.left, content_placeholder.width = int(LEFT_MARGIN_CM), int(Cm(30.5))
                content_placeholder.top = int(Cm(1.0) + Cm(2.5)); content_placeholder.height = int(Cm(14.1))
                content_frame = content_placeholder.text_frame; content_frame.clear(); content_frame.word_wrap = True
                
                domain_summary_text = summary_dict.get(domain, t['domain_summary_not_available'])
                domain_summary_text = re.sub(r'</p>|<p>', ' ', domain_summary_text); domain_summary_text = re.sub(r'<[^>]+>', '', domain_summary_text).strip()
                
                # Split into sentences and create bullet points for each statement
                sentences = re.split(r'(?<=[.!?])\s+', domain_summary_text)
                for sentence in sentences:
                    sentence = sentence.strip()
                    if not sentence: continue
                    p = content_frame.add_paragraph(); p.text = sentence; p.font.size = Pt(18)
                    p.level = 0; p.alignment = align; p.space_after = Pt(8)
                    set_font_color(p.runs[0], TEXT_COLOR)
                add_footer(slide)
                
                # Helper function to translate field values for PPTX
                def translate_pptx_value(value):
                    """Translate Type and Priority values for PPTX"""
                    if not value or value == 'N/A':
                        return value
                    
                    value_key_map = {
                        'Problem': 'value_type_problem', 'Risk': 'value_type_risk',
                        'Opportunity': 'value_type_opportunity', 'Improvement': 'value_type_improvement',
                        'Ultra High': 'value_priority_ultra_high', 'Very High': 'value_priority_very_high',
                        'High': 'value_priority_high', 'Medium': 'value_priority_medium',
                        'Low': 'value_priority_low', 'Very Low': 'value_priority_very_low',
                        'Ultra Low': 'value_priority_ultra_low'
                    }
                    translation_key = value_key_map.get(value)
                    return t.get(translation_key, value) if translation_key else value
                
                def translate_strategic_pptx_value(value):
                    """Translate Strategic Goals and Business Priority alignment values"""
                    if not value or value == 'N/A':
                        return value
                    
                    strategic_key_map = {
                        'General Improvement': 'value_general_improvement',
                        'Reduce Cost': 'value_reduce_cost',
                        'Increase Revenue': 'value_increase_revenue',
                        'Boost Productivity': 'value_boost_productivity',
                        'Mitigate Risk': 'value_mitigate_risk',
                        'Protect Revenue': 'value_protect_revenue',
                        'Align to Regulations': 'value_align_to_regulations',
                        'Improve Customer Experience': 'value_improve_customer_experience',
                        'Enable Data-Driven Decisions': 'value_enable_data_driven_decisions',
                        'Optimize Operations': 'value_optimize_operations',
                        'Empower Talent': 'value_empower_talent',
                        'Enhance Experience': 'value_enhance_experience',
                        'Drive Innovation': 'value_drive_innovation',
                        'Achieve ESG': 'value_achieve_esg',
                        'Execute Strategy': 'value_execute_strategy',
                    }
                    
                    # Handle comma-separated values
                    if ',' in str(value):
                        parts = [p.strip() for p in str(value).split(',')]
                        translated_parts = []
                        for part in parts:
                            key = strategic_key_map.get(part)
                            translated_parts.append(t.get(key, part) if key else part)
                        return ', '.join(translated_parts)
                    
                    translation_key = strategic_key_map.get(value)
                    return t.get(translation_key, value) if translation_key else value
                
                def translate_analytics_pptx_value(value):
                    """Translate Analytics Technique values for PPTX with inline fallback translations"""
                    if not value or value == 'N/A':
                        return value
                    
                    analytics_key_map = {
                        'Forecasting': 'value_forecasting',
                        'Classification': 'value_classification',
                        'Anomaly Detection': 'value_anomaly_detection',
                        'Cohort Analysis': 'value_cohort_analysis',
                        'Segmentation': 'value_segmentation',
                        'Sentiment Analysis': 'value_sentiment_analysis',
                        'Trend Analysis': 'value_trend_analysis',
                        'Prescriptive Analytics': 'value_prescriptive_analytics',
                        'Root Cause Analysis': 'value_root_cause_analysis',
                        'Optimization': 'value_optimization',
                        'Recommendation': 'value_recommendation',
                        'Time Series Analysis': 'value_time_series_analysis',
                        'Predictive Analytics': 'value_predictive_analytics',
                        'Descriptive Analytics': 'value_descriptive_analytics',
                    }
                    
                    analytics_fallbacks = {
                        'Chinese (Mandarin)': {'Forecasting': '预测', 'Classification': '分类', 'Anomaly Detection': '异常检测', 'Cohort Analysis': '队列分析', 'Segmentation': '细分', 'Sentiment Analysis': '情感分析', 'Trend Analysis': '趋势分析', 'Prescriptive Analytics': '规范性分析', 'Root Cause Analysis': '根因分析', 'Optimization': '优化', 'Recommendation': '推荐', 'Time Series Analysis': '时间序列分析', 'Predictive Analytics': '预测分析', 'Descriptive Analytics': '描述性分析'},
                        'Arabic': {'Forecasting': 'التنبؤ', 'Classification': 'التصنيف', 'Anomaly Detection': 'كشف الشذوذ', 'Cohort Analysis': 'تحليل الأتراب', 'Segmentation': 'التجزئة', 'Sentiment Analysis': 'تحليل المشاعر', 'Trend Analysis': 'تحليل الاتجاهات', 'Prescriptive Analytics': 'التحليلات الوصفية', 'Root Cause Analysis': 'تحليل السبب الجذري', 'Optimization': 'التحسين', 'Recommendation': 'التوصية', 'Time Series Analysis': 'تحليل السلاسل الزمنية', 'Predictive Analytics': 'التحليلات التنبؤية', 'Descriptive Analytics': 'التحليلات الوصفية'},
                        'Spanish': {'Forecasting': 'Pronóstico', 'Classification': 'Clasificación', 'Anomaly Detection': 'Detección de Anomalías', 'Cohort Analysis': 'Análisis de Cohortes', 'Segmentation': 'Segmentación', 'Sentiment Analysis': 'Análisis de Sentimiento', 'Trend Analysis': 'Análisis de Tendencias', 'Prescriptive Analytics': 'Analítica Prescriptiva', 'Root Cause Analysis': 'Análisis de Causa Raíz', 'Optimization': 'Optimización', 'Recommendation': 'Recomendación', 'Time Series Analysis': 'Análisis de Series Temporales', 'Predictive Analytics': 'Analítica Predictiva', 'Descriptive Analytics': 'Analítica Descriptiva'},
                        'French': {'Forecasting': 'Prévision', 'Classification': 'Classification', 'Anomaly Detection': 'Détection d\'Anomalies', 'Cohort Analysis': 'Analyse de Cohortes', 'Segmentation': 'Segmentation', 'Sentiment Analysis': 'Analyse de Sentiments', 'Trend Analysis': 'Analyse des Tendances', 'Prescriptive Analytics': 'Analytique Prescriptive', 'Root Cause Analysis': 'Analyse des Causes Profondes', 'Optimization': 'Optimisation', 'Recommendation': 'Recommandation', 'Time Series Analysis': 'Analyse de Séries Temporelles', 'Predictive Analytics': 'Analytique Prédictive', 'Descriptive Analytics': 'Analytique Descriptive'},
                        'German': {'Forecasting': 'Vorhersage', 'Classification': 'Klassifikation', 'Anomaly Detection': 'Anomalieerkennung', 'Cohort Analysis': 'Kohortenanalyse', 'Segmentation': 'Segmentierung', 'Sentiment Analysis': 'Stimmungsanalyse', 'Trend Analysis': 'Trendanalyse', 'Prescriptive Analytics': 'Präskriptive Analytik', 'Root Cause Analysis': 'Ursachenanalyse', 'Optimization': 'Optimierung', 'Recommendation': 'Empfehlung', 'Time Series Analysis': 'Zeitreihenanalyse', 'Predictive Analytics': 'Prädiktive Analytik', 'Descriptive Analytics': 'Deskriptive Analytik'},
                        'Portuguese': {'Forecasting': 'Previsão', 'Classification': 'Classificação', 'Anomaly Detection': 'Detecção de Anomalias', 'Cohort Analysis': 'Análise de Coorte', 'Segmentation': 'Segmentação', 'Sentiment Analysis': 'Análise de Sentimento', 'Trend Analysis': 'Análise de Tendências', 'Prescriptive Analytics': 'Análise Prescritiva', 'Root Cause Analysis': 'Análise de Causa Raiz', 'Optimization': 'Otimização', 'Recommendation': 'Recomendação', 'Time Series Analysis': 'Análise de Séries Temporais', 'Predictive Analytics': 'Análise Preditiva', 'Descriptive Analytics': 'Análise Descritiva'},
                        'Italian': {'Forecasting': 'Previsione', 'Classification': 'Classificazione', 'Anomaly Detection': 'Rilevamento Anomalie', 'Cohort Analysis': 'Analisi di Coorte', 'Segmentation': 'Segmentazione', 'Sentiment Analysis': 'Analisi del Sentimento', 'Trend Analysis': 'Analisi delle Tendenze', 'Prescriptive Analytics': 'Analisi Prescrittiva', 'Root Cause Analysis': 'Analisi delle Cause Profonde', 'Optimization': 'Ottimizzazione', 'Recommendation': 'Raccomandazione', 'Time Series Analysis': 'Analisi delle Serie Temporali', 'Predictive Analytics': 'Analisi Predittiva', 'Descriptive Analytics': 'Analisi Descrittiva'},
                        'Japanese': {'Forecasting': '予測', 'Classification': '分類', 'Anomaly Detection': '異常検出', 'Cohort Analysis': 'コホート分析', 'Segmentation': 'セグメンテーション', 'Sentiment Analysis': '感情分析', 'Trend Analysis': 'トレンド分析', 'Prescriptive Analytics': '処方的分析', 'Root Cause Analysis': '根本原因分析', 'Optimization': '最適化', 'Recommendation': 'レコメンデーション', 'Time Series Analysis': '時系列分析', 'Predictive Analytics': '予測分析', 'Descriptive Analytics': '記述分析'},
                        'Korean': {'Forecasting': '예측', 'Classification': '분류', 'Anomaly Detection': '이상 탐지', 'Cohort Analysis': '코호트 분석', 'Segmentation': '세분화', 'Sentiment Analysis': '감정 분석', 'Trend Analysis': '추세 분석', 'Prescriptive Analytics': '처방적 분석', 'Root Cause Analysis': '근본 원인 분석', 'Optimization': '최적화', 'Recommendation': '추천', 'Time Series Analysis': '시계열 분석', 'Predictive Analytics': '예측 분석', 'Descriptive Analytics': '기술 분석'},
                        'Hindi': {'Forecasting': 'पूर्वानुमान', 'Classification': 'वर्गीकरण', 'Anomaly Detection': 'विसंगति पता लगाना', 'Cohort Analysis': 'समूह विश्लेषण', 'Segmentation': 'विभाजन', 'Sentiment Analysis': 'भावना विश्लेषण', 'Trend Analysis': 'रुझान विश्लेषण', 'Prescriptive Analytics': 'निर्देशात्मक विश्लेषण', 'Root Cause Analysis': 'मूल कारण विश्लेषण', 'Optimization': 'अनुकूलन', 'Recommendation': 'सिफारिश', 'Time Series Analysis': 'समय श्रृंखला विश्लेषण', 'Predictive Analytics': 'भविष्य कथन विश्लेषण', 'Descriptive Analytics': 'वर्णनात्मक विश्लेषण'},
                        'Russian': {'Forecasting': 'Прогнозирование', 'Classification': 'Классификация', 'Anomaly Detection': 'Обнаружение Аномалий', 'Cohort Analysis': 'Когортный Анализ', 'Segmentation': 'Сегментация', 'Sentiment Analysis': 'Анализ Настроений', 'Trend Analysis': 'Анализ Трендов', 'Prescriptive Analytics': 'Предписывающая Аналитика', 'Root Cause Analysis': 'Анализ Первопричин', 'Optimization': 'Оптимизация', 'Recommendation': 'Рекомендация', 'Time Series Analysis': 'Анализ Временных Рядов', 'Predictive Analytics': 'Предиктивная Аналитика', 'Descriptive Analytics': 'Описательная Аналитика'},
                    }
                    
                    translation_key = analytics_key_map.get(value)
                    translated = t.get(translation_key, None) if translation_key else None
                    if translated and translated != value:
                        return translated
                    if language in analytics_fallbacks and value in analytics_fallbacks[language]:
                        return analytics_fallbacks[language][value]
                    return value
                
                for uc in domain_use_cases:
                    slide = prs.slides.add_slide(SLIDE_LAYOUT_TITLE_AND_CONTENT); slide.background.fill.solid(); slide.background.fill.fore_color.rgb = WHITE_COLOR
                    accent_bar = slide.shapes.add_shape(MSO_SHAPE.RECTANGLE, 0, 0, int(Cm(1.5)), int(prs.slide_height))
                    accent_bar.fill.solid(); accent_bar.fill.fore_color.rgb = DATABRICKS_ORANGE; accent_bar.line.fill.background()
                    
                    title = slide.shapes.title; title.left, title.width = int(LEFT_MARGIN_CM), int(CONTENT_WIDTH_CM)
                    title.top = int(Cm(1.0)); title.height = int(Cm(2.5))
                    title.text = f"{uc['No']}: {uc['Name']}"
                    p = title.text_frame.paragraphs[0]; p.font.color.rgb = DATABRICKS_BLUE; p.font.size = Pt(32); p.alignment = align
                    
                    content_placeholder = slide.placeholders[1]; content_placeholder.left, content_placeholder.width = int(LEFT_MARGIN_CM), int(Cm(30.5))
                    content_placeholder.top = int(Cm(4.1)); content_placeholder.height = int(Cm(14.1))
                    content_frame = content_placeholder.text_frame; content_frame.clear(); content_frame.word_wrap = True
                    
                    # Add header line with Subdomain, Type, Analytics Technique, and Priority (with translations)
                    # Use the first paragraph instead of adding a new one to avoid empty line at top
                    header_p = content_frame.paragraphs[0]; header_p.level = 0; header_p.alignment = align
                    subdomain_val = uc.get('Subdomain', 'N/A')
                    type_val = translate_pptx_value(uc.get('type', 'N/A'))
                    analytics_technique_val = translate_analytics_pptx_value(uc.get('Analytics Technique', 'N/A'))
                    priority_val = translate_pptx_value(uc.get('Priority', 'N/A'))
                    header_text = f"{t['subdomain']}: {subdomain_val} | {t['type']}: {type_val}, {t.get('analytics_technique', 'Analytics Technique')}: {analytics_technique_val}, {t['priority']}: {priority_val}"
                    header_run = header_p.add_run(); header_run.text = header_text; header_run.font.bold = True; header_run.font.size = Pt(22)
                    set_font_color(header_run, DATABRICKS_ORANGE)
                    header_p.space_after = Pt(16)
                    
                    def add_detail_line(frame, label_key, value, align, is_first=False):
                        p = frame.add_paragraph(); p.level = 0; p.alignment = align
                        if not is_first: p.space_before = Pt(12)
                        run_label = p.add_run(); run_label.text = f"{t[label_key]}: "; run_label.font.bold = True; run_label.font.size = Pt(20)
                        set_font_color(run_label, DATABRICKS_BLUE)
                        run_value = p.add_run(); run_value.text = value; run_value.font.size = Pt(20)
                        set_font_color(run_value, TEXT_COLOR)
                    # Type is already shown in header line above, no need to repeat it
                    add_detail_line(content_frame, 'statement', uc.get('Statement', 'N/A'), align, is_first=True)
                    add_detail_line(content_frame, 'solution', uc.get('Solution', 'N/A'), align)
                    add_detail_line(content_frame, 'business_value', uc.get('Business Value', 'N/A'), align)
                    add_detail_line(content_frame, 'beneficiary', uc.get('Beneficiary', 'N/A'), align)
                    add_detail_line(content_frame, 'sponsor', uc.get('Sponsor', 'N/A'), align)
                    # Add Business Priority Alignment
                    priority_alignment_label = t.get('business_priority_alignment', 'Business Priority Alignment')
                    p = content_frame.add_paragraph(); p.level = 0; p.alignment = align; p.space_before = Pt(12)
                    run_label = p.add_run(); run_label.text = f"{priority_alignment_label}: "; run_label.font.bold = True; run_label.font.size = Pt(20)
                    set_font_color(run_label, DATABRICKS_BLUE)
                    run_value = p.add_run(); run_value.text = translate_strategic_pptx_value(uc.get('Business Priority Alignment', 'General Improvement')); run_value.font.size = Pt(20)
                    set_font_color(run_value, TEXT_COLOR)
                    # Add Strategic Goals Alignment
                    strategic_goals_label = t.get('strategic_goals_alignment', 'Strategic Goals Alignment')
                    p = content_frame.add_paragraph(); p.level = 0; p.alignment = align; p.space_before = Pt(8)
                    run_label = p.add_run(); run_label.text = f"{strategic_goals_label}: "; run_label.font.bold = True; run_label.font.size = Pt(20)
                    set_font_color(run_label, DATABRICKS_BLUE)
                    run_value = p.add_run(); run_value.text = translate_strategic_pptx_value(uc.get('Strategic Goals Alignment', 'General Improvement')); run_value.font.size = Pt(20)
                    set_font_color(run_value, TEXT_COLOR)
                    # Analytics Technique is already shown in the header line above, no need to repeat it
                    add_footer(slide)
            
            local_pptx_path = None
            try:
                with tempfile.NamedTemporaryFile(delete=False, suffix=".pptx") as tmp_file: local_pptx_path = tmp_file.name
                prs.save(local_pptx_path)
                logger_instance.info(f"Presentation saved locally to {local_pptx_path}")
                _save_pptx(local_pptx_path, workspace_path, logger_instance)
            except Exception as e: logger_instance.error(f"Failed to save or upload PPTX: {e}")
            finally:
                if local_pptx_path and os.path.exists(local_pptx_path): os.remove(local_pptx_path)

        def _save_pptx(local_pptx_path: str, workspace_path: str, logger_instance):
            try:
                with open(local_pptx_path, "rb") as f: pptx_data = f.read()
                if not pptx_data: raise ValueError("Generated PPTX file is empty.")
                logger_instance.info(f"Uploading PPTX to workspace path: {workspace_path}")
                pptx_data_b64 = base64.b64encode(pptx_data).decode()
                self.w_client.workspace.import_(path=workspace_path, content=pptx_data_b64, format=workspace.ImportFormat.AUTO, overwrite=True)
                abs_path = self.w_client.workspace.get_status(workspace_path).path
                logger_instance.info(f"Success! Presentation uploaded to: {abs_path}")
                log_print(f"Success! Presentation ({language}) generated: {abs_path}")
            except Exception as e: logger_instance.critical(f"Failed to save and upload PPTX: {e}")

        # --- Main execution logic for generate_presentation_pptx ---
        try:
            if not _install_pptx_dependencies(self.logger):
                self.logger.error("Skipping PPTX generation due to missing python-pptx dependency.")
                return

            if not grouped_data:
                self.logger.warning(f"No use cases provided to generate_presentation_pptx for {language}. Skipping.")
                return
            pptx_workspace_path = os.path.join(self.docs_output_dir, f"{self.business_name}-dbx_inspire_{lang_abbr}.pptx")
            _build_presentation(grouped_data, summary_dict, transliterated_name, t, pptx_workspace_path, self.logger, is_rtl)
        except Exception as e:
            self.logger.critical(f"An error occurred during PPTX generation for {language}: {e}")
    
    def _install_excel_dependencies(self, logger_instance) -> bool:
        """Installs xlsxwriter if not present."""
        try:
            import xlsxwriter
            logger_instance.info("Excel package (xlsxwriter) already installed.")
            return True
        except ImportError:
            logger_instance.info(f"Installing required Excel package: xlsxwriter...")
            try:
                subprocess.check_call([sys.executable, "-m", "pip", "install", "xlsxwriter"])
                import xlsxwriter
                logger_instance.info("Successfully installed xlsxwriter.")
                return True
            except Exception as e:
                logger_instance.error(f"Failed to install xlsxwriter: {e}")
                print("ERROR: Failed to install 'xlsxwriter'. Excel generation cannot continue.", file=sys.stderr)
                return False

    def _save_excel(self, local_excel_path: str, workspace_path: str, logger_instance, language: str):
        """Uploads a locally generated Excel file to the Databricks workspace."""
        try:
            with open(local_excel_path, "rb") as f: excel_data = f.read()
            if not excel_data: raise ValueError("Generated Excel file is empty.")
            logger_instance.info(f"Uploading Excel to workspace path: {workspace_path}")
            excel_data_b64 = base64.b64encode(excel_data).decode()
            self.w_client.workspace.import_(
                path=workspace_path, content=excel_data_b64,
                format=workspace.ImportFormat.AUTO, overwrite=True
            )
            abs_path = self.w_client.workspace.get_status(workspace_path).path
            logger_instance.info(f"Success! Excel Catalog uploaded to: {abs_path}")
            log_print(f"Success! Excel Catalog ({language}) generated: {abs_path}")
        except Exception as e:
            logger_instance.critical(f"Failed to save and upload Excel: {e}")

    def _generate_use_case_excel(self, language: str, lang_abbr: str, grouped_data: dict):
        warnings.filterwarnings('ignore', module='xlsxwriter')
        # Only generate Excel for English
        if language != "English":
            self.logger.info(f"Skipping Excel generation for {language} (only English Excel is generated).")
            return
        
        self.logger.info(f"--- Starting Excel Catalog Generation with XlsxWriter for {language} ---")
        
        local_excel_path = None
        try:
            if not self._install_excel_dependencies(self.logger):
                self.logger.error("Skipping Excel generation due to missing xlsxwriter dependency.")
                return

            import xlsxwriter
            
            # Prepare data
            data_rows = []
            
            def safe_str(value, default='N/A'):
                """Safely convert value to string, handling None/empty."""
                if value is None or (isinstance(value, str) and not value.strip()):
                    return default
                return str(value)
            
            for domain, use_cases in grouped_data.items():
                for uc in use_cases:
                    data_rows.append([
                        safe_str(uc.get('No'), 'N/A'),                                 # 0 - ID (A)
                        safe_str(uc.get('Business Domain'), 'N/A'),                    # 1 - Business Domain (B)
                        safe_str(uc.get('Subdomain'), 'N/A'),                          # 2 - Subdomain (C)
                        safe_str(uc.get('Name'), 'N/A'),                               # 3 - Use Case (D)
                        safe_str(uc.get('type'), 'N/A'),                               # 4 - Type (E)
                        safe_str(uc.get('Analytics Technique'), 'N/A'),                # 5 - Analytics Technique (F)
                        safe_str(uc.get('Business Priority Alignment'), 'General Improvement'),  # 6 - Business Priority Alignment (G)
                        safe_str(uc.get('Strategic Goals Alignment'), 'General Improvement'),    # 7 - Strategic Goals Alignment (H)
                        safe_str(uc.get('Priority'), 'N/A'),                           # 8 - Priority (I)
                        safe_str(uc.get('Statement'), 'N/A'),                          # 9 - Statement (J)
                        safe_str(uc.get('Solution'), 'N/A'),                           # 10 - Solution (K)
                        safe_str(uc.get('Business Value'), 'N/A'),                     # 11 - Business Value (L)
                        safe_str(uc.get('Beneficiary'), 'N/A'),                        # 12 - Beneficiary (M)
                        safe_str(uc.get('Sponsor'), 'N/A'),                            # 13 - Sponsor (N)
                        safe_str(uc.get('Tables Involved'), 'N/A'),                    # 14 - Tables Involved (O)
                        uc.get('Strategic Alignment', 0),                              # 15 - Strategic Alignment (P)
                        uc.get('Return on Investment', 0),                             # 16 - ROI (Q)
                        uc.get('Reusability', 0),                                      # 17 - Reusability (R)
                        uc.get('Time to Value', 0),                                    # 18 - Time to Value (S)
                        uc.get('Data Availability', 0),                                # 19 - Data Availability (T)
                        uc.get('Data Accessibility', 0),                               # 20 - Data Accessibility (U)
                        uc.get('Architecture Fitness', 0),                             # 21 - Architecture Fitness (V)
                        uc.get('Team Skills', 0),                                      # 22 - Team Skills (W)
                        uc.get('Domain Knowledge', 0),                                 # 23 - Domain Knowledge (X)
                        uc.get('People Allocation', 0),                                # 24 - People Allocation (Y)
                        uc.get('Budget Allocation', 0),                                # 25 - Budget Allocation (Z)
                        uc.get('Time to Production', 0),                               # 26 - Time to Production (AA)
                        uc.get('Value', 0),                                            # 27 - Value Score (AB)
                        uc.get('Feasibility', 0),                                      # 28 - Feasibility Score (AC)
                        uc.get('Priority Score', 0),                                   # 29 - Priority Score (AD)
                        safe_str(uc.get('Justification'), 'N/A')                       # 30 - Justification (AE)
                    ])
            
            # Sort by Priority Score descending
            priority_score_idx = 29  # Priority Score is now column AD (index 29)
            data_rows.sort(key=lambda row: float(row[priority_score_idx]) if isinstance(row[priority_score_idx], (int, float)) else 0, reverse=True)
            
            if not data_rows:
                self.logger.warning(f"No data to write to Excel for {language}. Skipping.")
                return
            
            # Create Excel file
            excel_file_name = f"{self.business_name}-dbx_inspire.xlsx"
            with tempfile.NamedTemporaryFile(delete=False, suffix=".xlsx") as tmp_file:
                local_excel_path = tmp_file.name
            
            self.logger.info(f"Creating Excel file at {local_excel_path}")
            workbook = xlsxwriter.Workbook(local_excel_path, {'strings_to_numbers': False})
            worksheet = workbook.add_worksheet('Use Cases')
            
            # Modern Business Color Palette
            PRIMARY = '#2C3E50'      # Deep Slate
            SECONDARY = '#E74C3C'    # Vibrant Coral  
            ACCENT = '#3498DB'       # Bright Blue
            BACKGROUND = '#ECF0F1'   # Soft Grey
            TEXT = '#2C3E50'         # Dark Grey
            
            # Define cell formats
            header_format = workbook.add_format({
                'bold': True,
                'font_color': 'white',
                'bg_color': PRIMARY,
                'border': 1,
                'align': 'center',
                'valign': 'vcenter',
                'text_wrap': True,
                'font_size': 11
            })
            
            cell_format = workbook.add_format({
                'border': 1,
                'align': 'left',
                'valign': 'top',
                'text_wrap': True,
                'font_size': 10
            })
            
            numeric_format = workbook.add_format({
                'border': 1,
                'align': 'center',
                'valign': 'vcenter',
                'num_format': '0.00',
                'font_size': 10
            })
            
            # Column headers with both alignment columns
            # A: ID, B: Business Domain, C: Subdomain, D: Use Case, E: Type, F: Analytics Technique
            # G: Business Priority Alignment, H: Strategic Goals Alignment, I: Priority
            # J: Statement, K: Solution, L: Business Value, M: Beneficiary, N: Sponsor, O: Tables Involved
            # P: Strategic Alignment, Q: ROI, R: Reusability, S: Time to Value
            # T: Data Availability, U: Data Accessibility, V: Architecture Fitness, W: Team Skills
            # X: Domain Knowledge, Y: People Allocation, Z: Budget Allocation, AA: Time to Production
            # AB: Value Score, AC: Feasibility Score, AD: Priority Score, AE: Justification
            headers = [
                "ID", "Business Domain", "Subdomain", "Use Case", "Type", "Analytics Technique",
                "Business Priority Alignment", "Strategic Goals Alignment", "Priority",
                "Statement", "Solution", "Business Value", "Beneficiary", "Sponsor", "Tables Involved",
                "Strategic Alignment", "ROI", "Reusability", "Time to Value",
                "Data Availability", "Data Accessibility", "Architecture Fitness", "Team Skills", 
                "Domain Knowledge", "People Allocation", "Budget Allocation", "Time to Production",
                "Value Score",
                "Feasibility Score",
                "Priority Score",
                "Justification"
            ]
            
            # Write headers
            for col_num, header in enumerate(headers):
                worksheet.write(0, col_num, header, header_format)
            
            # Write data rows
            for row_num, row_data in enumerate(data_rows, start=1):
                numeric_start = 15  # Strategic Alignment score (column P, index 15)
                numeric_end = 29    # Priority Score (column AD, index 29)
                for col_num, cell_data in enumerate(row_data):
                    # Use numeric format for score columns
                    if col_num >= numeric_start and col_num <= numeric_end:
                        try:
                            numeric_value = float(cell_data) if cell_data not in ['N/A', '', None] else 0
                            worksheet.write_number(row_num, col_num, numeric_value, numeric_format)
                        except (ValueError, TypeError):
                            worksheet.write(row_num, col_num, cell_data, cell_format)
                    else:
                        worksheet.write(row_num, col_num, cell_data, cell_format)
            
            # Auto-fit column widths to content
            for col_num in range(len(headers)):
                # Calculate max width for this column
                max_width = len(str(headers[col_num]))
                for row_data in data_rows:
                    if col_num < len(row_data):
                        cell_value = str(row_data[col_num])
                        max_width = max(max_width, len(cell_value))
                # Set width to fit entire text
                column_width = max_width + 2
                worksheet.set_column(col_num, col_num, column_width)
            
            # Freeze top row
            worksheet.freeze_panes(1, 0)
            
            # Convert data range to native Excel Table
            last_row = len(data_rows)
            last_col = len(headers) - 1
            worksheet.add_table(0, 0, last_row, last_col, {
                'name': 'UseCaseTable',
                'style': 'Table Style Medium 9',
                'columns': [{'header': h} for h in headers]
            })
            
            # Add conditional formatting - Data Bars for all scoring columns
            # Column indices based on headers array (with Analytics Technique at F, Primary Table at O):
            # 15: Strategic Alignment (P), 16: ROI (Q), 17: Reusability (R), 18: Time to Value (S)
            # 19: Data Availability (T), 20: Data Accessibility (U), 21: Architecture Fitness (V)
            # 22: Team Skills (W), 23: Domain Knowledge (X), 24: People Allocation (Y)
            # 25: Budget Allocation (Z), 26: Time to Production (AA), 27: Value Score (AB)
            # 28: Feasibility Score (AC), 29: Priority Score (AD)
            scoring_columns = [
                (15, '#4472C4'),  # Strategic Alignment (P)
                (16, '#ED7D31'),  # ROI (Q)
                (17, '#A5A5A5'),  # Reusability (R)
                (18, '#FFC000'),  # Time to Value (S)
                (19, '#5B9BD5'),  # Data Availability (T)
                (20, '#70AD47'),  # Data Accessibility (U)
                (21, '#264478'),  # Architecture Fitness (V)
                (22, '#9E480E'),  # Team Skills (W)
                (23, '#636363'),  # Domain Knowledge (X)
                (24, '#997300'),  # People Allocation (Y)
                (25, '#255E91'),  # Budget Allocation (Z)
                (26, '#43682B'),  # Time to Production (AA)
                (27, ACCENT),     # Value Score (AB)
                (28, SECONDARY),  # Feasibility Score (AC)
            ]
            
            for col_idx, bar_color in scoring_columns:
                worksheet.conditional_format(1, col_idx, last_row, col_idx, {
                    'type': 'data_bar',
                    'bar_color': bar_color,
                    'bar_only': False
                })
            
            # Add conditional formatting - Color Scale for Priority Score (column 29 = Column AD)
            # Headers: ..., 27: Value Score (AB), 28: Feasibility Score (AC), 29: Priority Score (AD), 30: Justification (AE)
            worksheet.conditional_format(1, 29, last_row, 29, {
                'type': '3_color_scale',
                'min_color': '#F8696B',   # Red for low priority
                'mid_color': '#FFEB84',   # Yellow for medium priority
                'max_color': '#63BE7B'    # Green for high priority
            })
            
            # Graph sheet removed per user request - no longer needed
            
            workbook.close()
            self.logger.info(f"Excel file created successfully with native tables")
            
            workspace_excel_path = os.path.join(self.docs_output_dir, excel_file_name)
            self._save_excel(local_excel_path, workspace_excel_path, self.logger, language)
        except Exception as e:
            self.logger.critical(f"An error occurred during Excel generation for {language}: {e}")
            raise
        finally:
            if local_excel_path and os.path.exists(local_excel_path):
                os.remove(local_excel_path)

    def _generate_markdown_catalog(self, language: str, lang_abbr: str, grouped_data: dict, summary_dict: dict, transliterated_name: str):
        """
        Generates a Markdown catalog file containing all use case information.
        This is ALWAYS generated as a fallback when PDF generation may fail.
        """
        if language != "English":
            self.logger.info(f"Skipping Markdown generation for {language} (only English Markdown is generated).")
            return
        
        self.logger.info(f"--- Starting Markdown Catalog Generation for {language} ---")
        
        try:
            # Build markdown content
            md_content = []
            
            # Header
            md_content.append(f"# {self.business_name} - Databricks Inspire AI Use Cases Catalog\n")
            md_content.append(f"**Generated:** {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
            md_content.append(f"**Business Name:** {transliterated_name or self.business_name}\n")
            
            # Summary section if available
            if summary_dict:
                md_content.append("\n## Executive Summary\n")
                if summary_dict.get('executive_summary'):
                    md_content.append(f"{summary_dict.get('executive_summary')}\n")
                if summary_dict.get('key_findings'):
                    md_content.append(f"\n**Key Findings:** {summary_dict.get('key_findings')}\n")
                if summary_dict.get('recommendations'):
                    md_content.append(f"\n**Recommendations:** {summary_dict.get('recommendations')}\n")
            
            # Stats
            total_use_cases = sum(len(ucs) for ucs in grouped_data.values())
            md_content.append(f"\n## Overview\n")
            md_content.append(f"- **Total Use Cases:** {total_use_cases}\n")
            md_content.append(f"- **Business Domains:** {len(grouped_data)}\n")
            
            # Use cases by domain
            md_content.append("\n## Use Cases by Business Domain\n")
            
            for domain_name, use_cases in grouped_data.items():
                md_content.append(f"\n### {domain_name}\n")
                md_content.append(f"*{len(use_cases)} use cases*\n")
                
                for uc in use_cases:
                    uc_id = uc.get('No', 'N/A')
                    uc_name = uc.get('Name', 'Unnamed')
                    uc_priority = uc.get('Priority', 'Medium')
                    uc_type = uc.get('type', 'N/A')
                    uc_statement = uc.get('Statement', 'N/A')
                    uc_solution = uc.get('Solution', 'N/A')
                    uc_value = uc.get('Business Value', 'N/A')
                    uc_beneficiary = uc.get('Beneficiary', 'N/A')
                    uc_sponsor = uc.get('Sponsor', 'N/A')
                    uc_tables = uc.get('Primary Table', 'N/A')
                    uc_priority_score = uc.get('Priority Score', 0)
                    uc_justification = uc.get('Justification', 'N/A')
                    uc_subdomain = uc.get('Subdomain', 'N/A')
                    uc_analytics = uc.get('Analytics Technique', 'N/A')
                    uc_bp_alignment = uc.get('Business Priority Alignment', 'N/A')
                    uc_sg_alignment = uc.get('Strategic Goals Alignment', 'N/A')
                    
                    md_content.append(f"\n#### {uc_id}: {uc_name}\n")
                    md_content.append(f"- **Subdomain:** {uc_subdomain}\n")
                    md_content.append(f"- **Type:** {uc_type}\n")
                    md_content.append(f"- **Priority:** {uc_priority} (Score: {uc_priority_score})\n")
                    md_content.append(f"- **Analytics Technique:** {uc_analytics}\n")
                    md_content.append(f"- **Business Priority Alignment:** {uc_bp_alignment}\n")
                    md_content.append(f"- **Strategic Goals Alignment:** {uc_sg_alignment}\n")
                    md_content.append(f"\n**Problem Statement:**\n{uc_statement}\n")
                    md_content.append(f"\n**Solution:**\n{uc_solution}\n")
                    md_content.append(f"\n**Business Value:**\n{uc_value}\n")
                    md_content.append(f"\n**Beneficiary:** {uc_beneficiary}\n")
                    md_content.append(f"**Sponsor:** {uc_sponsor}\n")
                    md_content.append(f"**Primary Table:** {uc_tables}\n")
                    if uc_justification and uc_justification != 'N/A':
                        md_content.append(f"\n**Justification:**\n{uc_justification}\n")
                    md_content.append("\n---\n")
            
            # Save to workspace
            md_file_name = f"{self.business_name}-dbx_inspire.md"
            workspace_md_path = os.path.join(self.docs_output_dir, md_file_name)
            md_text = ''.join(md_content)
            
            # Upload to workspace
            md_data_b64 = base64.b64encode(md_text.encode('utf-8')).decode()
            self.w_client.workspace.import_(
                path=workspace_md_path, content=md_data_b64,
                format=workspace.ImportFormat.AUTO, overwrite=True
            )
            abs_path = self.w_client.workspace.get_status(workspace_md_path).path
            self.logger.info(f"Success! Markdown Catalog uploaded to: {abs_path}")
            log_print(f"Success! Markdown Catalog ({language}) generated: {abs_path}")
        except Exception as e:
            self.logger.error(f"Failed to generate Markdown catalog for {language}: {e}")
            import traceback
            self.logger.error(f"Full traceback: {traceback.format_exc()}")

    def _generate_csv_catalog(self, language: str, lang_abbr: str, grouped_data: dict):
        """
        Generates a CSV catalog file containing all use case information.
        This is ALWAYS generated as a fallback when Excel generation may fail.
        """
        if language != "English":
            self.logger.info(f"Skipping CSV generation for {language} (only English CSV is generated).")
            return
        
        self.logger.info(f"--- Starting CSV Catalog Generation for {language} ---")
        
        try:
            # Prepare data rows
            data_rows = []
            headers = [
                "ID", "Business Domain", "Subdomain", "Use Case", "Type", "Analytics Technique",
                "Business Priority Alignment", "Strategic Goals Alignment", "Priority",
                "Statement", "Solution", "Business Value", "Beneficiary", "Sponsor", "Primary Table",
                "Strategic Alignment", "ROI", "Reusability", "Time to Value",
                "Data Availability", "Data Accessibility", "Architecture Fitness", "Team Skills", 
                "Domain Knowledge", "People Allocation", "Budget Allocation", "Time to Production",
                "Value Score", "Feasibility Score", "Priority Score", "Justification"
            ]
            
            for domain, use_cases in grouped_data.items():
                for uc in use_cases:
                    data_rows.append([
                        uc.get('No', 'N/A'),
                        uc.get('Business Domain', 'N/A'),
                        uc.get('Subdomain', 'N/A'),
                        uc.get('Name', 'N/A'),
                        uc.get('type', 'N/A'),
                        uc.get('Analytics Technique', 'N/A'),
                        uc.get('Business Priority Alignment', 'General Improvement'),
                        uc.get('Strategic Goals Alignment', 'General Improvement'),
                        uc.get('Priority', 'N/A'),
                        uc.get('Statement', 'N/A'),
                        uc.get('Solution', 'N/A'),
                        uc.get('Business Value', 'N/A'),
                        uc.get('Beneficiary', 'N/A'),
                        uc.get('Sponsor', 'N/A'),
                        uc.get('Primary Table', 'N/A'),
                        uc.get('Strategic Alignment', 0),
                        uc.get('Return on Investment', 0),
                        uc.get('Reusability', 0),
                        uc.get('Time to Value', 0),
                        uc.get('Data Availability', 0),
                        uc.get('Data Accessibility', 0),
                        uc.get('Architecture Fitness', 0),
                        uc.get('Team Skills', 0),
                        uc.get('Domain Knowledge', 0),
                        uc.get('People Allocation', 0),
                        uc.get('Budget Allocation', 0),
                        uc.get('Time to Production', 0),
                        uc.get('Value', 0),
                        uc.get('Feasibility', 0),
                        uc.get('Priority Score', 0),
                        uc.get('Justification', 'N/A')
                    ])
            
            # Sort by Priority Score descending
            data_rows.sort(key=lambda row: float(row[29]) if isinstance(row[29], (int, float)) else 0, reverse=True)
            
            if not data_rows:
                self.logger.warning(f"No data to write to CSV for {language}. Skipping.")
                return
            
            # Build CSV content
            output = io.StringIO()
            writer = csv.writer(output, quoting=csv.QUOTE_ALL)
            writer.writerow(headers)
            for row in data_rows:
                writer.writerow(row)
            csv_content = output.getvalue()
            
            # Save to workspace
            csv_file_name = f"{self.business_name}-dbx_inspire.csv"
            workspace_csv_path = os.path.join(self.docs_output_dir, csv_file_name)
            
            # Upload to workspace
            csv_data_b64 = base64.b64encode(csv_content.encode('utf-8')).decode()
            self.w_client.workspace.import_(
                path=workspace_csv_path, content=csv_data_b64,
                format=workspace.ImportFormat.AUTO, overwrite=True
            )
            abs_path = self.w_client.workspace.get_status(workspace_csv_path).path
            self.logger.info(f"Success! CSV Catalog uploaded to: {abs_path}")
            log_print(f"Success! CSV Catalog ({language}) generated: {abs_path}")
        except Exception as e:
            self.logger.error(f"Failed to generate CSV catalog for {language}: {e}")
            import traceback
            self.logger.error(f"Full traceback: {traceback.format_exc()}")

    def _validate_use_case_tables(self, parsed_rows: list, full_schema_details: list, log_prefix: str) -> tuple:
        """
        Validate that all tables referenced in 'Tables Involved' field actually exist in the schema.
        This catches LLM hallucinations where it invents table names.
        
        Returns:
            tuple: (is_valid: bool, hallucinated_use_cases: list, valid_use_cases: list)
        """
        import re
        
        # Build set of available tables from schema
        available_tables = set()
        for detail in full_schema_details:
            (catalog, schema, table, _, _, _) = detail
            available_tables.add(f"{catalog}.{schema}.{table}")
            available_tables.add(f"`{catalog}`.`{schema}`.`{table}`")
            available_tables.add(f"{catalog}.{schema}.{table}".lower())
        global_tables = getattr(self, "global_table_names", set())
        for tbl in global_tables:
            available_tables.add(tbl)
            available_tables.add(tbl.lower())
        
        hallucinated_use_cases = []
        valid_use_cases = []
        
        for row in parsed_rows:
            tables_involved_str = row.get('Tables Involved', '').strip()
            use_case_id = row.get('No', 'Unknown')
            use_case_name = row.get('Name', 'Unknown')
            
            # Skip volume paths (for ai_parse_document use cases)
            if tables_involved_str.startswith('/Volumes'):
                valid_use_cases.append(row)
                continue
            
            # Skip empty tables involved (will be caught by other validation)
            if not tables_involved_str:
                hallucinated_use_cases.append(row)
                row['hallucination_reason'] = "No tables specified"
                continue
            
            # Extract table names from comma-separated list
            # Use simple approach: strip backticks, split by comma, parse each table
            table_matches = []
            for table_str in tables_involved_str.split(','):
                table_str = table_str.strip()
                if not table_str:
                    continue
                cat, sch, tbl = parse_three_level_name(table_str)
                if cat and sch and tbl:
                    table_matches.append((cat, sch, tbl))
            
            if not table_matches:
                hallucinated_use_cases.append(row)
                row['hallucination_reason'] = f"Invalid table format: {tables_involved_str}"
                self.logger.warning(f"{log_prefix} Use case {use_case_id}: Invalid table format: {tables_involved_str}")
                continue
            
            # Check if all tables exist in schema
            all_tables_found = True
            missing_tables = []
            for match in table_matches:
                catalog, schema, table = match
                # Try multiple formats (normalized names for comparison)
                table_formats = [
                    f"{catalog}.{schema}.{table}",
                    build_fqn(catalog, schema, table),
                    f"{catalog}.{schema}.{table}".lower()
                ]
                
                found = any(fmt in available_tables for fmt in table_formats)
                if not found:
                    all_tables_found = False
                    missing_tables.append(f"{catalog}.{schema}.{table}")
            
            if not all_tables_found:
                hallucinated_use_cases.append(row)
                row['hallucination_reason'] = f"Tables not found in schema: {', '.join(missing_tables)}"
                self.logger.warning(f"{log_prefix} Use case {use_case_id}: Hallucinated tables: {', '.join(missing_tables)}")
            else:
                reference_tables = {k.lower() for k, v in getattr(self, "data_category_map", {}).items() if v == "REFERENCE"}
                if reference_tables:
                    non_reference_found = False
                    for match in table_matches:
                        catalog, schema, table = match
                        fqtn = f"{catalog}.{schema}.{table}".lower()
                        if fqtn not in reference_tables:
                            non_reference_found = True
                            break
                    if not non_reference_found:
                        hallucinated_use_cases.append(row)
                        row['hallucination_reason'] = "Reference-only use case"
                        self.logger.warning(f"{log_prefix} Use case {use_case_id}: Reference-only tables in use case")
                        continue
                valid_use_cases.append(row)
        
        is_valid = len(hallucinated_use_cases) == 0
        
        if not is_valid:
            self.logger.warning(f"{log_prefix} ⚠️ Found {len(hallucinated_use_cases)} use cases with hallucinated/missing tables out of {len(parsed_rows)} total")
            self.logger.warning(f"{log_prefix}    Valid use cases: {len(valid_use_cases)}")
            self.logger.warning(f"{log_prefix}    Hallucinated use cases: {len(hallucinated_use_cases)}")
        else:
            self.logger.info(f"{log_prefix} ✓ Table validation passed: All {len(valid_use_cases)} use cases reference existing tables")
        
        return (is_valid, hallucinated_use_cases, valid_use_cases)
    
    def _validate_subdomain_rules(self, parsed_rows: list, log_prefix: str) -> tuple:
        """
        Validate subdomain rules (silent - no individual violation logging):
        1. Each domain must have at least 2 subdomains
        2. Each subdomain must have at least 3 use cases
        
        Returns:
            tuple: (is_valid: bool, violations: list, corrected_rows: list)
        """
        from collections import defaultdict
        
        # Group by domain and subdomain
        domain_subdomains = defaultdict(set)
        subdomain_usecases = defaultdict(list)
        
        for row in parsed_rows:
            domain = row.get('Business Domain', '').strip()
            subdomain = row.get('Subdomain', '').strip()
            
            if domain and subdomain:
                domain_subdomains[domain].add(subdomain)
                subdomain_usecases[f"{domain}::{subdomain}"].append(row)
        
        violations = []
        
        # Check Rule 1: Each domain must have at least 2 subdomains
        for domain, subdomains in domain_subdomains.items():
            if len(subdomains) < 2:
                violations.append(f"Domain '{domain}' has only {len(subdomains)} subdomain(s). Minimum required: 2")
        
        # Check Rule 2: Each subdomain must have at least 3 use cases
        for key, use_cases in subdomain_usecases.items():
            domain, subdomain = key.split('::', 1)
            if len(use_cases) < 3:
                violations.append(f"Subdomain '{subdomain}' in domain '{domain}' has only {len(use_cases)} use case(s). Minimum required: 3")
        
        # Don't log individual violations - they'll be fixed in consolidation
        if violations:
            self.logger.debug(f"{log_prefix} Found {len(violations)} subdomain violations (will be fixed in domain consolidation)")
            return (False, violations, parsed_rows)
        
        self.logger.info(f"{log_prefix} ✓ Subdomain validation passed: All domains have ≥2 subdomains, all subdomains have ≥2 use cases")
        return (True, [], parsed_rows)
    
    def _parse_llm_csv_response(self, llm_response: str, log_prefix: str) -> list:
        self.logger.info(f"{log_prefix} Starting robust 11-column CSV parsing (SQL and scoring metrics will be assigned separately)...")
        parsed_rows = []
        
        try:
            # Clean response - remove markdown fences if present
            csv_clean = llm_response.strip()
            if csv_clean.startswith('```'):
                csv_clean = re.sub(r'^```[a-z]*\n', '', csv_clean)
                csv_clean = re.sub(r'\n```$', '', csv_clean)
            
            # Find header line (11 columns - Business Domain, Subdomain, SQL, and scoring columns will be calculated in code)
            # Support both quoted and unquoted headers from LLM, with case-insensitive matching
            # Analytics Technique is now generated by LLM as column 4
            # Column 5 MUST be "Statement" (not "Opportunity" or any other name)
            header_pattern_quoted = r'"No","Name","[Tt]ype","Analytics Technique","Statement","Solution","Business Value","Beneficiary","Sponsor","Tables Involved","Technical Design"'
            header_pattern_unquoted = r'No,Name,[Tt]ype,Analytics Technique,Statement,Solution,Business Value,Beneficiary,Sponsor,Tables Involved,Technical Design'
            header_match = re.search(header_pattern_quoted, csv_clean, re.IGNORECASE)
            if not header_match:
                header_match = re.search(header_pattern_unquoted, csv_clean, re.IGNORECASE)
            
            # Fallback: If LLM incorrectly used "Opportunity" instead of "Statement", fix it
            if not header_match:
                # Check if LLM used wrong column name
                if '"Opportunity"' in csv_clean or ',Opportunity,' in csv_clean:
                    self.logger.warning(f"{log_prefix} LLM incorrectly used 'Opportunity' instead of 'Statement' - auto-correcting...")
                    csv_clean = csv_clean.replace('"Opportunity"', '"Statement"')
                    csv_clean = csv_clean.replace(',Opportunity,', ',Statement,')
                    # Try matching again after correction
                    header_match = re.search(header_pattern_quoted, csv_clean, re.IGNORECASE)
                    if not header_match:
                        header_match = re.search(header_pattern_unquoted, csv_clean, re.IGNORECASE)
            
            # Final fallback: Try simpler pattern if exact match still fails
            if not header_match:
                simpler_pattern = r'(?:"No"|No)\s*,\s*(?:"Name"|Name)\s*,\s*(?:"[Tt]ype"|[Tt]ype)'
                header_match = re.search(simpler_pattern, csv_clean, re.IGNORECASE)
                if header_match:
                    self.logger.warning(f"{log_prefix} Using fallback CSV header detection (found simplified pattern)")
            
            if not header_match:
                # Log first 500 chars of response to help debug
                preview = csv_clean[:500] if csv_clean else "(empty)"
                self.logger.error(f"{log_prefix} Could not find CSV header in LLM response. Response preview: {preview}")
                return []
            
            # Extract CSV starting from header
            csv_data = csv_clean[header_match.start():]
            
            # Use centralized CSV parser for robust parsing
            csv_rows = CSVParser.parse_csv_string(
                csv_data,
                logger=self.logger,
                context=log_prefix,
                quoting=csv.QUOTE_ALL
            )
            
            for row_dict in csv_rows:
                try:
                    # Defensive helper to safely get and strip values
                    def safe_get(d, key):
                        value = d.get(key)
                        if value is None:
                            return ''
                        if isinstance(value, str):
                            return value.strip()
                        return str(value).strip()
                    
                    # Extract and validate use case number (with null safety)
                    use_case_no = safe_get(row_dict, 'No')
                    valid_no = bool(use_case_no)  # Accept any non-empty ID
                    if not valid_no:
                        self.logger.warning(f"{log_prefix} Skipping row with invalid No field: {use_case_no}")
                        continue
                    
                    # Extract Analytics Technique from LLM response (with fallback)
                    analytics_technique = safe_get(row_dict, 'Analytics Technique')
                    if not analytics_technique or analytics_technique == 'N/A':
                        analytics_technique = 'AI Analysis'  # Default fallback
                    
                    # Helper to safely parse float scores
                    def safe_float(d, key, default=3.0):
                        try:
                            value = safe_get(d, key)
                            if not value:
                                return default
                            return float(value)
                        except (ValueError, TypeError):
                            self.logger.warning(f"{log_prefix} Invalid float value for {key}: {value}, using default {default}")
                            return default
                    
                    # Scoring will be added by LLM scoring step after deduplication
                    # Initialize with placeholder values that will be replaced
                    strategic_alignment = 0.0
                    return_on_investment = 0.0
                    reusability = 0.0
                    time_to_value = 0.0
                    data_availability = 0.0
                    data_accessibility = 0.0
                    architecture_fitness = 0.0
                    team_skills = 0.0
                    domain_knowledge = 0.0
                    people_allocation = 0.0
                    budget_allocation = 0.0
                    time_to_production = 0.0
                    value_score = 0.0
                    feasibility_score = 0.0
                    priority_score = 0.0
                    priority_label = "Pending"
                    
                    # Build row dictionary with all fields (SQL, Business Domain, and Subdomain will be added later)
                    # Using safe_get to handle None values and type conversions
                    # Column name MUST be "Statement" (auto-corrected above if LLM used wrong name)
                    statement_value = safe_get(row_dict, 'Statement')
                    
                    row = {
                        "No": use_case_no,
                        "Name": safe_get(row_dict, 'Name'),
                        "Business Domain": "",  # Will be set during domain clustering
                        "Subdomain": "",  # Will be set during subdomain clustering
                        "type": safe_get(row_dict, 'type'),
                        "Analytics Technique": analytics_technique,  # From LLM response
                        "Statement": statement_value,
                        "Solution": safe_get(row_dict, 'Solution'),
                        "Business Value": safe_get(row_dict, 'Business Value'),
                        "Beneficiary": safe_get(row_dict, 'Beneficiary'),
                        "Sponsor": safe_get(row_dict, 'Sponsor'),
                        "Tables Involved": safe_get(row_dict, 'Tables Involved'),
                        "Technical Design": safe_get(row_dict, 'Technical Design'),
                        "SQL": "",  # Will be generated in parallel later
                        # Scoring columns (only for Excel)
                        "Strategic Alignment": strategic_alignment,
                        "Return on Investment": return_on_investment,
                        "Reusability": reusability,
                        "Time to Value": time_to_value,
                        "Data Availability": data_availability,
                        "Data Accessibility": data_accessibility,
                        "Architecture Fitness": architecture_fitness,
                        "Team Skills": team_skills,
                        "Domain Knowledge": domain_knowledge,
                        "People Allocation": people_allocation,
                        "Budget Allocation": budget_allocation,
                        "Time to Production": time_to_production,
                        # Calculated fields
                        "Value": round(value_score, 2),
                        "Feasibility": round(feasibility_score, 2),
                        "Priority Score": round(priority_score, 2),
                        "Priority": priority_label
                    }
                    
                    # Validate row has minimum required fields
                    if not row['Name'] or not statement_value:
                        self.logger.warning(f"{log_prefix} Skipping row #{use_case_no}: Missing required fields (Name or Statement)")
                        continue
                    
                    self.logger.debug(f"{log_prefix} Parsed Scenario #{row['No']}: {row['Name']} [Analytics Technique: {analytics_technique}]")
                    
                    parsed_rows.append(row)
                    
                except Exception as e:
                    # Log error with sanitized row data (limit length to avoid huge logs)
                    try:
                        row_summary = {k: str(v)[:100] for k, v in row_dict.items()} if isinstance(row_dict, dict) else str(row_dict)[:200]
                        self.logger.error(f"{log_prefix} Error processing CSV row: {e}. Row summary: {row_summary}")
                    except:
                        self.logger.error(f"{log_prefix} Error processing CSV row: {e}. Could not serialize row data.")
                    continue
                    
        except Exception as e:
            self.logger.error(f"{log_prefix} Failed to parse LLM CSV response: {e}")
            # Show snippet for debugging
            snippet = llm_response[:500] if llm_response else "Empty response"
            self.logger.error(f"{log_prefix} Response snippet: {snippet}")
            return []
        
        # NOTE: Post-processing for naming conventions removed since AI Function field no longer exists
        # The LLM will now innovate and choose functions during SQL generation
        
        self.logger.info(f"{log_prefix} Robust parsing complete. Found {len(parsed_rows)} rows.")
        return parsed_rows


    def _retry_missing_table_coverage(self, use_cases: list, all_columns: list, unstructured_docs_markdown: str, strategic_goals: list = None, include_business_catchall: bool = False) -> list:
        """
        Retry use case generation for tables that have no use cases.
        Each table can be retried up to 2 times maximum.
        
        Args:
            use_cases: List of existing use case dictionaries
            all_columns: List of all column details (catalog, schema, table, column, type, comment)
            unstructured_docs_markdown: Markdown for unstructured documents
            strategic_goals: List of strategic goals
            include_business_catchall: If True, also include BUSINESS tables that were never involved in any use cases (catch-all mode)
            
        Returns:
            List of newly generated use cases for missing tables
        """
        from collections import defaultdict
        
        # Extract all tables from column details
        all_tables = set()
        table_columns = defaultdict(list)
        for col_tuple in all_columns:
            catalog, schema, table, column, col_type, comment = col_tuple
            fq_table = f"{catalog}.{schema}.{table}"
            all_tables.add(fq_table)
            table_columns[fq_table].append(col_tuple)
        
        # Extract tables that have use cases (INCLUDING those with empty tables field)
        tables_with_use_cases = set()
        for uc in use_cases:
            tables_str = uc.get('Tables Involved', '')
            if tables_str and not tables_str.startswith('/Volumes'):
                for table in tables_str.split(','):
                    table = table.strip().strip('`')
                    if table:
                        tables_with_use_cases.add(table)
        
        # Find tables without use cases
        missing_tables = all_tables - tables_with_use_cases
        
        # === CATCH-ALL: Include BUSINESS tables that were never involved in any use cases ===
        if include_business_catchall and hasattr(self, 'business_scores'):
            self.logger.info("🔍 CATCH-ALL MODE: Checking for BUSINESS tables that were never involved in use cases...")
            
            # Get all BUSINESS tables that were classified
            all_business_tables = {fqtn for fqtn, score in self.business_scores.items() if score > 0}
            
            # Find BUSINESS tables that were never involved in ANY use case (even those with empty tables)
            unused_business_tables = all_business_tables - tables_with_use_cases
            
            # Filter to only include tables that are in all_columns (have column details available)
            unused_business_tables = unused_business_tables.intersection(all_tables)
            
            if unused_business_tables:
                self.logger.warning(f"⚠️ Found {len(unused_business_tables)} BUSINESS tables that were never involved in any use cases")
                
                # Add them to missing_tables for retry
                missing_tables = missing_tables.union(unused_business_tables)
                
                # Show sample
                unused_sample = sorted(list(unused_business_tables))[:10]
                self.logger.info(f"📋 Sample unused BUSINESS tables: {', '.join(unused_sample)}{'...' if len(unused_business_tables) > 10 else ''}")
            else:
                self.logger.info("✅ All BUSINESS tables have been involved in use cases")
        
        if not missing_tables:
            self.logger.info("✅ All tables have at least one use case - no retry needed")
            return []
        
        coverage_percentage = ((len(all_tables) - len(missing_tables)) / len(all_tables)) * 100 if all_tables else 0
        self.logger.warning(f"⚠️ Found {len(missing_tables)} tables without use cases (out of {len(all_tables)} total tables - {coverage_percentage:.1f}% coverage)")
        
        # Show sample of missing tables
        missing_sample = sorted(list(missing_tables))[:10]
        self.logger.info(f"📋 Sample missing tables: {', '.join(missing_sample)}{'...' if len(missing_tables) > 10 else ''}")
        
        # Provide actionable insights
        if len(missing_tables) > len(all_tables) * 0.5:
            self.logger.warning(f"⚠️ More than 50% of tables lack use cases. Consider:")
            self.logger.warning(f"   - Checking if LLM is generating use cases for all tables")
            self.logger.warning(f"   - Verifying table names match between schema and use case generation")
            self.logger.warning(f"   - Reviewing business vs technical table filtering")
        
        # Track retry attempts per table (max 2 attempts)
        if not hasattr(self, '_table_retry_counts'):
            self._table_retry_counts = defaultdict(int)
        
        # Filter tables that haven't exceeded retry limit
        tables_to_retry = []
        for table in missing_tables:
            if self._table_retry_counts[table] < 2:
                tables_to_retry.append(table)
                self._table_retry_counts[table] += 1
            else:
                self.logger.warning(f"⚠️ Table {table} has been retried 2 times already - skipping")
        
        if not tables_to_retry:
            self.logger.info("No tables eligible for retry (all have reached 2 attempts)")
            return []
        
        self.logger.info(f"🔄 Retrying use case generation for {len(tables_to_retry)} tables...")
        
        # Group tables into batches (max 50 tables per batch to avoid context overflow)
        max_tables_per_batch = 50
        retry_batches = []
        for i in range(0, len(tables_to_retry), max_tables_per_batch):
            batch_tables = tables_to_retry[i:i+max_tables_per_batch]
            batch_columns = []
            for table in batch_tables:
                batch_columns.extend(table_columns[table])
            retry_batches.append((batch_tables, batch_columns))
        
        self.logger.info(f"📦 Created {len(retry_batches)} retry batch(es) for {len(tables_to_retry)} tables")
        
        # Process retry batches IN PARALLEL using centralized ParallelExecutor
        all_retry_use_cases = []
        
        # ADAPTIVE PARALLELISM: Calculate based on retry batches and columns
        total_retry_columns = sum(len(cols) for _, cols in retry_batches)
        
        retry_parallelism, reason = calculate_adaptive_parallelism(
            "use_case_generation", self.max_parallelism,
            num_items=len(retry_batches),
            total_columns=total_retry_columns,
            avg_prompt_chars=total_retry_columns * 100,
            is_llm_operation=True, logger=self.logger
        )
        log_adaptive_parallelism_decision("use_case_generation", retry_parallelism, self.max_parallelism, reason)
        
        self.logger.info(f"🔄 Processing {len(retry_batches)} retry batch(es) in parallel...")
        
        # Prepare tasks for parallel execution
        tasks = []
        for batch_idx, (batch_tables, batch_columns) in enumerate(retry_batches, 1):
            task = (
                self._process_batch_with_retry,
                (batch_columns, f"RETRY_{batch_idx}", unstructured_docs_markdown, strategic_goals, 2)
            )
            tasks.append(task)
            self.logger.info(f"✓ Prepared retry batch {batch_idx}/{len(retry_batches)} ({len(batch_tables)} tables)")
        
        # Execute in parallel with centralized utility
        results = ParallelExecutor.execute_parallel(
            tasks=tasks,
            max_workers=retry_parallelism,
            task_name="Retry Batch",
            logger=self.logger,
            thread_name_prefix="RetryBatch",
            return_exceptions=True
        )
        
        # Collect successful results
        for batch_idx, result in enumerate(results, 1):
            if isinstance(result, Exception):
                self.logger.error(f"❌ Retry batch {batch_idx} failed: {result}")
                continue
            if result:
                self.logger.info(f"✅ Retry batch {batch_idx}: Generated {len(result)} use cases")
                all_retry_use_cases.extend(result)
            else:
                self.logger.warning(f"⚠️ Retry batch {batch_idx}: No use cases generated")
        
        if all_retry_use_cases:
            self.logger.info(f"✅ Retry complete: Generated {len(all_retry_use_cases)} additional use cases")
        else:
            self.logger.warning("⚠️ Retry complete: No additional use cases generated")
        
        return all_retry_use_cases

    def _collect_pending_results(self, current_results: list) -> list:
        """
        Collect results from current batch plus any pending sub-batch results.
        
        Args:
            current_results: Results from current batch
            
        Returns:
            Combined list of current + pending results
        """
        if hasattr(self, '_pending_sub_batch_results') and self._pending_sub_batch_results:
            all_results = current_results + self._pending_sub_batch_results
            # Clear pending results after collecting
            self._pending_sub_batch_results = []
            return all_results
        return current_results
    
    def _process_batch_with_retry(self, column_details: list, batch_num, unstructured_docs_markdown: str, strategic_goals: list = None, business_context: str = "", business_priorities: str = "", strategic_initiative: str = "", value_chain: str = "", revenue_model: str = "", max_attempts: int = 3, previous_use_cases_feedback: str = "") -> list:
        """
        Process a batch of column details to generate use cases with retry logic.
        Automatically splits context if input is too long for the model.
        
        STRATEGY: When tables don't fit in context:
        1. Split tables across multiple sub-batches (NEVER drop business tables)
        2. Process ALL sub-batches recursively
        3. Track which columns are kept from each table (saved to disk, not memory)
        4. Column tracking is loaded from disk during SQL generation
        
        Args:
            column_details: List of column tuples (catalog, schema, table, column, type, comment)
            batch_num: Batch number for logging and prefixing (can be int or str)
            unstructured_docs_markdown: Unstructured documents markdown
            strategic_goals: List of strategic goals for the business (used for Strategic Alignment scoring)
            max_attempts: Maximum number of attempts (default 3)
            
        Returns:
            List of use case dictionaries (includes results from sub-batches)
        """
        log_prefix = f"[Batch {batch_num}]"
        
        # === NEW: Register columns and tables for Bitmap ID generation ===
        with self.registry_lock:
            for col_tuple in column_details:
                # col_tuple: (catalog, schema, table, column, type, comment)
                fqn = f"{col_tuple[0]}.{col_tuple[1]}.{col_tuple[2]}.{col_tuple[3]}"
                table_fqn = f"{col_tuple[0]}.{col_tuple[1]}.{col_tuple[2]}"
                
                # Register table if not already registered
                if table_fqn not in self.table_id_map:
                    table_id = str(self.next_table_id)
                    self.next_table_id += 1
                    self.table_id_map[table_fqn] = table_id
                    self.id_table_map[table_id] = table_fqn
                
                if fqn not in self.column_id_map:
                    col_id = str(self.next_column_id)
                    self.next_column_id += 1
                    self.column_id_map[fqn] = col_id
                    
                    # Create description (Type + Comment)
                    desc = f"{col_tuple[4]}"
                    if col_tuple[5]:
                        desc += f" - {col_tuple[5]}"
                    
                    self.id_column_map[col_id] = {
                        "fqn": fqn,
                        "description": desc
                    }
        
        current_column_details = column_details

        prompt_template = self.ai_agent.prompt_templates.get("BASE_USE_CASE_GEN_PROMPT", "")
        safe_limit = get_safe_context_limit(language="English", buffer_percent=0.9, prompt_name="BASE_USE_CASE_GEN_PROMPT")
        if strategic_goals and len(strategic_goals) > 0:
            strategic_goals_text = "\n".join([f"- {goal}" for goal in strategic_goals[:10]])
        else:
            strategic_goals_text = "- Maximize operational efficiency\n- Improve customer satisfaction\n- Reduce operational costs\n- Drive revenue growth\n- Ensure compliance and risk management"
        if business_priorities and len(business_priorities) > 0:
            business_priorities_text = "\n".join([f"- {priority}" for priority in business_priorities[:10]])
        else:
            business_priorities_text = "- None"
        if self.user_strategic_goals:
            goals_text = "\n".join([f"- {goal}" for goal in self.user_strategic_goals])
            additional_context_section = f"""**STRATEGIC GOALS (HIGHEST PRIORITY)**:

The user provided Strategic Goals that MUST be followed during generation.

**STRATEGIC GOALS:**
{goals_text}

**REQUIREMENTS**:
- Generate ONLY use cases that align with these Strategic Goals.
- Generate EVERY possible use case that aligns with these Strategic Goals. Do not omit any valid use case.
- Do NOT cap the number of use cases; completeness is mandatory.
- Use semantic understanding of the goals; do NOT apply rigid keyword rules.
- Do NOT generate use cases outside these goals."""
        else:
            additional_context_section = "*(No Strategic Goals provided by user - proceed with standard business analysis)*"
        if self.user_business_domains:
            domains_list = ", ".join([f'"{domain}"' for domain in self.user_business_domains])
            focus_areas_instruction = f"""  - **🚨 CRITICAL - USER-SPECIFIED BUSINESS DOMAINS 🚨**: You MUST assign use cases ONLY to the following business domains: {domains_list}. 
   * These are the ONLY valid Business Domain values - DO NOT invent new domains.
   * ALL use cases MUST be categorized into one of these exact domains.
   * DO NOT create any domain that is not in this list.
   * DO NOT modify, abbreviate, or expand these domain names - use them EXACTLY as provided.
   * The Business Domain field MUST exactly match one of these domains."""
        else:
            focus_areas_instruction = ""
        ai_functions_summary = generate_ai_functions_doc("summary")
        ai_functions_detailed = generate_ai_functions_doc("detailed")
        statistical_functions_detailed = generate_statistical_functions_doc("detailed")
        base_prompt_size = len(prompt_template) + len(unstructured_docs_markdown) + len(business_context) + len(business_priorities_text) + len(strategic_initiative) + len(value_chain) + len(revenue_model) + len(strategic_goals_text) + len(additional_context_section) + len(focus_areas_instruction) + len(ai_functions_summary) + len(ai_functions_detailed) + len(statistical_functions_detailed) + len(previous_use_cases_feedback) + 1000
        
        self.logger.info(f"{log_prefix} Starting batch processing with {len(column_details)} columns from {len(set([c[2] for c in column_details]))} tables")
        tables_in_call = sorted({f"{c[0]}.{c[1]}.{c[2]}" for c in column_details})
        tables_in_call_str = ", ".join(tables_in_call)
        self.logger.info(f"{log_prefix} Tables in call ({len(tables_in_call)}): {tables_in_call_str}")
        log_print(f"{log_prefix} Tables in call ({len(tables_in_call)}): {tables_in_call_str}")
        
        for attempt in range(1, max_attempts + 1):
            try:
                if attempt > 1:
                    self.logger.info(f"{log_prefix} Retry attempt {attempt}/{max_attempts}...")

                estimated_schema_size = self._estimate_schema_markdown_size(current_column_details)
                estimated_prompt_size = base_prompt_size + estimated_schema_size
                if estimated_prompt_size > safe_limit:
                    raise InputTooLongError(
                        f"Proactive split: Input length {estimated_prompt_size:,} characters exceeds "
                        f"safe limit of {safe_limit:,} (with 10% buffer)"
                    )

                self.logger.debug(f"{log_prefix} Formatting schema for prompt...")
                schema_markdown = self._format_schema_for_prompt(current_column_details)
                if not schema_markdown:
                    self.logger.warning(f"{log_prefix} Produced no schema markdown. Skipping.")
                    return []
                
                fk_relationships_text = "None"
                try:
                    if self.data_loader and getattr(self.data_loader, "foreign_key_graph", None):
                        batch_tables = {(c[0], c[1], c[2]) for c in current_column_details}
                        fk_relations = self.data_loader.get_foreign_key_relations(batch_tables)
                        if fk_relations:
                            rel_lines = []
                            for rel in fk_relations:
                                src = f"{rel[0]}.{rel[1]}.{rel[2]}.{rel[3]}"
                                ref_catalog = rel[4] or rel[0]
                                ref_schema = rel[5] or rel[1]
                                tgt = f"{ref_catalog}.{ref_schema}.{rel[6]}.{rel[7]}"
                                rel_lines.append(f"{src} -> {tgt}")
                            if rel_lines:
                                fk_relationships_text = "\n".join(sorted(set(rel_lines)))
                except Exception as fk_err:
                    self.logger.debug(f"{log_prefix} Failed to gather FK relationships: {str(fk_err)[:100]}")
                
                prompt_vars = {
                    "schema_markdown": schema_markdown,
                    "foreign_key_relationships": fk_relationships_text,
                    "unstructured_documents_markdown": unstructured_docs_markdown,
                    "business_context": business_context,
                    "business_priorities": business_priorities_text,
                    "strategic_initiative": strategic_initiative,
                    "value_chain": value_chain,
                    "revenue_model": revenue_model,
                    "strategic_goals": strategic_goals_text,
                    "additional_context_section": additional_context_section,
                    "ai_functions_summary": ai_functions_summary,
                    "ai_functions_detailed": ai_functions_detailed,
                    "statistical_functions_detailed": statistical_functions_detailed,
                    "focus_areas_instruction": focus_areas_instruction,
                    "previous_use_cases_feedback": previous_use_cases_feedback
                }
                
                # PROACTIVE CHECK: Estimate prompt size and split BEFORE attempting LLM call
                # This saves time by not waiting for LLM failures
                try:
                    estimated_prompt_size = base_prompt_size + len(schema_markdown) + len(fk_relationships_text)
                    
                    if estimated_prompt_size > safe_limit:
                        # Prompt exceeds safe limit - proactively split WITHOUT attempting LLM call
                        self.logger.warning(
                            f"{log_prefix} PROACTIVE SPLIT: Prompt size ({estimated_prompt_size:,} chars) exceeds "
                            f"safe limit ({safe_limit:,} chars with 10% buffer). Splitting batch without attempting LLM call."
                        )
                        
                        # Raise InputTooLongError to trigger the split logic below
                        raise InputTooLongError(
                            f"Proactive split: Input length {estimated_prompt_size:,} characters exceeds "
                            f"safe limit of {safe_limit:,} (with 10% buffer)"
                        )
                    else:
                        # Safe to proceed - log if we're approaching the limit (>80% of safe limit)
                        if estimated_prompt_size > (safe_limit * 0.8):
                            self.logger.info(
                                f"{log_prefix} Prompt size: {estimated_prompt_size:,} chars "
                                f"({(estimated_prompt_size/safe_limit)*100:.1f}% of safe limit)"
                            )
                
                except InputTooLongError:
                    # Re-raise to trigger split logic
                    raise
                except Exception as e:
                    # Any other error in estimation - log and continue to actual LLM call
                    self.logger.debug(f"{log_prefix} Proactive size check failed: {e}")
                
                # === PARALLEL EXECUTION: Send batch to BOTH AI and STATS prompts ===
                self.logger.info(f"⏳ {log_prefix} Sending batch to BOTH AI-focused and STATS-focused prompts in parallel...")
                
                from concurrent.futures import ThreadPoolExecutor, as_completed
                
                def call_prompt(prompt_name, step_suffix):
                    """Helper function to call a specific prompt."""
                    self.logger.info(f"⏳ {log_prefix} [{prompt_name}] Waiting for LLM response (may take 3-5 min)...")
                    response = self.ai_agent.run_worker(
                        step_name=f"Batch_{batch_num}_{step_suffix}", 
                        worker_prompt_path=prompt_name,
                        prompt_vars=prompt_vars,
                        response_schema=None
                    )
                    self.logger.info(f"✅ {log_prefix} [{prompt_name}] Received LLM response")
                    return prompt_name, response
                
                # Execute both prompts in parallel
                with ThreadPoolExecutor(max_workers=2, thread_name_prefix="PromptCall") as executor:
                    futures = {
                        executor.submit(call_prompt, "AI_USE_CASE_GEN_PROMPT", "AI"): "AI",
                        executor.submit(call_prompt, "STATS_USE_CASE_GEN_PROMPT", "STATS"): "STATS"
                    }
                    
                    ai_response_raw = None
                    stats_response_raw = None
                    
                    for future in as_completed(futures):
                        try:
                            prompt_name, response = future.result()
                            if prompt_name == "AI_USE_CASE_GEN_PROMPT":
                                ai_response_raw = response
                            elif prompt_name == "STATS_USE_CASE_GEN_PROMPT":
                                stats_response_raw = response
                        except Exception as e:
                            prompt_type = futures[future]
                            self.logger.error(f"❌ {log_prefix} [{prompt_type}] Prompt call failed: {e}")
                            raise
                
                # Parse both responses
                self.logger.info(f"✅ {log_prefix} Received both responses, parsing CSVs...")
                
                # Use clean_csv_response (NOT clean_json_response) to avoid extracting JSON from CSV
                ai_response_clean = clean_csv_response(ai_response_raw) if ai_response_raw else ""
                stats_response_clean = clean_csv_response(stats_response_raw) if stats_response_raw else ""
                
                ai_parsed_rows = self._parse_llm_csv_response(ai_response_clean, f"{log_prefix}[AI]") if ai_response_clean else []
                stats_parsed_rows = self._parse_llm_csv_response(stats_response_clean, f"{log_prefix}[STATS]") if stats_response_clean else []
                
                # Mark source for each use case (AI vs STATS)
                for row in ai_parsed_rows:
                    row['_source'] = 'AI'
                for row in stats_parsed_rows:
                    row['_source'] = 'STATS'
                
                # Merge results from both prompts
                parsed_rows = ai_parsed_rows + stats_parsed_rows
                self.logger.info(f"✅ {log_prefix} Merged results: {len(ai_parsed_rows)} AI use cases + {len(stats_parsed_rows)} STATS use cases = {len(parsed_rows)} total")
                
                if not parsed_rows:
                    raise Exception("LLM returned no use cases")
                
                # CRITICAL: Validate that tables referenced in use cases actually exist in schema
                # This catches LLM hallucinations where it invents non-existent table names
                validation_schema = getattr(self, "_business_column_details_global", current_column_details)
                is_tables_valid, hallucinated_use_cases, valid_use_cases = self._validate_use_case_tables(
                    parsed_rows, validation_schema, log_prefix
                )
                
                if not is_tables_valid:
                    hallucinated_count = len(hallucinated_use_cases)
                    hallucination_rate = hallucinated_count / len(parsed_rows) * 100
                    valid_count = len(valid_use_cases)
                    
                    if valid_count == 0:
                        self.logger.warning(f"{log_prefix} ⚠️ Table hallucination detected: {hallucinated_count}/{len(parsed_rows)} use cases ({hallucination_rate:.1f}%)")
                        for i, uc in enumerate(hallucinated_use_cases[:3]):
                            self.logger.warning(f"{log_prefix}    Example {i+1}: {uc.get('No')}: {uc.get('Name')} - {uc.get('hallucination_reason')}")
                        if attempt < max_attempts:
                            self.logger.warning(f"{log_prefix}    Retrying batch (attempt {attempt + 1}/{max_attempts}) because no valid use cases were returned")
                            continue
                        self.logger.error(f"{log_prefix} ❌ No valid use cases after {max_attempts} attempts due to hallucinated tables")
                        return self._collect_pending_results([])
                    
                    self.logger.warning(f"{log_prefix} ⚠️ Table hallucination detected: {hallucinated_count}/{len(parsed_rows)} use cases ({hallucination_rate:.1f}%). Dropping hallucinated use cases and continuing with {valid_count} valid use cases.")
                    for i, uc in enumerate(hallucinated_use_cases[:3]):
                        self.logger.warning(f"{log_prefix}    Example {i+1}: {uc.get('No')}: {uc.get('Name')} - {uc.get('hallucination_reason')}")
                    parsed_rows = valid_use_cases
                
                # Re-number use cases with batch prefix AND apply SQL validation
                # Handle both int and string batch_num (for retry batches)
                if isinstance(batch_num, int):
                    batch_prefix = f"{batch_num:02d}"  # Changed from B{batch_num:03d} to just 2-digit number
                else:
                    batch_prefix = str(batch_num)  # Already formatted (e.g., "RETRY_1")
                
                for row in parsed_rows:
                    try:
                        original_id = row['No']
                        use_case_num = original_id.split('-')[-1]
                        # Use F for AI-sourced, S for STATS-sourced
                        source_prefix = 'F' if row.get('_source') == 'AI' else 'S'
                        new_id = f"AI-{source_prefix}{batch_prefix}-U{use_case_num}"
                        row['No'] = new_id
                        row['batch'] = batch_num
                        if 'SQL' in row and row['SQL']:
                            # Update use case ID in SQL comment
                            if original_id in row['SQL']:
                                row['SQL'] = row['SQL'].replace(f"-- Use Case ID: {original_id}", f"-- Use Case ID: {new_id}")
                    except Exception as e:
                        self.logger.warning(f"{log_prefix} Failed to re-number row: {e}")
                        row['batch'] = batch_num
                
                self.logger.debug(f"{log_prefix} Successfully processed {len(parsed_rows)} use cases on attempt {attempt}")
                
                # Print top 5 use cases from AI and top 5 from STATS (total 10)
                ai_cases = [uc for uc in parsed_rows if uc.get('_source') == 'AI']
                stats_cases = [uc for uc in parsed_rows if uc.get('_source') == 'STATS']
                
                log_print(f"\n{'='*80}")
                log_print(f"📊 TOP USE CASES FROM {log_prefix} (for early quality review):")
                log_print(f"{'='*80}\n")
                
                if ai_cases:
                    log_print(f"🤖 Top {min(5, len(ai_cases))} AI-focused use cases:")
                    for use_case in ai_cases[:5]:
                        log_print(f"   {use_case.get('No', 'N/A')}: {use_case.get('Name', 'N/A')}")
                    print()
                
                if stats_cases:
                    log_print(f"📊 Top {min(5, len(stats_cases))} STATS-focused use cases:")
                    for use_case in stats_cases[:5]:
                        log_print(f"   {use_case.get('No', 'N/A')}: {use_case.get('Name', 'N/A')}")
                    print()
                
                log_print(f"{'='*80}\n")
                
                # Validate subdomain rules (silent check - accept response regardless)
                # The domain fixer will fix any issues later
                is_valid, violations, corrected_rows = self._validate_subdomain_rules(parsed_rows, log_prefix)
                # Always accept the response - domain fixer will handle issues
                # Collect any pending sub-batch results
                return self._collect_pending_results(parsed_rows)
                
            except InputTooLongError as e:
                # Handle "input too long" by intelligently splitting the batch
                # NEVER DROP BUSINESS TABLES - keep splitting until they fit
                
                # Group columns by table
                table_to_columns = {}
                for col in current_column_details:
                    table_key = (col[0], col[1], col[2])  # (catalog, schema, table)
                    if table_key not in table_to_columns:
                        table_to_columns[table_key] = []
                    table_to_columns[table_key].append(col)
                
                num_tables = len(table_to_columns)
                
                # Get business scores to check if tables are marked as business
                business_scores = getattr(self, 'business_scores', {})
                
                def is_business_table(table_key):
                    """Check if a table is marked as business (score > 0)."""
                    fqtn = f"{table_key[0]}.{table_key[1]}.{table_key[2]}"
                    return business_scores.get(fqtn, 0) > 0
                
                if num_tables > 1:
                    tables_list = list(table_to_columns.keys())
                    reference_tables_set = {k for k, v in getattr(self, "data_category_map", {}).items() if v == "REFERENCE"}
                    filtered_tables = []
                    for table_key in tables_list:
                        fqtn = f"{table_key[0]}.{table_key[1]}.{table_key[2]}"
                        if reference_tables_set and fqtn in reference_tables_set:
                            continue
                        filtered_tables.append(table_key)
                    if not filtered_tables:
                        self.logger.warning(f"{log_prefix} Input too long ({str(e)}). Only reference tables present; skipping use case generation.")
                        return self._collect_pending_results([])
                    self.logger.warning(f"{log_prefix} Input too long ({str(e)}). Falling back to single-table calls for {num_tables} tables.")
                    self.processing_honesty['total_batch_splits'] += 1
                    split_type = "Proactive" if "Proactive split" in str(e) else "Reactive"
                    split_info = {
                        'batch': batch_num,
                        'original_tables': num_tables,
                        'split_into': num_tables,
                        'sub_batch_1_tables': 1,
                        'sub_batch_2_tables': 1,
                        'reason': 'Input too long for LLM',
                        'split_type': split_type
                    }
                    self.processing_honesty['batch_split_history'].append(split_info)
                    current_column_details = table_to_columns[filtered_tables[0]]
                    current_column_details = self._augment_columns_with_related_tables(current_column_details)
                    if not hasattr(self, '_pending_sub_batch_results'):
                        self._pending_sub_batch_results = []
                    for idx, table_key in enumerate(filtered_tables[1:], start=2):
                        single_cols = table_to_columns[table_key]
                        single_cols = self._augment_columns_with_related_tables(single_cols)
                        single_batch_id = f"{batch_num}_T{idx}"
                        single_use_cases = self._process_batch_with_retry(
                            single_cols,
                            single_batch_id,
                            unstructured_docs_markdown,
                            strategic_goals,
                            max_attempts
                        )
                        if single_use_cases:
                            self._pending_sub_batch_results.extend(single_use_cases)
                            self.logger.info(f"{log_prefix} Single-table sub-batch {single_batch_id} generated {len(single_use_cases)} use cases")
                    continue
                    
                elif num_tables == 1:
                    # Single table is too big: try dropping columns
                    table_key = list(table_to_columns.keys())[0]
                    table_columns = table_to_columns[table_key]
                    table_is_business = is_business_table(table_key)
                    fqtn = f"{table_key[0]}.{table_key[1]}.{table_key[2]}"
                    
                    if len(table_columns) > 500:
                        keep_count = 500
                    else:
                        keep_count = len(table_columns) - 100
                    if keep_count < 5:
                        if table_is_business:
                            keep_count = 5
                        else:
                            self.logger.error(f"{log_prefix} Input too long even with minimal columns ({len(table_columns)} columns from non-business table {table_key[2]}). Dropping this table.")
                            return self._collect_pending_results([])
                    current_column_details = table_columns[:keep_count]
                    
                    kept_columns = [col[3] for col in current_column_details]  # col[3] is column name
                    self.storage_manager.save_column_tracking(fqtn, kept_columns)
                    
                    dropped_count = len(table_columns) - keep_count
                    drop_info = {
                        'table': fqtn,
                        'original_columns': len(table_columns),
                        'kept_columns': keep_count,
                        'dropped_columns': dropped_count,
                        'drop_percentage': (dropped_count / len(table_columns)) * 100,
                        'is_business': table_is_business
                    }
                    if fqtn not in [t['table'] for t in self.processing_honesty['tables_with_columns_dropped']]:
                        self.processing_honesty['tables_with_columns_dropped'].append(drop_info)
                    else:
                        # Update existing entry with new drop info
                        for idx, existing in enumerate(self.processing_honesty['tables_with_columns_dropped']):
                            if existing['table'] == fqtn:
                                self.processing_honesty['tables_with_columns_dropped'][idx] = drop_info
                                break
                    
                    table_type = "BUSINESS" if table_is_business else "non-business"
                    self.logger.warning(f"{log_prefix} Input too long ({str(e)}). Single {table_type} table {table_key[2]} is too large. Dropping columns from {len(table_columns)} to {keep_count} columns and retrying...")
                    self.logger.info(f"{log_prefix} Saved column tracking for {fqtn}: {len(kept_columns)} columns ({', '.join(kept_columns[:5])}{'...' if len(kept_columns) > 5 else ''})")
                    
                    continue
                    
                else:
                    # No tables? This shouldn't happen
                    self.logger.error(f"{log_prefix} Input too long but no tables found. Cannot process.")
                    return self._collect_pending_results([])
                
            except Exception as e:
                if attempt < max_attempts:
                    self.logger.warning(f"{log_prefix} Attempt {attempt} failed: {e}. Retrying...")
                else:
                    self.logger.error(f"{log_prefix} All {max_attempts} attempts failed: {e}")
                    return self._collect_pending_results([])
        
        # If we exhaust all attempts without success, still return any pending results
        return self._collect_pending_results([])

    def _assemble_notebook_for_db(self, db_name: str, use_cases: list, translations: dict, db_prefix: str, filename_override: str = None, domain_summary: str = None):
        self.logger.debug(f"--- Assembling notebook for: {db_name} (English) ---")
        if not use_cases:
            self.logger.warning(f"No use cases provided for {db_name}. Skipping notebook creation.")
            return
        t = translations
        grouped_by_domain = defaultdict(list)
        for uc in use_cases: grouped_by_domain[uc.get('Business Domain') or 'Other'].append(uc)
        
        # === Add top title cell ===
        final_cells = []
        title_cell_source = [
            f"# {t['pdf_title']}\n\n",
            f"## For {self.business_name}: {db_name}\n\n"
        ]
        title_cell = {
            "cell_type": "markdown", 
            "metadata": {"application/vnd.databricks.v1+cell": {"nuid": str(uuid.uuid4())}}, 
            "source": title_cell_source
        }
        final_cells.append(title_cell)
        
        # === Add disclaimer cell ===
        generation_timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        disclaimer_cell_source = [
            f"*Generated by Databricks Inspire AI on {generation_timestamp}*\n\n",
            "**Disclaimer:** All SQL queries are examples and must be validated for syntax and safety by a qualified engineer before being used in any production environment. Databricks is not liable for any issues arising from the use of this code.\n\n",
            "---\n"
        ]
        disclaimer_cell = {
            "cell_type": "markdown",
            "metadata": {"application/vnd.databricks.v1+cell": {"nuid": str(uuid.uuid4())}},
            "source": disclaimer_cell_source
        }
        final_cells.append(disclaimer_cell)
        
        # === Add domain executive summary if available ===
        if domain_summary:
            # Clean HTML tags from summary for better notebook display
            import re
            clean_summary = re.sub(r'<[^>]+>', '', domain_summary)
            # Format as proper sentences (split on periods, ensure proper spacing)
            sentences = [s.strip() + '.' for s in clean_summary.split('.') if s.strip()]
            formatted_summary = '\n\n'.join(sentences)
            
            summary_cell_source = [
                f"{formatted_summary}\n\n",
                "---\n\n"
            ]
            summary_cell = {
                "cell_type": "markdown", 
                "metadata": {"application/vnd.databricks.v1+cell": {"nuid": str(uuid.uuid4())}}, 
                "source": summary_cell_source
            }
            final_cells.append(summary_cell)
            self.logger.info(f"Added domain overview cell for '{db_name}'")
        
        # Use Cases Summaries section - split by subdomain
        
        # Generate summary tables with Priority column, grouped by subdomain
        first_section = True
        for domain, domain_use_cases in sorted(grouped_by_domain.items()):
            self.logger.debug(f"Assembling domain summary tables: '{domain}' with {len(domain_use_cases)} use cases.")
            
            # Group use cases by subdomain within this domain
            subdomain_groups = defaultdict(list)
            for uc in domain_use_cases:
                subdomain = uc.get('Subdomain', 'General')
                subdomain_groups[subdomain].append(uc)
            
            # Create a table for each subdomain
            for subdomain, subdomain_use_cases in sorted(subdomain_groups.items()):
                self.logger.debug(f"  - Subdomain '{subdomain}': {len(subdomain_use_cases)} use cases")
                
                if first_section:
                    # First table includes the main header
                    header_source = [
                        f"## {t['summaries']}\n\n",
                        f"### {subdomain}\n\n",
                        f"| {t['sum_id']} | {t['sum_name']} | {t['priority']} | {t['sum_value']} |\n",
                        "|---|---|---|---|\n"
                    ]
                    first_section = False
                else:
                    # Subsequent tables just have subdomain header
                    header_source = [
                        f"\n### {subdomain}\n\n",
                        f"| {t['sum_id']} | {t['sum_name']} | {t['priority']} | {t['sum_value']} |\n",
                        "|---|---|---|---|\n"
                    ]
                
                subdomain_header_cell = {
                    "cell_type": "markdown",
                    "metadata": {"application/vnd.databricks.v1+cell": {"nuid": str(uuid.uuid4())}},
                    "source": header_source
                }
                # Sort use cases within subdomain using natural sort (AI01, AI02, ..., AI10)
                sorted_subdomain_use_cases = sorted(subdomain_use_cases, key=self._natural_sort_key)
                toc_entries = [f"| {uc['No']} | {uc['Name']} | {uc.get('Priority', 'N/A')} | {uc['Business Value']} |\n" for uc in sorted_subdomain_use_cases]
                subdomain_header_cell["source"].extend(toc_entries)
                final_cells.append(subdomain_header_cell)
        
        # Req 6: Use translated disclaimer
        disclaimer_text = t["disclaimer"]
        disclaimer_html = f'<div style="background-color:#FFF3CD; color:#664D03; border: 1px solid #FFECB5; padding:10px; border-radius:5px; margin-top:10px;"><b>Disclaimer:</b> {disclaimer_text}</div>'
        final_cells.extend([
            {"cell_type": "markdown", "metadata": {"application/vnd.databricks.v1+cell": {"nuid": str(uuid.uuid4())}}, "source": [disclaimer_html]},
            {"cell_type": "markdown", "metadata": {"application/vnd.databricks.v1+cell": {"nuid": str(uuid.uuid4())}}, "source": [f"<hr>\n\n# {t['detailed_scenarios']}\n"]}
        ])
        
        # Sort use cases by use case ID using natural sort (AI01, AI02, ..., AI10 - not AI1, AI10, AI2)
        use_cases_sorted = sorted(use_cases, key=self._natural_sort_key)
        self.logger.debug(f"Sorted {len(use_cases_sorted)} use cases by ID (natural order)")
        
        # Validate and repair SQL before adding to notebook
        excluded_count = 0
        for use_case in use_cases_sorted:
            use_case_id = use_case.get('No', 'UNKNOWN')
            sql_content = use_case.get('SQL', '').strip()
            
            # CRITICAL: Check if SQL field is empty, None, or just a priority value
            if not sql_content or len(sql_content) < 20:
                self.logger.error(f"EXCLUDING use case {use_case_id}: SQL field is empty or too short (len={len(sql_content)})")
                excluded_count += 1
                continue
            
            # Check if SQL field contains only priority values (shouldn't happen)
            if sql_content.upper() in ['LOW', 'MEDIUM', 'HIGH']:
                self.logger.error(f"EXCLUDING use case {use_case_id}: SQL field contains only priority value '{sql_content}'")
                excluded_count += 1
                continue
            
            # Final check: Ensure SQL is not empty or trivial
            if not sql_content or len(sql_content.strip()) < 20:
                self.logger.error(f"EXCLUDING use case {use_case_id}: SQL is empty or too short")
                excluded_count += 1
                continue
            
            numbered_title = f"{use_case['No']}: {use_case['Name']}"
            
            # Helper function to translate field values
            def translate_value(field_name, value):
                """Translate Type and Priority values"""
                if not value or value == 'N/A':
                    return value
                
                # Map English values to translation keys
                value_key_map = {
                    # Type values
                    'Problem': 'value_type_problem',
                    'Risk': 'value_type_risk',
                    'Opportunity': 'value_type_opportunity',
                    'Improvement': 'value_type_improvement',
                    # Priority values
                    'Very High': 'value_priority_very_high',
                    'High': 'value_priority_high',
                    'Low': 'value_priority_low',
                    'Very Low': 'value_priority_very_low'
                }
                
                # Check if Medium needs special handling based on field
                if value == 'Medium':
                    if field_name in ['type', 'Type']:
                        return t.get('value_type_medium', value)
                    elif field_name in ['priority', 'Priority', 'aspect_priority']:
                        return t.get('value_priority_medium', value)
                
                # Get translation or return original value
                translation_key = value_key_map.get(value)
                return t.get(translation_key, value) if translation_key else value
            
            def safe_notebook_str(val):
                """Handle None/empty values for notebook display."""
                if val is None or (isinstance(val, str) and not val.strip()):
                    return 'N/A'
                return str(val)
            
            combined_source = [
                f"### {numbered_title}\n\n",
                f"| {t['aspect']} | {t['description']} |\n", "|---|---|\n",
                f"| **{t['subdomain']}** | {safe_notebook_str(use_case.get('Subdomain'))} |\n",
                f"| **{t['type']}** | {translate_value('type', use_case.get('type', 'N/A'))} |\n",
                f"| **{t.get('analytics_technique', 'Analytics Technique')}** | {safe_notebook_str(use_case.get('Analytics Technique'))} |\n",
                f"| **{t['priority']}** | {translate_value('priority', use_case.get('Priority', 'N/A'))} |\n",
                f"| **{t.get('primary_table', 'Primary Table')}** | {safe_notebook_str(use_case.get('Primary Table'))} |\n",
                f"| **{t['statement']}** | {safe_notebook_str(use_case.get('Statement'))} |\n",
                f"| **{t['solution']}** | {safe_notebook_str(use_case.get('Solution'))} |\n",
                f"| **{t['aspect_value']}** | {safe_notebook_str(use_case.get('Business Value'))} |\n",
                f"| **{t['aspect_tables']}** | {safe_notebook_str(use_case.get('Tables Involved'))} |\n"
            ]
            details_cell = {"cell_type": "markdown", "metadata": {"application/vnd.databricks.v1+cell": {"nuid": str(uuid.uuid4())}}, "source": combined_source}
            # Use SQL directly without any formatting/correction, but add Inspire header at the top
            use_case_id = use_case.get('No', 'UNKNOWN')
            use_case_name = use_case.get('Name', '')
            # generate_sample_result:No initially, user sets to Yes to generate sample output
            # regenerate_sql:No initially, user sets to Yes to regenerate SQL
            inspire_header = f"--Use Case: {use_case_id} - {use_case_name}\n--generate_sample_result:No\n--regenerate_sql:No\n"
            inspire_instructions_block = "/**Regeneration Instruction Start\n\nRegeneration Instruction End**/\n\n"
            sql_lines = use_case['SQL'].split('\n')
            # Strip LLM-generated header lines to avoid duplication (our header already has use case info)
            sql_lines_clean = []
            skip_header = True
            for line in sql_lines:
                line_stripped = line.strip().lower()
                if skip_header and (line_stripped.startswith('-- use case') or line_stripped.startswith('--use case')):
                    continue
                if skip_header and line_stripped.startswith('--') and not line_stripped.startswith('-- step') and not line_stripped.startswith('--step'):
                    # Skip generic comment lines at the start (descriptions)
                    if len(line_stripped) > 2 and not any(kw in line_stripped for kw in ['with', 'select', 'cte', 'step']):
                        continue
                skip_header = False
                sql_lines_clean.append(line)
            sql_with_header = [inspire_header, inspire_instructions_block] + [line + '\n' for line in sql_lines_clean]
            code_cell = {"cell_type": "code", "execution_count": 0, "outputs": [], "metadata": {"application/vnd.databricks.v1+cell": {"nuid": str(uuid.uuid4())}}, "source": sql_with_header}
            final_cells.extend([details_cell, code_cell])
        
        if excluded_count > 0:
            self.logger.warning(f"NOTEBOOK {db_name}: Excluded {excluded_count} use case(s) with invalid/missing SQL")
        
        if not filename_override:
            self.logger.warning(f"No filename_override provided for notebook assembly '{db_name}'. Defaulting.")
            notebook_name_sanitized = f"{db_prefix}_{self._sanitize_name(db_name)}"
        else:
            notebook_name_sanitized = filename_override

        gen_dir = self.notebook_output_dir
        
        # Databricks SQL notebook metadata
        notebook_metadata = {
            "application/vnd.databricks.v1+notebook": {
                "computePreferences": None,
                "dashboards": [],
                "environmentMetadata": {
                    "base_environment": "",
                    "environment_version": "4"
                },
                "inputWidgetPreferences": None,
                "language": "sql",
                "notebookMetadata": {
                    "pythonIndentUnit": 2
                },
                "notebookName": notebook_name_sanitized,
                "widgets": {}
            },
            "language_info": {
                "name": "sql"
            }
        }
        
        final_notebook_obj = { "notebook_content": { "cells": final_cells, "metadata": notebook_metadata, "nbformat": 4, "nbformat_minor": 0 } }
        
        # Retry logic for workspace import with exponential backoff (respect global retry setting)
        max_retries = (getattr(self, "max_retry_attempts", 1) or 0) + 1
        retry_delay = 2  # seconds
        
        for attempt in range(1, max_retries + 1):
            try:
                self.logger.info(f"Importing notebook '{notebook_name_sanitized}.ipynb' (attempt {attempt}/{max_retries})...")
                notebook_full_path = os.path.join(gen_dir, f"{notebook_name_sanitized}.ipynb")
                notebook_content_str = json.dumps(final_notebook_obj["notebook_content"], indent=2)
                
                # Import with timeout handling
                import time
                start_time = time.time()
                self.w_client.workspace.import_(
                    path=notebook_full_path, overwrite=True, format=workspace.ImportFormat.JUPYTER,
                    content=base64.b64encode(notebook_content_str.encode('utf-8')).decode(),
                )
                elapsed_time = time.time() - start_time
                
                self.logger.info(f"✓ Notebook '{notebook_name_sanitized}.ipynb' imported successfully in {elapsed_time:.1f}s")
                abs_path = self.w_client.workspace.get_status(notebook_full_path).path
                self.logger.info(f"Notebook is located at: {abs_path}")
                log_print(f"   ✓ Notebook saved to: {abs_path}")
                break  # Success - exit retry loop
                
            except Exception as err:
                self.logger.warning(f"Attempt {attempt}/{max_retries} failed: {err}")
                
                if attempt < max_retries:
                    # Exponential backoff
                    wait_time = retry_delay * (2 ** (attempt - 1))
                    self.logger.info(f"Retrying in {wait_time} seconds...")
                    import time
                    time.sleep(wait_time)
                else:
                    # Final attempt failed
                    self.logger.error(f"❌ Failed to import notebook '{notebook_name_sanitized}' after {max_retries} attempts")
                    self.logger.error(f"Error details: {err}")
                    log_print(f"   ❌ ERROR: Failed to create notebook '{notebook_name_sanitized}': {err}", file=sys.stderr)
                    # Re-raise to allow caller to handle
                    raise

    def _filter_business_tables(self, db_details: list, business_context: str = "", industry: str = "", exclusion_strategy: str = "Medium") -> tuple:
        """
        Filters tables into business-relevant vs technical/metadata tables using LLM.
        RESPECTS MAX_CONTEXT_CHARS by batching tables if needed.
        Implements RECURSIVE BATCHING: If a batch is too large, splits it and tries again recursively.
        
        Args:
            db_details: List of (catalog, schema, table, column, type, comment) tuples
            business_context: Business context description
            industry: Industry classification
            exclusion_strategy: Technical exclusion strategy ("None", "Aggressive", "Medium", "Low")
            
        Returns:
            Tuple of (business_tables_details, technical_tables_details, business_table_names, technical_table_names,
            business_scores_dict, data_category_map, master_tables_set, transactional_tables_set, reference_tables_set)
        """
        if not db_details:
            return ([], [], set(), set(), {})
        
        include_technical = exclusion_strategy == "None"
        
        # Get additional context from instance (user-provided instructions)
        additional_context = getattr(self, 'additional_context', '') or ''
        
        # Call the main recursive processing logic
        return self._filter_business_tables_with_batching(
            db_details=db_details,
            business_context=business_context,
            industry=industry,
            exclusion_strategy=exclusion_strategy,
            additional_context=additional_context,
            include_technical=include_technical
        )
    
    def _process_filter_batch_recursive(self, batch_tables: list, batch_idx: int, total_batches: int, 
                                       business_name: str, industry: str, business_context: str,
                                       exclusion_strategy: str, strategy_rule_text: str,
                                       additional_context_section: str = "",
                                       depth: int = 0, max_depth: int = 10) -> dict:
        """
        Recursively process a batch of tables for filtering with automatic sub-batching if context limit exceeded.
        
        Args:
            batch_tables: List of table names to classify
            batch_idx: Batch index for logging
            total_batches: Total number of batches
            business_name: Business name
            industry: Industry classification
            business_context: Business context description
            exclusion_strategy: Exclusion strategy
            strategy_rule_text: Strategy-specific rules text
            additional_context_section: User-provided filtering instructions (formatted)
            depth: Current recursion depth
            max_depth: Maximum recursion depth to prevent infinite loops
            
        Returns:
            Dict mapping table_name -> (classification, business_score, data_category)
        """
        if depth > max_depth:
            self.logger.error(f"[Batch {batch_idx}] Max recursion depth ({max_depth}) reached. Defaulting {len(batch_tables)} tables to BUSINESS.")
            return {table.replace('`', ''): ('BUSINESS', 50, 'MASTER') for table in batch_tables}
        
        depth_prefix = "  " * depth  # Indent based on depth for readability
        # Fix batch naming: use "Split" suffix instead of excessive "_SUB" suffixes
        if depth > 0:
            log_prefix = f"[Batch {batch_idx}-Split{depth}]"
        else:
            log_prefix = f"[Batch {batch_idx}]"
        
        self.logger.info(f"{depth_prefix}{log_prefix} Processing {len(batch_tables)} tables (depth={depth})...")
        
        # Create markdown table list for this batch
        tables_markdown = "\n".join([f"- {table}" for table in batch_tables])
        
        # Verify this batch's prompt size
        prompt_vars = {
            "business_name": business_name,
            "industry": industry,
            "business_context": business_context,
            "exclusion_strategy": exclusion_strategy,
            "strategy_rules": strategy_rule_text,
            "additional_context_section": additional_context_section,
            "tables_markdown": tables_markdown
        }
        
        # Format the prompt to check actual size (using model-specific limits from TECHNICAL_CONTEXT)
        filter_context_limit = get_max_context_chars("English", "FILTER_BUSINESS_TABLES_PROMPT")
        test_prompt = self.ai_agent._load_and_format_prompt("FILTER_BUSINESS_TABLES_PROMPT", prompt_vars)
        actual_prompt_size = len(test_prompt)
        
        # Check if batch is too large
        if actual_prompt_size > filter_context_limit:
            # Need to split this batch
            if len(batch_tables) <= 1:
                # Cannot split further - single table is too large (very rare case)
                self.logger.error(
                    f"{depth_prefix}{log_prefix} Single table name too long to process ({actual_prompt_size:,} chars). "
                    f"Defaulting to BUSINESS."
                )
                return {batch_tables[0].replace('`', ''): ('BUSINESS', 50, 'MASTER')}
            
            # Split into 2 sub-batches
            mid_point = len(batch_tables) // 2
            first_half = batch_tables[:mid_point]
            second_half = batch_tables[mid_point:]
            
            self.logger.warning(
                f"{depth_prefix}{log_prefix} Batch too large ({actual_prompt_size:,} chars > {filter_context_limit:,} limit). "
                f"Splitting {len(batch_tables)} tables into 2 sub-batches: {len(first_half)} + {len(second_half)} tables"
            )
            
            # Process both halves recursively
            results = {}
            
            # Process first half
            first_results = self._process_filter_batch_recursive(
                batch_tables=first_half,
                batch_idx=f"{batch_idx}a",
                total_batches=total_batches,
                business_name=business_name,
                industry=industry,
                business_context=business_context,
                exclusion_strategy=exclusion_strategy,
                strategy_rule_text=strategy_rule_text,
                additional_context_section=additional_context_section,
                depth=depth + 1,
                max_depth=max_depth
            )
            results.update(first_results)
            
            # Process second half
            second_results = self._process_filter_batch_recursive(
                batch_tables=second_half,
                batch_idx=f"{batch_idx}b",
                total_batches=total_batches,
                business_name=business_name,
                industry=industry,
                business_context=business_context,
                exclusion_strategy=exclusion_strategy,
                strategy_rule_text=strategy_rule_text,
                additional_context_section=additional_context_section,
                depth=depth + 1,
                max_depth=max_depth
            )
            results.update(second_results)
            
            self.logger.info(f"{depth_prefix}{log_prefix} Sub-batches complete. Total: {len(results)} tables classified")
            return results
        
        # Batch size is OK - process it with retry logic
        self.logger.debug(f"{depth_prefix}{log_prefix} Batch size OK ({actual_prompt_size:,} chars)")
        
        try:
            # Retry on CSV parsing errors (respect global retry setting)
            max_retries = (getattr(self, "max_retry_attempts", 1) or 0) + 1
            for attempt in range(1, max_retries + 1):
                try:
                    # Call LLM to classify tables in this batch
                    attempt_suffix = f" (attempt {attempt}/{max_retries})" if attempt > 1 else ""
                    self.logger.info(f"{depth_prefix}⏳ {log_prefix} Waiting for LLM response{attempt_suffix} (filtering {len(batch_tables)} tables into BUSINESS vs TECHNICAL)...")
                    response_raw = self.ai_agent.run_worker(
                        step_name=f"Filter_Business_Tables_Batch_{batch_idx}_Depth{depth}_Attempt{attempt}",
                        worker_prompt_path="FILTER_BUSINESS_TABLES_PROMPT",
                        prompt_vars=prompt_vars,
                        response_schema=None
                    )
                    self.logger.info(f"{depth_prefix}✅ {log_prefix} Received LLM response, parsing classifications...")
                    
                    # Parse CSV response using centralized utility
                    response_clean = clean_json_response(response_raw)
                    
                    csv_rows = CSVParser.parse_csv_string(
                        response_clean,
                        logger=self.logger,
                        context=log_prefix
                    )
                    
                    results = {}
                    row_count = 0
                    for row in csv_rows:
                        row_count += 1
                        # Safely get values with fallback to empty string if None
                        table_name = (row.get('Table Name') or '').strip().replace('`', '')
                        classification = (row.get('Classification') or '').strip().upper()
                        data_category = (row.get('Data Category') or '').strip().upper()
                        business_score_str = (row.get('Business Score') or '50').strip()
                        
                        # Skip invalid rows (empty table name or very short names that are likely parsing errors)
                        if not table_name or len(table_name) < 3:
                            self.logger.warning(f"{depth_prefix}Skipping invalid row with table name '{table_name}' (too short or empty)")
                            continue
                        
                        # Parse business score
                        try:
                            business_score = int(business_score_str)
                            business_score = max(0, min(100, business_score))  # Clamp to 0-100
                        except (ValueError, TypeError):
                            business_score = 50  # Default to medium score if parsing fails
                            self.logger.warning(f"{depth_prefix}Table {table_name}: Invalid business score '{business_score_str}', using default 50")
                        
                        if classification == 'BUSINESS':
                            if data_category not in ('MASTER', 'REFERENCE', 'TRANSACTIONAL'):
                                self.logger.warning(f"{depth_prefix}Table {table_name}: LLM returned invalid data_category '{data_category}', defaulting to MASTER")
                                data_category = 'MASTER'
                            results[table_name] = ('BUSINESS', business_score, data_category)
                        elif classification == 'TECHNICAL':
                            results[table_name] = ('TECHNICAL', 0, 'TECHNICAL')
                        else:
                            self.logger.warning(f"{depth_prefix}Table {table_name} has unclear classification '{classification}', defaulting to BUSINESS/MASTER")
                            if data_category not in ('MASTER', 'REFERENCE', 'TRANSACTIONAL'):
                                data_category = 'MASTER'
                            results[table_name] = ('BUSINESS', business_score, data_category)
                    
                    # Validate we got results
                    if not results or row_count == 0:
                        raise ValueError(f"CSV parsing returned no results (row_count={row_count})")
                    
                    self.logger.info(f"{depth_prefix}✅ {log_prefix} Complete: {len(results)} tables classified")
                    return results
                    
                except (ValueError, AttributeError, KeyError) as parse_error:
                    # CSV parsing errors - retry
                    if attempt < max_retries:
                        self.logger.warning(f"{depth_prefix}{log_prefix} CSV parsing error on attempt {attempt}/{max_retries}: {parse_error}. Retrying...")
                        continue
                    else:
                        self.logger.error(f"{depth_prefix}{log_prefix} Failed to parse CSV after {max_retries} attempts: {parse_error}. Defaulting batch tables to BUSINESS.")
                        return {table.replace('`', ''): ('BUSINESS', 50, 'MASTER') for table in batch_tables}
            
        except InputTooLongError as e:
            # Even though we pre-checked, the LLM still rejected it. Split more aggressively.
            if len(batch_tables) <= 1:
                self.logger.error(
                    f"{depth_prefix}{log_prefix} LLM rejected even after pre-check. Single table too large. "
                    f"Defaulting to BUSINESS. Error: {str(e)[:200]}"
                )
                return {batch_tables[0].replace('`', ''): ('BUSINESS', 50, 'MASTER')}
            
            # Split into 2 sub-batches
            mid_point = len(batch_tables) // 2
            first_half = batch_tables[:mid_point]
            second_half = batch_tables[mid_point:]
            
            self.logger.warning(
                f"{depth_prefix}{log_prefix} LLM rejected batch (InputTooLongError). "
                f"Splitting {len(batch_tables)} tables into 2 sub-batches: {len(first_half)} + {len(second_half)} tables"
            )
            
            # Process both halves recursively
            results = {}
            first_results = self._process_filter_batch_recursive(
                batch_tables=first_half,
                batch_idx=f"{batch_idx}a",
                total_batches=total_batches,
                business_name=business_name,
                industry=industry,
                business_context=business_context,
                exclusion_strategy=exclusion_strategy,
                strategy_rule_text=strategy_rule_text,
                additional_context_section=additional_context_section,
                depth=depth + 1,
                max_depth=max_depth
            )
            results.update(first_results)
            
            second_results = self._process_filter_batch_recursive(
                batch_tables=second_half,
                batch_idx=f"{batch_idx}b",
                total_batches=total_batches,
                business_name=business_name,
                industry=industry,
                business_context=business_context,
                exclusion_strategy=exclusion_strategy,
                strategy_rule_text=strategy_rule_text,
                additional_context_section=additional_context_section,
                depth=depth + 1,
                max_depth=max_depth
            )
            results.update(second_results)
            
            return results
            
        except Exception as batch_error:
            self.logger.error(f"{depth_prefix}{log_prefix} Failed to process batch: {batch_error}. Defaulting batch tables to BUSINESS.")
            # Default all tables in this batch to business with medium score
            return {table.replace('`', ''): ('BUSINESS', 50, 'MASTER') for table in batch_tables}
    
    def _filter_business_tables_with_batching(self, db_details: list, business_context: str, 
                                             industry: str, exclusion_strategy: str,
                                             additional_context: str = "", include_technical: bool = False) -> tuple:
        """
        Main filtering logic with batching support and recursive sub-batching.
        """
        try:
            # === LOG USER ADDITIONAL CONTEXT INTERPRETATION ===
            additional_context_section = ""
            if additional_context:
                self.logger.info("=" * 80)
                self.logger.info("🔍 INTERPRETING USER ADDITIONAL CONTEXT FOR TABLE FILTERING")
                self.logger.info("=" * 80)
                self.logger.info(f"📋 User Instructions: {additional_context[:500]}{'...' if len(additional_context) > 500 else ''}")
                
                # Log interpretation of common patterns
                context_lower = additional_context.lower()
                
                # Check for database/catalog exclusions
                if 'ignore' in context_lower or 'exclude' in context_lower or 'skip' in context_lower:
                    self.logger.info("🎯 DETECTED: User wants to EXCLUDE/IGNORE certain data")
                
                # Check for business entity references
                business_entities = ['customer', 'subscriber', 'client', 'user', 'member', 'patient', 'employee', 
                                   'vendor', 'supplier', 'partner', 'order', 'transaction', 'product', 'inventory']
                detected_entities = [e for e in business_entities if e in context_lower]
                if detected_entities:
                    self.logger.info(f"🎯 DETECTED BUSINESS ENTITIES: {', '.join(detected_entities)}")
                    self.logger.info("   → LLM will apply SEMANTIC understanding (e.g., 'subscriber' = 'customer')")
                
                # Check for database/catalog references
                if 'database' in context_lower or 'catalog' in context_lower or 'schema' in context_lower:
                    self.logger.info("🎯 DETECTED: Database/Catalog/Schema level filtering instructions")
                
                self.logger.info("=" * 80)
                
                # Build the additional context section for the prompt
                additional_context_section = f"""
**🚨🚨🚨 HIGHEST PRIORITY: USER-PROVIDED FILTERING INSTRUCTIONS 🚨🚨🚨**

**⛔ YOU MUST FOLLOW THESE USER INSTRUCTIONS - THEY OVERRIDE ALL OTHER RULES ⛔**

{additional_context}

**CRITICAL INTERPRETATION RULES**:
1. **Database/Catalog Exclusions**: If user says "ignore database X" or "exclude catalog Y", ALL tables from that database/catalog MUST be classified as TECHNICAL (filtered out)
2. **Business Entity Semantics**: Apply BUSINESS UNDERSTANDING to user instructions:
   - If user says "ignore customers", also ignore tables named: subscribers, clients, users, members, accounts, patrons, buyers
   - If user says "ignore orders", also ignore tables named: transactions, purchases, bookings, reservations, sales
   - If user says "ignore products", also ignore tables named: items, inventory, merchandise, goods, SKUs, catalog
   - **USE YOUR BUSINESS DOMAIN KNOWLEDGE** to identify semantically equivalent entities
3. **Explicit Overrides**: User instructions ALWAYS override the default classification rules
4. **When in doubt**: If a table MIGHT relate to something user wants to ignore, classify it as TECHNICAL

**EXAMPLE INTERPRETATIONS**:
- User: "ignore everything from database legacy_crm" → ALL tables with `legacy_crm` in their path → TECHNICAL
- User: "ignore all customer data" → Tables: customers, subscribers, clients, users, members, accounts → ALL TECHNICAL
- User: "focus only on finance" → Tables NOT related to finance/accounting → TECHNICAL
- User: "exclude marketing tables" → Tables: campaigns, leads, prospects, marketing_*, ads_* → ALL TECHNICAL

**⚠️ FAILURE TO FOLLOW USER INSTRUCTIONS = ENTIRE OUTPUT REJECTED ⚠️**
"""
            else:
                additional_context_section = "*(No additional user filtering instructions provided)*"
            
            # Define strategy-specific rules
            strategy_rules = {
                "Aggressive": """**AGGRESSIVE FILTERING**:
- Apply STRICT interpretation of technical patterns
- Err on the side of EXCLUDING borderline tables
- Any table with >50% technical-looking columns → TECHNICAL
- Aggressively filter logs, metrics, configurations, snapshots, audits
- **Exception**: Only include if clearly core to business operations AND business name indicates relevance""",
                
                "Medium": """**MEDIUM FILTERING** (BALANCED APPROACH):
- Apply MODERATE interpretation of technical patterns
- Balance between business value and technical overhead
- Tables with >70% technical columns → TECHNICAL
- Filter obvious technical tables but preserve borderline cases
- When moderately uncertain, prefer BUSINESS classification""",
                
                "Low": """**LOW FILTERING** (PERMISSIVE APPROACH):
- Apply LENIENT interpretation of technical patterns
- Err on the side of INCLUDING borderline tables
- Only filter tables that are >90% pure IT infrastructure
- Preserve any table with even minor business relevance
- When uncertain, default to BUSINESS classification
- Include operational logs/audits that may contain business insights"""
            }
            
            strategy_rule_text = strategy_rules.get(exclusion_strategy, strategy_rules["Medium"])
            
            # Extract unique tables
            unique_tables = set()
            for (catalog, schema, table, column_name, data_type, comment) in db_details:
                fqtn = f"`{catalog}`.`{schema}`.`{table}`"
                unique_tables.add(fqtn)
            
            sorted_tables = sorted(unique_tables)
            total_tables = len(sorted_tables)
            
            self.logger.info(f"Filtering {total_tables} tables into business vs technical categories with '{exclusion_strategy}' strategy...")
            
            # Get the prompt template to estimate base size (using model-specific limits from TECHNICAL_CONTEXT)
            filter_tables_context_limit = get_max_context_chars("English", "FILTER_BUSINESS_TABLES_PROMPT")
            prompt_template = self.ai_agent.prompt_templates.get("FILTER_BUSINESS_TABLES_PROMPT", "")
            base_prompt_size = len(prompt_template) + len(self.business_name) + len(industry or "General Business") + len(business_context or "General business operations") + len(strategy_rule_text)
            
            # Reserve 20% of model's context limit for safety margin
            available_chars = int(filter_tables_context_limit * 0.8) - base_prompt_size
            
            # Estimate chars per table (table name + markdown formatting) ~60 chars average
            estimated_chars_per_table = 60
            max_tables_per_batch = max(100, available_chars // estimated_chars_per_table)  # At least 100 tables per batch
            
            self.logger.info(f"Base prompt size: {base_prompt_size} chars, Available for tables: {available_chars} chars")
            self.logger.info(f"Estimated max tables per batch: {max_tables_per_batch}")
            
            # Determine if batching is needed
            if total_tables <= max_tables_per_batch:
                # Can process all tables in one batch
                self.logger.info("All tables fit in one batch. Processing...")
                batches = [sorted_tables]
            else:
                # Need to batch
                num_batches = (total_tables + max_tables_per_batch - 1) // max_tables_per_batch
                self.logger.info(f"⚠️ Tables exceed single batch capacity. Splitting into {num_batches} batches...")
                batches = [sorted_tables[i:i+max_tables_per_batch] for i in range(0, total_tables, max_tables_per_batch)]
            
            # Process each batch with recursive splitting if needed (IN PARALLEL)
            business_tables_set = set()
            technical_tables_set = set()
            master_tables_set = set()
            transactional_tables_set = set()
            reference_tables_set = set()
            data_category_map = {}
            business_scores = {}
            
            # ADAPTIVE PARALLELISM: Calculate based on batches and tables
            classification_parallelism, reason = calculate_adaptive_parallelism(
                "domain_clustering", self.max_parallelism,
                num_items=total_tables,
                num_domains=len(batches),
                is_llm_operation=True, logger=self.logger
            )
            log_adaptive_parallelism_decision("domain_clustering", classification_parallelism, self.max_parallelism, reason)
            
            self.logger.info(f"Processing {len(batches)} classification batch(es) in parallel...")
            
            # Prepare tasks for parallel execution
            tasks = []
            for batch_idx, batch_tables in enumerate(batches, 1):
                task = (
                    self._process_filter_batch_recursive,
                    (batch_tables, batch_idx, len(batches), self.business_name,
                     industry or "General Business", business_context or "General business operations",
                     exclusion_strategy, strategy_rule_text, additional_context_section, 0, 10)
                )
                tasks.append(task)
            
            # Execute in parallel with centralized utility
            results = ParallelExecutor.execute_parallel(
                tasks=tasks,
                max_workers=classification_parallelism,
                task_name="Classification Batch",
                logger=self.logger,
                thread_name_prefix="FilterBatch",
                return_exceptions=True
            )
            
            # Merge results from all batches
            for batch_idx, batch_results in enumerate(results, 1):
                if isinstance(batch_results, Exception):
                    self.logger.error(f"❌ Classification batch {batch_idx} failed: {batch_results}")
                    continue
                
                # Merge results
                for table_name, (classification, business_score, data_category) in batch_results.items():
                    if classification == 'BUSINESS':
                        if data_category == 'REFERENCE':
                            reference_tables_set.add(table_name)
                            data_category_map[table_name] = 'REFERENCE'
                        elif data_category == 'TRANSACTIONAL':
                            transactional_tables_set.add(table_name)
                            business_tables_set.add(table_name)
                            data_category_map[table_name] = 'TRANSACTIONAL'
                        else:
                            master_tables_set.add(table_name)
                            business_tables_set.add(table_name)
                            data_category_map[table_name] = 'MASTER'
                        business_scores[table_name] = business_score
                    elif classification == 'TECHNICAL':
                        technical_tables_set.add(table_name)
                        data_category_map[table_name] = 'TECHNICAL'
                        business_scores[table_name] = 0
                        if include_technical:
                            master_tables_set.add(table_name)
                            business_tables_set.add(table_name)
                
                self.logger.info(f"✅ Classification batch {batch_idx}/{len(batches)} completed")
            
            # === RETRY UNCLASSIFIED TABLES (UP TO 2 RETRIES) ===
            # Find tables that were not classified
            all_unique_tables = set()
            for detail in db_details:
                (catalog, schema, table, _, _, _) = detail
                fqtn = f"{catalog}.{schema}.{table}"
                all_unique_tables.add(fqtn)
            
            unclassified_tables = all_unique_tables - business_tables_set - technical_tables_set - reference_tables_set
            
            # Retry up to 2 times for unclassified tables
            for retry_attempt in range(1, 3):  # 2 retry attempts
                if not unclassified_tables:
                    break
                    
                self.logger.warning(f"🔄 RETRY {retry_attempt}/2: Found {len(unclassified_tables)} unclassified tables. Retrying classification...")
                
                # Prepare retry batch with table names (with backticks for compatibility)
                retry_batch_tables = [f"`{t.replace('.', '`.`')}`" for t in unclassified_tables]
                
                try:
                    # Retry classification for unclassified tables
                    retry_results = self._process_filter_batch_recursive(
                        batch_tables=retry_batch_tables,
                        batch_idx=f"RETRY_{retry_attempt}",
                        total_batches=1,
                        business_name=self.business_name,
                        industry=industry or "General Business",
                        business_context=business_context or "General business operations",
                        exclusion_strategy=exclusion_strategy,
                        strategy_rule_text=strategy_rule_text,
                        additional_context_section=additional_context_section,
                        max_depth=10
                    )
                    
                    # Merge retry results
                    for table_name, (classification, business_score, data_category) in retry_results.items():
                        if classification == 'BUSINESS':
                            if data_category == 'REFERENCE':
                                reference_tables_set.add(table_name)
                                data_category_map[table_name] = 'REFERENCE'
                            elif data_category == 'TRANSACTIONAL':
                                transactional_tables_set.add(table_name)
                                business_tables_set.add(table_name)
                                data_category_map[table_name] = 'TRANSACTIONAL'
                            else:
                                master_tables_set.add(table_name)
                                business_tables_set.add(table_name)
                                data_category_map[table_name] = 'MASTER'
                            business_scores[table_name] = business_score
                            self.logger.info(f"✅ Retry {retry_attempt} successful: {table_name} classified as BUSINESS")
                        elif classification == 'TECHNICAL':
                            technical_tables_set.add(table_name)
                            data_category_map[table_name] = 'TECHNICAL'
                            business_scores[table_name] = 0
                            if include_technical:
                                master_tables_set.add(table_name)
                                business_tables_set.add(table_name)
                            self.logger.info(f"✅ Retry {retry_attempt} successful: {table_name} classified as TECHNICAL")
                    
                    # Update unclassified list for next retry
                    unclassified_tables = all_unique_tables - business_tables_set - technical_tables_set - reference_tables_set
                    
                except Exception as retry_error:
                    self.logger.error(f"❌ Retry {retry_attempt} classification failed: {retry_error}")
            
            # After 2 retries, default remaining unclassified tables to BUSINESS + MASTER
            if unclassified_tables:
                self.logger.warning(f"⚠️ {len(unclassified_tables)} tables still unclassified after 2 retries. Defaulting to BUSINESS + MASTER.")
                for table_fqtn in unclassified_tables:
                    business_tables_set.add(table_fqtn)
                    master_tables_set.add(table_fqtn)
                    business_scores[table_fqtn] = 50  # Default medium score
                    data_category_map[table_fqtn] = 'MASTER'
                    self.logger.info(f"📋 Defaulted to BUSINESS/MASTER: {table_fqtn}")
            
            # Split db_details into business and technical
            business_details = []
            technical_details = []
            unclassified_tables_logged = set()  # Track tables already logged to avoid duplicates
            
            for detail in db_details:
                (catalog, schema, table, column_name, data_type, comment) = detail
                fqtn = f"{catalog}.{schema}.{table}"
                
                if fqtn in reference_tables_set:
                    continue
                elif fqtn in business_tables_set:
                    if fqtn not in reference_tables_set:
                        business_details.append(detail)
                elif fqtn in technical_tables_set:
                    technical_details.append(detail)
                else:
                    # Default: if not classified after retry, assume business (safer to include)
                    if fqtn not in unclassified_tables_logged:
                        self.logger.warning(f"Table {fqtn} not classified by LLM after retry, defaulting to BUSINESS")
                        unclassified_tables_logged.add(fqtn)
                    business_details.append(detail)
                    business_scores[fqtn] = 50  # Default medium score
                    master_tables_set.add(fqtn)
                    data_category_map[fqtn] = 'MASTER'
                    business_tables_set.add(fqtn)
            
            self.logger.info(f"✅ Filtering complete: {len(business_tables_set)} business tables, {len(technical_tables_set)} technical tables, {len(reference_tables_set)} reference tables")
            # Only log technical tables (top 10), not business tables
            if technical_tables_set:
                technical_list = sorted(list(technical_tables_set))[:10]
                more_indicator = f" (showing 10 of {len(technical_tables_set)})" if len(technical_tables_set) > 10 else ""
                self.logger.info(f"Technical tables excluded{more_indicator}: {', '.join(technical_list)}")
            else:
                self.logger.debug("No technical tables to exclude.")
            
            # Log sample business scores
            if business_scores:
                sample_scores = sorted(business_scores.items(), key=lambda x: x[1], reverse=True)[:5]
                self.logger.info(f"Top business tables by score: {', '.join([f'{t}({s})' for t, s in sample_scores])}")
            
            return (business_details, technical_details, business_tables_set, technical_tables_set, business_scores, data_category_map, master_tables_set, transactional_tables_set, reference_tables_set)
            
        except Exception as e:
            self.logger.error(f"Failed to filter business vs technical tables: {e}. Proceeding with all tables.")
            # On error, return all tables as business tables with default scores
            all_tables = set()
            default_scores = {}
            for (catalog, schema, table, _, _, _) in db_details:
                fqtn = f"{catalog}.{schema}.{table}"
                all_tables.add(fqtn)
                default_scores[fqtn] = 50  # Default medium score
            return (db_details, [], all_tables, set(), default_scores, {}, all_tables, set(), set())

    def _estimate_schema_markdown_size(self, db_details: list) -> int:
        if not db_details:
            return 0
        header_len = len("| column | type | column_description |\n| --- | --- | --- |")
        total = 0
        seen = set()
        for (catalog, schema, table, column_name, data_type, comment) in db_details:
            fqtn = f"`{catalog}`.`{schema}`.`{table}`"
            if fqtn not in seen:
                seen.add(fqtn)
                total += len("### ") + len(fqtn) + 1
                total += header_len + 1
            col = column_name or ""
            dtype = data_type or "unknown"
            desc = comment or ""
            total += len(col) + len(dtype) + len(desc) + 11
        total += len(seen)
        return total

    def _split_columns_to_fit_context(self, column_details: list, base_prompt_size: int, context_limit: int) -> list:
        if not column_details:
            return []
        table_columns = {}
        table_order = []
        for col in column_details:
            key = (col[0], col[1], col[2])
            if key not in table_columns:
                table_columns[key] = []
                table_order.append(key)
            table_columns[key].append(col)
        table_sizes = {}
        for key in table_order:
            table_sizes[key] = self._estimate_schema_markdown_size(table_columns[key])
        batches = []
        current_tables = []
        current_size = base_prompt_size
        for key in table_order:
            table_size = table_sizes[key]
            if current_tables and (current_size + table_size) > context_limit:
                batches.append(current_tables)
                current_tables = []
                current_size = base_prompt_size
            current_tables.append(key)
            current_size += table_size
        if current_tables:
            batches.append(current_tables)
        column_batches = []
        for table_keys in batches:
            batch_cols = []
            for key in table_keys:
                batch_cols.extend(table_columns[key])
            column_batches.append(batch_cols)
        return column_batches

    def _determine_tables_per_call(self, total_tables: int) -> int:
        if total_tables < 50:
            return 1
        if total_tables < 100:
            return 2
        if total_tables < 200:
            return 3
        if total_tables < 400:
            return 4
        return 5

    def _split_by_table_limit(self, column_details: list, max_tables: int) -> list:
        if not column_details or max_tables <= 0:
            return []
        table_columns = {}
        table_order = []
        for col in column_details:
            key = (col[0], col[1], col[2])
            if key not in table_columns:
                table_columns[key] = []
                table_order.append(key)
            table_columns[key].append(col)
        batches = []
        current_tables = []
        for key in table_order:
            current_tables.append(key)
            if len(current_tables) >= max_tables:
                batch_cols = []
                for table_key in current_tables:
                    batch_cols.extend(table_columns[table_key])
                batches.append(batch_cols)
                current_tables = []
        if current_tables:
            batch_cols = []
            for table_key in current_tables:
                batch_cols.extend(table_columns[table_key])
            batches.append(batch_cols)
        return batches

    # === MODIFIED: _format_schema_for_prompt ===
    def _format_schema_for_prompt(self, db_details: list, load_column_tracking: bool = False) -> str:
        """
        Format schema for prompt, respecting column tracking from context splitting.
        
        If columns were dropped due to context limits, only include the tracked columns.
        Column tracking is loaded from disk on-demand, not kept in memory.
        
        Args:
            db_details: List of column tuples (catalog, schema, table, column, type, comment)
            load_column_tracking: If True, load column tracking from disk for SQL generation
        
        Returns:
            Formatted schema markdown
        """
        if not db_details: return ""
        tables = defaultdict(list)
        
        # Load column tracking from disk only when needed (SQL generation)
        column_tracking_cache = {}
        if load_column_tracking:
            # Get unique tables from db_details
            unique_tables = set()
            for (catalog, schema, table, _, _, _) in db_details:
                fqtn_plain = f"{catalog}.{schema}.{table}"
                unique_tables.add(fqtn_plain)
            
            # Load column tracking for tables that have it
            for fqtn in unique_tables:
                if self.storage_manager.has_column_tracking(fqtn):
                    tracked_cols = self.storage_manager.load_column_tracking(fqtn)
                    if tracked_cols:
                        column_tracking_cache[fqtn] = tracked_cols
        
        for (catalog, schema, table, column_name, data_type, comment) in db_details:
            fqtn = f"`{catalog}`.`{schema}`.`{table}`"
            fqtn_plain = f"{catalog}.{schema}.{table}"
            
            # Check if this table has column tracking (meaning columns were dropped)
            if fqtn_plain in column_tracking_cache:
                tracked_columns = column_tracking_cache[fqtn_plain]
                # Only include columns that were tracked (kept during splitting)
                if column_name in tracked_columns:
                    tables[fqtn].append((column_name, data_type or 'unknown', comment or ''))
            else:
                # No tracking for this table - include all columns
                tables[fqtn].append((column_name, data_type or 'unknown', comment or ''))
        
        markdown_parts = []
        for fqtn, columns in tables.items():
            if not columns:
                continue  # Skip tables with no columns (shouldn't happen but safety check)
            
            table_header = f"### {fqtn}"
            header = "| column | type | column_description |\n| --- | --- | --- |"
            rows = "\n".join([f"| {col} | {dtype} | {desc} |" for col, dtype, desc in columns])
            
            # Add a note if columns were dropped
            fqtn_plain = fqtn.replace('`', '')
            if fqtn_plain in column_tracking_cache:
                note = f"<!-- NOTE: Table has {len(columns)} columns (subset used for context fitting) -->"
                markdown_parts.append(f"{table_header}\n{note}\n{header}\n{rows}\n")
            else:
                markdown_parts.append(f"{table_header}\n{header}\n{rows}\n")
        
        return "\n".join(markdown_parts)

    def _augment_columns_with_foreign_keys(self, column_details: list) -> list:
        if not column_details or not self.data_loader or not getattr(self.data_loader, "foreign_key_graph", None):
            return column_details
        tables_present = {(c, s, t) for (c, s, t, _, _, _) in column_details}
        additional = []
        for base_table in list(tables_present):
            relations = self.data_loader.foreign_key_graph.get(base_table, [])
            for rel in relations:
                ref_catalog = rel[4] or base_table[0]
                ref_schema = rel[5] or base_table[1]
                ref_table = rel[6]
                ref_tuple = (ref_catalog, ref_schema, ref_table)
                if ref_tuple in tables_present:
                    continue
                try:
                    ref_cols = self.data_loader._get_table_details(
                        ref_catalog,
                        ref_schema,
                        ref_table,
                        apply_sampling=self.data_loader.enable_column_sampling
                    )
                    if ref_cols:
                        additional.extend(ref_cols)
                        tables_present.add(ref_tuple)
                        self.logger.debug(f"Auto-added foreign key table {ref_tuple} to batch context")
                except Exception as e:
                    self.logger.debug(f"Could not load foreign key table {ref_tuple}: {str(e)[:80]}")
        if additional:
            return column_details + additional
        return column_details

    def _get_reverse_foreign_key_graph(self):
        if not self.data_loader or not getattr(self.data_loader, "foreign_key_graph", None):
            return {}
        graph_size = len(self.data_loader.foreign_key_graph)
        cached_size = getattr(self, "_reverse_foreign_key_graph_size", -1)
        if hasattr(self, "_reverse_foreign_key_graph") and cached_size == graph_size:
            return self._reverse_foreign_key_graph
        reverse_graph = defaultdict(list)
        for src_key, rels in self.data_loader.foreign_key_graph.items():
            for rel in rels:
                ref_catalog = rel[4] or src_key[0]
                ref_schema = rel[5] or src_key[1]
                ref_table = rel[6]
                ref_key = (ref_catalog, ref_schema, ref_table)
                reverse_graph[ref_key].append(rel)
        self._reverse_foreign_key_graph = reverse_graph
        self._reverse_foreign_key_graph_size = graph_size
        return reverse_graph

    def _augment_columns_with_reference_tables(self, column_details: list, reference_tables_set: set) -> list:
        if not column_details or not self.data_loader or not getattr(self.data_loader, "foreign_key_graph", None):
            return column_details
        if not reference_tables_set:
            return column_details
        tables_present = {(c, s, t) for (c, s, t, _, _, _) in column_details}
        additional = []
        reverse_graph = self._get_reverse_foreign_key_graph()
        for base_table in list(tables_present):
            relations = self.data_loader.foreign_key_graph.get(base_table, [])
            for rel in relations:
                ref_catalog = rel[4] or base_table[0]
                ref_schema = rel[5] or base_table[1]
                ref_table = rel[6]
                ref_fqtn = f"{ref_catalog}.{ref_schema}.{ref_table}"
                if ref_fqtn not in reference_tables_set:
                    continue
                ref_tuple = (ref_catalog, ref_schema, ref_table)
                if ref_tuple in tables_present:
                    continue
                try:
                    ref_cols = self.data_loader._get_table_details(
                        ref_catalog,
                        ref_schema,
                        ref_table,
                        apply_sampling=self.data_loader.enable_column_sampling
                    )
                    if ref_cols:
                        additional.extend(ref_cols)
                        tables_present.add(ref_tuple)
                except Exception as e:
                    self.logger.debug(f"Could not load reference table {ref_tuple}: {str(e)[:80]}")
            reverse_relations = reverse_graph.get(base_table, [])
            for rel in reverse_relations:
                src_catalog = rel[0]
                src_schema = rel[1]
                src_table = rel[2]
                src_fqtn = f"{src_catalog}.{src_schema}.{src_table}"
                if src_fqtn not in reference_tables_set:
                    continue
                src_tuple = (src_catalog, src_schema, src_table)
                if src_tuple in tables_present:
                    continue
                try:
                    ref_cols = self.data_loader._get_table_details(
                        src_catalog,
                        src_schema,
                        src_table,
                        apply_sampling=self.data_loader.enable_column_sampling
                    )
                    if ref_cols:
                        additional.extend(ref_cols)
                        tables_present.add(src_tuple)
                except Exception as e:
                    self.logger.debug(f"Could not load reference table {src_tuple}: {str(e)[:80]}")
        if additional:
            return column_details + additional
        return column_details

    def _augment_columns_with_related_tables(self, column_details: list) -> list:
        if not column_details or not self.data_loader or not getattr(self.data_loader, "foreign_key_graph", None):
            return column_details
        allowed_tables = None
        if hasattr(self, "data_category_map"):
            allowed_tables = {k for k, v in self.data_category_map.items() if v != "TECHNICAL"}
        tables_present = {(c, s, t) for (c, s, t, _, _, _) in column_details}
        additional = []
        reverse_graph = self._get_reverse_foreign_key_graph()
        for base_table in list(tables_present):
            relations = self.data_loader.foreign_key_graph.get(base_table, [])
            for rel in relations:
                ref_catalog = rel[4] or base_table[0]
                ref_schema = rel[5] or base_table[1]
                ref_table = rel[6]
                ref_fqtn = f"{ref_catalog}.{ref_schema}.{ref_table}"
                if allowed_tables is not None and ref_fqtn not in allowed_tables:
                    continue
                ref_tuple = (ref_catalog, ref_schema, ref_table)
                if ref_tuple in tables_present:
                    continue
                try:
                    ref_cols = self.data_loader._get_table_details(
                        ref_catalog,
                        ref_schema,
                        ref_table,
                        apply_sampling=self.data_loader.enable_column_sampling
                    )
                    if ref_cols:
                        additional.extend(ref_cols)
                        tables_present.add(ref_tuple)
                except Exception as e:
                    self.logger.debug(f"Could not load related table {ref_tuple}: {str(e)[:80]}")
            reverse_relations = reverse_graph.get(base_table, [])
            for rel in reverse_relations:
                src_catalog = rel[0]
                src_schema = rel[1]
                src_table = rel[2]
                src_fqtn = f"{src_catalog}.{src_schema}.{src_table}"
                if allowed_tables is not None and src_fqtn not in allowed_tables:
                    continue
                src_tuple = (src_catalog, src_schema, src_table)
                if src_tuple in tables_present:
                    continue
                try:
                    ref_cols = self.data_loader._get_table_details(
                        src_catalog,
                        src_schema,
                        src_table,
                        apply_sampling=self.data_loader.enable_column_sampling
                    )
                    if ref_cols:
                        additional.extend(ref_cols)
                        tables_present.add(src_tuple)
                except Exception as e:
                    self.logger.debug(f"Could not load related table {src_tuple}: {str(e)[:80]}")
        if additional:
            return column_details + additional
        return column_details

    def _expand_tables_with_foreign_keys(self, tables: set):
        if not tables or not self.data_loader or not getattr(self.data_loader, "foreign_key_graph", None):
            return tables, []
        expanded = set()
        relationships = []
        for tbl in tables:
            expanded.add(tbl)
            cat, sch, tbl_name = parse_three_level_name(tbl)
            if not (cat and sch and tbl_name):
                continue
            key = (cat, sch, tbl_name)
            for rel in self.data_loader.foreign_key_graph.get(key, []):
                ref_catalog = rel[4] or cat
                ref_schema = rel[5] or sch
                ref_table = rel[6]
                ref_str = f"{ref_catalog}.{ref_schema}.{ref_table}"
                expanded.add(ref_str)
                relationships.append(f"{cat}.{sch}.{tbl_name}.{rel[3]} → {ref_catalog}.{ref_schema}.{ref_table}.{rel[7]}")
        return expanded, relationships

    def _apply_progressive_truncation(self, use_case_id: str, directly_involved_details: list, 
                                      additional_details: list, unstructured_docs: str, 
                                      max_schema_size: int, base_prompt_size: int,
                                      directly_involved_tables: set = None) -> tuple:
        """
        Apply progressive truncation strategy to fit schema within context limits.
        
        NEW STRATEGY (Updated):
        1. Drop unstructured documents and check if it fits
        2. Truncate tables with >250 columns to 250 columns (ALL tables, including directly involved)
        3. If still doesn't fit, take your chances anyway and send the request to the LLM
           - The LLM might succeed despite exceeding the limit
        4. If SQL generation fails, the calling code should:
           a. Switch model from Opus to Sonnet and retry
           b. Last resort: Remove this use case and fetch the highest-rated deduplicated use case
        
        Args:
            directly_involved_tables: Set of table names that are directly involved in the query
        
        Returns:
            tuple: (directly_involved_schema, additional_schema, final_unstructured_docs, was_truncated)
        """
        from collections import defaultdict
        
        # Initialize directly_involved_tables if not provided
        if directly_involved_tables is None:
            directly_involved_tables = set()
        
        # Get business scores (default to 50 if not available)
        business_scores = getattr(self, 'business_scores', {})
        
        def is_table_directly_involved(table_name: str) -> bool:
            """Check if a table is directly involved in the query (must be protected)."""
            if not directly_involved_tables:
                return False
            
            # Normalize table name for comparison (remove backticks)
            clean_table = table_name.replace('`', '')
            
            # Check against all directly involved tables
            for involved_table in directly_involved_tables:
                clean_involved = involved_table.replace('`', '')
                if clean_table == clean_involved:
                    return True
            return False
        
        def group_columns_by_table(details):
            """Group column details by table."""
            tables = defaultdict(list)
            for detail in details:
                catalog, schema, table = detail[0], detail[1], detail[2]
                fqtn = f"{catalog}.{schema}.{table}"
                tables[fqtn].append(detail)
            return tables
        
        def rebuild_details_from_tables(tables_dict):
            """Flatten table dict back to detail list."""
            details = []
            for table_name in sorted(tables_dict.keys()):
                details.extend(tables_dict[table_name])
            return details
        
        def get_table_score(table_name):
            """Get business score for a table (0-100)."""
            # Normalize table name (remove backticks)
            clean_name = table_name.replace('`', '')
            return business_scores.get(clean_name, 50)  # Default to 50 if not found
        
        def calculate_total_size(directly_inv_details, additional_det, unstructured):
            """Calculate total size of schema context."""
            # Load column tracking for SQL generation
            directly_schema = self._format_schema_for_prompt(directly_inv_details, load_column_tracking=True)
            additional_schema = ""
            if additional_det:
                additional_schema = self._format_schema_for_prompt(additional_det, load_column_tracking=True)
            return len(directly_schema) + len(additional_schema) + len(unstructured)
        
        # Initial check
        total_size = calculate_total_size(directly_involved_details, additional_details, unstructured_docs)
        target_size = max_schema_size + len(unstructured_docs)  # Total allowed
        
        if total_size <= target_size:
            # No truncation needed
            directly_schema = self._format_schema_for_prompt(directly_involved_details, load_column_tracking=True)
            additional_schema = ""
            if additional_details:
                additional_schema = self._format_schema_for_prompt(additional_details, load_column_tracking=True)
            return (directly_schema, additional_schema, unstructured_docs, False)
        
        self.logger.warning(f"Use case {use_case_id}: Schema exceeds limit. Starting progressive truncation...")
        
        # STEP 1: Drop unstructured documents
        self.logger.info(f"Use case {use_case_id}: Step 1 - Dropping unstructured documents")
        unstructured_docs_truncated = ""
        total_size = calculate_total_size(directly_involved_details, additional_details, unstructured_docs_truncated)
        
        if total_size <= target_size:
            self.logger.info(f"Use case {use_case_id}: Fits after dropping unstructured docs (size: {total_size:,} chars)")
            directly_schema = self._format_schema_for_prompt(directly_involved_details, load_column_tracking=True)
            additional_schema = ""
            if additional_details:
                additional_schema = self._format_schema_for_prompt(additional_details, load_column_tracking=True)
            return (directly_schema, additional_schema, unstructured_docs_truncated, True)
        
        # STEP 2: Truncate tables with >250 columns to 250 columns
        # CRITICAL: Truncate ALL tables (including directly involved) if they exceed 250 columns
        self.logger.info(f"Use case {use_case_id}: Step 2 - Truncating tables >250 columns to 250 columns")
        
        all_tables = group_columns_by_table(directly_involved_details)
        truncated_count = 0
        
        for table_name, columns in list(all_tables.items()):
            if len(columns) > 250:
                all_tables[table_name] = columns[:250]
                truncated_count += 1
                self.logger.debug(f"Use case {use_case_id}: Truncated {table_name} from {len(columns)} to 250 columns")
        
        directly_involved_details = rebuild_details_from_tables(all_tables)
        total_size = calculate_total_size(directly_involved_details, [], unstructured_docs_truncated)
        
        if total_size <= target_size:
            self.logger.info(f"Use case {use_case_id}: Fits after truncating {truncated_count} tables to 250 columns (size: {total_size:,} chars)")
            directly_schema = self._format_schema_for_prompt(directly_involved_details, load_column_tracking=True)
            additional_schema = ""  # No additional tables left
            return (directly_schema, additional_schema, unstructured_docs_truncated, True)
        
        # STEP 3: Take your chances anyway - send the request to the LLM
        # The LLM might succeed despite exceeding the context limit
        # If it fails, the calling code should:
        #   a. Switch model from Opus to Sonnet and retry
        #   b. Last resort: Remove this use case and fetch highest-rated deduplicated use case
        self.logger.warning(f"Use case {use_case_id}: Context still exceeds limit after truncation (size: {total_size:,} chars, limit: {target_size:,} chars)")
        self.logger.warning(f"Use case {use_case_id}: Taking chances anyway - sending request to LLM (it might succeed)")
        
        directly_schema = self._format_schema_for_prompt(directly_involved_details, load_column_tracking=True)
        additional_schema = ""  # No additional tables
        return (directly_schema, additional_schema, unstructured_docs_truncated, True)

    def _sanitize_name(self, name: str) -> str:
        if not name: return "_"
        s = re.sub(r'[^a-z0-9_]', '_', str(name).lower())
        s = re.sub(r'_+', '_', s).strip('_')
        return s or "_"

    def _save_usecases_catalog_json(self, final_consolidated_use_cases: list, english_translations: dict, summary_dict: dict = None) -> dict:
        """
        Saves the usecases_catalog.json file with all the content for later doc generation.
        Returns the summary_dict used (computed if not provided).
        
        JSON Structure:
        {
            "title": "...",
            "executive_summary": "...",
            "domains": "domain1: X use cases, domain2: Y use cases, ...",
            "domains": [
                {
                    "summary": "...",
                    "use_cases": [
                        {id, title, type, statement, etc...}
                    ]
                }
            ]
        }
        """
        try:
            self.logger.info("Generating JSON Catalog...")
            
            # Group use cases by domain
            flat_english_use_cases = final_consolidated_use_cases
            _unsorted_grouped = self._group_use_cases_by_domain_flat(flat_english_use_cases)
            english_grouped_data = {k: _unsorted_grouped[k] for k in sorted(_unsorted_grouped.keys())}
            
            # Get summary if not provided
            if summary_dict is None:
                (summary_dict, transliterated_name) = self._get_salesy_summary(english_grouped_data, self.business_name, "English", english_translations)
            
            # Build domain summary string
            domain_counts = []
            for domain_name, use_cases in english_grouped_data.items():
                domain_counts.append(f"{domain_name}: {len(use_cases)} use cases")
            domains_summary = ", ".join(domain_counts)
            
            # === NEW: Build Column Bitmap and replace names with IDs ===
            import copy
            column_registry = {}
            # Format: ID -> "FQN, Description"
            for col_id, info in self.id_column_map.items():
                column_registry[col_id] = f"{info['fqn']}, {info['description']}"

            # === NEW: Build Table Registry ===
            table_registry = {}
            # Format: ID -> Table FQN
            for table_id, table_fqn in self.id_table_map.items():
                table_registry[table_id] = table_fqn

            # Create deep copy of use cases to modify for JSON output
            json_english_grouped_data = copy.deepcopy(english_grouped_data)

            # Pre-compute table -> column IDs map for faster lookup
            table_to_col_ids = defaultdict(list)
            for fqn, cid in self.column_id_map.items():
                # fqn is catalog.schema.table.column
                parts = fqn.split('.')
                if len(parts) >= 2:
                    table_fqn = ".".join(parts[:-1])
                    table_to_col_ids[table_fqn].append(cid)

            # Replace column names with IDs in the JSON copy
            for domain in json_english_grouped_data:
                for uc in json_english_grouped_data[domain]:
                    # 1. Process Columns Involved (Specific columns)
                    cols_involved = uc.get("Columns Involved", "")
                    if cols_involved:
                        # Split by comma or newline
                        col_names = [c.strip() for c in re.split(r'[,\n]', cols_involved) if c.strip()]
                        col_ids = []
                        for name in col_names:
                            # 1. Try exact FQN match
                            if name in self.column_id_map:
                                col_ids.append(self.column_id_map[name])
                            else:
                                # 2. Fuzzy match: try to find a column ending with name
                                found = False
                                for fqn, cid in self.column_id_map.items():
                                    if fqn.endswith(f".{name}"):
                                         col_ids.append(cid)
                                         found = True
                                         break
                                
                                # 3. Super fuzzy: just check if name is in FQN (risky but better than nothing for IDs)
                                if not found:
                                    for fqn, cid in self.column_id_map.items():
                                        if name in fqn:
                                            col_ids.append(cid)
                                            found = True
                                            break

                                if not found:
                                    # Fallback: keep original name if not found in registry
                                    col_ids.append(name)
                        
                        uc["Columns Involved"] = ", ".join(col_ids)

                    # 1b. Process Involved Columns (same logic as Columns Involved)
                    involved_cols = uc.get("Involved Columns", "")
                    if involved_cols:
                        col_names = [c.strip() for c in re.split(r'[,\n]', involved_cols) if c.strip()]
                        col_ids = []
                        for name in col_names:
                            if name in self.column_id_map:
                                col_ids.append(self.column_id_map[name])
                            else:
                                found = False
                                for fqn, cid in self.column_id_map.items():
                                    if fqn.endswith(f".{name}"):
                                         col_ids.append(cid)
                                         found = True
                                         break
                                
                                if not found:
                                    for fqn, cid in self.column_id_map.items():
                                        if name in fqn:
                                            col_ids.append(cid)
                                            found = True
                                            break

                                if not found:
                                    col_ids.append(name)
                        
                        uc["Involved Columns"] = ", ".join(col_ids)

                    # 2. Process directly_involved_schema (Full table schemas -> IDs)
                    # Requirement: directly_involved_schema should only have comma separated list of column ids from the registry
                    involved_tables = uc.get('_directly_involved_tables', [])
                    schema_col_ids = set()
                    table_ids = []
                    if involved_tables:
                        for table in involved_tables:
                            # table is usually FQN from discovery
                            if table in table_to_col_ids:
                                schema_col_ids.update(table_to_col_ids[table])
                            # Convert table name to table ID
                            if table in self.table_id_map:
                                table_ids.append(self.table_id_map[table])
                    
                    # Always overwrite directly_involved_schema with IDs string (or empty string)
                    if schema_col_ids:
                         uc['directly_involved_schema'] = ", ".join(sorted(list(schema_col_ids), key=lambda x: int(x) if x.isdigit() else float('inf')))
                    else:
                         uc['directly_involved_schema'] = ""
                    
                    # Convert directly_involved_tables to table IDs string
                    if table_ids:
                        uc['directly_involved_tables'] = ", ".join(sorted(table_ids, key=lambda x: int(x) if x.isdigit() else float('inf')))
                    else:
                        uc['directly_involved_tables'] = ""
                    
                    # Remove old underscore-prefixed keys
                    if '_directly_involved_schema' in uc:
                        del uc['_directly_involved_schema']
                    if '_directly_involved_tables' in uc:
                        del uc['_directly_involved_tables']
                    # Remove generated/validated from JSON - these are now stored in notebook cells
                    if 'generated' in uc:
                        del uc['generated']
                    if 'validated' in uc:
                        del uc['validated']
            
            # Build the JSON structure
            catalog_json = {
                "business_name": self.business_name,
                "title": f"{self.business_name} Use Cases Catalog",
                "executive_summary": summary_dict.get("Executive", ""),
                "domains_summary": domains_summary,
                "column_registry": column_registry,
                "table_registry": table_registry,
                "metadata": {
                    "catalogs": [c.strip() for c in self.catalogs_str.split(',') if c.strip()],
                    "schemas": [s.strip() for s in self.schemas_str.split(',') if s.strip()],
                    "generation_path": self.generation_path
                },
                "domains": []
            }
            
            # Add each domain from the ID-ified data
            for domain_name, use_cases in json_english_grouped_data.items():
                domain_obj = {
                    "domain_name": domain_name,
                    "summary": summary_dict.get(domain_name, f"Domain: {domain_name} with {len(use_cases)} use cases"),
                    "use_cases": use_cases
                }
                catalog_json["domains"].append(domain_obj)
            
            # Save to workspace
            json_path = os.path.join(self.docs_output_dir, f"{self.business_name}-dbx_inspire.json")
            self.logger.info(f"Saving JSON Catalog to: {json_path}")
            
            json_content = json.dumps(catalog_json, indent=2, ensure_ascii=False)
            json_data_b64 = base64.b64encode(json_content.encode('utf-8')).decode()
            
            self.w_client.workspace.import_(
                path=json_path,
                content=json_data_b64,
                format=workspace.ImportFormat.AUTO,
                overwrite=True
            )
            
            self.logger.info(f"✅ JSON Catalog saved successfully to {json_path}")
            log_print(f"✅ JSON Catalog saved to: {json_path}")
            
            return summary_dict
            
        except Exception as e:
            self.logger.error(f"Failed to save JSON Catalog: {e}")
            return summary_dict

    def _run_queries_fixing_mode(self):
        """
        SQL Regeneration Mode: Scans notebooks for failed SQL and regenerates them.
        
        This method:
        1. Scans all notebooks in the notebook output directory
        2. Finds SQL cells with "Regenerate:Yes" in the Inspire header
        3. Loads the JSON file to get schema info for regeneration
        4. Regenerates SQL using the WAVE PATTERN (same as normal generation)
        5. Updates the notebook cells with new SQL and "Regenerate:No"
        
        NOTEBOOKS are the source of truth for SQL status.
        
        Can be run multiple times until no failed queries remain.
        """
        import json
        import base64
        import re
        from collections import defaultdict
        from databricks.sdk.service import workspace
        
        log_print(f"\n📓 SCANNING NOTEBOOKS FOR FAILED SQL...")
        self.logger.info(f"Scanning notebooks in: {self.notebook_output_dir}")
        
        json_file_path = os.path.join(self.docs_output_dir, f"{self.business_name}-dbx_inspire.json")
        
        log_print(f"📁 JSON File (for schema): {json_file_path}")
        self.logger.info(f"Loading schema from: {json_file_path}")
        
        try:
            file_info = self.w_client.workspace.export(path=json_file_path, format=workspace.ExportFormat.AUTO)
            json_content = base64.b64decode(file_info.content).decode('utf-8')
            catalog_json = json.loads(json_content)
        except Exception as e:
            self.logger.error(f"Failed to load JSON file: {e}")
            log_print(f"❌ Failed to load JSON file: {e}")
            return
        
        json_business_name = catalog_json.get("business_name", None)
        if json_business_name:
            self.business_name = json_business_name
            self.logger.info(f"Using business name from JSON: '{json_business_name}'")
        
        json_generation_path = catalog_json.get("metadata", {}).get("generation_path", None)
        if json_generation_path:
            self.generation_path = json_generation_path
            self.logger.info(f"Using generation path from JSON: '{json_generation_path}'")
        
        column_registry = catalog_json.get("column_registry", {})
        if not column_registry:
            self.logger.error("No column_registry in JSON. Cannot regenerate SQL.")
            log_print("❌ No column_registry found in JSON file.", level="ERROR")
            return
        
        self.logger.info(f"Building schema from JSON column_registry ({len(column_registry)} columns)...")
        log_print(f"📊 Building schema from JSON ({len(column_registry)} columns - NO database discovery)")
        
        id_to_info = {}
        full_schema_details = []
        schema_by_table = defaultdict(list)
        
        for cid, val in column_registry.items():
            parts = val.split(",", 1)
            fqn = parts[0].strip()
            description = parts[1].strip() if len(parts) > 1 else ""
            
            fqn_parts = fqn.split(".")
            if len(fqn_parts) >= 4:
                catalog = fqn_parts[0].strip('`')
                schema = fqn_parts[1].strip('`')
                table = fqn_parts[2].strip('`')
                column_name = ".".join(fqn_parts[3:]).strip('`')
            elif len(fqn_parts) == 3:
                catalog = ""
                schema = fqn_parts[0].strip('`')
                table = fqn_parts[1].strip('`')
                column_name = fqn_parts[2].strip('`')
            else:
                continue
            
            desc_parts = description.split(" - ", 1)
            data_type = desc_parts[0].strip() if desc_parts else "STRING"
            comment = desc_parts[1].strip() if len(desc_parts) > 1 else ""
            
            detail = (catalog, schema, table, column_name, data_type, comment)
            full_schema_details.append(detail)
            
            fqtn = f"{catalog}.{schema}.{table}" if catalog else f"{schema}.{table}"
            fqtn_backticks = f"`{catalog}`.`{schema}`.`{table}`" if catalog else f"`{schema}`.`{table}`"
            schema_by_table[fqtn].append(detail)
            schema_by_table[fqtn_backticks].append(detail)
            
            id_to_info[cid] = {
                'fqn': fqn,
                'catalog': catalog,
                'schema': schema,
                'table': table,
                'column': column_name,
                'data_type': data_type,
                'comment': comment
            }
        
        self.logger.info(f"Rebuilt schema: {len(full_schema_details)} columns across {len(schema_by_table)//2} tables")
        
        # === CRITICAL FIX: Initialize lightweight DataLoader for dynamic table loading ===
        # This enables loading tables requested by users in regeneration instructions
        # that weren't in the original generation (e.g., "join with table X")
        if self.data_loader is None:
            self.logger.info("🔧 Initializing DataLoader for dynamic table loading in SQL Regeneration mode...")
            log_print(f"   📥 Enabling dynamic table loading for user-requested tables")
            try:
                # Extract unique catalogs from the existing schema
                existing_catalogs = set()
                for detail in full_schema_details:
                    if detail[0]:  # catalog name
                        existing_catalogs.add(detail[0])
                catalogs_str = ",".join(sorted(existing_catalogs)) if existing_catalogs else ""
                
                self.data_loader = DataLoader(
                    catalogs=catalogs_str,
                    schemas="",  # Don't restrict schemas - allow any table lookup
                    tables="",   # Don't restrict tables - allow any table lookup
                    logger=self.logger,
                    enable_two_pass=False,  # No bulk discovery needed
                    enable_column_sampling=False,  # Get full column list for requested tables
                    streaming_batch_size=100,
                    max_parallelism=self.scan_parallelism,
                    schema_timeout_seconds=300  # 5 min timeout for individual table lookups
                )
                self.logger.info(f"✅ DataLoader initialized for catalogs: {catalogs_str if catalogs_str else '(all)'}")
            except Exception as e:
                self.logger.warning(f"⚠️ Could not initialize DataLoader for dynamic loading: {e}")
                self.logger.warning("   User-requested tables not in JSON will cause hallucinated columns!")
                log_print(f"   ⚠️ Dynamic table loading unavailable - user-requested tables may have hallucinated columns", level="WARNING")
        
        # Build lookup from JSON use cases for schema info
        all_use_cases = []
        use_case_lookup = {}
        domain_lookup = {}
        
        for domain_idx, domain_obj in enumerate(catalog_json.get("domains", []), start=1):
            domain_name = domain_obj.get("domain_name", "General Operations")
            use_cases = domain_obj.get("use_cases", [])
            
            for uc in use_cases:
                schema_ids = uc.get("directly_involved_schema", "") or uc.get("_directly_involved_schema", "")
                if schema_ids:
                    schema_col_ids = [p.strip() for p in schema_ids.split(",")]
                    schema_lines = []
                    tables_seen = set()
                    
                    for cid in schema_col_ids:
                        if cid in id_to_info:
                            info = id_to_info[cid]
                            fqtn = f"{info['catalog']}.{info['schema']}.{info['table']}" if info['catalog'] else f"{info['schema']}.{info['table']}"
                            
                            if fqtn not in tables_seen:
                                if tables_seen:
                                    schema_lines.append("")
                                schema_lines.append(f"Table: {fqtn}")
                                schema_lines.append("Columns:")
                                tables_seen.add(fqtn)
                            
                            col_desc = f"  - {info['column']} ({info['data_type']})"
                            if info['comment']:
                                col_desc += f": {info['comment']}"
                            schema_lines.append(col_desc)
                    
                    uc["directly_involved_schema"] = "\n".join(schema_lines)
                
                uc['Business Domain'] = domain_name
                all_use_cases.append(uc)
                uc_id = uc.get('No', '')
                use_case_lookup[uc_id] = uc
                # CRITICAL FIX: Extract domain prefix from use case ID, not from JSON domain order
                # Use case IDs like N15-AI01 must match their notebook N15-xxx.ipynb
                try:
                    domain_prefix = uc_id.split('-')[0] if uc_id and '-' in uc_id else f"N{domain_idx:02d}"
                except Exception:
                    domain_prefix = f"N{domain_idx:02d}"
                domain_lookup[uc_id] = (domain_name, domain_prefix)
        
        # === SCAN NOTEBOOKS FOR FAILED SQL ===
        failed_use_cases = []
        notebook_status = {}  # uc_id -> (notebook_path, generated, validated)
        
        # Log what use case IDs are in the lookup for debugging
        self.logger.info(f"Use case IDs in lookup from JSON: {list(use_case_lookup.keys())[:20]}...")  # First 20
        
        log_print(f"\n🔍 Scanning notebooks for SQL status...")
        
        try:
            notebook_list = list(self.w_client.workspace.list(self.notebook_output_dir))
        except Exception as e:
            self.logger.error(f"Failed to list notebooks in {self.notebook_output_dir}: {e}")
            log_print(f"❌ Failed to list notebooks: {e}", level="ERROR")
            return
        
        # Match use case ID from header and find regenerate_sql status anywhere in the cell
        # Also support legacy Regenerate: format for backwards compatibility
        use_case_id_pattern = re.compile(r'--Use Case:\s*([A-Za-z0-9_-]+)\s*-', re.IGNORECASE)
        regenerate_sql_pattern = re.compile(r'(?:regenerate_sql|Regenerate):\s*(Yes|No)', re.IGNORECASE)
        instructions_pattern = re.compile(r'/\*\*Regeneration Instruction Start\s*(.*?)\s*Regeneration Instruction End\*\*/', re.DOTALL)
        
        for item in notebook_list:
            if not item.path.endswith('.ipynb'):
                continue
            
            try:
                file_info = self.w_client.workspace.export(path=item.path, format=workspace.ExportFormat.JUPYTER)
                notebook_json_str = base64.b64decode(file_info.content).decode('utf-8')
                notebook_json = json.loads(notebook_json_str)
                
                for cell in notebook_json.get('cells', []):
                    if cell.get('cell_type') != 'code':
                        continue
                    
                    source = cell.get('source', [])
                    if isinstance(source, list):
                        cell_content = ''.join(source)
                    else:
                        cell_content = source
                    
                    # Find use case ID from header
                    uc_match = use_case_id_pattern.search(cell_content)
                    if uc_match:
                        uc_id = uc_match.group(1).strip()
                        
                        # Find regenerate_sql status anywhere in the cell
                        regen_match = regenerate_sql_pattern.search(cell_content)
                        regenerate_sql = regen_match.group(1) if regen_match else 'No'
                        
                        user_instructions = ""
                        instructions_match = instructions_pattern.search(cell_content)
                        if instructions_match:
                            user_instructions = instructions_match.group(1).strip()
                        
                        needs_regenerate = (regenerate_sql.lower() == 'yes')
                        notebook_status[uc_id] = (item.path, regenerate_sql, user_instructions)
                        self.logger.info(f"[{uc_id}] Found: regenerate_sql={regenerate_sql}, NeedsRegen={needs_regenerate}")
                        
                        in_lookup = uc_id in use_case_lookup
                        self.logger.info(f"[{uc_id}] needs_regenerate={needs_regenerate}, in_lookup={in_lookup}")
                        
                        if needs_regenerate and not in_lookup:
                            self.logger.warning(f"[{uc_id}] SKIPPED: Use case needs regeneration but NOT found in JSON lookup!")
                        
                        if needs_regenerate and in_lookup:
                            uc = use_case_lookup[uc_id].copy()
                            uc['_notebook_path'] = item.path
                            uc['_notebook_regenerate'] = regenerate_sql
                            uc['_user_instructions'] = user_instructions
                            failed_use_cases.append(uc)
                            if user_instructions:
                                self.logger.info(f"[{uc_id}] Found in notebook with regenerate_sql={regenerate_sql}, User Instructions: {user_instructions[:100]}... -> NEEDS REGENERATION")
                                log_print(f"   📝 [{uc_id}] regenerate_sql:Yes with instructions: {user_instructions[:80]}...")
                            else:
                                self.logger.info(f"[{uc_id}] Found in notebook with regenerate_sql={regenerate_sql} -> NEEDS REGENERATION")
                                log_print(f"   🔄 [{uc_id}] regenerate_sql:Yes")
                        
            except Exception as e:
                self.logger.debug(f"Could not parse notebook {item.path}: {e}")
                continue
        
        log_print(f"   • Notebooks scanned: {len(notebook_list)}")
        log_print(f"   • Use cases found in notebooks: {len(notebook_status)}")
        
        total_use_cases = len(all_use_cases)
        failed_count = len(failed_use_cases)
        
        log_print(f"\n📊 USE CASE STATISTICS:")
        log_print(f"   • Total Use Cases: {total_use_cases}")
        log_print(f"   • Failed/Missing SQL: {failed_count}")
        log_print(f"   • Valid SQL: {total_use_cases - failed_count}")
        
        if failed_count == 0:
            log_print(f"\n✅ SUCCESS: All {total_use_cases} use cases have valid SQL!")
            log_print(f"   No queries need to be fixed.")
            self.logger.info(f"No failed queries found. All {total_use_cases} use cases have valid SQL.")
            return
        
        log_print(f"\n🔧 REGENERATING SQL FOR {failed_count} FAILED USE CASES (using wave pattern):")
        for uc in failed_use_cases:
            uc_id = uc.get('No', 'UNKNOWN')
            uc_name = uc.get('Name', '')
            user_instructions = uc.get('_user_instructions', '')
            if user_instructions:
                log_print(f"   • [{uc_id}] {uc_name} (with user instructions)")
            else:
                log_print(f"   • [{uc_id}] {uc_name}")
        
        # === NEW: INTERPRET USER INSTRUCTIONS BEFORE SQL GENERATION (PARALLEL) ===
        # For use cases with user instructions, run them through the interpretation prompt first
        use_cases_with_instructions = [uc for uc in failed_use_cases if uc.get('_user_instructions', '').strip()]
        
        if use_cases_with_instructions:
            log_print(f"\n🔍 INTERPRETING USER INSTRUCTIONS FOR {len(use_cases_with_instructions)} USE CASES (parallel, max {self.max_parallelism} workers)...")
            self.logger.info(f"Interpreting user instructions for {len(use_cases_with_instructions)} use cases in parallel...")
            
            # Build available tables registry from column_registry (done once, shared by all workers)
            available_tables = set()
            for cid, val in column_registry.items():
                parts = val.split(",", 1)
                fqn = parts[0].strip()
                fqn_parts = fqn.split(".")
                if len(fqn_parts) >= 3:
                    catalog = fqn_parts[0].strip('`')
                    schema_name = fqn_parts[1].strip('`')
                    table_name = fqn_parts[2].strip('`')
                    fqtn = f"{catalog}.{schema_name}.{table_name}"
                    available_tables.add(fqtn)
            
            available_tables_list = sorted(list(available_tables))
            available_tables_registry = "\n".join([f"- {t}" for t in available_tables_list])
            
            def interpret_single_use_case(uc: dict) -> dict:
                """Interpret user instructions for a single use case. Returns the updated use case."""
                uc_id = uc.get('No', 'UNKNOWN')
                user_instructions = uc.get('_user_instructions', '')
                previous_sql = uc.get('SQL', '')
                
                self.logger.info(f"[{uc_id}] Interpreting user instructions: {user_instructions[:100]}...")
                
                interpret_prompt_vars = {
                    "use_case_id": uc_id,
                    "use_case_name": uc.get('Name', ''),
                    "business_domain": uc.get('Business Domain', ''),
                    "statement": uc.get('Statement', ''),
                    "solution": uc.get('Solution', ''),
                    "original_tables_involved": uc.get('Tables Involved', ''),
                    "previous_sql": previous_sql if previous_sql else "(No previous SQL)",
                    "available_tables_registry": available_tables_registry,
                    "user_regeneration_instructions": user_instructions
                }
                
                try:
                    interpretation_response = self.ai_agent.run_worker(
                        step_name=f"Interpret_Instructions_{uc_id}",
                        worker_prompt_path="INTERPRET_USER_SQL_REGENERATION_PROMPT",
                        prompt_vars=interpret_prompt_vars,
                        response_schema=None
                    )
                    
                    # Parse the JSON response
                    interpretation_response_clean = clean_json_response(interpretation_response)
                    interpretation_json = json.loads(interpretation_response_clean)
                    
                    # Extract interpreted information
                    interpretation_summary = interpretation_json.get('interpretation_summary', '')
                    tables_to_add = interpretation_json.get('tables_to_add', [])
                    tables_to_remove = interpretation_json.get('tables_to_remove', [])
                    final_tables_involved = interpretation_json.get('final_tables_involved', [])
                    new_tables_need_loading = interpretation_json.get('new_tables_need_loading', False)
                    technical_design_instructions = interpretation_json.get('technical_design_instructions', '')
                    special_requirements = interpretation_json.get('special_requirements', '')
                    
                    self.logger.info(f"[{uc_id}] Interpretation: {interpretation_summary[:100]}...")
                    self.logger.info(f"[{uc_id}] Tables to add: {tables_to_add}")
                    self.logger.info(f"[{uc_id}] Tables to remove: {tables_to_remove}")
                    self.logger.info(f"[{uc_id}] Final tables: {final_tables_involved}")
                    
                    # Update the use case's Tables Involved if interpretation changed them
                    if final_tables_involved:
                        uc['Tables Involved'] = ', '.join(final_tables_involved)
                        self.logger.info(f"[{uc_id}] Updated Tables Involved to: {uc['Tables Involved']}")
                    
                    # Determine which tables need schema loaded - check FINAL tables, not just tables_to_add
                    # The LLM may return tables_to_add=[] but have new tables in final_tables_involved
                    self.logger.info(f"[{uc_id}] Checking {len(final_tables_involved)} tables against registry ({len(schema_by_table)} entries)...")
                    tables_needing_schema = []
                    for tbl in final_tables_involved:
                        tbl_clean = tbl.replace('`', '').strip()
                        tbl_variants = [tbl_clean, f"`{tbl_clean.replace('.', '`.`')}`"]
                        found_in_registry = any(v in schema_by_table for v in tbl_variants)
                        self.logger.debug(f"[{uc_id}] Table '{tbl_clean}' - in registry: {found_in_registry}")
                        if not found_in_registry:
                            tables_needing_schema.append(tbl_clean)
                            self.logger.info(f"[{uc_id}] 🔍 Table '{tbl_clean}' NOT in registry - will load from database")
                    
                    # Also include any explicitly listed tables_to_add
                    for tbl in tables_to_add:
                        tbl_clean = tbl.replace('`', '').strip()
                        if tbl_clean not in tables_needing_schema:
                            tables_needing_schema.append(tbl_clean)
                    
                    self.logger.info(f"[{uc_id}] Tables needing dynamic load: {len(tables_needing_schema)} - {tables_needing_schema}")
                    
                    if tables_needing_schema:
                        self.logger.info(f"[{uc_id}] Loading schema for {len(tables_needing_schema)} tables not in registry: {tables_needing_schema}")
                        additional_schema_lines = []
                        tables_loaded = []
                        tables_not_found = []
                        
                        for tbl_name in tables_needing_schema:
                            # Find columns for this table in schema_by_table
                            table_variants = [tbl_name, f"`{tbl_name.replace('.', '`.`')}`"]
                            found = False
                            for variant in table_variants:
                                if variant in schema_by_table:
                                    additional_schema_lines.append(f"\nTable: {tbl_name}")
                                    additional_schema_lines.append("Columns:")
                                    for detail in schema_by_table[variant]:
                                        cat, sch_name, tbl_nm, column_name, data_type, comment = detail
                                        col_desc = f"  - {column_name} ({data_type})"
                                        if comment:
                                            col_desc += f": {comment}"
                                        additional_schema_lines.append(col_desc)
                                    tables_loaded.append(tbl_name)
                                    found = True
                                    break
                            
                            if not found:
                                # Table not in registry - try to load dynamically from database
                                self.logger.info(f"[{uc_id}] 📥 Loading table '{tbl_name}' from database (not in JSON registry)...")
                                log_print(f"   📥 [{uc_id}] Loading schema for: {tbl_name}")
                                try:
                                    tbl_parts = tbl_name.replace('`', '').split('.')
                                    if len(tbl_parts) == 3:
                                        cat, sch_name, tbl_nm = tbl_parts
                                        self.logger.info(f"[{uc_id}] Parsed table: catalog={cat}, schema={sch_name}, table={tbl_nm}")
                                        # Check if spark and data_loader are available for dynamic loading
                                        if hasattr(self, 'data_loader') and self.data_loader is not None:
                                            self.logger.info(f"[{uc_id}] DataLoader available - calling _get_table_details...")
                                            dynamic_details = self.data_loader._get_table_details(cat, sch_name, tbl_nm, apply_sampling=False)
                                            if dynamic_details:
                                                self.logger.info(f"[{uc_id}] ✅ Dynamically loaded {len(dynamic_details)} columns for table '{tbl_name}'")
                                                additional_schema_lines.append(f"\nTable: {tbl_name}")
                                                additional_schema_lines.append("Columns:")
                                                for detail in dynamic_details:
                                                    d_cat, d_sch, d_tbl, column_name, data_type, comment = detail
                                                    col_desc = f"  - {column_name} ({data_type})"
                                                    if comment:
                                                        col_desc += f": {comment}"
                                                    additional_schema_lines.append(col_desc)
                                                # Store for later merge into full_schema_details
                                                if '_dynamic_column_details' not in uc:
                                                    uc['_dynamic_column_details'] = []
                                                uc['_dynamic_column_details'].extend(dynamic_details)
                                                # Also add to schema_by_table for future lookups
                                                schema_by_table[tbl_name] = dynamic_details
                                                schema_by_table[f"`{cat}`.`{sch_name}`.`{tbl_nm}`"] = dynamic_details
                                                tables_loaded.append(tbl_name)
                                                found = True
                                            else:
                                                self.logger.warning(f"[{uc_id}] Table '{tbl_name}' exists but returned no columns")
                                                tables_not_found.append(tbl_name)
                                        else:
                                            self.logger.warning(f"[{uc_id}] DataLoader not available for dynamic table loading")
                                            tables_not_found.append(tbl_name)
                                    else:
                                        self.logger.warning(f"[{uc_id}] Invalid table name format '{tbl_name}' - expected catalog.schema.table")
                                        tables_not_found.append(tbl_name)
                                except Exception as e:
                                    self.logger.warning(f"[{uc_id}] Failed to load table '{tbl_name}' dynamically: {e}")
                                    tables_not_found.append(tbl_name)
                        
                        if additional_schema_lines:
                            existing_schema = uc.get('directly_involved_schema', '')
                            uc['directly_involved_schema'] = existing_schema + "\n" + "\n".join(additional_schema_lines)
                            self.logger.info(f"[{uc_id}] Successfully loaded schema for tables: {tables_loaded}")
                        
                        if tables_not_found:
                            self.logger.error(f"[{uc_id}] ⚠️ CRITICAL: Could not load schema for tables: {tables_not_found}. SQL generation may produce invalid column names!")
                            log_print(f"   ❌ [{uc_id}] FAILED to load tables: {', '.join(tables_not_found)}", level="ERROR")
                            uc['_tables_not_found'] = tables_not_found
                    
                    # Build the interpreted regeneration context for the SQL generation prompt
                    tables_not_found_warning = ""
                    if tables_not_found:
                        tables_not_found_warning = f"""
**⛔⛔⛔ CRITICAL WARNING: SCHEMA NOT AVAILABLE FOR REQUESTED TABLES ⛔⛔⛔**

The following tables requested by the user could NOT be loaded from Unity Catalog:
{', '.join(tables_not_found)}

**YOU MUST NOT USE THESE TABLES** - their schema is not available.
DO NOT hallucinate column names for these tables.
Instead, generate SQL using ONLY the tables that have schema available below.
If the user's request REQUIRES these tables and cannot be fulfilled without them,
generate a comment explaining which tables are missing.

"""
                    interpreted_context = f"""
**🔥🔥🔥 REGENERATION MODE - USER INSTRUCTIONS INTERPRETED 🔥🔥🔥**

The user has provided regeneration instructions which have been interpreted into the following technical requirements:
{tables_not_found_warning}
**INTERPRETATION SUMMARY:**
{interpretation_summary}

**FINAL TABLES TO USE:**
{', '.join(final_tables_involved) if final_tables_involved else uc.get('Tables Involved', '')}

**TECHNICAL DESIGN INSTRUCTIONS (MUST FOLLOW):**
{technical_design_instructions}

**SPECIAL REQUIREMENTS:**
{special_requirements if special_requirements else 'None'}

**🚨 CRITICAL: You MUST follow the technical design instructions above. They take precedence over the default solution approach. 🚨**
"""
                    uc['_interpreted_regeneration_context'] = interpreted_context
                    uc['_interpretation_status'] = 'success'
                    
                except Exception as e:
                    self.logger.warning(f"[{uc_id}] Failed to interpret user instructions: {e}. Proceeding with raw instructions.")
                    # Fallback: use raw instructions in a simpler format
                    uc['_interpreted_regeneration_context'] = f"""
**🔥 REGENERATION MODE - USER INSTRUCTIONS 🔥**

The user has provided the following instructions for regenerating this SQL query. You MUST follow these instructions:

**USER INSTRUCTIONS:**
{user_instructions}

**🚨 CRITICAL: Follow the user's instructions above. They take precedence over the default solution approach. 🚨**
"""
                    uc['_interpretation_status'] = 'fallback'
                
                return uc
            
            # Execute interpretation in parallel using ThreadPoolExecutor
            # ADAPTIVE PARALLELISM: Calculate based on use cases to interpret
            from concurrent.futures import ThreadPoolExecutor, as_completed
            
            interpretation_parallelism, reason = calculate_adaptive_parallelism(
                "sql_generation", self.max_parallelism,
                num_items=len(use_cases_with_instructions),
                is_llm_operation=True, logger=self.logger
            )
            log_adaptive_parallelism_decision("sql_generation", interpretation_parallelism, self.max_parallelism, reason)
            
            interpretation_results = {}
            with ThreadPoolExecutor(max_workers=interpretation_parallelism, thread_name_prefix="InterpretInstr") as executor:
                future_to_uc = {executor.submit(interpret_single_use_case, uc): uc.get('No', 'UNKNOWN') for uc in use_cases_with_instructions}
                
                completed_count = 0
                for future in as_completed(future_to_uc):
                    uc_id = future_to_uc[future]
                    try:
                        updated_uc = future.result(timeout=120)  # 2 min timeout per interpretation
                        interpretation_results[uc_id] = updated_uc
                        completed_count += 1
                        status = updated_uc.get('_interpretation_status', 'unknown')
                        if status == 'success':
                            log_print(f"   ✅ [{uc_id}] Interpretation complete ({completed_count}/{len(use_cases_with_instructions)})")
                        else:
                            log_print(f"   ⚠️ [{uc_id}] Using fallback instructions ({completed_count}/{len(use_cases_with_instructions)})")
                    except Exception as e:
                        self.logger.error(f"[{uc_id}] Interpretation failed with error: {e}")
                        log_print(f"   ❌ [{uc_id}] Interpretation error: {str(e)[:50]}")
                        # Find the original use case and apply fallback
                        for uc in use_cases_with_instructions:
                            if uc.get('No') == uc_id:
                                user_instr = uc.get('_user_instructions', '')
                                uc['_interpreted_regeneration_context'] = f"""
**🔥 REGENERATION MODE - USER INSTRUCTIONS 🔥**

The user has provided the following instructions for regenerating this SQL query. You MUST follow these instructions:

**USER INSTRUCTIONS:**
{user_instr}

**🚨 CRITICAL: Follow the user's instructions above. They take precedence over the default solution approach. 🚨**
"""
                                interpretation_results[uc_id] = uc
                                break
            
            # Update the failed_use_cases list with interpreted results
            for i, uc in enumerate(failed_use_cases):
                uc_id = uc.get('No', 'UNKNOWN')
                if uc_id in interpretation_results:
                    failed_use_cases[i] = interpretation_results[uc_id]
            
            log_print(f"   ✅ All {len(use_cases_with_instructions)} interpretations complete")
        
        # Merge dynamically loaded column details into full_schema_details for SQL generation
        dynamic_columns_added = 0
        dynamic_tables_added = set()
        for uc in failed_use_cases:
            dynamic_details = uc.get('_dynamic_column_details', [])
            if dynamic_details:
                # Avoid duplicates by checking if table is already in full_schema_details
                existing_tables = set()
                for detail in full_schema_details:
                    existing_tables.add(f"{detail[0]}.{detail[1]}.{detail[2]}")
                
                for detail in dynamic_details:
                    fqtn = f"{detail[0]}.{detail[1]}.{detail[2]}"
                    if fqtn not in existing_tables:
                        full_schema_details.append(detail)
                        dynamic_columns_added += 1
                        dynamic_tables_added.add(fqtn)
                        existing_tables.add(fqtn)
        
        if dynamic_columns_added > 0:
            self.logger.info(f"📥 Merged {dynamic_columns_added} dynamically loaded columns from {len(dynamic_tables_added)} tables into schema")
            for tbl in sorted(dynamic_tables_added):
                col_count = sum(1 for d in full_schema_details if f"{d[0]}.{d[1]}.{d[2]}" == tbl)
                self.logger.info(f"   📥 {tbl}: {col_count} columns loaded")
        
        self.logger.info(f"Starting SQL regeneration using WAVE PATTERN for {failed_count} failed use cases...")
        
        regenerated_use_cases = self._generate_sql_parallel(
            failed_use_cases, 
            full_schema_details, 
            ""
        )
        
        regenerated_count = 0
        still_failed_count = 0
        succeeded_list = []
        failed_list = []
        
        for result in regenerated_use_cases:
            uc_id = result.get('No', 'UNKNOWN')
            uc_name = result.get('Name', '')
            gen_status = result.get('generated', 'N')
            val_status = result.get('validated', 'D')
            
            is_success = gen_status == 'Y' and val_status in ['Y', 'D']
            if is_success:
                regenerated_count += 1
                succeeded_list.append((uc_id, uc_name, gen_status, val_status))
                self.logger.info(f"[{uc_id}] SQL regenerated successfully (generated={gen_status}, validated={val_status})")
            else:
                still_failed_count += 1
                failed_list.append((uc_id, uc_name, gen_status, val_status))
                self.logger.warning(f"[{uc_id}] SQL regeneration failed (generated={gen_status}, validated={val_status})")
            
            for orig_uc in all_use_cases:
                if orig_uc.get('No') == uc_id:
                    orig_uc['SQL'] = result.get('SQL', '')
                    orig_uc['sql_generation_status'] = result.get('sql_generation_status', '')
                    orig_uc['sql_validation_status'] = result.get('sql_validation_status', '')
                    orig_uc['generated'] = gen_status
                    orig_uc['validated'] = val_status
                    break
        
        log_print(f"\n{'='*80}")
        log_print(f"📊 QUERIES REGENERATION - FINAL REPORT")
        log_print(f"{'='*80}")
        log_print(f"\n📈 SUMMARY:")
        log_print(f"   • Total attempted: {len(regenerated_use_cases)}")
        log_print(f"   • ✅ Successfully regenerated: {regenerated_count}")
        log_print(f"   • ❌ Still failed: {still_failed_count}")
        
        if succeeded_list:
            log_print(f"\n✅ SUCCEEDED ({len(succeeded_list)}):")
            for uc_id, uc_name, gen, val in succeeded_list:
                log_print(f"   • [{uc_id}] {uc_name} (generated={gen}, validated={val})")
        
        if failed_list:
            log_print(f"\n❌ STILL FAILED ({len(failed_list)}):")
            for uc_id, uc_name, gen, val in failed_list:
                log_print(f"   • [{uc_id}] {uc_name} (generated={gen}, validated={val})")
        
        log_print(f"\n{'='*80}")
        
        # Update JSON with new SQL (but NOT generated/validated - those are in notebooks now)
        self.logger.info("Updating JSON file with regenerated SQL...")
        
        for domain_obj in catalog_json.get("domains", []):
            domain_name = domain_obj.get("domain_name", "")
            for uc in domain_obj.get("use_cases", []):
                uc_id = uc.get('No', '')
                for result in regenerated_use_cases:
                    if result.get('No') == uc_id:
                        uc['SQL'] = result.get('SQL', '')
                        # Remove generated/validated from JSON - they are in notebooks
                        if 'generated' in uc:
                            del uc['generated']
                        if 'validated' in uc:
                            del uc['validated']
                        break
        
        try:
            updated_json = json.dumps(catalog_json, indent=2, ensure_ascii=False)
            import_content = base64.b64encode(updated_json.encode('utf-8')).decode('utf-8')
            self.w_client.workspace.import_(
                path=json_file_path,
                content=import_content,
                format=workspace.ImportFormat.AUTO,
                overwrite=True
            )
            log_print(f"\n✅ JSON file updated: {json_file_path}")
            self.logger.info(f"JSON file updated successfully: {json_file_path}")
        except Exception as e:
            self.logger.error(f"Failed to update JSON file: {e}")
            log_print(f"⚠️ Failed to update JSON file: {e}", level="WARNING")
        
        # Update notebook cells with new SQL and reset header
        # FIX: Batch updates by notebook to avoid lost updates from eventual consistency
        # When multiple cells in the same notebook are updated sequentially, each load/save
        # cycle can hit stale cache, causing later saves to overwrite earlier updates.
        # Solution: Group by notebook, load ONCE, update ALL cells, save ONCE.
        self.logger.info("Updating notebook cells with regenerated SQL (batched by notebook)...")
        log_print(f"\n📓 UPDATING NOTEBOOKS (batched to prevent lost updates)...")
        
        # Group use cases by notebook path
        notebook_updates = {}  # notebook_path -> list of (use_case, domain_name, domain_prefix)
        for result in regenerated_use_cases:
            uc_id = result.get('No', 'UNKNOWN')
            gen_status = result.get('generated', 'N')
            val_status = result.get('validated', 'D')
            
            if gen_status == 'Y':
                domain_info = domain_lookup.get(uc_id, ('General', 'N01'))
                domain_name, domain_prefix = domain_info
                result['generated'] = 'Y'
                result['validated'] = 'Y' if val_status == 'Y' else ('Unknown' if val_status == 'D' else 'N')
                
                # Determine notebook path
                uc_prefix_match = re.match(r'^(N\d+)', uc_id)
                actual_prefix = uc_prefix_match.group(1) if uc_prefix_match else domain_prefix
                sanitized_domain = self._sanitize_name(domain_name)
                notebook_name = f"{actual_prefix}-{sanitized_domain}"
                notebook_path = os.path.join(self.notebook_output_dir, f"{notebook_name}.ipynb")
                
                if notebook_path not in notebook_updates:
                    notebook_updates[notebook_path] = []
                notebook_updates[notebook_path].append((result, domain_name, domain_prefix))
        
        # Update each notebook once with all its cells
        notebooks_updated = 0
        cells_updated = 0
        for notebook_path, updates in notebook_updates.items():
            batch_result = self._update_notebook_cells_batched(notebook_path, updates)
            if batch_result > 0:
                notebooks_updated += 1
                cells_updated += batch_result
        
        self.logger.info(f"Batched notebook update complete: {cells_updated} cells across {notebooks_updated} notebooks")
        
        log_print(f"\n{'='*80}")
        log_print(f"✅ SQL REGENERATION COMPLETE")
        log_print(f"{'='*80}")
        log_print(f"   • SQL Regenerated: {regenerated_count}/{failed_count}")
        log_print(f"   • Still Failed: {still_failed_count}")
        log_print(f"   • Notebooks Updated: {notebooks_updated}")
        
        if still_failed_count > 0:
            log_print(f"\n⚠️ {still_failed_count} use cases still have failed SQL.", level="WARNING")
            log_print(f"   Run 'SQL Regeneration' mode again to retry these.")
        else:
            log_print(f"\n🎉 All use cases now have valid SQL!")
        
        log_print(f"{'='*80}\n")
        
        # === ALSO GENERATE SAMPLES IF generate_sample_result:Yes FOUND ===
        # Re-generate SQL mode also handles sample generation (but not vice versa)
        log_print(f"\n🔍 Checking for generate_sample_result:Yes flags...")
        self.logger.info("Re-generate SQL mode: Also checking for sample generation requests...")
        try:
            self._run_generate_sample_result_mode(called_from_sql_regen=True)
        except Exception as e:
            self.logger.warning(f"Sample generation after SQL regeneration failed: {e}")
            log_print(f"⚠️ Sample generation encountered an issue: {e}", level="WARNING")
        
        self._upload_log_file()
        AIAgent.get_summary_report()

    def _update_notebook_cells_batched(self, notebook_path: str, updates: list) -> int:
        """
        Update multiple cells in a single notebook with one load/save cycle.
        
        This prevents the lost update problem caused by eventual consistency in the
        Databricks workspace API. When updating sequentially (load-update-save for each cell),
        subsequent loads may return stale cached data, causing updates to be lost.
        
        Args:
            notebook_path: Path to the notebook file
            updates: List of tuples: (use_case_dict, domain_name, domain_prefix)
            
        Returns:
            Number of cells successfully updated (0 if notebook load failed)
        """
        import base64
        import json
        from databricks.sdk.service import workspace
        
        notebook_name = os.path.basename(notebook_path).replace('.ipynb', '')
        self.logger.info(f"[BATCH] Loading notebook: {notebook_path} to update {len(updates)} cells")
        
        try:
            file_info = self.w_client.workspace.export(path=notebook_path, format=workspace.ExportFormat.JUPYTER)
            notebook_json_str = base64.b64decode(file_info.content).decode('utf-8')
            notebook_json = json.loads(notebook_json_str)
            self.logger.info(f"[BATCH] Successfully loaded notebook with {len(notebook_json.get('cells', []))} cells")
        except Exception as e:
            self.logger.warning(f"[BATCH] Could not load notebook {notebook_path}: {e}")
            return 0
        
        cells = notebook_json.get('cells', [])
        inspire_header_pattern_template = r'--Use Case:\s*{}'
        generate_sample_pattern = re.compile(r'generate_sample_result:\s*(Yes|No)', re.IGNORECASE)
        
        cells_updated = 0
        
        for use_case, domain_name, domain_prefix in updates:
            uc_id = use_case.get('No', 'UNKNOWN')
            sql_raw = use_case.get('SQL', '')
            use_case_name = use_case.get('Name', '')
            user_instructions = use_case.get('_user_instructions', '')
            
            sql_lines = sql_raw.split('\n')
            sql_lines_clean = []
            skip_header = True
            for line in sql_lines:
                line_stripped = line.strip().lower()
                if skip_header and (line_stripped.startswith('-- use case') or line_stripped.startswith('--use case')):
                    continue
                if skip_header and line_stripped.startswith('--') and not line_stripped.startswith('-- step') and not line_stripped.startswith('--step'):
                    if len(line_stripped) > 2 and not any(kw in line_stripped for kw in ['with', 'select', 'cte', 'step']):
                        continue
                skip_header = False
                sql_lines_clean.append(line)
            sql = '\n'.join(sql_lines_clean)
            
            if not sql:
                self.logger.warning(f"[{uc_id}] No SQL content to update")
                continue
            
            if user_instructions:
                inspire_instructions_block = f"/**Regeneration Instruction Start\n{user_instructions}\nRegeneration Instruction End**/\n\n"
            else:
                inspire_instructions_block = "/**Regeneration Instruction Start\n\nRegeneration Instruction End**/\n\n"
            
            inspire_header_pattern = re.compile(inspire_header_pattern_template.format(re.escape(uc_id)))
            
            cell_found = False
            for cell in cells:
                if cell.get('cell_type') != 'code':
                    continue
                
                source = cell.get('source', [])
                if isinstance(source, list):
                    cell_content = ''.join(source)
                else:
                    cell_content = source
                
                if inspire_header_pattern.search(cell_content):
                    sample_match = generate_sample_pattern.search(cell_content)
                    existing_sample_result = sample_match.group(1) if sample_match else 'No'
                    
                    updated_header = f"--Use Case: {uc_id} - {use_case_name}\n--generate_sample_result:{existing_sample_result}\n--regenerate_sql:No\n"
                    new_cell_content = updated_header + inspire_instructions_block + sql + "\n"
                    cell['source'] = [new_cell_content]
                    cell_found = True
                    cells_updated += 1
                    self.logger.info(f"[{uc_id}] Cell updated in batch (regenerate_sql:No, generate_sample_result:{existing_sample_result})")
                    break
            
            if not cell_found:
                self.logger.warning(f"[{uc_id}] Could not find matching cell in {notebook_path}")
        
        if cells_updated > 0:
            try:
                updated_notebook_str = json.dumps(notebook_json, indent=2)
                import_content = base64.b64encode(updated_notebook_str.encode('utf-8')).decode('utf-8')
                self.w_client.workspace.import_(
                    path=notebook_path,
                    content=import_content,
                    format=workspace.ImportFormat.JUPYTER,
                    overwrite=True
                )
                self.logger.info(f"[BATCH] Saved notebook {notebook_path} with {cells_updated} updated cells")
                log_print(f"   ✅ [{notebook_name}] {cells_updated} cells updated (batched save)")
            except Exception as e:
                self.logger.error(f"[BATCH] Failed to save notebook {notebook_path}: {e}")
                return 0
        
        return cells_updated

    def _update_notebook_sql_cell_with_header(self, use_case: dict, domain_name: str, domain_prefix: str) -> bool:
        """
        Updates the SQL cell in an existing notebook with regenerated SQL and Inspire header.
        
        The header format is:
        --Use Case: <ID> - <Name>
        --Regenerate:No
        /**Regeneration Instruction Start ... Regeneration Instruction End**/
        
        Returns True if the notebook was successfully updated.
        """
        import base64
        import json
        from databricks.sdk.service import workspace
        
        uc_id = use_case.get('No', 'UNKNOWN')
        sql_raw = use_case.get('SQL', '')
        
        # Strip LLM-generated header lines to avoid duplication (our header already has use case info)
        sql_lines = sql_raw.split('\n')
        sql_lines_clean = []
        skip_header = True
        for line in sql_lines:
            line_stripped = line.strip().lower()
            if skip_header and (line_stripped.startswith('-- use case') or line_stripped.startswith('--use case')):
                continue
            if skip_header and line_stripped.startswith('--') and not line_stripped.startswith('-- step') and not line_stripped.startswith('--step'):
                if len(line_stripped) > 2 and not any(kw in line_stripped for kw in ['with', 'select', 'cte', 'step']):
                    continue
            skip_header = False
            sql_lines_clean.append(line)
        sql = '\n'.join(sql_lines_clean)
        
        if not sql:
            self.logger.warning(f"[{uc_id}] No SQL content to update")
            return False
        
        uc_prefix_match = re.match(r'^(N\d+)', uc_id)
        if uc_prefix_match:
            actual_prefix = uc_prefix_match.group(1)
        else:
            actual_prefix = domain_prefix
        
        sanitized_domain = self._sanitize_name(domain_name)
        notebook_name = f"{actual_prefix}-{sanitized_domain}"
        notebook_path = os.path.join(self.notebook_output_dir, f"{notebook_name}.ipynb")
        
        self.logger.info(f"[{uc_id}] Attempting to update notebook: {notebook_path}")
        
        try:
            file_info = self.w_client.workspace.export(path=notebook_path, format=workspace.ExportFormat.JUPYTER)
            notebook_json_str = base64.b64decode(file_info.content).decode('utf-8')
            notebook_json = json.loads(notebook_json_str)
            self.logger.info(f"[{uc_id}] Successfully loaded notebook with {len(notebook_json.get('cells', []))} cells")
        except Exception as e:
            self.logger.warning(f"[{uc_id}] Could not find notebook for domain '{domain_name}' at {notebook_path}: {e}")
            return False
        
        cells = notebook_json.get('cells', [])
        cell_updated = False
        
        use_case_name = use_case.get('Name', '')
        
        # Preserve user instructions if they were passed (from the notebook scan)
        user_instructions = use_case.get('_user_instructions', '')
        if user_instructions:
            inspire_instructions_block = f"/**Regeneration Instruction Start\n{user_instructions}\nRegeneration Instruction End**/\n\n"
        else:
            inspire_instructions_block = "/**Regeneration Instruction Start\n\nRegeneration Instruction End**/\n\n"
        
        code_cells_count = 0
        inspire_header_pattern = re.compile(r'--Use Case:\s*' + re.escape(uc_id))
        generate_sample_pattern = re.compile(r'generate_sample_result:\s*(Yes|No)', re.IGNORECASE)
        
        for cell in cells:
            if cell.get('cell_type') != 'code':
                continue
            
            code_cells_count += 1
            source = cell.get('source', [])
            if isinstance(source, list):
                cell_content = ''.join(source)
            else:
                cell_content = source
            
            # Match on Inspire header with this use case ID
            if inspire_header_pattern.search(cell_content):
                # Preserve existing generate_sample_result value
                sample_match = generate_sample_pattern.search(cell_content)
                existing_sample_result = sample_match.group(1) if sample_match else 'No'
                
                # Build header with regenerate_sql:No (just regenerated) but preserve generate_sample_result
                updated_header = f"--Use Case: {uc_id} - {use_case_name}\n--generate_sample_result:{existing_sample_result}\n--regenerate_sql:No\n"
                
                # Build new cell content with updated header and instructions block
                new_cell_content = updated_header + inspire_instructions_block + sql + "\n"
                cell['source'] = [new_cell_content]
                cell_updated = True
                self.logger.info(f"[{uc_id}] Found matching cell by Inspire header, updating SQL content (regenerate_sql:No, generate_sample_result:{existing_sample_result})")
                break
        
        if not cell_updated:
            self.logger.warning(f"[{uc_id}] Could not find Inspire header in notebook {notebook_path} (searched {code_cells_count} code cells)")
            return False
        
        try:
            updated_notebook_str = json.dumps(notebook_json, indent=2)
            import_content = base64.b64encode(updated_notebook_str.encode('utf-8')).decode('utf-8')
            self.w_client.workspace.import_(
                path=notebook_path,
                content=import_content,
                format=workspace.ImportFormat.JUPYTER,
                overwrite=True
            )
            self.logger.info(f"Updated notebook cell for [{uc_id}] in {notebook_path}")
            log_print(f"   ✅ [{uc_id}] Updated in {notebook_name}")
            return True
        except Exception as e:
            self.logger.error(f"Failed to update notebook for [{uc_id}]: {e}")
            return False

    def _extract_clean_sql_error(self, error: Exception) -> str:
        """
        Extract clean error message from SQL execution exception.
        Removes JVM stack traces and internal details, keeping only the core error message.
        
        Args:
            error: The exception raised during SQL execution
            
        Returns:
            str: Clean, concise error message suitable for AI fix prompt
        """
        error_str = str(error)
        
        if 'JVM stacktrace:' in error_str:
            error_str = error_str.split('JVM stacktrace:')[0].strip()
        
        if 'SQLSTATE:' in error_str:
            sqlstate_idx = error_str.find('SQLSTATE:')
            semicolon_after = error_str.find(';', sqlstate_idx)
            if semicolon_after != -1:
                error_str = error_str[:semicolon_after + 20] if semicolon_after + 20 < len(error_str) else error_str[:semicolon_after + 1]
        
        lines = error_str.split('\n')
        clean_lines = []
        for line in lines:
            line_stripped = line.strip()
            if line_stripped.startswith("'") and ('+- ' in line_stripped or '   ' in line_stripped):
                continue
            if 'at org.apache' in line or 'at com.databricks' in line or 'at scala.' in line:
                continue
            if line_stripped:
                clean_lines.append(line_stripped)
        
        clean_error = ' '.join(clean_lines[:5])
        
        if len(clean_error) > 500:
            clean_error = clean_error[:500] + '...'
        
        return clean_error

    def _fix_sql_with_retry(self, uc_id: str, uc_name: str, sql: str, error_msg: str, 
                            use_case_lookup: dict, schema_lookup: dict, max_retries: int = 2) -> tuple:
        """
        Attempt to fix SQL using AI and retry execution.
        
        Args:
            uc_id: Use case ID
            uc_name: Use case name
            sql: Original SQL query that failed
            error_msg: Clean error message from execution failure
            use_case_lookup: Dictionary mapping use case IDs to their details
            schema_lookup: Dictionary mapping table names to their column schemas
            max_retries: Maximum number of fix attempts (default: 2)
            
        Returns:
            tuple: (success: bool, fixed_sql: str or None, result_df: DataFrame or None)
        """
        current_sql = sql
        current_error = error_msg
        
        for attempt in range(1, max_retries + 1):
            log_print(f"   🔧 [{uc_id}] Fix attempt {attempt}/{max_retries}...")
            
            use_case_details = use_case_lookup.get(uc_id, {})
            
            tables_involved = use_case_details.get('Tables Involved', '')
            directly_involved_schema = ""
            if tables_involved and schema_lookup:
                schema_parts = []
                for tbl in tables_involved.split(','):
                    tbl = tbl.strip()
                    if tbl in schema_lookup:
                        schema_parts.append(f"-- Table: {tbl}\n{schema_lookup[tbl]}")
                directly_involved_schema = '\n\n'.join(schema_parts) if schema_parts else ""
            
            fix_prompt_vars = {
                "use_case_id": uc_id,
                "use_case_name": uc_name,
                "business_domain": use_case_details.get('Business Domain', ''),
                "statement": use_case_details.get('Statement', ''),
                "tables_involved": tables_involved,
                "directly_involved_schema": directly_involved_schema,
                "original_sql": current_sql,
                "explain_error": current_error,
                "use_case_columns": use_case_details.get('Involved Columns', '') or use_case_details.get('Columns Involved', '') or ""
            }
            
            try:
                fixed_sql = self.ai_agent.run_worker(
                    step_name=f"Fix_SQL_Sample_{uc_id}_Attempt{attempt}",
                    worker_prompt_path="USE_CASE_SQL_FIX_PROMPT",
                    prompt_vars=fix_prompt_vars,
                    response_schema=None,
                    timeout_override=120,
                    max_retries_override=2
                )
                
                if not fixed_sql or fixed_sql.strip() == current_sql.strip():
                    log_print(f"   ⚠️ [{uc_id}] Fix returned same SQL, skipping retry", level="WARNING")
                    continue
                
                df = self.spark.sql(fixed_sql)
                pdf = df.toPandas()
                
                log_print(f"   ✅ [{uc_id}] SQL fixed successfully on attempt {attempt}")
                return (True, fixed_sql, pdf)
                
            except Exception as fix_error:
                current_error = self._extract_clean_sql_error(fix_error)
                current_sql = fixed_sql if 'fixed_sql' in dir() and fixed_sql else current_sql
                log_print(f"   ❌ [{uc_id}] Fix attempt {attempt} failed: {current_error[:80]}...", level="ERROR")
        
        return (False, None, None)

    def _load_json_for_sample_fixing(self) -> tuple:
        """
        Load JSON catalog to get use case details and schema for SQL fixing.
        
        Returns:
            tuple: (use_case_lookup: dict, schema_lookup: dict)
        """
        from collections import defaultdict
        
        json_file_path = os.path.join(self.docs_output_dir, f"{self.business_name}-dbx_inspire.json")
        use_case_lookup = {}
        schema_lookup = {}
        
        try:
            file_info = self.w_client.workspace.export(path=json_file_path, format=workspace.ExportFormat.AUTO)
            json_content = base64.b64decode(file_info.content).decode('utf-8')
            catalog_json = json.loads(json_content)
            
            domains_data = catalog_json.get("domains", {})
            if isinstance(domains_data, dict):
                for domain_data in domains_data.values():
                    if isinstance(domain_data, dict):
                        for uc in domain_data.get("use_cases", []):
                            uc_id = uc.get('No', '')
                            if uc_id:
                                use_case_lookup[uc_id] = uc
            elif isinstance(domains_data, list):
                for domain_data in domains_data:
                    if isinstance(domain_data, dict):
                        for uc in domain_data.get("use_cases", []):
                            uc_id = uc.get('No', '')
                            if uc_id:
                                use_case_lookup[uc_id] = uc
            
            column_registry = catalog_json.get("column_registry", {})
            schema_by_table = defaultdict(list)
            
            for cid, val in column_registry.items():
                parts = val.split(",", 1)
                fqn = parts[0].strip()
                description = parts[1].strip() if len(parts) > 1 else ""
                
                fqn_parts = fqn.split('.')
                if len(fqn_parts) >= 3:
                    table_fqn = '.'.join(fqn_parts[:-1])
                    col_name = fqn_parts[-1]
                    schema_by_table[table_fqn].append(f"  - {col_name}: {description[:100]}" if description else f"  - {col_name}")
            
            for tbl, cols in schema_by_table.items():
                schema_lookup[tbl] = '\n'.join(cols)
            
            self.logger.info(f"Loaded {len(use_case_lookup)} use cases and {len(schema_lookup)} table schemas for SQL fixing")
            
        except Exception as e:
            self.logger.warning(f"Could not load JSON for SQL fixing: {e}")
        
        return (use_case_lookup, schema_lookup)

    def _update_notebook_cell_with_fixed_sql(self, notebook_path: str, uc_id: str, fixed_sql: str) -> bool:
        """
        Update a notebook cell with fixed SQL after successful fix.
        
        Args:
            notebook_path: Path to the notebook file
            uc_id: Use case ID to find the cell
            fixed_sql: The fixed SQL to replace the original
            
        Returns:
            bool: True if update was successful
        """
        try:
            file_info = self.w_client.workspace.export(path=notebook_path, format=workspace.ExportFormat.JUPYTER)
            notebook_json_str = base64.b64decode(file_info.content).decode('utf-8')
            notebook_json = json.loads(notebook_json_str)
            
            inspire_header_pattern = re.compile(r'--Use Case:\s*' + re.escape(uc_id))
            
            for cell in notebook_json.get('cells', []):
                if cell.get('cell_type') != 'code':
                    continue
                
                source = cell.get('source', [])
                if isinstance(source, list):
                    cell_content = ''.join(source)
                else:
                    cell_content = source
                
                if inspire_header_pattern.search(cell_content):
                    header_lines = []
                    for line in cell_content.split('\n'):
                        if line.strip().startswith('--') or line.strip().startswith('/**') or line.strip().startswith('*/'):
                            header_lines.append(line)
                        elif 'Regeneration Instruction' in line:
                            header_lines.append(line)
                        else:
                            break
                    
                    new_cell_content = '\n'.join(header_lines) + '\n' + fixed_sql + '\n'
                    cell['source'] = [new_cell_content]
                    
                    updated_notebook_str = json.dumps(notebook_json, indent=2)
                    import_content = base64.b64encode(updated_notebook_str.encode('utf-8')).decode('utf-8')
                    self.w_client.workspace.import_(
                        path=notebook_path,
                        content=import_content,
                        format=workspace.ImportFormat.JUPYTER,
                        overwrite=True
                    )
                    
                    self.logger.info(f"Updated notebook cell for [{uc_id}] with fixed SQL")
                    return True
            
            return False
            
        except Exception as e:
            self.logger.error(f"Failed to update notebook for [{uc_id}]: {e}")
            return False

    def _run_generate_sample_result_mode(self, called_from_sql_regen: bool = False):
        """
        Generate Sample Result Mode: Scans notebooks for generate_sample_result:Yes and executes SQL.
        
        This method:
        1. Scans all notebooks in the notebook output directory
        2. Finds SQL cells with "generate_sample_result:Yes" in the Inspire header
        3. Executes the SQL query using SparkSession
        4. If execution fails, attempts to fix SQL up to 2 times using AI
        5. Collects result where ai_sys_importance=High AND ai_sys_urgency=High, or first result
        6. Generates one Excel per notebook with one sheet per use case ID (transposed)
        7. Generates MD file with all results
        8. Saves to /sample folder
        
        Args:
            called_from_sql_regen: If True, this is being called from Re-generate SQL mode
                                  (no samples found is INFO, not WARNING)
        """
        import re
        import json
        import base64
        import pandas as pd
        
        # Try to import openpyxl, install if needed, fall back to markdown-only if unavailable
        excel_available = False
        try:
            from openpyxl import Workbook
            from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
            from openpyxl.utils.dataframe import dataframe_to_rows
            excel_available = True
        except ImportError:
            self.logger.info("openpyxl not found, attempting to install...")
            try:
                import subprocess
                subprocess.check_call(['pip', 'install', 'openpyxl', '-q'])
                from openpyxl import Workbook
                from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
                from openpyxl.utils.dataframe import dataframe_to_rows
                excel_available = True
                self.logger.info("Successfully installed openpyxl")
            except Exception as install_err:
                self.logger.warning(f"Could not install openpyxl: {install_err}. Will generate markdown only.")
                log_print(f"⚠️ openpyxl unavailable - sample results will be generated as markdown only", level="WARNING")
        
        if not called_from_sql_regen:
            log_print(f"\n📊 GENERATE SAMPLE RESULT MODE")
            log_print(f"{'='*80}")
        log_print(f"Scanning notebooks for generate_sample_result:Yes...")
        
        sample_output_dir = os.path.join(self.base_output_dir, "sample")
        excel_output_dir = os.path.join(sample_output_dir, "excel")
        markdown_output_dir = os.path.join(sample_output_dir, "markdown")
        
        for dir_path in [sample_output_dir, excel_output_dir, markdown_output_dir]:
            try:
                self.w_client.workspace.mkdirs(dir_path)
                self.logger.info(f"Created directory: {dir_path}")
            except Exception as e:
                self.logger.debug(f"Directory may already exist: {dir_path}: {e}")
        
        # Scan notebooks
        try:
            notebook_list = list(self.w_client.workspace.list(self.notebook_output_dir))
        except Exception as e:
            self.logger.error(f"Failed to list notebooks in {self.notebook_output_dir}: {e}")
            log_print(f"❌ Failed to list notebooks: {e}", level="ERROR")
            return
        
        # Patterns for parsing notebook cells
        use_case_id_pattern = re.compile(r'--Use Case:\s*([A-Za-z0-9_-]+)\s*-\s*(.+?)$', re.MULTILINE)
        generate_sample_pattern = re.compile(r'generate_sample_result:\s*(Yes|No)', re.IGNORECASE)
        
        # Collect all use cases that need sample generation
        notebooks_with_samples = {}  # notebook_path -> list of (use_case_id, use_case_name, sql, markdown_table)
        
        for item in notebook_list:
            if not item.path.endswith('.ipynb'):
                continue
            
            notebook_samples = []
            notebook_name = os.path.basename(item.path).replace('.ipynb', '')
            
            try:
                file_info = self.w_client.workspace.export(path=item.path, format=workspace.ExportFormat.JUPYTER)
                notebook_json_str = base64.b64decode(file_info.content).decode('utf-8')
                notebook_json = json.loads(notebook_json_str)
                
                # Track markdown table for each use case (from preceding markdown cell)
                current_markdown_table = None
                
                for cell in notebook_json.get('cells', []):
                    if cell.get('cell_type') == 'markdown':
                        source = cell.get('source', [])
                        if isinstance(source, list):
                            current_markdown_table = ''.join(source)
                        else:
                            current_markdown_table = source
                        continue
                    
                    if cell.get('cell_type') != 'code':
                        continue
                    
                    source = cell.get('source', [])
                    if isinstance(source, list):
                        cell_content = ''.join(source)
                    else:
                        cell_content = source
                    
                    # Check for generate_sample_result:Yes
                    sample_match = generate_sample_pattern.search(cell_content)
                    if sample_match and sample_match.group(1).lower() == 'yes':
                        # Extract use case ID and name
                        uc_match = use_case_id_pattern.search(cell_content)
                        if uc_match:
                            uc_id = uc_match.group(1).strip()
                            uc_name = uc_match.group(2).strip()
                            
                            # Extract SQL (remove header lines)
                            sql_lines = []
                            in_sql = False
                            for line in cell_content.split('\n'):
                                line_stripped = line.strip()
                                if line_stripped.startswith('--') or line_stripped.startswith('/**') or line_stripped.startswith('*/'):
                                    continue
                                if 'Regeneration Instruction' in line:
                                    continue
                                if line_stripped:
                                    in_sql = True
                                if in_sql:
                                    sql_lines.append(line)
                            
                            sql = '\n'.join(sql_lines).strip()
                            if sql:
                                notebook_samples.append((uc_id, uc_name, sql, current_markdown_table))
                                log_print(f"   📝 Found [{uc_id}] in {notebook_name}")
                
            except Exception as e:
                self.logger.debug(f"Could not parse notebook {item.path}: {e}")
                continue
            
            if notebook_samples:
                notebooks_with_samples[item.path] = notebook_samples
        
        if not notebooks_with_samples:
            if called_from_sql_regen:
                log_print(f"ℹ️ No use cases with generate_sample_result:Yes found (sample generation skipped)")
            else:
                log_print(f"\n⚠️ No SQL cells with generate_sample_result:Yes found in any notebook", level="WARNING")
            return
        
        total_samples = sum(len(v) for v in notebooks_with_samples.values())
        log_print(f"\n📊 Found {total_samples} use cases to sample across {len(notebooks_with_samples)} notebooks")
        
        use_case_lookup, schema_lookup = self._load_json_for_sample_fixing()
        
        from concurrent.futures import ThreadPoolExecutor, as_completed
        import threading
        import gc
        
        WAVE_SIZE = 4
        MAX_CONCURRENT_FIXES = 2
        fix_semaphore = threading.Semaphore(MAX_CONCURRENT_FIXES)
        
        sample_parallelism = min(WAVE_SIZE, max(2, self.max_parallelism // 2))
        
        log_print(f"\n{'='*80}")
        log_print(f"🔄 MEMORY-EFFICIENT SAMPLE EXECUTION: {total_samples} samples")
        log_print(f"{'='*80}")
        log_print(f"🔧 [SAMPLE_EXECUTION] Wave size: {WAVE_SIZE}, Workers per wave: {sample_parallelism}")
        log_print(f"   └─ Processing notebook-by-notebook with immediate memory release")
        log_print(f"   └─ Max concurrent SQL fixes: {MAX_CONCURRENT_FIXES} (backpressure)")
        log_print(f"{'='*80}\n")
        
        def extract_transposed_data(pdf, uc_id: str, max_records: int = 10) -> list:
            """Extract transposed data for up to max_records rows.
            
            Returns list of records, where each record is a list of {Column, Value} dicts.
            Prioritizes high importance/urgency rows if available.
            Note: ai_sys_* columns are excluded from output as they are for internal use only.
            Preserves numeric types to avoid 'Number stored as text' Excel errors.
            """
            import numpy as np
            
            selected_rows = []
            total_rows = len(pdf)
            
            if 'ai_sys_importance' in pdf.columns and 'ai_sys_urgency' in pdf.columns:
                try:
                    high_priority = pdf[
                        (pdf['ai_sys_importance'].str.lower() == 'high') & 
                        (pdf['ai_sys_urgency'].str.lower() == 'high')
                    ]
                    if not high_priority.empty:
                        for i in range(min(len(high_priority), max_records)):
                            selected_rows.append(high_priority.iloc[i])
                except Exception:
                    pass
            
            remaining_needed = max_records - len(selected_rows)
            if remaining_needed > 0:
                for i in range(min(total_rows, remaining_needed)):
                    if len(selected_rows) < max_records:
                        row = pdf.iloc[i]
                        is_duplicate = False
                        for existing in selected_rows:
                            if existing.equals(row):
                                is_duplicate = True
                                break
                        if not is_duplicate:
                            selected_rows.append(row)
            
            if not selected_rows:
                selected_rows = [pdf.iloc[0]]
            
            all_records = []
            for row_idx, selected_row in enumerate(selected_rows):
                transposed_data = []
                for col_name in selected_row.index:
                    col_name_str = str(col_name)
                    if col_name_str.startswith('ai_sys_'):
                        continue
                    
                    raw_value = selected_row[col_name]
                    
                    if pd.isna(raw_value):
                        value = 'N/A'
                    elif isinstance(raw_value, (int, np.integer)):
                        value = int(raw_value)
                    elif isinstance(raw_value, (float, np.floating)):
                        if raw_value == int(raw_value):
                            value = int(raw_value)
                        else:
                            value = round(float(raw_value), 6)
                    elif isinstance(raw_value, bool):
                        value = raw_value
                    else:
                        value = str(raw_value)[:500]
                    
                    transposed_data.append({'Column': col_name_str, 'Value': value})
                
                all_records.append(transposed_data)
            
            self.logger.debug(f"[{uc_id}] Extracted {len(all_records)} records (of {total_rows} total)")
            return all_records
        
        def parse_markdown_table_to_info(markdown_text: str) -> list:
            """Parse markdown table to extract use case information rows.
            
            Expected format from notebooks:
            | Aspect | Description |
            |---|---|
            | **Subdomain** | Value |
            | **Statement** | Value |
            
            Returns list of tuples: [(field_name, field_value), ...]
            """
            if not markdown_text:
                return []
            
            info_rows = []
            lines = markdown_text.strip().split('\n')
            skip_headers = {'aspect', 'description', 'field', 'attribute', 'column', 'name', 'value'}
            
            for line in lines:
                line = line.strip()
                if not line.startswith('|'):
                    continue
                if '|---|' in line or '| --- |' in line or line.replace(' ', '').replace('|', '').replace('-', '') == '':
                    continue
                
                parts = [p.strip() for p in line.split('|')]
                parts = [p for p in parts if p]
                
                if len(parts) >= 2:
                    field_name = parts[0].replace('**', '').strip()
                    field_value = parts[1].replace('**', '').strip()
                    
                    if field_name and field_name.lower() not in skip_headers:
                        if field_value:
                            info_rows.append((field_name, field_value))
            
            return info_rows
        
        def execute_single_sample_memory_safe(sample_info: dict) -> dict:
            """Execute SQL and return only lightweight transposed data (no DataFrame retention)."""
            uc_id = sample_info['uc_id']
            uc_name = sample_info['uc_name']
            sql = sample_info['sql']
            notebook_path = sample_info['notebook_path']
            markdown_table = sample_info['markdown_table']
            
            result = {
                'uc_id': uc_id,
                'uc_name': uc_name,
                'notebook_path': notebook_path,
                'markdown_table': markdown_table,
                'success': False,
                'transposed_data': None,
                'final_sql': sql,
                'sql_was_fixed': False,
                'error': None
            }
            
            pdf = None
            try:
                df = self.spark.sql(sql)
                pdf = df.toPandas()
                del df
                
                if pdf is not None and not pdf.empty:
                    result['transposed_data'] = extract_transposed_data(pdf, uc_id)
                    result['success'] = True
                    self.logger.info(f"[{uc_id}] SQL executed successfully ({len(pdf)} rows)")
                else:
                    result['error'] = "No results returned"
                    self.logger.warning(f"[{uc_id}] SQL returned empty result")
                    
            except Exception as e:
                clean_error = self._extract_clean_sql_error(e)
                self.logger.error(f"[{uc_id}] SQL execution failed: {clean_error[:100]}")
                
                with fix_semaphore:
                    success, fixed_sql, fixed_pdf = self._fix_sql_with_retry(
                        uc_id=uc_id,
                        uc_name=uc_name,
                        sql=sql,
                        error_msg=clean_error,
                        use_case_lookup=use_case_lookup,
                        schema_lookup=schema_lookup,
                        max_retries=2
                    )
                    
                    if success and fixed_pdf is not None:
                        result['transposed_data'] = extract_transposed_data(fixed_pdf, uc_id)
                        result['success'] = True
                        result['final_sql'] = fixed_sql
                        result['sql_was_fixed'] = True
                        del fixed_pdf
                        
                        if self._update_notebook_cell_with_fixed_sql(notebook_path, uc_id, fixed_sql):
                            self.logger.info(f"[{uc_id}] Notebook updated with fixed SQL")
                    else:
                        result['error'] = f"SQL fix failed after 2 attempts: {clean_error[:100]}"
            finally:
                if pdf is not None:
                    del pdf
            
            return result
        
        global_succeeded = 0
        global_failed = 0
        global_fixed = 0
        global_processed = 0
        all_results_for_md = []
        
        # Style definitions for Excel (only if available)
        if excel_available:
            header_font = Font(bold=True, color="FFFFFF", size=11)
            header_fill = PatternFill(start_color="0066CC", end_color="0066CC", fill_type="solid")
            column_font = Font(bold=True, size=10)
            value_font = Font(size=10)
            thin_border = Border(
                left=Side(style='thin'), right=Side(style='thin'),
                top=Side(style='thin'), bottom=Side(style='thin')
            )
        
        for notebook_path, samples in notebooks_with_samples.items():
            notebook_name = os.path.basename(notebook_path).replace('.ipynb', '')
            notebook_sample_count = len(samples)
            log_print(f"\n📓 Processing notebook: {notebook_name} ({notebook_sample_count} samples)")
            
            # Only create workbook if Excel is available
            wb = Workbook() if excel_available else None
            if excel_available:
                wb.remove(wb.active)
            
            notebook_succeeded = 0
            notebook_failed = 0
            
            sample_list = [
                {'notebook_path': notebook_path, 'uc_id': uc_id, 'uc_name': uc_name, 
                 'sql': sql, 'markdown_table': markdown_table}
                for uc_id, uc_name, sql, markdown_table in samples
            ]
            
            for wave_start in range(0, len(sample_list), WAVE_SIZE):
                wave_samples = sample_list[wave_start:wave_start + WAVE_SIZE]
                wave_num = (wave_start // WAVE_SIZE) + 1
                total_waves = (len(sample_list) + WAVE_SIZE - 1) // WAVE_SIZE
                
                self.logger.info(f"[{notebook_name}] Wave {wave_num}/{total_waves}: Processing {len(wave_samples)} samples")
                
                with ThreadPoolExecutor(max_workers=sample_parallelism, thread_name_prefix=f"Wave{wave_num}") as executor:
                    futures = {executor.submit(execute_single_sample_memory_safe, s): s['uc_id'] for s in wave_samples}
                    
                    for future in as_completed(futures):
                        uc_id = futures[future]
                        try:
                            result = future.result(timeout=300)
                            global_processed += 1
                            
                            if result['success'] and result['transposed_data']:
                                global_succeeded += 1
                                notebook_succeeded += 1
                                sql_was_fixed = result.get('sql_was_fixed', False)
                                if sql_was_fixed:
                                    global_fixed += 1
                                
                                transposed_data = result['transposed_data']
                                
                                # Excel sheet creation (only if openpyxl available)
                                if excel_available and wb is not None:
                                    sheet_name = uc_id[:31]
                                    ws = wb.create_sheet(title=sheet_name)
                                    
                                    right_align = Alignment(horizontal='right')
                                    left_align_wrap = Alignment(horizontal='left', wrap_text=True)
                                    info_fill = PatternFill(start_color="E6F3FF", end_color="E6F3FF", fill_type="solid")
                                    
                                    ws['A1'] = f"Use Case: {uc_id}"
                                    ws['A1'].font = Font(bold=True, size=14)
                                    ws['A2'] = f"Name: {result['uc_name']}"
                                    ws['A2'].font = Font(bold=True, size=12)
                                    ws.merge_cells('A1:B1')
                                    ws.merge_cells('A2:B2')
                                    
                                    current_row = 4
                                    
                                    uc_info = parse_markdown_table_to_info(result.get('markdown_table', ''))
                                    if uc_info:
                                        ws.cell(row=current_row, column=1, value='USE CASE INFORMATION')
                                        ws.cell(row=current_row, column=1).font = Font(bold=True, size=11, color="0066CC")
                                        ws.merge_cells(f'A{current_row}:B{current_row}')
                                        current_row += 1
                                        
                                        for field_name, field_value in uc_info:
                                            ws.cell(row=current_row, column=1, value=field_name)
                                            ws.cell(row=current_row, column=2, value=field_value)
                                            ws.cell(row=current_row, column=1).font = Font(bold=True, size=10)
                                            ws.cell(row=current_row, column=1).alignment = right_align
                                            ws.cell(row=current_row, column=1).fill = info_fill
                                            ws.cell(row=current_row, column=1).border = thin_border
                                            ws.cell(row=current_row, column=2).font = Font(size=10)
                                            ws.cell(row=current_row, column=2).alignment = left_align_wrap
                                            ws.cell(row=current_row, column=2).fill = info_fill
                                            ws.cell(row=current_row, column=2).border = thin_border
                                            current_row += 1
                                        
                                        current_row += 1
                                    
                                    all_records = transposed_data
                                    num_records = len(all_records)
                                    
                                    ws.cell(row=current_row, column=1, value=f'SAMPLE DATA ({num_records} record{"s" if num_records > 1 else ""})')
                                    ws.cell(row=current_row, column=1).font = Font(bold=True, size=11, color="0066CC")
                                    ws.merge_cells(f'A{current_row}:B{current_row}')
                                    current_row += 1
                                    
                                    separator_fill = PatternFill(start_color="D9D9D9", end_color="D9D9D9", fill_type="solid")
                                    
                                    for record_idx, record_data in enumerate(all_records):
                                        if record_idx > 0:
                                            ws.cell(row=current_row, column=1, value=f'--- Record {record_idx + 1} ---')
                                            ws.cell(row=current_row, column=1).font = Font(bold=True, size=10, italic=True)
                                            ws.cell(row=current_row, column=1).fill = separator_fill
                                            ws.cell(row=current_row, column=2).fill = separator_fill
                                            ws.merge_cells(f'A{current_row}:B{current_row}')
                                            current_row += 1
                                        else:
                                            ws.cell(row=current_row, column=1, value='Column')
                                            ws.cell(row=current_row, column=2, value='Value')
                                            ws.cell(row=current_row, column=1).font = header_font
                                            ws.cell(row=current_row, column=2).font = header_font
                                            ws.cell(row=current_row, column=1).fill = header_fill
                                            ws.cell(row=current_row, column=2).fill = header_fill
                                            ws.cell(row=current_row, column=1).border = thin_border
                                            ws.cell(row=current_row, column=2).border = thin_border
                                            ws.cell(row=current_row, column=1).alignment = right_align
                                            current_row += 1
                                        
                                        for row_data in record_data:
                                            col_name = row_data['Column']
                                            if col_name.startswith('ai_sys_'):
                                                continue
                                            ws.cell(row=current_row, column=1, value=col_name)
                                            ws.cell(row=current_row, column=2, value=row_data['Value'])
                                            ws.cell(row=current_row, column=1).font = column_font
                                            ws.cell(row=current_row, column=1).border = thin_border
                                            ws.cell(row=current_row, column=1).alignment = right_align
                                            ws.cell(row=current_row, column=2).font = value_font
                                            ws.cell(row=current_row, column=2).border = thin_border
                                            ws.cell(row=current_row, column=2).alignment = left_align_wrap
                                            current_row += 1
                                    
                                    ws.column_dimensions['A'].width = 35
                                    ws.column_dimensions['B'].width = 80
                                
                                # Markdown data collection (always, regardless of Excel availability)
                                all_results_for_md.append({
                                    'uc_id': uc_id,
                                    'uc_name': result['uc_name'],
                                    'notebook_path': notebook_path,
                                    'markdown_table': result.get('markdown_table'),
                                    'result_data': transposed_data
                                })
                                
                                status = "[FIXED]" if sql_was_fixed else ""
                                log_print(f"   ✅ [{uc_id}] OK {status} ({global_processed}/{total_samples})")
                                
                                del transposed_data
                                del result['transposed_data']
                            else:
                                global_failed += 1
                                notebook_failed += 1
                                log_print(f"   ❌ [{uc_id}] Failed: {result.get('error', 'Unknown')[:50]} ({global_processed}/{total_samples})", level="ERROR")
                                
                        except Exception as e:
                            global_processed += 1
                            global_failed += 1
                            notebook_failed += 1
                            log_print(f"   ❌ [{uc_id}] Exception: {str(e)[:50]} ({global_processed}/{total_samples})", level="ERROR")
                
                gc.collect()
                self.logger.debug(f"[{notebook_name}] Wave {wave_num} complete, memory released")
            
            # Save Excel workbook (only if available and has sheets)
            if excel_available and wb is not None and wb.sheetnames:
                try:
                    import tempfile
                    with tempfile.NamedTemporaryFile(delete=False, suffix='.xlsx') as tmp:
                        excel_local_path = tmp.name
                    
                    wb.save(excel_local_path)
                    
                    with open(excel_local_path, 'rb') as f:
                        excel_content = f.read()
                    
                    excel_workspace_path = os.path.join(excel_output_dir, f"{notebook_name}_samples.xlsx")
                    excel_b64 = base64.b64encode(excel_content).decode('utf-8')
                    self.w_client.workspace.import_(
                        path=excel_workspace_path,
                        content=excel_b64,
                        format=workspace.ImportFormat.AUTO,
                        overwrite=True
                    )
                    
                    log_print(f"   📊 Saved: {notebook_name}_samples.xlsx ({notebook_succeeded} sheets)")
                    os.remove(excel_local_path)
                    del excel_content
                    
                except Exception as e:
                    self.logger.error(f"Failed to save Excel for {notebook_name}: {e}")
                    log_print(f"   ❌ Failed to save Excel: {str(e)[:80]}", level="ERROR")
            
            if wb is not None:
                del wb
            gc.collect()
            
            log_print(f"   📈 Notebook complete: {notebook_succeeded} OK, {notebook_failed} failed")
        
        log_print(f"\n{'='*80}")
        log_print(f"📊 SAMPLE EXECUTION COMPLETE")
        log_print(f"   • Total: {total_samples}")
        log_print(f"   • ✅ Succeeded: {global_succeeded} ({global_fixed} fixed)")
        log_print(f"   • ❌ Failed: {global_failed}")
        log_print(f"{'='*80}\n")
        
        md_files_created = 0
        if all_results_for_md:
            md_by_notebook = {}
            for result in all_results_for_md:
                nb_path = result.get('notebook_path', 'unknown')
                nb_name = os.path.basename(nb_path).replace('.ipynb', '') if nb_path else 'unknown'
                if nb_name not in md_by_notebook:
                    md_by_notebook[nb_name] = []
                md_by_notebook[nb_name].append(result)
            
            for notebook_name, results in md_by_notebook.items():
                md_content = f"# Sample Results: {notebook_name}\n\n"
                md_content += f"**Business:** {self.business_name}\n\n"
                md_content += f"**Generated:** {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n"
                md_content += "---\n\n"
                
                for result in results:
                    md_content += f"## {result['uc_id']}: {result['uc_name']}\n\n"
                    
                    if result.get('markdown_table'):
                        md_content += "### Use Case Information\n\n"
                        md_content += result['markdown_table'] + "\n\n"
                    
                    all_records = result['result_data']
                    num_records = len(all_records) if isinstance(all_records, list) and all_records and isinstance(all_records[0], list) else 1
                    
                    if not isinstance(all_records[0], list):
                        all_records = [all_records]
                    
                    md_content += f"### Sample Result ({num_records} record{'s' if num_records > 1 else ''})\n\n"
                    
                    for record_idx, record_data in enumerate(all_records):
                        if record_idx > 0:
                            md_content += f"\n**--- Record {record_idx + 1} ---**\n\n"
                        
                        md_content += "| Column | Value |\n"
                        md_content += "|-------:|-------|\n"
                        
                        for row in record_data:
                            col_name = row['Column']
                            if col_name.startswith('ai_sys_'):
                                continue
                            col = col_name.replace('|', '\\|')
                            val = str(row['Value'])[:200].replace('|', '\\|').replace('\n', ' ')
                            md_content += f"| {col} | {val} |\n"
                    
                    md_content += "\n---\n\n"
                
                try:
                    md_workspace_path = os.path.join(markdown_output_dir, f"{notebook_name}_samples.md")
                    md_b64 = base64.b64encode(md_content.encode('utf-8')).decode('utf-8')
                    self.w_client.workspace.import_(
                        path=md_workspace_path,
                        content=md_b64,
                        format=workspace.ImportFormat.AUTO,
                        overwrite=True
                    )
                    md_files_created += 1
                    log_print(f"   📝 Saved: {notebook_name}_samples.md ({len(results)} use cases)")
                except Exception as e:
                    self.logger.error(f"Failed to save MD file for {notebook_name}: {e}")
                    log_print(f"   ❌ Failed to save MD: {notebook_name}_samples.md: {str(e)[:60]}", level="ERROR")
        
        log_print(f"\n{'='*80}")
        log_print(f"✅ GENERATE SAMPLE RESULT MODE COMPLETE")
        log_print(f"   📁 Output folder: {sample_output_dir}")
        if excel_available:
            log_print(f"   📁 Excel files: {excel_output_dir}")
            log_print(f"   📊 Excel files created: {len(notebooks_with_samples)}")
        else:
            log_print(f"   ⚠️ Excel files: Skipped (openpyxl unavailable)")
        log_print(f"   📁 Markdown files: {markdown_output_dir}")
        log_print(f"   📝 Markdown files created: {md_files_created}")
        log_print(f"   📋 Total use cases sampled: {len(all_results_for_md)}")
        log_print(f"{'='*80}\n")

    def _load_usecases_catalog_json(self, json_file_path: str) -> tuple:
        """
        Loads the JSON Catalog file for docs-only mode.
        
        Returns:
            tuple: (final_consolidated_use_cases, summary_dict, english_grouped_data)
        """
        try:
            self.logger.info(f"Loading JSON Catalog from: {json_file_path}")
            
            # Read from workspace
            file_info = self.w_client.workspace.export(path=json_file_path, format=workspace.ExportFormat.AUTO)
            json_content = base64.b64decode(file_info.content).decode('utf-8')
            catalog_json = json.loads(json_content)
            
            self.logger.info(f"✅ Successfully loaded JSON Catalog")
            
            # === NEW: Load business name from JSON, fallback to widget value ===
            json_business_name = catalog_json.get("business_name", None)
            if json_business_name:
                old_business_name = self.business_name
                self.business_name = json_business_name
                self.logger.info(f"Using business name from JSON: '{json_business_name}' (widget value '{old_business_name}' ignored)")
                log_print(f"📌 Using business name from JSON: '{json_business_name}'")
            else:
                self.logger.info(f"Business name not found in JSON. Using widget value: '{self.business_name}'")
                log_print(f"📌 Using business name from widget: '{self.business_name}'")
            
            # Extract data - summary_dict should match _get_salesy_summary format
            # It uses "Executive" as key for executive summary, and domain names for domain summaries
            summary_dict = {
                "Executive": catalog_json.get("executive_summary", "")
            }
            
            # === NEW: Restore Column Names from IDs ===
            column_registry = catalog_json.get("column_registry", {})
            # Parse registry: ID -> FQN
            id_to_fqn = {}
            for cid, val in column_registry.items():
                # Value format: "fqn, description"
                # Split only on first comma to separate FQN from description
                parts = val.split(",", 1)
                id_to_fqn[cid] = parts[0].strip()
            
            # Reconstruct grouped data and flat list
            english_grouped_data = {}
            final_consolidated_use_cases = []
            
            for domain_obj in catalog_json.get("domains", []):
                domain_name = domain_obj.get("domain_name", "General Operations")
                use_cases = domain_obj.get("use_cases", [])
                
                # Restore column names in use cases
                for uc in use_cases:
                    cols_involved = uc.get("Columns Involved", "")
                    if cols_involved:
                        parts = [p.strip() for p in cols_involved.split(",")]
                        restored_names = []
                        for p in parts:
                            if p in id_to_fqn:
                                restored_names.append(id_to_fqn[p])
                            else:
                                restored_names.append(p)
                        uc["Columns Involved"] = ", ".join(restored_names)

                domain_summary = domain_obj.get("summary", "")
                
                english_grouped_data[domain_name] = use_cases
                final_consolidated_use_cases.extend(use_cases)
                
                # Add domain summary to summary_dict
                summary_dict[domain_name] = domain_summary
            
            self.logger.info(f"Loaded {len(final_consolidated_use_cases)} use cases from {len(english_grouped_data)} domains")
            
            # === NEW: Filter out any use cases with "Pending" priority (safety check) ===
            pending_use_cases = [uc for uc in final_consolidated_use_cases if uc.get('Priority') == 'Pending']
            if pending_use_cases:
                self.logger.warning(f"⚠️ Found {len(pending_use_cases)} use cases with 'Pending' priority in JSON - these will be filtered out")
                for uc in pending_use_cases[:5]:  # Log first 5 for debugging
                    self.logger.warning(f"  - {uc.get('No', 'N/A')}: {uc.get('Name', 'N/A')}")
                
                # Filter from flat list
                final_consolidated_use_cases = [uc for uc in final_consolidated_use_cases if uc.get('Priority') != 'Pending']
                
                # Also filter from grouped data
                for domain_name in list(english_grouped_data.keys()):
                    english_grouped_data[domain_name] = [uc for uc in english_grouped_data[domain_name] if uc.get('Priority') != 'Pending']
                
                self.logger.info(f"✅ Filtered to {len(final_consolidated_use_cases)} scored use cases (removed {len(pending_use_cases)} pending)")
            
            return (final_consolidated_use_cases, summary_dict, english_grouped_data)
            
        except Exception as e:
            self.logger.error(f"Failed to load JSON Catalog: {e}")
            raise

    def _upload_log_file(self):
        log_file_path = None
        try:
            for handler in logging.root.handlers:
                if isinstance(handler, logging.FileHandler):
                    log_file_path = handler.baseFilename
                    break
            if not log_file_path:
                self.logger.warning("Could not find FileHandler to upload log file.")
                return
            if not os.path.exists(log_file_path):
                self.logger.warning(f"Log file not found at expected path: {log_file_path}")
                return
            self.logger.info(f"Reading log file from: {log_file_path}")
            with open(log_file_path, "rb") as f: log_data = f.read()
            if not log_data:
                self.logger.warning("Log file is empty. Skipping upload.")
                return
            
            # Copy log file to base output directory for easy access
            output_log_path = os.path.join(self.base_output_dir, "log.txt")
            try:
                if self.base_output_dir.startswith("/tmp/") or self.base_output_dir.startswith("/dbfs/"):
                    os.makedirs(self.base_output_dir, exist_ok=True)
                    shutil.copy2(log_file_path, output_log_path)
                    self.logger.info(f"✅ Log file copied to output directory: {output_log_path}")
                    log_print(f"✅ Log file available at: {output_log_path}")
                else:
                    self.logger.info(f"Skipping local copy of log file (non-local path): {output_log_path}")
            except Exception as copy_error:
                self.logger.warning(f"Failed to copy log file to output directory: {copy_error}")
            
            # Also upload to workspace for Databricks UI access
            workspace_log_path = os.path.join(self.docs_output_dir, "generation_log.txt")
            
            self.logger.info(f"Uploading log file to workspace: {workspace_log_path}")
            log_data_b64 = base64.b64encode(log_data).decode()
            self.w_client.workspace.import_(
                path=workspace_log_path, content=log_data_b64,
                format=workspace.ImportFormat.AUTO, overwrite=True
            )
            abs_path = self.w_client.workspace.get_status(workspace_log_path).path
            self.logger.info(f"Successfully uploaded log file to workspace: {abs_path}")
            log_print(f"✅ Log file also uploaded to workspace: {abs_path}")
        except Exception as e:
            self.logger.error(f"Failed to upload log file: {e}")
            if log_file_path:
                self.logger.error(f"Log file was at: {log_file_path}")

# COMMAND ----------

# DBTITLE 1,Main
# ==============================================================================
# 4. MAIN EXECUTION METHOD (MODIFIED)
# ==============================================================================

def main():
    """
    Main function to read widget values, validate inputs,
    and run the DatabricksInspire class.
    
    *** IMPORTANT ***
    Run the `create_widgets()` cell first and fill in the UI values
    BEFORE running this main() function.
    """
    
    print_ascii_banner()

    # --- 1. Get Widget Values ---
    
    # --- Business Name ---
    business_name = dbutils.widgets.get("00_business_name")
    
    # --- UC Metadata ---
    catalogs_and_schemas_str = dbutils.widgets.get("01_uc_metadata")
    
    # --- Operation Mode ---
    operation_mode = dbutils.widgets.get("02_operation")
    log_print(f"🎯 Operation Mode: {operation_mode}")
    
    # --- Business Domains ---
    business_domains_str = dbutils.widgets.get("03_business_domains")
    
    # --- Business Priorities (multi-select) ---
    business_priorities_str = dbutils.widgets.get("04_business_priorities")
    
    # --- Strategic Goals ---
    strategic_goals_str = dbutils.widgets.get("05_strategic_goals")
    
    # Check if this is a JSON file path (docs-only mode)
    json_file_path = None
    catalogs_list = []
    schemas_list = []
    tables_list = []
    
    if catalogs_and_schemas_str:
        catalogs_and_schemas_str = catalogs_and_schemas_str.strip()
        # Check if it's a JSON file path (starts with /)
        if catalogs_and_schemas_str.startswith('/'):
            json_file_path = catalogs_and_schemas_str
            log_print(f"Detected JSON file path: {json_file_path}")
            log_print("Running in DOCS-ONLY mode: Will skip use case generation and notebook generation.")
        else:
            # Parse catalogs, schemas, and tables from the merged widget
            for item in catalogs_and_schemas_str.split(','):
                item = item.strip()
                if not item:
                    continue
                dot_count = item.count('.')
                if dot_count == 2:
                    # Fully qualified table (catalog.schema.table)
                    tables_list.append(item)
                elif dot_count == 1:
                    # Fully qualified schema (catalog.schema)
                    schemas_list.append(item)
                elif dot_count == 0:
                    # Catalog only
                    catalogs_list.append(item)
                else:
                    # Invalid format - log warning
                    log_print(f"Invalid metadata format '{item}' - expected 0, 1, or 2 dots", level="WARNING")
    
    catalogs_str = ','.join(catalogs_list)
    schemas_str = ','.join(schemas_list)
    tables_str = ','.join(tables_list)
    
    # --- Generation Options ---
    generate_str = dbutils.widgets.get("06_generation_options")
    # Force "use cases" to be included always
    if generate_str:
        if "use cases" not in generate_str:
             generate_str += ", use cases"
    else:
        generate_str = "use cases"
    
    # Parse generation options for special flags
    generate_options_list = [opt.strip() for opt in generate_str.split(',') if opt.strip()]
    
    # Extract special options from generation options
    use_unstructured_data = "Unstructured Data Usecases" in generate_options_list
    technical_exclusion_strategy = "Aggressive"
    
    # Set use_unstructured_data_str based on Unstructured Data Usecases selection
    use_unstructured_data_str = "yes" if use_unstructured_data else "no"
    
    # --- Generation Path ---
    generation_path = dbutils.widgets.get("07_generation_path")
    
    # --- Documents Languages (multiselect) ---
    output_language_str = dbutils.widgets.get("08_documents_languages") 
    
    # --- AI Model (model endpoint for ai_query in generated SQL) ---
    sql_model_serving = dbutils.widgets.get("09_ai_model")
    if not sql_model_serving or not sql_model_serving.strip():
        sql_model_serving = "databricks-gpt-oss-120b"

    # ============================================================================
    # --- 2. VALIDATE ALL WIDGET VALUES (FAIL FAST BEFORE ANY PROCESSING) ---
    log_print("=" * 80)
    log_print("🔍 VALIDATING WIDGET INPUTS...")
    log_print("=" * 80)
    
    validation_errors = []
    
    # Validate Business Name first
    if not business_name:
        validation_errors.append("❌ 'Business Name' (00_business_name) is REQUIRED")
    else:
        log_print(f"✅ Business Name: '{business_name}'")
    
    # Validate Operation mode
    valid_operations = ["Discover Usecases", "Re-generate SQL", "Generate Sample Result"]
    if operation_mode not in valid_operations:
        validation_errors.append(f"❌ 'Operation' (02_operation) must be one of: {', '.join(valid_operations)}")
    else:
        log_print(f"✅ Operation: '{operation_mode}'")
    
    # AUTO-ENABLE SQL Code generation for "Re-generate SQL" mode (regardless of checkbox)
    if operation_mode == "Re-generate SQL" and "SQL Code" not in generate_options_list:
        generate_options_list.append("SQL Code")
        generate_str = ", ".join(generate_options_list)
        log_print(f"ℹ️ Auto-enabled 'SQL Code' for Re-generate SQL mode")
    
    # Log Business Priorities (optional)
    if business_priorities_str:
        log_print(f"✅ Business Priorities: '{business_priorities_str}'")
    else:
        log_print(f"ℹ️ Business Priorities: Not provided")
    
    # Log Business Domains (optional)
    if business_domains_str:
        log_print(f"✅ Business Domains: '{business_domains_str}'")
    else:
        log_print(f"ℹ️ Business Domains: Not provided (domains will be inferred from data)")
    
    # Log Strategic Goals (optional but HIGHEST PRIORITY when provided)
    if strategic_goals_str:
        log_print(f"✅ Strategic Goals: '{strategic_goals_str[:100]}...' (HIGHEST PRIORITY)")
    else:
        log_print(f"ℹ️ Strategic Goals: Not provided")
    
    # UC Metadata validation depends on operation mode
    if not json_file_path:
        if (operation_mode == "Discover Usecases" and 
            not catalogs_str and not schemas_str and not tables_str):
            validation_errors.append("❌ 'UC Metadata' (01_uc_metadata) is REQUIRED when discovering use cases")
        elif operation_mode in ["Re-generate SQL", "Generate Sample Result"]:
            # These modes work on existing notebooks, UC Metadata not required
            log_print(f"ℹ️ UC Metadata: Not required for '{operation_mode}' mode")
        else:
            log_print(f"✅ UC Metadata provided: catalogs={len(catalogs_str.split(',')) if catalogs_str else 0}, schemas={len(schemas_str.split(',')) if schemas_str else 0}, tables={len(tables_str.split(',')) if tables_str else 0}")
    else:
        log_print(f"✅ Docs-only mode: Using JSON file '{json_file_path}'")
    
    if not generate_str:
        validation_errors.append("❌ 'Generation Options' (06_generation_options) is REQUIRED - select at least one option")
    else:
        log_print(f"✅ Generation Options: {generate_str}")
    
    if not generation_path:
        validation_errors.append("❌ 'Generation Path' (07_generation_path) is REQUIRED")
    else:
        log_print(f"✅ Generation Path: '{generation_path}'")
    
    
    # Language is only REQUIRED for PDF/Presentation artifacts, optional for notebooks-only
    requires_language = ("PDF Catalog" in generate_str or 
                        "Presentation" in generate_str or 
                        "Use Cases Catalog PDF" in generate_str)
    
    if requires_language:
        if not output_language_str:
            validation_errors.append("❌ 'Documents Languages' (08_documents_languages) is REQUIRED when generating PDF or Presentation")
        else:
            languages = [lang.strip() for lang in output_language_str.split(',') if lang.strip()]
            log_print(f"✅ Documents Languages: {', '.join(languages)}")
    else:
        # Default to English for notebooks-only mode (no PDF/Presentation)
        if not output_language_str:
            output_language_str = "English"
            languages = ["English"]
            log_print(f"ℹ️ Documents Languages: Not required (no PDF/Presentation selected), defaulting to English")
        else:
            languages = [lang.strip() for lang in output_language_str.split(',') if lang.strip()]
            log_print(f"ℹ️ Documents Languages: {', '.join(languages)} (optional for notebooks-only)")
    
    # Log derived options
    generate_sql_code = "SQL Code" in generate_options_list
    log_print(f"ℹ️ SQL Code Generation: {'Enabled' if generate_sql_code else 'DISABLED (notebooks will have placeholder SQL)'}")
    log_print(f"ℹ️ Unstructured Data Usecases: {'Enabled' if use_unstructured_data else 'Disabled'}")
    log_print("ℹ️ Technical table filtering: Aggressive (mandatory)")
    if generate_sql_code:
        log_print(f"✅ AI Model: '{sql_model_serving}' (for ai_query in generated SQL)")
    
    if validation_errors:
        import sys as _sys
        error_count = len(validation_errors)
        error_summary = "\n".join(validation_errors)
        
        log_print("=" * 80, level="ERROR")
        log_print(f"❌ VALIDATION FAILED - {error_count} ERROR(S) FOUND:", level="ERROR")
        log_print("=" * 80, level="ERROR")
        for error in validation_errors:
            log_print(error, level="ERROR")
        log_print("=" * 80, level="ERROR")
        
        print(f"\n{'='*80}\n❌ VALIDATION ERRORS ({error_count}):\n{error_summary}\n{'='*80}\n", file=_sys.stderr, flush=True)
        _sys.stdout.flush()
        _sys.stderr.flush()
        
        exit_msg = f"Validation failed with {error_count} error(s):\n{error_summary}"
        dbutils.notebook.exit(exit_msg)
    
    log_print("=" * 80)
    log_print("✅ ALL VALIDATIONS PASSED - Starting generation...")
    log_print("=" * 80)

    # --- 3. Pack values and Run ---
    
    widget_values = {
        "business": business_name,
        "operation_mode": operation_mode,
        "strategic_goals": strategic_goals_str,
        "business_priorities": business_priorities_str,
        "business_domains": business_domains_str,
        "catalogs": catalogs_str,
        "schemas": schemas_str,
        "tables": tables_str,
        "generate": generate_str,
        "generation_path": generation_path,
        "output_language": output_language_str,
        "use_unstructured_data": use_unstructured_data_str,
        "technical_exclusion_strategy": technical_exclusion_strategy,
        "sql_model_serving": sql_model_serving,
        "json_file_path": json_file_path
    }

    try:
        inspirer = DatabricksInspire(**widget_values)
        inspirer.run()
    except NameError as ne:
        if ('DataLoader' in str(ne) or 'AIAgent' in str(ne) or 
            'PROMPT_TEMPLATES' in str(ne) or 'DatabricksInspire' in str(ne) or 
            'setup_logging' in str(ne) or 'TranslationService' in str(ne)):
            
            print(f"ERROR: A required class, function, or variable is missing: {ne}", file=sys.stderr)
            print("Please ensure `setup_logging`, `DataLoader`, `AIAgent`, `PROMPT_TEMPLATES`, `TranslationService`, and `DatabricksInspire` are defined in preceding cells.", file=sys.stderr)
        else:
            raise
    except Exception as e:
        print(f"An unexpected error occurred: {e}", file=sys.stderr)
        logging.getLogger("main").critical("Main execution failed")

01:19:04 - INFO - Creating widgets (retaining existing values)...
01:19:04 - INFO - ✅ Widgets created successfully.
01:19:04 - INFO - 
01:19:04 - INFO - >>> Fill in the widget values at the TOP of this notebook, then run main().
01:19:06 - INFO - PROMPT_TEMPLATES dictionary defined successfully with all required prompts.


In [0]:
if __name__ == "__main__":
    logging.getLogger("py4j").setLevel(logging.ERROR)
    main()


┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ____        _        _          _      _                             ┃
┃   |  _ \  __ _| |_ __ _| |__  _ __(_) ___| | _____                      ┃
┃   | | | |/ _` | __/ _` | '_ \| '__| |/ __| |/ / __|                     ┃
┃   | |_| | (_| | || (_| | |_) | |  | | (__|   <\__ \                     ┃
┃   |____/ \__,_|\__\__,_|_.__/|_|  |_|\___|_|\_\___/                     ┃
┃       ___                      _                  _    ___              ┃
┃      |_ _| _ __   ___  _ __   (_) _ __  ___      / \  |_ _|             ┃
┃       | | | '_ \ / __|| '_ \  | || '__|/ _ \    / _ \  | |              ┃
┃       | | | | | |\__ \| |_) | | || |  |  __/   / ___ \ | |              ┃
┃      |___||_| |_||___/| .__/  |_||_|   \___|  /_/   \_\___|             ┃
┃                       |_|                                               ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

01:19:07 -


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


01:25:18 - INFO - Success! Excel Catalog uploaded to: /Users/amr.ali@databricks.com/inspire/inspire_gen/procore/docs/procore-dbx_inspire.xlsx
01:25:18 - INFO - Success! Excel Catalog (English) generated: /Users/amr.ali@databricks.com/inspire/inspire_gen/procore/docs/procore-dbx_inspire.xlsx
01:25:18 - INFO - Generating executive summaries for notebooks...
01:25:18 - INFO -    [SUMMARY_GEN_PROMPT] Setting max_tokens=115,200 (model limit: 128,000)
01:25:41 - INFO - 🔮✨ HONESTY CHECK [SUMMARY_GEN_PROMPT] Score: 72% | Generic summaries lacking specific Procore context or construction industry depth. No actual use case details provided. Formulaic structure without deep strategic insight or differentiation. ✨🔮
01:25:41 - INFO - LLM summaries (CSV) received for English.
01:25:41 - INFO - Found CSV header at index 0. Parsing as 3-column.
01:25:41 - INFO - Successfully parsed 6 summaries for English. Transliterated name: Procore
01:25:41 - INFO - 🚀 Starting PDF/PPTX documentation generation in p


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


01:25:46 - INFO - ✓ PDF package (weasyprint) installed successfully.
01:25:46 - INFO - Checking Excel dependencies (pandas, openpyxl)...
01:25:46 - INFO - Excel packages not found. Installing...
Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


01:25:48 - INFO - ✓ Excel packages (pandas, openpyxl) installed successfully.
01:25:48 - INFO - ✓ Proceeding with translations and artifact generation (fallback to .md/.csv if needed)...
01:25:48 - INFO - Using default English UI translations.
01:25:48 - INFO - 🔧 [TRANSLATION] Parallelism = 4 (from max=1) | calculated=4 based on: 23 items (medium) + translation LLM calls
01:25:48 - INFO - 🔧 [TRANSLATION] Workers: 4 (max=1)
01:25:48 - INFO -    └─ Reason: calculated=4 based on: 23 items (medium) + translation LLM calls
01:25:48 - INFO - Preparing English artifacts (no translation needed).
01:25:48 - INFO -    [SUMMARY_GEN_PROMPT] Setting max_tokens=115,200 (model limit: 128,000)
01:26:16 - INFO - 🔮✨ HONESTY CHECK [SUMMARY_GEN_PROMPT] Score: 72% | Generic summaries lacking specific Procore context or construction industry depth. No actual use case details provided. Formulaic structure without deep strategic insight or differentiation. ✨🔮
01:26:16 - INFO - LLM summaries (CSV) received for