# 2.08: GenAI-Assisted Performance Evaluation

## OPENING: YOU EARNED THE RIGHT TO USE GENAI

We've built the baseline system. We've run cross-validation. We've computed metrics. We've built a scoreboard. We've run portfolio diagnostics. We've done forensic investigation.

**Now we can use GenAI.**

Here's why the order matters: In early forecasting work, GenAI is dangerous. You don't have grounding. You don't have evidence.

But now? **Now we have grounding.**

We have the forecast database from cross-validation. We have the scoreboard. We have diagnostics. We have forensic findings. We have artifacts that represent measured truth.

GenAI doesn't create truth. **It compresses truth you already measured.**

**Rule: If GenAI contradicts the scoreboard, GenAI loses.**

## SETUP: Load Dependencies and Data

In [None]:
import pandas as pd
import numpy as np
import json
from pathlib import Path

# For GenAI integration (example using OpenAI or similar)
# pip install openai
# from openai import OpenAI

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

In [None]:
# Load evidence artifacts from previous modules
scoreboard = pd.read_csv('path/to/scoreboard.csv')
diagnostics_summary = pd.read_csv('path/to/diagnostics_summary.csv')
forensic_findings = pd.read_csv('forensic_findings_for_module_3.csv')

print(f"Loaded scoreboard with {len(scoreboard)} models")
print(f"Loaded diagnostics summary with {len(diagnostics_summary)} findings")
print(f"Loaded forensic findings with {len(forensic_findings)} SKUs")

---
## SECTION 1: THE GENAI GOVERNANCE CONTRACT

### Four Rules (The Guardrails)

Before we start prompting GenAI, we need rules. These aren't suggestions — they're hard boundaries.

**Rule 1: Evidence In, Insight Out**
- GenAI only interprets data we measured
- No speculation about unmeasured scenarios

**Rule 2: Scoreboard Veto**
- If GenAI's interpretation contradicts the metrics, the metrics win
- GenAI can suggest *why* a pattern exists, but can't deny that it exists

**Rule 3: Chain-of-Custody for Insights**
- Every GenAI output must be traceable back to a data artifact
- "Model X is better because..." must cite the scoreboard

**Rule 4: Human Authority**
- GenAI generates hypotheses, not decisions
- The human makes the final call

---
## SECTION 2: PROMPT ENGINEERING FOR FORECASTING DIAGNOSTICS

In [None]:
def create_genai_context_prompt(scoreboard_df, diagnostics_df, forensic_df):
    """
    Create a grounded context prompt for GenAI analysis
    Follows Rule 1 & 3: Evidence In, Chain-of-Custody
    """
    
    # Build evidence summary
    scoreboard_summary = f"""
    Scoreboard Summary (Evidence):
    - Best model: {scoreboard_df.iloc[0]['model_name'] if len(scoreboard_df) > 0 else 'N/A'}
    - Best wMAPE: {scoreboard_df.iloc[0]['wmape']:.2%} if len(scoreboard_df) > 0 else 'N/A'}
    - Model count: {len(scoreboard_df)}
    """
    
    diagnostics_summary = f"""
    Diagnostics Summary (Evidence):
    - High-error segments identified: {diagnostics_df['segment'].nunique() if len(diagnostics_df) > 0 else 'N/A'}
    - Primary failure mode: {diagnostics_df.iloc[0]['action'] if len(diagnostics_df) > 0 else 'N/A'}
    """
    
    prompt = f"""
    You are analyzing forecasting model performance. Your role is to:
    1. Interpret measured evidence (scoreboard, diagnostics, forensics)
    2. Identify patterns that suggest specific fixes
    3. Generate hypotheses about what features or techniques could improve performance
    
    CRITICAL: Base all interpretations on the evidence below. If evidence contradicts your interpretation, the evidence wins.
    
    {scoreboard_summary}
    
    {diagnostics_summary}
    
    Given this evidence, please:
    1. Summarize the current model performance status
    2. Identify the top 3 actionable insights for improvement
    3. For each insight, suggest specific features or techniques
    4. Prioritize by impact potential
    """
    
    return prompt

# Generate context prompt
context_prompt = create_genai_context_prompt(scoreboard, diagnostics_summary, forensic_findings)
print("Context Prompt Generated:")
print(context_prompt[:500] + "...")

In [None]:
# Example: Mock GenAI response (in real scenario, call actual API)
def mock_genai_analysis(context):
    """
    Mock GenAI response for demonstration
    In production, replace with actual API call to OpenAI/Claude/etc
    """
    analysis = {
        'status': 'Baseline system is performing adequately',
        'top_insights': [
            {
                'insight': 'Seasonal patterns are not fully captured',
                'evidence': 'Residual analysis shows 52-week autocorrelation',
                'recommendation': 'Add explicit seasonal indicators (month, quarter, day-of-week)',
                'impact': 'HIGH',
                'effort': 'LOW'
            },
            {
                'insight': 'Model misses promotional spikes',
                'evidence': 'Forensic investigation identified holiday/promo alignment with errors',
                'recommendation': 'Create promotional calendar feature and include in model',
                'impact': 'MEDIUM',
                'effort': 'MEDIUM'
            },
            {
                'insight': 'Variance is high in long-tail items',
                'evidence': 'Diagnostics show error concentration in low-volume SKUs',
                'recommendation': 'Use ensemble approach or hybrid policy (forecast + inventory rule)',
                'impact': 'MEDIUM',
                'effort': 'HIGH'
            }
        ]
    }
    return analysis

# Run mock analysis
genai_analysis = mock_genai_analysis(context_prompt)
print("\nGenAI Analysis Results:")
print(f"Status: {genai_analysis['status']}")
print(f"\nTop Insights:")
for i, insight in enumerate(genai_analysis['top_insights'], 1):
    print(f"\n{i}. {insight['insight']}")
    print(f"   Evidence: {insight['evidence']}")
    print(f"   Recommendation: {insight['recommendation']}")
    print(f"   Impact: {insight['impact']} | Effort: {insight['effort']}")

---
## SECTION 3: HYPOTHESIS GENERATION FROM EVIDENCE

In [None]:
def validate_genai_hypothesis(hypothesis, scoreboard_df, tolerance=0.05):
    """
    Validate GenAI hypothesis against measured evidence
    Rule 2 & 3: Scoreboard Veto + Chain-of-Custody
    """
    
    validation_result = {
        'hypothesis': hypothesis,
        'is_grounded': False,
        'supporting_evidence': [],
        'contradicting_evidence': [],
        'confidence': 0.0
    }
    
    # Example: Check if hypothesis about model performance is supported
    if 'seasonal' in hypothesis.lower():
        # Check if any model performs significantly better in high-seasonality items
        if len(scoreboard_df) > 0:
            validation_result['supporting_evidence'].append(
                f"Scoreboard shows {scoreboard_df.iloc[0]['model_name']} has wMAPE of {scoreboard_df.iloc[0]['wmape']:.2%}"
            )
            validation_result['is_grounded'] = True
            validation_result['confidence'] = 0.7
    
    return validation_result

# Validate each GenAI hypothesis
validated_hypotheses = []
for insight in genai_analysis['top_insights']:
    validation = validate_genai_hypothesis(insight['insight'], scoreboard)
    validated_hypotheses.append(validation)

print("\nHypothesis Validation (Rule 2: Scoreboard Veto):")
for i, val in enumerate(validated_hypotheses, 1):
    print(f"\n{i}. {val['hypothesis']}")
    print(f"   Grounded in Evidence: {val['is_grounded']}")
    print(f"   Confidence: {val['confidence']:.0%}")
    if val['supporting_evidence']:
        print(f"   Supporting Evidence: {val['supporting_evidence'][0]}")

---
## SECTION 4: ACTIONABLE BACKLOG GENERATION

In [None]:
def generate_feature_backlog_from_genai(genai_insights, validated_hypotheses):
    """
    Convert GenAI insights + validated hypotheses into actionable backlog
    Rule 4: GenAI generates hypotheses, humans prioritize
    """
    
    backlog = []
    
    for insight, validation in zip(genai_insights['top_insights'], validated_hypotheses):
        if validation['is_grounded']:  # Only include grounded insights
            backlog_item = {
                'feature': insight['recommendation'],
                'insight': insight['insight'],
                'evidence': validation['supporting_evidence'],
                'impact_potential': insight['impact'],
                'implementation_effort': insight['effort'],
                'priority_score': calculate_priority(insight['impact'], insight['effort']),
                'status': 'PENDING_REVIEW',
                'module': 'Module 3: Feature Engineering'
            }
            backlog.append(backlog_item)
    
    return pd.DataFrame(backlog)

def calculate_priority(impact, effort):
    """
    Simple priority score: impact / effort
    HIGH=3, MEDIUM=2, LOW=1
    """
    impact_val = {'HIGH': 3, 'MEDIUM': 2, 'LOW': 1}.get(impact, 1)
    effort_val = {'LOW': 3, 'MEDIUM': 2, 'HIGH': 1}.get(effort, 1)
    return impact_val / effort_val if effort_val > 0 else 0

# Generate backlog
feature_backlog = generate_feature_backlog_from_genai(genai_analysis, validated_hypotheses)
feature_backlog = feature_backlog.sort_values('priority_score', ascending=False)

print("\nFeature Engineering Backlog (Rule 4: Human Authority):")
print(feature_backlog[['feature', 'insight', 'impact_potential', 'implementation_effort', 'priority_score']].to_string(index=False))

# Save backlog for Module 3
feature_backlog.to_csv('genai_feature_backlog.csv', index=False)
print("\n✓ Feature backlog saved to 'genai_feature_backlog.csv'")

---
## SECTION 5: WHEN GENAI CONTRADICTS EVIDENCE (RULE 2)

In [None]:
def check_scoreboard_veto(genai_claim, scoreboard_df):
    """
    Check if GenAI claim contradicts scoreboard evidence
    Rule 2: Scoreboard Veto
    """
    
    veto_result = {
        'claim': genai_claim,
        'contradicts_evidence': False,
        'veto_reason': None,
        'evidence_override': None
    }
    
    # Example: If GenAI claims "Model X is best" but scoreboard shows otherwise
    if 'best model' in genai_claim.lower():
        if len(scoreboard_df) > 0:
            actual_best = scoreboard_df.iloc[0]['model_name']
            if actual_best not in genai_claim:
                veto_result['contradicts_evidence'] = True
                veto_result['veto_reason'] = f"Scoreboard shows {actual_best} is actually the best"
                veto_result['evidence_override'] = f"Use {actual_best} instead"
    
    return veto_result

# Example veto check
example_claim = "Model Theta seems like the best choice"
veto = check_scoreboard_veto(example_claim, scoreboard)

print(f"Claim: {veto['claim']}")
print(f"Contradicts Evidence: {veto['contradicts_evidence']}")
if veto['contradicts_evidence']:
    print(f"Veto: {veto['veto_reason']}")
    print(f"Evidence Override: {veto['evidence_override']}")

---
## SECTION 6: DELIVERABLE - EXECUTIVE SUMMARY

In [None]:
# Generate executive summary combining evidence + GenAI interpretation
executive_summary = f"""
FORECASTING SYSTEM EVALUATION SUMMARY
{'='*70}

CURRENT STATE (Evidence-Based)
- Champion Model: {scoreboard.iloc[0]['model_name'] if len(scoreboard) > 0 else 'N/A'}
- Portfolio wMAPE: {scoreboard.iloc[0]['wmape']:.2%} if len(scoreboard) > 0 else 'N/A'}
- Assessment Status: Ready for diagnostics review

KEY FINDINGS (Grounded Analysis)
"""

for i, insight in enumerate(genai_analysis['top_insights'][:3], 1):
    executive_summary += f"""

{i}. {insight['insight']}
   Recommendation: {insight['recommendation']}
   Priority: {insight['impact']} Impact / {insight['effort']} Effort
"""

executive_summary += f"""

NEXT STEPS (Module 3)
- Review recommended features
- Implement highest priority items
- Validate improvements via cross-validation

GOVERNANCE NOTES
- All GenAI recommendations grounded in measured evidence
- Scoreboard remains final authority for model selection
- Feature priorities should be reviewed by human decision-maker
{'='*70}
"""

print(executive_summary)

# Save summary
with open('executive_summary_2.08.txt', 'w') as f:
    f.write(executive_summary)

print("\n✓ Executive summary saved to 'executive_summary_2.08.txt'")