# Day 2, Session 4 - Lab: Build a Prompt Management System

## Goal: Build Production Prompt Management System

You'll create a comprehensive prompt system that:
- Adapts to different invoice formats
- Guarantees structured outputs
- Handles validation failures gracefully
- Tracks performance metrics
- Enables A/B testing of prompt variants

This system will make your invoice agent reliable and maintainable!

**Time Allocation: 45 minutes**
- Task 1: Define Structured Output Models (8 min)
- Task 2: Create Base Extraction Prompts (10 min)
- Task 3: Implement Conditional Prompts (8 min)
- Task 4: Add Instructor Integration (10 min)
- Task 5: Implement Prompt Caching (5 min)
- Task 6: Build Testing Framework (5 min)
- Task 7: Add Versioning System (4 min)

In [None]:
# Global configuration - Instructor will fill these
OLLAMA_URL = "http://XX.XX.XX.XX"  # Course server IP (port 80)
API_TOKEN = "YOUR_TOKEN_HERE"      # Instructor provides token
MODEL = "qwen3:8b"                  # Default model on server

In [None]:
!pip install langchain instructor pydantic jinja2

In [None]:
from langchain.prompts import PromptTemplate, FewShotPromptTemplate
from instructor import patch
from pydantic import BaseModel, Field, validator
from typing import List, Optional, Dict, Any
import json
from datetime import datetime
from functools import lru_cache
import hashlib
import requests
import time
import pandas as pd
from collections import defaultdict

In [None]:
# Download Invoice Dataset
import requests
import zipfile
import io

dropbox_url = "https://www.dropbox.com/scl/fo/m9hyfmvi78snwv0nh34mo/AMEXxwXMLAOeve-_yj12ck8?rlkey=urinkikgiuven0fro7r4x5rcu&st=hv3of7g7&dl=1"

print(f"Downloading data from: {dropbox_url}")

try:
    response = requests.get(dropbox_url)
    response.raise_for_status()

    with zipfile.ZipFile(io.BytesIO(response.content)) as z:
        z.extractall("downloaded_images")

    print("✅ Downloaded and extracted images to 'downloaded_images' folder.")

except requests.exceptions.RequestException as e:
    print(f"❌ Error downloading the file: {e}")
except zipfile.BadZipFile:
    print("❌ Error: The downloaded file is not a valid zip file.")
except Exception as e:
    print(f"❌ An unexpected error occurred: {e}")

## Task 1: Define Structured Output Models (8 minutes)

### Create Pydantic Models for Invoice Data

Define strict schemas that will be enforced on LLM outputs.
Include validation rules and default values.

**TODO Instructions:**
1. Complete the LineItem validator to ensure total = quantity × unit_price
2. Add more validators to InvoiceExtraction for business rules
3. Test your models with sample data

In [None]:
class LineItem(BaseModel):
    """Individual invoice line item"""
    description: str = Field(description="Item description")
    quantity: float = Field(gt=0, description="Quantity ordered")
    unit_price: float = Field(ge=0, description="Price per unit")
    total: float = Field(ge=0, description="Line total")
    
    @validator('total')
    def validate_total(cls, v, values):
        """Ensure total = quantity * unit_price"""
        # TODO: Implement validation
        # if 'quantity' in values and 'unit_price' in values:
        #     expected = values['quantity'] * values['unit_price']
        #     if abs(v - expected) > 0.01:  # Allow small rounding differences
        #         raise ValueError(f"Total {v} doesn't match quantity {values['quantity']} × unit_price {values['unit_price']} = {expected}")
        # return v
        
        print("TODO: Complete total validation")
        return v

class InvoiceExtraction(BaseModel):
    """Complete invoice data structure"""
    invoice_number: str = Field(regex=r'^[A-Z0-9\-]+$')
    vendor_name: str = Field(min_length=2)
    invoice_date: datetime
    due_date: Optional[datetime] = None
    
    line_items: List[LineItem]
    
    subtotal: float = Field(ge=0)
    tax_rate: float = Field(ge=0, le=1)
    tax_amount: float = Field(ge=0)
    total_amount: float = Field(ge=0)
    
    currency: str = Field(regex=r'^[A-Z]{3}$', default="USD")
    confidence_score: float = Field(ge=0, le=1)
    
    # TODO: Add more validators
    @validator('tax_amount')
    def validate_tax_calculation(cls, v, values):
        """Validate tax calculation"""
        # TODO: Check if tax_amount = subtotal × tax_rate
        print("TODO: Implement tax validation")
        return v
    
    @validator('total_amount')
    def validate_total_amount(cls, v, values):
        """Validate total = subtotal + tax"""
        # TODO: Check if total_amount = subtotal + tax_amount
        print("TODO: Implement total amount validation")
        return v
    
    @validator('subtotal')
    def validate_subtotal_matches_line_items(cls, v, values):
        """Validate subtotal matches sum of line items"""
        # TODO: Check if subtotal = sum of all line item totals
        print("TODO: Implement subtotal validation")
        return v

# Test your models
print("🧪 Testing Pydantic models...")

# TODO: Create test data and validate
test_line_item = {
    "description": "Test Product",
    "quantity": 2.0,
    "unit_price": 100.0,
    "total": 200.0
}

try:
    item = LineItem(**test_line_item)
    print(f"✅ LineItem validation passed: {item.description}")
except Exception as e:
    print(f"❌ LineItem validation failed: {e}")

print("\n📝 TODO Summary for Task 1:")
print("1. Complete validate_total() in LineItem")
print("2. Implement tax_amount validation")
print("3. Add total_amount validation")
print("4. Validate subtotal matches line items sum")
print("5. Test with complete invoice data")

## Task 2: Create Base Extraction Prompts (10 minutes)

### Build LangChain Prompt Templates

Create reusable, parameterized prompts with variables
for dynamic content and few-shot examples.

**TODO Instructions:**
1. Complete the few-shot examples with realistic invoice data
2. Create specialized prompts for different document types
3. Test prompt rendering with sample data

In [None]:
# Basic extraction prompt
basic_extraction_prompt = PromptTemplate(
    input_variables=["invoice_text", "output_format"],
    template="""Extract invoice information from the following text:

{invoice_text}

Return the data in this format:
{output_format}

Be precise and extract only what you can clearly identify."""
)

# Few-shot prompt with examples
examples = [
    {
        "input": "Invoice #INV-001 from Acme Corp, Date: 2024-01-15, Item: Laptop $1200, Total: $1200",
        "output": '{"invoice_number": "INV-001", "vendor_name": "Acme Corp", "total_amount": 1200.0}'
    },
    # TODO: Add 2-3 more examples
    # {
    #     "input": "Invoice #ABC-123 from TechSupplies Inc...",
    #     "output": '{"invoice_number": "ABC-123", ...}'
    # },
]

few_shot_prompt = FewShotPromptTemplate(
    examples=examples,
    example_prompt=PromptTemplate(
        input_variables=["input", "output"],
        template="Input: {input}\nOutput: {output}"
    ),
    prefix="Extract invoice data following these examples:",
    suffix="Now extract from:\n{invoice_text}",
    input_variables=["invoice_text"]
)

# Chain-of-thought prompt
cot_extraction_prompt = PromptTemplate(
    input_variables=["invoice_text"],
    template="""Let's extract invoice data step by step.

Step 1: Identify the vendor
Step 2: Find invoice number and date
Step 3: Extract line items
Step 4: Calculate totals

Invoice text:
{invoice_text}

Now perform each step:"""
)

# TODO: Create specialized prompts
table_specific_prompt = PromptTemplate(
    input_variables=["invoice_text"],
    template="""This invoice contains complex tables. Extract data carefully:

Focus on:
- Table headers to identify columns
- Row-by-row data extraction
- Proper alignment of quantities and prices

Invoice:
{invoice_text}

Extract table data systematically:"""
)

handwritten_prompt = PromptTemplate(
    input_variables=["invoice_text"],
    template="""This appears to be handwritten or poor quality text. Extract carefully:

- Some characters may be unclear
- Use context to infer missing information
- Mark uncertain extractions with lower confidence

Text:
{invoice_text}

Extract what you can determine:"""
)

# Test prompt rendering
print("🧪 Testing prompt templates...")

sample_invoice = "Invoice #TEST-001 from Sample Corp, Date: 2024-01-15, Total: $500.00"
sample_format = "JSON with invoice_number, vendor_name, total_amount"

# Test basic prompt
try:
    rendered = basic_extraction_prompt.format(
        invoice_text=sample_invoice,
        output_format=sample_format
    )
    print(f"✅ Basic prompt rendered ({len(rendered)} chars)")
    print(f"   Preview: {rendered[:100]}...")
except Exception as e:
    print(f"❌ Basic prompt failed: {e}")

# Test few-shot prompt
try:
    rendered = few_shot_prompt.format(invoice_text=sample_invoice)
    print(f"✅ Few-shot prompt rendered ({len(rendered)} chars)")
    print(f"   Examples included: {len(examples)}")
except Exception as e:
    print(f"❌ Few-shot prompt failed: {e}")

# Test CoT prompt
try:
    rendered = cot_extraction_prompt.format(invoice_text=sample_invoice)
    print(f"✅ Chain-of-thought prompt rendered ({len(rendered)} chars)")
except Exception as e:
    print(f"❌ CoT prompt failed: {e}")

print("\n📝 TODO Summary for Task 2:")
print("1. Add 2-3 more realistic examples to few_shot_prompt")
print("2. Create error-specific enhancement prompts")
print("3. Add prompts for multi-language invoices")
print("4. Test all prompts with various invoice formats")
print("5. Measure prompt token usage for optimization")

## Task 3: Implement Conditional Prompts (8 minutes)

### Create Document-Specific Prompts

Different document types and qualities need different approaches.
Build a routing system to select optimal prompts.

**TODO Instructions:**
1. Complete the document analysis logic
2. Implement prompt selection based on characteristics
3. Add error-based prompt enhancement
4. Test with different invoice types

In [None]:
class PromptSelector:
    """Select optimal prompt based on document characteristics"""
    
    def __init__(self):
        self.prompts = {
            'high_quality': basic_extraction_prompt,
            'medium_quality': few_shot_prompt,
            'poor_quality': cot_extraction_prompt,
            'complex_table': table_specific_prompt,
            'handwritten': handwritten_prompt
        }
        self.usage_stats = defaultdict(int)
    
    def analyze_document(self, text: str) -> Dict[str, Any]:
        """
        Analyze document characteristics
        TODO: Implement analysis logic
        - Check text length
        - Detect tables
        - Assess quality indicators
        """
        characteristics = {
            'length': len(text),
            'has_tables': False,
            'quality_score': 1.0,
            'language': 'en',
            'complexity': 'simple'
        }
        
        # TODO: Implement real analysis
        # Check for table indicators
        # table_indicators = ['|', '\t', 'qty', 'quantity', 'price', 'total']
        # characteristics['has_tables'] = any(indicator in text.lower() for indicator in table_indicators)
        
        # Assess quality based on text patterns
        # quality_indicators = ['invoice', 'date', 'total', 'vendor']
        # found_indicators = sum(1 for indicator in quality_indicators if indicator in text.lower())
        # characteristics['quality_score'] = found_indicators / len(quality_indicators)
        
        # Detect complexity
        # if len(text) > 1000:
        #     characteristics['complexity'] = 'complex'
        # elif characteristics['has_tables']:
        #     characteristics['complexity'] = 'medium'
        
        print(f"TODO: Complete document analysis - currently returning defaults")
        return characteristics
    
    def select_prompt(self, text: str) -> PromptTemplate:
        """
        Choose best prompt for document
        TODO: Implement selection logic
        """
        characteristics = self.analyze_document(text)
        
        # TODO: Route to appropriate prompt based on characteristics
        selected_prompt_key = 'high_quality'  # Default
        
        # if characteristics['quality_score'] < 0.3:
        #     selected_prompt_key = 'poor_quality'
        # elif characteristics['has_tables']:
        #     selected_prompt_key = 'complex_table'
        # elif characteristics['quality_score'] < 0.7:
        #     selected_prompt_key = 'medium_quality'
        # else:
        #     selected_prompt_key = 'high_quality'
        
        self.usage_stats[selected_prompt_key] += 1
        
        print(f"Selected prompt: {selected_prompt_key} (TODO: implement real selection logic)")
        return self.prompts[selected_prompt_key]
    
    def get_usage_stats(self) -> Dict[str, int]:
        """Get prompt usage statistics"""
        return dict(self.usage_stats)

def enhance_prompt_for_errors(base_prompt: PromptTemplate, 
                              errors: List[str]) -> PromptTemplate:
    """
    Add error-specific instructions to prompt
    TODO: Implement enhancement logic
    """
    enhancement_instructions = []
    
    # TODO: Add error-specific enhancements
    # if "missing_vendor" in errors:
    #     enhancement_instructions.append(
    #         "IMPORTANT: Look carefully for company name, vendor, or seller information."
    #     )
    # 
    # if "invalid_totals" in errors:
    #     enhancement_instructions.append(
    #         "IMPORTANT: Verify all calculations. Check that line totals = quantity × unit_price."
    #     )
    # 
    # if "date_parsing" in errors:
    #     enhancement_instructions.append(
    #         "IMPORTANT: Convert dates to YYYY-MM-DD format. Look for date patterns like MM/DD/YYYY."
    #     )
    
    if enhancement_instructions:
        enhanced_template = "\n".join(enhancement_instructions) + "\n\n" + base_prompt.template
        return PromptTemplate(
            input_variables=base_prompt.input_variables,
            template=enhanced_template
        )
    
    print("TODO: Implement error-specific prompt enhancements")
    return base_prompt

# Test the prompt selector
print("🧪 Testing prompt selection system...")

prompt_selector = PromptSelector()

# Test with different document types
test_documents = [
    ("Simple invoice: ABC Corp, $500 total", "simple"),
    ("Complex table with multiple items and calculations...", "complex"),
    ("Poorly scanned text with missing characters...", "poor_quality")
]

for doc_text, doc_type in test_documents:
    print(f"\n📄 Testing {doc_type} document:")
    characteristics = prompt_selector.analyze_document(doc_text)
    selected_prompt = prompt_selector.select_prompt(doc_text)
    print(f"   Characteristics: {characteristics}")
    print(f"   Selected prompt type: {type(selected_prompt).__name__}")

# Test error enhancement
print(f"\n🔧 Testing error-based enhancement:")
sample_errors = ["missing_vendor", "invalid_totals"]
enhanced_prompt = enhance_prompt_for_errors(basic_extraction_prompt, sample_errors)
print(f"   Enhanced prompt length: {len(enhanced_prompt.template)} chars")

# Show usage statistics
print(f"\n📊 Prompt usage statistics:")
for prompt_type, count in prompt_selector.get_usage_stats().items():
    print(f"   {prompt_type}: {count} uses")

print("\n📝 TODO Summary for Task 3:")
print("1. Complete analyze_document() with real quality assessment")
print("2. Implement smart prompt selection in select_prompt()")
print("3. Add error-specific enhancements in enhance_prompt_for_errors()")
print("4. Add support for multi-language document detection")
print("5. Test with real invoice samples of varying quality")

## Task 4: Add Instructor Integration (10 minutes)

### Guaranteed Structured Output with Instructor

Use Instructor library to ensure LLM outputs match
your Pydantic schemas with automatic retry on failure.

**TODO Instructions:**
1. Implement the LLM calling logic
2. Add JSON parsing and Pydantic validation
3. Implement retry logic with prompt enhancement
4. Test with sample invoice data

In [None]:
def call_llm(prompt: str) -> str:
    """Call the course LLM server"""
    headers = {
        "Authorization": f"Bearer {API_TOKEN}",
        "Content-Type": "application/json"
    }
    
    data = {
        "model": MODEL,
        "prompt": prompt
    }
    
    try:
        response = requests.post(
            f"{OLLAMA_URL}/think",
            headers=headers,
            json=data
        )
        
        if response.status_code == 200:
            return response.json().get('response', '')
        else:
            raise Exception(f"HTTP {response.status_code}: {response.text}")
            
    except Exception as e:
        raise Exception(f"LLM call failed: {str(e)}")

def extract_with_retry(text: str, max_attempts: int = 3) -> InvoiceExtraction:
    """
    Extract invoice data with validation and retry
    TODO: Implement extraction with Instructor
    """
    extraction_errors = []
    
    for attempt in range(max_attempts):
        print(f"\n🔄 Extraction attempt {attempt + 1}/{max_attempts}")
        
        try:
            # Select appropriate prompt
            prompt_template = prompt_selector.select_prompt(text)
            
            # Enhance prompt if we have previous errors
            if extraction_errors:
                prompt_template = enhance_prompt_for_errors(prompt_template, extraction_errors)
            
            # Format prompt
            if 'output_format' in prompt_template.input_variables:
                schema_desc = "JSON matching InvoiceExtraction schema with all required fields"
                prompt = prompt_template.format(invoice_text=text, output_format=schema_desc)
            else:
                prompt = prompt_template.format(invoice_text=text)
            
            print(f"   📝 Using prompt type: {type(prompt_template).__name__}")
            print(f"   📏 Prompt length: {len(prompt)} chars")
            
            # Call LLM
            print(f"   🤖 Calling LLM...")
            raw_response = call_llm(prompt)
            print(f"   📄 Response length: {len(raw_response)} chars")
            
            # TODO: Parse and validate with Pydantic
            # Extract JSON from response
            # json_start = raw_response.find('{')
            # json_end = raw_response.rfind('}') + 1
            # 
            # if json_start == -1 or json_end == 0:
            #     raise ValueError("No JSON found in response")
            # 
            # json_text = raw_response[json_start:json_end]
            # raw_data = json.loads(json_text)
            # 
            # # Create InvoiceExtraction instance
            # extraction = InvoiceExtraction(**raw_data)
            # 
            # print(f"   ✅ Extraction successful!")
            # print(f"      Vendor: {extraction.vendor_name}")
            # print(f"      Invoice #: {extraction.invoice_number}")
            # print(f"      Total: {extraction.total_amount} {extraction.currency}")
            # 
            # return extraction
            
            # TODO: For now, create mock extraction for testing
            mock_extraction = InvoiceExtraction(
                invoice_number="TEST-001",
                vendor_name="Mock Vendor",
                invoice_date=datetime.now(),
                line_items=[],
                subtotal=100.0,
                tax_rate=0.08,
                tax_amount=8.0,
                total_amount=108.0,
                confidence_score=0.85
            )
            
            print(f"   ✅ Mock extraction created (TODO: implement real parsing)")
            return mock_extraction
            
        except Exception as e:
            error_msg = str(e)
            extraction_errors.append(error_msg)
            print(f"   ❌ Attempt {attempt + 1} failed: {error_msg[:100]}...")
            
            # TODO: Handle validation errors
            # if attempt < max_attempts - 1:
            #     # Analyze error type for prompt enhancement
            #     if "vendor_name" in error_msg:
            #         extraction_errors.append("missing_vendor")
            #     if "total" in error_msg or "calculation" in error_msg:
            #         extraction_errors.append("invalid_totals")
            #     if "date" in error_msg:
            #         extraction_errors.append("date_parsing")
            
            if attempt == max_attempts - 1:
                print(f"   💥 All attempts failed. Last error: {error_msg}")
                raise Exception(f"Extraction failed after {max_attempts} attempts: {error_msg}")

def batch_extract(texts: List[str]) -> List[InvoiceExtraction]:
    """Extract from multiple invoices"""
    results = []
    
    for i, text in enumerate(texts):
        print(f"\n📄 Processing invoice {i+1}/{len(texts)}")
        try:
            extraction = extract_with_retry(text, max_attempts=2)
            results.append(extraction)
            print(f"   ✅ Success: {extraction.vendor_name} - {extraction.total_amount}")
        except Exception as e:
            print(f"   ❌ Failed: {str(e)[:100]}...")
            results.append(None)
    
    success_rate = sum(1 for r in results if r is not None) / len(results)
    print(f"\n📊 Batch processing complete: {success_rate:.1%} success rate")
    
    return results

# Test the extraction system
print("🧪 Testing extraction with retry system...")

sample_invoices = [
    "Invoice #INV-001 from TechCorp, Date: 2024-01-15, Total: $1500.00",
    "Complex invoice with multiple line items and tax calculations...",
    "Poorly formatted invoice text that might cause parsing errors..."
]

# Test single extraction
print(f"\n🔍 Testing single extraction:")
try:
    result = extract_with_retry(sample_invoices[0])
    print(f"✅ Single extraction successful")
except Exception as e:
    print(f"❌ Single extraction failed: {e}")

# Test batch extraction
print(f"\n📦 Testing batch extraction:")
batch_results = batch_extract(sample_invoices[:2])  # Test with first 2

print("\n📝 TODO Summary for Task 4:")
print("1. Implement JSON parsing from LLM response")
print("2. Add Pydantic validation with proper error handling")
print("3. Enhance error classification for better prompt adaptation")
print("4. Add confidence scoring based on validation success")
print("5. Implement fallback strategies for persistent failures")

## Task 5: Implement Prompt Caching (5 minutes)

### Cache Rendered Prompts for Performance

Avoid re-rendering identical prompts and cache
LLM responses for identical inputs.

**TODO Instructions:**
1. Implement cache key generation
2. Add thread-safe prompt rendering
3. Implement response caching with TTL
4. Test cache performance

In [None]:
import threading
from functools import lru_cache
import time

class PromptCache:
    """Thread-safe prompt and response caching"""
    
    def __init__(self, max_size: int = 100):
        self.max_size = max_size
        self._response_cache = {}
        self._cache_times = {}
        self._lock = threading.RLock()
        self.hits = 0
        self.misses = 0
        self.ttl_seconds = 3600  # 1 hour TTL
        
    def _hash_inputs(self, template: str, **kwargs) -> str:
        """Create cache key from inputs"""
        # TODO: Implement hashing
        # content = f"{template}{sorted(kwargs.items())}"
        # return hashlib.md5(content.encode()).hexdigest()
        
        # Simple implementation for now
        content = f"{template}{sorted(kwargs.items())}"
        return str(hash(content))
    
    @lru_cache(maxsize=100)
    def render_prompt(self, template_str: str, **kwargs) -> str:
        """
        Cache rendered prompts
        TODO: Thread-safe implementation
        """
        # TODO: Create PromptTemplate and render
        # template = PromptTemplate.from_template(template_str)
        # return template.format(**kwargs)
        
        # Simple string formatting for now
        try:
            return template_str.format(**kwargs)
        except KeyError as e:
            print(f"TODO: Fix template rendering - missing variable {e}")
            return template_str
    
    def get_cached_response(self, prompt_hash: str) -> Optional[str]:
        """Retrieve cached LLM response if available"""
        with self._lock:
            # TODO: Implement cache retrieval with TTL check
            if prompt_hash in self._response_cache:
                cache_time = self._cache_times.get(prompt_hash, 0)
                if time.time() - cache_time < self.ttl_seconds:
                    self.hits += 1
                    return self._response_cache[prompt_hash]
                else:
                    # Expired cache entry
                    del self._response_cache[prompt_hash]
                    del self._cache_times[prompt_hash]
            
            self.misses += 1
            return None
    
    def cache_response(self, prompt_hash: str, response: str):
        """Store LLM response in cache"""
        with self._lock:
            # TODO: Implement cache storage with TTL
            # Check cache size and evict if necessary
            if len(self._response_cache) >= self.max_size:
                # Remove oldest entry
                oldest_key = min(self._cache_times.keys(), 
                               key=lambda k: self._cache_times[k])
                del self._response_cache[oldest_key]
                del self._cache_times[oldest_key]
            
            self._response_cache[prompt_hash] = response
            self._cache_times[prompt_hash] = time.time()
    
    def get_stats(self) -> Dict[str, Any]:
        """Get cache statistics"""
        with self._lock:
            total_requests = self.hits + self.misses
            hit_rate = self.hits / total_requests if total_requests > 0 else 0
            
            return {
                'hits': self.hits,
                'misses': self.misses,
                'hit_rate': hit_rate,
                'cache_size': len(self._response_cache),
                'max_size': self.max_size
            }
    
    def clear_cache(self):
        """Clear all cached data"""
        with self._lock:
            self._response_cache.clear()
            self._cache_times.clear()
            self.hits = 0
            self.misses = 0

# Enhanced extraction function with caching
def extract_with_caching(text: str, cache: PromptCache) -> InvoiceExtraction:
    """Extract with prompt and response caching"""
    
    # Generate cache key
    prompt_template = prompt_selector.select_prompt(text)
    cache_key = cache._hash_inputs(prompt_template.template, invoice_text=text)
    
    print(f"🔍 Checking cache for key: {cache_key[:16]}...")
    
    # Check for cached response
    cached_response = cache.get_cached_response(cache_key)
    
    if cached_response:
        print(f"   💾 Cache hit! Using cached response")
        # TODO: Parse cached response to InvoiceExtraction
        # For now, return mock data
        return InvoiceExtraction(
            invoice_number="CACHED-001",
            vendor_name="Cached Vendor",
            invoice_date=datetime.now(),
            line_items=[],
            subtotal=100.0,
            tax_rate=0.08,
            tax_amount=8.0,
            total_amount=108.0,
            confidence_score=0.90
        )
    else:
        print(f"   🚫 Cache miss - calling LLM")
        
        # Render prompt (with caching)
        prompt = cache.render_prompt(prompt_template.template, invoice_text=text)
        
        # Call LLM
        response = call_llm(prompt)
        
        # Cache the response
        cache.cache_response(cache_key, response)
        
        # TODO: Parse response to InvoiceExtraction
        return InvoiceExtraction(
            invoice_number="FRESH-001",
            vendor_name="Fresh Vendor",
            invoice_date=datetime.now(),
            line_items=[],
            subtotal=200.0,
            tax_rate=0.08,
            tax_amount=16.0,
            total_amount=216.0,
            confidence_score=0.85
        )

# Test caching system
print("🧪 Testing prompt caching system...")

cache = PromptCache(max_size=50)

test_texts = [
    "Invoice #TEST-001 from Company A, Total: $500",
    "Invoice #TEST-002 from Company B, Total: $750",
    "Invoice #TEST-001 from Company A, Total: $500",  # Duplicate for cache test
]

print(f"\n📄 Processing {len(test_texts)} invoices with caching:")

for i, text in enumerate(test_texts):
    print(f"\n   Invoice {i+1}: {text[:30]}...")
    start_time = time.time()
    
    try:
        result = extract_with_caching(text, cache)
        processing_time = time.time() - start_time
        print(f"      ✅ Processed in {processing_time:.3f}s - {result.vendor_name}")
    except Exception as e:
        print(f"      ❌ Failed: {str(e)[:50]}...")

# Show cache statistics
stats = cache.get_stats()
print(f"\n📊 Cache Performance:")
print(f"   Cache hits: {stats['hits']}")
print(f"   Cache misses: {stats['misses']}")
print(f"   Hit rate: {stats['hit_rate']:.1%}")
print(f"   Cache utilization: {stats['cache_size']}/{stats['max_size']}")

if stats['hit_rate'] > 0:
    print(f"   ✅ Cache is working - {stats['hit_rate']:.1%} of requests served from cache")
else:
    print(f"   ⚠️ No cache hits yet - need more duplicate requests to see benefits")

print("\n📝 TODO Summary for Task 5:")
print("1. Implement proper JSON response parsing for cached responses")
print("2. Add cache warming strategies for common invoice types")
print("3. Implement cache persistence across sessions")
print("4. Add cache invalidation strategies")
print("5. Optimize cache key generation for better hit rates")

## Task 6: Build Testing Framework (5 minutes)

### Test and Track Prompt Performance

Create framework to measure prompt effectiveness
and track metrics over time.

**TODO Instructions:**
1. Implement test case storage and management
2. Create prompt testing logic with accuracy metrics
3. Build A/B testing comparison framework
4. Add performance tracking

In [None]:
class PromptTester:
    """Test prompts against known invoice samples"""
    
    def __init__(self):
        self.test_cases = []
        self.metrics = {
            'accuracy': [],
            'token_usage': [],
            'extraction_time': [],
            'retry_count': []
        }
    
    def add_test_case(self, invoice_text: str, expected: InvoiceExtraction):
        """Add test case with ground truth"""
        # TODO: Store test case
        test_case = {
            'id': len(self.test_cases),
            'text': invoice_text,
            'expected': expected,
            'created_at': datetime.now()
        }
        self.test_cases.append(test_case)
        print(f"✅ Added test case {test_case['id']}: {expected.vendor_name}")
    
    def calculate_accuracy(self, extracted: InvoiceExtraction, 
                          expected: InvoiceExtraction) -> float:
        """Calculate accuracy score between extracted and expected data"""
        # TODO: Implement sophisticated accuracy calculation
        scores = []
        
        # Check key fields
        # if extracted.invoice_number == expected.invoice_number:
        #     scores.append(1.0)
        # else:
        #     scores.append(0.0)
        # 
        # if extracted.vendor_name == expected.vendor_name:
        #     scores.append(1.0)
        # else:
        #     scores.append(0.0)
        # 
        # # Check total amount (allow small differences)
        # if abs(extracted.total_amount - expected.total_amount) < 0.01:
        #     scores.append(1.0)
        # else:
        #     scores.append(0.0)
        # 
        # return sum(scores) / len(scores) if scores else 0.0
        
        # Mock accuracy calculation for testing
        import random
        return random.uniform(0.7, 0.95)
    
    def test_prompt(self, prompt: PromptTemplate) -> Dict[str, Any]:
        """
        Test prompt against all test cases
        TODO: Implement testing logic
        - Run extraction
        - Compare with expected
        - Calculate accuracy metrics
        - Track token usage
        """
        if not self.test_cases:
            print("⚠️ No test cases available. Add test cases first.")
            return {}
        
        print(f"🧪 Testing prompt against {len(self.test_cases)} test cases...")
        
        results = {
            'prompt_type': type(prompt).__name__,
            'test_cases_count': len(self.test_cases),
            'accuracy_scores': [],
            'token_usage': [],
            'processing_times': [],
            'success_count': 0
        }
        
        for i, test_case in enumerate(self.test_cases):
            print(f"   Test {i+1}/{len(self.test_cases)}: ", end="")
            
            start_time = time.time()
            
            try:
                # TODO: Run actual extraction
                # extracted = extract_with_retry(test_case['text'])
                # accuracy = self.calculate_accuracy(extracted, test_case['expected'])
                
                # Mock extraction for testing
                extracted = InvoiceExtraction(
                    invoice_number="TEST-001",
                    vendor_name="Test Vendor",
                    invoice_date=datetime.now(),
                    line_items=[],
                    subtotal=100.0,
                    tax_rate=0.08,
                    tax_amount=8.0,
                    total_amount=108.0,
                    confidence_score=0.85
                )
                
                accuracy = self.calculate_accuracy(extracted, test_case['expected'])
                processing_time = time.time() - start_time
                
                results['accuracy_scores'].append(accuracy)
                results['token_usage'].append(300)  # Mock token count
                results['processing_times'].append(processing_time)
                results['success_count'] += 1
                
                print(f"✅ {accuracy:.1%}")
                
            except Exception as e:
                print(f"❌ Failed")
                results['accuracy_scores'].append(0.0)
                results['token_usage'].append(200)  # Mock token count for failed attempts
                results['processing_times'].append(time.time() - start_time)
        
        # Calculate summary metrics
        results['avg_accuracy'] = sum(results['accuracy_scores']) / len(results['accuracy_scores'])
        results['avg_tokens'] = sum(results['token_usage']) / len(results['token_usage'])
        results['avg_time'] = sum(results['processing_times']) / len(results['processing_times'])
        results['success_rate'] = results['success_count'] / len(self.test_cases)
        
        print(f"\n   📊 Results: {results['avg_accuracy']:.1%} accuracy, {results['avg_tokens']:.0f} tokens avg")
        
        return results
    
    def compare_prompts(self, prompts: List[PromptTemplate]) -> pd.DataFrame:
        """
        A/B test multiple prompts
        TODO: Create comparison matrix
        """
        print(f"🏁 A/B testing {len(prompts)} prompt variants...")
        
        results = []
        for i, prompt in enumerate(prompts):
            print(f"\n🔍 Testing prompt variant {i+1}/{len(prompts)}")
            metrics = self.test_prompt(prompt)
            
            if metrics:  # Only add if we got results
                results.append({
                    'Prompt': f"Variant_{i+1}",
                    'Accuracy': f"{metrics['avg_accuracy']:.1%}",
                    'Success_Rate': f"{metrics['success_rate']:.1%}",
                    'Avg_Tokens': f"{metrics['avg_tokens']:.0f}",
                    'Avg_Time': f"{metrics['avg_time']:.2f}s",
                    'Test_Cases': metrics['test_cases_count']
                })
        
        if results:
            df = pd.DataFrame(results)
            print(f"\n📈 A/B Test Comparison:")
            print(df.to_string(index=False))
            return df
        else:
            print(f"❌ No valid results to compare")
            return pd.DataFrame()
    
    def add_sample_test_cases(self):
        """Add some sample test cases for demonstration"""
        sample_cases = [
            {
                'text': "Invoice #INV-001 from TechCorp, Date: 2024-01-15, Total: $1500.00",
                'expected': InvoiceExtraction(
                    invoice_number="INV-001",
                    vendor_name="TechCorp",
                    invoice_date=datetime(2024, 1, 15),
                    line_items=[],
                    subtotal=1388.89,
                    tax_rate=0.08,
                    tax_amount=111.11,
                    total_amount=1500.00,
                    confidence_score=0.95
                )
            },
            {
                'text': "Invoice ABC-123 from Supplies Inc, Total: $750",
                'expected': InvoiceExtraction(
                    invoice_number="ABC-123",
                    vendor_name="Supplies Inc",
                    invoice_date=datetime.now(),
                    line_items=[],
                    subtotal=694.44,
                    tax_rate=0.08,
                    tax_amount=55.56,
                    total_amount=750.00,
                    confidence_score=0.90
                )
            }
        ]
        
        for case in sample_cases:
            self.add_test_case(case['text'], case['expected'])

# Test the testing framework
print("🧪 Testing the prompt testing framework...")

tester = PromptTester()

# Add sample test cases
print(f"\n📋 Adding sample test cases:")
tester.add_sample_test_cases()

# Test individual prompts
print(f"\n🔍 Testing individual prompts:")
basic_results = tester.test_prompt(basic_extraction_prompt)

# Compare multiple prompts
print(f"\n🏁 Comparing multiple prompts:")
prompts_to_test = [
    basic_extraction_prompt,
    few_shot_prompt,
    cot_extraction_prompt
]

comparison_df = tester.compare_prompts(prompts_to_test)

if not comparison_df.empty:
    print(f"\n🏆 Best performing prompt characteristics:")
    # Find best accuracy
    best_accuracy_idx = comparison_df['Accuracy'].str.rstrip('%').astype(float).idxmax()
    best_prompt = comparison_df.loc[best_accuracy_idx]
    print(f"   Highest accuracy: {best_prompt['Prompt']} ({best_prompt['Accuracy']})")

print("\n📝 TODO Summary for Task 6:")
print("1. Implement real accuracy calculation comparing extracted vs expected")
print("2. Add more sophisticated metrics (F1 score, field-level accuracy)")
print("3. Create automated test case generation from real invoices")
print("4. Add regression testing for prompt changes")
print("5. Implement continuous performance monitoring")

## Task 7: Add Versioning System (4 minutes)

### Version Control for Prompts

Track prompt evolution and enable rollback.

**TODO Instructions:**
1. Implement prompt version storage and metadata
2. Create prompt registry with active version management
3. Add rollback capabilities
4. Test versioning workflow

In [None]:
class PromptVersion:
    """Versioned prompt with metadata"""
    
    def __init__(self, template: PromptTemplate, version: str, metadata: Dict[str, Any]):
        self.template = template
        self.version = version
        self.created_at = datetime.now()
        self.metadata = metadata
        self.performance_stats = {
            'success_rate': 0.0,
            'avg_tokens': 0,
            'avg_accuracy': 0.0,
            'usage_count': 0
        }
    
    def update_stats(self, success_rate: float, avg_tokens: int, accuracy: float = 0.0):
        """Track performance over time"""
        # TODO: Update statistics with weighted averaging
        self.performance_stats['usage_count'] += 1
        
        # Simple moving average for now
        count = self.performance_stats['usage_count']
        
        self.performance_stats['success_rate'] = (
            (self.performance_stats['success_rate'] * (count - 1) + success_rate) / count
        )
        
        self.performance_stats['avg_tokens'] = int(
            (self.performance_stats['avg_tokens'] * (count - 1) + avg_tokens) / count
        )
        
        if accuracy > 0:
            self.performance_stats['avg_accuracy'] = (
                (self.performance_stats['avg_accuracy'] * (count - 1) + accuracy) / count
            )
    
    def get_summary(self) -> Dict[str, Any]:
        """Get version summary"""
        return {
            'version': self.version,
            'created_at': self.created_at.isoformat(),
            'metadata': self.metadata,
            'performance': self.performance_stats,
            'template_length': len(self.template.template)
        }

class PromptRegistry:
    """Manage prompt versions"""
    
    def __init__(self):
        self.versions = {}  # name -> {version -> PromptVersion}
        self.active_versions = {}  # name -> version
    
    def register(self, name: str, prompt: PromptTemplate, 
                version: str = "1.0.0", metadata: Dict[str, Any] = None):
        """Register new prompt version"""
        # TODO: Store versioned prompt
        if metadata is None:
            metadata = {}
        
        # Initialize name if not exists
        if name not in self.versions:
            self.versions[name] = {}
        
        # Create version
        prompt_version = PromptVersion(prompt, version, metadata)
        self.versions[name][version] = prompt_version
        
        # Set as active if first version or explicitly requested
        if name not in self.active_versions or metadata.get('set_active', False):
            self.active_versions[name] = version
        
        print(f"✅ Registered {name} v{version} (active: {self.active_versions[name]})")
    
    def get_active(self, name: str) -> Optional[PromptTemplate]:
        """Get current active version"""
        # TODO: Return active prompt
        if name in self.active_versions:
            active_version = self.active_versions[name]
            if name in self.versions and active_version in self.versions[name]:
                return self.versions[name][active_version].template
        
        print(f"⚠️ No active version found for {name}")
        return None
    
    def get_version(self, name: str, version: str) -> Optional[PromptTemplate]:
        """Get specific version"""
        if name in self.versions and version in self.versions[name]:
            return self.versions[name][version].template
        return None
    
    def rollback(self, name: str, version: str):
        """Rollback to previous version"""
        # TODO: Implement rollback
        if name in self.versions and version in self.versions[name]:
            old_version = self.active_versions.get(name, 'none')
            self.active_versions[name] = version
            print(f"🔄 Rolled back {name} from v{old_version} to v{version}")
            return True
        else:
            print(f"❌ Cannot rollback {name} to v{version} - version not found")
            return False
    
    def list_versions(self, name: str) -> List[Dict[str, Any]]:
        """List all versions for a prompt"""
        if name not in self.versions:
            return []
        
        versions_info = []
        for version, prompt_version in self.versions[name].items():
            info = prompt_version.get_summary()
            info['is_active'] = (version == self.active_versions.get(name))
            versions_info.append(info)
        
        # Sort by creation time
        versions_info.sort(key=lambda x: x['created_at'], reverse=True)
        return versions_info
    
    def update_performance(self, name: str, version: str, 
                          success_rate: float, avg_tokens: int, accuracy: float = 0.0):
        """Update performance stats for a version"""
        if name in self.versions and version in self.versions[name]:
            self.versions[name][version].update_stats(success_rate, avg_tokens, accuracy)
    
    def get_best_performing_version(self, name: str, metric: str = 'avg_accuracy') -> Optional[str]:
        """Find best performing version by metric"""
        if name not in self.versions:
            return None
        
        best_version = None
        best_score = -1
        
        for version, prompt_version in self.versions[name].items():
            if prompt_version.performance_stats['usage_count'] > 0:
                score = prompt_version.performance_stats.get(metric, 0)
                if score > best_score:
                    best_score = score
                    best_version = version
        
        return best_version

# Test the versioning system
print("🧪 Testing prompt versioning system...")

registry = PromptRegistry()

# Register initial versions
print(f"\n📝 Registering prompt versions:")

registry.register(
    name="invoice_extraction",
    prompt=basic_extraction_prompt,
    version="1.0.0",
    metadata={"description": "Basic extraction prompt", "author": "system"}
)

registry.register(
    name="invoice_extraction",
    prompt=few_shot_prompt,
    version="1.1.0",
    metadata={"description": "Added few-shot examples", "author": "developer"}
)

registry.register(
    name="invoice_extraction",
    prompt=cot_extraction_prompt,
    version="2.0.0",
    metadata={"description": "Chain of thought reasoning", "author": "researcher", "set_active": True}
)

# Test getting active version
print(f"\n🎯 Testing active version retrieval:")
active_prompt = registry.get_active("invoice_extraction")
if active_prompt:
    print(f"✅ Got active prompt ({len(active_prompt.template)} chars)")
else:
    print(f"❌ Failed to get active prompt")

# Simulate performance updates
print(f"\n📊 Simulating performance data:")
registry.update_performance("invoice_extraction", "1.0.0", 0.75, 200, 0.80)
registry.update_performance("invoice_extraction", "1.1.0", 0.85, 250, 0.88)
registry.update_performance("invoice_extraction", "2.0.0", 0.80, 300, 0.85)

# List all versions
print(f"\n📋 All versions of 'invoice_extraction':")
versions = registry.list_versions("invoice_extraction")
for version_info in versions:
    active_indicator = "🟢" if version_info['is_active'] else "⚪"
    print(f"   {active_indicator} v{version_info['version']} - {version_info['metadata']['description']}")
    print(f"      Performance: {version_info['performance']['avg_accuracy']:.1%} accuracy, {version_info['performance']['avg_tokens']} tokens")

# Find best performing version
print(f"\n🏆 Finding best performing version:")
best_version = registry.get_best_performing_version("invoice_extraction", "avg_accuracy")
if best_version:
    print(f"   Best performing: v{best_version}")
    
    # Rollback to best version if it's not active
    current_active = registry.active_versions["invoice_extraction"]
    if best_version != current_active:
        print(f"   🔄 Rolling back to best performing version...")
        registry.rollback("invoice_extraction", best_version)

# Test rollback functionality
print(f"\n🔄 Testing rollback functionality:")
success = registry.rollback("invoice_extraction", "1.0.0")
if success:
    current_active = registry.get_active("invoice_extraction")
    print(f"   ✅ Successfully rolled back to v1.0.0")
    # Roll back to the best version
    registry.rollback("invoice_extraction", best_version)

print("\n📝 TODO Summary for Task 7:")
print("1. Add automatic version numbering (semantic versioning)")
print("2. Implement version diffing to see changes between versions")
print("3. Add version tags and branch management")
print("4. Implement automated performance-based version promotion")
print("5. Add version backup and restore from external storage")

## Lab Summary and Assessment

### What You've Built

Congratulations! You've created a comprehensive prompt management system with:

1. **Structured Output Models** - Pydantic schemas with validation
2. **Template System** - Reusable prompts with variables and examples
3. **Conditional Logic** - Smart prompt selection based on document characteristics
4. **Validation & Retry** - Instructor integration with automatic retries
5. **Performance Caching** - Thread-safe caching for efficiency
6. **Testing Framework** - A/B testing and performance metrics
7. **Version Control** - Prompt versioning with rollback capabilities

### Assessment Criteria

**Students succeed if they:**
- ✅ Create working Pydantic models with validation
- ✅ Implement multiple prompt templates with variables
- ✅ Build conditional prompt selection system
- ✅ Integrate Instructor for structured outputs
- ✅ Add functional caching mechanism
- ✅ Create testing framework with metrics

### Self-Assessment Questions

1. **Model Validation**: Do your Pydantic models catch invalid data effectively?
2. **Template Flexibility**: Can you easily add new prompt variants?
3. **Smart Selection**: Does the system choose appropriate prompts for different documents?
4. **Performance**: Is caching improving response times?
5. **Quality Assurance**: Can you measure and compare prompt effectiveness?
6. **Maintainability**: Can you safely deploy new prompt versions?

### Common Issues and Solutions

**Issue: Pydantic validation too strict**
- Solution: Add Optional fields and reasonable defaults

**Issue: Prompt templates not rendering**
- Solution: Check variable names match exactly

**Issue: Cache growing too large**
- Solution: Implement LRU eviction or TTL

**Issue: Instructor retry loops forever**
- Solution: Set max_attempts and fallback logic

### Key Learning Points

- **Structured outputs prevent downstream errors**
- **Templates make prompts reusable and testable**
- **Caching dramatically improves performance**
- **Validation should happen at extraction time**
- **Versioning enables safe experimentation**
- **Testing reveals which prompts work best**

### Production Deployment Checklist

Before deploying this system:

1. **Complete TODOs**: Implement all the placeholder logic
2. **Add Monitoring**: Track prompt performance in production
3. **Error Handling**: Add comprehensive error handling and logging
4. **Security**: Validate all inputs and sanitize outputs
5. **Scaling**: Test with concurrent users and large document volumes
6. **Backup**: Implement prompt version backup and recovery

### Next Steps

To advance this system:

1. **AI-Powered Enhancement**: Use ML to automatically improve prompts based on failures
2. **Multi-Modal Support**: Add support for image and table extraction
3. **Domain Adaptation**: Create industry-specific prompt libraries
4. **Real-Time Learning**: Implement online learning from user corrections
5. **Integration**: Connect with document management and ERP systems

**Excellent work!** You now have a production-ready prompt management system that will make your invoice processing reliable, maintainable, and continuously improving.