# Week 16 ‚Äî Creative & Marketing Content Evaluation
### BenchRight LLM Evaluation Master Program (18 Weeks)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. Understand creative writing and marketing copy evaluation
2. Define criteria: brand voice alignment, clarity, call-to-action strength
3. Design and apply a simple rubric (1-5) for each dimension
4. Use the LLM-as-Judge pattern to assign multi-dimensional scores
5. Analyze patterns in high-scoring vs. low-scoring content

---

## üß† What is Creative Content Evaluation?

### The Challenge

Unlike factual QA, marketing content has no single "correct" answer:

| Aspect | Factual QA | Creative Content |
|--------|------------|------------------|
| Correct answer | Yes, one | No, many possible |
| Evaluation | Objective | Subjective |
| Metrics | Accuracy | Voice, clarity, persuasion |
| Success | Match reference | Achieve desired effect |

### Our Approach: Multi-Dimensional Scoring

We evaluate creative content on three key dimensions:
1. **Brand Voice Alignment** - Does it sound like our brand?
2. **Clarity** - Is it easy to understand?
3. **Call-to-Action Strength** - Does it motivate action?

---

## üõ†Ô∏è Step 1: Setup & Dependencies

In [None]:
# Standard library imports
import numpy as np
import sys
import json
from typing import Dict, List, Any, Optional, Callable

# Add src to path if running in Colab
sys.path.insert(0, '.')

# For data display
try:
    from IPython.display import display, HTML
except ImportError:
    display = print

print("‚úÖ Setup complete!")
print(f"   NumPy version: {np.__version__}")

---

## üìã Step 2: Define the Evaluation Rubric

In [None]:
# Define the creative content evaluation rubric
RUBRIC = {
    "brand_voice": {
        "name": "Brand Voice Alignment",
        "description": "Does the content match the specified brand voice?",
        "scores": {
            5: "Perfectly on-brand; indistinguishable from best brand examples",
            4: "Mostly on-brand; minor adjustments needed",
            3: "Neutral; not distinctly on or off-brand",
            2: "Mostly off-brand; needs substantial revision",
            1: "Completely off-brand; unusable",
        },
    },
    "clarity": {
        "name": "Clarity",
        "description": "How easy is the content to understand?",
        "scores": {
            5: "Crystal clear; immediately understandable",
            4: "Mostly clear; minor ambiguities",
            3: "Understandable but with some unclear sections",
            2: "Difficult to understand",
            1: "Incomprehensible",
        },
    },
    "cta_strength": {
        "name": "Call-to-Action Strength",
        "description": "How effectively does it motivate action?",
        "scores": {
            5: "Exceptional; compelling and impossible to ignore",
            4: "Strong; clear and motivating",
            3: "Adequate; present but not compelling",
            2: "Weak; vague or easily ignored",
            1: "No effective CTA",
        },
    },
}

# Display the rubric
print("üìã Creative Content Evaluation Rubric")
print("=" * 70)

for dim_key, dim in RUBRIC.items():
    print(f"\n{dim['name'].upper()}")
    print(f"   {dim['description']}")
    print()
    for score, description in sorted(dim['scores'].items(), reverse=True):
        print(f"   {score}: {description}")

---

## üè¢ Step 3: Define Brand Guidelines

In [None]:
# Define comprehensive brand guidelines
BRAND_GUIDELINES = """
Brand Name: TechFlow
Industry: SaaS / Productivity Software
Target Audience: Busy professionals and knowledge workers

Voice Characteristics:
- Tone: Professional yet approachable, confident but not arrogant
- Style: Helpful, empowering, forward-thinking
- Personality: Like a smart colleague who makes complex things simple

Core Values:
- Efficiency: We value people's time
- Innovation: We embrace new solutions
- Simplicity: We make things easy

Vocabulary Guidelines:
- USE: action verbs, specific benefits, relatable scenarios
- AVOID: jargon, buzzwords, overly technical language

Writing Style:
- Short sentences, active voice
- Benefit-focused, not feature-focused
- Conversational but professional

CTA Style:
- Direct and action-oriented
- Highlight value, not just the action
- Create gentle urgency when appropriate
"""

print("‚úÖ Brand guidelines defined!")
print()
print("Brand: TechFlow")
print("Industry: SaaS / Productivity Software")
print("Target: Busy professionals")

---

## üìù Step 4: Create Test Content Samples

In [None]:
# Sample marketing content for evaluation
CONTENT_SAMPLES = [
    {
        "id": "content_001",
        "type": "Product Description",
        "product": "Smart Inbox Feature",
        "content": """
Stop drowning in emails. TechFlow's smart inbox prioritizes what matters, 
so you can focus on work that moves the needle. Join 50,000+ professionals 
who've reclaimed 2 hours every day. Start your free trial now.
        """.strip(),
        "expected_quality": "high",
    },
    {
        "id": "content_002",
        "type": "Product Description",
        "product": "Task Management",
        "content": """
Our cutting-edge, enterprise-grade solution leverages advanced AI/ML 
capabilities to optimize your task management paradigm through 
synergistic workflow automation. Request a demo to learn more.
        """.strip(),
        "expected_quality": "poor_voice",
    },
    {
        "id": "content_003",
        "type": "Product Description",
        "product": "Calendar Sync",
        "content": """
TechFlow helps you manage your calendar better. It's a good tool for 
staying organized. Many people like it. You might want to 
consider trying it someday.
        """.strip(),
        "expected_quality": "weak_cta",
    },
    {
        "id": "content_004",
        "type": "Email Subject Line",
        "product": "Weekly Digest",
        "content": "Your productivity just leveled up‚Äîsee what's new in TechFlow",
        "expected_quality": "high",
    },
    {
        "id": "content_005",
        "type": "Landing Page Hero",
        "product": "TechFlow Platform",
        "content": """
Work smarter, not harder.

TechFlow brings your email, calendar, and tasks into one place. 
Less switching, more doing. See the difference in your first week.

Try free for 14 days ‚Üí
        """.strip(),
        "expected_quality": "high",
    },
    {
        "id": "content_006",
        "type": "Product Description",
        "product": "Analytics Dashboard",
        "content": """
BUY NOW!!! BEST PRODUCTIVITY APP EVER!!! 
Amazing analytics! Super powerful! Everyone needs this!
LIMITED TIME OFFER - 90% OFF TODAY ONLY!!!
CLICK HERE IMMEDIATELY!!! DON'T MISS OUT!!!
        """.strip(),
        "expected_quality": "poor_all",
    },
]

print(f"üìù Created {len(CONTENT_SAMPLES)} content samples for evaluation")
print()
for sample in CONTENT_SAMPLES:
    print(f"   ‚Ä¢ {sample['id']}: {sample['product']} ({sample['type']})")

---

## ü§ñ Step 5: Define the LLM-as-Judge System Prompt

In [None]:
# System prompt for creative content evaluation
CREATIVE_JUDGE_SYSTEM_PROMPT = """You are an expert marketing content evaluator. Your task is to evaluate AI-generated creative and marketing copy across three dimensions.

## Evaluation Dimensions

### 1. Brand Voice Alignment (1-5)
Does the content match the specified brand voice?
- 5: Perfectly on-brand, indistinguishable from best brand examples
- 4: Mostly on-brand with minor adjustments needed
- 3: Neutral, not distinctly on or off-brand
- 2: Mostly off-brand, needs substantial revision
- 1: Completely off-brand, unusable

### 2. Clarity (1-5)
How easy is the content to understand?
- 5: Crystal clear, immediately understandable
- 4: Mostly clear with minor ambiguities
- 3: Understandable but with some unclear sections
- 2: Difficult to understand
- 1: Incomprehensible

### 3. Call-to-Action Strength (1-5)
How effectively does it motivate action?
- 5: Exceptional, compelling and impossible to ignore
- 4: Strong, clear and motivating
- 3: Adequate, present but not compelling
- 2: Weak, vague or easily ignored
- 1: No effective CTA

## Instructions

1. Read the brand guidelines carefully
2. Evaluate the content against each dimension
3. Provide specific examples from the content to justify each score
4. Be objective and consistent

Respond ONLY with a valid JSON object in this exact format:
{
    "brand_voice_score": <int 1-5>,
    "brand_voice_rationale": "<specific justification>",
    "clarity_score": <int 1-5>,
    "clarity_rationale": "<specific justification>",
    "cta_score": <int 1-5>,
    "cta_rationale": "<specific justification>",
    "overall_score": <float, average of three scores>,
    "summary": "<2-3 sentence overall assessment>"
}

Do not include any other text before or after the JSON object."""

print("‚úÖ LLM-as-Judge system prompt defined!")
print(f"   Prompt length: {len(CREATIVE_JUDGE_SYSTEM_PROMPT)} characters")

---

## üß™ Step 6: Implement the CreativeContentJudge Class

In [None]:
class CreativeContentJudge:
    """
    A judge that evaluates creative and marketing content using an LLM.
    
    This class implements the LLM-as-Judge pattern for multi-dimensional
    evaluation of marketing copy, assessing brand voice, clarity, and
    call-to-action effectiveness.
    """
    
    def __init__(
        self,
        client: Any,
        model: str = "gpt-4o-mini",
        system_prompt: str = CREATIVE_JUDGE_SYSTEM_PROMPT,
    ):
        """
        Initialize the CreativeContentJudge.
        
        Args:
            client: An LLM client with chat.completions.create method
            model: Model identifier for evaluation
            system_prompt: System prompt for the judge
        """
        self.client = client
        self.model = model
        self.system_prompt = system_prompt
    
    def evaluate(
        self,
        content: str,
        brand_guidelines: str,
        target_audience: str = "",
        desired_action: str = "",
    ) -> Dict[str, Any]:
        """
        Evaluate marketing content across all dimensions.
        
        Args:
            content: The marketing copy to evaluate
            brand_guidelines: Description of brand voice and style
            target_audience: Who the content is for
            desired_action: What action should readers take
            
        Returns:
            Dictionary with scores and rationales for each dimension
        """
        # Construct the evaluation prompt
        user_message = f"""## Brand Guidelines
{brand_guidelines}

## Target Audience
{target_audience if target_audience else "General audience"}

## Desired Action
{desired_action if desired_action else "Not specified"}

## Content to Evaluate
{content}

Please evaluate this content according to the rubric."""

        # Call the LLM for evaluation
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": user_message},
            ],
            temperature=0.0,
        )
        
        # Parse the response
        response_text = response.choices[0].message.content.strip()
        return self._parse_response(response_text)
    
    def _parse_response(self, response_text: str) -> Dict[str, Any]:
        """Parse the LLM response into a structured result."""
        try:
            result = json.loads(response_text)
            
            # Validate and clamp scores
            for key in ["brand_voice_score", "clarity_score", "cta_score"]:
                if key in result:
                    result[key] = max(1, min(5, int(result[key])))
            
            # Recalculate overall score
            scores = [
                result.get("brand_voice_score", 3),
                result.get("clarity_score", 3),
                result.get("cta_score", 3),
            ]
            result["overall_score"] = sum(scores) / len(scores)
            
            return result
            
        except json.JSONDecodeError as e:
            return {
                "brand_voice_score": 0,
                "brand_voice_rationale": f"Parse error: {e}",
                "clarity_score": 0,
                "clarity_rationale": f"Parse error: {e}",
                "cta_score": 0,
                "cta_rationale": f"Parse error: {e}",
                "overall_score": 0.0,
                "summary": f"Failed to parse response: {response_text[:100]}",
            }
    
    def evaluate_batch(
        self,
        contents: List[str],
        brand_guidelines: str,
        target_audience: str = "",
        desired_action: str = "",
    ) -> List[Dict[str, Any]]:
        """
        Evaluate multiple pieces of content.
        
        Args:
            contents: List of marketing copy to evaluate
            brand_guidelines: Description of brand voice and style
            target_audience: Who the content is for
            desired_action: What action should readers take
            
        Returns:
            List of evaluation results
        """
        results = []
        for content in contents:
            result = self.evaluate(
                content=content,
                brand_guidelines=brand_guidelines,
                target_audience=target_audience,
                desired_action=desired_action,
            )
            results.append(result)
        return results
    
    def compute_aggregate_metrics(
        self,
        results: List[Dict[str, Any]],
    ) -> Dict[str, float]:
        """
        Compute aggregate metrics across multiple evaluations.
        
        Args:
            results: List of evaluation results
            
        Returns:
            Dictionary with average scores and distributions
        """
        if not results:
            return {}
        
        brand_scores = [r.get("brand_voice_score", 0) for r in results if r.get("brand_voice_score", 0) > 0]
        clarity_scores = [r.get("clarity_score", 0) for r in results if r.get("clarity_score", 0) > 0]
        cta_scores = [r.get("cta_score", 0) for r in results if r.get("cta_score", 0) > 0]
        overall_scores = [r.get("overall_score", 0) for r in results if r.get("overall_score", 0) > 0]
        
        return {
            "avg_brand_voice": np.mean(brand_scores) if brand_scores else 0.0,
            "avg_clarity": np.mean(clarity_scores) if clarity_scores else 0.0,
            "avg_cta_strength": np.mean(cta_scores) if cta_scores else 0.0,
            "avg_overall": np.mean(overall_scores) if overall_scores else 0.0,
            "total_evaluated": len(results),
            "valid_evaluations": len([r for r in results if r.get("overall_score", 0) > 0]),
        }


print("‚úÖ CreativeContentJudge class defined!")

---

## üß™ Step 7: Create a Mock LLM Client for Demonstration

In [None]:
class MockCreativeJudgeClient:
    """
    Mock LLM client for demonstration purposes.
    
    Simulates LLM responses based on content analysis heuristics.
    In production, use an actual LLM (GPT-4, Claude, etc.)
    """
    
    def __init__(self):
        self.chat = self._MockChat()
    
    class _MockChat:
        def __init__(self):
            self.completions = self._MockCompletions()
        
        class _MockCompletions:
            def create(self, model: str, messages: list, temperature: float = 0.0):
                # Extract content from the user message
                user_msg = next((m["content"] for m in messages if m["role"] == "user"), "")
                
                # Simple heuristics for scoring
                content_lower = user_msg.lower()
                
                # Brand voice scoring
                if any(word in content_lower for word in ["synergistic", "paradigm", "leverage", "enterprise-grade"]):
                    brand_voice = 2
                    brand_rationale = "Uses jargon and buzzwords that don't match the approachable brand voice."
                elif any(word in content_lower for word in ["buy now!!!", "limited time", "don't miss out"]):
                    brand_voice = 1
                    brand_rationale = "Aggressive promotional tone contradicts the professional, approachable brand voice."
                elif any(word in content_lower for word in ["might want to", "consider", "someday"]):
                    brand_voice = 3
                    brand_rationale = "Neutral tone, lacks the confident and empowering voice expected."
                else:
                    brand_voice = 4
                    brand_rationale = "Good match with professional yet approachable tone. Uses benefit-focused language."
                
                # Clarity scoring
                if "paradigm" in content_lower or "synergistic" in content_lower:
                    clarity = 2
                    clarity_rationale = "Technical jargon and complex sentence structures make it difficult to understand."
                elif "!!!" in user_msg:
                    clarity = 3
                    clarity_rationale = "Excessive punctuation creates noise, but message is somewhat understandable."
                else:
                    clarity = 5
                    clarity_rationale = "Clear, concise language with logical flow. Easy to understand immediately."
                
                # CTA scoring
                if "start your free trial" in content_lower or "try free" in content_lower:
                    cta = 5
                    cta_rationale = "Strong, specific CTA with clear value proposition and low barrier to action."
                elif "request a demo" in content_lower:
                    cta = 4
                    cta_rationale = "Clear call-to-action, though could emphasize value more."
                elif "might want to" in content_lower or "consider" in content_lower:
                    cta = 2
                    cta_rationale = "Weak, vague language that fails to motivate action. No urgency."
                elif "click here immediately" in content_lower:
                    cta = 2
                    cta_rationale = "Aggressive CTA that may turn off the target professional audience."
                else:
                    cta = 4
                    cta_rationale = "Solid CTA that motivates action appropriately for the brand."
                
                overall = (brand_voice + clarity + cta) / 3
                
                # Generate summary
                if overall >= 4:
                    summary = "Strong content that aligns well with brand guidelines. Ready for publication with minor adjustments."
                elif overall >= 3:
                    summary = "Adequate content with room for improvement. Needs revision to better match brand voice or strengthen CTA."
                else:
                    summary = "Content requires significant revision. Major issues with brand alignment, clarity, or call-to-action."
                
                response_json = json.dumps({
                    "brand_voice_score": brand_voice,
                    "brand_voice_rationale": brand_rationale,
                    "clarity_score": clarity,
                    "clarity_rationale": clarity_rationale,
                    "cta_score": cta,
                    "cta_rationale": cta_rationale,
                    "overall_score": overall,
                    "summary": summary,
                })
                
                class MockChoice:
                    class MockMessage:
                        def __init__(self, content):
                            self.content = content
                    
                    def __init__(self, content):
                        self.message = self.MockMessage(content)
                
                class MockResponse:
                    def __init__(self, content):
                        self.choices = [MockChoice(content)]
                
                return MockResponse(response_json)


# Create mock client
mock_client = MockCreativeJudgeClient()

print("‚úÖ Mock LLM client created!")
print("   Note: Using mock for demonstration.")
print("   In production, use GPT-4, Claude, or similar.")

---

## üèÉ Step 8: Run Creative Content Evaluation

In [None]:
# Create evaluator
judge = CreativeContentJudge(mock_client)

# Run evaluation on all samples
print("üîÑ Running Creative Content Evaluation...")
print("=" * 70)

all_results = []
for sample in CONTENT_SAMPLES:
    result = judge.evaluate(
        content=sample["content"],
        brand_guidelines=BRAND_GUIDELINES,
        target_audience="Busy professionals and knowledge workers",
        desired_action="Sign up for free trial or explore features",
    )
    
    # Add metadata
    result["id"] = sample["id"]
    result["product"] = sample["product"]
    result["type"] = sample["type"]
    result["expected_quality"] = sample["expected_quality"]
    all_results.append(result)
    
    # Display results
    overall_status = "‚úÖ" if result["overall_score"] >= 4 else "‚ö†Ô∏è" if result["overall_score"] >= 3 else "‚ùå"
    
    print(f"\n{'='*60}")
    print(f"{sample['product']} ({sample['id']})")
    print(f"Type: {sample['type']} | Expected: {sample['expected_quality']}")
    print(f"{'='*60}")
    print(f"\nContent: {sample['content'][:80]}...")
    print(f"\nScores:")
    print(f"   Brand Voice:  {result['brand_voice_score']}/5")
    print(f"   Clarity:      {result['clarity_score']}/5")
    print(f"   CTA Strength: {result['cta_score']}/5")
    print(f"   Overall:      {result['overall_score']:.1f}/5 {overall_status}")
    print(f"\nSummary: {result['summary']}")

---

## üìä Step 9: Compute and Display Aggregate Metrics

In [None]:
# Compute aggregate metrics
metrics = judge.compute_aggregate_metrics(all_results)

print("üìä Aggregate Creative Content Evaluation Metrics")
print("=" * 70)
print(f"")
print(f"Total Evaluated: {metrics['total_evaluated']}")
print(f"Valid Evaluations: {metrics['valid_evaluations']}")
print(f"")
print(f"Average Scores:")
print(f"   Brand Voice:  {metrics['avg_brand_voice']:.1f}/5")
print(f"   Clarity:      {metrics['avg_clarity']:.1f}/5")
print(f"   CTA Strength: {metrics['avg_cta_strength']:.1f}/5")
print(f"   Overall:      {metrics['avg_overall']:.1f}/5")

---

## üìã Step 10: Generate Summary Table

In [None]:
print("üìã Evaluation Summary Table")
print("=" * 100)
print(f"{'#':<3} {'Product':<25} {'Type':<20} {'Voice':<6} {'Clarity':<8} {'CTA':<5} {'Overall':<8}")
print("-" * 100)

for i, r in enumerate(all_results, 1):
    status = "‚úÖ" if r["overall_score"] >= 4 else "‚ö†Ô∏è" if r["overall_score"] >= 3 else "‚ùå"
    product_short = r["product"][:23] + ".." if len(r["product"]) > 25 else r["product"]
    type_short = r["type"][:18] + ".." if len(r["type"]) > 20 else r["type"]
    
    print(f"{i:<3} {product_short:<25} {type_short:<20} {r['brand_voice_score']}/5   {r['clarity_score']}/5     {r['cta_score']}/5  {r['overall_score']:.1f}/5 {status}")

print("-" * 100)
print(f"")
print(f"Summary: {metrics['avg_overall']:.1f}/5 average overall score")

---

## üîç Step 11: Analyze High vs. Low Scoring Content

In [None]:
# Sort results by overall score
sorted_results = sorted(all_results, key=lambda x: x["overall_score"], reverse=True)

print("üîç Content Quality Analysis")
print("=" * 70)

# Identify best and worst
print("\nüìà HIGHEST SCORING CONTENT")
print("-" * 50)
best = sorted_results[0]
print(f"Product: {best['product']}")
print(f"Overall Score: {best['overall_score']:.1f}/5")
print(f"")
print(f"Brand Voice ({best['brand_voice_score']}/5): {best['brand_voice_rationale']}")
print(f"Clarity ({best['clarity_score']}/5): {best['clarity_rationale']}")
print(f"CTA ({best['cta_score']}/5): {best['cta_rationale']}")

print("\nüìâ LOWEST SCORING CONTENT")
print("-" * 50)
worst = sorted_results[-1]
print(f"Product: {worst['product']}")
print(f"Overall Score: {worst['overall_score']:.1f}/5")
print(f"")
print(f"Brand Voice ({worst['brand_voice_score']}/5): {worst['brand_voice_rationale']}")
print(f"Clarity ({worst['clarity_score']}/5): {worst['clarity_rationale']}")
print(f"CTA ({worst['cta_score']}/5): {worst['cta_rationale']}")

---

## üìä Step 12: Score Distribution by Dimension

In [None]:
print("üìä Score Distribution by Dimension")
print("=" * 70)

# Analyze each dimension
dimensions = [
    ("Brand Voice", "brand_voice_score"),
    ("Clarity", "clarity_score"),
    ("CTA Strength", "cta_score"),
]

for dim_name, dim_key in dimensions:
    scores = [r[dim_key] for r in all_results if r[dim_key] > 0]
    
    print(f"\n{dim_name.upper()}")
    print(f"   Average: {np.mean(scores):.1f}/5")
    print(f"   Std Dev: {np.std(scores):.2f}")
    print(f"   Min: {min(scores)}/5")
    print(f"   Max: {max(scores)}/5")
    
    # Distribution
    print(f"   Distribution:")
    for score in range(5, 0, -1):
        count = sum(1 for s in scores if s == score)
        bar = "‚ñà" * count
        print(f"      {score}/5: {bar} ({count})")

---

## üéØ Step 13: Identify Improvement Areas

In [None]:
print("üéØ Improvement Areas Analysis")
print("=" * 70)

# Find content needing improvement for each dimension
print("\nüî¥ Content Needing Brand Voice Improvement (Score < 4):")
brand_issues = [r for r in all_results if r["brand_voice_score"] < 4]
if brand_issues:
    for r in brand_issues:
        print(f"   ‚Ä¢ {r['product']}: {r['brand_voice_score']}/5 - {r['brand_voice_rationale'][:60]}...")
else:
    print("   ‚úÖ All content meets brand voice standards")

print("\nüî¥ Content Needing Clarity Improvement (Score < 4):")
clarity_issues = [r for r in all_results if r["clarity_score"] < 4]
if clarity_issues:
    for r in clarity_issues:
        print(f"   ‚Ä¢ {r['product']}: {r['clarity_score']}/5 - {r['clarity_rationale'][:60]}...")
else:
    print("   ‚úÖ All content meets clarity standards")

print("\nüî¥ Content Needing CTA Improvement (Score < 4):")
cta_issues = [r for r in all_results if r["cta_score"] < 4]
if cta_issues:
    for r in cta_issues:
        print(f"   ‚Ä¢ {r['product']}: {r['cta_score']}/5 - {r['cta_rationale'][:60]}...")
else:
    print("   ‚úÖ All content meets CTA standards")

---

## üß™ Step 14: Test Custom Content

In [None]:
# Test with custom content
print("üß™ Test Custom Content")
print("=" * 70)

custom_content = """
Finally, a calendar that works the way you think.

TechFlow Calendar learns your preferences, suggests optimal meeting times, 
and protects your focus time automatically. Because your best work happens 
when you're not fighting your schedule.

See it in action ‚Äî book a 10-minute demo.
""".strip()

print(f"\nCustom Content:")
print(f"-" * 50)
print(custom_content)
print(f"-" * 50)

# Evaluate
result = judge.evaluate(
    content=custom_content,
    brand_guidelines=BRAND_GUIDELINES,
    target_audience="Busy professionals",
    desired_action="Book a demo",
)

print(f"\nEvaluation Results:")
print(f"   Brand Voice:  {result['brand_voice_score']}/5 - {result['brand_voice_rationale']}")
print(f"   Clarity:      {result['clarity_score']}/5 - {result['clarity_rationale']}")
print(f"   CTA Strength: {result['cta_score']}/5 - {result['cta_rationale']}")
print(f"   Overall:      {result['overall_score']:.1f}/5")
print(f"\nSummary: {result['summary']}")

---

## üîÑ Step 15: Compare Content Variants (A/B Testing)

In [None]:
def compare_variants(
    judge: CreativeContentJudge,
    variant_a: str,
    variant_b: str,
    brand_guidelines: str,
) -> Dict[str, Any]:
    """
    Compare two content variants for A/B testing.
    """
    result_a = judge.evaluate(variant_a, brand_guidelines)
    result_b = judge.evaluate(variant_b, brand_guidelines)
    
    # Determine winner for each dimension
    comparisons = {}
    for dim, key in [("Brand Voice", "brand_voice_score"), ("Clarity", "clarity_score"), ("CTA", "cta_score")]:
        a_score = result_a[key]
        b_score = result_b[key]
        winner = "A" if a_score > b_score else "B" if b_score > a_score else "Tie"
        comparisons[dim] = {"a": a_score, "b": b_score, "winner": winner}
    
    # Overall winner
    a_wins = sum(1 for c in comparisons.values() if c["winner"] == "A")
    b_wins = sum(1 for c in comparisons.values() if c["winner"] == "B")
    overall_winner = "A" if a_wins > b_wins else "B" if b_wins > a_wins else "Tie"
    
    return {
        "result_a": result_a,
        "result_b": result_b,
        "comparisons": comparisons,
        "overall_winner": overall_winner,
    }


# Test variants
variant_a = "Start your free trial today and see why 50,000+ professionals trust TechFlow."
variant_b = "Join TechFlow now. It might help you be more productive someday."

print("üîÑ A/B Test Comparison")
print("=" * 70)
print(f"\nVariant A: {variant_a}")
print(f"Variant B: {variant_b}")

comparison = compare_variants(judge, variant_a, variant_b, BRAND_GUIDELINES)

print(f"\nResults:")
print(f"-" * 50)
for dim, data in comparison["comparisons"].items():
    print(f"   {dim}: A={data['a']}/5, B={data['b']}/5 ‚Üí Winner: {data['winner']}")

print(f"\nüèÜ Overall Winner: Variant {comparison['overall_winner']}")

---

## üìö Summary

In this notebook, you learned how to:

1. **Define evaluation criteria** for creative and marketing content
2. **Design a multi-dimensional rubric** (1-5 scale for each dimension)
3. **Implement the LLM-as-Judge pattern** for creative content
4. **Analyze content quality** across brand voice, clarity, and CTA
5. **Compare content variants** for A/B testing

### Key Takeaways

1. Creative content requires multi-dimensional evaluation
2. Brand voice alignment is critical for consistent messaging
3. Clarity and CTA strength directly impact conversion
4. LLM judges can scale evaluation but should be calibrated with human review

### Next Steps

1. **Calibrate the judge** with human-rated examples
2. **Add engagement prediction** based on content characteristics
3. **Integrate with A/B testing** to validate scores against real performance
4. **Build a content scoring dashboard** for marketing teams

---

## ‚úî Knowledge Mastery Checklist

Before moving to Week 17-18, ensure you can check all boxes:

- [ ] I understand what creative and marketing content evaluation involves
- [ ] I can define evaluation criteria: brand voice, clarity, CTA strength
- [ ] I can design and apply a 1-5 rubric for each dimension
- [ ] I understand how to implement the LLM-as-Judge pattern for multi-dimensional scoring
- [ ] I can interpret evaluation results and identify patterns
- [ ] I understand the limitations of automated creative evaluation
- [ ] I can provide actionable recommendations based on evaluation results

---

**Week 16 Complete!**

*Next: Week 17-18 ‚Äî Full System Architecture & Capstone*