# Prompt Evolution with SCOPE

Welcome to this tutorial on **automatic prompt optimization**!

## What You'll Learn

In this notebook, you'll discover how AI systems can automatically improve their own prompts through observation and learning. We'll use a simple information extraction task to demonstrate:

- üìö How SCOPE (Self-Correcting Optimal Prompt Evolution) works
- üîÑ How prompts evolve automatically over time
- üìä How to measure improvement through iterative learning
- üéØ Real-world application with LangChain

## Context

Traditional AI systems use **static prompts** - they never change or improve. But what if your AI could learn from experience and automatically optimize its own instructions? That's exactly what SCOPE enables.

Think of it like a student who:
1. Completes a task
2. Reviews what went well and what didn't
3. Updates their approach for next time
4. Gets better with each attempt

SCOPE does this automatically, with no manual prompt engineering required!

## About This Tutorial

This notebook uses a **simple information extraction task** to teach SCOPE fundamentals. The same principles apply to complex systems - in fact, this repository includes a production research assistant with:
- üéØ 5 agents learning simultaneously
- üéì Source quality assessment (academic vs blog detection)
- üìà +31% quality improvement in just 5 iterations

We start simple here so you can focus on **how SCOPE works**, then you can explore the advanced features!

## ‚ö†Ô∏è Important: Using the Right Environment

This notebook requires the project environment.

**Before running:** Activate the project `.venv`:
```bash
source .venv/bin/activate  # In project root
```

**In VS Code/Cursor:** The `.venv` is auto-detected - just open and run!

**Verify below:**

In [None]:
import sys
from pathlib import Path

print("üêç Python:", sys.executable)
print("üìÅ Directory:", Path.cwd())

# Check if using project .venv
if ".venv" in sys.executable:
    print("\n‚úÖ Correct! Using project .venv")
else:
    print("\n‚ö†Ô∏è  Not using project .venv")
    print("   Run: source .venv/bin/activate")

## Setup

First, let's install the required packages and set up our environment.

In [None]:
%%capture --no-stderr
%pip install --quiet -U langchain_openai langchain_core scope-optimizer

In [None]:
import os, getpass

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("OPENAI_API_KEY")

## The Concept: Before and After

Let's understand what we're building:

### Without SCOPE (Traditional Approach)
```
Static Prompt ‚Üí LLM ‚Üí Output
     ‚Üì
  Never changes!
```

### With SCOPE (Evolving Approach)
```
Initial Prompt ‚Üí LLM ‚Üí Output
      ‚Üì                  ‚Üì
      ‚Üì          SCOPE Observes
      ‚Üì                  ‚Üì
      ‚Üì          Learns Patterns
      ‚Üì                  ‚Üì
Improved Prompt ‚Üê Updates Rules
```

The magic happens in the learning loop - SCOPE observes what works and what doesn't, then automatically generates improvement rules!

## Our Simple Task: Information Extraction

We'll use information extraction as our example because:
- ‚úÖ It's easy to understand
- ‚úÖ Results are measurable
- ‚úÖ Improvements are visible

We'll ask the AI to extract information like emails, names, and phone numbers from text. As it completes tasks, SCOPE will learn how to do this better.

## Step 1: The Base Prompt (Before Learning)

Let's start with a simple, basic prompt. This is what the AI begins with - no optimization yet.

In [42]:
BASE_EXTRACTION_PROMPT = """You are an information extraction specialist.
Extract the requested information from the provided text.

## Core Instructions:
- Extract only what is requested
- If information is not present, respond with "Not found"
"""

print("üìã Base Prompt (intentionally simple):")
print(BASE_EXTRACTION_PROMPT)
print("\nüí° This basic prompt will allow SCOPE to discover best practices through observation!")

üìã Base Prompt (intentionally simple):
You are an information extraction specialist.
Extract the requested information from the provided text.

## Core Instructions:
- Extract only what is requested
- If information is not present, respond with "Not found"


üí° This basic prompt will allow SCOPE to discover best practices through observation!


This prompt is intentionally basic! It doesn't specify:
- How to handle email case (should `SUPPORT@COMPANY.COM` be lowercase?)
- Whether to remove brackets/formatting from emails (`<INFO@HELP.NET>`)
- How to standardize dates (`12/25/2024` vs `January 1st 2025` vs `2025-02-14`)
- Which phone number format to use (`(555)123.4567` vs `555 987 6543`)
- How to ensure consistency when the same instruction is repeated

**This is where SCOPE comes in!** When the model handles similar tasks inconsistently, SCOPE observes the patterns and automatically learns formatting rules to standardize outputs.

### üí° What Makes This Demo Work?

The tasks use **repeated similar instructions** to expose formatting inconsistencies:

**Task 1 & 2: Email Extraction (Same Task, Different Data)**
- `SUPPORT@COMPANY.COM` vs `sales@test.org` ‚Üí Mixed case needs normalization
- `<INFO@HELP.NET>` vs plain text ‚Üí Inconsistent formatting
- SCOPE will learn: "Always lowercase emails, remove brackets"

**Task 3 & 4: Date Extraction (Proven Pattern!)**
- `12/25/2024` vs `January 1st 2025` vs `2025-02-14` ‚Üí Mixed formats
- `March 15th 2025` vs `04/20/2025` vs `2025-05-30` ‚Üí More variety
- SCOPE will learn: "Standardize all dates to YYYY-MM-DD"

**Task 5: Phone Numbers**
- `1-555-CALL-NOW`, `(555)123.4567`, `555 987 6543` ‚Üí Messy formats
- SCOPE will learn: "Clean and standardize phone formats"

**Why this strategy works:**
- **Repetition**: Same instruction twice forces consistency decisions
- **Visible input inconsistencies**: Mixed formats that clearly need fixing
- **First run**: Model handles each case separately (inconsistent)
- **SCOPE observes**: Identifies the pattern across similar tasks
- **Second run**: Applies learned rules consistently!

The key: **Repeated task types** make inconsistencies obvious and improvements measurable.

## Step 2: Define Extraction Tasks

Let's create a variety of extraction tasks. Each one will teach SCOPE something different:

In [43]:
EXTRACTION_TASKS = [
    # Task 1: Emails - force the model to handle inconsistent data first
    {
        "instruction": "Extract all email addresses in a clean list",
        "text": "Reach us at: SUPPORT@COMPANY.COM, Sales: sales@test.org, or Info <INFO@HELP.NET>"
    },
    
    # Task 2: Multiple extractions - test if model maintains learned patterns
    {
        "instruction": "Extract all email addresses in a clean list", 
        "text": "Team: Alice.Brown@TECH.COM, bob.smith@startup.io, Contact: HR@BUSINESS.ORG"
    },
    
    # Task 3: Dates - proven to work! Keep this one
    {
        "instruction": "Extract dates",
        "text": "Important dates: 12/25/2024, January 1st 2025, and 2025-02-14"
    },
    
    # Task 4: Second date task - test consistency
    {
        "instruction": "Extract dates",
        "text": "Deadlines: March 15th 2025, 04/20/2025, and 2025-05-30"
    },
    
    # Task 5: Phone numbers with messy formatting
    {
        "instruction": "Extract phone numbers",
        "text": "Contact: 1-555-CALL-NOW (555-2255), office (555)123.4567, or mobile: 555 987 6543"
    },
]

print(f"‚úÖ Created {len(EXTRACTION_TASKS)} extraction tasks designed for visible learning")
print("\nKey improvements:")
print("  ‚Ä¢ Task 1 & 2: Same instruction, different data ‚Üí tests consistency")
print("  ‚Ä¢ Task 3 & 4: Dates with mixed formats ‚Üí proven to standardize")
print("  ‚Ä¢ Task 5: Messy phone formats ‚Üí should clean up")
print("  ‚Ä¢ Uppercase emails (SUPPORT@COMPANY.COM) ‚Üí should normalize")
print("\nüí° These tasks expose inconsistencies that SCOPE can learn to fix!")

‚úÖ Created 5 extraction tasks designed for visible learning

Key improvements:
  ‚Ä¢ Task 1 & 2: Same instruction, different data ‚Üí tests consistency
  ‚Ä¢ Task 3 & 4: Dates with mixed formats ‚Üí proven to standardize
  ‚Ä¢ Task 5: Messy phone formats ‚Üí should clean up
  ‚Ä¢ Uppercase emails (SUPPORT@COMPANY.COM) ‚Üí should normalize

üí° These tasks expose inconsistencies that SCOPE can learn to fix!


## Step 3: Set Up LangChain

Now let's create our LangChain chat model. We'll use GPT-4o for high-quality extractions.

In [44]:
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

# Initialize the chat model
llm = ChatOpenAI(model="gpt-4o", temperature=0)

print("‚úÖ LangChain ChatOpenAI initialized")
print("   Model: gpt-4o")
print("   Temperature: 0 (deterministic)")

‚úÖ LangChain ChatOpenAI initialized
   Model: gpt-4o
   Temperature: 0 (deterministic)


## Step 4: Initialize SCOPE

Here's where the magic begins! SCOPE will observe each task completion and learn improvement patterns.

In [46]:
from scope import SCOPEOptimizer
from scope.models import create_openai_model

# Create SCOPE's model (for analyzing and learning)
scope_model = create_openai_model(
    model="gpt-4o",
    api_key=os.environ["OPENAI_API_KEY"]
)

# Initialize SCOPE optimizer
optimizer = SCOPEOptimizer(
    synthesizer_model=scope_model,
    exp_path="./scope_data",  # Where to save learned rules
    enable_quality_analysis=True,  # Analyze quality after each task
    quality_analysis_frequency=1,  # Check every task
    synthesis_mode="efficiency",  # Fast learning mode
    store_history=True  # Keep learning history
)

print("‚úÖ SCOPE Optimizer initialized")
print("   üìä Quality analysis: Enabled")
print("   üíæ Learning history: Stored")
print("   ‚ö° Mode: Efficiency (fast learning)")

‚úÖ SCOPE Optimizer initialized
   üìä Quality analysis: Enabled
   üíæ Learning history: Stored
   ‚ö° Mode: Efficiency (fast learning)


### What do these parameters mean?

- **enable_quality_analysis**: After each task, SCOPE analyzes if the output could be better
- **quality_analysis_frequency**: How often to check (1 = every task)
- **synthesis_mode**: "efficiency" learns quickly, "thoroughness" is more thorough (7-dimension analysis)
- **store_history**: Keeps a record of all learning events

### Why "efficiency" mode for this tutorial?

For this educational tutorial, we use **"efficiency"** mode because:
- ‚úÖ Faster learning (good for quick demonstrations)
- ‚úÖ Clear, straightforward rules
- ‚úÖ Perfect for understanding SCOPE fundamentals

The research assistant (`main.py`) uses **"thoroughness"** mode for production:
- üéØ More detailed analysis (7 dimensions)
- üéØ Higher quality rules
- üéØ Better for complex, multi-agent systems

## Step 5: First Run - Observe Learning

Let's run through our tasks and watch SCOPE learn in real-time!

In [47]:
import asyncio

async def extract_with_scope(instruction, text, task_id):
    """Extract information and let SCOPE observe."""
    
    # Get current prompt (starts with base, evolves over time)
    strategic_rules = optimizer.get_strategic_rules_for_agent("info_extractor")
    current_prompt = BASE_EXTRACTION_PROMPT
    if strategic_rules:
        current_prompt += f"\n\n## Strategic Guidelines (Learned):\n{strategic_rules}"
    
    # Create messages
    messages = [
        SystemMessage(content=current_prompt),
        HumanMessage(content=f"{instruction}\n\nText: {text}")
    ]
    
    # Get response from LLM
    response = llm.invoke(messages)
    output = response.content
    
    # Let SCOPE observe and learn
    result = await optimizer.on_step_complete(
        agent_name="info_extractor",
        agent_role="Information Extraction Specialist",
        task=f"{instruction} | Text: {text}",
        model_output=output,
        observations=f"Extracted from: '{text[:50]}...'",
        error=None,
        current_system_prompt=current_prompt,
        task_id=task_id
    )
    
    return output, result

# Run the tasks
print("üöÄ BEFORE LEARNING (Initial Run)\n")
print("=" * 70)

learning_events = []
first_run_outputs = []  # Store outputs for comparison

for i, task in enumerate(EXTRACTION_TASKS, 1):
    print(f"\nüìù Task {i}/{len(EXTRACTION_TASKS)}")
    print(f"Instruction: {task['instruction']}")
    print(f"Text: {task['text']}")
    
    # Run extraction
    output, learning_result = await extract_with_scope(
        task['instruction'],
        task['text'],
        f"task_{i}"
    )
    
    first_run_outputs.append(output)  # Store for comparison
    print(f"\n‚úì Output: {output}")
    
    # Check if SCOPE learned something
    if learning_result:
        guideline, guideline_type = learning_result
        learning_events.append({"task": i, "type": guideline_type, "rule": guideline})
        print(f"\nüìö SCOPE LEARNED ({guideline_type.upper()}):")
        print(f"   {guideline[:120]}...")
    
    print("\n" + "-" * 70)

print(f"\n‚úÖ Completed {len(EXTRACTION_TASKS)} tasks")
print(f"üìö SCOPE learning events: {len(learning_events)}")

üöÄ BEFORE LEARNING (Initial Run)


üìù Task 1/5
Instruction: Extract all email addresses in a clean list
Text: Reach us at: SUPPORT@COMPANY.COM, Sales: sales@test.org, or Info <INFO@HELP.NET>

‚úì Output: - SUPPORT@COMPANY.COM
- sales@test.org
- INFO@HELP.NET

üìö SCOPE LEARNED (STRATEGIC):
   Convert all extracted emails to lowercase to maintain consistency....

----------------------------------------------------------------------

üìù Task 2/5
Instruction: Extract all email addresses in a clean list
Text: Team: Alice.Brown@TECH.COM, bob.smith@startup.io, Contact: HR@BUSINESS.ORG

‚úì Output: - alice.brown@tech.com
- bob.smith@startup.io
- hr@business.org

----------------------------------------------------------------------

üìù Task 3/5
Instruction: Extract dates
Text: Important dates: 12/25/2024, January 1st 2025, and 2025-02-14

‚úì Output: 12/25/2024, January 1st 2025, 2025-02-14

üìö SCOPE LEARNED (STRATEGIC):
   Always use a consistent date format, such as YYYY-MM-DD, 

In [48]:
print("üìä Learning Summary\n")
print("=" * 70)

if learning_events:
    print(f"\nTotal learning events: {len(learning_events)}\n")
    
    for event in learning_events:
        print(f"Task {event['task']} - {event['type'].upper()}:")
        print(f"  {event['rule'][:100]}...")
        print()
else:
    print("No learning events recorded.")

# Get the complete evolved prompt
strategic_rules = optimizer.get_strategic_rules_for_agent("info_extractor")
evolved_prompt = BASE_EXTRACTION_PROMPT
if strategic_rules:
    evolved_prompt += f"\n\n## Strategic Guidelines (Learned):\n{strategic_rules}"

print("\n" + "=" * 70)
print("EVOLVED PROMPT (After Learning)")
print("=" * 70)
print(evolved_prompt)

üìä Learning Summary


Total learning events: 4

Task 1 - STRATEGIC:
  Convert all extracted emails to lowercase to maintain consistency....

Task 3 - STRATEGIC:
  Always use a consistent date format, such as YYYY-MM-DD, for all extracted dates....

Task 4 - STRATEGIC:
  Always verify and clearly delimit output to ensure extracted dates align precisely with the expected...

Task 5 - STRATEGIC:
  Normalize extracted phone numbers to a consistent format, e.g., (NNN) NNN-NNNN....


EVOLVED PROMPT (After Learning)
You are an information extraction specialist.
Extract the requested information from the provided text.

## Core Instructions:
- Extract only what is requested
- If information is not present, respond with "Not found"


## Strategic Guidelines (Learned):

## Strategic Guidelines (Learned Best Practices):
These are high-confidence rules learned from previous tasks:

### Data Validation:
- Convert all extracted emails to lowercase to maintain consistency.
- Always use a consistent

## Step 7: Compare Before and After

Now let's run the same tasks again with the evolved prompt!

**What to watch for:**
- Are outputs more normalized (lowercase emails, consistent formatting)?
- Do we see fewer learning events (meaning the prompt is already better)?
- Can you spot visible improvements in the outputs?

Let's find out:

In [49]:
print("üîÑ AFTER LEARNING (Second Run with Evolved Prompt)\n")
print("=" * 70)

second_run_learning = []
second_run_outputs = []  # Store outputs for comparison

for i, task in enumerate(EXTRACTION_TASKS, 1):
    print(f"\nüìù Task {i}/{len(EXTRACTION_TASKS)}")
    
    # Run extraction
    output, learning_result = await extract_with_scope(
        task['instruction'],
        task['text'],
        f"task_{i}_round2"
    )
    
    second_run_outputs.append(output)
    print(f"‚úì Output: {output}")
    
    if learning_result:
        second_run_learning.append(learning_result)
        print(f"üìö New learning event")
    else:
        print(f"‚úì No new learning needed (prompt already optimized!)")

print("\n" + "=" * 70)
print("üìä SIDE-BY-SIDE COMPARISON")
print("=" * 70)

# Compare outputs for each task
improvements_found = False
for i, task in enumerate(EXTRACTION_TASKS):
    before = first_run_outputs[i]
    after = second_run_outputs[i]
    
    if before != after:
        improvements_found = True
        print(f"\nüìù Task {i+1}: {task['instruction']}")
        print(f"   Text: {task['text'][:60]}...")
        print(f"\n   ‚ùå Before: {before}")
        print(f"   ‚úÖ After:  {after}")
        print(f"   üí° Improvement: Output is now more consistent/normalized")

if not improvements_found:
    print("\n‚ö†Ô∏è  Outputs are identical - learning is happening but not visible in final outputs.")
    print("This suggests the tasks may need adjustment to show clearer improvements.")

print("\n" + "=" * 70)
print("üìà LEARNING METRICS")
print("=" * 70)
print(f"\n1st Run: {len(learning_events)} learning events")
print(f"2nd Run: {len(second_run_learning)} learning events")
print(f"\nImprovement: {len(learning_events) - len(second_run_learning)} fewer learning events needed!")
print("\nüí° Fewer learning events means the prompt is better optimized!")

üîÑ AFTER LEARNING (Second Run with Evolved Prompt)


üìù Task 1/5
‚úì Output: support@company.com  
sales@test.org  
info@help.net
üìö New learning event

üìù Task 2/5
‚úì Output: alice.brown@tech.com  
bob.smith@startup.io  
hr@business.org
‚úì No new learning needed (prompt already optimized!)

üìù Task 3/5
‚úì Output: 2024-12-25, 2025-01-01, 2025-02-14
üìö New learning event

üìù Task 4/5
‚úì Output: 2025-03-15, 2025-04-20, 2025-05-30
‚úì No new learning needed (prompt already optimized!)

üìù Task 5/5
‚úì Output: (555) 123-4567  
555-987-6543
‚úì No new learning needed (prompt already optimized!)

üìä SIDE-BY-SIDE COMPARISON

üìù Task 1: Extract all email addresses in a clean list
   Text: Reach us at: SUPPORT@COMPANY.COM, Sales: sales@test.org, or ...

   ‚ùå Before: - SUPPORT@COMPANY.COM
- sales@test.org
- INFO@HELP.NET
   ‚úÖ After:  support@company.com  
sales@test.org  
info@help.net
   üí° Improvement: Output is now more consistent/normalized

üìù Task 2: Extract

## Understanding the Results

What just happened?

### First Run (BEFORE Learning)
- Started with a basic, generic prompt
- Outputs may have inconsistencies (mixed case, uneven formatting, etc.)
- SCOPE observed the outputs and identified improvement patterns
- Generated strategic rules to address issues

### Second Run (AFTER Learning)
- Used the evolved prompt with learned strategic rules
- Outputs should be more normalized and consistent
- SCOPE found fewer (or no) issues to fix
- The prompt is now optimized!

### Key Insights

**Look for visible improvements:**
- ‚úÖ **Email normalization** (Tasks 1 & 2): `SUPPORT@COMPANY.COM` ‚Üí `support@company.com`, brackets removed
- ‚úÖ **Date standardization** (Tasks 3 & 4): `January 1st 2025` ‚Üí `2025-01-01`, consistent YYYY-MM-DD
- ‚úÖ **Phone cleaning** (Task 5): `(555)123.4567` ‚Üí consistent format, cleaned up
- ‚úÖ **Consistency across similar tasks**: Same instruction = same formatting style

**Metric validation:**
- **Fewer learning events** = Prompt is already better optimized
- **Visible output improvements** = Rules are actually working

If outputs look identical, the tasks may need adjustment to better demonstrate learning!

## Key Takeaways

1. **Automatic Optimization**: SCOPE improves prompts without manual engineering
2. **Observable Learning**: You can see what SCOPE learns in real-time
3. **Measurable Results**: Fewer learning events = better prompts
4. **Persistent Memory**: Rules are saved and reused across runs
5. **LangChain Integration**: Works seamlessly with existing LangChain code

## Try It Yourself!

Experiment with:
- Different extraction tasks
- More iterations (run 10-15 tasks)
- Other domains (classification, summarization, etc.)
- Different models

The more SCOPE observes, the better it gets!

## Next Steps

Ready to see more? Try these demos:

### 1. Simple Demo (Command Line)
Run the extraction demo from your terminal:
```bash
# Single run (~2 min)
python simple_demo.py

# See learning over 10 iterations (~20 min)
python simple_compare.py --iterations 10
```

### 2. Research Assistant (Production Example)
See SCOPE optimize a real multi-agent research system:
```bash
# Full research assistant with 5 learning agents
python main.py

# Compare learning over 10 iterations (~50 min)
python compare_scope_impact.py --iterations 10
```

**What's different in the research assistant?**
- üéØ **5 agents learning**: Questions, web search, Wikipedia, writing, coordination
- üéì **Source quality assessment**: Academic vs blog detection (0-10 scoring)
- ‚ö° **Thoroughness mode**: 7-dimension analysis for higher quality rules
- üìà **Proven results**: +31% quality improvement in 5 iterations

### 3. Documentation
- **Architecture**: `docs/SCOPE_ARCHITECTURE.md` - See the 5-agent learning pipeline
- **Implementation**: `docs/IMPLEMENTATION_GUIDE.md` - Complete usage guide
- **Source Quality**: `docs/SOURCE_QUALITY_LEARNING.md` - How academic detection works

### 4. Custom Integration
Add SCOPE to your own LangChain applications following the pattern shown in this notebook!