# Demo 4: The Evolving Dev Team

## Concept: Self-Evolving Agent Prompts

In this advanced demo, we build a team of AI developers that **evolve their own system prompts** to become better programmers. This is meta-learning: agents that improve their own "personality" and strategies through competition and self-reflection.

### What We'll See

1. Two developer agents with different philosophies compete to solve coding problems
2. A Tech Lead evaluates solutions by actually running tests
3. After feedback, each agent rewrites their own system prompt
4. In Round 2, we see if their evolved prompts lead to better performance

### The Key Insight

The agents don't just solve problems - they reflect on their performance and rewrite their own instructions. We'll observe:
- Self-reflection capabilities of LLMs
- Prompt optimization through competition
- Emergent strategies we didn't explicitly program

---

### Related Papers

- **Self-Refine**: Iterative Refinement with Self-Feedback  
  [arXiv:2303.17651](https://arxiv.org/abs/2303.17651)

- **OPRO**: Large Language Models as Optimizers  
  [arXiv:2309.03409](https://arxiv.org/abs/2309.03409)

- **Constitutional AI**: Harmlessness from AI Feedback  
  [arXiv:2212.08073](https://arxiv.org/abs/2212.08073)

## Setup

You have three options for the LLM provider:

### Option 1: OpenAI (Recommended)
1. Get an API key from [platform.openai.com](https://platform.openai.com)
2. Add secret `OPENAI_API_KEY` in Colab Secrets (key icon in sidebar)

### Option 2: Google Gemini (FREE)
1. Get a free API key from [Google AI Studio](https://aistudio.google.com/apikey)
2. Add secret `GEMINI_API_KEY` in Colab Secrets
3. In the next cell, comment out the OpenAI section and uncomment the Gemini section

### Option 3: Groq (FREE - Very Fast)
1. Get a free API key from [console.groq.com](https://console.groq.com)
2. Add secret `GROQ_API_KEY` in Colab Secrets
3. In the next cell, comment out the OpenAI section and uncomment the Groq section
4. Uses Llama 3.1 model

In [None]:
# ============================================================
# OPTION 1: OpenAI (Recommended - requires API key with credits)
# ============================================================
!pip install openai -q

from google.colab import userdata
from openai import OpenAI
import time
import traceback
from typing import Callable
from functools import wraps

client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

print("Setup complete! Using OpenAI")

# ============================================================
# OPTION 2: Google Gemini (FREE - uncomment below, comment above)
# ============================================================
# !pip install google-generativeai -q
#
# from google.colab import userdata
# import google.generativeai as genai
# import time
# import traceback
# from typing import Callable
# from functools import wraps
#
# genai.configure(api_key=userdata.get('GEMINI_API_KEY'))
#
# # Wrapper class to make Gemini API compatible with OpenAI-style calls
# class GeminiClient:
#     def __init__(self):
#         self._model = genai.GenerativeModel('gemini-1.5-flash')
#
#     class _Completions:
#         def __init__(self, model):
#             self._model = model
#
#         def create(self, model=None, messages=None, temperature=0.7, **kwargs):
#             prompt_parts = []
#             for msg in messages:
#                 role = msg.get('role', 'user')
#                 content = msg.get('content', '')
#                 if role == 'system':
#                     prompt_parts.append(f"Instructions: {content}")
#                 else:
#                     prompt_parts.append(content)
#
#             prompt = "\n\n".join(prompt_parts)
#
#             response = self._model.generate_content(
#                 prompt,
#                 generation_config=genai.GenerationConfig(temperature=temperature)
#             )
#
#             class Message:
#                 def __init__(self, text):
#                     self.content = text
#
#             class Choice:
#                 def __init__(self, text):
#                     self.message = Message(text)
#
#             class Response:
#                 def __init__(self, text):
#                     self.choices = [Choice(text)]
#
#             return Response(response.text)
#
#     @property
#     def chat(self):
#         class Chat:
#             def __init__(chat_self):
#                 chat_self.completions = GeminiClient._Completions(self._model)
#         return Chat()
#
# client = GeminiClient()
#
# print("Setup complete! Using Google Gemini (FREE)")

# ============================================================
# OPTION 3: Groq (FREE - very fast, uncomment below, comment above)
# ============================================================
# !pip install openai -q
#
# from google.colab import userdata
# from openai import OpenAI
# import time
# import traceback
# from typing import Callable
# from functools import wraps
#
# client = OpenAI(
#     api_key=userdata.get('GROQ_API_KEY'),
#     base_url="https://api.groq.com/openai/v1"
# )
#
# # IMPORTANT: When using Groq, change the model in API calls from
# # "gpt-4o-mini" to "openai/gpt-oss-20b"
#
# print("Setup complete! Using Groq (FREE)")

## The Coding Problems

We define two problems with test cases. The agents will solve these, and their solutions will be actually executed to verify correctness.

In [None]:
# Problem 1: Find pairs that sum to a target
PROBLEM_1 = {
    "name": "find_pairs",
    "description": """Write a Python function called `find_pairs` that finds all unique pairs of numbers in a list that sum to a target value.

Function signature:
def find_pairs(numbers: list[int], target: int) -> list[tuple[int, int]]

Requirements:
- Return a list of tuples, each containing two numbers that sum to target
- Each pair should only appear once (no duplicates)
- The smaller number should come first in each tuple
- Handle edge cases: empty list, no valid pairs, duplicate numbers

Examples:
- find_pairs([1, 2, 3, 4, 5], 6) -> [(1, 5), (2, 4)]
- find_pairs([1, 1, 2, 3], 4) -> [(1, 3)]
- find_pairs([], 5) -> []
""",
    "test_cases": [
        {
            "input": {"numbers": [1, 2, 3, 4, 5], "target": 6},
            "expected": [(1, 5), (2, 4)],
            "name": "basic case"
        },
        {
            "input": {"numbers": [1, 1, 2, 3], "target": 4},
            "expected": [(1, 3)],
            "name": "with duplicates"
        },
        {
            "input": {"numbers": [], "target": 5},
            "expected": [],
            "name": "empty list"
        },
        {
            "input": {"numbers": [1, 2, 3], "target": 10},
            "expected": [],
            "name": "no valid pairs"
        },
    ]
}

# Problem 2: Retry decorator
PROBLEM_2 = {
    "name": "retry",
    "description": """Write a Python decorator called `retry` that retries a function when it raises an exception.

Function signature:
def retry(max_attempts: int = 3, delay: float = 0.1)

Requirements:
- The decorator should retry the function up to max_attempts times
- It should wait delay seconds between each retry
- If the function succeeds, return its result immediately
- If all attempts fail, raise the last exception
- The decorator should work with functions that have any arguments

Example usage:
@retry(max_attempts=3, delay=0.1)
def flaky_function():
    # might fail sometimes
    pass
""",
    "test_cases": []  # We'll use custom test logic for the decorator
}

print("Problems defined!")
print(f"Problem 1: {PROBLEM_1['name']}")
print(f"Problem 2: {PROBLEM_2['name']}")

## The Developer Agent

Each developer has:
- A name and initial system prompt (their "personality")
- The ability to solve problems
- The ability to evolve their own prompt based on feedback

In [None]:
class Developer:
    """
    An AI developer agent with an evolvable system prompt.
    """

    def __init__(self, name: str, initial_prompt: str):
        self.name = name
        self.system_prompt = initial_prompt
        self.prompt_history = [initial_prompt]
        self.score_history = []
        self.solutions = []

    def solve(self, problem: dict) -> str:
        """
        Generate a solution for the given problem using current system prompt.

        Returns the code as a string.
        """
        user_message = f"""Solve this programming problem:

{problem['description']}

Return ONLY the Python code, no explanations or markdown."""

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": user_message}
            ],
            temperature=0.7
        )

        code = response.choices[0].message.content

        # Clean up code block markers if present
        if "```python" in code:
            code = code.split("```python")[1].split("```")[0]
        elif "```" in code:
            code = code.split("```")[1].split("```")[0]

        code = code.strip()
        self.solutions.append(code)
        return code

    def evolve_prompt(self, feedback: str, my_score: int,
                      opponent_score: int, winning_code: str = None):
        """
        Rewrite own system prompt based on feedback.
        This is the self-evolution mechanism!
        """
        evolution_message = f"""You are an AI developer agent reflecting on your performance.

Your current system prompt is:
---
{self.system_prompt}
---

In the last coding challenge:
- Your score: {my_score}/100
- Opponent's score: {opponent_score}/100
- Tech Lead feedback: {feedback}
"""

        if winning_code and opponent_score > my_score:
            evolution_message += f"""\n- The winning solution was:
```python
{winning_code}
```
"""

        evolution_message += """
Based on this feedback, rewrite your system prompt to perform better in future coding challenges.

Guidelines:
- Focus on specific, actionable improvements
- Learn from what worked in the winning solution (if you lost)
- Keep the prompt concise (under 150 words)
- Maintain your core identity but address weaknesses

Return ONLY the new system prompt text, nothing else."""

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": evolution_message}],
            temperature=0.7
        )

        new_prompt = response.choices[0].message.content.strip()

        # Remove quotes if the model wrapped the prompt in them
        if new_prompt.startswith('"') and new_prompt.endswith('"'):
            new_prompt = new_prompt[1:-1]

        self.system_prompt = new_prompt
        self.prompt_history.append(new_prompt)

        return new_prompt

print("Developer class defined!")

## The Tech Lead (Evaluator)

The Tech Lead:
- Executes submitted code against test cases
- Scores solutions based on correctness, style, and other criteria
- Provides actionable feedback

In [None]:
class TechLead:
    """
    Evaluates developer solutions by running tests and providing scores.
    """

    def __init__(self):
        self.criteria = [
            ("correctness", 40),
            ("performance", 20),
            ("readability", 20),
            ("pythonic", 10),
            ("edge_cases", 10)
        ]

    def run_tests_problem1(self, code: str, test_cases: list) -> dict:
        """
        Execute code for Problem 1 (find_pairs) and run test cases.
        """
        results = {
            "passed": 0,
            "failed": 0,
            "errors": [],
            "details": []
        }

        # Create execution namespace
        namespace = {}

        try:
            exec(code, namespace)
        except Exception as e:
            results["errors"].append(f"Code compilation error: {e}")
            return results

        if "find_pairs" not in namespace:
            results["errors"].append("Function 'find_pairs' not found")
            return results

        func = namespace["find_pairs"]

        for tc in test_cases:
            try:
                result = func(**tc["input"])
                # Normalize result for comparison (sort tuples and list)
                result_normalized = sorted([tuple(sorted(p)) for p in result])
                expected_normalized = sorted([tuple(sorted(p)) for p in tc["expected"]])

                if result_normalized == expected_normalized:
                    results["passed"] += 1
                    results["details"].append(f"PASS: {tc['name']}")
                else:
                    results["failed"] += 1
                    results["details"].append(
                        f"FAIL: {tc['name']} - Expected {tc['expected']}, got {result}"
                    )
            except Exception as e:
                results["failed"] += 1
                results["errors"].append(f"Runtime error in {tc['name']}: {e}")

        return results

    def run_tests_problem2(self, code: str) -> dict:
        """
        Execute code for Problem 2 (retry decorator) and run test cases.
        """
        results = {
            "passed": 0,
            "failed": 0,
            "errors": [],
            "details": []
        }

        # Create execution namespace with time module
        namespace = {"time": time, "wraps": wraps}

        try:
            exec(code, namespace)
        except Exception as e:
            results["errors"].append(f"Code compilation error: {e}")
            return results

        if "retry" not in namespace:
            results["errors"].append("Decorator 'retry' not found")
            return results

        retry_decorator = namespace["retry"]

        # Test 1: Function that succeeds immediately
        try:
            @retry_decorator(max_attempts=3, delay=0.01)
            def always_works():
                return "success"

            if always_works() == "success":
                results["passed"] += 1
                results["details"].append("PASS: Function that succeeds immediately")
            else:
                results["failed"] += 1
                results["details"].append("FAIL: Wrong return value for successful function")
        except Exception as e:
            results["failed"] += 1
            results["errors"].append(f"Test 1 error: {e}")

        # Test 2: Function that fails then succeeds
        try:
            call_count = [0]

            @retry_decorator(max_attempts=3, delay=0.01)
            def fails_twice():
                call_count[0] += 1
                if call_count[0] < 3:
                    raise ValueError("Not yet!")
                return "finally!"

            result = fails_twice()
            if result == "finally!" and call_count[0] == 3:
                results["passed"] += 1
                results["details"].append("PASS: Function that fails then succeeds")
            else:
                results["failed"] += 1
                results["details"].append(f"FAIL: Expected 3 calls, got {call_count[0]}")
        except Exception as e:
            results["failed"] += 1
            results["errors"].append(f"Test 2 error: {e}")

        # Test 3: Function that always fails
        try:
            @retry_decorator(max_attempts=2, delay=0.01)
            def always_fails():
                raise RuntimeError("I always fail!")

            try:
                always_fails()
                results["failed"] += 1
                results["details"].append("FAIL: Should have raised exception")
            except RuntimeError:
                results["passed"] += 1
                results["details"].append("PASS: Raises exception after max attempts")
        except Exception as e:
            results["failed"] += 1
            results["errors"].append(f"Test 3 error: {e}")

        # Test 4: Preserves function arguments
        try:
            @retry_decorator(max_attempts=2, delay=0.01)
            def add(a, b):
                return a + b

            if add(2, 3) == 5:
                results["passed"] += 1
                results["details"].append("PASS: Preserves function arguments")
            else:
                results["failed"] += 1
                results["details"].append("FAIL: Wrong result with arguments")
        except Exception as e:
            results["failed"] += 1
            results["errors"].append(f"Test 4 error: {e}")

        return results

    def evaluate(self, code: str, problem: dict) -> dict:
        """
        Full evaluation: run tests and score based on multiple criteria.
        """
        # Run appropriate tests
        if problem["name"] == "find_pairs":
            test_results = self.run_tests_problem1(code, problem["test_cases"])
            total_tests = len(problem["test_cases"])
        else:  # retry decorator
            test_results = self.run_tests_problem2(code)
            total_tests = 4

        # Calculate scores
        scores = {}

        # Correctness (40 pts) - based on test pass rate
        if total_tests > 0:
            pass_rate = test_results["passed"] / total_tests
            scores["correctness"] = int(40 * pass_rate)
        else:
            scores["correctness"] = 0

        # Use LLM to evaluate other criteria
        style_prompt = f"""Evaluate this Python code on a scale of 0-10 for each criterion.

Code:
```python
{code}
```

Criteria:
1. Performance (time/space complexity, efficient algorithms)
2. Readability (clear variable names, good structure, comments if needed)
3. Pythonic (uses Python idioms, list comprehensions where appropriate, type hints)
4. Edge Cases (handles empty inputs, None values, boundary conditions)

Return ONLY four numbers separated by commas, nothing else.
Example: 8,7,6,9"""

        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": style_prompt}],
                temperature=0
            )

            style_scores = response.choices[0].message.content.strip()
            parts = [int(x.strip()) for x in style_scores.split(",")]

            scores["performance"] = min(20, parts[0] * 2)
            scores["readability"] = min(20, parts[1] * 2)
            scores["pythonic"] = min(10, parts[2])
            scores["edge_cases"] = min(10, parts[3])
        except:
            # Fallback scores if LLM evaluation fails
            scores["performance"] = 10
            scores["readability"] = 10
            scores["pythonic"] = 5
            scores["edge_cases"] = 5

        total_score = sum(scores.values())

        return {
            "scores": scores,
            "total": total_score,
            "test_results": test_results
        }

    def generate_feedback(self, dev_name: str, evaluation: dict, code: str) -> str:
        """
        Generate actionable feedback for a developer.
        """
        prompt = f"""As a Tech Lead, provide brief, actionable feedback for {dev_name}.

Their code:
```python
{code}
```

Scores:
- Correctness: {evaluation['scores']['correctness']}/40
- Performance: {evaluation['scores']['performance']}/20
- Readability: {evaluation['scores']['readability']}/20
- Pythonic: {evaluation['scores']['pythonic']}/10
- Edge Cases: {evaluation['scores']['edge_cases']}/10
- Total: {evaluation['total']}/100

Test results: {evaluation['test_results']['passed']} passed, {evaluation['test_results']['failed']} failed
Errors: {evaluation['test_results']['errors']}

Give 2-3 specific suggestions for improvement. Be direct and technical.
Keep response under 100 words."""

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.5
        )

        return response.choices[0].message.content.strip()

print("TechLead class defined!")

## The Arena

The DevTeamArena orchestrates the competition between developers, including evaluation and the evolution phase.

In [None]:
class DevTeamArena:
    """
    Orchestrates the competition between developer agents.
    """

    def __init__(self):
        # Initialize developers with distinct personalities
        # Named after computing pioneers: Ada Lovelace & Alan Turing
        self.ada = Developer(
            name="Ada",
            initial_prompt="You are a Python developer who values performance and efficiency above all. Write fast, optimized code."
        )

        self.turing = Developer(
            name="Turing",
            initial_prompt="You are a Python developer who values clean, readable code. Write elegant, maintainable solutions."
        )

        self.tech_lead = TechLead()
        self.problems = [PROBLEM_1, PROBLEM_2]
        self.round_results = []

    def display_solution(self, dev: Developer, code: str, evaluation: dict):
        """Pretty print a solution and its evaluation."""
        print(f"\n{dev.name.upper()}'s Solution:")
        print("-" * 60)
        print(code)
        print("-" * 60)
        print(f"Tests: {evaluation['test_results']['passed']} passed, "
              f"{evaluation['test_results']['failed']} failed")
        if evaluation['test_results']['errors']:
            print(f"Errors: {evaluation['test_results']['errors'][:2]}")
        print(f"\nScore Breakdown:")
        for criterion, score in evaluation['scores'].items():
            max_score = 40 if criterion == 'correctness' else (20 if criterion in ['performance', 'readability'] else 10)
            print(f"  {criterion}: {score}/{max_score}")
        print(f"  TOTAL: {evaluation['total']}/100")

    def run_round(self, round_num: int):
        """Execute one round of competition."""
        problem = self.problems[round_num - 1]

        print("\n" + "=" * 70)
        print(f"ROUND {round_num}: {'BASELINE' if round_num == 1 else 'EVOLVED'}")
        print("=" * 70)
        print(f"\nProblem: {problem['name']}")
        print("-" * 70)

        # Get solutions
        print("\nGenerating solutions...")
        ada_code = self.ada.solve(problem)
        turing_code = self.turing.solve(problem)

        # Evaluate solutions
        print("Evaluating solutions...")
        ada_eval = self.tech_lead.evaluate(ada_code, problem)
        turing_eval = self.tech_lead.evaluate(turing_code, problem)

        # Display results
        self.display_solution(self.ada, ada_code, ada_eval)
        self.display_solution(self.turing, turing_code, turing_eval)

        # Determine winner
        if ada_eval['total'] > turing_eval['total']:
            winner = self.ada
            winner_code = ada_code
            loser = self.turing
        elif turing_eval['total'] > ada_eval['total']:
            winner = self.turing
            winner_code = turing_code
            loser = self.ada
        else:
            winner = None
            winner_code = None

        print("\n" + "-" * 70)
        if winner:
            print(f"WINNER: {winner.name} ({ada_eval['total']} vs {turing_eval['total']})")
        else:
            print(f"TIE! ({ada_eval['total']} vs {turing_eval['total']})")

        # Generate feedback
        print("\nTech Lead Feedback:")
        ada_feedback = self.tech_lead.generate_feedback("Ada", ada_eval, ada_code)
        turing_feedback = self.tech_lead.generate_feedback("Turing", turing_eval, turing_code)

        print(f"\nTo Ada: {ada_feedback}")
        print(f"\nTo Turing: {turing_feedback}")

        # Store results
        self.round_results.append({
            "round": round_num,
            "ada": {"code": ada_code, "eval": ada_eval, "feedback": ada_feedback},
            "turing": {"code": turing_code, "eval": turing_eval, "feedback": turing_feedback},
            "winner": winner.name if winner else "Tie"
        })

        # Record scores
        self.ada.score_history.append(ada_eval['total'])
        self.turing.score_history.append(turing_eval['total'])

        return {
            "ada": {"eval": ada_eval, "feedback": ada_feedback, "code": ada_code},
            "turing": {"eval": turing_eval, "feedback": turing_feedback, "code": turing_code},
            "winner_code": winner_code
        }

    def run_evolution(self, round_results: dict):
        """Have both agents evolve their prompts based on feedback."""
        print("\n" + "=" * 70)
        print("PROMPT EVOLUTION PHASE")
        print("=" * 70)

        print("\nAgents are reflecting on their performance...")

        # Ada evolves
        ada_new = self.ada.evolve_prompt(
            feedback=round_results["ada"]["feedback"],
            my_score=round_results["ada"]["eval"]["total"],
            opponent_score=round_results["turing"]["eval"]["total"],
            winning_code=round_results["winner_code"] if round_results["turing"]["eval"]["total"] > round_results["ada"]["eval"]["total"] else None
        )

        # Turing evolves
        turing_new = self.turing.evolve_prompt(
            feedback=round_results["turing"]["feedback"],
            my_score=round_results["turing"]["eval"]["total"],
            opponent_score=round_results["ada"]["eval"]["total"],
            winning_code=round_results["winner_code"] if round_results["ada"]["eval"]["total"] > round_results["turing"]["eval"]["total"] else None
        )

        print("\n" + "-" * 70)
        print("ADA's Prompt Evolution:")
        print("-" * 70)
        print(f"OLD: {self.ada.prompt_history[0]}")
        print(f"\nNEW: {ada_new}")

        print("\n" + "-" * 70)
        print("TURING's Prompt Evolution:")
        print("-" * 70)
        print(f"OLD: {self.turing.prompt_history[0]}")
        print(f"\nNEW: {turing_new}")

    def show_summary(self):
        """Display the evolution summary."""
        print("\n" + "=" * 70)
        print("EVOLUTION SUMMARY")
        print("=" * 70)

        print("\nScore Progression:")
        print("-" * 40)
        print(f"{'':15} {'Round 1':>12} {'Round 2':>12} {'Change':>12}")
        print("-" * 40)

        ada_change = self.ada.score_history[1] - self.ada.score_history[0] if len(self.ada.score_history) > 1 else 0
        turing_change = self.turing.score_history[1] - self.turing.score_history[0] if len(self.turing.score_history) > 1 else 0

        ada_change_str = f"+{ada_change}" if ada_change > 0 else str(ada_change)
        turing_change_str = f"+{turing_change}" if turing_change > 0 else str(turing_change)

        print(f"{'Ada':15} {self.ada.score_history[0]:>12} {self.ada.score_history[1] if len(self.ada.score_history) > 1 else 'N/A':>12} {ada_change_str:>12}")
        print(f"{'Turing':15} {self.turing.score_history[0]:>12} {self.turing.score_history[1] if len(self.turing.score_history) > 1 else 'N/A':>12} {turing_change_str:>12}")

        print("\n" + "-" * 70)
        print("Key Observations:")

        if ada_change > 0 and turing_change > 0:
            print("- Both agents improved after self-reflection!")
        elif ada_change > turing_change:
            print("- Ada showed more improvement after evolution")
        elif turing_change > ada_change:
            print("- Turing showed more improvement after evolution")

        bigger_improver = "Ada" if ada_change > turing_change else "Turing"
        if abs(ada_change - turing_change) > 5:
            print(f"- {bigger_improver} learned more from the feedback")

        print("\nPrompt Evolution Highlights:")
        print(f"- Ada went from {len(self.ada.prompt_history[0])} to {len(self.ada.prompt_history[-1])} chars")
        print(f"- Turing went from {len(self.turing.prompt_history[0])} to {len(self.turing.prompt_history[-1])} chars")

print("DevTeamArena class defined!")

## Round 1: Baseline Performance

Let's run the first round with the agents' initial prompts.

In [None]:
# Create the arena
arena = DevTeamArena()

# Run Round 1
round1_results = arena.run_round(1)

## Evolution Phase

Now each agent reflects on their performance and rewrites their system prompt. This is the key innovation: **the agents modify their own instructions** based on what they learned.

In [None]:
# Run the evolution phase
arena.run_evolution(round1_results)

## Round 2: Evolved Performance

Now let's test the evolved prompts on a new problem. Did the self-reflection help?

In [None]:
# Run Round 2 with evolved prompts
round2_results = arena.run_round(2)

## Final Summary

Let's compare how the agents performed before and after evolution.

In [None]:
# Show the evolution summary
arena.show_summary()

## Viewing the Full Prompt History

Let's see the complete evolution of each agent's system prompt.

In [None]:
print("COMPLETE PROMPT HISTORY")
print("=" * 70)

print("\nADA's Journey:")
print("-" * 70)
for i, prompt in enumerate(arena.ada.prompt_history):
    label = "Initial" if i == 0 else f"After Round {i}"
    score = arena.ada.score_history[i] if i < len(arena.ada.score_history) else "N/A"
    print(f"\n[{label}] (Score: {score}/100)")
    print(prompt)

print("\n" + "=" * 70)
print("\nTURING's Journey:")
print("-" * 70)
for i, prompt in enumerate(arena.turing.prompt_history):
    label = "Initial" if i == 0 else f"After Round {i}"
    score = arena.turing.score_history[i] if i < len(arena.turing.score_history) else "N/A"
    print(f"\n[{label}] (Score: {score}/100)")
    print(prompt)

## Key Takeaways

### What We Observed

1. **Self-Reflection Works**: Agents that analyze their mistakes can improve their own instructions.

2. **Competition Drives Improvement**: The losing agent has strong incentive to learn from the winner's approach.

3. **Emergent Strategies**: The evolved prompts often contain insights we didn't explicitly teach:
   - "Always check edge cases first"
   - "Balance elegance with correctness"
   - "Add type hints and docstrings"

4. **Convergence**: Despite starting with opposite philosophies, agents often converge toward similar best practices.

### The Meta Pattern

```
Prompt -> Behavior -> Evaluation -> Reflection -> Better Prompt
```

This is the same pattern as:
- Demo 1 (code -> error -> fix)
- Demo 2 (solution -> fitness -> evolution)
- Demo 3 (capability gap -> tool creation)

But applied to the agent's own instructions!

### Real-World Applications

- **OPRO (Google)**: Optimizing prompts for math problems
- **Constitutional AI**: Improving AI behavior through self-critique
- **Automated Prompt Engineering**: Finding optimal prompts for specific tasks

### Limitations

- Requires good evaluation metrics
- May overfit to specific test cases
- Evolution is bounded by the LLM's capabilities
- Need safeguards against "prompt collapse"

---

**Congratulations!** You've completed all four demos of self-improving code.

From simple bug fixing to agents that rewrite their own instructions, we've explored the frontier of adaptive software. The future is code that learns, evolves, and grows.