# DSPy Tutorial: Evaluator Optimizer Pattern in DSPy

## What is DSPy?

DSPy (Declarative Self-improving Python) is a framework for programming with language models (LMs) that allows you to:
- Build modular AI programs using composable modules
- Automatically optimize prompts and few-shot examples
- Evaluate and improve your programs systematically

## What We'll Build

In this tutorial, we'll create a joke-telling AI that:
1. Generates jokes about any topic
2. Learns what makes jokes funny through optimization
3. Gets progressively better at telling jokes

## The Evaluator-Optimizer Pattern

In this tutorial, we'll use the evaluator-optimizer pattern, which is a powerful workflow for ensuring our AI program meets all requirements through iterative refinement. Here's how it works:

1. **Generation**: An LLM performs the task (generating jokes in our case)
2. **Evaluation**: A second LLM evaluates if the result meets our criteria (checking if jokes are funny)
3. **Refinement**: If needed, the process repeats with adjustments until all requirements are met

This pattern is particularly useful for:
- Ensuring consistent quality in generated content
- Incorporating synthetic feedback to improve outputs
- Systematically optimizing prompts and examples

## Tutorial Structure

1. **Setup**: Install DSPy and configure language models
2. **Basic Programs**: Create simple AI programs with signatures
3. **Chain of Thought**: Add reasoning to improve outputs
4. **Modular Programs**: Build reusable components
5. **Evaluation**: Create metrics to measure performance
6. **Optimization**: Automatically improve prompts

Let's get started! 🎉

## 1. Setup and Installation

First, let's install DSPy:

In [1]:
# Install DSPy quietly (-q flag suppresses verbose output)
!pip install dspy -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Installation Complete! 

DSPy has been successfully installed. The `-q` flag was used to suppress verbose installation output, keeping our notebook clean.

## 2. Configure Language Models

Now let's import DSPy and set up our language model. We'll use OpenAI's GPT-4o-mini as our primary model:

In [42]:
# Import necessary libraries
import dspy
import os
from dotenv import load_dotenv

# Load environment variables from .env file (contains API keys)
load_dotenv()

# Initialize the OpenAI language model
# - "openai/gpt-4o-mini" specifies the model to use
# - api_key is loaded from environment variable for security
openai_lm = dspy.LM("openai/gpt-4o-mini", api_key=os.getenv("OPENAI_API_KEY"))

# Test the language model with a simple query
openai_lm("Hello")

['Hello! How can I assist you today?']

### Language Model Configured!

The OpenAI language model responded with a greeting! This confirms:
- Your API key is correctly loaded from the .env file
- The connection to OpenAI is working
- DSPy's LM class successfully wraps the OpenAI API

Notice that DSPy returns responses as a list - this allows for handling multiple completions if requested.

## 3. Creating Your First DSPy Program

DSPy programs are built using **signatures** - simple declarations of what goes in and what comes out. Let's create a basic joke generator:

In [45]:
# Configure DSPy to use our language model globally
dspy.configure(lm=openai_lm)

# Create a basic program using a signature string
# Format: 'input -> output'
basic_joke_program = dspy.Predict('topic -> joke')

# Generate a joke about programming
result = basic_joke_program(topic="python")
print(result)

Prediction(
    joke='Why do Python programmers prefer dark mode? Because light attracts bugs!'
)


### Your First DSPy Program Works!

The program generated a programming joke! Notice:
- DSPy automatically created a prompt from the signature `'topic -> joke'`
- The output is wrapped in a `Prediction` object with the field name `joke`
- No manual prompt engineering was needed - DSPy handled it all

### Understanding What Happened

Let's inspect the actual prompt DSPy sent to the language model:

In [46]:
# Inspect the last interaction with the language model
# This shows the system prompt and user message DSPy generated
openai_lm.inspect_history(n=1)





[34m[2025-08-06T17:21:13.190779][0m

[31mSystem message:[0m

Your input fields are:
1. `topic` (str):
Your output fields are:
1. `joke` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## topic ## ]]
{topic}

[[ ## joke ## ]]
{joke}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `topic`, produce the fields `joke`.


[31mUser message:[0m

[[ ## topic ## ]]
python

Respond with the corresponding output fields, starting with the field `[[ ## joke ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## joke ## ]]
Why do Python programmers prefer dark mode? Because light attracts bugs!

[[ ## completed ## ]][0m







### Behind the Scenes

DSPy automatically generated a structured prompt! It:
- Created a system message explaining the input/output fields
- Formatted the user's input with special markers `[[ ## topic ## ]]`
- Instructed the model to respond with the output field `[[ ## joke ## ]]`
- Added completion markers to ensure proper parsing

This structured approach makes outputs reliable and easy to parse.

## 4. Creating More Detailed Signatures

For better control, we can create signatures with descriptions and custom instructions:

In [49]:
# Define custom instructions for our joke generator
instructions = """Tell a funny joke about the topic"""

# Define input and output fields with descriptions
fields = {
    # Input field with description
    "topic": (str, dspy.InputField(desc="The topic of the joke")),
    
    # Output field with description  
    "joke": (str, dspy.OutputField(desc="The joke that is being told")),
}

# Create a signature programmatically
joke_signature = dspy.make_signature(
    signature_name="Joke",
    instructions=instructions,
    signature=fields
)

# Create a program with our custom signature
detailed_joke_program = dspy.Predict(joke_signature)

# Test it out
output = detailed_joke_program(topic="python")
print(output.joke)

Why do Python programmers prefer dark mode? Because light attracts bugs!


### Custom Signatures Work!

The joke generator now uses our custom instructions and field descriptions. This gives you more control over:
- The instructions sent to the model
- Descriptions for each input/output field
- The overall behavior of your program

The model generated the same joke, showing consistency in its humor database!

## 5. Using Different Language Models

DSPy makes it easy to switch between different language models. Let's try Google's Gemini:

In [50]:
# Initialize Google's Gemini model
gemini_lm = dspy.LM("gemini/gemini-2.0-flash", api_key=os.getenv("GEMINI_API_KEY"))

# Use a context manager to temporarily switch models
# This doesn't change the global configuration
with dspy.context(lm=gemini_lm):
    output = detailed_joke_program(topic="python")
    print(output.joke)

Why did the Python script need therapy?

Because it had too many deeply nested if-else statements and couldn't handle the indentation!


### Model Switching Success!

Gemini generated the same joke! Key points:
- The `dspy.context()` manager temporarily switches models
- Your original configuration remains unchanged after the context
- This is useful for comparing different models or using specialized models for specific tasks
- Both models seem to know this popular programming joke!

## 6. Chain of Thought Reasoning

DSPy can use modules automatically add reasoning steps (or other common prompt engineering techniques) to improve output quality. Let's create a joke generator that explains its thinking:

In [51]:
# Create a Chain of Thought program
# This automatically adds a 'reasoning' field before the output
cot_joke_program = dspy.ChainOfThought(joke_signature)

# Configure to use Gemini for this example
dspy.configure(lm=gemini_lm)

# Generate a joke with reasoning
output = cot_joke_program(topic="python")
print(f"Reasoning: {output.reasoning}")
print(f"Joke: {output.joke}")

Reasoning: The joke should be about a common frustration or misconception related to Python programming. I'll focus on the indentation sensitivity of Python, which is a frequent source of errors for beginners.
Joke: Why do Python programmers get paid so much?

Because they have to be right all the time... about their indentation!


### DSPy Has Other Powerful Reasoning Modules!

Beyond ChainOfThought, DSPy provides other reasoning modules:
- `Predict` is a simpler module for direct predictions
- `ChainOfThought` adds reasoning to the program
- `ReAct` adds tool use to make your program agentic 


These modules give you flexibility in how much reasoning and transparency you want from your models. Other modules exist or you can make your own.

## 7. Building Modular Programs

DSPy allows you to create reusable modules. Let's build a more sophisticated joke generator:

In [52]:
# Define a signature using class syntax (alternative to make_signature)
class JokeSignature(dspy.Signature):
    """Tell a funny joke about the topic"""
    topic: str = dspy.InputField(desc="The topic of the joke")
    joke: str = dspy.OutputField(desc="The joke that is being told")

# Create a reusable module
class JokeModule(dspy.Module):
    def __init__(self):
        super().__init__()
        # Initialize the chain of thought predictor with our joke signature
        self.joke_generator = dspy.ChainOfThought(JokeSignature)

    def forward(self, topic: str) -> str:
        # Generate a joke about the topic
        prediction = self.joke_generator(topic=topic)
        return prediction.joke
    
# Instantiate our module
joke_module = JokeModule()

# Test it
output = joke_module(topic="python")
print(output)

Why do Python programmers get paid so much?

Because they have to be right all the time... about their indentation!


### Modular Programming with DSPy

We've created a reusable `JokeModule` that:
- Uses the cleaner class-based signature syntax
- Encapsulates the joke generation logic
- Returns just the joke string (not the full Prediction object)
- Can be easily integrated into larger applications

The module pattern is powerful for building complex AI systems with multiple components. Any business logic that is valid python can be in a module and you can join multiple modules together as needed.

## 8. Creating a Dataset for Evaluation

To optimize our programs, we need data to evaluate performance. Let's create a dataset of funny and unfunny jokes:

In [53]:
# Import random for shuffling
import random
random.seed(69)  # Set seed for reproducibility

# Dataset of professional comedian jokes (labeled as funny)
# Source: Various famous comedians
# https://inews.co.uk/light-relief/jokes/ricky-gervais-jokes-best-golden-globes-2020-host-controversial-funniest-the-office-135797
# https://www.blackpoolgrand.co.uk/funniest-jokes-one-liners/
# https://www.vulture.com/2018/01/dave-chappelle-bird-revelation-equanimity-best-jokes.html
# https://www.scotsman.com/heritage-and-retro/heritage/billy-connollys-best-jokes-80-of-the-big-yins-funniest-jokes-and-one-liners-4458332
# https://inews.co.uk/light-relief/jokes/funny-jokes-110-funniest-best-one-liners-192413

funny_jokes = [
    {"topic": "Fishing", "joke": "Give a man a fish, and he'll probably follow you home expecting more fish.", "comedian": "Ricky Gervais"},
    {"topic": "Family", "joke": "Where there's a will – there's a relative!", "comedian": "Ricky Gervais"},
    {"topic": "Holidays", "joke": "1st of December, World Aids Day….I don't think it'll ever take off like Christmas.", "comedian": "Ricky Gervais"},
    {"topic": "Drinking", "joke": "I like a drink as much as the next man. Unless the next man is Mel Gibson.", "comedian": "Ricky Gervais"},
    {"topic": "Celebrity", "joke": "It's gonna be a night of partying and heavy drinking. Or as Charlie calls it: breakfast.", "comedian": "Ricky Gervais"},
    {"topic": "Movies", "joke": "It seems like everything this year was three-dimensional, except the characters in The Tourist.", "comedian": "Ricky Gervais"},
    {"topic": "Religion", "joke": "You won't burn in hell. But be nice anyway.", "comedian": "Ricky Gervais"},
    {"topic": "Inspiration", "joke": "My greatest hero is Nelson Mandela. What a man. Incarcerated for 25 years, he was released in 1990 and he hasn't reoffended. I think he's going straight, which shows you prison does work.", "comedian": "Ricky Gervais"},
    {"topic": "Philosophy", "joke": "Remember, when you are dead, you do not know you are dead. It is only painful for others. The same applies when you are stupid.", "comedian": "Ricky Gervais"},
    {"topic": "Life", "joke": "Mondays are fine. It's your life that sucks.", "comedian": "Ricky Gervais"},
    {"topic": "Religion", "joke": "Remember, if you don't sin, then Jesus died for nothing.", "comedian": "Ricky Gervais"},
    {"topic": "Activism", "joke": "I could solve the world's problems if I… cared.", "comedian": "Ricky Gervais"},
    {"topic": "Identity", "joke": "I can have a go at the French cause I'm half French half English with a stupid name like Gervais. No I am, I'm half French half English and um I've got qualities of both, French and English which is good, so um… I am crap in bed but at least I've got bad breath.", "comedian": "Ricky Gervais"},
    {"topic": "Military", "joke": "Do commandos not wear pants? They must wear pants, don't they?", "comedian": "Ricky Gervais"},
    {"topic": "Equality", "joke": "Same sex marriage is not a gay privilege, it's equal rights. Privilege would be something like gay people not paying taxes. Like churches don't.", "comedian": "Ricky Gervais"},
    {"topic": "Folklore", "joke": "I've never worked out what the moral of Humpty Dumpty is. I can only think of: Don't sit on a wall, if you're an egg.", "comedian": "Ricky Gervais"},
    {"topic": "Employment", "joke": "Avoid employing unlucky people – throw half of the pile of CVs in the bin without reading them.", "comedian": "Ricky Gervais"},
    {"topic": "Awards", "joke": "For any of you who don't know, the Golden Globes are just like the Oscars, but without all that esteem. The Golden Globes are to the Oscars what Kim Kardashian is to Kate Middleton. A bit louder, a bit trashier, a bit drunker, and more easily bought.", "comedian": "Ricky Gervais"},
    {"topic": "Workplace", "joke": "If your boss is getting you down, look at him through the prongs of a fork and imagine him in jail.", "comedian": "Ricky Gervais"},
    {"topic": "Humor", "joke": "I can't find someone funny whom I don't like. Hitler told great jokes.", "comedian": "Ricky Gervais"},
    {"topic": "Culture", "joke": "America champions the underdog. We champion the under dog until he's not the underdog anymore, and he annoys us.", "comedian": "Ricky Gervais"},
    {"topic": "Betrayal", "joke": "You have to be 100% behind someone, before you can stab them in the back.", "comedian": "Ricky Gervais"},
    {"topic": "Health", "joke": "Remember, being healthy is basically dying as slowly as possible.", "comedian": "Ricky Gervais"},
    {"topic": "Atheism", "joke": "I'd like to thank God for making me an atheist.", "comedian": "Ricky Gervais"},
    {"topic": "Music Industry", "joke": "Piracy doesn't kill music, boy bands do.", "comedian": "Ricky Gervais"},
    {"topic": "Wealth", "joke": "My wealth and happiness would suggest that God definitely does love me. If he existed of course. Which he doesn't.", "comedian": "Ricky Gervais"},
    {"topic": "Social Media", "joke": "Following someone on Twitter and asking them to tweet about something else is like stalking someone and asking them to go a different route.", "comedian": "Ricky Gervais"},
    {"topic": "Fame", "joke": "Please don't worship me. I'm just an ordinary guy, with lots of followers trying to spread my message. Sort of like Jesus Christ I guess.", "comedian": "Ricky Gervais"},
    {"topic": "Technology", "joke": "iPhones are Barbie Dolls for grown men. You carry them round, dress them up in little outfits, accessorise, & get a new one every year.", "comedian": "Ricky Gervais"},
    {"topic": "Generosity", "joke": "Give a man a fish, and he'll probably follow you home expecting more fish.", "comedian": "Ricky Gervais"},
    {"topic": "Environment", "joke": "It seems to be true, particularly in middle America, that those most militant about using up fossil fuels, don't actually believe in fossils", "comedian": "Ricky Gervais"},
    {"topic": "Drinking", "joke": "My father drank so heavily, when he blew on the birthday cake he lit the candles.", "comedian": "Les Dawson"},
    {"topic": "Police", "joke": "I was in my car driving back from work. A police officer pulled me over and knocked on my window. I said, 'One minute I'm on the phone.'", "comedian": "Alan Carr"},
    {"topic": "Overthinking", "joke": "I worry about ridiculous things, you know, how does a guy who drives a snowplough get to work in the morning… that can keep me awake for days.", "comedian": "Billy Connolly"},
    {"topic": "Relationships", "joke": "I used to go out with a giraffe. Used to take it to the pictures and that. You'd always get some bloke complaining that he couldn't see the screen.", "comedian": "Paul Merton"},
    {"topic": "Music", "joke": "Here's a picture of me with REM. That's me in the corner.", "comedian": "Milton Jones"},
    {"topic": "Optimism", "joke": "People say 'Bill, are you an optimist?' And I say, 'I hope so.'", "comedian": "Bill Bailey"},
    {"topic": "Customer Service", "joke": "I rang up British Telecom and said: 'I want to report a nuisance caller.' He said: 'Not you again.'", "comedian": "Tim Vine"},
    {"topic": "Obesity", "joke": "Life is like a box of chocolates. It doesn't last long if you're fat.", "comedian": "Joe Lycett"},
    {"topic": "Religion", "joke": "We weren't very religious. On Hanukkah, my mother had our menorah on a dimmer.", "comedian": "Richard Lewis"},
    {"topic": "Beauty", "joke": "My girlfriend is absolutely beautiful. Body like a Greek statue – completely pale, no arms.", "comedian": "Phil Wang"},
    {"topic": "Weather", "joke": "Normally you have news, weather and travel. But not on snow day. On a snow day, the news is weather is travel.", "comedian": "Michael McIntyre"},
    {"topic": "Personal Improvement", "joke": "I bought myself some glasses. My observational comedy improved.", "comedian": "Sara Pascoe"},
    {"topic": "Sports", "joke": "If I was an Olympic athlete, I'd rather come in last than win the silver medal. You win the gold, you feel good. You win the bronze, you think, 'at least I got something.' But you win that silver, that's like, 'Congratulations, you almost won! Of all the losers, you came in first! You're the number one loser! No one lost ahead of you!'", "comedian": "Jerry Seinfeld"},
    {"topic": "Identity", "joke": "My star sign is Pyrex. I was a test-tube baby.", "comedian": "Billy Connolly"},
    {"topic": "Marriage", "joke": "I always take my wife morning tea in my pyjamas. But is she grateful? No, she says she'd rather have it in a cup.", "comedian": "Eric Morecambe"},
    {"topic": "Shopping", "joke": "A man walks into a chemist's and says, 'Can I have a bar of soap, please?' The chemist says, 'Do you want it scented?' And the man says, 'No, I'll take it with me now.'", "comedian": "Ronnie Barker"},
    {"topic": "Crime", "joke": "Crime in multi-storey car parks. That is wrong on so many different levels.", "comedian": "Tim Vine"},
    {"topic": "Social Class", "joke": "You know you're working class when your TV is bigger than your bookcase.", "comedian": "Rob Beckett"},
    {"topic": "Animals", "joke": "Owls haven't got necks, have they? An owl is essentially a one-piece unit.", "comedian": "Ross Noble"},
    {"topic": "Fashion", "joke": "If you arrive fashionably late in Crocs, you're just late.", "comedian": "Joel Dommett"},
    {"topic": "Technology", "joke": "My phone will ring at 2am and my wife'll look at me and go, \"Who's that calling at this time?\" I say, \"I don't know. If I knew that we wouldn't need the bloody phone.\"", "comedian": "Lee Evans"},
    {"topic": "Philosophy", "joke": "I doubt there's a heaven; I think the people from hell have probably bought it for a timeshare.", "comedian": "Victoria Wood"},
    {"topic": "Fitness", "joke": "I said to the gym instructor: \"Can you teach me to do the splits?\", He said: \"How flexible are you?\", I said: \"I can't make Tuesdays.\"", "comedian": "Tommy Cooper"},
    {"topic": "Insurance", "joke": "Do Transformers get car, or life insurance?", "comedian": "Russell Howard"},
    {"topic": "Police", "joke": "Alright lads, a giant fly is attacking the police station. I've called the SWAT team!", "comedian": "Greg Davies"},
    {"topic": "Healthcare", "joke": "A good rule to remember for life is that when it comes to plastic surgery and sushi, never be attracted by a bargain.", "comedian": "Graham Norton"},
    {"topic": "Animals", "joke": "Two monkeys were getting into the bath. One said: 'Oo, oo, oo, aah aah aah.' The other replied: 'Well, put some cold in it then.'", "comedian": "Harry Hill"},
    {"topic": "Suburban Life", "joke": "My parents did just well enough so I could grow up poor around white people. When Nas and them used to talk about the projects, I used to get jealous. It sounded fun. Everybody in the projects was poor, and that's fair. But if you were poor in Silver Spring, nigga, it felt like it was only happening to you.", "comedian": "Dave Chappelle"},
    {"topic": "Cultural Identity", "joke": "What is Rachel willing to do, so that we blacks believe that she believes she is actually one of us? Bitch, are you willing to put a lien on your house so that you can invest in a mixtape that probably won't work out?", "comedian": "Dave Chappelle"},
    {"topic": "Aging", "joke": "I don't like looking at my dick anymore. My dick looks distinguished. It's old, an old-looking dick. It's got salt-and-pepper hair all around it. My dick looks like Morgan Freeman in the '90s.", "comedian": "Dave Chappelle"},
    {"topic": "Fatherhood", "joke": "This motherfucker calls me up in the middle of the night. It was one o'clock in the morning and he goes, 'Dad, don't be mad […] I'm at a party and my designated driver had too much to drink. Me and friends need you to come pick us up.' I said, 'Jesus Christ, it's one o'clock in the morning. Nigga, I am shit-faced!'", "comedian": "Dave Chappelle"},
    {"topic": "Political Commentary", "joke": "Eight years later, I'm pulling up to the polls again. This time, I'm driving a brand-new Porsche because the Obama years were very good to me […] I walked up and saw a long, long line of dusty white people […] I stood with them in line, like all us Americans are required to do in a democracy. Nobody skips the line to vote. And I listened to them say naïve, poor white people things.", "comedian": "Dave Chappelle"},
    {"topic": "Leadership", "joke": "This motherfucker [Donald Trump] grabbed the podium and he goes, 'You don't know how scary the things I read in my briefings are.' Holy shit, man, you ain't supposed to tell us that, bro!", "comedian": "Dave Chappelle"},
    {"topic": "Religious Satire", "joke": "I respect everybody's beliefs, except Amish people. They are the only ones I can say clearly, 'Their God is wrong.' The speed limit is 75 miles an hour in Ohio, and one lane of traffic is blocked by a goddamned horse and buggy?", "comedian": "Dave Chappelle"},
    {"topic": "Hollywood", "joke": "You think I go to a Hollywood meeting with all them white people by myself? I bring my nigga Mac Mittens from the streets […] He's not even qualified to listen to these meetings, he just makes me feel good.", "comedian": "Dave Chappelle"},
    {"topic": "Comedy Culture", "joke": "The tough part of being a comedian and knowing the motherfucker is, everybody comes up to me like, 'Did you know? Did you know what Louis was doing?' No, bitch, I did not know.", "comedian": "Dave Chappelle"},
    {"topic": "National Identity", "joke": "I could kill every white person in America at one time. You know how I'd do it? Just wait for the Super Bowl, and right when they sing the National Anthem, I'd have O.J. Simpson walk to the 50-yard line with them bad knees.", "comedian": "Dave Chappelle"},
    {"topic": "Gender Relations", "joke": "I used to do shows for drug dealers that wanted to clean their money up. One time I did a real good set, and these motherfuckers called me into the back room. They gave me $25,000 in cash […] I jumped on the subway and started heading towards Brooklyn at one o'clock in the morning.", "comedian": "Dave Chappelle"},
    {"topic": "Scottish Heritage", "joke": "Scottish-Americans tell you that if you want to identify tartans, it's easy – you simply look under the kilt, and if it's a quarter-pounder, you know it's a McDonald's.", "comedian": "Billy Connolly"},
    {"topic": "Judgement", "joke": "Before you judge a man, walk a mile in his shoes. After that who cares? He's a mile away and you've got his shoes!", "comedian": "Billy Connolly"},
    {"topic": "Weather", "joke": "I hate all those weathermen, too, who tell you that rain is bad weather. There's no such thing as bad weather, just the wrong clothing, so get yourself a sexy raincoat and live a little.", "comedian": "Billy Connolly"},
    {"topic": "Film Industry", "joke": "I'm a huge film star, but you have to hurry to the movies because I usually die in the first 15 f***ing minutes. I'm the only guy I know who died in a f***ing Muppet Movie.", "comedian": "Billy Connolly"},
    {"topic": "Appearance", "joke": "I always look skint. When I buy a Big Issue, people take it out of my hand and give me a pound.", "comedian": "Billy Connolly"},
    {"topic": "Sex Therapy", "joke": "One sex therapist claims that the most effective way to arouse your man is to spend 10 minutes licking his ears. Personally, I think its bollocks.", "comedian": "Billy Connolly"},
    {"topic": "Cinema", "joke": "When people say while watching a film 'did you see that? No tosser, I paid ten quid to come to the cinema and stare at the f***ing floor.", "comedian": "Billy Connolly"},
    {"topic": "Aeroplane Comfort", "joke": "I get claustrophobic easily and I don't get why aeroplane toilets don't f***ing have windows. I mean it's not as if anyone can f***ing see in. Unless of course you are the most determined pervert in the world.", "comedian": "Billy Connolly"},
    {"topic": "Astrology", "joke": "My star sign is Pyrex. I was a test-tube baby.", "comedian": "Billy Connolly"},
    {"topic": "Parenting", "joke": "Don't buy one of those baby intercoms. Babies pretend to be dead. They're bastards, and they do it on purpose.", "comedian": "Billy Connolly"},
    {"topic": "Common Sayings", "joke": "Why do people say 'Oh you want to have your cake and eat it too?' Dead right! What good is a cake if you can't eat it?", "comedian": "Billy Connolly"},
    {"topic": "Life Perception", "joke": "When people say 'life is short'. What the f***? Life is the longest damn thing anyone ever f***ing does! What can you do that's longer?", "comedian": "Billy Connolly"},
    {"topic": "Dating", "joke": "I like a woman with a head on her shoulders. I hate necks.", "comedian": "Steve Martin"},
    {"topic": "Growing Up", "joke": "I have a lot of growing up to do. I realised that the other day inside my fort.", "comedian": "Zach Galifianakis"},
    {"topic": "Employment", "joke": "I used to work at McDonald's making minimum wage. You know what that means when someone pays you minimum wage? You know what your boss was trying to say? 'Hey, if I could pay you less, I would, but it's against the law.'", "comedian": "Chris Rock"},
    {"topic": "Love", "joke": "Love is like a fart. If you have to force it it's probably s***.", "comedian": "Stephen K. Amos"},
    {"topic": "Convenience", "joke": "I like an escalator because an escalator can never break. It can only become stairs. There would never be an 'Escalator Temporarily Out of Order' sign, only 'Escalator Temporarily Stairs'.", "comedian": "Mitch Hedberg"},
    {"topic": "Sports", "joke": "If I was an Olympic athlete, I'd rather come in last than win the silver medal. You win the gold, you feel good. You win the bronze, you think, 'at least I got something.' But you win that silver, that's like, 'Congratulations, you almost won! Of all the losers, you came in first! You're the number one loser! No one lost ahead of you!'", "comedian": "Jerry Seinfeld"},
    {"topic": "Religion", "joke": "We weren't very religious. On Hanukkah, my mother had our menorah on a dimmer.", "comedian": "Richard Lewis"},
    {"topic": "Beauty", "joke": "My girlfriend is absolutely beautiful. Body like a Greek statue – completely pale, no arms.", "comedian": "Phil Wang"},
    {"topic": "Creation", "joke": "If God had written the Bible, the first line should have been 'It's round.'", "comedian": "Eddie Izzard"},
    {"topic": "Self-Improvement", "joke": "I bought myself some glasses. My observational comedy improved.", "comedian": "Sara Pascoe"},
    {"topic": "Politics", "joke": "Trump's nothing like Hitler. There's no way he could write a book.", "comedian": "Frankie Boyle"},
    {"topic": "Social Class", "joke": "You know you're working class when your TV is bigger than your book case.", "comedian": "Rob Beckett"},
    {"topic": "Conflict", "joke": "Most of my life is spent avoiding conflict. I hardly ever visit Syria.", "comedian": "Alex Horne"},
    {"topic": "Relaxation", "joke": "A spa hotel? It's like a normal hotel, only in reception there's a picture of a pebble.", "comedian": "Rhod Gilbert"},
    {"topic": "Health", "joke": "Life is like a box of chocolates. It doesn't last long if you're fat.", "comedian": "Joe Lycett"},
    {"topic": "Career", "joke": "My Dad said, always leave them wanting more. Ironically, that's how he lost his job in disaster relief.", "comedian": "Mark Watson"},
    {"topic": "Memory", "joke": "Apparently smoking cannabis can affect your short term memory. Well if that's true, what do you think smoking cannabis does?", "comedian": "Mickey P Kerr"},
    {"topic": "Philosophy", "joke": "How many philosophers does it take to change a lightbulb?…. none. They're not really into that sort of thing. If it's that dark, light a candle.", "comedian": "Phil Cornwell"},
    {"topic": "Marriage", "joke": "The first time I met my wife, I knew she was a keeper. She was wearing massive gloves.", "comedian": "Alun Cochrane"},
    {"topic": "Childhood", "joke": "As a kid I was made to walk the plank. We couldn't afford a dog.", "comedian": "Gary Delaney"},
    {"topic": "Misunderstanding", "joke": "Two fish in a tank. One says: 'How do you drive this thing?'", "comedian": "Peter Kay"},
    {"topic": "Entertainment", "joke": "I saw a documentary on how ships are kept together. Riveting!", "comedian": "Stewart Francis"},
    {"topic": "Music", "joke": "People who like trance music are very persistent. They don't techno for an answer.", "comedian": "Joel Dommett"},
    {"topic": "Dating", "joke": "I used to go out with a giraffe. Used to take it to the pictures and that. You'd always get some bloke complaining that he couldn't see the screen. It's a giraffe, mate. What do you expect? 'Well he can take his hat off for a start!'", "comedian": "Paul Merton"},
    {"topic": "Weather", "joke": "Normally you have news, weather and travel. But not on snow day. On a snow day, news is weather is travel.", "comedian": "Michael McIntyre"},
    {"topic": "Music", "joke": "Here's a picture of me with REM. That's me in the corner.", "comedian": "Milton Jones"},
    {"topic": "Sarcasm", "joke": "Someone showed me a photograph of my local MP the other day. 'Would you buy a second-hand car from this man?' they asked. 'Would you buy a second-hand car?' I replied.", "comedian": "Miles Jupp"},
    {"topic": "Culture", "joke": "With stand-up in Britain, what you have to do is bloody swearing. In Germany, we don't have to swear. Reason being, things work.", "comedian": "Henning When"},
    {"topic": "Learning", "joke": "I'm learning the hokey cokey. Not all of it. But – I've got the ins and outs.", "comedian": "Iain Stirling"},
    {"topic": "Identity", "joke": "Roses are red, violets are blue, I'm a schizophrenic, and so am I.", "comedian": "Billy Connolly"},
    {"topic": "Parenting", "joke": "My mother told me, you don't have to put anything in your mouth you don't want to. Then she made me eat broccoli, which felt like double standards.", "comedian": "Sarah Millican"},
    {"topic": "Vengeance", "joke": "My therapist says I have a preoccupation with vengeance. We'll see about that.", "comedian": "Stewart Francis"},
    {"topic": "Family", "joke": "I'm sure wherever my Dad is, he's looking down on us. He's not dead, just very condescending.", "comedian": "Jack Whitehall"},
    {"topic": "Marriage", "joke": "'What's a couple?' I asked my mum. She said, 'Two or three'. Which probably explains why her marriage collapsed.", "comedian": "Josie Long"},
    {"topic": "Injury", "joke": "The easiest time to add insult to injury is when you're signing somebody's cast.", "comedian": "Demetri Martin"},
    {"topic": "Communication", "joke": "I was in my car driving back from work. A police officer pulled me over and knocked on my window. I said, 'One minute I'm on the phone.'", "comedian": "Alan Carr"},
    {"topic": "Afterlife", "joke": "I doubt there's a heaven; I think the people from hell have probably bought it for a timeshare.", "comedian": "Victoria Wood"},
    {"topic": "Flexibility", "joke": "I said to the gym instructor: 'Can you teach me to do the splits?' He said: 'How flexible are you?' I said: 'I can't make Tuesdays.'", "comedian": "Tommy Cooper"},
    {"topic": "Misunderstanding", "joke": "A man walks into a chemist's and says, 'Can I have a bar of soap, please?' The chemist says, 'Do you want it scented?' And the man says, 'No, I'll take it with me now.'", "comedian": "Ronnie Barker"},
    {"topic": "Humor", "joke": "It's really hard to define 'virtue signalling', as I was saying the other day to some of my Muslim friends over a fair-trade coffee in our local feminist bookshop.", "comedian": "Lucy Porter"},
    {"topic": "Creation", "joke": "If we were truly created by God, then why do we still occasionally bite the insides of our own mouths?", "comedian": "Dara Ó Briain"},
    {"topic": "Insurance", "joke": "Do Transformers get car, or life insurance?", "comedian": "Russell Howard"},
    {"topic": "Emergency", "joke": "Alright lads, a giant fly is attacking the police station. I've called the SWAT team!", "comedian": "Greg Davies"},
    {"topic": "Consumerism", "joke": "A good rule to remember for life is that when it comes to plastic surgery and sushi, never be attracted by a bargain.", "comedian": "Graham Norton"},
    {"topic": "Family", "joke": "My father drank so heavily, when he blew on the birthday cake he lit the candles.", "comedian": "Les Dawson"},
    {"topic": "Therapy", "joke": "I've been feeling suicidal so my therapist suggested I do CBT. Now I can ride a motorbike, how's that going to help?", "comedian": "Eric Lampaert"},
]

# Dataset of generic, unfunny jokes (labeled as not funny)
unfunny_jokes = [
    {"topic": "Science", "joke": "Why don't scientists trust atoms? Because they make up everything."},
    {"topic": "Field", "joke": "Why did the scarecrow win an award? Because he was outstanding in his field."},
    {"topic": "Animals", "joke": "Why do cows have hooves instead of feet? Because they lactose."},
    {"topic": "Food", "joke": "What do you call fake spaghetti? An impasta."},
    {"topic": "Animals", "joke": "How does a penguin build its house? Igloos it together."},
    {"topic": "Halloween", "joke": "What do you get when you cross a snowman and a vampire? Frostbite."},
    {"topic": "Books", "joke": "Why was the math book sad? It had too many problems."},
    {"topic": "Food", "joke": "What do you call cheese that isn't yours? Nacho cheese."},
    {"topic": "Skeletons", "joke": "Why don't skeletons fight each other? They don't have the guts."},
    {"topic": "Walls", "joke": "What did one wall say to the other wall? I'll meet you at the corner."},
    {"topic": "Transportation", "joke": "Why did the bicycle fall over? It was two-tired."},
    {"topic": "Animals", "joke": "What do you call a bear with no teeth? A gummy bear."},
    {"topic": "Gym", "joke": "Why don't some couples go to the gym? Because some relationships don't work out."},
    {"topic": "Factories", "joke": "What do you call a factory that makes good products? A satisfactory."},
    {"topic": "Golf", "joke": "Why did the golfer bring an extra pair of pants? In case he got a hole in one."},
    {"topic": "Cleaning", "joke": "What did the janitor say when he jumped out of the closet? Supplies!"},
    {"topic": "Animals", "joke": "What do you call a fish with no eyes? Fsh."},
    {"topic": "Charity", "joke": "Why don't oysters donate to charity? Because they are shellfish."},
    {"topic": "Food", "joke": "What did the grape do when it got stepped on? Nothing but let out a little wine."},
    {"topic": "Animals", "joke": "Why was the big cat disqualified from the race? Because it was a cheetah."},
    {"topic": "Fashion", "joke": "What do you call a belt made of watches? A waist of time."},
    {"topic": "Body", "joke": "Why can't your nose be 12 inches long? Because then it would be a foot."},
    {"topic": "Sports", "joke": "Why don't some fish play basketball? Because they are afraid of the net."},
    {"topic": "Animals", "joke": "What do you call a pile of cats? A meowtain."},
    {"topic": "Coffee", "joke": "Why did the coffee file a police report? It got mugged."},
    {"topic": "Weather", "joke": "Why did the stadium get hot after the game? All the fans left."},
    {"topic": "Plates", "joke": "What did one plate say to the other plate? Lunch is on me."},
    {"topic": "Space", "joke": "How do you organize a space party? You planet."},
    {"topic": "Food", "joke": "Why don't eggs tell jokes? They'd crack each other up."},
    {"topic": "Halloween", "joke": "How does a vampire start a letter? Tomb it may concern."},
    {"topic": "Technology", "joke": "Why did the computer go to the doctor? It had a virus."},
    {"topic": "Boomerangs", "joke": "What do you call a boomerang that doesn't come back? A stick."},
    {"topic": "Ghosts", "joke": "Why are ghosts bad at lying? Because you can see right through them."},
    {"topic": "Animals", "joke": "What do you get when you cross a sheep and a kangaroo? A woolly jumper."},
    {"topic": "Food", "joke": "Why did the tomato turn red? Because it saw the salad dressing."},
    {"topic": "School", "joke": "Why did the math teacher take off points? Because the student's answer was too square."},
    {"topic": "Birds", "joke": "Why do seagulls fly over the ocean? Because if they flew over the bay, they'd be bagels."},
    {"topic": "Food", "joke": "Why was the baby strawberry crying? Because its parents were in a jam."},
    {"topic": "Technology", "joke": "What do you call a droid that takes the long way around? R2 detour."},
    {"topic": "Fashion", "joke": "Why did the scarecrow get promoted? He was outstanding in his field."},
    {"topic": "Fashion", "joke": "What did one hat say to the other hat? You stay here, I'll go on ahead."},
    {"topic": "Fashion", "joke": "Why was the belt arrested? It held up a pair of pants."},
    {"topic": "Animals", "joke": "What do you call an alligator in a vest? An investigator."},
    {"topic": "Animals", "joke": "Why don't you see elephants hiding in trees? Because they're so good at it."},
    {"topic": "Books", "joke": "Why did the math book look sad? Because it had too many problems."},
    {"topic": "Bees", "joke": "Why do bees have sticky hair? Because they use honeycombs."},
    {"topic": "Music", "joke": "Why did the chicken join a band? Because it had the drumsticks."},
    {"topic": "Animals", "joke": "How do you catch a squirrel? Climb a tree and act like a nut."},
    {"topic": "Technology", "joke": "Why was the computer cold? It left its Windows open."},
    {"topic": "Animals", "joke": "What do you call a magic dog? A labracadabrador."},
    {"topic": "Sports", "joke": "Why don't some fish play basketball? Because they're afraid of the net."},
    {"topic": "Oceans", "joke": "What did one ocean say to the other ocean? Nothing, they just waved."},
    {"topic": "Dogs", "joke": "Why did the cowboy get a dachshund? Because he wanted to get a long little doggie."},
    {"topic": "Snowmen", "joke": "What do you call a snowman with a six-pack? An abdominal snowman."},
    {"topic": "Food", "joke": "Why did the tomato turn red? Because it saw the salad dressing."},
    {"topic": "Animals", "joke": "How does a penguin build its house? Igloos it together."},
    {"topic": "Golf", "joke": "Why did the golfer bring extra pants? In case he got a hole in one."},
    {"topic": "Animals", "joke": "What do you call an alligator in a vest? An investigator."},
    {"topic": "Fashion", "joke": "Why do cows wear bells? Because their horns don't work."},
    {"topic": "Field", "joke": "Why did the scarecrow become a successful neurosurgeon? Because he was outstanding in his field."},
    {"topic": "Cleaning", "joke": "What did the janitor say when he jumped out of the closet? Supplies!"},
    {"topic": "Science", "joke": "Why don't scientists trust atoms? Because they make up everything."},
    {"topic": "Skeletons", "joke": "Why did the skeleton go to the party alone? He had no body to go with him."},
    {"topic": "Transportation", "joke": "Why did the bicycle fall over? It was two-tired."},
    {"topic": "Technology", "joke": "Why did the computer go to the doctor? It had a virus."},
    {"topic": "Food", "joke": "What did the grape do when it got stepped on? Nothing but let out a little wine."},
    {"topic": "Ghosts", "joke": "Why do ghosts like elevators? Because it lifts their spirits."},
    {"topic": "Science", "joke": "Why can't you trust an atom? Because they make up everything."},
    {"topic": "Food", "joke": "What do you call fake spaghetti? An impasta."},
    {"topic": "Cleaning", "joke": "How do you make a tissue dance? Put a little boogie in it."},
    {"topic": "Charity", "joke": "Why don't oysters donate to charity? Because they are shellfish."},
    {"topic": "Boomerangs", "joke": "What do you call a boomerang that doesn't come back? A stick."},
    {"topic": "Books", "joke": "Why did the math book look sad? Because it had too many problems."},
    {"topic": "Skeletons", "joke": "Why don't skeletons fight each other? They don't have the guts."},
    {"topic": "Walls", "joke": "What did one wall say to the other wall? I'll meet you at the corner."},
    {"topic": "Animals", "joke": "What do you call a bear with no teeth? A gummy bear."},
    {"topic": "Plates", "joke": "What did one plate say to the other plate? Lunch is on me."},
    {"topic": "Space", "joke": "How do you organize a space party? You planet."},
    {"topic": "Food", "joke": "Why don't eggs tell jokes? They'd crack each other up."},
    {"topic": "Halloween", "joke": "How does a vampire start a letter? Tomb it may concern."},
    {"topic": "Coffee", "joke": "Why did the coffee file a police report? It got mugged."},
    {"topic": "Golf", "joke": "Why did the golfer bring an extra pair of pants? In case he got a hole in one."},
    {"topic": "Animals", "joke": "What do you call a fish with no eyes? Fsh."},
    {"topic": "Food", "joke": "Why did the tomato turn red? Because it saw the salad dressing."},
    {"topic": "Birds", "joke": "Why don't seagulls fly over the bay? Because then they'd be bagels."},
    {"topic": "Food", "joke": "Why do cows have hooves instead of feet? Because they lactose."},
    {"topic": "Sports", "joke": "Why don't some fish play basketball? Because they're afraid of the net."},
    {"topic": "Field", "joke": "Why did the scarecrow win an award? Because he was outstanding in his field."},
    {"topic": "Food", "joke": "What do you call cheese that isn't yours? Nacho cheese."},
    {"topic": "Transportation", "joke": "Why did the bicycle fall over? It was two-tired."},
    {"topic": "Animals", "joke": "How does a penguin build its house? Igloos it together."},
    {"topic": "Animals", "joke": "What do you call a pile of cats? A meowtain."},
    {"topic": "Fashion", "joke": "What did one hat say to the other hat? You stay here, I'll go on ahead."},
    {"topic": "Animals", "joke": "What do you call an alligator in a vest? An investigator."},
    {"topic": "Charity", "joke": "Why don't oysters donate to charity? Because they are shellfish."},
    {"topic": "Food", "joke": "What did the grape do when it got stepped on? Nothing but let out a little wine."},
    {"topic": "Golf", "joke": "Why did the golfer bring an extra pair of pants? In case he got a hole in one."},
    {"topic": "Food", "joke": "Why was the baby strawberry crying? Because its parents were in a jam."},
    {"topic": "Factories", "joke": "What do you call a factory that makes good products? A satisfactory."},
    {"topic": "Skeletons", "joke": "Why don't skeletons fight each other? They don't have the guts."},
    {"topic": "Animals", "joke": "What do you call a fish with no eyes? Fsh."},
    {"topic": "Gym", "joke": "Why don't some couples go to the gym? Because some relationships don't work out."},
    {"topic": "Field", "joke": "Why did the scarecrow win an award? Because he was outstanding in his field."},
    {"topic": "Food", "joke": "What do you call fake spaghetti? An impasta."},
    {"topic": "Halloween", "joke": "How does a vampire start a letter? Tomb it may concern."},
    {"topic": "Technology", "joke": "Why did the computer go to the doctor? It had a virus."},
    {"topic": "Boomerangs", "joke": "What do you call a boomerang that doesn't come back? A stick."},
    {"topic": "Food", "joke": "Why did the tomato turn red? Because it saw the salad dressing."},
    {"topic": "Birds", "joke": "Why do seagulls fly over the ocean? Because if they flew over the bay, they'd be bagels."},
    {"topic": "Food", "joke": "Why was the baby strawberry crying? Because its parents were in a jam."},
    {"topic": "Technology", "joke": "What do you call a droid that takes the long way around? R2 detour."},
    {"topic": "Fashion", "joke": "Why did the scarecrow get promoted? He was outstanding in his field."},
    {"topic": "Fashion", "joke": "What did one hat say to the other hat? You stay here, I'll go on ahead."},
    {"topic": "Fashion", "joke": "Why was the belt arrested? It held up a pair of pants."},
    {"topic": "Animals", "joke": "What do you call an alligator in a vest? An investigator."},
    {"topic": "Animals", "joke": "Why don't you see elephants hiding in trees? Because they're so good at it."},
    {"topic": "Books", "joke": "Why did the math book look sad? Because it had too many problems."},
    {"topic": "Bees", "joke": "Why do bees have sticky hair? Because they use honeycombs."},
    {"topic": "Music", "joke": "Why did the chicken join a band? Because it had the drumsticks."},
    {"topic": "Animals", "joke": "How do you catch a squirrel? Climb a tree and act like a nut."},
    {"topic": "Technology", "joke": "Why was the computer cold? It left its Windows open."},
    {"topic": "Animals", "joke": "What do you call a magic dog? A labracadabrador."},
    {"topic": "Sports", "joke": "Why don't some fish play basketball? Because they're afraid of the net."},
    {"topic": "Oceans", "joke": "What did one ocean say to the other ocean? Nothing, they just waved."},
    {"topic": "Dogs", "joke": "Why did the cowboy get a dachshund? Because he wanted to get a long little doggie."},
    {"topic": "Snowmen", "joke": "What do you call a snowman with a six-pack? An abdominal snowman."},
    {"topic": "Food", "joke": "Why did the tomato turn red? Because it saw the salad dressing."}
]

# Convert to DSPy format
dataset = []

# Process funny jokes
for row in funny_jokes:
    topic, joke = row["topic"], row["joke"]
    # Create DSPy Example with labels
    dataset.append(dspy.Example(topic=topic, joke=joke, funny=True).with_inputs("topic", "joke"))

# Process unfunny jokes  
for row in unfunny_jokes:
    topic, joke = row["topic"], row["joke"]
    dataset.append(dspy.Example(topic=topic, joke=joke, funny=False).with_inputs("topic", "joke"))

# Shuffle the dataset
random.shuffle(dataset)

# Split into 60% training, 20% validation, 20% dev
num_items = len(dataset)
train_index = int(0.6 * num_items)
val_index = int(0.8 * num_items)

trainset = dataset[:train_index]
valset = dataset[train_index:val_index]
devset = dataset[val_index:]

print(f"Training set size: {len(trainset)}")
print(f"Validation set size: {len(valset)}")
print(f"Development set size: {len(devset)}")

Training set size: 152
Validation set size: 51
Development set size: 51


### Dataset Created Successfully!

We've built a dataset of 254 jokes:
- 127 professional comedian jokes labeled as funny
- 127 generic dad jokes labeled as not funny
- Split into 152 training examples (training data) and 51 validation examples (for testing), 51 development examples (holdout group)

The `with_inputs()` method tells DSPy which fields are inputs vs outputs.

### Examining Our Data

Let's look at an example from our training set:

In [54]:
# Look at an example from our training set
print(trainset[0])
print(f"\nTopic: {trainset[0].topic}")
print(f"Joke: {trainset[0].joke}")
print(f"Funny: {trainset[0].funny}")

Example({'topic': 'Books', 'joke': 'Why did the math book look sad? Because it had too many problems.', 'funny': False}) (input_keys={'topic', 'joke'})

Topic: Books
Joke: Why did the math book look sad? Because it had too many problems.
Funny: False


## 9. Creating a Joke Judge

Let's create a program that can judge whether a joke is funny or not:

In [56]:
# Create a joke judge with Chain of Thought reasoning
# Input: topic and joke, Output: funny (boolean)
joke_judge = dspy.ChainOfThought('topic, joke -> funny: bool')

# Test on our first training example
result = joke_judge(topic=trainset[0].topic, joke=trainset[0].joke)
print(f"Judge's reasoning: {result.reasoning}")
print(f"\nJudge says funny: {result.funny}")
print(f"Ground truth: {trainset[0].funny}")

Judge's reasoning: The joke plays on the double meaning of the word "problems." In the context of a math book, "problems" refers to mathematical exercises. However, "problems" can also refer to difficulties or sources of sadness. The joke is funny because it personifies the math book and attributes its sadness to the abundance of mathematical problems it contains.

Judge says funny: True
Ground truth: False


Using an inline dspy signature we input the topic and joke and get a rating of whether the joke was funny or not. We need this to act as the evaluation metric for our joke generator, using the LLM-as-a-Judge technique.

## 10. Evaluating Our Judge

Now let's evaluate how well our judge performs on the validation set:

In [58]:
# Import evaluation tools
from dspy.evaluate import Evaluate

# Define our evaluation metric
def exact_match(pred, gold, trace=None):
    """Check if the predicted 'funny' label matches the ground truth"""
    return pred.funny == gold.funny

# Create an evaluator
evaluate = Evaluate(
    metric=exact_match, 
    devset=devset, # the optimized judge hasn't seen this data yet
    num_threads=8,  # Run evaluations in parallel
    display_progress=True,
    display_table=5  # Show first 5 results
)

# Evaluate our basic judge
basic_judge_score = evaluate(joke_judge)
print(f"\nBasic judge accuracy: {basic_judge_score}%")

Average Metric: 26.00 / 51 (51.0%): 100%|██████████| 51/51 [00:00<00:00, 4329.71it/s]

2025/08/06 17:39:29 INFO dspy.evaluate.evaluate: Average Metric: 26 / 51 (51.0%)





Unnamed: 0,topic,joke,example_funny,reasoning,pred_funny,exact_match
0,Field,Why did the scarecrow become a successful neurosurgeon? Because he...,False,"The joke plays on the double meaning of the word ""field."" It refer...",True,
1,Ghosts,Why are ghosts bad at lying? Because you can see right through them.,False,The joke plays on the literal transparency of ghosts and the figur...,True,
2,Marriage,"The first time I met my wife, I knew she was a keeper. She was wea...",True,The joke is funny because it sets up an expectation of a romantic ...,True,✔️ [True]
3,Wealth,My wealth and happiness would suggest that God definitely does lov...,True,The joke plays on the common saying that wealth and happiness are ...,True,✔️ [True]
4,Social Media,Following someone on Twitter and asking them to tweet about someth...,True,The joke uses an analogy to highlight the absurdity of asking some...,True,✔️ [True]



Basic judge accuracy: 50.98%


### Basic Judge Performance: 51% Accuracy

The unoptimized judge is performing at random chance (50%)! Looking at the results:
- It tends to judge most jokes as funny (even the dad jokes)
- It correctly identifies funny professional jokes
- But it also thinks simple puns are funny when they're labeled as not funny

This shows why we need optimization - the basic judge is too generous!

## 11. Optimizing with Bootstrap Few-Shot

DSPy can automatically optimize programs by finding the best few-shot examples:

In [59]:
# Import the Bootstrap optimizer
from dspy.teleprompt import BootstrapFewShot

# Create optimizer
bootstrap_optimizer = BootstrapFewShot(metric=exact_match)

# Compile (optimize) our judge with the training data
print("Optimizing judge with Bootstrap Few-Shot...")
bootstrap_optimized_judge = bootstrap_optimizer.compile(
    joke_judge, 
    trainset=trainset
)

# Test on the same example
result = bootstrap_optimized_judge(topic=trainset[0].topic, joke=trainset[0].joke)
print(f"\nOptimized judge says funny: {result.funny}")
print(f"Ground truth: {trainset[0].funny}")

Optimizing judge with Bootstrap Few-Shot...


  3%|▎         | 4/152 [00:00<00:00, 701.86it/s]

Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.

Optimized judge says funny: False
Ground truth: False





### Bootstrap Optimization Complete!

The optimizer:
- Automatically found 4 good examples from the training set
- These examples will be used as few-shot demonstrations
- The optimized judge still correctly identifies our test joke as funny

Bootstrap works by finding examples where the base program succeeds and using those as demonstrations.

### Evaluating the Optimized Judge

In [60]:
# Evaluate the optimized judge
bootstrap_judge_score = evaluate(bootstrap_optimized_judge)
print(f"\nBootstrap optimized judge accuracy: {bootstrap_judge_score}%")
print(f"Improvement: {(bootstrap_judge_score - basic_judge_score) / basic_judge_score}%")

Average Metric: 47.00 / 51 (92.2%): 100%|██████████| 51/51 [00:00<00:00, 4264.63it/s] 

2025/08/06 17:45:48 INFO dspy.evaluate.evaluate: Average Metric: 47 / 51 (92.2%)





Unnamed: 0,topic,joke,example_funny,reasoning,pred_funny,exact_match
0,Field,Why did the scarecrow become a successful neurosurgeon? Because he...,False,"The joke is a pun based on the double meaning of the word ""field.""...",True,
1,Ghosts,Why are ghosts bad at lying? Because you can see right through them.,False,The joke is a pun based on the literal transparency of ghosts and ...,False,✔️ [True]
2,Marriage,"The first time I met my wife, I knew she was a keeper. She was wea...",True,The joke is funny because it sets up an expectation of a romantic ...,True,✔️ [True]
3,Wealth,My wealth and happiness would suggest that God definitely does lov...,True,The joke is based on irony and a contradiction. The speaker claims...,True,✔️ [True]
4,Social Media,Following someone on Twitter and asking them to tweet about someth...,True,The joke uses an analogy to highlight the absurdity of trying to c...,True,✔️ [True]



Bootstrap optimized judge accuracy: 92.16%
Improvement: 0.8077677520596312%


### Bootstrap Optimization Improved Accuracy to 92%!

The optimized judge shows significant improvement:
- From ~50% (random chance) to 90% accuracy
- An 80% improvement from just adding few-shot examples
- The judge is still too generous with puns, but it's usable now

The few-shot examples help the model understand the distinction between professional and dad jokes. The judge now agrees with our human evaluation of what jokes are funny to a high enough degree to use.

## 12. Using the Judge to Optimize Joke Generation

Now let's use our optimized judge to create a better joke generator:

In [62]:
# Define a metric that uses our judge to score generated jokes
def judge_score(pred, gold, trace=None):
    """Score generated jokes using our optimized judge"""
    # Use the judge to evaluate the generated joke
    judge_result = bootstrap_optimized_judge(topic=gold, joke=pred)
    
    # Return 1.0 if judge thinks it's funny, 0.0 otherwise
    score = 1.0 if judge_result.funny else 0.0
    return score

# Create a dataset of topics from training data
topic_trainset = [
    dspy.Example(topic=example.topic, joke=example.joke).with_inputs("topic")
    for example in trainset if example.funny
]

# It doesn't matter what we put in the joke field, we're only using the topic
topic_valset = [
    dspy.Example(topic=example.topic, joke=example.joke).with_inputs("topic")
    for example in valset
]

# This is just a holdout set of fresh topics to do the final evaluation on
topic_devset = [
    dspy.Example(topic=example.topic, joke=example.joke).with_inputs("topic")
    for example in devset
]

topic_trainset[0]

Example({'topic': 'Suburban Life', 'joke': "My parents did just well enough so I could grow up poor around white people. When Nas and them used to talk about the projects, I used to get jealous. It sounded fun. Everybody in the projects was poor, and that's fair. But if you were poor in Silver Spring, nigga, it felt like it was only happening to you."}) (input_keys={'topic'})

In [63]:
# Check whether the judge rates our joke as funny

example_topic = topic_trainset[0].topic
example_joke = topic_trainset[0].joke

example_score = judge_score(example_topic, example_joke)

print(f"Topic: {example_topic}")
print(f"Joke: {example_joke}")

print(f"Score: {example_score}")

Topic: Suburban Life
Joke: My parents did just well enough so I could grow up poor around white people. When Nas and them used to talk about the projects, I used to get jealous. It sounded fun. Everybody in the projects was poor, and that's fair. But if you were poor in Silver Spring, nigga, it felt like it was only happening to you.
Score: 1.0


In [64]:
# First evaluate the baseline joke generator before MIPRO optimization
print("Evaluating baseline joke generator on dev set...")

# Run evaluation using the judge_score metric on the original joke_module
evaluate = Evaluate(
    metric=judge_score, 
    devset=topic_devset,
    num_threads=8,  # Run evaluations in parallel
    display_progress=True,
    display_table=10  # Show first 10 results
)

# Evaluate the baseline joke generator
baseline_results = evaluate(joke_module)

# Print the results
print("\nBaseline evaluation complete!")

Evaluating baseline joke generator on dev set...
Average Metric: 20.00 / 51 (39.2%): 100%|██████████| 51/51 [00:00<00:00, 4641.43it/s]

2025/08/06 17:49:02 INFO dspy.evaluate.evaluate: Average Metric: 20.0 / 51 (39.2%)





Unnamed: 0,topic,joke,prediction,judge_score
0,Field,Why did the scarecrow become a successful neurosurgeon? Because he...,Why did the scarecrow win an award?\n\nBecause he was outstanding ...,✔️ [1.000]
1,Ghosts,Why are ghosts bad at lying? Because you can see right through them.,Why did the ghost cross the road?\n\nTo get to the other sheet!,
2,Marriage,"The first time I met my wife, I knew she was a keeper. She was wea...",A man is sitting at home when he hears the doorbell ring. He opens...,✔️ [1.000]
3,Wealth,My wealth and happiness would suggest that God definitely does lov...,"They say money can't buy happiness, but it's a lot more comfortabl...",✔️ [1.000]
4,Social Media,Following someone on Twitter and asking them to tweet about someth...,I unfollowed my gym on social media. I wasn't getting results.,
5,Fitness,"I said to the gym instructor: ""Can you teach me to do the splits?""...","I'm on a seafood diet. I see food, and I eat it. Especially after ...",
6,Field,Why did the scarecrow win an award? Because he was outstanding in ...,Why did the scarecrow win an award?\n\nBecause he was outstanding ...,
7,Fashion,Why do cows wear bells? Because their horns don't work.,Why did the fashion model get fired from the grocery store? Becaus...,✔️ [1.000]
8,Ghosts,Why do ghosts like elevators? Because it lifts their spirits.,Why did the ghost cross the road?\n\nTo get to the other sheet!,
9,Animals,What do you call an alligator in a vest? An investigator.,Why don't scientists trust atoms?\n\nBecause they make up everything!,



Baseline evaluation complete!


## 13. Advanced Optimization with MIPRO

MIPRO (Multi-prompt Instruction Proposal Optimizer) can optimize the instructions themselves, not just the examples:

In [65]:
# Import MIPRO optimizer
from dspy.teleprompt import MIPROv2

# Create MIPRO optimizer with balanced settings
mipro_optimizer = MIPROv2(
    metric=judge_score,
    num_threads=8,
    max_bootstrapped_demos=4,  # Include some bootstrapped examples (generated by the model)
    max_labeled_demos=4,      # Include some labeled examples (from our training data)  
    auto="heavy",             # Light optimization for faster results
    seed=69,
    init_temperature=1
)

# Optimize with MIPRO
print("Optimizing joke generator with MIPRO...")
print("MIPRO will optimize:")
print("- The instruction text")
print("- Select the best few-shot examples")
print("- Balance between different optimization strategies")

mipro_optimized_joke_program = mipro_optimizer.compile(
    joke_module,
    trainset=topic_trainset, # giving the model jokes to learn from
    valset=topic_valset, # we're giving it fresh topics to test against
    requires_permission_to_run=False,
    seed=69
)

print("\nMIPRO optimization complete!")

2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING HEAVY AUTO RUN SETTINGS:
num_trials: 27
minibatch: True
num_fewshot_candidates: 18
num_instruct_candidates: 9
valset size: 51

2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=18 sets of demonstrations...


Optimizing joke generator with MIPRO...
MIPRO will optimize:
- The instruction text
- Select the best few-shot examples
- Balance between different optimization strategies
Bootstrapping set 1/18
Bootstrapping set 2/18
Bootstrapping set 3/18


  8%|▊         | 6/74 [00:00<00:00, 589.68it/s]


Bootstrapped 4 full traces after 6 examples for up to 1 rounds, amounting to 6 attempts.
Bootstrapping set 4/18


  3%|▎         | 2/74 [00:00<00:00, 479.82it/s]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 5/18


  1%|▏         | 1/74 [00:00<00:00, 475.98it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/18


  1%|▏         | 1/74 [00:00<00:00, 484.95it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 7/18


  1%|▏         | 1/74 [00:00<00:00, 429.00it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 8/18


  7%|▋         | 5/74 [00:00<00:00, 548.17it/s]


Bootstrapped 3 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 9/18


  3%|▎         | 2/74 [00:00<00:00, 531.63it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 10/18


  8%|▊         | 6/74 [00:00<00:00, 595.06it/s]


Bootstrapped 3 full traces after 6 examples for up to 1 rounds, amounting to 6 attempts.
Bootstrapping set 11/18


  9%|▉         | 7/74 [00:00<00:00, 587.68it/s]


Bootstrapped 4 full traces after 7 examples for up to 1 rounds, amounting to 7 attempts.
Bootstrapping set 12/18


  3%|▎         | 2/74 [00:00<00:00, 483.66it/s]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 13/18


 11%|█         | 8/74 [00:00<00:00, 576.78it/s]


Bootstrapped 4 full traces after 8 examples for up to 1 rounds, amounting to 8 attempts.
Bootstrapping set 14/18


  3%|▎         | 2/74 [00:00<00:00, 515.62it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 15/18


  8%|▊         | 6/74 [00:00<00:00, 609.25it/s]


Bootstrapped 4 full traces after 6 examples for up to 1 rounds, amounting to 6 attempts.
Bootstrapping set 16/18


  9%|▉         | 7/74 [00:00<00:00, 609.92it/s]


Bootstrapped 4 full traces after 7 examples for up to 1 rounds, amounting to 7 attempts.
Bootstrapping set 17/18


  5%|▌         | 4/74 [00:00<00:00, 503.20it/s]


Bootstrapped 3 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 18/18


  1%|▏         | 1/74 [00:00<00:00, 474.84it/s]
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=9 instructions...



Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Error getting source code: unhashable type: 'dict'.

Running without program aware proposer.


2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Tell a funny joke about the topic

2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Generate a humorous joke related to the specified topic, suitable for a general adult audience. Be mindful of potentially sensitive content.

2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: 2: You are a world-class comedian. Your task is to generate a joke based on a given topic. The jokes should be original, funny, and appropriate for a general adult audience. Avoid offensive or discriminatory humor. Aim for a joke that elicits laughter or amusement through clever wordplay, surprising twists, or relatable observations. Provide a joke that's a maximum of three sentences. After the joke, briefly explain the humor behind it (one sentence).

2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: 3: You are a wo

Average Metric: 23.00 / 51 (45.1%): 100%|██████████| 51/51 [00:00<00:00, 4419.71it/s]

2025/08/06 17:50:32 INFO dspy.evaluate.evaluate: Average Metric: 23.0 / 51 (45.1%)
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 45.1

2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 34 - Minibatch ==



Average Metric: 11.00 / 35 (31.4%): 100%|██████████| 35/35 [00:00<00:00, 4711.34it/s]

2025/08/06 17:50:32 INFO dspy.evaluate.evaluate: Average Metric: 11.0 / 35 (31.4%)
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 31.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2'].





2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43]
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1]
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 45.1


2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 34 - Minibatch ==


Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:00<00:00, 4811.09it/s]

2025/08/06 17:50:32 INFO dspy.evaluate.evaluate: Average Metric: 16.0 / 35 (45.7%)
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 16'].
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71]
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1]
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 45.1


2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 34 - Minibatch ==



Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:00<00:00, 4287.53it/s]

2025/08/06 17:50:32 INFO dspy.evaluate.evaluate: Average Metric: 16.0 / 35 (45.7%)
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 1'].
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71]
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1]
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 45.1


2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 34 - Minibatch ==



Average Metric: 13.00 / 35 (37.1%): 100%|██████████| 35/35 [00:00<00:00, 4694.17it/s]

2025/08/06 17:50:32 INFO dspy.evaluate.evaluate: Average Metric: 13.0 / 35 (37.1%)
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 37.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 15'].
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14]
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1]
2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 45.1


2025/08/06 17:50:32 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 34 - Minibatch ==



Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:00<00:00, 4632.10it/s]

2025/08/06 17:50:33 INFO dspy.evaluate.evaluate: Average Metric: 16.0 / 35 (45.7%)
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 1'].
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 45.1


2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 34 - Full Evaluation =====
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 45.71) from minibatch trials...



Average Metric: 22.00 / 51 (43.1%): 100%|██████████| 51/51 [00:00<00:00, 4534.57it/s]

2025/08/06 17:50:33 INFO dspy.evaluate.evaluate: Average Metric: 22.0 / 51 (43.1%)
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 45.1
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 34 - Minibatch ==



Average Metric: 13.00 / 35 (37.1%): 100%|██████████| 35/35 [00:00<00:00, 4013.80it/s]

2025/08/06 17:50:33 INFO dspy.evaluate.evaluate: Average Metric: 13.0 / 35 (37.1%)





2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 37.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 12'].
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 45.1


2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 34 - Minibatch ==


Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:00<00:00, 4407.90it/s]

2025/08/06 17:50:33 INFO dspy.evaluate.evaluate: Average Metric: 16.0 / 35 (45.7%)
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 5'].
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 45.1


2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 34 - Minibatch ==



Average Metric: 15.00 / 35 (42.9%): 100%|██████████| 35/35 [00:00<00:00, 4719.52it/s]

2025/08/06 17:50:33 INFO dspy.evaluate.evaluate: Average Metric: 15.0 / 35 (42.9%)
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 5'].
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 45.1


2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 34 - Minibatch ==



Average Metric: 15.00 / 35 (42.9%): 100%|██████████| 35/35 [00:00<00:00, 4761.93it/s]

2025/08/06 17:50:33 INFO dspy.evaluate.evaluate: Average Metric: 15.0 / 35 (42.9%)
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 16'].
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 45.1


2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 34 - Minibatch ==



Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:00<00:00, 4327.47it/s]

2025/08/06 17:50:33 INFO dspy.evaluate.evaluate: Average Metric: 17.0 / 35 (48.6%)
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 1'].
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 45.1


2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 34 - Full Evaluation =====
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 47.14) from minibatch trials...



Average Metric: 25.00 / 51 (49.0%): 100%|██████████| 51/51 [00:00<00:00, 4696.56it/s]

2025/08/06 17:50:33 INFO dspy.evaluate.evaluate: Average Metric: 25.0 / 51 (49.0%)
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 49.02





2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 49.02
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 34 - Minibatch ==


Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:00<00:00, 5210.69it/s]


2025/08/06 17:50:33 INFO dspy.evaluate.evaluate: Average Metric: 19.0 / 35 (54.3%)
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 8'].
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 49.02


2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 34 - Minibatch ==


Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:00<00:00, 4947.78it/s]

2025/08/06 17:50:33 INFO dspy.evaluate.evaluate: Average Metric: 18.0 / 35 (51.4%)
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 8'].
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 49.02


2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 34 - Minibatch ==



Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:00<00:00, 4145.39it/s]

2025/08/06 17:50:33 INFO dspy.evaluate.evaluate: Average Metric: 18.0 / 35 (51.4%)
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 6'].
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02]
2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 49.02


2025/08/06 17:50:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 34 - Minibatch ==



Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:00<00:00, 4916.46it/s]

2025/08/06 17:50:34 INFO dspy.evaluate.evaluate: Average Metric: 18.0 / 35 (51.4%)
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 8'].
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43]
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02]
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 49.02


2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 34 - Minibatch ==



Average Metric: 15.00 / 35 (42.9%): 100%|██████████| 35/35 [00:00<00:00, 4262.50it/s]

2025/08/06 17:50:34 INFO dspy.evaluate.evaluate: Average Metric: 15.0 / 35 (42.9%)
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 4'].
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43, 42.86]
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02]
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 49.02


2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 34 - Full Evaluation =====
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 52.38333333333333) from minibatch trials...



Average Metric: 26.00 / 51 (51.0%): 100%|██████████| 51/51 [00:00<00:00, 4287.11it/s]

2025/08/06 17:50:34 INFO dspy.evaluate.evaluate: Average Metric: 26.0 / 51 (51.0%)
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 50.98





2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98]
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 34 - Minibatch ==


Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:00<00:00, 4808.56it/s]

2025/08/06 17:50:34 INFO dspy.evaluate.evaluate: Average Metric: 17.0 / 35 (48.6%)





2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 8'].
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43, 42.86, 48.57]
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98]
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98


2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 21 / 34 - Minibatch ==


Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:00<00:00, 725.25it/s] 

2025/08/06 17:50:34 INFO dspy.evaluate.evaluate: Average Metric: 19.0 / 35 (54.3%)
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 8'].
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43, 42.86, 48.57, 54.29]
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98]
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98


2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 34 - Minibatch ==



Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:00<00:00, 4002.20it/s]

2025/08/06 17:50:34 INFO dspy.evaluate.evaluate: Average Metric: 17.0 / 35 (48.6%)
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 14'].
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43, 42.86, 48.57, 54.29, 48.57]
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98]
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98


2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 23 / 34 - Minibatch ==



Average Metric: 15.00 / 35 (42.9%): 100%|██████████| 35/35 [00:00<00:00, 4165.74it/s]

2025/08/06 17:50:34 INFO dspy.evaluate.evaluate: Average Metric: 15.0 / 35 (42.9%)
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 8'].
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43, 42.86, 48.57, 54.29, 48.57, 42.86]
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98]
2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98


2025/08/06 17:50:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 24 / 34 - Minibatch ==



Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:00<00:00, 4119.68it/s]

2025/08/06 17:50:35 INFO dspy.evaluate.evaluate: Average Metric: 19.0 / 35 (54.3%)
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 8'].
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43, 42.86, 48.57, 54.29, 48.57, 42.86, 54.29]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98


2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 25 / 34 - Full Evaluation =====
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 54.29) from minibatch trials...



Average Metric: 26.00 / 51 (51.0%): 100%|██████████| 51/51 [00:00<00:00, 3332.49it/s]

2025/08/06 17:50:35 INFO dspy.evaluate.evaluate: Average Metric: 26.0 / 51 (51.0%)
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98, 50.98]





2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 26 / 34 - Minibatch ==


Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:00<00:00, 3433.45it/s]

2025/08/06 17:50:35 INFO dspy.evaluate.evaluate: Average Metric: 18.0 / 35 (51.4%)





2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 10'].
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43, 42.86, 48.57, 54.29, 48.57, 42.86, 54.29, 51.43]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98, 50.98]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98


2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 27 / 34 - Minibatch ==


Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:00<00:00, 3445.54it/s]

2025/08/06 17:50:35 INFO dspy.evaluate.evaluate: Average Metric: 16.0 / 35 (45.7%)





2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 8'].
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43, 42.86, 48.57, 54.29, 48.57, 42.86, 54.29, 51.43, 45.71]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98, 50.98]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98


2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 28 / 34 - Minibatch ==


Average Metric: 15.00 / 35 (42.9%): 100%|██████████| 35/35 [00:00<00:00, 2058.83it/s]

2025/08/06 17:50:35 INFO dspy.evaluate.evaluate: Average Metric: 15.0 / 35 (42.9%)
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 7'].
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43, 42.86, 48.57, 54.29, 48.57, 42.86, 54.29, 51.43, 45.71, 42.86]





2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98, 50.98]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98


2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 29 / 34 - Minibatch ==


Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:00<00:00, 3608.31it/s]

2025/08/06 17:50:35 INFO dspy.evaluate.evaluate: Average Metric: 17.0 / 35 (48.6%)
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 11'].
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43, 42.86, 48.57, 54.29, 48.57, 42.86, 54.29, 51.43, 45.71, 42.86, 48.57]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98, 50.98]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98


2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 30 / 34 - Minibatch ==



Average Metric: 20.00 / 35 (57.1%): 100%|██████████| 35/35 [00:00<00:00, 3263.03it/s]

2025/08/06 17:50:35 INFO dspy.evaluate.evaluate: Average Metric: 20.0 / 35 (57.1%)





2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 17'].
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43, 42.86, 48.57, 54.29, 48.57, 42.86, 54.29, 51.43, 45.71, 42.86, 48.57, 57.14]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98, 50.98]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98


2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 31 / 34 - Full Evaluation =====
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 57.14) from minibatch trials...


Average Metric: 26.00 / 51 (51.0%): 100%|██████████| 51/51 [00:00<00:00, 3039.70it/s]

2025/08/06 17:50:35 INFO dspy.evaluate.evaluate: Average Metric: 26.0 / 51 (51.0%)
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98, 50.98, 50.98]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 32 / 34 - Minibatch ==



Average Metric: 21.00 / 35 (60.0%): 100%|██████████| 35/35 [00:00<00:00, 2391.82it/s]

2025/08/06 17:50:35 INFO dspy.evaluate.evaluate: Average Metric: 21.0 / 35 (60.0%)
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 60.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 17'].
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43, 42.86, 48.57, 54.29, 48.57, 42.86, 54.29, 51.43, 45.71, 42.86, 48.57, 57.14, 60.0]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98, 50.98, 50.98]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98


2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 33 / 34 - Minibatch ==



Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:00<00:00, 1990.36it/s]

2025/08/06 17:50:35 INFO dspy.evaluate.evaluate: Average Metric: 16.0 / 35 (45.7%)





2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 14'].
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [31.43, 45.71, 45.71, 37.14, 45.71, 37.14, 45.71, 42.86, 42.86, 48.57, 54.29, 51.43, 51.43, 51.43, 42.86, 48.57, 54.29, 48.57, 42.86, 54.29, 51.43, 45.71, 42.86, 48.57, 57.14, 60.0, 45.71]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98, 50.98, 50.98]
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.98


2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 34 / 34 - Full Evaluation =====
2025/08/06 17:50:35 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 60.0) from minibatch trials...


Average Metric: 32.00 / 51 (62.7%): 100%|██████████| 51/51 [00:00<00:00, 4818.65it/s]

2025/08/06 17:50:36 INFO dspy.evaluate.evaluate: Average Metric: 32.0 / 51 (62.7%)
2025/08/06 17:50:36 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 62.75
2025/08/06 17:50:36 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [45.1, 43.14, 49.02, 50.98, 50.98, 50.98, 62.75]
2025/08/06 17:50:36 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.75
2025/08/06 17:50:36 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/08/06 17:50:36 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 62.75!




MIPRO optimization complete!


In [66]:
# How did we do on topics the optimizer hadn't seen?
optimized_results = evaluate(mipro_optimized_joke_program)

Average Metric: 25.00 / 51 (49.0%): 100%|██████████| 51/51 [00:01<00:00, 45.13it/s]  

2025/08/06 17:52:36 INFO dspy.evaluate.evaluate: Average Metric: 25.0 / 51 (49.0%)





Unnamed: 0,topic,joke,prediction,judge_score
0,Field,Why did the scarecrow become a successful neurosurgeon? Because he...,I went to a field the other day. It was a minefield. That's why th...,✔️ [1.000]
1,Ghosts,Why are ghosts bad at lying? Because you can see right through them.,Why did the ghost break up with his girlfriend?\n\nShe ghosted him.,
2,Marriage,"The first time I met my wife, I knew she was a keeper. She was wea...",My wife told me to take the spider out instead of killing it. We w...,✔️ [1.000]
3,Wealth,My wealth and happiness would suggest that God definitely does lov...,"A wealthy man is complaining to his friend, ""Last year, I spent $5...",✔️ [1.000]
4,Social Media,Following someone on Twitter and asking them to tweet about someth...,I unfollowed the baker on social media. All his posts were just ri...,
5,Fitness,"I said to the gym instructor: ""Can you teach me to do the splits?""...",I told my personal trainer I wanted to get in shape for summer. No...,
6,Field,Why did the scarecrow win an award? Because he was outstanding in ...,I went to a field the other day. It was a minefield. That's why th...,✔️ [1.000]
7,Fashion,Why do cows wear bells? Because their horns don't work.,"I tried to put on my skinny jeans, but I think I pulled a muscle.....",
8,Ghosts,Why do ghosts like elevators? Because it lifts their spirits.,Why did the ghost break up with his girlfriend?\n\nShe ghosted him.,
9,Animals,What do you call an alligator in a vest? An investigator.,Why did the chicken cross the road? To get away from his crippling...,✔️ [1.000]


### MIPRO Optimization Results

The MIPRO-optimized joke generator achieved a 47.1% accuracy on the validation set, compared to the baseline model's 39.2% accuracy. This represents a significant improvement of 7.9 percentage points.

Key observations:
- Baseline model: 20.0/51 (39.2%) accuracy
- MIPRO optimized: 24.0/51 (47.1%) accuracy
- Improvement: +7.9% accuracy (+20% improvement)

The optimized model shows better performance on unseen topics, suggesting MIPRO helped create more robust and generalizable instructions for joke generation.



### Examining MIPRO's Optimized Instructions

In [71]:
# Extract and display the optimized prompt
# MIPRO modifies the instructions to improve performance
prompt = {
  name: dspy.ChatAdapter().format(
    p.signature,
    demos=p.demos, 
    inputs={k: f"{{{k}}}" for k in p.signature.input_fields},
  )
  for name, p in mipro_optimized_joke_program.named_predictors()
}['joke_generator.predict']

# Show the optimized instructions
for i, message in enumerate(prompt):
    role = message['role']
    content = message['content']
    
    print(f"{'='*50}")
    print(f"MESSAGE {i+1}: {role.upper()}")
    print(f"{'='*50}")
    print(content)
    print()

MESSAGE 1: SYSTEM
Your input fields are:
1. `topic` (str): The topic of the joke
Your output fields are:
1. `reasoning` (str): 
2. `joke` (str): The joke that is being told
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## topic ## ]]
{topic}

[[ ## reasoning ## ]]
{reasoning}

[[ ## joke ## ]]
{joke}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Generate a humorous joke related to the specified topic, suitable for a general adult audience. Be mindful of potentially sensitive content.

MESSAGE 2: USER
This is an example of the task, though some input or output fields are not supplied.

[[ ## topic ## ]]
Religious Satire

MESSAGE 3: ASSISTANT
[[ ## reasoning ## ]]
Not supplied for this particular example. 

[[ ## joke ## ]]
I respect everybody's beliefs, except Amish people. They are the only ones I can say clearly, 'Their God is wrong.' The speed limit is 75 miles an hour in Ohio, and one lane of 

In [72]:
prompt

[{'role': 'system',
  'content': 'Your input fields are:\n1. `topic` (str): The topic of the joke\nYour output fields are:\n1. `reasoning` (str): \n2. `joke` (str): The joke that is being told\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\n[[ ## topic ## ]]\n{topic}\n\n[[ ## reasoning ## ]]\n{reasoning}\n\n[[ ## joke ## ]]\n{joke}\n\n[[ ## completed ## ]]\nIn adhering to this structure, your objective is: \n        Generate a humorous joke related to the specified topic, suitable for a general adult audience. Be mindful of potentially sensitive content.'},
 {'role': 'user',
  'content': 'This is an example of the task, though some input or output fields are not supplied.\n\n[[ ## topic ## ]]\nReligious Satire'},
 {'role': 'assistant',
  'content': "[[ ## reasoning ## ]]\nNot supplied for this particular example. \n\n[[ ## joke ## ]]\nI respect everybody's beliefs, except Amish people. They are the only ones I can say clearly, 'Their

In [68]:
# Save optimized program
mipro_optimized_joke_program.save("./mipro_optimized_joke_program/", save_program=True)
print("Saved optimized joke generator!")

Saved optimized joke generator!


### Programs Saved Successfully!

The optimized program has been saved to JSON file:
- `mipro_optimized_jokes.json` - Contains the MIPRO-optimized joke generator

This file include the optimized prompts, few-shot examples, and all parameters needed to reproduce the results.

In [69]:
# Load the saved program
loaded_joke_program = dspy.load("./mipro_optimized_joke_program/")

# Generate a joke about programming
result = loaded_joke_program(topic="Python") 
print(f"\nJoke about programming:\n{result}")


Joke about programming:
Why did the programmer quit his job? Because he didn't get arrays! (Inheritance)


## 14. Comparing Our Joke Generators

Let's test our joke generators side by side:

In [70]:
# Test all our joke generators
test_topics = ["Python", "Coffee", "Exercise"]

for topic in test_topics:
    print(f"\n{'='*50}")
    print(f"TOPIC: {topic}")
    print(f"{'='*50}")
    
    # Basic joke generator
    basic_result = basic_joke_program(topic=topic)
    print(f"\nBASIC: {basic_result.joke}")
    
    # MIPRO optimized
    mipro_result = mipro_optimized_joke_program(topic=topic)
    print(f"\nMIPRO: {mipro_result}")


TOPIC: Python

BASIC: Why do Python programmers prefer dark mode?

Because light attracts bugs!

MIPRO: Why did the programmer quit his job? Because he didn't get arrays! (Inheritance)

TOPIC: Coffee

BASIC: Why did the coffee go to the police?

It got mugged!

MIPRO: I like my coffee how I like myself: dark, bitter, and too hot for you.

TOPIC: Exercise

BASIC: Why did the bicycle fall over? Because it was two tired!

MIPRO: I hate when I lose my motivation to exercise. It's like, where do these extra 10 pounds keep coming from?


### Joke Quality Comparison

Notice the progression in joke quality:

**Basic Generator:**
- Simple puns and wordplay
- Generic dad joke style
- No personality or edge

**MIPRO Optimized:**
- Mix of styles - sometimes puns, sometimes observational
- More relatable and human
- Better understanding of context

The optimization process taught the models what makes professional comedians' jokes funnier than generic dad jokes!

## Summary: What We Learned

In this tutorial, we explored DSPy's key concepts:

### 1. **Signatures** - Simple Input/Output Declarations
```python
basic_joke_program = dspy.Predict('topic -> joke')
```

### 2. **Chain of Thought** - Automatic Reasoning
```python
cot_joke_program = dspy.ChainOfThought(joke_signature)
```

### 3. **Modules** - Reusable Components
```python
class JokeModule(dspy.Module):
    def forward(self, topic: str) -> str:
        # Your logic here
```

### 4. **Evaluation** - Systematic Performance Measurement
```python
evaluate = Evaluate(metric=exact_match, devset=valset)
```

### 5. **Optimization** - Automatic Prompt Improvement
- **Bootstrap**: Selects effective few-shot examples
- **MIPRO**: Optimizes both instructions and examples

### Key Takeaways

1. **No Manual Prompt Engineering**: DSPy handles prompt formatting automatically
2. **Data-Driven Optimization**: Use labeled data to improve performance systematically
3. **Modular Design**: Build complex AI systems from simple, reusable components
4. **Automatic Improvement**: Optimizers find better prompts than manual tuning

### Next Steps

- Try different signatures and modules for your use case
- Create custom evaluation metrics
- Experiment with different optimizers
- Build multi-step AI pipelines with multiple modules

Happy building with DSPy! 🚀