# DSPy Introduction

## Table of Contents

- What you will learn
- Core Concepts
- Building Blocks
- Recommended Workflow
- Examples
- Roadmap
- References


## What you will learn

1. The core concepts of DSPy
2. Understanding the building blocks of DSPy
3. How to use DSPy in practice and recommended workflow
4. Enable to trace DSPy internals
5. Future roadmap of DSPy

## Core Concepts

## Building Blocks

In [26]:
import dspy
from dotenv import load_dotenv

load_dotenv()

True

### Language Models

Notes:
- Earlier versions of DSPy involved tens of clients for different LM providers.(deprecated, and will be removed in DSPy 2.6) Starting from 2.5, use `dspy.LM` instead(using litellm under the hood)
- Inspecting history
- Adapters
    - DSPy 2.5 introduces **Adapters** as a layer between Signatures and LMs, responsible for formatting these pieces (Signature I/O fields, instructions, and examples) as well as generating and parsing the outputs.
- Using `dspy.configure` and `dspy.context` is thread-safe!
- By default LMs in DSPy are cached. If you repeat the same call, you will get the same outputs. But you can turn off caching by setting `cache=False` while declaring `dspy.LM` object
- Any OpenAI-compatible endpoint is easy to set up with an `openai/` prefix as well.



References:
- documentation: https://dspy-docs.vercel.app/building-blocks/1-language_models/
- source code: https://github.com/stanfordnlp/dspy/blob/main/dspy/clients/lm.py


#### Examples

setting up LLM

In [27]:
lm = dspy.LM(model="gpt-4o-mini")
dspy.configure(lm=lm)

directly calling the LLM(not recommended)

In [28]:
lm("hello!")

['Hello! How can I assist you today?']

In [29]:
# for chat LLMs
lm(messages=[{"role": "system", "content": "You are a helpful assistant."},
             {"role": "user", "content": "What is 2+2?"}])

['2 + 2 equals 4.']

using the llm with DSPy signatures and modules

In [30]:
# Define a module (ChainOfThought) and assign it a signature (return an answer, given a question).
qa = dspy.ChainOfThought('question -> answer')

# Run with the default LM configured with `dspy.configure` above.
response = qa(question="How many floors are in the castle David Gregory inherited?")
print(response.answer)

Insufficient information to determine the number of floors in the castle David Gregory inherited.


using multiple LLMs at once

In [31]:
# Run with the default LM configured above, i.e. GPT-4o-mini
response = qa(question="How many floors are in the castle David Gregory inherited?")
print('gpt-4o-mini:', response.answer)

gpt_4o = dspy.LM(model='gpt-4o', max_tokens=300)

# Run with GPT-4o instead
with dspy.context(lm=gpt_4o):
    response = qa(question="How many floors are in the castle David Gregory inherited?")
    print('gpt-4o:', response.answer)

gpt-4o-mini: Insufficient information to determine the number of floors in the castle David Gregory inherited.
gpt-4o: Unknown


configuring llm attributes

In [32]:
gpt_4o_mini = dspy.LM(
	'gpt-4o-mini',
	temperature=0.9,
	max_tokens=3000,
	stop=None,
	cache=False
)

using locally hosted LLMs

In [33]:
ollama_port = 11434 
ollama_url = f"http://localhost:{ollama_port}"
ollama_llm = dspy.LM(model="ollama/llama3.2:1b", api_base=ollama_url)

inspecting llm output and usage metadata

In [34]:
len(lm.history)

4

In [35]:
for k, v in lm.history[-1].items():
    print(f"{k}: {v}")

prompt: None
messages: [{'role': 'system', 'content': 'Your input fields are:\n1. `question` (str)\n\nYour output fields are:\n1. `reasoning` (str)\n2. `answer` (str)\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\n[[ ## question ## ]]\n{question}\n\n[[ ## reasoning ## ]]\n{reasoning}\n\n[[ ## answer ## ]]\n{answer}\n\n[[ ## completed ## ]]\n\nIn adhering to this structure, your objective is: \n        Given the fields `question`, produce the fields `answer`.'}, {'role': 'user', 'content': '[[ ## question ## ]]\nHow many floors are in the castle David Gregory inherited?\n\nRespond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.'}]
kwargs: {'temperature': 0.0, 'max_tokens': 1000}
response: ModelResponse(id='chatcmpl-AP1gFYUpR7PAotKStTcGdeP89I0dM', choices=[Choices(finish_reason='stop', index=0, message=Messa

#### Creating custom LLM class (Advanced)

Creating a custom LM class is quite straightforward in DSPy. You can inherit from the dspy.LM class or create a new class with a similar interface. You'll need to implement/override these three methods:

- `__init__`: Initialize the LM with the given model and other keyword arguments.
- `__call__`: Call the LM with the given input prompt and return a list of string outputs.
- `inspect_history`: The history of interactions with the LM. This is optional but is needed by some optimizers in DSPy.

```python
import os
import dspy
import google.generativeai as genai

class GeminiLM(dspy.LM):
    def __init__(self, model, api_key=None, endpoint=None, **kwargs):
        genai.configure(api_key=os.environ["GEMINI_API_KEY"] or api_key)

        self.endpoint = endpoint
        self.history = []

        super().__init__(model, **kwargs)
        self.model = genai.GenerativeModel(model)

    def __call__(self, prompt=None, messages=None, **kwargs):
        # Custom chat model working for text completion model
        prompt = '\n\n'.join([x['content'] for x in messages] + ['BEGIN RESPONSE:'])

        completions = self.model.generate_content(prompt)
        self.history.append({"prompt": prompt, "completions": completions})

        # Must return a list of strings
        return [completions.candidates[0].content.parts[0].text]

    def inspect_history(self):
        for interaction in self.history:
            print(f"Prompt: {interaction['prompt']} -> Completions: {interaction['completions']}")

lm = GeminiLM("gemini-1.5-flash", temperature=0)
dspy.configure(lm=lm)

qa = dspy.ChainOfThought("question->answer")
qa(question="What is the capital of France?")
```

#### TODO: Structured LLM output with Adapters (Advanced)

### Signatures

Notes:
- inline-based signature prompt creation
![](https://dspy-docs.vercel.app/deep-dive/signature/img/prompt_creation.png)
- class-based signature prompt creation
![](https://dspy-docs.vercel.app/deep-dive/signature/img/class_based_prompt_creation.png)

References:
- documentation
    - https://dspy-docs.vercel.app/building-blocks/2-signatures/
    - https://dspy-docs.vercel.app/deep-dive/signature/understanding-signatures/
    - https://dspy-docs.vercel.app/deep-dive/signature/executing-signatures/
- source code: https://github.com/stanfordnlp/dspy/tree/main/dspy/signatures


When we assign tasks to LMs in DSPy, we specify the behavior we need as a Signature.

A signature is a declarative specification of input/output behavior of a DSPy module. Signatures allow you to tell the LM what it needs to do, rather than specify how we should ask the LM to do it.

You're probably familiar with function signatures, which specify the input and output arguments and their types. DSPy signatures are similar, but the differences are that:

- While typical function signatures just describe things, DSPy Signatures define and control the behavior of modules.
- The field names matter in DSPy Signatures. You express semantic roles in plain English: a question is different from an answer, a sql_query is different from python_code.

Why should I use a DSPy Signature?

tl;dr For modular and clean code, in which LM calls can be optimized into high-quality prompts (or automatic finetunes).

Long Answer: Most people coerce LMs to do tasks by hacking long, brittle prompts. Or by collecting/generating data for fine-tuning.

Writing signatures is far more modular, adaptive, and reproducible than hacking at prompts or finetunes. The DSPy compiler will figure out how to build a highly-optimized prompt for your LM (or finetune your small LM) for your signature, on your data, and within your pipeline. In many cases, we found that compiling leads to better prompts than humans write. Not because DSPy optimizers are more creative than humans, but simply because they can try more things and tune the metrics directly.

#### Inline DSPy Signatures

Signatures can be defined as a short string, with argument names that define semantic roles for inputs/outputs.

1. `Question Answering: "question -> answer"`
2. `Sentiment Classification: "sentence -> sentiment"`
3. `Summarization: "document -> summary"`

Your signatures can also have multiple input/output fields.

1. `Retrieval-Augmented Question Answering: "context, question -> answer"`
2. `Multiple-Choice Question Answering with Reasoning: "question, choices -> reasoning, selection"`

Tip: For fields, any valid variable names work! Field names should be semantically meaningful, but start simple and don't prematurely optimize keywords! Leave that kind of hacking to the DSPy compiler. For example, for summarization, it's probably fine to say "document -> summary", "text -> gist", or "long_context -> tldr".

Notes:
- Many DSPy modules (except `dspy.Predict`) return auxiliary information by expanding your signature under the hood. For example, `dspy.ChainOfThought` also adds a rationale field that includes the LM's reasoning before it generates the output summary.

In [37]:
sentence = "it's a charming and often affecting journey."  # example from the SST-2 dataset.

classify = dspy.Predict('sentence -> sentiment')
classify(sentence=sentence).sentiment

'positive'

In [38]:
# Example from the XSum dataset.
document = """The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page."""

summarize = dspy.ChainOfThought('document -> summary')
response = summarize(document=document)

print(response.summary)

Lee, a 21-year-old footballer, made seven appearances for the Hammers, scoring once in a Europa League match. He had loan spells at Blackpool and Colchester United, scoring twice for Colchester, but could not prevent their relegation. His contract details with the Tykes remain undisclosed.


#### Class-based DSPy Signatures

For some advanced tasks, you need more verbose signatures. This is typically to:

1. Clarify something about the nature of the task (expressed below as a docstring).
2. Supply hints on the nature of an input field, expressed as a desc keyword argument for dspy.InputField.
3. Supply constraints on an output field, expressed as a desc keyword argument for dspy.OutputField.

Tips:
- There's nothing wrong with specifying your requests to the LM more clearly. Class-based Signatures help you with that. However, don't prematurely tune the keywords of your signature by hand. The DSPy optimizers will likely do a better job (and will transfer better across LMs).
- How `Predict` works:
    - https://dspy-docs.vercel.app/deep-dive/signature/executing-signatures/#how-predict-works
    - source code: https://github.com/stanfordnlp/dspy/blob/main/dspy/predict/predict.py

In [39]:
class Emotion(dspy.Signature):
    """Classify emotion among sadness, joy, love, anger, fear, surprise."""

    sentence = dspy.InputField()
    sentiment = dspy.OutputField()

sentence = "i started feeling a little vulnerable when the giant spotlight started blinding me"  # from dair-ai/emotion

classify = dspy.Predict(Emotion)
classify(sentence=sentence)

Prediction(
    sentiment='fear'
)

#### Using signatures to build modules & compiling them

While signatures are convenient for prototyping with structured inputs/outputs, that's not the main reason to use them!

You should compose multiple signatures into bigger DSPy modules and compile these modules into optimized prompts and finetunes.

### Modules

Notes:
- A DSPy module is a building block for programs that use LMs.
    - Each built-in module abstracts a prompting technique (like chain of thought or ReAct). Crucially, they are generalized to handle any DSPy Signature.
    - A DSPy module has learnable parameters (i.e., the little pieces comprising the prompt and the LM weights) and can be invoked (called) to process inputs and return outputs.
    - Multiple modules can be composed into bigger modules (programs). DSPy modules are inspired directly by NN modules in PyTorch, but applied to LM programs.
- What other DSPy modules are there? How can I use them?
    - The others are very similar. They mainly change the internal behavior with which your signature is implemented!
        - `dspy.Predict`: Basic predictor. Does not modify the signature. Handles the key forms of learning (i.e., storing the instructions and demonstrations and updates to the LM).
        - `dspy.ChainOfThought`: Teaches the LM to think step-by-step before committing to the signature's response.
        - `dspy.ProgramOfThought`: Teaches the LM to output code, whose execution results will dictate the response.
        - `dspy.ReAct`: An agent that can use tools to implement the given signature.
        - `dspy.MultiChainComparison`: Can compare multiple outputs from ChainOfThought to produce a final prediction.
    - We also have some function-style modules:
        - `dspy.majority`: Can do basic voting to return the most popular response from a set of predictions.
- How do I compose multiple modules into a bigger program?
    - DSPy is just Python code that uses modules in any control flow you like. (There's some magic internally at `compile` time to trace your LM calls.)
    - This means that, you can just call the modules freely. No weird abstractions for chaining calls.
    - This is basically PyTorch's design approach for define-by-run / dynamic computation graphs. Refer to the intro tutorials for examples.

References:
- documentation:
    - https://dspy-docs.vercel.app/building-blocks/3-modules/
    - https://dspy-docs.vercel.app/deep-dive/modules/guide
- source code:
    - https://github.com/stanfordnlp/dspy/blob/main/dspy/primitives/module.py
    - https://github.com/stanfordnlp/dspy/tree/main/dspy/predict

In [49]:
question = "What's something great about the ColBERT retrieval model?"

# 1) Declare with a signature, and pass some config.
classify = dspy.ChainOfThought('question -> answer', n=5)

# 2) Call with input argument.
response = classify(question=question)

# 3) Access the outputs.
response.completions.answer

['One great aspect of the ColBERT retrieval model is its ability to efficiently combine dense and sparse retrieval approaches, allowing for high accuracy in document retrieval while maintaining fast performance through late interaction.',
 'One great aspect of the ColBERT retrieval model is its use of late interaction, which allows it to efficiently combine the strengths of dense and traditional retrieval methods, resulting in fast and accurate search performance.',
 'One great aspect of the ColBERT retrieval model is its efficient use of contextual embeddings, enabling high-quality information retrieval while maintaining fast performance through a late interaction mechanism.',
 'One great thing about the ColBERT retrieval model is its ability to efficiently combine dense and sparse retrieval techniques, allowing for fast and accurate document retrieval through late interaction mechanisms.',
 'One great aspect of the ColBERT retrieval model is its ability to combine efficiency with hig

In [50]:
response

Prediction(
    reasoning='The ColBERT retrieval model stands out because it efficiently combines the strengths of dense and sparse retrieval methods. It uses late interaction for fast retrieval while preserving the contextual information from dense embeddings. This allows it to achieve high accuracy and relevance in document retrieval tasks while maintaining speed, making it suitable for large-scale applications. Additionally, its ability to leverage pre-trained language models enhances its performance in understanding and retrieving relevant documents based on complex queries.',
    answer='One great aspect of the ColBERT retrieval model is its ability to efficiently combine dense and sparse retrieval approaches, allowing for high accuracy in document retrieval while maintaining fast performance through late interaction.',
    completions=Completions(...)
) (4 completions omitted)

In [51]:
response.completions

Completions(
    reasoning=['The ColBERT retrieval model stands out because it efficiently combines the strengths of dense and sparse retrieval methods. It uses late interaction for fast retrieval while preserving the contextual information from dense embeddings. This allows it to achieve high accuracy and relevance in document retrieval tasks while maintaining speed, making it suitable for large-scale applications. Additionally, its ability to leverage pre-trained language models enhances its performance in understanding and retrieving relevant documents based on complex queries.', 'The ColBERT retrieval model is notable for its ability to efficiently combine the benefits of both traditional dense retrieval and modern transformer-based architectures. One of the great features of ColBERT is its use of late interaction, allowing it to process queries and documents separately and then interact them at retrieval time. This approach significantly speeds up the retrieval process while maint

### Data

Notes:
- DSPy is a machine learning framework, so working in it involves training sets, development sets, and test sets.
    - For each example in your data, we distinguish typically between three types of values: the inputs, the intermediate labels, and the final label. You can use DSPy effectively without any intermediate or final labels, but you will need at least a few example inputs.
- How much data do I need and how do I collect data for my task?
    - Concretely, you can use DSPy optimizers usefully with as few as 10 example inputs, but having 50-100 examples (or even better, 300-500 examples) goes a long way.
    - How can you get examples like these? If your task is extremely unusual, please invest in preparing ~10 examples by hand. Often times, depending on your metric below, you just need inputs and not labels, so it's not that hard.
    - However, chances are that your task is not actually that unique. You can almost always find somewhat adjacent datasets on, say, HuggingFace datasets or other forms of data that you can leverage here.
    - If there's data whose licenses are permissive enough, we suggest you use them. Otherwise, you can also start using/deploying/demoing your system and collect some initial data that way.
- DSPy `Example` objects
    - The core data type for data in DSPy is Example. You will use Examples to represent items in your training set and test set.
    - DSPy Examples are similar to Python dicts but have a few useful utilities. Your DSPy modules will return values of the type Prediction, which is a special sub-class of Example.
    - When you use DSPy, you will do a lot of evaluation and optimization runs. Your individual datapoints will be of type Example
- Loading Dataset from sources
    - One of the most convenient way to import datasets in DSPy is by using `DataLoader`. The first step is to declare an object, this object can then be used to call utilities to load datasets in different formats:
        - `DataLoader().from_csv(...)`
        - `DataLoader().from_json(...)`
        - `DataLoader().from_parquet(...)`
        - `DataLoader().from_pandas(...)`
        - `DataLoader().from_huggingface(...)`

References:
- documentation:
    - https://dspy-docs.vercel.app/building-blocks/4-data/
    - https://dspy-docs.vercel.app/deep-dive/data-handling/built-in-datasets/
    - https://dspy-docs.vercel.app/deep-dive/data-handling/loading-custom-data/
- source code:
    - https://github.com/stanfordnlp/dspy/blob/main/dspy/primitives/example.py
    - https://github.com/stanfordnlp/dspy/blob/main/dspy/datasets/dataset.py
    - https://github.com/stanfordnlp/dspy/blob/main/dspy/datasets/dataloader.py

In [52]:
qa_pair = dspy.Example(question="This is a question?", answer="This is an answer.")

print(qa_pair)
print(qa_pair.question)
print(qa_pair.answer)

Example({'question': 'This is a question?', 'answer': 'This is an answer.'}) (input_keys=None)
This is a question?
This is an answer.


In [53]:
# Single Input.
print(qa_pair.with_inputs("question"))

# Multiple Inputs; be careful about marking your labels as inputs unless you mean it.
print(qa_pair.with_inputs("question", "answer"))

Example({'question': 'This is a question?', 'answer': 'This is an answer.'}) (input_keys={'question'})
Example({'question': 'This is a question?', 'answer': 'This is an answer.'}) (input_keys={'question', 'answer'})


In [54]:
article_summary = dspy.Example(article= "This is an article.", summary= "This is a summary.").with_inputs("article")

input_key_only = article_summary.inputs()
non_input_key_only = article_summary.labels()

print("Example object with Input fields only:", input_key_only)
print("Example object with Non-Input fields only:", non_input_key_only)

Example object with Input fields only: Example({'article': 'This is an article.'}) (input_keys={'article'})
Example object with Non-Input fields only: Example({'summary': 'This is a summary.'}) (input_keys=None)


In [55]:
from dspy.datasets import DataLoader

dl = DataLoader()

blog_alpaca = dl.from_huggingface(
    "intertwine-expel/expel-blog",
    input_keys=("title",)
)

Downloading data: 100%|██████████| 233/233 [00:07<00:00, 32.58files/s]
Generating train split: 100%|██████████| 233/233 [00:00<00:00, 2683.53 examples/s]


In [56]:
blog_alpaca

  Example({'title': '2023 Great eXpeltations report: top six findings', 'url': 'https://expel.com/blog/2023-great-expeltations-report-top-six-findings/', 'date': 'Jan 31, 2023', 'contents': 'Subscribe × EXPEL BLOG 2023 Great eXpeltations report: top six findings Security operations · 2 MIN READ · BEN BRIGIDA · JAN 31, 2023 · TAGS: MDR Bad news: 2022 was a big year in cybersecurity. Good news: We stopped a lot of attacks. Better news: We sure learned a lot, didn’t we? We just released our Great eXpeltations annual report, which details the major trends we saw in the security operations center (SOC) last year…and what you can do about them this year. You can grab your copy now , and here’s a taste of what you’ll find. Top findings from the Great eXpeltations report 1: Business email compromise (BEC) accounted for half of all incidents, and remains the top threat facing our customers. This finding is consistent with what we saw in 2021. Key numbers: Of the BEC attempts we identified: more

#### Advanced: Inside DSPy's `Dataset` class

![](https://dspy-docs.vercel.app/deep-dive/data-handling/img/data-loading.png)

```python
import random

from datasets import load_dataset

from dspy.datasets.dataset import Dataset


class HotPotQA(Dataset):
    def __init__(self, *args, only_hard_examples=True, keep_details='dev_titles', unofficial_dev=True, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        assert only_hard_examples, "Care must be taken when adding support for easy examples." \
                                   "Dev must be all hard to match official dev, but training can be flexible."
        
        hf_official_train = load_dataset("hotpot_qa", 'fullwiki', split='train', trust_remote_code=True)
        hf_official_dev = load_dataset("hotpot_qa", 'fullwiki', split='validation', trust_remote_code=True)

        official_train = []
        for raw_example in hf_official_train:
            if raw_example['level'] == 'hard':
                if keep_details is True:
                    keys = ['id', 'question', 'answer', 'type', 'supporting_facts', 'context']
                elif keep_details == 'dev_titles':
                    keys = ['question', 'answer', 'supporting_facts']
                else:
                    keys = ['question', 'answer']

                example = {k: raw_example[k] for k in keys}
                
                if 'supporting_facts' in example:
                    example['gold_titles'] = set(example['supporting_facts']['title'])
                    del example['supporting_facts']

                official_train.append(example)

        rng = random.Random(0)
        rng.shuffle(official_train)

        self._train = official_train[:len(official_train)*75//100]

        if unofficial_dev:
            self._dev = official_train[len(official_train)*75//100:]
        else:
            self._dev = None

        for example in self._train:
            if keep_details == 'dev_titles':
                del example['gold_titles']
        
        test = []
        for raw_example in hf_official_dev:
            assert raw_example['level'] == 'hard'
            example = {k: raw_example[k] for k in ['id', 'question', 'answer', 'type', 'supporting_facts']}
            if 'supporting_facts' in example:
                example['gold_titles'] = set(example['supporting_facts']['title'])
                del example['supporting_facts']
            test.append(example)

```

### Metrics

Notes:
- What is a metric and how do I define a metric for my task?
    - A metric is just a function that will take examples from your data and the output of your system and return a score that quantifies how good the output is.
    - For simple tasks, this could be just "accuracy" or "exact match" or "F1 score". This may be the case for simple classification or short-form QA tasks. However, for most applications, your system will output long-form outputs. There, your metric should probably be a smaller DSPy program that checks multiple properties of the output (quite possibly using AI feedback from LMs).

References:
- documentation: https://dspy-docs.vercel.app/building-blocks/5-metrics/
- source code: https://github.com/stanfordnlp/dspy/tree/main/dspy/evaluate

#### Simple Metrics

In [57]:
def validate_answer(example, pred, trace=None):
    return example.answer.lower() == pred.answer.lower()

#### Evaluation

```python
from dspy.evaluate import Evaluate

# Set up the evaluator, which can be re-used in your code.
evaluator = Evaluate(devset=YOUR_DEVSET, num_threads=1, display_progress=True, display_table=5)

# Launch evaluation.
evaluator(YOUR_PROGRAM, metric=YOUR_METRIC)
```

#### Intermediate: Using AI feedback for your metric

For most applications, your system will output long-form outputs, so your metric should check multiple dimensions of the output using AI feedback from LMs.

```python
# Define the signature for automatic assessments.
class Assess(dspy.Signature):
    """Assess the quality of a tweet along the specified dimension."""

    assessed_text = dspy.InputField()
    assessment_question = dspy.InputField()
    assessment_answer = dspy.OutputField(desc="Yes or No")


gpt4T = dspy.OpenAI(model='gpt-4-1106-preview', max_tokens=1000, model_type='chat')

def metric(gold, pred, trace=None):
    question, answer, tweet = gold.question, gold.answer, pred.output

    engaging = "Does the assessed text make for a self-contained, engaging tweet?"
    correct = f"The text should answer `{question}` with `{answer}`. Does the assessed text contain this answer?"

    with dspy.context(lm=gpt4T):
        correct =  dspy.Predict(Assess)(assessed_text=tweet, assessment_question=correct)
        engaging = dspy.Predict(Assess)(assessed_text=tweet, assessment_question=engaging)

    correct, engaging = [m.assessment_answer.lower() == 'yes' for m in [correct, engaging]]
    score = (correct + engaging) if correct and (len(tweet) <= 280) else 0

    if trace is not None: return score >= 2
    return score / 2.0
```

#### Advanced: Using a DSPy program as your metric

If your metric is itself a DSPy program, one of the most powerful ways to iterate is to compile (optimize) your metric itself. That's usually easy because the output of the metric is usually a simple value (e.g., a score out of 5) so the metric's metric is easy to define and optimize by collecting a few examples.

When your metric is used during evaluation runs, DSPy will not try to track the steps of your program.

But during compiling (optimization), DSPy will trace your LM calls. The trace will contain inputs/outputs to each DSPy predictor and you can leverage that to validate intermediate steps for optimization.

### Optimizers(formerly Teleprompters)

Notes:
- A DSPy optimizer is an algorithm that can tune the parameters of a DSPy program (i.e., the prompts and/or the LM weights) to maximize the metrics you specify, like accuracy.
    - There are many built-in optimizers in DSPy, which apply vastly different strategies. A typical DSPy optimizer takes three things:
    - Your DSPy program. This may be a single module (e.g., dspy.Predict) or a complex multi-module program.
    - Your metric. This is a function that evaluates the output of your program, and assigns it a score (higher is better).
    - A few training inputs. This may be very small (i.e., only 5 or 10 examples) and incomplete (only inputs to your program, without any labels).
- What does a DSPy Optimizer tune? How does it tune them?
    - DSPy programs consist of multiple calls to LMs, stacked together as [DSPy modules]. Each DSPy module has internal parameters of three kinds: (1) the LM weights, (2) the instructions, and (3) demonstrations of the input/output behavior.
    - Given a metric, DSPy can optimize all of these three with multi-stage optimization algorithms. These can combine gradient descent (for LM weights) and discrete LM-driven optimization, i.e. for crafting/updating instructions and for creating/validating demonstrations. DSPy Demonstrations are like few-shot examples, but they're far more powerful. They can be created from scratch, given your program, and their creation and selection can be optimized in many effective ways.
    - In many cases, we found that compiling leads to better prompts than human writing. Not because DSPy optimizers are more creative than humans, but simply because they can try more things, much more systematically, and tune the metrics directly.
- Which optimizer should I use?
    - Ultimately, finding the ‘right’ optimizer to use & the best configuration for your task will require experimentation. Success in DSPy is still an iterative process - getting the best performance on your task will require you to explore and iterate.
    - That being said, here's the general guidance on getting started:
        - If you have very few examples (around 10), start with `BootstrapFewShot`.
        - If you have more data (50 examples or more), try `BootstrapFewShotWithRandomSearch`.
        - If you prefer to do instruction optimization only (i.e. you want to keep your prompt 0-shot), use `MIPROv2` configured for 0-shot optimization to optimize.
        - If you’re willing to use more inference calls to perform longer optimization runs (e.g. 40 trials or more), and have enough data (e.g. 200 examples or more to prevent overfitting) then try `MIPROv2`.
        - If you have been able to use one of these with a large LM (e.g., 7B parameters or above) and need a very efficient program, finetune a small LM for your task with `BootstrapFinetune`.


References:
- documentation: https://dspy-docs.vercel.app/building-blocks/6-optimizers/
- source code: https://github.com/stanfordnlp/dspy/tree/main/dspy/teleprompt

#### Automatic Few-Shot Learning

1. `LabeledFewShot`: Simply constructs few-shot examples (demos) from provided labeled input and output data points. Requires k (number of examples for the prompt) and trainset to randomly select k examples from.

2. `BootstrapFewShot`: Uses a teacher module (which defaults to your program) to generate complete demonstrations for every stage of your program, along with labeled examples in trainset. Parameters include max_labeled_demos (the number of demonstrations randomly selected from the trainset) and max_bootstrapped_demos (the number of additional examples generated by the teacher). The bootstrapping process employs the metric to validate demonstrations, including only those that pass the metric in the "compiled" prompt. Advanced: Supports using a teacher program that is a different DSPy program that has compatible structure, for harder tasks.

3. `BootstrapFewShotWithRandomSearch`: Applies BootstrapFewShot several times with random search over generated demonstrations, and selects the best program over the optimization. Parameters mirror those of BootstrapFewShot, with the addition of num_candidate_programs, which specifies the number of random programs evaluated over the optimization, including candidates of the uncompiled program, LabeledFewShot optimized program, BootstrapFewShot compiled program with unshuffled examples and num_candidate_programs of BootstrapFewShot compiled programs with randomized example sets.

4. `KNNFewShot`: Uses k-Nearest Neighbors algorithm to find the nearest training example demonstrations for a given input example. These nearest neighbor demonstrations are then used as the trainset for the BootstrapFewShot optimization process. See this notebook for an example.

#### Automatic Instruction Optimization

1. `COPRO`: Generates and refines new instructions for each step, and optimizes them with coordinate ascent (hill-climbing using the metric function and the trainset). Parameters include depth which is the number of iterations of prompt improvement the optimizer runs over.

2. `MIPROv2`: Generates instructions and few-shot examples in each step. The instruction generation is data-aware and demonstration-aware. Uses Bayesian Optimization to effectively search over the space of generation instructions/demonstrations across your modules.

#### Automatic Finetuning

1. `BootstrapFinetune`: Distills a prompt-based DSPy program into weight updates (for smaller LMs). The output is a DSPy program that has the same steps, but where each step is conducted by a finetuned model instead of a prompted LM.


#### Program Transformations

1. `Ensemble`: Ensembles a set of DSPy programs and either uses the full set or randomly samples a subset into a single program.


#### Saving and loading optimizer output


Saving a program: The resulting file is in plain-text JSON format. It contains all the parameters and steps in the source program. You can always read it and see what the optimizer generated.

You can add save_field_meta to additionally save the list of fields with the keys, name, field_type, description, and prefix with: `optimized_program.save(YOUR_SAVE_PATH, save_field_meta=True)`.

```python
optimized_program.save(YOUR_SAVE_PATH)
```

Loading a program:

```python
loaded_program = YOUR_PROGRAM_CLASS()
loaded_program.load(path=YOUR_SAVE_PATH)
```

### Assertions

Notes:
- Why and What is DSPy Assertions?
    - Despite the growth of techniques like fine-tuning or “prompt engineering”, these approaches are extremely tedious and rely on heavy, manual hand-waving to guide the LMs in adhering to specific constraints. Even DSPy's modularity of programming prompting pipelines lacks mechanisms to effectively and automatically enforce these constraints.
    - To address this, we introduce DSPy Assertions, a feature within the DSPy framework designed to automate the enforcement of computational constraints on LMs. DSPy Assertions empower developers to guide LMs towards desired outcomes with minimal manual intervention, enhancing the reliability, predictability, and correctness of LM outputs.
- `dspy.Assert` and `dspy.Suggest` API
    - when a constraint is not met:
        - Backtracking Mechanism: An under-the-hood backtracking is initiated, offering the model a chance to self-refine and proceed, which is done through
        - Dynamic Signature Modification: internally modifying your DSPy program’s Signature by adding the following fields:
            - Past Output: your model's past output that did not pass the validation_fn
            - Instruction: your user-defined feedback message on what went wrong and what possibly to fix
        - If the error continues past the `max_backtracking_attempts`, then `dspy.Assert` will halt the pipeline execution, altering you with an `dspy.AssertionError`. This ensures your program doesn't continue executing with “bad” LM behavior and immediately highlights sample failure outputs for user assessment.
    - `dspy.Suggest` vs. `dspy.Assert`:
        - `dspy.Suggest` on the other hand offers a softer approach. It maintains the same retry backtracking as `dspy.Assert` but instead serves as a gentle nudger. If the model outputs cannot pass the model constraints after the `max_backtracking_attempts`, `dspy.Suggest` will log the persistent failure and continue execution of the program on the rest of the data. This ensures the LM pipeline works in a "best-effort" manner without halting execution.
        - `dspy.Suggest` are best utilized as "helpers" during the evaluation phase, offering guidance and potential corrections without halting the pipeline.
        - `dspy.Assert` are recommended during the development stage as "checkers" to ensure the LM behaves as expected, providing a robust mechanism for identifying and addressing errors early in the development cycle.
- It is recommended to define a program with assertions separately than your original program if you are doing comparative evaluation for the effect of assertions. If not, feel free to set Assertions away!

References:
- documentation: https://dspy-docs.vercel.app/deep-dive/assertions/
- source code: https://github.com/stanfordnlp/dspy/blob/main/dspy/primitives/assertions.py

#### Use Case: Including Assertions in DSPy Programs

```python
class SimplifiedBaleenAssertions(dspy.Module):
    def __init__(self, passages_per_hop=2, max_hops=2):
        super().__init__()
        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops

    def forward(self, question):
        context = []
        prev_queries = [question]

        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query

            dspy.Suggest(
                len(query) <= 100,
                "Query should be short and less than 100 characters",
                target_module=self.generate_query
            )

            dspy.Suggest(
                validate_query_distinction_local(prev_queries, query),
                "Query should be distinct from: "
                + "; ".join(f"{i+1}) {q}" for i, q in enumerate(prev_queries)),
                target_module=self.generate_query
            )

            prev_queries.append(query)
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)

        if all_queries_distinct(prev_queries):
            self.passed_suggestions += 1

        pred = self.generate_answer(context=context, question=question)
        pred = dspy.Prediction(context=context, answer=pred.answer)
        return pred
```

```python
from dspy.primitives.assertions import assert_transform_module, backtrack_handler

baleen_with_assertions = assert_transform_module(SimplifiedBaleenAssertions(), backtrack_handler)

# backtrack_handler is parameterized over a few settings for the backtracking mechanism
# To change the number of max retry attempts, you can do
baleen_with_assertions_retry_once = assert_transform_module(SimplifiedBaleenAssertions(), 
    functools.partial(backtrack_handler, max_backtracks=1))
```

#### Assertion-Driven Optimizations

DSPy Assertions work with optimizations that DSPy offers, particularly with `BootstrapFewShotWithRandomSearch`

- Compilation with Assertions This includes assertion-driven example bootstrapping and counterexample bootstrapping during compilation. The teacher model for bootstrapping few-shot demonstrations can make use of DSPy Assertions to offer robust bootstrapped examples for the student model to learn from during inference. In this setting, the student model does not perform assertion aware optimizations (backtracking and retry) during inference.
- Compilation + Inference with Assertions -This includes assertion-driven optimizations in both compilation and inference. Now the teacher model offers assertion-driven examples but the student can further optimize with assertions of its own during inference time.


```python
teleprompter = BootstrapFewShotWithRandomSearch(
    metric=validate_context_and_answer_and_hops,
    max_bootstrapped_demos=max_bootstrapped_demos,
    num_candidate_programs=6,
)

#Compilation with Assertions
compiled_with_assertions_baleen = teleprompter.compile(student = baleen, teacher = baleen_with_assertions, trainset = trainset, valset = devset)

#Compilation + Inference with Assertions
compiled_baleen_with_assertions = teleprompter.compile(student=baleen_with_assertions, teacher = baleen_with_assertions, trainset=trainset, valset=devset)
```

## Recommended Workflow

### 1. Define your task

### 2. Define your pipeline

### 3. Explore a few examples

### 4. Define your data

### 5. Define your metric

### 6. Collect preliminary "zero-shot" evaluations

### 7. Compile with a DSPy optimizer

### 8. Iterate

## Examples

## Roadmap

## References

- Documentation: https://dspy-docs.vercel.app/intro/
- DSPy cheatsheet: https://dspy-docs.vercel.app/cheatsheet
- GitHub: https://github.com/stanfordnlp/dspy
- Introduction by Author
    - Video: https://www.youtube.com/live/JEMYuzrKLUw?si=iwAzhwobN52zgIZ_
    - Slides: https://llmagents-learning.org/slides/dspy_lec.pdf
- Papers
    - [DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines](https://arxiv.org/abs/2310.03714)
    - [DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines](https://arxiv.org/abs/2312.13382)
    - [Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together](https://arxiv.org/abs/2407.10930)
    - [Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs](https://arxiv.org/abs/2406.11695)