In [1]:
import dspy

lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

In [2]:
import ujson
from dspy.utils import download

# Download question--answer pairs from the RAG-QA Arena "Tech" dataset.
# download(
#     "https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_examples.jsonl"
# )

with open("ragqa_arena_tech_examples.jsonl") as f:
    data = [ujson.loads(line) for line in f]

In [3]:
data[0]

{'question': 'why igp is used in mpls?',
 'response': "An IGP exchanges routing prefixes between gateways/routers.  \nWithout a routing protocol, you'd have to configure each route on every router and you'd have no dynamic updates when routes change because of link failures. \nFuthermore, within an MPLS network, an IGP is vital for advertising the internal topology and ensuring connectivity for MP-BGP inside the network.",
 'gold_doc_ids': [2822, 2823]}

In [4]:
import dspy

data = [dspy.Example(**d).with_inputs("question") for d in data]

# Let's pick an `example` here from the data.
example = data[2]
example

Example({'question': 'why are my text messages coming up as maybe?', 'response': 'This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". \n\nHowever, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.', 'gold_doc_ids': [3956, 3957, 8034]}) (input_keys={'question'})

In [5]:
example.keys()

['question', 'response', 'gold_doc_ids']

In [6]:
import random

random.Random(0).shuffle(data)
trainset, devset, testset = data[:200], data[200:500], data[500:1000]

len(trainset), len(devset), len(testset)

(200, 300, 500)

In [7]:
cot = dspy.ChainOfThought("question -> response")
cot(question="should curly braces appear on their own line?")

Prediction(
    reasoning='The placement of curly braces on their own line is largely a matter of coding style and conventions. In many programming languages, such as Java, C#, and JavaScript, it is common to place opening curly braces on the same line as the statement that precedes them, while closing curly braces are often placed on a new line. This style is known as "K&R style." However, some developers prefer to place both opening and closing curly braces on their own lines for better readability, especially in languages like Python or when following certain style guides like the Allman style. Ultimately, the decision should be based on team conventions, readability, and personal preference.',
    response="Curly braces can either appear on their own line or on the same line as the preceding statement, depending on the coding style you choose to follow. It's important to be consistent with whichever style you adopt, and to consider team conventions and readability when making your 

In [8]:
from dspy.evaluate import SemanticF1

# Instantiate the metric.
metric = SemanticF1(decompositional=True)

# Produce a prediction from our `cot` module, using the `example` above as input.
pred = cot(**example.inputs())

# Compute the metric score for the prediction.
score = metric(example, pred)

print(f"Question: \t {example.question}\n")
print(f"Gold Response: \t {example.response}\n")
print(f"Predicted Response: \t {pred.response}\n")
print(f"Semantic F1 Score: {score:.2f}")

Question: 	 why are my text messages coming up as maybe?

Gold Response: 	 This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". 

However, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.

Predicted Response: 	 Your text messages are showing up as "maybe" likely because the recipient's phone does not recognize your number, possibly because you are not saved in their contacts. This feature is designed to help users identify potential spam or unknown senders. You might want to ask the recipient to save your number or check their messaging app settings for more information.

Semantic F1 Score: 0.40


In [9]:
dspy.inspect_history(n=1)





[34m[2025-07-13T15:29:16.535330][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str): 
2. `ground_truth` (str): 
3. `system_response` (str):
Your output fields are:
1. `reasoning` (str): 
2. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth
3. `system_response_key_ideas` (str): enumeration of key ideas in the system response
4. `discussion` (str): discussion of the overlap between ground truth and system response
5. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response
6. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## ground_truth ## ]]
{ground_truth}

[[ ## system_response ## ]]
{system_response}

[[ ## reasoning ## ]]
{reasoning}

[[ ## ground_truth_key_ideas ## ]]
{ground_truth_key_ideas}

[[ ## system_response_key_ideas

In [10]:
# Define an evaluator that we can re-use.
evaluate = dspy.Evaluate(
    devset=devset, metric=metric, num_threads=24, display_progress=True, display_table=2
)

# Evaluate the Chain-of-Thought program.
evaluate(cot)

Average Metric: 126.02 / 300 (42.0%): 100%|██████████| 300/300 [01:30<00:00,  3.30it/s]

2025/07/13 15:33:23 INFO dspy.evaluate.evaluate: Average Metric: 126.01753608119975 / 300 (42.0%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],"C and C++ are both powerful programming languages, but they serve ...","Use C when you need low-level programming, performance, and simpli...",✔️ [0.500]
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",Storing images in a Git repository can be problematic for several ...,It is generally not advisable to store images directly in a Git re...,✔️ [0.308]


EvaluationResult(score=42.01, results=<list of 300 results>)

In [11]:
max_characters = 6000  # for truncating >99th percentile of documents
topk_docs_to_retrieve = 5  # number of documents to retrieve per search query

with open("ragqa_arena_tech_corpus.jsonl") as f:
    corpus = [ujson.loads(line)["text"][:max_characters] for line in f]
    print(f"Loaded {len(corpus)} documents. Will encode them below.")

embedder = dspy.Embedder("openai/text-embedding-3-small", dimensions=512)
search = dspy.retrievers.Embeddings(
    embedder=embedder, corpus=corpus, k=topk_docs_to_retrieve
)

Loaded 28436 documents. Will encode them below.
Training a 32-byte FAISS index with 337 partitions, based on 28436 x 512-dim embeddings


In [12]:
class RAG(dspy.Module):
    def __init__(self):
        self.respond = dspy.ChainOfThought("context, question -> response")

    def forward(self, question):
        context = search(question).passages
        return self.respond(context=context, question=question)

In [13]:
rag = RAG()
rag(question="what are high memory and low memory on linux?")

Prediction(
    reasoning="High memory and low memory in Linux refer to two distinct segments of the kernel's memory space. Low memory is the portion of memory that the kernel can access directly and is statically mapped at boot time, allowing for efficient access. High memory, on the other hand, is not permanently mapped in the kernel's address space and requires special handling (like mapping and unmapping) for the kernel to access it. This distinction is crucial for managing memory in a 32-bit architecture, where the kernel needs to access more memory than it can directly map. High memory is typically used for temporary data buffers, while low memory is used for kernel operations.",
    response="In Linux, high memory refers to the segment of memory that is not permanently mapped in the kernel's address space, requiring the kernel to map it temporarily for access. Low memory, conversely, is the portion that the kernel can access directly and is statically mapped at boot time. This s

In [14]:
dspy.inspect_history()





[34m[2025-07-13T15:39:07.901851][0m

[31mSystem message:[0m

Your input fields are:
1. `context` (str): 
2. `question` (str):
Your output fields are:
1. `reasoning` (str): 
2. `response` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## context ## ]]
{context}

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## response ## ]]
{response}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `context`, `question`, produce the fields `response`.


[31mUser message:[0m

[[ ## context ## ]]
[1] «As far as I remember, High Memory is used for application space and Low Memory for the kernel. Advantage is that (user-space) applications cant access kernel-space memory.»
[2] «HIGHMEM is a range of kernels memory space, but it is NOT memory you access but its a place where you put what you want to access. A typical 32bit Linux virtual memory map is like: 0x0000

In [15]:
evaluate(RAG())

Average Metric: 164.89 / 300 (55.0%): 100%|██████████| 300/300 [01:55<00:00,  2.60it/s]

2025/07/13 15:41:16 INFO dspy.evaluate.evaluate: Average Metric: 164.88770316639904 / 300 (55.0%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],C should be used over C++ primarily in scenarios where simplicity ...,Use C over C++ when working on embedded systems or projects that r...,
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",Storing images in a Git repository can be problematic due to Git's...,While it is technically possible to store images in a Git reposito...,✔️ [0.500]


EvaluationResult(score=54.96, results=<list of 300 results>)

In [16]:
tp = dspy.MIPROv2(
    metric=metric, auto="medium", num_threads=24
)  # use fewer threads if your rate limit is small

optimized_rag = tp.compile(
    RAG(),
    trainset=trainset,
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
    requires_permission_to_run=False,
)

2025/07/13 15:41:43 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 18
minibatch: True
num_fewshot_candidates: 12
num_instruct_candidates: 6
valset size: 160

2025/07/13 15:41:43 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/07/13 15:41:43 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/07/13 15:41:43 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=12 sets of demonstrations...


Bootstrapping set 1/12
Bootstrapping set 2/12
Bootstrapping set 3/12


 12%|█▎        | 5/40 [00:43<05:07,  8.79s/it]


Bootstrapped 2 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 4/12


 15%|█▌        | 6/40 [00:50<04:47,  8.47s/it]


Bootstrapped 1 full traces after 6 examples for up to 1 rounds, amounting to 6 attempts.
Bootstrapping set 5/12


  5%|▌         | 2/40 [00:16<05:19,  8.40s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 6/12


  8%|▊         | 3/40 [00:25<05:20,  8.66s/it]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 7/12


  5%|▌         | 2/40 [00:16<05:19,  8.40s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 8/12


 12%|█▎        | 5/40 [00:38<04:28,  7.66s/it]


Bootstrapped 1 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 9/12


  5%|▌         | 2/40 [00:20<06:31, 10.30s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 10/12


  5%|▌         | 2/40 [00:15<05:03,  7.98s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 11/12


  2%|▎         | 1/40 [00:09<06:03,  9.32s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 12/12


  8%|▊         | 3/40 [00:33<06:48, 11.04s/it]
2025/07/13 15:46:15 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/07/13 15:46:15 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.


2025/07/13 15:46:26 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=6 instructions...

2025/07/13 15:46:56 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/07/13 15:46:56 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `context`, `question`, produce the fields `response`.

2025/07/13 15:46:56 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Imagine you are a technical support agent in a high-pressure environment where users urgently need assistance with their macOS and shell scripting issues. Your task is to provide clear, actionable solutions based on the context provided. Given the fields `context` and `question`, produce a comprehensive `response` that not only answers the question but also explains the reasoning behind your answer step by step, ensuring that the user can understand and apply the solution effectively.

2025/07/13 15:46:56 INFO dspy.teleprompt.mipro_optimizer_v2: 2: You are a technical expert in macOS and shell scrip

Average Metric: 90.85 / 160 (56.8%): 100%|██████████| 160/160 [01:21<00:00,  1.96it/s]

2025/07/13 15:48:18 INFO dspy.evaluate.evaluate: Average Metric: 90.85100066174321 / 160 (56.8%)
2025/07/13 15:48:18 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 56.78

2025/07/13 15:48:18 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 23 - Minibatch ==



Average Metric: 21.75 / 35 (62.1%): 100%|██████████| 35/35 [00:21<00:00,  1.61it/s]

2025/07/13 15:48:40 INFO dspy.evaluate.evaluate: Average Metric: 21.746202270134518 / 35 (62.1%)
2025/07/13 15:48:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 62.13 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 6'].
2025/07/13 15:48:40 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13]
2025/07/13 15:48:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78]
2025/07/13 15:48:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.78


2025/07/13 15:48:40 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 23 - Minibatch ==



Average Metric: 18.92 / 35 (54.0%): 100%|██████████| 35/35 [00:48<00:00,  1.38s/it]

2025/07/13 15:49:29 INFO dspy.evaluate.evaluate: Average Metric: 18.91570953788717 / 35 (54.0%)
2025/07/13 15:49:29 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.04 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 2'].
2025/07/13 15:49:29 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04]
2025/07/13 15:49:29 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78]
2025/07/13 15:49:29 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.78


2025/07/13 15:49:29 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 23 - Minibatch ==



Average Metric: 22.11 / 35 (63.2%): 100%|██████████| 35/35 [00:19<00:00,  1.78it/s]

2025/07/13 15:49:48 INFO dspy.evaluate.evaluate: Average Metric: 22.1125846187646 / 35 (63.2%)
2025/07/13 15:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 63.18 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 6'].
2025/07/13 15:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18]
2025/07/13 15:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78]
2025/07/13 15:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.78


2025/07/13 15:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 23 - Minibatch ==



Average Metric: 21.81 / 35 (62.3%): 100%|██████████| 35/35 [00:21<00:00,  1.60it/s]

2025/07/13 15:50:10 INFO dspy.evaluate.evaluate: Average Metric: 21.81279328156445 / 35 (62.3%)
2025/07/13 15:50:10 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 62.32 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 4'].
2025/07/13 15:50:10 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32]
2025/07/13 15:50:10 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78]
2025/07/13 15:50:10 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.78


2025/07/13 15:50:10 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 23 - Minibatch ==



Average Metric: 20.26 / 35 (57.9%): 100%|██████████| 35/35 [00:28<00:00,  1.24it/s]

2025/07/13 15:50:38 INFO dspy.evaluate.evaluate: Average Metric: 20.263546886058187 / 35 (57.9%)
2025/07/13 15:50:38 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.9 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 5'].
2025/07/13 15:50:38 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9]
2025/07/13 15:50:38 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78]
2025/07/13 15:50:38 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.78


2025/07/13 15:50:38 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 23 - Full Evaluation =====
2025/07/13 15:50:38 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 63.18) from minibatch trials...



Average Metric: 95.74 / 160 (59.8%): 100%|██████████| 160/160 [01:23<00:00,  1.91it/s]

2025/07/13 15:52:02 INFO dspy.evaluate.evaluate: Average Metric: 95.73659981271513 / 160 (59.8%)
2025/07/13 15:52:02 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 59.84
2025/07/13 15:52:02 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84]
2025/07/13 15:52:02 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 59.84
2025/07/13 15:52:02 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/13 15:52:02 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 23 - Minibatch ==



Average Metric: 22.25 / 35 (63.6%): 100%|██████████| 35/35 [00:20<00:00,  1.70it/s]

2025/07/13 15:52:23 INFO dspy.evaluate.evaluate: Average Metric: 22.249721355623993 / 35 (63.6%)
2025/07/13 15:52:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 63.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 6'].
2025/07/13 15:52:23 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9, 63.57]
2025/07/13 15:52:23 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84]
2025/07/13 15:52:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 59.84


2025/07/13 15:52:23 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 23 - Minibatch ==



Average Metric: 21.41 / 35 (61.2%): 100%|██████████| 35/35 [00:22<00:00,  1.54it/s]

2025/07/13 15:52:46 INFO dspy.evaluate.evaluate: Average Metric: 21.41353242436359 / 35 (61.2%)
2025/07/13 15:52:46 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 61.18 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 1'].
2025/07/13 15:52:46 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9, 63.57, 61.18]
2025/07/13 15:52:46 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84]
2025/07/13 15:52:46 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 59.84


2025/07/13 15:52:46 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 23 - Minibatch ==



Average Metric: 19.86 / 35 (56.7%): 100%|██████████| 35/35 [00:21<00:00,  1.63it/s]

2025/07/13 15:53:07 INFO dspy.evaluate.evaluate: Average Metric: 19.85791772226103 / 35 (56.7%)
2025/07/13 15:53:07 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 56.74 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 3'].
2025/07/13 15:53:07 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9, 63.57, 61.18, 56.74]
2025/07/13 15:53:07 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84]
2025/07/13 15:53:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 59.84


2025/07/13 15:53:07 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 23 - Minibatch ==



Average Metric: 20.94 / 35 (59.8%): 100%|██████████| 35/35 [00:23<00:00,  1.48it/s]

2025/07/13 15:53:31 INFO dspy.evaluate.evaluate: Average Metric: 20.937048132799298 / 35 (59.8%)
2025/07/13 15:53:31 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 59.82 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 11'].
2025/07/13 15:53:31 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9, 63.57, 61.18, 56.74, 59.82]
2025/07/13 15:53:31 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84]
2025/07/13 15:53:31 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 59.84


2025/07/13 15:53:31 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 23 - Minibatch ==



Average Metric: 21.08 / 35 (60.2%): 100%|██████████| 35/35 [00:01<00:00, 19.62it/s]

2025/07/13 15:53:33 INFO dspy.evaluate.evaluate: Average Metric: 21.075892800359945 / 35 (60.2%)
2025/07/13 15:53:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 60.22 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 6'].
2025/07/13 15:53:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9, 63.57, 61.18, 56.74, 59.82, 60.22]
2025/07/13 15:53:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84]
2025/07/13 15:53:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 59.84


2025/07/13 15:53:33 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 23 - Full Evaluation =====
2025/07/13 15:53:33 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 63.57) from minibatch trials...



Average Metric: 95.28 / 160 (59.6%): 100%|██████████| 160/160 [00:50<00:00,  3.17it/s]

2025/07/13 15:54:23 INFO dspy.evaluate.evaluate: Average Metric: 95.28297398123125 / 160 (59.6%)
2025/07/13 15:54:23 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84, 59.55]
2025/07/13 15:54:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 59.84
2025/07/13 15:54:23 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/13 15:54:23 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 23 - Minibatch ==



Average Metric: 20.55 / 35 (58.7%): 100%|██████████| 35/35 [00:07<00:00,  4.58it/s]

2025/07/13 15:54:31 INFO dspy.evaluate.evaluate: Average Metric: 20.55293935939425 / 35 (58.7%)
2025/07/13 15:54:31 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 58.72 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 6'].
2025/07/13 15:54:31 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9, 63.57, 61.18, 56.74, 59.82, 60.22, 58.72]
2025/07/13 15:54:31 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84, 59.55]
2025/07/13 15:54:31 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 59.84


2025/07/13 15:54:31 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 23 - Minibatch ==



Average Metric: 22.67 / 35 (64.8%): 100%|██████████| 35/35 [00:22<00:00,  1.57it/s]

2025/07/13 15:54:53 INFO dspy.evaluate.evaluate: Average Metric: 22.66652710762648 / 35 (64.8%)





2025/07/13 15:54:54 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 64.76 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 6'].
2025/07/13 15:54:54 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9, 63.57, 61.18, 56.74, 59.82, 60.22, 58.72, 64.76]
2025/07/13 15:54:54 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84, 59.55]
2025/07/13 15:54:54 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 59.84


2025/07/13 15:54:54 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 23 - Minibatch ==


Average Metric: 21.30 / 35 (60.9%): 100%|██████████| 35/35 [00:18<00:00,  1.86it/s]

2025/07/13 15:55:12 INFO dspy.evaluate.evaluate: Average Metric: 21.303550638625808 / 35 (60.9%)
2025/07/13 15:55:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 60.87 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 6'].
2025/07/13 15:55:12 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9, 63.57, 61.18, 56.74, 59.82, 60.22, 58.72, 64.76, 60.87]
2025/07/13 15:55:12 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84, 59.55]
2025/07/13 15:55:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 59.84


2025/07/13 15:55:12 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 23 - Minibatch ==



Average Metric: 19.18 / 35 (54.8%): 100%|██████████| 35/35 [00:22<00:00,  1.55it/s]

2025/07/13 15:55:35 INFO dspy.evaluate.evaluate: Average Metric: 19.181659308797148 / 35 (54.8%)
2025/07/13 15:55:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.8 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 8'].
2025/07/13 15:55:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9, 63.57, 61.18, 56.74, 59.82, 60.22, 58.72, 64.76, 60.87, 54.8]
2025/07/13 15:55:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84, 59.55]
2025/07/13 15:55:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 59.84


2025/07/13 15:55:35 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 23 - Minibatch ==



Average Metric: 22.53 / 35 (64.4%): 100%|██████████| 35/35 [00:59<00:00,  1.71s/it]

2025/07/13 15:56:35 INFO dspy.evaluate.evaluate: Average Metric: 22.528202400450105 / 35 (64.4%)
2025/07/13 15:56:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 64.37 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 10'].
2025/07/13 15:56:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9, 63.57, 61.18, 56.74, 59.82, 60.22, 58.72, 64.76, 60.87, 54.8, 64.37]
2025/07/13 15:56:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84, 59.55]
2025/07/13 15:56:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 59.84


2025/07/13 15:56:35 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 23 - Full Evaluation =====
2025/07/13 15:56:35 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 64.37) from minibatch trials...



Average Metric: 96.74 / 160 (60.5%): 100%|██████████| 160/160 [00:56<00:00,  2.84it/s]

2025/07/13 15:57:31 INFO dspy.evaluate.evaluate: Average Metric: 96.74258105497474 / 160 (60.5%)
2025/07/13 15:57:31 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 60.46





2025/07/13 15:57:32 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84, 59.55, 60.46]
2025/07/13 15:57:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 60.46
2025/07/13 15:57:32 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/13 15:57:32 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 23 - Minibatch ==


Average Metric: 19.81 / 35 (56.6%): 100%|██████████| 35/35 [00:18<00:00,  1.89it/s]

2025/07/13 15:57:50 INFO dspy.evaluate.evaluate: Average Metric: 19.805352243503528 / 35 (56.6%)
2025/07/13 15:57:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 56.59 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 9'].
2025/07/13 15:57:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9, 63.57, 61.18, 56.74, 59.82, 60.22, 58.72, 64.76, 60.87, 54.8, 64.37, 56.59]
2025/07/13 15:57:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84, 59.55, 60.46]
2025/07/13 15:57:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 60.46


2025/07/13 15:57:50 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 21 / 23 - Minibatch ==



Average Metric: 20.02 / 35 (57.2%): 100%|██████████| 35/35 [00:01<00:00, 18.35it/s]

2025/07/13 15:57:52 INFO dspy.evaluate.evaluate: Average Metric: 20.018292496614276 / 35 (57.2%)
2025/07/13 15:57:52 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.2 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 10'].
2025/07/13 15:57:52 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9, 63.57, 61.18, 56.74, 59.82, 60.22, 58.72, 64.76, 60.87, 54.8, 64.37, 56.59, 57.2]
2025/07/13 15:57:52 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84, 59.55, 60.46]
2025/07/13 15:57:52 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 60.46


2025/07/13 15:57:52 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 23 - Minibatch ==



Average Metric: 20.99 / 35 (60.0%): 100%|██████████| 35/35 [00:18<00:00,  1.90it/s]

2025/07/13 15:58:11 INFO dspy.evaluate.evaluate: Average Metric: 20.990445514288535 / 35 (60.0%)
2025/07/13 15:58:11 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 59.97 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 10'].
2025/07/13 15:58:11 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.13, 54.04, 63.18, 62.32, 57.9, 63.57, 61.18, 56.74, 59.82, 60.22, 58.72, 64.76, 60.87, 54.8, 64.37, 56.59, 57.2, 59.97]
2025/07/13 15:58:11 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84, 59.55, 60.46]
2025/07/13 15:58:11 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 60.46


2025/07/13 15:58:11 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 23 / 23 - Full Evaluation =====
2025/07/13 15:58:11 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 62.815) from minibatch trials...



Average Metric: 93.73 / 160 (58.6%): 100%|██████████| 160/160 [00:56<00:00,  2.83it/s]

2025/07/13 15:59:07 INFO dspy.evaluate.evaluate: Average Metric: 93.72821567828288 / 160 (58.6%)
2025/07/13 15:59:07 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.78, 59.84, 59.55, 60.46, 58.58]
2025/07/13 15:59:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 60.46
2025/07/13 15:59:07 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/13 15:59:07 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 60.46!





In [17]:
baseline = rag(question="cmd+tab does not work on hidden or minimized windows")
print(baseline.response)

You are correct; cmd+tab does not work on hidden or minimized windows. The Command + Tab shortcut only allows you to switch to applications that are currently open and in focus. If an application is minimized, you need to first restore it or use specific commands to manage minimized windows.


In [18]:
pred = optimized_rag(question="cmd+tab does not work on hidden or minimized windows")
print(pred.response)

The Command + Tab shortcut on macOS is intended for switching between applications that are currently open and visible. It does not work with minimized or hidden windows because these applications are not actively running in the foreground. When an application is minimized, it is not considered "open" in the same sense as those displayed on the screen. To access a minimized application, you can either click its icon in the Dock or use a different method, such as the Mission Control feature, to view all open windows, including minimized ones.


In [19]:
dspy.inspect_history(n=2)





[34m[2025-07-13T16:00:06.235507][0m

[31mSystem message:[0m

Your input fields are:
1. `context` (str): 
2. `question` (str):
Your output fields are:
1. `reasoning` (str): 
2. `response` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## context ## ]]
{context}

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## response ## ]]
{response}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `context`, `question`, produce the fields `response`.


[31mUser message:[0m

[[ ## context ## ]]
[1] «If you hold command and quickly tap tab you will cycle between your two most recently used applications without bringing up the heads up display. If you want to use a third party app Witch has options to ignore minimized windows.»
[2] «Try this: On your Mac, Navigate to System Preferences Go to Mission Control Uncheck When switching to an application, switch to a

In [20]:
cost = sum(
    [x["cost"] for x in lm.history if x["cost"] is not None]
)  # in USD, as calculated by LiteLLM for certain providers

In [21]:
optimized_rag.save("optimized_rag.json")

loaded_rag = RAG()
loaded_rag.load("optimized_rag.json")

loaded_rag(question="cmd+tab does not work on hidden or minimized windows")

Prediction(
    reasoning='The Command + Tab shortcut on macOS is designed to switch between applications that are currently open and visible on the screen. However, it does not work with minimized or hidden windows because these applications are not actively running in the foreground. When an application is minimized, it is not considered "open" in the same way as an application that is currently displayed on the screen. Therefore, the Command + Tab function will not cycle through minimized applications. To switch to a minimized application, you would typically need to either click on its icon in the Dock or use a different keyboard shortcut that specifically targets minimized windows.',
    response='The Command + Tab shortcut on macOS is intended for switching between applications that are currently open and visible. It does not work with minimized or hidden windows because these applications are not actively running in the foreground. When an application is minimized, it is not con