# DSPy RAG Example with Azure OpenAI

## Setup

In [None]:
import os
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
import dspy
import ujson
import random
import dspy.evaluate
import dspy.retrievers

In [None]:
# Load environment variables from .env file
load_dotenv()

True

## Setup Tracing with MLFlow

In [2]:
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy-Azure-Eval-Example")

# Enable tracing.
mlflow.dspy.autolog()

## Setup LM Configuration

In [65]:
token_provider = get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")

lm = dspy.LM(
    model=f"azure/gpt-4.1",
    base_url=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_ad_token_provider=token_provider,
)
dspy.configure(lm=lm)

In [66]:
# Test the LM configuration.
lm("Say hello to the world!")

['Hello, world! 🌍']

## Start with DSPy modules

In [67]:
qa = dspy.Predict("question: str -> response: str")
response = qa(question="What are high memory and low memory on Linux?")

print(response)

Prediction(
    response='On Linux, "high memory" and "low memory" refer to different regions of physical RAM as seen by the kernel, especially on 32-bit systems:\n\n- **Low Memory**: This is the portion of physical RAM that the Linux kernel can directly address and map into its own address space. On 32-bit x86 systems, this is typically the first ~896 MB of RAM (from 0x00000000 to 0x37FFFFFF). The kernel and most kernel data structures reside here, and it is directly accessible by the kernel at all times.\n\n- **High Memory**: This is any physical RAM above the low memory threshold (above ~896 MB on 32-bit x86). The kernel cannot directly map all of high memory into its address space due to address space limitations. Instead, it must temporarily map high memory pages when it needs to access them. High memory is used primarily for user-space processes and file caches.\n\nOn 64-bit systems, the distinction is largely irrelevant because the kernel can address much more memory directly, s

In [68]:
cot = dspy.ChainOfThought("question: str -> response: str")
cot(question="Should curly braces appear on the same line as the function name in C#?")

Prediction(
    reasoning='In C#, the placement of curly braces is a matter of style, but the most widely accepted convention (as recommended by Microsoft and used in most C# codebases) is to place the opening curly brace on a new line after the function declaration. This is known as the "Allman style." However, some developers may use the "K&R style," where the opening brace appears on the same line as the function name, but this is less common in C#.',
    response='No, in C#, the standard convention is to place the opening curly brace on a new line after the function name, not on the same line.'
)

## Evaluate with Examples

In [None]:
! wget https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_examples.jsonl -O ragqa_arena_tech_examples.jsonl

--2025-07-13 13:18:05--  https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_examples.jsonl
Resolving huggingface.co (huggingface.co)... 3.163.189.114, 3.163.189.90, 3.163.189.37, ...
Connecting to huggingface.co (huggingface.co)|3.163.189.114|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: /api/resolve-cache/models/dspy/cache/ef6a5e72a98b47cef31574a400fea8fe149559a3/ragqa_arena_tech_examples.jsonl?%2Fdspy%2Fcache%2Fresolve%2Fmain%2Fragqa_arena_tech_examples.jsonl=&etag=%2236aea6cafdd206141e4ce7a897b5c38f0be9d785%22 [following]
--2025-07-13 13:18:05--  https://huggingface.co/api/resolve-cache/models/dspy/cache/ef6a5e72a98b47cef31574a400fea8fe149559a3/ragqa_arena_tech_examples.jsonl?%2Fdspy%2Fcache%2Fresolve%2Fmain%2Fragqa_arena_tech_examples.jsonl=&etag=%2236aea6cafdd206141e4ce7a897b5c38f0be9d785%22
Reusing existing connection to huggingface.co:443.
HTTP request sent, awaiting response... 200 OK
Length: 1522638 (1.5M) [text/plain]

In [69]:
with open("ragqa_arena_tech_examples.jsonl", "r") as f:
    data = [ujson.loads(line) for line in f]

data[0]

{'question': 'why igp is used in mpls?',
 'response': "An IGP exchanges routing prefixes between gateways/routers.  \nWithout a routing protocol, you'd have to configure each route on every router and you'd have no dynamic updates when routes change because of link failures. \nFuthermore, within an MPLS network, an IGP is vital for advertising the internal topology and ensuring connectivity for MP-BGP inside the network.",
 'gold_doc_ids': [2822, 2823]}

In [70]:
data = [dspy.Example(**d).with_inputs("question") for d in data]
example = data[2]
example

Example({'question': 'why are my text messages coming up as maybe?', 'response': 'This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". \n\nHowever, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.', 'gold_doc_ids': [3956, 3957, 8034]}) (input_keys={'question'})

In [71]:
random.Random(0).shuffle(data)
trainset, devset, testset = data[:200], data[200:500], data[500:1000]
len(trainset), len(devset), len(testset)

(200, 300, 500)

In [72]:
metric = dspy.evaluate.SemanticF1(decompositional=True)

pred = cot(**example.inputs())

score = metric(example, pred)

print(f"Question: \t {example.question}\n")
print(f"Gold Response: \t {example.response}\n")
print(f"Predicted Response: \t {pred.response}\n")
print(f"Semantic F1 Score: {score:.2f}")

Question: 	 why are my text messages coming up as maybe?

Gold Response: 	 This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". 

However, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.

Predicted Response: 	 Your text messages are showing up as "Maybe: [Name]" because your phone is trying to guess the identity of the sender based on information it finds in your emails, messages, or other apps. This happens when the sender's number isn't saved in your contacts, or if the contact information is incomplete. To stop seeing "Maybe," save the sender's number to your contacts with the correct name.

Semantic F1 Score: 0.40


In [22]:
# Define the evaluator.
evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=24, display_progress=True, display_table=2)

In [None]:
# Evaluate the Chain-of-Thought program.
evaluate(cot)

Average Metric: 128.09 / 300 (42.7%): 100%|██████████| 300/300 [01:49<00:00,  2.74it/s]

2025/07/13 13:33:20 INFO dspy.evaluate.evaluate: Average Metric: 128.08507795163482 / 300 (42.7%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],The choice between C and C++ depends on the requirements of the pr...,Use C over C++ when: - You need maximum portability and minimal ru...,✔️ [0.154]
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",Storing images in a git repository is a nuanced topic. Git is opti...,"In general, it's best to avoid storing large or numerous images di...",✔️ [0.444]


42.7

## Custom DSPy Module

In [26]:
! wget https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_corpus.jsonl -O ragqa_arena_tech_corpus.jsonl

--2025-07-13 13:38:47--  https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_corpus.jsonl
Resolving huggingface.co (huggingface.co)... 108.138.246.67, 108.138.246.79, 108.138.246.85, ...
Connecting to huggingface.co (huggingface.co)|108.138.246.67|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/3c/d1/3cd17edd9d00b6cb87d90d566d09ca0a0d2242d61c9c36b9e3c6c82c41a34479/a5d0eff9ad3fece29c97d59ccf24ae80cd6317f8781c218ff355be922056fe7d?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27ragqa_arena_tech_corpus.jsonl%3B+filename%3D%22ragqa_arena_tech_corpus.jsonl%22%3B&Expires=1752442728&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1MjQ0MjcyOH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzNjL2QxLzNjZDE3ZWRkOWQwMGI2Y2I4N2Q5MGQ1NjZkMDljYTBhMGQyMjQyZDYxYzljMzZiOWUzYzZjODJjNDFhMzQ0NzkvYTVkMGVmZjlhZDNmZWNlMjljOTdkNTljY2YyNGFlODBjZDYzMTdmODc4MWMyMThmZjM1N

In [None]:
max_characters = 6000  # for truncating >99th percentile of documents
topk_docs_to_retrieve = 5  # number of documents to retrieve per search query

with open("ragqa_arena_tech_corpus.jsonl") as f:
    corpus = [ujson.loads(line)['text'][:max_characters] for line in f]
    print(f"Loaded {len(corpus)} documents. Will encode them below.")

token_provider = get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")
azure_ad_token = token_provider()

embedder = dspy.Embedder(
    f'azure/text-embedding-3-small',
    api_base=os.environ["AZURE_OPENAI_EMBEDDING_ENDPOINT"],
    api_version=os.environ["AZURE_OPENAI_EMBEDDING_API_VERSION"],
    azure_ad_token=azure_ad_token,
    dimensions=512,
)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=topk_docs_to_retrieve)

Loaded 28436 documents. Will encode them below.
Training a 32-byte FAISS index with 337 partitions, based on 28436 x 512-dim embeddings


In [41]:
class RAG(dspy.Module):
    def __init__(self):
        self.respond = dspy.ChainOfThought('context, question -> response')

    def forward(self, question):
        context = search(question).passages
        return self.respond(context=context, question=question)

In [None]:
rag = RAG()
rag(question="What are high memory and low memory on Linux?")

Prediction(
    reasoning="The context provides several explanations about high memory (highmem) and low memory (lowmem) in Linux, especially in the context of 32-bit systems. Low memory refers to the portion of physical memory that is permanently mapped into the kernel's address space, making it directly accessible to the kernel. High memory, on the other hand, is the portion of physical memory that is not permanently mapped and requires special handling (such as temporary mapping) for the kernel to access it. This distinction arises because, on 32-bit systems, the kernel's address space is limited (typically 1GB out of 4GB), so not all physical memory can be mapped at once. High memory is generally used for user-space applications and the page cache, while low memory is critical for kernel operations and data structures. The kernel must use functions like kmap and kunmap to access high memory pages.",
    response='On Linux, especially on 32-bit systems, "low memory" refers to the po

In [43]:
evaluate(RAG())

Average Metric: 168.15 / 300 (56.1%): 100%|██████████| 300/300 [02:07<00:00,  2.36it/s]

2025/07/13 14:15:11 INFO dspy.evaluate.evaluate: Average Metric: 168.1502347173502 / 300 (56.1%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],The context provides several perspectives on when to use C versus ...,Use C over C++ when: - You are working in embedded systems or on p...,✔️ [0.429]
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",The context provides a range of opinions and technical considerati...,"Images can be stored in a Git repository, especially if they are e...",✔️ [0.500]


56.05

## Optimize using `MIPROv2`

In [44]:
tp = dspy.MIPROv2(metric=metric, auto="medium", num_threads=24)  # use fewer threads if your rate limit is small

optimized_rag = tp.compile(RAG(), trainset=trainset,
                           max_bootstrapped_demos=2, max_labeled_demos=2,
                           requires_permission_to_run=False)

2025/07/13 14:17:23 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 18
minibatch: True
num_fewshot_candidates: 12
num_instruct_candidates: 6
valset size: 160

2025/07/13 14:17:23 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/07/13 14:17:23 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/07/13 14:17:23 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=12 sets of demonstrations...


Bootstrapping set 1/12
Bootstrapping set 2/12
Bootstrapping set 3/12


 10%|█         | 4/40 [00:41<06:10, 10.30s/it]


Bootstrapped 2 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 4/12


 20%|██        | 8/40 [01:07<04:29,  8.43s/it]


Bootstrapped 1 full traces after 8 examples for up to 1 rounds, amounting to 8 attempts.
Bootstrapping set 5/12


  2%|▎         | 1/40 [00:12<07:48, 12.01s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/12


 10%|█         | 4/40 [00:36<05:27,  9.10s/it]


Bootstrapped 1 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 7/12


  5%|▌         | 2/40 [00:18<05:55,  9.37s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 8/12


 12%|█▎        | 5/40 [00:39<04:34,  7.85s/it]


Bootstrapped 1 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 9/12


  8%|▊         | 3/40 [00:26<05:21,  8.70s/it]


Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 10/12


  8%|▊         | 3/40 [00:27<05:35,  9.06s/it]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 11/12


  8%|▊         | 3/40 [00:22<04:42,  7.63s/it]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 12/12


  5%|▌         | 2/40 [00:18<05:47,  9.14s/it]
2025/07/13 14:22:33 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/07/13 14:22:33 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.


2025/07/13 14:22:50 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=6 instructions...

2025/07/13 14:23:38 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/07/13 14:23:38 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `context`, `question`, produce the fields `response`.

2025/07/13 14:23:38 INFO dspy.teleprompt.mipro_optimizer_v2: 1: You are assisting an IT helpdesk technician who must resolve a critical technical support ticket for a high-priority client. Given a set of relevant `context` passages and a specific technical `question`, provide a step-by-step `reasoning` process that clearly explains how you arrive at your solution, and then deliver a precise, actionable `response` that the technician can immediately relay to the client. Ensure your explanation is transparent, references evidence from the context, and anticipates potential pitfalls or variations based on system differences. Accuracy and clarity are essential, as i

Average Metric: 88.24 / 160 (55.1%): 100%|██████████| 160/160 [01:12<00:00,  2.21it/s]

2025/07/13 14:24:50 INFO dspy.evaluate.evaluate: Average Metric: 88.23730972680185 / 160 (55.1%)
2025/07/13 14:24:50 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 55.15

2025/07/13 14:24:50 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 23 - Minibatch ==



Average Metric: 21.17 / 35 (60.5%): 100%|██████████| 35/35 [00:19<00:00,  1.78it/s]

2025/07/13 14:25:10 INFO dspy.evaluate.evaluate: Average Metric: 21.174969797245613 / 35 (60.5%)
2025/07/13 14:25:10 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 60.5 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 6'].
2025/07/13 14:25:10 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5]
2025/07/13 14:25:10 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15]
2025/07/13 14:25:10 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 55.15


2025/07/13 14:25:10 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 23 - Minibatch ==



Average Metric: 17.28 / 35 (49.4%): 100%|██████████| 35/35 [00:20<00:00,  1.74it/s]

2025/07/13 14:25:30 INFO dspy.evaluate.evaluate: Average Metric: 17.280008746862478 / 35 (49.4%)
2025/07/13 14:25:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 49.37 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 2'].
2025/07/13 14:25:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37]
2025/07/13 14:25:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15]
2025/07/13 14:25:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 55.15


2025/07/13 14:25:30 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 23 - Minibatch ==



Average Metric: 19.76 / 35 (56.4%): 100%|██████████| 35/35 [00:20<00:00,  1.75it/s]

2025/07/13 14:25:50 INFO dspy.evaluate.evaluate: Average Metric: 19.75616726060295 / 35 (56.4%)
2025/07/13 14:25:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 56.45 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 6'].
2025/07/13 14:25:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45]
2025/07/13 14:25:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15]
2025/07/13 14:25:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 55.15


2025/07/13 14:25:50 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 23 - Minibatch ==



Average Metric: 19.84 / 35 (56.7%): 100%|██████████| 35/35 [00:18<00:00,  1.85it/s]

2025/07/13 14:26:09 INFO dspy.evaluate.evaluate: Average Metric: 19.842719811150737 / 35 (56.7%)
2025/07/13 14:26:09 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 56.69 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 4'].
2025/07/13 14:26:09 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69]
2025/07/13 14:26:09 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15]
2025/07/13 14:26:09 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 55.15


2025/07/13 14:26:09 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 23 - Minibatch ==



Average Metric: 18.95 / 35 (54.1%): 100%|██████████| 35/35 [00:20<00:00,  1.70it/s]

2025/07/13 14:26:30 INFO dspy.evaluate.evaluate: Average Metric: 18.945122809075336 / 35 (54.1%)
2025/07/13 14:26:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.13 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 5'].
2025/07/13 14:26:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13]
2025/07/13 14:26:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15]
2025/07/13 14:26:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 55.15


2025/07/13 14:26:30 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 23 - Full Evaluation =====
2025/07/13 14:26:30 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 60.5) from minibatch trials...



Average Metric: 90.02 / 160 (56.3%): 100%|██████████| 160/160 [00:57<00:00,  2.80it/s]

2025/07/13 14:27:27 INFO dspy.evaluate.evaluate: Average Metric: 90.01972201578808 / 160 (56.3%)
2025/07/13 14:27:27 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 56.26
2025/07/13 14:27:27 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26]
2025/07/13 14:27:27 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.26
2025/07/13 14:27:27 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/13 14:27:27 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 23 - Minibatch ==



Average Metric: 20.07 / 35 (57.3%): 100%|██████████| 35/35 [00:20<00:00,  1.74it/s]

2025/07/13 14:27:47 INFO dspy.evaluate.evaluate: Average Metric: 20.070874741662752 / 35 (57.3%)
2025/07/13 14:27:47 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.35 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 6'].
2025/07/13 14:27:47 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13, 57.35]
2025/07/13 14:27:47 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26]
2025/07/13 14:27:47 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.26


2025/07/13 14:27:47 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 23 - Minibatch ==



Average Metric: 20.72 / 35 (59.2%): 100%|██████████| 35/35 [00:21<00:00,  1.66it/s]

2025/07/13 14:28:08 INFO dspy.evaluate.evaluate: Average Metric: 20.724994833319 / 35 (59.2%)
2025/07/13 14:28:08 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 59.21 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 1'].
2025/07/13 14:28:08 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13, 57.35, 59.21]
2025/07/13 14:28:08 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26]
2025/07/13 14:28:08 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.26


2025/07/13 14:28:08 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 23 - Minibatch ==



Average Metric: 19.13 / 35 (54.7%): 100%|██████████| 35/35 [00:18<00:00,  1.93it/s]

2025/07/13 14:28:26 INFO dspy.evaluate.evaluate: Average Metric: 19.131384373992788 / 35 (54.7%)





2025/07/13 14:28:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.66 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 3'].
2025/07/13 14:28:26 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13, 57.35, 59.21, 54.66]
2025/07/13 14:28:26 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26]
2025/07/13 14:28:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.26


2025/07/13 14:28:26 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 23 - Minibatch ==


Average Metric: 18.17 / 35 (51.9%): 100%|██████████| 35/35 [00:01<00:00, 26.86it/s]

2025/07/13 14:28:28 INFO dspy.evaluate.evaluate: Average Metric: 18.172819512197155 / 35 (51.9%)
2025/07/13 14:28:28 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.92 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 6'].
2025/07/13 14:28:28 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13, 57.35, 59.21, 54.66, 51.92]
2025/07/13 14:28:28 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26]
2025/07/13 14:28:28 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.26


2025/07/13 14:28:28 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 23 - Minibatch ==



Average Metric: 19.09 / 35 (54.6%): 100%|██████████| 35/35 [00:15<00:00,  2.24it/s]

2025/07/13 14:28:43 INFO dspy.evaluate.evaluate: Average Metric: 19.094702248888762 / 35 (54.6%)
2025/07/13 14:28:43 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.56 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 1'].
2025/07/13 14:28:43 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13, 57.35, 59.21, 54.66, 51.92, 54.56]
2025/07/13 14:28:43 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26]
2025/07/13 14:28:43 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.26


2025/07/13 14:28:43 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 23 - Full Evaluation =====
2025/07/13 14:28:43 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 57.35) from minibatch trials...



Average Metric: 91.86 / 160 (57.4%): 100%|██████████| 160/160 [00:53<00:00,  3.02it/s]

2025/07/13 14:29:37 INFO dspy.evaluate.evaluate: Average Metric: 91.85627952838095 / 160 (57.4%)
2025/07/13 14:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 57.41
2025/07/13 14:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26, 57.41]
2025/07/13 14:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 57.41
2025/07/13 14:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/13 14:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 23 - Minibatch ==



Average Metric: 17.80 / 35 (50.9%): 100%|██████████| 35/35 [00:18<00:00,  1.88it/s]

2025/07/13 14:29:55 INFO dspy.evaluate.evaluate: Average Metric: 17.803539893046057 / 35 (50.9%)
2025/07/13 14:29:55 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.87 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 10'].
2025/07/13 14:29:55 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13, 57.35, 59.21, 54.66, 51.92, 54.56, 50.87]
2025/07/13 14:29:55 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26, 57.41]
2025/07/13 14:29:55 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 57.41


2025/07/13 14:29:55 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 23 - Minibatch ==



Average Metric: 6.68 / 11 (60.7%):  31%|███▏      | 11/35 [00:10<00:07,  3.25it/s]



Average Metric: 20.27 / 35 (57.9%): 100%|██████████| 35/35 [00:19<00:00,  1.79it/s]

2025/07/13 14:30:15 INFO dspy.evaluate.evaluate: Average Metric: 20.268202712170698 / 35 (57.9%)
2025/07/13 14:30:15 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.91 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 9'].





2025/07/13 14:30:15 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13, 57.35, 59.21, 54.66, 51.92, 54.56, 50.87, 57.91]
2025/07/13 14:30:15 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26, 57.41]
2025/07/13 14:30:15 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 57.41


2025/07/13 14:30:15 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 23 - Minibatch ==


Average Metric: 19.43 / 35 (55.5%): 100%|██████████| 35/35 [00:22<00:00,  1.59it/s]

2025/07/13 14:30:37 INFO dspy.evaluate.evaluate: Average Metric: 19.42702844049844 / 35 (55.5%)
2025/07/13 14:30:37 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 55.51 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 7'].
2025/07/13 14:30:37 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13, 57.35, 59.21, 54.66, 51.92, 54.56, 50.87, 57.91, 55.51]
2025/07/13 14:30:37 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26, 57.41]
2025/07/13 14:30:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 57.41


2025/07/13 14:30:37 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 23 - Minibatch ==



Average Metric: 19.51 / 35 (55.8%): 100%|██████████| 35/35 [00:20<00:00,  1.72it/s]

2025/07/13 14:30:57 INFO dspy.evaluate.evaluate: Average Metric: 19.51391465719073 / 35 (55.8%)
2025/07/13 14:30:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 55.75 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 8'].
2025/07/13 14:30:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13, 57.35, 59.21, 54.66, 51.92, 54.56, 50.87, 57.91, 55.51, 55.75]
2025/07/13 14:30:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26, 57.41]
2025/07/13 14:30:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 57.41


2025/07/13 14:30:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 23 - Minibatch ==



Average Metric: 20.53 / 35 (58.7%): 100%|██████████| 35/35 [00:21<00:00,  1.60it/s]

2025/07/13 14:31:19 INFO dspy.evaluate.evaluate: Average Metric: 20.530047965509066 / 35 (58.7%)
2025/07/13 14:31:19 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 58.66 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 9'].
2025/07/13 14:31:19 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13, 57.35, 59.21, 54.66, 51.92, 54.56, 50.87, 57.91, 55.51, 55.75, 58.66]
2025/07/13 14:31:19 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26, 57.41]
2025/07/13 14:31:19 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 57.41


2025/07/13 14:31:19 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 23 - Full Evaluation =====
2025/07/13 14:31:19 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 58.66) from minibatch trials...



Average Metric: 90.20 / 160 (56.4%): 100%|██████████| 160/160 [00:55<00:00,  2.90it/s]

2025/07/13 14:32:14 INFO dspy.evaluate.evaluate: Average Metric: 90.20462290821547 / 160 (56.4%)
2025/07/13 14:32:14 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26, 57.41, 56.38]
2025/07/13 14:32:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 57.41
2025/07/13 14:32:14 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/13 14:32:14 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 23 - Minibatch ==



Average Metric: 20.09 / 35 (57.4%): 100%|██████████| 35/35 [00:21<00:00,  1.63it/s]

2025/07/13 14:32:36 INFO dspy.evaluate.evaluate: Average Metric: 20.089864909446547 / 35 (57.4%)
2025/07/13 14:32:36 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.4 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 6'].
2025/07/13 14:32:36 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13, 57.35, 59.21, 54.66, 51.92, 54.56, 50.87, 57.91, 55.51, 55.75, 58.66, 57.4]
2025/07/13 14:32:36 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26, 57.41, 56.38]
2025/07/13 14:32:36 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 57.41


2025/07/13 14:32:36 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 21 / 23 - Minibatch ==



Average Metric: 19.60 / 35 (56.0%): 100%|██████████| 35/35 [00:20<00:00,  1.71it/s]

2025/07/13 14:32:56 INFO dspy.evaluate.evaluate: Average Metric: 19.59549286044601 / 35 (56.0%)
2025/07/13 14:32:56 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 55.99 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/07/13 14:32:56 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13, 57.35, 59.21, 54.66, 51.92, 54.56, 50.87, 57.91, 55.51, 55.75, 58.66, 57.4, 55.99]
2025/07/13 14:32:56 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26, 57.41, 56.38]
2025/07/13 14:32:56 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 57.41


2025/07/13 14:32:56 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 23 - Minibatch ==



Average Metric: 22.80 / 35 (65.1%): 100%|██████████| 35/35 [00:20<00:00,  1.69it/s]

2025/07/13 14:33:17 INFO dspy.evaluate.evaluate: Average Metric: 22.796277534287974 / 35 (65.1%)
2025/07/13 14:33:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 65.13 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 7'].
2025/07/13 14:33:17 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [60.5, 49.37, 56.45, 56.69, 54.13, 57.35, 59.21, 54.66, 51.92, 54.56, 50.87, 57.91, 55.51, 55.75, 58.66, 57.4, 55.99, 65.13]
2025/07/13 14:33:17 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26, 57.41, 56.38]
2025/07/13 14:33:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 57.41


2025/07/13 14:33:17 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 23 / 23 - Full Evaluation =====
2025/07/13 14:33:17 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 65.13) from minibatch trials...



Average Metric: 93.15 / 160 (58.2%): 100%|██████████| 160/160 [00:51<00:00,  3.09it/s]

2025/07/13 14:34:09 INFO dspy.evaluate.evaluate: Average Metric: 93.14748450501101 / 160 (58.2%)
2025/07/13 14:34:09 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 58.22
2025/07/13 14:34:09 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [55.15, 56.26, 57.41, 56.38, 58.22]
2025/07/13 14:34:09 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 58.22
2025/07/13 14:34:09 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/13 14:34:09 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 58.22!





In [45]:
baseline = rag(question="cmd+tab does not work on hidden or minimized windows")
print(baseline.response)

By default, cmd+tab on macOS cycles through open applications, but it does not always bring minimized or hidden windows to the foreground. If you try to switch to a minimized window using cmd+tab, the app may become active, but its window will remain minimized. To work around this, you can use third-party apps like Witch (as mentioned in [1]) that offer more control over window management, including options to ignore minimized windows. Additionally, you can adjust Mission Control settings (see [2]) to change how spaces and windows are handled when switching applications. However, the default behavior is that cmd+tab does not restore minimized or hidden windows.


In [46]:
pred = optimized_rag(question="cmd+tab does not work on hidden or minimized windows")
print(pred.response)

By default, cmd+tab on macOS switches between applications but does not restore or focus minimized or hidden windows. If you switch to an app whose windows are minimized, the app becomes active, but its windows remain minimized. To change this behavior, you can try:
- Going to System Preferences > Mission Control and unchecking "When switching to an application, switch to a Space with open windows for the application."
- Using third-party utilities like Witch for more advanced window management.
Alternatively, you can manually restore minimized windows by clicking their icons in the Dock.


In [47]:
evaluate(optimized_rag)

Average Metric: 165.61 / 300 (55.2%): 100%|██████████| 300/300 [01:58<00:00,  2.53it/s]

2025/07/13 14:41:09 INFO dspy.evaluate.evaluate: Average Metric: 165.60628529485274 / 300 (55.2%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],The context provides several perspectives on when to use C versus ...,Use C over C++ when: - You are working in embedded systems or envi...,✔️ [0.400]
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",The context provides several perspectives on storing images in a g...,"Images can be stored in a git repository, especially if they are a...",✔️ [0.727]


55.2

In [49]:
cost = sum([x['cost'] for x in lm.history if x['cost'] is not None])  # in USD, as calculated by LiteLLM for certain providers
print(cost)

24.464128


In [50]:
optimized_rag.save("optimized_rag.json")

loaded_rag = RAG()
loaded_rag.load("optimized_rag.json")

loaded_rag(question="cmd+tab does not work on hidden or minimized windows")

Prediction(
    reasoning='The context provides several insights into how macOS handles application switching with cmd+tab, especially regarding minimized or hidden windows. Source [3] explains that when you use cmd+tab to switch to an application, if that application\'s windows are minimized, they do not automatically become visible or focused. You must switch to another app and let it take focus first before you can bring up a minimized window. Additionally, both the Command and Option keys must be pressed on the same side of the keyboard for certain behaviors. Source [2] suggests adjusting a setting in System Preferences > Mission Control by unchecking "When switching to an application, switch to a Space with open windows for the application," which may affect how cmd+tab interacts with minimized windows. Source [1] mentions that cmd+tab cycles between the most recently used applications, but does not bring up the heads-up display if done quickly, and third-party apps like Witch can