In [2]:
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")

<Experiment: artifact_location='mlflow-artifacts:/374362034103955121', creation_time=1741686562632, experiment_id='374362034103955121', last_update_time=1741686562632, lifecycle_stage='active', name='DSPy', tags={}>

In [3]:
mlflow.dspy.autolog()

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
import dspy
import os
from dotenv import load_dotenv
from IPython.display import Image, display

load_dotenv()

lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)

In [5]:
response = lm("What is the capital of France?")
print(response)

['The capital of France is Paris.']


In [6]:
qa = dspy.Predict('question: str -> response: str')
response = qa(question="what are high memory and low memory on linux?")
print(response.response)

lm.inspect_history(n=1)

In Linux, "high memory" and "low memory" refer to different regions of the system's memory address space, particularly in the context of how the kernel manages memory for processes.

- **Low Memory**: This typically refers to the memory that is directly accessible by the kernel and can be used by processes without any special handling. In a 32-bit system, this is usually the first 896 MB of RAM (though this can vary based on the architecture and configuration). Low memory is used for kernel data structures and for user processes that require direct access to memory.

- **High Memory**: This refers to memory that is above the addressable range of the kernel in a 32-bit system. Since the kernel can only directly access a limited amount of memory (due to the 4 GB address space limitation), any memory above this limit is considered high memory. Processes can still use this memory, but the kernel must use special mechanisms (like paging) to access it. High memory is more relevant in systems

In [7]:
cot = dspy.ChainOfThought('question -> response')
cot(question="should curly braces appear on their own line?")

Prediction(
    reasoning='The placement of curly braces on their own line is largely a matter of coding style and conventions. In some programming languages and style guides, such as the Allman style, it is common to place opening curly braces on a new line to enhance readability. This can help visually separate blocks of code. In contrast, other styles, like K&R style, place the opening brace on the same line as the control statement, which can save vertical space. Ultimately, whether curly braces should appear on their own line depends on the coding standards being followed in a particular project or organization.',
    response="Curly braces can appear on their own line depending on the coding style you are following. If you are using a style that emphasizes readability, such as the Allman style, then yes, they should be on their own line. If you are following a more compact style like K&R, then they would typically be on the same line. It's best to adhere to the conventions of the

In [8]:
import ujson
from dspy.utils import download

# Download question--answer pairs from the RAG-QA Arena "Tech" dataset.
download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_examples.jsonl")

with open("ragqa_arena_tech_examples.jsonl") as f:
    data = [ujson.loads(line) for line in f]

In [9]:
# Inspect one datapoint.
data[0]

{'question': 'why igp is used in mpls?',
 'response': "An IGP exchanges routing prefixes between gateways/routers.  \nWithout a routing protocol, you'd have to configure each route on every router and you'd have no dynamic updates when routes change because of link failures. \nFuthermore, within an MPLS network, an IGP is vital for advertising the internal topology and ensuring connectivity for MP-BGP inside the network.",
 'gold_doc_ids': [2822, 2823]}

In [10]:
data = [dspy.Example(**d).with_inputs('question') for d in data]

# Let's pick an `example` here from the data.
example = data[2]
example

Example({'question': 'why are my text messages coming up as maybe?', 'response': 'This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". \n\nHowever, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.', 'gold_doc_ids': [3956, 3957, 8034]}) (input_keys={'question'})

In [11]:
import random

random.Random(0).shuffle(data)
trainset, devset, testset = data[:200], data[200:500], data[500:1000]

len(trainset), len(devset), len(testset)

(200, 300, 500)

In [12]:
from dspy.evaluate import SemanticF1

# Instantiate the metric.
metric = SemanticF1(decompositional=True)

# Produce a prediction from our `cot` module, using the `example` above as input.
pred = cot(**example.inputs())

# Compute the metric score for the prediction.
score = metric(example, pred)

print(f"Question: \t {example.question}\n")
print(f"Gold Response: \t {example.response}\n")
print(f"Predicted Response: \t {pred.response}\n")
print(f"Semantic F1 Score: {score:.2f}")

Question: 	 why are my text messages coming up as maybe?

Gold Response: 	 This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". 

However, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.

Predicted Response: 	 Your text messages are showing up as "maybe" because the recipient's messaging app is unsure about the sender's identity. This can occur if you're not in their contacts or if the app is trying to filter out potential spam. To resolve this, you can ask the recipient to save your number or check their messaging app settings.

Semantic F1 Score: 0.33


In [13]:
# Define an evaluator that we can re-use.
evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=24,
                         display_progress=True, display_table=2)

# Evaluate the Chain-of-Thought program.
evaluate(cot)

Average Metric: 123.61 / 300 (41.2%): 100%|██████████| 300/300 [02:18<00:00,  2.17it/s]

2025/03/11 21:16:49 INFO dspy.evaluate.evaluate: Average Metric: 123.61108373468522 / 300 (41.2%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],"C and C++ are both powerful programming languages, but they serve ...","Use C when you need low-level system programming, performance, and...",✔️ [0.286]
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",Storing images in a Git repository can be problematic for several ...,"No, images should generally not be stored in a Git repository due ...",✔️ [0.286]


41.2

In [14]:
download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_corpus.jsonl")

Downloading 'ragqa_arena_tech_corpus.jsonl'...


In [16]:
max_characters = 6000  # for truncating >99th percentile of documents
topk_docs_to_retrieve = 5  # number of documents to retrieve per search query

with open("ragqa_arena_tech_corpus.jsonl") as f:
    corpus = [ujson.loads(line)['text'][:max_characters] for line in f]
    print(f"Loaded {len(corpus)} documents. Will encode them below.")

embedder = dspy.Embedder('openai/text-embedding-3-small', dimensions=512)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=topk_docs_to_retrieve)

Loaded 28436 documents. Will encode them below.
Training a 32-byte FAISS index with 337 partitions, based on 28436 x 512-dim embeddings


In [17]:
class RAG(dspy.Module):
    def __init__(self):
        self.respond = dspy.ChainOfThought('context, question -> response')

    def forward(self, question):
        context = search(question).passages
        return self.respond(context=context, question=question)

In [18]:
rag = RAG()
rag(question="what are high memory and low memory on linux?")

Prediction(
    reasoning="High memory and low memory in Linux refer to two distinct segments of the kernel's memory space. Low memory is the portion of memory that the kernel can access directly and is always mapped in the kernel's address space. This allows the kernel to access it without any additional steps. High memory, on the other hand, is not permanently mapped in the kernel's address space, meaning that the kernel must map it temporarily when it needs to access it. This distinction is crucial for managing memory efficiently, especially in a 32-bit architecture where the total addressable memory is limited. The kernel can access low memory directly, while high memory is used for temporary data buffers and requires specific functions (like kmap) to access.",
    response="In Linux, low memory refers to the segment of memory that the kernel can access directly and is always mapped in the kernel's address space. High memory, however, is a segment that is not permanently mapped, me

In [19]:
dspy.inspect_history()





[34m[2025-03-11T21:40:37.919814][0m

[31mSystem message:[0m

Your input fields are:
1. `context` (str)
2. `question` (str)

Your output fields are:
1. `reasoning` (str)
2. `response` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## context ## ]]
{context}

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## response ## ]]
{response}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `context`, `question`, produce the fields `response`.


[31mUser message:[0m

[[ ## context ## ]]
[1] «As far as I remember, High Memory is used for application space and Low Memory for the kernel. Advantage is that (user-space) applications cant access kernel-space memory.»
[2] «HIGHMEM is a range of kernels memory space, but it is NOT memory you access but its a place where you put what you want to access. A typical 32bit Linux virtual memory map is like: 0x0000000

In [20]:
evaluate(RAG())

Average Metric: 95.87 / 175 (54.8%):  58%|█████▊    | 175/300 [01:24<00:47,  2.65it/s]

2025/03/11 21:43:53 ERROR dspy.utils.parallelizer: Error for Example({'question': 'apples open letter - they cant or wont backdoor ios?', 'response': 'In certain cases involving specific hardware, it has been stated that manipulating the device\'s firmware is feasible. \nA data forensics company called Cellebrite has reportedly developed the ability to unlock and access data on various iPhone models, hinting at the existence of a vulnerability they can exploit. \nThe notion that this manipulation may be actionable has some support, indicated by Apple\'s own implications in their statements, although they acknowledge the security risks of fulfilling such requests. \nThere is suspicion over whether Apple will comply with creating a system that bypasses security features due to potential risks and their apparent objection to such demands. \nApple has a financial motive in portraying the iPhone as unhackable even by the FBI, and may also be attempting to capitalize on anti-government views

Average Metric: 95.87 / 175 (54.8%):  59%|█████▊    | 176/300 [01:26<02:00,  1.03it/s]

2025/03/11 21:43:53 ERROR dspy.utils.parallelizer: Error for Example({'question': 'why isnt udp with reliability (implemented at application layer) a substitute of tcp?', 'response': "Both protocols are designed with certain guarantees and raw TCP has more guarantees than raw UDP.  \nUDP with reliability can indeed be a substitute for TCP. \nHowever, in general, TCP is pretty optimal so it's not worth going to all the extra work and adding a (large) chance of adding bugs and security flaws.", 'gold_doc_ids': [3453, 3455, 3457]}) (input_keys={'question'}): litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-iVeCJplfMaAzsC1tlwHrn5Bd on tokens per min (TPM): Limit 200000, Used 199304, Requested 2395. Please try again in 509ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}. Set `provide_traceback=True

Average Metric: 96.12 / 176 (54.6%):  59%|█████▉    | 178/300 [01:27<01:23,  1.47it/s]

2025/03/11 21:43:55 ERROR dspy.utils.parallelizer: Error for Example({'question': 'override itunes media keys (play, pause, etc) for spotify?', 'response': 'NoMitsu has created a one-click installer that patches the remote control daemon, the app responsible for managing the media keys. \nThe app BeardedSpice is does an excellent job and offers a convenient way to ensure your media keys prioritize Spotify, while also working with numerous other media players.', 'gold_doc_ids': [7104, 1433, 3439]}) (input_keys={'question'}): litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-iVeCJplfMaAzsC1tlwHrn5Bd on tokens per min (TPM): Limit 200000, Used 198281, Requested 2301. Please try again in 174ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}. Set `provide_traceback=True` for traceback.


Average Metric: 96.12 / 176 (54.6%):  60%|█████▉    | 179/300 [01:28<01:35,  1.26it/s]

2025/03/11 21:43:56 ERROR dspy.utils.parallelizer: Error for Example({'question': 'why 2des is not secure?', 'response': 'Two-DES does not provide the expected increase in security due to its susceptibility to certain types of attacks, specifically the "meet-in-the-middle attack". \nDespite employing 112 key bits, 2DES only achieves a security level equivalent to about 257 bits rather than 2112. . \nConversely, although 3DES uses 168 key bits, it effectively offers a security level of 2112 bits, which is considered sufficiently secure for most applications. \nThe reason 3DES is preferred to 2DES is that 2DES doesn\'t actually give that much more security than DES.', 'gold_doc_ids': [8228, 8229]}) (input_keys={'question'}): litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-iVeCJplfMaAzsC1tlwHrn5Bd on tokens per min (TPM): Limit 200000, Used 199514, Requested 3078. Please try again in 

Average Metric: 97.12 / 178 (54.6%):  60%|██████    | 181/300 [01:30<01:43,  1.15it/s]

2025/03/11 21:43:57 ERROR dspy.utils.parallelizer: Error for Example({'question': 'how to search for android apps without ads?', 'response': 'You could consider using Yalp or Aurora, a fork of Yalp available on F-droid, which offers multiple search filters to enhance your experience. \nThe Playsearch website also offers you the option to search for apps with specific criteria without the need to install anything. \nAppFilter, a new website currently in beta, offers a platform to search for apps applying filters for numerous features and is free to use.', 'gold_doc_ids': [658, 659, 661]}) (input_keys={'question'}): litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-iVeCJplfMaAzsC1tlwHrn5Bd on tokens per min (TPM): Limit 200000, Used 199180, Requested 3151. Please try again in 699ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 

Average Metric: 97.12 / 178 (54.6%):  92%|█████████▏| 277/300 [01:30<00:07,  3.06it/s]






Exception: Execution cancelled due to errors or interruption.

In [21]:
tp = dspy.MIPROv2(metric=metric, auto="medium", num_threads=24)  # use fewer threads if your rate limit is small

optimized_rag = tp.compile(RAG(), trainset=trainset,
                           max_bootstrapped_demos=2, max_labeled_demos=2,
                           requires_permission_to_run=False)

2025/03/11 21:45:50 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 25
minibatch: True
num_candidates: 19
valset size: 160

2025/03/11 21:45:50 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/03/11 21:45:50 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/03/11 21:45:50 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=19 sets of demonstrations...


Bootstrapping set 1/19
Bootstrapping set 2/19
Bootstrapping set 3/19


 12%|█▎        | 5/40 [01:01<07:07, 12.23s/it]


Bootstrapped 2 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 4/19


 15%|█▌        | 6/40 [01:03<06:01, 10.62s/it]


Bootstrapped 1 full traces after 6 examples for up to 1 rounds, amounting to 6 attempts.
Bootstrapping set 5/19


  5%|▌         | 2/40 [00:29<09:27, 14.92s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 6/19


  8%|▊         | 3/40 [00:38<08:00, 12.98s/it]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 7/19


  5%|▌         | 2/40 [00:23<07:25, 11.72s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 8/19


 12%|█▎        | 5/40 [00:52<06:07, 10.51s/it]


Bootstrapped 1 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 9/19


  5%|▌         | 2/40 [00:19<06:17,  9.94s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 10/19


  8%|▊         | 3/40 [00:28<05:52,  9.53s/it]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 11/19


 12%|█▎        | 5/40 [00:50<05:52, 10.06s/it]


Bootstrapped 1 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 12/19


  8%|▊         | 3/40 [00:33<06:47, 11.02s/it]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 13/19


  8%|▊         | 3/40 [00:26<05:23,  8.75s/it]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 14/19


  2%|▎         | 1/40 [00:11<07:41, 11.82s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 15/19


  2%|▎         | 1/40 [00:12<07:59, 12.29s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 16/19


  5%|▌         | 2/40 [00:18<05:54,  9.33s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 17/19


 10%|█         | 4/40 [00:42<06:22, 10.62s/it]


Bootstrapped 2 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 18/19


  5%|▌         | 2/40 [00:16<05:07,  8.08s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 19/19


  8%|▊         | 3/40 [00:29<06:01,  9.78s/it]
2025/03/11 21:55:08 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/03/11 21:55:08 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.


2025/03/11 21:55:24 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing instructions...

2025/03/11 21:58:12 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/03/11 21:58:12 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `context`, `question`, produce the fields `response`.

2025/03/11 21:58:12 INFO dspy.teleprompt.mipro_optimizer_v2: 1: You are an expert technical support agent. Given the fields `context` (which provides detailed information about a specific technical topic) and `question` (which is a user inquiry related to that context), produce a comprehensive and well-reasoned `response`. Make sure to synthesize the information from the context to answer the question accurately and informatively.

2025/03/11 21:58:12 INFO dspy.teleprompt.mipro_optimizer_v2: 2: You are a technical support expert. Given the fields `context`, `question`, produce the fields `response` by synthesizing the information in the context to provide a clear an

Average Metric: 46.20 / 83 (55.7%):  52%|█████▏    | 83/160 [00:40<00:43,  1.76it/s]

2025/03/11 21:58:52 ERROR dspy.utils.parallelizer: Error for Example({'question': 'are there any famous one-man-army programmers?', 'response': 'Bill Joy is notable for writing vi, among other programs, including csh, rlogin, rsh, and rcp.  \nChris Sawyer was instrumental in creating RollerCoaster Tycoon, contributing to the game\'s programming, which included a physics engine, and writing it in assembly language.  \nOren Eini, also known as Ayende Rahien, is the author behind Rhino Mocks and is known for writing elegant code.  \nGuido van Rossum created Python.  \nRichard M. Stallman, often referred to as RMS, created Emacs and gcc and is recognized for maintaining pace with commercial Lisp machine code.  \nThe first version of Lisp can be attributed to John McCarthy.  \nJamie Zawinski is associated with one of computer science\'s most epic stories.  \nDerek Smart was known for his involvement with Battlecruiser 3000AD and his penchant for flame wars.  \nSteve Wozniak was essentially 

Average Metric: 89.80 / 159 (56.5%): 100%|██████████| 160/160 [01:21<00:00,  1.97it/s]

2025/03/11 21:59:33 INFO dspy.evaluate.evaluate: Average Metric: 89.80463743529265 / 160 (56.1%)
2025/03/11 21:59:33 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 56.13

2025/03/11 21:59:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 28 - Minibatch ==



Average Metric: 13.71 / 25 (54.8%): 100%|██████████| 25/25 [00:19<00:00,  1.28it/s]

2025/03/11 21:59:53 INFO dspy.evaluate.evaluate: Average Metric: 13.711728636155076 / 25 (54.8%)
2025/03/11 21:59:53 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.85 on minibatch of size 25 with parameters ['Predictor 0: Instruction 12', 'Predictor 0: Few-Shot Set 7'].
2025/03/11 21:59:53 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85]
2025/03/11 21:59:53 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13]
2025/03/11 21:59:53 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.13


2025/03/11 21:59:53 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 28 - Minibatch ==



Average Metric: 15.70 / 25 (62.8%): 100%|██████████| 25/25 [01:19<00:00,  3.18s/it]

2025/03/11 22:01:12 INFO dspy.evaluate.evaluate: Average Metric: 15.695510680063343 / 25 (62.8%)
2025/03/11 22:01:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 62.78 on minibatch of size 25 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 7'].
2025/03/11 22:01:12 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78]
2025/03/11 22:01:12 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13]
2025/03/11 22:01:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.13


2025/03/11 22:01:12 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 28 - Minibatch ==



Average Metric: 16.17 / 25 (64.7%): 100%|██████████| 25/25 [00:23<00:00,  1.05it/s]

2025/03/11 22:01:36 INFO dspy.evaluate.evaluate: Average Metric: 16.167429041650053 / 25 (64.7%)
2025/03/11 22:01:36 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 64.67 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 18'].
2025/03/11 22:01:36 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67]
2025/03/11 22:01:36 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13]
2025/03/11 22:01:36 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.13


2025/03/11 22:01:36 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 28 - Minibatch ==



Average Metric: 15.47 / 25 (61.9%): 100%|██████████| 25/25 [00:36<00:00,  1.47s/it]

2025/03/11 22:02:13 INFO dspy.evaluate.evaluate: Average Metric: 15.467186012785435 / 25 (61.9%)
2025/03/11 22:02:13 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 61.87 on minibatch of size 25 with parameters ['Predictor 0: Instruction 15', 'Predictor 0: Few-Shot Set 2'].
2025/03/11 22:02:13 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87]
2025/03/11 22:02:13 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13]
2025/03/11 22:02:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.13


2025/03/11 22:02:13 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 28 - Minibatch ==



Average Metric: 14.84 / 25 (59.3%): 100%|██████████| 25/25 [00:24<00:00,  1.01it/s]

2025/03/11 22:02:38 INFO dspy.evaluate.evaluate: Average Metric: 14.83623920870716 / 25 (59.3%)
2025/03/11 22:02:38 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 59.34 on minibatch of size 25 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 18'].
2025/03/11 22:02:38 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34]
2025/03/11 22:02:38 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13]
2025/03/11 22:02:38 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.13


2025/03/11 22:02:38 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 7 / 28 - Minibatch ==



Average Metric: 14.64 / 25 (58.6%): 100%|██████████| 25/25 [00:27<00:00,  1.11s/it]

2025/03/11 22:03:05 INFO dspy.evaluate.evaluate: Average Metric: 14.638857501567125 / 25 (58.6%)
2025/03/11 22:03:05 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 58.56 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 1'].
2025/03/11 22:03:05 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56]
2025/03/11 22:03:05 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13]
2025/03/11 22:03:05 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.13


2025/03/11 22:03:05 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 28 - Minibatch ==



Average Metric: 14.88 / 25 (59.5%): 100%|██████████| 25/25 [00:20<00:00,  1.19it/s]

2025/03/11 22:03:26 INFO dspy.evaluate.evaluate: Average Metric: 14.883243248831121 / 25 (59.5%)
2025/03/11 22:03:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 59.53 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 12'].
2025/03/11 22:03:26 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53]
2025/03/11 22:03:26 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13]
2025/03/11 22:03:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.13


2025/03/11 22:03:26 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 28 - Minibatch ==



Average Metric: 15.27 / 24 (63.6%):  96%|█████████▌| 24/25 [00:44<00:07,  7.29s/it]

2025/03/11 22:04:18 ERROR dspy.utils.parallelizer: Error for Example({'question': 'are there any famous one-man-army programmers?', 'response': 'Bill Joy is notable for writing vi, among other programs, including csh, rlogin, rsh, and rcp.  \nChris Sawyer was instrumental in creating RollerCoaster Tycoon, contributing to the game\'s programming, which included a physics engine, and writing it in assembly language.  \nOren Eini, also known as Ayende Rahien, is the author behind Rhino Mocks and is known for writing elegant code.  \nGuido van Rossum created Python.  \nRichard M. Stallman, often referred to as RMS, created Emacs and gcc and is recognized for maintaining pace with commercial Lisp machine code.  \nThe first version of Lisp can be attributed to John McCarthy.  \nJamie Zawinski is associated with one of computer science\'s most epic stories.  \nDerek Smart was known for his involvement with Battlecruiser 3000AD and his penchant for flame wars.  \nSteve Wozniak was essentially 

Average Metric: 15.27 / 24 (63.6%): 100%|██████████| 25/25 [00:51<00:00,  2.06s/it]

2025/03/11 22:04:18 INFO dspy.evaluate.evaluate: Average Metric: 15.271497394831991 / 25 (61.1%)
2025/03/11 22:04:18 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 61.09 on minibatch of size 25 with parameters ['Predictor 0: Instruction 11', 'Predictor 0: Few-Shot Set 13'].
2025/03/11 22:04:18 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09]
2025/03/11 22:04:18 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13]
2025/03/11 22:04:18 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.13


2025/03/11 22:04:18 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 28 - Minibatch ==



Average Metric: 14.09 / 25 (56.4%): 100%|██████████| 25/25 [00:26<00:00,  1.06s/it]

2025/03/11 22:04:44 INFO dspy.evaluate.evaluate: Average Metric: 14.092884408712298 / 25 (56.4%)
2025/03/11 22:04:44 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 56.37 on minibatch of size 25 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 4'].
2025/03/11 22:04:44 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37]
2025/03/11 22:04:44 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13]
2025/03/11 22:04:44 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.13


2025/03/11 22:04:44 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 / 28 - Full Evaluation =====
2025/03/11 22:04:44 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 64.67) from minibatch trials...



Average Metric: 97.79 / 160 (61.1%): 100%|██████████| 160/160 [01:29<00:00,  1.78it/s]

2025/03/11 22:06:14 INFO dspy.evaluate.evaluate: Average Metric: 97.79287481178888 / 160 (61.1%)
2025/03/11 22:06:14 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 61.12
2025/03/11 22:06:14 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12]
2025/03/11 22:06:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.12
2025/03/11 22:06:14 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/03/11 22:06:14 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 28 - Minibatch ==



Average Metric: 15.80 / 25 (63.2%): 100%|██████████| 25/25 [01:18<00:00,  3.15s/it]

2025/03/11 22:07:33 INFO dspy.evaluate.evaluate: Average Metric: 15.79697364365773 / 25 (63.2%)
2025/03/11 22:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 63.19 on minibatch of size 25 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 8'].
2025/03/11 22:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19]
2025/03/11 22:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12]
2025/03/11 22:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.12


2025/03/11 22:07:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 13 / 28 - Minibatch ==



Average Metric: 14.78 / 25 (59.1%): 100%|██████████| 25/25 [00:24<00:00,  1.01it/s]

2025/03/11 22:07:58 INFO dspy.evaluate.evaluate: Average Metric: 14.77719197206137 / 25 (59.1%)
2025/03/11 22:07:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 59.11 on minibatch of size 25 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 8'].
2025/03/11 22:07:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11]
2025/03/11 22:07:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12]
2025/03/11 22:07:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.12


2025/03/11 22:07:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 28 - Minibatch ==



Average Metric: 15.25 / 25 (61.0%): 100%|██████████| 25/25 [00:21<00:00,  1.15it/s]

2025/03/11 22:08:20 INFO dspy.evaluate.evaluate: Average Metric: 15.253148075549364 / 25 (61.0%)
2025/03/11 22:08:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 61.01 on minibatch of size 25 with parameters ['Predictor 0: Instruction 13', 'Predictor 0: Few-Shot Set 9'].
2025/03/11 22:08:20 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11, 61.01]
2025/03/11 22:08:20 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12]
2025/03/11 22:08:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.12


2025/03/11 22:08:20 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 28 - Minibatch ==



Average Metric: 14.54 / 25 (58.2%): 100%|██████████| 25/25 [00:20<00:00,  1.22it/s]

2025/03/11 22:08:40 INFO dspy.evaluate.evaluate: Average Metric: 14.539452172279168 / 25 (58.2%)
2025/03/11 22:08:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 58.16 on minibatch of size 25 with parameters ['Predictor 0: Instruction 9', 'Predictor 0: Few-Shot Set 18'].
2025/03/11 22:08:40 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11, 61.01, 58.16]
2025/03/11 22:08:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12]
2025/03/11 22:08:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.12


2025/03/11 22:08:40 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 28 - Minibatch ==



Average Metric: 14.33 / 25 (57.3%): 100%|██████████| 25/25 [00:19<00:00,  1.29it/s]

2025/03/11 22:08:59 INFO dspy.evaluate.evaluate: Average Metric: 14.329899837617274 / 25 (57.3%)
2025/03/11 22:08:59 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.32 on minibatch of size 25 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 8'].
2025/03/11 22:08:59 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11, 61.01, 58.16, 57.32]
2025/03/11 22:08:59 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12]
2025/03/11 22:08:59 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.12


2025/03/11 22:09:00 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 28 - Minibatch ==



Average Metric: 14.36 / 25 (57.4%): 100%|██████████| 25/25 [01:32<00:00,  3.72s/it]

2025/03/11 22:10:33 INFO dspy.evaluate.evaluate: Average Metric: 14.361461896335845 / 25 (57.4%)
2025/03/11 22:10:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.45 on minibatch of size 25 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 5'].
2025/03/11 22:10:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11, 61.01, 58.16, 57.32, 57.45]
2025/03/11 22:10:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12]
2025/03/11 22:10:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.12


2025/03/11 22:10:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 28 - Minibatch ==



Average Metric: 16.63 / 25 (66.5%): 100%|██████████| 25/25 [00:08<00:00,  2.90it/s]

2025/03/11 22:10:41 INFO dspy.evaluate.evaluate: Average Metric: 16.63074181389143 / 25 (66.5%)
2025/03/11 22:10:41 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 66.52 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 18'].
2025/03/11 22:10:41 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11, 61.01, 58.16, 57.32, 57.45, 66.52]
2025/03/11 22:10:41 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12]
2025/03/11 22:10:41 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.12


2025/03/11 22:10:41 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 19 / 28 - Minibatch ==



Average Metric: 15.93 / 25 (63.7%): 100%|██████████| 25/25 [00:22<00:00,  1.12it/s]

2025/03/11 22:11:04 INFO dspy.evaluate.evaluate: Average Metric: 15.930546299825908 / 25 (63.7%)
2025/03/11 22:11:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 63.72 on minibatch of size 25 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 6'].
2025/03/11 22:11:04 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11, 61.01, 58.16, 57.32, 57.45, 66.52, 63.72]
2025/03/11 22:11:04 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12]
2025/03/11 22:11:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.12


2025/03/11 22:11:04 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 28 - Minibatch ==



Average Metric: 15.12 / 25 (60.5%): 100%|██████████| 25/25 [00:20<00:00,  1.23it/s]

2025/03/11 22:11:24 INFO dspy.evaluate.evaluate: Average Metric: 15.115268746478558 / 25 (60.5%)
2025/03/11 22:11:24 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 60.46 on minibatch of size 25 with parameters ['Predictor 0: Instruction 18', 'Predictor 0: Few-Shot Set 3'].
2025/03/11 22:11:24 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11, 61.01, 58.16, 57.32, 57.45, 66.52, 63.72, 60.46]
2025/03/11 22:11:24 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12]
2025/03/11 22:11:24 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.12


2025/03/11 22:11:24 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 21 / 28 - Full Evaluation =====
2025/03/11 22:11:24 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 63.72) from minibatch trials...



Average Metric: 97.83 / 160 (61.1%): 100%|██████████| 160/160 [01:18<00:00,  2.03it/s]

2025/03/11 22:12:43 INFO dspy.evaluate.evaluate: Average Metric: 97.82914508233239 / 160 (61.1%)
2025/03/11 22:12:43 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 61.14
2025/03/11 22:12:43 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12, 61.14]
2025/03/11 22:12:43 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.14
2025/03/11 22:12:43 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/03/11 22:12:43 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 28 - Minibatch ==



Average Metric: 16.61 / 25 (66.5%): 100%|██████████| 25/25 [00:01<00:00, 23.95it/s]

2025/03/11 22:12:44 INFO dspy.evaluate.evaluate: Average Metric: 16.61435917326007 / 25 (66.5%)
2025/03/11 22:12:44 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 66.46 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 18'].
2025/03/11 22:12:44 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11, 61.01, 58.16, 57.32, 57.45, 66.52, 63.72, 60.46, 66.46]
2025/03/11 22:12:44 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12, 61.14]
2025/03/11 22:12:44 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.14


2025/03/11 22:12:44 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 23 / 28 - Minibatch ==



Average Metric: 15.18 / 25 (60.7%): 100%|██████████| 25/25 [00:20<00:00,  1.21it/s]

2025/03/11 22:13:04 INFO dspy.evaluate.evaluate: Average Metric: 15.181797803956204 / 25 (60.7%)
2025/03/11 22:13:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 60.73 on minibatch of size 25 with parameters ['Predictor 0: Instruction 16', 'Predictor 0: Few-Shot Set 18'].
2025/03/11 22:13:04 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11, 61.01, 58.16, 57.32, 57.45, 66.52, 63.72, 60.46, 66.46, 60.73]
2025/03/11 22:13:04 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12, 61.14]
2025/03/11 22:13:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.14


2025/03/11 22:13:04 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 24 / 28 - Minibatch ==



Average Metric: 14.96 / 25 (59.8%): 100%|██████████| 25/25 [00:22<00:00,  1.13it/s]

2025/03/11 22:13:27 INFO dspy.evaluate.evaluate: Average Metric: 14.957747537315127 / 25 (59.8%)
2025/03/11 22:13:27 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 59.83 on minibatch of size 25 with parameters ['Predictor 0: Instruction 15', 'Predictor 0: Few-Shot Set 18'].
2025/03/11 22:13:27 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11, 61.01, 58.16, 57.32, 57.45, 66.52, 63.72, 60.46, 66.46, 60.73, 59.83]
2025/03/11 22:13:27 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12, 61.14]
2025/03/11 22:13:27 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.14


2025/03/11 22:13:27 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 25 / 28 - Minibatch ==



Average Metric: 14.40 / 25 (57.6%): 100%|██████████| 25/25 [00:22<00:00,  1.12it/s]

2025/03/11 22:13:49 INFO dspy.evaluate.evaluate: Average Metric: 14.40463462874835 / 25 (57.6%)
2025/03/11 22:13:49 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.62 on minibatch of size 25 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 14'].
2025/03/11 22:13:49 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11, 61.01, 58.16, 57.32, 57.45, 66.52, 63.72, 60.46, 66.46, 60.73, 59.83, 57.62]
2025/03/11 22:13:49 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12, 61.14]
2025/03/11 22:13:49 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.14


2025/03/11 22:13:49 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 26 / 28 - Minibatch ==



Average Metric: 15.21 / 25 (60.8%): 100%|██████████| 25/25 [00:29<00:00,  1.16s/it]

2025/03/11 22:14:18 INFO dspy.evaluate.evaluate: Average Metric: 15.207612683736725 / 25 (60.8%)
2025/03/11 22:14:18 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 60.83 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 16'].
2025/03/11 22:14:18 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11, 61.01, 58.16, 57.32, 57.45, 66.52, 63.72, 60.46, 66.46, 60.73, 59.83, 57.62, 60.83]
2025/03/11 22:14:18 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12, 61.14]
2025/03/11 22:14:18 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.14


2025/03/11 22:14:18 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 27 / 28 - Minibatch ==



Average Metric: 16.44 / 25 (65.8%): 100%|██████████| 25/25 [00:01<00:00, 21.82it/s]

2025/03/11 22:14:19 INFO dspy.evaluate.evaluate: Average Metric: 16.4383194776683 / 25 (65.8%)
2025/03/11 22:14:19 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 65.75 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 18'].
2025/03/11 22:14:19 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [54.85, 62.78, 64.67, 61.87, 59.34, 58.56, 59.53, 61.09, 56.37, 63.19, 59.11, 61.01, 58.16, 57.32, 57.45, 66.52, 63.72, 60.46, 66.46, 60.73, 59.83, 57.62, 60.83, 65.75]
2025/03/11 22:14:19 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12, 61.14]
2025/03/11 22:14:19 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.14


2025/03/11 22:14:19 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 28 / 28 - Full Evaluation =====
2025/03/11 22:14:19 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 63.19) from minibatch trials...



Average Metric: 98.36 / 160 (61.5%): 100%|██████████| 160/160 [01:27<00:00,  1.82it/s]

2025/03/11 22:15:47 INFO dspy.evaluate.evaluate: Average Metric: 98.36471488348802 / 160 (61.5%)
2025/03/11 22:15:47 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 61.48
2025/03/11 22:15:47 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.13, 61.12, 61.14, 61.48]
2025/03/11 22:15:47 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 61.48
2025/03/11 22:15:47 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/03/11 22:15:47 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 61.48!





In [22]:
baseline = rag(question="cmd+tab does not work on hidden or minimized windows")
print(baseline.response)

You are correct that cmd+tab does not work on hidden or minimized windows. To switch back to a minimized app, you must first switch to another application and let it take focus. This means you cannot directly cycle back to a minimized window using cmd+tab without first selecting a different app.


In [23]:
pred = optimized_rag(question="cmd+tab does not work on hidden or minimized windows")
print(pred.response)

The Command + Tab shortcut on macOS is designed to switch between open applications, but it does not activate or bring back minimized windows directly. When you use Command + Tab, it cycles through the applications that are currently open, but if an application is minimized, it will not be restored until you select it from the application switcher.

To manage this behavior, you can try the following:

1. **Adjust Mission Control Settings**: Go to System Preferences > Mission Control and uncheck the option "When switching to an application, switch to a Space with open windows for the application." This may help in switching applications more effectively.

2. **Use Option Key**: When you are cycling through applications with Command + Tab, you can hold the Option key while selecting an application. This allows you to switch focus to another app before returning to a minimized one.

3. **Third-Party Applications**: Consider using third-party applications like Witch, which can provide more

In [24]:
evaluate(optimized_rag)

Average Metric: 189.83 / 300 (63.3%): 100%|██████████| 300/300 [03:29<00:00,  1.43it/s]

2025/03/11 22:39:09 INFO dspy.evaluate.evaluate: Average Metric: 189.82949313136845 / 300 (63.3%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],The context provides insights into the strengths and weaknesses of...,You should consider using C over C++ in the following scenarios: 1...,✔️ [0.400]
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",The context discusses the pros and cons of storing images in a Git...,Storing images in a Git repository can be problematic due to Git's...,✔️ [0.500]


63.28

In [26]:
cost = sum([x['cost'] for x in lm.history if x['cost'] is not None])  # in USD, as calculated by LiteLLM for certain providers
cost

1.7683434000000022

In [27]:
optimized_rag.save("optimized_rag.json")

loaded_rag = RAG()
loaded_rag.load("optimized_rag.json")

loaded_rag(question="cmd+tab does not work on hidden or minimized windows")

Prediction(
    reasoning='The context discusses the behavior of the Command + Tab shortcut on macOS, particularly in relation to switching between applications and handling minimized or hidden windows. It explains that Command + Tab allows users to cycle through applications but does not activate minimized windows directly. The context also provides alternative methods and settings adjustments that can help manage application switching more effectively.',
    response='The Command + Tab shortcut on macOS is designed to switch between open applications, but it does not activate or bring back minimized windows directly. When you use Command + Tab, it cycles through the applications that are currently open, but if an application is minimized, it will not be restored until you select it from the application switcher.\n\nTo manage this behavior, you can try the following:\n\n1. **Adjust Mission Control Settings**: Go to System Preferences > Mission Control and uncheck the option "When swit