In [2]:
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")

<Experiment: artifact_location='mlflow-artifacts:/374362034103955121', creation_time=1741686562632, experiment_id='374362034103955121', last_update_time=1741686562632, lifecycle_stage='active', name='DSPy', tags={}>

In [3]:
mlflow.dspy.autolog()

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
import dspy
import os
from dotenv import load_dotenv
from IPython.display import Image, display

load_dotenv()

lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)

In [5]:
response = lm("What is the capital of France?")
print(response)

['The capital of France is Paris.']


In [6]:
qa = dspy.Predict('question: str -> response: str')
response = qa(question="what are high memory and low memory on linux?")
print(response.response)

lm.inspect_history(n=1)

In Linux, "high memory" and "low memory" refer to different regions of the system's memory address space, particularly in the context of how the kernel manages memory for processes.

- **Low Memory**: This typically refers to the memory that is directly accessible by the kernel and can be used by processes without any special handling. In a 32-bit system, this is usually the first 896 MB of RAM (though this can vary based on the architecture and configuration). Low memory is used for kernel data structures and for user processes that require direct access to memory.

- **High Memory**: This refers to memory that is above the addressable range of the kernel in a 32-bit system. Since the kernel can only directly access a limited amount of memory (due to the 4 GB address space limitation), any memory above this limit is considered high memory. Processes can still use this memory, but the kernel must use special mechanisms (like paging) to access it. High memory is more relevant in systems

In [7]:
cot = dspy.ChainOfThought('question -> response')
cot(question="should curly braces appear on their own line?")

Prediction(
    reasoning='The placement of curly braces on their own line is largely a matter of coding style and conventions. In some programming languages and style guides, such as the Allman style, it is common to place opening curly braces on a new line to enhance readability. This can help visually separate blocks of code. In contrast, other styles, like K&R style, place the opening brace on the same line as the control statement, which can save vertical space. Ultimately, whether curly braces should appear on their own line depends on the coding standards being followed in a particular project or organization.',
    response="Curly braces can appear on their own line depending on the coding style you are following. If you are using a style that emphasizes readability, such as the Allman style, then yes, they should be on their own line. If you are following a more compact style like K&R, then they would typically be on the same line. It's best to adhere to the conventions of the

In [8]:
import ujson
from dspy.utils import download

# Download question--answer pairs from the RAG-QA Arena "Tech" dataset.
download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_examples.jsonl")

with open("ragqa_arena_tech_examples.jsonl") as f:
    data = [ujson.loads(line) for line in f]

In [9]:
# Inspect one datapoint.
data[0]

{'question': 'why igp is used in mpls?',
 'response': "An IGP exchanges routing prefixes between gateways/routers.  \nWithout a routing protocol, you'd have to configure each route on every router and you'd have no dynamic updates when routes change because of link failures. \nFuthermore, within an MPLS network, an IGP is vital for advertising the internal topology and ensuring connectivity for MP-BGP inside the network.",
 'gold_doc_ids': [2822, 2823]}

In [10]:
data = [dspy.Example(**d).with_inputs('question') for d in data]

# Let's pick an `example` here from the data.
example = data[2]
example

Example({'question': 'why are my text messages coming up as maybe?', 'response': 'This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". \n\nHowever, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.', 'gold_doc_ids': [3956, 3957, 8034]}) (input_keys={'question'})

In [11]:
import random

random.Random(0).shuffle(data)
trainset, devset, testset = data[:200], data[200:500], data[500:1000]

len(trainset), len(devset), len(testset)

(200, 300, 500)

In [12]:
from dspy.evaluate import SemanticF1

# Instantiate the metric.
metric = SemanticF1(decompositional=True)

# Produce a prediction from our `cot` module, using the `example` above as input.
pred = cot(**example.inputs())

# Compute the metric score for the prediction.
score = metric(example, pred)

print(f"Question: \t {example.question}\n")
print(f"Gold Response: \t {example.response}\n")
print(f"Predicted Response: \t {pred.response}\n")
print(f"Semantic F1 Score: {score:.2f}")

Question: 	 why are my text messages coming up as maybe?

Gold Response: 	 This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". 

However, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.

Predicted Response: 	 Your text messages are showing up as "maybe" because the recipient's messaging app is unsure about the sender's identity. This can occur if you're not in their contacts or if the app is trying to filter out potential spam. To resolve this, you can ask the recipient to save your number or check their messaging app settings.

Semantic F1 Score: 0.33


In [13]:
# Define an evaluator that we can re-use.
evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=24,
                         display_progress=True, display_table=2)

# Evaluate the Chain-of-Thought program.
evaluate(cot)

Average Metric: 123.61 / 300 (41.2%): 100%|██████████| 300/300 [02:18<00:00,  2.17it/s]

2025/03/11 21:16:49 INFO dspy.evaluate.evaluate: Average Metric: 123.61108373468522 / 300 (41.2%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],"C and C++ are both powerful programming languages, but they serve ...","Use C when you need low-level system programming, performance, and...",✔️ [0.286]
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",Storing images in a Git repository can be problematic for several ...,"No, images should generally not be stored in a Git repository due ...",✔️ [0.286]


41.2

In [14]:
download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_corpus.jsonl")

Downloading 'ragqa_arena_tech_corpus.jsonl'...


In [16]:
max_characters = 6000  # for truncating >99th percentile of documents
topk_docs_to_retrieve = 5  # number of documents to retrieve per search query

with open("ragqa_arena_tech_corpus.jsonl") as f:
    corpus = [ujson.loads(line)['text'][:max_characters] for line in f]
    print(f"Loaded {len(corpus)} documents. Will encode them below.")

embedder = dspy.Embedder('openai/text-embedding-3-small', dimensions=512)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=topk_docs_to_retrieve)

Loaded 28436 documents. Will encode them below.
Training a 32-byte FAISS index with 337 partitions, based on 28436 x 512-dim embeddings


In [17]:
class RAG(dspy.Module):
    def __init__(self):
        self.respond = dspy.ChainOfThought('context, question -> response')

    def forward(self, question):
        context = search(question).passages
        return self.respond(context=context, question=question)

In [18]:
rag = RAG()
rag(question="what are high memory and low memory on linux?")

Prediction(
    reasoning="High memory and low memory in Linux refer to two distinct segments of the kernel's memory space. Low memory is the portion of memory that the kernel can access directly and is always mapped in the kernel's address space. This allows the kernel to access it without any additional steps. High memory, on the other hand, is not permanently mapped in the kernel's address space, meaning that the kernel must map it temporarily when it needs to access it. This distinction is crucial for managing memory efficiently, especially in a 32-bit architecture where the total addressable memory is limited. The kernel can access low memory directly, while high memory is used for temporary data buffers and requires specific functions (like kmap) to access.",
    response="In Linux, low memory refers to the segment of memory that the kernel can access directly and is always mapped in the kernel's address space. High memory, however, is a segment that is not permanently mapped, me

In [19]:
dspy.inspect_history()





[34m[2025-03-11T21:40:37.919814][0m

[31mSystem message:[0m

Your input fields are:
1. `context` (str)
2. `question` (str)

Your output fields are:
1. `reasoning` (str)
2. `response` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## context ## ]]
{context}

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## response ## ]]
{response}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `context`, `question`, produce the fields `response`.


[31mUser message:[0m

[[ ## context ## ]]
[1] «As far as I remember, High Memory is used for application space and Low Memory for the kernel. Advantage is that (user-space) applications cant access kernel-space memory.»
[2] «HIGHMEM is a range of kernels memory space, but it is NOT memory you access but its a place where you put what you want to access. A typical 32bit Linux virtual memory map is like: 0x0000000

In [20]:
evaluate(RAG())

Average Metric: 95.87 / 175 (54.8%):  58%|█████▊    | 175/300 [01:24<00:47,  2.65it/s]

2025/03/11 21:43:53 ERROR dspy.utils.parallelizer: Error for Example({'question': 'apples open letter - they cant or wont backdoor ios?', 'response': 'In certain cases involving specific hardware, it has been stated that manipulating the device\'s firmware is feasible. \nA data forensics company called Cellebrite has reportedly developed the ability to unlock and access data on various iPhone models, hinting at the existence of a vulnerability they can exploit. \nThe notion that this manipulation may be actionable has some support, indicated by Apple\'s own implications in their statements, although they acknowledge the security risks of fulfilling such requests. \nThere is suspicion over whether Apple will comply with creating a system that bypasses security features due to potential risks and their apparent objection to such demands. \nApple has a financial motive in portraying the iPhone as unhackable even by the FBI, and may also be attempting to capitalize on anti-government views

Average Metric: 95.87 / 175 (54.8%):  59%|█████▊    | 176/300 [01:26<02:00,  1.03it/s]

2025/03/11 21:43:53 ERROR dspy.utils.parallelizer: Error for Example({'question': 'why isnt udp with reliability (implemented at application layer) a substitute of tcp?', 'response': "Both protocols are designed with certain guarantees and raw TCP has more guarantees than raw UDP.  \nUDP with reliability can indeed be a substitute for TCP. \nHowever, in general, TCP is pretty optimal so it's not worth going to all the extra work and adding a (large) chance of adding bugs and security flaws.", 'gold_doc_ids': [3453, 3455, 3457]}) (input_keys={'question'}): litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-iVeCJplfMaAzsC1tlwHrn5Bd on tokens per min (TPM): Limit 200000, Used 199304, Requested 2395. Please try again in 509ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}. Set `provide_traceback=True

Average Metric: 96.12 / 176 (54.6%):  59%|█████▉    | 178/300 [01:27<01:23,  1.47it/s]

2025/03/11 21:43:55 ERROR dspy.utils.parallelizer: Error for Example({'question': 'override itunes media keys (play, pause, etc) for spotify?', 'response': 'NoMitsu has created a one-click installer that patches the remote control daemon, the app responsible for managing the media keys. \nThe app BeardedSpice is does an excellent job and offers a convenient way to ensure your media keys prioritize Spotify, while also working with numerous other media players.', 'gold_doc_ids': [7104, 1433, 3439]}) (input_keys={'question'}): litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-iVeCJplfMaAzsC1tlwHrn5Bd on tokens per min (TPM): Limit 200000, Used 198281, Requested 2301. Please try again in 174ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}. Set `provide_traceback=True` for traceback.


Average Metric: 96.12 / 176 (54.6%):  60%|█████▉    | 179/300 [01:28<01:35,  1.26it/s]

2025/03/11 21:43:56 ERROR dspy.utils.parallelizer: Error for Example({'question': 'why 2des is not secure?', 'response': 'Two-DES does not provide the expected increase in security due to its susceptibility to certain types of attacks, specifically the "meet-in-the-middle attack". \nDespite employing 112 key bits, 2DES only achieves a security level equivalent to about 257 bits rather than 2112. . \nConversely, although 3DES uses 168 key bits, it effectively offers a security level of 2112 bits, which is considered sufficiently secure for most applications. \nThe reason 3DES is preferred to 2DES is that 2DES doesn\'t actually give that much more security than DES.', 'gold_doc_ids': [8228, 8229]}) (input_keys={'question'}): litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-iVeCJplfMaAzsC1tlwHrn5Bd on tokens per min (TPM): Limit 200000, Used 199514, Requested 3078. Please try again in 

Average Metric: 97.12 / 178 (54.6%):  60%|██████    | 181/300 [01:30<01:43,  1.15it/s]

2025/03/11 21:43:57 ERROR dspy.utils.parallelizer: Error for Example({'question': 'how to search for android apps without ads?', 'response': 'You could consider using Yalp or Aurora, a fork of Yalp available on F-droid, which offers multiple search filters to enhance your experience. \nThe Playsearch website also offers you the option to search for apps with specific criteria without the need to install anything. \nAppFilter, a new website currently in beta, offers a platform to search for apps applying filters for numerous features and is free to use.', 'gold_doc_ids': [658, 659, 661]}) (input_keys={'question'}): litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-iVeCJplfMaAzsC1tlwHrn5Bd on tokens per min (TPM): Limit 200000, Used 199180, Requested 3151. Please try again in 699ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 

Average Metric: 97.12 / 178 (54.6%):  92%|█████████▏| 277/300 [01:30<00:07,  3.06it/s]






Exception: Execution cancelled due to errors or interruption.

In [None]:
tp = dspy.MIPROv2(metric=metric, auto="medium", num_threads=24)  # use fewer threads if your rate limit is small

optimized_rag = tp.compile(RAG(), trainset=trainset,
                           max_bootstrapped_demos=2, max_labeled_demos=2,
                           requires_permission_to_run=False)

2025/03/11 21:45:50 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 25
minibatch: True
num_candidates: 19
valset size: 160

2025/03/11 21:45:50 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/03/11 21:45:50 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/03/11 21:45:50 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=19 sets of demonstrations...


Bootstrapping set 1/19
Bootstrapping set 2/19
Bootstrapping set 3/19


 12%|█▎        | 5/40 [01:01<07:07, 12.23s/it]


Bootstrapped 2 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 4/19


 15%|█▌        | 6/40 [01:03<06:01, 10.62s/it]


Bootstrapped 1 full traces after 6 examples for up to 1 rounds, amounting to 6 attempts.
Bootstrapping set 5/19


  5%|▌         | 2/40 [00:29<09:27, 14.92s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 6/19


  8%|▊         | 3/40 [00:38<08:00, 12.98s/it]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 7/19


  5%|▌         | 2/40 [00:23<07:25, 11.72s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 8/19


 12%|█▎        | 5/40 [00:52<06:07, 10.51s/it]


Bootstrapped 1 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 9/19


  5%|▌         | 2/40 [00:19<06:17,  9.94s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 10/19


  8%|▊         | 3/40 [00:28<05:52,  9.53s/it]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 11/19


  0%|          | 0/40 [00:00<?, ?it/s]