# Tutorial: Retrieval-Augmented Generation (RAG)

Let's walk through a quick example of **basic question answering** with and without **retrieval-augmented generation** (RAG) in DSPy. Specifically, let's build **a system for answering Tech questions**, e.g. about Linux or iPhone apps.

Install the latest DSPy via `pip install -U dspy` and follow along. If you're looking instead for a conceptual overview of DSPy, this [recent lecture](https://www.youtube.com/live/JEMYuzrKLUw) is a good place to start.


## Configuring the DSPy environment.

Let's tell DSPy that we will use `ollama` in our modules. You can easily swap this out for [other providers or local models](https://github.com/stanfordnlp/dspy/blob/main/examples/migration.ipynb).


In [2]:
import dspy

lm = dspy.LM(model='ollama/mistral:7b', base_url='http://localhost:11434/', num_ctx=34000)
dspy.configure(lm=lm)

## Exploring some basic DSPy Modules.

You can always prompt the LM directly via `lm(prompt="prompt")` or `lm(messages=[...])`. However, DSPy gives you `Modules` as a better way to define your LM functions.

The simplest module is `dspy.Predict`. It takes a [DSPy Signature](/building-blocks/2-signatures), i.e. a structured input/output schema, and gives you back a callable function for the behavior you specified. Let's use the "in-line" notation for signatures to declare a module that takes a `question` (of type `str`) as input and produces a `response` as an output.

In [3]:
qa = dspy.Predict('question: str -> response: str')
response = qa(question="what are high memory and low memory on linux?")

print(response.response)

High memory in Linux is referred to as RAM (Random Access Memory). It is a type of volatile memory that stores data temporarily while the system is running. Low memory, on the other hand, is when the available RAM becomes insufficient for the current workload, causing the system to slow down or become unresponsive.


Notice how the variable names we specified in the signature defined our input and output argument names and their role.

Now, what did DSPy do to build this `qa` module? Nothing fancy in this example, yet. The module passed your signature, LM, and inputs to an [Adapter](/building-blocks/1-language_models#structured-lm-output-with-adapters), which is a layer that handles structuring the inputs and parsing structured outputs to fit your signature.

Let's see it directly. You can inspect the `n` last prompts sent by DSPy easily.


In [4]:
dspy.inspect_history(n=1)





[34m[2024-11-28T10:19:02.636933][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)

Your output fields are:
1. `response` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## response ## ]]
{response}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `question`, produce the fields `response`.


[31mUser message:[0m

[[ ## question ## ]]
what are high memory and low memory on linux?

Respond with the corresponding output fields, starting with the field `[[ ## response ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## response ## ]]
High memory in Linux is referred to as RAM (Random Access Memory). It is a type of volatile memory that stores data temporarily while the system is running. Low memory, on the other hand, is when the available RAM becomes insufficient for 

DSPy has various built-in modules, e.g. `dspy.ChainOfThought`, `dspy.ProgramOfThought`, and `dspy.ReAct`. These are interchangeable with basic `dspy.Predict`: they take your signature, which is specific to your task, and they apply general-purpose prompting techniques and inference-time strategies to it.

For example, `dspy.ChainOfThought` is an easy way to elicit `reasoning` out of your LM before it commits to the outputs requested in your signature.

In the example below, we'll omit `str` types (as the default type is string). You should feel free to experiment with other fields and types, e.g. try `topics: list[str]` or `is_realistic: bool`.


In [5]:
cot = dspy.ChainOfThought('question -> response')
cot(question="should curly braces appear on their own line?")

Prediction(
    reasoning='The question asks whether curly braces should appear on their own line. In markdown language, which is used in this interaction, curly braces are typically not used on their own line but rather to denote specific syntax or groupings. However, if the user intends for the curly braces to be displayed as literal characters, they can be placed on their own line without any special formatting.',
    response='Curly braces can appear on their own line in markdown, but they are typically used for syntax or groupings and do not need to be on a separate line unless the user intends them to be displayed as literal characters.'
)



Interestingly, asking for reasoning can make the output `response` shorter in this case. Is this a good thing or a bad thing? It depends on what you need: there's no free lunch, but DSPy gives you the tools to experiment with different strategies extremely quickly.

By the way, `dspy.ChainOfThought` is implemented in DSPy, using `dspy.Predict`. This is a good place to `dspy.inspect_history` if you're curious.


## Using DSPy well involves evaluation and iterative development.

You already know a lot about DSPy at this point. If all you want is quick scripting, this much of DSPy already enables a lot. Sprinkling DSPy signatures and modules into your Python control flow is a pretty ergonomic way to just get stuff done with LMs.

That said, you're likely here because you want to build a high-quality system and improve it over time. The way to do that in DSPy is to iterate fast by evaluating the quality of your system and using DSPy's powerful tools, e.g. [Optimizers](/building-blocks/6-optimizers). You can learn about the [appropriate development cycle in DSPy here](/building-blocks/solving_your_task).

## Manipulating Examples in DSPy.

To measure the quality of your DSPy system, you need (1) a bunch of input values, like `question`s for example, and (2) a `metric` that can score the quality of an output from your system. Metrics vary widely. Some metrics need ground-truth labels of ideal outputs, e.g. for classification or question answering. Other metrics are self-supervised, e.g. checking faithfulness or lack of hallucination, perhaps using a DSPy program as a judge of these qualities.

Let's load a dataset of questions and their (pretty long) gold answers. Since we started this notebook with the goal of building **a system for answering Tech questions**, we obtained a bunch of StackExchange-based questions and their correct answers from the [RAG-QA Arena](https://arxiv.org/abs/2407.13998) dataset. (Learn more about the [development cycle](/building-blocks/solving_your_task) if you don't have data for your task.)



In [6]:
import ujson
from dspy.utils import download

# Download question--answer pairs from the RAG-QA Arena "Tech" dataset.
download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_examples.jsonl")

with open("ragqa_arena_tech_examples.jsonl") as f:
    data = [ujson.loads(line) for line in f]

In [7]:
# Inspect one datapoint.
data[0]

{'question': 'why igp is used in mpls?',
 'response': "An IGP exchanges routing prefixes between gateways/routers.  \nWithout a routing protocol, you'd have to configure each route on every router and you'd have no dynamic updates when routes change because of link failures. \nFuthermore, within an MPLS network, an IGP is vital for advertising the internal topology and ensuring connectivity for MP-BGP inside the network.",
 'gold_doc_ids': [2822, 2823]}



Given a simple dict like this, let's create a list of `dspy.Example`s, which is the datatype that carries training (or test) datapoints in DSPy.

When you build a `dspy.Example`, you should generally specify `.with_inputs("field1", "field2", ...)` to indicate which fields are inputs. The other fields are treated as labels or metadata.


In [8]:
data = [dspy.Example(**d).with_inputs('question') for d in data]

# Let's pick an `example` here from the data.
example = data[2]
example

Example({'question': 'why are my text messages coming up as maybe?', 'response': 'This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". \n\nHowever, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.', 'gold_doc_ids': [3956, 3957, 8034]}) (input_keys={'question'})


Now, let's divide the data into:

- Training (and with it Validation) set:
    - These are the splits you typically give to DSPy optimizers.
    - Optimizers typically learn directly from the training examples and check their progress using the validation examples.
    - It's good to have 30--300 examples for training and validation each.
    - For prompt optimizers in particular, it's often better to pass _more_ validation than training.
    - Below, we'll use 200 in total. MIPROv2 will split them into 20% training and 80% validation if you don't pass a valset.

- Development and Test sets: The rest, typically on the order of 30--1000, can be used for:
    - development (i.e., you can inspect them as you iterate on your system) and
    - testing (final held-out evaluation).


In [9]:
import random

random.Random(0).shuffle(data)
trainset, devset, testset = data[:200], data[200:500], data[500:1000]

len(trainset), len(devset), len(testset)

(200, 300, 500)

## Evaluation in DSPy.

What kind of metric can suit our question-answering task? There are many choices, but since the answers are long, we may ask: How well does the system response _cover_ all key facts in the gold response? And the other way around, how well is the system response _not saying things_ that aren't in the gold response?

That metric is essentially a **semantic F1**, so let's load a `SemanticF1` metric from DSPy. This metric is actually implemented as a [very simple DSPy module](https://github.com/stanfordnlp/dspy/blob/main/dspy/evaluate/auto_evaluation.py#L21) using whatever LM we're working with.

In [10]:
from dspy.evaluate import SemanticF1

# Instantiate the metric.
metric = SemanticF1(decompositional=True)

# Produce a prediction from our `cot` module, using the `example` above as input.
pred = cot(**example.inputs())

# Compute the metric score for the prediction.
score = metric(example, pred)

print(f"Question: \t {example.question}\n")
print(f"Gold Response: \t {example.response}\n")
print(f"Predicted Response: \t {pred.response}\n")
print(f"Semantic F1 Score: {score:.2f}")

Question: 	 why are my text messages coming up as maybe?

Gold Response: 	 This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". 

However, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.

Predicted Response: 	 It seems like your text messages are coming up as "maybe" because there might be a problem with your network connection, the messaging app, or the message content itself. I would recommend checking your internet connection, updating the messaging app if possible, and ensuring that the message content is not corrupted. If the issue persists, you may want to contact your service provider for further assistance.

Semantic F1 Score: 0.00



The final DSPy module call above actually happens inside `metric`. You might be curious how it measured the semantic F1 for this example.



In [11]:
dspy.inspect_history(n=1)





[34m[2024-11-28T10:19:31.437141][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
2. `ground_truth` (str)
3. `system_response` (str)

Your output fields are:
1. `reasoning` (str)
2. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth
3. `system_response_key_ideas` (str): enumeration of key ideas in the system response
4. `discussion` (str): discussion of the overlap between ground truth and system response
5. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response
6. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## ground_truth ## ]]
{ground_truth}

[[ ## system_response ## ]]
{system_response}

[[ ## reasoning ## ]]
{reasoning}

[[ ## ground_truth_key_ideas ## ]]
{ground_truth_key_ideas}

[[ ## system_response_key_ideas ## ]

For evaluation, you could use the metric above in a simple loop and just average the score. But for nice parallelism and utilities, we can rely on `dspy.Evaluate`.

In [12]:
# Define an evaluator that we can re-use.
evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=1,
                         display_progress=True, display_table=2)

# Evaluate the Chain-of-Thought program.
eval_result = evaluate(cot)

Average Metric: 8.66 / 300 (2.9%):   7%|████████████▍                                                                                                                                                                              | 20/300 [09:13<2:11:54, 28.27s/it]

2024/11/28 10:30:45 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'preserve bash history in multiple terminal windows', 'response': 'To preserve bash history in multiple terminal windows, first, add two lines to your ~/.bashrc: \n```shopt -s histappend PROMPT_COMMAND="history -a;history -c;history -r;$PROMPT_COMMAND"```. \nAdditionally, you can use the `history -a` command to append the current session\'s history to the histfile and then use `history -r` on other terminals to read the histfile . \nYou can edit your BASH prompt to run "history -a" and "history -r": \n```savePS1=$PS1 (in case you mess something up, which is almost guaranteed) PS1=$savePS1`history -a;history -r```. \nAnother approach involves writing a script to set a history file per session or task: \n```# write existing history to the old file history -a # set new historyfile export HISTFILE="$1" export HISET=$1```. \nYou can also map the following to a hotkey: \n```history | grep -v history

Average Metric: 36.09 / 300 (12.0%):  27%|█████████████████████████████████████████████████▉                                                                                                                                       | 81/300 [38:41<1:35:41, 26.22s/it]

2024/11/28 10:59:16 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'passing named arguments to shell scripts', 'response': "The syntax closest to the example provided is `p_out='/some/path' arg_1='5' my_script`.  \nIf you're comfortable with using single-letter arguments, you can utilize `my_script -p '/some/path' -a5` with the built-in getopts in bash.  \nHowever, with zsh, you would use zparseopts instead.", 'gold_doc_ids': [7794, 7797, 7842]}) (input_keys={'question'}): litellm.APIConnectionError: 'name'
Traceback (most recent call last):
  File "/home/alazarchuk/Projects/dspy-examples/.venv/lib/python3.12/site-packages/dspy/adapters/chat_adapter.py", line 78, in parse
    fields[k] = parse_value(v, signature.output_fields[k].annotation) if _parse_values else v
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alazarchuk/Projects/dspy-examples/.venv/lib/python3.12/site-packages/dspy/adapters/chat_adapter.py", line 164, in 

Average Metric: 105.98 / 300 (35.3%):  79%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                      | 237/300 [1:52:49<36:30, 34.77s/it]

2024/11/28 12:13:51 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'how do i remove the first 300 million lines from a 700 gb txt file on a system with 1 tb disk space?', 'response': 'To modify a file and remove initial \'n\' lines, one approach uses the `dd` command in conjunction with other utilities like `truncate` without using a temporary file and there is no size limit, but it is dangerous since there is no track of progress, and any error leaves you with a broken file. \nThe `losetup` command offers an alternative to the `dd` method but presents similar dangers. \nAnother method involves compressing the file with: \'gzip file && zcat file.gz | tail -n +300000001 | gzip > newFile.gz\'. \nThe `fallocate()` system call is mentioned as a viable option on specific filesystems such as ext4 or xfs. \nUsing tools like `hexpeek` or a custom program to find the 300-millionth newline, written in languages such as C, Perl, or Python offers additional strategies li

Average Metric: 110.28 / 300 (36.8%):  83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                               | 248/300 [1:59:04<24:44, 28.54s/it]

2024/11/28 12:19:47 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'what should a website operator do about the heartbleed openssl exploit?', 'response': 'This requires more than just new certificates (or rather, new key pairs) for every affected server, as it also requires: patching affected systems to OpenSSL 1.0.1g, revocation of the old keypairs that were just superseded, changing all passwords, and invalidating all session keys and cookies.', 'gold_doc_ids': [7962]}) (input_keys={'question'}): litellm.APIConnectionError: 'name'
Traceback (most recent call last):
  File "/home/alazarchuk/Projects/dspy-examples/.venv/lib/python3.12/site-packages/dspy/adapters/chat_adapter.py", line 78, in parse
    fields[k] = parse_value(v, signature.output_fields[k].annotation) if _parse_values else v
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alazarchuk/Projects/dspy-examples/.venv/lib/python3.12/site-packages/dspy/adapters/chat_

Average Metric: 132.57 / 300 (44.2%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [2:25:12<00:00, 29.04s/it]

2024/11/28 12:44:43 INFO dspy.evaluate.evaluate: Average Metric: 132.56774533781308 / 300 (44.2%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1,response
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],C is a lower-level programming language that provides less abstrac...,"Use C when you need low-level control over memory management, perf...",✔️ [0.724],
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",Storing images in a Git repository is generally not recommended du...,"No, images should not be stored in a Git repository. It is better ...",✔️ [0.224],


In [13]:
eval_result

44.19

So far, we built a very simple chain-of-thought module for question answering and evaluated it on a small dataset.

Can we do better? In the rest of this guide, we will build a retrieval-augmented generation (RAG) program in DSPy for the same task. We'll see how this can boost the score substantially, then we'll use one of the DSPy Optimizers to _compile_ our RAG program to higher-quality prompts, raising our scores even more.

## Basic Retrieval-Augmented Generation (RAG).

First, let's download the corpus data that we will use for RAG search. An older version of this tutorial used the full (650,000 document) corpus. To make this very fast and cheap to run, we've downsampled the corpus to just 28,000 documents.

In [14]:
download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_corpus.jsonl")

## Set up your system's retriever.

As far as DSPy is concerned, you can plug in any Python code for calling tools or retrievers. Here, we'll just use OpenAI Embeddings and do top-K search locally, just for convenience.

**Note:** The step below will require that you either do `pip install -U faiss-cpu` or pass `brute_force_threshold=30_000` to `dspy.retrievers.Embeddings` to avoid faiss.

In [15]:
# %pip install -U faiss-cpu  # or faiss-gpu if you have a GPU

**Note:** Ollama wrapper from LiteLLM is running separate async job and everihting is crashing. We nned to install nest_asyncio and run to avoid crash.

In [16]:
import nest_asyncio
import litellm

litellm.drop_params = True

nest_asyncio.apply()

In [17]:
max_characters = 6000  # for truncating >99th percentile of documents
topk_docs_to_retrieve = 5  # number of documents to retrieve per search query

with open("ragqa_arena_tech_corpus.jsonl") as f:
    corpus = [ujson.loads(line)['text'][:max_characters] for line in f]
    print(f"Loaded {len(corpus)} documents. Will encode them below.")

embedder = dspy.Embedder('ollama/mxbai-embed-large', dimensions=512)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=topk_docs_to_retrieve)

Loaded 28436 documents. Will encode them below.
Training a 32-byte FAISS index with 337 partitions, based on 28436 x 1024-dim embeddings



## Build your first RAG Module.

In the previous guide, we looked at individual DSPy modules in isolation, e.g. `dspy.Predict("question -> answer")`.

What if we want to build a DSPy _program_ that has multiple steps? The syntax below with `dspy.Module` allows you to connect a few pieces together, in this case, our retriever and a generation module, so the whole system can be optimized.

Concretely, in the `__init__` method, you declare any sub-module you'll need, which in this case is just a `dspy.ChainOfThought('context, question -> response')` module that takes retrieved context, a question, and produces a response. In the `forward` method, you simply express any Python control flow you like, possibly using your modules. In this case, we first invoke the `search` function defined earlier and then invoke the `self.respond` ChainOfThought module.


In [18]:
class RAG(dspy.Module):
    def __init__(self):
        self.respond = dspy.ChainOfThought('context, question -> response')

    def forward(self, question):
        context = search(question).passages
        return self.respond(context=context, question=question)


Let's use the RAG module.


In [19]:
rag = RAG()
rag(question="what are high memory and low memory on linux?")

Prediction(
    reasoning='The context provided explains that High Memory (HIGHEMEM) is a range of kernel memory space, used for single-time data buffers. It is not memory you access but rather a place where you put what you want to access. Low Memory (LOWMEM), on the other hand, is statically mapped at the kernel boot time and is always accessible within the kernel.',
    response='High Memory (HIGHEMEM) and Low Memory (LOWMEM) are two segments of memory space in a Linux kernel. High Memory is used for single-time data buffers, while Low Memory is statically mapped at the kernel boot time and can be directly accessed within the kernel.'
)

In [20]:
dspy.inspect_history()





[34m[2024-11-28T13:17:28.232225][0m

[31mSystem message:[0m

Your input fields are:
1. `context` (str)
2. `question` (str)

Your output fields are:
1. `reasoning` (str)
2. `response` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## context ## ]]
{context}

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## response ## ]]
{response}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `context`, `question`, produce the fields `response`.


[31mUser message:[0m

[[ ## context ## ]]
[1] «This is relevant to the Linux kernel; Im not sure how any Unix kernel handles this. The High Memory is the segment of memory that user-space programs can address. It cannot touch Low Memory. Low Memory is the segment of memory that the Linux kernel can address directly. If the kernel must access High Memory, it has to map it into its own address space first. There w

Earlier with a CoT module, we got around 44% in terms of semantic F1 on our `devset`. Would this `RAG` module score better?

In [21]:
evaluate(RAG())

Average Metric: 0.76 / 300 (0.3%):   1%|█▎                                                                                                                                                                                          | 2/300 [01:07<2:52:52, 34.81s/it]

2024/11/28 13:20:08 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'how to chmod without /usr/bin/chmod?', 'response': "Run the loader directly, and pass it the command you want to run: `/lib/ld-linux.so /bin/chmod +x /bin/chmod`. \nThe exact path might change, especially on a 64-bit system, so version is named something like `/lib64/ld-linux-x86-64.so.2.` \nAlternatively, if busybox is installed, you can execute `busybox chmod +x /bin/chmod`. \nAnother hack is the command: `mv /bin/chmod /bin/chmod.orig cp -a /bin/chown /bin/chmod`. \nYou can also prepare another executable file, copy chmod over it to maintain executable permissions, `$ cp /bin/ls chmod $ cp /bin/chmod`, or use the install utility to do this with permission settings in one step; the command for this would be: `$ install -m a+x /bin/chmod . $ ./chmod # executes'`. \nPiping contents into an already executable file is another option (`cp /usr/bin/executable_file ~/executable_file cat /usr/bin/c

Average Metric: 3.16 / 300 (1.1%):   3%|█████                                                                                                                                                                                       | 8/300 [05:29<2:52:51, 35.52s/it]

2024/11/28 13:24:35 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'why is my message not delivered imessage?', 'response': '"Not Delivered" implies an error of some kind, perhaps because the recipient\'s phone battery is low/dead, the recipient\'s phone is out of wifi range, the recipient has you on block, or the recipient\'s phone is off, whereas if nothing is displayed that just means the message is queued up to an inactive phone, or backed-up in transit over the network.', 'gold_doc_ids': [7645, 4712]}) (input_keys={'question'}): litellm.APIConnectionError: 'name'
Traceback (most recent call last):
  File "/home/alazarchuk/Projects/dspy-examples/.venv/lib/python3.12/site-packages/dspy/adapters/chat_adapter.py", line 78, in parse
    fields[k] = parse_value(v, signature.output_fields[k].annotation) if _parse_values else v
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alazarchuk/Projects/dspy-examples/.venv/lib/python3.

Average Metric: 127.65 / 300 (42.5%):  79%|██████████████████████████████████████████████████████████████████████████████████████████                        | 237/300 [2:24:04<44:28, 42.36s/it]                                                                     

2024/11/28 15:43:36 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'how do i remove the first 300 million lines from a 700 gb txt file on a system with 1 tb disk space?', 'response': 'To modify a file and remove initial \'n\' lines, one approach uses the `dd` command in conjunction with other utilities like `truncate` without using a temporary file and there is no size limit, but it is dangerous since there is no track of progress, and any error leaves you with a broken file. \nThe `losetup` command offers an alternative to the `dd` method but presents similar dangers. \nAnother method involves compressing the file with: \'gzip file && zcat file.gz | tail -n +300000001 | gzip > newFile.gz\'. \nThe `fallocate()` system call is mentioned as a viable option on specific filesystems such as ext4 or xfs. \nUsing tools like `hexpeek` or a custom program to find the 300-millionth newline, written in languages such as C, Perl, or Python offers additional strategies li

Average Metric: 162.18 / 300 (54.1%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [3:05:20<00:00, 37.07s/it]

2024/11/28 16:22:48 INFO dspy.evaluate.evaluate: Average Metric: 162.17509656200917 / 300 (54.1%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1,response
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],C is typically used in embedded systems where using C++ might be o...,You should use C in embedded systems where using C++ might be over...,✔️ [0.284],
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",Storing images in a Git repository can be tricky due to its nature...,"While it is possible to store images in a Git repository, it may n...",✔️ [0.480],


54.06

## Using a DSPy Optimizer to improve your RAG prompt.

Off the shelf, our `RAG` module scores 54%. What are our options to make it stronger? One of the various choices DSPy offers is optimizing the prompts in our pipeline.

If there are many sub-modules in your program, all of them will be optimized together. In this case, there's only one: `self.respond = dspy.ChainOfThought('context, question -> response')`

Let's set up and use DSPy's [MIPRO (v2) optimizer](/deep-dive/optimizers/miprov2). The run below has a cost around $1.5 (for the `medium` auto setting) and may take some 20-30 minutes depending on your number of threads.

In [None]:
tp = dspy.MIPROv2(metric=metric, auto="medium", num_threads=2)  # use fewer threads if your rate limit is small

optimized_rag = tp.compile(RAG(), trainset=trainset,
                           max_bootstrapped_demos=2, max_labeled_demos=2,
                           requires_permission_to_run=False)

2024/11/28 16:22:48 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 25
minibatch: True
num_candidates: 19
valset size: 160

2024/11/28 16:22:48 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2024/11/28 16:22:48 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2024/11/28 16:22:48 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=19 sets of demonstrations...


Bootstrapping set 1/19
Bootstrapping set 2/19
Bootstrapping set 3/19


  8%|████████████████▉                                                                                                                                                                                                                 | 3/40 [02:03<25:26, 41.27s/it]


Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 4/19


  5%|███████████▎                                                                                                                                                                                                                      | 2/40 [01:22<25:58, 41.03s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 5/19


  5%|███████████▎                                                                                                                                                                                                                      | 2/40 [01:37<31:01, 48.98s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 6/19


  0%|                                                                                                                                                                                                                                          | 0/40 [00:00<?, ?it/s]2024/11/28 16:31:45 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'question': 'what are your favorite mac os x keyboard shortcuts?', 'response': "Ctrl+F2 allows you to access menu options using your keyboard, similar to the Windows Alt+Space shortcut.  \nUsing ⌘+E, you can place selected text into the search clipboard, and then utilize the combination of ⌘+G to find the next instances, making search and replace operations quick and convenient.  \nIf you wish to enable a high contrast theme, the shortcut is ctrl+⌥+⌘+8, and to cycle through an app's open windows, you can use ⌘+` (backtick).  \nFor taking screenshots, ⌘+⇧+4 will save a selective screenshot to the desktop, and ⌘+⇧+ctrl+4 will save it

Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 7/19


  0%|                                                                                                                                                                                                                                          | 0/40 [00:00<?, ?it/s]2024/11/28 16:34:38 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'question': 'is there a way to dynamically refresh the less command?', 'response': 'Within the Unix \'less\' program, using \'F\' or \'Shift+F\' initiates a \'follow\' function akin to the \'tail -f\' command, allowing updates to display as they occur.  \nIf you prefer to use \'tail -f\' directly, you can pipe your command to it, though this will restrict you from scrolling through your output. \nAlternatively, typing \'G\' approaches similar functionality but on a one-time basis, which may be advantageous in particular network environments.  \nIf you\'re accessing \'less\' via a command line, you can also use \'less +F\' followed 

Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 8/19


 12%|████████████████████████████▎                                                                                                                                                                                                     | 5/40 [03:00<21:04, 36.14s/it]


Bootstrapped 1 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 9/19


  8%|████████████████▉                                                                                                                                                                                                                 | 3/40 [01:56<23:55, 38.81s/it]


Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 10/19


  8%|████████████████▉                                                                                                                                                                                                                 | 3/40 [02:01<24:54, 40.40s/it]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 11/19


  5%|███████████▎                                                                                                                                                                                                                      | 2/40 [01:16<24:09, 38.14s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 12/19


  2%|█████▋                                                                                                                                                                                                                            | 1/40 [00:34<22:43, 34.97s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 13/19


  5%|███████████▎                                                                                                                                                                                                                      | 2/40 [01:07<21:22, 33.75s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 14/19


  5%|███████████▎                                                                                                                                                                                                                      | 2/40 [01:23<26:28, 41.80s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 15/19


 20%|█████████████████████████████████████████████▏                                                                                                                                                                                    | 8/40 [06:15<25:00, 46.90s/it]


Bootstrapped 1 full traces after 8 examples for up to 1 rounds, amounting to 8 attempts.
Bootstrapping set 16/19


  8%|████████████████▉                                                                                                                                                                                                                 | 3/40 [01:47<21:32, 34.92s/it]2024/11/28 16:58:21 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'question': 'what are your favorite mac os x keyboard shortcuts?', 'response': "Ctrl+F2 allows you to access menu options using your keyboard, similar to the Windows Alt+Space shortcut.  \nUsing ⌘+E, you can place selected text into the search clipboard, and then utilize the combination of ⌘+G to find the next instances, making search and replace operations quick and convenient.  \nIf you wish to enable a high contrast theme, the shortcut is ctrl+⌥+⌘+8, and to cycle through an app's open windows, you can use ⌘+` (backtick).  \nFor taking screenshots, ⌘+⇧+4 will save a selective screenshot to the desktop, and ⌘+⇧+ctrl+4 will save it

Bootstrapped 1 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 17/19


 10%|██████████████████████▌                                                                                                                                                                                                           | 4/40 [02:56<26:24, 44.02s/it]


Bootstrapped 2 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 18/19


  5%|███████████▎                                                                                                                                                                                                                      | 2/40 [01:21<25:56, 40.97s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 19/19


  0%|                                                                                                                                                                                                                                          | 0/40 [00:00<?, ?it/s]2024/11/28 17:05:09 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'question': 'what to do if caught in a physical pentest?', 'response': 'When engaging in a Red Team Engagement, if you are caught by security, ensure you follow the police\'s orders. \nThey will verify your Permission to Attack and contact the individual responsible for authorizing your activity.  \nClaiming to be a "security researcher" or attempting to flee the scene are inadvisable strategies, as these could lead to further complications and misunderstandings with security personnel.', 'gold_doc_ids': [5693]}) (input_keys={'question'}) with module = Predict(StringSignature(question, ground_truth, system_response -> reasoning, gr

Bootstrapped 2 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


2024/11/28 17:08:55 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing instructions...

2024/11/28 17:22:38 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2024/11/28 17:22:38 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `context`, `question`, produce the fields `response`.

2024/11/28 17:22:38 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Generate a concise instruction that prompts the Language Model to suggest troubleshooting solutions or tips for common issues encountered when using Mac OS X, based on the given context.

2024/11/28 17:22:38 INFO dspy.teleprompt.mipro_optimizer_v2: 2: As an assistant knowledgeable about Mac OS X system issues and solutions, answer questions related to the given context about where certain files or settings are stored on a Mac. In this case, you have been asked: "Where does the mac store account pictures?"

In order to provide an accurate response, follow these steps:
1. Read through the provided context

Average Metric: 30.17 / 55 (54.9%):  34%|███████████████████████████████████████████████████████████████▉                                                                                                                          | 55/160 [31:51<1:43:59, 59.42s/it]

2024/11/28 17:54:46 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'tmux status-bar corrupted after catting a binary file, how to reset?', 'response': 'You may have to invoke stty sane at the command line which will reset the in-kernel TTY drivers to some set of default values that usually let you proceed. \nIf the TTY drivers are messed up enough, you may have to type stty sane "blind", that is, without on-screen feedback.', 'gold_doc_ids': [7893]}) (input_keys={'question'}): litellm.APIConnectionError: 'name'
Traceback (most recent call last):
  File "/home/alazarchuk/Projects/dspy-examples/.venv/lib/python3.12/site-packages/dspy/adapters/chat_adapter.py", line 78, in parse
    fields[k] = parse_value(v, signature.output_fields[k].annotation) if _parse_values else v
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alazarchuk/Projects/dspy-examples/.venv/lib/python3.12/site-packages/dspy/adapters/chat_adapter.py", line 164,

Average Metric: 30.17 / 55 (54.9%):  35%|█████████████████████████████████████████████████████████████████                                                                                                                         | 56/160 [32:08<1:20:43, 46.57s/it]

The prompt optimization process here is pretty systematic, you can learn about it for example in this paper. Importantly, it's not a magic button. It's very possible that it can overfit your training set for instance and not generalize well to a held-out set, making it essential that we iteratively validate our programs.

Let's check on an example here, asking the same question to the baseline `rag = RAG()` program, which was not optimized, and to the `optimized_rag = MIPROv2(..)(..)` program, after prompt optimization.

In [None]:
baseline = rag(question="cmd+tab does not work on hidden or minimized windows")
print(baseline.response)

In [None]:
dspy.inspect_history(n=2)

In [None]:
pred = optimized_rag(question="cmd+tab does not work on hidden or minimized windows")
print(pred.response)

In [None]:
dspy.inspect_history(n=2)

You can use `dspy.inspect_history(n=2)` to view the RAG prompt [before optimization](https://gist.github.com/okhat/5d04648f2226e72e66e26a8cb1456ee4) and [after optimization](https://gist.github.com/okhat/79405b8889b4b07da577ee19f1a3479a).

Concretely, in one of the runs of this notebook, the optimized prompt does the following (note that it may be different on a later rerun).

1. Constructs the following instruction,
```text
Using the provided `context` and `question`, analyze the information step by step to generate a comprehensive and informative `response`. Ensure that the response clearly explains the concepts involved, highlights key distinctions, and addresses any complexities noted in the context.
```

2. And includes two fully worked out RAG examples with synthetic reasoning and answers, e.g. `how to transfer whatsapp voice message to computer?`.

Let's now evaluate on the overall devset.

In [None]:
evaluate(optimized_rag)

## Keeping an eye on cost.

DSPy allows you to track the cost of your programs, which can be used to monitor the cost of your calls. Here, we'll show you how to track the cost of your programs with DSPy.

In [None]:
cost = sum([x['cost'] for x in lm.history if x['cost'] is not None])  # in USD, as calculated by LiteLLM for certain providers

## Saving and loading.

The optimized program has a pretty simple structure on the inside. Feel free to explore it.

Here, we'll save `optimized_rag` so we can load it again later without having to optimize from scratch.

In [None]:
optimized_rag.save("optimized_rag.json")

loaded_rag = RAG()
loaded_rag.load("optimized_rag.json")

loaded_rag(question="cmd+tab does not work on hidden or minimized windows")


## What's next?

Improving from around 42% to approximately 61% on this task, in terms of `SemanticF1`, was pretty easy.

But DSPy gives you paths to continue iterating on the quality of your system and we have barely scratched the surface.

In general, you have the following tools:

1. Explore better system architectures for your program, e.g. what if we ask the LM to generate search queries for the retriever? See, e.g., the [STORM pipeline](https://arxiv.org/abs/2402.14207) built in DSPy.
2. Explore different [prompt optimizers](https://arxiv.org/abs/2406.11695) or [weight optimizers](https://arxiv.org/abs/2407.10930). See the **[Optimizers Docs](/building-blocks/6-optimizers)**.
3. Scale inference time compute using DSPy Optimizers, e.g. this [notebook](https://github.com/stanfordnlp/dspy/blob/main/examples/agents/multi_agent.ipynb).
4. Cut cost by distilling to a smaller LM, via prompt or weight optimization, e.g. [this notebook](https://github.com/stanfordnlp/dspy/blob/main/examples/nli/scone/scone.ipynb).

How do you decide which ones to proceed with first?

The first step is to look at your system outputs, which will allow you to identify the sources of lower performance if any. While doing all of this, make sure you continue to refine your metric, e.g. by optimizing against your judgments, and to collect more (or more realistic) data, e.g. from related domains or from putting a demo of your system in front of users.