## Purpose
I have a read a lot about the capabilites of the dspy library from stanford. I am trying to understand its various capabilities. Let start with the https://github.com/stanfordnlp/dspy/blob/main/intro.ipynb. My comments will be highlighted in <span style="color:red">**red**</span>

In [2]:
import dspy
import os
from IPython.display import display, HTML
!pip show dspy-ai | grep Version

Version: 2.4.9



### 1] Getting Started
<span style="color:red">We are using groq inference as it is much cheaper and faster than openai</span>


In [3]:
lm = dspy.GROQ(api_key=os.getenv('GROQ_API_KEY'), model='llama3-8b-8192')
# this is the retriver given in the dspy documentation
retriever = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(lm=lm, rm=retriever)

In [4]:
# helper functions to print in color
def color_html(s, color='black'):
    '''returns a string with the given color in html format'''
    s = s.replace('\n', '<br>')
    return "<text style=color:{}>{}</text>".format(color, s)

def cprint(s, color='black'):
    '''prints a string with the given color'''
    display(HTML(color_html(s, color)))

def print_prompt(lm, idx=-1):
    cprint(lm.history[idx]['prompt'], color='blue')

def print_response(lm, idx=-1):
    cprint(lm.history[idx]['response'], color='green')

Whatever the task, the general workflow is:

1. **Collect a little bit of data.** Define examples of the inputs and outputs of your program (e.g., questions and their answers). This could just be a handful of quick examples you wrote down. If large datasets exist, the more the merrier!
1. **Write your program.** Define the modules (i.e., sub-tasks) of your program and the way they should interact together to solve your task.
1. **Define some validation logic.** What makes for a good run of your program? Maybe the answers need to have a certain length or stick to a particular format? Specify the logic that checks that.
1. **Compile!** Ask **DSPy** to _compile_ your program using your data. The compiler will use your data and validation logic to optimize your program (e.g., prompts and modules) so it's efficient and effective! <span style="color:red">We will attempt to understand what this means. How are prompts and modules optimized?</span>
1. **Iterate.** Repeat the process by improving your data, program, validation, or by using more advanced features of the **DSPy** compiler.



### 2] Task Examples

In [5]:

from dspy.datasets import HotPotQA
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)

# tell DSPy that the question field is the one we want to use (from all the fields in the dataset)
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(devset)

  table = cls._concat_blocks(blocks, axis=0)


(20, 50)

We just loaded trainset(20 examples) and devset(50 examples). Lets look at some examples from the trainset.


In [6]:
train_example = trainset[0]
print(f"Question: {train_example.question}")
print(f"Answer: {train_example.answer}")

# select a train example and a dev example
train_example = trainset[0]
dev_example = devset[18]

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt


In [7]:
class BasicQA(dspy.Signature):
    '''Answers questions with short factoid answers'''
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

# define predictor
generate_answer = dspy.Predict(BasicQA)
# call the predictor on the particular input
pred = generate_answer(question=dev_example.question)

# print the inpput and the prediction
print(f"Question: {dev_example.question}")
print(f"Prediction: {pred.answer}")


Question: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?
Prediction: Robert Irvine


In [10]:
lm.history[-1]

{'prompt': "Answers questions with short factoid answers\n\n---\n\nFollow the following format.\n\nQuestion: ${question}\nReasoning: Let's think step by step in order to ${produce the answer}. We ...\nAnswer: often between 1 and 5 words\n\n---\n\nQuestion: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?\nReasoning: Let's think step by step in order to",
 'response': "Question: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?\nReasoning: Let's think step by step in order to figure out the nationality of the chef. He is an American chef and restaurateur, and the show is set in the United States. Therefore...\nAnswer: American",
 'kwargs': {'temperature': 0.0,
  'max_tokens': 150,
  'top_p': 1,
  'frequency_penalty': 0,
  'presence_penalty': 0,
  'n': 1,
  'model': 'llama3-8b-8192',
  'messages': [{'role': 'user',
    'content': "Answers questions with short factoid answers\n\n---\n\nFollow the following form

Lets now look at the <span style="color:blue">prompt</span> and the <span style="color:green">response</span> 



In [8]:
print_prompt(lm)
print_response(lm)

<span style="color:red"> Let us use chain-of-thought. We can do that by creating a ChainOfThought class from any signature. </span>

In [9]:
# Define the predictor. Notice we're just changing the class. The signature BasicQA is unchanged.
generate_answer_with_cot = dspy.ChainOfThought(BasicQA)

# call the predictor on the particular input
pred = generate_answer_with_cot(question=dev_example.question)

print(f"Question: {dev_example.question}")
print(f"Thought: {pred.rationale.split('.', 1)[1].strip()}")
print(f"Predicted Answer: {pred.answer}")


Question: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?
Thought: He is an American chef and restaurateur, and the show is set in the United States. Therefore...
Predicted Answer: American



Lets now look at the <span style="color:blue">prompt</span> and the <span style="color:green">response</span> for <span style="color:red">ChainOfThought</span>  

In [None]:
print(f"number of LLM calls till now - {len(lm.history)}")
print_prompt(lm)
print_response(lm)

number of LLM calls till now - 2


<span style="color:red">The `Reasoning` variable is included in the prompt by the class `dspy.ChainOfThought`. (check the code) 
<br><br> How is the prompt able to generate completions for multiple output variables (say Reasoning and answer) in the same prompt?
<br> Or is it? The prompt cleverly stops after `Reasoning: Let's think step by step in order to` and the response is generated for 
<br> `Reasoning` as well as the `answer`</span>

When the LLM is able to generate only plain text completion, how are we able to populate output variables in the Predictions class?

### Using the retriever model


In [None]:
retrieve = dspy.Retrieve(k=3)
topK_passages = retrieve(dev_example.question).passages

for idx, passage in enumerate(topK_passages):
    print(f'{idx+1}]', passage, '\n')

1] Restaurant: Impossible | Restaurant: Impossible is an American reality television series, featuring chef and restaurateur Robert Irvine, that aired on Food Network from 2011 to 2016. 

2] Jean Joho | Jean Joho is a French-American chef and restaurateur. He is chef/proprietor of Everest in Chicago (founded in 1986), Paris Club Bistro & Bar and Studio Paris in Chicago, The Eiffel Tower Restaurant in Las Vegas, and Brasserie JO in Boston. 

3] List of Restaurant: Impossible episodes | This is the list of the episodes for the American cooking and reality television series "Restaurant Impossible", produced by Food Network. The premise of the series is that within two days and on a budget of $10,000, celebrity chef Robert Irvine renovates a failing American restaurant with the goal of helping to restore it to profitability and prominence. Irvine is assisted by a designer (usually Taniya Nayak, Cheryl Torrenueva, or Lynn Keagan, but sometimes Vanessa De Leon, Krista Watterworth, Yvette Ire

### 4] Program 1: Basic Retrieval-Augmented Generation (“RAG”)

Let's define our first complete program for this task. We'll build a retrieval-augmented pipeline for answer generation.

Given a question, we'll search for the top-3 passages in Wikipedia and then feed them as context for answer generation.

Let's start by defining this signature: `context, question --> answer`.

In [None]:
class GenerateAnswer(dspy.Signature):
    """Answers questions with short factoid answers"""

    context = dspy.InputField(desc='may contain relevant facts')
    question = dspy.InputField()
    answer = dspy.OutputField(desc='often between 1 and 5 words')

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

Having defined this program, let's now **compile** it. Compiling a program will update the parameters stored in each module. In our setting, this is primarily in the form of collecting and selecting good demonstrations for inclusion in your prompt(s).

Compiling depends on three things:

1. **A training set.** We'll just use our 20 question–answer examples from `trainset` above.
1. **A metric for validation.** We'll define a quick `validate_context_and_answer` that checks that the predicted answer is correct. It'll also check that the retrieved context does actually contain that answer.
1. **A specific teleprompter.** The **DSPy** compiler includes a number of **teleprompters** that can optimize your programs.

**Teleprompters:** Teleprompters are powerful optimizers that can take any program and learn to bootstrap and select effective prompts for its modules. Hence the name, which means "prompting at a distance".

Different teleprompters offer various tradeoffs in terms of how much they optimize cost versus quality, etc. We will use a simple default `BootstrapFewShot` in this notebook.


_If you're into analogies, you could think of this as your training data, your loss function, and your optimizer in a standard DNN supervised learning setup. Whereas SGD is a basic optimizer, there are more sophisticated (and more expensive!) ones like Adam or RMSProp._

In [None]:
from dspy.teleprompt import BootstrapFewShot

# Validation logic: check that the predicted answer is correct.
# Also check that the retrieved context does actually contain that answer.
def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM

# Set up a basic teleprompter, which will compile our RAG program.
teleprompter = BootstrapFewShot(metric=validate_context_and_answer, metric_threshold=None)

# Compile!
compiled_rag = teleprompter.compile(student=RAG(), trainset=trainset)

0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
 55%|█████▌    | 11/20 [01:01<00:49,  5.55s/it]


<span style="color:red"> What does compiling actually do?
<br> - Here we want to teach the `student` (the RAG model) to learn something from samples from the trainset. 
<br> - We make a copy of the student and the techer using reset_copy() method, which uses deepcopy and resets any parameters in studen or teacher
<br> - Here the teacher is the compiled LabeledFewShot, which is nothing but selecting k (default=16) samples from the trainset. 
<br> - We set the metric_threshold to None, so we are just ignoring the result of validate_context_and_answer and just selecting all the examples. 
<br> - So let us now see what is the result of the compilation below. We can see all the 16 examples added as demos to the student RAG module. 
</span>

In [None]:

compiled_rag.save('rag_model.json')
compiled_rag.dump_state()

{'retrieve': {'k': 3},
 'generate_answer': {'lm': None,
  'traces': [],
  'train': [],
  'demos': [Example({'augmented': True, 'context': ['Tae Kwon Do Times | Tae Kwon Do Times is a magazine devoted to the martial art of taekwondo, and is published in the United States of America. While the title suggests that it focuses on taekwondo exclusively, the magazine also covers other Korean martial arts. "Tae Kwon Do Times" has published articles by a wide range of authors, including He-Young Kimm, Thomas Kurz, Scott Shaw, and Mark Van Schuyver.', "Kwon Tae-man | Kwon Tae-man (born 1941) was an early Korean hapkido practitioner and a pioneer of the art, first in Korea and then in the United States. He formed one of the earliest dojang's for hapkido in the United States in Torrance, California, and has been featured in many magazine articles promoting the art.", 'Hee Il Cho | Cho Hee Il (born October 13, 1940) is a prominent Korean-American master of taekwondo, holding the rank of 9th "dan" i


<span style="color:red">Let us now use the compiled model and see whats happening. As the prompt size is large, the llama 7b model is unable to give the right answer. It is ignoring the instructions. So let us use a 70b model. </span>

In [None]:
lm = dspy.GROQ(api_key=os.getenv('GROQ_API_KEY'), model='llama3-70b-8192')
dspy.settings.configure(lm=lm, rm=retriever)

compiled_rag = RAG()
compiled_rag.load('rag_model.json')

my_question = "What castle did David Gregory inherit?"

# Get the prediction. This contains `pred.context` and `pred.answer`.
pred = compiled_rag(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

print_prompt(lm)
print_response(lm)


Question: What castle did David Gregory inherit?
Predicted Answer: Kinnairdy Castle
Retrieved Contexts (truncated): ['David Gregory (physician) | David Gregory (20 December 1625 – 1720) was a Scottish physician and inventor. His surname is sometimes spelt as Gregorie, the original Scottish spelling. He inherited Kinn...', 'Gregory Tarchaneiotes | Gregory Tarchaneiotes (Greek: Γρηγόριος Ταρχανειώτης , Italian: "Gregorio Tracanioto" or "Tracamoto" ) was a "protospatharius" and the long-reigning catepan of Italy from 998 t...', 'David Gregory (mathematician) | David Gregory (originally spelt Gregorie) FRS (? 1659 – 10 October 1708) was a Scottish mathematician and astronomer. He was professor of mathematics at the University ...']


<span style="color:red">TODO: How did the prompt use 12 examples and 6 contexts?</span>

### 5] Program 2: Multi-Hop Search (“Baleen”)
From exploring the harder questions in the training/dev sets, it becomes clear that a single search query is often not enough for this task. For instance, this can be seen when a question ask about, say, the birth city of the writer of "Right Back At It Again". A search query identifies the author correctly as "Jeremy McKinnon", but it wouldn't figure out when he was born.

The standard approach for this challenge in the retrieval-augmented NLP literature is to build multi-hop search systems, like GoldEn (Qi et al., 2019) and Baleen (Khattab et al., 2021). These systems read the retrieved results and then generate additional queries to gather additional information if necessary. Using **DSPy**, we can easily simulate such systems in a few lines of code.


We'll still use the `GenerateAnswer` signature from the RAG implementation above. All we need now is a **signature** for the "hop" behavior: taking some partial context and a question, generate a search query to find missing information.

In [None]:
class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query that will help answer a complex question."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    query = dspy.OutputField()

from dsp.utils import deduplicate

class SimplifiedBaleen(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()

        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops
    
    def forward(self, question):
        context = []
        
        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)

        pred = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=pred.answer)

As we can see, the `__init__` method defines a few key sub-modules:

- **generate_query**: For each hop, we will have one `dspy.ChainOfThought` predictor with the `GenerateSearchQuery` signature.
- **retrieve**: This module will do the actual search, using the generated queries.
- **generate_answer**: This `dspy.Predict` module will be used after all the search steps. It has a `GenerateAnswer`, to actually produce an answer.

The `forward` method uses these sub-modules in simple control flow.

1. First, we'll loop up to `self.max_hops` times.
1. In each iteration, we'll generate a search query using the predictor at `self.generate_query[hop]`.
1. We'll retrieve the top-k passages using that query.
1. We'll add the (deduplicated) passages to our accumulator of `context`.
1. After the loop, we'll use `self.generate_answer` to produce an answer.
1. We'll return a prediction with the retrieved `context` and predicted `answer`.

##### Inspect the zero-shot version of the Baleen program

We will also compile this program shortly. But, before that, we can try it out in a "zero-shot" setting (i.e., without any compilation).

Using a program in zero-shot (uncompiled) setting doesn't mean that quality will be bad. It just means that we're bottlenecked directly by the reliability of the underlying LM to understand our sub-tasks from minimal instructions.

This is often just fine when using the most expensive/powerful models (e.g., GPT-4) on the easiest and most standard tasks (e.g., answering simple questions about popular entities).

However, a zero-shot approach quickly falls short for more specialized tasks, for novel domains/settings, and for more efficient (or open) models. **DSPy** can help you in all of these settings.

In [None]:
my_question = "How many storeys are in the castle that David Gregory inherited?"

# Get the prediction. This contains `pred.context` and `pred.answer`.
uncompiled_baleen = SimplifiedBaleen()  # uncompiled (i.e., zero-shot) program
pred = uncompiled_baleen(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

Question: How many storeys are in the castle that David Gregory inherited?
Predicted Answer: Context: David Gregory inherited Kinnairdy Castle in 1664.

Question: How many storeys are in the castle that David Gregory inherited?

Reasoning: Let's think step by step in order to find the number of storeys in Kinnairdy Castle. We know that David Gregory inherited Kinnairdy Castle, and according to the context, Kinn
Retrieved Contexts (truncated): ['David Gregory (physician) | David Gregory (20 December 1625 – 1720) was a Scottish physician and inventor. His surname is sometimes spelt as Gregorie, the original Scottish spelling. He inherited Kinn...', 'Gregory Tarchaneiotes | Gregory Tarchaneiotes (Greek: Γρηγόριος Ταρχανειώτης , Italian: "Gregorio Tracanioto" or "Tracamoto" ) was a "protospatharius" and the long-reigning catepan of Italy from 998 t...', 'Gregory Parsloe-Parsloe | Sir Gregory Parsloe-Parsloe, 7th Baronet is a fictional character from the Blandings stories of P. G. Wodehouse

Let's inspect the last **three** calls to the LM (i.e., generating the first hop's query, generating the second hop's query, and generating the answer).

In [None]:
print_prompt(lm, -3)
print_response(lm, -3)

In [None]:
print_prompt(lm, -2)
print_response(lm, -2)

In [None]:
print_prompt(lm, -1)
print_response(lm, -1)

##### Compiling the Baleen program

Now is the time to compile our multi-hop (`SimplifiedBaleen`) program.

We will first define our validation logic, which will simply require that:

- The predicted answer matches the gold answer.
- The retrieved context contains the gold answer.
- None of the generated queries is rambling (i.e., none exceeds 100 characters in length).
- None of the generated queries is roughly repeated (i.e., none is within 0.8 or higher F1 score of earlier queries).

In [None]:
def validate_context_and_answer_and_hops(example, pred, trace=None):
    if not dspy.evaluate.answer_exact_match(example, pred): return False
    if not dspy.evaluate.answer_passage_match(example, pred): return False

    hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]

    if max([len(h) for h in hops]) > 100: return False
    if any(dspy.evaluate.answer_exact_match_str(hops[idx], hops[:idx], frac=0.8) for idx in range(2, len(hops))): return False

    return True

In [None]:
lm = dspy.GROQ(api_key=os.getenv('GROQ_API_KEY'), model='llama3-70b-8192')
dspy.settings.configure(lm=lm, rm=retriever)

teleprompter = BootstrapFewShot(metric=validate_context_and_answer_and_hops)
compiled_baleen = teleprompter.compile(SimplifiedBaleen(), teacher=SimplifiedBaleen(passages_per_hop=2), trainset=trainset)


100%|██████████| 20/20 [13:25<00:00, 40.28s/it]


In [None]:
compiled_baleen.save('baleen_model.json')
compiled_baleen.dump_state()

{'generate_query[0]': {'lm': None,
  'traces': [],
  'train': [],
  'demos': [Example({'augmented': True, 'context': [], 'question': 'Tombstone stared an actor born May 17, 1955 known as who?', 'rationale': "Here's the completed response:\n\nContext: N/A\n\nQuestion: Tombstone starred an actor born May 17, 1955 known as who?\n\nReasoning: Let's think step by step in order to find the answer. We know the actor's birthdate, May 17, 1955, and the movie they starred in, Tombstone. We can use this information to search for the actor's name.", 'query': '"Tombstone movie cast born May 17 1955"'}) (input_keys=None),
   Example({'question': 'Which is taller, the Empire State Building or the Bank of America Tower?', 'answer': 'The Empire State Building'}) (input_keys=None),
   Example({'question': 'Which American actress who made their film debut in the 1995 teen drama "Kids" was the co-founder of Voto Latino?', 'answer': 'Rosario Dawson'}) (input_keys=None),
   Example({'question': 'Samantha Cr

In [42]:
from dspy.evaluate.evaluate import Evaluate

# Set up the `evaluate_on_hotpotqa` function. We'll use this many times below.
evaluate_on_hotpotqa = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=5)

def gold_passages_retrieved(example, pred, trace=None):
    gold_titles = set(map(dspy.evaluate.normalize_text, example['gold_titles']))
    found_titles = set(map(dspy.evaluate.normalize_text, [c.split(' | ')[0] for c in pred.context]))

    return gold_titles.issubset(found_titles)

compiled_rag_retrieval_score = evaluate_on_hotpotqa(compiled_rag, metric=gold_passages_retrieved)


# Evaluate the `compiled_rag` program with the `answer_exact_match` metric.
metric = dspy.evaluate.answer_exact_match
evaluate_on_hotpotqa(compiled_rag, metric=metric)