# DSPy - Advanced Prompt Engineering

1. Breakout Room #1:
  - Task 1: Dependencies
  - Task 2: Loading Our Model
  - Task 3: Loading Our Data
  - Task 4: Setting Our Signature
  - Task 5: Creating a Predictor
  - Task 6: Making a Chain, I mean...Module
  - Task 7: Evaluate
  - Task 8: Program Optimization
2. Breakout Room #2:
  - Task 1: Defining Appliation
  - Task 2: Hyper-Parameters and Data
  - Task 3: Signature And Module Creation
  - Task 4: Evaluating Our LongFormQA Module
  - Task 5: Adding Assertions

---

In the following notebook, we'll explore an introduction to DSPy and what it can do in just a few lines of code!

# 🤝 Breakout Room #1

## Task 1: Dependencies

We'll start by installing DSPy, `nltk` (for later) and including our OpenAI API key.

In [1]:
!pip install -qU dspy-ai nltk

DSPy can leverage OpenAI's models under the hood, and still provide an advantage - in order to do so, however, we'll need to provide an OpenAI API Key!

In [1]:
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')

## Task 2: Loading Our Model

Now we can setup our OpenAI language model - which we'll use through the remaining cells in the notebook.

In [2]:
from dspy import LM

llm = LM(model='openai/gpt-3.5-turbo')

* 'allow_population_by_field_name' has been renamed to 'populate_by_name'
* 'smart_union' has been removed


Similar to other libraries, we can call the LLM directly with a string to get a response!

In [3]:
llm("What is the square root of pi?")

['The square root of pi is approximately 1.77245385091.']

We'll also set our `setting.configure` with our OpenAI model in the `lm` (Language Model) field for a default LM to use in case we don't specify which LM we'd like to use when calling our DSPy `Predictors`.

In [3]:
import dspy

dspy.settings.configure(lm=llm)

## Task 3: Load Our Data

We're going to be using a dataset that provides a number of example sentences, along with a rating that indicates their "dopeness" level.

In [4]:
from datasets import load_dataset

dataset = load_dataset("llm-wizard/dope_or_nope_v2")

We have a total of 99 rows of data, and will be splitting that into a `trainset` and a `valset` - for training and evaluation.

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'Rating', 'Fire Emojis'],
        num_rows: 99
    })
})

Due to the nature of the dataset, we'll need to shuffle our dataset to ensure our labels are not clumped up, and our `valset` is remotely representative to our `trainset`.

In [5]:
dataset = dataset.shuffle(seed=42)

We'll move our `Dataset` into the expected format in DSPy which is the [`Example`](https://dspy-docs.vercel.app/docs/deep-dive/data-handling/examples)!


Our examples will have two keys:

- `sentence`, our input sentence to be rated
- `rating`, our rating label

We'll specify our input as `sentence` to properly leverage the DSPy framework.

In [6]:
from dspy import Example

trainset = []

for row in dataset["train"].select(range(0,len(dataset["train"])-10)):
  trainset.append(Example(sentence=row["Sentence"], rating=row["Rating"]).with_inputs("sentence"))

len(trainset)

89

We'll repeat the same process for our `valset` as well.

In [7]:
valset = []

for row in dataset["train"].select(range(len(trainset),len(dataset["train"]))):
  valset.append(Example(sentence=row["Sentence"], rating=row["Rating"]).with_inputs("sentence"))

len(valset)

10

Let's take a peek at an example from our `trainset` and `valset`!

In [8]:
train_example = trainset[0]
print(f"Sentence: {train_example.sentence}")
print(f"Label: {train_example.rating}")

Sentence: The results were satisfactory.
Label: 0


In [12]:
valset_example = valset[0]
print(f"Sentence: {valset_example.sentence}")
print(f"Label: {valset_example.rating}")

Sentence: This is top tier.
Label: 4


## Task 4: Setting Our Signature

The first foundational unit in DSPy is the `Signature`.

In a sense, a `Signature` can be thought of as both a prompt, as well as metadata about that prompt.

Going beyond just a simple `SystemMessage`, as seen in other frameworks, the `Signature` helps DSPy validate datatypes, create examples, and more.

> NOTE: DSPy's [documentation](https://dspy-docs.vercel.app/docs/deep-dive/signature/understanding-signatures#what-is-a-signature) goes into more detail about what exactly a `Signature` is.

In [13]:
from dspy import Signature, InputField, OutputField

class DopeOrNopeSignature(Signature):
  """Rate a sentence from 0 to 4 on a dopeness scale"""
  sentence: str = InputField()
  rating: int = OutputField()

## Task 5: Creating a Predictor

Now that we have our `Signature`, we can build a `Predictor` that leverages it.

A `Predictor`, in the simplest terms, is what calls the LLM using our signature. Importantly, the `Predictor` knows how to leverage our signature to call the LLM. From DSPy's documentation, one of the most interesting parts of a `Predictor` is that it can *learn* to become better at the desired task!

Let's take a look at our `TypedPredictor` below to see more.

In [14]:
from dspy.functional import TypedPredictor

generate_label = TypedPredictor(DopeOrNopeSignature)

In [15]:
generate_label

TypedPredictor(DopeOrNopeSignature(sentence -> rating
    instructions='Rate a sentence from 0 to 4 on a dopeness scale'
    sentence = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Sentence:', 'desc': '${sentence}'})
    rating = Field(annotation=int required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Rating:', 'desc': '${rating}'})
))

In [16]:
label_prediction = generate_label(sentence=valset_example.sentence)
print(f"Sentence: {valset_example.sentence}")
print(f"Prediction: {label_prediction}")

Sentence: This is top tier.
Prediction: Prediction(
    rating=4
)


We can, at any time, check our LLMs outputs through the `inspect_history`.

In [17]:
llm.inspect_history(n=1)





[31mSystem message:[0m

Your input fields are:
1. `sentence` (str)

Your output fields are:
1. `rating` (int): ${rating} (Respond with a single int value)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## sentence ## ]]
{sentence}

[[ ## rating ## ]]
{rating}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Rate a sentence from 0 to 4 on a dopeness scale


[31mUser message:[0m

[[ ## sentence ## ]]
This is top tier.

Respond with the corresponding output fields, starting with the field `rating`, and then ending with the marker for `completed`.


[31mResponse:[0m

[32m[[ ## rating ## ]]
4
[[ ## completed ## ]][0m







Notice how, without our input - the `TypedPredictor` has included format instructions to the LLM to help ensure our returned data resembles what we desire.

Let's look at another example of a `Predictor` - this time with Chain of Thought.

In order to use this - we don't have to do anything with our `Signature`! We can leave it exactly as is - and allow the `Predictor` to adapt to it.

> NOTE: We won't be using this predictor going forward - this is just to showcase the ease of using another `Predictor` with a `Signature`.

In [18]:
from dspy.functional import TypedChainOfThought

generate_label_with_chain_of_thought = TypedChainOfThought(DopeOrNopeSignature)

label_prediction = generate_label_with_chain_of_thought(sentence=valset_example.sentence)

In [19]:
print(f"Sentence: {valset_example.sentence}")
print(f"Reasoning: {label_prediction.reasoning}")
print(f"Ground Truth Label: {valset_example.rating}")
print(f"Prediction: {label_prediction.rating}")

Sentence: This is top tier.
Reasoning: I would rate this sentence as a 4 because it conveys a high level of excellence or superiority.
Ground Truth Label: 4
Prediction: 4


We can, again, check our LLM's history to see what the actual prompt/response is.


In [20]:
llm.inspect_history(n=1)





[31mSystem message:[0m

Your input fields are:
1. `sentence` (str)

Your output fields are:
1. `reasoning` (str): ${produce the rating}. We ...
2. `rating` (int): ${rating} (Respond with a single int value)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## sentence ## ]]
{sentence}

[[ ## reasoning ## ]]
{reasoning}

[[ ## rating ## ]]
{rating}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Rate a sentence from 0 to 4 on a dopeness scale


[31mUser message:[0m

[[ ## sentence ## ]]
This is top tier.

Respond with the corresponding output fields, starting with the field `reasoning`, then `rating`, and then ending with the marker for `completed`.


[31mResponse:[0m

[32m[[ ## reasoning ## ]]
I would rate this sentence as a 4 because it conveys a high level of excellence or superiority. 

[[ ## rating ## ]]
4

[[ ## completed ## ]][0m







## Task 6: Making a Chain, I mean...Module.

Now that we have our `TypedPredictor`, we can create a `Module`!

A `Module` is useful because it allows us to interact with the `Predictor` and `Signature` in a way that DSPy can leverage for optimization.

The helps the DSPy framework determine paths through your program - and helps during the `compilation` or optimisation steps (formerly `teleprompting`).

> NOTE: You might notice this looks strikingly familiar to PyTorch, and this is by design!

In [21]:
from dspy import Module, Prediction

class DopeOrNopeStudent(Module):
  def __init__(self):
    super().__init__()

    self.generate_rating = TypedPredictor(DopeOrNopeSignature)

  def forward(self, sentence):
    prediction = self.generate_rating(sentence=sentence)
    return Prediction(rating=prediction.rating)

## Task 7: Evaluate

As with any good framework, DSPy has the ability to `Evaluate` - we can leverage this to determine how our current DSPy "program" (our `Module` in this case) operates.

> NOTE: DSPy's "program" could be loosely related to a "chain" from the popular LLM Framework LangChain.

In [22]:
from dspy.evaluate.evaluate import Evaluate

evaluate_fewshot = Evaluate(devset=valset, num_threads=1, display_progress=True, display_table=10)

def exact_match_metric(answer, pred, trace=None):
  return answer.rating == pred.rating

evaluate_fewshot(DopeOrNopeStudent(), metric=exact_match_metric)

Average Metric: 5 / 10  (50.0): 100%|██████████| 10/10 [00:00<00:00, 225.72it/s]


Unnamed: 0,sentence,example_rating,pred_rating,exact_match_metric
0,This is top tier.,4,4,✔️ [True]
1,Big mood.,3,3,✔️ [True]
2,The presentation was outstanding.,1,4,
3,I'm living my best life.,4,3,
4,"Sksksksk, that's hilarious.",3,3,✔️ [True]
5,The report is comprehensive.,1,3,
6,This is next level.,4,4,✔️ [True]
7,The meeting was productive.,1,3,
8,The analysis was insightful.,1,3,
9,I stan a legend.,3,3,✔️ [True]


50.0

#### ❓Question #1:

Does DSPy lend itself to more complex less exactly defined evaluations? Provide reasoning for your answer.

<font color="blue">Yes, since evaluation metrics can be defined in custom functions, DSPy absolutely lends itself to less exactly defined metrics. The main requirements for the evaluate function are its inputs (example, prediction, trace) and outputs (a float or int). In between, we can do whatever we want, including calling an llm to evaluate some more nuanced or vague aspect of the output (as one example, imagine evaluating "human readability" or a metric like "helpfulness"). As long as the output of our eval can be translated to a numeric value, DSPy lends itself to a very wide range of sophisticated evaluation techniques. The only risk to defining such vague metrics, as with all LLM-based evaluations, is the risk of introducing the LLM's biases and limitations into the evaluation process.

## Task 8: Program Optimization (the Artist Formerly Known as Teleprompting)

Optimization is the crux of the DSPy framework - it is what allows it to operate at a level beyond traditional prompt engineering.

At a high level, optimisation is a way for the DSPy framework to take the program, a training set, and a metric - and make changes/tweaks to our program to improve our metrics on our dataset.

Let's get started with the `LabeledFewShot` optimizer.

The `LabeledFewShot` optimizer very simply provides a sample of the `trainset` as few-shot examples!

In [23]:
from dspy.teleprompt import LabeledFewShot

labeled_fewshot_optimizer = LabeledFewShot(k=4)

Once we define our optimizer, we can compile our program!

In [24]:
compiled_dspy = labeled_fewshot_optimizer.compile(student=DopeOrNopeStudent(), trainset=trainset)

Let's evaluate!

In [25]:
evaluate_fewshot(compiled_dspy, metric=exact_match_metric)

Average Metric: 4 / 10  (40.0): 100%|██████████| 10/10 [00:00<00:00, 105.40it/s]


Unnamed: 0,sentence,example_rating,pred_rating,exact_match_metric
0,This is top tier.,4,4,✔️ [True]
1,Big mood.,3,3,✔️ [True]
2,The presentation was outstanding.,1,3,
3,I'm living my best life.,4,3,
4,"Sksksksk, that's hilarious.",3,3,✔️ [True]
5,The report is comprehensive.,1,2,
6,This is next level.,4,3,
7,The meeting was productive.,1,3,
8,The analysis was insightful.,1,3,
9,I stan a legend.,3,3,✔️ [True]


40.0

As you can see - with no effort at all - we can improve our performance on our `valset`!

Let's try another optimizer - this time: [`BootstrapFewShot`](https://dspy-docs.vercel.app/docs/deep-dive/teleprompter/bootstrap-fewshot).

The key thing to note is that this optimizer works with even very few examples - by way of generating new examples by the LLMs!

In [26]:
from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(metric=exact_match_metric, max_bootstrapped_demos=4, max_labeled_demos=12)

compiled_dspy_BOOTSTRAP = optimizer.compile(student=DopeOrNopeStudent(), trainset=trainset)

  6%|▌         | 5/89 [00:00<00:00, 84.25it/s]

Bootstrapped 4 full traces after 6 examples in round 0.





#### 🏗️ Activity #1:

Outline how `BootstrapFewShot` works "under the hood" in natural language or create a diagram of the workflow.

<font color="blue">`BootstrapFewShot` is a DSPY optimizer designed to improve the performance of a language model program by generating and refining few-shot examples. 

Here's a breakdown of its workflow:

**Initialization:**
- The optimizer starts with an initial set of few-shot examples (if provided) or generates some randomly.
- It defines a metric to evaluate the performance of the program.

**Iteration:**
1. Generate new examples: The optimizer generates new examples by perturbing existing examples or creating new ones based on the program's behavior.
1. Evaluate performance: The generated examples are run through the program, and their performance is evaluated using the defined metric.
1. Update examples: The optimizer selects the best-performing examples and adds them to the training set, while discarding the worst-performing ones.

**Optimize program:**
- The optimizer uses the updated training set to fine-tune the language model's weights or adjust the program's structure (e.g., by modifying prompts or adding new steps).

**Repeat:**
- The process is repeated until a satisfactory level of performance is achieved or a maximum number of iterations is reached.</font>

Let's finally evaluate!

In [27]:
evaluate_fewshot(compiled_dspy_BOOTSTRAP, metric=exact_match_metric)

Average Metric: 8 / 10  (80.0): 100%|██████████| 10/10 [00:00<00:00, 203.13it/s]


Unnamed: 0,sentence,example_rating,pred_rating,exact_match_metric
0,This is top tier.,4,4,✔️ [True]
1,Big mood.,3,3,✔️ [True]
2,The presentation was outstanding.,1,3,
3,I'm living my best life.,4,4,✔️ [True]
4,"Sksksksk, that's hilarious.",3,3,✔️ [True]
5,The report is comprehensive.,1,1,✔️ [True]
6,This is next level.,4,4,✔️ [True]
7,The meeting was productive.,1,1,✔️ [True]
8,The analysis was insightful.,1,2,
9,I stan a legend.,3,3,✔️ [True]


80.0

We can see that this optimization helps our program achieve 30 points higher on our evaluation!

In [28]:
llm.inspect_history(n=1)





[31mSystem message:[0m

Your input fields are:
1. `sentence` (str)

Your output fields are:
1. `rating` (int): ${rating} (Respond with a single int value)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## sentence ## ]]
{sentence}

[[ ## rating ## ]]
{rating}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Rate a sentence from 0 to 4 on a dopeness scale


[31mUser message:[0m

[[ ## sentence ## ]]
This tea is piping hot.

Respond with the corresponding output fields, starting with the field `rating`, and then ending with the marker for `completed`.


[31mAssistant message:[0m

[[ ## rating ## ]]
4

[[ ## completed ## ]]


[31mUser message:[0m

[[ ## sentence ## ]]
Your professionalism is appreciated.

Respond with the corresponding output fields, starting with the field `rating`, and then ending with the marker for `completed`.


[31mAssistant message:[0m

[[ ## rating ## ]]
1

[[ ##

In [29]:
for name, parameter in compiled_dspy_BOOTSTRAP.named_parameters():
  print(f"Parameter {name}: Num Examples: {len(parameter.demos)}, {parameter.demos[0]}")
  print()

Parameter generate_rating.predictor: Num Examples: 12, Example({'augmented': True, 'sentence': 'This tea is piping hot.', 'rating': '4'}) (input_keys=None)



# 🤝 Breakout Room #2

## Task 1: Defining Application

In this breakoutroom, we'll be using DSPy to optimize a Multi-Hop QA module with `Assertions`.

So what is a "Multi-Hop QA module"?

Well - going beyond naive RAG retrieval, Multi-Hop QA lets us create applications that are well-suited to questions that (potentially have) multiple "hops" required to answer them.

For instance: "Who is the top goal scorer that has ever played on the Winnipeg Jets, and what years did he play for the Winnipeg Jets?"

You can see that there are two "hops" required to respond correctly:

1. Who is the top goal scorer for the Winnipeg Jets?
2. What years did X player play for the Winnipeg Jets?

While this is a toy example, the idea is the same across complexity: Questions that take more than one step of reasoning to answer.

Let's grab some data, set-up some hyper-parameters, and then get to implmentation!

## Task 2: Hyper-Parameters and Data

We'll use the DSPy ColBERT abstracts as our retrieval system for this example.

We'll also use `GPT-4o-Mini` as our LM to keep things light and inexpensive as we'll be sending quite a few LLM calls.

In [9]:
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(rm=colbertv2_wiki17_abstracts)
lm_openai_four_mini = dspy.LM(model='openai/gpt-4o-mini', max_tokens=500)
dspy.settings.configure(lm=lm_openai_four_mini, trace=[], temperature=0.7)

We'll be using the [`HotPotQA`](https://hotpotqa.github.io/) dataset which is a number of multi-hop QA pairs that includes context, and is based on Wikipedia (for compatibility with our Retriever system).

In [10]:
from dspy.datasets import HotPotQA

dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0, keep_details=True)
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

We can look at a few examples:

In [11]:
train_example = trainset[0]
print(f"Question: {train_example.question}")
print(f"Answer: {train_example.answer}")
print(f"Relevant Wikipedia Titles: {train_example.gold_titles}")

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt
Relevant Wikipedia Titles: {'At My Window (album)', 'Townes Van Zandt'}


In [12]:
dev_example = devset[18]
print(f"Question: {dev_example.question}")
print(f"Answer: {dev_example.answer}")
print(f"Relevant Wikipedia Titles: {dev_example.gold_titles}")

Question: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?
Answer: English
Relevant Wikipedia Titles: {'Restaurant: Impossible', 'Robert Irvine'}


## Task 3: Signature and Module Creation

As we learned above - the bread and butter for DSPy is the `Signature` and `Module`, so we'll create each below.

For our `Signatures`, things are fairly straight-forward, we need to:

1. Create a `Signature` that will allow us to generate sub-questions.
2. Create a `Signature` that will provide citations for our responses.

In [13]:
from dsp.utils import deduplicate

class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query that will help answer a complex question."""
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    query = dspy.OutputField()

class GenerateCitedParagraph(dspy.Signature):
    """Generate a paragraph with citations."""
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    paragraph = dspy.OutputField(desc="includes citations")

Our `Module` is a bit more complex than what we've seen before - so let's walk through what's happening inside of it. We're going to concern ourselves with the `forward` method - as that is where the logic of our `Module` is contained.

In the `forward` method we:

1. Create an empty list of contexts.
2. For each `hop` in our `max_hops` (by default, it will be 2) we:
  - Generate a new `query` using our `GenerateSearchQuery` with a `ChainOfThought` predictor.
  - Retrieve a number (default 3) of `passages` based on that new `query`.
  - Add unique (non-present) `passages` into our `context` list.
3. Take all that `context` and our original `question` and generate a cited paragraph and use it to predict an answer.

In [14]:
class LongFormQA(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()
        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_cited_paragraph = dspy.ChainOfThought(GenerateCitedParagraph)
        self.max_hops = max_hops

    def forward(self, question):
        context = []
        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)
        pred = self.generate_cited_paragraph(context=context, question=question)
        pred = dspy.Prediction(context=context, paragraph=pred.paragraph)
        return pred

Next, we'll need a way to evaluate how we're doing!

## Task 4: Evaluating our LongFormQA Module.

Now we'd like to evaluate our module - we'll need a number of helper functions to do so - which will be instantiated below.

#### Utility Functions for Citation Checking

In [15]:
import nltk
import regex as re

from nltk.tokenize import sent_tokenize
nltk.download('punkt')

def extract_text_by_citation(paragraph):
    citation_regex = re.compile(r'(.*?)(\[\d+\]\.)', re.DOTALL)
    parts_with_citation = citation_regex.findall(paragraph)
    citation_dict = {}
    for part, citation in parts_with_citation:
        part = part.strip()
        citation_num = re.search(r'\[(\d+)\]\.', citation).group(1)
        citation_dict.setdefault(str(int(citation_num) - 1), []).append(part)
    return citation_dict

def correct_citation_format(paragraph):
    modified_sentences = []
    sentences = sent_tokenize(paragraph)
    for sentence in sentences:
        modified_sentences.append(sentence)
    citation_regex = re.compile(r'\[\d+\]\.')
    i = 0
    if len(modified_sentences) == 1:
      has_citation = bool(citation_regex.search(modified_sentences[i]))
    while i < len(modified_sentences):
      if len(modified_sentences[i:i+2]) == 2:
        sentence_group = " ".join(modified_sentences[i:i+2])
        has_citation = bool(citation_regex.search(sentence_group))
        if not has_citation:
            return False
        i += 2 if has_citation and i+1 < len(modified_sentences) and citation_regex.search(modified_sentences[i+1]) else 1
      else:
        return True
    return True

def has_citations(paragraph):
    return bool(re.search(r'\[\d+\]\.', paragraph))

def citations_check(paragraph):
    return has_citations(paragraph) and correct_citation_format(paragraph)

[nltk_data] Downloading package punkt to /Users/Angela/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Checking Citation Faithfulness

We will create a number of useful metrics for our pipeline - included "Faithfulness", as well as a number of more traditional metrics. "

In [41]:
class CheckCitationFaithfulness(dspy.Signature):
    """Verify that the text is based on the provided context."""
    context = dspy.InputField(desc="may contain relevant facts")
    text = dspy.InputField(desc="between 1 to 2 sentences")
    faithfulness = dspy.OutputField(desc="boolean indicating if text is faithful to context")

def citation_faithfulness(example, pred, trace):
    paragraph, context = pred.paragraph, pred.context
    citation_dict = extract_text_by_citation(paragraph)
    if not citation_dict:
        return False, None
    context_dict = {str(i): context[i].split(' | ')[1] for i in range(len(context))}
    faithfulness_results = []
    unfaithful_citations = []
    check_citation_faithfulness = dspy.ChainOfThought(CheckCitationFaithfulness)
    for citation_num, texts in citation_dict.items():
        if citation_num not in context_dict:
            continue
        current_context = context_dict[citation_num]
        for text in texts:
            try:
                result = check_citation_faithfulness(context=current_context, text=text)
                is_faithful = result.faithfulness.lower() == 'true'
                faithfulness_results.append(is_faithful)
                if not is_faithful:
                    unfaithful_citations.append({'paragraph': paragraph, 'text': text, 'context': current_context})
            except ValueError as e:
                faithfulness_results.append(False)
                unfaithful_citations.append({'paragraph': paragraph, 'text': text, 'error': str(e)})
    final_faithfulness = all(faithfulness_results)
    print(f"faithfulness: unfaithful_citations={unfaithful_citations}")
    
    if not faithfulness_results:
        return False, None
    return final_faithfulness, unfaithful_citations

#### ❓Question #2:

How is faithfulness being determined here? How is this different from Ragas Faithfulness.

<font color="blue">The Ragas faithfulness metric and the faithfulness check above share similar goals but differ in approach and complexity.

**Approach:**

Ragas evaluates the factual consistency of the generated answer by checking whether the claims made in the answer can be inferred from the retrieved context, assigning a quantitative score based on this analysis.
'CheckCitationFaithfulness', on the other hand, verifies whether specific text segments in a paragraph are consistent with the corresponding citation context using a boolean result (faithful or not). This method is more citation-specific, focusing on whether each citation aligns with the referenced context.

**Output:**

Ragas provides a numeric score ranging from 0 to 1, providing a range of "faithfulness" levels.
'CheckCitationFaithfulness' returns a boolean indicating whether the text is faithful or not based on citations, alongside details on any unfaithful citations. It's more difficult to achieve a score of 1 on the faithfulness metric above than it is to achieve a score of >0 on the Ragas metric.

Next, we can create a number of useful metrics that rely on more traditional evaluations, like Precision, Recall, and "does this contain the answer".

In [42]:
from dsp.utils import normalize_text

def extract_cited_titles_from_paragraph(paragraph, context):
    cited_indices = [int(m.group(1)) for m in re.finditer(r'\[(\d+)\]\.', paragraph)]
    cited_indices = [index - 1 for index in cited_indices if index <= len(context)]
    cited_titles = [context[index].split(' | ')[0] for index in cited_indices]
    return cited_titles

def calculate_recall(example, pred, trace=None):
    gold_titles = set(example['gold_titles'])
    found_cited_titles = set(extract_cited_titles_from_paragraph(pred.paragraph, pred.context))
    print(f"calculate_recall: found_cited_titles={found_cited_titles}")
    intersection = gold_titles.intersection(found_cited_titles)
    recall = len(intersection) / len(gold_titles) if gold_titles else 0
    return recall

def calculate_precision(example, pred, trace=None):
    gold_titles = set(example['gold_titles'])
    found_cited_titles = set(extract_cited_titles_from_paragraph(pred.paragraph, pred.context))
    print(f"calculate_precision: found_cited_titles={found_cited_titles}")
    intersection = gold_titles.intersection(found_cited_titles)
    precision = len(intersection) / len(found_cited_titles) if found_cited_titles else 0
    return precision

def answer_correctness(example, pred, trace=None):
    assert hasattr(example, 'answer'), "Example does not have 'answer'."
    normalized_context = normalize_text(pred.paragraph)
    if isinstance(example.answer, str):
        gold_answers = [example.answer]
    elif isinstance(example.answer, list):
        gold_answers = example.answer
    else:
        raise ValueError("'example.answer' is not string or list.")
    return 1 if any(normalize_text(answer) in normalized_context for answer in gold_answers) else 0

### Creating the Evaluation Function

In essence, all this function does is call all the created metrics above and sum/average them.

In [43]:
from tqdm import tqdm

def evaluate(module, verbose=False):
    correctness_values = []
    recall_values = []
    precision_values = []
    citation_faithfulness_values = []
    for i in tqdm(range(len(devset[:20]))):
        example = devset[i]
        try:
            pred = module(question=example.question)
            if verbose:
                print(f"Example {i+1}:")
                print(f"Question: {example.question}")
                print(f"Prediction: {pred.paragraph}")
                print(f"Correct Answer: {example.answer}")
                print(f"Gold Titles: {example.gold_titles}")

            correctness_values.append(answer_correctness(example, pred))
            citation_faithfulness_score, _ = citation_faithfulness(None, pred, None)
            citation_faithfulness_values.append(citation_faithfulness_score)
            recall = calculate_recall(example, pred)
            precision = calculate_precision(example, pred)
            recall_values.append(recall)
            precision_values.append(precision)

            if verbose:
                print(f"correctness: {answer_correctness(example,pred)}, faithfulness: {citation_faithfulness_score}, precision: {precision}, recall: {recall}\n\n")
        except Exception as e:
            print(f"Failed generation with error: {e}")

    average_correctness = sum(correctness_values) / len(devset[:20]) if correctness_values else 0
    average_recall = sum(recall_values) / len(devset[:20]) if recall_values else 0
    average_precision = sum(precision_values) / len(devset[:20]) if precision_values else 0
    average_citation_faithfulness = sum(citation_faithfulness_values) / len(devset[:20]) if citation_faithfulness_values else 0

    print(f"\nAverage Correctness: {average_correctness}")
    print(f"Average Recall: {average_recall}")
    print(f"Average Precision: {average_precision}")
    print(f"Average Citation Faithfulness: {average_citation_faithfulness}")

### Evaluating our LongFormQA Module

Finally, we can evaluate our module!

In [44]:
longformqa = LongFormQA()
evaluate(longformqa, verbose=True)

100%|██████████| 20/20 [00:00<00:00, 800.60it/s]

Example 1:
Question: Are both Cangzhou and Qionghai in the Hebei province of China?
Prediction: Cangzhou is a prefecture-level city in eastern Hebei province, with a significant population and proximity to major cities like Tianjin and Beijing (Cangzhou, 2010). In contrast, Qionghai is a county-level city located in Hainan province, which is geographically distinct from Hebei, situated on the southern coast of China (Qionghai, 2010). Thus, while Cangzhou is indeed in Hebei, Qionghai belongs to Hainan province, making the assertion that both are in Hebei incorrect.
Correct Answer: no
Gold Titles: {'Qionghai', 'Cangzhou'}
calculate_recall: found_cited_titles=set()
calculate_precision: found_cited_titles=set()
correctness: 0, faithfulness: False, precision: 0, recall: 0.0


Example 2:
Question: Who conducts the draft in which Marc-Andre Fleury was drafted to the Vegas Golden Knights for the 2017-18 season?
Prediction: Marc-André Fleury was drafted to the Vegas Golden Knights during the 20




This did surprisingly poorly on `Recall`, `Precision` and `Citation Faithfulness`.

❓Question #3:

Why did our `Module` do surprisingly poorly on `Recall`, `Precision` and `Citation Faithfulness`?

> HINT: The name `LongFormQA` should provide a fairly big hint.

<font color="blue">Above, I added some print statements to the evaluate() function so that we can see the intermediate results. It's immediately clear that the system is having trouble extracting the citations present in the text. The `extract_cited_titles_from_paragraph` function expects citations in the form of [number], but instead the text contains citations like (Cangzhou, 2010). Since the program can't "see" the citations, it naturally thinks there aren't any, resulting in recall, precision, and faithfulness scores of zero.

## Task 5: Adding `Assertions`.

DSPy comes equipped with an extremely useful feature called `Assertions` and `Suggestions`.

Let's take a look at what each one does:

1. `dspy.Assert` - this is a hard rule that must be followed, and if it's not followed; an exception will be raised.
2. `dspy.Suggest` - this is a looser rule, or guiding principle, it will not raise an exception if the rule isn't met; but it will try and ensure the suggestion is met.

Let's improve our `Module` with some `dspy.Suggest`s!


In [45]:
class LongFormQAWithAssertions(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()
        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_cited_paragraph = dspy.ChainOfThought(GenerateCitedParagraph)
        self.max_hops = max_hops

    def forward(self, question):
        context = []
        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)
        pred = self.generate_cited_paragraph(context=context, question=question)
        pred = dspy.Prediction(context=context, paragraph=pred.paragraph)
        dspy.Suggest(citations_check(pred.paragraph), "Make sure every 1-2 sentences has citations. If any 1-2 sentences lack citations, add them in 'text... [x].' format.", target_module=self.generate_cited_paragraph)
        _, unfaithful_outputs = citation_faithfulness(None, pred, None)
        if unfaithful_outputs:
            unfaithful_pairs = [(output['text'], output['context']) for output in unfaithful_outputs]
            for _, context in unfaithful_pairs:
                dspy.Suggest(len(unfaithful_pairs) == 0, f"Make sure your output is based on the following context: '{context}'.", target_module=self.generate_cited_paragraph)
        else:
            return pred
        return pred

#### 🏗️ Activity #2:

Write out the above flow in natural language or using a drawing program.

<font color="blue">

**Initialization:**

The model is initialized with parameters for retrieving passages and defining how many hops (iterations) to perform. Key components include a query generator, passage retriever, and cited paragraph generator.

**Forward Function:**

1. The model performs iterative hops to generate queries and retrieve relevant passages, gradually building up the context for the question by deduplicating passages.
2. After all hops, a cited paragraph is generated based on the collected context.
3. Next, the function conducts citation and faithfulness checks:
  * Citation Check: The generated paragraph is checked to ensure citations appear every 1-2 sentences. If not, a suggestion is made to revise the output, providing soft feedback without stopping the process.
  * Faithfulness Check: The citation_faithfulness function checks whether the output accurately reflects the retrieved context. If discrepancies are found, suggestions target the specific areas that need to align better with the context, guiding further refinement.
4. Completion: The process ends when either the suggestions have been incorporated or the output passes all checks, producing a final, faithful, and well-cited paragraph​
</font>


<font color="black">What is the key advantage provided by using `dspy.Suggest`?</font>
<font color="blue">The key advantage of using dspy.Suggest is that it provides soft feedback during program execution without interrupting the flow. This allows the model to identify and offer actionable improvements (such as adding missing citations or ensuring faithfulness) while continuing its operation, enabling iterative refinement. Unlike assertions, which enforce strict conditions that may halt execution, suggestions guide the model to adjust outputs and enhance quality incrementally, making the development process more flexible and adaptive.
Above, we saw low performance metrics due to the module's not adhering to instructions, so adding `Suggest` statements is likely to help significantly.​</font>

In [46]:
from dspy.primitives.assertions import assert_transform_module, backtrack_handler
from dspy.predict import Retry

longformqa_with_assertions = assert_transform_module(LongFormQAWithAssertions().map_named_predictors(Retry), backtrack_handler)
evaluate(longformqa_with_assertions)

  5%|▌         | 1/20 [00:14<04:41, 14.79s/it]

faithfulness: unfaithful_citations=[{'paragraph': 'Cangzhou is a prefecture-level city in eastern Hebei province, with a significant population and proximity to major cities like Tianjin and Beijing (Cangzhou, 2010) [2]. In contrast, Qionghai is a county-level city located in Hainan province, which is geographically distinct from Hebei, situated on the southern coast of China (Qionghai, 2010) [4]. Thus, while Cangzhou is indeed in Hebei, Qionghai belongs to Hainan province, making the assertion that both are in Hebei incorrect [1][2][4].', 'text': 'In contrast, Qionghai is a county-level city located in Hainan province, which is geographically distinct from Hebei, situated on the southern coast of China (Qionghai, 2010)', 'context': 'Qionghai () is one of the seven county-level cities of Hainan province, China. Although called a "city", Qionghai refers to a large land area in Hainan - an area which was once a county. Within this area is the main city, Qionghai City. It is located in th

 10%|█         | 2/20 [00:23<03:22, 11.28s/it]

faithfulness: unfaithful_citations=[]
faithfulness: unfaithful_citations=[]
calculate_recall: found_cited_titles={'2017 NHL Expansion Draft'}
calculate_precision: found_cited_titles={'2017 NHL Expansion Draft'}


 15%|█▌        | 3/20 [00:32<02:50, 10.05s/it]

faithfulness: unfaithful_citations=[]
faithfulness: unfaithful_citations=[]
calculate_recall: found_cited_titles={'2006–07 Detroit Red Wings season', 'Steve Yzerman'}
calculate_precision: found_cited_titles={'2006–07 Detroit Red Wings season', 'Steve Yzerman'}
faithfulness: unfaithful_citations=[{'paragraph': "Crichton Collegiate Church is located near Crichton Castle, which is positioned at the head of the River Tyne in Midlothian, Scotland [1]. This proximity suggests that the River Tyne is also near the church itself, enhancing the historical and geographical significance of the area [2]. The relationship between these landmarks highlights the interconnectedness of the region's cultural heritage [1][2].", 'text': 'Crichton Collegiate Church is located near Crichton Castle, which is positioned at the head of the River Tyne in Midlothian, Scotland', 'context': "Crichton Collegiate Church is situated about 0.6 mi south west of the hamlet of Crichton in Midlothian, Scotland. Crichton it

 20%|██        | 4/20 [00:48<03:18, 12.39s/it]

faithfulness: unfaithful_citations=[{'paragraph': "Crichton Collegiate Church is situated about 0.6 miles southwest of the hamlet of Crichton in Midlothian, Scotland [1]. Its proximity to Crichton Castle, which lies at the head of the River Tyne, suggests that the River Tyne is also near the church [2]. This connection enhances the historical and geographical significance of the area, illustrating the interconnectedness of the region's cultural heritage [1][2].", 'text': 'Its proximity to Crichton Castle, which lies at the head of the River Tyne, suggests that the River Tyne is also near the church', 'context': 'Crichton Castle is a ruined castle situated at the head of the River Tyne, near the village of Crichton, Midlothian, Scotland. The castle lies two miles south of the village of Pathhead, and the same distance east of Gorebridge, at . A mile to the south-west is Borthwick Castle.'}, {'paragraph': "Crichton Collegiate Church is situated about 0.6 miles southwest of the hamlet of 

 25%|██▌       | 5/20 [01:13<04:17, 17.16s/it]

faithfulness: unfaithful_citations=[{'paragraph': "Æthelweard, who died in 920 or 922, was the younger son of King Alfred the Great and his wife Ealhswith [1]. Alfred, reigning from 871 to 899, is celebrated for his defense of England against Viking invasions and his role in unifying the Anglo-Saxon kingdoms [1]. As a member of Alfred's royal family, Æthelweard's lineage is significant in the context of English history during a period of considerable turmoil and transformation [1]. Thus, the English king associated with Æthelweard is indeed Alfred the Great, whose reign was pivotal for the future of England [1].", 'text': 'Alfred, reigning from 871 to 899, is celebrated for his defense of England against Viking invasions and his role in unifying the Anglo-Saxon kingdoms', 'context': 'Æthelweard (d. 920 or 922) was the younger son of King Alfred the Great and Ealhswith.'}]
faithfulness: unfaithful_citations=[{'paragraph': "Æthelweard, who died in 920 or 922, was the younger son of King 

 30%|███       | 6/20 [01:26<03:40, 15.78s/it]

faithfulness: unfaithful_citations=[{'paragraph': 'The Newark Airport Exchange is situated at the northern edge of Newark Liberty International Airport (EWR), which is operated by the Port Authority of New York and New Jersey (Newark Liberty International Airport) [1]. This airport serves as the primary airport for New Jersey and is jointly owned by the cities of Elizabeth and Newark, with Newark being the most populous city in the state [1]. The Port Authority manages the operations of this significant transportation hub, ensuring connectivity and services for travelers [1].', 'text': 'The Newark Airport Exchange is situated at the northern edge of Newark Liberty International Airport (EWR), which is operated by the Port Authority of New York and New Jersey (Newark Liberty International Airport)', 'context': 'Newark Liberty International Airport (IATA: EWR, ICAO: KEWR, FAA LID: EWR) , originally Newark Metropolitan Airport and later Newark International Airport, is the primary airport

 35%|███▌      | 7/20 [01:43<03:26, 15.91s/it]

calculate_recall: found_cited_titles=set()
calculate_precision: found_cited_titles=set()


 40%|████      | 8/20 [01:50<02:39, 13.28s/it]

faithfulness: unfaithful_citations=[]
faithfulness: unfaithful_citations=[]
calculate_recall: found_cited_titles={'William R. Fairchild International Airport', 'Chico Municipal Airport'}
calculate_precision: found_cited_titles={'William R. Fairchild International Airport', 'Chico Municipal Airport'}


 45%|████▌     | 9/20 [01:58<02:06, 11.54s/it]

calculate_recall: found_cited_titles=set()
calculate_precision: found_cited_titles=set()
faithfulness: unfaithful_citations=[{'paragraph': 'The Afghan Whigs, an American alternative rock band from Cincinnati, Ohio, originally active from 1986 to 2001, reformed in 2012 and have since released two albums: "Do to the Beast" in 2014 and "In Spades" in 2017 [4][6]. In contrast, the British band Gene disbanded in 2004 and reformed in 2011, but they have not produced new music since their reunion [1]. Thus, The Afghan Whigs have been more prolific in their post-reformation activities compared to Gene [4][6].', 'text': 'Thus, The Afghan Whigs have been more prolific in their post-reformation activities compared to Gene [4]', 'context': 'Do to the Beast is the seventh studio album by American alternative rock band The Afghan Whigs, their first in 16 years. It was released on April 15, 2014 on Sub Pop Records, the same label that released their albums "Up in It" and "Congregation".'}, {'paragrap

 50%|█████     | 10/20 [02:15<02:13, 13.32s/it]

faithfulness: unfaithful_citations=[{'paragraph': 'The Afghan Whigs, an American alternative rock band from Cincinnati, Ohio, originally active from 1986 to 2001, reformed in 2012 and have since released their seventh studio album, "Do to the Beast," on April 15, 2014, marking their first release in 16 years [6]. In contrast, the British band Gene disbanded in 2004 and reformed in 2011, but they have not produced any new music since their reunion [1]. Thus, The Afghan Whigs have been more prolific in their post-reformation activities compared to Gene, showcasing their continued relevance in the rock scene [6].', 'text': 'Thus, The Afghan Whigs have been more prolific in their post-reformation activities compared to Gene, showcasing their continued relevance in the rock scene', 'context': 'Do to the Beast is the seventh studio album by American alternative rock band The Afghan Whigs, their first in 16 years. It was released on April 15, 2014 on Sub Pop Records, the same label that relea

 55%|█████▌    | 11/20 [02:23<01:43, 11.53s/it]

faithfulness: unfaithful_citations=[]
faithfulness: unfaithful_citations=[]
calculate_recall: found_cited_titles={'Eruption of Mount Vesuvius in 79', 'Plinian eruption'}
calculate_precision: found_cited_titles={'Eruption of Mount Vesuvius in 79', 'Plinian eruption'}


 60%|██████    | 12/20 [02:36<01:36, 12.04s/it]

faithfulness: unfaithful_citations=[{'paragraph': 'The 72nd Field Artillery Brigade, located at Joint Base McGuire-Dix-Lakehurst, New Jersey, operates as a subordinate unit of the First United States Army, which is recognized as the oldest and longest established field army in the United States Army, having been formed during World War I [4]. This brigade is tasked with training selected Army Reserve and National Guard units along the East Coast, highlighting its significant role within the historical context of the U.S. Army [1]. Thus, the 72nd Field Artillery Brigade is part of the oldest established field army in the United States [4].', 'text': 'The 72nd Field Artillery Brigade, located at Joint Base McGuire-Dix-Lakehurst, New Jersey, operates as a subordinate unit of the First United States Army, which is recognized as the oldest and longest established field army in the United States Army, having been formed during World War I', 'context': 'The First Army is the oldest and longes

 65%|██████▌   | 13/20 [03:01<01:50, 15.85s/it]

faithfulness: unfaithful_citations=[{'paragraph': 'The historical records regarding Stanisław Kiszka do not provide explicit details about whether he was paid for his services by the Royal Treasury [1]. Kiszka, a prominent noble from the Grand Duchy of Lithuania, held significant military and administrative positions, including Great Hetman and Grand Marshal of Lithuania, roles that typically would involve state support or compensation [1]. His successful defense of Smolensk during the Second Muscovite–Lithuanian War further underscores his importance, suggesting that he likely received some form of remuneration for his contributions to the state [1]. However, specific records of such payments are not mentioned in the available context [1]. Therefore, while it is plausible that he was compensated, the lack of direct evidence prevents a definitive conclusion [1].', 'text': 'His successful defense of Smolensk during the Second Muscovite–Lithuanian War further underscores his importance, 

 70%|███████   | 14/20 [03:09<01:21, 13.63s/it]

calculate_recall: found_cited_titles=set()
calculate_precision: found_cited_titles=set()


 75%|███████▌  | 15/20 [03:18<01:00, 12.17s/it]

calculate_recall: found_cited_titles=set()
calculate_precision: found_cited_titles=set()


 80%|████████  | 16/20 [03:25<00:42, 10.66s/it]

faithfulness: unfaithful_citations=[]
faithfulness: unfaithful_citations=[]
calculate_recall: found_cited_titles={'Marche', 'Pollenza'}
calculate_precision: found_cited_titles={'Marche', 'Pollenza'}
faithfulness: unfaithful_citations=[{'paragraph': 'William Hughes Miller was born in Kosciusko, Mississippi, a city that had a population of 7,402 according to the 2010 census [4]. This city serves as the county seat of Attala County, which is named after a fictional Native American heroine from an early-19th-century novel [5]. The population figure highlights the size of the city where Miller was born, providing insight into his early environment [4].', 'text': 'The population figure highlights the size of the city where Miller was born, providing insight into his early environment', 'context': 'Kosciusko is a city in Attala County, Mississippi, United States. The population was 7,402 at the 2010 census. It is the county seat of Attala County.'}]


 85%|████████▌ | 17/20 [03:40<00:36, 12.01s/it]

faithfulness: unfaithful_citations=[{'paragraph': 'William Hughes Miller was born in Kosciusko, Mississippi, a city that had a population of 7,402 according to the 2010 census [4]. This city is notable as it serves as the county seat of Attala County, which is named after a fictional Native American heroine from an early-19th-century novel [5]. The population figure underscores the size of the city where Miller was born, providing insight into his early environment [4].', 'text': 'The population figure underscores the size of the city where Miller was born, providing insight into his early environment', 'context': 'Kosciusko is a city in Attala County, Mississippi, United States. The population was 7,402 at the 2010 census. It is the county seat of Attala County.'}]
faithfulness: unfaithful_citations=[{'paragraph': 'William Hughes Miller was born in Kosciusko, Mississippi, a city that had a population of 7,402 according to the 2010 census [4]. This city is notable as it serves as the c

 90%|█████████ | 18/20 [03:48<00:21, 10.66s/it]

faithfulness: unfaithful_citations=[]
faithfulness: unfaithful_citations=[]
calculate_recall: found_cited_titles={'Gallatin School of Individualized Study'}
calculate_precision: found_cited_titles={'Gallatin School of Individualized Study'}
faithfulness: unfaithful_citations=[{'paragraph': 'Robert Irvine, the celebrity chef and restaurateur known for his role in the reality television series "Restaurant: Impossible," is of English nationality [3]. Born on September 24, 1965, in England, Irvine has made a significant impact on the culinary scene in the United States through his various television appearances and his work in helping struggling restaurants regain their footing [1], [2]. His expertise and guidance have been instrumental in revitalizing numerous establishments across the country [2].', 'text': 'Born on September 24, 1965, in England, Irvine has made a significant impact on the culinary scene in the United States through his various television appearances and his work in hel

 95%|█████████▌| 19/20 [04:07<00:13, 13.33s/it]

faithfulness: unfaithful_citations=[{'paragraph': 'Robert Irvine, the celebrity chef and restaurateur known for his role in the reality television series "Restaurant: Impossible," is of English nationality [3]. The show focuses on Irvine\'s efforts to renovate failing American restaurants within a tight budget and timeframe, aiming to restore them to profitability and prominence [2]. His expertise is evident as he assesses the problems of each restaurant, creates a new decor plan, oversees cleaning, reduces menu size, and trains the staff as needed [2]. Through his various television appearances, Irvine has made a significant impact on the culinary scene in the United States [1]. His guidance has been instrumental in revitalizing numerous establishments across the country [2].', 'text': 'Through his various television appearances, Irvine has made a significant impact on the culinary scene in the United States', 'context': 'Restaurant: Impossible is an American reality television series

100%|██████████| 20/20 [04:29<00:00, 13.48s/it]

faithfulness: unfaithful_citations=[{'paragraph': 'The American actor who plays an Eastside drug lord known for preferring peaceful solutions to business disputes is Robert F. Chew [4]. He is best known for portraying the character Joseph Stewart, commonly referred to as "Proposition Joe," in the HBO drama series "The Wire" [4]. Proposition Joe is characterized by his amiable demeanor and his preference for negotiation over violence in the dangerous world of drug trade, often using his trademark phrase, "I\'ve got a proposition for you" [1]. Chew\'s portrayal of this character added significant depth, making him a memorable figure in the series [4].', 'text': 'The American actor who plays an Eastside drug lord known for preferring peaceful solutions to business disputes is Robert F. Chew', 'context': 'Robert Frederick Chew (December 28, 1960 – January 17, 2013) was an American actor. He was best known for portraying drug kingpin Proposition Joe on the HBO drama series "The Wire".'}, {'




In [None]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

longformqa = LongFormQAWithAssertions()
teleprompter = BootstrapFewShotWithRandomSearch(metric = answer_correctness, max_bootstrapped_demos=2, num_candidate_programs=6)
cited_longformqa_student_teacher = teleprompter.compile(student=assert_transform_module(LongFormQAWithAssertions().map_named_predictors(Retry), backtrack_handler), teacher = assert_transform_module(LongFormQAWithAssertions().map_named_predictors(Retry), backtrack_handler), trainset=trainset, valset=devset[:25])
evaluate(cited_longformqa_student_teacher)