<img src="../../docs/images/DSPy8.png" alt="DSPy7 Image" height="120"/>

### Multi-Agent DSPy Programs: Bootstrapping & Aggregating Multiple `ReAct` Agents

This is a quick (somewhat advanced) example of DSPy. You're given a hard QA task and an agent architecture (`dspy.ReAct`), how do you get high scores without tinkering with prompts?

There are many ways, but this notebook shows one complex strategy that DSPy makes near-trivial to achieve: we'll automatically bootstrap five different highly-effective prompts for ReAct, then optimize an aggregator that combines their powers.

As is usually the case with DSPy, the code to do this is probably shorter than describing it in English, so let's jump right into that.

### 0) TLDR.

We'll build a ReAct agent in DSPy that scores 30% accuracy on a retrieval-based question answering task.

Then, we'll optimize it with `BootstrapFewShotWithRandomSearch` to get 46% accuracy.

Then, we'll build a multi-agent aggregator over five different optimized versions of the agent.

Our unoptimized aggregator will score 26%. It doesn't understand the task. Hence, we'll optimize the aggregator too.

We'll end up with an optimized multi-agent system that scores a whopping 60% accuracy on the same task.

The core portion of the code to do this can be fit into 10 lines of DSPy, but we'll sprinkle some short explanations below.

### 1) Setting Up.

We'll configure the language model (GPT-3.5) and the retrieval model (ColBERTv2 over Wikipedia).

In [None]:
import dspy
from dspy.evaluate import Evaluate
from dspy.datasets.hotpotqa import HotPotQA
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

gpt3 = dspy.OpenAI('gpt-3.5-turbo', max_tokens=4000)
colbert = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.configure(lm=gpt3, rm=colbert)

### 2) Loading some data.

We'll load 150 examples for training (`trainset`), 50 examples for validation & optimization (`valset`), and 300 examples for evaluation (`devset`).

In [7]:
dataset = HotPotQA(train_seed=1, train_size=200, eval_seed=2023, dev_size=300, test_size=0)
trainset = [x.with_inputs('question') for x in dataset.train[0:150]]
valset = [x.with_inputs('question') for x in dataset.train[150:200]]
devset = [x.with_inputs('question') for x in dataset.dev]

# show an example datapoint; it's just a question-answer pair
trainset[0]

  table = cls._concat_blocks(blocks, axis=0)


Example({'question': 'At My Window was released by which American singer-songwriter?', 'answer': 'John Townes Van Zandt'}) (input_keys={'question'})

### 3) ReAct Agent.

Our agent will just be a DSPy ReAct agent that takes a `question` and outputs the `answer` by using a ColBERTv2 retrieval tool.

In [None]:
agent = dspy.ReAct("question -> answer", tools=[dspy.Retrieve(k=1)])

Let's evaluate this **unoptimized** ReAct agent on the `devset`.

In [None]:
# Set up an evaluator on the first 300 examples of the devset.
config = dict(num_threads=8, display_progress=True, display_table=5)
evaluate = Evaluate(devset=devset, metric=dspy.evaluate.answer_exact_match, **config)

evaluate(agent)

### 4) Optimized ReAct.

Let's use DSPy's simple `BootstrapFewShotWithRandomSearch` optimizer to create successful examples of the ReAct program and attempt to optimize the prompts using those constructed examples. In the future, we could try more sophisticated DSPy optimizers too, like `MIPRO`.

We'll bootstrap 20 programs that way. Examples will be bootstrapped starting from the `trainset` and optimized over our tiny `valset`. We'll evaluate later on the `devset`.

In [None]:
config = dict(max_bootstrapped_demos=2, max_labeled_demos=0, num_candidate_programs=20, num_threads=8)
tp = BootstrapFewShotWithRandomSearch(metric=dspy.evaluate.answer_exact_match, **config)
optimized_react = tp.compile(agent, trainset=trainset, valset=valset)

In [None]:
evaluate(optimized_react)

Backing off 0.4 seconds after 1 tries calling function <function GPT3.request at 0x712b80f00b80> with kwargs {}
Backing off 0.7 seconds after 2 tries calling function <function GPT3.request at 0x712b80f00b80> with kwargs {}
Backing off 42.1 seconds after 7 tries calling function <function GPT3.request at 0x712b80f00b80> with kwargs {}
Backing off 8.1 seconds after 7 tries calling function <function GPT3.request at 0x712b80f00b80> with kwargs {}
Backing off 62.5 seconds after 7 tries calling function <function GPT3.request at 0x712b80f00b80> with kwargs {}
Backing off 2.3 seconds after 3 tries calling function <function GPT3.request at 0x712b80f00b80> with kwargs {}
Backing off 10.7 seconds after 6 tries calling function <function GPT3.request at 0x712b80f00b80> with kwargs {}Backing off 4.6 seconds after 4 tries calling function <function GPT3.request at 0x712b80f00b80> with kwargs {}
Backing off 0.5 seconds after 1 tries calling function <function GPT3.request at 0x712b80f00b80> with 

### 5) Zero-Shot Aggregator.

Let's now extract the best five bootstrapped ReAct programs. We'll build a simple DSPy aggregator that runs all of them then produces a final answer.

In [None]:
from dsp.utils import flatten, deduplicate

# the best-performing five ReAct programs from the optimization process
AGENTS = [x[-1] for x in optimized_react.candidate_programs[:5]]

class Aggregator(dspy.Module):
	def __init__(self, temperature=0.0):
		self.aggregate = dspy.ChainOfThought('context, question -> answer')
		self.temperature = temperature

	def forward(self, question):
		# Run all five agents with high temperature, then extract and deduplicate their observed contexts
		with dspy.context(lm=gpt3.copy(temperature=self.temperature)):
			preds = [agent(question=question) for agent in AGENTS]
			context = deduplicate(flatten([flatten(p.observations) for p in preds]))

		# Run the aggregation step to produce a final answer
		return self.aggregate(context=context, question=question)ValueError: not enough values to unpack (expected 2, got 1)


Let's quickly evaluate the aggregator prior to optimization.

In [None]:
aggregator = Aggregator()
evaluate(aggregator)

### 6) Optimized Aggregator.

In [None]:
kwargs = dict(max_bootstrapped_demos=2, max_labeled_demos=6, num_candidate_programs=10, num_threads=8)
tp = BootstrapFewShotWithRandomSearch(metric=dspy.evaluate.answer_exact_match, **kwargs)
optimized_aggregator = tp.compile(aggregator, trainset=trainset, valset=valset)

In [None]:
optimized_aggregator2 = optimized_aggregator.deepcopy()
optimized_aggregator2.temperature = 0.7

evaluate(optimized_aggregator2)

### 7) Conclusion.

Normally, we like to release notebooks with pre-computed caches and to inspect the prompts with `gpt3.inspect_history` to explore the behavior of optimization. See the intro notebook (or any of the Colab notebooks on the README) for such annotated examples!

To keep the current release super quick, Omar will extend this notebook into an annotated version if there's significant interest.

### 8) Post-Conclusion Note.

With a little bit of syntactic sugar, the main code in this notebook could be as short as 10 lines excluding whitespace:

```python
agent = dspy.ReAct("question -> answer", tools=[dspy.Retrieve(k=1)])

optimizer = BootstrapFewShotWithRandomSearch(metric=dspy.evaluate.answer_exact_match)
optimized_react = optimizer.compile(agent, trainset=trainset, valset=valset)

class Aggregator(dspy.Module):
	def __init__(self):
		self.aggregate = dspy.ChainOfThought('context, question -> answer')

	def forward(self, question):
        preds = [agent(question=question) for agent in optimized_react.best_programs[:5]]
		return self.aggregate(context=deduplicate(flatten([p.observations for p in preds])), question=question)
	
optimized_aggregator = optimizer.compile(aggregator, trainset=trainset, valset=valset)

# Use it!
optimized_aggregator(question="How many storeys are in the castle that David Gregory inherited?")
```