<img src="../../docs/images/DSPy8.png" alt="DSPy7 Image" height="120"/>

### Multi-Agent DSPy Programs: Bootstrapping & Aggregating Multiple `ReAct` Agents

This is a quick (somewhat advanced) example of DSPy. You're given a hard QA task and an agent architecture (`dspy.ReAct`), how do you get high scores without tinkering with prompts?

There are many ways, but this notebook shows one complex strategy that DSPy makes near-trivial to achieve: we'll automatically bootstrap five different highly-effective prompts for ReAct, then optimize an aggregator that combines their powers.

As is usually the case with DSPy, the code to do this is probably shorter than describing it in English, so let's jump right into that.

### 0) TLDR.

We'll build a ReAct agent in DSPy that scores 30% accuracy on a retrieval-based question answering task.

Then, we'll optimize it with `BootstrapFewShotWithRandomSearch` to get 46% accuracy.

Then, we'll build a multi-agent aggregator over five different optimized versions of the agent.

Our unoptimized aggregator will score 26%. It doesn't understand the task. Hence, we'll optimize the aggregator too.

We'll end up with an optimized multi-agent system that scores a whopping 60% accuracy on the same task.

The core portion of the code to do this can be fit into 10 lines of DSPy, but we'll sprinkle some short explanations below.

### 1) Setting Up.

We'll configure the language model (GPT-3.5) and the retrieval model (ColBERTv2 over Wikipedia).

In [1]:
from dspy.evaluate import Evaluate
from dspy.datasets.hotpotqa import HotPotQA
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
from experiment_project.utils.initial.util import init_sys_env
from experiment_project.utils.files.read import read_yaml
import dspy

init_sys_env(proxy_url='http://192.168.31.215:10890')
secret_env_file = '/mnt/c/Users/chenzi/Desktop/project/experiment_project/env_secret_config.yaml'

api_configs = read_yaml(secret_env_file)
model_config = api_configs.get('openai')

turbo = dspy.OpenAI(model=model_config.get('model'), max_tokens=520,api_key=model_config.get('api_key'))
colbert = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.configure(lm=turbo, rm=colbert)

### 2) Loading some data.
We'll load 150 examples for training (`trainset`), 50 examples for validation & optimization (`valset`), and 300 examples for evaluation (`devset`).

In [2]:
dataset = HotPotQA(train_seed=1, train_size=200, eval_seed=2023, dev_size=300, test_size=0)
trainset = [x.with_inputs('question') for x in dataset.train[0:150]]
valset = [x.with_inputs('question') for x in dataset.train[150:200]]
devset = [x.with_inputs('question') for x in dataset.dev]

# show an example datapoint; it's just a question-answer pair
trainset[0]
# Example({'question': '《在我的窗前》是由哪位美国创作歌手发行的？', 'answer': '约翰·汤斯·范·赞德'}) (input_keys={'question'})

Downloading builder script:   0%|          | 0.00/6.42k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.19k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/566M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/47.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/46.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/90447 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/7405 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7405 [00:00<?, ? examples/s]

  table = cls._concat_blocks(blocks, axis=0)


Example({'question': 'At My Window was released by which American singer-songwriter?', 'answer': 'John Townes Van Zandt'}) (input_keys={'question'})

### 3) ReAct Agent.

Our agent will just be a DSPy ReAct agent that takes a `question` and outputs the `answer` by using a ColBERTv2 retrieval tool.
我们的代理将只是一个DSPy ReAct代理，它通过使用ColBERTv2检索工具获取 `question` 并输出 `answer`。

In [3]:
agent = dspy.ReAct("question -> answer", tools=[dspy.Retrieve(k=1)])

Let's evaluate this **unoptimized** ReAct agent on the `devset`.
---- 让我们在`devset`上评估这个未优化的ReAct代理。

In [5]:
# Set up an evaluator on the first 300 examples of the devset. 在开发集的前300个示例上设置评估器。
config = dict(num_threads=10, display_progress=True, display_table=5)
evaluate = Evaluate(devset=devset, metric=dspy.evaluate.answer_exact_match, **config)

evaluate(agent)

Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)
Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)


Average Metric: 10.0 / 109  (9.2):  36%|███▋      | 109/300 [06:11<10:50,  3.40s/it]

Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)
Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)




  0%|          | 0/300 [00:00<?, ?it/s][A
Average Metric: 0 / 1  (0.0):   0%|          | 0/300 [00:00<?, ?it/s][A
Average Metric: 0 / 2  (0.0):   0%|          | 1/300 [00:00<00:00, 415.28it/s][A
Average Metric: 0 / 3  (0.0):   1%|          | 2/300 [00:00<00:00, 518.17it/s][A
Average Metric: 0 / 4  (0.0):   1%|          | 3/300 [00:00<00:00, 565.57it/s][A
Average Metric: 0 / 5  (0.0):   1%|▏         | 4/300 [00:00<00:00, 585.37it/s][A
Average Metric: 0 / 6  (0.0):   2%|▏         | 5/300 [00:00<00:00, 579.85it/s][A
Average Metric: 0 / 7  (0.0):   2%|▏         | 6/300 [00:00<00:00, 592.74it/s][A
Average Metric: 0 / 8  (0.0):   2%|▏         | 7/300 [00:00<00:00, 602.33it/s][A
Average Metric: 0 / 9  (0.0):   3%|▎         | 8/300 [00:00<00:00, 603.61it/s][A
Average Metric: 1 / 10  (10.0):   3%|▎         | 9/300 [00:00<00:00, 617.46it/s][A
Average Metric: 1 / 11  (9.1):   3%|▎         | 10/300 [00:00<00:00, 625.20it/s][A
Average Metric: 1 / 12  (8.3):   4%|▎         | 11/300 [00

ValueError: not enough values to unpack (expected 2, got 1)

### 4) Optimized ReAct.

Let's use DSPy's simple `BootstrapFewShotWithRandomSearch` optimizer to create successful examples of the ReAct program and attempt to optimize the prompts using those constructed examples. In the future, we could try more sophisticated DSPy optimizers too, like `MIPRO`.

We'll bootstrap 20 programs that way. Examples will be bootstrapped starting from the `trainset` and optimized over our tiny `valset`. We'll evaluate later on the `devset`.

让我们使用DSPy的简单优化器`BootstrapFewShotWithRandomSearch`来创建ReAct程序的成功示例，并尝试使用这些构建的示例来优化提示。将来，我们还可以尝试更复杂的DSPy优化器，如`MIPRO`。我们将以这种方式引导20个程序。这些示例将从`trainset`开始引导，并在我们的小型`valset`上进行优化。稍后我们将在`devset`上进行评估

In [7]:
# 构造函数初始化 BootstrapFewShotWithRandomSearch 类并设置其属性。它继承自 BootstrapFewShot 类，并为随机搜索过程引入了附加属性。
# max_bootstrapped_demos=2：在少样本学习中最多使用2个引导样本。 max_labeled_demos=0：不使用任何标记样本。 num_candidate_programs=20：在随机搜索中一共会评估20个候选程序。




config = dict(max_bootstrapped_demos=2, max_labeled_demos=0, num_candidate_programs=20, num_threads=32)

tp = BootstrapFewShotWithRandomSearch(metric=dspy.evaluate.answer_exact_match, **config)
optimized_react = tp.compile(agent, trainset=trainset, valset=valset)

Going to sample between 1 and 2 traces per predictor.
Will attempt to train 20 candidate sets.
Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)
Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)



  0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 0.0 / 1  (0.0):   0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 0.0 / 2  (0.0):   2%|▏         | 1/50 [00:00<00:00, 441.88it/s][A
Average Metric: 0.0 / 3  (0.0):   4%|▍         | 2/50 [00:00<00:00, 506.86it/s][A
Average Metric: 0.0 / 4  (0.0):   6%|▌         | 3/50 [00:00<00:00, 519.12it/s][A
Average Metric: 1.0 / 5  (20.0):   8%|▊         | 4/50 [00:00<00:00, 528.77it/s][A
Average Metric: 1.0 / 6  (16.7):  10%|█         | 5/50 [00:00<00:00, 414.26it/s][A
Average Metric: 1.0 / 7  (14.3):  12%|█▏        | 6/50 [00:00<00:00, 404.65it/s][A
Average Metric: 1.0 / 8  (12.5):  14%|█▍        | 7/50 [00:00<00:00, 430.40it/s][A
Average Metric: 1.0 / 9  (11.1):  16%|█▌        | 8/50 [00:00<00:00, 422.67it/s][A
Average Metric: 1.0 / 10  (10.0):  18%|█▊        | 9/50 [00:00<00:00, 414.52it/s][A
Average Metric: 2.0 / 11  (18.2):  20%|██        | 10/50 [00:00<00:00, 413.68it/s][A
Average Metric: 2.0 / 12  (16.7):  22%|██▏ 

Average Metric: 8.0 / 50  (16.0%)
Score: 16.0 for set: [0, 0, 0, 0, 0]
New best score: 16.0 for seed -3
Scores so far: [16.0]
Best score: 16.0
Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)
Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)



  0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 0 / 1  (0.0):   0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 0 / 2  (0.0):   2%|▏         | 1/50 [00:00<00:00, 294.01it/s][A
Average Metric: 0 / 3  (0.0):   4%|▍         | 2/50 [00:00<00:00, 389.84it/s][A
Average Metric: 0 / 4  (0.0):   6%|▌         | 3/50 [00:00<00:00, 436.27it/s][A
Average Metric: 0 / 5  (0.0):   8%|▊         | 4/50 [00:00<00:00, 440.84it/s][A
Average Metric: 1 / 6  (16.7):  10%|█         | 5/50 [00:00<00:00, 469.52it/s][A
Average Metric: 1 / 7  (14.3):  12%|█▏        | 6/50 [00:00<00:00, 491.42it/s][A
Average Metric: 1 / 8  (12.5):  14%|█▍        | 7/50 [00:00<00:00, 512.83it/s][A
Average Metric: 1 / 9  (11.1):  16%|█▌        | 8/50 [00:00<00:00, 421.70it/s][A
Average Metric: 1 / 10  (10.0):  18%|█▊        | 9/50 [00:00<00:00, 419.37it/s][A
Average Metric: 1 / 11  (9.1):  20%|██        | 10/50 [00:00<00:00, 437.24it/s][A
Average Metric: 1.0 / 12  (8.3):  22%|██▏       | 11/50 [00:00<00:0

Average Metric: 8.0 / 50  (16.0%)
Score: 16.0 for set: [0, 0, 0, 0, 0]
Scores so far: [16.0, 16.0]
Best score: 16.0



  7%|▋         | 11/150 [00:00<00:00, 1279.21it/s]


Bootstrapped 2 full traces after 12 examples in round 0.



  0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 0 / 1  (0.0):   0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 1 / 2  (50.0):   2%|▏         | 1/50 [00:00<00:00, 287.50it/s][A
Average Metric: 1 / 3  (33.3):   4%|▍         | 2/50 [00:00<00:00, 382.01it/s][A
Average Metric: 2 / 4  (50.0):   6%|▌         | 3/50 [00:00<00:00, 432.61it/s][A
Average Metric: 2 / 5  (40.0):   8%|▊         | 4/50 [00:00<00:00, 471.84it/s][A
Average Metric: 2 / 6  (33.3):  10%|█         | 5/50 [00:00<00:00, 489.20it/s][A
Average Metric: 2 / 7  (28.6):  12%|█▏        | 6/50 [00:00<00:00, 504.47it/s][A
Average Metric: 3 / 8  (37.5):  14%|█▍        | 7/50 [00:00<00:00, 514.18it/s][A
Average Metric: 3 / 9  (33.3):  16%|█▌        | 8/50 [00:00<00:00, 518.80it/s][A
Average Metric: 4 / 10  (40.0):  18%|█▊        | 9/50 [00:00<00:00, 517.23it/s][A
Average Metric: 4 / 11  (36.4):  20%|██        | 10/50 [00:00<00:00, 515.49it/s][A
Average Metric: 4 / 12  (33.3):  22%|██▏       | 11/50 [00:00<

Average Metric: 22 / 50  (44.0%)
Score: 44.0 for set: [2, 2, 1, 1, 1]
New best score: 44.0 for seed -1
Scores so far: [16.0, 16.0, 44.0]
Best score: 44.0
Average of max per entry across top 1 scores: 0.44
Average of max per entry across top 2 scores: 0.56
Average of max per entry across top 3 scores: 0.64
Average of max per entry across top 5 scores: 0.64
Average of max per entry across top 8 scores: 0.64
Average of max per entry across top 9999 scores: 0.64



  7%|▋         | 10/150 [00:00<00:00, 1268.08it/s]


Bootstrapped 2 full traces after 11 examples in round 0.
Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)



  0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 1 / 1  (100.0):   0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 1.0 / 2  (50.0):   2%|▏         | 1/50 [00:00<00:00, 481.33it/s][A
Average Metric: 1.0 / 3  (33.3):   4%|▍         | 2/50 [00:00<00:00, 530.79it/s][A
Average Metric: 1.0 / 4  (25.0):   6%|▌         | 3/50 [00:00<00:00, 569.00it/s][A
Average Metric: 2.0 / 5  (40.0):   8%|▊         | 4/50 [00:00<00:00, 576.54it/s][A
Average Metric: 3.0 / 6  (50.0):  10%|█         | 5/50 [00:00<00:00, 605.55it/s][A
Average Metric: 4.0 / 7  (57.1):  12%|█▏        | 6/50 [00:00<00:00, 450.22it/s][A
Average Metric: 5.0 / 8  (62.5):  14%|█▍        | 7/50 [00:00<00:00, 453.45it/s][A
Average Metric: 6.0 / 9  (66.7):  16%|█▌        | 8/50 [00:00<00:00, 472.48it/s][A
Average Metric: 6.0 / 10  (60.0):  18%|█▊        | 9/50 [00:00<00:00, 483.59it/s][A
Average Metric: 6.0 / 11  (54.5):  20%|██        | 10/50 [00:00<00:00, 496.54it/s][A
Average Metric: 7.0 / 12  (58.3):  22%|█

Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)
Average Metric: 24.0 / 50  (48.0%)
Score: 48.0 for set: [2, 2, 0, 0, 0]
New best score: 48.0 for seed 0
Scores so far: [16.0, 16.0, 44.0, 48.0]
Best score: 48.0
Average of max per entry across top 1 scores: 0.48
Average of max per entry across top 2 scores: 0.72
Average of max per entry across top 3 scores: 0.76
Average of max per entry across top 5 scores: 0.82
Average of max per entry across top 8 scores: 0.82
Average of max per entry across top 9999 scores: 0.82



  3%|▎         | 4/150 [00:00<00:00, 1127.35it/s]


Bootstrapped 1 full traces after 5 examples in round 0.



  0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 0 / 1  (0.0):   0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 1 / 2  (50.0):   2%|▏         | 1/50 [00:00<00:00, 401.52it/s][A
Average Metric: 1 / 3  (33.3):   4%|▍         | 2/50 [00:00<00:00, 526.26it/s][A
Average Metric: 1 / 4  (25.0):   6%|▌         | 3/50 [00:00<00:00, 578.31it/s][A
Average Metric: 2 / 5  (40.0):   8%|▊         | 4/50 [00:00<00:00, 617.20it/s][A
Average Metric: 2 / 6  (33.3):  10%|█         | 5/50 [00:00<00:00, 607.24it/s][A
Average Metric: 3 / 7  (42.9):  12%|█▏        | 6/50 [00:00<00:00, 620.83it/s][A
Average Metric: 3 / 8  (37.5):  14%|█▍        | 7/50 [00:00<00:00, 630.87it/s][A
Average Metric: 3 / 9  (33.3):  16%|█▌        | 8/50 [00:00<00:00, 575.59it/s][A
Average Metric: 4 / 10  (40.0):  18%|█▊        | 9/50 [00:00<00:00, 575.79it/s][A
Average Metric: 4 / 11  (36.4):  20%|██        | 10/50 [00:00<00:00, 475.76it/s][A
Average Metric: 5 / 12  (41.7):  22%|██▏       | 11/50 [00:00<

Error for example in dev set: 		 HTTPConnectionPool(host='192.168.31.215', port=10890): Read timed out. (read timeout=10)



Average Metric: 21.0 / 46  (45.7):  90%|█████████ | 45/50 [00:27<00:13,  2.63s/it][A
Average Metric: 21.0 / 46  (45.7):  92%|█████████▏| 46/50 [00:27<00:08,  2.09s/it][A
Average Metric: 21.0 / 47  (44.7):  92%|█████████▏| 46/50 [00:31<00:08,  2.09s/it][A
Average Metric: 21.0 / 47  (44.7):  94%|█████████▍| 47/50 [00:31<00:07,  2.44s/it][A
Average Metric: 22.0 / 48  (45.8):  94%|█████████▍| 47/50 [00:33<00:07,  2.44s/it][A
Average Metric: 22.0 / 48  (45.8):  96%|█████████▌| 48/50 [00:33<00:04,  2.46s/it][A
Average Metric: 22.0 / 49  (44.9):  96%|█████████▌| 48/50 [00:40<00:04,  2.46s/it][A
Average Metric: 22.0 / 49  (44.9):  98%|█████████▊| 49/50 [00:40<00:03,  3.62s/it][A
Average Metric: 22.0 / 50  (44.0):  98%|█████████▊| 49/50 [00:59<00:03,  3.62s/it][A
Average Metric: 22.0 / 50  (44.0): 100%|██████████| 50/50 [00:59<00:00,  1.18s/it][A


Average Metric: 22.0 / 50  (44.0%)
Score: 44.0 for set: [1, 1, 1, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0]
Best score: 48.0
Average of max per entry across top 1 scores: 0.48
Average of max per entry across top 2 scores: 0.72
Average of max per entry across top 3 scores: 0.8
Average of max per entry across top 5 scores: 0.88
Average of max per entry across top 8 scores: 0.88
Average of max per entry across top 9999 scores: 0.88



  0%|          | 0/150 [00:00<?, ?it/s][A
  1%|▏         | 2/150 [00:06<08:03,  3.27s/it][A
  2%|▏         | 3/150 [00:13<12:04,  4.93s/it][A
  3%|▎         | 4/150 [00:18<11:47,  4.85s/it][A
  3%|▎         | 5/150 [00:25<13:43,  5.68s/it][A
  5%|▍         | 7/150 [00:29<10:03,  4.22s/it][A


Bootstrapped 1 full traces after 8 examples in round 0.



  0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 1 / 1  (100.0):   0%|          | 0/50 [00:02<?, ?it/s][A
Average Metric: 1 / 1  (100.0):   2%|▏         | 1/50 [00:02<02:12,  2.69s/it][A
Average Metric: 1 / 2  (50.0):   2%|▏         | 1/50 [00:02<02:12,  2.69s/it] [A
Average Metric: 2 / 3  (66.7):   4%|▍         | 2/50 [00:02<02:09,  2.69s/it][A
Average Metric: 2 / 3  (66.7):   6%|▌         | 3/50 [00:02<00:34,  1.35it/s][A
Average Metric: 2 / 4  (50.0):   6%|▌         | 3/50 [00:03<00:34,  1.35it/s][A
Average Metric: 2 / 5  (40.0):   8%|▊         | 4/50 [00:03<00:34,  1.35it/s][A
Average Metric: 2 / 5  (40.0):  10%|█         | 5/50 [00:03<00:23,  1.89it/s][A
Average Metric: 3 / 6  (50.0):  10%|█         | 5/50 [00:03<00:23,  1.89it/s][A
Average Metric: 3 / 7  (42.9):  12%|█▏        | 6/50 [00:03<00:23,  1.89it/s][A
Average Metric: 3 / 7  (42.9):  14%|█▍        | 7/50 [00:03<00:14,  2.96it/s][A
Average Metric: 4 / 8  (50.0):  14%|█▍        | 7/50 [00:03<00:14,  2.9

Average Metric: 25 / 50  (50.0%)
Score: 50.0 for set: [1, 1, 0, 0, 0]
New best score: 50.0 for seed 2
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0]
Best score: 50.0
Average of max per entry across top 1 scores: 0.5
Average of max per entry across top 2 scores: 0.7
Average of max per entry across top 3 scores: 0.84
Average of max per entry across top 5 scores: 0.88
Average of max per entry across top 8 scores: 0.92
Average of max per entry across top 9999 scores: 0.92



  0%|          | 0/150 [00:00<?, ?it/s][A
  1%|          | 1/150 [00:32<1:19:43, 32.10s/it][A
  1%|▏         | 2/150 [00:57<1:09:47, 28.29s/it][A
  2%|▏         | 3/150 [01:07<48:14, 19.69s/it]  [A
  3%|▎         | 5/150 [01:13<25:04, 10.37s/it][A
  4%|▍         | 6/150 [01:21<32:46, 13.65s/it][A


Bootstrapped 1 full traces after 7 examples in round 0.



  0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 0 / 1  (0.0):   0%|          | 0/50 [00:02<?, ?it/s][A
Average Metric: 0 / 1  (0.0):   2%|▏         | 1/50 [00:02<02:08,  2.62s/it][A
Average Metric: 1 / 2  (50.0):   2%|▏         | 1/50 [00:03<02:08,  2.62s/it][A
Average Metric: 1 / 2  (50.0):   4%|▍         | 2/50 [00:03<01:03,  1.33s/it][A
Average Metric: 2 / 3  (66.7):   4%|▍         | 2/50 [00:03<01:03,  1.33s/it][A
Average Metric: 2 / 3  (66.7):   6%|▌         | 3/50 [00:03<00:44,  1.05it/s][A
Average Metric: 2 / 4  (50.0):   6%|▌         | 3/50 [00:04<00:44,  1.05it/s][A
Average Metric: 2 / 4  (50.0):   8%|▊         | 4/50 [00:04<00:38,  1.20it/s][A
Average Metric: 3 / 5  (60.0):   8%|▊         | 4/50 [00:04<00:38,  1.20it/s][A
Average Metric: 3 / 6  (50.0):  10%|█         | 5/50 [00:04<00:37,  1.20it/s][A
Average Metric: 3 / 6  (50.0):  12%|█▏        | 6/50 [00:04<00:18,  2.37it/s][A
Average Metric: 3 / 7  (42.9):  12%|█▏        | 6/50 [00:04<00:18,  2.37it/s

Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)



Average Metric: 18.0 / 41  (43.9):  80%|████████  | 40/50 [00:22<00:07,  1.32it/s][A
Average Metric: 18.0 / 41  (43.9):  82%|████████▏ | 41/50 [00:22<00:06,  1.34it/s][A
Average Metric: 18.0 / 42  (42.9):  82%|████████▏ | 41/50 [00:22<00:06,  1.34it/s][A
Average Metric: 18.0 / 43  (41.9):  84%|████████▍ | 42/50 [00:23<00:05,  1.34it/s][A
Average Metric: 18.0 / 43  (41.9):  86%|████████▌ | 43/50 [00:23<00:03,  1.78it/s][A
Average Metric: 18.0 / 44  (40.9):  86%|████████▌ | 43/50 [00:23<00:03,  1.78it/s][A
Average Metric: 18.0 / 44  (40.9):  88%|████████▊ | 44/50 [00:23<00:02,  2.11it/s][A
Average Metric: 18.0 / 45  (40.0):  88%|████████▊ | 44/50 [00:24<00:02,  2.11it/s][A
Average Metric: 18.0 / 45  (40.0):  90%|█████████ | 45/50 [00:24<00:02,  2.08it/s][A
Average Metric: 18.0 / 46  (39.1):  90%|█████████ | 45/50 [00:24<00:02,  2.08it/s][A
Average Metric: 18.0 / 46  (39.1):  92%|█████████▏| 46/50 [00:24<00:01,  2.34it/s][A
Average Metric: 18.0 / 47  (38.3):  92%|█████████▏| 4

Average Metric: 20.0 / 50  (40.0%)
Score: 40.0 for set: [1, 1, 1, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0]
Best score: 50.0
Average of max per entry across top 1 scores: 0.5
Average of max per entry across top 2 scores: 0.7
Average of max per entry across top 3 scores: 0.84
Average of max per entry across top 5 scores: 0.92
Average of max per entry across top 8 scores: 0.96
Average of max per entry across top 9999 scores: 0.96



  0%|          | 0/150 [00:00<?, ?it/s][A
  1%|          | 1/150 [00:20<51:28, 20.73s/it][A
  1%|▏         | 2/150 [00:40<50:24, 20.44s/it][A
  2%|▏         | 3/150 [00:42<29:19, 11.97s/it][A
  3%|▎         | 4/150 [00:50<24:33, 10.09s/it][A
  3%|▎         | 5/150 [01:18<40:18, 16.68s/it][A
  4%|▍         | 6/150 [01:23<30:34, 12.74s/it][A
  5%|▍         | 7/150 [01:33<27:50, 11.68s/it][A
  5%|▌         | 8/150 [01:47<29:59, 12.67s/it][A
  7%|▋         | 10/150 [01:51<17:29,  7.50s/it][A
  7%|▋         | 11/150 [02:01<25:40, 11.08s/it][A


Bootstrapped 1 full traces after 12 examples in round 0.



  0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 1 / 1  (100.0):   0%|          | 0/50 [00:03<?, ?it/s][A
Average Metric: 1 / 1  (100.0):   2%|▏         | 1/50 [00:03<02:36,  3.20s/it][A
Average Metric: 2 / 2  (100.0):   2%|▏         | 1/50 [00:03<02:36,  3.20s/it][A
Average Metric: 2 / 3  (66.7):   4%|▍         | 2/50 [00:03<02:33,  3.20s/it] [A
Average Metric: 2 / 3  (66.7):   6%|▌         | 3/50 [00:03<00:42,  1.11it/s][A
Average Metric: 2 / 4  (50.0):   6%|▌         | 3/50 [00:03<00:42,  1.11it/s][A
Average Metric: 2 / 4  (50.0):   8%|▊         | 4/50 [00:03<00:31,  1.46it/s][A
Average Metric: 3 / 5  (60.0):   8%|▊         | 4/50 [00:03<00:31,  1.46it/s][A
Average Metric: 3 / 5  (60.0):  10%|█         | 5/50 [00:03<00:24,  1.80it/s][A
Average Metric: 4 / 6  (66.7):  10%|█         | 5/50 [00:04<00:24,  1.80it/s][A
Average Metric: 4 / 6  (66.7):  12%|█▏        | 6/50 [00:04<00:19,  2.20it/s][A
Average Metric: 5 / 7  (71.4):  12%|█▏        | 6/50 [00:05<00:19,  2.

Error for example in dev set: 		 HTTPConnectionPool(host='192.168.31.215', port=10890): Read timed out. (read timeout=10)



Average Metric: 9.0 / 16  (56.2):  30%|███       | 15/50 [00:18<01:22,  2.36s/it][A
Average Metric: 9.0 / 16  (56.2):  32%|███▏      | 16/50 [00:18<01:27,  2.58s/it][A

Error for example in dev set: 		 HTTPConnectionPool(host='192.168.31.215', port=10890): Read timed out. (read timeout=10)



Average Metric: 9.0 / 17  (52.9):  32%|███▏      | 16/50 [00:28<01:27,  2.58s/it][A
Average Metric: 9.0 / 17  (52.9):  34%|███▍      | 17/50 [00:28<02:37,  4.77s/it][A

Error for example in dev set: 		 HTTPConnectionPool(host='192.168.31.215', port=10890): Read timed out. (read timeout=10)



Average Metric: 9.0 / 18  (50.0):  34%|███▍      | 17/50 [00:31<02:37,  4.77s/it][A
Average Metric: 9.0 / 18  (50.0):  36%|███▌      | 18/50 [00:31<02:17,  4.29s/it][A
Average Metric: 9.0 / 19  (47.4):  36%|███▌      | 18/50 [00:34<02:17,  4.29s/it][A
Average Metric: 9.0 / 19  (47.4):  38%|███▊      | 19/50 [00:34<02:05,  4.04s/it][A
Average Metric: 10.0 / 20  (50.0):  38%|███▊      | 19/50 [00:35<02:05,  4.04s/it][A
Average Metric: 10.0 / 20  (50.0):  40%|████      | 20/50 [00:35<01:33,  3.13s/it][A
Average Metric: 10.0 / 21  (47.6):  40%|████      | 20/50 [00:38<01:33,  3.13s/it][A
Average Metric: 10.0 / 21  (47.6):  42%|████▏     | 21/50 [00:38<01:26,  3.00s/it][A
Average Metric: 11.0 / 22  (50.0):  42%|████▏     | 21/50 [00:40<01:26,  3.00s/it][A
Average Metric: 11.0 / 22  (50.0):  44%|████▍     | 22/50 [00:40<01:19,  2.84s/it][A
Average Metric: 11.0 / 23  (47.8):  44%|████▍     | 22/50 [00:41<01:19,  2.84s/it][A
Average Metric: 11.0 / 23  (47.8):  46%|████▌     | 23/50

Error for example in dev set: 		 Request timed out.



Average Metric: 11.0 / 28  (39.3):  54%|█████▍    | 27/50 [00:46<00:29,  1.29s/it][A
Average Metric: 11.0 / 28  (39.3):  56%|█████▌    | 28/50 [00:46<00:24,  1.13s/it][A

Error for example in dev set: 		 HTTPConnectionPool(host='192.168.31.215', port=10890): Read timed out. (read timeout=10)



Average Metric: 11.0 / 29  (37.9):  56%|█████▌    | 28/50 [00:49<00:24,  1.13s/it][A
Average Metric: 11.0 / 29  (37.9):  58%|█████▊    | 29/50 [00:49<00:34,  1.62s/it][A
Average Metric: 12.0 / 30  (40.0):  58%|█████▊    | 29/50 [00:50<00:34,  1.62s/it][A
Average Metric: 12.0 / 30  (40.0):  60%|██████    | 30/50 [00:50<00:32,  1.64s/it][A
Average Metric: 13.0 / 31  (41.9):  60%|██████    | 30/50 [00:51<00:32,  1.64s/it][A
Average Metric: 13.0 / 31  (41.9):  62%|██████▏   | 31/50 [00:51<00:27,  1.43s/it][A
Average Metric: 14.0 / 32  (43.8):  62%|██████▏   | 31/50 [00:53<00:27,  1.43s/it][A
Average Metric: 14.0 / 32  (43.8):  64%|██████▍   | 32/50 [00:53<00:26,  1.50s/it][A
Average Metric: 14.0 / 33  (42.4):  64%|██████▍   | 32/50 [00:54<00:26,  1.50s/it][A
Average Metric: 14.0 / 33  (42.4):  66%|██████▌   | 33/50 [00:54<00:25,  1.51s/it][A
Average Metric: 14.0 / 34  (41.2):  66%|██████▌   | 33/50 [00:54<00:25,  1.51s/it][A
Average Metric: 14.0 / 35  (40.0):  68%|██████▊   | 3

Error for example in dev set: 		 Connection error.



Average Metric: 18.0 / 43  (41.9):  84%|████████▍ | 42/50 [01:03<00:07,  1.06it/s][A
Average Metric: 18.0 / 43  (41.9):  86%|████████▌ | 43/50 [01:03<00:06,  1.14it/s][A
Average Metric: 19.0 / 44  (43.2):  86%|████████▌ | 43/50 [01:03<00:06,  1.14it/s][A
Average Metric: 19.0 / 44  (43.2):  88%|████████▊ | 44/50 [01:03<00:04,  1.50it/s][A
Average Metric: 19.0 / 45  (42.2):  88%|████████▊ | 44/50 [01:06<00:04,  1.50it/s][A
Average Metric: 19.0 / 45  (42.2):  90%|█████████ | 45/50 [01:06<00:07,  1.45s/it][A
Average Metric: 19.0 / 46  (41.3):  90%|█████████ | 45/50 [01:06<00:07,  1.45s/it][A
Average Metric: 19.0 / 47  (40.4):  92%|█████████▏| 46/50 [01:06<00:05,  1.45s/it][A
Average Metric: 19.0 / 47  (40.4):  94%|█████████▍| 47/50 [01:06<00:02,  1.23it/s][A

Error for example in dev set: 		 Connection error.



Average Metric: 19.0 / 48  (39.6):  94%|█████████▍| 47/50 [01:09<00:02,  1.23it/s][A
Average Metric: 19.0 / 48  (39.6):  96%|█████████▌| 48/50 [01:09<00:02,  1.35s/it][A
Average Metric: 20.0 / 49  (40.8):  96%|█████████▌| 48/50 [01:12<00:02,  1.35s/it][A
Average Metric: 20.0 / 49  (40.8):  98%|█████████▊| 49/50 [01:12<00:01,  1.55s/it][A
Average Metric: 21.0 / 50  (42.0):  98%|█████████▊| 49/50 [01:14<00:01,  1.55s/it][A
Average Metric: 21.0 / 50  (42.0): 100%|██████████| 50/50 [01:14<00:00,  1.50s/it][A


Average Metric: 21.0 / 50  (42.0%)
Score: 42.0 for set: [1, 1, 1, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0, 42.0]
Best score: 50.0
Average of max per entry across top 1 scores: 0.5
Average of max per entry across top 2 scores: 0.7
Average of max per entry across top 3 scores: 0.84
Average of max per entry across top 5 scores: 0.96
Average of max per entry across top 8 scores: 0.98
Average of max per entry across top 9999 scores: 0.98



  0%|          | 0/150 [00:00<?, ?it/s][A
  1%|          | 1/150 [00:05<12:58,  5.22s/it][A
  2%|▏         | 3/150 [00:10<07:45,  3.17s/it][A
  3%|▎         | 5/150 [00:22<11:12,  4.64s/it][A
  4%|▍         | 6/150 [00:28<12:11,  5.08s/it][A
  5%|▍         | 7/150 [00:40<16:23,  6.88s/it][A
  5%|▌         | 8/150 [00:44<14:42,  6.22s/it][A
  6%|▌         | 9/150 [00:48<12:52,  5.48s/it][A
  7%|▋         | 11/150 [00:59<12:41,  5.48s/it][A
  8%|▊         | 12/150 [01:15<18:50,  8.19s/it][A
  9%|▊         | 13/150 [01:21<17:12,  7.54s/it][A
  9%|▉         | 14/150 [01:38<22:38,  9.99s/it][A
 10%|█         | 15/150 [02:07<34:38, 15.40s/it][A

Failed to run or to evaluate example Example({'question': 'What ride at Disney was inspired by an old TV show and also inspired a made for TV movie on the Disney channel?', 'answer': 'The Twilight Zone Tower of Terror'}) (input_keys={'question'}) with <function answer_exact_match at 0x7fa075303eb0> due to HTTPConnectionPool(host='192.168.31.215', port=10890): Read timed out. (read timeout=10).



 11%|█         | 16/150 [02:36<42:57, 19.24s/it][A
 11%|█▏        | 17/150 [02:59<45:03, 20.33s/it][A
 13%|█▎        | 19/150 [03:08<21:37,  9.90s/it][A


Bootstrapped 2 full traces after 20 examples in round 0.



  0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 1 / 1  (100.0):   0%|          | 0/50 [00:02<?, ?it/s][A
Average Metric: 1 / 1  (100.0):   2%|▏         | 1/50 [00:02<02:14,  2.75s/it][A
Average Metric: 1 / 2  (50.0):   2%|▏         | 1/50 [00:03<02:14,  2.75s/it] [A
Average Metric: 1 / 2  (50.0):   4%|▍         | 2/50 [00:03<01:03,  1.33s/it][A
Average Metric: 1 / 3  (33.3):   4%|▍         | 2/50 [00:03<01:03,  1.33s/it][A
Average Metric: 1 / 3  (33.3):   6%|▌         | 3/50 [00:03<00:40,  1.17it/s][A
Average Metric: 2 / 4  (50.0):   6%|▌         | 3/50 [00:03<00:40,  1.17it/s][A
Average Metric: 3 / 5  (60.0):   8%|▊         | 4/50 [00:03<00:39,  1.17it/s][A
Average Metric: 4 / 6  (66.7):  10%|█         | 5/50 [00:03<00:38,  1.17it/s][A
Average Metric: 4 / 6  (66.7):  12%|█▏        | 6/50 [00:03<00:15,  2.78it/s][A
Average Metric: 4 / 7  (57.1):  12%|█▏        | 6/50 [00:03<00:15,  2.78it/s][A
Average Metric: 4 / 7  (57.1):  14%|█▍        | 7/50 [00:03<00:12,  3.3

Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)
Average Metric: 25.0 / 50  (50.0%)
Score: 50.0 for set: [2, 2, 1, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0, 42.0, 50.0]
Best score: 50.0
Average of max per entry across top 1 scores: 0.5
Average of max per entry across top 2 scores: 0.72
Average of max per entry across top 3 scores: 0.82
Average of max per entry across top 5 scores: 0.9
Average of max per entry across top 8 scores: 0.98
Average of max per entry across top 9999 scores: 0.98



  0%|          | 0/150 [00:00<?, ?it/s][A
  1%|          | 1/150 [00:03<09:48,  3.95s/it][A
  1%|▏         | 2/150 [00:22<30:57, 12.55s/it][A
  3%|▎         | 4/150 [00:45<28:41, 11.79s/it][A
  5%|▍         | 7/150 [00:56<19:11,  8.05s/it][A


Bootstrapped 1 full traces after 8 examples in round 0.



  0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 1 / 1  (100.0):   0%|          | 0/50 [00:03<?, ?it/s][A
Average Metric: 1 / 1  (100.0):   2%|▏         | 1/50 [00:03<02:29,  3.05s/it][A
Average Metric: 2 / 2  (100.0):   2%|▏         | 1/50 [00:03<02:29,  3.05s/it][A
Average Metric: 2 / 2  (100.0):   4%|▍         | 2/50 [00:03<01:07,  1.40s/it][A
Average Metric: 2 / 3  (66.7):   4%|▍         | 2/50 [00:03<01:07,  1.40s/it] [A
Average Metric: 2 / 3  (66.7):   6%|▌         | 3/50 [00:03<00:40,  1.17it/s][A
Average Metric: 2 / 4  (50.0):   6%|▌         | 3/50 [00:03<00:40,  1.17it/s][A
Average Metric: 2 / 4  (50.0):   8%|▊         | 4/50 [00:03<00:30,  1.52it/s][A
Average Metric: 3 / 5  (60.0):   8%|▊         | 4/50 [00:04<00:30,  1.52it/s][A
Average Metric: 3 / 5  (60.0):  10%|█         | 5/50 [00:04<00:32,  1.38it/s][A
Average Metric: 3 / 6  (50.0):  10%|█         | 5/50 [00:04<00:32,  1.38it/s][A
Average Metric: 3 / 6  (50.0):  12%|█▏        | 6/50 [00:05<00:25,  1

Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)
Average Metric: 25.0 / 50  (50.0%)
Score: 50.0 for set: [1, 1, 0, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0, 42.0, 50.0, 50.0]
Best score: 50.0
Average of max per entry across top 1 scores: 0.5
Average of max per entry across top 2 scores: 0.72
Average of max per entry across top 3 scores: 0.86
Average of max per entry across top 5 scores: 0.94
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0



  0%|          | 0/150 [00:00<?, ?it/s][A
  1%|▏         | 2/150 [00:29<35:53, 14.55s/it][A

Failed to run or to evaluate example Example({'question': 'Are both Ralph Saenz and Roddy Woomble credited as writers?', 'answer': 'no'}) (input_keys={'question'}) with <function answer_exact_match at 0x7fa075303eb0> due to not enough values to unpack (expected 2, got 1).



  3%|▎         | 4/150 [00:38<21:25,  8.81s/it][A
  3%|▎         | 5/150 [00:58<29:14, 12.10s/it][A
  4%|▍         | 6/150 [01:06<25:45, 10.73s/it][A
  5%|▍         | 7/150 [01:38<40:53, 17.16s/it][A
  5%|▌         | 8/150 [01:50<36:46, 15.54s/it][A

Failed to run or to evaluate example Example({'question': 'Was The Devil and Max Devlin or Do Dooni Chaar released first?', 'answer': 'The Devil and Max Devlin'}) (input_keys={'question'}) with <function answer_exact_match at 0x7fa075303eb0> due to HTTPConnectionPool(host='192.168.31.215', port=10890): Read timed out. (read timeout=10).



  6%|▌         | 9/150 [02:27<51:41, 22.00s/it][A
  7%|▋         | 10/150 [02:39<44:50, 19.22s/it][A
  8%|▊         | 12/150 [02:49<28:44, 12.50s/it][A
 10%|█         | 15/150 [03:21<26:04, 11.59s/it][A
 11%|█▏        | 17/150 [03:27<19:36,  8.84s/it][A
 12%|█▏        | 18/150 [03:37<19:54,  9.05s/it][A
 14%|█▍        | 21/150 [03:46<13:46,  6.41s/it][A
 15%|█▌        | 23/150 [03:51<21:20, 10.09s/it][A


Bootstrapped 2 full traces after 24 examples in round 0.



  0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 1 / 1  (100.0):   0%|          | 0/50 [00:03<?, ?it/s][A
Average Metric: 1 / 1  (100.0):   2%|▏         | 1/50 [00:03<02:56,  3.60s/it][A
Average Metric: 2 / 2  (100.0):   2%|▏         | 1/50 [00:03<02:56,  3.60s/it][A
Average Metric: 2 / 2  (100.0):   4%|▍         | 2/50 [00:03<01:14,  1.55s/it][A
Average Metric: 2 / 3  (66.7):   4%|▍         | 2/50 [00:03<01:14,  1.55s/it] [A
Average Metric: 2 / 3  (66.7):   6%|▌         | 3/50 [00:03<00:41,  1.12it/s][A
Average Metric: 3 / 4  (75.0):   6%|▌         | 3/50 [00:03<00:41,  1.12it/s][A
Average Metric: 4 / 5  (80.0):   8%|▊         | 4/50 [00:04<00:41,  1.12it/s][A
Average Metric: 4 / 5  (80.0):  10%|█         | 5/50 [00:04<00:22,  2.04it/s][A
Average Metric: 4 / 6  (66.7):  10%|█         | 5/50 [00:04<00:22,  2.04it/s][A
Average Metric: 4 / 6  (66.7):  12%|█▏        | 6/50 [00:04<00:21,  2.01it/s][A
Average Metric: 4 / 7  (57.1):  12%|█▏        | 6/50 [00:04<00:21,  2

Average Metric: 26 / 50  (52.0%)
Score: 52.0 for set: [2, 2, 1, 0, 0]
New best score: 52.0 for seed 7
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0, 42.0, 50.0, 50.0, 52.0]
Best score: 52.0
Average of max per entry across top 1 scores: 0.52
Average of max per entry across top 2 scores: 0.8
Average of max per entry across top 3 scores: 0.8
Average of max per entry across top 5 scores: 0.94
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0



  0%|          | 0/150 [00:00<?, ?it/s][A
  3%|▎         | 4/150 [00:36<22:09,  9.11s/it][A
  3%|▎         | 5/150 [00:41<19:17,  7.99s/it][A
  4%|▍         | 6/150 [00:54<21:45,  9.06s/it][A


Bootstrapped 1 full traces after 7 examples in round 0.



  0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 1 / 1  (100.0):   0%|          | 0/50 [00:03<?, ?it/s][A
Average Metric: 1 / 1  (100.0):   2%|▏         | 1/50 [00:03<02:37,  3.22s/it][A
Average Metric: 2 / 2  (100.0):   2%|▏         | 1/50 [00:03<02:37,  3.22s/it][A
Average Metric: 2 / 2  (100.0):   4%|▍         | 2/50 [00:03<01:09,  1.45s/it][A
Average Metric: 3 / 3  (100.0):   4%|▍         | 2/50 [00:03<01:09,  1.45s/it][A
Average Metric: 3 / 3  (100.0):   6%|▌         | 3/50 [00:03<00:39,  1.20it/s][A
Average Metric: 3 / 4  (75.0):   6%|▌         | 3/50 [00:04<00:39,  1.20it/s] [A
Average Metric: 3 / 4  (75.0):   8%|▊         | 4/50 [00:04<00:31,  1.45it/s][A
Average Metric: 3 / 5  (60.0):   8%|▊         | 4/50 [00:04<00:31,  1.45it/s][A
Average Metric: 3 / 5  (60.0):  10%|█         | 5/50 [00:04<00:25,  1.76it/s][A
Average Metric: 3 / 6  (50.0):  10%|█         | 5/50 [00:04<00:25,  1.76it/s][A
Average Metric: 3 / 6  (50.0):  12%|█▏        | 6/50 [00:04<00:18, 

Average Metric: 17 / 50  (34.0%)
Score: 34.0 for set: [1, 1, 1, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0, 42.0, 50.0, 50.0, 52.0, 34.0]
Best score: 52.0
Average of max per entry across top 1 scores: 0.52
Average of max per entry across top 2 scores: 0.8
Average of max per entry across top 3 scores: 0.8
Average of max per entry across top 5 scores: 0.94
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0



  0%|          | 0/150 [00:00<?, ?it/s][A
  1%|          | 1/150 [00:17<43:26, 17.50s/it][A
  3%|▎         | 4/150 [00:20<10:06,  4.16s/it][A
  3%|▎         | 5/150 [00:31<14:50,  6.14s/it][A
  4%|▍         | 6/150 [00:35<13:18,  5.55s/it][A
  5%|▍         | 7/150 [00:44<15:33,  6.53s/it][A
  5%|▌         | 8/150 [02:12<1:11:51, 30.36s/it][A
  6%|▌         | 9/150 [02:37<1:07:15, 28.62s/it][A
  7%|▋         | 11/150 [02:45<34:46, 15.01s/it] [A


Bootstrapped 2 full traces after 12 examples in round 0.



  0%|          | 0/50 [00:00<?, ?it/s][A
Average Metric: 0 / 1  (0.0):   0%|          | 0/50 [00:03<?, ?it/s][A
Average Metric: 0 / 1  (0.0):   2%|▏         | 1/50 [00:03<02:35,  3.18s/it][A
Average Metric: 1 / 2  (50.0):   2%|▏         | 1/50 [00:03<02:35,  3.18s/it][A
Average Metric: 1 / 2  (50.0):   4%|▍         | 2/50 [00:03<01:07,  1.40s/it][A
Average Metric: 1 / 3  (33.3):   4%|▍         | 2/50 [00:03<01:07,  1.40s/it][A
Average Metric: 1 / 3  (33.3):   6%|▌         | 3/50 [00:03<00:38,  1.22it/s][A
Average Metric: 2 / 4  (50.0):   6%|▌         | 3/50 [00:03<00:38,  1.22it/s][A
Average Metric: 2 / 4  (50.0):   8%|▊         | 4/50 [00:03<00:28,  1.62it/s][A
Average Metric: 2 / 5  (40.0):   8%|▊         | 4/50 [00:05<00:28,  1.62it/s][A
Average Metric: 2 / 5  (40.0):  10%|█         | 5/50 [00:05<00:38,  1.17it/s][A
Average Metric: 2 / 6  (33.3):  10%|█         | 5/50 [00:05<00:38,  1.17it/s][A
Average Metric: 2 / 6  (33.3):  12%|█▏        | 6/50 [00:05<00:27,  1.62it/s

Average Metric: 26 / 50  (52.0%)
Score: 52.0 for set: [2, 2, 2, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0, 42.0, 50.0, 50.0, 52.0, 34.0, 52.0]
Best score: 52.0
Average of max per entry across top 1 scores: 0.52
Average of max per entry across top 2 scores: 0.74
Average of max per entry across top 3 scores: 0.84
Average of max per entry across top 5 scores: 0.94
Average of max per entry across top 8 scores: 0.98
Average of max per entry across top 9999 scores: 1.0


  4%|▍         | 6/150 [00:22<09:04,  3.78s/it]


Bootstrapped 1 full traces after 7 examples in round 0.


Average Metric: 21.0 / 48  (43.8):  94%|█████████▍| 47/50 [00:36<00:06,  2.13s/it]

Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)


Average Metric: 21.0 / 50  (42.0): 100%|██████████| 50/50 [00:47<00:00,  1.06it/s]


Average Metric: 21.0 / 50  (42.0%)
Score: 42.0 for set: [1, 1, 1, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0, 42.0, 50.0, 50.0, 52.0, 34.0, 52.0, 42.0]
Best score: 52.0
Average of max per entry across top 1 scores: 0.52
Average of max per entry across top 2 scores: 0.74
Average of max per entry across top 3 scores: 0.84
Average of max per entry across top 5 scores: 0.94
Average of max per entry across top 8 scores: 0.98
Average of max per entry across top 9999 scores: 1.0


  2%|▏         | 3/150 [00:15<12:24,  5.06s/it]


Bootstrapped 2 full traces after 4 examples in round 0.


Average Metric: 12.0 / 28  (42.9):  56%|█████▌    | 28/50 [00:29<00:14,  1.48it/s]

Error for example in dev set: 		 HTTPConnectionPool(host='192.168.31.215', port=10890): Read timed out. (read timeout=10)


Average Metric: 14.0 / 32  (43.8):  64%|██████▍   | 32/50 [00:34<00:19,  1.09s/it]

Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)


Average Metric: 14.0 / 37  (37.8):  74%|███████▍  | 37/50 [00:39<00:08,  1.46it/s]

Error for example in dev set: 		 HTTPConnectionPool(host='192.168.31.215', port=10890): Read timed out. (read timeout=10)


Average Metric: 20.0 / 50  (40.0): 100%|██████████| 50/50 [00:59<00:00,  1.19s/it]


Average Metric: 20.0 / 50  (40.0%)
Score: 40.0 for set: [2, 2, 2, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0, 42.0, 50.0, 50.0, 52.0, 34.0, 52.0, 42.0, 40.0]
Best score: 52.0
Average of max per entry across top 1 scores: 0.52
Average of max per entry across top 2 scores: 0.74
Average of max per entry across top 3 scores: 0.84
Average of max per entry across top 5 scores: 0.94
Average of max per entry across top 8 scores: 0.98
Average of max per entry across top 9999 scores: 1.0


 19%|█▉        | 29/150 [02:39<12:50,  6.37s/it]

Failed to run or to evaluate example Example({'question': 'Who supervises the subordinate that occupies the majority of Chatan, Japan?', 'answer': 'United States Pacific Command'}) (input_keys={'question'}) with <function answer_exact_match at 0x7fa075303eb0> due to HTTPConnectionPool(host='192.168.31.215', port=10890): Read timed out. (read timeout=10).


 25%|██▌       | 38/150 [03:15<09:35,  5.14s/it]


Bootstrapped 2 full traces after 39 examples in round 0.


Average Metric: 26 / 50  (52.0): 100%|██████████| 50/50 [00:54<00:00,  1.09s/it]


Average Metric: 26 / 50  (52.0%)
Score: 52.0 for set: [2, 2, 1, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0, 42.0, 50.0, 50.0, 52.0, 34.0, 52.0, 42.0, 40.0, 52.0]
Best score: 52.0
Average of max per entry across top 1 scores: 0.52
Average of max per entry across top 2 scores: 0.74
Average of max per entry across top 3 scores: 0.86
Average of max per entry across top 5 scores: 0.88
Average of max per entry across top 8 scores: 0.96
Average of max per entry across top 9999 scores: 1.0


  0%|          | 0/150 [00:00<?, ?it/s]

Failed to run or to evaluate example Example({'question': 'Are both Ralph Saenz and Roddy Woomble credited as writers?', 'answer': 'no'}) (input_keys={'question'}) with <function answer_exact_match at 0x7fa075303eb0> due to not enough values to unpack (expected 2, got 1).


  9%|▉         | 14/150 [01:18<12:43,  5.61s/it]


Bootstrapped 2 full traces after 15 examples in round 0.


Average Metric: 23 / 50  (46.0): 100%|██████████| 50/50 [00:31<00:00,  1.56it/s]


Average Metric: 23 / 50  (46.0%)
Score: 46.0 for set: [2, 2, 1, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0, 42.0, 50.0, 50.0, 52.0, 34.0, 52.0, 42.0, 40.0, 52.0, 46.0]
Best score: 52.0
Average of max per entry across top 1 scores: 0.52
Average of max per entry across top 2 scores: 0.74
Average of max per entry across top 3 scores: 0.86
Average of max per entry across top 5 scores: 0.88
Average of max per entry across top 8 scores: 0.98
Average of max per entry across top 9999 scores: 1.0


  2%|▏         | 3/150 [00:07<06:07,  2.50s/it]


Bootstrapped 1 full traces after 4 examples in round 0.
Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)


Average Metric: 25.0 / 50  (50.0): 100%|██████████| 50/50 [00:00<00:00, 1551.33it/s]


Average Metric: 25.0 / 50  (50.0%)
Score: 50.0 for set: [1, 1, 0, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0, 42.0, 50.0, 50.0, 52.0, 34.0, 52.0, 42.0, 40.0, 52.0, 46.0, 50.0]
Best score: 52.0
Average of max per entry across top 1 scores: 0.52
Average of max per entry across top 2 scores: 0.74
Average of max per entry across top 3 scores: 0.86
Average of max per entry across top 5 scores: 0.88
Average of max per entry across top 8 scores: 0.98
Average of max per entry across top 9999 scores: 1.0


  1%|          | 1/150 [00:09<24:08,  9.72s/it]

Failed to run or to evaluate example Example({'question': 'Are both Ralph Saenz and Roddy Woomble credited as writers?', 'answer': 'no'}) (input_keys={'question'}) with <function answer_exact_match at 0x7fa075303eb0> due to not enough values to unpack (expected 2, got 1).


 14%|█▍        | 21/150 [00:24<02:28,  1.15s/it]


Bootstrapped 1 full traces after 22 examples in round 0.


Average Metric: 24 / 50  (48.0): 100%|██████████| 50/50 [00:47<00:00,  1.05it/s]


Average Metric: 24 / 50  (48.0%)
Score: 48.0 for set: [1, 1, 0, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0, 42.0, 50.0, 50.0, 52.0, 34.0, 52.0, 42.0, 40.0, 52.0, 46.0, 50.0, 48.0]
Best score: 52.0
Average of max per entry across top 1 scores: 0.52
Average of max per entry across top 2 scores: 0.74
Average of max per entry across top 3 scores: 0.86
Average of max per entry across top 5 scores: 0.88
Average of max per entry across top 8 scores: 0.98
Average of max per entry across top 9999 scores: 1.0


 11%|█▏        | 17/150 [01:03<09:10,  4.14s/it]

Failed to run or to evaluate example Example({'question': 'The Mercurial Vapor is a football boot  endorsed by many players such as a Brazilian professional footballer who plays for what national team? ', 'answer': 'Brazil'}) (input_keys={'question'}) with <function answer_exact_match at 0x7fa075303eb0> due to HTTPConnectionPool(host='192.168.31.215', port=10890): Read timed out. (read timeout=10).


 19%|█▉        | 29/150 [01:43<08:10,  4.05s/it]

Failed to run or to evaluate example Example({'question': 'Are both Ralph Saenz and Roddy Woomble credited as writers?', 'answer': 'no'}) (input_keys={'question'}) with <function answer_exact_match at 0x7fa075303eb0> due to not enough values to unpack (expected 2, got 1).


 36%|███▌      | 54/150 [02:31<04:29,  2.80s/it]


Bootstrapped 2 full traces after 55 examples in round 0.


Average Metric: 21 / 50  (42.0): 100%|██████████| 50/50 [00:43<00:00,  1.16it/s]


Average Metric: 21 / 50  (42.0%)
Score: 42.0 for set: [2, 2, 2, 0, 0]
Scores so far: [16.0, 16.0, 44.0, 48.0, 44.0, 50.0, 40.0, 42.0, 50.0, 50.0, 52.0, 34.0, 52.0, 42.0, 40.0, 52.0, 46.0, 50.0, 48.0, 42.0]
Best score: 52.0
Average of max per entry across top 1 scores: 0.52
Average of max per entry across top 2 scores: 0.74
Average of max per entry across top 3 scores: 0.86
Average of max per entry across top 5 scores: 0.88
Average of max per entry across top 8 scores: 0.98
Average of max per entry across top 9999 scores: 1.0


  3%|▎         | 5/150 [00:12<06:05,  2.52s/it]


Bootstrapped 2 full traces after 6 examples in round 0.


Average Metric: 7.0 / 15  (46.7):  28%|██▊       | 14/50 [00:40<01:46,  2.96s/it]

Error for example in dev set: 		 Request timed out.
Error for example in dev set: 		 Request timed out.


Average Metric: 7.0 / 16  (43.8):  32%|███▏      | 16/50 [00:42<01:09,  2.03s/it]

Error for example in dev set: 		 Request timed out.


Average Metric: 7.0 / 20  (35.0):  40%|████      | 20/50 [00:53<01:17,  2.57s/it]

Error for example in dev set: 		 Connection error.


Average Metric: 7.0 / 21  (33.3):  42%|████▏     | 21/50 [00:55<01:08,  2.36s/it]

Error for example in dev set: 		 Connection error.


Average Metric: 7.0 / 22  (31.8):  44%|████▍     | 22/50 [00:58<01:08,  2.45s/it]

Error for example in dev set: 		 Request timed out.


Average Metric: 8.0 / 25  (32.0):  48%|████▊     | 24/50 [00:59<00:52,  2.02s/it]

Error for example in dev set: 		 Request timed out.
Error for example in dev set: 		 Request timed out.


Average Metric: 8.0 / 26  (30.8):  52%|█████▏    | 26/50 [00:59<00:22,  1.06it/s]

Error for example in dev set: 		 Request timed out.


APITimeoutError: Request timed out.

In [None]:
evaluate(optimized_react)

### 5) Zero-Shot Aggregator.

Let's now extract the best five bootstrapped ReAct programs. We'll build a simple DSPy aggregator that runs all of them then produces a final answer.
现在让我们提取最好的五个自举ReAct程序。我们将构建一个简单的DSPy聚合器，运行所有这些程序，然后产生一个最终答案。

In [7]:
from dsp.utils import flatten, deduplicate

# the best-performing five ReAct programs from the optimization process
AGENTS = [x[-1] for x in optimized_react.candidate_programs[:5]]

class Aggregator(dspy.Module):
	def __init__(self, temperature=0.0):
		"""这个类名为 Aggregator，继承自 dspy.Module。
在初始化方法中，定义了一个用于聚合的 ChainOfThought 对象 self.aggregate，以及一个温度参数 self.temperature。"""
		self.aggregate = dspy.ChainOfThought('context, question -> answer')
		self.temperature = temperature

	def forward(self, question):
		# Run all five agents with high temperature, then extract and deduplicate their observed contexts
		# 运行所有五个代理，并设置高温，然后提取和去重它们观察到的上下文
		with dspy.context(lm=turbo.copy(temperature=self.temperature)):
			preds = [agent(question=question) for agent in AGENTS]
			context = deduplicate(flatten([flatten(p.observations) for p in preds]))

		# Run the aggregation step to produce a final answer
		# 运行聚合步骤以生成最终答案
		return self.aggregate(context=context, question=question)

Let's quickly evaluate the aggregator prior to optimization.

In [8]:
aggregator = Aggregator()
evaluate(aggregator)

Average Metric: 78 / 300  (26.0): 100%|██████████| 300/300 [00:06<00:00, 45.38it/s]


Unnamed: 0,question,example_answer,gold_titles,rationale,pred_answer,answer_exact_match
0,Are both Cangzhou and Qionghai in the Hebei province of China?,no,"{'Cangzhou', 'Qionghai'}",determine if both Cangzhou and Qionghai are in the Hebei province of China. We need to carefully analyze the information provided in the context to...,"No, only Cangzhou is in the Hebei province of China. Qionghai is located in Hainan province.",False
1,Who conducts the draft in which Marc-Andre Fleury was drafted to the Vegas Golden Knights for the 2017-18 season?,National Hockey League,"{'2017 NHL Expansion Draft', '2017–18 Pittsburgh Penguins season'}","produce the answer. We know that Marc-Andre Fleury was drafted to the Vegas Golden Knights for the 2017-18 season. Looking at the context provided, we...","The 2017 NHL Expansion Draft conducted by the National Hockey League filled the roster of the Vegas Golden Knights, including selecting Marc-Andre Fleury for the...",False
2,"The Wings entered a new era, following the retirement of which Canadian retired professional ice hockey player and current general manager of the Tampa Bay...",Steve Yzerman,"{'2006–07 Detroit Red Wings season', 'Steve Yzerman'}",identify the retired Canadian professional ice hockey player and current general manager of the Tampa Bay Lightning of the National Hockey League (NHL) whose retirement...,Steve Yzerman,✔️ [True]
3,What river is near the Crichton Collegiate Church?,the River Tyne,"{'Crichton Collegiate Church', 'Crichton Castle'}","identify the river near the Crichton Collegiate Church. We know that the church is situated in Midlothian, Scotland, and the River Esk flows through Midlothian...",The River Esk,False
4,In the 10th Century A.D. Ealhswith had a son called Æthelweard by which English king?,King Alfred the Great,"{'Ealhswith', 'Æthelweard (son of Alfred)'}","produce the answer. We know from the context that Ealhswith had a son named Æthelweard in the 10th century A.D. Now, looking at the information...",King Alfred the Great,✔️ [True]


26.0

### 6) Optimized Aggregator.

In [9]:
kwargs = dict(max_bootstrapped_demos=2, max_labeled_demos=6, num_candidate_programs=10, num_threads=8)
tp = BootstrapFewShotWithRandomSearch(metric=dspy.evaluate.answer_exact_match, **kwargs)
optimized_aggregator = tp.compile(aggregator, trainset=trainset, valset=valset)

Average Metric: 16 / 50  (32.0): 100%|██████████| 50/50 [00:00<00:00, 153.98it/s]
Average Metric: 27 / 50  (54.0): 100%|██████████| 50/50 [00:00<00:00, 82.75it/s]
  3%|▎         | 4/150 [00:00<00:03, 45.32it/s]
Average Metric: 28 / 50  (56.0): 100%|██████████| 50/50 [00:00<00:00, 156.28it/s]
  1%|▏         | 2/150 [00:00<00:03, 39.99it/s]
Average Metric: 28 / 50  (56.0): 100%|██████████| 50/50 [00:00<00:00, 162.26it/s]
  1%|          | 1/150 [00:00<00:02, 51.23it/s]
Average Metric: 26 / 50  (52.0): 100%|██████████| 50/50 [00:00<00:00, 158.64it/s]
  1%|          | 1/150 [00:00<00:00, 155.47it/s]
Average Metric: 28 / 50  (56.0): 100%|██████████| 50/50 [00:00<00:00, 159.96it/s]
  1%|          | 1/150 [00:00<00:04, 31.56it/s]
Average Metric: 27 / 50  (54.0): 100%|██████████| 50/50 [00:00<00:00, 143.11it/s]
  1%|          | 1/150 [00:00<00:03, 43.19it/s]
Average Metric: 29 / 50  (58.0): 100%|██████████| 50/50 [00:00<00:00, 163.95it/s]
  1%|▏         | 2/150 [00:00<00:04, 31.94it/s]
Average 

In [10]:
optimized_aggregator2 = optimized_aggregator.deepcopy()
optimized_aggregator2.temperature = 0.7

evaluate(optimized_aggregator2)

Average Metric: 180 / 300  (60.0): 100%|██████████| 300/300 [00:07<00:00, 42.10it/s]


Unnamed: 0,question,example_answer,gold_titles,rationale,pred_answer,answer_exact_match
0,Are both Cangzhou and Qionghai in the Hebei province of China?,no,"{'Cangzhou', 'Qionghai'}","produce the answer. From the context, we know that Cangzhou is a prefecture-level city in eastern Hebei province, while Qionghai is one of the seven...",no,✔️ [True]
1,Who conducts the draft in which Marc-Andre Fleury was drafted to the Vegas Golden Knights for the 2017-18 season?,National Hockey League,"{'2017 NHL Expansion Draft', '2017–18 Pittsburgh Penguins season'}","produce the answer. From the context, we know that Marc-Andre Fleury was drafted to the Vegas Golden Knights for the 2017-18 season. The draft that...",National Hockey League,✔️ [True]
2,"The Wings entered a new era, following the retirement of which Canadian retired professional ice hockey player and current general manager of the Tampa Bay...",Steve Yzerman,"{'2006–07 Detroit Red Wings season', 'Steve Yzerman'}",produce the answer. We know from the context that Steve Yzerman is a Canadian retired professional ice hockey player and the current general manager of...,Steve Yzerman,✔️ [True]
3,What river is near the Crichton Collegiate Church?,the River Tyne,"{'Crichton Collegiate Church', 'Crichton Castle'}","produce the answer. We know that Crichton Collegiate Church is located in Midlothian, Scotland, near the hamlet of Crichton. Since it is close to Edinburgh,...",River Esk,False
4,In the 10th Century A.D. Ealhswith had a son called Æthelweard by which English king?,King Alfred the Great,"{'Ealhswith', 'Æthelweard (son of Alfred)'}","produce the answer. From the context, we know that Ealhswith was the wife of King Alfred the Great. Therefore, in the 10th Century A.D., Ealhswith...",King Alfred the Great,✔️ [True]


60.0