# 使用 GPT-4 来为 GPT-3.5 启动少样本 CoT 演示

[ScoNe (Scoped Negation)基准测试](https://aclanthology.org/2023.acl-short.154/)由She等人（2023年）提出，旨在对模型在处理否定推理方面的能力进行压力测试。在原始论文中，`text-davinci-002`和`text-davinci-003`模型在最困难的ScoNe类别上基本上是随机的。本笔记本从一个非常简单的基于思维链的ScoNe模块开始。使用这个简单程序，`gpt-3.5-turbo`在“one scoping negation”类别（ScoNe中最困难的两个之一）上表现随机。我们发现，通过引导演示可能会有所帮助，但是`turbo`很难创建包含CoT步骤的良好演示。当我们转而使用`gpt4-turbo`仅用于创建这些演示（这涉及对该模型的不到50次调用）时，`turbo`经常达到85-90%的准确率。**这是使用`dspy.BootstrapFewShotWithRandomSearch`进行的单个编译步骤。**

## 设置

In [1]:
import glob  # 导入glob模块，用于查找文件路径名匹配的所有文件import os  # 导入os模块，提供了许多与操作系统交互的函数import pandas as pd  # 导入pandas库，并简写为pd，用于数据处理和分析import random  # 导入random模块，用于生成随机数import dspy  # 导入dspy库from dspy.evaluate import Evaluate  # 从dspy库中导入Evaluate类from dspy.teleprompt import BootstrapFewShotWithRandomSearch  # 从dspy库中导入BootstrapFewShotWithRandomSearch类

In [2]:
import os# 设置环境变量"DSP_NOTEBOOK_CACHEDIR"为当前目录下的"cache"文件夹路径os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join('.', 'cache')

In [3]:
# 我们将依赖 turbo 来完成除了引导 CoT 演示之外的所有工作：turbo = dspy.OpenAI(model='gpt-3.5-turbo-1106', max_tokens=250, model_type='chat')# 配置 deepspeed 设置，使用 turbo 语言模型dspy.settings.configure(lm=turbo)

In [4]:
# 仅用于启动 CoT 演示的 GPT-4：gpt4T = dspy.OpenAI(model='gpt-4-1106-preview', max_tokens=350, model_type='chat')

In [5]:
# 将此变量设置为True将重新执行引导过程。当设置为False时，将使用现有的演示，但仍将使用turbo来评估零射和完整程序。RUN_FROM_SCRATCH = False

## ScoNe## ScoNe

In [6]:
# 使用git命令克隆ScoNe仓库!git clone https://github.com/selenashe/ScoNe.git

Cloning into 'ScoNe'...
remote: Enumerating objects: 77, done.[K
remote: Counting objects: 100% (77/77), done.[K
remote: Compressing objects: 100% (55/55), done.[K
remote: Total 77 (delta 42), reused 42 (delta 20), pack-reused 0[K
Receiving objects: 100% (77/77), 116.25 KiB | 1.21 MiB/s, done.
Resolving deltas: 100% (42/42), done.


### 数据加载器

In [7]:
import pandas as pdimport osimport globdef load_scone(dirname):    dfs = []    for filename in glob.glob(dirname + "/*.csv"):        df = pd.read_csv(filename, index_col=0)        df['category'] = os.path.basename(filename).replace(".csv", "")        dfs.append(df)    data_df = pd.concat(dfs)    def as_example(row):        # 'one_scoped'文件来自早期的数据集MoNLI，因此格式有点不同：        suffix = '' if row['category'] == 'one_scoped' else '_edited'        # 重新格式化假设，使其成为问题中的嵌入子句：        hkey = 'sentence2' + suffix        question = row[hkey][0].lower() + row[hkey][1:].strip(".")        question = f"我们是否可以确定地推断出 {question}？"        # 二元任务制定：        label = "是" if row['gold_label' + suffix] == 'entailment' else "否"        return dspy.Example({            "context": row['sentence1' + suffix],            "question": question,            "answer": label,            "category": row['category']        }).with_inputs("context", "question")    return list(data_df.apply(as_example, axis=1).values)

### 训练集和验证集样本

In [8]:
# 导入必要的库import random# 加载训练数据all_train = load_scone("ScoNe/scone_nli/train")# 设定随机种子为1random.seed(1)# 打乱数据集random.shuffle(all_train)# 从打乱后的数据集中取出前200个样本作为训练集，取出接下来的50个样本作为验证集train, dev = all_train[:200], all_train[200:250]# 输出训练集和验证集的样本数量len(train), len(dev)

(200, 50)

### 测试

In [9]:
import random# 设置随机种子为1random.seed(1)# 加载测试数据集test = load_scone(dirname=f"ScoNe/scone_nli/test")# 我们正在为完整的ScoNe基准测试开发系统，但目前我们将仅在其中最困难且最具信息量的ScoNe类别中进行评估# 该类别是具有在推理过程中起关键作用的单个否定的例子：test = [ex for ex in test if ex.category == "one_scoped"]

In [10]:
# 导入pandas库import pandas as pd# 创建一个Series对象，其中包含test列表中每个元素的answer属性值# 对Series对象进行计数，并返回计数结果pd.Series([ex.answer for ex in test]).value_counts()

No     100
Yes    100
dtype: int64

## 评估工具

In [11]:
# 导入dspy库中的evaluate模块import dspy# 将精确匹配准确度指标赋值给scone_accuracy变量scone_accuracy = dspy.evaluate.metrics.answer_exact_match

In [12]:
# 创建一个评估器对象，用于评估测试集evaluator = Evaluate(devset=test, num_threads=1, display_progress=True, display_table=0)

## 零样本CoT

In [13]:
# 定义一个名为ScoNeSignature的类，继承自dspy.Signature类class ScoNeSignature(dspy.Signature):    ("""You are given some context (a premise) and a question (a hypothesis). """    """You must indicate with Yes/No answer whether we can logically """    """conclude the hypothesis from the premise.""")    # 定义类属性context，表示输入的上下文    context = dspy.InputField()    # 定义类属性question，表示输入的问题    question = dspy.InputField()    # 定义类属性answer，表示输出的答案，可以是Yes或No    answer = dspy.OutputField(desc="Yes or No")

In [14]:
# 定义一个名为ScoNeCoT的类，继承自dspy.Module类class ScoNeCoT(dspy.Module):    def __init__(self):        super().__init__()        # 初始化一个名为generate_answer的属性，其值为dspy.ChainOfThought(ScoNeSignature)        self.generate_answer = dspy.ChainOfThought(ScoNeSignature)    # 定义forward方法，接受context和question两个参数    def forward(self, context, question):        # 调用generate_answer属性，传入context和question作为参数，并返回结果        return self.generate_answer(context=context, question=question)

In [15]:
# 创建一个名为cot_zeroshot的ScoNeCoT对象cot_zeroshot = ScoNeCoT()

In [16]:
# 使用evaluator函数评估cot_zeroshot模型的性能，评估指标为scone_accuracyevaluator(cot_zeroshot, metric=scone_accuracy)

Average Metric: 100 / 200  (50.0): 100%|█████████████████████████| 200/200 [00:00<00:00, 733.75it/s]

Average Metric: 100 / 200  (50.0%)





50.0

## 使用引导式演示优化少样本学习

In [17]:
# 创建一个BootstrapFewShotWithRandomSearch优化器对象，并设置参数bootstrap_optimizer = BootstrapFewShotWithRandomSearch(    max_bootstrapped_demos=8,  # 最大引导演示数量为8    max_labeled_demos=8,  # 最大标记演示数量为8    num_candidate_programs=10,  # 候选程序数量为10    num_threads=8,  # 线程数量为8    metric=scone_accuracy,  # 评估指标为scone_accuracy    teacher_settings=dict(lm=gpt4T)  # 教师设置为使用gpt4T语言模型)

Going to sample between 1 and 8 traces per predictor.
Will attempt to train 10 candidate sets.


In [18]:
# 如果需要从头开始运行if RUN_FROM_SCRATCH:    # 使用 bootstrap_optimizer 编译 cot_zeroshot 模型，并使用 train 数据集进行训练，val 数据集进行验证    cot_fewshot = bootstrap_optimizer.compile(cot_zeroshot, trainset=train, valset=dev)else:    # 否则，创建一个新的 ScoNeCoT 模型    cot_fewshot = ScoNeCoT()    # 加载预训练好的模型参数文件 "scone-cot_fewshot-turbo-gpt4-demos.json"    cot_fewshot.load("scone-cot_fewshot-turbo-gpt4-demos.json")

Average Metric: 24 / 50  (48.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1096.32it/s]


Average Metric: 24 / 50  (48.0%)
Score: 48.0 for set: [0]
New best score: 48.0 for seed -3
Scores so far: [48.0]
Best score: 48.0


Average Metric: 25 / 50  (50.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1034.71it/s]


Average Metric: 25 / 50  (50.0%)
Score: 50.0 for set: [8]
New best score: 50.0 for seed -2
Scores so far: [48.0, 50.0]
Best score: 50.0


  6%|███▎                                                         | 11/200 [00:00<00:00, 899.26it/s]


Bootstrapped 8 full traces after 12 examples in round 0.


Average Metric: 27 / 50  (54.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1225.04it/s]


Average Metric: 27 / 50  (54.0%)
Score: 54.0 for set: [8]
New best score: 54.0 for seed -1
Scores so far: [48.0, 50.0, 54.0]
Best score: 54.0
Average of max per entry across top 1 scores: 0.54
Average of max per entry across top 2 scores: 0.7
Average of max per entry across top 3 scores: 0.76
Average of max per entry across top 5 scores: 0.76
Average of max per entry across top 8 scores: 0.76
Average of max per entry across top 9999 scores: 0.76


  4%|██▊                                                           | 9/200 [00:00<00:00, 815.06it/s]


Bootstrapped 7 full traces after 10 examples in round 0.


Average Metric: 37 / 50  (74.0): 100%|█████████████████████████████| 50/50 [00:00<00:00, 884.47it/s]


Average Metric: 37 / 50  (74.0%)
Score: 74.0 for set: [8]
New best score: 74.0 for seed 0
Scores so far: [48.0, 50.0, 54.0, 74.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.78
Average of max per entry across top 3 scores: 0.86
Average of max per entry across top 5 scores: 0.92
Average of max per entry across top 8 scores: 0.92
Average of max per entry across top 9999 scores: 0.92


  2%|█▏                                                            | 4/200 [00:00<00:00, 309.09it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


Average Metric: 28 / 50  (56.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1111.93it/s]


Average Metric: 28 / 50  (56.0%)
Score: 56.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.8
Average of max per entry across top 3 scores: 0.82
Average of max per entry across top 5 scores: 0.92
Average of max per entry across top 8 scores: 0.92
Average of max per entry across top 9999 scores: 0.92


  0%|▎                                                             | 1/200 [00:00<00:00, 712.23it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 31 / 50  (62.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1043.32it/s]


Average Metric: 31 / 50  (62.0%)
Score: 62.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.86
Average of max per entry across top 3 scores: 0.9
Average of max per entry across top 5 scores: 0.94
Average of max per entry across top 8 scores: 0.94
Average of max per entry across top 9999 scores: 0.94


  2%|█▏                                                            | 4/200 [00:00<00:00, 837.65it/s]


Bootstrapped 4 full traces after 5 examples in round 0.


Average Metric: 23 / 50  (46.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1104.00it/s]


Average Metric: 23 / 50  (46.0%)
Score: 46.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.86
Average of max per entry across top 3 scores: 0.9
Average of max per entry across top 5 scores: 0.94
Average of max per entry across top 8 scores: 0.96
Average of max per entry across top 9999 scores: 0.96


  2%|█▏                                                            | 4/200 [00:00<00:00, 802.55it/s]


Bootstrapped 4 full traces after 5 examples in round 0.


Average Metric: 34 / 50  (68.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1116.66it/s]


Average Metric: 34 / 50  (68.0%)
Score: 68.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


  2%|█▌                                                            | 5/200 [00:00<00:00, 855.28it/s]


Bootstrapped 5 full traces after 6 examples in round 0.


Average Metric: 30 / 50  (60.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1148.03it/s]


Average Metric: 30 / 50  (60.0%)
Score: 60.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


  1%|▌                                                             | 2/200 [00:00<00:00, 723.34it/s]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 27 / 50  (54.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1109.09it/s]


Average Metric: 27 / 50  (54.0%)
Score: 54.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


  3%|█▊                                                            | 6/200 [00:00<00:00, 828.15it/s]


Bootstrapped 6 full traces after 7 examples in round 0.


Average Metric: 28 / 50  (56.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1036.51it/s]


Average Metric: 28 / 50  (56.0%)
Score: 56.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0, 56.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


  2%|█▌                                                            | 5/200 [00:00<00:00, 790.78it/s]


Bootstrapped 4 full traces after 6 examples in round 0.


Average Metric: 25 / 50  (50.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1128.36it/s]


Average Metric: 25 / 50  (50.0%)
Score: 50.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0, 56.0, 50.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


  4%|██▍                                                           | 8/200 [00:00<00:00, 845.75it/s]


Bootstrapped 8 full traces after 9 examples in round 0.


Average Metric: 31 / 50  (62.0): 100%|█████████████████████████████| 50/50 [00:00<00:00, 921.83it/s]

Average Metric: 31 / 50  (62.0%)
Score: 62.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0, 56.0, 50.0, 62.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0
13 candidate programs found.





In [19]:
# 使用evaluator函数对cot_fewshot进行评估，评估指标为scone_accuracyevaluator(cot_fewshot, metric=scone_accuracy)

Average Metric: 171 / 200  (85.5): 100%|█████████████████████████| 200/200 [00:00<00:00, 557.50it/s]

Average Metric: 171 / 200  (85.5%)





85.5

In [20]:
# 保存模型为 JSON 格式cot_fewshot.save("scone-cot_fewshot-turbo-gpt4-demos.json")

## 带有预测的示例提示

In [21]:
# 使用turbo.inspect_history函数检查最近的1个历史记录turbo.inspect_history(n=1)





You are given some context (a premise) and a question (a hypothesis). You must indicate with Yes/No answer whether we can logically conclude the hypothesis from the premise.

---

Follow the following format.

Context: ${context}

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: Yes or No

---

Context: It is not true that there is not a single person walking in the city.

Question: Can we logically conclude for sure that it is not true that there is not a single celebrity walking in the city?

Reasoning: Let's think step by step in order to produce the answer. We know that the double negative in the context implies that there is at least one person walking in the city. However, the context does not provide any information about the status or occupation of the person walking in the city. Therefore, we cannot logically conclude that the person walking in the city is a celebrity.

Answer: No

---

Context: the boy, not girl