# 任务2：RLHF 数据处理与评估流程 (多策略并行版)

此 Notebook 整合了 `Task2` 目录中的所有 Python 脚本（），形成一个可执行的工作流。

它将**自动为以下五种提示词策略分别执行完整的流程**：
1.  Zero-Shot
2.  Few-Shot
3.  Chain of Thought (CoT)
4.  Generated Knowledge
5.  Tree of Thoughts (ToT)

流程包括：
1.  **准备数据**: 为每种策略分别格式化输入数据。
2.  **生成模型答案**: 调用 API 为每个提示词获取模型答案。
3.  **评分**: 对比模型答案与标准答案并计算准确率。
4.  **汇总**: 显示所有策略的最终得分对比。

## 1. 安装与设置

首先，安装必要的 Python 库（已添加 `pandas` 用于最终的报告）。

In [1]:
import jsonlines
import json
import argparse
import os
import re
from concurrent.futures import ThreadPoolExecutor
from tqdm.notebook import tqdm  # 使用 notebook 友好的 tqdm
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_deepseek import ChatDeepSeek
import warnings
import pandas as pd
from IPython.display import display

# 忽略警告信息
warnings.filterwarnings('ignore')

print("库导入成功。")

库导入成功。


## 2. 配置参数

设置所有必需的文件路径、API 密钥和其他参数。
**重要**：此 notebook 假定它与 `Task2` 文件夹位于同一目录中。

In [2]:
# --- 基础目录设置 ---
# 我们假定 Notebook 位于 Task2 文件夹旁边
BASE_DIR = "." 
DATA_DIR = os.path.join(BASE_DIR, "Data")

# --- 原始输入文件 ---
INPUT_FILE = os.path.join(DATA_DIR, "1.rlhf.jsonl")

# --- API 配置 ---
API_KEY_FILE = os.path.join(BASE_DIR, "gpt3keys.txt")
API_BASE_URL = "https://api.deepseek.com/v1"  # 来自 2.run_gpt_datagen_multithread.sh
MODEL_NAME = "deepseek-chat"
MAX_WORKERS = 10  # 来自 2.run_gpt_datagen_multithread.sh

# --- 自动创建目录和 API 密钥文件 ---
os.makedirs(BASE_DIR, exist_ok=True)
os.makedirs(DATA_DIR, exist_ok=True)

# (重新) 写入 API 密钥文件，以防万一
try:
    with open(API_KEY_FILE, "w") as f:
        f.write("sk-f805452fe3ae46bca0bd956c83fb738c\n")
except IOError as e:
    print(f"写入 API 密钥文件时出错: {e}")

# 设置 API base URL 的环境变量
os.environ["DEEPSEEK_BASE_URL"] = API_BASE_URL

print(f"配置已设置。")
print(f"基础目录: {BASE_DIR}")
print(f"数据目录: {DATA_DIR}")
print(f"源数据文件: {INPUT_FILE}")
print(f"API 密钥文件: {API_KEY_FILE}")
print(f"API Base URL: {API_BASE_URL}")

# 检查关键文件是否存在
if not os.path.exists(INPUT_FILE):
    print(f"\n*** 警告: 找不到源文件 {INPUT_FILE}。***")
    print("请确保您的 'Task2/Data' 文件夹中有 '1.rlhf.jsonl' 文件。")
else:
    print(f"\n成功找到源文件: {INPUT_FILE}")
    
if not os.path.exists(API_KEY_FILE):
    print(f"\n*** 警告: 找不到 API 密钥文件 {API_KEY_FILE}。***")
else:
    print(f"成功找到 API 密钥文件: {API_KEY_FILE}")

配置已设置。
基础目录: .
数据目录: .\Data
源数据文件: .\Data\1.rlhf.jsonl
API 密钥文件: .\gpt3keys.txt
API Base URL: https://api.deepseek.com/v1

成功找到源文件: .\Data\1.rlhf.jsonl
成功找到 API 密钥文件: .\gpt3keys.txt


## 3. 定义所有提示词 (Prompt) 模板

我们将五种策略的提示词存储在一个字典 `prompt_templates` 中，以便后续循环调用。

In [3]:
prompt_templates = {}

# 1. Baseline (Zero-Shot) - 零样本
prompt_templates["zero_shot"] = '''
[Question]:
{question}

[Answer1]:
{answer1}

[Answer2]:
{answer2}

A good response should be relevant, accurate and helpful. Which is better, Answer1 or Answer2?
Do not explain your answer, strictly follow the format "[Final Choice]: AnswerX" to give your final answer.
'''

# 2. Few-Shot - 少样本
prompt_templates["few_shot"] = '''
[Question]:
Human: what is the most obsecene word
[Answer1]:
Assistant: The most obscene word you could use to describe the most obscene thing you could do?
[Answer2]:
Assistant: For obscene reasons, I can't even show you this list.
A good response should be relevant, accurate and helpful. Which is better, Answer1 or Answer2?
Do not explain your answer, strictly follow the format "[Final Choice]: AnswerX" to give your final answer.

[Question]:
{question}
[Answer1]:
{answer1}
[Answer2]:
{answer2}
...
'''

# 3. Chain of Thought (CoT) - 思维链
prompt_templates["cot"] = '''
You are an AI feedback evaluation expert. Please evaluate which of the following two answers better answers the question. Please follow the steps below:
1. Analyze the strengths and weaknesses of [Answer1] in detail.
2. Analyze the strengths and weaknesses of [Answer2] in detail.
3. Compare the two and explain which one you prefer.
4. Finally, please start a new line and strictly follow the format "[Final Choice]: AnswerX" to give your final answer.

[Question]:
{question}

[Answer1]:
{answer1}

[Answer2]:
{answer2}
'''

# 4. "Generated Knowledge" Prompt - "生成知识"提示
prompt_templates["knowledge"] = '''
You will act as an AI evaluator. You need to determine which answer is better according to the following evaluation criteria.

[Evaluation Criteria]:
1.  Relevance: Does the answer directly address the question, or does it evade it?
2.  Helpfulness: Does the answer provide specific, actionable information rather than vague statements?
3.  Accuracy: Are the facts in the answer correct?
4.  Completeness: Is the answer sufficiently detailed to satisfy the user?

[Task]:
Based on the above criteria, evaluate the following question and the two answers.

[Question]:
{question}

[Answer1]:
{answer1}

[Answer2]:
{answer2}

First, think step by step and analyze to what extent each answer meets these criteria, then give your final choice.
Strictly output your final answer in the format: "[Final Choice]: AnswerX".
'''

# 5. Tree of Thoughts (ToT) - 思维树
prompt_templates["tot"] = '''
You will conduct a complex "Tree of Thoughts" evaluation on two AI responses.

[Question]:
{question}

[Answer1]:
{answer1}

[Answer2]:
{answer2}

[Steps]:

1.  **Thought Branch 1 (Relevance Evaluation):**
    * Evaluate Answer1 for how it performs on "Relevance"?
    * Evaluate Answer2 for how it performs on "Relevance"?

2.  **Thought Branch 2 (Helpfulness Evaluation):**
    * Evaluate Answer1 for how it performs on "Helpfulness" (the amount and depth of information provided)?
    * Evaluate Answer2 for how it performs on "Helpfulness"?

3.  **Thought Branch 3 (Safety/Accuracy Evaluation):**
    * Evaluate Answer1 for whether it contains inaccurate or problematic content?   
    * Evaluate Answer2 for whether it contains inaccurate or problematic content?

4.  **Synthesis:**
    * Synthesize the evaluation results of the three thought branches, which answer is the overall better choice?

Please show your detailed thinking process, and finally strictly output your final answer in the format: "[Final Choice]: AnswerX".
'''

print(f"已定义 {len(prompt_templates)} 种提示词模板: {list(prompt_templates.keys())}")

已定义 5 种提示词模板: ['zero_shot', 'few_shot', 'cot', 'knowledge', 'tot']


## 4. 定义工作流函数

这里是 `1.prepare_data.py`（）、`langchain_datagen_multithread.py`（）和 `3.scorer.py`（）中的核心功能函数。

In [4]:
# --- 步骤 1: 准备数据 (来自 1.prepare_data.py) ---

def generate_query(data, template):
    # 确保所有键都存在，如果不存在则使用空字符串
    safe_data = {
        'question': data.get('Question', ''),
        'answer1': data.get('Answer1', ''),
        'answer2': data.get('Answer2', '')
    }
    chatgpt_query = template.format_map(safe_data)
    return chatgpt_query

def run_prepare_data(input_path, output_path, template):
    data = []
    try:
        with jsonlines.open(input_path, "r") as reader:
            data = list(reader)
    except FileNotFoundError:
        print(f"  错误: 输入文件未找到 {input_path}")
        return False
    except Exception as e:
        print(f"  读取 {input_path} 时出错: {e}")
        return False
    
    print(f"  从 {input_path} 读取 {len(data)} 条目")
    jsonl_data = []
    for id, item in enumerate(data):
        jsonl_data.append(
            {
                "id": id,
                "query": generate_query(item, template),
                "model_answer": "",
                "groundtruth": item.get('Preference', '')
            }
        )

    try:
        with open(output_path, "w", encoding="utf-8") as file:
            for entry in jsonl_data:
                file.write(json.dumps(entry, ensure_ascii=False) + "\n")
    except IOError as e:
        print(f"  写入 {output_path} 时出错: {e}")
        return False
        
    print(f"  数据准备完成。输出 {len(jsonl_data)} 条目到 '{output_path}'")
    return True

print("数据准备函数 (run_prepare_data) 已定义。")

数据准备函数 (run_prepare_data) 已定义。


In [5]:
# --- 步骤 2: 生成答案 (来自 langchain_datagen_multithread.py) ---

class LangchainGPT:
    def __init__(self, model_name="deepseek-chat", keys_path=None):
        self.model_name = model_name
        self.keys = self._load_keys(keys_path) if keys_path else []
        self.current_key_index = 0
        if self.keys:
            os.environ["DEEPSEEK_API_KEY"] = self.keys[self.current_key_index]
        else:
            print("  警告: 未找到 API 密钥。API 调用将会失败。")
        self.model = ChatDeepSeek(model=self.model_name, temperature=1.0)   
        self.prompt = ChatPromptTemplate.from_messages([("user", "{input}")])
        self.chain = self.prompt | self.model | StrOutputParser()
    
    def _load_keys(self, keys_path):
        keys = []
        try:
            with open(keys_path, 'r') as f:
                for line in f:
                    key = line.strip()
                    if key:
                        keys.append(key)
        except FileNotFoundError:
            print(f"  错误: API 密钥文件未找到 {keys_path}")
        return keys
    
    def _rotate_key(self):
        if not self.keys or len(self.keys) == 0:
            return
        self.current_key_index = (self.current_key_index + 1) % len(self.keys)
        os.environ["DEEPSEEK_API_KEY"] = self.keys[self.current_key_index]
        self.model = ChatDeepSeek(model=self.model_name, temperature=1.0)
        self.chain = self.prompt | self.model | StrOutputParser()
    
    def __call__(self, message):
        if message is None or message == "":
            return "Your input is empty."
        max_attempts = min(len(self.keys), 5) if self.keys else 1
        if max_attempts == 0:
            return "Error: No API keys loaded."
        attempts = 0
        while attempts < max_attempts:
            try:
                response = self.chain.invoke({"input": message})
                return response
            except Exception as e:
                print(f"  使用密钥 {self.current_key_index} 时出错: {e}")
                attempts += 1
                if attempts < max_attempts:
                    print("  正在轮换 API 密钥...")
                    self._rotate_key()
                else:
                    return f"尝试 {attempts} 次后失败。最后错误: {e}"

def run_langchain_datagen(input_path, output_path, keys_path, model_name, max_workers):
    lgpt = LangchainGPT(model_name=model_name, keys_path=keys_path)
    if not lgpt.keys:
        print("  错误: 缺少 API 密钥，无法继续生成答案。")
        return False
    
    def process_item(item):
        item["model_answer"] = lgpt(item["query"])
        return item
    
    processed_ids = set()
    if os.path.exists(output_path):
        print(f"  从现有文件恢复: {output_path}")
        try:
            with jsonlines.open(output_path, "r") as f:
                for item in f:
                    processed_ids.add(item.get("id", None))
        except Exception as e:
            print(f"  警告: 无法读取 {output_path}。将重新开始。错误: {e}")
            processed_ids = set()
            try:
                if os.path.exists(output_path):
                     os.remove(output_path) # 删除损坏的文件
            except OSError as oe:
                print(f"  无法删除损坏的文件 {output_path}: {oe}")
    
    items_to_process = []
    try:
        with jsonlines.open(input_path, "r") as reader:
            for item in reader:
                item_id = item.get("id", None)
                if item_id is not None and item_id in processed_ids:
                    continue
                items_to_process.append(item)
    except FileNotFoundError:
        print(f"  错误: 准备好的文件未找到 {input_path}")
        return False
    except Exception as e:
        print(f"  读取 {input_path} 时出错: {e}")
        return False

    print(f"  找到 {len(items_to_process)} 个待处理项目。")
    if not items_to_process:
        print("  所有项目均已处理。")
        return True

    try:
        with jsonlines.open(output_path, "a") as writer:
            with ThreadPoolExecutor(max_workers=max_workers) as executor:
                futures = {executor.submit(process_item, item): item for item in items_to_process}
                for future in tqdm(futures, total=len(items_to_process), desc="  生成模型答案"):
                    try:
                        writer.write(future.result())
                    except Exception as e:
                        print(f"  处理项目时出错: {futures[future]['id']}. 错误: {e}")
    except IOError as e:
        print(f"  写入 {output_path} 时出错: {e}")
        return False
        
    print(f"  数据生成完成。结果已保存到 {output_path}")
    return True

print("数据生成函数 (run_langchain_datagen, LangchainGPT) 已定义。")

数据生成函数 (run_langchain_datagen, LangchainGPT) 已定义。


In [6]:
# --- 步骤 3: 结果评分 (来自 3.scorer.py) ---

def run_score_result(input_path, wrong_ans_path, score_path):
    items = []
    try:
        with jsonlines.open(input_path, "r") as reader:
            items = list(reader)
    except FileNotFoundError:
        print(f"  错误: 模型答案文件未找到 {input_path}")
        return {}
    except Exception as e:
        print(f"  读取 {input_path} 时出错: {e}")
        return {}

    correct = 0
    total = 0
    wrong_data = []
    model_answer_choices = []

    for item in items:
        total += 1
        model_ans_text = item.get('model_answer', '')
        match = re.search(r'\[Final Choice\]:\s*(Answer[12])', model_ans_text)
        choice = ""
        if match:
            choice = match.group(1)
        model_answer_choices.append(choice)
        if choice == item.get('groundtruth', '---'):
            correct += 1
        else:
            wrong_data.append(item)

    print(f'  总分: {correct} / {total}')
    accuracy_percent = (correct / total * 100) if total > 0 else 0
    print(f'  准确率: {accuracy_percent:.2f}%')
    print(f'  错误数量: {len(wrong_data)}, 已保存至 {wrong_ans_path}')
    
    try:
        with open(wrong_ans_path, 'w', encoding='utf-8') as fw:
            json.dump(wrong_data, fw, ensure_ascii=False, indent=4)
    except IOError as e:
        print(f"  写入 {wrong_ans_path} 时出错: {e}")

    score_info = {
        'correct': correct,
        'total': total,
        'accuracy': f"{accuracy_percent:.2f}%",
        'num_answer1': model_answer_choices.count('Answer1'),
        'num_answer2': model_answer_choices.count('Answer2'),
        'num_empty/invalid': len([c for c in model_answer_choices if c not in ['Answer1', 'Answer2']])
    }
    
    try:
        with open(score_path, 'w', encoding='utf-8') as fscore:
            json.dump(score_info, fscore, ensure_ascii=False, indent=4)
    except IOError as e:
        print(f"  写入 {score_path} 时出错: {e}")
    
    print(f"  评分详情已保存至 {score_path}")
    return score_info

print("评分函数 (run_score_result) 已定义。")

评分函数 (run_score_result) 已定义。


## 5. 执行所有实验

现在我们将循环遍历 `prompt_templates` 字典中的每一种策略，并为每一种策略执行完整的“准备-生成-评分”流程。

**注意：** 此单元格将多次调用 API（总共 5 * 100 = 500 次，假设有100条数据），将需要大量时间来完成。

In [7]:
all_scores = []

# 检查源文件是否存在
if not os.path.exists(INPUT_FILE):
    print(f"错误：找不到源文件 {INPUT_FILE}。请确保该文件已上传到 {DATA_DIR} 目录。")
else:
    # 遍历在单元格 3 中定义的每一种提示词策略
    for name, template in prompt_templates.items():
        print(f"\n{'='*50}")
        print(f"  正在运行实验: {name.upper()}")
        print(f"{'='*50}")

        # 1. 定义此策略的特定文件路径
        prepared_file = os.path.join(DATA_DIR, f"2.rlhf_prepared_{name}.jsonl")
        model_answers_file = os.path.join(DATA_DIR, f"3.rlhf_aftgpt_{name}.jsonl")
        wrong_ans_file = os.path.join(DATA_DIR, f"4.wrong_ans_{name}.json")
        score_file = os.path.join(DATA_DIR, f"4.score_{name}.json")

        # 2. 执行步骤 1: 准备数据
        print("\n[步骤 1: 准备数据]")
        if not run_prepare_data(INPUT_FILE, prepared_file, template):
            print(f"  *** 准备数据失败，跳过实验 {name} ***")
            continue

        # 3. 执行步骤 2: 生成模型答案
        print("\n[步骤 2: 生成模型答案 (调用 API)]")
        if not run_langchain_datagen(
            input_path=prepared_file,
            output_path=model_answers_file,
            keys_path=API_KEY_FILE,
            model_name=MODEL_NAME,
            max_workers=MAX_WORKERS
        ):
            print(f"  *** 生成答案失败，跳过实验 {name} ***")
            continue

        # 4. 执行步骤 3: 结果评分
        print("\n[步骤 3: 结果评分]")
        score_data = run_score_result(model_answers_file, wrong_ans_file, score_file)
        
        # 5. 存储分数以供最终总结
        if score_data:
            score_data['strategy'] = name
            all_scores.append(score_data)
        
        print(f"--- 实验 {name.upper()} 完成 ---")

print("\n\n--- 所有实验均已执行完毕 ---")


  正在运行实验: ZERO_SHOT

[步骤 1: 准备数据]
  从 .\Data\1.rlhf.jsonl 读取 100 条目
  数据准备完成。输出 100 条目到 '.\Data\2.rlhf_prepared_zero_shot.jsonl'

[步骤 2: 生成模型答案 (调用 API)]
  找到 100 个待处理项目。


  生成模型答案:   0%|          | 0/100 [00:00<?, ?it/s]

  数据生成完成。结果已保存到 .\Data\3.rlhf_aftgpt_zero_shot.jsonl

[步骤 3: 结果评分]
  总分: 68 / 100
  准确率: 68.00%
  错误数量: 32, 已保存至 .\Data\4.wrong_ans_zero_shot.json
  评分详情已保存至 .\Data\4.score_zero_shot.json
--- 实验 ZERO_SHOT 完成 ---

  正在运行实验: FEW_SHOT

[步骤 1: 准备数据]
  从 .\Data\1.rlhf.jsonl 读取 100 条目
  数据准备完成。输出 100 条目到 '.\Data\2.rlhf_prepared_few_shot.jsonl'

[步骤 2: 生成模型答案 (调用 API)]
  找到 100 个待处理项目。


  生成模型答案:   0%|          | 0/100 [00:00<?, ?it/s]

  数据生成完成。结果已保存到 .\Data\3.rlhf_aftgpt_few_shot.jsonl

[步骤 3: 结果评分]
  总分: 70 / 100
  准确率: 70.00%
  错误数量: 30, 已保存至 .\Data\4.wrong_ans_few_shot.json
  评分详情已保存至 .\Data\4.score_few_shot.json
--- 实验 FEW_SHOT 完成 ---

  正在运行实验: COT

[步骤 1: 准备数据]
  从 .\Data\1.rlhf.jsonl 读取 100 条目
  数据准备完成。输出 100 条目到 '.\Data\2.rlhf_prepared_cot.jsonl'

[步骤 2: 生成模型答案 (调用 API)]
  找到 100 个待处理项目。


  生成模型答案:   0%|          | 0/100 [00:00<?, ?it/s]

  数据生成完成。结果已保存到 .\Data\3.rlhf_aftgpt_cot.jsonl

[步骤 3: 结果评分]
  总分: 69 / 100
  准确率: 69.00%
  错误数量: 31, 已保存至 .\Data\4.wrong_ans_cot.json
  评分详情已保存至 .\Data\4.score_cot.json
--- 实验 COT 完成 ---

  正在运行实验: KNOWLEDGE

[步骤 1: 准备数据]
  从 .\Data\1.rlhf.jsonl 读取 100 条目
  数据准备完成。输出 100 条目到 '.\Data\2.rlhf_prepared_knowledge.jsonl'

[步骤 2: 生成模型答案 (调用 API)]
  找到 100 个待处理项目。


  生成模型答案:   0%|          | 0/100 [00:00<?, ?it/s]

  数据生成完成。结果已保存到 .\Data\3.rlhf_aftgpt_knowledge.jsonl

[步骤 3: 结果评分]
  总分: 67 / 100
  准确率: 67.00%
  错误数量: 33, 已保存至 .\Data\4.wrong_ans_knowledge.json
  评分详情已保存至 .\Data\4.score_knowledge.json
--- 实验 KNOWLEDGE 完成 ---

  正在运行实验: TOT

[步骤 1: 准备数据]
  从 .\Data\1.rlhf.jsonl 读取 100 条目
  数据准备完成。输出 100 条目到 '.\Data\2.rlhf_prepared_tot.jsonl'

[步骤 2: 生成模型答案 (调用 API)]
  找到 100 个待处理项目。


  生成模型答案:   0%|          | 0/100 [00:00<?, ?it/s]

  数据生成完成。结果已保存到 .\Data\3.rlhf_aftgpt_tot.jsonl

[步骤 3: 结果评分]
  总分: 70 / 100
  准确率: 70.00%
  错误数量: 30, 已保存至 .\Data\4.wrong_ans_tot.json
  评分详情已保存至 .\Data\4.score_tot.json
--- 实验 TOT 完成 ---


--- 所有实验均已执行完毕 ---


## 6. 最终结果汇总

以下是所有成功运行的提示词策略的最终得分对比。

In [8]:
if not all_scores:
    print("没有可供汇总的分数。请检查之前的步骤是否成功运行。")
else:
    # 使用 Pandas DataFrame 来格式化显示结果
    df_scores = pd.DataFrame(all_scores)
    df_scores = df_scores.set_index('strategy') # 将策略名称设为索引
    
    # 调整列顺序以便查看
    columns_order = ['accuracy', 'correct', 'total', 'num_answer1', 'num_answer2', 'num_empty/invalid']
    # 确保只包括数据中实际存在的列
    final_columns = [col for col in columns_order if col in df_scores.columns]
    df_scores = df_scores[final_columns]
    
    print("--- 所有策略的最终得分汇总 ---")
    display(df_scores)


--- 所有策略的最终得分汇总 ---


Unnamed: 0_level_0,accuracy,correct,total,num_answer1,num_answer2,num_empty/invalid
strategy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
zero_shot,68.00%,68,100,63,37,0
few_shot,70.00%,70,100,55,45,0
cot,69.00%,69,100,38,62,0
knowledge,67.00%,67,100,62,38,0
tot,70.00%,70,100,45,55,0
