<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary 代码 for 这个 <一个 href="http://mng.bz/orYv">构建 一个 大语言模型 From Scratch</一个> book by <一个 href="https://sebastianraschka.com">Sebastian Raschka</一个><br>
<br>代码 repository: <一个 href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</一个>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<一个 href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></一个>
</td>
</tr>
</table>

# Evaluating Instruction Responses Using 这个 OpenAI API

- 这个 笔记本 uses OpenAI's GPT-4 API to evaluate responses by 一个 instruction finetuned LLMs based on 一个 数据集 in JSON format 那个 includes 这个 generated 模型 responses, for 示例:



```python
{
    "instruction": "什么 is 这个 atomic number of helium?",
    "输入": "",
    "输出": "这个 atomic number of helium is 2.",               # <-- 这个 目标 given in 这个 测试 设置
    "模型 1 response": "\nThe atomic number of helium is 2.0.", # <-- Response by 一个 大语言模型
    "模型 2 response": "\nThe atomic number of helium is 3."    # <-- Response by 一个 2nd 大语言模型
},
```

In [1]:
# pip 安装 -r 依赖-extra.txt

In [2]:
from importlib.metadata import version

pkgs = ["openai",  # OpenAI API
        "tqdm",    # Progress bar
        ]

for p in pkgs:
    print(f"{p} version: {version(p)}")

openai version: 1.30.3
tqdm version: 4.66.2


## 测试 OpenAI API

- 首先, 让我们 测试 如果 这个 OpenAI API is correctly 设置 up
- 如果 你 don't have 一个 account yet, 你 need to 创建 one at https://platform.openai.com/
- Note 那个 你 will also have to transfer some funds to your account as 这个 GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)
- Running 这个 experiments 和 creating 这个 ~200 evaluations using 这个 代码 in 这个 笔记本 costs about $0.26 (26 cents) as of 这个 writing

- 首先, 我们 need to provide our OpenAI API secret key, 哪个 can be found at https://platform.openai.com/api-keys
- Make sure not to share 这个 key with anyone
- 添加 这个 secret key (`"sk-..."`) to 这个 `config.json` file in 这个 folder

In [3]:
import json
from openai import OpenAI

# 加载 API key from 一个 JSON file.
# Make sure to replace "sk-..." with your actual API key from https://platform.openai.com/api-keys
with open("config.json", "r") as config_file:
    config = json.load(config_file)
    api_key = config["OPENAI_API_KEY"]

client = OpenAI(api_key=api_key)

- 首先, 让我们 try 这个 API with 一个 simple 示例 to make sure 它 works as intended:

In [4]:
def run_chatgpt(prompt, client, model="gpt-4-turbo"):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        seed=123,
    )
    return response.choices[0].message.content


prompt = "Respond with 'hello world' if you got this message."
run_chatgpt(prompt, client)

'hello world'

## 加载 JSON Entries

- 这里, 我们 assume 那个 我们 saved 这个 测试 数据集 和 这个 模型 responses as 一个 JSON file 那个 我们 can 加载 as follows:

In [5]:
json_file = "eval-example-data.json"

with open(json_file, "r") as file:
    json_data = json.load(file)

print("Number of entries:", len(json_data))

Number of entries: 100


- 这个 structure of 这个 file is as follows, 哪里 我们 have 这个 given response in 这个 测试 数据集 (`'输出'`) 和 responses by two different models (`'模型 1 response'` 和 `'模型 2 response'`):

In [6]:
json_data[0]

{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',
 'input': '',
 'output': 'The hypotenuse of the triangle is 10 cm.',
 'model 1 response': '\nThe hypotenuse of the triangle is 3 cm.',
 'model 2 response': '\nThe hypotenuse of the triangle is 12 cm.'}

- Below is 一个 small utility 函数 那个 formats 这个 输入 for visualization purposes later:

In [7]:
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. Write a response that "
        f"appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### 输入:\n{entry['输入']}" 如果 entry["输入"] else ""
    instruction_text + input_text

    return instruction_text + input_text

- 现在, 让我们 try 这个 OpenAI API to compare 这个 模型 responses (我们 only evaluate 这个 首先 5 responses for 一个 visual comparison):

In [8]:
for entry in json_data[:5]:
    prompt = (f"Given the input `{format_input(entry)}` "
              f"and correct output `{entry['output']}`, "
              f"score the model response `{entry['model 1 response']}`"
              f" on a scale from 0 to 100, where 100 is the best score. "
              )
    print("\nDataset response:")
    print(">>", entry['output'])
    print("\nModel response:")
    print(">>", entry["model 1 response"])
    print("\nScore:")
    print(">>", run_chatgpt(prompt, client))
    print("\n-------------------------")


Dataset response:
>> The hypotenuse of the triangle is 10 cm.

Model response:
>> 
The hypotenuse of the triangle is 3 cm.

Score:
>> The model response "The hypotenuse of the triangle is 3 cm." is incorrect. The correct calculation of the hypotenuse for a right triangle with legs of 6 cm and 8 cm can be found using the Pythagorean theorem, which states that the square of the hypotenuse (c) is equal to the sum of the squares of the other two sides (a and b). Mathematically, this is expressed as:

\[ c = \sqrt{a^2 + b^2} \]
\[ c = \sqrt{6^2 + 8^2} \]
\[ c = \sqrt{36 + 64} \]
\[ c = \sqrt{100} \]
\[ c = 10 \text{ cm} \]

The correct answer should be 10 cm. The response given as 3 cm is not only incorrect but also significantly off from the correct value. This error could lead to misunderstandings or incorrect applications in practical scenarios where precise measurements are crucial.

Given the scale from 0 to 100, where 100 is the best score, the response would score very low due to it

- Note 那个 这个 responses are very verbose; to quantify 哪个 模型 is better, 我们 only want to 返回 这个 scores:

In [9]:
from tqdm import tqdm


def generate_model_scores(json_data, json_key, client):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the number only."
        )
        score = run_chatgpt(prompt, client)
        try:
            scores.append(int(score))
        except ValueError:
            continue

    return scores

- Please note 那个 这个 response scores may vary because OpenAI's GPT models are not deterministic despite setting 一个 random number seed, etc.

- 让我们 现在 应用 这个 evaluation to 这个 whole 数据集 和 计算 这个 average score of each 模型:

In [11]:
from pathlib import Path

for model in ("model 1 response", "model 2 response"):

    scores = generate_model_scores(json_data, model, client)
    print(f"\n{model}")
    print(f"Number of scores: {len(scores)} of {len(json_data)}")
    print(f"Average score: {sum(scores)/len(scores):.2f}\n")

    # Optionally 保存 这个 scores
    save_path = Path("scores") / f"gpt4-{model.replace(' ', '-')}.json"
    with open(save_path, "w") as file:
        json.dump(scores, file)

Scoring entries: 100%|████████████████████████| 100/100 [01:03<00:00,  1.56it/s]



model 1 response
Number of scores: 100 of 100
Average score: 74.09



Scoring entries: 100%|████████████████████████| 100/100 [01:06<00:00,  1.50it/s]


model 2 response
Number of scores: 100 of 100
Average score: 56.57






- Based on 这个 evaluation above, 我们 can say 那个 这个 1st 模型 is substantially better than 这个 2nd 模型