<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary 代码 for 这个 <一个 href="http://mng.bz/orYv">构建 一个 大语言模型 From Scratch</一个> book by <一个 href="https://sebastianraschka.com">Sebastian Raschka</一个><br>
<br>代码 repository: <一个 href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</一个>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<一个 href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></一个>
</td>
</tr>
</table>

# 创建 "Passive Voice" Entries for 一个 Instruction 数据集

- 这个 笔记本 uses OpenAI's GPT-4 to 创建 "passive voice" entries for 一个 instruction 数据集, as shown in 这个 示例 below

```python
{  
   'instruction': 'Identify 这个 verb in 这个 following sentence',
   '输入': '这个 cat sleeps on 这个 couch.',
   '输出': '这个 verb in 这个 sentence is "sleeps."',
   'output_2': '这个 sentence is "sleeps."'   #  <---- Newly created entry
}  
```

In [1]:
# pip 安装 -r 依赖-extra.txt

In [2]:
from importlib.metadata import version

pkgs = ["openai",  # OpenAI API
        "tqdm",    # Progress bar
       ]

for p in pkgs:
    print(f"{p} version: {version(p)}")

openai version: 1.30.3
tqdm version: 4.65.0


## 测试 OpenAI API

- 首先, 让我们 测试 如果 这个 OpenAI API is correctly 设置 up
- 如果 你 don't have 一个 account yet, 你 need to 创建 one at https://platform.openai.com/
- Note 那个 你 will also have to transfer some funds to your account as 这个 GPT-4 API is not free (see https://platform.openai.com/settings/organization/billing/overview)
- Creating 这个 ~200 passive voice entries using 这个 代码 in 这个 笔记本 costs about $0.13 (13 cents)

- 首先, 我们 need to provide our OpenAI API secret key, 哪个 can be found at https://platform.openai.com/api-keys
- Make sure not to share 这个 key with anyone
- 添加 这个 secret key (`"sk-..."`) to 这个 `config.json` file in 这个 folder

In [3]:
import json
from openai import OpenAI

# 加载 API key from 一个 JSON file. 
# Make sure to replace "sk-..." with your actual API key from https://platform.openai.com/api-keys
with open("config.json", "r") as config_file:
    config = json.load(config_file)
    api_key = config["OPENAI_API_KEY"]

client = OpenAI(api_key=api_key)

- 首先, 让我们 try 这个 API with 一个 simple 示例 to make sure 它 works as intended:

In [4]:
def run_chatgpt(prompt, client, model="gpt-4-turbo"):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    return response.choices[0].message.content


# Prepare 输入
sentence = "I ate breakfast"
prompt = f"Convert the following sentence to passive voice: '{sentence}'"
run_chatgpt(prompt, client)

'Breakfast was eaten by me.'

## 创建 JSON Entries

- 接下来, 我们 加载 这个 file 我们 want to 修改:

In [5]:
import json

json_file = "instruction-examples.json"

with open(json_file, "r") as file:
    json_data = json.load(file)
    
print("Number of entries:", len(json_data))

Number of entries: 200


- 和 我们 try 这个 OpenAI chat API on 一个 small sample 首先 to ensure 那个 它 works correctly:

In [6]:
for entry in json_data[:5]:
    text = entry["output"]
    prompt = f"Without adding any response or explanation, convert the following text to passive voice: {text}"
    
    print("\nInput:")
    print(">>", text)
    print("\nOutput:")
    print(">>", run_chatgpt(prompt, client))
    print("\n-------------------------")


Input:
>> The verb in the sentence is "sleeps."

Output:
>> The sentence is "sleeps."

-------------------------

Input:
>> The plural form of "goose" is "geese."

Output:
>> The plural form of "goose" is referred to as "geese."

-------------------------

Input:
>> The three primary colors are red, blue, and yellow.

Output:
>> Red, blue, and yellow are considered the three primary colors.

-------------------------

Input:
>> They had finished the game.

Output:
>> The game had been finished by them.

-------------------------

Input:
>> The abbreviation for "Doctor of Philosophy" is Ph.D.

Output:
>> The abbreviation "Ph.D." is used for "Doctor of Philosophy".

-------------------------


- 让我们 现在 extend 这个 代码 to 添加 这个 generated entries to 这个 `json_data` 和 添加 一个 progress bar:

In [7]:
from tqdm import tqdm  # 一个 progress bar tool


for i, entry in tqdm(enumerate(json_data[:5]), total=len(json_data[:5])):
    text = entry["output"]
    prompt = f"Without adding any response or explanation, convert the following text to passive voice: {text}"
    json_data[i]["output_2"] = run_chatgpt(prompt, client)

100%|██████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.23it/s]


- One more time, 让我们 make sure 那个 这个 new entries (`"output_2"`) look ok

In [8]:
json_data[0]

{'instruction': 'Identify the verb in the following sentence: The cat sleeps on the couch.',
 'input': '',
 'output': 'The verb in the sentence is "sleeps."',
 'output_2': 'The sentence is "sleeps."'}

- 最后, 如果 everything above looks ok, 让我们 运行 这个 conversion to passive voice on our entire json 数据集 (这个 takes about 3 minutes):

In [9]:
for i, entry in tqdm(enumerate(json_data), total=len(json_data)):
    text = entry["output"]
    prompt = f"Without adding any response or explanation, convert the following text to passive voice: {text}"
    json_data[i]["output_2"] = run_chatgpt(prompt, client)

100%|██████████████████████████████████████████████████████████████████| 200/200 [03:43<00:00,  1.12s/it]


- After 这个 conversion is completed, 我们 保存 这个 file:

In [10]:
new_json_file = json_file.replace(".json", "-modified.json")


with open(new_json_file, "w") as file:
    json.dump(json_data, file, indent=4)  # "indent" for pretty-printing