<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary 代码 for 这个 <一个 href="http://mng.bz/orYv">构建 一个 大语言模型 From Scratch</一个> book by <一个 href="https://sebastianraschka.com">Sebastian Raschka</一个><br>
<br>代码 repository: <一个 href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</一个>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<一个 href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></一个>
</td>
</tr>
</table>

# 第 7 练习 解答

## 练习 7.1: Changing prompt styles

Suppose 我们 have 这个 following data entry:

```json
{
  "instruction": "Identify 这个 correct spelling of 这个 following word.",
  "输入": "Ocassion",
  "输出": "这个 correct spelling is 'Occasion.'"
}
```

In 这个 main 第, 我们 formatted 它 according to 这个 Alpaca-style prompt template:

```
Below is 一个 instruction 那个 describes 一个 task. Write 一个 response 那个 appropriately completes 这个 request.

### Instruction:
Identify 这个 correct spelling of 这个 following word.

### 输入:
Occassion

### Response:
这个 correct spelling is 'Occasion.'
```

In 这个 练习, 我们 现在 使用 这个 Phi-3 prompt template instead, 哪个 formats 这个 data entry as follows:

```
<user>
Identify 这个 correct spelling of 这个 following word: 'Occasion'

<assistant>
这个 correct spelling is 'Occasion'.
```

Note 那个 这个 prompt template is substantially shorter, 哪个 reduces 这个 runtime 和 hardware 依赖 for finetuning 这个 大语言模型 和 generating text since 这个 输入 prompts are shorter.
To make 这个 改变, 我们 更新 这个 `format_input` 函数 as follows:

In [1]:
def format_input(entry):
    instruction_text = (
        f"<|user|>\n{entry['instruction']}"
    )

    input_text = f"\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text

让我们 make sure 那个 它 works as intended by applying 它 to two 输入 samples, one with 和 one without content in 这个 `'输入'` field:

In [2]:
sample_data = [
    {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}, 
    {'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."}
]

print(format_input(sample_data[0]))
print()
print(format_input(sample_data[1]))

<|user|>
Identify the correct spelling of the following word.
Ocassion

<|user|>
What is an antonym of 'complicated'?


接下来, 我们 also 更新 这个 `InstructionDataset` 类 to 使用 这个 <|assistant|> prompt template for 这个 response:

```python
导入 tiktoken
from torch.utils.data 导入 数据集

类 InstructionDataset(数据集):
    def __init__(self, data, 分词器):
        self.data = data

        # Pre-tokenize texts
        self.encoded_texts = []
        for entry in data:

            ###################################################################
            # NEW: 使用 `format_input_phi` 和 adjust 这个 response text template
            instruction_plus_input = format_input(entry)
            response_text = f"\n<|assistant|>:\n{entry['输出']}"
            ###################################################################
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(
                分词器.encode(full_text)
            )

    def __getitem__(self, index):
        返回 self.encoded_texts[index]

    def __len__(self):
        返回 len(self.data)


分词器 = tiktoken.get_encoding("gpt2")
```

Lastly, 我们 also have to 更新 这个 way 我们 extract 这个 generated response 当 我们 collect 这个 测试 设置 responses:

```python
for i, entry in tqdm(enumerate(test_data), total=len(test_data)):

    input_text = format_input(entry)
    分词器=分词器

    token_ids = 生成(
        模型=模型,
        idx=text_to_token_ids(input_text, 分词器).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256
    )
    generated_text = token_ids_to_text(token_ids, 分词器)

    # New: Adjust ###Response -> <|assistant|>
    response_text = generated_text[len(input_text):].replace("<|assistant|>:", "").strip()

    test_data[i]["model_response"] = response_text
```

For your convenience, 这个 练习 solution is implemented in 这个 [exercise_experiments.py](exercise_experiments.py) 脚本, 哪个 你 can 运行 as follows:

```bash
python exercise_experiments.py --exercise_solution phi3_prompt
```

输出:

```
matplotlib version: 3.7.1
tiktoken version: 0.7.0
torch version: 2.3.0+cu121
tqdm version: 4.66.4
tensorflow version: 2.15.0
--------------------------------------------------
训练 设置 length: 935
验证 设置 length: 55
测试 设置 length: 110
--------------------------------------------------
Device: cuda
--------------------------------------------------
...
Loaded 模型: gpt2-medium (355M)
--------------------------------------------------
Initial losses
   训练 loss: 3.71630220413208
   验证 loss: 3.6440994262695314
Ep 1 (Step 000000): Train loss 2.633, Val loss 2.622
...
Ep 2 (Step 000230): Train loss 0.424, Val loss 0.928
<|user|> 转换 这个 active sentence to passive: '这个 chef cooks 这个 meal every day.' <|assistant|>: 这个 meal is prepared every day by 这个 chef....
训练 completed in 1.50 minutes.
绘图 saved as loss-绘图-phi3-prompt.pdf
--------------------------------------------------
Generating responses
100% 110/110 [00:11<00:00,  9.27it/s]
Responses saved as instruction-data-with-response-phi3-prompt.json
模型 saved as gpt2-medium355M-sft-phi3-prompt.pth
```

For comparison, 你 can 运行 这个 original 第 7 finetuning 代码 via `python exercise_experiments.py --exercise_solution baseline`. 

Note 那个 on 一个 Nvidia L4 GPU, 这个 代码 above, using 这个 Phi-3 prompt template, takes 1.5 min to 运行. In comparison, 这个 Alpaca-style template takes 1.80 minutes to 运行. So, 这个 Phi-3 template is approximately 17% faster since 它 results in shorter 模型 inputs. 

让我们 take 一个 look at some of 这个 responses to make sure they have been formatted correctly:

```json
    {
        "instruction": "Rewrite 这个 sentence using 一个 simile.",
        "输入": "这个 car is very fast.",
        "输出": "这个 car is as fast as lightning.",
        "model_response": "这个 car is as fast as 一个 cheetah."
    },
    {
        "instruction": "什么 type of cloud is typically associated with thunderstorms?",
        "输入": "",
        "输出": "这个 type of cloud typically associated with thunderstorms is cumulonimbus.",
        "model_response": "这个 type of cloud associated with thunderstorms is 一个 cumulus cloud."
    },
    {
        "instruction": "Name 这个 author of 'Pride 和 Prejudice'.",
        "输入": "",
        "输出": "Jane Austen.",
        "model_response": "这个 author of 'Pride 和 Prejudice' is Jane Austen."
    },
```

我们 can evaluate 这个 性能 using 这个 Ollama Llama 3 方法, 哪个 is for your convenience, also implemented in 这个 `python exercise_experiments.py` 脚本, 哪个 我们 can 运行 as follows:

```bash
python ollama_evaluate.py --file_path instruction-data-with-response-phi3-prompt.json
```

输出:

```
Ollama running: True
Scoring entries: 100%|████████████████████████| 110/110 [01:08<00:00,  1.60it/s]
Number of scores: 110 of 110
Average score: 48.87
```

这个 score is close to 50, 哪个 is in 这个 same ballpark as 这个 score 我们 previously achieved with 这个 Alpaca-style prompts.

那里 is no inherent advantage 或者 rationale 为什么 这个 Phi prompt-style should be better, 但是 它 can be more concise 和 efficient, except for 这个 caveat mentioned in 这个 *Tip* section below.

#### Tip: Considering special tokens

- Note 那个 这个 Phi-3 prompt template contains special tokens such as `<|user|>` 和 `<|assistant|>`, 哪个 can be suboptimal for 这个 GPT-2 分词器
- While 这个 GPT-2 分词器 recognizes `<|endoftext|>` as 一个 special 词元 (encoded into 词元 ID 50256), 它 is inefficient at handling other special tokens, such as 这个 aforementioned ones
- For instance, `<|user|>` is encoded into 5 individual 词元 IDs (27, 91, 7220, 91, 29), 哪个 is very inefficient
- 我们 could 添加 `<|user|>` as 一个 new special 词元 in `tiktoken` via 这个 `allowed_special` argument, 但是 please keep in mind 那个 这个 GPT-2 vocabulary would not be able to handle 它 without additional modification
- 如果 你 are curious about 如何 一个 分词器 和 大语言模型 can be extended to handle special tokens, please see 这个 [extend-tiktoken.ipynb](../../ch05/09_extending-tokenizers/extend-tiktoken.ipynb) bonus materials (note 那个 这个 is not required 这里 但是 is just 一个 interesting/bonus consideration for curious readers)
- Furthermore, 我们 can hypothesize 那个 models 那个 support these special tokens of 一个 prompt template via their vocabulary may perform more efficiently 和 better overall

&nbsp;
## 练习 7.2: Instruction 和 输入 masking

To mask out 这个 instructions as shown in 这个 following figure, 我们 need to make slight modifications to 这个 `InstructionDataset` 类 和 `custom_collate_fn`.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch07_compressed/mask-instructions.webp" width=600px>

In [4]:
# 这个 `format_input` 函数 is copied from 这个 original 第 7 代码

def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### 输入:\n{entry['输入']}" 如果 entry["输入"] else ""

    return instruction_text + input_text

我们 can 修改 这个 `InstructionDataset` 类 to collect 这个 lengths of 这个 instructions, 哪个 我们 will 使用 in 这个 collate 函数 to locate 这个 instruction content positions in 这个 targets 当 我们 代码 这个 collate 函数, as follows:

In [5]:
import torch
from torch.utils.data import Dataset


class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data

        ##########################################################################################
        # New: Separate list for instruction lengths
        self.instruction_lengths = []
        ##########################################################################################
        
        self.encoded_texts = []
        
        for entry in data:
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry['输出']}"
            full_text = instruction_plus_input + response_text
            
            self.encoded_texts.append(
                tokenizer.encode(full_text)
            )

            ##########################################################################################
            # New: collect instruction lengths
            instruction_length = len(tokenizer.encode(instruction_plus_input))
            self.instruction_lengths.append(instruction_length)
            ##########################################################################################
            
    def __getitem__(self, index):
        # New: 返回 both instruction lengths 和 texts separately
        return self.instruction_lengths[index], self.encoded_texts[index]

    def __len__(self):
        return len(self.data)

In [6]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

接下来, 我们 更新 这个 `custom_collate_fn` 哪里 each `batch` is 现在 一个 tuple containing `(instruction_length, item)` instead of just `item` due to 这个 changes in 这个 `InstructionDataset` 数据集. In addition, 我们 现在 mask 这个 corresponding instruction tokens in 这个 目标 ID list.

In [7]:
def custom_collate_fn(
    batch,
    pad_token_id=50256,
    ignore_index=-100,
    allowed_max_length=None,
    device="cpu"
):
    # Find 这个 longest sequence in 这个 batch
    batch_max_length = max(len(item)+1 for instruction_length, item in batch)   # New: batch is 现在 一个 tuple

    # Pad 和 prepare inputs 和 targets
    inputs_lst, targets_lst = [], []

    for instruction_length, item in batch:  # New: batch is 现在 一个 tuple
        new_item = item.copy()
        # 添加 一个 <|endoftext|> 词元
        new_item += [pad_token_id]
        # Pad sequences to max_length
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])  # Truncate 这个 last 词元 for inputs
        targets = torch.tensor(padded[1:])  # Shift +1 to 这个 right for targets

        # Replace all 但是 这个 首先 padding tokens in targets by ignore_index
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index

        ##########################################################################################
        # New: Mask all 输入 和 instruction tokens in 这个 targets
        targets[:instruction_length-1] = -100
        ##########################################################################################
        
        # Optionally truncate to maximum sequence length
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]
        
        inputs_lst.append(inputs)
        targets_lst.append(targets)

    # 转换 list of inputs 和 targets to tensors 和 transfer to 目标 device
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)

    return inputs_tensor, targets_tensor

让我们 try 它 out on some sample data below:

In [8]:
sample_data = [
    {'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."},
    {'instruction': 'Sort the following list in alphabetical order.', 'input': 'Zebra, Elephant, Crocodile', 'output': 'Crocodile, Elephant, Zebra'},
    {'instruction': 'Arrange the given numbers in descending order.', 'input': '5, 12, 8, 3, 15', 'output': '15, 12, 8, 5, 3.'}
]

In [9]:
from torch.utils.data import DataLoader

train_dataset = InstructionDataset(sample_data, tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=len(sample_data),
    collate_fn=custom_collate_fn,
    num_workers=0
)

In [10]:
print("Train loader:")
for inputs, targets in train_loader:
    print(inputs.shape, targets.shape)

Train loader:
torch.Size([3, 64]) torch.Size([3, 64])


In [11]:
print("Inputs:\n", inputs[1])
print("\n\nTargets:\n", targets[1])

Inputs:
 tensor([21106,   318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,
          257,  2882,   326, 20431, 32543,   262,  2581,    13,   198,   198,
        21017, 46486,    25,   198, 42758,   262,  1708,  1351,   287, 24830,
          605,  1502,    13,   198,   198, 21017, 23412,    25,   198,    57,
        37052,    11, 42651,    11,  9325, 19815,   576,   198,   198, 21017,
        18261,    25,   198,    34, 12204,   375,   576,    11, 42651,    11,
         1168, 37052, 50256, 50256])


Targets:
 tensor([ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,   198,   198, 21017, 18261,
           25,   198,    34, 12204,   375,   576,    11, 42651,    11,  1168,
      

As 我们 can see based on 这个 `targets` tensor, both 这个 instruction 和 padding tokens are 现在 masked using 这个 -100 placeholder tokens. 
让我们 decode 这个 inputs just to make sure 那个 they look correct:

In [12]:
print(tokenizer.decode(list(inputs[1])))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Sort the following list in alphabetical order.

### Input:
Zebra, Elephant, Crocodile

### Response:
Crocodile, Elephant, Zebra<|endoftext|><|endoftext|>


接下来, 让我们 decode 这个 non-masked 目标 词元 IDS:

In [13]:
non_masked_targets = targets[1][targets[1] != -100]

print(tokenizer.decode(list(non_masked_targets)))



### Response:
Crocodile, Elephant, Zebra<|endoftext|>


As shown above, 这个 non-masked 目标 tokens exclude 这个 `"Instruction"` 和 `"输入"` fields, as intended. 现在, 我们 can 运行 这个 modified 代码 to see 如何 well 这个 大语言模型 performs 当 finetuned using 这个 masking strategy.

For your convenience, 你 can 使用 这个 `exercise_experiments.py` 代码 to 运行 一个 comparison as follows:

```bash
python exercise_experiments.py --exercise_solution mask_instructions
```

输出:

```
matplotlib version: 3.7.1
tiktoken version: 0.7.0
torch version: 2.3.0+cu121
tqdm version: 4.66.4
tensorflow version: 2.15.0
--------------------------------------------------
训练 设置 length: 935
验证 设置 length: 55
测试 设置 length: 110
--------------------------------------------------
Device: cuda
--------------------------------------------------
...
Loaded 模型: gpt2-medium (355M)
--------------------------------------------------
Initial losses
   训练 loss: 2.280539035797119
   验证 loss: 2.262560224533081
Ep 1 (Step 000000): Train loss 1.636, Val loss 1.620
...
Ep 2 (Step 000230): Train loss 0.143, Val loss 0.727
...
训练 completed in 1.77 minutes.
绘图 saved as loss-绘图-mask-instructions.pdf
--------------------------------------------------
Generating responses
100% 110/110 [02:10<00:00,  1.19s/它]
Responses saved as instruction-data-with-response-mask-instructions.json
模型 saved as gpt2-medium355M-sft-mask-instructions.pth
```

接下来, 让我们 evaluate 这个 性能 of 这个 resulting 大语言模型:

```bash
python ollama_evaluate.py --file_path instruction-data-with-response-mask-instructions.json
```

```
Ollama running: True
Scoring entries: 100%|██████████████████████████████████████████████████████████████████████████████████████| 110/110 [01:23<00:00,  1.31it/s]
Number of scores: 110 of 110
Average score: 47.73
```

As 我们 can see based on 这个 scores, 这个 instruction masking does perform slightly worse, 哪个 is consistent with 这个 observation in 这个 "Instruction Tuning With Loss Over Instructions" paper (https://arxiv.org/abs/2405.14394)

&nbsp;
## 练习 7.3: Finetuning on 这个 original Alpaca 数据集

To finetune 这个 模型 on 这个 original Stanford Alpaca 数据集 ([https://github.com/tatsu-lab/stanford_alpaca](https://github.com/tatsu-lab/stanford_alpaca)), 你 just need to 改变 这个 file URL from

```python
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch07/01_main-第-代码/instruction-data.json"
```

to

```python
url = "https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json"
```

Note 那个 这个 数据集 contains 52k entries (50x more than in 第 7), 和 这个 entries are longer than 这个 ones 我们 worked with in 第 7.
Thus, 它's highly recommended 那个 这个 训练 be 运行 on 一个 GPU.

如果 你 encounter out-of-memory errors, consider reducing 这个 批量大小 from 8 to 4, 2, 或者 1. In addition to lowering 这个 批量大小, 你 may also want to consider lowering 这个 `allowed_max_length` from 1024 to 512 或者 256.

For your convenience, 你 can 使用 这个 `exercise_experiments.py` 代码 to finetune 这个 模型 on 这个 52k Alpaca 数据集 with 一个 批量大小 of 4 和 一个 `allowed_max_length` of 512 as follows:

```bash
python exercise_experiments.py --exercise_solution alpaca_52k
```

```
matplotlib version: 3.7.1
tiktoken version: 0.7.0
torch version: 2.3.0+cu121
tqdm version: 4.66.4
tensorflow version: 2.15.0
--------------------------------------------------
训练 设置 length: 44201
验证 设置 length: 2601
测试 设置 length: 5200
--------------------------------------------------
Device: cuda
--------------------------------------------------
...
Loaded 模型: gpt2-medium (355M)
--------------------------------------------------
Initial losses
   训练 loss: 3.3681655883789063
   验证 loss: 3.4122894287109373
Ep 1 (Step 000000): Train loss 2.477, Val loss 2.750
...
Ep 2 (Step 022095): Train loss 0.761, Val loss 1.557
...
训练 completed in 196.38 minutes.
绘图 saved as loss-绘图-alpaca52k.pdf
--------------------------------------------------
Generating responses
100% 5200/5200 [2:56:33<00:00,  2.04s/它]
Responses saved as instruction-data-with-response-alpaca52k.json
模型 saved as gpt2-medium355M-sft-alpaca52k.pth
```

Below are 一个 few examples from 这个 Alpaca 数据集, including 这个 generated 模型 responses:

```json
[
    {
        "instruction": "Edit 这个 following sentence to increase readability: \"He made 一个 huge effort 和 was so successful.\"",
        "输入": "",
        "输出": "He exerted 一个 tremendous effort, 和 thus enjoyed great success.",
        "model_response": "He put in 一个 immense effort 和 was rewarded with success."
    },
    {
        "instruction": "Rewrite 这个 following sentence to make 它 more concise: \"I was displeased with 这个 result of 这个 experiment 那个 I conducted.\"",
        "输入": "",
        "输出": "I was unhappy with my experiment's outcome.",
        "model_response": "I was displeased with 这个 results of 这个 experiment."
    },
    {
        "instruction": "如何 can 我们 构建 一个 more efficient GPT 模型?",
        "输入": "",
        "输出": "我们 can 构建 一个 more efficient GPT 模型 by optimizing 这个 architecture of 这个 模型, using smaller 模型 sizes 和 训练 with fewer parameters. 我们 can also leverage techniques such as knowledge distillation, transfer learning, dynamic sparsity 和 hybrid computing to further improve 这个 efficiency of 这个 模型.",
        "model_response": "Building 一个 more efficient GPT 模型 requires careful planning 和 optimization. 首先, 它 is important to identify 这个 目标 language 和 这个 context in 哪个 这个 模型 is used. 然后, 它 is important to select 这个 appropriate 模型 architecture, such as 反向传播, hyperparameters, 和 hyperparameters. 最后, 它 is important to select 这个 appropriate 模型 weights 和 optimizers, such as 反向传播, hyperparameters, 和 hyperparameters."
    },
]

最后, 我们 can evaluate 这个 finetuned 大语言模型 using 这个 [ollama_evaluate.py](ollama_evaluate.py) utility 函数:

```bash
python ollama_evaluate.py --file_path instruction-data-with-response-alpaca52k.json
```

```
Scoring entries: 100%|████████████████████| 5200/5200 [1:07:52<00:00, 1.28it/s]
Number of scores: 5188 of 5200
Average score: 48.16
```

这个 score is slightly lower than 这个 score 我们 obtained on 这个 数据集 我们 used in 这个 第. However, note 那个 这个 Alpaca 测试 设置 contains more diverse 和 partly more challenging instructions than 这个 数据集 我们 used in 这个 main 第.

## 练习 7.4: 参数-efficient finetuning with LoRA

To instruction finetune 这个 模型 using LoRA, 使用 这个 relevant classes 和 functions from 附录 E:

```python
from appendix_E 导入 LoRALayer, LinearWithLoRA, replace_linear_with_lora
```

接下来, 添加 这个 following lines of 代码 below 这个 模型 loading 代码 in section 7.5:


```python
total_params = sum(p.numel() for p in 模型.parameters() 如果 p.requires_grad)
打印(f"Total trainable parameters before: {total_params:,}")

for param in 模型.parameters():
    param.requires_grad = False

total_params = sum(p.numel() for p in 模型.parameters() 如果 p.requires_grad)
打印(f"Total trainable parameters after: {total_params:,}")
replace_linear_with_lora(模型, rank=16, alpha=16)

total_params = sum(p.numel() for p in 模型.parameters() 如果 p.requires_grad)
打印(f"Total trainable LoRA parameters: {total_params:,}")
模型.to(device)
```

For your convenience, 你 can 使用 这个 `exercise_experiments.py` 代码 to finetune 这个 模型, using LoRA with rank 16 和 alpa 16, as follows:

```bash
python exercise_experiments.py --exercise_solution lora
```

输出:

```
matplotlib version: 3.7.1
tiktoken version: 0.7.0
torch version: 2.3.0+cu121
tqdm version: 4.66.4
tensorflow version: 2.15.0
--------------------------------------------------
训练 设置 length: 935
验证 设置 length: 55
测试 设置 length: 110
--------------------------------------------------
Device: cuda
--------------------------------------------------
File already exists 和 is up-to-date: gpt2/355M/checkpoint
File already exists 和 is up-to-date: gpt2/355M/encoder.json
File already exists 和 is up-to-date: gpt2/355M/hparams.json
File already exists 和 is up-to-date: gpt2/355M/模型.ckpt.data-00000-of-00001
File already exists 和 is up-to-date: gpt2/355M/模型.ckpt.index
File already exists 和 is up-to-date: gpt2/355M/模型.ckpt.meta
File already exists 和 is up-to-date: gpt2/355M/vocab.bpe
Loaded 模型: gpt2-medium (355M)
--------------------------------------------------
Total trainable parameters before: 406,286,336
Total trainable parameters after: 0
Total trainable LoRA parameters: 7,898,384
Initial losses
   训练 loss: 3.7684114456176756
   验证 loss: 3.7619335651397705
Ep 1 (Step 000000): Train loss 2.509, Val loss 2.519
...
Ep 2 (Step 000230): Train loss 0.308, Val loss 0.652
...
--------------------------------------------------
Generating responses
100% 110/110 [01:52<00:00,  1.03s/它]
Responses saved as instruction-data-with-response-lora.json
模型 saved as gpt2-medium355M-sft-lora.pth
```

For comparison, 你 can 运行 这个 original 第 7 finetuning 代码 via `python exercise_experiments.py --exercise_solution baseline`. 

Note 那个 on 一个 Nvidia L4 GPU, 这个 代码 above, using LoRA, takes 1.30 min to 运行. In comparison, 这个 baseline takes 1.80 minutes to 运行. So, LoRA is approximately 28% faster.


我们 can evaluate 这个 性能 using 这个 Ollama Llama 3 方法, 哪个 is for your convenience, also implemented in 这个 `python exercise_experiments.py` 脚本, 哪个 我们 can 运行 as follows:

```bash
python ollama_evaluate.py --file_path instruction-data-with-response-lora.json
```

输出:

```
Ollama running: True
Scoring entries: 100%|████████████████████████| 110/110 [01:13<00:00,  1.50it/s]
Number of scores: 110 of 110
Average score: 50.23
```

这个 score is around 50, 哪个 is in 这个 same ballpark as 这个 original 模型.