<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary 代码 for 这个 <一个 href="http://mng.bz/orYv">构建 一个 大语言模型 From Scratch</一个> book by <一个 href="https://sebastianraschka.com">Sebastian Raschka</一个><br>
<br>代码 repository: <一个 href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</一个>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<一个 href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></一个>
</td>
</tr>
</table>

# Generating 一个 Instruction 数据集 via Llama 3 和 Ollama

- 这个 笔记本 uses 一个 8-billion-参数 Llama 3 模型 through ollama to 生成 一个 synthetic 数据集 using 这个 "hack" proposed in 这个 "Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing" paper ([https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464))

- 这个 generated 数据集 will be 一个 instruction 数据集 with "instruction" 和 "输出" field similar to 什么 can be found in Alpaca:


```python
{
    "instruction": "什么 is 这个 atomic number of helium?",
    "输出": "这个 atomic number of helium is 2.",
},
```

- 这个 代码 doesn't require 一个 GPU 和 runs on 一个 laptop (它 was tested on 一个 M3 MacBook Air)

*Note 那个 这个 instruction datasets created 这里 are for educational purposes. However, 它 is 这个 users' duty to ensure 那个 their 使用 adheres to 这个 terms of 这个 relevant licensing agreements with Meta AI's Llama 3.*

In [1]:
from importlib.metadata import version

pkgs = [
    "tqdm",    # Progress bar
]

for p in pkgs:
    print(f"{p} version: {version(p)}")

tqdm version: 4.66.4


## Installing Ollama 和 Downloading Llama 3

- Ollama is 一个 application to 运行 LLMs efficiently
- 它 is 一个 wrapper around [llama.cpp](https://github.com/ggerganov/llama.cpp), 哪个 implements LLMs in pure C/C++ to maximize efficiency
- Note 那个 它 is 一个 tool for using LLMs to 生成 text (推理), not 训练 或者 finetuning LLMs
- Prior to running 这个 代码 below, 安装 ollama by visiting [https://ollama.com](https://ollama.com) 和 following 这个 instructions (for instance, clicking on 这个 "Download" button 和 downloading 这个 ollama application for your operating system)

- For macOS 和 Windows users, click on 这个 ollama application 你 downloaded; 如果 它 prompts 你 to 安装 这个 command line usage, say "yes"
- Linux users can 使用 这个 安装 command provided on 这个 ollama website

- In general, before 我们 can 使用 ollama from 这个 command line, 我们 have to either 开始 这个 ollama application 或者 运行 `ollama serve` in 一个 separate terminal

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/ollama-eval/ollama-serve.webp?1">


- With 这个 ollama application 或者 `ollama serve` running, in 一个 different terminal, on 这个 command line, 执行 这个 following command to try out 这个 8-billion-参数 Llama 3 模型 (这个 模型, 哪个 takes up 4.7 GB of storage space, will be automatically downloaded 这个 首先 time 你 执行 这个 command)

```bash
# 8B 模型
ollama 运行 llama3
```


这个 输出 looks like as follows:

```
$ ollama 运行 llama3
pulling manifest 
pulling 6a0746a1ec1a... 100% ▕████████████████▏ 4.7 GB                         
pulling 4fa551d4f938... 100% ▕████████████████▏  12 KB                         
pulling 8ab4849b038c... 100% ▕████████████████▏  254 B                         
pulling 577073ffcc6c... 100% ▕████████████████▏  110 B                         
pulling 3f8eb4da87fa... 100% ▕████████████████▏  485 B                         
verifying sha256 digest 
writing manifest 
removing any unused layers 
success 
```

- Note 那个 `llama3` refers to 这个 instruction finetuned 8-billion-参数 Llama 3 模型

- Alternatively, 你 can also 使用 这个 larger 70-billion-参数 Llama 3 模型, 如果 your machine supports 它, by replacing `llama3` with `llama3:70b`

- After 这个 download has been completed, 你 will see 一个 command line prompt 那个 allows 你 to chat with 这个 模型

- Try 一个 prompt like "什么 do llamas eat?", 哪个 should 返回 一个 输出 similar to 这个 following:

```
>>> 什么 do llamas eat?
Llamas are ruminant animals, 哪个 means they have 一个 four-chambered 
stomach 和 eat plants 那个 are high in fiber. In 这个 wild, llamas 
typically feed on:
1. Grasses: They love to graze on various types of grasses, including tall 
grasses, wheat, oats, 和 barley.
```

- 你 can 结束 这个 session using 这个 输入 `/bye`

## Using Ollama's REST API

- 现在, 一个 alternative way to interact with 这个 模型 is via its REST API in Python via 这个 following 函数
- Before 你 运行 这个 接下来 cells in 这个 笔记本, make sure 那个 ollama is still running, as described above, via
  - `ollama serve` in 一个 terminal
  - 这个 ollama application
- 接下来, 运行 这个 following 代码 cell to query 这个 模型

- 首先, 让我们 try 这个 API with 一个 simple 示例 to make sure 它 works as intended:

In [2]:
import urllib.request
import json

def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat", role="user"):
    # 创建 这个 data payload as 一个 dictionary
    data = {
        "model": model,
        "seed": 123,        # for deterministic responses
        "temperature": 1.,   # for deterministic responses
        "top_p": 1,         
        "messages": [
            {"role": role, "content": prompt}
        ]
    }

    # 转换 这个 dictionary to 一个 JSON formatted string 和 encode 它 to bytes
    payload = json.dumps(data).encode("utf-8")

    # 创建 一个 request object, setting 这个 方法 to POST 和 adding necessary headers
    request = urllib.request.Request(url, data=payload, method="POST")
    request.add_header("Content-Type", "application/json")

    # Send 这个 request 和 capture 这个 response
    response_data = ""
    with urllib.request.urlopen(request) as response:
        # Read 和 decode 这个 response
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data

In [3]:
result = query_model("What do Llamas eat?")
print(result)

Llamas are herbivores, which means they primarily eat plants and plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: They enjoy eating hay, such as alfalfa or timothy hay, which provides them with fiber, protein, and other essential nutrients.
3. Grains: Llamas may eat grains like oats, barley, or corn as a supplement to their diet.
4. Leaves: They will also munch on leaves from trees and shrubs, including clover, alfalfa, and various types of leaves.
5. Fruits and vegetables: In the wild, llamas might eat fruits and vegetables that grow in their natural habitat, such as apples, carrots, or potatoes.

In general, a llama's diet should consist of:

* 50% grasses and hay
* 20% grains (like oats or corn)
* 10% leaves and other plant material
* 5% fruits and vegetables (as treats)

It's essential to provide llamas with a balanced diet that meets their nutritional n

## Extract Instructions

- 现在, 让我们 使用 这个 "hack" proposed in 这个 paper: 我们 provide 这个 empty prompt template `"<|begin_of_text|><|start_header_id|>user<|end_header_id|>"` prompt, 哪个 will cause 这个 instruction-finetuned Llama 3 模型 to 生成 一个 instruction

In [4]:
def extract_instruction(text):
    for content in text.split("\n"):
        if content:
            return content.strip()

In [5]:
query = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>"

result = query_model(query, role="assistant")
instruction = extract_instruction(result)
print(instruction)

I am trying to find a way to make my child's birthday party more special and unique. What are some creative ideas you have?


- As 我们 can see above, surprisingly, 这个 模型 indeed generated 一个 instruction

## 生成 Responses

- 现在, 这个 接下来 step is to 创建 这个 corresponding response, 哪个 can be done by simply passing 这个 instruction as 输入

In [6]:
response = query_model(instruction, role="user")
print(response)

What an exciting question! I'd be delighted to help you come up with some creative and unique ideas to make your child's birthday party truly special!

Here are a few ideas to get you started:

1. **Themed Scavenger Hunt**: Plan a scavenger hunt based on the birthday child's favorite theme (e.g., superheroes, animals, or princesses). Hide clues and challenges throughout the party area, leading up to a final surprise.
2. **DIY Crafts Station**: Set up a craft station where kids can create their own party favors, such as customized t-shirts, crowns, or jewelry. This activity encourages creativity and makes for a memorable keepsake.
3. **Mystery Box Challenge**: Fill mystery boxes with different textures, smells, and sounds. Have the kids guess what's inside each box without looking. This game promotes problem-solving and teamwork.
4. **Indoor Camping Adventure**: Set up a cozy indoor "camping" area with sleeping bags, flashlights, and s'mores-making stations. Kids can enjoy a camping exp

## 生成 数据集

- 我们 can scale up 这个 approach to 一个 arbitrary number of data samples (你 may want to 应用 some optional filtering length 或者 quality (e.g., using another 大语言模型 to rate 这个 generated data)
- Below, 我们 生成 5 synthetic instruction-response pairs, 哪个 takes about 3 minutes on 一个 M3 MacBook Air
- (To 生成 一个 数据集 suitable for instruction finetuning, 我们 want to increase 这个 to at least 1k to 50k 和 perhaps 运行 它 on 一个 GPU to 生成 这个 examples in 一个 more timely fashion)

**Tip**

- 你 can 生成 even higher-quality responses by changing `模型="llama3"` to `模型="llama3:70b"`, however, 这个 will require more computational resources

In [7]:
from tqdm import tqdm

dataset_size = 5
dataset = []

for i in tqdm(range(dataset_size)):

    result = query_model(query, role="assistant")
    instruction = extract_instruction(result)
    response = query_model(instruction, role="user")
    entry = {
        "instruction": instruction,
        "output": response
    }
    dataset.append(entry)

100%|█████████████████████████████████████████████| 5/5 [02:37<00:00, 31.41s/it]


In [8]:
with open("instruction-data-llama3-7b.json", "w") as file:
    json.dump(dataset, file, indent=4)

In [9]:
!cat instruction-data-llama3-7b.json

[
    {
        "instruction": "What is the significance of the number 7 in various cultures and religions?",
        "output": "The number 7 has been a significant and recurring theme across many cultures and religions, often imbuing it with special meaning and symbolism. Here are some examples:\n\n1. **Numerology**: In numerology, the number 7 is considered sacred and mystical, associated with spiritual awakening, introspection, and enlightenment.\n2. **Judaism**: The Torah has seven days of creation, seven weeks in the wilderness, and seven years of rest (Sabbatical year). Seven is also a symbol of completion or perfection.\n3. **Christianity**: In Christianity, there are seven deadly sins, seven virtues, and seven sacraments. Jesus was said to have spoken seven sermons, and the number 7 appears in various biblical accounts, such as the seven days of creation and the seven angels who appear before God.\n4. **Islam**: In Islamic tradition, there are seven heavens, seven earths, and s