<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
《<a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a>》一书的配套代码，作者 <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>代码仓库：<a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp?1" width="100px"></a>
</td>
</tr>
</table>

# 通过Llama 3和Ollama生成指令数据集

- 此笔记本通过ollama使用80亿参数的Llama 3模型，使用"Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing"论文（[https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464)）中提出的"hack"生成合成数据集

- 生成的数据集将是一个指令数据集，具有"instruction"和"output"字段，类似于Alpaca中可以找到的内容：


```python
{
    "instruction": "What is the atomic number of helium?",
    "output": "The atomic number of helium is 2.",
},
```

- 此代码不需要GPU，可在笔记本电脑上运行（已在M3 MacBook Air上测试过）

*请注意，这里创建的指令数据集用于教育目的。但是，用户有责任确保其使用符合与Meta AI的Llama 3的相关许可协议的条款。*

In [1]:
from importlib.metadata import version

pkgs = [
    "tqdm",    # Progress bar
]

for p in pkgs:
    print(f"{p} version: {version(p)}")

tqdm version: 4.66.4


## 安装Ollama和下载Llama 3

- Ollama是一个高效运行LLM的应用程序
- 它是[llama.cpp](https://github.com/ggerganov/llama.cpp)的封装器，llama.cpp用纯C/C++实现LLM以最大化效率
- 请注意，它是一个用于使用LLM生成文本（推理）的工具，而不是训练或微调LLM
- 在运行下面的代码之前，通过访问[https://ollama.com](https://ollama.com)并按照说明安装ollama（例如，点击"下载"按钮并下载适合您操作系统的ollama应用程序）

- 对于macOS和Windows用户，点击下载的ollama应用程序；如果提示安装命令行使用，请回答"是"
- Linux用户可以使用ollama网站上提供的安装命令

- 通常，在从命令行使用ollama之前，我们必须启动ollama应用程序或在单独的终端中运行`ollama serve`

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/ollama-eval/ollama-serve.webp?1">


- 在ollama应用程序或`ollama serve`运行的情况下，在不同的终端中，在命令行上执行以下命令来尝试80亿参数的Llama 3模型（模型占用4.7 GB存储空间，第一次执行此命令时会自动下载）

```bash
# 8B模型
ollama run llama3
```


输出看起来如下：

```
$ ollama run llama3
pulling manifest 
pulling 6a0746a1ec1a... 100% ▕████████████████▏ 4.7 GB                         
pulling 4fa551d4f938... 100% ▕████████████████▏  12 KB                         
pulling 8ab4849b038c... 100% ▕████████████████▏  254 B                         
pulling 577073ffcc6c... 100% ▕████████████████▏  110 B                         
pulling 3f8eb4da87fa... 100% ▕████████████████▏  485 B                         
verifying sha256 digest 
writing manifest 
removing any unused layers 
success 
```

- 请注意，`llama3`指的是指令微调的80亿参数Llama 3模型

- 或者，如果机器支持，也可以使用更大的700亿参数Llama 3模型，只需将`llama3`替换为`llama3:70b`

- 下载完成后，将看到允许与模型聊天的命令行提示

- 尝试"羊驼吃什么？"这样的提示，这应该返回类似于以下内容的输出：

```
>>> 羊驼吃什么？
羊驼是反刍动物，这意味着它们有四个胃室，
吃高纤维的植物。在野外，羊驼
通常以以下为食：
1. 草：它们喜欢吃各种类型的草，包括高
草、小麦、燕麦和大麦。
```

- 可以使用输入`/bye`来结束此会话

## 使用Ollama的REST API

- 现在，通过Python中的REST API与模型交互的替代方法是通过以下函数
- 在运行此笔记本中的下一个单元格之前，请确保ollama仍在运行，如上所述，通过
  - 在终端中运行`ollama serve`
  - ollama应用程序
- 接下来，运行以下代码单元格来查询模型

- 首先，用一个简单的示例尝试API，确保它按预期工作

In [2]:
import json
import requests

def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat", role="user"):
    # Create the data payload as a dictionary
    data = {
        "model": model,
        "seed": 123,        # for deterministic responses
        "temperature": 1.,   # for deterministic responses
        "top_p": 1,         
        "messages": [
            {"role": role, "content": prompt}
        ]
    }

    # Send the POST request
    with requests.post(url, json=data, stream=True, timeout=30) as r:
        r.raise_for_status()
        response_data = ""
        for line in r.iter_lines(decode_unicode=True):
            if not line:
                continue
            response_json = json.loads(line)
            if "message" in response_json:
                response_data += response_json["message"]["content"]

    return response_data

result = query_model("What do Llamas eat?")
print(result)

In [3]:
result = query_model("What do Llamas eat?")
print(result)

Llamas are herbivores, which means they primarily eat plants and plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: They enjoy eating hay, such as alfalfa or timothy hay, which provides them with fiber, protein, and other essential nutrients.
3. Grains: Llamas may eat grains like oats, barley, or corn as a supplement to their diet.
4. Leaves: They will also munch on leaves from trees and shrubs, including clover, alfalfa, and various types of leaves.
5. Fruits and vegetables: In the wild, llamas might eat fruits and vegetables that grow in their natural habitat, such as apples, carrots, or potatoes.

In general, a llama's diet should consist of:

* 50% grasses and hay
* 20% grains (like oats or corn)
* 10% leaves and other plant material
* 5% fruits and vegetables (as treats)

It's essential to provide llamas with a balanced diet that meets their nutritional n

## 提取指令

- 现在，使用论文中提出的"hack"：提供空的提示模板`"<|begin_of_text|><|start_header_id|>user<|end_header_id|>"`提示，这将导致指令微调的Llama 3模型生成一个指令

In [4]:
def extract_instruction(text):
    for content in text.split("\n"):
        if content:
            return content.strip()

In [5]:
query = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>"

result = query_model(query, role="assistant")
instruction = extract_instruction(result)
print(instruction)

I am trying to find a way to make my child's birthday party more special and unique. What are some creative ideas you have?


- 如上所示，令人惊讶的是，模型确实生成了一个指令

## 生成响应

- 现在，下一步是创建相应的响应，这可以通过简单地将指令作为输入传递来完成

In [6]:
response = query_model(instruction, role="user")
print(response)

What an exciting question! I'd be delighted to help you come up with some creative and unique ideas to make your child's birthday party truly special!

Here are a few ideas to get you started:

1. **Themed Scavenger Hunt**: Plan a scavenger hunt based on the birthday child's favorite theme (e.g., superheroes, animals, or princesses). Hide clues and challenges throughout the party area, leading up to a final surprise.
2. **DIY Crafts Station**: Set up a craft station where kids can create their own party favors, such as customized t-shirts, crowns, or jewelry. This activity encourages creativity and makes for a memorable keepsake.
3. **Mystery Box Challenge**: Fill mystery boxes with different textures, smells, and sounds. Have the kids guess what's inside each box without looking. This game promotes problem-solving and teamwork.
4. **Indoor Camping Adventure**: Set up a cozy indoor "camping" area with sleeping bags, flashlights, and s'mores-making stations. Kids can enjoy a camping exp

## 生成数据集

- 可以将这种方法扩展到任意数量的数据样本（可能希望应用一些可选的过滤长度或质量（例如，使用另一个LLM对生成的数据进行评分））
- 下面，生成5个合成指令响应对，在M3 MacBook Air上大约需要3分钟
- （要生成适合指令微调的数据集，希望将其增加到至少1k到50k，也许在GPU上运行以更及时的方式生成示例）

**提示**

- 可以通过将`model="llama3"`更改为`model="llama3:70b"`来生成更高质量的响应，但这将需要更多计算资源

In [7]:
from tqdm import tqdm

dataset_size = 5
dataset = []

for i in tqdm(range(dataset_size)):

    result = query_model(query, role="assistant")
    instruction = extract_instruction(result)
    response = query_model(instruction, role="user")
    entry = {
        "instruction": instruction,
        "output": response
    }
    dataset.append(entry)

100%|█████████████████████████████████████████████| 5/5 [02:37<00:00, 31.41s/it]


In [8]:
with open("instruction-data-llama3-7b.json", "w") as file:
    json.dump(dataset, file, indent=4)

In [9]:
!cat instruction-data-llama3-7b.json

[
    {
        "instruction": "What is the significance of the number 7 in various cultures and religions?",
        "output": "The number 7 has been a significant and recurring theme across many cultures and religions, often imbuing it with special meaning and symbolism. Here are some examples:\n\n1. **Numerology**: In numerology, the number 7 is considered sacred and mystical, associated with spiritual awakening, introspection, and enlightenment.\n2. **Judaism**: The Torah has seven days of creation, seven weeks in the wilderness, and seven years of rest (Sabbatical year). Seven is also a symbol of completion or perfection.\n3. **Christianity**: In Christianity, there are seven deadly sins, seven virtues, and seven sacraments. Jesus was said to have spoken seven sermons, and the number 7 appears in various biblical accounts, such as the seven days of creation and the seven angels who appear before God.\n4. **Islam**: In Islamic tradition, there are seven heavens, seven earths, and s