# Chapter 6 - Prompt Engineering
通过提示工程提高生成输出的质量


In [1]:
# %%capture
# !pip install langchain>=0.1.17 openai>=1.13.3 langchain_openai>=0.1.6 transformers>=4.40.1 datasets>=2.18.0 accelerate>=0.27.2 sentence-transformers>=2.5.1 duckduckgo-search>=5.2.2
# !CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

In [2]:
import os
os.environ["HF_HOME"] = "/openbayes/home/huggingface"


In [3]:
!echo $HF_HOME

/openbayes/home/huggingface


## 6.1 加载模型

In [7]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline


model_name = "Qwen/Qwen2.5-7B-Instruct"
# model_name = "Qwen/Qwen2.5-0.5B-Instruct"

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create a pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,
)

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.56G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [17]:
#  简要的 prompt
messages = [
    {"role": "user", "content": "讲一个猫和狗有关的笑话."}
]

# 生成结果
output = pipe(messages)
print(output[0]["generated_text"])

当然可以，这里有一个关于猫和狗的小笑话：

有一天，猫和狗决定去参加一场才艺大赛。猫非常自信地展示了自己的抓老鼠技能，而狗则兴奋地表演了接飞盘。

最后，评委宣布了结果：猫获得了第二名，而狗获得了第一名。

猫疑惑地问狗：“为什么我拿了第二名？”

狗得意地说：“因为我是唯一一个没有把飞盘还给主人的。”

这个笑话虽然简单，但通过对比猫和狗的特点，展现了一种轻松幽默的氛围。希望你喜欢！


In [12]:
# Apply prompt template
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False)
print(prompt)

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
讲一个猫和狗有关的笑话.<|im_end|>



In [13]:
# Using a high temperature
output = pipe(messages, do_sample=True, temperature=1)
print(output[0]["generated_text"])

当然可以，这里有一个关于猫和狗的小笑话：

为什么猫不喜欢上网？

因为它们总是被“狗狗视频”占用时间！


In [14]:
# Using a high top_p
output = pipe(messages, do_sample=True, top_p=1)
print(output[0]["generated_text"])

当然可以！这里有一个关于猫和狗的笑话：

为什么猫会嫉妒狗狗收到的生日卡上写着“最好的朋友”？

因为猫想，如果它也是“最好的朋友”，那就没人再把它当成“沙发的一部分”了！


## 6.2 提示词工程

Prompt Engineering 在网络上的资源非常多，个人觉得看吴恩达的《https://github.com/Kevin-free/chatgpt-prompt-engineering-for-developers》基本就够了

## 6.3 高级提示词工程


### 6.3.1 复杂详细的 prompt

In [24]:
# Text to summarize which we stole from https://jalammar.github.io/illustrated-transformer/ ;)
text = """In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.
The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.
Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.
Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.
The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.
The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.
The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).
Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.
As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.
Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes.
The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.
After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.
Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.
Now We’re Encoding!
As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
"""

# Prompt components
persona = "You are an expert in Large Language models. You excel at breaking down complex papers into digestible summaries.\n"
instruction = "Summarize the key findings of the paper provided.\n"
context = "Your summary should extract the most crucial points that can help researchers quickly understand the most vital information of the paper.\n"
data_format = "Create a bullet-point summary that outlines the method. Follow this up with a concise paragraph that encapsulates the main results.\n"
audience = "The summary is designed for busy researchers that quickly need to grasp the newest trends in Large Language Models.\n"
tone = "The tone should be professional and clear.\n"
# text = "MY TEXT TO SUMMARIZE"  # Replace with your own text to summarize
data = f"Text to summarize: {text}"

# The full prompt - remove and add pieces to view its impact on the generated output
query = persona + instruction + context + data_format + audience + tone + data

In [25]:
messages = [
    {"role": "user", "content": query}
]
print(tokenizer.apply_chat_template(messages, tokenize=False))

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
You are an expert in Large Language models. You excel at breaking down complex papers into digestible summaries.
Summarize the key findings of the paper provided.
Your summary should extract the most crucial points that can help researchers quickly understand the most vital information of the paper.
Create a bullet-point summary that outlines the method. Follow this up with a concise paragraph that encapsulates the main results.
The summary is designed for busy researchers that quickly need to grasp the newest trends in Large Language Models.
The tone should be professional and clear.
Text to summarize: In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attent

In [26]:
# Generate the output
outputs = pipe(messages)
print(outputs[0]["generated_text"])

### Summary of Key Method

- **Model Overview**: The Transformer is a model that uses attention mechanisms to enhance the speed of training compared to traditional neural machine translation models.
- **Architecture**:
  - **Encoding Component**: Consists of a stack of identical encoders, each with two sub-layers: a self-attention layer and a feed-forward neural network.
  - **Decoding Component**: Consists of a stack of decoders, each also with two sub-layers: a self-attention layer and a feed-forward neural network.
  - **Connections**: Encoders and decoders are connected, allowing the decoder to attend to relevant parts of the encoded input.

- **Embedding**: Each input word is converted into a vector of size 512 using an embedding algorithm.
- **Parallelization**: The model's architecture allows for parallel processing, making it highly efficient for training and inference.

### Main Results

The Transformer model significantly improves upon existing neural machine translation syst

### 6.3.2 In-Context Learning: 提供样例

In [27]:
# Use a single example of using the made-up word in a sentence
one_shot_prompt = [
    {
        "role": "user",
        "content": "A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:"
    },
    {
        "role": "assistant",
        "content": "I have a Gigamuru that my uncle gave me as a gift. I love to play it at home."
    },
    {
        "role": "user",
        "content": "To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:"
    }
]
print(tokenizer.apply_chat_template(one_shot_prompt, tokenize=False))

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:<|im_end|>
<|im_start|>assistant
I have a Gigamuru that my uncle gave me as a gift. I love to play it at home.<|im_end|>
<|im_start|>user
To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:<|im_end|>



In [28]:
# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

The warrior screegged the air with his sword, testing its balance and weight before the battle began.


### 6.3.3 Chain Prompting: 拆解问题


In [29]:
# Create name and slogan for a product
product_prompt = [
    {"role": "user", "content": "Create a name and slogan for a chatbot that leverages LLMs."}
]
outputs = pipe(product_prompt)
product_description = outputs[0]["generated_text"]
print(product_description)

Certainly! Here's a name and slogan for a chatbot that leverages Large Language Models (LLMs):

**Name:** LinguaBot

**Slogan:** "Unleash the Power of Language with LinguaBot"

This name and slogan aim to highlight the chatbot's ability to handle and generate human-like language through advanced LLM technology.


In [30]:
# Based on a name and slogan for a product, generate a sales pitch
sales_prompt = [
    {"role": "user", "content": f"Generate a very short sales pitch for the following product: '{product_description}'"}
]
outputs = pipe(sales_prompt)
sales_pitch = outputs[0]["generated_text"]
print(sales_pitch)

Introducing LinguaBot: Unleash the Power of Language with LinguaBot! Experience unparalleled conversation and language generation with our cutting-edge LLM technology.


## 6.4 Reasoning with Generative Models


### 6.4.1 Chain-of-Thought（CoT）: Think Before Answering 


In [31]:
# Answering without explicit reasoning
standard_prompt = [
    {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", "content": "11"},
    {"role": "user", "content": "The cafeteria had 25 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"}
]

# Run generative model
outputs = pipe(standard_prompt)
print(outputs[0]["generated_text"])

Let's break it down step by step:

1. The cafeteria started with 25 apples.
2. They used 20 apples to make lunch. So, we subtract 20 from 25:
   \[
   25 - 20 = 5
   \]
   Now, they have 5 apples left.
3. They then bought 6 more apples. So, we add 6 to the 5 apples they have left:
   \[
   5 + 6 = 11
   \]

Therefore, the cafeteria now has 11 apples.


In [32]:
# Answering with chain-of-thought
cot_prompt = [
    {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", "content": "Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11."},
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"}
]

# Generate the output
outputs = pipe(cot_prompt)
print(outputs[0]["generated_text"])

Let's break it down step by step:

1. The cafeteria started with 23 apples.
2. They used 20 apples to make lunch. So, we subtract 20 from 23:
   \[
   23 - 20 = 3
   \]
   After making lunch, they have 3 apples left.
3. They then bought 6 more apples. So, we add 6 to the 3 apples they have left:
   \[
   3 + 6 = 9
   \]

Therefore, the cafeteria now has 9 apples.


### 6.4.2 Zero-shot Chain-of-Thought


In [33]:
# Zero-shot Chain-of-Thought
zeroshot_cot_prompt = [
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Let's think step-by-step."}
]

# Generate the output
outputs = pipe(zeroshot_cot_prompt)
print(outputs[0]["generated_text"])

Sure, let's break this down step-by-step:

1. **Initial number of apples**: The cafeteria started with 23 apples.
2. **Apples used for lunch**: They used 20 apples to make lunch. So, we subtract 20 from the initial 23 apples:
   \[
   23 - 20 = 3
   \]
   After making lunch, they have 3 apples left.
3. **Apples bought**: They then bought 6 more apples. We add these 6 apples to the 3 apples they had left:
   \[
   3 + 6 = 9
   \]

So, after using some apples for lunch and buying more, the cafeteria now has 9 apples.


### 6.4.3 Tree-of-Thought: Exploring Intermediate Steps
> 大家推测 openai-o1 就用了一个梳妆的思维链过程


In [34]:
# Zero-shot Chain-of-Thought
zeroshot_tot_prompt = [
    {"role": "user", "content": "Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realises they're wrong at any point then they leave. The question is 'The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?' Make sure to discuss the results."}
]

In [35]:
# Generate the output
outputs = pipe(zeroshot_tot_prompt)
print(outputs[0]["generated_text"])

Certainly! Let's break down the problem step-by-step with three experts, each sharing one step of their thinking before moving on to the next.

### Expert 1: Initial Count and Usage

**Step 1:** The cafeteria started with 23 apples.
- **Expert 1's Thought:** "We need to account for the apples that were used first."

### Expert 2: Apples Used

**Step 2:** They used 20 apples to make lunch.
- **Expert 2's Thought:** "After using 20 apples, we need to subtract this from the initial count."

### Expert 3: Remaining Apples After Usage

**Step 3:** Calculate the remaining apples after usage.
- **Expert 3's Thought:** "23 - 20 = 3 apples left."

### Expert 1: New Purchase

**Step 4:** They bought 6 more apples.
- **Expert 1's Thought:** "Now we need to add these new apples to the remaining ones."

### Expert 2: Total Apples After Purchase

**Step 5:** Calculate the total number of apples after the purchase.
- **Expert 2's Thought:** "3 + 6 = 9 apples in total."

### Expert 3: Final Count

**S

## 6.5 输出验证

### 6.5.1 Providing Examples

In [36]:
# Zero-shot learning: Providing no examples
zeroshot_prompt = [
    {"role": "user", "content": "Create a character profile for an RPG game in JSON format."}
]

# Generate the output
outputs = pipe(zeroshot_prompt)
print(outputs[0]["generated_text"])

Certainly! Below is a character profile for an RPG game in JSON format:

```json
{
  "characterName": "Aelion",
  "race": "Elven Mage",
  "class": "Arcane Sorcerer",
  "level": 5,
  "alignment": "Chaotic Good",
  "strength": 12,
  "dexterity": 16,
  "constitution": 14,
  "intelligence": 20,
  "wisdom": 18,
  "charisma": 17,
  "hitPoints": 35,
  "experiencePoints": 1200,
  "equipment": [
    {
      "item": "Staff of Arcane Power",
      "description": "A staff that amplifies the user's magical abilities."
    },
    {
      "item": "Cloak of Elvenkind",
      "description": "A shimmering cloak that grants the wearer the ability to blend into shadows."
    }
  ],
  "spells": [
    {
      "name": "Fireball",
      "level": 3,
      "description": "Summons a ball of fire that explodes on impact, dealing damage to all enemies within a radius."
    },
    {
      "name": "Invisibility",
      "level": 2,
      "description": "Makes the caster invisible for a short duration, allowing them t

In [37]:
# One-shot learning: Providing an example of the output structure
one_shot_template = """Create a short character profile for an RPG game. Make sure to only use this format:

{
  "description": "A SHORT DESCRIPTION",
  "name": "THE CHARACTER'S NAME",
  "armor": "ONE PIECE OF ARMOR",
  "weapon": "ONE OR MORE WEAPONS"
}
"""
one_shot_prompt = [
    {"role": "user", "content": one_shot_template}
]

# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

{
  "description": "A skilled archer with a keen eye and steady hand, often seen roving the wilderness in search of prey or enemies to hunt down from afar.",
  "name": "Eira Windwhisper",
  "armor": "Leather Vest",
  "weapon": "Longbow, Quiver of Arrows"
}


### 6.5.2 Grammar: Constrained Sampling


In [38]:
import gc
import torch
del model, tokenizer, pipe

# Flush memory
gc.collect()
torch.cuda.empty_cache()

In [41]:
from llama_cpp.llama import Llama

# Load Phi-3
llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen2.5-0.5B-Instruct-GGUF",
    filename="*fp16.gguf",
    n_gpu_layers=-1,
    n_ctx=2048,
    verbose=False
)

qwen2.5-0.5b-instruct-fp16.gguf:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

In [42]:
# Generate output
output = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Create a warrior for an RPG in JSON format."},
    ],
    response_format={"type": "json_object"},
    temperature=0,
)['choices'][0]['message']["content"]


In [43]:
import json

# Format as json
json_output = json.dumps(json.loads(output), indent=4)
print(json_output)

{
    "name": "Gandalf",
    "description": "The most powerful and feared wizard of Middle-earth, Gandalf is known for his wisdom, wit, and magical abilities.",
    "background": [
        {
            "name": "Origin",
            "description": "Born into a noble family in the Kingdom of Gondor"
        },
        {
            "name": "Career",
            "description": "Travelled to Middle-earth as a wizard"
        },
        {
            "name": "Powers",
            "description": "Divine Inspiration, Magical Healing"
        }
    ],
    "characteristics": [
        {
            "name": "Wisdom",
            "description": "Gandalf is known for his deep understanding of the world and its creatures. He has a keen eye for detail and can see things that others miss."
        },
        {
            "name": "Courage",
            "description": "Gandalf is a brave and fearless wizard who never backs down from a challenge or threat."
        }
    ],
    "achievements": [
     