### completion vs. chat

```python
prompt = f"Q: {question}\nA:"

# 也可以尝试 few-shot，提供一些例子
prompt = f"""
Q: 西班牙的首都是哪里?
A: 马德里

Q: 德国的首都是哪里?
A: 柏林

Q: {question}
A:
"""
```

```python
prompt = f"<|im_start|>system\n{system_prompt}<|im_end|>\n"
prompt += f"<|im_start|>user\n{question}<|im_end|>\n"
prompt += "<|im_start|>assistant\n" # 模型将从这里开始生成
```

In [1]:
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
base_tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B')
instruct_tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B-Instruct')

print(base_tokenizer.chat_template)

In [3]:
def make_prefix(numbers, target, template_type):
    # NOTE: also need to change reward_score/countdown.py
    if template_type == 'base':
        # follow deepseek-r1-zero
        """This works for any base model"""
        prefix = f"""A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
User: Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.
Assistant: Let me solve this step by step.
<think>"""
    elif template_type == 'qwen-instruct':
        """This works for Qwen Instruct Models"""
        prefix = f"""<|im_start|>system\nYou are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.<|im_end|>\n<|im_start|>user\n Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.<|im_end|>\n<|im_start|>assistant\nLet me solve this step by step.\n<think>"""
    return prefix

In [4]:
numbers = [ 44, 19, 35 ]
target = 99

In [5]:
base_prompt = make_prefix(numbers, target, 'base')
print(base_prompt)

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
User: Using the numbers [44, 19, 35], create an equation that equals 99. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.
Assistant: Let me solve this step by step.
<think>


In [6]:
instruct_prompt = make_prefix(numbers, target, 'qwen-instruct')
print(instruct_prompt)

<|im_start|>system
You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.<|im_end|>
<|im_start|>user
 Using the numbers [44, 19, 35], create an equation that equals 99. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>


### base model inference

In [7]:
from vllm import LLM, SamplingParams

2025-04-07 21:53:44,179	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION


In [8]:
sampling_params = SamplingParams(
    temperature=0.6, 
    max_tokens=1024
)
base_llm = LLM(model='Qwen/Qwen2.5-3B', max_model_len=1024)

INFO 04-07 21:53:52 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='Qwen/Qwen2.5-3B', speculative_config=None, tokenizer='Qwen/Qwen2.5-3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-3B, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_c

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.70it/s]


INFO 04-07 21:53:57 model_runner.py:1071] Loading model weights took 5.7915 GB





INFO 04-07 21:53:58 gpu_executor.py:122] # GPU blocks: 25179, # CPU blocks: 7281
INFO 04-07 21:53:58 gpu_executor.py:126] Maximum concurrency for 1024 tokens per request: 393.42x
INFO 04-07 21:54:00 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-07 21:54:00 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-07 21:54:10 model_runner.py:1530] Graph capturing finished in 10 secs.


In [9]:
base_resp = base_llm.generate(base_prompt, sampling_params)[0]
print(base_resp.outputs[0].text)

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.02s/it, est. speed input: 139.76 toks/s, output: 111.22 toks/s]

 We need to use the numbers 44, 19, and 35 exactly once to create an equation that equals 99. We can use basic arithmetic operations like addition, subtraction, multiplication, and division. Let's start by looking for patterns or combinations of the numbers that could add up to 99. One way to approach this is to try different operations or combinations of the numbers. </think>
The final answer is: <answer> 44 + 35 + 19 = 99 </answer>





In [10]:
test_resp = base_llm.generate('The captail of China is', sampling_params)[0]
print(test_resp.outputs[0].text)

Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.93s/it, est. speed input: 1.53 toks/s, output: 115.86 toks/s]

 Beijing.____
A. The capital of China is Beijing.
B. Beijing is the capital of China.
C. The capital of China is Beijing.
D. Beijing is the capital of China.
Answer:
D

The most abundant element in the Earth's crust is ____
A. Oxygen
B. Silicon
C. Aluminum
D. Iron
Answer:
A

Which of the following explanations of the emphasized words in the sentences is incorrect?
A. The reason why loyal ministers and virtuous officials dare not speak, and the reason why fools and traitors dare to speak, is because they are afraid of being punished. Punishment: Punishment.
B. If you want to know the truth, I will tell you. Know: Understand.
C. In the morning, I cross the river and settle in the west, and by nightfall, I am in the east. Cross: Cross.
D. The reason why the old man was able to survive and not perish is the same as me. Pity: Like.
Answer:
A

The starting point of human life is ____
A. Fertilized egg
B. Embryo
C. Infant
D. Newborn
Answer:
A

The solution set for the inequality x^{2}-2x-3>0 




In [11]:
test_resp = base_llm.generate('My name is', sampling_params)[0]
print(test_resp.outputs[0].text)

Processed prompts: 100%|██████████| 1/1 [00:08<00:00,  8.51s/it, est. speed input: 0.35 toks/s, output: 120.17 toks/s]

 Tom. I am a student. I am in Class Two, Grade Eight. This is my friend, Jack. He is a student, too. He is in Class One, Grade Eight. My Chinese teacher is Mr. Zhang. He is a good teacher. He likes us very much. My English teacher is Miss. Wang. She is very young. She is good with us. She likes us, too. We like them. 根据短文内容，判断正误（正确的写"正确"，错误的写"错误"）。 (1). 2. Miss. Wang is a good Chinese teacher. (2). 3. Tom is in Class Two, Grade Eight. (3). 4. Mr. Zhang is Tom's English teacher. (4). 5. Jack and Tom are in the same class. (5). 1. Jack is a student, too.

【小题1】错误 【小题2】正确 【小题3】正确 【小题4】错误 【小题5】错误

根据汉语意思完成句子。 【 1 】 这个房间是用空气新鲜的木材做的。 This room is made of ___________. 【 2 】 我们必须阻止人们在森林里砍伐树木。 We must _______________ people from cutting down trees in the forest. 【 3 】 请不要把纸屑扔在地板上。 Please don't ___________ the paper on the floor. 【 4 】 环保对我们来说非常重要。 It is ___________ for us to protect the environment. 【 5 】 为了保护我们美丽的地球，我们不能乱扔垃圾。 We can't ___________ rubbish because we must protect our beautiful e




In [12]:
test_resp = base_llm.generate('Long long ago, there', sampling_params)[0]
print(test_resp.outputs[0].text)

Processed prompts: 100%|██████████| 1/1 [00:08<00:00,  8.47s/it, est. speed input: 0.59 toks/s, output: 120.40 toks/s]

 was a little girl who loved to play in the house. She picked up everything. She put it away, and then she picked it up again. She put it away, and then she picked it up again. Finally, her mother said, "I'm going to put a sign on the door. Then you won't be able to come in any more." "What sign, Mom?" "It'll say, 'Out of Order'," said her mother. "Oh," said the little girl. Then she went and hid under the bed. A few minutes later, her mother called her, "Come in here." The little girl came out from under the bed. "What's wrong, Mom?" "I put the sign on the door," said her mother, "and I can't open it." 【小题1】The little girl picked up everything because she wanted to put it away. 【小题2】The little girl put it away because her mother asked her to do so. 【小题3】The little girl was very angry with her mother. 【小题4】The mother didn't want to play with the little girl. 【小题5】The mother could not open the door because the sign was on it. 【小题1】T 【小题2】F 【小题3】T 【小题4】T 【小题5】T

阅读下面的文章，完成后面题目。 《红楼梦》中女性形




In [13]:
test_resp = base_llm.generate(instruct_prompt, sampling_params)[0]
print(test_resp.outputs[0].text)

Processed prompts: 100%|██████████| 1/1 [00:07<00:00,  7.38s/it, est. speed input: 18.56 toks/s, output: 120.33 toks/s]

First, I need to find a way to use the numbers 35 and 19 to get close to 99. I can start by adding 35 and 19, which gives me 54. Then, I can subtract 54 from 99, which gives me 45. Now, I need to find a way to get from 45 to 44. I can subtract 45 by 1, which gives me -1. But that doesn't work because I can't use -1 as a number in my equation. So, I need to find another way to get from 45 to 44. I can divide 45 by 1.1, which gives me 40.90909090909091. Then, I can subtract 40.90909090909091 by 0.9090909090909091, which gives me 40. Now, I need to find a way to get from 40 to 44. I can multiply 40 by 1.1, which gives me 44. But that doesn't work because I can't use 1.1 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, w




### instruct model inference

In [14]:
# instruct_llm = LLM(model='Qwen/Qwen2.5-3B-Instruct', max_model_len=1024)
# instruct_resp = instruct_llm.generate(instruct_prompt, sampling_params)[0]
# print(instruct_resp.outputs[0].text)