In [1]:
from datasets import load_dataset

dataset = load_dataset("cimec/lambada", split="test")

# 打印前 3 条记录
for i in range(3):
    print(f"\nSample {i+1}:")
    for k, v in dataset[i].items():
        print(f"{k}: {v}")


  from .autonotebook import tqdm as notebook_tqdm



Sample 1:
text: in my palm is a clear stone , and inside it is a small ivory statuette . a guardian angel . `` figured if you 're going to be out at night getting hit by cars , you might as well have some backup . '' i look at him , feeling stunned . like this is some sort of sign . but as i stare at harlin , his mouth curved in a confident grin , i do n't care about signs
domain: None

Sample 2:
text: give me a minute to change and i 'll meet you at the docks . '' she 'd forced those words through her teeth . `` no need to change . we wo n't be that long . '' shane gripped her arm and started leading her to the dock . `` i can make it there on my own , shane
domain: None

Sample 3:
text: `` only one source i know of that would be likely to cough up enough money to finance a phony sleep research facility and pay people big bucks to solve crimes in their dreams , '' farrell concluded dryly . `` what can i say ? '' ellis unfolded his arms and widened his hands . `` your tax dollars at w

In [2]:
import torch
from datasets import load_dataset
from unsloth import FastLanguageModel
from tqdm import tqdm

# === 通用评估函数 ===
def evaluate_lambada(model, tokenizer, dataset, name="Model", max_tokens=1):
    model.eval()
    correct = 0
    total = 0
    wrong_cases = []

    for example in tqdm(dataset, desc=f"Evaluating {name}"):
        full_text = example["text"].strip()
        if not full_text or " " not in full_text:
            continue  # 忽略异常样本

        prompt = full_text.rsplit(" ", 1)[0].strip() + " "
        target = full_text.rsplit(" ", 1)[1].strip()

        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                do_sample=False,
                temperature=0.0,
                top_p=1.0,
                pad_token_id=tokenizer.eos_token_id,
            )
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        predicted = generated_text[len(prompt):].strip().split(" ")[0]  # 最先生成的 token
        if predicted.lower() == target.lower():
            correct += 1
        else:
            if len(wrong_cases) < 10:
                wrong_cases.append((prompt, predicted, target))
        total += 1

    accuracy = correct / total if total > 0 else 0.0
    print(f"\n✅ {name} Accuracy: {accuracy:.2%} ({correct}/{total})")
    print("\n❌ Example wrong predictions:")
    for p, pred, tgt in wrong_cases:
        print(f"Prompt: {p}\n→ Predicted: {pred} | Target: {tgt}\n")



# === 加载 LAMBADA 数据集 ===
dataset = load_dataset("cimec/lambada", split="test")




🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


2025-04-19 19:08:06,404	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 04-19 19:08:06 __init__.py:207] Automatically detected platform cuda.


Unsloth: Will load grpo500_phi14b_model_500eps_newreward as a legacy tokenizer.
Not an error, but Unsloth cannot patch Attention layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Not an error, but Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.3.19 patched 40 layers with 0 QKV layers, 0 O layers and 40 MLP layers.
Evaluating GRPO Fine-tuned Model: 100%|██████████| 5153/5153 [23:43<00:00,  3.62it/s]

✅ GRPO Fine-tuned Model Accuracy: 5.76% (297/5153)

❌ Example wrong predictions:
Prompt: in my palm is a clear stone , and inside it is a small ivory statuette . a guardian angel . `` figured if you 're going to be out at night getting hit by cars , you might as well have some backup . '' i look at him , feeling stunned . like this is some sort of sign . but as i stare at harlin , his mouth curved in a confident grin , i do n't care about 
→ Predicted: rome | Target: signs

Prompt: give me a minute to change and i 'll meet you at the docks . '' she 'd forced those words through her teeth . `` no need to change . we wo n't be that long . '' shane gripped her arm and started leading her to the dock . `` i can make it there on my own , 
→ Predicted: rome | Target: shane

Prompt: helen 's heart broke a little in the face of miss mabel 's selfless courage . she thought that because she was old , her life was of less value than the others ' . for all helen knew , miss mabel had a lot more years to live than she did . `` not going to happen , '' replied 
→ Predicted: ellen | Target: helen

Prompt: preston had been the last person to wear those chains , and i knew what i 'd see and feel if they were slipped onto my skin-the reaper 's unending hatred of me . i 'd felt enough of that emotion already in the amphitheater . i did n't want to feel anymore . `` do n't put those on me , '' i whispered . `` please . '' sergei looked at me , surprised by my low , raspy please , but he put down the 
→ Predicted: urch | Target: chains

Prompt: she knew that basha was a decent young man , that he was pretty sweet and friendly with her . jawen knew they had a bit of a history , but she thought that this time she would get along better with him , that she could overlook those problems . they kissed , and she knew that she liked basha , but then hastin interfered . she was so angry that she immediately said , once they were out of earshot of basha , `` you do n't mean anything to me anymore , 
→ Predicted: 0 | Target: hastin

Prompt: he heard rhinna speak `` the queen wants you in her carriage . '' tom spoke `` no , i 'm not going in some asylum . '' ran was seen standing next to him spoke `` it 's just for a private talk with you that 's all . '' tom groaned and went inside the carriage to sit down next to the 
→ Predicted: 2 | Target: queen

Prompt: there was no way he would come here on his own . he ordered a cup of coffee , and then we just sat in silence . `` so , '' aidan finally said , `` how 's it going ? '' i laughed . `` not much has changed since the last time i saw you . '' `` ya know , you eat here a lot , '' said 
→ Predicted: ellen | Target: aidan

Prompt: `` why ? '' `` i would have thought you 'd find him rather dry , '' she said . `` i do n't know about that , '' said gabriel . `` he was a great craftsman , '' said heather . `` that he was , '' said flannery . `` and polish , to boot , '' said 
→ Predicted: ellen | Target: gabriel

Prompt: both its sun-speckled shade and the cool grass beneath were a welcome respite after the stifling kitchen , and i was glad to relax against the tree 's rough , brittle bark and begin my breakfast of buttery , toasted bread and fresh fruit . even the water was tasty , it was so clean and cold . it almost made up for the lack of 
→ Predicted: iced | Target: coffee

Prompt: escorting drunk humans out of the bar is different from going up against a tiger-wildcat who eats raw steak for breakfast and is dying for a fight . '' `` i bet he could win with just his breath , '' ronan said . sean chuckled . `` take it seriously , ronan . these guys are seasoned . if marquez has a champion , it means he 's won a good share of the 
→ Predicted: 100 | Target: fights

Unsloth: Switching from Unsloth dynamic quant to normal quant since
we do not yet support fast inference for unsloth/phi-4-unsloth-bnb-4bit
==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.49.0. vLLM: 0.7.3.
   \\   /|    NVIDIA GeForce RTX 4060 Ti. Num GPUs = 1. Max memory: 15.996 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Your GPU cannot handle sequence lengths of 256 due to limited GPU memory.
Unsloth: Your GPU can only handle approximately the maximum sequence length of 256.
Unsloth: vLLM loading unsloth/phi-4-bnb-4bit with actual GPU utilization = 12.81%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 16.0 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 256. Num Sequences = 128.
Unsloth: vLLM's KV Cache can use up to 0.0 GB. Also swap space = 2 GB.
INFO 04-19 10:24:25 config.py:549] This model supports multiple tasks: {'embed', 'classify', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'bfloat16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_use_double_quant': True, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': ['lm_head', 'multi_modal_projector', 'merger', 'modality_projection'], 'llm_int8_threshold': 6.0}
INFO 04-19 10:24:25 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='unsloth/phi-4-bnb-4bit', speculative_config=None, tokenizer='unsloth/phi-4-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=256, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/phi-4-bnb-4bit, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":128}, use_cached_outputs=False, 
INFO 04-19 10:24:26 model_runner.py:1110] Starting to load model unsloth/phi-4-bnb-4bit...
Unsloth: Retrying vLLM to process 96 sequences and 256 tokens in tandem.
Error:
CUDA driver error: out of memory
INFO 04-19 10:24:28 config.py:549] This model supports multiple tasks: {'embed', 'classify', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'bfloat16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_use_double_quant': True, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': ['lm_head', 'multi_modal_projector', 'merger', 'modality_projection'], 'llm_int8_threshold': 6.0}
INFO 04-19 10:24:28 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='unsloth/phi-4-bnb-4bit', speculative_config=None, tokenizer='unsloth/phi-4-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=256, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/phi-4-bnb-4bit, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":96}, use_cached_outputs=False, 
INFO 04-19 10:24:28 model_runner.py:1110] Starting to load model unsloth/phi-4-bnb-4bit...

In [3]:
# === 加载你训练好的 GRPO LoRA 模型 ===
# finetuned_model_path = "grpo500_phi14b_model_500eps_newreward"
# model_finetuned, tokenizer_finetuned = FastLanguageModel.from_pretrained(
#     model_name=finetuned_model_path,
#     max_seq_length=2048,
#     load_in_4bit=True,
#     fast_inference=True,
#     max_lora_rank=64,
#     gpu_memory_utilization=0.7,
#     # trust_remote_code=True
# )
# FastLanguageModel.for_inference(model_finetuned)

# === 评估微调模型 ===
# evaluate_lambada(model_finetuned, tokenizer_finetuned, dataset, name="GRPO Fine-tuned Model")




In [4]:
# === 加载原始 base 模型 unsloth/Phi-4 ===
# model_base, tokenizer_base = FastLanguageModel.from_pretrained(
#     model_name="unsloth/Phi-4",
#     max_seq_length=2048,
#     load_in_4bit=True,
#     fast_inference=True,
#     max_lora_rank=64,
#     gpu_memory_utilization=0.7,
 
# )
# FastLanguageModel.for_inference(model_base)

# # === 评估 base 模型 ===
# evaluate_lambada(model_base, tokenizer_base, dataset, name="Base Model (unsloth/Phi-4)")

✅ Base Model (unsloth/Phi-4) Accuracy: 5.84% (301/5153)

❌ Example wrong predictions:
Prompt: in my palm is a clear stone , and inside it is a small ivory statuette . a guardian angel . `` figured if you 're going to be out at night getting hit by cars , you might as well have some backup . '' i look at him , feeling stunned . like this is some sort of sign . but as i stare at harlin , his mouth curved in a confident grin , i do n't care about 
→ Predicted: rome | Target: signs

Prompt: give me a minute to change and i 'll meet you at the docks . '' she 'd forced those words through her teeth . `` no need to change . we wo n't be that long . '' shane gripped her arm and started leading her to the dock . `` i can make it there on my own , 
→ Predicted: elli | Target: shane

Prompt: helen 's heart broke a little in the face of miss mabel 's selfless courage . she thought that because she was old , her life was of less value than the others ' . for all helen knew , miss mabel had a lot more years to live than she did . `` not going to happen , '' replied 
→ Predicted: ellen | Target: helen

Prompt: preston had been the last person to wear those chains , and i knew what i 'd see and feel if they were slipped onto my skin-the reaper 's unending hatred of me . i 'd felt enough of that emotion already in the amphitheater . i did n't want to feel anymore . `` do n't put those on me , '' i whispered . `` please . '' sergei looked at me , surprised by my low , raspy please , but he put down the 
→ Predicted: urch | Target: chains

Prompt: she knew that basha was a decent young man , that he was pretty sweet and friendly with her . jawen knew they had a bit of a history , but she thought that this time she would get along better with him , that she could overlook those problems . they kissed , and she knew that she liked basha , but then hastin interfered . she was so angry that she immediately said , once they were out of earshot of basha , `` you do n't mean anything to me anymore , 
→ Predicted: 0 | Target: hastin

Prompt: he heard rhinna speak `` the queen wants you in her carriage . '' tom spoke `` no , i 'm not going in some asylum . '' ran was seen standing next to him spoke `` it 's just for a private talk with you that 's all . '' tom groaned and went inside the carriage to sit down next to the 
→ Predicted: 2 | Target: queen

Prompt: there was no way he would come here on his own . he ordered a cup of coffee , and then we just sat in silence . `` so , '' aidan finally said , `` how 's it going ? '' i laughed . `` not much has changed since the last time i saw you . '' `` ya know , you eat here a lot , '' said 
→ Predicted: ellen | Target: aidan

Prompt: `` why ? '' `` i would have thought you 'd find him rather dry , '' she said . `` i do n't know about that , '' said gabriel . `` he was a great craftsman , '' said heather . `` that he was , '' said flannery . `` and polish , to boot , '' said 
→ Predicted: ellen | Target: gabriel

Prompt: both its sun-speckled shade and the cool grass beneath were a welcome respite after the stifling kitchen , and i was glad to relax against the tree 's rough , brittle bark and begin my breakfast of buttery , toasted bread and fresh fruit . even the water was tasty , it was so clean and cold . it almost made up for the lack of 
→ Predicted: iced | Target: coffee

Prompt: escorting drunk humans out of the bar is different from going up against a tiger-wildcat who eats raw steak for breakfast and is dying for a fight . '' `` i bet he could win with just his breath , '' ronan said . sean chuckled . `` take it seriously , ronan . these guys are seasoned . if marquez has a champion , it means he 's won a good share of the 
→ Predicted: 100 | Target: fights

# 加了prompt

In [5]:
from sentence_transformers import SentenceTransformer, util

# 加载语义嵌入模型（建议 all-MiniLM-L6-v2，轻量稳定）
embedder = SentenceTransformer("all-MiniLM-L6-v2")

def is_semantically_similar(pred, target, threshold=0.63):
    pred_sent = f"The final word is {pred}."
    tgt_sent = f"The final word is {target}."
    embeddings = embedder.encode([pred_sent, tgt_sent], convert_to_tensor=True)
    score = util.pytorch_cos_sim(embeddings[0], embeddings[1]).item()
    return score >= threshold


In [6]:
import torch
import re
from tqdm import tqdm
from datasets import load_dataset
import string

# 你训练时用的 system prompt
SYSTEM_PROMPT = """Respond in the following format:
<reasoning>
Your reasoning here...
</reasoning>
<answer>
The final word (only one word, no explanation)
</answer>"""

def apply_chat_prompt(tokenizer, context_text: str) -> str:
    return tokenizer.apply_chat_template([
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Please read the following text and predict its final word:\n\n{context_text}"}
    ], tokenize=False, add_generation_prompt=True)

def extract_answer_from_xml(output: str) -> str:
    matches = re.findall(r"<answer>\s*(.*?)\s*</answer>", output, re.DOTALL)
    if not matches:
        return ""
    
    # 取最后一个 <answer>，去除标点，只保留第一个单词
    raw = matches[-1].strip()
    word = raw.split()[0].strip(string.punctuation)
    
    # 可选：过滤常见 collapse token
    if word.lower() in {"the", "a", "an", "it"}:
        return ""
    
    return word

def evaluate_lambada_structured(model, tokenizer, dataset, name="Model"):
    model.eval()
    strict_correct = 0
    soft_correct = 0
    total = 0
    wrong_cases = []

    for example in tqdm(dataset, desc=f"Evaluating {name}"):
        full_text = example["text"].strip()
        if not full_text or " " not in full_text:
            continue
        prompt_raw = full_text.rsplit(" ", 1)[0].strip()
        target = full_text.rsplit(" ", 1)[1].strip()

        # 构造 prompt
        chat_prompt = apply_chat_prompt(tokenizer, prompt_raw)
        inputs = tokenizer(chat_prompt, return_tensors="pt").to("cuda")

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=256,
                do_sample=False,
                temperature=0.0,
                top_p=1.0,
                pad_token_id=tokenizer.eos_token_id,
            )

        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        predicted = extract_answer_from_xml(generated_text)
        total += 1

        if predicted.lower() == target.lower():
            strict_correct += 1
        elif is_semantically_similar(predicted.lower(), target.lower()):
            soft_correct += 1
        else:
            if len(wrong_cases) < 10:
                wrong_cases.append((prompt_raw, predicted, target))

    # 输出评估结果
    print(f"\n✅ {name} Strict Accuracy: {strict_correct / total:.2%} ({strict_correct}/{total})")
    print(f"🤝 {name} Semantic Accuracy: {(strict_correct + soft_correct) / total:.2%} "
          f"({strict_correct + soft_correct}/{total})")

    print("\n❌ Example wrong predictions:")
    for p, pred, tgt in wrong_cases:
        print(f"Context: {p}\n→ Predicted: {pred} | Target: {tgt}\n")

# def evaluate_lambada_structured(model, tokenizer, dataset, name="Model"):
#     model.eval()
#     correct = 0
#     total = 0
#     wrong_cases = []

#     for example in tqdm(dataset, desc=f"Evaluating {name}"):
#         full_text = example["text"].strip()
#         if not full_text or " " not in full_text:
#             continue
#         prompt_raw = full_text.rsplit(" ", 1)[0].strip()
#         target = full_text.rsplit(" ", 1)[1].strip()

#         # 构造 prompt
#         chat_prompt = apply_chat_prompt(tokenizer, prompt_raw)
#         inputs = tokenizer(chat_prompt, return_tensors="pt").to("cuda")

#         with torch.no_grad():
#             outputs = model.generate(
#                 **inputs,
#                 max_new_tokens=256,
#                 do_sample=False,
#                 temperature=0.0,
#                 top_p=1.0,
#                 pad_token_id=tokenizer.eos_token_id,
#             )

#         generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
#         predicted = extract_answer_from_xml(generated_text)

#         if predicted.lower() == target.lower():
#             correct += 1
#         else:
#             if len(wrong_cases) < 10:
#                 wrong_cases.append((prompt_raw, predicted, target))
#         total += 1

#     accuracy = correct / total
#     print(f"\n✅ {name} Accuracy: {accuracy:.2%} ({correct}/{total})")
#     print("\n❌ Example wrong predictions:")
#     for p, pred, tgt in wrong_cases:
#         print(f"Context: {p}\n→ Predicted: {pred} | Target: {tgt}\n")


In [7]:
# 加载 LAMBADA 数据集
#dataset = load_dataset("cimec/lambada", split="test")
dataset = load_dataset("cimec/lambada", split="test[:5%]")

In [8]:
from unsloth import FastLanguageModel




# 加载 GRPO 微调模型
# model_finetuned, tokenizer_finetuned = FastLanguageModel.from_pretrained(
#     model_name="grpo500_phi14b_model_500eps_newreward",
#     max_seq_length=2048,
#     load_in_4bit=True,
#     fast_inference=True,
#     max_lora_rank=64,
#     gpu_memory_utilization=0.7,
    
# )
# FastLanguageModel.for_inference(model_finetuned)

# evaluate_lambada_structured(model_finetuned, tokenizer_finetuned, dataset, name="GRPO Fine-tuned Model")



只修改reward函数：（25%题量）

✅ GRPO Fine-tuned Model Accuracy: 5.76% (297/5153)

✅ Base Model (unsloth/Phi-4) Accuracy: 5.84% (301/5153)

改进答案抓取函数后：（5%题量）

✅ GRPO Fine-tuned Model Accuracy: 33.33% (86/258)

✅ Base Model (unsloth/Phi-4) Accuracy: 34.11% (88/258)

添加语义答案匹配函数后：（3%题量）


添加语义答案匹配函数后：（5%题量）

✅ GRPO Fine-tuned Model Strict Accuracy: 33.33% (86/258)
🤝 GRPO Fine-tuned Model Semantic Accuracy: 46.12% (119/258)

✅ Base Model (unsloth/Phi-4) Strict Accuracy: 34.11% (88/258)
🤝 Base Model (unsloth/Phi-4) Semantic Accuracy: 45.74% (118/258)


✅ GRPO Fine-tuned Model Accuracy: 33.33% (86/258)

❌ Example wrong predictions:
Context: in my palm is a clear stone , and inside it is a small ivory statuette . a guardian angel . `` figured if you 're going to be out at night getting hit by cars , you might as well have some backup . '' i look at him , feeling stunned . like this is some sort of sign . but as i stare at harlin , his mouth curved in a confident grin , i do n't care about
→ Predicted: danger | Target: signs

Context: give me a minute to change and i 'll meet you at the docks . '' she 'd forced those words through her teeth . `` no need to change . we wo n't be that long . '' shane gripped her arm and started leading her to the dock . `` i can make it there on my own ,
→ Predicted: alone | Target: shane

Context: `` only one source i know of that would be likely to cough up enough money to finance a phony sleep research facility and pay people big bucks to solve crimes in their dreams , '' farrell concluded dryly . `` what can i say ? '' ellis unfolded his arms and widened his hands . `` your tax dollars at work . '' before farrell could respond , leila 's voice rose from inside the house . `` no insurance ? '' she wailed . `` what do you mean you do n't have any
→ Predicted: coverage | Target: insurance

Context: helen 's heart broke a little in the face of miss mabel 's selfless courage . she thought that because she was old , her life was of less value than the others ' . for all helen knew , miss mabel had a lot more years to live than she did . `` not going to happen , '' replied
→ Predicted: that | Target: helen

Context: there was no way he would come here on his own . he ordered a cup of coffee , and then we just sat in silence . `` so , '' aidan finally said , `` how 's it going ? '' i laughed . `` not much has changed since the last time i saw you . '' `` ya know , you eat here a lot , '' said
→ Predicted: regular | Target: aidan

Context: `` why ? '' `` i would have thought you 'd find him rather dry , '' she said . `` i do n't know about that , '' said gabriel . `` he was a great craftsman , '' said heather . `` that he was , '' said flannery . `` and polish , to boot , '' said
→ Predicted: O'Connor | Target: gabriel

Context: escorting drunk humans out of the bar is different from going up against a tiger-wildcat who eats raw steak for breakfast and is dying for a fight . '' `` i bet he could win with just his breath , '' ronan said . sean chuckled . `` take it seriously , ronan . these guys are seasoned . if marquez has a champion , it means he 's won a good share of the
→ Predicted: titles | Target: fights

Context: i was so happy to see him that i almost sobbed his name . eli stiffened and let out a hiss . `` mohiri ! '' fear crept into his voice , and my dazed mind wondered what on earth scared a vampire . nikolas chuckled , and i felt a tremor run through my captor . `` i see there is no need for
→ Predicted: Mohiri | Target: introductions

Context: `` come on , baby girl , '' mary jo said , scooping up the toy , then bending to retrieve her daughter . `` let me change your diaper and put you down for a couple of hours . '' `` she sleeps that long ? '' `` almost every afternoon . she still takes a morning nap , too , but she 'll outgrow those pretty soon . '' lori knew she had a lot to learn about
→ Predicted: parenting | Target: babies

Context: i would really like to have some time with my mom . '' lucien groaned inwardly . how could he say no to that ? julia was grieving for her mate and at a loss on how to deal with her daughter . now jaeden actually wanted to speak to her mom ... he would be an ogre if he said no . he gave a quick nod and ignored the grateful smile she threw him before her and her mother ambled out of the room
→ Predicted: room | Target: together

In [9]:
model_finetuned.print_trainable_parameters()


NameError: name 'model_finetuned' is not defined

In [None]:
import torch
import re

import string

SYSTEM_PROMPT = """Respond in the following format:
<reasoning>
Your reasoning here...
</reasoning>
<answer>
The final word (only one word, no explanation)
</answer>"""

def apply_chat_prompt(tokenizer, context_text: str) -> str:
    return tokenizer.apply_chat_template([
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Please read the following text and predict its final word:\n\n{context_text}"}
    ], tokenize=False, add_generation_prompt=True)

def extract_answer_from_xml(output: str) -> str:
    matches = re.findall(r"<answer>\s*(.*?)\s*</answer>", output, re.DOTALL)
    if not matches:
        return ""
    
    # 取最后一个 <answer>，去除标点，只保留第一个单词
    raw = matches[-1].strip()
    word = raw.split()[0].strip(string.punctuation)
    
    # 可选：过滤常见 collapse token
    if word.lower() in {"the", "a", "an", "it"}:
        return ""
    
    return word

# def is_semantically_similar(pred, target, threshold=0.75):
#     embeddings = embedder.encode([pred, target], convert_to_tensor=True)
#     score = util.pytorch_cos_sim(embeddings[0], embeddings[1]).item()
#     print(f"🔍 Sim({pred}, {target}) = {score:.3f}")
#     return score >= threshold

def is_semantically_similar(pred, target, threshold=0.7):
    pred_sent = f"The final word is {pred}."
    tgt_sent = f"The final word is {target}."
    embeddings = embedder.encode([pred_sent, tgt_sent], convert_to_tensor=True)
    score = util.pytorch_cos_sim(embeddings[0], embeddings[1]).item()
    print(f"🔍 Sim({pred}, {target}) = {score:.3f}")
    return score >= threshold



def test_one_lambada_sample(model, tokenizer, example, max_new_tokens=256):
    model.eval()

    full_text = example["text"].strip()
    if not full_text or " " not in full_text:
        print("⚠️ Invalid sample.")
        return

    prompt_raw = full_text.rsplit(" ", 1)[0].strip()
    target = full_text.rsplit(" ", 1)[1].strip()

    # === 构造 ChatML prompt ===
    chat_prompt = apply_chat_prompt(tokenizer, prompt_raw)
    print("🔹 System + User Prompt:\n", chat_prompt)

    # === 模型生成 ===
    inputs = tokenizer(chat_prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            temperature=0.0,
            top_p=1.0,
            pad_token_id=tokenizer.eos_token_id,
        )
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("\n🔸 Raw Generated Output:\n", generated_text)

    # === 解析 <answer> 部分 ===
    predicted = extract_answer_from_xml(generated_text)
    print("\n✅ Extracted <answer>:", predicted)
    print("🎯 Target:", target)
    print(is_semantically_similar(predicted.lower(), target.lower(),0.6))

    if predicted.lower() == target.lower():
        print("\n🎉 Prediction CORRECT ✅")
    else:
        print("\n❌ Prediction WRONG ❌")


In [None]:


# 用 GRPO 微调模型测试第 3 个样本
test_one_lambada_sample(model_finetuned, tokenizer_finetuned, dataset[2])


🔹 System + User Prompt:
 <|im_start|>system<|im_sep|>Respond in the following format:
<reasoning>
Your reasoning here...
</reasoning>
<answer>
The final word (only one word, no explanation)
</answer><|im_end|><|im_start|>user<|im_sep|>Please read the following text and predict its final word:

`` only one source i know of that would be likely to cough up enough money to finance a phony sleep research facility and pay people big bucks to solve crimes in their dreams , '' farrell concluded dryly . `` what can i say ? '' ellis unfolded his arms and widened his hands . `` your tax dollars at work . '' before farrell could respond , leila 's voice rose from inside the house . `` no insurance ? '' she wailed . `` what do you mean you do n't have any<|im_end|><|im_start|>assistant<|im_sep|>

🔸 Raw Generated Output:
 <|im_start|> system <|im_sep|> Respond in the following format:
<reasoning>
Your reasoning here...
</reasoning>
<answer>
The final word (only one word, no explanation)
</answer> <

In [10]:
# 加载 base 模型
model_base, tokenizer_base = FastLanguageModel.from_pretrained(
    model_name="unsloth/Phi-4",
    max_seq_length=2048,
    load_in_4bit=True,
    fast_inference=True,
    max_lora_rank=64,
    gpu_memory_utilization=0.7,
    
)
FastLanguageModel.for_inference(model_base)
evaluate_lambada_structured(model_base, tokenizer_base, dataset, name="Base Model (unsloth/Phi-4)")


Unsloth: Switching from Unsloth dynamic quant to normal quant since
we do not yet support fast inference for unsloth/phi-4-unsloth-bnb-4bit
==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.49.0. vLLM: 0.7.3.
   \\   /|    NVIDIA GeForce RTX 4060 Ti. Num GPUs = 1. Max memory: 15.996 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Your GPU cannot handle sequence lengths of 256 due to limited GPU memory.
Unsloth: Your GPU can only handle approximately the maximum sequence length of 256.
Unsloth: vLLM loading unsloth/phi-4-bnb-4bit with actual GPU utilization = 64.6%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 16.0 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill



INFO 04-19 19:08:24 loader.py:1089] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 04-19 19:08:24 weight_utils.py:254] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.52s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.20s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.25s/it]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.43it/s]


INFO 04-19 19:08:28 model_runner.py:1115] Loading model weights took 8.4920 GB
INFO 04-19 19:08:28 punica_selector.py:18] Using PunicaWrapperGPU.





INFO 04-19 19:08:29 worker.py:267] Memory profiling takes 1.18 seconds
INFO 04-19 19:08:29 worker.py:267] the current vLLM instance can use total_gpu_memory (16.00GiB) x gpu_memory_utilization (0.65) = 10.33GiB
INFO 04-19 19:08:29 worker.py:267] model weights take 8.49GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.47GiB; the rest of the memory reserved for KV Cache is 1.34GiB.
INFO 04-19 19:08:30 executor_base.py:111] # cuda blocks: 439, # CPU blocks: 983
INFO 04-19 19:08:30 executor_base.py:116] Maximum concurrency for 256 tokens per request: 27.44x
INFO 04-19 19:08:30 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as nee

Capturing CUDA graph shapes: 100%|██████████| 19/19 [00:18<00:00,  1.03it/s]

INFO 04-19 19:08:48 model_runner.py:1562] Graph capturing finished in 18 secs, took 0.65 GiB
INFO 04-19 19:08:48 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 20.32 seconds



Evaluating Base Model (unsloth/Phi-4): 100%|██████████| 258/258 [40:21<00:00,  9.38s/it]


✅ Base Model (unsloth/Phi-4) Strict Accuracy: 34.11% (88/258)
🤝 Base Model (unsloth/Phi-4) Semantic Accuracy: 45.74% (118/258)

❌ Example wrong predictions:
Context: in my palm is a clear stone , and inside it is a small ivory statuette . a guardian angel . `` figured if you 're going to be out at night getting hit by cars , you might as well have some backup . '' i look at him , feeling stunned . like this is some sort of sign . but as i stare at harlin , his mouth curved in a confident grin , i do n't care about
→ Predicted: him | Target: signs

Context: give me a minute to change and i 'll meet you at the docks . '' she 'd forced those words through her teeth . `` no need to change . we wo n't be that long . '' shane gripped her arm and started leading her to the dock . `` i can make it there on my own ,
→ Predicted: alone | Target: shane

Context: helen 's heart broke a little in the face of miss mabel 's selfless courage . she thought that because she was old , her life was of le


