In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%%html
<style>
.cell-output-ipywidget-background {
    background-color: transparent !important;
}
:root {
    --jp-widgets-color: var(--vscode-editor-foreground);
    --jp-widgets-font-size: var(--vscode-editor-font-size);
}  
</style>

In [3]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 4096  # Can increase for longer reasoning traces
lora_rank = 32  # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/meta-Llama-3.1-8B-Instruct",
    max_seq_length=max_seq_length,
    load_in_4bit=True,  # False for LoRA 16bit
    fast_inference=True,  # Enable vLLM fast inference
    max_lora_rank=lora_rank,
    gpu_memory_utilization=0.6,  # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r=lora_rank,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],  # Remove QKVO if out of memory
    lora_alpha=lora_rank,
    use_gradient_checkpointing="unsloth",  # Enable long context finetuning
    random_state=3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-13 23:06:48 __init__.py:190] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.10: Fast Llama patching. Transformers: 4.48.3. vLLM: 0.7.2.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.109 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 9.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit with actual GPU utilization = 59.54%
Unsloth: Your GPU has CUDA compute capability 9.0 with VRAM = 79.11 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 4096. Num Sequences = 226.
Unsloth: vLLM's KV Cache can us



INFO 03-13 23:07:08 weight_utils.py:252] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 03-13 23:07:10 model_runner.py:1115] Loading model weights took 5.5976 GB
INFO 03-13 23:07:10 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-13 23:07:12 worker.py:267] Memory profiling takes 1.43 seconds
INFO 03-13 23:07:12 worker.py:267] the current vLLM instance can use total_gpu_memory (79.11GiB) x gpu_memory_utilization (0.60) = 47.11GiB
INFO 03-13 23:07:12 worker.py:267] model weights take 5.60GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 1.09GiB; the rest of the memory reserved for KV Cache is 40.26GiB.
INFO 03-13 23:07:12 executor_base.py:110] # CUDA blocks: 20613, # CPU blocks: 3072
INFO 03-13 23:07:12 executor_base.py:115] Maximum concurrency for 4096 tokens per request: 80.52x
INFO 03-13 23:07:15 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error 

Capturing CUDA graph shapes: 100%|██████████| 32/32 [00:25<00:00,  1.28it/s]

INFO 03-13 23:07:40 model_runner.py:1562] Graph capturing finished in 25 secs, took 4.17 GiB
INFO 03-13 23:07:40 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 30.20 seconds



Unsloth 2025.3.10 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [4]:
from datasets import Dataset
from lib.temporal_clue import get_temporal_clue_puzzles
import random


puzzles = get_temporal_clue_puzzles()[128:]
random.seed(42)
random.shuffle(puzzles)

dataset = Dataset.from_list(puzzles).map(
    lambda x: {
        "prompt": [
            {"role": "user", "content": x["prompt"]},
        ],
        "answer": x["solution"],
    }
)

Map:   0%|          | 0/2860 [00:00<?, ? examples/s]

In [5]:
import re


def _reward_func(response, answer) -> float:
    num_correct = 0
    for key, value in answer.items():
        if matches := re.findall(rf"{key}\. ([A-Za-z \.:-]+)", response):
            match = matches[-1]
            if (
                match is not None
                and value is not None
                and match.strip().lower() == value.lower()
            ):
                num_correct += 1
    return num_correct / len(answer)


def reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]["content"] for completion in completions]
    q = prompts[0][-1]["content"]
    print(
        "-" * 20,
        f"Question:\n{q}",
        f"\nAnswer:\n{answer[0]}",
        f"\nResponse:\n{responses[0]}",
    )
    return [_reward_func(r, a) for r, a in zip(responses, answer)]

In [6]:
max_prompt_length = 2048

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    learning_rate=5e-6,
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",
    beta=0.0,
    logging_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,  # Increase to 4 for smoother training
    num_generations=15,  # Decrease if out of memory
    max_prompt_length=max_prompt_length,
    max_completion_length=max_seq_length - max_prompt_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps=250,
    save_steps=250,
    max_grad_norm=0.1,
    report_to="none",  # Can use Weights & Biases
    output_dir="outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 15


In [7]:
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=reward_func,  # type: ignore
    args=training_args,
    train_dataset=dataset,
    
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,860 | Num Epochs = 1 | Total steps = 250
O^O/ \_/ \    Batch size per device = 15 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (15 x 1 x 1) = 15
 "-____-"     Trainable parameters = 83,886,080/4,712,566,784 (1.78% trained)


-------------------- Question:
On a dark winter night, wealthy and enigmatic Mr. John Q. Boddy hosted a small, but lavish, dinner party for some of his closest associates. However, the night ended in tragedy when Mr. Boddy was found dead in one of the rooms of Tudor Mansion in the early hours of the morning. The following persons of interest have been identified as suspects:

• Sgt. Gray
• Monsieur Brunette
• Madame Rose
• Professor Plum
• Miss Scarlet
• Mrs. White
• Miss Peach
• Colonel Mustard
• Mr. Green

And the following weapons were found on the premises:

• Revolver
• Rope
• Lead Pipe
• Wrench
• Horseshoe
• Candlestick

The murder could only have occured in one of the following rooms:

01. Courtyard
02. Library
03. Conservatory
04. Studio
05. Lounge
06. Drawing Room
07. Cloak Room
08. Hall
09. Kitchen
10. Dining Room
11. Gazebo
12. Ballroom
13. Fountain
14. Trophy Room
15. Carriage House

The rooms are laid out as follows:

  NN NN NN NN  
W 01|02|03|04 E
W 05|06|07|08 E
W 09|10

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / reward_func
1,0.0,0.133333,0.129099,858.93335,0.0,0.133333
2,0.0,0.083333,0.102062,1244.333374,0.0,0.083333
3,0.0,0.116667,0.120144,332.733337,0.00056,0.116667
4,0.0,0.208333,0.102062,900.866699,0.000571,0.208333
5,0.0,0.3,0.169031,1377.733398,0.000544,0.3
6,0.0,0.066667,0.114434,1349.800049,0.000468,0.066667
7,0.0,0.158333,0.120144,1247.333374,0.000489,0.158333
8,0.0,0.216667,0.129099,1250.800049,0.000492,0.216667
9,0.0,0.083333,0.077152,1131.133423,0.00062,0.083333
10,0.0,0.191667,0.188193,1181.666748,0.00043,0.191667


-------------------- Question:
On a dark winter night, wealthy and enigmatic Mr. John Q. Boddy hosted a small, but lavish, dinner party for some of his closest associates. However, the night ended in tragedy when Mr. Boddy was found dead in one of the rooms of Tudor Mansion in the early hours of the morning. The following persons of interest have been identified as suspects:

• Mr. Green
• Miss Scarlet
• Madame Rose
• Mrs. Peacock
• Sgt. Gray
• Mrs. White
• Monsieur Brunette
• Colonel Mustard
• Professor Plum
• Miss Peach

And the following weapons were found on the premises:

• Revolver
• Rope
• Poison
• Knife
• Candlestick
• Wrench
• Lead Pipe

The murder could only have occured in one of the following rooms:

01. Library
02. Ballroom
03. Cloak Room
04. Study
05. Carriage House
06. Lounge
07. Conservatory
08. Hall
09. Drawing Room
10. Kitchen
11. Trophy Room

The rooms are laid out as follows:

  NN NN NN NN  
W 01|02|03|04 E
W 05|06|07|08 E
W 09|10|11|- E
  SS SS SS SS  

The exact 

TrainOutput(global_step=250, training_loss=1.0140968795405159e-07, metrics={'train_runtime': 8543.5364, 'train_samples_per_second': 0.439, 'train_steps_per_second': 0.029, 'total_flos': 0.0, 'train_loss': 1.0140968795405159e-07})

In [14]:
160_000/60/60/60

0.7407407407407407