In [1]:
from unsloth import FastLanguageModel
max_seq_length = 1024 
lora_rank = 64 

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-1.5B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, 
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, 
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], 
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", 
    random_state = 3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-21 18:20:23 [__init__.py:239] Automatically detected platform cuda.


2025-04-21 18:20:23,479	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.50.0.dev0. vLLM: 0.8.2.
   \\   /|    NVIDIA GeForce RTX 4070 Ti. Num GPUs = 1. Max memory: 11.994 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/Qwen2.5-1.5B-Instruct with actual GPU utilization = 44.81%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 11.99 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 160.
Unsloth: vLLM's KV Cache can use up to 2.36 GB. Also swap space = 2 GB.
INFO 04-21 18:20:28 [config.py:585] This model supports multiple tasks: {'classify', 'reward', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
INFO 04-21 18:20:28 [arg_utils.p

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.62it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.62it/s]


INFO 04-21 18:20:30 [loader.py:447] Loading weights took 0.43 seconds
INFO 04-21 18:20:30 [punica_selector.py:18] Using PunicaWrapperGPU.





INFO 04-21 18:20:30 [model_runner.py:1146] Model loading took 3.0237 GB and 0.923579 seconds
INFO 04-21 18:20:31 [worker.py:267] Memory profiling takes 0.87 seconds
INFO 04-21 18:20:31 [worker.py:267] the current vLLM instance can use total_gpu_memory (11.99GiB) x gpu_memory_utilization (0.45) = 5.37GiB
INFO 04-21 18:20:31 [worker.py:267] model weights take 3.02GiB; non_torch_memory takes 0.04GiB; PyTorch activation peak memory takes 0.87GiB; the rest of the memory reserved for KV Cache is 1.43GiB.
INFO 04-21 18:20:32 [executor_base.py:111] # cuda blocks: 3351, # CPU blocks: 4681
INFO 04-21 18:20:32 [executor_base.py:116] Maximum concurrency for 1024 tokens per request: 52.36x
INFO 04-21 18:20:32 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasi

Capturing CUDA graph shapes: 100%|██████████| 23/23 [00:14<00:00,  1.60it/s]

INFO 04-21 18:20:46 [model_runner.py:1570] Graph capturing finished in 14 secs, took 0.21 GiB
INFO 04-21 18:20:46 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 15.94 seconds



Unsloth 2025.3.19 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [2]:
import json
import os
from datasets import Dataset
from typing import  Optional, Union, List, Dict, Any 

# --- Helper Functions ---

def format_solution_string(solution_dict: Dict[str, Union[int, float]]) -> str:
    """Formats a solution dictionary into a standardized string."""

    items = sorted(solution_dict.items())

    formatted_items = []
    for k, v in items:
        if isinstance(v, float) and v.is_integer():
            v = int(v)
        formatted_items.append(f"{k}={v}")
    return ", ".join(formatted_items)

def load_data_from_json(json_path="./data.json") -> Dataset:
    """Loads prompts and answers from the specified JSON file."""
    print(f"Attempting to load data from: {json_path}")

    if not os.path.exists(json_path):
        raise FileNotFoundError(f"Cannot find data file at {json_path}. Please ensure data.py has run and generated the file.")

    try:
        with open(json_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
    except json.JSONDecodeError as e:
        print(f"Error: Could not decode JSON from {json_path}. Error: {e}")
        raise
    except Exception as e:
        print(f"Error reading file {json_path}: {e}")
        raise

    processed_data = []
    for entry in data:

        full_prompt_text = entry['prompt']
        actual_solution_dict = entry['answer']

        # 将 Ground Truth 解字典格式化为标准字符串
        try:
            answer_text = format_solution_string(actual_solution_dict)
        except Exception as e:
            print(f"Skipping entry due to error formatting solution {actual_solution_dict}: {e}")
            continue

        processed_data.append({
            # Prompt structure: Only user role with the full prompt content
            'prompt': [
                {'role': 'user', 'content': full_prompt_text}
            ],
            'answer': answer_text # Ground Truth 答案字符串 (e.g., "x=1, y=2")
        })


    prompts = [item['prompt'] for item in processed_data]
    answers = [item['answer'] for item in processed_data]
    dataset = Dataset.from_dict({'prompt': prompts, 'answer': answers})
    print(f"Loaded {len(dataset)} samples from {json_path}")

    return dataset

def extract_json_solution(text: Optional[str]) -> Optional[Dict[str, Union[int, float]]]:
    """Parses the JSON response and extracts the solution dictionary."""
    if not text:
        return None
    try:
        # Ollama with format='json' should return just the JSON content
        data = json.loads(text)
        if isinstance(data, dict) and \
           'solution' in data and \
           isinstance(data['solution'], dict) and \
           'x' in data['solution'] and \
           'y' in data['solution'] and \
           isinstance(data['solution']['x'], (int, float)) and \
           isinstance(data['solution']['y'], (int, float)):

            # Optional: Convert float integers to int for consistency
            solution = data['solution']
            if isinstance(solution['x'], float) and solution['x'].is_integer():
                solution['x'] = int(solution['x'])
            if isinstance(solution['y'], float) and solution['y'].is_integer():
                solution['y'] = int(solution['y'])
            return solution
        else:
            print(f"Warning: JSON structure mismatch in response. Expected 'solution' dict with 'x', 'y'. Response sample: {text[:150]}...")
            return None
    except json.JSONDecodeError:
        print(f"Warning: Could not decode JSON from response: {text[:150]}...")
        return None
    except Exception as e:
        print(f"Warning: Unexpected error parsing JSON solution: {e}. Response sample: {text[:150]}...")
        return None

# --- Reward Function (Modified for JSON) ---
def correctness_reward_func(prompts: List[List[Dict[str, str]]],
                            completions: List[List[Dict[str, str]]],
                            answer: List[str], # List of ground truth strings (e.g., ["x=1, y=2", ...])
                            **kwargs) -> List[float]:
    """
    Calculates reward based on whether the parsed JSON solution matches the ground truth answer string.
    """
    rewards = []
    # `answer` is the list of ground truth strings from the dataset
    ground_truths_normalized = ["".join(gt.split()) for gt in answer] # Normalize GT (remove spaces)

    # Process completions
    for i, completion_list in enumerate(completions):
        if not completion_list: # Handle empty completions if they occur
            print("Warning: Received empty completion list.")
            rewards.append(0.0)
            continue

        response_text = completion_list[0].get('content') # Get the model's response text
        extracted_solution_dict = extract_json_solution(response_text) # Parse JSON

        if extracted_solution_dict:
            # Format the extracted solution into the standard string
            extracted_answer_str = format_solution_string(extracted_solution_dict)
            extracted_answer_normalized = "".join(extracted_answer_str.split()) # Normalize extracted

            # Compare normalized strings
            is_correct = (extracted_answer_normalized == ground_truths_normalized[i])
            reward = 1.0 if is_correct else 0.0
            rewards.append(reward)

            # Debug print for the first item in the batch
            if i == 0:
                q = prompts[i][-1]['content'] # Get user question/prompt
                print('-'*20)
                # print(f"Question:\n{q}") # Prompt can be very long, maybe skip printing it fully
                print(f"Ground Truth: {answer[i]}")
                print(f"Response:\n{response_text}")
                print(f"Extracted Solution Dict: {extracted_solution_dict}")
                print(f"Formatted Extracted Answer: {extracted_answer_str}")
                print(f"Normalized Extracted: '{extracted_answer_normalized}'")
                print(f"Normalized Ground Truth: '{ground_truths_normalized[i]}'")
                print(f"Match: {is_correct}, Reward: {reward}")
                print('-'*20)
        else:
            # If JSON parsing or extraction failed
            rewards.append(0.0)
            if i == 0: # Debug print for the first item
                 print('-'*20)
                 print(f"Ground Truth: {answer[i]}")
                 print(f"Response (failed parse):\n{response_text}")
                 print("Match: False (Parse Failure), Reward: 0.0")
                 print('-'*20)

    return rewards

# --- Load the Dataset ---
# Ensure the path to data.json is correct
dataset_path = "./data.json"
dataset = load_data_from_json(dataset_path)

print(f"\nFinal dataset size after filtering: {len(dataset)}")
if len(dataset) == 0:
    raise ValueError("Dataset is empty after loading and filtering. Cannot train.")

split_dataset = dataset.train_test_split(test_size=50, seed=42) 
train_dataset = split_dataset['train']
eval_dataset = split_dataset['test']

print(f"\nDataset split into:")
print(f"  Training set size: {len(train_dataset)}")
print(f"  Evaluation set size: {len(eval_dataset)}")



Attempting to load data from: ./data.json
Loaded 1074 samples from ./data.json

Final dataset size after filtering: 1074

Dataset split into:
  Training set size: 1024
  Evaluation set size: 50


<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [3]:
from trl import GRPOConfig, GRPOTrainer
from unsloth import is_bfloat16_supported 


training_args = GRPOConfig(
    use_vllm = True,
    learning_rate = 5e-5,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 5,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1, 
    gradient_accumulation_steps = 4,
    num_generations = 8, 
    max_prompt_length = 256,  
    max_completion_length = 1024, 
    max_steps = 500,
    save_steps = 250, 
    max_grad_norm = 0.5,
    report_to = "none",
    output_dir = "outputs",
)



Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8


In [None]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        correctness_reward_func, 
    ],
    args = training_args,
    train_dataset = train_dataset, 
    eval_dataset = eval_dataset
)

trainer.train() 
print("\nGRPOTrainer initialized successfully with custom data and reward function.")
print(f"Using max_prompt_length: {training_args.max_prompt_length}")
print(f"Using max_completion_length: {training_args.max_completion_length}")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,024 | Num Epochs = 2 | Total steps = 500
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 73,859,072/1,617,573,376 (4.57% trained)


{
  "reasoning": "First, we multiply Equation 1 by 5 and Equation 2 by 23 to eliminate y. This gives us equations: 115x + 25y = 3015, -115x - ...
--------------------
Ground Truth: x=21, y=24
Response (failed parse):
```json
{
  "reasoning": "First, we multiply Equation 1 by 5 and Equation 2 by 23 to eliminate y. This gives us equations: 115x + 25y = 3015, -115x - 391y = -11799. Adding these two equations eliminates y and gives us a new equation in x: -366y = -8784. Dividing both sides by -366 gives us y = 24. Substituting y = 24 into Equation 1 gives us 23x + 5(24) = 603. Simplifying this gives us 23x = 483. Dividing both sides by 23 gives us x = 21. Therefore, x = 21 and y = 24.",
  "solution": {
    "x": 21,
    "y": 24
  }
}
```
Match: False (Parse Failure), Reward: 0.0
--------------------
{
  "reasoning": "To solve this system of equations, we can use the method of substitution or elimination. Here, we'll use the elimination met...
{
  "reasoning": "To solve the system of linear 

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / correctness_reward_func


--------------------
Ground Truth: x=13, y=-12
Response:
{
  "reasoning": "First, we solve for y. Given the system of equations, we can express y from Equation 2: -23y = -23 + 23x, solving for y gives: y = 1 - x. We substitute this expression for y in Equation 1: -4x + 30(1 - x) = -412. Simplifying this results in: -4x + 30 - 30x = -412, or -34x = -442. Solving for x: x = -442 / -34 = 13. Then, substituting this value into the equation for y: y = 1 - 13 = -12. Hence, the solution for the system is x = 13, y = -12.",
  "solution": {
    "x": 13,
    "y": -12
  }
}
Extracted Solution Dict: {'x': 13, 'y': -12}
Formatted Extracted Answer: x=13, y=-12
Normalized Extracted: 'x=13,y=-12'
Normalized Ground Truth: 'x=13,y=-12'
Match: True, Reward: 1.0
--------------------
{
  "reasoning": "To solve the system of linear equations, we first manipulate the equations to eliminate one of the variables. For equations ...
{
  "reasoning": "To solve the system of equations, we can use the method of sub

Unsloth: Input IDs of length 1025 > the model's max sequence length of 1024.
We shall truncate it ourselves. It's imperative if you correct this issue first.


{
  "reasoning": "To solve the system of linear equations, we can use the elimination method. First, we find a common multiple of the coeffici...
--------------------
Ground Truth: x=18, y=-30
Response (failed parse):
```json
{
  "reasoning": "To solve the system of linear equations, we can use the elimination method. First, we find a common multiple of the coefficients of x or y to eliminate one variable. Here, we can eliminate x by making its coefficients in both equations equal. Multiply Equation 1 by 27 and Equation 2 by 26 to get: 732x - 432y = 25146; -654x + 208y = -18576. Adding these two equations, we eliminate x: 0x - 224y = 6570. Dividing by -224, we get y = -29.328125. Substituting y = -29.328125 into Equation 1, we get 26x - 16(-29.328125) = 948. Solving for x, we get x = 74.444444. Therefore, the solution is x = 74.444444 and y = -29.328125.",
  "solution": {
    "x": 74.444444,
    "y": -29.328125
  }
}
```
Match: False (Parse Failure), Reward: 0.0
--------------------
{


In [None]:
model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [None]:
from tqdm import tqdm
from vllm import SamplingParams 
import time 

print("\n" + "="*30)
print("  Starting Manual Evaluation")
print("="*30 + "\n")


print("Attempting to load trained LoRA adapter...")
try:
    lora_adapter_eval = model.load_lora("grpo_saved_lora")
    print("Successfully loaded LoRA adapter 'grpo_saved_lora' for evaluation.")
    use_lora = True
except Exception as e:
    print(f"Warning: Could not load LoRA adapter 'grpo_saved_lora'. Evaluating model's current state (might be base model or last training state). Error: {e}")
    lora_adapter_eval = None
    use_lora = False

model.eval()
print("Model set to evaluation mode.")


print(f"Preparing {len(eval_dataset)} evaluation samples...")
eval_prompts_content = [item['prompt'][0]['content'] for item in eval_dataset]
eval_answers_gt = [item['answer'] for item in eval_dataset] 


eval_prompts_formatted = []
for p_content in eval_prompts_content:
    formatted_p = tokenizer.apply_chat_template(
        [{'role': 'user', 'content': p_content}],
        tokenize=False,
        add_generation_prompt=True 
    )
    eval_prompts_formatted.append(formatted_p)
print("Evaluation prompts formatted.")

eval_sampling_params = SamplingParams(
    temperature=1.0,       
    top_p=0.95, 
    top_k=64,          
    max_tokens=training_args.max_completion_length
)
print(f"Using SamplingParams: temp={eval_sampling_params.temperature}, top_p={eval_sampling_params.top_p}, max_tokens={eval_sampling_params.max_tokens}")

generated_outputs_text = []
print(f"\nGenerating responses for {len(eval_prompts_formatted)} evaluation samples...")
start_time_gen = time.time()


for prompt_text in tqdm(eval_prompts_formatted, desc="Generating"):

    outputs = model.fast_generate(
        prompt_text,
        sampling_params=eval_sampling_params,
        lora_request=lora_adapter_eval if use_lora else None, 
    )

    generated_text = outputs[0].outputs[0].text
    generated_outputs_text.append(generated_text)

end_time_gen = time.time()
print(f"Generation finished in {end_time_gen - start_time_gen:.2f} seconds.")


correct_count = 0
parse_failures = 0
incorrect_samples = [] 

print("\nCalculating accuracy...")
start_time_eval = time.time()

for i, response_text in enumerate(tqdm(generated_outputs_text, desc="Evaluating")):
    ground_truth_str = eval_answers_gt[i]
    extracted_solution_dict = extract_json_solution(response_text) 

    is_correct = False
    formatted_extracted_answer = "N/A (Parse Failure)"

    if extracted_solution_dict:
        try:
            formatted_extracted_answer = format_solution_string(extracted_solution_dict)
            extracted_normalized = "".join(formatted_extracted_answer.split())
            gt_normalized = "".join(ground_truth_str.split())

            if extracted_normalized == gt_normalized:
                is_correct = True
                correct_count += 1
        except Exception as e:
            print(f"Warning: Error formatting extracted solution for sample {i}: {extracted_solution_dict}. Error: {e}")
            pass 
    else:
        parse_failures += 1

end_time_eval = time.time()
print(f"Evaluation calculation finished in {end_time_eval - start_time_eval:.2f} seconds.")

total_samples = len(eval_dataset)
accuracy = correct_count / total_samples if total_samples > 0 else 0

print("\n" + "="*30)
print("    Evaluation Results")
print("="*30)
print(f"Total evaluation samples: {total_samples}")
print(f"Correct predictions:    {correct_count}")
print(f"Incorrect predictions:  {total_samples - correct_count}")
print(f"  - Parse failures:     {parse_failures}")
print(f"Accuracy:               {accuracy:.4f} ({accuracy*100:.2f}%)")
print("="*30)


if incorrect_samples:
    print("\n--- Example Incorrect/Failed Samples ---")
    num_to_show = min(3, len(incorrect_samples)) 
    for k in range(num_to_show):
        sample = incorrect_samples[k]
        print(f"\nSample Index: {sample['index']}")
        print(f"  Prompt (start): {sample['prompt']}")
        print(f"  Ground Truth:   {sample['ground_truth']}")
        print(f"  Generated:      {sample['generated_response']}")
        if sample['parse_failed']:
            print("  Result:         Parse Failure")
        else:
            print(f"  Parsed Dict:    {sample['parsed_solution']}")
            print(f"  Formatted Ext:  {sample['formatted_extracted']}")
            print("  Result:         Incorrect Match")
    print("-" * 35)



  Starting Manual Evaluation

Attempting to load trained LoRA adapter...
Successfully loaded LoRA adapter 'grpo_saved_lora' for evaluation.
Model set to evaluation mode.
Preparing 50 evaluation samples...
Evaluation prompts formatted.
Using SamplingParams: temp=0.1, top_p=0.95, max_tokens=1024

Generating responses for 50 evaluation samples...


Processed prompts: 100%|██████████| 1/1 [00:06<00:00,  6.62s/it, est. speed input: 33.38 toks/s, output: 95.46 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.66s/it, est. speed input: 131.60 toks/s, output: 111.68 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.11s/it, est. speed input: 196.31 toks/s, output: 109.46 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.57s/it, est. speed input: 138.12 toks/s, output: 110.62 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:05<00:00,  5.11s/it, est. speed input: 43.83 toks/s, output: 111.33 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.44s/it, est. speed input: 64.51 toks/s, output: 110.71 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:04<00:00,  4.98s/it, est. speed input: 44.98 toks/s, output: 114.25 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.35s/it, est. speed input: 67.46 toks/s, output: 113.13 toks/s]
Processed prompts: 100%|██████████| 1/

Generation finished in 191.34 seconds.

Calculating accuracy...


Evaluating: 100%|██████████| 50/50 [00:00<00:00, 119905.77it/s]

  "reasoning": "To solve the system of linear equations, we can use the method of substitution or elimination. Here, we will use the elimination met...
  "reasoning": "To solve the system of linear equations, we can use the method of substitution or elimination. Here, we will use the elimination met...
  "reasoning": "To solve the system of linear equations, we can use the method of substitution or elimination. Here, we'll use the elimination metho...
  "reasoning": "To solve the system of linear equations, we can use the method of substitution or elimination. Here, we will use the elimination met...
  "reasoning": "To solve the system of linear equations, we can use the method of substitution or elimination. Here, we will use the elimination met...
  "reasoning": "To solve the system of linear equations, we can use the method of substitution or elimination. Here, we will use the elimination met...
Evaluation calculation finished in 0.00 seconds.

    Evaluation Results
Total evaluatio


