### Installation

In [2]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm

In [3]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

### Unsloth

Load up `Llama 3.2 3B Instruct`, and set parameters

In [4]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Llama-3.2-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
INFO 03-30 12:39:20 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.0. vLLM: 0.8.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/llama-3.2-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 59.43%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 6.

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

INFO 03-30 12:39:44 [cuda.py:239] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 03-30 12:39:44 [cuda.py:288] Using XFormers backend.
INFO 03-30 12:39:45 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-30 12:39:45 [model_runner.py:1110] Starting to load model unsloth/llama-3.2-3b-instruct-unsloth-bnb-4bit...
INFO 03-30 12:39:45 [loader.py:1155] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 03-30 12:39:46 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

INFO 03-30 12:40:06 [weight_utils.py:281] Time spent downloading weights for unsloth/llama-3.2-3b-instruct-unsloth-bnb-4bit: 19.760553 seconds
INFO 03-30 12:40:06 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 03-30 12:40:09 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-30 12:40:10 [model_runner.py:1146] Model loading took 2.3498 GB and 23.977246 seconds
INFO 03-30 12:40:20 [worker.py:267] Memory profiling takes 9.87 seconds
INFO 03-30 12:40:20 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.59) = 8.76GiB
INFO 03-30 12:40:20 [worker.py:267] model weights take 2.35GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.89GiB; the rest of the memory reserved for KV Cache is 5.49GiB.
INFO 03-30 12:40:20 [executor_base.py:111] # cuda blocks: 3212, # CPU blocks: 1170
INFO 03-30 12:40:20 [executor_base.py:116] Maximum concurrency for 1024 tokens per request: 50.19x
INFO 03-30 12:40:23 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If o

Capturing CUDA graph shapes: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 27/27 [01:06<00:00,  2.45s/it]

INFO 03-30 12:41:29 [model_runner.py:1570] Graph capturing finished in 66 secs, took 0.48 GiB
INFO 03-30 12:41:29 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 79.80 seconds





tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Unsloth 2025.3.19 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [6]:
# prompt: mount google drive

from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [10]:
import pandas as pd
train_df = pd.read_json('/content/drive/MyDrive/nlp/PROJECT/math_reasoning_dataset.jsonl', lines=True)
train_df.head()

Unnamed: 0,instruction,output,original_answer
0,Natalia sold clips to 48 of her friends in Apr...,"<think>Okay, let's see. Natalia sold clips in ...",Natalia sold 48/2 = <<48/2=24>>24 clips in May...
1,Weng earns $12 an hour for babysitting. Yester...,"<think>Okay, let's see. Weng earns $12 per hou...",Weng earns 12/60 = $<<12/60=0.2>>0.2 per minut...
2,Betty is saving money for a new wallet which c...,"<think>Okay, let's see. Betty needs a wallet t...","In the beginning, Betty has only 100 / 2 = $<<..."
3,"Julie is reading a 120-page book. Yesterday, s...","<think>Okay, let's see. Julie has a 120-page b...",Maila read 12 x 2 = <<12*2=24>>24 pages today....
4,James writes a 3-page letter to 2 different fr...,"<think>Okay, let's see. James writes a 3-page ...",He writes each friend 3*2=<<3*2=6>>6 pages a w...


In [11]:
train_df["instruction"][0]

'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?'

In [12]:
from datasets import Dataset
import re
SYSTEM_PROMPT = """
You are a math teacher who writes step-by-step reasoning in Python code.

For each problem:
1. Think through the problem step by step and enclose your reasoning within <think></think> tags.
2. Then, break down the question into sub-steps using Python.
3. Use comments (#) to explain each step in the code.
4. Do all calculations using variables.
5. Assign the final result to a variable named `answer`.

Make sure the final line of your code is: `answer = ...` with the correct value.

Structure your response like this:

<think>
Your reasoning here
</think>
<code>:
Your Python code here
</code>
"""
def extract_think_content(text: str) -> str:
    """Extract content within <think> tags."""
    think_match = re.search(r'<think>\s*(.+?)\s*</think>', text, re.DOTALL)
    return think_match.group(1).strip() if think_match else ""

def extract_code_answer(code_text: str) -> str:
    """Extract the value assigned to 'answer' from the last line within <code>:...</code>."""
    code_match = re.search(r'<code>:(.*?)<\/code>', code_text, re.DOTALL)
    if not code_match:
        return ""
    code = code_match.group(1).strip()
    lines = code.split('\n')
    for line in reversed(lines):
        if line.strip().startswith('answer ='):
            return line.split('=')[-1].strip()
    return ""

def prepare_grpo_dataset(df: pd.DataFrame) -> Dataset:
    data = []
    for _, row in df.iterrows():
        question = row['instruction']
        deepseek_output = row['output']  # This is the ground truth

        # Extract components from DeepSeek output
        think_content = extract_think_content(deepseek_output)
        correct_answer = extract_code_answer(deepseek_output)

        data.append({
            'prompt': [
                {'role': 'system', 'content': SYSTEM_PROMPT},
                {'role': 'user', 'content': question}
            ],
            'think_content': think_content,  # Ground truth reasoning
            'answer': correct_answer         # Ground truth final answer
        })
    return Dataset.from_list(data)

grpo_dataset = prepare_grpo_dataset(train_df)

In [13]:
# prompt: explore the grpo_dataset

# Print some info
print(grpo_dataset)
print(grpo_dataset[0])
print(grpo_dataset[0]['prompt'])
print(grpo_dataset[0]['answer'])

# Print the first few examples
for i in range(min(5, len(grpo_dataset))):
  print(f"Example {i+1}:")
  print(grpo_dataset[i])
  print("-" * 20)


Dataset({
    features: ['prompt', 'think_content', 'answer'],
    num_rows: 350
})
{'prompt': [{'content': '\nYou are a math teacher who writes step-by-step reasoning in Python code.\n\nFor each problem:\n1. Think through the problem step by step and enclose your reasoning within <think></think> tags.\n2. Then, break down the question into sub-steps using Python.\n3. Use comments (#) to explain each step in the code.\n4. Do all calculations using variables.\n5. Assign the final result to a variable named `answer`.\n\nMake sure the final line of your code is: `answer = ...` with the correct value.\n\nStructure your response like this:\n\n<think>\nYour reasoning here\n</think>\n<code>:\nYour Python code here\n</code>\n', 'role': 'system'}, {'content': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'role': 'user'}], 'think_content': "Okay, let's see. Natalia sold clips in April

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [14]:
max_prompt_length = 512

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 6, # Decrease if out of memory
    max_prompt_length = max_prompt_length,
    max_completion_length = max_seq_length - max_prompt_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


In [32]:
grpo_dataset = '/content/drive/MyDrive/nlp/PROJECT/math_reasoning_dataset.jsonl'

In [55]:
# import re
# import traceback
# import textwrap
# # Helper function to execute code and get the 'answer' variable
# def execute_code(code_text: str) -> tuple[float | None, bool]:
#     # Try to find <code> with optional colon: <code>:? ... </code>
#     code_match = re.search(r'<code>:?(.*?)<\/code>', code_text, re.DOTALL)
#     if code_match:
#         code = code_match.group(1).strip()
#     else:
#         # Try to find triple backticks: ```python ... ``` or ``` ... ```
#         code_match = re.search(r'```(?:python)?(.*?)```', code_text, re.DOTALL)
#         if code_match:
#             code = code_match.group(1).strip()
#         else:
#             print("No code block found in:", code_text)
#             return None, False

#     # Dedent the code to remove leading whitespace
#     code = textwrap.dedent(code).strip()
#     print("Extracted and dedented code:\n", code)

#     # Execute the code in a local namespace
#     local_vars = {}
#     try:
#         exec(code, {}, local_vars)
#         answer = local_vars.get('answer', None)
#         print("Executed successfully, answer =", answer)
#         return answer, True
#     except Exception as e:
#         print(f"Execution error: {e}")
#         return None, False

# # 1. Correctness Reward (Updated for Python code)
# def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
#     responses = [completion[0]['content'] for completion in completions]
#     q = prompts[0][-1]['content']
#     rewards = []
#     for response in responses:
#         student_answer, success = execute_code(response)
#         correct_answer = float(answer[0]) if isinstance(answer[0], str) and answer[0].replace('.', '').isdigit() else answer[0]
#         if success and student_answer is not None and isinstance(student_answer, (int, float)):
#             if abs(student_answer - correct_answer) < 1e-5:
#                 rewards.append(2.0)
#             else:
#                 rewards.append(0.0)
#         else:
#             rewards.append(0.0)
#         print('-' * 20)
#         print(f"Question:\n{q}")
#         print(f"Correct Answer:\n{correct_answer}")
#         print(f"Response:\n{response}")
#         print(f"Extracted Answer:\n{student_answer}")
#     return rewards

# # 2. Format Reward (Updated for <code>:...</code>)
# def strict_format_reward_func(completions, **kwargs) -> list[float]:
#     responses = [completion[0]["content"] for completion in completions]
#     rewards = []
#     for r in responses:
#         code_match = re.search(r'<code>:?(.*?)<\/code>', r, re.DOTALL)
#         if code_match:
#             code = code_match.group(1).strip()
#             if any(line.strip().startswith('answer =') for line in code.split('\n')):
#                 rewards.append(0.5)
#                 print(f"Strict format passed: Found 'answer =' in:\n{code}")
#             else:
#                 rewards.append(0.0)
#                 print(f"Strict format failed: No 'answer =' in:\n{code}")
#         else:
#             rewards.append(0.0)
#             print(f"Strict format failed: No code block in:\n{r}")
#     return rewards

# # 3. Soft Format Reward (Updated)
# def soft_format_reward_func(completions, **kwargs) -> list[float]:
#     responses = [completion[0]["content"] for completion in completions]
#     pattern = r"<code>:?.*?</code>"
#     rewards = []
#     for r in responses:
#         if re.search(pattern, r, re.DOTALL):
#             rewards.append(0.25)
#             print(f"Soft format passed:\n{r}")
#         else:
#             rewards.append(0.0)
#             print(f"Soft format failed:\n{r}")
#     return rewards

# # 4. Integer Reward (Optional, if answers are numeric)
# def int_reward_func(completions, **kwargs) -> list[float]:
#     rewards = []
#     for completion in completions:
#         response = completion[0]['content']
#         answer_val, success = execute_code(response)
#         # Check if answer_val is an int or a float that represents an integer
#         if success and isinstance(answer_val, int):
#             rewards.append(0.5)
#         elif success and isinstance(answer_val, float) and answer_val.is_integer():
#             rewards.append(0.5)
#         else:
#             rewards.append(0.0)
#         print(f"Int reward check: answer_val={answer_val}, success={success}")
#     return rewards

# # 5. Comment Reward (New, to enforce Python comments)
# def comment_reward_func(completions, **kwargs) -> list[float]:
#     rewards = []
#     for completion in completions:
#         response = completion[0]['content']
#         code_match = re.search(r'<code>:(.*?)<\/code>', response, re.DOTALL)
#         if not code_match:
#             rewards.append(0.0)
#             continue
#         code = code_match.group(1).strip()
#         comment_lines = len([line for line in code.split('\n') if line.strip().startswith('#')])
#         rewards.append(min(0.5, comment_lines * 0.1))  # 0.1 per comment, max 0.5
#     return rewards

# # 6. Replace xmlcount_reward_func with Variable Usage Reward
# def variable_reward_func(completions, **kwargs) -> list[float]:
#     rewards = []
#     for completion in completions:
#         response = completion[0]['content']
#         code_match = re.search(r'<code>:(.*?)<\/code>', response, re.DOTALL)
#         if not code_match:
#             rewards.append(0.0)
#             continue
#         code = code_match.group(1).strip()
#         assignments = len(re.findall(r'^\s*([a-zA-Z_][a-zA-Z0-9_]*)\s*=', code, re.MULTILINE))
#         rewards.append(0.5 if assignments >= 2 else 0.0)  # Reward for 2+ variables
#     return rewards




In [18]:
import re
import traceback
import textwrap
# Helper function to execute code and get the 'answer' variable
def execute_code(code_text: str) -> tuple[float | None, bool]:
    # Try to find <code> with optional colon: <code>:? ... </code>
    code_match = re.search(r'<code>:?(.*?)<\/code>', code_text, re.DOTALL)
    if code_match:
        code = code_match.group(1).strip()
    else:
        # Try triple backticks: ```python ... ``` or ``` ... ```
        code_match = re.search(r'```(?:python)?(.*?)```', code_text, re.DOTALL)
        if code_match:
            code = code_match.group(1).strip()
        else:
            print("No code block found in:", code_text)
            return None, False

    # Dedent the code to remove leading whitespace
    code = textwrap.dedent(code).strip()
    print("Extracted and dedented code:\n", code)

    # Execute the code in a local namespace
    local_vars = {}
    try:
        exec(code, {}, local_vars)
        answer = local_vars.get('answer', None)
        if answer is None:
            print("Warning: 'answer' variable not assigned in code")
        else:
            print("Executed successfully, answer =", answer)
        return answer, True
    except Exception as e:
        print(f"Execution error: {traceback.format_exc()}")
        return None, False

# 1. Correctness Reward (Updated for Python code)
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    """
    Reward if the student's code produces the same answer as DeepSeek's code.
    """
    responses = [completion[0]['content'] for completion in completions]
    rewards = []
    for response in responses:
        student_answer, success = execute_code(response)
        # Convert DeepSeek's answer to float, handle non-numeric cases
        try:
            correct_answer = float(answer[0]) if answer[0] else 0.0
        except ValueError:
            print(f"Invalid correct_answer: {answer[0]}")
            rewards.append(0.0)
            continue

        # Check if student_answer matches correct_answer
        if success and student_answer is not None:
            try:
                student_answer_float = float(student_answer)
                if abs(student_answer_float - correct_answer) < 1e-5:
                    rewards.append(2.0)
                else:
                    rewards.append(0.0)
            except (ValueError, TypeError):
                print(f"Invalid student_answer: {student_answer}")
                rewards.append(0.0)
        else:
            rewards.append(0.0)
    return rewards

# 2. Format Reward (Updated for <code>:...</code>)
def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """
    Reward if the response follows <think>...</think><code>:...answer = ...</code>.
    """
    pattern = r'^<think>.*?</think>\s*<code>:\n.*?\nanswer = .*\n</code>$'
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r, re.DOTALL) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

# 3. Soft Format Reward (Updated)
def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """
    Reward if <think> and <code> tags are present.
    """
    pattern = r'<think>.*?</think>.*?<code>:.*?</code>'
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.search(pattern, r, re.DOTALL) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

# 4. Integer Reward (Optional, if answers are numeric)
def int_reward_func(completions, **kwargs) -> list[float]:
    rewards = []
    for completion in completions:
        response = completion[0]['content']
        answer_val, success = execute_code(response)
        if success and answer_val is not None and isinstance(answer_val, (int, float)) and float(answer_val).is_integer():
            rewards.append(0.5)
        else:
            rewards.append(0.0)
        print(f"Int reward check: answer_val={answer_val}, success={success}")
    return rewards

# 5. Comment Reward (New, to enforce Python comments)
def comment_reward_func(completions, **kwargs) -> list[float]:
    rewards = []
    for completion in completions:
        response = completion[0]['content']
        code_match = re.search(r'<code>:(.*?)<\/code>', response, re.DOTALL)
        if not code_match:
            rewards.append(0.0)
            continue
        code = code_match.group(1).strip()
        comment_lines = len([line for line in code.split('\n') if line.strip().startswith('#')])
        rewards.append(min(0.5, comment_lines * 0.1))  # 0.1 per comment, max 0.5
    return rewards

# 6. Replace xmlcount_reward_func with Variable Usage Reward
def variable_reward_func(completions, **kwargs) -> list[float]:
    rewards = []
    for completion in completions:
        response = completion[0]['content']
        code_match = re.search(r'<code>:(.*?)<\/code>', response, re.DOTALL)
        if not code_match:
            rewards.append(0.0)
            continue
        code = code_match.group(1).strip()
        assignments = len(re.findall(r'^\s*([a-zA-Z_][a-zA-Z0-9_]*)\s*=', code, re.MULTILINE))
        rewards.append(0.5 if assignments >= 2 else 0.0)  # Reward for 2+ variables
    return rewards

def think_content_length_reward(completions, **kwargs) -> list[float]:
    """
    Reward based on word count in <think> tags, up to 1.0.
    """
    rewards = []
    for completion in completions:
        response = completion[0]['content']
        think_match = re.search(r'<think>\s*(.+?)\s*</think>', response, re.DOTALL)
        if think_match:
            think_text = think_match.group(1).strip()
            word_count = len(think_text.split())
            reward = min(1.0, word_count * 0.05)  # 0.05 per word, max 1.0 (20 words)
            rewards.append(reward)
        else:
            rewards.append(0.0)
    return rewards


def think_similarity_reward_func(completions, think_content, **kwargs) -> list[float]:
    """
    Reward if student's <think> content is similar to DeepSeek's <think> content.
    - 1.0 for high similarity (Jaccard > 0.5), scaled otherwise.
    """
    rewards = []
    correct_think_words = set(think_content[0].lower().split())
    for completion in completions:
        response = completion[0]['content']
        think_match = re.search(r'<think>\s*(.+?)\s*</think>', response, re.DOTALL)
        if not think_match:
            rewards.append(0.0)
            continue
        student_think = think_match.group(1).strip().lower()
        student_words = set(student_think.split())

        # Jaccard similarity
        intersection = len(correct_think_words & student_words)
        union = len(correct_think_words | student_words)
        similarity = intersection / union if union > 0 else 0.0
        rewards.append(min(1.0, similarity * 2.0))  # Scale to max 1.0
    return rewards


In [None]:
prompts = [[{"content": "James joins a football team... How many points did he beat the old record by?"}]]
completions = [[{"content": """<code>
# Step 1: Calculate the total points from touchdowns
touchdowns = 4
games = 15
points_per_touchdown = 6
total_touchdowns = touchdowns * games
total_touchdown_points = total_touchdowns * points_per_touchdown

# Step 2: Calculate the total points from 2-point conversions
conversions = 6
points_per_conversion = 2
total_conversion_points = conversions * points_per_conversion

# Step 3: Calculate the total points
total_points = total_touchdown_points + total_conversion_points

# Step 4: Calculate the difference from the old record
old_record = 300
difference = total_points - old_record

# Step 5: Get the final result
answer = difference
</code>"""}]]
answer = ["72.0"]

print("Testing correctness_reward_func:")
rewards = correctness_reward_func(prompts, completions, answer)
print("Correctness Reward:", rewards)

print("\nTesting strict_format_reward_func:")
rewards = strict_format_reward_func(completions)
print("Strict Format Reward:", rewards)

print("\nTesting soft_format_reward_func:")
rewards = soft_format_reward_func(completions)
print("Soft Format Reward:", rewards)

print("\nTesting int_reward_func:")
rewards = int_reward_func(completions)
print("Int Reward:", rewards)

In [19]:
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        correctness_reward_func,
        strict_format_reward_func,
        soft_format_reward_func,
        int_reward_func,
        comment_reward_func,
        variable_reward_func,
        think_content_length_reward,
        think_similarity_reward_func,
    ],
    args=training_args,
    train_dataset=grpo_dataset
)
trainer.train()


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 350 | Num Epochs = 1 | Total steps = 250
O^O/ \_/ \    Batch size per device = 6 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (6 x 1 x 1) = 6
 "-____-"     Trainable parameters = 48,627,712/3,000,000,000 (1.62% trained)


Extracted and dedented code:
 # Given cost on Monday
monday_cost = 18

# Calculate the multiplier for the weekend price
weekend_multiplier = 1 + (50 / 100)  # 50% increase

# Calculate the cost on Saturday
saturday_cost = monday_cost / weekend_multiplier

# Print the result
print("The cost on Saturday would be: ", saturday_cost)
The cost on Saturday would be:  12.0
Extracted and dedented code:
 # Define the original price of the haircut
original_price = 18

# Calculate the price increase on Sunday
price_increase = original_price * 0.5  # 50% increase

# Calculate the price on Sunday
sunday_price = original_price + price_increase

# Assign the result to the answer variable
answer = sunday_price
Executed successfully, answer = 27.0
Extracted and dedented code:
 # Original price of the haircut on a weekday
monday_price = 18
# Calculate the multiplier for the weekend price (100% + 50% = 150%)
weekend_multiplier = 1.5

# Calculate the original price of the haircut on a weekday
origin_price 

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / correctness_reward_func,rewards / strict_format_reward_func,rewards / soft_format_reward_func,rewards / int_reward_func,rewards / comment_reward_func,rewards / variable_reward_func,rewards / think_content_length_reward,rewards / think_similarity_reward_func
1,-0.0,1.356621,0.725201,196.666672,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.833333,0.273288
2,0.0,1.492394,0.37224,257.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,1.0,0.40906
3,0.0,1.736032,0.26471,348.0,5e-06,0.0,0.0,0.0,0.333333,0.0,0.0,1.0,0.402699
4,0.0,1.539409,0.281601,327.166687,4e-06,0.0,0.0,0.0,0.333333,0.0,0.0,1.0,0.206075
5,0.0,1.409263,0.507722,178.833344,5e-06,0.0,0.0,0.0,0.25,0.0,0.0,0.833333,0.32593
6,0.0,1.620632,0.596022,229.333344,4e-06,0.0,0.0,0.0,0.416667,0.0,0.0,0.833333,0.370631
7,0.0,1.634709,0.226409,159.0,6e-06,0.0,0.0,0.0,0.333333,0.0,0.0,1.0,0.301376
8,0.0,1.874786,0.103837,244.833344,5e-06,0.0,0.0,0.0,0.5,0.0,0.0,1.0,0.374786
9,0.0,1.392287,0.749472,208.5,6e-06,0.0,0.0,0.0,0.166667,0.0,0.0,0.833333,0.392287
10,0.0,1.720844,0.665856,331.833344,4e-06,0.0,0.0,0.083333,0.083333,0.066667,0.083333,1.0,0.404178


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Executed successfully, answer = 35.0
Int reward check: answer_val=35.0, success=True
Extracted and dedented code:
 # Import the math module for the ceiling function
import math

# Define variables
episodes = 20
minutes_per_episode = 30
total_minutes = episodes * minutes_per_episode

# Convert total minutes into hours
hours_per_day = total_minutes / 60

# Calculate the total time needed
total_time = hours_per_day * 5

# Divide the total time by the number of days to find out the time to watch per day
time_to_watch_per_day = total_time / 5

# Assign the final result to a variable named `answer`
answer = time_to_watch_per_day
Executed successfully, answer = 10.0
Extracted and dedented code:
 # Calculate the total minutes needed
total_minutes = 20 * 30

# Convert the total minutes to hours
total_hours = total_minutes / 60

# Calculate the hours John needs to watch per day
hours_per_day = total_hours / 5

# Assign the result t

TrainOutput(global_step=250, training_loss=0.0010167121298637128, metrics={'train_runtime': 10342.7546, 'train_samples_per_second': 0.145, 'train_steps_per_second': 0.024, 'total_flos': 0.0, 'train_loss': 0.0010167121298637128})

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [31]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Kylar went to the store to buy glasses for his new apartment. One glass costs $5, but every second glass costs only 60% of the price. Kylar wants to buy 16 glasses. How much does he need to pay for them?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

print(output)

Processed prompts: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:05<00:00,  5.86s/it, est. speed input: 15.36 toks/s, output: 12.63 toks/s]

To find the total cost, first calculate the cost of the full-priced glasses and the cost of the discounted glasses, then add them together.

Full-priced glasses: 16 * 5 = 80 dollars
Discounted glasses: 10 * 5 * 0.60 = 30 dollars

Total cost: 80 + 30 = 110 dollars





And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [20]:
model.save_lora("/content/drive/MyDrive/nlp/PROJECT/grpo_saved_lora")

Now we load the LoRA and test:

In [30]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Kylar went to the store to buy glasses for his new apartment. One glass costs $5, but every second glass costs only 60% of the price. Kylar wants to buy 16 glasses. How much does he need to pay for them?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.4,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("/content/drive/MyDrive/nlp/PROJECT/grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:10<00:00, 10.84s/it, est. speed input: 21.14 toks/s, output: 27.51 toks/s]


'<think>\nTo solve this problem, we need to calculate the cost of the first glass and the cost of the second glass, and then add them together. Since the cost of every second glass is 60% of the price of the first glass, we need to find 60% of the cost of the first glass. We can do this by multiplying the cost of the first glass by 0.6.\n</think>\n<code>:\n# Cost of the first glass\ncost_first_glass = 5\n# Calculate the cost of the second glass (60% of the cost of the first glass)\ncost_second_glass = cost_first_glass * 0.6\n# Calculate the total number of glasses Kylar wants to buy\ntotal_glasses = 16\n# Calculate the total cost of the glasses\ntotal_cost = cost_first_glass + cost_second_glass + (cost_first_glass * (total_glasses - 2) * 0.6)\n# Calculate the cost of the remaining glasses (every second glass is 60% of the price of the first glass)\nremaining_glasses = total_glasses - 2\n# Calculate the cost of the remaining glasses\nremaining_cost = remaining_glasses * cost_first_glass