# Fine tune deepseek qwen 1.5B with GRPO

In this notebook we will demonstrate how to use the GRPOTrainer from huggingface to finetune Deepseek R1 - Distilled Qwen 1.5B, outlined in the bottom reference. I have also shared a notebook with the current packages needed to be able to run this notebook. I am new to this, so comments and corrections are welcome!

This notebook used 3 different reward functions, and used a prompt from the Deepseek paper (https://arxiv.org/abs/2501.12948). My reward functions:

* Formatting, so that there is a marked <think></think> section, a summary of the solution, then the final answer in \boxed{}.
* Accuracy of the result
* The Levenshtein distance between the summarized solution and the actual solution from the dataset, called solution_quality.

<b>Before training (100 evaluation problems):</b>

on deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B with 3179 test samples

* formatting 0.675
* accuracy 0.475
* solution_quality 0.2648481333776768


<b>After training:</b>

* formatting 0.6225
* accuracy 0.5425
* solution_quality 0.26916747222129206
* 
an increase of over 10% accuracy! WOW!


<b>Next steps:¶</b>

I would of course like to get a properly done training and evaluation set, a cluster of GPUs, and see how far we can take this.
Contact me if you want to team up!

Disclaimers:
The loss quoted when running the trainer is not what you would expect be if you are used to Supervised Fine Tuning. In the GRPOTrainer, the loss term multiplying the advantage is set to zero, however, don't be afraid, the model is learning! I have the printer callback inform you that there is in fact a gradient.
This is a demonstration notebook. To be able to fit on this an run in a reasonable amount of time, I have selected a simpler and smaller dataset, shortened the output sequences, used the smallest possible model, and set the number of generations for use in training to 4. I have also used PEFT to decrease the size.


<b>References</b>

* https://www.kaggle.com/models/deepseek-ai/deepseek-r1
* https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/discussion/557197
* https://huggingface.co/docs/trl/main/en/grpo_trainer


In [None]:
!python -m pip install --no-index -v --find-links=/kaggle/input/aimo-packages/offline_packages trl --pre
!python -m pip install --no-index -v --find-links=/kaggle/input/aimo-packages/offline_packages levenshtein
!python -m pip install --no-index -v --find-links=/kaggle/input/aimo-packages/offline_packages -U bitsandbytes

In [1]:
import os
from huggingface_hub import login
import wandb

login(os.getenv('hf_api_wtoken'))
wandb.login(key=os.getenv('wandb_api_token'))
run = wandb.init(
    project='Qwen-1.5B-grpo-imo', 
    job_type="training", 
    anonymous="allow"
)
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/alfred/.netrc
[34m[1mwandb[0m: Currently logged in as: [33malfredcs[0m ([33malfredcs_team[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


In [2]:
from datasets import load_dataset,Dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
from trl import GRPOConfig, GRPOTrainer

import datetime

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    GenerationConfig,
    PrinterCallback,
)
from tqdm import tqdm
import torch
import time
import transformers
import pandas as pd
import numpy as np

from Levenshtein import ratio as levenshtein_ratio
transformers.set_seed(42)

[2025-02-06 18:33:46,964] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/home/alfred/anaconda3/envs/dev/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/home/alfred/anaconda3/envs/dev/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `std::runtime_error::~runtime_error()@GLIBCXX_3.4'
/home/alfred/anaconda3/envs/dev/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `__gxx_personality_v0@CXXABI_1.3'
/home/alfred/anaconda3/envs/dev/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `std::ostream::tellp()@GLIBCXX_3.4'
/home/alfred/anaconda3/envs/dev/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `std::chrono::_V2::steady_clock::now()@GLIBCXX_3.4.19'
/home/alfred/anaconda3/envs/dev/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `std::string::_M_replace_aux(unsigned long, unsigned long, unsigned long, char)@GLIBCXX_3.4'
/home/alfred/anaconda3/envs/dev/comp

INFO 02-06 18:33:47 __init__.py:183] Automatically detected platform cuda.


In [55]:
class CFG:
    MAX_TRAIN = 1000
    MAX_TOKENS = 4096
    NUM_GENERATIONS = 4
    USE_PEFT = True
    BATCH_SIZE=1
    MAX_STEPS = 80
    
    BETA = 0.04
    LR = 1.e-5
    
    model_name = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B'
    splitter = '<｜Assistant｜>'
    
    step_count=10
    DEBUG = False

In [56]:
import re

def extract_boxed_text(text):
    pattern = r'oxed{(.*?)}'
    matches = re.findall(pattern, text)
    if not matches:
        return ""
    for match in matches[::-1]:
        if match != "":
            return match
    return ""

In [57]:
df = pd.read_parquet('~/data/imo/math_problems.parquet')
df = df.reset_index().rename({'index': 'id'}, axis=1)
df['answer'] = df['solution'].map(extract_boxed_text)

def is_valid_answer(s):
    try:
        if float(s) == int(s):
            i = int(s)
            return 0<=i<1000
        else:
            return False
    except ValueError:
        return False
    
mask = df['answer'].map(is_valid_answer)
df = df[mask]

In [58]:
df = df.iloc[:CFG.MAX_TRAIN]
dataset = Dataset.from_pandas(df)
dataset = dataset.train_test_split(test_size=0.1)
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'problem', 'solution', 'answer', '__index_level_0__'],
        num_rows: 900
    })
    test: Dataset({
        features: ['id', 'problem', 'solution', 'answer', '__index_level_0__'],
        num_rows: 100
    })
})

In [59]:
len(dataset['train'])

900

In [60]:
def create_prompt(sample):
    question = sample['problem']
    chat = [{"role": "system", "content": "A conversation between User and Assistant. The user asks a question, and the Assistant solves it.  The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>"},
            {"role": "user", "content": question + ' Return final answer within \\boxed{}, after taking modulo 1000.'},]
    sample['prompt'] = tokenizer.apply_chat_template(
            conversation=chat,
            tokenize=False,
            add_generation_prompt=True
        )
    return sample

In [61]:
dataset['train'][1]

{'id': 849,
 'problem': 'John can play 200 beats per minute. If he plays 2 hours a day for a certain number of days, he plays 72000 beats. How many days does John play?',
 'solution': "First, let's calculate how many beats John plays in one day. \n\nIf he can play 200 beats per minute and he plays for 2 hours a day, we need to convert the hours to minutes because the beat rate is given per minute. There are 60 minutes in an hour, so 2 hours is 2 * 60 = 120 minutes.\n\nNow, if he plays 200 beats per minute for 120 minutes, the total number of beats he plays in one day is:\n200 beats/minute * 120 minutes = 24000 beats/day\n\nNow we know that John plays 24000 beats in one day. We are given that he plays a total of 72000 beats. To find out how many days he plays, we divide the total number of beats by the number of beats he plays in one day:\n\n72000 beats / 24000 beats/day = 3 days\n\nSo, John plays for $\\boxed{3}$  days.",
 'answer': '3',
 '__index_level_0__': 849}

In [62]:
## We would also want a reward function based on accuracy
# split after </think>, then get the answer within bbox

## We can also do a reward based on Similarity of 

import re

def format_reward_func(completions, **kwargs):
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<think>.*?</think>.*?oxed{(.*?)}.*?$"
    matches = [re.match(pattern, content, re.DOTALL) for content in completions]
    return [1.0 if match else 0.0 for match in matches]


def extract_boxed_text(text):
    pattern = r'oxed{(.*?)}'
    matches = re.findall(pattern, text)
    if not matches:
        return ""
    for match in matches[::-1]:
        if match != "":
            return match
    return ""

def accuracy_reward_func(completions, answer, **kwargs):
    # Regular expression to capture content inside \boxed{}
    contents = [extract_boxed_text(completion) for completion in completions]
    # Reward 1 if the content is the same as the ground truth, 0 otherwise
    return [1.0 if c == str(gt) else 0.0 for c, gt in zip(contents, answer)]

In [63]:
def levenshtein_reward_func(completions, solution, **kwargs):
    res = []
    for completion, sol in zip(completions, solution):
        if '</think>' in completion:
            t = completion.split('</think>')[-1]
            res.append(levenshtein_ratio(t, sol))
        else:
            res.append(0.0)
    return res

In [64]:
device_map = 'auto'
if CFG.USE_PEFT:
    compute_dtype = getattr(torch, "float16")
    bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_compute_dtype=compute_dtype,
            bnb_4bit_use_double_quant=False,
        )
    original_model = AutoModelForCausalLM.from_pretrained(CFG.model_name, 
                                                          device_map=device_map,
                                                          quantization_config=bnb_config,
                                                          trust_remote_code=True)
else:
    original_model = AutoModelForCausalLM.from_pretrained(CFG.model_name, 
                                                          device_map=device_map,
                                                          trust_remote_code=True)

In [65]:
tokenizer = AutoTokenizer.from_pretrained(CFG.model_name,trust_remote_code=True,padding_side="left")

In [66]:
dataset = dataset.map(create_prompt)#, batched=True)

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [51]:
def gen(model, text, max_tokens):
    model_input = tokenizer(text, return_tensors='pt').to(model.device)
    model.eval()
    with torch.no_grad():
        tok = model.generate(**model_input, max_new_tokens=max_tokens, pad_token_id=tokenizer.pad_token_type_id)
        outputs = []
        for i in range(len(tok)):
            res = tokenizer.decode(tok[i], skip_special_tokens=True)
            output = res.split(CFG.splitter)[-1]
            outputs.append(output)
        return outputs[0] if len(outputs) == 1 else outputs

In [67]:
def evaluate_rewards(model, dataset, reward_functions: dict[str, callable], max_tokens: int, num_generations: int):
    completions = []
    other_info = []
    for example in tqdm(dataset):
        txt = example['prompt']
        kw = {k: v for k, v in example.items() if k not in {'prompt', 'completion'}}
        for _ in range(num_generations):
            other_info.append(kw)
            
        completion = gen(model, [txt]*num_generations, max_tokens)
        if isinstance(completion, str):
            completions.append(completion)
        else:
            completions += completion
        
    kwargs = {k: [d[k] for d in other_info] for k in other_info[0].keys()}
    res = {}
    for nm, reward_func in reward_functions.items():
        v = reward_func(completions=completions, **kwargs)
        print(nm, np.mean(v))
        res[nm] = np.mean(v)
    return res

In [68]:
reward_functions = {'formatting': format_reward_func, 'accuracy': accuracy_reward_func, 'solution_quality': levenshtein_reward_func}

### Evaluate the original model

In [54]:
if not CFG.DEBUG:
    original_rewards = evaluate_rewards(model=original_model, dataset=dataset['test'], reward_functions=reward_functions, max_tokens=CFG.MAX_TOKENS, num_generations=CFG.NUM_GENERATIONS)

  0%|          | 9/3179 [22:27<131:52:30, 149.76s/it]


KeyboardInterrupt: 

### Configure GRPO trainer

In [69]:
dtstr = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
output_directory=f"./DEEPSEEK-GRPO-{dtstr}"


training_args = GRPOConfig(
    output_dir=output_directory,
    
    learning_rate=CFG.LR,
    
    per_device_train_batch_size=CFG.BATCH_SIZE,
    
    gradient_accumulation_steps=1,
    max_steps=CFG.MAX_STEPS,
    
    max_completion_length=CFG.MAX_TOKENS,  #8192
    num_generations=CFG.NUM_GENERATIONS,
    beta=CFG.BETA,
    
    logging_steps=CFG.step_count,
    logging_dir="./logs",
    save_strategy="steps",
    save_steps=CFG.step_count,
#     eval_strategy="steps",
#     eval_steps=CFG.step_count,
#     do_eval=True,
    # gradient_checkpointing=True,  # Will crash the whole thing
    report_to="wandb", #"none"
    overwrite_output_dir = 'True',
)

# Will typically use the AdamW optimizer

In [70]:
if CFG.USE_PEFT:
    peft_config = LoraConfig(
        r=32, #Rank
        lora_alpha=32,
        target_modules=[
            'q_proj',
            'k_proj',
            'v_proj',
            'dense'
        ],
        bias="none",
        lora_dropout=0.05,  # Conventional
        task_type="CAUSAL_LM",
    )
    trainer = GRPOTrainer(
        model=original_model,
        reward_funcs=list(reward_functions.values()),
        args=training_args,
        train_dataset=dataset['train'],
        peft_config=peft_config,
        callbacks=[PrinterCallback()]
    )
else:
    trainer = GRPOTrainer(
        model=original_model,
        reward_funcs=list(reward_functions.values()),
        args=training_args,
        train_dataset=dataset['train'],
        callbacks=[PrinterCallback()]
    )

In [None]:
trainer.train()

  output = module._old_forward(*args, **kwargs)


Step,Training Loss
10,0.0
20,0.0
30,0.0
40,-0.0
50,0.0
60,-0.0


{'loss': 0.0, 'grad_norm': 0.02567768283188343, 'learning_rate': 8.750000000000001e-06, 'completion_length': 988.6, 'rewards/format_reward_func': 0.925, 'rewards/accuracy_reward_func': 0.65, 'rewards/levenshtein_reward_func': 0.39275863766670227, 'reward': 1.9677586317062379, 'reward_std': 0.49106792388483883, 'kl': -5.328655242919922e-06, 'epoch': 0.011111111111111112}


  output = module._old_forward(*args, **kwargs)


{'loss': 0.0, 'grad_norm': 0.013751871883869171, 'learning_rate': 7.500000000000001e-06, 'completion_length': 1932.95, 'rewards/format_reward_func': 0.725, 'rewards/accuracy_reward_func': 0.675, 'rewards/levenshtein_reward_func': 0.32710439562797544, 'reward': 1.727104389667511, 'reward_std': 0.7703705318272114, 'kl': -6.204843521118164e-06, 'epoch': 0.022222222222222223}


  output = module._old_forward(*args, **kwargs)


{'loss': 0.0, 'grad_norm': 0.034051552414894104, 'learning_rate': 6.25e-06, 'completion_length': 1927.7, 'rewards/format_reward_func': 0.75, 'rewards/accuracy_reward_func': 0.55, 'rewards/levenshtein_reward_func': 0.3563418656587601, 'reward': 1.6563418865203858, 'reward_std': 0.2885350169613957, 'kl': -6.633996963500976e-06, 'epoch': 0.03333333333333333}


  output = module._old_forward(*args, **kwargs)


{'loss': -0.0, 'grad_norm': 0.0, 'learning_rate': 5e-06, 'completion_length': 2418.1, 'rewards/format_reward_func': 0.575, 'rewards/accuracy_reward_func': 0.55, 'rewards/levenshtein_reward_func': 0.28815700188279153, 'reward': 1.413156995177269, 'reward_std': 0.709516017511487, 'kl': -6.16908073425293e-06, 'epoch': 0.044444444444444446}


  output = module._old_forward(*args, **kwargs)


{'loss': 0.0, 'grad_norm': 0.04182285815477371, 'learning_rate': 3.7500000000000005e-06, 'completion_length': 2537.875, 'rewards/format_reward_func': 0.475, 'rewards/accuracy_reward_func': 0.425, 'rewards/levenshtein_reward_func': 0.22036247700452805, 'reward': 1.1203624725341796, 'reward_std': 0.4122450739145279, 'kl': -7.528066635131836e-06, 'epoch': 0.05555555555555555}


  output = module._old_forward(*args, **kwargs)


{'loss': -0.0, 'grad_norm': 0.03483168035745621, 'learning_rate': 2.5e-06, 'completion_length': 2591.525, 'rewards/format_reward_func': 0.65, 'rewards/accuracy_reward_func': 0.4, 'rewards/levenshtein_reward_func': 0.2755273155868053, 'reward': 1.325527310371399, 'reward_std': 0.4981116403825581, 'kl': -7.593631744384766e-06, 'epoch': 0.06666666666666667}


  output = module._old_forward(*args, **kwargs)


In [None]:
if CFG.USE_PEFT:
    print('Loading trained model')
    CHKPT = CFG.MAX_STEPS
    adapter_model_name = f'{output_directory}/checkpoint-{CHKPT}/'
    new_model = PeftModel.from_pretrained(original_model, adapter_model_name)
else:
    new_model = original_model

### Evaluate fine tuned model

In [None]:
rewards = evaluate_rewards(model=new_model, dataset=dataset['test'], reward_functions=reward_functions, max_tokens=CFG.MAX_TOKENS, num_generations=CFG.NUM_GENERATIONS)
rewards