# Baseline GRPO and SFT Finetuning

In this notebook, we finetune a quantized version of Qwen2.5-Instruct, using QLora Peft, on a training dataset built from GTA Benchmark. Following this process, we produce two different models SFT_SFT_model and SFT_GRPO model.

Training is carried out in two phases:  
1- **Phase 1:** Both those models are first trained the same half of the training dataset using supervised finetuning (SFT).   
2- **Phase 2:** The models are then finetuned on the remaining half of the training dataset using different methods: SFT_SFT_model is trained using SFT,  while SFT_GRPO model is finetuned using Grouped Relative Policy Optimization (GRPO).

For a detailed discussion of the experimental objectives and analysis, see the main.ipynb notebook.

In [None]:
!git clone --branch feat/experiment1 --single-branch https://github.com/YassKa71/AI_agents_mini_project.git
%cd AI_agents_mini_project


Cloning into 'AI_agents_mini_project'...
remote: Enumerating objects: 66, done.[K
remote: Counting objects: 100% (66/66), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 66 (delta 19), reused 63 (delta 16), pack-reused 0 (from 0)[K
Receiving objects: 100% (66/66), 921.80 KiB | 3.91 MiB/s, done.
Resolving deltas: 100% (19/19), done.
/content/AI_agents_mini_project


In [None]:
!pip install -e .

Obtaining file:///content/AI_agents_mini_project
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting bitsandbytes>=0.49.1 (from ai_agents_mini_project==0.1.0)
  Downloading bitsandbytes-0.49.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting datasets>=4.5.0 (from ai_agents_mini_project==0.1.0)
  Downloading datasets-4.5.0-py3-none-any.whl.metadata (19 kB)
Collecting pandas>=2.3.3 (from ai_agents_mini_project==0.1.0)
  Downloading pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft>=0.18.1 (from ai_agents_mini_project==0.1.0)
  Downloading peft-0.18.1-py3-none-any.whl.metadata (14 kB)


In [None]:
# IMPORTANT: You need to restart the session in colab after installing the project.

# importing needed modules
import torch
import json
import re
import ast

import pandas as pd
import numpy as np

from research.ai_agents.environment.evaluation import StaticEnvironment
from research.utils.utils import SLM, parse
from research.paths import RESEARCH_REPO_ROOT
from datasets import load_dataset
from trl import SFTTrainer
from trl import GRPOTrainer, GRPOConfig
from transformers import TrainingArguments
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from peft import PeftModel
from tqdm import tqdm
from google.colab import files

  the file: research\datasets\environment_datasets\toolmeta.json.


## Environment and Dataset

The GTA benchmark consists of 209 tool agent tasks annotated with their respective correct goal reaching sequence of thoughts and actions. For each task the agent is provided with a different set of tools and inputs and is required to answer a question or do a task in a few number of steps using exclusively the provided tools.

The reasonning process to reach the goal can be relatively complex for small language models (SLMs) as it requires gathering relevant information first using different tools and then combining the results in the correct way to reach the goal.

This benchmark provides an adequate experimental framework for our "mini-experiment" on SLM based Ai agents because it is adapted to our limited computaional resources and the environment configuration using this benchmark is easy and time efficient.

#### Clear GTA Dataset

In [None]:
clear_gta_path = RESEARCH_REPO_ROOT / "datasets" / "environment_datasets" / "clear_gta.csv"
clear_gta_df = pd.read_csv(clear_gta_path)
clear_gta_df.head()

Unnamed: 0.1,Unnamed: 0,input,instruction,tools,final_answer,tool_calls
0,0,"[{'image_path': 'image/image_1.jpg', 'image_de...",How much should I pay for the beer on the tabl...,"[{""name"": ""OCR"", ""description"": ""This tool can...",[['12']],"{""CountGivenObject"": [{""arguments"": {""image"": ..."
1,1,"[{'image_path': 'image/image_3.jpg', 'image_de...",I want to buy a PS5 for each child in the phot...,"[{""name"": ""Calculator"", ""description"": ""A calc...",[['1919.96']],"{""CountGivenObject"": [{""arguments"": {""image"": ..."
2,2,"[{'image_path': 'image/image_7.jpg', 'image_de...",How many cups of water do the people in the p...,"[{""name"": ""Calculator"", ""description"": ""A calc...",[['27']],"{""OCR"": [{""arguments"": {""image"": ""image/image_..."
3,3,"[{'image_path': 'image/image_9.jpg', 'image_de...",I need to prepare twelve servings of this dis...,"[{""name"": ""OCR"", ""description"": ""This tool can...",[['2']],"{""OCR"": [{""arguments"": {""image"": ""image/image_..."
4,4,"[{'image_path': 'image/image_11.jpg', 'image_d...",I want to buy a dog toy for each dog in the p...,"[{""name"": ""Calculator"", ""description"": ""A calc...",[['79.96']],"{""OCR"": [{""arguments"": {""image"": ""image/image_..."


#### Finetuning Dataset

In [None]:
# Load dataset
finetuning_dataset_path = RESEARCH_REPO_ROOT / "datasets" / "environment_datasets" / "finetuning_dataset.jsonl"
dataset = load_dataset("json", data_files={"train": str(finetuning_dataset_path)})
ds = dataset["train"]
dataset_df = ds.select(range(5)).to_pandas()
print("Dataset overview:")
dataset_df.head()

Generating train split: 0 examples [00:00, ? examples/s]

Dataset overview:


Unnamed: 0,prompt,completion,task_id
0,"[{'role': 'system', 'content': 'You are an exp...","[{'role': 'assistant', 'content': 'Thought: Th...",0
1,"[{'role': 'system', 'content': 'You are an exp...","[{'role': 'assistant', 'content': 'Thought: No...",0
2,"[{'role': 'system', 'content': 'You are an exp...","[{'role': 'assistant', 'content': 'Thought: No...",0
3,"[{'role': 'system', 'content': 'You are an exp...","[{'role': 'assistant', 'content': 'Action: {'n...",0
4,"[{'role': 'system', 'content': 'You are an exp...","[{'role': 'assistant', 'content': 'Thought: To...",1


In [None]:
GROUP_COL = "task_id"
SEED = 42

# split by unique group ids, then select rows by indices
def split_by_group(ds, group_col, test_size, seed):
    # Get all group ids
    group_values = ds[group_col]
    unique_groups = np.array(sorted(set(group_values)))

    rng = np.random.default_rng(seed)
    rng.shuffle(unique_groups)

    n_test_groups = int(round(len(unique_groups) * test_size))
    test_groups = set(unique_groups[:n_test_groups])
    train_groups = set(unique_groups[n_test_groups:])

    # Build row indices for each split
    train_idx = [i for i, g in enumerate(group_values) if g in train_groups]
    test_idx  = [i for i, g in enumerate(group_values) if g in test_groups]

    return ds.select(train_idx), ds.select(test_idx)


# Split into train/test with disjoint task_ids
train_dataset, test_dataset = split_by_group(
    ds, GROUP_COL, test_size=0.4, seed=SEED
)

# Split train into SFT vs GRPO with disjoint task_ids
sft_train_dataset, grpo_train_dataset = split_by_group(
    train_dataset, GROUP_COL, test_size=0.5, seed=SEED
)

# Display lengths
print(f"Train dataset size: {len(train_dataset)}")
print(f"Test dataset size: {len(test_dataset)}")
print(f"SFT train dataset size: {len(sft_train_dataset)}")
print(f"GRPO train dataset size: {len(grpo_train_dataset)}")

# Display task ids for each split
sft_group_values = sft_train_dataset[GROUP_COL]
sft_tasks = np.array(sorted(set(sft_group_values)))
print(f"\n SFT task ids: \n{sft_tasks}")

grpo_group_values = grpo_train_dataset[GROUP_COL]
grpo_tasks = np.array(sorted(set(grpo_group_values)))
print(f"\n GRPO task ids: \n{grpo_tasks}")

train_group_values = train_dataset[GROUP_COL]
train_tasks = np.array(sorted(set(train_group_values)))
print(f"\n All train task ids: \n{train_tasks}")

test_group_values = test_dataset[GROUP_COL]
test_tasks = np.array(sorted(set(test_group_values)))
print(f"\n All test task ids: \n{test_tasks}")

Train dataset size: 406
Test dataset size: 266
SFT train dataset size: 203
GRPO train dataset size: 203

 SFT task ids: 
['1' '11' '112' '114' '118' '119' '125' '126' '13' '134' '135' '141' '142'
 '146' '15' '151' '152' '153' '156' '157' '160' '162' '164' '165' '169'
 '172' '173' '180' '181' '182' '188' '190' '193' '196' '197' '198' '199'
 '202' '205' '212' '216' '219' '221' '224' '225' '227' '27' '33' '37' '38'
 '4' '40' '46' '54' '55' '56' '58' '63' '64' '7' '77' '78' '8' '80' '86'
 '90' '91' '94' '98']

 GRPO task ids: 
['0' '102' '103' '107' '108' '110' '111' '113' '115' '117' '129' '130'
 '133' '136' '137' '138' '139' '140' '150' '154' '155' '167' '168' '175'
 '177' '18' '186' '191' '20' '200' '203' '206' '208' '209' '210' '214'
 '22' '220' '222' '223' '226' '23' '24' '25' '3' '30' '34' '36' '42' '43'
 '50' '51' '52' '57' '59' '66' '67' '68' '69' '70' '71' '72' '74' '82'
 '83' '84' '85' '97']

 All train task ids: 
['0' '1' '102' '103' '107' '108' '11' '110' '111' '112' '113' '114

## **Phase 1:** Supervized finetuning on 50% of training dataset

In [None]:
def load_base(model_name):
    # 4-bit quantization config (QLoRA)
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float16,
    )
    base = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float16,
    )
    base.config.use_cache = False
    base = prepare_model_for_kbit_training(base)
    return base

In [None]:
model_name = "Qwen/Qwen2.5-1.5B-Instruct"

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# load quantized model
model = load_base(model_name)

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)

# Attach LoRA adapter
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Training arguments
training_args = TrainingArguments(
    output_dir="./out_sft",
    per_device_train_batch_size=3,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    num_train_epochs=5,
    logging_steps=10,
    save_strategy="no",
    save_total_limit=2,
    bf16=torch.cuda.is_available(),
    fp16=not torch.cuda.is_available(),
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    report_to="none",
)

trainable params: 18,464,768 || all params: 1,562,179,072 || trainable%: 1.1820


In [None]:
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=sft_train_dataset,
)
trainer.train()
trainer.save_model("./out_sft/final")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
  return fn(*args, **kwargs)


Step,Training Loss
10,1.0611
20,0.7564
30,0.5831
40,0.566


In [None]:
!zip -r ./out_sft.zip ./out_sft
files.download('./out_sft.zip')

  adding: out_sft/ (stored 0%)
  adding: out_sft/final/ (stored 0%)
  adding: out_sft/final/README.md (deflated 65%)
  adding: out_sft/final/merges.txt (deflated 57%)
  adding: out_sft/final/special_tokens_map.json (deflated 69%)
  adding: out_sft/final/vocab.json (deflated 61%)
  adding: out_sft/final/added_tokens.json (deflated 67%)
  adding: out_sft/final/adapter_config.json (deflated 58%)
  adding: out_sft/final/tokenizer.json (deflated 81%)
  adding: out_sft/final/chat_template.jinja (deflated 71%)
  adding: out_sft/final/adapter_model.safetensors (deflated 22%)
  adding: out_sft/final/training_args.bin (deflated 52%)
  adding: out_sft/final/tokenizer_config.json (deflated 89%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **Phase 2:** finetuning on the remaining half

In [None]:
!unzip ./out_sft.zip

Archive:  ./out_sft.zip
   creating: out_sft/
   creating: out_sft/final/
  inflating: out_sft/final/README.md  
  inflating: out_sft/final/merges.txt  
  inflating: out_sft/final/special_tokens_map.json  
  inflating: out_sft/final/vocab.json  
  inflating: out_sft/final/added_tokens.json  
  inflating: out_sft/final/adapter_config.json  
  inflating: out_sft/final/tokenizer.json  
  inflating: out_sft/final/chat_template.jinja  
  inflating: out_sft/final/adapter_model.safetensors  
  inflating: out_sft/final/training_args.bin  
  inflating: out_sft/final/tokenizer_config.json  


### GRPO

In [None]:
model_name = "Qwen/Qwen2.5-1.5B-Instruct"

# load sft model for grpo training
base_for_grpo = load_base(model_name)
sft_grpo_model = PeftModel.from_pretrained(base_for_grpo, "./out_sft/final", is_trainable=True)

sft_grpo_model.print_trainable_parameters()

trainable params: 18,464,768 || all params: 1,562,179,072 || trainable%: 1.1820


In [None]:
# Training Data
training_info = {
    "prompts": [],
    "completions": [],
    "rewards": [],
    "step": []
}

# slm for evaluation
slm = SLM(model_name)

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


In [None]:
def reward_func(prompts, completions, trainer_state, task_id, **kwargs):
    """Reward function that assigns the following scores to actions:
    - parsing error: -2
    - already tried action: -2
    - unkwown or missing tool / arguments: -2
    - wrong action: -1
    - correct action: +1
    """
    rewards= []
    for prompt, completion, t_id in zip(prompts, completions, task_id):
      # reward for already tried action
      completion = completion[0]["content"]
      if "Action:" not in completion:
          rewards.append(-2)
          continue
      completion_action = completion.split("Action:")[1]
      if completion_action in prompt:
          rewards.append(-2)
          continue

      # environment initiation
      sample = clear_gta_df.iloc[int(t_id)]
      environment = StaticEnvironment(sample, slm)

      # errors
      errors = ["ValueError", "TypeError"]

      # run environment on the action and assign a reward
      try:
          parsed_output = parse(completion)
          _, _ = parsed_output["name"], parsed_output["arguments"]
          for argument in parsed_output["arguments"]:
              _, _ = argument["name"], argument["value"]
          observation = environment.run(parsed_output)

          if any([error in observation for error in errors]):
            # reward for unkwown or missing tool / arguments
            rewards.append(-2)
          elif "wrong direction." in observation:
            # reward for wrong action
            rewards.append(-1)
          else:
            # reward for correct action
            rewards.append(1)

      except (json.JSONDecodeError, SyntaxError, ValueError, KeyError, IndexError, TypeError):
          # reward for parsing error
          rewards.append(-2)

    # save training data
    for prompt, completion, reward, t_id in zip(prompts, completions, rewards, task_id):
      training_info["prompts"].append(prompt)
      training_info["completions"].append(completion)
      training_info["rewards"].append(reward)
      training_info["step"].append(trainer_state.global_step)

    return rewards

In [None]:
grpo_args = GRPOConfig(
    output_dir="./out_sft_grpo",
    per_device_train_batch_size=3,
    gradient_accumulation_steps=4,
    learning_rate=1e-5,
    logging_steps=10,
    save_steps=1,
    num_generations=3,
    max_completion_length=750,
    temperature=1.0,
    top_p=1.0,
    beta=0.0,
    num_iterations=1,
    epsilon=0.2,
    loss_type="dapo",
    remove_unused_columns=False,
    report_to="none",
    num_train_epochs=5,
)

trainer = GRPOTrainer(
    model=sft_grpo_model,
    args=grpo_args,
    reward_funcs=reward_func,
    train_dataset=grpo_train_dataset,
)

trainer.train()
trainer.save_model("./out_sft_grpo/final_adapter")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Step,Training Loss
10,0.0697
20,0.0263
30,0.1213
40,0.0053
50,0.0878
60,0.0952
70,0.0383
80,0.09
90,0.0193
100,0.1652




In [None]:
training_data_df = pd.DataFrame(training_info)

In [None]:
training_data_df.head()

Unnamed: 0,prompts,completions,rewards,step
0,"[{'role': 'system', 'content': 'You are an exp...","[{'role': 'assistant', 'content': 'Thought: Wi...",1,0
1,"[{'role': 'system', 'content': 'You are an exp...","[{'role': 'assistant', 'content': 'Thought: No...",1,0
2,"[{'role': 'system', 'content': 'You are an exp...","[{'role': 'assistant', 'content': 'Thought: No...",-1,0
3,"[{'role': 'system', 'content': 'You are an exp...","[{'role': 'assistant', 'content': 'Thought: No...",1,0
4,"[{'role': 'system', 'content': 'You are an exp...","[{'role': 'assistant', 'content': 'Thought: I ...",1,0


In [None]:
training_data_df.to_csv("./data_grpo_sft.csv")

In [None]:
!zip -r ./out_sft_grpo.zip ./out_sft_grpo

  adding: out_sft_grpo/ (stored 0%)
  adding: out_sft_grpo/README.md (deflated 47%)
  adding: out_sft_grpo/checkpoint-217/ (stored 0%)
  adding: out_sft_grpo/checkpoint-217/README.md (deflated 65%)
  adding: out_sft_grpo/checkpoint-217/trainer_state.json (deflated 84%)
  adding: out_sft_grpo/checkpoint-217/merges.txt (deflated 57%)
  adding: out_sft_grpo/checkpoint-217/scheduler.pt (deflated 61%)
  adding: out_sft_grpo/checkpoint-217/special_tokens_map.json (deflated 69%)
  adding: out_sft_grpo/checkpoint-217/vocab.json (deflated 61%)
  adding: out_sft_grpo/checkpoint-217/rng_state.pth (deflated 26%)
  adding: out_sft_grpo/checkpoint-217/added_tokens.json (deflated 67%)
  adding: out_sft_grpo/checkpoint-217/adapter_config.json (deflated 58%)
  adding: out_sft_grpo/checkpoint-217/tokenizer.json (deflated 81%)
  adding: out_sft_grpo/checkpoint-217/chat_template.jinja (deflated 71%)
  adding: out_sft_grpo/checkpoint-217/optimizer.pt (deflated 24%)
  adding: out_sft_grpo/checkpoint-217/ada

In [None]:
files.download('./out_sft_grpo.zip')
files.download('./data_grpo_sft.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### SFT

In [None]:
model_name = "Qwen/Qwen2.5-1.5B-Instruct"

# load sft model for grpo training
base_for_sft = load_base(model_name)
sft_sft_model = PeftModel.from_pretrained(base_for_sft, "./out_sft/final", is_trainable=True)

sft_sft_model.print_trainable_parameters()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

trainable params: 18,464,768 || all params: 1,562,179,072 || trainable%: 1.1820


In [None]:
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


# Training arguments
training_args = TrainingArguments(
    output_dir="./out_sft_sft",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=1e-5,
    num_train_epochs=5,
    logging_steps=10,
    save_steps=1,
    report_to="none",
)

trainer = SFTTrainer(
    model=sft_sft_model,
    args=training_args,
    train_dataset=grpo_train_dataset,
)
trainer.train()
trainer.save_model("./out_sft_sft/final")

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]



Tokenizing train dataset:   0%|          | 0/203 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/203 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
  return fn(*args, **kwargs)


Step,Training Loss
10,0.617
20,0.765
30,0.642
40,0.5753
50,0.5706
60,0.4201
70,0.5811
80,0.6579
90,0.5992
100,0.5204


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*a

In [None]:
!zip -r ./out_sft_sft.zip ./out_sft_sft
files.download('./out_sft_sft.zip')

  adding: out_sft_sft/ (stored 0%)
  adding: out_sft_sft/README.md (deflated 43%)
  adding: out_sft_sft/checkpoint-217/ (stored 0%)
  adding: out_sft_sft/checkpoint-217/README.md (deflated 65%)
  adding: out_sft_sft/checkpoint-217/trainer_state.json (deflated 75%)
  adding: out_sft_sft/checkpoint-217/merges.txt (deflated 57%)
  adding: out_sft_sft/checkpoint-217/scheduler.pt (deflated 61%)
  adding: out_sft_sft/checkpoint-217/special_tokens_map.json (deflated 69%)
  adding: out_sft_sft/checkpoint-217/vocab.json (deflated 61%)
  adding: out_sft_sft/checkpoint-217/rng_state.pth (deflated 26%)
  adding: out_sft_sft/checkpoint-217/added_tokens.json (deflated 67%)
  adding: out_sft_sft/checkpoint-217/adapter_config.json (deflated 58%)
  adding: out_sft_sft/checkpoint-217/tokenizer.json (deflated 81%)
  adding: out_sft_sft/checkpoint-217/chat_template.jinja (deflated 71%)
  adding: out_sft_sft/checkpoint-217/optimizer.pt (deflated 23%)
  adding: out_sft_sft/checkpoint-217/adapter_model.safet

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
print("hello")

hello


## Test

In [None]:
# slm for evaluation
slm = SLM(model_name)

`torch_dtype` is deprecated! Use `dtype` instead!


In [None]:
def reward_tests(prompts, completions, task_id):
    """Reward function that assigns the following scores to actions:
    - parsing error: -2
    - already tried action: -2
    - unkwown or missing tool / arguments: -2
    - wrong action: -1
    - correct action: +1
    """
    # Training Data
    testing_info = {
        "task_id": [],
        "prompts": [],
        "completions": [],
        "rewards": [],
    }

    rewards= []
    for prompt, completion, t_id in tqdm(zip(prompts, completions, task_id),  desc="Rewarding completions", unit="completion"):
      # reward for already tried action
      if "Action:" not in completion:
          rewards.append(-2)
          continue
      completion_action = completion.split("Action:")[1]
      if completion_action in prompt:
          rewards.append(-2)
          continue

      # environment initiation
      sample = clear_gta_df.iloc[int(t_id)]
      environment = StaticEnvironment(sample, slm)

      # errors
      errors = ["ValueError", "TypeError"]

      # run environment on the action and assign a reward
      try:
          parsed_output = parse(completion)
          _, _ = parsed_output["name"], parsed_output["arguments"]
          for argument in parsed_output["arguments"]:
              _, _ = argument["name"], argument["value"]
          observation = environment.run(parsed_output)

          if any([error in observation for error in errors]):
            # reward for unkwown or missing tool / arguments
            rewards.append(-2)
          elif "wrong direction." in observation:
            # reward for wrong action
            rewards.append(-1)
          else:
            # reward for correct action
            rewards.append(1)

      except (json.JSONDecodeError, SyntaxError, ValueError, KeyError, IndexError, TypeError):
          # reward for parsing error
          rewards.append(-2)

    # save training data
    for prompt, completion, reward, t_id in zip(prompts, completions, rewards, task_id):
      testing_info["task_id"].append(t_id)
      testing_info["prompts"].append(prompt)
      testing_info["completions"].append(completion)
      testing_info["rewards"].append(reward)

    return testing_info

In [None]:
def generate_completion(model, prompt):
  text = tokenizer.apply_chat_template(
      prompt,
      tokenize=False,
      add_generation_prompt=True
  )

  # 2) Tokenize to tensors
  inputs = tokenizer(
      text,
      return_tensors="pt"
  ).to(model.device)

  # 3) Generate
  with torch.no_grad():
      outputs = model.generate(
          **inputs,
          do_sample=False,
          max_new_tokens=750,
      )

  # 4) Decode only the newly generated tokens (recommended)
  gen_ids = outputs[0, inputs["input_ids"].shape[-1]:]
  return tokenizer.decode(gen_ids, skip_special_tokens=True)

In [None]:
def test(model, outpath):
  prompts = list(test_dataset["prompt"])
  task_ids = list(test_dataset["task_id"])
  completions = []
  for prompt in tqdm(prompts,  desc="Generating completions", unit="prompt"):
    completion = generate_completion(model, prompt)
    completions.append(completion)
  testing_info = reward_tests(prompts, completions, task_ids)
  testing_data_df = pd.DataFrame(testing_info)
  testing_data_df.to_csv(
    outpath,
    sep='|',
)


### Testing SFT model

In [None]:
# base_for_grpo = load_base()
# grpo_model = PeftModel.from_pretrained(base_for_grpo, "/content/out_grpo_half/final_adapter").eval()

base_for_sft = load_base(model_name)
sft_model  = PeftModel.from_pretrained(base_for_sft, "./out_sft_sft/final").eval()


In [None]:
test(sft_model, "./test_sft_sft.csv")

Generating completions: 100%|██████████| 266/266 [37:23<00:00,  8.44s/prompt]
Rewarding completions: 266completion [00:11, 22.61completion/s]


In [None]:
files.download('./test_sft_sft.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
df_model_sft = pd.read_csv("./test_sft_sft.csv", sep="|")

### Testing GRPO model

In [None]:
!unzip /content/out_grpo_sft1.zip

Archive:  /content/out_grpo_sft1.zip
   creating: content/AI_agents_mini_project/AI_agents_mini_project/out_grpo_half/
  inflating: content/AI_agents_mini_project/AI_agents_mini_project/out_grpo_half/README.md  
   creating: content/AI_agents_mini_project/AI_agents_mini_project/out_grpo_half/checkpoint-118/
  inflating: content/AI_agents_mini_project/AI_agents_mini_project/out_grpo_half/checkpoint-118/README.md  
  inflating: content/AI_agents_mini_project/AI_agents_mini_project/out_grpo_half/checkpoint-118/trainer_state.json  
  inflating: content/AI_agents_mini_project/AI_agents_mini_project/out_grpo_half/checkpoint-118/merges.txt  
  inflating: content/AI_agents_mini_project/AI_agents_mini_project/out_grpo_half/checkpoint-118/scheduler.pt  
  inflating: content/AI_agents_mini_project/AI_agents_mini_project/out_grpo_half/checkpoint-118/special_tokens_map.json  
  inflating: content/AI_agents_mini_project/AI_agents_mini_project/out_grpo_half/checkpoint-118/vocab.json  
  inflating: co

In [None]:
ADAPTER_DIR = "/content/out_grpo_half/final_adapter"
grpo_model = PeftModel.from_pretrained(model, ADAPTER_DIR)



In [None]:
test(grpo_model, "/content/test_grpo_second_half.csv")

Generating completions: 100%|██████████| 50/50 [06:25<00:00,  7.71s/prompt]
Rewarding completions: 50completion [00:02, 20.10completion/s]


In [None]:
df_model_grpo = pd.read_csv("/content//test_grpo_second_half.csv", sep="|")

In [None]:
df1 = df_model_sft[["task_id", "prompts", "completions", "rewards"]].rename(columns={"rewards": "reward_1", "completions": "completion_1"})
df2 = df_model_grpo[["task_id", "prompts","completions", "rewards"]].rename(columns={"rewards": "reward_2", "completions": "completion_2"})

df = df1.merge(df2, on=["prompts", "task_id"], how="inner")
df.head()

Unnamed: 0,task_id,prompts,completion_1,reward_1,completion_2,reward_2
0,2,"[{'role': 'system', 'content': 'You are an exp...",Thought:\nSince we don't know which image cont...,-2,Thought:\nSince we don't know which image cont...,-1
1,2,"[{'role': 'system', 'content': 'You are an exp...",Thought:\nNow let's describe the second image ...,-2,Thought:\nNow that we know the contents of bot...,1
2,2,"[{'role': 'system', 'content': 'You are an exp...","Action:\n{'name': 'OCR', 'arguments': [{'name'...",-2,"Action:\n{'name': 'OCR', 'arguments': [{'name'...",-2
3,2,"[{'role': 'system', 'content': 'You are an exp...","Action:\n{'name': 'FinalAnswer', 'arguments': ...",1,"Action:\n{'name': 'FinalAnswer', 'arguments': ...",1
4,2,"[{'role': 'system', 'content': 'You are an exp...","Action:\n{'name': 'FinalAnswer', 'arguments': ...",1,"Action:\n{'name': 'FinalAnswer', 'arguments': ...",1


In [None]:
print(df['completion_1'].iloc[34])
print(df['completion_2'].iloc[34])

Action:
{'name': 'FinalAnswer', 'arguments': [{'name': 'answer', 'value': "[['88.9']]"}]}
Thought:
Now that I know the number of women (5) and the total number of people (9), I can calculate the percentage of women by dividing the number of women by the total number of people and then multiplying by 100.

Let's calculate the percentage.


Action:
{'name': 'Calculator', 'arguments': [{'name': 'expression', 'value': '5 / 9 * 100'}]}


In [None]:
total_reward_model1 = df["reward_1"].sum()
total_reward_model2 = df["reward_2"].sum()
print(total_reward_model1, total_reward_model2)

-9 8


In [None]:
df["reward_diff"] = df["reward_2"] - df["reward_1"]

row_indices_model2_better = df.index[df["reward_diff"] < 0].tolist()
print(row_indices_model2_better)

[17, 21, 32, 34]
