---
title: "Log GRPO Completions to 🤗 Datasets"
description: "Log GRPO training completions from trl to a 🤗 Dataset repo for easy analysis"
author: "Daniel van Strien"
date: "2025-02-20"
categories: ["huggingface", "trl", "datasets"]
image: https://raw.githubusercontent.com/davanstrien/blog/refs/heads/main/posts/2025/grpo/assets/illustration.jpg 
twitter-card:
  title: "GRPO Log Completions to 🤗 Datasets"
  description: "Log completions during GRPO training to a 🤗 Dataset repo for easy analysis"
  image: https://github.com/davanstrien/blog/blob/main/posts/2025/grpo/assets/whale.png?raw=true 
  card-style: summary_large_image
open-graph:
  title: "GRPO Log Completions to 🤗 Datasets"
  description: "Log completions during GRPO training to a 🤗 Dataset repo for easy analysis"
  image: https://github.com/davanstrien/blog/blob/main/posts/2025/grpo/assets/whale.png?raw=true
toc-depth: 3
toc: true
---


During GRPO training, it can be useful to stare at your completions and try and understand how the different reward functions are behaving. This notebook shows an experimental branch of trl which pushes completions to a 🤗 Dataset repo.
 
You can also play with this in Colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wzBFPVthRYYTp-mEYlznLg_e_0Za1M3g?usp=sharing)


This results in a dataset that looks like this:

In [27]:
#| code-fold: true
from IPython.display import HTML

HTML('''
<iframe
  src="https://huggingface.co/datasets/davanstrien/test-logs/embed/viewer/default/train"
  frameborder="0"
  width="100%"
  height="560px">
</iframe>
''')

Once you have this data in a HF repo, you can go on an work with it with whatever tools you prefer.

To try this out install from this fork of TRL

**NOTE** this is an experiment so don't expect everything to work super well!

In [2]:
# | output: false
!pip install git+https://github.com/davanstrien/trl.git@log-data
!pip install polars hvplot altair--upgrade

Collecting git+https://github.com/davanstrien/trl.git@log-data
  Cloning https://github.com/davanstrien/trl.git (to revision log-data) to /tmp/pip-req-build-7v_k3b_k
  Running command git clone --filter=blob:none --quiet https://github.com/davanstrien/trl.git /tmp/pip-req-build-7v_k3b_k
  Running command git checkout -b log-data --track origin/log-data
  Switched to a new branch 'log-data'
  Branch 'log-data' set up to track remote branch 'log-data' from 'origin'.
  Resolved https://github.com/davanstrien/trl.git to commit e3ac2828e4196b337b8919f43f7918fa59319d48
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting datasets>=2.21.0 (from trl==0.16.0.dev0)
  Downloading datasets-3.3.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets>=2.21.0->trl==0.16.0.dev0)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xx

This is basically the same as the example code in the [GRPO docs](https://huggingface.co/docs/trl/main/en/grpo_trainer) in TRL. I just add some extra reward functions so we can see what the outputs for multiple rewards look like. 

The main things we add are:
- `log_completions=True`
- `log_completions_hub_repo='davanstrien/test-logs'`

The first option will enable the logging of completions (this goes to WandB too) and the second option will push the completions to a 🤗 Dataset repo. 

**Note** at the moment we don't overwrite the dataset if it already exists on the Hub. 

In [3]:
# | output: false
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer

dataset = load_dataset("trl-lib/tldr", split="train")

def reward_len(completions, **kwargs):
    return [-abs(20 - len(completion)) for completion in completions]

def reward_shouting(completions: list[str], **kwargs) -> list[float]:
    """Reward text completions where all alphabetic words are in uppercase letters."""
    results = []

    for completion in completions:
        words = completion.split()
        all_uppercase = True

        for word in words:
            # Extract only alphabetic characters
            alpha_only = ''.join(char for char in word if char.isalpha())

            # Skip empty strings or strings with no alphabetic characters
            if not alpha_only:
                continue

            # Check if the alphabetic part is uppercase
            if not alpha_only.isupper():
                all_uppercase = False
                break

        results.append(50.0 if all_uppercase else 0.0)

    return results

def reward_emojis(completions: list[str], **kwargs) -> list[float]:
    """Reward text completions that contain emojis, with extra points for 🤗."""
    results = []

    for completion in completions:
        # Base score - check if any emoji exists in the completion
        has_any_emoji = any(char for char in completion if ord(char) > 127000)
        base_score = 10.0 if has_any_emoji else 0.0

        # Bonus points for 🤗 (hugging face emoji)
        hugging_face_count = completion.count('🤗')
        bonus_score = hugging_face_count * 5.0

        # Total score
        total_score = base_score + bonus_score
        results.append(total_score)

    return results


training_args = GRPOConfig(output_dir="Qwen2-0.5B-GRPO",
                           logging_steps=1,
                           per_device_train_batch_size=4,
                           per_device_eval_batch_size=4,
                           num_generations=2,
                           log_completions=True, 
                           max_steps=300,
                           log_completions_hub_repo='davanstrien/test-logs') # repo to push completions to
trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=[reward_len, reward_shouting, reward_emojis],
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/110M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/6.11M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/6.21M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/116722 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/6447 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6553 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdavanstrien[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss
1,0.0
2,0.1226
3,0.1111
4,0.0001
5,0.0001
6,0.0002
7,0.0001
8,0.1522
9,0.0003
10,0.0003


Upload 8 LFS files:   0%|          | 0/8 [00:00<?, ?it/s]

Upload 13 LFS files:   0%|          | 0/13 [00:00<?, ?it/s]

Upload 13 LFS files:   0%|          | 0/13 [00:00<?, ?it/s]

Upload 14 LFS files:   0%|          | 0/14 [00:00<?, ?it/s]

Upload 13 LFS files:   0%|          | 0/13 [00:00<?, ?it/s]

Upload 14 LFS files:   0%|          | 0/14 [00:00<?, ?it/s]

Upload 13 LFS files:   0%|          | 0/13 [00:00<?, ?it/s]

Upload 13 LFS files:   0%|          | 0/13 [00:00<?, ?it/s]

Upload 13 LFS files:   0%|          | 0/13 [00:00<?, ?it/s]

Upload 14 LFS files:   0%|          | 0/14 [00:00<?, ?it/s]

Upload 13 LFS files:   0%|          | 0/13 [00:00<?, ?it/s]

Upload 13 LFS files:   0%|          | 0/13 [00:00<?, ?it/s]

Upload 14 LFS files:   0%|          | 0/14 [00:00<?, ?it/s]

Upload 14 LFS files:   0%|          | 0/14 [00:00<?, ?it/s]

Upload 22 LFS files:   0%|          | 0/22 [00:00<?, ?it/s]

Upload 27 LFS files:   0%|          | 0/27 [00:00<?, ?it/s]

Upload 39 LFS files:   0%|          | 0/39 [00:00<?, ?it/s]

TrainOutput(global_step=300, training_loss=0.18083132882757733, metrics={'train_runtime': 2155.0442, 'train_samples_per_second': 0.557, 'train_steps_per_second': 0.139, 'total_flos': 0.0, 'train_loss': 0.18083132882757733})

## Visualizing the data

Once we've finishes training we can work with the completions dataset using Polars, Pandas, etc. This can help give us more insight into how the different reward functions are behaving and how we might modify our traing process.


In [1]:
import polars as pl

# Login using e.g. `huggingface-cli login` to access this dataset
df = pl.read_parquet("hf://datasets/davanstrien/test-logs/**/*.parquet")
df.describe()

We can already see that we need to do something different if we want more shouting. We can see that the emojis reward is triggered sometimes but not that often. 

We can also do things like plot the length reward over time.

In [26]:
import polars as pl
import altair as alt

df_sorted = df.sort("step")
df_with_avg = df_sorted.with_columns(
    pl.col("reward_reward_len")
    .rolling_mean(window_size=10, min_samples=1)
    .alias("rolling_avg")
)

# Now create separate charts and combine them
scatter_chart = (
    alt.Chart(df_with_avg)
    .mark_circle(opacity=0.6)
    .encode(
        x=alt.X("step:Q", title="Step"),
        y=alt.Y("reward_reward_len:Q", title="Length Reward Score"),
        tooltip=["step", "reward_reward_len"],
    )
)

line_chart = (
    alt.Chart(df_with_avg)
    .mark_line(color="purple", size=2)
    .encode(x="step:Q", y="rolling_avg:Q")
)

combined_chart = (
    (scatter_chart + line_chart)
    .properties(
        width=800,
        height=400,
        title="Length Rewards Over Time with 10-step Moving Average",
    )
    .configure_axis(labelFontSize=12, titleFontSize=14)
)

In [13]:
combined_chart

### Create a heatmap showing when different reward types contribute most

In [22]:

heatmap_data = df.with_columns([
    pl.col('step').cast(pl.Int32) // 10 * 10
]).group_by('step').agg([
    pl.mean('reward_reward_len').alias('avg_len_reward'),
    pl.mean('reward_reward_shouting').alias('avg_shouting_reward'),
    pl.mean('reward_reward_emojis').alias('avg_emoji_reward')
])

heatmap_long = heatmap_data.unpivot(
    index=['step'],
    on=['avg_len_reward', 'avg_shouting_reward', 'avg_emoji_reward'],
    variable_name='reward_type',
    value_name='value'
)

heatmap = alt.Chart(heatmap_long).mark_rect().encode(
    x='step:O',
    y='reward_type:N',
    color=alt.Color('value:Q', scale=alt.Scale(scheme='viridis'))
).properties(
    width=800,
    height=200,
    title='Reward Types Intensity Over Time'
)
heatmap

### Create histograms to see distribution of rewards


In [25]:
histogram = alt.Chart(df).transform_fold(
    ['reward', 'reward_reward_len', 'reward_reward_shouting', 'reward_reward_emojis'],
    as_=['Reward Type', 'Value']
).mark_bar(opacity=0.7).encode(
    alt.X('Value:Q', bin=alt.Bin(maxbins=30)),
    alt.Y('count():Q'),
    alt.Color('Reward Type:N'),
    alt.Row('Reward Type:N')
).properties(
    width=600,
    height=150,
    title='Distribution of Different Reward Types'
)
histogram

# Is this a good idea?

This was hacked together in a hour or so but I'm curious if other people think this could be useful? 

I think opening up the completions logs more widely could allow for more analysis and more insights into how training is progressing even for those without the compute resources to run training. 