### Reward Model with TRL RewardTrainer
This notebook uses the TRL `RewardTrainer` (v0.17.0) to train a reward model on HellaSwag-style chat data.


In [1]:
import random
from pathlib import Path

from datasets import Dataset
from shared_models import HellaSwagEntry

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import LoraConfig, TaskType, get_peft_model
from trl import RewardTrainer, RewardConfig

#### Data Collection

In [2]:
DATA_PATH = Path("../data/hellaswag_format/personal_chat_sessions_train_hellaswag.jsonl")

def load_jsonl_pydantic(path):
    """Yield HellaSwagEntry objects parsed with Pydantic."""
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            yield HellaSwagEntry.model_validate_json(line)

In [3]:
# Build pairwise examples
pairs = []
for ex in load_jsonl_pydantic(DATA_PATH):
    endings = [ex.ending0, ex.ending1, ex.ending2, ex.ending3, ex.ending4]
    pos_id = ex.label
    neg_id = random.choice([i for i in range(5) if i != pos_id])

    pos_txt, neg_txt = endings[pos_id].strip(), endings[neg_id].strip()
    context = ex.context.strip()

    # randomly order A/B
    if random.random() < 0.5:
        first, second, lbl = pos_txt, neg_txt, 1
    else:
        first, second, lbl = neg_txt, pos_txt, 0

    pairs.append({
        "context": context,
        "first_resp": first,
        "second_resp": second,
        "label": lbl
    })

In [4]:
# Create HF Dataset and split
dataset = Dataset.from_list(pairs)
train_test = dataset.train_test_split(test_size=0.1, seed=42)

#### Prepare for RewardTrainer
Convert to the `"chosen"` / `"rejected"` format required by RewardTrainer

In [5]:
def map_to_reward(examples):
    chosen, rejected = [], []
    for lbl, a, b in zip(examples["label"], examples["first_resp"], examples["second_resp"]):
        if lbl == 1:
            chosen.append(a)
            rejected.append(b)
        else:
            chosen.append(b)
            rejected.append(a)
    return {"chosen": chosen, "rejected": rejected}

rm_dataset = train_test.map(
    map_to_reward,
    batched=True,
    remove_columns=train_test["train"].column_names,
)

Map:   0%|          | 0/20053 [00:00<?, ? examples/s]

Map:   0%|          | 0/2229 [00:00<?, ? examples/s]

#### Model & Tokenizer

In [6]:
model_ckpt = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [7]:
# Single‐scalar head for reward
model = AutoModelForSequenceClassification.from_pretrained(
    model_ckpt,
    num_labels=1,
)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id

#### LoRA Setup

In [9]:
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=4,
    lora_alpha=32,
    target_modules=["query", "value"],
    lora_dropout=0.05,
)
model = get_peft_model(model, peft_config)

 #### Training Configuration

In [10]:
training_args = RewardConfig(
    output_dir="../data/models/reward_model_ckpts",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    max_length=128,
    disable_dropout=False,  # keep dropout active during training
)

#### Initialize & Run RewardTrainer

In [11]:
trainer = RewardTrainer(
    model=model,
    args=training_args,
    train_dataset=rm_dataset["train"],
    eval_dataset=rm_dataset["test"],
    processing_class=tokenizer,
    peft_config=peft_config,
)


Map:   0%|          | 0/20053 [00:00<?, ? examples/s]

Map:   0%|          | 0/20053 [00:00<?, ? examples/s]

Filter:   0%|          | 0/20053 [00:00<?, ? examples/s]

Map:   0%|          | 0/2229 [00:00<?, ? examples/s]

Map:   0%|          | 0/2229 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2229 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [12]:
print("Baseline:", trainer.evaluate())


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Baseline: {'eval_loss': 0.7329193949699402, 'eval_model_preparation_time': 0.0039, 'eval_accuracy': 0.2591093117408907, 'eval_runtime': 45.0688, 'eval_samples_per_second': 49.391, 'eval_steps_per_second': 1.553}




In [13]:
trainer.train()

Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy
1,0.0289,0.019819,0.0039,0.994602
2,0.0131,0.012975,0.0039,0.995502
3,0.0144,0.012009,0.0039,0.995951








TrainOutput(global_step=3753, training_loss=0.052317084613050936, metrics={'train_runtime': 2782.2375, 'train_samples_per_second': 21.576, 'train_steps_per_second': 1.349, 'total_flos': 0.0, 'train_loss': 0.052317084613050936, 'epoch': 3.0})

In [14]:
print("Final:", trainer.evaluate())

Final: {'eval_loss': 0.012009366415441036, 'eval_model_preparation_time': 0.0039, 'eval_accuracy': 0.9959514170040485, 'eval_runtime': 56.725, 'eval_samples_per_second': 39.242, 'eval_steps_per_second': 1.234, 'epoch': 3.0}


