# Reward Model with Pairwise Preference Training

## Introduction

In Reinforcement Learning from Human Feedback (RLHF), a reward model is trained to predict human preferences given prompt-response pairs. The reward model assigns higher scores to preferred responses and lower scores to less-preferred responses. Training involves minimizing a pairwise loss function, ensuring the model correctly ranks preferred responses higher than others.

### Pairwise Preference Setup

Given:
- A prompt $x$
- A preferred (good) response $y_{good}$
- A less preferred (bad) response $y_{bad}$

We train a reward model $r_\theta$ parameterized by $\theta$, to assign scores such that:

$$
r_\theta(x, y_{good}) > r_\theta(x, y_{bad})
$$

### Pairwise Loss Function

The loss function used for training is:

$$
\text{Loss} = -\log(\sigma(r_\theta(x, y_{good}) - r_\theta(x, y_{bad})))
$$

where $\sigma(z)$ is the sigmoid function defined as:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

The intuition is:
- If $r_\theta(x, y_{good})$ significantly exceeds $r_\theta(x, y_{bad})$, the loss is small.
- Otherwise, the loss is large, encouraging the model to adjust parameters accordingly.

## PyTorch Implementation

Below is a self-contained and easy-to-follow PyTorch implementation of this reward model training approach.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer


class RewardModel(nn.Module):
    """
    A simple transformer-based reward model for pairwise preference training.
    """

    def __init__(self, model_name="distilbert-base-uncased"):
        super().__init__()
        self.transformer = AutoModel.from_pretrained(model_name)
        self.reward_head = nn.Linear(self.transformer.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        transformer_output = self.transformer(input_ids=input_ids,
                                              attention_mask=attention_mask)
        pooled_output = transformer_output.last_hidden_state[:, 0]  # [CLS] token
        reward = self.reward_head(pooled_output)
        return reward.squeeze(-1)


def compute_pairwise_loss(model, tokenizer, prompt, good_response, bad_response, device):
    """
    Compute pairwise ranking loss given a prompt, good and bad responses.
    """

    # Tokenize inputs
    good_input = tokenizer(prompt, good_response, return_tensors='pt', truncation=True, padding=True).to(device)
    bad_input = tokenizer(prompt, bad_response, return_tensors='pt', truncation=True, padding=True).to(device)

    # Compute scores
    good_score = model(**good_input)
    bad_score = model(**bad_input)

    # Pairwise loss
    loss = -F.logsigmoid(good_score - bad_score).mean()

    return loss, good_score.item(), bad_score.item()


# Example usage:
if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    reward_model = RewardModel().to(device)

    optimizer = torch.optim.Adam(reward_model.parameters(), lr=1e-5)

    # Example data
    prompt = "What's the capital of France?"
    good_response = "The capital of France is Paris."
    bad_response = "The capital of France is London."

    reward_model.train()
    epochs = 5

    for epoch in range(epochs):
        optimizer.zero_grad()

        loss, good_score, bad_score = compute_pairwise_loss(
            reward_model, tokenizer, prompt, good_response, bad_response, device
        )

        loss.backward()
        optimizer.step()

        print(f"Epoch {epoch + 1}/{epochs} - Loss: {loss:.4f}, Good Score: {good_score:.4f}, Bad Score: {bad_score:.4f}")
