### Ungraded Lab: Reinforcement Learning with Human Feedback (RLHF)

Last updated: April 15, 2025

---

#### Source: ChatGPT created this lab from this prompt:

*Create a lab assignment that demonstrates RLHF. Write and show the python code. Make sure the code works successfully.*

---

#### Background

RLHF is used to train models (like ChatGPT) by aligning behavior with human preferences rather than just predefined rewards.

We'll simulate RLHF in a basic environment where an agent must learn to generate the most preferred string from a predefined set.

---

#### Code

#### Import Packages

In [None]:
import numpy as np
from collections import defaultdict

#### Configs

In [None]:
# Set a fixed vocabulary
VOCAB = ["A", "B", "C"]
TARGET_STRING = ["A", "B", "C"]  # This is the "ideal" sequence from a human preference view

epochs = 10
n_pairs = 10

#### Calculate preferences

In [None]:
def generate_random_sequence(length=3):
    return np.random.choice(VOCAB, size=length).tolist()

# Preference function simulating human feedback
def human_preference(seq1, seq2):
    # Compare how close each sequence is to the target
    score1 = sum([1 for a, b in zip(seq1, TARGET_STRING) if a == b])
    score2 = sum([1 for a, b in zip(seq2, TARGET_STRING) if a == b])
    if score1 > score2:
        return 0  # Prefer seq1
    else:
        return 1  # Prefer seq2


### Simulate preferences from human over random pairs

In [None]:
def collect_preferences(n_pairs):
    preferences = []
    for _ in range(n_pairs):
        s1 = generate_random_sequence()
        s2 = generate_random_sequence()
        pref = human_preference(s1, s2)
        preferences.append((s1, s2, pref))
    return preferences

preferences = collect_preferences(n_pairs)
print("Sample preference:", preferences[0])

Sample preference: (['A', 'A', 'C'], ['A', 'A', 'C'], 1)


#### Train a Reward Model from Preferences


In [None]:
# Reward model: assign a reward value to each token at each position
reward_table = defaultdict(lambda: np.random.randn())

def sequence_reward(seq):
    return sum(reward_table[(i, token)] for i, token in enumerate(seq))

# Train the reward model using preference data
def train_reward_model(preferences, epochs=5, lr=0.01):
    for epoch in range(epochs):
        for s1, s2, pref in preferences:
            r1 = sequence_reward(s1)
            r2 = sequence_reward(s2)

            # Apply logistic loss
            prob1 = np.exp(r1) / (np.exp(r1) + np.exp(r2))
            grad = (1 - prob1) if pref == 0 else -prob1

            # explanation:
            # if pref = 0 (human prefers s1)
            # but prob1 = 0 (extreme case), we incur max loss. grad will push prob1 closer to 1

            # Update reward table
            for i, token in enumerate(s1):
                reward_table[(i, token)] += lr * grad
            for i, token in enumerate(s2):
                reward_table[(i, token)] -= lr * grad

            # explanation
            # if grad > 0, we increase the reward for s1 and decrease for s2
            # if grad < 0, we decrease the reward for s1 and increase for s2
            # example: if s1 has 'A' in first position and s2 has 'B',
            # there will be a preference at this position for s1
            # the reward model should learn this preference and give high prob1
            # we compute grad of prob1 and nudge reward table entries
            # value of reward_table[0,'A'] goes up
            # value of reward_table[0,'B'] goes down

train_reward_model(preferences)

#### Learn preferred sequence

In [None]:
def policy_sample():
    # Sample best token for each position according to reward_table
    best_seq = []
    for i in range(3):
        best_token = max(VOCAB, key=lambda token: reward_table[(i, token)])
        best_seq.append(best_token)
    return best_seq

final_sequence = policy_sample()
print("Learned preferred sequence:", final_sequence)

Learned preferred sequence: ['C', 'C', 'A']


---

#### Tasks

1) Test that `human_preference() works properly`

2) Call the function `collect_preferences()` with parameter `n_pairs=10`  
Does the output make sense?

3) Let's see if the "humans" can teach the model how to learn.  
Call `train_reward_model()`, modifying `n_pairs`, `epochs` as needed so the learned sequence matches TARGET_STRING = ["A", "B", "C"]

4) Review `train_reward_model()` to understand what it's doing