# Learn trlX in 60 Minutes - API Tutorial

## 1. Introduction

**trlX** (Transformer Reinforcement Learning X) is a distributed training framework designed to fine-tune Large Language Models (LLMs) using Reinforcement Learning (RL). It is particularly known for scaling RLHF (Reinforcement Learning from Human Feedback) to large models.

In this tutorial, we will explore the core API of `trlx` by setting up a simple **PPO (Proximal Policy Optimization)** training loop. We will also demonstrate how to use **DPO (Direct Preference Optimization)** using the custom extensions in this project.

**Goal**: Understand how to configure, train, and use a model with `trlx`.

## 2. Setup and Installation

First, ensure `trlx` and its dependencies are installed. If you are running this in a new environment, uncomment the line below.

In [1]:
# !pip install trlx

import trlx
import torch
import os

# Check for GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

if device == "cpu":
    print("WARNING: Training on CPU will be extremely slow. This tutorial is best run with a GPU.")

W1206 21:22:40.986000 50375 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
  import pkg_resources


Using device: cpu


## 3. Configuration (`TRLConfig`)

`trlx` uses a configuration object to manage the many hyperparameters involved in RL training. The `TRLConfig` object controls:
- **Model**: Which model to load (e.g., `gpt2`).
- **Train**: Batch size, sequence length, epochs.
- **Method**: Algorithm-specific settings (e.g., PPO clip range, chunk size).

We will manually construct the configuration to ensure full control.

In [2]:
from trlx.data.configs import TRLConfig, TrainConfig, ModelConfig, OptimizerConfig, SchedulerConfig, TokenizerConfig
from trlx.models.modeling_ppo import PPOConfig

# 1. Model Settings
model_config = ModelConfig(
    model_path="gpt2",  # Using GPT-2 (small) for demonstration
    model_arch_type="causal"
)

# 2. Tokenizer Settings
tokenizer_config = TokenizerConfig(
    tokenizer_path="gpt2",
    padding_side="left"
)

# 3. Training Settings
train_config = TrainConfig(
    total_steps=10,
    seq_length=128,
    epochs=1,
    batch_size=4,
    checkpoint_interval=100,
    eval_interval=100,
    pipeline="PPOPipeline",
    trainer="AcceleratePPOTrainer"
)

# 4. PPO Method Settings
method_config = PPOConfig(
    name="PPOConfig",
    num_rollouts=128,
    chunk_size=4,
    ppo_epochs=1,
    init_kl_coef=0.1,
    target=6.0,
    horizon=10000,
    gamma=1.0,
    lam=0.95,
    cliprange=0.2,
    cliprange_value=0.2,
    vf_coef=1.0,
    scale_reward="ignored",
    ref_mean=None,
    ref_std=None,
    cliprange_reward=10.0,
    gen_kwargs={"max_new_tokens": 40}
)

# 5. Optimizer & Scheduler
optimizer_config = OptimizerConfig(name="adamw", kwargs={"lr": 1.0e-5, "betas": [0.9, 0.95], "eps": 1.0e-8, "weight_decay": 1.0e-6})
scheduler_config = SchedulerConfig(name="cosine_annealing", kwargs={"T_max": 10000, "eta_min": 1.0e-5})

# Combine into TRLConfig
config = TRLConfig(
    model=model_config,
    tokenizer=tokenizer_config,
    train=train_config,
    method=method_config,
    optimizer=optimizer_config,
    scheduler=scheduler_config
)

print("Configuration ready.")

Configuration ready.


## 4. The Reward Function

In Reinforcement Learning from Human Feedback (RLHF), the **Reward Function** is the critic. It evaluates the text generated by the model and assigns a score.

For this tutorial, we will define a simple **heuristic reward function**. We want the model to generate text containing the word **"cat"**.

- **Input**: A list of generated strings.
- **Output**: A list of scalar rewards (floats).

In [3]:
def reward_fn(samples, **kwargs):
    """
    Assigns a reward based on the number of times 'cat' appears in the generated text.
    """
    rewards = []
    for sample in samples:
        # Calculate reward: +1.0 for every mention of 'cat'
        count = sample.lower().count("cat")
        rewards.append(float(count))
    return rewards

# Test the reward function
test_samples = ["I love my cat.", "Dogs are great.", "The cat sat on the cat."]
print(f"Test Rewards: {reward_fn(test_samples)}")

Test Rewards: [1.0, 0.0, 2.0]


## 5. The Training Loop (`trlx.train`)

The `trlx.train` function is the main entry point. It handles:
1.  Loading the model and tokenizer.
2.  Generating samples (rollouts).
3.  Calculating rewards using your function.
4.  Updating the model using PPO.

We provide a list of **prompts** to kickstart the generation.

In [4]:
# Prompts to start generation
prompts = [
    "My favorite animal is",
    "The quick brown fox",
    "I saw a",
    "Once upon a time",
    "The pet store had"
]

# Evaluation prompts (to see progress)
eval_prompts = ["My favorite animal is"]

if torch.cuda.is_available():
    print("Starting training... (This may take a few minutes)")
    trainer = trlx.train(
        reward_fn=reward_fn,
        prompts=prompts,
        eval_prompts=eval_prompts,
        config=config
    )
    print("Training complete!")
else:
    print("Skipping actual training call because no GPU is detected.")
    print("In a GPU environment, 'trlx.train' would execute the PPO loop.")

Skipping actual training call because no GPU is detected.
In a GPU environment, 'trlx.train' would execute the PPO loop.


## 6. Inference and Verification

Once trained, the `trainer` object wraps the fine-tuned model. We can use it to generate text and verify if it learned the policy (to mention "cat").

In [5]:
if torch.cuda.is_available():
    # Generate text using the fine-tuned model
    output = trainer.generate(prompts=["My favorite animal is"], length=20)
    print("Generated:", output)
    
    # Check if it learned
    if "cat" in output[0].lower():
        print("Success! The model mentioned 'cat'.")
    else:
        print("The model didn't mention 'cat'. It might need more training steps.")
else:
    print("Skipping inference.")

Skipping inference.


## 7. Advanced: Direct Preference Optimization (DPO)

This project extends `trlx` to support DPO, which is a more stable alternative to PPO. Instead of a reward function, DPO uses a dataset of preferred and rejected responses.

Below is an example of how to configure and run DPO using the custom `AccelerateDPOTrainer` included in this project.

In [6]:
try:
    from trlx_custom.trainer.accelerate_dpo_trainer import DPOConfig
    
    # Initialize DPO Configuration
    dpo_config = DPOConfig(
        beta=0.1,  # The beta parameter controls the strength of the KL penalty
        gen_kwargs={"max_new_tokens": 64}
    )
    
    # In a real scenario, you would pass a dataset with 'chosen' and 'rejected' columns
    # trainer = trlx.train(
    #     model_path="gpt2",
    #     config=dpo_config,
    #     samples=...,
    #     rewards=... # In DPO, rewards are implicit in the preference pairs
    # )
    
    print("DPO Configuration loaded successfully.")
    print("To run DPO, you would call trlx.train() with this config and a preference dataset.")
    
except ImportError:
    print("Could not import DPOConfig. Ensure you are running this notebook from the project root.")

DPO Configuration loaded successfully.
To run DPO, you would call trlx.train() with this config and a preference dataset.


## 8. Summary

In this tutorial, we covered:
1.  **Configuration**: How to set up `TRLConfig` for PPO.
2.  **Reward Function**: How to define a custom Python function to guide the model.
3.  **Training**: How to launch the training loop with `trlx.train`.
4.  **DPO**: How to configure the custom Direct Preference Optimization trainer.

This is the foundation of RLHF. In real-world scenarios, the simple `reward_fn` is replaced by a **Reward Model** (another neural network) trained on human preferences, as seen in the main project.