# Tutorial 4: Improving LLMs with RLHF

Reinforcement Learning from Human Feedback (RLHF) incorporates human feedback into the training process through a reward model that learns the desired patterns to improve the model’s output. For example, if the goal is to enhance politeness, the reward model will guide the model to generate more polite responses by assigning higher scores to polite outputs. This process is resource-intensive because it necessitates training a reward model using a dataset curated by humans.

This tutorial will use available open-source models and datasets whenever possible while maintaining costs.

We begin with a pre-trained model that we fine-tune in a supervised fine-tuning phase using the `SFTTrainer` class. Next, a reward model is trained with the desired traits using the `RewardTrainer` class. Finally, the reinforcement learning phase employs the models to build the ultimate aligned model, utilizing the `PPOTrainer`.

You can access the reports generated from Weights & Biases and the file with the requirements for the library after each subsection. Note that different steps require distinct versions of libraries. We chose `OPT-1.3B` as the base model and fine-tuned a `DeBERTa` (300M) model as the reward model for our experiments. While these are more compact models, the process used in this tutorial can be applied to other existing models by simply modifying the model’s name in the code.

Even if much more affordable than what companies like OpenAI do, this tutorial is still resource-intensive as we replicate an RLHF phase. We rented an 8x NVIDIA A100 instance for $8.80/h and used  [lambda](https://lambdalabs.com/)  as our GPU cloud provider.

>⚠️It’s important to be aware of the costs associated with cloud GPUs. The total cost will depend on the machine type and the instance’s uptime. Regularly check your costs in the billing section of Lambda Labs and spin off your instances when you don’t use them.

>💡If you want to run the code in the section without spending much money, you can perform a few iterations of training on your virtual machine and then stop it.

## Supervised Fine-Tuning

• Find the  [Notebook](https://colab.research.google.com/github/towardsai/ragbook-notebooks/blob/main/notebooks/Chapter%2010%20-%20FineTuning_a_LLM_QLoRA.ipynb)  for this section at  [towardsai.net/book](http://towardsai.net/book).

Previous sections covered the SFT phase. This section uses a unique  [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca)  dataset with question-response pairs and implements the QLoRA fine-tuning technique.

This phase teaches the model a conversational format, training it to provide answers rather than defaulting to its standard auto-completion function.

Install the required libraries with the command `!pip install -q transformers==4.32.0 bitsandbytes==0.41.1 accelerate==0.22.0 deeplake==3.6.19 trl==0.5.0 peft==0.5.0 wandb==0.15.8.`

### The Dataset

The first step is streaming the dataset. Streaming a dataset refers to the process of loading and processing data in smaller chunks or batches rather than loading the entire dataset into memory at once. This approach is useful when working with large datasets that cannot fit into memory or when you want to perform real-time processing. For this example, we only use a subset of the original dataset, comprising 1 million data points. However, you can access the  [entire dataset](https://app.activeloop.ai/genai360/OpenOrca-4M/), containing 4 million data points, at  [towardsai.net/book](http://towardsai.net/book).

In [None]:
import deeplake

# Connect to the training and testing datasets
ds = deeplake.load('hub://genai360/OpenOrca-1M-train-set')
ds_valid = deeplake.load('hub://genai360/OpenOrca-1M-valid-set')

print(ds)

    Dataset(path='hub://genai360/OpenOrca-1M-train-set',  
    read_only=True,  
    tensors=['id', 'question', 'response', 'system_prompt'])

The dataset features three key columns: question, the queries posed to the LLM; response, the model’s output or answers to these questions; and `system_prompt`, the initial instructions that set the context for the model, such as “you are a helpful assistant.”

For simplicity, this chapter focuses solely on the first two columns. However, incorporating system prompts into text formatting can be advantageous. The text is structured in the format `Question: xxx\n\nAnswer: yyy`, with the question and answer separated by two newline characters. You can also experiment with different formats, such as `System: xxx\n\nQuestion: yyy\n\nAnswer: zzz`, to include the system prompts from the dataset.

In [None]:
def prepare_sample_text(example):
    """Prepare the text from a sample of the dataset."""
    text = f"""Question: {example['question'][0]}\n\nAnswer: {example['response'][0]}"""
    return text

Next, load the OPT model tokenizer:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")

Use the `ConstantLengthDataset` class to aggregate data. This will maximize usage within the 2K input size constraint and improve training efficiency.

In [None]:
from trl.trainer import ConstantLengthDataset

train_dataset = ConstantLengthDataset(
    tokenizer,
    ds,
    formatting_func=prepare_sample_text,
    infinite=True,
    seq_length=2048
)

eval_dataset = ConstantLengthDataset(
    tokenizer,
    ds_valid,
    formatting_func=prepare_sample_text,
    seq_length=1024
)

iterator = iter(train_dataset)
sample = next(iterator)
print(sample)

train_dataset.start_iteration = 0

    {'input_ids': tensor([ 16, 358, 828, ..., 137, 79, 362]),  
    'labels': tensor([ 16, 358, 828, ..., 137, 79, 362])}

### Initialize the Model and Trainer

The following code sets the LoRA configuration. The parameter r=16 sets the rank of the LoRA layers, controlling the reduction in dimensionality. The `lora_alpha=32` parameter adjusts the scaling factor for the adaptation, impacting how much the new parameters contribute to the model. `lora_dropout=0.05` specifies a dropout rate of 5%, adding regularization during training. The `bias="none"` parameter indicates no bias is added to the model. Finally, `task_type="CAUSAL_LM"` defines the task type as causal language modeling, which is typically used for autoregressive text generation models. We’ll later load the model in a quantized mode, effectively implementing QLoRA.

In [None]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

Instantiate the `TrainingArguments`, which define the hyperparameters of the training process:


In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./OPT-fine_tuned-OpenOrca",
    dataloader_drop_last=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    num_train_epochs=2,
    eval_steps=2000,
    save_steps=2000,
    logging_steps=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=1e-4,
    lr_scheduler_type="cosine",
    warmup_steps=100,
    gradient_accumulation_steps=1,
    bf16=True,
    weight_decay=0.05,
    ddp_find_unused_parameters=False,
    run_name="OPT-fine_tuned-OpenOrca",
    report_to="wandb",
)

Set a `BitsAndBytes` configuration. This new class package runs the quantization operation and loads the model in a 4-bit format. We will use the NF4 data type for weights and the nested quantization strategy to reduce memory usage while maintaining performance.

Next, specify that the training process computations be carried out in the `bfloat16` format.

The QLoRA method integrates LoRA with quantization to optimize memory usage further. To enable this functionality, include the `quantization_config` when initializing the model.

In [None]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

Use the `AutoModelForCasualLM` class to load the OPT model’s pre-trained weights containing 1.3 billion parameters. Note that the model will be loaded using the GPUs registered in the system.

In [None]:
from transformers import AutoModelForCausalLM
from accelerate import Accelerator

model = AutoModelForCausalLM.from_pretrained(
  "facebook/opt-1.3b",
    quantization_config=quantization_config,
    device_map={"": Accelerator().process_index}
)

Change the model architecture before initializing the trainer object to improve its efficiency. This requires casting specific layers of the model to complete precision (32 bits), including LayerNorms and the final language modeling head.

In [None]:
from torch import nn

for param in model.parameters():
  param.requires_grad = False
  if param.ndim == 1:
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

The `SFTTrainer` class will begin training using the initialized dataset, the model, and the training arguments:

In [None]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=lora_config,
    packing=True,
)

print("Training...")

In [None]:
trainer.train()

The `SFTTrainer` instance will automatically establish checkpoints during the training process, as given by the `save_steps` argument (from `TrainingArguments`), and save them to the `./OPT-fine_tuned-OpenOrca` directory.

Merge the LoRA layers with the base model to form a standalone network. The following code will handle the merging process:

In [None]:
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
  "facebook/opt-1.3b", return_dict=True, torch_dtype=torch.bfloat16
)

from peft import PeftModel

# Load the Lora model
model = PeftModel.from_pretrained(model, "./OPT-fine_tuned-OpenOrca/<step>")
model.eval()

model = model.merge_and_unload()

model.save_pretrained("./OPT-fine_tuned-OpenOrca/merged")

The standalone model will be accessible on the .`/OPT-supervised_fine_tuned/merged` directory. This checkpoint will be used in the Reinforcement Learning section.

>💡[The Merged Model Checkpoint (2GB)](https://drive.google.com/file/d/1D9rH2kLiBgRR31xvelOcW09uHzrO5Fbv/view?usp=drive_link), [Weights & Bias Report](https://wandb.ai/ala_/GenAI360/runs/n6czwaqq?workspace=user-ala_), and the [fine-tuning requirements](https://github.com/towardsai/rag-ebook-files/blob/main/requirements-fine-tune.txt) are accessible at [towardsai.net/book](http://towardsai.net/book).

>(The provided requirements text file is a snapshot of all the packages on the server; not all of these packages are necessary for you)

## Training a Reward Model

• Find the  [Notebook](https://colab.research.google.com/github/towardsai/ragbook-notebooks/blob/main/notebooks/Chapter%2010%20-%20FineTuning_Reward_Model.ipynb)  for this section at  [towardsai.net/book](http://towardsai.net/book).

The reward model is designed to learn human preferences from labeled examples, guiding the LLM during the final stage of the RLHF process. It is exposed to examples of preferred and less desirable behaviors. It learns to mirror human preferences by assigning higher scores to preferred examples.

In essence, reward models perform a classification task, choosing the better option from a pair of sample interactions based on human feedback. Various network architectures can be used as reward models. A key consideration is whether the reward model should be similar to the base model to ensure it has adequate knowledge for practical guidance. However, smaller models such as DeBERTa or RoBERTa have also demonstrated efficiency. If resources permit, exploring larger models can be beneficial.

Install the essential libraries with the command `!pip install -q transformers==4.32.0 deeplake==3.6.19 sentencepiece==0.1.99 trl==0.6.0.`

### The Dataset

>💡Note that the datasets in this step contain inappropriate language and offensive words. This approach aligns the model’s behavior by instructing the model not to replicate it.

For the RLHF process, we use the “[helpfulness/harmless](https://github.com/anthropics/hh-rlhf)” dataset from Anthropic. This dataset is tailored for RLHF and offers an in-depth understanding of the approach. Find  [the study](https://arxiv.org/abs/2204.05862)  and the dataset at  [towardsai.net/book](http://towardsai.net/book).

The following code will set up the data loader objects for the training and validation sets:

In [None]:
import deeplake

ds = deeplake.load('hub://genai360/Anthropic-hh-rlhf-train-set')
ds_valid = deeplake.load('hub://genai360/Anthropic-hh-rlhf-test-set')

print(ds)

Dataset(path='hub://genai360/Anthropic-hh-rlhf-train-set', read_only=True, tensors=['chosen', 'rejected'])

Before structuring the dataset for the Trainer class, load the pre-trained tokenizer for DeBERTa (the reward model). The code should be recognizable; the `AutoTokenizer` class will locate the suitable initializer class and utilize the `.from_pretrained()` method to load the pre-trained tokenizer.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")

PyTorch’s `Dataset` class prepares the dataset for various downstream tasks. A pair of inputs is required to train a reward model. The first item will represent the selected (favorable) conversation, while the second will represent a talk rejected by labelers. The reward model will allocate a higher score to the chosen sample and a lower score to the rejected samples.

The code below tokenizes the samples and combines them into a single Python dictionary:

In [None]:
from torch.utils.data import Dataset

class MyDataset(Dataset):
    def __init__(self, dataset):
        self.dataset = dataset

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):

      chosen = self.dataset.chosen[idx].text()
      rejected = self.dataset.rejected[idx].text()

      tokenized_chosen = tokenizer(chosen, truncation=True, max_length=512, padding='max_length')
      tokenized_rejected = tokenizer(rejected, truncation=True, max_length=512, padding='max_length')

      formatted_input = {
        "input_ids_chosen": tokenized_chosen["input_ids"],
        "attention_mask_chosen": tokenized_chosen["attention_mask"],
        "input_ids_rejected": tokenized_rejected["input_ids"],
        "attention_mask_rejected": tokenized_rejected["attention_mask"],
      }

      return formatted_input

The `Trainer` class requires a dictionary with four keys for training. This includes the tokenized forms for chosen and rejected talks (`input_ids_chosen` and `input_ids_rejected`) and their respective attention masks `(attention_mask_chosen and attention_mask_rejected)`. Attention masks are necessary because they add padding tokens to standardize input sizes (up to the model’s maximum input size of 512 in this example), which warns the model that specific tokens at the end do not contain valuable information and can be ignored.

You can use the previously established class to construct an instance of the dataset or extract a single row from the dataset using the iter and next methods to validate the output keys and ensure that everything works as expected:

In [None]:
train_dataset = MyDataset(ds)
eval_dataset = MyDataset(ds_valid)

# Print one sample row
iterator = iter(train_dataset)
one_sample = next(iterator)
print(list(one_sample.keys()))

    ['input_ids_chosen', 'attention_mask_chosen', 'input_ids_rejected',  
    'attention_mask_rejected']

### Initialize the Model and Trainer

Import the pre-trained DeBERTa model using the `AutoModelForSequenceClassification`. Set the number of labels (`num_labels`) to 1 since just a single score is needed to evaluate the quality of a sequence. A high score will signify content alignment, while a low score suggests the content may be unsuitable.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/deberta-v3-base", num_labels=1
)

Create an instance of `TrainingArguments`, setting the intended hyperparameters. You can explore various hyperparameters based on the selection of pre-trained models and available resources. For example, if an out-of-memory error is encountered, a smaller batch size might be needed.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="DeBERTa-reward-hh_rlhf",
    learning_rate=2e-5,
    per_device_train_batch_size=24,
    per_device_eval_batch_size=24,
    num_train_epochs=20,
    weight_decay=0.001,
    evaluation_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    gradient_accumulation_steps=1,
    bf16=True,
    logging_strategy="steps",
    logging_steps=1,
    optim="adamw_hf",
    lr_scheduler_type="linear",
    ddp_find_unused_parameters=False,
    run_name="DeBERTa-reward-hh_rlhf",
    report_to="wandb",
)

The `RewardTrainer` class from the TRL library integrates all components, including the previously defined elements, such as the model, tokenizer, and dataset, and executes the training loop:

In [None]:
from trl import RewardTrainer

trainer = RewardTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    max_length=512
)

trainer.train()

The `trainer` will automatically save the checkpoints, which will be used in the. Reinforcement Learning section.

💡[The Reward Model Checkpoint (Step 1000 - 2GB)](https://drive.google.com/file/d/1GWL-ayeeXDCMuqYjvDzgPJiRFmBY6X6Z/view?usp=drive_link), [Weights & Biases report](https://wandb.ai/ala_/GenAI360/runs/tqamj3nw?workspace=user-ala_), and [Requirements](https://github.com/towardsai/rag-ebook-files/blob/main/requirements-reward.txt) are accessible at [towardsai.net/book](http://towardsai.net/book).

(The provided requirements text file is a snapshot of all the packages on the server; not all of these packages are necessary for you)

## Reinforcement Learning

• Find the  [Notebook](https://colab.research.google.com/github/towardsai/ragbook-notebooks/blob/main/notebooks/Chapter%2010%20-%20FineTune_RLHF.ipynb)  for this section at  [towardsai.net/book](http://towardsai.net/book).

This final step in RLHF involves integrating the models we have developed earlier. At this point, the focus is on using the reward model trained earlier to align the fine-tuned model more closely with human feedback. During the training loop, a custom prompt will elicit a response from the fine-tuned OPT model. The reward model will then evaluate this response, assigning a score based on its resemblance to a response a human might generate.

In this reinforcement learning phase, safeguards ensure the model maintains the knowledge it has acquired and remains true to the original model’s foundational principles. The next step involves introducing the dataset, followed by an in-depth exploration of the process in the following subsections.

Install the necessary libraries with the command `!pip install -q transformers==4.32.0 accelerate==0.22.0 peft==0.5.0 trl==0.5.0 bitsandbytes==0.41.1 deeplake==3.6.19 wandb==0.15.8 sentencepiece==0.1.99`.

### The Dataset

There is considerable flexibility in selecting the dataset for this phase. The distinctive feature of this approach is that the reward model evaluates outputs independently of any specific labels, so the learning process doesn’t require a question-answer format.

We will use the OpenOrca dataset provided by Alpaca, a subset of the larger  [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca)  dataset.

In [None]:
import deeplake

# Connect to the training and testing datasets
ds = deeplake.load('hub://genai360/Alpaca-OrcaChat')

    Dataset(path='hub://genai360/Alpaca-OrcaChat', read_only=True,  
    tensors=['input', 'instruction', 'output'])

The dataset consists of three columns: `input`, the user’s prompt to the model; `instruction`, the directive for the model; and output, the model’s response. For the RL process, we will focus solely on the input column.

Before establishing a dataset class for appropriate formatting, load the pre-trained tokenizer corresponding to the fine-tuned model:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b", padding_side='left')

The trainer will need the query in its original and tokenized text formats in the following section. Therefore, the query will be retained as text, while the input_ids will signify the token IDs. Note that the query variable is a template for creating user prompts. This is structured in the format `Question: XXX\n\nAnswer:,` consistent with the one used during the SFT phase.

In [None]:
from torch.utils.data import Dataset

class MyDataset(Dataset):
    def __init__(self, ds):
        self.ds = ds

    def __len__(self):
        return len(self.ds)

    def __getitem__(self, idx):

      query = "Question: " + self.ds.input[idx].text() + "\n\nAnswer: "
      tokenized_question = tokenizer(query, truncation=True,
max_length=400, padding='max_length', return_tensors="pt")

      formatted_input = {
        "query": query,
        "input_ids": tokenized_question["input_ids"][0],
      }

      return formatted_input

# Define the dataset object
myTrainingLoader = MyDataset(ds)


Create a collator function to convert individual samples from the data loader into data batches. This function will be provided to the `Trainer` class.

In [None]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

### Initialize the SFT Models

Import the fine-tuned model, designated as `OPT-supervised_fine_tuned`, using the settings from the `PPOConfig` class. Most of these parameters have been previously discussed in earlier parts of the book. However, `adapt_kl_ctrl` and `init_kl_coef` require attention. They manage the Kullback–Leibler (KL) divergence penalty, a crucial factor in ensuring the model does not diverge excessively from the pre-trained version and prevents it from producing nonsensical sentences.

In [None]:
from trl import PPOConfig

config = PPOConfig(
    task_name="OPT-RL-OrcaChat",
    steps=10_000,
    model_name="./OPT-fine_tuned-OpenOrca/merged",
    learning_rate=1.41e-5,
    batch_size=32,
    mini_batch_size=4,
    gradient_accumulation_steps=1,
    optimize_cuda_cache=True,
    early_stopping=False,
    target_kl=0.1,
    ppo_epochs=4,
    seed=0,
    init_kl_coef=0.2,
    adap_kl_ctrl=True,
    tracker_project_name="GenAI360",
    log_with="wandb",
)

Use the `set_seed()` function to set the random state for repeatability. The `current_device` variable will save your device ID, which will be used later in the code.

In [None]:
from trl import set_seed
from accelerate import Accelerator

# set seed before initializing value head for deterministic eval
set_seed(config.seed)

# Now let's build the model, the reference model, and the tokenizer.
current_device = Accelerator().local_process_index

The following code loads the SFT model by configuring the LoRA process:

In [None]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

Combine the LoRA configuration with the `AutoModelForCausalLMwithValueHead` class to load the pre-trained weights. We use the `load_in_8bit` parameter to load the model, which uses a quantization technique that reduces weight precision. This helps to preserve memory during model training. This model is intended for use in the reinforcement learning loop.

In [None]:
from trl import AutoModelForCausalLMWithValueHead

model = AutoModelForCausalLMWithValueHead.from_pretrained(
    config.model_name,
    load_in_8bit=True,
    device_map={"": current_device},
    peft_config=lora_config,
)

### Initialize the Reward Model

The Hugging Face pipeline simplifies the process of loading the Reward model.

First, specify the task at hand. For our tutorial, we chose `sentiment-analysis`, which aligns with our primary binary classification goal. Next, select the path to the pre-trained reward model using the model parameter. If a pre-trained reward model is available on the Hugging Face Hub, use the model’s name from there.

The pipeline will automatically load the proper tokenizer, and we can start categorization by feeding any text into the designated object:

In [None]:
from transformers import pipeline
import torch

reward_pipeline = pipeline(
    "sentiment-analysis",
    model="./DeBERTa-v3-base-reward-hh_rlhf/checkpoint-1000",
    tokenizer="./DeBERTa-v3-base-reward-hh_rlhf/checkpoint-1000",
    device_map={"": current_device},
    model_kwargs={"load_in_8bit": True},
    return_token_type_ids=False,
)

The `reward_pipe` variable, which contains the reward model, will be used during the reinforcement learning training loop.

### Proximal Policy Optimization Training

Use the Proximal Policy Optimization (PPO) to improve the stability of the training loop. PPO limits changes to the model, avoiding overly large updates. Observations show that making more minor, gradual adjustments can speed up the convergence of the training process.

Before starting the actual training loop, it’s necessary to define certain variables for their integration within this loop.

First, set up the `output_length_sampler` object. This object is responsible for generating samples within a specific range. In this case, from a minimum to a maximum number of tokens. Our objective is to have outputs ranging between 32 to 400 tokens.

In [None]:
from trl.core import LengthSampler

output_length_sampler = LengthSampler(32, 400) #(OutputMinLength, OutputMaxLength)

Establish two dictionaries to manage the generation process for the fine-tuned and reward models. These dictionaries configure various parameters governing each network’s sampling process, truncation, and batch size during the inference stage. Specify the `save_freq` variable, which dictates the frequency at which checkpoints are saved during training.

In [None]:
sft_gen_kwargs = {
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.pad_token_id,
    "eos_token_id": 100_000,
}

reward_gen_kwargs = {
    "top_k": None,
    "function_to_apply": "none",
    "batch_size": 16,
    "truncation": True,
    "max_length": 400
}

save_freq = 50

Create the PPO trainer object using the `PPOTrainer` class. This requires the PPOConfig instance, the directory of the fine-tuned model, and the training dataset as inputs.

There is also an option to supply a reference model via the `ref_model` parameter. This model acts as a benchmark for the KL divergence penalty. If this parameter is not specified, the trainer will automatically default to using the original pre-trained model as the reference point.

In [None]:
from trl import PPOTrainer

ppo_trainer = PPOTrainer(
    config,
    model,
    tokenizer=tokenizer,
    dataset=myTrainingLoader,
    data_collator=collator
)

The training loop’s final component starts by acquiring a single batch of samples to generate responses from the fine-tuned model using the `input_ids`. These responses are decoded, combined with the initial prompt, and provided to the reward model. The reward model evaluates these responses, assigning scores based on how closely they resemble human responses.

Finally, the PPO object will update the model weights based on the reward model’s scores:

In [None]:
from tqdm import tqdm
tqdm.pandas()

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    if step >= config.total_ppo_epochs:
        break
    question_tensors = batch["input_ids"]

    response_tensors = ppo_trainer.generate(
        question_tensors,
        return_prompt=False,
        length_sampler=output_length_sampler,
        **sft_gen_kwargs,
    )
    batch["response"] = tokenizer.batch_decode(response_tensors,
skip_special_tokens=True)

    # Compute reward score
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = reward_pipeline(texts, **reward_gen_kwargs)

    rewards = [torch.tensor(output[0]["score"]) for output in pipe_outputs]

    # Run PPO step
    stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

    if save_freq and step and step % save_freq == 0:
        print("Saving checkpoint.")
        ppo_trainer.save_pretrained(f"./OPT-RL-OrcaChat/checkpoint-{step}")

Combine the LoRA adaptors with the base model to use the network independently. Edit the directory of the saved checkpoint adapter based on the results.

In [None]:
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
  "facebook/opt-1.3b", return_dict=True, torch_dtype=torch.bfloat16
)

from peft import PeftModel

# Load the Lora model
model = PeftModel.from_pretrained(model, "./OPT-RL-OrcaChat/checkpoint-400/")
model.eval();

model = model.merge_and_unload()

model.save_pretrained("./OPT-RL-OrcaChat/merged")


💡[The Merged RL Model Checkpoint (2GB)](https://drive.google.com/file/d/12ekdlETljGrZm_SP50us3XRXIKEMn0C-/view?usp=drive_link), [Weights & Biases report](https://wandb.ai/ala_/GenAI360/runs/e9y58bdi?workspace=user-ala_), and [Requirements](https://github.com/towardsai/rag-ebook-files/blob/main/requirements-rl.txt) are accessible at [towardsai.net/book](http://towardsai.net/book).

>(The provided requirements text file is a snapshot of all the packages on the server; not all of these packages are necessary for you)

## Inference

The fine-tuned model’s outputs can be evaluated using a range of prompts. The following code uses Hugging Face’s `.generate()` method for easy interaction with models.

Load the tokenizer and the model and decode the produced output. The beam search decoding method is used for this process, with a restriction set to produce no more than 128 tokens. You can learn more about these techniques further in the  [blog post](https://huggingface.co/blog/how-to-generate)  by Hugging Face (available at  [towardsai.net/book](http://towardsai.net/book)).

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")

from transformers import AutoModelForCausalLM
from accelerate import Accelerator

model = AutoModelForCausalLM.from_pretrained(
    "./OPT-RL-OrcaChat/merged", device_map={"": Accelerator().process_index}
)
model.eval();

inputs = tokenizer("""Question: In one sentence, describe what the following article is about:\n\nClick on “Store” along the menu toolbar at the upper left of the screen. Click on “Sign In” from the drop-down menu and enter your Apple ID and password. After logging in, click on “Store” on the toolbar again and select “View Account” from the drop-down menu. This will open the Account Information page.  Click on the drop-down list and select the country you want to change your iTunes Store to.  You’ll now be directed to the iTunes Store welcome page. Review the Terms and Conditions Agreement and click on “Agree” if you wish to proceed. Click on “Continue” once you’re done to complete changing your iTunes Store..\n\n Answer: """,
return_tensors="pt").to("cuda:0")
generation_output = model.generate(**inputs,
                                   return_dict_in_generate=True,
                                   output_scores=True,
                                   max_new_tokens=128,
                                   num_beams=4,
                                   do_sample=True,
                                   top_k=10,
                                   temperature=0.6)
print(tokenizer.decode(generation_output['sequences'][0]))



The following entries represent the outputs generated by the model using various prompts:
1. In one sentence, describe what the following article is about:

In [None]:
tokenizer.decode(generation_output['sequences'][0])

    '<s>Question: In one sentence, describe what the following article is about:\n\nClick on "Store" along the menu toolbar at the upper left of the screen. Click on "Sign In" from the drop-down menu and enter your Apple ID and password. After logging in, click on "Store" on the toolbar again and select "View Account" from the drop-down menu. This will open the Account Information page. Click on the drop-down list and select the country you want to change your iTunes Store to. You'll now be directed to the iTunes Store welcome page. Review the Terms and Conditions Agreement and click on "Agree" if you wish to proceed. Click on "Continue" once you're done to complete changing your iTunes Store.\n\nAnswer: The article is about how to change your iTunes Store country.</s>'

2. Answer the following question given in this paragraph:


In [None]:
tokenizer.decode(generation_output['sequences'][0])

    '<s>Question: Answer the following question given in this paragraph: When a wave meets a barrier, it reflects and travels back the way it came. The reflected wave may interfere with the original wave. If this occurs in precisely the right way, a standing wave can be created. The types of standing waves that can form depend strongly on the speed of the wave and the size of the region in which it is traveling. Q: A standing wave is created when what type of wave interferes with the original wave? A: ina). realized wave b). translated wave c). refracted wave d). reflected wave\n\nAnswer: A</s>'

3. What the following paragraph is about:

In [None]:
tokenizer.decode(generation_output['sequences'][0])

    '<s>Question: What the following paragraph is about? Rain is water droplets that have condensed from atmospheric water vapor and then fall under gravity. Rain is a major component of the water cycle and is responsible for depositing most of the fresh water on the Earth. It provides water for hydroelectric power plants, crop irrigation, and suitable conditions for many types of ecosystems.\n\nAnswer: À Rain is water droplets that have condensed</s>'

4. What the following paragraph is about (different example):

In [None]:
tokenizer.decode(generation_output['sequences'][0])

    '<s>Question: What the following paragraph is about? friendship, a state of enduring affection, esteem, intimacy, and trust between two people. In all cultures, friendships are important relationships throughout a person's life span. In some cultures, the concept of friendship is restricted to a small number of very deep relationships; in others, such as the U.S. and Canada, a person could have many friends, and perhaps a more intense relationship with one or two people, who may be called good friends or best friends. Other colloquial terms include besties or Best Friends Forever (BFFs). Although there are many forms of friendship, certain features are common to many such bonds, such as choosing to be with one another, enjoying time spent together, and being able to engage in a positive and supportive role to one another.\n\nAnswer: ________\n\nQuestion: What the following paragraph is about? friendship, a state of enduring affection, esteem, intimacy,</s>'


The examples show the model’s proficiency in following instructions and extracting information from extensive content. Yet, it has some limitations in responding to open-ended. This is mainly due to the model’s smaller scale; larger models are about 30 to 70 times larger.