<a href="https://colab.research.google.com/github/abdulsamadkhan/AlignmentTuning/blob/main/PPOTrainer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning from Human Feedback Using PPO


# Training an RL Agent with PPO and Sentiment Analysis

Proximal Policy Optimization (PPO), developed by OpenAI, is a highly effective reinforcement learning (RL) algorithm known for balancing simplicity and performance. It optimizes policies directly while ensuring stable updates to maintain training reliability, making it a top choice for RL tasks.

This tutorial guides you through training an RL agent using PPO, with a focus on sentiment analysis. You'll utilize the IMDb dataset, a vast collection of movie reviews, to train your model. By the end, you'll understand how to implement PPO for RL, gaining practical skills to apply to other datasets and problems.

This tutorial is based on [Hugging Face's example code: `Tune GPT2 to generate positive reviews`](https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb).

## Table of Contents

1. [Setup](#setup)
    - [Installing required libraries](#installing-required-libraries)
    - [Importing required libraries](#importing-required-libraries)
    - [Defining helper functions](#defining-helper-functions)
2. [Initializing the PPO configuration, model, and tokenizer](#initializing-the-ppo-configuration-model-and-tokenizer)
    - [AutoModelForCausalLMWithValueHead Overview](#automodelforcausallmwithvaluehead-overview)
3. [Dataset and processing of Dataset](#dataset-and-dataset-tokenization)
4. [Collator function](#collator-function)
5. [Initialize PPOTrainer](#initialize-ppotrainer)
6. [Reward function](#reward-function)
7. [Generating responses using PPO](#generating-responses-using-ppo)
   1. [Tokenizing and preparing the input batch](#tokenizing-and-preparing-the-input-batch)
   2. [Scoring function](#scoring-function)
   3. [Proximal policy optimization](#proximal-policy-optimization)
8. [Plotting PPO training loss and mean](#plotting-ppo-training-loss-and-mean)
9. [Generating and analyzing text with PPO and reference models](#generating-and-analyzing-text-with-ppo-and-reference-models)
10. [Comparing PPO and reference models](#comparing-ppo-and-reference-models)


----


## Setup


### Installing required libraries




In [None]:
!pip install torch torchtext
!pip install  datasets==3.2.0
!pip install  trl==0.11
!pip install transformers==4.43.4
!pip install  nltk==3.9.1 rouge_score==0.1.2


### Importing required libraries

_It is recommended that you import all required libraries in one place (here):_


In [None]:
import torch
from tqdm import tqdm
import pandas as pd

tqdm.pandas()

from transformers import pipeline, AutoTokenizer,AutoModelForCausalLM
from datasets import load_dataset

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler
import os

import tarfile
import pickle
import json
import matplotlib.pyplot as plt


## Defining helper functions


In [None]:
def save_to_json(data, file_path):
    """
    Save a dictionary to a JSON file.

    Args:
        data (dict): The dictionary to save.
        file_path (str): The path to the JSON file.
    """
    with open(file_path, 'w') as json_file:
        json.dump(data, json_file, indent=4)
    print(f"Data successfully saved to {file_path}")


def load_from_json(file_path):
    """
    Load data from a JSON file.

    Args:
        file_path (str): The path to the JSON file.

    Returns:
        dict: The data loaded from the JSON file.
    """
    with open(file_path, 'r') as json_file:
        data = json.load(json_file)
    return data


This function adds padding to a 1D PyTorch tensor so that its length matches a given target length.
If padding is needed, it appends the specified `pad_token_id`; otherwise, it returns the original tensor.


In [None]:
def pad_sequence_to_length(tensor, length, pad_token_id):
    padding_length = length - tensor.size(0)
    if padding_length > 0:
        padding = torch.full((padding_length,), pad_token_id, dtype=torch.long, device=tensor.device)
        return torch.cat((tensor, padding))
    return tensor


t = torch.tensor([1, 2, 3])
padded = pad_sequence_to_length(t, 5, pad_token_id=0)
print(padded)




In [None]:
# Pads each tensor in the list to the maximum length using `pad_token_id`, then ensures the total number of tensors equals `batch_size`.
# If there are fewer tensors, it adds padding-only tensors; if more, it truncates to match the batch size.


Pads each tensor in the list to the maximum length using `pad_token_id`, then ensures the total number of tensors equals `batch_size`.
 If there are fewer tensors, it adds padding-only tensors; if more, it truncates to match the batch size.


In [None]:
def pad_list_to_batch_size(tensors, batch_size, pad_token_id):
    max_length = max(t.size(0) for t in tensors)
    padded_tensors = [pad_sequence_to_length(t, max_length, pad_token_id) for t in tensors]

    # Add additional padding-only tensors if needed
    while len(padded_tensors) < batch_size:
        padded_tensors.append(torch.full((max_length,), pad_token_id, dtype=torch.long, device=tensors[0].device))

    return padded_tensors[:batch_size]
tensors = [torch.tensor([1, 2,3 ]), torch.tensor([3])]
batch = pad_list_to_batch_size(tensors, batch_size=4, pad_token_id=0)
for b in batch:
    print(b)

In [None]:
def print_ppo_stats(stats, related_to_objective=False):
    print("PPO Training Statistics\n")

    if related_to_objective:
        print("Objective Statistics:")
        print(f"  KL Divergence (objective/kl): {stats['objective/kl']}")
        print(f"  KL Coefficient (objective/kl_coef): {stats['objective/kl_coef']}")
        print(f"  Entropy (objective/entropy): {stats['objective/entropy']}\n")

        print("PPO Losses (Related to Minimizing Objective Function):")
        print(f"  Policy Loss (ppo/loss/policy): {stats['ppo/loss/policy']}")
        print(f"  Value Loss (ppo/loss/value): {stats['ppo/loss/value']}")
        print(f"  Total Loss (ppo/loss/total): {stats['ppo/loss/total']}\n")

        print("PPO Policy Statistics:")
        print(f"  Policy Entropy (ppo/policy/entropy): {stats['ppo/policy/entropy']}")
        print(f"  Approx KL (ppo/policy/approxkl): {stats['ppo/policy/approxkl']}")
        print(f"  Clip Fraction (ppo/policy/clipfrac): {stats['ppo/policy/clipfrac']}\n")
    else:
        print("Reward and Value Function Estimation:")
        print(f"  Mean Non-Score Reward (ppo/mean_non_score_reward): {stats['ppo/mean_non_score_reward']}")
        print(f"  Mean Scores (ppo/mean_scores): {stats['ppo/mean_scores']}")
        print(f"  Std Scores (ppo/std_scores): {stats['ppo/std_scores']}")
        print(f"  Value Prediction (ppo/val/vpred): {stats['ppo/val/vpred']}")
        print(f"  Value Prediction Error (ppo/val/error): {stats['ppo/val/error']}")
        print(f"  Value Prediction Variance (ppo/val/var): {stats['ppo/val/var']}")
        print(f"  Value Prediction Mean (ppo/val/mean): {stats['ppo/val/mean']}")
        print(f"  Explained Variance (ppo/val/var_explained): {stats['ppo/val/var_explained']}\n")

    print("Token Lengths:")
    print(f"  Queries Length Mean (tokens/queries_len_mean): {stats['tokens/queries_len_mean']}")
    print(f"  Responses Length Mean (tokens/responses_len_mean): {stats['tokens/responses_len_mean']}\n")

    print("Time Statistics:")
    print(f"  Total Time (time/ppo/total): {stats['time/ppo/total']} seconds\n")

# Example usage with the provided stats and the flag

#2. Initializing the PPO configuration, model, and tokenizer


The `PPOConfig` class is used to specify the model and learning rate for the PPO training. In this case, the model is `"lvwerra/gpt2-imdb"` and the learning rate is set to `1.41e-5`.


In [None]:
config = PPOConfig(
    model_name="lvwerra/gpt2-imdb",
    learning_rate=1.41e-5)

`config.model_name` refers to the specific model identifier used in the configuration for loading the pretrained model.

In [None]:
config.model_name

The `sent_kwargs` dictionary contains parameters for the sentiment analysis pipeline, specifying that all scores should be returned.

In [None]:
sent_kwargs = {
    "top_k": None,                    # No limit on top-k results (could mean return all or default behavior)
    "function_to_apply": "none",     # Do not apply any activation function like softmax or sigmoid
    "batch_size": 2                  # Process inputs in batches of size 2
}



## AutoModelForCausalLMWithValueHead Overview
The `AutoModelForCausalLMWithValueHead` class is used to load the pretrained GPT-2 model with a value head for PPO training.

This model **simultaneously** performs two tasks using a shared transformer backbone:

### 1. Next Token Prediction
- **Function**: Predicts the next token in a sequence, like a standard causal language model (e.g., GPT-2).
- **Process**:
  - Input tokens are processed through the transformer’s layers.
  - Outputs logits (probabilities) over the vocabulary for the next token.
- **Component**: Language model head.

### 2. Value Score Prediction
- **Function**: Produces a scalar value score representing the estimated value or expected reward for the current sequence.
- **Process**:
  - Uses the transformer’s hidden states (encoded representations of all input tokens).
  - Passes hidden states through an additional **value head** (a linear layer) to output a single scalar.
- **Component**: Value head.
- **Use Case**: Critical for reinforcement learning tasks like RLHF (Reinforcement Learning from Human Feedback) or PPO (Proximal Policy Optimization).

## How It Works
- **Shared Computation**: Both tasks leverage the same transformer layers, processing input tokens in a single forward pass.
- **Dual Outputs**:
  - Language model head generates next-token logits.
  - Value head produces a value score for the sequence.
- **Efficiency**: The model efficiently handles text generation and value estimation, making it ideal for RLHF fine-tuning.

---
The `AutoTokenizer` class loads the tokenizer that matches a pretrained model.  
In this case, the tokenizer's padding token is set to the end-of-sequence (EOS) token.



In [None]:
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token

During PPO training, the model is updated. Additionally, a reference model is employed to stabilize training by incorporating the Kullback-Leibler (KL) divergence between the current policy and the reference policy. The KL divergence serves as a regularization term to prevent excessive deviation from the reference policy.

In [None]:
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)

#3. Dataset and processing of Dataset
The **IMDB dataset**, containing 50,000 positive or negative movie reviews, is loaded using the datasets library's `load_dataset` function with the "train" split for sentiment analysis.


In [None]:
dataset_name = "imdb"
ds = load_dataset(dataset_name, split="train")
N = 3
for i in range(N):
    print(f"Text: {ds[i]['text']}")
    print(f"Label: {ds[i]['label']}")

The LengthSampler in the trl (Transformers Reinforcement Learning) library is a utility that helps you randomly vary the length of generated sequences during training.






In [None]:

# Define minimum and maximum input text lengths
min_text_length = 2
max_text_length = 8

# Create a LengthSampler to randomly choose input sizes in the given range
input_size = LengthSampler(min_text_length, max_text_length)

# Example usage
print(input_size())  # Outputs a random number between 2 and 8


The function `prepare_text_dataset` loads a text dataset, tokenizes each sample with a randomly chosen input length, and adds a decoded `"query"` field. It formats the result for PyTorch, making it ready for model training or fine-tuning.


In [None]:

def prepare_text_dataset(cfg, dataset_name="imdb", min_text_len=2, max_text_len=8, tokenizer=None):
    # Load tokenizer from the model specified in config
    tokenizer = AutoTokenizer.from_pretrained(cfg.model_name)

    # Set padding token to the end-of-sequence token (needed for consistent padding)
    tokenizer.pad_token = tokenizer.eos_token

    # Load the specified dataset split (default: IMDb training data)
    dataset = load_dataset(dataset_name, split="train")

    # Create a LengthSampler to sample random input lengths between min and max
    length_sampler = LengthSampler(min_text_len, max_text_len)

    # Define a function to tokenize each sample
    def tokenize_function(example):
        # Tokenize and truncate the input text to a random length
        example["input_ids"] = tokenizer.encode(example["text"])[: length_sampler()]

        # Decode the tokens back to text to store as "query"
        example["query"] = tokenizer.decode(example["input_ids"])
        return example

    # Apply the tokenize_function to each example in the dataset
    dataset = dataset.map(tokenize_function, batched=False)

    # Set dataset format to PyTorch tensors
    dataset.set_format(type="torch")

    return dataset


Create the dataset object


In [None]:
dataset = prepare_text_dataset(config)

print first 5 data points from the dataset, this will give you an idea about the input data needed for training the PPO

In [None]:
for i, sample in enumerate(dataset):
  if i >= 5:
    break
  print(f"Sample {i+1}:")
  print(f"Review: {sample['text']}")
  print(f"Input IDs: {sample['input_ids']}")
  print(f"Query: {sample['query']}")
  print("-" * 50)

#4. Collator function
🔄 The collator function organizes data into batches for the PPOTrainer by grouping corresponding features from each sample together. ✅ It ensures the input is in the correct format for training.
```
two samples were input and check the output from collator for the keys  'input_ids', 'query', and 'review'.
```


In [None]:
def collator(samples):
    return {field: [sample[field] for sample in samples] for field in samples[0]}


examples = [
    {'input_ids': [10, 20, 30], 'query': "hello world", 'text': "This product works great!"},
    {'input_ids': [40, 50, 60], 'query': "test phrase", 'text': "I had a fantastic experience."}
]

batched_output = collator(examples)
batched_output



# 5. Initialize PPOTrainer

**Proximal Policy Optimization (PPO)** is a reinforcement learning algorithm well-suited for fine-tuning generative models like chatbots. It addresses common training challenges such as producing stable, coherent, and contextually appropriate responses.

#### 🧠 Why PPO for Chatbots?
- PPO enhances traditional policy gradient methods by using a **clipped objective function**, which leads to **stable and gradual policy updates**.
- This reduces high variance and instability, helping to avoid erratic or inconsistent chatbot behavior.
- It maintains a balance between **exploring new responses** and **exploiting known good ones**, thanks to its trust region strategy.

#### 🚀 Role of the PPO Trainer
- **Collects** dialogue samples during interaction.
- **Optimizes** the policy (chatbot behavior) based on those samples.
- **Manages** the underlying neural networks during training.

This results in a more **robust and reliable** chatbot that responds in a helpful, safe, and aligned manner.

---

### ✅ Let’s initialize the `PPOTrainer` with the specified configuration and components:

- ```config``` : Configuration settings for PPO training, such as learning rate and model name  
- ```model``` : The primary model to be fine-tuned using PPO  
- ```tokenizer``` : Tokenizer corresponding to the model, used for processing input text  
- ```dataset``` : Dataset to be used for training, providing the input data for the model  
- ```data_collator``` : Data collator to handle batching and formatting of the input data
````


In [None]:
ppo_trainer = PPOTrainer(config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator)


#6. The Reward Function

In **Reinforcement Learning with PPO (Proximal Policy Optimization)**, a **reward function** plays a crucial role by providing feedback on the quality of actions taken by the model. For a **generative chatbot**, this means evaluating how good or appropriate its responses are.

---

### 💡 Using Sentiment Analysis as a Reward Signal

One simple yet effective approach is to use a **sentiment analysis pipeline** as the reward function:

- The chatbot's generated response is analyzed using a sentiment classifier.
- A **reward score** is assigned based on the **positivity or negativity** of the sentiment.
- The PPO Trainer uses this reward to fine-tune the model, encouraging it to generate **more positive and engaging** responses over time.

---




### 🛠️ Initialize Sentiment Analysis Pipeline

First, we set up a **sentiment analysis pipeline** using a **pretrained model** fine-tuned on IMDB reviews. This model is capable of analyzing input text and predicting sentiment with confidence scores for **positive** and **negative** classes.



In [None]:
from transformers import pipeline

# Load sentiment analysis model
sentiment_pipe = pipeline(
    "sentiment-analysis",
    model="lvwerra/distilbert-imdb",
)


The `score` indicates the model’s confidence in its sentiment prediction. A higher score for the **"POSITIVE"** class yields a higher reward, while a lower score (or **"NEGATIVE"** sentiment) results in a lower or negative reward—encouraging the chatbot to generate more positively perceived responses.


In [None]:
text = "this movie was really bad!!"
sentiment_pipe(text, **sent_kwargs)

#7. Generating responses using PPO

##Tokenizing and preparing the input batch
This section illustrates the process of generating responses using the PPO (Proximal Policy Optimization) Trainer. It includes tokenizing the input, preparing the training batch, generating model responses, and decoding the output tokens into human-readable text.

The below code  retrieves a
- batch of data from the PPO Trainer's dataloader
- keys of the data
- selects the first two entries for processing.


In [None]:
batch = next(iter(ppo_trainer.dataloader))
#The batch contains label, input_ids, and query
print(batch.keys())
# Let's take the first two  sample in the batch
batch = {key: batch[key][0:2] for key in batch}
batch

Initialize a list named `response_tensors` to store the model-generated responses for scoring. The code below extracts the input_ids from the batch and assigns them to `query_tensors`. These tensors represent the tokenized input sequences, referred to as "query tensors" because they serve as the initial queries that the model will process to generate responses in the following steps.

In [None]:
response_tensors = []
query_tensors =  batch["input_ids"]
query_tensors

The code below defines a lambda function `get_text` that takes a list of response tensors (`response`) and decodes each tensor into readable text using the tokenizer. The `squeeze()` method is applied to remove any singleton dimensions from the tensor before decoding. This allows you to view the original input queries in their human-readable form.


In [None]:
get_text = lambda response:''.join([tokenizer.decode(r.squeeze()) for r in response])
get_text(query_tensors)

The dictionary `generation_kwargs` defines the parameters used for generating sequences from the Language Model (LLM). These parameters control the behavior and diversity of the generated output:

- `"min_length": -1` – No minimum length is enforced for the generated text.
- `"top_k": 0.0` – Disables top-k filtering, allowing sampling from all tokens.
- `"top_p": 1.0` – Disables nucleus (top-p) sampling, using the full probability distribution.
- `"do_sample": True` – Enables sampling to allow diverse and varied outputs.
- `"pad_token_id": 50256` – Specifies the padding token ID to ensure uniform sequence length.


In [None]:
generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": 50256,
}
generation_kwargs

The `output_length_sampler` is initialized using `LengthSampler(output_min_length, output_max_length)`. This object samples output lengths for the generated sequences, ensuring they fall within the specified minimum and maximum range. By introducing variability in length, it helps produce more diverse and natural responses, avoiding outputs that are too short or unnecessarily long, thereby improving the overall quality of the model's responses.


In [None]:
output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)
gen_len = output_length_sampler()
generation_kwargs["max_new_tokens"] = gen_len

Now, let's process a single sample using PPO. Begin by extracting the first query tensor from the input batch. Then, generate a response for this query using the PPO trainer along with the specified generation parameters (`generation_kwargs`). The generated response tensor is stored in the variable `response`.


In [None]:
query=query_tensors[0]
response = ppo_trainer.generate(query, **generation_kwargs)
print("query:",get_text(query))
print("response:", get_text(response))

Finally, append the generated tokens to the `response_tensors` list. The `squeeze()` method removes any single-dimensional entries from the shape of the tensor, and the slicing `[-gen_len:]` ensures that only the newly generated tokens are included, excluding any tokens that were part of the original input.


In [None]:
response_tensors.append(response.squeeze()[-gen_len:])
print("newly generated tokens form response:", get_text(response_tensors[-gen_len:]))

Repeat the process for the second sample. This section generates a response for a given query, decodes the relevant part, and appends it to the `response_tensors` list.


In [None]:
query=query_tensors[1]
gen_len = output_length_sampler()
generation_kwargs["max_new_tokens"] = gen_len
response = ppo_trainer.generate(query, **generation_kwargs)
tokenizer.decode(response.squeeze()[-gen_len:], skip_special_tokens=True)
print("query:",get_text(query))
print("response ouput :", get_text(response_tensors))
response_tensors.append(response.squeeze()[-gen_len:])
print("newly generated tokens form response:", get_text(response_tensors[-gen_len:]))

Convert each tensor in `response_tensors` into human-readable text and store it in the `batch` dictionary under the key `response`.


In [None]:
batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]
batch["response"]

The batch now contains both `response` and `query` keys.


In [None]:
batch

## Scoring function

Next, prepare the text data for sentiment analysis, which can serve as a component of the reward function in a PPO setup. Sentiment analysis of interactions helps determine the reward signal by evaluating the tone or emotional content of the generated responses.

After that, extract the query and response tensors and add them to the batch for further processing.



In [None]:
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
texts

The sentiment scores (`pipe_outputs`) can be used as feedback to update the policy


In [None]:
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
pipe_outputs

These scores are used to evaluate the quality or relevance of the generated responses, reflecting the model's confidence in their likelihood of being positive. The scores are extracted from the `pipe_outputs` list, where each element contains a list of scores corresponding to the model's output.

The code iterates over the `pipe_outputs` list, extracts the score from each output, converts it into a tensor, and stores it in the `rewards` list. These scores serve as a reward signal, representing the model's assessment of how likely the generated responses are to be positive.



In [None]:
positive_scores = [
    item["score"]
    for output in pipe_outputs
    for item in output
    if item["label"] == "POSITIVE"
]
rewards = [torch.tensor(score) for score in positive_scores]
rewards

### Proximal policy optimization
The training loop executes a single update step of the PPO algorithm. It takes the following inputs:

- Query tensor
- Response tensor
- Score tensor


In [None]:
print("query:", get_text(query_tensors))
print("\n")
print("response:", get_text(response_tensors))

Here we will create query tensors, response_tensors and rewards of batch size, just to run one step of the PPO trainer.

In [None]:
batch_size=128
pad_token_id = tokenizer.pad_token_id

query_tensors = pad_list_to_batch_size(query_tensors, batch_size, pad_token_id)

response_tensors = pad_list_to_batch_size(response_tensors, batch_size, pad_token_id)
rewards=rewards+[torch.tensor(0) for _ in range(batch_size-len(rewards))]



Invoke the PPO step method to update the model using the PPO algorithm with `query_tensors`, `response_tensors`, and `rewards`. It calculates policy and value function losses, computes gradients, and updates policy network parameters to enhance the policy. The method constrains policy updates to prevent large shifts, a key feature of PPO.

In [None]:
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)


The `stats` dictionary contains various statistics from the PPO training step, which can be printed using the `print_ppo_stats` function. Its keys are divided into two main categories:

- **Minimizing Language Model Loss** (`related_to_objective=True`): Statistics related to optimizing model parameters, such as policy loss and value loss.
- **Calculating the Reward**: Metrics relevant to reinforcement learning, including advantage estimates and reward calculations.

In [None]:
print_ppo_stats(stats, related_to_objective = True)

In [None]:
print_ppo_stats(stats)

In [None]:
all_stats = []

The `sentiment`should be set to NEGATIVE for bad responses and POSITIVE for good responses score .


In [None]:
sentiment = "POSITIVE"

# Training Loop for PPO with Sentiment Analysis

This code implements a training loop for the PPO (Proximal Policy Optimization) algorithm integrated with sentiment analysis. The loop processes batches from the `ppo_trainer` dataloader, executing these steps:

1. **Extract Query Tensors**:
   - Input IDs (query tensors) are retrieved from the batch.

2. **Generate Responses**:
   - Responses are generated for each query tensor using `ppo_trainer.generate` with specified `generation_kwargs`.
   - Generated responses are decoded and stored in the batch under the `response` key.

3. **Compute Sentiment Scores**:
   - Queries and responses are concatenated to form text data.
   - Sentiment analysis is applied to compute sentiment scores.
   - Scores are converted to tensors and stored in the `rewards` list.

4. **Run PPO Step**:
   - The `ppo_trainer.step` method updates the model using `query_tensors`, `response_tensors`, and `rewards`.
   - It computes policy and value function losses, calculates gradients, and updates policy network parameters.
   - Policy updates are constrained to prevent large shifts, a core PPO feature.

5. **Logging Statistics**:
   - Training step statistics are logged and appended to the `all_stats` list.

In [None]:
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]
    print(f"epoch {epoch}")

    #### Get response from gpt2
    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        response = ppo_trainer.generate(query, **generation_kwargs)
        response_tensors.append(response.squeeze()[-gen_len:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    #### Compute sentiment score
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    positive_scores = [
           item["score"]
           for output in pipe_outputs
           for item in output
           if item["label"] == sentiment
           ]
    rewards = [torch.tensor(score) for score in positive_scores]

    #### Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

    all_stats.append(stats)

In [None]:
# # Save the model

model_dir = "ppo-good"
os.makedirs(model_dir, exist_ok=True)

# # Save model configuration and weights
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir)

#8. Plotting PPO training loss and mean

Here's a precise description of the plotting process:

1.  **Data Extraction**:

    * Isolate the list of total loss values from the `all_stats` dictionary under the key `'total_loss'` and assign it to the variable `loss_values`.
    * Isolate the list of mean reward values from the `all_stats` dictionary under the key `'mean_reward'` and assign it to the variable `reward_values`.

2.  **Loss Visualization**:

    * Generate a line plot where the x-axis represents the training epochs (assuming a corresponding index or epoch list) and the y-axis represents the `loss_values`.

3.  **Reward Visualization**:

    * Generate a separate line plot where the x-axis represents the training epochs and the y-axis represents the `reward_values`.

4.  **Plot Display**:

    * Use `plt.tight_layout()` to adjust the spacing between the generated plots for better readability.
    * Use `plt.show()` to display the resulting loss and reward plots.

In [None]:
loss_values = [stat['ppo/loss/total'] for stat in all_stats]
reward_values = [stat['ppo/mean_scores'] for stat in all_stats]

# Plotting the loss
plt.figure(figsize=(12, 6))
plt.subplot(2, 1, 1)
plt.plot(loss_values, label='Total Loss', color='b')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('PPO Training Loss over Time')
plt.legend()
plt.grid(True)

# Plotting the rewards
plt.subplot(2, 1, 2)
plt.plot(reward_values, label='Mean Reward', color='g')
plt.xlabel('Epoch')
plt.ylabel('Reward')
plt.title('PPO Mean Reward over Time')
plt.legend()
plt.grid(True)

# Show the plots
plt.tight_layout()
plt.show()

#9. Generating and analyzing text with PPO and reference models



**Text generation function**:
    - `generate_some_text(input_text, my_model)`: Tokenizes input text, generates a response, and decodes it.


In [None]:
# Check if CUDA is available, otherwise use the CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

gen_kwargs = {"min_length": -1, "max_new_tokens":20, "top_k": 0.0, "top_p": 1.0, "do_sample": True, "pad_token_id": tokenizer.eos_token_id}
def generate_some_text(input_text,my_model):
# Tokenize the input text
    input_ids = tokenizer(input_text, return_tensors='pt').input_ids.to(device)
    generated_ids = my_model.generate(input_ids,**gen_kwargs )

    # Decode the generated text
    generated_text_ = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

    return generated_text_

**Generate text with PPO model**:
    - Generate text using the PPO-trained model.


In [None]:
input_text = "Once upon a time in a land far"
generated_text=generate_some_text(input_text,model)
generated_text

**Sentiment Analysis**:
    - Analyze the sentiment of the generated text using `sentiment_pipe`.


In [None]:
pipe_outputs = sentiment_pipe(generated_text, **sent_kwargs)
pipe_outputs

**Generate text with reference model**:
    - Generate text using the reference model.


In [None]:
generated_text = generate_some_text(input_text,ref_model)
generated_text

#10. Comparing PPO and Reference Models

This process outlines the comparison of text generation between a PPO-trained model and a reference model:

1.  **Generation Parameters**:
    * Define `gen_kwargs`: A dictionary containing parameters for text generation (e.g., `max_new_tokens`, `temperature`, `top_k`).

2.  **Prepare Batch**:
    * Sample a batch of size `bs` from the dataset.
    * Extract the query tensors from the sampled batch.

3.  **Generate Responses**:
    * For each query tensor in the batch:
        * Generate a response using the reference model with the defined `gen_kwargs`.
        * Generate a response using the PPO model with the defined `gen_kwargs`.

4.  **Decode Responses**:
    * Convert the generated response tensors from both models into human-readable text strings.

5.  **Compute Sentiment Scores**:
    * For each query-response pair (before and after PPO training):
        * Concatenate the original query and the generated response to form a complete text.
        * Use the `sentiment_pipe` to calculate a sentiment score for the generated response.

6.  **Store Results**:
    * Store the original queries, the generated responses from both models, and their corresponding sentiment scores in a dictionary named `game_data`.
    * Convert the `game_data` dictionary into a Pandas DataFrame for easier analysis and return this DataFrame.

In [None]:
def compare_models_on_dataset(model, ref_model, dataset, tokenizer, sentiment_pipe, sent_kwargs, device, output_length_sampler):
    gen_kwargs = {
        "min_length": -1,
        "top_k": 0.0,
        "top_p": 1.0,
        "do_sample": True,
        "pad_token_id": tokenizer.eos_token_id
    }

    bs = 16
    game_data = dict()
    dataset.set_format("pandas")
    df_batch = dataset[:].sample(bs)
    game_data["query"] = df_batch["query"].tolist()
    query_tensors = df_batch["input_ids"].tolist()

    response_tensors_ref, response_tensors = [], []

    # Get maximum position embeddings for both models
    max_position_embeddings_ref = ref_model.config.max_position_embeddings
    max_position_embeddings_model = model.config.max_position_embeddings

    for i in range(bs):
        gen_len = output_length_sampler()

        # Convert query tensors to input IDs
        input_ids = torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device)

        # ********** Process for ref_model **********
        total_length_ref = input_ids.shape[-1] + gen_len
        if total_length_ref > max_position_embeddings_ref:
            # Truncate input_ids to fit within the max length
            max_input_length_ref = max_position_embeddings_ref - gen_len
            input_ids_ref = input_ids[:, -max_input_length_ref:]
            total_length_ref = input_ids_ref.shape[-1] + gen_len
        else:
            input_ids_ref = input_ids

        output = ref_model.generate(
            torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device),
            max_new_tokens=gen_len,
            **gen_kwargs
        ).squeeze()[-gen_len:]
        response_tensors_ref.append(output)

        # ********** Process for model **********
        total_length_model = input_ids.shape[-1] + gen_len
        if total_length_model > max_position_embeddings_model:
            max_input_length_model = max_position_embeddings_model - gen_len
            input_ids_model = input_ids[:, -max_input_length_model:]
            total_length_model = input_ids_model.shape[-1] + gen_len
        else:
            input_ids_model = input_ids

        output = model.generate(
            torch.tensor(query_tensors[i]).unsqueeze(dim=0).to(device),
            max_new_tokens=gen_len,
            **gen_kwargs
        ).squeeze()[-gen_len:]
        response_tensors.append(output)

    game_data["response (before)"] = [tokenizer.decode(response_tensors_ref[i]) for i in range(bs)]
    game_data["response (after)"] = [tokenizer.decode(response_tensors[i]) for i in range(bs)]

    texts_before = [q + r for q, r in zip(game_data["query"], game_data["response (before)"])]
    game_data["rewards (before)"] = [output[1]["score"] for output in sentiment_pipe(texts_before, **sent_kwargs)]

    texts_after = [q + r for q, r in zip(game_data["query"], game_data["response (after)"])]
    game_data["rewards (after)"] = [output[1]["score"] for output in sentiment_pipe(texts_after, **sent_kwargs)]

    df_results = pd.DataFrame(game_data)
    return df_results

In [None]:
df_results = compare_models_on_dataset(model, ref_model, dataset, tokenizer, sentiment_pipe, sent_kwargs, device, output_length_sampler)
df_results



You can also run the PPO  with the sentiment set to NEGATIVE, which evaluates the model's performance when negative sentiment scores are prioritized. The training loop generates responses, computes sentiment scores, updates the model, and logs the statistics for each epoch.
