<a href="https://colab.research.google.com/github/coralie-sorbet/Enhancing-LLM-with-human-feedback/blob/main/Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **End-to-End Reinforcement Learning with Human Feedback: Reward Modeling and PPO Training**

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive



## **Load Libraries and configuration of the models**


In [10]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification, pipeline
from trl import (
    AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer,
    RewardConfig, RewardTrainer, setup_chat_format, ModelConfig
)
from torch.amp import autocast, GradScaler
from datasets import load_dataset

In [11]:
# Set model name and device (use GPU if available)
model_name = "gpt2"  # Base model name
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Initialize model configuration
model_config = ModelConfig(model_name_or_path=model_name)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_config.model_name_or_path,
    trust_remote_code=model_config.trust_remote_code,
    use_fast=True
)
tokenizer.pad_token = tokenizer.eos_token  # Ensure EOS token is used for padding

# Load models for different purposes
model = AutoModelForSequenceClassification.from_pretrained(
    model_config.model_name_or_path, num_labels=1, trust_remote_code=model_config.trust_remote_code
)
reward_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1).to(device)
value_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1).to(device)
policy_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
ref_policy_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## **Data Preparation**

In [7]:
# Load and preprocess the ultrafeedback_binarized dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized")

# Data preparation: Truncate long sequences
def tokenize_function(examples):
    return tokenizer(
        [str(x) for x in examples["chosen"]],
        truncation=True,
        padding="max_length",
        max_length=min(tokenizer.model_max_length, 200), #Reduced length to 200 for CUDA memory purposes for the PPO training
        return_tensors="pt" # Added return_tensors to return PyTorch tensors
    )

# Tokenize dataset
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=4,
)

README.md:   0%|          | 0.00/643 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/131M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/62135 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/62135 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [6]:
print(dataset) # Shows the features of the original dataset

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 62135
    })
    test: Dataset({
        features: ['chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 1000
    })
})


In [7]:
print(tokenized_dataset)  # Shows the features of the tokenized dataset

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected', 'score_chosen', 'score_rejected', 'input_ids', 'attention_mask'],
        num_rows: 62135
    })
    test: Dataset({
        features: ['chosen', 'rejected', 'score_chosen', 'score_rejected', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})


## **Train the Reward Model**

In [None]:
# Set up chat format for the tokenizer and model
if tokenizer.chat_template is None:
        model, tokenizer = setup_chat_format(model, tokenizer)

# Configure training arguments for the reward model
reward_config = RewardConfig(
    output_dir="drive/MyDrive/M2 D3S/Math of DL/Project",
    per_device_train_batch_size=8,
    num_train_epochs=1,
    logging_steps=25,
    eval_strategy="steps",
    eval_steps=50,
    remove_unused_columns=False,  # Required for `RewardDataCollatorWithPadding`
    dataset_num_proc=4,
    report_to="none"
)

# Initialize the RewardTrainer
trainer = RewardTrainer(
        args=reward_config,
        model=model,
        tokenizer=tokenizer,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset['test'],
    )

# Train the reward model
print("Training the reward model...")
trainer.train()

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Map:   0%|          | 0/62135 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/62135 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1063 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1942 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1096 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2164 > 1024). Running this sequence through the model will result in indexing errors


Filter (num_proc=4):   0%|          | 0/62135 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1366 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1187 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2495 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1100 > 1024). Running this sequence through the model will result in indexing errors


Filter (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Training the reward model...


Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Accuracy
50,0.8189,0.693147,0.687273
100,0.7078,0.693147,0.72
150,0.7011,0.693147,0.696364
200,0.6897,0.693147,0.723636
250,0.7064,0.693147,0.723636
300,0.7002,0.693147,0.710909
350,0.7027,0.693147,0.716364
400,0.6933,0.693147,0.752727
450,0.7026,0.693147,0.765455
500,0.6987,0.693147,0.750909










































































































































































TrainOutput(global_step=4214, training_loss=0.696694164067917, metrics={'train_runtime': 8777.7748, 'train_samples_per_second': 3.841, 'train_steps_per_second': 0.48, 'total_flos': 0.0, 'train_loss': 0.696694164067917, 'epoch': 1.0})

In [None]:
metrics = trainer.evaluate()
print("\n***** Evaluation Metrics *****")
for key, value in metrics.items():
    print(f"{key}: {value}")
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)


***** Evaluation Metrics *****
eval_loss: 0.6931471824645996
eval_accuracy: 0.8618181818181818
eval_runtime: 33.4792
eval_samples_per_second: 16.428
eval_steps_per_second: 2.061
epoch: 1.0
***** eval metrics *****
  epoch                   =        1.0
  eval_accuracy           =     0.8618
  eval_loss               =     0.6931
  eval_runtime            = 0:00:33.47
  eval_samples_per_second =     16.428
  eval_steps_per_second   =      2.061


In [None]:
# Save the model
print("Reward model trained and saved.")
reward_model.save_pretrained("drive/MyDrive/M2 D3S/Math of DL/Project/reward_model")
tokenizer.save_pretrained("drive/MyDrive/M2 D3S/Math of DL/Project/reward_model")

Reward model trained and saved.


('drive/MyDrive/M2 D3S/Math of DL/Project/reward_model/tokenizer_config.json',
 'drive/MyDrive/M2 D3S/Math of DL/Project/reward_model/special_tokens_map.json',
 'drive/MyDrive/M2 D3S/Math of DL/Project/reward_model/vocab.json',
 'drive/MyDrive/M2 D3S/Math of DL/Project/reward_model/merges.txt',
 'drive/MyDrive/M2 D3S/Math of DL/Project/reward_model/added_tokens.json',
 'drive/MyDrive/M2 D3S/Math of DL/Project/reward_model/tokenizer.json')

**Reward Model Performance**

The reward model achieved the following evaluation results:

- **`eval_loss`**: **0.6931**  
  This represents the cross-entropy loss during evaluation. A lower loss indicates better performance, but the ideal value depends on the specific task and dataset.

- **`eval_accuracy`**: **0.8618**  
  The model correctly classified **86.18%** of the evaluation samples. While this is a good starting point, there's still room for improvement.

- **`eval_runtime`**, **`eval_samples_per_second`**, **`eval_steps_per_second`**  
  These metrics relate to the **efficiency** of the evaluation process and are not directly tied to the model's performance.

---

**Potential Improvements with Proximal Policy Optimization (PPO)**


1. **Increase the Number of Training Epochs**  
   - The current configuration uses only **1 training epoch** for both the reward model and PPO.  
   - Increasing the number of epochs can allow the model to learn more **complex patterns** in the data.  
   - Use **early stopping** to avoid overfitting and excessive training times.

2. **Fine-Tune the PPO Configuration**  
   - Adjust the parameters of the PPO configuration, such as:
     - **`learning_rate`**
     - **`batch_size`**
     - **`mini_batch_size`**
   - Experiment with different values to identify the optimal settings for your specific task and dataset.  
   - Use a **learning rate scheduler** to dynamically adjust the learning rate during training.

3. **Increase the Training Data**  
   - Adding more training data can significantly improve model performance.  
   - Use techniques like **data augmentation** to expand the size and diversity of your training dataset.  

4. **Experiment with Different Reward Functions**  
   - The choice of reward function has a major impact on PPO's performance.  
   - Consider experimenting with reward functions that incorporate:
     - **Fluency**
     - **Coherence**
     - **Factual accuracy**  

---

## Reasoning Behind Improvements

- **Increasing the Number of Epochs**: Improves accuracy but comes at the cost of longer training times.
- **Fine-Tuning PPO Parameters**: Requires trial and error with careful monitoring of model performance.
- **Augmenting Training Data**: Effective only if the data is relevant, clean, and free of bias.
- **Modifying Reward Functions**: Significantly affects training outcomes and must align with the **downstream application** of your model (e.g., chatbot).


## **Train the Policy Model with PPO**

In [98]:
ppo_config = PPOConfig(
    num_train_epochs=1, # Setting to 1 epoch to reduce time run even if not the best precision
    gradient_accumulation_steps=2,  # Reduced gradient accumulation steps for memory purposes
    batch_size=1,
    mini_batch_size=1,
    learning_rate=1.41e-5,
    output_dir="drive/MyDrive/M2 D3S/Math of DL/Project/PPO_results",
    logging_steps=25,
    eval_strategy="steps",
    eval_steps=50,
    report_to="none"
)

# Use a smaller train dataset subset for testing
train_dataset = tokenized_dataset["train"].select(range(int(len(tokenized_dataset["train"]) * 0.1))) #Reduced test set to decrease the run time
train_dataset = train_dataset.with_format("torch", columns=['input_ids', 'attention_mask'])

eval_dataset = tokenized_dataset["test"].map(tokenize_function, batched=True, num_proc=4)
eval_dataset = eval_dataset.with_format("torch", columns=['input_ids', 'attention_mask'])

trainer = PPOTrainer(
    ppo_config,
    reward_model=reward_model,
    tokenizer=tokenizer,
    policy=policy_model,
    ref_policy=ref_policy_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    value_model=value_model,
)
trainer.train()

  return func(*args, **kwargs)
  return func(*args, **kwargs)
  return func(*args, **kwargs)


In [None]:
# Save the optimized PPO model
policy_model.save_pretrained("drive/MyDrive/M2 D3S/Math of DL/Project/ppo_optimized_policy")
print("Optimized PPO model saved.")

## Testing Models on Unseen Texts

### Reward Model

In [7]:
# Load the trained reward model and tokenizer
reward_model = AutoModelForSequenceClassification.from_pretrained("drive/MyDrive/M2 D3S/Math of DL/Project/reward_model")
tokenizer = AutoTokenizer.from_pretrained("drive/MyDrive/M2 D3S/Math of DL/Project/reward_model")
tokenizer.pad_token = tokenizer.eos_token
reward_model.config.pad_token_id = tokenizer.pad_token_id


# Test inputs and responses
prompt = "What is artificial intelligence?"
responses = [
    "AI is the simulation of human intelligence in machines.",
    "AI is a field of engineering.",
]

# Tokenize and score
inputs = tokenizer([prompt] * len(responses), responses, return_tensors="pt", padding=True, truncation=True)
# Check for and handle out-of-vocabulary tokens before passing to the model
inputs['input_ids'] = inputs['input_ids'].clamp(0, tokenizer.vocab_size - 1)  # Clamp IDs within valid range

scores = reward_model(**inputs).logits.squeeze()

# Print scores
for i, response in enumerate(responses):
    print(f"Response: {response}\nScore: {scores[i].item()}\n")

Response: AI is the simulation of human intelligence in machines.
Score: 0.2086963653564453

Response: AI is a field of engineering.
Score: -2.861525774002075



### PPO model

In [9]:
# Load the trained policy model and tokenizer
policy_model = AutoModelForCausalLM.from_pretrained("drive/MyDrive/M2 D3S/Math of DL/Project/ppo_optimized_policy")
generation_pipeline = pipeline("text-generation",
                               model=policy_model,
                               tokenizer=tokenizer,
                               device=0 if torch.cuda.is_available() else -1)

diverse_outputs = generation_pipeline("What is deep learning?", max_length=50, num_return_sequences=5)
# Calculate and print rewards for diverse_outputs
for i, output in enumerate(diverse_outputs):
    text = output['generated_text']
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    inputs['input_ids'] = inputs['input_ids'].clamp(0, tokenizer.vocab_size - 1)

    score = reward_model(**inputs).logits[0].item()

    print(f"Response: {text}\nScore: {score}\n")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Response: What is deep learning?

Deep learning is a technique used to develop neural networks in a way that will help you find and understand a large set of tasks that will help you make decisions easier. If you are willing to do something simple like choose
Score: -0.6190184354782104

Response: What is deep learning?

Deep learning refers to learning algorithms that take into account multiple variables to determine the input of one or more items.

More on Deep Learning

The topic below describes the basic concepts of deep learning in Python and
Score: -0.05082559585571289

Response: What is deep learning?

Deep learning refers to the work done on AI using deep learning algorithms.

Deep Learning Training

Advanced Techniques

Understanding neural network technology

Nurturing a career based solely on how well you perform
Score: -0.09765625

Response: What is deep learning? Why does it work?

Deep neural networks

What was Deep Blue's favorite game?

Deep Learning has become part of p