# **End-to-End Reinforcement Learning with Human Feedback: Reward Modeling and PPO Testing on unseen texts**

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive



## **Load Libraries**


In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification, pipeline
# from trl import (
#     AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer,
#     RewardConfig, RewardTrainer, setup_chat_format, ModelConfig
# )
# from datasets import load_dataset

## Testing Models on Unseen Texts

### Reward Model

In [4]:
# Load the trained reward model and tokenizer
reward_model = AutoModelForSequenceClassification.from_pretrained("drive/MyDrive/M2 D3S/Math of DL/Project/reward_model")
tokenizer = AutoTokenizer.from_pretrained("drive/MyDrive/M2 D3S/Math of DL/Project/reward_model")
tokenizer.pad_token = tokenizer.eos_token
reward_model.config.pad_token_id = tokenizer.pad_token_id


# Test inputs and responses
prompt = "What is artificial intelligence?"
responses = [
    "AI is the simulation of human intelligence in machines.",
    "AI is a field of engineering.",
]

# Tokenize and score
inputs = tokenizer([prompt] * len(responses), responses, return_tensors="pt", padding=True, truncation=True)
# Check for and handle out-of-vocabulary tokens before passing to the model
inputs['input_ids'] = inputs['input_ids'].clamp(0, tokenizer.vocab_size - 1)  # Clamp IDs within valid range

scores = reward_model(**inputs).logits.squeeze()

# Print scores
for i, response in enumerate(responses):
    print(f"Response: {response}\nScore: {scores[i].item()}\n")

Response: AI is the simulation of human intelligence in machines.
Score: 0.2086963653564453

Response: AI is a field of engineering.
Score: -2.861525774002075



### PPO model

In [6]:
# Load the trained policy model and tokenizer
policy_model = AutoModelForCausalLM.from_pretrained("drive/MyDrive/M2 D3S/Math of DL/Project/ppo_optimized_policy")
generation_pipeline = pipeline("text-generation",
                               model=policy_model,
                               tokenizer=tokenizer,
                               device=0 if torch.cuda.is_available() else -1)

diverse_outputs = generation_pipeline("What is deep learning?", max_length=50, num_return_sequences=5)
# Calculate and print rewards for diverse_outputs
for i, output in enumerate(diverse_outputs):
    text = output['generated_text']
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    inputs['input_ids'] = inputs['input_ids'].clamp(0, tokenizer.vocab_size - 1)

    score = reward_model(**inputs).logits[0].item()

    print(f"Response: {text}\nScore: {score}\n")

Response: What is deep learning?

It's an industry term. It refers to the ability to classify information among a collection of discrete neural networks. Deep learning is currently only available in Google's cloud services.

A user might need time to build
Score: -0.28140246868133545

Response: What is deep learning?

Deep learning is the process by which a processor is trained that it learns a single thing. It can then use the same training to look back at how it has learned that technique.

Deep learning is similar to
Score: 0.23582708835601807

Response: What is deep learning?

What is visualization?

What is deep neural network?

What is a self-learning algorithm

What is learning

What is a self-training algorithm

A key example is to write
Score: -0.5195984840393066

Response: What is deep learning? And why do people prefer it to Big Data?

Well, some say deep learning is just a way of making predictions about the world. Then many people try to build the algorithm on top of it to