# **End-to-End Reinforcement Learning with Human Feedback: Reward Modeling and PPO Testing on unseen texts**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive



## **Load Libraries**


In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification, pipeline

## Testing Models on Unseen Texts

### Reward Model

In [3]:
# Load the trained reward model and tokenizer
reward_model = AutoModelForSequenceClassification.from_pretrained("drive/MyDrive/M2 D3S/Math of DL/Project/reward_model")
tokenizer = AutoTokenizer.from_pretrained("drive/MyDrive/M2 D3S/Math of DL/Project/tokenizer")
tokenizer.pad_token = tokenizer.eos_token
reward_model.config.pad_token_id = tokenizer.pad_token_id


# Test inputs and responses
prompt = "What is artificial intelligence?"
responses = [
    "AI is the simulation of human intelligence in machines.",
    "AI is a field of engineering.",
]

# Tokenize and score
inputs = tokenizer([prompt] * len(responses), responses, return_tensors="pt", padding=True, truncation=True)
# Check for and handle out-of-vocabulary tokens before passing to the model
inputs['input_ids'] = inputs['input_ids'].clamp(0, tokenizer.vocab_size - 1)  # Clamp IDs within valid range

scores = reward_model(**inputs).logits.squeeze()

# Print scores
for i, response in enumerate(responses):
    print(f"Response: {response}\nScore: {scores[i].item()}\n")

Response: AI is the simulation of human intelligence in machines.
Score: 0.2086963653564453

Response: AI is a field of engineering.
Score: 0.6473867893218994



In [4]:
# Test inputs and responses
prompt = "Explain deep learning."
responses = [
    "Deep learning uses neural networks to learn patterns.",
    "It processes data through multiple layers."
]

# Tokenize and score
inputs = tokenizer([prompt] * len(responses), responses, return_tensors="pt", padding=True, truncation=True)
# Check for and handle out-of-vocabulary tokens before passing to the model
inputs['input_ids'] = inputs['input_ids'].clamp(0, tokenizer.vocab_size - 1)  # Clamp IDs within valid range

scores = reward_model(**inputs).logits.squeeze()

# Print scores
for i, response in enumerate(responses):
    print(f"Response: {response}\nScore: {scores[i].item()}\n")

Response: Deep learning uses neural networks to learn patterns.
Score: 0.07615280151367188

Response: It processes data through multiple layers.
Score: -0.652172327041626



### PPO model

In [5]:
# Load the trained policy model and tokenizer
policy_model = AutoModelForCausalLM.from_pretrained("drive/MyDrive/M2 D3S/Math of DL/Project/ppo_optimized")

generation_pipeline = pipeline("text-generation",
                               model=policy_model,
                               tokenizer=tokenizer,
                               device=0 if torch.cuda.is_available() else -1)

diverse_outputs = generation_pipeline("What is artificial intelligence?", num_return_sequences=5)

# Calculate and print rewards for diverse_outputs
for i, output in enumerate(diverse_outputs):
    text = output['generated_text']
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    inputs['input_ids'] = inputs['input_ids'].clamp(0, tokenizer.vocab_size - 1)

    score = reward_model(**inputs).logits[0].item()

    print(f"Response: {text}\nScore: {score}\n")

Device set to use cpu


Response: What is artificial intelligence?

What is the science behind it?

I'm sure that there's a lot to learn about it, but I cannot say it definitively yet, and it has to do primarily with a mathematical approach (and I
Score: 0.6635737419128418

Response: What is artificial intelligence? Why would it exist?"

"I have met someone before who can tell you that it must do something interesting or interesting and is just like an expert analyst. I'm thinking of him. I know he's not an
Score: -2.110313892364502

Response: What is artificial intelligence?

Curiously smart.

This is a far cry from most artificial intelligence technologies.

In fact, there are many technologies that I'm not sure of that have made it into the public domain, but
Score: -0.4357433319091797

Response: What is artificial intelligence? It is very strange when you start with that exact thing that it believes is impossible, that it's impossible, that it exists. If you think about it, I know I'm not making any claim

In [None]:
diverse_outputs = generation_pipeline("Explain deep learning.", num_return_sequences=5)

# Calculate and print rewards for diverse_outputs
for i, output in enumerate(diverse_outputs):
    text = output['generated_text']
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    inputs['input_ids'] = inputs['input_ids'].clamp(0, tokenizer.vocab_size - 1)

    score = reward_model(**inputs).logits[0].item()

    print(f"Response: {text}\nScore: {score}\n")

Response: Explain deep learning. The reason you're doing this is we have to understand how we can combine the above code and then learn a new algorithm for what we haven't done before. You probably know this already, but you don't want to think
Score: -0.10906124114990234

Response: Explain deep learning.

It is a bit unclear if the app will work on all Android devices, but there is at least one source app on the App Store that works with either Android devices of your choice.

That is probably something
Score: -0.7012643814086914

Response: Explain deep learning. We don't know if this data was drawn during the project or during any other project — it's just that it took long enough, maybe even tens of hours.

What is the potential cost of doing this and,
Score: -0.38426971435546875

Response: Explain deep learning.

It needs good neural networks and algorithms that are not dependent on any algorithm of any algorithm of any algorithm of anything whatsoever. But with a very low price.

