<font color=red>**Danger zone:**</font> you'll be fine-tuning a model to generate positive, negative or even toxic reviews. We'll be doing this for fun, but this is also the technique for [review bombing](https://en.wikipedia.org/wiki/Review_bomb), bot farms on social media and other less than dignified stuff. It is ultimately your decision how you apply this knowledge, but before you choose, ask yourself: is this why you chose to learn ML?


# LLMs Alignment with Reinforcement Learning from human feedback (RLHF).



In this homework, you're gonna fine-tune a language model with reinforcement learning to make it generat bad (or good) reviews.

To perform RL-based fine-tuning, we'll use a new (in this course) library called [Transformer Reinforcement Learning (TRL)](https://huggingface.co/docs/trl). TRL implements the main reinforcement learning components of RLHF: reward modeling and fine-tuning with PPO.

![img](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/TRL-readme.png)

## Stage 0: load model

To see how TRL works, we'll use it to align GPT2 on IMDB dataset to generate negative movie reviews. In fact, __it's your choice whether you want positive or negative reviews__, however I recommend you to focus on negative ones, in order to see greater effect after RLHF

But before you choose, let's take a look at the baseline model: a GPT-2 fine-tuned on generating arbitrary movie reviews.

In [1]:
import torch
import transformers
import datasets
import trl

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_model = transformers.AutoModelForCausalLM.from_pretrained("lvwerra/gpt2-imdb", device_map=device)

  from .autonotebook import tqdm as notebook_tqdm
  torch.utils._pytree._register_pytree_node(
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  return torch.load(checkpoint_file, map_location=map_location)


In [8]:
inputs = main_tokenizer("The movie", return_tensors='pt').to(device)
generated_ids = main_model.generate(**inputs, max_new_tokens=50, do_sample=True)
print("\nGenerated text:", main_tokenizer.decode(generated_ids.flatten().cpu().numpy().tolist()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated text: The movie is as I've seen it, if you can believe what I've heard in comments made recently, it is a complete failure. I wish that this movie had never been made. I think this was a great example to the world of acting that I


If you run this cell a couple of times, you'll see that the model generates both positive, negative and neutral reviews in some proportion. What we're gonna do next is teach the model to generate more positive (or negative) reviews.

Similarly to InstructGPT, we're gonna do that in 2 stages:
- **train a reward model** to assign higher values to positive (or negative) reviews
- fine-tune the language model to **maximize that reward using [proximal policy optimization](https://openai.com/research/openai-baselines-ppo)**



## Stage 1: train a reward model

First, we'll train a BERT-like model as our reward model. We'll generate a synthetic pairwise rankings to emulate human rankings.

__Q:__ why do I need a reward model? Can I just use a pre-trained sentiment classifier? <br> __A:__ Yes, you can - but that only works for movie reviews. But this homework will teach you how to do RLHF for any kind objective.



In [9]:
# We'll be fine-tuning a small BERT-like model for now. Please try other models for the main assignment.
reward_model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", device_map=device)
reward_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-cased")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


__Note that__ the reward model has a separate tokenizer, different from the main model. They don't need to be the same for RLHF fine-tuning.

In [32]:
from torch.utils.data import Dataset

class IMDBPairwiseDataset(Dataset):
	""" 
	A dataset of all possible pairs of chosen and rejected texts for TRL reward training format.

	This dataset is designed to facilitate the training of a reward model by providing pairs of
	texts where one is preferred (chosen) and the other is not (rejected). Each sample in the dataset
	is a dictionary containing tokenized input IDs and attention masks for both the chosen and rejected
	texts.

	Parameters:
	imdb: dataset to pairwise
	tokenizer: The tokenizer used to preprocess the texts
	accepted_label (int): The label that indicates a chosen text. Texts with this label are considered
						  preferred, while others are considered rejected.

	Methods:
	__len__(): Returns the total number of possible pairs of chosen and rejected texts.
	__getitem__(index): Returns a dictionary containing tokenized inputs for a specific pair of chosen
						and rejected texts.
	"""
	
	def __init__(self, imdb, tokenizer, accepted_label):
		super().__init__()
		self.tokenizer = tokenizer
		self.chosen_texts = imdb.filter(lambda x: x["label"] == accepted_label)["text"]
		self.rejected_texts = imdb.filter(lambda x: x["label"] != accepted_label)["text"]

		assert self.chosen_texts, f"no texts with label {accepted_label}"
		# print(f"Found {len(self.chosen_texts)} chosen and {len(self.rejected_texts)} rejected texts, {len(self)} pairs")

		self.column_names = [
			'input_ids_chosen', 'attention_mask_chosen',
			'input_ids_rejected', 'attention_mask_rejected'
		]

	def __len__(self):
		return len(self.chosen_texts) * len(self.rejected_texts)  # all pairs

	def __getitem__(self, index: int):
		choosen_text = self.chosen_texts[(index - 1) // len(self.rejected_texts)]
		rejected_text = self.rejected_texts[(index - 1) % len(self.rejected_texts)]

		chosen_inputs = self.tokenizer(choosen_text, return_tensors='pt', truncation=True)
		rejected_inputs = self.tokenizer(rejected_text, return_tensors='pt', truncation=True)
		
		return dict(
			input_ids_chosen=chosen_inputs['input_ids'][0],
			attention_mask_chosen=chosen_inputs['attention_mask'][0],
			input_ids_rejected=rejected_inputs['input_ids'][0],
			attention_mask_rejected=rejected_inputs['attention_mask'][0],
		)

In [33]:
TARGET_LABEL = 0 # negative reviews
imdb = datasets.load_dataset("imdb", split='train')
reward_data = IMDBPairwiseDataset(imdb, reward_tokenizer, accepted_label=TARGET_LABEL)

sample = reward_data[31337]
print('CHOSEN:', reward_tokenizer.decode(sample['input_ids_chosen']))
print('REJECTED:', reward_tokenizer.decode(sample['input_ids_rejected']))

CHOSEN: [CLS] If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story. < br / > < br / > One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives ( unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film ). < br / > < br / > One might better spend one's time staring out a window at a tree growing. < br / > < br / > [SEP]
REJECTED: [CLS] Well, I come from Bulgaria where it's almost impossible to have a tornado but my imagination tells me to be " very, very afraid "!!! This guy ( Devon Sawa ) has done a great job with this movie! I don't know exactly how old he was but he didn't act like a child ( WELL DONE )! Now about the tornado - it wasn't very realistic but frightens you! If you want to have a nice time in front of the telly - this is the 

We'll be using `trl.RewardTrainer` - a special case of `transformers.Trainer`.

![img](https://i.imgur.com/2JzNAPs.png)

Note that the model itself does not score pairs: it processes chosen ($y_w$) and rejected ($y_l$) samples independently. To minimize this loss, the reward model needs to score chosen sample higher than the rejected one. Note that the formula also assumes some context $x$, which is useful for seq2seq tasks. In our case of movie reviews, $x$ is empty.

In [34]:
training_args = trl.RewardConfig(  # like transformers.TrainingArguments
	output_dir="reward_model",
	per_device_train_batch_size=32,
	gradient_accumulation_steps=1,
	learning_rate=1.41e-5,
	max_steps=1_000,              # note: training may need more than 1k steps
	logging_steps=50,
	gradient_checkpointing=True,  # reduce memory usage but train ~30% slower
	gradient_checkpointing_kwargs={"use_reentrant": False},
	fp16=True,                    # disable this on CPU or on very old GPUs
	report_to='none',
	# you may add any other hyperparameters that you found useful
)

trainer = trl.RewardTrainer(
	model=reward_model,
	args=training_args,
	tokenizer=reward_tokenizer,
	train_dataset=reward_data,
	peft_config=None,  # optionally, you may tune with LoRA, prompt-tuning, etc
)

trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  0%|          | 0/1000 [00:00<?, ?it/s]You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  return fn(*args, **kwargs)
Could not estimate the number of tokens of the input, floating-point operations will not be computed
  5%|▌         | 50/1000 [00:41<10:52,  1.45it/s] 

{'loss': 0.5332, 'learning_rate': 1.34232e-05, 'epoch': 0.0}


 10%|█         | 100/1000 [01:16<12:16,  1.22it/s]

{'loss': 0.1951, 'learning_rate': 1.2718200000000001e-05, 'epoch': 0.0}


 15%|█▌        | 150/1000 [01:50<09:40,  1.46it/s]

{'loss': 0.1565, 'learning_rate': 1.20132e-05, 'epoch': 0.0}


 20%|██        | 200/1000 [02:24<09:08,  1.46it/s]

{'loss': 0.1237, 'learning_rate': 1.1308200000000001e-05, 'epoch': 0.0}


 25%|██▌       | 250/1000 [02:59<08:32,  1.46it/s]

{'loss': 0.103, 'learning_rate': 1.06032e-05, 'epoch': 0.0}


 30%|███       | 300/1000 [03:33<07:59,  1.46it/s]

{'loss': 0.1051, 'learning_rate': 9.8982e-06, 'epoch': 0.0}


 35%|███▌      | 350/1000 [04:07<07:24,  1.46it/s]

{'loss': 0.0977, 'learning_rate': 9.1932e-06, 'epoch': 0.0}


 40%|████      | 400/1000 [04:41<06:51,  1.46it/s]

{'loss': 0.0822, 'learning_rate': 8.4882e-06, 'epoch': 0.0}


 45%|████▌     | 450/1000 [05:16<06:15,  1.46it/s]

{'loss': 0.0722, 'learning_rate': 7.7832e-06, 'epoch': 0.0}


 50%|█████     | 500/1000 [05:50<05:42,  1.46it/s]

{'loss': 0.0827, 'learning_rate': 7.0782e-06, 'epoch': 0.0}


  return fn(*args, **kwargs)
 55%|█████▌    | 550/1000 [06:26<05:08,  1.46it/s]

{'loss': 0.0799, 'learning_rate': 6.3732e-06, 'epoch': 0.0}


 60%|██████    | 600/1000 [07:00<04:34,  1.46it/s]

{'loss': 0.0831, 'learning_rate': 5.6682e-06, 'epoch': 0.0}


 65%|██████▌   | 650/1000 [07:34<03:59,  1.46it/s]

{'loss': 0.0706, 'learning_rate': 4.9632e-06, 'epoch': 0.0}


 70%|███████   | 700/1000 [08:09<03:25,  1.46it/s]

{'loss': 0.0586, 'learning_rate': 4.2582e-06, 'epoch': 0.0}


 75%|███████▌  | 750/1000 [08:43<02:52,  1.45it/s]

{'loss': 0.0722, 'learning_rate': 3.5532e-06, 'epoch': 0.0}


 80%|████████  | 800/1000 [09:17<02:17,  1.46it/s]

{'loss': 0.0816, 'learning_rate': 2.8482e-06, 'epoch': 0.0}


 85%|████████▌ | 850/1000 [09:52<01:42,  1.46it/s]

{'loss': 0.0648, 'learning_rate': 2.1432e-06, 'epoch': 0.0}


 90%|█████████ | 900/1000 [10:26<01:08,  1.46it/s]

{'loss': 0.0619, 'learning_rate': 1.4382e-06, 'epoch': 0.0}


 95%|█████████▌| 950/1000 [11:00<00:33,  1.48it/s]

{'loss': 0.0651, 'learning_rate': 7.332e-07, 'epoch': 0.0}


100%|██████████| 1000/1000 [11:34<00:00,  1.48it/s]

{'loss': 0.0573, 'learning_rate': 2.82e-08, 'epoch': 0.0}


100%|██████████| 1000/1000 [11:37<00:00,  1.43it/s]

{'train_runtime': 697.7943, 'train_samples_per_second': 45.859, 'train_steps_per_second': 1.433, 'train_loss': 0.11233135890960694, 'epoch': 0.0}





TrainOutput(global_step=1000, training_loss=0.11233135890960694, metrics={'train_runtime': 697.7943, 'train_samples_per_second': 45.859, 'train_steps_per_second': 1.433, 'train_loss': 0.11233135890960694, 'epoch': 0.0})

In [35]:
reward_model.gradient_checkpointing_disable()
reward_model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### Sanity-check the reward model

Let's check how our reward model performs.

__Your task__ is to measure how often does your reward model can rank a pair of (chosen and rejected) reviews correctly. Please measure this separately for train data (`imdb`) and a separate test set loaded below.

In [36]:

for sample_index in 45, 16000:
	print('TEXT:', imdb[sample_index]['text'])
	inputs = reward_tokenizer(imdb[sample_index]['text'], truncation=True, return_tensors='pt').to(device)
	with torch.no_grad():
		reward = reward_model(**inputs).logits[0, 0].item()
		print("REWARD:", reward)
	print('LABEL:', imdb[sample_index]['label'])
	print()

# note: your reward model may produce different absolute rewards.
# This is fine as long as the rewards are ordered correctly (most of the time)

TEXT: This movie sucked. It really was a waste of my life. The acting was atrocious, the plot completely implausible. Long, long story short, these people get "terrorized" by this pathetic "crazed killer", but completely fail to fight back in any manner. And this is after they take a raft on a camping trip, with no gear, and show up at a campsite that is already assembled and completely stocked with food and clothes and the daughters headphones. Additionally, after their boat goes missing, they panic that they're stuck in the woods, but then the daughters boyfriend just shows up and they apparently never consider that they could just hike out of the woods like he did to get to them. Like I said, this movie sucks. A complete joke. Don't let your girlfriend talk you into watching it.
REWARD: 5.078125
LABEL: 0

TEXT: Good: Engaging cinematic firefights, great presentation, vehicles are actually fun to drive, fairly appealing multiplayer, faithful to the movie, and the list goes on.<br /><

First of all, let's implement `compute_reward` function. Note that we use plaintext reviews because main model uses a different tokenizer from the reward model.

In [None]:
from torch import Tensor, no_grad

def compute_reward(reward_model, reward_tokenizer, texts: list[str], device='cpu') -> Tensor:
	"""
	Compute the reward scores for a list of texts using a specified reward model and tokenizer.

	Parameters:
	reward_model: The model used to compute the reward scores
	reward_tokenizer: The tokenizer for reward_model
	texts (list[str]): A list of text strings for which the reward scores are to be computed.
	device (str, optional): The device on which the computation should be performed. Default is 'cpu'.

	Returns:
	torch.Tensor: A tensor containing the reward scores for each input text. The scores are extracted
				  from the logits of the reward model.

	Example:
	>>> compute_reward(my_reward_model, my_reward_tokenizer, ["text1", "text2"])
	tensor([ 5.1836, -4.8438], device='cpu')
	"""
	
	reward_model.to(device)
	inputs = reward_tokenizer(texts, truncation=True, padding=True, return_tensors='pt').to(device)
	
	with no_grad():
		reward = reward_model(**inputs).logits[:, 0]
	
	return reward

In [70]:
rewards = compute_reward(reward_model, reward_tokenizer, [imdb[45]['text'], imdb[16000]['text']], device=device)
print(rewards)
assert rewards[0] > rewards[1]
assert rewards[0] > 0
assert rewards[1] < 0

tensor([ 5.0742, -4.7891], device='cuda:0')


In [87]:
from tqdm.auto import tqdm

def eval_reward_model(reward_model, reward_tokenizer, test_dataset, target_label, device='cpu'):
	"""
	Evaluate the performance of a reward model by comparing reward scores for chosen and rejected reviews. 

	This function selects reviews from a test dataset based on a target label and evaluates the reward model's
	ability to assign higher scores to chosen reviews compared to rejected ones. The evaluation is performed
	in batches for efficiency.
	Note that reward scores are compared on corresponding chosen and rejected reviews: 
		chosen_reviews[0] vs rejected_reviews[0], 
		chosen_reviews[1] vs rejected_reviews[1],
		etc.

	Parameters:
	reward_model: The model used to compute the reward scores
	reward_tokenizer: The tokenizer for reward_model
	tes_dataset: test Dataset
	target_label (0 or 1): The label used to select chosen reviews. Reviews with this label are considered chosen,
				  while others are considered rejected.
	device (str, optional): The device on which the computation should be performed. Default is 'cpu'.

	Returns:
	float: The accuracy of the reward model, calculated as the proportion of times the model assigns a higher
		   reward score to the chosen review compared to the rejected review.

	Example:
	>>> accuracy = eval_reward_model(my_reward_model, my_reward_tokenizer, test_data, target_label=1)
	>>> print(f"Model accuracy: {accuracy:.2%}")
	"""

	chosen_reviews = test_dataset.filter(lambda x: x['label'] == target_label)['text']
	rejected_reviews = test_dataset.filter(lambda x: x['label'] != target_label)['text']

	if len(chosen_reviews) != len(rejected_reviews):
		min_len = min(len(chosen_reviews), len(rejected_reviews))
		chosen_reviews = chosen_reviews[:min_len]
		rejected_reviews = rejected_reviews[:min_len]

	assert len(chosen_reviews) == len(rejected_reviews)

	correct_answers_count = 0
	for chosen_review, rejected_review in tqdm(zip(chosen_reviews, rejected_reviews)):

		chosen_ids = reward_tokenizer(chosen_review, truncation=True, padding=True, return_tensors='pt').to(device)
		rejected_ids = reward_tokenizer(rejected_review, truncation=True, padding=True, return_tensors='pt').to(device)

		with torch.no_grad():
			chosen_reward = reward_model(**chosen_ids).logits[0, 0].item()
			rejected_reward = reward_model(**rejected_ids).logits[0, 0].item()
		
		if chosen_reward > rejected_reward:
			correct_answers_count += 1

	return correct_answers_count / len(chosen_reviews)

In [88]:
imdb_test = datasets.load_dataset("imdb", split='test')

test_accuracy = eval_reward_model(
	reward_model,
	reward_tokenizer,
	imdb_test,
	target_label=TARGET_LABEL,
	device=device,
)

print('test accuracy: {}'.format(test_accuracy))
assert test_accuracy > 0.94

Filter: 100%|██████████| 25000/25000 [00:00<00:00, 320059.58 examples/s]
Filter: 100%|██████████| 25000/25000 [00:00<00:00, 411892.76 examples/s]
12500it [02:48, 74.32it/s]

test accuracy: 0.96928





### Reward-guided generation (1 point)

If you did everything right, by now you should have a decent reward model. Before we use it for reinforcement learning, let's see if we can align model samples without any training.

To do so, you can use reward-guided inference: __generate N=16 samples, then select the one with the highest reward__ (according to your reward model).

For this problem, it's on you to demonstrate whether or not your code works. Find at least 5 neutral prompts such as "This movie is" (...), generate samples, rank them based on reward and show which samples get the highest reward.

Note: it is faster to generate samples in parallel, rather than sequentially, as follows:




In [89]:
prompts = ["This movie is", "The movie was", "I want to say that film", "Well, the movie was", "The movie was really"]

In [None]:
main_tokenizer.pad_token = main_tokenizer.eos_token
main_tokenizer.padding_side = 'left'

In [None]:
inputs = main_tokenizer(prompts, return_tensors='pt', padding=True, truncation=True).to(device)
for candidate in main_model.generate(**inputs, max_new_tokens=50, do_sample=True):
	print("Sample:", main_tokenizer.decode(candidate.flatten().cpu().numpy().tolist(), skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample: This movie is about a woman who accidentally kills her sister's dog. She is in love with a man with whom she dated for at first only 20 seconds. She wants revenge and she has an affair - but as a result of what happened to her, she breaks
Sample: The movie was not exactly a movie. It's more like a play. At the end of a role played by Marv Levy, I actually laughed out loud. I was not sure if he was crying or laughing in that one scene. It could have been funny
Sample: I want to say that film has been really great, though. The story was also good in certain aspects; for example, I love Cemetera, she is great in her role of the daughter of a wealthy French woman, but we have to be careful not to forget
Sample: Well, the movie was okay enough (it stars the legendary "Vladimir") but I still thought it was just plain bad! Not to mention that no one who knew who The Master of Evil was was ever introduced to it in the theaters..and I was wrong!
Sample: The movie was really good, as the

In [115]:
def generate_with_reward_guidance(
		main_model, main_tokenizer,
		reward_model, reward_tokenizer,
		N=16,
		device='cpu',
	):
	"""
	Generate text samples using a main model and select the best sample based on a reward model's guidance.

	This function generates multiple text samples from a main model, evaluates each sample using a reward model,
	and returns the sample with the highest reward score. The process is guided by the reward model to select
	the most desirable output.

	Parameters:
	main_model: The language model used to generate text samples.
	main_tokenizer: The tokenizer for main_model
	reward_model: The model used to compute reward scores for the generated samples.
	reward_tokenizer: The tokenizer for reward_model
	N (int, optional): The number of text samples to generate. Default is 16.
	device (str, optional): The device on which the computation should be performed. Default is 'cpu'.

	Returns:
	str: The generated text sample with the highest reward score.
	"""
	main_model.to(device)
	generated_ids = main_model.generate(
		max_new_tokens=100,
		do_sample=True,
		num_return_sequences=N
		)
	
	text_samples = main_tokenizer.batch_decode(
		generated_ids,
		skip_special_tokens=True
		)
	
	reward_model.to(device)
	inputs = reward_tokenizer(
		text_samples,
		truncation=True, padding=True, return_tensors='pt').to(device)
	
	with no_grad():
		reward = reward_model(**inputs).logits[:, 0]
	
	return text_samples[reward.argmax().item()]

In [116]:
generate_with_reward_guidance(
	main_model, main_tokenizer,
	reward_model, reward_tokenizer,
	device=device,
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


'Sigh. We all know how many things can suck in the first half: a crappy car chase, a movie that doesn\'t know what can go wrong, a plot that takes ages to figure. It\'s even worse than a "good movie". And there\'s the fact that all this movie needed is a small cameo and maybe an extra script from a couple of Hollywood stars who play big characters. The plot lines make no sense and take over 90% of the movie. It makes you wonder what'

# Stage 2: fine-tune the main model with RL


Now, we will optimize GPT2 to produce negative IMDB movie reviews using the reward model you trained above.

Unlike supervised fine-tuning, RL allows model to generate it's own sentences on each training step. Then, it calculates the reward of those specific sentences, and finally, updates the model to increase the probability of sentences with high reward.

Thus, each RLHF consists of three stages: __Rollout__, __Evaluation__ and __Update__

<div style="text-align: center">
<img src='https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gpt2_bert_training.png' width='600'>

The update stage depends on the specific RL algorithm. We'll be using Proximal Policy Optimization, or [PPO](https://arxiv.org/abs/1707.06347), similarly to what was used for InstructGPT.

Before we run those 3 stages, however, we need to create a dataset of "queries" - partial reviews in our case.

In [117]:
# Note: this code is specific to IMDB; you will need to re-write it for other tasks
imdb_for_rlhf = imdb.filter(lambda row: len(row['text']) > 200, batched=False)
imdb_for_rlhf = imdb_for_rlhf.remove_columns(['label'])
sample_length = trl.core.LengthSampler(2, 8)  # use the first 2-8 tokens as query

def select_query_and_tokenize(sample):
	query_ids = main_tokenizer.encode(sample["text"])[: sample_length()]
	sample["query"] = main_tokenizer.decode(query_ids)  # query is the only required column
	sample["input_ids"] = query_ids  # to avoid re-tokenizing later
	return sample  # we do not need the rest - it will be generated by the model

imdb_for_rlhf = imdb_for_rlhf.map(select_query_and_tokenize, batched=False)
imdb_for_rlhf.set_format(type="torch")

Filter: 100%|██████████| 25000/25000 [00:00<00:00, 368186.24 examples/s]
Map:   0%|          | 0/24895 [00:00<?, ? examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors
Map: 100%|██████████| 24895/24895 [00:17<00:00, 1446.92 examples/s]


Finally, we move to RL training. In this tutorial, we'll train LoRA adapters and not the full model.

In [118]:
import peft
peft_config = peft.LoraConfig(
	task_type=peft.TaskType.CAUSAL_LM, r=32, lora_alpha=32, lora_dropout=0.0, inference_mode=False
)

# reload main model as AutoModelForCausalLMWithValueHead - with an extra head needed for PPO
main_tokenizer = transformers.AutoTokenizer.from_pretrained("lvwerra/gpt2-imdb")
main_tokenizer.pad_token = main_tokenizer.eos_token

main_model = trl.AutoModelForCausalLMWithValueHead.from_pretrained("lvwerra/gpt2-imdb", device_map=device)
main_model = peft.get_peft_model(main_model, peft_config, adapter_name='default')
main_model.print_trainable_parameters()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  return torch.load(checkpoint_file, map_location=map_location)
  state_dict = loading_func(filename if not use_safe else safe_filename, **load_kwargs)


trainable params: 1,179,648 || all params: 125,620,225 || trainable%: 0.9390589771670923


Same as before, trl has a special type of trainer that minimize PPO-specific pseudo-loss. You can read more on this trainer [here](https://huggingface.co/docs/trl/main/en/ppo_trainer).

In [119]:
training_args = trl.PPOConfig(
	model_name=main_model.config._name_or_path,
	gradient_accumulation_steps=1,
	learning_rate=1.41e-5,
	batch_size=64,
	ppo_epochs=4,                 # PPO performs this many updates per training batch
)

ppo_trainer = trl.PPOTrainer(
	training_args, model=main_model.model, tokenizer=main_tokenizer,
	dataset=imdb_for_rlhf, data_collator=lambda data: dict((key, [d[key] for d in data]) for key in data[0])
)  # note: we pass main_model.model because PPOTrainer checks for one of several supported model types ...
# ... main_model.model is a model with adapters, which is supported. main_model itself is a wrapper that is not supported

In [121]:
from tqdm.auto import tqdm

max_steps = 50   # can be insufficient for some tasks - watch your learning curves
generation_kwargs = dict(
		min_length=-1, max_new_tokens=128, do_sample=True, top_k=0, top_p=1.0, pad_token_id=main_tokenizer.eos_token_id)
#                                  ^-- task-specific parameter!

average_reward = 0
gamma = 0.7

with tqdm(enumerate(ppo_trainer.dataloader), total=max_steps) as progressbar:
	# note: ppo_trainer.dataloader is just a regular dataloader of queries, no RL-specific magic :)
	for epoch, batch in progressbar:
		if epoch >= max_steps:
				break

		# Rollout stage: generate continuations from batch queries using main_model
		response_tensors = ppo_trainer.generate(batch['input_ids'], **generation_kwargs)
		# ^-- list of tensors of token ids from main model tokenizer

		# de-tokenize responses to strings (since reward model uses a different tokenizer)
		batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
		# note: response_tensors already contain query tokens, so we don't need to add queries manually.
		# This may not be true for other tasks: check this manually by viewing batch["response"] and batch["query"]


		# Evaluation stage - rewards for batch['response']
		rewards = compute_reward(reward_model, reward_tokenizer, batch["response"], device=device)

		# Update stage
		stats = ppo_trainer.step(batch['input_ids'], response_tensors, list(rewards.split(1)))
		stats['rewards/mean'] = rewards.mean().item()   # compute mean rewards for batch
		average_reward = gamma * average_reward + (1 - gamma) * stats['rewards/mean']

		print("-" * 30, 'STEP', epoch, '-' * 30)
		print(f'rewards/mean:\t{stats["rewards/mean"]:.9f}\t<---- average reward over this batch (higher=better, noisy)')
		print(f'rewards/moving_avg:\t{average_reward:.9f}\t<---- moving average reward (higher=better, less noisy)')
		print(f'ppo/returns/mean:\t{stats["ppo/returns/mean"]:.9f}\t<---- model-estimated average discounted reward')
		print(f'objective/kl:\t{stats["objective/kl"]:.9f}\t<---- how far we are from the original model (regularizer)')
		print()

		ppo_trainer.log_stats(stats, batch, list(rewards.split(1)))

  0%|          | 0/50 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  2%|▏         | 1/50 [00:36<29:55, 36.64s/it]

------------------------------ STEP 0 ------------------------------
rewards/mean:	0.184813499	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	0.055444050	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	-0.421159863	<---- model-estimated average discounted reward
objective/kl:	0.000000000	<---- how far we are from the original model (regularizer)



  4%|▍         | 2/50 [01:13<29:38, 37.06s/it]

------------------------------ STEP 1 ------------------------------
rewards/mean:	0.408388615	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	0.161327419	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	-0.383026958	<---- model-estimated average discounted reward
objective/kl:	0.069897503	<---- how far we are from the original model (regularizer)



  6%|▌         | 3/50 [01:51<29:09, 37.23s/it]

------------------------------ STEP 2 ------------------------------
rewards/mean:	-0.085626602	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	0.087241213	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	-0.404487967	<---- model-estimated average discounted reward
objective/kl:	0.110206619	<---- how far we are from the original model (regularizer)



  8%|▊         | 4/50 [02:28<28:27, 37.13s/it]

------------------------------ STEP 3 ------------------------------
rewards/mean:	-0.300073624	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	-0.028953238	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	-0.477114022	<---- model-estimated average discounted reward
objective/kl:	0.616876841	<---- how far we are from the original model (regularizer)



 10%|█         | 5/50 [03:04<27:38, 36.85s/it]

------------------------------ STEP 4 ------------------------------
rewards/mean:	-0.052922249	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	-0.036143941	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	-0.448472112	<---- model-estimated average discounted reward
objective/kl:	1.427753329	<---- how far we are from the original model (regularizer)



 12%|█▏        | 6/50 [03:40<26:49, 36.58s/it]

------------------------------ STEP 5 ------------------------------
rewards/mean:	0.346815109	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	0.078743774	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	-0.303512335	<---- model-estimated average discounted reward
objective/kl:	2.009539127	<---- how far we are from the original model (regularizer)



 14%|█▍        | 7/50 [04:16<25:57, 36.22s/it]

------------------------------ STEP 6 ------------------------------
rewards/mean:	1.154198647	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	0.401380236	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	-0.084377229	<---- model-estimated average discounted reward
objective/kl:	2.812479496	<---- how far we are from the original model (regularizer)



 16%|█▌        | 8/50 [04:55<25:59, 37.12s/it]

------------------------------ STEP 7 ------------------------------
rewards/mean:	0.881374359	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	0.545378473	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	-0.006371658	<---- model-estimated average discounted reward
objective/kl:	3.780475616	<---- how far we are from the original model (regularizer)



 18%|█▊        | 9/50 [05:33<25:40, 37.57s/it]

------------------------------ STEP 8 ------------------------------
rewards/mean:	0.463375092	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	0.520777458	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	-0.021193445	<---- model-estimated average discounted reward
objective/kl:	5.312567711	<---- how far we are from the original model (regularizer)



 20%|██        | 10/50 [06:10<24:54, 37.36s/it]

------------------------------ STEP 9 ------------------------------
rewards/mean:	0.787911415	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	0.600917645	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.078692898	<---- model-estimated average discounted reward
objective/kl:	5.806142330	<---- how far we are from the original model (regularizer)



 22%|██▏       | 11/50 [06:47<24:07, 37.12s/it]

------------------------------ STEP 10 ------------------------------
rewards/mean:	0.334899068	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	0.521112072	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.039771907	<---- model-estimated average discounted reward
objective/kl:	6.280764580	<---- how far we are from the original model (regularizer)



 24%|██▍       | 12/50 [07:24<23:31, 37.15s/it]

------------------------------ STEP 11 ------------------------------
rewards/mean:	1.217950821	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	0.730163697	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.096000120	<---- model-estimated average discounted reward
objective/kl:	8.058010101	<---- how far we are from the original model (regularizer)



 26%|██▌       | 13/50 [08:01<22:50, 37.04s/it]

------------------------------ STEP 12 ------------------------------
rewards/mean:	1.296674728	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	0.900117006	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.181105554	<---- model-estimated average discounted reward
objective/kl:	8.411641121	<---- how far we are from the original model (regularizer)



 28%|██▊       | 14/50 [08:38<22:12, 37.01s/it]

------------------------------ STEP 13 ------------------------------
rewards/mean:	1.005467534	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	0.931722165	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.207273066	<---- model-estimated average discounted reward
objective/kl:	7.628366947	<---- how far we are from the original model (regularizer)



 30%|███       | 15/50 [09:14<21:21, 36.61s/it]

------------------------------ STEP 14 ------------------------------
rewards/mean:	1.545151711	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	1.115751028	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.357464492	<---- model-estimated average discounted reward
objective/kl:	8.351938248	<---- how far we are from the original model (regularizer)



 32%|███▏      | 16/50 [09:50<20:42, 36.54s/it]

------------------------------ STEP 15 ------------------------------
rewards/mean:	1.138256073	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	1.122502542	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.235509828	<---- model-estimated average discounted reward
objective/kl:	9.184993744	<---- how far we are from the original model (regularizer)



 34%|███▍      | 17/50 [10:26<20:05, 36.53s/it]

------------------------------ STEP 16 ------------------------------
rewards/mean:	0.975373983	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	1.078363974	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.233827069	<---- model-estimated average discounted reward
objective/kl:	8.465982437	<---- how far we are from the original model (regularizer)



 36%|███▌      | 18/50 [11:07<20:04, 37.65s/it]

------------------------------ STEP 17 ------------------------------
rewards/mean:	1.262933731	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	1.133734901	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.322418392	<---- model-estimated average discounted reward
objective/kl:	7.753006458	<---- how far we are from the original model (regularizer)



 38%|███▊      | 19/50 [11:44<19:22, 37.49s/it]

------------------------------ STEP 18 ------------------------------
rewards/mean:	1.718078613	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	1.309038015	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.380621910	<---- model-estimated average discounted reward
objective/kl:	8.846158981	<---- how far we are from the original model (regularizer)



 40%|████      | 20/50 [12:20<18:32, 37.08s/it]

------------------------------ STEP 19 ------------------------------
rewards/mean:	1.872340202	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	1.478028671	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.506123424	<---- model-estimated average discounted reward
objective/kl:	8.042571068	<---- how far we are from the original model (regularizer)



 42%|████▏     | 21/50 [12:57<17:55, 37.08s/it]

------------------------------ STEP 20 ------------------------------
rewards/mean:	1.371367216	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	1.446030235	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.545753360	<---- model-estimated average discounted reward
objective/kl:	7.333931446	<---- how far we are from the original model (regularizer)



 44%|████▍     | 22/50 [13:33<17:06, 36.67s/it]

------------------------------ STEP 21 ------------------------------
rewards/mean:	2.155541420	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	1.658883590	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.674237370	<---- model-estimated average discounted reward
objective/kl:	9.028120041	<---- how far we are from the original model (regularizer)



 46%|████▌     | 23/50 [14:09<16:25, 36.52s/it]

------------------------------ STEP 22 ------------------------------
rewards/mean:	2.335096359	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	1.861747421	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.852169514	<---- model-estimated average discounted reward
objective/kl:	9.251592636	<---- how far we are from the original model (regularizer)



 48%|████▊     | 24/50 [14:44<15:42, 36.27s/it]

------------------------------ STEP 23 ------------------------------
rewards/mean:	2.365423203	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.012850155	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.946076751	<---- model-estimated average discounted reward
objective/kl:	10.439592361	<---- how far we are from the original model (regularizer)



 50%|█████     | 25/50 [15:19<14:53, 35.75s/it]

------------------------------ STEP 24 ------------------------------
rewards/mean:	2.512689114	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.162801843	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.128260136	<---- model-estimated average discounted reward
objective/kl:	9.519665718	<---- how far we are from the original model (regularizer)



 52%|█████▏    | 26/50 [15:53<14:04, 35.17s/it]

------------------------------ STEP 25 ------------------------------
rewards/mean:	2.036746979	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.124985384	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.923455477	<---- model-estimated average discounted reward
objective/kl:	10.514331818	<---- how far we are from the original model (regularizer)



 54%|█████▍    | 27/50 [16:29<13:32, 35.33s/it]

------------------------------ STEP 26 ------------------------------
rewards/mean:	2.507891655	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.239857265	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.984710634	<---- model-estimated average discounted reward
objective/kl:	10.754348755	<---- how far we are from the original model (regularizer)



 56%|█████▌    | 28/50 [17:08<13:26, 36.65s/it]

------------------------------ STEP 27 ------------------------------
rewards/mean:	2.476023436	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.310707116	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.041142941	<---- model-estimated average discounted reward
objective/kl:	11.956239700	<---- how far we are from the original model (regularizer)



 58%|█████▊    | 29/50 [17:43<12:36, 36.01s/it]

------------------------------ STEP 28 ------------------------------
rewards/mean:	2.333978176	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.317688434	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	0.968672752	<---- model-estimated average discounted reward
objective/kl:	8.873231888	<---- how far we are from the original model (regularizer)



 60%|██████    | 30/50 [18:19<11:59, 35.99s/it]

------------------------------ STEP 29 ------------------------------
rewards/mean:	2.526615143	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.380366447	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.252208948	<---- model-estimated average discounted reward
objective/kl:	10.057155609	<---- how far we are from the original model (regularizer)



 62%|██████▏   | 31/50 [18:54<11:18, 35.72s/it]

------------------------------ STEP 30 ------------------------------
rewards/mean:	3.062608957	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.585039200	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.464666605	<---- model-estimated average discounted reward
objective/kl:	11.300116539	<---- how far we are from the original model (regularizer)



 64%|██████▍   | 32/50 [19:30<10:43, 35.73s/it]

------------------------------ STEP 31 ------------------------------
rewards/mean:	2.322861671	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.506385941	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.462383986	<---- model-estimated average discounted reward
objective/kl:	10.169392586	<---- how far we are from the original model (regularizer)



 66%|██████▌   | 33/50 [20:05<10:06, 35.71s/it]

------------------------------ STEP 32 ------------------------------
rewards/mean:	2.927136421	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.632611085	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.551594496	<---- model-estimated average discounted reward
objective/kl:	12.427952766	<---- how far we are from the original model (regularizer)



 68%|██████▊   | 34/50 [20:38<09:19, 34.95s/it]

------------------------------ STEP 33 ------------------------------
rewards/mean:	2.977771759	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.736159287	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.585646868	<---- model-estimated average discounted reward
objective/kl:	10.313037872	<---- how far we are from the original model (regularizer)



 70%|███████   | 35/50 [21:12<08:37, 34.51s/it]

------------------------------ STEP 34 ------------------------------
rewards/mean:	2.639501572	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.707161973	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.515711546	<---- model-estimated average discounted reward
objective/kl:	12.010259628	<---- how far we are from the original model (regularizer)



 72%|███████▏  | 36/50 [21:46<08:01, 34.36s/it]

------------------------------ STEP 35 ------------------------------
rewards/mean:	3.018028259	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.800421859	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.647488356	<---- model-estimated average discounted reward
objective/kl:	14.720853806	<---- how far we are from the original model (regularizer)



 74%|███████▍  | 37/50 [22:18<07:16, 33.56s/it]

------------------------------ STEP 36 ------------------------------
rewards/mean:	3.111459732	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	2.893733221	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.681805968	<---- model-estimated average discounted reward
objective/kl:	13.077782631	<---- how far we are from the original model (regularizer)



 76%|███████▌  | 38/50 [22:53<06:48, 34.05s/it]

------------------------------ STEP 37 ------------------------------
rewards/mean:	3.252457619	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	3.001350540	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.781169415	<---- model-estimated average discounted reward
objective/kl:	13.318946838	<---- how far we are from the original model (regularizer)



 78%|███████▊  | 39/50 [23:26<06:13, 33.92s/it]

------------------------------ STEP 38 ------------------------------
rewards/mean:	3.181152344	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	3.055291081	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.850888252	<---- model-estimated average discounted reward
objective/kl:	14.168682098	<---- how far we are from the original model (regularizer)



 80%|████████  | 40/50 [24:06<05:57, 35.75s/it]

------------------------------ STEP 39 ------------------------------
rewards/mean:	3.101176262	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	3.069056635	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.878402948	<---- model-estimated average discounted reward
objective/kl:	13.739465714	<---- how far we are from the original model (regularizer)



 82%|████████▏ | 41/50 [24:46<05:31, 36.83s/it]

------------------------------ STEP 40 ------------------------------
rewards/mean:	3.492445946	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	3.196073429	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.861994147	<---- model-estimated average discounted reward
objective/kl:	14.708334923	<---- how far we are from the original model (regularizer)



 84%|████████▍ | 42/50 [25:23<04:55, 37.00s/it]

------------------------------ STEP 41 ------------------------------
rewards/mean:	3.222915173	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	3.204125952	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.902049065	<---- model-estimated average discounted reward
objective/kl:	14.603998184	<---- how far we are from the original model (regularizer)



 86%|████████▌ | 43/50 [26:02<04:23, 37.65s/it]

------------------------------ STEP 42 ------------------------------
rewards/mean:	3.172294617	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	3.194576551	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.869485855	<---- model-estimated average discounted reward
objective/kl:	13.483745575	<---- how far we are from the original model (regularizer)



 88%|████████▊ | 44/50 [26:38<03:42, 37.05s/it]

------------------------------ STEP 43 ------------------------------
rewards/mean:	2.957304955	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	3.123395072	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.709170938	<---- model-estimated average discounted reward
objective/kl:	14.789612770	<---- how far we are from the original model (regularizer)



 90%|█████████ | 45/50 [27:10<02:57, 35.50s/it]

------------------------------ STEP 44 ------------------------------
rewards/mean:	3.133522987	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	3.126433447	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.681799173	<---- model-estimated average discounted reward
objective/kl:	12.233568192	<---- how far we are from the original model (regularizer)



 92%|█████████▏| 46/50 [27:43<02:19, 34.89s/it]

------------------------------ STEP 45 ------------------------------
rewards/mean:	3.211109161	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	3.151836161	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.926575422	<---- model-estimated average discounted reward
objective/kl:	12.679422379	<---- how far we are from the original model (regularizer)



 94%|█████████▍| 47/50 [28:19<01:45, 35.16s/it]

------------------------------ STEP 46 ------------------------------
rewards/mean:	3.836676121	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	3.357288149	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	2.115183830	<---- model-estimated average discounted reward
objective/kl:	14.401512146	<---- how far we are from the original model (regularizer)



 96%|█████████▌| 48/50 [28:58<01:12, 36.32s/it]

------------------------------ STEP 47 ------------------------------
rewards/mean:	3.096509933	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	3.279054684	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.950516462	<---- model-estimated average discounted reward
objective/kl:	12.884355545	<---- how far we are from the original model (regularizer)



 98%|█████████▊| 49/50 [29:32<00:35, 35.55s/it]

------------------------------ STEP 48 ------------------------------
rewards/mean:	3.509265184	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	3.348117834	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	2.106852770	<---- model-estimated average discounted reward
objective/kl:	15.362712860	<---- how far we are from the original model (regularizer)



100%|██████████| 50/50 [30:03<00:00, 36.07s/it]

------------------------------ STEP 49 ------------------------------
rewards/mean:	3.398410797	<---- average reward over this batch (higher=better, noisy)
rewards/moving_avg:	3.363205723	<---- moving average reward (higher=better, less noisy)
ppo/returns/mean:	1.960811734	<---- model-estimated average discounted reward
objective/kl:	15.823799133	<---- how far we are from the original model (regularizer)






In [122]:
assert average_reward > 2

And now test your PPO model:

In [124]:
inputs = [main_tokenizer.encode("The movie was", return_tensors='pt').to(device)[0] for i in range(5)]

response_tensors = ppo_trainer.generate(inputs, **generation_kwargs)
batch["response"] = [main_tokenizer.decode(response.squeeze()) for response in response_tensors]
for sample in batch["response"]:
	print('Sample: {}'.format(sample))

Sample: The movie was made by not being highschooled with the teen zombie flick, it actually made you wish you had been in it, as though it were eye candy...with a bit of self pity and misreading the art.<|endoftext|>
Sample: The movie was a remake of The Matrix - but IMDb has static images instead of static images: the anti-film take it out is junk. There is a sign published saying there is bad remake.<|endoftext|>
Sample: The movie was just called'movie', 'poster'. And the direction. o laurels make it look bad. The sales people could not have sell it. The 80's movies were just wooden from what I have seen in the past 8 years, the only movie that ever made it was called 'Do Not Miss Me'. And that movie with Hillary cyber scam exposed to people and corporations' is nothing comparison, and that movie is awful. This is the worst movie movie in garbage every time.<|endoftext|>
Sample: The movie was better than the film. It got everything superfluier than the film. For it gets to the very 