#### Elad Prager
<br>

### Final project in the class: Information-theoretic analysis of neural language models
#### Recihman University, Fall 2022-2023

# Part 1

## Summarization Quality

The task of masked language modeling involves obscuring certain words within a sentence and subsequently determining the appropriate replacements for those obscured words. In the first stage of the project, I've implement this task on a given document, and I've suggested that by incorporating a prefix of the document summarization, the task's performance on the concatenated summary and document will improve when compared to its application on the document alone. The rationale for this hypothesis is that the summary, being a concise representation of the document, would provide additional context to the task, thereby facilitating more accurate predictions for the obscured words. Additionally, it is expected that if the summary were to more effectively capture the essence of the document, the prediction of the model would be more accurate. If this hypothesis is upheld, it could potentially serve as a new metric for evaluating the quality of summaries.

### Initial imports

In [None]:
!pip install wandb
!pip install datasets
!pip install transformers
!pip install trl

In [None]:
import torch
import wandb
import time
import os
from tqdm import tqdm
import numpy as np
import pandas as pd
from random import choices
import matplotlib.pyplot as plt
tqdm.pandas()
from datasets import load_dataset
from transformers import GPT2Tokenizer, AutoModelForSequenceClassification, PegasusTokenizer, PegasusForConditionalGeneration, BertTokenizer, BertForMaskedLM, BertModel, pipeline, AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForTokenClassification
from trl.gpt2 import GPT2HeadWithValueModel, respond_to_batch
from trl.ppo import PPOTrainer
from trl.core import build_bert_batch_from_txt
from pprint import pprint
import random
from operator import itemgetter
import spacy
from random import randrange
from statistics import mean

In [None]:
from google.colab import drive
from IPython.display import Image
drive.mount('/content/drive')

### Dataset

In [None]:
dataset_comparison = load_dataset("openai/summarize_from_feedback", 'comparisons')

In [None]:
len(dataset_comparison['train'])

In [None]:
len(dataset_comparison['validation'])

### Comparison dataset - Dataset overview

In [None]:
dataset_comparison['train']['info'][1]

In [None]:
dataset_comparison['train']['summaries'][1]

In [None]:
dataset_comparison['train']['choice'][1]

### Comparison dataset - Get documents and summaries groups

In [None]:
documents = [features['info']['post'] for features in dataset_comparison['train']]
doc_length = [len(x)for x in documents]

summaries_A = [features['summaries'][0]['text'] if features['choice']==0 else features['summaries'][1]['text'] for features in dataset_comparison['train']]
summaries_B = [features['summaries'][1]['text'] if features['choice']==0 else features['summaries'][0]['text'] for features in dataset_comparison['train']]

As the hypothesis proposed, it is anticipated that if the summary effectively captures the essence of the document as in the "summaries_A" summarization group, the prediction of the model will be more accurate compared to the "summaries_B" summarization group. Furthermore, it is expected that the task's performance on the concatenated summaries and documents will be superior when compared to its application on the documents alone.

Let's sort the given data based on the documents length

In [None]:
sorted_zip = [list(x) for x in zip(*sorted(zip(doc_length, documents, summaries_A, summaries_B), key=itemgetter(0)))]

In [None]:
documents = sorted_zip[1][50:]
summaries_A = sorted_zip[2][50:]
summaries_B = sorted_zip[3][50:]

In [None]:
len_documents = [len(x) for x in documents]
plt.plot(len_documents)
plt.title('documents length')
plt.show()

In [None]:
print(f'documents len: {len(documents)}')
print(f'summaries_A len: {len(summaries_A)}')
print(f'summaries_B len: {len(summaries_B)}')

### Mask document

In this process, a random word in the given document will be obscured, with the constraint that the masked word shall not be:
1.   a stopped word
2.   the initial word of the document
3.   a numeric value.

Of course, this method can be further improved by implementing additional logical constraints.





In [None]:
nlp = spacy.load('en_core_web_sm')
stop_words = nlp.Defaults.stop_words
print(stop_words)
print(f'num of stop words: {len(stop_words)}')

In [None]:
def get_masked_documents(documents):
  documents_masked_words = []
  documents_masked = []

  for i, document in enumerate(documents):
    valid_for_masking = False
    mask_index = 1

    document_masked_words = []
    document_words = document.split()
    
    while not valid_for_masking:
      mask_index = randrange(len(document_words))
      mask_word = document_words[mask_index].lower()
      valid_for_masking = mask_word not in stop_words and mask_index!=0 and not mask_word.isnumeric()

    document_words[mask_index] = '[MASK]'
    document_masked_words = mask_word

    document_masked = " ".join(document_words)

    documents_masked.append(document_masked)
    documents_masked_words.append(document_masked_words)
  return documents_masked, documents_masked_words 

In [None]:
documents_masked, documents_masked_words  = get_masked_documents(documents)

In [None]:
print(f'documents_masked len: {len(documents_masked)}')
print(f'documents_masked_words len: {len(documents_masked_words)}')

### Comparison dataset - Concatenate summaries with documents

In [None]:
summary_A_concat_document = []
summary_B_concat_document = []

for i, document in enumerate(documents_masked):
  summary_A_concat_document.append(summaries_A[i] + '\n' + document)
  summary_B_concat_document.append(summaries_B[i] + '\n' + document)

In [None]:
print(f'summary_A_concat_document len: {len(summary_A_concat_document)}')
print(f'summary_B_concat_document len: {len(summary_B_concat_document)}')

### Comparison dataset - Sanity check

In [None]:
pprint(documents[25])

In [None]:
pprint(documents_masked[25])

In [None]:
documents_masked_words[25]

In [None]:
summaries_A[25]

In [None]:
pprint(summary_A_concat_document[25])

In [None]:
pprint(summary_B_concat_document[25])

### load pipeline

In [None]:
unmasker = pipeline('fill-mask', "bert-base-uncased")

### Comparison dataset - Fill masks in 3 methods (document only, summary_A prefix, summary_B prefix)

In [None]:
documents_fill_mask_words = []

document_fill_mask = [unmasker(x) for x in documents_masked[:2500]]
for document in document_fill_mask:

  max_mask = max([mask['score'] for mask in document])
  filled_token = [mask['token_str'].replace(" ", "") for mask in document if mask['score'] == max_mask][0]

  documents_fill_mask_words.append(filled_token)

In [None]:
summaries_A_fill_mask_words = []

summary_A_fill_mask = [unmasker(x) for x in summary_A_concat_document[:2500]]
for document in summary_A_fill_mask:

  max_mask = max([mask['score'] for mask in document])
  filled_token = [mask['token_str'].replace(" ", "") for mask in document if mask['score'] == max_mask][0]

  summaries_A_fill_mask_words.append(filled_token)

In [None]:
summaries_B_fill_mask_words = []

summary_B_fill_mask = [unmasker(x) for x in summary_B_concat_document[:2500]]
for document in summary_B_fill_mask:

  max_mask = max([mask['score'] for mask in document])
  filled_token = [mask['token_str'].replace(" ", "") for mask in document if mask['score'] == max_mask][0]

  summaries_B_fill_mask_words.append(filled_token)

### Comparison dataset - Evaluation

In [None]:
df = pd.DataFrame({
    'summaries_A': summaries_A_fill_mask_words[:30], 
    'summaries_B': summaries_B_fill_mask_words[:30],
    'documents': documents_fill_mask_words[:30],
    'original_word': documents_masked_words[:30]
})
df

In [None]:
def get_accuracy(document):
  accuracies = []

  for i, w in enumerate(document):
    is_correct = 0
    if w == documents_masked_words[i]:
      is_correct = 1
    accuracies.append(is_correct)
  mean_accuracy = mean(accuracies)
  return mean_accuracy

In [None]:
acc_summaries_A = get_accuracy(summaries_A_fill_mask_words)
print(f'concat_summary_A: {acc_summaries_A*100}%')

acc_summaries_B = get_accuracy(summaries_B_fill_mask_words)
print(f'concat_summary_B: {acc_summaries_B*100}%')

acc_baseline = get_accuracy(documents_fill_mask_words)
print(f'masked_document_only: {acc_baseline*100}%')

In [None]:
x = ['summaries_A','summaries_B','baseline']
y = [acc_summaries_A, acc_summaries_B, acc_baseline]
plt.ylim(0.29, 0.32)
plt.bar(x, y)

### Axis dataset

In [None]:
dataset_axis = load_dataset("openai/summarize_from_feedback", 'axis')

In [None]:
Image(filename='/content/drive/MyDrive/language_models/figure10.PNG', width="600")

In [None]:
len(dataset_axis['validation'])

In [None]:
len(dataset_axis['test'])

Axis dataset - Dataset overview

In [None]:
dataset_axis['validation']['info'][100]

In [None]:
dataset_axis['validation']['summary'][100]

In [None]:
dataset_axis['validation']['info'][101]

In [None]:
dataset_axis['validation']['summary'][101]

### Axis dataset - Get documents and summaries

In [None]:
axis_documents = [features['post'] for features in dataset_axis['validation']['info']]
axis_doc_length = [len(x)for x in axis_documents]

overall_7 = [features['text'] if features['axes']['overall']==7 else None for features in dataset_axis['validation']['summary']]
overall_6 = [features['text'] if features['axes']['overall']==6 else None for features in dataset_axis['validation']['summary']]
overall_5 = [features['text'] if features['axes']['overall']==5 else None for features in dataset_axis['validation']['summary']]
overall_4 = [features['text'] if features['axes']['overall']==4 else None for features in dataset_axis['validation']['summary']]
overall_3 = [features['text'] if features['axes']['overall']==3 else None for features in dataset_axis['validation']['summary']]
overall_2 = [features['text'] if features['axes']['overall']==2 else None for features in dataset_axis['validation']['summary']]
overall_1 = [features['text'] if features['axes']['overall']==1 else None for features in dataset_axis['validation']['summary']]

Let's sort the given data based on the documents length

In [None]:
axis_sorted_zip = [list(x) for x in zip(*sorted(zip(axis_doc_length, axis_documents, overall_7, overall_6, overall_5, overall_4, overall_3, overall_2, overall_1), key=itemgetter(0)))]

In [None]:
axis_documents = axis_sorted_zip[1][50:]
overall_7 = axis_sorted_zip[2][50:]
overall_6 = axis_sorted_zip[3][50:]
overall_5 = axis_sorted_zip[4][50:]
overall_4 = axis_sorted_zip[5][50:]
overall_3 = axis_sorted_zip[6][50:]
overall_2 = axis_sorted_zip[7][50:]
overall_1 = axis_sorted_zip[8][50:]

In [None]:
len_axis_documents = [len(x) for x in axis_documents]
plt.plot(len_axis_documents)
plt.title('documents length')
plt.show()

In [None]:
print(f'axis_documents: {len(axis_documents)}')
print(f'overall_7: {len(overall_7)}')
print(f'overall_6: {len(overall_6)}')
print(f'overall_5: {len(overall_5)}')
print(f'overall_4: {len(overall_4)}')
print(f'overall_3: {len(overall_3)}')
print(f'overall_2: {len(overall_2)}')
print(f'overall_1: {len(overall_1)}')

In [None]:
axis_documents_masked, axis_documents_masked_words  = get_masked_documents(axis_documents)

In [None]:
print(f'axis_documents_masked len: {len(axis_documents_masked)}')
print(f'axis_documents_masked_words len: {len(axis_documents_masked_words)}')

### Axis dataset - Concatenate summaries with documents

In [None]:
overall_7_concat_document = []
overall_6_concat_document = []
overall_5_concat_document = []
overall_4_concat_document = []
overall_3_concat_document = []
overall_2_concat_document = []
overall_1_concat_document = []

for i, axis_document in enumerate(axis_documents_masked):
  overall_7_concat_document.append(overall_7[i] + '\n' + axis_document) if overall_7[i] != None else overall_7_concat_document.append(None)
  overall_6_concat_document.append(overall_6[i] + '\n' + axis_document) if overall_6[i] != None else overall_6_concat_document.append(None)
  overall_5_concat_document.append(overall_5[i] + '\n' + axis_document) if overall_5[i] != None else overall_5_concat_document.append(None)
  overall_4_concat_document.append(overall_4[i] + '\n' + axis_document) if overall_4[i] != None else overall_4_concat_document.append(None)
  overall_3_concat_document.append(overall_3[i] + '\n' + axis_document) if overall_3[i] != None else overall_3_concat_document.append(None)
  overall_2_concat_document.append(overall_2[i] + '\n' + axis_document) if overall_2[i] != None else overall_2_concat_document.append(None)
  overall_1_concat_document.append(overall_1[i] + '\n' + axis_document) if overall_1[i] != None else overall_1_concat_document.append(None)

In [None]:
print(f'overall_7_concat_document len: {len(overall_7_concat_document)}')
print(f'overall_6_concat_document len: {len(overall_6_concat_document)}')
print(f'overall_5_concat_document len: {len(overall_5_concat_document)}')
print(f'overall_4_concat_document len: {len(overall_4_concat_document)}')
print(f'overall_3_concat_document len: {len(overall_3_concat_document)}')
print(f'overall_2_concat_document len: {len(overall_2_concat_document)}')
print(f'overall_1_concat_document len: {len(overall_1_concat_document)}')

### Axis dataset - Sanity check

In [None]:
axis_documents_masked_words[18]

In [None]:
pprint(axis_documents[18])

In [None]:
overall_7[18]

In [None]:
pprint(overall_7_concat_document[18])

In [None]:
overall_7_concat_document_valid = [x for x in overall_7_concat_document[:2500] if x != None]
overall_6_concat_document_valid = [x for x in overall_6_concat_document[:2500] if x != None]
overall_5_concat_document_valid = [x for x in overall_5_concat_document[:2500] if x != None]
overall_4_concat_document_valid = [x for x in overall_4_concat_document[:2500] if x != None]
overall_3_concat_document_valid = [x for x in overall_3_concat_document[:2500] if x != None]
overall_2_concat_document_valid = [x for x in overall_2_concat_document[:2500] if x != None]
overall_1_concat_document_valid = [x for x in overall_1_concat_document[:2500] if x != None]

print(f'overall_7_valid_summaries length: {len(overall_7_concat_document_valid)}')
print(f'overall_6_valid_summaries length: {len(overall_6_concat_document_valid)}')
print(f'overall_5_valid_summaries length: {len(overall_5_concat_document_valid)}')
print(f'overall_4_valid_summaries length: {len(overall_4_concat_document_valid)}')
print(f'overall_3_valid_summaries length: {len(overall_3_concat_document_valid)}')
print(f'overall_2_valid_summaries length: {len(overall_2_concat_document_valid)}')
print(f'overall_1_valid_summaries length: {len(overall_1_concat_document_valid)}')

Based on the varying length of each group, it was determined that a comparison of groups with similar length would be more appropriate. As such, the "overall_7" and "overall_4" groups were selected for further experimentation.<br><br>As the hypothesis proposed, it is anticipated that if the summary effectively captures the essence of the document as in the "overall_7" summarization group, the prediction of the model will be more accurate compared to the "overall 4" summarization group. Furthermore, it is expected that the task's performance on the concatenated summaries and documents will be superior when compared to its application on the documents alone.

### Axis dataset - Fill masks in 3 methods (document only, overall_7 prefix, overall_4 prefix)

In [None]:
axis_documents_fill_mask_words = []

axis_documents_fill_mask = [unmasker(x) for x in axis_documents_masked[:500]]
for document in axis_documents_fill_mask:

  max_mask = max([mask['score'] for mask in document])
  filled_token = [mask['token_str'].replace(" ", "") for mask in document if mask['score'] == max_mask][0]

  axis_documents_fill_mask_words.append(filled_token)

In [None]:
overall_7_fill_mask_words = []

overall_7_fill_mask = [unmasker(x) if x != None else None for x in overall_7_concat_document[:2500]]
for document in overall_7_fill_mask:
  if document != None:
    max_mask = max([mask['score'] for mask in document])
    filled_token = [mask['token_str'].replace(" ", "") for mask in document if mask['score'] == max_mask][0]
  else:
    filled_token = None
  overall_7_fill_mask_words.append(filled_token)

In [None]:
overall_4_fill_mask_words = []

overall_4_fill_mask = [unmasker(x) if x != None else None for x in overall_4_concat_document[:2500]]
for document in overall_4_fill_mask:
  if document != None:
    max_mask = max([mask['score'] for mask in document])
    filled_token = [mask['token_str'].replace(" ", "") for mask in document if mask['score'] == max_mask][0]
  else:
    filled_token = None
  overall_4_fill_mask_words.append(filled_token)

### Axis dataset - Evaluation

In [None]:
df = pd.DataFrame({
    'overall_4': overall_4_fill_mask_words[:30],
    'overall_7': overall_7_fill_mask_words[:30],
    'axis_documents': axis_documents_fill_mask_words[:30],
    'original_word': axis_documents_masked_words[:30]
})
df

In [None]:
def axis_get_accuracy(document):
  accuracies = []

  for i, w in enumerate(document):
    if w == None:
      continue
    is_correct = 0
    if w == axis_documents_masked_words[i]:
      is_correct = 1
    accuracies.append(is_correct)
  mean_accuracy = mean(accuracies)
  return mean_accuracy

In [None]:
acc_overall_7 = axis_get_accuracy(overall_7_fill_mask_words)
print(f'concat_overall_7: {acc_overall_7*100}%')

acc_overall_4 = axis_get_accuracy(overall_4_fill_mask_words)
print(f'concat_overall_4: {acc_overall_4*100}%')

acc_baseline = axis_get_accuracy(axis_documents_fill_mask_words)
print(f'masked_document_only: {acc_baseline*100}%')

In [None]:
x = ['overall_7','overall_4','baseline']
y = [acc_overall_7, acc_overall_4, acc_baseline]
plt.ylim(0.25, 0.32)
plt.bar(x, y)

### Future Work

* Try different approaches for masking words
* Compare to known evaluation techniques (ROUGE, etc.)
* Experiment on the entire comparison dataset
* Experiment on several new datasets
* Experiment with question answering

# Part 2

## Controlled Text Generation

Controlled text generation is the task of creating natural language text that **follows specific input or constraints**, such as a particular **style, tone** or specific words or **phrases**, it can be useful in various applications like  content **generation for social media or marketing**, etc. With **recent advances** in machine learning techniques like language models, and **reinforcement learning** algorithms, have led to significant progress in the field of controlled text generation, these models are able to learn from large amounts of data and generate text that is more human-like and natural than previous methods.

### Learning to summarize from human feedback

**The paper** "Learning to summarize from human feedback" presents a method of training a machine learning model to generate text summaries by **combining supervised learning with reinforcement learning**. The model is trained on a dataset of text documents and their corresponding human-written summaries. The model generates a summary and it is evaluated by a human annotator who **provides a reward signal** indicating how well the summary captures the main points of the document. This **feedback is used by the model to adjust its internal parameters** and generate better summaries in the future. The authors show that their approach generates high-quality summaries and **outperforms other methods** that don't use human feedback. Additionally the model **can learn from a small amount of feedback** and can generate summaries for unseen documents.

### Pipeline

In [1]:
Image(filename='/content/drive/MyDrive/language_models/figure2.PNG', width="1000") 

NameError: name 'Image' is not defined

### Collect human feesback

In [None]:
Image(filename='/content/drive/MyDrive/language_models/figure5.png', width="300") 

https://openaipublic.blob.core.windows.net/summarize-from-feedback/website/index.html#/tldr_comparisons

**Dataset:** The authors used a dataset of **3 million posts from the website reddit.com**, along with summary descriptions of the posts written by the original poster. **They filtered the dataset** to ensure quality, including using a whitelist of subreddits that are **understandable** to the general population and including only posts where the human-written summaries contain between 24 and 48 tokens to minimize the potential effect of **summary length** on quality. The **final filtered dataset contains 123,169 posts** and they held out 5% as a validation set. They will refer to this dataset as "TL;DR" throughout the rest of the paper.

**Labelers:** The paper describes problem a **known problem** of mismatch between the intended quality of the model and what human labelers evaluated, which results in model-generated summaries that were **high-quality according to labelers but low-quality according to researchers**. To improve human data quality, the authors of the paper propose to establish a hands-on relationship with labelers. They give **detailed instructions**, answer questions and provide **regular feedbacks**. The researchers train labelers to ensure high agreement with their judgments and continuously monitor agreement between labelers and researchers throughout the project. As a result of this, **they obtained high agreement between labelers and researchers** on a subset of comparison tasks, labelers agree with researchers 77% ± 2% of the time, while researchers agree with each other 73% ± 4% of the time.

**Policies:** The authors generete summaries by several different summarization policies. These include:

* **Title:** This policy involves using the title of the text as a summary, which is a common practice in summarization as the title is often thought to convey the main idea of the text.

* **Lead-2:** This policy involves using the first 2 sentences of the text as a summary. These two sentences are often used as a lead in news articles and are thought to convey the main idea of the text.

* **Reference Summary:** This policy involves using the human-written summary of the text as the summary. These are summaries written by the original poster of the text which are labeled as 'TL;DR' in the dataset.

* **Pretrain-only:** This policy involves using a transformer model that is only pretrained on a large text corpus and use it to autoregressively predict the next token, then use these models as a 'zero-shot' baselines by padding the context with examples of high-quality summaries from the dataset.

* **Supervised learning:** This policy involves fine-tuning the transformer model with supervised learning to predict summaries from the filtered TL;DR dataset. These models are then used to sample initial summaries for collecting comparisons and as baselines for evaluation

### Train reward model

In [None]:
Image(filename='/content/drive/MyDrive/language_models/figure6.png', width="300") 

**Reward Model (RM):** RM is a machine learning model that is trained on a filtered dataset of text and corresponding human-written summaries. It uses supervised learning to predict which of two given summaries is better.**The input to the RM is a post and the two summaries judged by the human, and the output is a scalar values** that represents the predicted quality of the summaries. The RM is trained by fine-tuning it on a supervised dataset, then adding a randomly initialized linear head to the model, and training it to predict which summary is better as judged by a human. This trained RM is later used as the reward signal in the human feedback policy training process to generate higher-quality outputs as judged by humans.

**Loss Function:** The reward model was trained to predict which summary $y∈{y0, y1}$ is better as judged by a human, given a post x. If the summary preferred by the human is: $y_i$, the RM loss is:<br><br>**$loss(r_Θ)=-E(x,y_o,y_1,i)$~$D[log(σ(r_Θ(x,y_i)-r_Θ(x,y_1-i)))]$**<br><br>where $r_θ(x, y)$ is the scalar output of the reward model for post x and summary y with parameters θ, and D is the dataset of human judgments. In order to **minimize the loss** we would want to **maximize the difference between the rewards** assigned to two summaries, with the **preferred summary receiving a higher reward** and the other receiving a lower reward.

### Train policy with PPO

In [None]:
Image(filename='/content/drive/MyDrive/language_models/figure7.png', width="300") 

**RL Overview:** In reinforcement learning, an **agent learns by interacting with an environment**, in order to **maximize a reward signal**. At each step, the agent receives an observation of the state of the environment and takes an action. The action results in a change in the state of the environment, and the agent receives a reward signal. The goal of the agent is to **learn a policy that maximizes the expected cumulative reward over time**. The agent learns through **trial and error**, using feedback from the environment in the form of the reward signal. It adjusts its actions based on this feedback, in order to increase the reward signal it receives. This process is called learning by reinforcement, because the agent is **reinforced with a positive signal** when it takes actions that lead to good outcomes.

In [None]:
Image(filename='/content/drive/MyDrive/language_models/figure12.PNG', width="600") 

Policy example:

In [None]:
Image(filename='/content/drive/MyDrive/language_models/figure13.PNG', width="400")

**Agent Example:** https://www.youtube.com/watch?v=hJLmXezsjcg

**Play Atari Games:** In the paper "Playing Atari with Deep Reinforcement Learning" the authors presents a method for training a machine learning model to play Atari video games using deep reinforcement learning. The authors demonstrate that a deep neural network can be trained to learn how to play a variety of Atari games by interacting with the game environment and receiving feedback in the form of a **reward signal based on the game scor**e. The authors show that their approach is effective at learning to play Atari games and that the model is able to **learn directly from raw pixel inputs and a scalar reward signal**. This work was an important step in the development of deep reinforcement learning and has had a significant impact on the field.

In [None]:
Image(filename='/content/drive/MyDrive/language_models/figure4.PNG', width="600") 

**Proximal Policy Optimization:** PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance. It has several key characteristics. Some of these include:

* **On-policy learning:** PPO is an on-policy algorithm, which means that it uses the current policy to generate data for the agent to learn from. This makes it more stable and sample efficient than off-policy algorithms.

* **Continuous action spaces:** PPO is designed to work well in environments with continuous action spaces, such as robotic control or game playing. It addresses the difficulty of optimizing a policy in such spaces by using a surrogate objective function that is optimized instead of the true objective.

* **Clipped Surrogate Objective:** In PPO the policy update uses a "clipped" objective function which prevents the update step from going too far away from the previous policy. This makes the update more stable and helps avoid overshooting the optimal policy.

* **Trust Region Method:** PPO uses a trust region method to update the policy, where it will only make a change if it improves the objective by more than a certain amount. This helps prevent the policy from making large, unnecessary updates that could make it worse.

* **Adaptive KL Penalty:** PPO uses a penalty term in the loss function for the KL divergence between the current policy and the old one, this prevents the policy to shift too far from the old one which makes it more stable and sample efficient

* **Value Function:** PPO uses a separate value function that estimates the expected return of each state, it uses this value function to estimate the quality of the policy update, this helps the agent to avoid suboptimal solutions.

Information regarding the PPO **loss function** can be found in: https://huggingface.co/blog/deep-rl-ppo

### Results

In [None]:
Image(filename='/content/drive/MyDrive/language_models/figure9.PNG', width="600")

In [None]:
Image(filename='/content/drive/MyDrive/language_models/figure0.png', width="500")

From the charts, it can be inferred that the results of the "human feedback" model are superior to those of all other models.

### ChatGPT

It is worth noting that ChatGPT employs similar techniques, including training a reward model, ranking outputs by a labeler, and utilizing the Reinforcement Learning PPO algorithm to enhance the policy, resulting in improved output.

In [None]:
Image(filename='/content/drive/MyDrive/language_models/figure8.PNG', width="1000")

## Generate controlled sentiment reviews

### hyperparameters

In [None]:
config = {
    "lm_name": "lvwerra/gpt2-imdb",
    "ref_lm_name": "lvwerra/gpt2-imdb",
    "cls_model_name": "lvwerra/distilbert-imdb",
    "tk_name": "gpt2",
    "steps": 25600,
    "batch_size": 128,
    "forward_batch_size": 8,
    "ppo_epochs": 4,   
    "txt_in_len": 5,
    "txt_out_len": 20,
    "lr": 1.41e-5,
    "init_kl_coef":0.2,
    "target": 6,
    "horizon":10000,
    "gamma":1,
    "lam":0.95,
    "cliprange": .2,
    "cliprange_value":.2,
    "vf_coef":.1, 
    "seed": 1,
}

In [None]:
np.random.seed(config['seed'])

In [None]:
import wandb
wandb.init(name='long-response', project='gpt2-ctrl', config=config)

### IMDB Dataset

In [None]:
# load imdb with datasets
ds = load_dataset('imdb', split='train')
ds = ds.rename_columns({'text': 'review', 'label': 'sentiment'})
ds.set_format('pandas')
df = ds[:]

# make sure the comments are long enough
df = df.loc[df['review'].str.len() > 500]

# make sure comments are not too long
df['review'] = df['review'].apply(lambda x: x[:1000])

df.head()

### Sentiment model

In [None]:
sentiment_model = AutoModelForSequenceClassification.from_pretrained(config["cls_model_name"])
sentiment_tokenizer = AutoTokenizer.from_pretrained(config["cls_model_name"])

In [None]:
text = 'this movie was really bad!!'
output = sentiment_model.forward(sentiment_tokenizer.encode(text, return_tensors="pt"))
output

In [None]:
text = 'this movie was really good!!'
output = sentiment_model.forward(sentiment_tokenizer.encode(text, return_tensors="pt"))
output

In [None]:
text = 'this movie was a documentary'
output = sentiment_model.forward(sentiment_tokenizer.encode(text, return_tensors="pt"))
output

The resulting reward signal:

In [None]:
output[0][:, 1]

### GPT model and tokenizer

In [None]:
gpt2_model = GPT2HeadWithValueModel.from_pretrained(config['lm_name'])
gpt2_model_ref = GPT2HeadWithValueModel.from_pretrained(config['ref_lm_name'])
gpt2_tokenizer = GPT2Tokenizer.from_pretrained(config['tk_name'])

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
_ = gpt2_model.to(device)
_ = sentiment_model.to(device)
_ = gpt2_model_ref.to(device)

In [None]:
wandb.watch(gpt2_model, log='all')

In [None]:
df['tokens'] = df['review'].progress_apply(lambda x: gpt2_tokenizer.encode(' '+x, return_tensors="pt").to(device)[0, :config['txt_in_len']])

In [None]:
df['query'] = df['tokens'].progress_apply(lambda x: gpt2_tokenizer.decode(x))

### Define sentiment tasks

In [None]:
ctrl_str = ['[negative]', '[neutral]', '[positive]']

ctrl_tokens = dict((s, gpt2_tokenizer.encode(s, return_tensors="pt").squeeze().to(device)) for s in ctrl_str)

In [None]:
ctrl_tokens

### Loss function

In [None]:
def pos_logit_to_reward(logit, task):
    """
    Take the positive sentiment logit and scale it for the task.
        task [negative]: reward = -logit
        task [neutral]: reward = -2*abs(logit)+4
        task [positive]: reward = logit
    """
    for i in range(len(logit)):
        if task[i]=='[negative]':
            logit[i] = -logit[i]
        elif task[i]=='[neutral]':
            logit[i] = -2*torch.abs(logit[i])+4
        elif task[i]=='[positive]':
            pass
        else:
            raise ValueError('task has to be in [0, 1, 2]!')
    return logit

Examples for inputs and outputs for pos_logit_to_reward function:

In [None]:
print(ctrl_str)

In [None]:
pos_logit_to_reward(torch.Tensor([4,4,4]), ctrl_str)

For the negative task, and a positive sentiment of 4, the reward will be -4 (negative reward).<br>For the neutral task, and a positive sentiment of 4, the reward will be -4 (negative reward).<br>For the positive task, and a positive sentiment of 4, the reward will be 4 (positive reward), as the sentiment is aligned with the task.

In [None]:
pos_logit_to_reward(torch.Tensor([-4,-4,-4]), ctrl_str)

For the negative task, and a negative sentiment of -4, the reward will be 4 (positive reward), as the sentiment is aligned with the task.<br>For the neutral task, and a negative sentiment of -4, the reward will be -4 (negative reward).<br>For the positive task, and a negative sentiment of -4, the reward will be -4 (negative reward).

In [None]:
pos_logit_to_reward(torch.Tensor([0, 0, 0]), ctrl_str)

For the negative task, and a neutral sentiment of 0, the reward will be 0.<br>For the neutral task, and a neutral sentiment of 0, the reward will be 4 (positive reward), as the sentiment is aligned with the task.<br>For the positive task, and a neutral sentiment of 0, the reward will be 0.

<br>It is important to note that the proposed reward system is a suggestion and can be modified and further studied in future research.

### Model Training

The training loop consists of the following steps:

* Get a batch of queries and create random controls
* Get the query responses from the policy
* Join query and responses and tokenize for BERT analysis
* Get sentiments for query/responses from BERT
* Optimize policy with PPO using the (query, response, reward) triplet
* Log all the training statistics

In [None]:
ppo_trainer = PPOTrainer(gpt2_model, gpt2_model_ref, gpt2_tokenizer, **config)
fbs = config['forward_batch_size']

for epoch in tqdm(range(int(np.ceil(config["steps"]/config['batch_size'])))):
    torch.cuda.empty_cache()
    logs = dict()
    game_data = dict()
    timing = dict()
    t0 = time.time()
    
    #### get a batch from the dataset and annotate tasks
    df_batch = df.sample(config['batch_size'])
    task_list = choices(ctrl_str, k=config['batch_size'])
    task_tensors = torch.stack([ctrl_tokens[t] for t in task_list])
    query_list = df_batch['query'].tolist()
    game_data['query'] = [t+q for t,q in zip(task_list, query_list)]
    
    query_tensors = torch.stack(df_batch['tokens'].tolist())
    query_tensors = torch.cat((task_tensors, query_tensors), axis=1)
    
    #### get response from gpt2
    t = time.time()
    response_tensors = []
    for i in range(int(config['batch_size']/fbs)):
        response  = respond_to_batch(gpt2_model, query_tensors[i*fbs:(i+1)*fbs],
                                     txt_len=config['txt_out_len'])
        response_tensors.append(response)
    response_tensors = torch.cat(response_tensors)
    game_data['response'] = [gpt2_tokenizer.decode(response_tensors[i, :]) for i in range(config['batch_size'])]
    timing['time/get_response'] = time.time()-t

    #### tokenize text for sentiment analysis
    t = time.time()
    texts = [q + r for q,r in zip(query_list, game_data['response'])]
    sentiment_inputs, attention_masks = build_bert_batch_from_txt(texts, sentiment_tokenizer, device)    
    timing['time/build_input_sentiment'] = time.time()-t

    #### get sentiment score
    t = time.time()
    pos_logits = []
    for i in range(int(config['batch_size']/fbs)):
        res = sentiment_model.forward(sentiment_inputs[i*fbs:(i+1)*fbs],
                                      attention_masks[i*fbs:(i+1)*fbs])[0][:, 1].detach()
        pos_logits.append(res)
    rewards = pos_logit_to_reward(torch.cat(pos_logits), task_list)
    timing['time/get_sentiment_preds'] = time.time()-t

    #### Run PPO training 
    t = time.time()
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    timing['time/optimization'] = time.time()-t
     
    #### Log everything
    timing['time/epoch'] = time.time()-t0
    table_rows = [list(r) for r in zip(game_data['query'], game_data['response'], rewards.cpu().tolist())]
    logs.update({'game_log':wandb.Table(
        columns=['query', 'response', 'reward'],
        rows=table_rows)})
    logs.update(timing)
    logs.update(stats)
    logs['env/reward_mean'] = torch.mean(rewards).cpu().numpy()
    logs['env/reward_std'] = torch.std(rewards).cpu().numpy()
    logs['env/reward_dist'] = rewards.cpu().numpy()
    for ctrl_s in ctrl_str:
        key = 'env/reward_'+ctrl_s.strip('[]')
        logs[key] = np.mean([r for r, t in zip(logs['env/reward_dist'], task_list) if t==ctrl_s])
    wandb.log(logs)

In [None]:
Image(filename='/content/drive/MyDrive/language_models/figure11.PNG', width="600")

In [None]:
for ctrl_s in ctrl_str:
    plt.hist([r for r, t in zip(logs['env/reward_dist'], task_list) if t==ctrl_s],
             density=True,
             alpha=0.5,
             label=ctrl_s)
plt.legend(loc='best')
plt.title('reward distribution')
plt.grid(True)
plt.show()

### Results Overview

In [None]:
#### get a batch from the dataset
bs = 32
game_data = dict()
df_batch = df.sample(bs)
query_list = df_batch['query'].tolist()
game_data['query'] = query_list
for ctrl in ctrl_str:
    task_list = [ctrl] * bs
    task_tensors = torch.stack([ctrl_tokens[t] for t in task_list])

    query_tensors = torch.stack(df_batch['tokens'].tolist())
    query_tensors = torch.cat((task_tensors, query_tensors), axis=1)

    #### get response from gpt2 and gpt2_ref
    response_tensors  = respond_to_batch(gpt2_model, query_tensors, txt_len=config['txt_out_len'])
    game_data['response ' + ctrl] = [gpt2_tokenizer.decode(response_tensors[i, :]) for i in range(bs)]

    #### sentiment analysis of query/response pairs before/after
    texts = [q + r for q,r in zip(game_data['query'], game_data['response ' + ctrl])]
    sentiment_inputs, attention_masks = build_bert_batch_from_txt(texts, sentiment_tokenizer, device)    
    rewards = sentiment_model.forward(sentiment_inputs, attention_masks)[0][:, 1].detach()
    game_data['rewards ' + ctrl] = pos_logit_to_reward(rewards, task_list).cpu().numpy()

# store results in a dataframe
df_results = pd.DataFrame(game_data)
df_results

In [None]:
input_string = '[negative] The movie'
input_tokens = gpt2_tokenizer.encode(input_string, return_tensors="pt").to(device)

response_tensors = respond_to_batch(gpt2_model, input_tokens, txt_len=config['txt_out_len'])
response_strings = gpt2_tokenizer.decode(response_tensors[0, :])
response_strings

In [None]:
input_string = '[neutral] The movie'
input_tokens = gpt2_tokenizer.encode(input_string, return_tensors="pt").to(device)

response_tensors = respond_to_batch(gpt2_model, input_tokens, txt_len=config['txt_out_len'])
response_strings = gpt2_tokenizer.decode(response_tensors[0, :])
response_strings

In [None]:
input_string = '[positive] The movie'
input_tokens = gpt2_tokenizer.encode(input_string, return_tensors="pt").to(device)

response_tensors = respond_to_batch(gpt2_model, input_tokens, txt_len=config['txt_out_len'])
response_strings = gpt2_tokenizer.decode(response_tensors[0, :])
response_strings

As it can be observed, the model managed to generate its reviews based on the assigned task successfully, which illustrates the concept of controlled text generation.

### Future Work

* Try different reward systems
* Further evaluate the results with known metrices
* Experiment on several new datasets
* Experiment with more exotic tasks
* Further explore the capability of ChatGPT to generalize to an infinite number of tasks based on the user input.





## Refrences

https://arxiv.org/pdf/2009.01325.pdf

https://openai.com/blog/learning-to-summarize-with-human-feedback/

https://machinelearning.co.il/8530/learning-to-summarize/

https://openai.com/blog/summarizing-books/

https://sh-tsang.medium.com/review-learning-to-summarize-from-human-feedback-d5bb11e4c1c5

https://openai.com/blog/instruction-following/#moon

https://openaipublic.blob.core.windows.net/summarize-from-feedback/website/index.html#/tldr_comparisons

https://openai.com/blog/openai-baselines-ppo/

https://huggingface.co/blog/deep-rl-ppo

https://openai.com/blog/chatgpt/

https://wandb.ai/ayush-thakur/RLHF/reports/Understanding-Reinforcement-Learning-from-Human-Feedback-RLHF-Part-1--VmlldzoyODk5MTIx

https://openai.com/blog/instruction-following/

https://github.com/openai/summarize-from-feedback/blob/master/model_card.md
