# Summarizing dialogue with generative AI

This tutorial shows how you can load a pre-trained Large Language Model (LLM) from Hugging Face; improve its completion results with prompt engineering (zero-, one-, and few-shot inference); refine the model with instruction finetuning; and, finally, use Reinforcement Learning with Human Feedback (RLHF) to make its responses less toxic. 

# Table of Contents

- [ 1 - Install and load dependencies](#1)
- [ 2 - Summarizing dialogue without in-context learning](#2)
- [ 3 - Summarizing dialogue with in-context learning](#3)
  - [ 3.1 - Include instruction prompt but no examples ("zero-shot inference")](#3.1)
  - [ 3.2 - Include instruction prompt and a single example ("one-shot inference")](#3.2)
  - [ 3.3 - Include instruction prompt and several examples ("few-shot inference")](#3.3)
  - [ 3.4 - Tweaking the generative Configuration Parameters](#3.4)
- [ 4 - Tweaking the generation configuration parameters](#4)
- [ 5 - Fine-tuning and automatically evaluating the model](#6)
  - [ 5.1 - Preprocessing Dialogsum dataset to add instructions to prompt](#5.1)
  - [ 5.2 - Fine-tune the model updating all parameters](#5.2)
  - [ 5.3 - Visually inspect model results](#5.3)
  - [ 5.4 - Automatically quantify model results with ROUGE metrics](#5.4)
- [ 6 - Save compute resources with Parameter Efficient Fine-Tuning (PEFT)](#6)
  - [ 6.1 - Setting up LoRA](#6.1)
  - [ 6.2 - Training our model with LoRA](#6.2)
  - [ 6.3 - Visually inspect LoRA model results](#6.3)
  - [ 6.4 - Assessing LoRA model with ROUGE metrics](#6.4)
- [ 7 - Using Reinforcement Learning with Human Feedback (RLHF) for further refinement](#7)
  - [ 7.1 - Load the reward model](#7.1)
  - [ 7.2 - Calculating baseline toxicity stats](#7.2)
  - [ 7.3 - Fine-tuning our model to reduce toxicity](#7.3)
  - [ 7.4 - Quantifying the toxicity of our detoxified model](#7.4)
  - [ 7.5 - Visually comparing pre-detoxified vs. detoxified models](#7.5)
    

<a name='1'></a>
## 1 - Install and load dependencies

In [52]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

# Install reinforcement Learning library from github.
%pip install git+https://github.com/lvwerra/trl.git@25fa1bd 

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/lvwerra/trl.git@25fa1bd
  Cloning https://github.com/lvwerra/trl.git (to revision 25fa1bd) to c:\users\david abugaber\appdata\local\temp\pip-req-build-yguy0jm9
  Resolved https://github.com/lvwerra/trl.git to commit 25fa1bd
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: trl
  Building wheel for trl (setup.py): started
  Building wheel for trl (setup.py): finished with status 'done'
  Created wheel for trl: filename=trl-0.4.2.dev0-py3-none-any.whl size=67534 sha256=aa85fb906bae541cb69824152ac71b29baa89b085a8b0109

  Running command git clone --filter=blob:none --quiet https://github.com/lvwerra/trl.git 'C:\Users\David Abugaber\AppData\Local\Temp\pip-req-build-yguy0jm9'
  Running command git checkout -q 25fa1bd


Load various modules: datasets, Large Language Model, tokenizer, configurator, etc.

In [1]:
from datasets import load_dataset
from transformers import pipeline,AutoModelForSequenceClassification,AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from peft import PeftModel, PeftConfig, LoraConfig, TaskType
from trl import create_reference_model
from trl.core import LengthSampler
import torch
import time
import evaluate
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()

<a name='2'></a>
## 2 - Summarizing dialogue without in-context learning

Load in dialogue data from DialogSum dataset (https://huggingface.co/datasets/knkarthick/dialogsum). This comprises 10,000+ short dialogues with manually annotated summaries. 

In [77]:
data_name = "knkarthick/dialogsum"

dataset = load_dataset(data_name)

Found cached dataset csv (C:/Users/David Abugaber/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-c8fac5d84cd35861/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/3 [00:00<?, ?it/s]

Data is already split into "train" and "test" subsets. Each subset contains the dialogue itself plus a human-written summary. Here is an example:

In [4]:
print(dataset['train'][0]['dialogue'])

print(dataset['train'][0]['summary'])

#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
#Person2#: I found it would be a good idea to get a check-up.
#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.
#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?
#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
#Person2#: Ok.
#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
#Person2#: Yes.
#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.
#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.
#Person2#: Ok, thanks doctor.
Mr. Smith's getting a check-up, and Doctor Hawkins advises him to ha

Let's use the FLAN-T5 model from Hugging Face (more models available at https://huggingface.co/docs/transformers/index). 

Should be loaded as "Seq2Seq" because this is an encoder/decoder model. For autoregressive (decoder-only) models, would instead use AutoModelForCausalLM . See: https://stackoverflow.com/questions/75549632/difference-between-automodelforseq2seqlm-and-automodelforcausallm

Model is already trained hence use the `.from_pretrained()` method. 

Model is ~1Gb, so might take a minute to load.

In [78]:
model_name = 'google/flan-t5-base'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Begin by tokenizing dataset.

Use tokenizer that corresponds to FLAN-T5. use_fast=True speeds up the process (see https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer).

In [79]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Example of the tokenizer encoding/decoding with our example dialogue from above. 

Use return_tensors='pt' for PyTorch, otherwise return_tensors='tf' for TensorFlow

In [80]:
test_input = dataset['train'][0]['dialogue']

test_encoded = tokenizer(test_input, return_tensors='pt')

test_decoded = tokenizer.decode(test_encoded["input_ids"][0],skip_special_tokens=True)

print(f'Tokens represented as numbers:\n{test_encoded["input_ids"][0]}\n')
print(f'Tokens decoded back into words:\n{test_decoded}')

Tokens represented as numbers:
tensor([ 1713,   345, 13515,   536,  4663,    10,  2018,     6,  1363,     5,
         3931,     5,    27,    31,    51,  7582, 12833,    77,     7,     5,
         1615,    33,    25,   270,   469,    58,  1713,   345, 13515,   357,
         4663,    10,    27,   435,    34,   133,    36,     3,     9,   207,
          800,    12,   129,     3,     9,   691,    18,   413,     5,  1713,
          345, 13515,   536,  4663,    10,  2163,     6,   168,     6,    25,
           43,    29,    31,    17,   141,    80,    21,   305,   203,     5,
          148,   225,    43,    80,   334,   215,     5,  1713,   345, 13515,
          357,  4663,    10,    27,   214,     5,    27,  2320,    38,   307,
           38,   132,    19,  1327,  1786,     6,   572,   281,   217,     8,
         2472,    58,  1713,   345, 13515,   536,  4663,    10,  1548,     6,
            8,   200,   194,    12,  1792,  2261, 21154,    19,    12,   253,
           91,    81,   135,   77

You might notice that the length of the text string in words is longer than the length of the list of numbers.

In [26]:
print("Length of our list of tokens is " + str(len(test_encoded["input_ids"][0])))
print("Length of our input text string in words is " + str(len(test_decoded.split())))

Length of our list of tokens is 284
Length of our input text string in words is 168


The reason for this is that sometimes a word is split up into several tokens. For instance, "I'll" becomes "I"+"'"+"ll", "Smoking" is split into "Smok"+"ing", and (perhaps more unconventionally) "Person" is split into "P"+"erson", due to the capitalization. Here is how you can view the tokens as strings:

In [29]:
tokenizer.tokenize(test_input)

['▁#',
 'P',
 'erson',
 '1',
 '#',
 ':',
 '▁Hi',
 ',',
 '▁Mr',
 '.',
 '▁Smith',
 '.',
 '▁I',
 "'",
 'm',
 '▁Doctor',
 '▁Hawk',
 'in',
 's',
 '.',
 '▁Why',
 '▁are',
 '▁you',
 '▁here',
 '▁today',
 '?',
 '▁#',
 'P',
 'erson',
 '2',
 '#',
 ':',
 '▁I',
 '▁found',
 '▁it',
 '▁would',
 '▁be',
 '▁',
 'a',
 '▁good',
 '▁idea',
 '▁to',
 '▁get',
 '▁',
 'a',
 '▁check',
 '-',
 'up',
 '.',
 '▁#',
 'P',
 'erson',
 '1',
 '#',
 ':',
 '▁Yes',
 ',',
 '▁well',
 ',',
 '▁you',
 '▁have',
 'n',
 "'",
 't',
 '▁had',
 '▁one',
 '▁for',
 '▁5',
 '▁years',
 '.',
 '▁You',
 '▁should',
 '▁have',
 '▁one',
 '▁every',
 '▁year',
 '.',
 '▁#',
 'P',
 'erson',
 '2',
 '#',
 ':',
 '▁I',
 '▁know',
 '.',
 '▁I',
 '▁figure',
 '▁as',
 '▁long',
 '▁as',
 '▁there',
 '▁is',
 '▁nothing',
 '▁wrong',
 ',',
 '▁why',
 '▁go',
 '▁see',
 '▁the',
 '▁doctor',
 '?',
 '▁#',
 'P',
 'erson',
 '1',
 '#',
 ':',
 '▁Well',
 ',',
 '▁the',
 '▁best',
 '▁way',
 '▁to',
 '▁avoid',
 '▁serious',
 '▁illnesses',
 '▁is',
 '▁to',
 '▁find',
 '▁out',
 '▁about',
 '▁them

Example of using FLAN-T5 to summarize dialogue, using our sample text from above.

Zero-shot inference: we're feeding in the raw prompt without including any examples/instructions within it.

Here I'm using the first example dialogue (corresponds to index 0 because Python starts counting at zero). Am storing the index number as a separate object so that it can be easily changed if I want to look at other example inputs.

Note that max_new_tokens=80 limits the output to 80 tokens (remember that a single word may comprise several tokens). 

In [8]:
index = 0

dialogue = dataset['test'][index]['dialogue']
human_summary = dataset['test'][index]['summary']

input_tokens = tokenizer(dialogue, return_tensors='pt')

output = tokenizer.decode(model.generate(input_tokens["input_ids"], max_new_tokens=80,)[0],skip_special_tokens=True)

print(f'Test prompt:\n{dialogue}\n')
print(f'Human-generated summary:\n{human_summary}\n')
print(f'Completion - no in-context learning:\n{output}')

Test prompt:
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much time! Now, please continue with

Feeding in the raw dialogue produces crappy results. In-context learning with zero-/one-/few-shot inference might help.

<a name='3'></a>
## 3 - Summarizing dialogue with in-context learning

<a name='3.1'></a>
### 3.1 - Include instruction prompt but no examples ("zero-shot inference")

This time, add an explicit instruction at the end of the prompt to see if our dialogue summaries improve. Still no examples to learn from, though, hence "zero-shot."

Note that more example prompts can be found in the FLAN-T5 Github repository (https://github.com/google-research/FLAN/tree/main/flan/v2) in this script (https://github.com/google-research/FLAN/blob/main/flan/v2/templates.py)

For instance, some of the instruction prompts they use are: "Briefly summarize that dialogue", "Here is a dialogue ... Write a short summary!", "What was that dialogue about, in two sentences or less?, "Here is a dialogue... What were they talking about?", etc.

In [39]:
index = 0

dialogue = dataset['test'][index]['dialogue']
human_summary = dataset['test'][index]['summary']

prompt = f"""
Produce a summary of the following dialogue.

{dialogue}

Summary:
"""

input_tokens = tokenizer(prompt, return_tensors='pt')

output = tokenizer.decode(model.generate(input_tokens["input_ids"], max_new_tokens=80,)[0],skip_special_tokens=True)

print(f'Test prompt:\n{dialogue}\n')
print(f'Human-generated summary:\n{human_summary}\n')
print(f'Completion - zero-shot learning:\n{output}')


Test prompt:
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much time! Now, please continue with

A little better, but omits a lot of detail as to the actual content of the memo (at least for the dialogue in index zero -- might perform better for some of the other exampels)

<a name='3.2'></a>
### 3.2 - Include instruction prompt and a single example ("one-shot inference")

Now I'll try including a single ("one-shot") example of a good dialogue / summary combination, before prompting the model to produce its own on a different dialogue.

For this, I'll have two separate indices: one for the example summary, and one for the dialogue to be summarized

In [9]:
test_index = 0
test_dialogue = dataset['test'][test_index]['dialogue']
test_human_summary = dataset['test'][test_index]['summary']

example_index = 3
example_dialogue = dataset['test'][example_index]['dialogue']
example_human_summary = dataset['test'][example_index]['summary']

prompt = f"""
Produce a summary of the following dialogue.

{example_dialogue}

Summary:
{example_human_summary}

Produce a summary of the following dialogue.

{test_dialogue}

Summary:
"""

input_tokens = tokenizer(prompt, return_tensors='pt')

output = tokenizer.decode(model.generate(input_tokens["input_ids"], max_new_tokens=80,)[0],skip_special_tokens=True)

print(f'Test prompt:\n{prompt}\n')
print(f'Human-generated summary:\n{test_human_summary}\n')
print(f'Completion - one-shot learning:\n{output}')

Token indices sequence length is longer than the specified maximum sequence length for this model (752 > 512). Running this sequence through the model will result in indexing errors


Test prompt:

Produce a summary of the following dialogue.

#Person1#: You're finally here! What took so long?
#Person2#: I got stuck in traffic again. There was a terrible traffic jam near the Carrefour intersection.
#Person1#: It's always rather congested down there during rush hour. Maybe you should try to find a different route to get home.
#Person2#: I don't think it can be avoided, to be honest.
#Person1#: perhaps it would be better if you started taking public transport system to work.
#Person2#: I think it's something that I'll have to consider. The public transport system is pretty good.
#Person1#: It would be better for the environment, too.
#Person2#: I know. I feel bad about how much my car is adding to the pollution problem in this city.
#Person1#: Taking the subway would be a lot less stressful than driving as well.
#Person2#: The only problem is that I'm going to really miss having the freedom that you have with a car.
#Person1#: Well, when it's nicer outside, you can st

Maybe a slight improvement (?). Instead of "the memo WILL go out" now we get "the memo IS TO BE DISTRIBUTED." Note that bizarrely, using indices 1 or 2 for the example produced some bizarre behavior: those example dialogues were also about a conversation between Ms. Dawson and the boss, so FLAN-T5 just reproduced the human-generated summaries for THOSE dialogues. So, having an example that was too similar worked against us here.

<a name='3.3'></a>
### 3.3 - Include instruction prompt and several examples ("few-shot inference")

Instead of just a single example in the prompt, I'll try adding several. So, "few-shot inference" instead of "one-shot inference."

Rather than tediously writing out separate slots in the prompt for example 1, example 2, example 3, etc., I can build a function to automatically add the new examples. The code below prints out an example of such an aggregated prompt.

In [10]:
def prompt_builder(example_indices, test_index):
    prompt = ''
    for i in example_indices:
        dialogue = dataset['test'][i]['dialogue']
        summary = dataset['test'][i]['summary']
        
        prompt += f"""
Produce a summary of the following dialogue:

{dialogue}

Summary:
{summary}

"""
    
    dialogue = dataset['test'][test_index]['dialogue']
    
    prompt += f"""
Produce a summary of the following dialogue:

{dialogue}

Summary:
"""
        
    return prompt

example_indices = [5,6]
test_index = 0

print(prompt_builder(example_indices, test_index))


Produce a summary of the following dialogue:

#Person1#: You're finally here! What took so long?
#Person2#: I got stuck in traffic again. There was a terrible traffic jam near the Carrefour intersection.
#Person1#: It's always rather congested down there during rush hour. Maybe you should try to find a different route to get home.
#Person2#: I don't think it can be avoided, to be honest.
#Person1#: perhaps it would be better if you started taking public transport system to work.
#Person2#: I think it's something that I'll have to consider. The public transport system is pretty good.
#Person1#: It would be better for the environment, too.
#Person2#: I know. I feel bad about how much my car is adding to the pollution problem in this city.
#Person1#: Taking the subway would be a lot less stressful than driving as well.
#Person2#: The only problem is that I'm going to really miss having the freedom that you have with a car.
#Person1#: Well, when it's nicer outside, you can start biking to

Following the same procedure as before, I can feed this few-shot inference prompt to FLAN-T5. I'll use 6 prompts:

In [11]:
example_indices = [5,6,7,8,9,10]
test_index = 0

test_human_summary = dataset['test'][test_index]['summary']

prompt = prompt_builder(example_indices, test_index)

input_tokens = tokenizer(prompt, return_tensors='pt')

output = tokenizer.decode(model.generate(input_tokens["input_ids"], max_new_tokens=80,)[0],skip_special_tokens=True)

print(f'Test prompt:\n{prompt}\n')
print(f'Human-generated summary:\n{test_human_summary}\n')
print(f'Completion - few-shot learning:\n{output}')

Test prompt:

Produce a summary of the following dialogue:

#Person1#: You're finally here! What took so long?
#Person2#: I got stuck in traffic again. There was a terrible traffic jam near the Carrefour intersection.
#Person1#: It's always rather congested down there during rush hour. Maybe you should try to find a different route to get home.
#Person2#: I don't think it can be avoided, to be honest.
#Person1#: perhaps it would be better if you started taking public transport system to work.
#Person2#: I think it's something that I'll have to consider. The public transport system is pretty good.
#Person1#: It would be better for the environment, too.
#Person2#: I know. I feel bad about how much my car is adding to the pollution problem in this city.
#Person1#: Taking the subway would be a lot less stressful than driving as well.
#Person2#: The only problem is that I'm going to really miss having the freedom that you have with a car.
#Person1#: Well, when it's nicer outside, you can st

Oddly enough, adding more examples didn't seem to improve our summary much. In any case, adding at least ONE example was better than having no in-context learning (which just spat out "#Person1#: Ms. Dawson, I need you to take a dictation for me.").

Two things to note:

-best practices suggest that adding more than 6 examples ("6-shot learning") does not improve results

-models have limited input context length. In the case of FLAN-T5, it's 512 tokens. So, after that ceiling, any subsequent enty gets ignored.

<a name='4'></a>
## 4 - Tweaking the generation configuration parameters

There are other parameters that we can play with, other than the max_new_tokens limit to the output length. For more, see: https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig. 

For instance, setting "do_sample = True" makes the model *not* just choose the highest-probability output token (what's called "greedy decoding") but instead sample from the top candidates. 

For instance, setting "top_k = 20" limits the output to choosing from the 20 highest-probability tokens. Similarly, "top_p = .50" limits the output to the set of highest-probability tokens whose probabilities cumulatively sum to .50.

Finally, setting "temperature" above 1 makes the model use a broader (less peaked) probability function, with a higher chance of choosing low-probability candidates. 

Here is an example of such parameters at play.

In [60]:
generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, top_p = .8, temperature=1.5)

output = tokenizer.decode(model.generate(input_tokens["input_ids"], max_new_tokens=80,)[0],skip_special_tokens=True)

print(f'Test prompt:\n{prompt}\n')
print(f'Human-generated summary:\n{test_human_summary}\n')
print(f'Completion - few-shot learning:\n{output}')

Test prompt:

Here is a dialogue:

#Person1#: You're finally here! What took so long?
#Person2#: I got stuck in traffic again. There was a terrible traffic jam near the Carrefour intersection.
#Person1#: It's always rather congested down there during rush hour. Maybe you should try to find a different route to get home.
#Person2#: I don't think it can be avoided, to be honest.
#Person1#: perhaps it would be better if you started taking public transport system to work.
#Person2#: I think it's something that I'll have to consider. The public transport system is pretty good.
#Person1#: It would be better for the environment, too.
#Person2#: I know. I feel bad about how much my car is adding to the pollution problem in this city.
#Person1#: Taking the subway would be a lot less stressful than driving as well.
#Person2#: The only problem is that I'm going to really miss having the freedom that you have with a car.
#Person1#: Well, when it's nicer outside, you can start biking to work. That 

In [36]:
#resetting parameters
generation_config = GenerationConfig(max_new_tokens=50, do_sample=False)

<a name='5'></a>
## 5 - Fine-tuning and automatically evaluating the model

Instead of relying on modified prompts to get the model to spit out the desired output (which doesn't actually change the model but puts the burden on the human providing the input), we can modify the parameter weights to change the model itself. 

As a first set, we can get the size of the model and determine how many of the parameters are "trainable" with a function like this:

In [12]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(model))



trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


<a name='5.1'></a>
### 5.1 - Preprocessing Dialogsum dataset to add instructions to prompt

Not unlike what we did above with the in-context learning, we need to reformat the dialog/summary pairs in the Dialogsum dataset by adding instructions: a message before each dialogue to orient the model, an explicit instruction to create a summary, and a slot for the humman-generated annotation for training:

Then, this entire dataset should be tokenized and the 'input_id' should be extracted

In [97]:
def tokenize_function(example):
    start_prompt = 'Produce a summary of the following dialogue.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# set "batched=True" because dataset is already divided into training, validation, and test splits
# training data: original exemplars used to train the model
# validation data: used during model training to fine-tune parameters across iterations
# test data: used after all training is complete to give final model performance
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

For this example, we can just take a subset of the data by taking only every 100th entry. 
In the code below, "index % 100 == 0" is a Boolean expression that looks for entries whose indices are divisible by 100 -- i.e., the remainder after dividing by 100 is zero.

In [166]:
subsample_tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Loading cached processed dataset at C:\Users\David Abugaber\.cache\huggingface\datasets\knkarthick___csv\knkarthick--dialogsum-c8fac5d84cd35861\0.0.0\6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1\cache-4f905de34fd2d5b2.arrow
Loading cached processed dataset at C:\Users\David Abugaber\.cache\huggingface\datasets\knkarthick___csv\knkarthick--dialogsum-c8fac5d84cd35861\0.0.0\6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1\cache-b568002b0437a33e.arrow
Loading cached processed dataset at C:\Users\David Abugaber\.cache\huggingface\datasets\knkarthick___csv\knkarthick--dialogsum-c8fac5d84cd35861\0.0.0\6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1\cache-8b96e13cd31466b7.arrow


Here is the shape of the dataset that we'll be working on: a train dataset with 125 entries, a test dataset with 15 entries, and a validation dataset with 5 entries. 

In [120]:
print(subsample_tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
})


Each entry consists of a string of 'input_ids' and a string of 'labels.' As before, the raw values are just numerical representations corresponding to each token...

In [84]:
print(subsample_tokenized_datasets['train']['input_ids'][0])
print(subsample_tokenized_datasets['train']['labels'][0])

[947, 19, 3, 9, 7478, 24, 523, 12, 36, 21603, 26, 5, 1713, 345, 13515, 536, 4663, 10, 2018, 6, 1363, 5, 3931, 5, 27, 31, 51, 7582, 12833, 77, 7, 5, 1615, 33, 25, 270, 469, 58, 1713, 345, 13515, 357, 4663, 10, 27, 435, 34, 133, 36, 3, 9, 207, 800, 12, 129, 3, 9, 691, 18, 413, 5, 1713, 345, 13515, 536, 4663, 10, 2163, 6, 168, 6, 25, 43, 29, 31, 17, 141, 80, 21, 305, 203, 5, 148, 225, 43, 80, 334, 215, 5, 1713, 345, 13515, 357, 4663, 10, 27, 214, 5, 27, 2320, 38, 307, 38, 132, 19, 1327, 1786, 6, 572, 281, 217, 8, 2472, 58, 1713, 345, 13515, 536, 4663, 10, 1548, 6, 8, 200, 194, 12, 1792, 2261, 21154, 19, 12, 253, 91, 81, 135, 778, 5, 264, 653, 12, 369, 44, 709, 728, 3, 9, 215, 21, 39, 293, 207, 5, 1713, 345, 13515, 357, 4663, 10, 8872, 5, 1713, 345, 13515, 536, 4663, 10, 1563, 140, 217, 270, 5, 696, 2053, 11, 11581, 320, 1399, 5, 2321, 3, 9, 1659, 6522, 6, 754, 5, 531, 25, 7269, 6, 1363, 5, 3931, 58, 1713, 345, 13515, 357, 4663, 10, 2163, 5, 1713, 345, 13515, 536, 4663, 10, 14627, 53, 19, 

In [None]:
We can use tokenizer.decode() to turn these back into actual human-legible text strings:

In [80]:
print(tokenizer.decode(subsample_tokenized_datasets['train']['input_ids'][0],skip_special_tokens=True))
print(tokenizer.decode(subsample_tokenized_datasets['train']['labels'][0],skip_special_tokens=True))

Here is a dialogue that needs to be summarized. #Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today? #Person2#: I found it would be a good idea to get a check-up. #Person1#: Yes, well, you haven't had one for 5 years. You should have one every year. #Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor? #Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good. #Person2#: Ok. #Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith? #Person2#: Yes. #Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit. #Person2#: I've tried hundreds of times, but I just can't seem to kick the habit. #Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave. #Person2#: Ok, thanks doctor. Write a summary here

<a name='5.2'></a>
### 5.2 - Fine-tune the model updating all parameters

We first need to establish the training configuration.

Here is a quick run-down of what each argument means:
    
Learning rate: determines how quickly model weights are updated during training. A higher rate means faster convergence, but if it's too high the results can become unstable, overshooting or oscillating around the solution

Number of training epochs: how many training epochs to perform? (one epoch is one full cycle through the training data, so this sets the total number of times the model sees the entire training dataset)

Weight decay: a parameter that limits overfitting by penalizing large weights in the model

Logging steps: controls how frequently training progress is displayed during training 

Max steps: How many training steps to perform? (one step is one gradient update to the model steps, iterating one "batch." So, this sets the total number of batches processed during training)

Save strategy: can be set to either "no" (no saving at all), "epoch" (to save at the end of each epoch"), or "steps" (to save after a given number of steps, set by including a save_steps = # argument in your call to TrainingArguments )


In [16]:
output_dir = f'./dialog_summary_model_training'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1,
    save_strategy='epoch'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=subsample_tokenized_datasets['train'],
    eval_dataset=subsample_tokenized_datasets['validation']
)

trainer.train()



Step,Training Loss
1,49.1909


TrainOutput(global_step=1, training_loss=49.19086456298828, metrics={'train_runtime': 61.4805, 'train_samples_per_second': 0.13, 'train_steps_per_second': 0.016, 'total_flos': 5478058819584.0, 'train_loss': 49.19086456298828, 'epoch': 0.06})

Now load the trained model. Since in the previous step we set it to save after each training epoch, and there was only one epoch, then by default the directory should be called 'checkpoint-1'

In [17]:
# Load trained model
model_path = "dialog_summary_model_training/checkpoint-1"

instruct_model = AutoModelForSeq2SeqLM.from_pretrained(model_path, torch_dtype=torch.bfloat16)

<a name='5.3'></a>
### 5.3 - Visually inspect model results

Let's check if the model is producing reasonably good output. Of course, it's not feasible to manually check every output (especially in a full model), but eyeballing the results qualitatively can give us an initial sense of its performance before turning to quantitative metrics (see below).

Note that here we are looking at the original, unsampled Dialogsum dataset.

In [37]:
index = 25
dialogue = dataset['test'][index]['dialogue']
human_summary = dataset['test'][index]['summary']

prompt = f"""
Produce a summary of the following dialogue.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(f'Dialogue:\n{dialogue}')
print(f'Human-generated summary:\n{human_summary}')
print(f'Original model:\n{original_model_text_output}')
print(f'Instruct model:\n{instruct_model_text_output}')

Dialogue:
#Person1#: Steven, I need badly your help.
#Person2#: What's the matter?
#Person1#: My wife has found that I have an affair with my secretary, and now she is going to divorce me.
#Person2#: How could you cheat on your wife? You have been married for ten years.
#Person1#: Yes, I know I'm wrong. But I swear that the affair lasts only for two months. And I still love my wife. I couldn't live without her.
#Person2#: I will try my best to persuade her to reconsider the divorce. But are you sure that from now on you will be faithful to her forever?
#Person1#: Yes, I swear.
Human-generated summary:
Steve will try to persuade #Person1#'s wife not to divorce #Person1# as #Person1# swears to remain faithful forever.
Original model:
Steven is asking for help.
Instruct model:
Person1 is going to divorce her secretary.


Seems like fine-tuning the model by embedding explicit instructions within the prompts has improved the results somewhat.

<a name='5.3'></a>
### 5.4 - Automatically quantify model results with ROUGE metrics

Of course, it's not feasible to manually check every single summary and compare it against a baseline human annotation. Beyond the fact that such an approach is not scalable (why automate the dialogue summaries in the first place if a human has to validate every output?), there's the fact that you can't easily quantify how good the summary is just by glancing at it.

One way to tackle this problem is with ROUGE, or "Recall-Oriented Understudy for Gisting Evaluation." This is a family of metrics that quantifies the similarity between computer-generated text vs. a human baseline.

ROUGE can take the form of "ROUGE-N", where N is the granularity of the n-grams to compare. So for instance, ROUGE-1 compares unigrams (individual words), ROUGE-2 compares bigrams, etc.

Alternately, ROUGE-L compares n grams with the length of the longest common subsequence between the generated vs. reference text

The score itself can either be recall (# of matching n-grams divided by # of n-grams in the reference text), precision (# of matching n-grams divided by # of n-grams in the text to be tested), or F-score (also known as F1), which is the harmonic mean of precision and recall. Which of these to use depends on your use case.

If it's more important that your output captures all the "right answers" (minimizing false negatives) even if you inadvertently generate irrelevant text, then use recall.
If it's more important that nothing in your output is wrong (minimizing false positives) even if you omit possibly relevant text, then use precision.
If you want to balance these two considerations, use F-score.

Here is a toy example of ROUGE in action, from the rouge-score package:


In [19]:
#pip install rouge-score

from rouge_score import rouge_scorer

# Sample generated and reference texts
generated_text = "The cat sat on the mat"
reference_text = "The cat was sitting on the mat"

#note: the longest common subsequence is "on the mat", so in this example ROUGE-L is essentially ROUGE-3

# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Calculate ROUGE scores
scores = scorer.score(generated_text, reference_text)

# Print the ROUGE scores
for key in scores:
    print(f'{key}: {scores[key]}')

rouge1: Score(precision=0.7142857142857143, recall=0.8333333333333334, fmeasure=0.7692307692307692)
rouge2: Score(precision=0.5, recall=0.6, fmeasure=0.5454545454545454)
rougeL: Score(precision=0.7142857142857143, recall=0.8333333333333334, fmeasure=0.7692307692307692)


The first thing we want to do to calculate ROUGE scores is to generate a dataset with summaries for our human-annotated baseline, for our original (non-finetuned) model, and for our finetuned "instruct model." The code below produces such a dataset for the first 20 entries.

In [40]:
dialogues = dataset['test'][0:19]['dialogue']
human_baseline_summaries = dataset['test'][0:19]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Produce a summary of the following dialogue.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    original_model_outputs = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)
    
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Employee memo is due to go out by this afternoon.,The memo will go out to all employees by this ...
1,In order to prevent employees from wasting tim...,The memo is being distributed to all employees...,The memo will go out to all employees by this ...
2,Ms. Dawson takes a dictation for #Person1# abo...,"The memo is for the following: ""It is a dictat...",The memo will go out to all employees by this ...
3,#Person2# arrives late because of traffic jam....,#Person1: I'm happy to see you finally here. #...,The traffic jam at the Carrefour intersection ...
4,#Person2# decides to follow #Person1#'s sugges...,The traffic was terrible.,The traffic jam at the Carrefour intersection ...
5,#Person2# complains to #Person1# about the tra...,People are complaining about the congestion in...,The traffic jam at the Carrefour intersection ...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,#Person1#: I'm so happy to see you have a part...,Brian's birthday is coming up.


Now to get ROUGE scores for the original vs. instruct models. Note that "use_stemmer=True" removes suffixes from words to improve matching. So for instance, "sleeping" and "sleeps" would be counted as the same word after removing the -ing and -s suffixes. Setting "use_aggregator=True" gives you an aggregate ROUGE score across all the summaries, instead of a separate score for each one.

In [41]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('Original model:')
print(original_model_results)
print('Instruct model:')
print(instruct_model_results)

Original model:
{'rouge1': 0.2093862423370193, 'rouge2': 0.07284353971171693, 'rougeL': 0.1825140348261415, 'rougeLsum': 0.1820160064971703}
Instruct model:
{'rouge1': 0.29293514576360014, 'rouge2': 0.11264452616822107, 'rougeL': 0.2449479682766204, 'rougeLsum': 0.24590160739029454}


In [None]:
We got improvements across all measures! (Note that rougeLsum looks across sentences, whereas rougeL averages over individual sentences). 

Of course, it's possible that we could have gotten an even larger improvement by tweaking the instructions in our preprocessed training data

For instance, instead of "Here is a dialogue that needs to be summarized... Write a summary here:", we might use variations like "Summarize the dialogue below... Summary:" or similar.

<a name='6'></a>
## 6 - Save compute resources with Parameter Efficient Fine-Tuning (PEFT)

The approach used above adjusted the model parameter weights "wholesale," essentially creating an entire new copy of the model.

However, since we're simply trying to improve the model's performance on a very targeted task (summarizing dialogues), it is somewhat inefficient to tweak ALL of the model parameters.

An alternative approach is to use Parameter Efficient Fine-Tuning (PEFT), which improves a model by either selecting only a subset of parameters for fine-tuning (selective PEFT); adding additional layers or parameters to the base model, for instance, by appending virtual tokens to the input to nudge the output towards a desired result (additive PEFT); or by training a low-dimensional representation of the network that is added to the original weights (reparameterization).


<a name='6.1'></a>
### 6.1 - Setting up LoRA

The approach I take here uses one such reparameterization method called Low-Rank Adaptation, or LoRA. The LoRA "adapter" that is trained is orders of magnitude smaller than the original LLM.

The first step is to set up the LoRA adapter. In the code below, I sent r(ank) to 8; this refers to the size of the low-dimensional representation that actually gets trained. Lora_alpha is a scaling factor that determines how much weight is assigned to the LoRA activations vs. to the base model. I apply LoRA to the target modules q(uery) and v(alue), which are matrices in the attention blocks of the FLAN-T5 model. We can also try to eke out better task performance by training bias vectors (which determine unit activation regardless of input). The Huggingface docs recommend "setting bias to None first, and then lora_only, before trying all." The dropout parameter prevents overfitting randomly "drops out" (i.e., sets to zero) a fraction of the units in the model. Since FLAN is a sequence-to-sequence (encoder-decoder) model, we set task type to "TaskType.SEQ_2_SEQ_LM." You can read more about the LoRA parameters here: https://huggingface.co/docs/peft/conceptual_guides/lora

In [23]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8, 
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

Now to actually add the LoRA layer to our original LLM...

In [25]:
peft_model = get_peft_model(model, 
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 884736
all model parameters: 248462592
percentage of trainable model parameters: 0.36%


Notice how only a tiny proportion of our model parameters are now "trainable."

<a name='6.2'></a>
### 6.2 - Training our model with LoRA

As before, we'll have to define a set of training arguments before performing the training itself.

Instead of saving the whole model, we'll only save the LoRA adapter, hence no need to set "saving_strategy='epoch'" as before

In [28]:
output_dir = f'./dialog_summary_model_lora'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1
)
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=subsample_tokenized_datasets["train"],
)

peft_trainer.train()



Step,Training Loss
1,48.1301


TrainOutput(global_step=1, training_loss=48.13014602661133, metrics={'train_runtime': 54.4416, 'train_samples_per_second': 0.147, 'train_steps_per_second': 0.018, 'total_flos': 5499802091520.0, 'train_loss': 48.13014602661133, 'epoch': 0.06})

In [29]:
peft_trainer.model.save_pretrained(output_dir)

tokenizer.save_pretrained(output_dir)

('./dialog_summary_model_lora\\tokenizer_config.json',
 './dialog_summary_model_lora\\special_tokens_map.json',
 './dialog_summary_model_lora\\tokenizer.json')

Now we'll load a fresh copy of the FLAN-T5 model and combine it with the LoRA adapter we just trained.

In [81]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       output_dir, 
                                       torch_dtype=torch.bfloat16)

<a name='6.3'></a>
### 6.3 - Visually inspect LoRA model results

As before, let's take a glance at how our model performs.

We can compare our LORA-tuned model with the original base model as well as the model that used full (but computationally costly) tuning.

In [43]:
index = 1
dialogue = dataset['test'][index]['dialogue']
human_summary = dataset['test'][index]['summary']

prompt = f"""
Produce a summary of the following dialogue.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)


print(f'Dialogue:\n{dialogue}')
print(f'Human-generated summary:\n{human_summary}')
print(f'Original model:\n{original_model_text_output}')
print(f'Instruct model:\n{instruct_model_text_output}')
print(f'LoRA model:\n{peft_model_text_output}')

Dialogue:
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much time! Now, please continue with th

If the LoRA model doesn't perform as well as the fully-trained one, at least bear in mind that LoRA was a shortcut to training that was much more efficient in terms of compute...

<a name='6.4'></a>
### 6.4 - Assessing LoRA model with ROUGE metrics
As before, we can quantify how well our model is performing by calculating ROUGE metrics. The first step is to create a dataframe with model summaries from our original, full fine-tuned, and LoRA models. I'm using the same syntax as before for the LoRA model outputs. Since I already generated the completions for the first two models, no need to redo them (though I'm leaving in the code, commented out, just for comparison):

In [44]:
#dialogues = dataset['test'][0:19]['dialogue']
#human_baseline_summaries = dataset['test'][0:19]['summary']

#original_model_summaries = []
#instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Here is a dialogue that needs to be summarized.

{dialogue}

Write a summary here: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
   
    #original_model_outputs = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    #original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    #instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    #instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    #original_model_summaries.append(original_model_text_output)
    #instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,Employee memo is due to go out by this afternoon.,The memo will go out to all employees by this ...,The memo will go out to all employees by this ...
1,In order to prevent employees from wasting tim...,The memo is being distributed to all employees...,The memo will go out to all employees by this ...,The memo will go out to all employees by this ...
2,Ms. Dawson takes a dictation for #Person1# abo...,"The memo is for the following: ""It is a dictat...",The memo will go out to all employees by this ...,The memo will go out to all employees by this ...
3,#Person2# arrives late because of traffic jam....,#Person1: I'm happy to see you finally here. #...,The traffic jam at the Carrefour intersection ...,The traffic jam near the Carrefour intersectio...
4,#Person2# decides to follow #Person1#'s sugges...,The traffic was terrible.,The traffic jam at the Carrefour intersection ...,The traffic jam near the Carrefour intersectio...
5,#Person2# complains to #Person1# about the tra...,People are complaining about the congestion in...,The traffic jam at the Carrefour intersection ...,The traffic jam near the Carrefour intersectio...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,#Person1#: I'm so happy to see you have a part...,Brian's birthday is coming up.,Brian's birthday is coming up.


On to the ROUGE scores...

In [46]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)


print('Original model:')
print(original_model_results)
print('Instruct model:')
print(instruct_model_results)
print('LoRA model:')
print(instruct_model_results)

Original model:
{'rouge1': 0.2093862423370193, 'rouge2': 0.07284353971171693, 'rougeL': 0.1825140348261415, 'rougeLsum': 0.1820160064971703}
Instruct model:
{'rouge1': 0.29293514576360014, 'rouge2': 0.11264452616822107, 'rougeL': 0.2449479682766204, 'rougeLsum': 0.24590160739029454}
LoRA model:
{'rouge1': 0.29293514576360014, 'rouge2': 0.11264452616822107, 'rougeL': 0.2449479682766204, 'rougeLsum': 0.24590160739029454}


The remarkable thing here is that LoRA produced very similar results as the instruct model, even though the method that it used was computationally much lighter. 

I can verify that the results weren't 100% identical by looking at the entry for the 19th row -- notice how the instruct and LoRA models do indeed differ (so it wasn't a bug in my code that made me reuse an identical model). However, since there is no difference in n-gram overlap with the human baselines in this case, the ROUGE scores ultimately come out the same despite these small differences.

In [50]:
print(df['human_baseline_summaries'][18])
print(df['instruct_model_summaries'][18])
print(df['peft_model_summaries'][18])

#Person2# feels itchy. #Person1# doubts it is chicken pox and asks #Person2# to get away. #Person2# doesn't believe it.
Person1 is scratching a lot.
Person1 is scratching so much that he feels lightheaded and weak.


<a name='7'></a>
## 7 - Using Reinforcement Learning with Human Feedback (RLHF) for further refinement

A common step in improving LLMs is using human-annotated feedback to further refine a trained model.

First, humans label prompts along a given dimension, e.g., using the HHH criteria "helpful / not helpful", "honest / not honest," and "harmless / not harmless"

Then, a reinforcement learning model is trained based on these human ratings.

Finally, the model itself feeds its output to this reinforcement learning model and uses the generated "reward" to update its weights.

To make sure that the model doesn't just "reward-hack" at this stage (doing whatever it needs to to maximize reward from the reinforcement learning model but neglecting its original goal of carrying out the task at hand), the learning from this rewards is tempered by a metric that calculates how similar the updated model's output is to the original model's output.

To begin, let's prepare the reward model: 

<a name='7.1'></a>
### 7.1 - Load the reward model

My first attempt used a model created by Facebook that classifies text inputs as either "hate" or "not hate" -- see https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target. The Huggingface website even has a nifty widget that lets you calculate a toxicity score for a given input sentence directly from the browser.

However, this model is quite conservative in its ratings, perhaps because their definition of hate speech is targeted very narrowly to mean language against a particular demographic group, rather than just hateful language. For instance, a sentence like "The Barbie movie was a disgusting piece of fake-feminist materialistic garbage" yields a 85.7% probability of 'not hate.'

By contrast, a separate model for categorizing text as "toxic" vs "nontxic" I found (https://huggingface.co/s-nlp/roberta_toxicity_classifier) gave a 99.9% probability of "toxic" to that same sentence.

In [54]:
#reward_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
reward_model_name = "s-nlp/roberta_toxicity_classifier"
reward_tokenizer = AutoTokenizer.from_pretrained(reward_model_name, device_map="auto")
reward_model = AutoModelForSequenceClassification.from_pretrained(reward_model_name, device_map="auto")
print(reward_model.config.id2label)

Some weights of the model checkpoint at s-nlp/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


{0: 'neutral', 1: 'toxic'}


Here is a toy example of how this works:    

In [38]:
nontoxic_text = "The Barbie movie exceeded my expectations."

print(nontoxic_text)

reward_input_ids = reward_tokenizer(nontoxic_text, return_tensors="pt").input_ids

logits = reward_model(input_ids=reward_input_ids).logits
print(f'Logit values [not toxic, toxic]: {logits.tolist()[0]}')

#convert logit values to probability
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'Probabilities [not toxic, toxic]: {probabilities}')

#given the input logits, how much reward to give?
nottoxic_index = 0
nottoxic_reward = (logits[:, nottoxic_index]).tolist()
print(f'reward (high): {nottoxic_reward}\n')

#rerun the example with toxic text
toxic_text = "The Barbie movie was a disgusting piece of fake-feminist materialistic garbage."

print(toxic_text)

reward_input_ids = reward_tokenizer(toxic_text, return_tensors="pt").input_ids
logits = reward_model(input_ids=reward_input_ids).logits
print(f'Logit values [not toxic, toxic]: {logits.tolist()[0]}')

#convert logit values to probability
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'Probabilities [not toxic, toxic]: {probabilities}')

toxic_reward = (logits[:, nottoxic_index]).tolist()
print(f'reward (low): {toxic_reward}')


The Barbie movie exceeded my expectations.
Logit values [not toxic, toxic]: [3.5713207721710205, -3.086273670196533]
Probabilities [not toxic, toxic]: [0.9987174272537231, 0.00128258450422436]
reward (high): [3.5713207721710205]

The Barbie movie was a disgusting piece of fake-feminist materialistic garbage.
Logit values [not toxic, toxic]: [0.9113463759422302, -0.8771229982376099]
Probabilities [not toxic, toxic]: [0.8567395210266113, 0.14326049387454987]
reward (low): [0.9113463759422302]


All of this can be made simpler by creating a pipeline as follows (to be used in our code further below):

In [157]:
device = 0 if torch.cuda.is_available() else "cpu"

sentiment_pipe = pipeline("sentiment-analysis", 
                          model=reward_model_name, 
                          device=device)
reward_logits_kwargs = {
    "top_k": None, #return all scores, not just for top k candidates
    "function_to_apply": "none", # use raw logits instead of applying function
    "batch_size": 16
}

reward_probabilities_kwargs = {
    "top_k": None, 
    "function_to_apply": "softmax", #"softmax" function converts logit values to probabilities
    "batch_size": 16
}

print("Reward model output:")
print("For non-toxic text")
print(sentiment_pipe(nontoxic_text, **reward_logits_kwargs))
print(sentiment_pipe(nontoxic_text, **reward_probabilities_kwargs))

print("For toxic text")
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

Some weights of the model checkpoint at s-nlp/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Reward model output:
For non-toxic text
[{'label': 'neutral', 'score': 4.369976997375488}, {'label': 'toxic', 'score': -4.438220500946045}]
[{'label': 'neutral', 'score': 0.9998505115509033}, {'label': 'toxic', 'score': 0.0001494801981607452}]
For toxic text
[{'label': 'toxic', 'score': 3.440110921859741}, {'label': 'neutral', 'score': -3.532526731491089}]
[{'label': 'toxic', 'score': 0.9990636706352234}, {'label': 'neutral', 'score': 0.0009363002027384937}]


<a name='7.2'></a>
### 7.2 - Calculating baseline toxicity stats

Let's get a sense of how toxic the output of our original model is. 

Before we actually crunch the numbers, we should set up an evaluation metric for toxicity as follows:

In [72]:
toxic_evaluator = evaluate.load("toxicity", 
                                    reward_model_name,
                                    module_type="measurement")

Some weights of the model checkpoint at s-nlp/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [75]:
toxicity_score = toxic_evaluator.compute(predictions=[
    nontoxic_text
], toxic_label="toxic")

print("Toxicity score for following sentence: "+ nontoxic_text)
print(toxicity_score["toxicity"])

toxicity_score = toxic_evaluator.compute(predictions=[
    toxic_text
], toxic_label="toxic")

print("\nToxicity score for following sentence: "+toxic_text)
print(toxicity_score["toxicity"])

Toxicity score for following sentence: The Barbie movie exceeded my expectations.
[0.0001494801981607452]

Toxicity score for following sentence: The Barbie movie was a disgusting piece of fake-feminist materialistic garbage.
[0.9990636706352234]


Looks good?

For the next steps it'll be helpful to have the original English text (not just the number-converted encoded input_ids) in our dataset.

We can retrieve this by decoding the input_ids as follows -- the code is a little ugly but it gets the job done:

In [218]:
decoded_text_vector = []

for i in subsample_tokenized_datasets['train']['input_ids']:
     decoded_text_vector.append(tokenizer.decode(i,skip_special_tokens=True))
    
subsample_tokenized_datasets['train'] = subsample_tokenized_datasets['train'].add_column("query",decoded_text_vector)

decoded_text_vector = []

for i in subsample_tokenized_datasets['test']['input_ids']:
     decoded_text_vector.append(tokenizer.decode(i,skip_special_tokens=True))

subsample_tokenized_datasets['test'] = subsample_tokenized_datasets['test'].add_column("query",decoded_text_vector)

ValueError: The table can't have duplicated columns but columns ['query'] are duplicated.

Now to set the evaluator loose on our pre-finetuned model to get the mean and standard deviation of the toxicity of its dialogue summaries. The best approach here is to first define a function as follows:

In [194]:
def evaluate_toxicity(model, 
                      toxicity_evaluator, 
                      tokenizer, 
                      dataset, 
                      num_samples):

    max_new_tokens=100

    toxicities = []
    input_texts = []
    for i, sample in tqdm(enumerate(dataset)):
        input_text = sample["query"]

        if i > num_samples:
            break
            
        input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids
        
        generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
                                             tok_k=0.0,
                                             top_p=1.0,
                                             do_sample=True)

        response_token_ids = model.generate(input_ids=input_ids,
                                            generation_config=generation_config)
        
        generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)
        
        toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " + generated_text)], toxic_label="toxic")

        toxicities.extend(toxicity_score["toxicity"])

    mean = np.mean(toxicities)
    std = np.std(toxicities)
        
    return mean, std

One more step: as mentioned above implementing Reinforcement Learning with Human Feedback isn't just a matter of using reward values to adjust model weights based on the toxicity (or honesty, or helpfulness, etc.) of the model. To prevent "reward hacking" (e.g., the LLM stops trying to carry out the actual task and instead just talks about love, sunishine, and rainbows to minimize its toxicity score), we also need to weigh this against how similar the retrained model outputs are to what the original model gave (calculated using metrics like KL divergence). This balances how much learning occurs from the reward vs. the similarity to the original model.
    
We create a copy of the model that is trainable (so that its values can be iteratively updated with each toxicity assessment) as well as a "frozen" reference model that remains unchanged during toxicity evaluation. The create_reference_model() function makes a copy whose parameters are not trainable.

In [195]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,                                                               
                                                               torch_dtype=torch.bfloat16,
                                                               is_trainable=True)

ref_model = create_reference_model(ppo_model)

Now use the function we just defined to get the mean and SD of the toxicity of our pre-training model outputs, to give us a baseline comparison:

In [196]:
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

mean_toxicity_original, std_toxicity_original = evaluate_toxicity(model=ref_model, 
                                                                          toxicity_evaluator=toxic_evaluator, 
                                                                          tokenizer=tokenizer, 
                                                                          dataset=subsample_tokenized_datasets['test'], 
                                                                          num_samples=10)

print(f'Mean toxicity and standard deviation before detox: [{mean_toxicity_original}, {std_toxicity_original}]')

11it [00:34,  3.12s/it]

Mean toxicity and standard deviation before detox: [3.804054689350199e-05, 6.522331639452418e-06]





These numbers are actually extremely low (perhaps not surprising since the Dialogsum data are just rather banal conversations, not online social media content)

<a name='7.3'></a>
### 7.3 - Fine-tuning our model to reduce toxicity

Now to actually fine-tune the model, using a combination of reward values and similarity metrics from the original, "frozen" model.

The next step requires a collator function to reformat our data slightly, taking a dictionary out of a one-element list and giving entries within the dictionary as lists of length 1:

In [143]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}]
print(f'Collator input: {test_data}')
print(f'Collator output: {collator(test_data)}')



Collator input: [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}]
Collator output: {'key1': ['value1'], 'key2': ['value2'], 'key3': ['value3']}


Now to define some parameters for our training function. Note that we are still using our mini version of the data (100th the size of the full Dialogsum dataset):

In [197]:
learning_rate=1.41e-5
max_ppo_epochs=1
mini_batch_size=4
batch_size=16

config = PPOConfig(
    model_name=model_name,    
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

ppo_trainer = PPOTrainer(config=config, 
                         model=ppo_model, 
                         ref_model=ref_model, 
                         tokenizer=tokenizer, 
                         dataset=subsample_tokenized_datasets["train"], 
                         data_collator=collator)

Here is the fine-tuning itself. Note that LengthSampler() uniformly samples values from a defined range so that the query and response lengths are randomized. I set it to 10 PPO steps but this can be adjusted based on your compute availability. This took about 10 minutes to run on a computer with 11th Gen Intel i5-11300H 3.10GHz CPU with 32GB RAM and an Nvidia Geforce RTX 3050 GPU:

In [198]:
output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

reward_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # You want the raw logits without softmax.
    "batch_size": 16
}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    if step >= max_ppo_steps:
        break   

    prompt_tensors = [torch.tensor(x) for x in batch["input_ids"]] #convert from integers to tensor format

    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()        
            
        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)
        
        summary_tensors.append(summary.squeeze()[-max_new_tokens:])
        
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]    
    rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)

    reward_tensors = [torch.tensor(reward[nottoxic_index]["score"]) for reward in rewards]    

    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

0it [00:00, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
7it [16:44, 143.48s/it]


<a name='7.4'></a>
### 7.4 - Quantifying the toxicity of our detoxified model

We'll use the same procedure as before to crunch some toxicity stats for our newly-detoxified model:

In [200]:
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

mean_toxicity_detoxmodel, std_toxicity_detoxmodel = evaluate_toxicity(model=ppo_model, 
                                                                          toxicity_evaluator=toxic_evaluator, 
                                                                          tokenizer=tokenizer, 
                                                                          dataset=subsample_tokenized_datasets['test'], 
                                                                          num_samples=10)

print(f'Mean toxicity and standard deviation after detox: [{mean_toxicity_detoxmodel}, {std_toxicity_detoxmodel}]')

11it [00:33,  3.00s/it]

Mean toxicity and standard deviation after detox: [3.8285791643747046e-05, 5.815389222470542e-06]





The numbers were quite small to begin with, but for reference here is some code to compare the pre/post detox stats in terms of percentage change:

In [201]:
mean_toxicity_change = (mean_toxicity_original - mean_toxicity_detoxmodel) / mean_toxicity_original
std_toxicity_change = (std_toxicity_original - std_toxicity_detoxmodel) / std_toxicity_original

print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_toxicity_change*100:.2f}%')
print(f'std: {std_toxicity_change*100:.2f}%')

Percentage improvement of toxicity score after detoxification:
mean: -0.64%
std: 10.84%


The difference isn't very significant (perhaps due to the small toxicity scores in the first place?)

<a name='7.5'></a>
### 7.5 - Visually comparing pre-detoxified vs. detoxified models

For visualization purposes, let's take just the first ten entries in our test data and compare the summaries and reward scores for the pre- and post-detoxification models.

In [214]:
batch_size = 10
compare_results = {}

df_batch = subsample_tokenized_datasets["test"][0:batch_size]

compare_results["query"] = df_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []
summary_tensors = []

for i in tqdm(range(batch_size)):
    gen_len = output_length_sampler()
    generation_kwargs["max_new_tokens"] = gen_len
    
    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device), 
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors_ref.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device), 
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors.append(summary)

#Translate from encoded numerical tensors to human-legible tokens
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]

#Compare the pre/post detox values
texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
compare_results["reward_before"] = [reward[nottoxic_index]["score"] for reward in rewards_before]

texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
compare_results["reward_after"] = [reward[nottoxic_index]["score"] for reward in rewards_after]

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:58<00:00,  5.81s/it]


Create a Pandas dataframe to view these. We can sort by " reward difference" to see where the difference in toxicity was largest betweeh the pre-detox vs. detoxified models:

In [215]:
pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["reward_diff"] = df_compare_results['reward_after'] - df_compare_results['reward_before']
df_compare_results_sorted = df_compare_results.sort_values(by=['reward_diff'], ascending=False).reset_index(drop=True)
df_compare_results_sorted

Unnamed: 0,query,response_before,response_after,reward_before,reward_after,reward_diff
0,"Produce a summary of the following dialogue. #Person1#: I cannot imagine if Trump were to be our President again. #Person2#: I am proud to say that he is our President, and I will be really happy if he could be re-elected. #Person1#: You voted for him, right? #Person2#: Did you vote for him, because I know that I did. #Person1#: I am not sure about this. #Person2#: I have nothing but faith in Trump. #Person1#: What? #Person2#: I am pretty sure he will make America great again! #Person1#: Wel...",<pad> The President of the United States needs to succeed because of his lack of experience and mendable nature.</s>,<pad> People are excited about Trump and try to imagine his war in the future.</s>,4.625224,4.986396,0.361172
1,"Produce a summary of the following dialogue. #Person1#: Have you considered upgrading your system? #Person2#: Yes, but I'm not sure what exactly I would need. #Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising. #Person2#: That would be a definite bonus. #Person1#: You might also want to upgrade your hardware because it is pretty outdated now. #Person2#: How can we do that? #Person1#: You'd probab...",<pad> A painting program is great for advertising. A CD-ROM drive is also included.</s>,"<pad> #Person2#: You're just clicking on your computer. Or join in on one of the Facebook groups. #Person1#: You have a few options to make your computer computer more complicated. #Person2#: Is there anything I can do? #Person1#: To learn more, please subscribe here. Where I Get started: http://www.facebook.com/people/site/21251293145.html</s>",4.95996,4.996674,0.036714
2,"Produce a summary of the following dialogue. #Person1#: We need to call an emergency meeting as soon as possible about this. #Person2#: OK. Let me send a memo around. #Person1#: Schedule it for this afternoon, when Ken gets back. #Person2#: I don't think he's back today. #Person1#: Oh, that's right. Go ahead anyways. I'll fill him in. We can't lose this order under any circumstances! #Person2#: I know, it's a big one. Summary:",<pad> The emergency meeting will be held in an emergency. Ken might not be here today.</s>,<pad> Parents are wondering if Ken is back.</s>,5.006478,5.016961,0.010483
3,"Produce a summary of the following dialogue. #Person1#: Ms. Dawson, I need you to take a dictation for me. #Person2#: Yes, sir... #Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready? #Person2#: Yes, sir. Go ahead. #Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly ...","<pad> The Office of the President, Vice President and Director of Communications. NYPD and Department Heads: Effective immediately, all office communications are restricted to email correspondence and official memos.</s>",<pad> #Person1#: We are converting an intra-office memo on the information regarding employee access.</s>,4.971152,4.981174,0.010022
4,"Produce a summary of the following dialogue. #Person1#: It was a heavy storm last night, wasn't it? #Person2#: It certainly was. The wind broke several windows. What weather! #Person1#: Do you know that big tree in front of my house? One of the biggest branches came down in the night. #Person2#: Really? Did it do any damage to your home? #Person1#: Thank goodness! It is far away from that. #Person2#: I really hate storms. It's about time we had some nice spring weather. #Person1#: It's April...",<pad> The installation of the rain system last night broke several windows.</s>,<pad> There was a heavy storm with a presidential and a slightly angry outburst of tremors.</s>,4.778887,4.777029,-0.001858
5,"Produce a summary of the following dialogue. #Person1#: Are you going anywhere for your vacation? #Person2#: Yes, we're making plans for a tour. #Person1#: That'll be lovely. Where are you going? #Person2#: Well, we will start out from Long Island this Friday. We've planned a four day drive to Salt Lake City, where we'll join my brother and his family on his fortieth birthday. #Person1#: Well, you've got to prepare a lot of food and enough sleeping bags then. #Person2#: Oh, we'll spend the n...","<pad> David is planning a four day tour to Salt Lake City. To join his brother and his family, they plan to take a four day drive this week, and lay out food and sleeping bags. It's being promoted as a Christmas bet.</s>","<pad> David and his brother are going for a four day tour to Salt Lake City. They plan on staying at a house most likely in Long Island, as they don't have a car when doing the drive.</s>",5.019026,5.013295,-0.005731
6,"Produce a summary of the following dialogue. #Person1#: Dad, you keep talking about family in New Zealand. Who are they? #Person2#: Well, that's your uncle Bill, his wife and two of their daughters. #Person1#: Is uncle Bill your brother? #Person2#: No, your uncle Jack is my brother, Bill is my brother-in-law, your mom's brother. #Person1#: So his two daughters are my cousins? #Person2#: That's right, Sarah and Jane are both your cousins although they are step-sisters. #Person1#: What are ste...",<pad> The family in New Zealand is said to have a lot in common.</s>,<pad> Parents of children called visiting their relatives in New Zealand. People who weren't interesting called the new British stop and Spanish-language visitors.</s>,5.00191,4.985108,-0.016802
7,"Produce a summary of the following dialogue. #Person1#: Oh, I'm starving. It's my first time to China. And I'd like to try some real Chinese cuisine. What would you recommend? #Person2#: Well, depends. You see, there are eight famous Chinese food cuisines, for instance, Sichuan cuisine and Hunan cuisine. #Person1#: There're all spicy or hot of heard. #Person2#: That's right. If you have hot dishes, you can try some. #Person1#: I cannot have it. Last time I had some in the US. It almost kille...","<pad> We were unable to find a place for 'Pennhaburi' in China, because there is a lot of food. I'm starving, so I'll try to find 'Cantonese' restaurant.</s>","<pad> TellPeople1 which Chinese restaurant is best for them. Go to the nearest restaurant, and wait for a taxi right away.</s>",4.802761,4.780698,-0.022063
8,"Produce a summary of the following dialogue. #Person1#: What do you think of my new suit? #Person2#: Not bad. It reminds me of the one I saw at the new department store last week. Did you get it there? #Person1#: No, I got it in that big shopping center. It cost me only $ 150. #Person2#: Well, I don't think it's a good bargain. Summary:",<pad> They received the suit in a big shopping center.</s>,<pad> Person1 and #Person2 are both complaining about their new suits.</s>,4.976263,4.943659,-0.032604
9,"Produce a summary of the following dialogue. #Person1#: OK, that's a cut! Let's start from the beginning, everyone. #Person2#: What was the problem that time? #Person1#: The feeling was all wrong, Mike. She is telling you that she doesn't want to see you any more, but I want to get more anger from you. You're acting hurt and sad, but that's not how your character would act in this situation. #Person2#: But Jason and Laura have been together for three years. Don't you think his reaction would...","<pad> As the situation involves Jason and Laura, they tried using their reasoning to cast a light on Mike's feelings.</s>",<pad> It was seems the person who messed up his sensing will react differently to it than the one that Mike messed up.</s>,4.968035,4.84825,-0.119785


The summaries above are a little subpar. Bear in mind that a. for this exercise I used a dataset only 100th the size of the original Dialogsum data; b. the original data wasn't that toxic in the first place, so the detoxification wouldn't lead to a massive change, and c. there are a ton of training parameters that I didn't fiddle with here: length of input, number of training epochs, learning rate, etc.

In any case, I hope this exercise serves as a nice mini-tutorial for how a text-summarizing model can be improved with in-context learning, instruction finetuning, and reinforcement learning with human feedback!