# Mini-ChatGPT

This tutorial assumes the reader has some basic machine learning (ML) and natural language processing (NLP) knowledge, but is new to training large language models, aka foundation models, aka transformer neural networks, using reinforcement learning. If you know what a loss aka cost aka objective function is and how a token aka input_id represents a vector aka embedding, you should be good.

If not here is a really great place to start  https://lena-voita.github.io/nlp_course/language_modeling.html 

Special thanks to Leandro von Werra 's https://github.com/lvwerra/trl repo

diagram from https://openai.com/blog/chatgpt/ ChatGPT is an improved InstructGPT 

<img src='https://cdn.openai.com/chatgpt/draft-20221129c/ChatGPT_Diagram.svg'>

#### Figure 1

### Step 1

Supervised Fine-Tuning (SFT) is the process of taking a model already pre-trained on a very large corpus of data, and re-training it carefully as not to erase the knowledge the model has gained from that pre-training [catastropic forgetting](https://github.com/clam004/intro_continual_learning), but rather slightly nudge the model to perform better in a more specific or narrow domain represented by a much smaller fine tuning dataset.

There are already plenty of resources on Supervised Fine-Tuning (SFT) aka fine-tuning already, some with impressively good explanations, mistakenly claiming to help you build a mini-chatgpt. I will not cover SFT, instead I will point you to some of my favorite resources. 

Here is an explanatory video by Huggingface of the Decoder, or autoregressive, transformers like GPT: https://youtu.be/d_ixlCubqQw

Here is a great blog explaining language modeling and the cross entropy loss function https://lena-voita.github.io/nlp_course/language_modeling.html , I would start here if you have done ML but not deep learning for NLP before. 

Fine-tuning GPT-like models: https://huggingface.co/course/chapter7/6?fw=pt and [Guide to fine-tuning Text Generation models: GPT-2, GPT-Neo and T5](https://towardsdatascience.com/guide-to-fine-tuning-text-generation-models-gpt-2-gpt-neo-and-t5-dc5de6b3bc5e)

ChatGPT likely was fine-tuned on a dataset of instructions and examples of following to
 
### Step 2

Fine-tuning usually means training for only a small number of epochs with a much smaller learning rate, but also using the cross entropy loss function used to do pre-training, but this way to learning is abit limited, can you tell why? 

<img src="https://lena-voita.github.io/resources/lectures/lang_models/neural/one_step_loss_intuition-min.png">

#### Figure 2

This example of using cross entropy shows how the training algorithm rewards or penalizes based on how large the logit (probability mass asigned to) for "cat" is, just this one way of completing the sequence, but we know that in reality when generating responses to instructions, there isnt one right ways to do it even in the cases when there is one right answer, there are many ways to be good and many ways to be bad. 

For the sake of keeping this tutorial light weight, fit on commonly available compute and intuitive, we have simplified the overall strategy in 2 major ways:

A. ChatGPT and InstructGPT, although they are very specially trained with reinforcement learning (RL), when you generalize, ChatGPT is just taking some prompt, aka input text (the phrase or instructions) and generating an output, aka continuing that text (providing a response or answer). It takes alot more compute memory to represent these long instructions and long answers in the neural network transformers, so instead we simplify the input text to the start of a movie review (first few words or subwords) and simplify the continuing text to the continuation of those first few words.

B. The reward modeling has been simplified as compared to InstructGPT https://openai.com/blog/instruction-following/ which is the model that ChatGPT is a scaled up and improved version of. 

Instead of training a reward model based on human rankings of output text, which was done to make the reward signal more stable or reliable, we use the more direct tactic of using the logits from a classifier (like BERT) as the reward signal, "positive meanss do more like this, negative means do less like this", where positive means the review is a referring to movie the reviewer thinks is good (a positive review). 

### Step 3

Given our limited time, I am going to focus on giving you what is hard to find. That is, good explainations of the parts that are usually glossed over yet important, or those parts that are usually explained in a much more jargony, field specific, technical or mathematical manner.

That means we will be focusing on step 3 in Figure 1. Namely, how reinforcement learning, something we have seen thus far in the popular media mostly applied to computer games, instead applied to natural language, or the generation of sequences of discrete tokens. 

In [1]:
%load_ext autoreload
%autoreload 2

In [27]:
from tqdm.notebook import tqdm
tqdm.pandas()

import torch

from minichatgpt.experiments.imdb import config, sent_kwargs
from minichatgpt.processdata.build_dataset import build_dataset
from minichatgpt import PPOTrainer, LengthSampler, Lab

In [5]:
lab = Lab(config)

dataset = lab.build_dataset(dataset_name="imdb",input_min_text_length=2,input_max_text_length=8)

new_policy, old_policy, tokenizer = lab.init_policies_tokenizer()

lab.set_generation_config(output_min_length=4,output_max_length=16,pad_token_id=tokenizer.eos_token_id)

ppo_trainer = lab.init_ppo_trainer(config, new_policy, old_policy, tokenizer, dataset)

reward_model = lab.init_reward_model()

#### A new prompt is sampled from the dataset

In the cells below, he text is our samples is represented by the token IDs in input_ids, and each token is either a word or a subword. Roughly speaking the average token represents a word or subword that is around 3 characters long, 'can', 'ed', 'con', 'tion' etc. Because of the code `input_min_text_length=2, input_max_text_length=8` above in `dataset = lab.build_dataset()`, you will find a random assortment of sample lengths in each batch, but they will be no less than 2 tokens short and no more than 8 tokens long. 

In reinforcement learning the agent acts in an environment, like a computer game, and takes actions, like hiting the jump button or moving to the left, which in the future may result in more or less reward, like points in a game. Here the environment is the random assortment of samples that the model can continue is many ways. The action here is how the model chooses to continue those samples, aka prompts.

In [11]:
for batch_step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    
    query_tensors = batch['input_ids']
    
    break
    
print('the part of each batch are: ', batch.keys())
print('-'*50)
print('each batch has ', len(batch['query']), 'samples')
print('-'*50)
print('here are some samples examples ', batch['query'])
print('-'*50)
print('here are the token ids of those examples ', batch['input_ids'])

0it [00:00, ?it/s]

the part of each batch are:  dict_keys(['label', 'input_ids', 'query'])
--------------------------------------------------
each batch has  4 samples
--------------------------------------------------
here are some samples examples  ['Firstly let me say that I didn', 'clara bow', 'I have seen', 'I cant describe how terrible this']
--------------------------------------------------
here are the token ids of those examples  [tensor([49709,  1309,   502,   910,   326,   314,  1422]), tensor([ 565, 3301, 9563]), tensor([  40,  423, 1775]), tensor([   40, 18548,  6901,   703,  7818,   428])]


#### The policy generates an output

In [12]:
#### Get response from gpt2
response_tensors = []
for query in query_tensors:
    gen_len = lab.output_length_sampler()
    lab.generation_kwargs["max_new_tokens"] = gen_len
    response = ppo_trainer.generate(query, **lab.generation_kwargs)
    response_tensors.append(response.squeeze()[-gen_len:])
    
response_tensors

[tensor([  470,  1607,   597, 17774,   393, 18078,  7188]),
 tensor([ 1424,    11,   290,   607,   366,  5886, 12728,     1,  5585,   318,
          2818]),
 tensor([  867,   922,  6918,  7924,   416,   475,   973,   845, 31363,   287,
         10514, 47226,  1114]),
 tensor([ 3807,   373,    13, 21326,   484,   389,   319])]

In [25]:
batch['response'] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

concat_text = [batch['query'][i]+" >>> "+batch['response'][i] for i in range(len(response_text))]

concat_text

["Firstly let me say that I didn >>> 't expect any entertaining or memorable moments",
 'clara bow >>> els, and her "Overkill" appearance is perfect',
 'I have seen >>>  many good movies directed by but used very unrealistic inpiration :-) For',
 'I cant describe how terrible this >>>  movie was. Whenever they are on']

I used `>>>` to separate the prompt, left side, which is sampled from the dataset from the text generated by the policy, right side.

In [23]:
lab.reward_model('this movie was really bad!!',**sent_kwargs)



[[{'label': 'NEGATIVE', 'score': 2.335048198699951},
  {'label': 'POSITIVE', 'score': -2.726576089859009}]]

#### The reward model calculates a reward for the output

In [28]:
#### Compute sentiment score
texts = [q + r for q,r in zip(batch['query'], batch['response'])]
pipe_outputs = lab.reward_model(texts, **sent_kwargs)
rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

rewards



[tensor(-1.2059), tensor(2.3504), tensor(-1.3959), tensor(-2.1491)]

#### The reward is used to update the policy using PPO

PPO means proximal policy optimization. It is the reinforcement learning step and is done by calling `stats = ppo_trainer.step(query_tensors, response_tensors, rewards)` but we are here to learn so we will go through PPO as it applies to sequences of discrete tokens, step by step.