# Explanation

GPT-2 and GPT-3 broke many records in a variety of language modeling tasks and captured a lot of attention in the deep learning and broader technical communities - but they still didn't initially break through into the main stream in the way that they have now.

The models were still difficult to use in practical scenarios - by default GPT-3 was a good few-shot learner, meaning it needed to be prompted with examples of how to respond in order to know how to generate a useful response.

This prevented the model from being useful to the majority, where this mode of interaction was impractical.

**Reinforcement learning with human feedback** (RLHF) was the last step needed to make these models practically useful in most contexts.

[Early work in RLHF for LLMs](https://arxiv.org/pdf/1909.08593) demonstrated the efficacy of creating human-labeled datasets of ideal responses to get the models to fit to a more practically useful reponse format.

Then, InstructGPT applied these techniques to fine-tune GPT-2. This had surprising results - when surveyed, most people prefered the responses of the GPT-2 assistant model over the responses of GPT-3, despite the assistant model having more than 100x fewer parameters.

The introduction of the instruct series of GPT models is arguably what led to the wave of mainstream AI adoption we're in today - ChatGPT was released with the InstructGPT-3 model and the obvious practicaly assistant format made the tool immediately catch on.

The success of InstructGPT was an initial indication of the effect and leverage of good fine-tuning with specific taste - despite all previous language models also using the pre-training and fine-tuning pattern, the fine-tune of InstructGPT was clearly far preferable to anything tried in the past.

### Approach

The fine-tuning approach taken by InstructGPT had 3 important steps.

First, they started by creating a human generate dataset of good output behaviors for an assistant model. These are pairs of questions a human might ask to the model along with an ideal response written by a human labeler. Given the nature of how big of an impact this dataset would have on the model, the human labelers were also carefully vetted by OpenAI.

They used this dataset to fine-tune the GPT model after pre-training, makings its responses initially resemble an assistant.

Next, they collect comparisons of different outputs to the same prompts that are generated by the actual model. These comparison are then labeled by humans who select the preferable response.

This creates a comparison dataset which is used to train a reward model that can predict outputs that humans would prefer.

Finally, this reward model can be used as a function to maximize the reward for proximal policy optimization (PPO).

This combination of data collection from manually created human datasets and some jointly created data (synthetic data from the model, along with human labels to inject preferences) compose a robust RLHF process that produces far more preferable responses.

This RLHF is now happening at scale with modern language models which occasionally present comparisons to users, collecting a massive crowd-sourced dataset for fine-tuning.

### Further Reading

Below, I've included my notes for the InstructGPT paper. Given that many of the interesting RLHF papers build on each other or discuss relatively intuitive concepts (aside from PPO, which has a bit more complexity), I haven't taken notes on them - but for anyone curious to read more I'd recommend the following papers:

- [Fine-Tuning Language Models from Human Preferences](https://arxiv.org/pdf/1909.08593)
- [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/pdf/2204.05862)
- [Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347)

# My Notes

## 📜 [Training language models to follow instructions with human feedback](https://arxiv.org/pdf/2203.02155)

> Making language models bigger does not inherently make them better at following a user’s intent.

> In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback.

> In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation.

> Our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

> We want language models to be _helpful_, _honest_, and _harmless_.

> We focus on fine-tuning approaches to aligning language models. Specifically, we use reinforcement learning from human feedback (RLHF) to fine-tune GPT-3 to follow a broad class of written instructions. This technique uses human preferences as a reward signal to fine-tune our models.

Using RLHF to align the model with human preferences via fine-tuning.

> We first hire a team of 40 contractors to label our data, based on their performance on a screening test.

This is still an extremely manual and human involved process.

The procedure is:
(1) collect a dataset of human-written desired output behaviors and some labeler written prompts and use this to train GPT-3.
(2) next, collect a dataset of human-labeled comparisons between outputs, then train a reward model (RM) to predict which outputs labelers prefer.
(3) then use the RM as a reward function to maximize reward for the model using PPO.

> This procedure aligns the behavior of GPT-3 to the stated preferences of a specific group of people (mostly our labelers and researchers), rather than any broader notion of “human values”. We call the resulting models InstructGPT.

### Methods and experimental details

> Step 1: Collect demonstration data, and train a supervised policy
> Step 2: Collect comparison data, and train a reward model
> Step 3: Optimize a policy against the reward model using PPO

### Results

**1. Results on the API distribution**

> Labelers significantly prefer InstructGPT outputs over outputs from GPT-3

> Our models generalize to the preferences of “held-out” labelers that did not produce any training data.

> Public NLP datasets are not reflective of how our language models are used.

**2. Results on public NLP datasets**

> InstructGPT models show improvements in truthfulness over GPT-3

> We can minimize performance regressions on public NLP datasets by modifying our fine-tuning procedure.

**3. Qualitative Results**

> InstructGPT models show promising generalization to instructions outside of the RLHF fine-tuning distribution.

> InstructGPT still makes simple mistakes.

### Discussion

**1. Implications for alignment research**

> The cost of increasing model alignment is modest relative to pre-training.

> We’ve seen some evidence that InstructGPT generalizes ‘following instructions’ to settings that we don’t supervise it in.

> We were able to mitigate most of the performance degradations introduced by our fine-tuning.

> We’ve validated alignment techniques from research in the real world.
