# Reinforcement Learning from Human Feedback (RLHF)

RLHF is one of the hottest topics of 2023 with several Large Language Models (LLMs) including OpenAI's ChatGPT and DeepMind's Sparrow utilizing it. It consists of three steps that combine supervised learning (finetuning) and reinforcement learning.

![RLHF Overview diagram from OpenAI's ChatGPT blogpost](assets/ChatGPT_Diagram.svg)
Source: https://openai.com/blog/chatgpt

## Step 1: Supervised Fine Tuning (SFT)

Here the LLM is finetuned on a prompt dataset (prompt + human written answers). This is the same finetuning that we perform in the `07-next-steps-fine-tuning` notebook.

## Step 2: Training a Reward Model (RM)

This step involves training a "reward model". This is a model trained via supervised learning that aims to assign a score to responses to a prompt. To train this model, we collect and use the relative rankings (ranked by humans) of several answers for every prompt. This helps the model to better understand what a "good" answer to a given prompt is. 

## Step 3: Reinforcement Learning from Human Feedback (RLHF/Step-3)

Note: This step is often just referred to as step 3 since the overall technique is called RLHF. 

The third step utilizes reinforcement learning to train the model from step 1 (sometimes called the policy or actor model) via reinforcement learning. Specifically, this uses the Proximal Policy Optimization (PPO) algorithm. At a high level, you can think of this as a student (the actor model) attempting to answer prompts while a teacher (reward model) scores these answers each time. The student then tries to maximize the reward or score that the teacher gives out.

## Resources

While we have offered you a high level explanation of RLHF above, this is likely not something that is very easy to understand. Thus we have a more elaborate guide to it in the [Georgian AI Library](https://github.com/georgian-io/GAL/tree/main/Short%20Tutorials/Reinforcement%20Learning%20from%20Human%20Feedback%20(RLHF)). We recommend going through that for more information. If you have any questions please reach out to us during the office hours!

We also include steps to quickly get started in running these models as well as some information on the compute power you may need.

### Further Reading

* [Rank Response to align Human Feedback (RRHF)](https://github.com/GanjinZero/RRHF): This is an alternative to RLHF that is in active development. They claim comparable performance to RLHF-based models with a simpler training paradigm.
* [wandb rlhf tutorial](https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2): A tutorial on training a summarization model via RLHF.
* [Microsoft DeepSpeed](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat): DeepSpeed is a deep learning optimization library. It also offers a relatively easy way to get started with RLHF.
* [Hugging Face's StackLLaMa](https://huggingface.co/blog/stackllama): An instructive blog post on training LLaMA with RLHF.