# Tutorial 4: Improving LLMs with RLHF

Reinforcement Learning from Human Feedback (RLHF) incorporates human feedback into the training process through a reward model that learns the desired patterns to improve the model’s output. For example, if the goal is to enhance politeness, the reward model will guide the model to generate more polite responses by assigning higher scores to polite outputs. This process is resource-intensive because it necessitates training a reward model using a dataset curated by humans.

This tutorial will use available open-source models and datasets whenever possible while maintaining costs.

We begin with a pre-trained model that we fine-tune in a supervised fine-tuning phase using the `SFTTrainer` class. Next, a reward model is trained with the desired traits using the `RewardTrainer` class. Finally, the reinforcement learning phase employs the models to build the ultimate aligned model, utilizing the `PPOTrainer`.

You can access the reports generated from Weights & Biases and the file with the requirements for the library after each subsection. Note that different steps require distinct versions of libraries. We chose `OPT-1.3B` as the base model and fine-tuned a `DeBERTa` (300M) model as the reward model for our experiments. While these are more compact models, the process used in this tutorial can be applied to other existing models by simply modifying the model’s name in the code.

Even if much more affordable than what companies like OpenAI do, this tutorial is still resource-intensive as we replicate an RLHF phase. We rented an 8x NVIDIA A100 instance for $8.80/h and used  [lambda](https://lambdalabs.com/)  as our GPU cloud provider.

>⚠️It’s important to be aware of the costs associated with cloud GPUs. The total cost will depend on the machine type and the instance’s uptime. Regularly check your costs in the billing section of Lambda Labs and spin off your instances when you don’t use them.

>💡If you want to run the code in the section without spending much money, you can perform a few iterations of training on your virtual machine and then stop it.