# Direct Preference Optimization

List of main relevant papers:
* https://arxiv.org/pdf/2305.18290

List of main relevant blogs / book chapters
* https://allam.vercel.app/post/dpo/
* https://github.com/rasbt/LLMs-from-scratch/tree/main/ch07/04_preference-tuning-with-dpo
* https://huggingface.co/blog/pref-tuning
* https://argilla.io/blog/mantisnlp-rlhf-part-11
* https://medium.com/mantisnlp/finetuning-an-llm-rlhf-and-alternatives-part-iii-d1dcb1792968

List of main relevant tools: 
* https://github.com/argilla-io/argilla
* https://allam.vercel.app/post/dpo/

# 1 - Introduction

For most applications, it’s crucial for LLMs to generate text that is contextually consistent and aligned with the intended task and user behavior. This includes developing LLMs that are safe, aligned, and unbiased, or those capable of generating syntactically and functionally correct code, despite the presence of incorrect code in the training data. However, the pre-training process alone does not guarantee specific model behavior. This is where [Reinforcement Learning From Human Feedback (RLHF)](https://huggingface.co/blog/rlhf) becomes vital.

RLHF is a technique used to fine-tune LLMs by maximizing a reward function derived from another reward model trained on human feedback from evaluators based on a set of generated samples. This technique is widely used and is considered state-of-the-art. However, RLHF has several drawbacks that limit its effectiveness as a solution.

Direct Preference Optimization (DPO) has emerged as a promising alternative for aligning LLMs to human or AI preferences. Unlike traditional alignment methods, which are based on reinforcement learning, DPO recasts the alignment formulation as a simple loss function that can be optimized directly on a dataset of preferences $(x, y_w, y_l)$ where $x$ is a prompt and $(y_w, y_l)$ are the preferred and dispreferred responses.

<table>
    <tr>
        <td><img src="./images_2/dpo_vs_rlhf.png" width="900"/></td>
    </tr>
</table>


# 2 - How RLHF works?

Given a pre-trained LLM, RLHF works in three steps:

1. Generate a set of samples using the LLM from a dataset of prompts.
2. Human evaluators rate the samples, and train a seperate reward model on the samples and their ratings.
3. Fine-tune the LLM using the reward model as a reward signal.

<table>
    <tr>
        <td><img src="./images_1/rlhf.png" width="900"/></td>
    </tr>
</table>


In [1]:
import pandas as pd