# Use Preference Data with DPO to train a QA model

At least once, you have probably found yourself in a situation where you need an answer for quite a specific question from Google. Chances are that Google gave you a response in bold letters ("How much does the Eiffel Tower weigh? - 10,100 tons) or that Google only highlighted some text from the results ("What was the original color of the Statue of Liberty?"). Either way, Google used a specific algorithm that does the finding-the-answer job for you and saved you from reading a bunch of pages or using Ctrl+F. Although not confirming the reliability of the pages or the pieces of information found, it managed to come up with an exact answer to the question (it incorrectly highlights "blue-green" for the Statue of Liberty at the time of this post). This task -finding the exact answer in a piece of text for a given question- is called extractive question answering and it is one of the main pipelines of the many QA or LLM systems today. In this blogpost, we will see how we can use Argilla to create an end-to-end pipeline for extractive QA and train a model for it using Direct Preference Optimization.

Here are the steps we will follow:

- Create a dataset for extractive QA
- Annotate the dataset with Argilla

...

## Introduction

**Question answering (QA)** tasks are mainly divided into two: extractive QA and generative QA. Generative QA (or abstractive QA) is the task where the QA system generates a human-like, natural language answers to a question. For this, a generative QA system uses retriever-generator architecture instead of a retriever-reader one, which is employed by an extractive QA. As it requires a deeper understanding of the text and natural language generation, generative models are yet to catch the extractive ones in term of performance as of today. However, as it offers a more sophisticated pipeline and output, it will have much more to offer in the future.

On the other hand, the task we have just seen above was an example of the extractive QA, where a model finds the exact span within a text that will be used as an answer to the given question. In this sense, this task formally consists of a tuple of (q,c,a) and the objective of training is to minimize the loss between *-log(Pstart)* and *-log(Pend)*, where *Pstart* and *Pend* are the probabilities of the start and end indices of the answer span.

### DPO

To further improve a QA model, we will be employing **Direct Preference Optimization (DPO)** as the training method. DPO is a relatively new training method that is based on the idea of directly optimizing the model by the preference data. This is where DPO gains practical advantage compared to older methods using human feedback. Starting from the introduction of [Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347) by OpenAI in 2017, the human feedback has been used in the training loops in reinforcement learning with the requirement of a Reward Model that has to be trained separately. This reward model is used to calculate the reward for the model's action and the model is trained to maximize the reward. Therefore, the fact that a reward model requires extra data and time to train besides the complexity of its algorithm is the main drawback of this approach. DPO, on the other hand, does not require a reward model and directly optimizes the model by the preference data. This means that the model is trained to maximize the preference data, which is the human feedback in our case. As the authors of DPO put it [in the paper](https://arxiv.org/abs/2305.18290), this new approach is at least as good as the previous ones in terms of performance and it is much more practical.

One important caveat in DPO is that you need to train a **SFT** model first. This is primarily because the training data for DPO should be in-distribution, which means that the source from which the data come does not differentiate. One thing to keep in mind is that we will use two seperate datasets: **demonstration data** for SFT, and **preference data** (comparison data) for DPO. To get more info about the data and general overview of RLHF, you can refer to our [RLHF data page](../../../conceptual_guides/llm/rlhf.md) and this great [RLHF post](https://huyenchip.com/2023/05/02/rlhf.html) by Chip Huyen.

Argilla offers all the necessary tools from the start to the end of such a pipeline. We will use `Argilla` to annotate our dataset and the preference dataset, and use `ArgillaTrainer` to train the SFT model and the DPO for our QA model. `ArgillaTrainer` offers a smooth integration with `trl`, Transformer Reinforcement Learning, a library for training transformer models with reinforcement learning, from HuggingFace. We will use the `SFTTrainer` and `DPOTrainer` classes from `trl` via `ArgillaTrainer`. Let us first start by installing the required libraries and importing the necessary modules.

## Running Argilla

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

**Deploy Argilla on Hugging Face Spaces:** If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/new-space?template=argilla/argilla-template-space)

For details about configuring your deployment, check the [official Hugging Face Hub guide](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla).

**Launch Argilla using Argilla's quickstart Docker image**: This is the recommended option if you want [Argilla running on your local machine](../../getting_started/quickstart.ipynb). Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

<div class="alert alert-info">

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
</div>

## Install Dependencies

Let us install the dependencies first.

In [None]:
%pip install argilla transformers datasets trl

And then import the necessary modules.

In [63]:
import argilla as rg
from argilla.feedback import ArgillaTrainer, TrainingTask
from datasets import load_dataset
from trl import SFTTrainer, DPOTrainer
from transformers import TrainingArguments, Trainer
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from typing import Dict, Any
from transformers import GenerationConfig, AutoTokenizer, GPT2LMHeadModel



Initialize the Argilla client with the `init` function. If you are running Argilla on HF Spaces, you can change `api_url` to your Spaces URL.

In [None]:
rg.init(
    api_url="http://localhost:6900",
    api_key="owner.apikey",
    workspace="admin",
)

## The SFT Model

As stated above, we will first train an SFT model. Let us create our dataset for this model.

### Demonstration Data

To train the SFT model, we will need data that involves question, context and answer triplet. The important point here is that the data should be high-quality to ensure that the fine-tuned model will be able to produce high-quality answers. In a possible scenario, you might want to create a dataset with questions and context, and then have it annotated by the annotators to obtain high-quality answers. We will be covering this procedure in this blogpost end-to-end. However, for the sake of simplicity, we will be using a dataset that is already annotated. The dataset for our demonstration data is [squad](https://huggingface.co/datasets/squad), which is a highly-used dataset for extractive QA. Let us first download the dataset and create our `demonstration_dataset`.

In [65]:
dataset_hf = load_dataset("squad", split="train").shard(num_shards=1000, index=5)

Let us create the our `FeedbackDataset` and add the data items from `squad`. To create a `FeedbackDataset`, we will use the task templates from Argilla, which makes the process much easier for any NLP task. You can have more info about the task templates from [here](../../../practical_guides/create_dataset.md#task-templates).

In [66]:
dataset = rg.FeedbackDataset.for_question_answering(use_markdown=True)

`for_question_answering` has returned a `FeedbackDataset` with the fields for `question` and `context` along with a question field `question` for our annotators to annotate.

### Add Suggestions

To help our annotators and make the annotation process faster, we can add suggestions to our dataset. Suggestions are model predictions for our data items that will be shown on Argilla UI during the annotation process. As it is optional, depending on your project, it will gain you a lot of time. You can use any model of your preference to generate model predictions to your dataset. We will be using `distilbert-base-uncased-distilled-squad` for demonstration purposes here. We can utilize the `pipeline` function from `transformers` to make things easier.

In [67]:
question_answerer = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

Let us create the records from our dataset by also adding suggestions to each item.

In [68]:
records = [
    rg.FeedbackRecord(
        fields={
            "question": item["question"],
            "context": item["context"],
        },
        suggestions=[
            {"question_name": "answer",
            "value": question_answerer(question=item["question"], context=item["context"])["answer"]},
        ]
    ) for item in dataset_hf
]

And add the records to our dataset.

In [None]:
dataset.add_records(records)

We can now upload our dataset to Argilla for our annotators to annotate. They will annotate each item by writing the answer span in the `answer` field by using the model hints, if you have opted for the suggestions. If you would like to have more control over the annotation process and manipulate some other features, you can refer to our [Argilla UI](../../../reference/webapp/features.md) page to have more info.

In [None]:
remote_dataset = dataset.push_to_argilla(name="demonstration_data_squad") #R

### Train the SFT Model

After our annoation work is done, we can download our annotated dataset. Note that the dataset downloaded by the `from_argilla` function is remote dataset object, meaning that any change you make is directly reflected on the remote dataset. 

In [None]:
annotated_dataset_from_argilla = rg.FeedbackDataset.from_argilla("demonstration_data_squad") #PROBLEM

However, as we stated above, we will be using `squad`, which is already annotated, for this tutorial. If you would like to follow a similar path, you can do so by creating the dataset as seen below.

In [71]:
annotated_dataset = rg.FeedbackDataset.for_question_answering(use_markdown=True)

# for each item in dataset_hf, get question, context, and answer
records = [
    rg.FeedbackRecord(
        fields={
            "question": item["question"],
            "context": item["context"],
            "answer": item["answers"]["text"][0],
        },
    ) for item in dataset_hf
]

# add records to annotated_dataset
annotated_dataset.add_records(records)

# FOR R, (item.responses[0].values["answer"].value) for item in dataset
# FOR T, (item["answer"]) for item in ds_hf as a field

After having our dataset ready and tidy, we can start training. For training, we first need to define our model that will be the base model for the SFT model.

In [None]:
model_sft = AutoModelForCausalLM.from_pretrained("sshleifer/tiny-gpt2")

Just like any model, SFT trainer as well requires the data to be in a specific format before being fed into the model. We will accomplish this task by a custom `formatting_func` that will be passed to the `ArgillaTrainer`. Let us define the template for the `formatting_func`.

In [72]:
# T
template = """\
### Question: {question}\n
### Context: {context}\n
### Answer: {answer}"""

Note that `formatting_func` returns a `str`.

In [None]:
def formatting_func(example: Dict[str, Any]) -> str:
    return template.format(
        question=example["question"],
        context=example["context"],
        answer=example["answer"],
    )

We can now pass the formatting function to our `TrainingTask` from Argilla, which defines how the data should be processed and formatted. We do this by employing the corresponding task according to the model we are training.

In [73]:
task = TrainingTask.for_supervised_fine_tuning(formatting_func=formatting_func)

Let us prepare the dataset for training.

In [74]:
annotated_dataset = annotated_dataset.prepare_for_training(
    framework="trl",
    task=task
)

Create the `ArgillaTrainer` class for our model. We will be using [SFTTrainer](https://huggingface.co/docs/trl/main/en/sft_trainer) from `trl`. Having passed the corresponding `TrainingTask`, Argilla will automatically create and train the model for us.

In [None]:
trainer_sft = ArgillaTrainer(
    dataset=annotated_dataset,
    task=task,
    framework="trl",
    train_size=0.8,
    model=model_sft,
)

Let us train the model.

In [None]:
trainer_sft.train(output_dir="sft_model")

### Inference with the SFT Model

We can now use this model for inference.

In [None]:
def generate(model_id: str, question: str, context: str = "") -> str:
    model = GPT2LMHeadModel.from_pretrained(model_id)
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    inputs = template.format(
        question=question,
        context=context,
        answer="",
    ).strip()

    encoding = tokenizer([inputs], return_tensors="pt")
    outputs = model.generate(
        **encoding,
        generation_config=GenerationConfig(
            max_new_tokens=32,
            min_new_tokens=12,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        ),
    )
    return tokenizer.decode(outputs[0])

## The DPO Model

Up to this point, we have created a demonstration dataset (and annotated it if you have chose to do so) and trained an SFT model on it. Before creating the second dataset, which is the comparison dataset, we ideally need an annotation work to have the data items for this new dataset. This new dataset would consist of a question and context pair along with two responses to the question. One of the responses would be human responses to the question-context pairs and the other would be the model predictions we obtain from the SFT model. If you opt to have gold-answers for your comparison dataset, you can again refer to [annotation](#demonstration-data) section above and have gold-answers for your comparison dataset.

For demonstration and simplicity purposes, we will again be using a dataset, `squad_v2`, that is already annotated. We will use the question and context columns from this dataset as well as the answer column, which will be our (kind of) gold-answers. We will also use the SFT model to create the second answers in dataset.

### Comparison Data

The main goal of the comparison dataset is to have a dataset that consists of two responses to a question-context pair. One of the responses would be human responses to the question-context pairs and the other would be the model predictions we obtain from the SFT model. Then, we will have this dataset annotated by the annotators to obtain the preference data by choosing one of the answers and rejecting the other. This preference data will be used to train the DPO model.

Let us download the dataset from HuggingFace. We eliminate the data items that do not have an answer in the `answer` column by means of `filter`.

In [None]:
dataset_hf_dpo = load_dataset("squad_v2", split="train").shard(num_shards=1000, index=5)

In [133]:
dataset_hf_dpo = dataset_hf_dpo.filter(lambda example: example["answers"]["text"] != [])

And create the `FeedbackDataset`. Note that Argilla again offers us a template that we can directly use to employ a DPO pipeline. As we have 2 responses that the annotators will choose from, we will set `num_responses` to 2. Additionally, by having a context in the dataset, we will set `context` to `True`.

In [None]:
dataset_dpo = rg.FeedbackDataset.for_direct_preference_optimization(number_of_responses=2, context=True)

In [None]:
#R
records = [
    rg.FeedbackRecord(
        fields={
            "prompt": item["question"],
            "context": item["context"],
            "response1": item["answers"]["text"][0], #
            "response2": generate("sft_model", item["question"], item["context"]).split("### Answer:")[1].split("###")[0].strip(),
        },

    ) for item in dataset_hf_dpo
]

dataset_dpo.add_records(records)

### Annotation of Comparison Data

Now we can get the annotations from our annotators. Let us upload it to Argilla for our annotators to annotate.

In [None]:
remote_dataset_dpo = dataset.push_to_argilla(name="squad_to_be_annotated_dpo") #R

### Train the DPO Model

In [None]:
annotated_dataset_dpo_from_argilla = rg.FeedbackDataset.from_argilla("squad_to_be_annotated_dpo") #R #PROBLEM 

Now, we can train our DPO.

In [171]:
# TO TELL CHOSEN FROM REJECTED
def formatting_func_dpo(example: Dict[str, Any]) -> str:
    return (
        f"Question: {example['prompt']} Context: {example['context']}",
        example["response1"],
        example["response2"],
    )


In [174]:
task = TrainingTask.for_direct_preference_optimization(formatting_func=formatting_func_dpo)

In [None]:

trainer = ArgillaTrainer(
    dataset=dataset_dpo,
    task=task,
    framework="trl",
    model="sft_model",
)

trainer.update_config(
    learning_rate=2e-2,
)


In [None]:
trainer.train(output_dir="dpo_model")