# Train a QA model with Argilla

At least once, you have probably found yourself in a situation where you need an answer for quite a specific question from Google. Chances are that Google gave you a response in bold letters ("How much does the Eiffel Tower weigh? - 10,100 tons) or that Google only highlighted some text from the results ("What was the original color of the Statue of Liberty?"). Either way, Google used a specific algorithm that does the finding-the-answer job for you and saved you from reading a bunch of pages or using Ctrl+F. Although not confirming the reliability of the pages or the pieces of information found, it managed to come up with an exact answer to the question (it incorrectly highlights "blue-green" for the Statue of Liberty at the time of this post). This task -finding the exact answer in a piece of text for a given question- is called extractive question answering and it is one of the main pipelines of the many QA or LLM systems today. In this blogpost, we will see how we can use Argilla to create an end-to-end pipeline for extractive QA.

Here are the steps we will follow:

- Create a dataset for extractive QA
- Annotate the dataset with Argilla

...

## Introduction

**Question answering (QA)** tasks are mainly divided into two: extractive QA and generative QA. Generative QA (or abstractive QA) is the task where the QA system generates a human-like, natural language answers to a question. For this, a generative QA system uses retriever-generator architecture instead of a retriever-reader one, which is employed by an extractive QA. As it requires a deeper understanding of the text and natural language generation, generative models are yet to catch the extractive ones in term of performance as of today. However, as it offers a more sophisticated pipeline and output, it will have much more to offer in the future.

On the other hand, the task we have just seen above was an example of the extractive QA, where a model finds the exact span within a text that will be used as an answer to the given question. In this sense, this task formally consists of a tuple of (q,c,a) and the objective of training is to minimize the loss between *-log(Pstart)* and *-log(Pend)*, where *Pstart* and *Pend* are the probabilities of the start and end indices of the answer span.

Argilla offers all the necessary tools from the start to the end of such a pipeline. We will use `Argilla` to annotate our dataset and use `ArgillaTrainer` to train the QA model. `ArgillaTrainer` offers a smooth integration with `transformers`. ...  Let us first start by installing the required libraries and importing the necessary modules.

## Running Argilla

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

**Deploy Argilla on Hugging Face Spaces:** If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/new-space?template=argilla/argilla-template-space)

For details about configuring your deployment, check the [official Hugging Face Hub guide](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla).

**Launch Argilla using Argilla's quickstart Docker image**: This is the recommended option if you want [Argilla running on your local machine](../../getting_started/quickstart.ipynb). Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

<div class="alert alert-info">

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
</div>

## Install Dependencies

Let us install the dependencies first.

In [None]:
%pip install argilla transformers datasets evaluate

And then import the necessary modules.

In [6]:
import argilla as rg
import re
from datasets import load_dataset, Dataset
from argilla.feedback import ArgillaTrainer, TrainingTask
from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering
from transformers import TrainingArguments, Trainer, DefaultDataCollator
import torch

Initialize the Argilla client with the `init` function. If you are running Argilla on HF Spaces, you can change `api_url` to your Spaces URL.

In [None]:
rg.init(
    api_url="http://localhost:6900",
    api_key="argilla.apikey"
)

## Create the Dataset

As the first step in our QA pipeline, we will need a dataset annotated by our annotators. For this, we will need to create a dataset where there is a question and context to search for the answer within. Our annotators will construct the answers by giving answers to the questions from the context. For this tutorial, we will use the [squad](https://huggingface.co/datasets/squad) dataset, which is a popular dataset for extractive QA. We will firstly ignore the answers and load the question-context pairs from `squad` to Argilla to showcase the annotation process. We will use the `datasets` library to download the dataset.

In [8]:
dataset_hf = load_dataset("squad", split="train").shard(num_shards=10000, index=55)

Let us have a look at the data we have before starting the annotation process.

In [9]:
dataset_hf[0]

{'id': '5733b1da4776f41900661069',
 'title': 'University_of_Notre_Dame',
 'context': "In 1882, Albert Zahm (John Zahm's brother) built an early wind tunnel used to compare lift to drag of aeronautical models. Around 1899, Professor Jerome Green became the first American to send a wireless message. In 1931, Father Julius Nieuwland performed early work on basic reactions that was used to create neoprene. Study of nuclear physics at the university began with the building of a nuclear accelerator in 1936, and continues now partly through a partnership in the Joint Institute for Nuclear Astrophysics.",
 'question': 'Which professor sent the first wireless message in the USA?',
 'answers': {'text': ['Professor Jerome Green'], 'answer_start': [136]}}

`squad` is a dataset of question-context-answer triples pulled from the Wikipedia articles. As seen in the `title` field above, each data item comes from a specific Wikipedia article. The `context` field contains the text of the article, `question` contains the question to be answered, and `answers` contains the answer span within the context, which are already annotated by humans. The `answer_start` is the starting index of the answer within the context given. We will ignore the `answers` field for now and only use the `context` and `question` fields.

### Create FeedbackDataset

Let us create our `FeedbackDataset` and add the data items from `squad`. To create a `FeedbackDataset`, we will use the task templates from Argilla, which makes the process much easier for any NLP task. You can have more info about the task templates from [here](../../../practical_guides/create_dataset.md#task-templates).

In [10]:
dataset = rg.FeedbackDataset.for_question_answering()

This method has just created the basic QA task template for us with `context` and `question` fields along with the `answer` question which will be used by the annotators to construct the answer.

To make the annotation project a lot easier to manage, we can add `metada_property` to our dataset. This will allow us to filter and sort the datasets by the metadata properties. You can have more information on metadata properties from [here](../../../practical_guides/create_dataset.md#metadata-properties).

In [None]:
dataset.add_metadata_property(rg.TermsMetadataProperty(name="groups", title="Annotation Groups", values=["Group1", "Group2", "Group3"], visible_for_annotators=False))

Now that we have our dataset ready, we can add the data items from `squad` to our dataset as `records` by adding suggestions for each one as well.

### Add Suggestions

To help our annotators and make the annotation process faster, we can add suggestions to our dataset. Suggestions are model predictions for our data items that will be shown on Argilla UI during the annotation process. As it is optional, depending on your project, it will gain you a lot of time. You can use any model of your preference to generate model predictions to your dataset. We will be using `distilbert-base-uncased-distilled-squad` for demonstration purposes here. We can utilize the `pipeline` function from `transformers` to make things easier.

In [12]:
question_answerer = pipeline("question-answering", model="deepset/electra-base-squad2")

Let us create the records from our dataset by also adding suggestions to each item.

In [13]:
records = [
    rg.FeedbackRecord(
        fields={
            "question": item["question"],
            "context": item["context"],
        },
        suggestions=[
            {"question_name": "answer",
            "value": question_answerer(question=item["question"], context=item["context"])["answer"]},
        ]
    ) for item in dataset_hf
]

And add the records to our dataset.

In [14]:
dataset.add_records(records)

We can now upload our dataset to Argilla for our annotators to annotate. They will annotate each item by writing the answer span in the `answer` field by using the model hints, if you have opted for the suggestions. If you would like to have more control over the annotation process and manipulate some other features, you can refer to our [Argilla UI](../../../reference/webapp/features.md) page to have more info.

In [None]:
remote_dataset = dataset.push_to_argilla(name="demonstration_data_squad_gpt2", workspace="argilla") #R

## Train the Model

After our annoation work is done, we can download our annotated dataset. Note that the dataset downloaded by the `from_argilla` function is remote dataset object, meaning that any change you make is directly reflected on the remote dataset. 

In [16]:
annotated_dataset_from_argilla = rg.FeedbackDataset.from_argilla("demonstration_data_squad_gpt2", workspace="argilla")

We can then set the format of our dataset as `datasets` to convert it into a datasets object.

In [17]:
annotated_dataset = annotated_dataset_from_argilla.format_as("datasets")

The annotators gave their responses as text pieces and we need to convert them into the start and end indices of the answer span.

In [18]:
def find_span(answer, context):
    matches = re.finditer(answer, context)
    for match in matches:
        if type(match) == re.Match:
            start_index = match.start()
            end_index = match.end()
            return start_index, end_index
    return None

By using the `find_span` function, we will create the `answer_start` and `answer_end` fields for our dataset.

In [19]:
def create_answer_columns(dataset):
    answer_starts = []
    answer_ends = []
    for item in dataset:
        answer = item["answer"][0]["value"]
        answer_start, answer_end = find_span(answer, item["context"])
        answer_starts.append(answer_start)
        answer_ends.append(answer_end)
    dataset = dataset.add_column("answer_start", answer_starts)
    dataset = dataset.add_column("answer_end", answer_ends)
    return dataset

annotated_dataset = create_answer_columns(annotated_dataset)

We can now tokenize the `annotated_dataset` by mapping with our tokenizer function.

In [20]:
def tokenize_function(example):
    inputs = tokenizer(
        example["question"],
        example["context"],
        max_length=512,
        truncation=True,
        padding="max_length",
    )
    
    inputs["start_positions"] = annotated_dataset["answer_start"]
    inputs["end_positions"] = annotated_dataset["answer_end"]
    return inputs


In [None]:
tokenized_dataset = annotated_dataset.map(tokenize_function, batched=True).select_columns(["input_ids", "attention_mask", "start_positions", "end_positions"])

Let us split the dataset for training and validation sets.

In [None]:
#TO BE REPLACED BY ARGILLATRAINER

model_name = "distilbert-base-uncased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("No GPU available, using CPU instead.")

splited_dataset = tokenized_dataset.train_test_split(test_size=0.8)

train_dataset = splited_dataset['train']
val_dataset = splited_dataset['test']
train_dataset.set_format("torch",device=device)
val_dataset.set_format("torch",device=device)

data_collator = DefaultDataCollator()

training_args = TrainingArguments(
    output_dir='./results',
    overwrite_output_dir=True,
    save_strategy="epoch",
    evaluation_strategy = "epoch",
    learning_rate=0.000005,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    num_train_epochs=5,
    weight_decay=0.001,
    push_to_hub=False,
    logging_strategy="epoch",
    logging_dir='./logs',
    logging_steps=5,
    load_best_model_at_end=True,
    greater_is_better=True,
    dataloader_pin_memory=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)

trainer.train()

## Inference

Now that we have our model trained, we can use it to find the answer span for a given question and context. We can use the `pipeline` function from `transformers` to make things easier. It will give us the answer as well as the start and end indices of the answer span.

In [23]:
qa_pipeline = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer,
    device=device
)

We just need the feed the function with the question and context to get the answer.

In [27]:
qa_pipeline(question="For what is Venezuela famous?", context="Venezuela is known for its natural beauty.")

{'score': 0.6772550344467163,
 'start': 27,
 'end': 41,
 'answer': 'natural beauty'}