# üå† Improving RAG by optimizing retrieval and reranking models

In this tutorial, we will show how to improve a RAG model by optimizing the retrieval and reranking models. For this purpose, we will use the `ArgillaTrainer` to fine-tune a `bi-encoder` and `cross-encoder` on a dataset of similar sentences. We will then show how to use the fine-tuned models to improve the RAG model.

We will follow these steps:

* üìù Choose the right dataset for sentence similarity
* üì© Upload the dataset to Argilla and work in the `Argilla UI`
* üí´ Fine-tune the `bi-encoder` and `cross-encoder`
* üåå Evaluate the fine-tuned models


## Introduction

**LLMs** are a reality in our day-to-day lives. They are used in search engines, chatbots, and question answering systems. However, they are not perfect. They often produce responses that are not relevant, accurate, or verifiable. To solve this problem, RAG (Retrieval-Agumented Generation) was introduced.

**RAG** is a framework that improves the quality of the responses using a pre-trained LLM and a retrieval model. This one is used to retrieve relevant information from a knowledge base (the web or your documents) what it makes it more trustworthy for the user. In addition, RAG solves the common LLMs drawbacks as it can provide up-to-date and domain-specific data (even citing its sources) and it is more efficient and affordable (no need of retraining models from scratch).

In order to optimize the retrieval model, a **sentence similarity model** can be used. Why? To improve accuracy and relevance of the retrieved information by finding the user's intent. This is done by transforming the text into embeddings (vectors representing the semantic information) and computing the similarity between those so that the meaning of the input text can be 'understood'. 

In this tutorial, we will fine-tune a sentence similarity model with a bi-encoder (faster but less accurate) and a cross-encoder (slower but more accurate). The **bi-encoder** creates sentence embeddings for the data and the query, and then compare them by computing the similarity between vectors. The **cross-encoder** does not use sentence embeddings, but classifies the data pairs and output a value between 0 and 1 indicating the similarity between them. In the image below, you can see how both can work.

<img src="" alt="Bi-encoder and cross-encoder for RAG" style="width: 1100px;">

## Running Argilla

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:


**Deploy Argilla on Hugging Face Spaces**: If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/new-space?template=argilla/argilla-template-space)

For details about configuring your deployment, check the [official Hugging Face Hub guide](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla).


**Launch Argilla using Argilla's quickstart Docker image**: This is the recommended option if you want [Argilla running on your local machine](../../getting_started/quickstart.html). Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

<div class="alert alert-info">

Tip
    
This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter Notebook tool of your choice.
</div>

## Set up the environment

To complete this tutorial, you will need to install the Argilla client and a few third-party libraries using `pip`:

In [None]:
%pip install argilla -qqq
%pip install datasets
%pip install sentence-transformers

Let's make the needed imports:

In [28]:
import argilla as rg
from argilla.feedback import TrainingTask
from argilla.feedback import ArgillaTrainer

import random

from datasets import load_dataset

If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the `URL` and `API_KEY`:

In [2]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
# Replace workspace with the name of your workspace
rg.init(
    api_url="http://localhost:6900", 
    api_key="owner.apikey",
    workspace="admin"
)

## The Dataset

Here, we will use the [Sentence Compression](https://huggingface.co/datasets/embedding-data/sentence-compression) dataset, which is composed of 180000 pairs of equivalent sentences (uncompressed and compressed sentences from news articles). In our example, we will configure it and upload it to Argilla. Thus, we will reduce the number of samples to 500 and add the guidelines and questions to work with the Argilla UI.

In this case, the dataset is composed of pairs of positive similar sentences. But there are more valid types: datasets prepared for NLI like [snli](https://huggingface.co/datasets/snli), datasets with a label for the sentence like [trec](https://huggingface.co/datasets/trec), datasets with a triplet of sentences like [QQP_triplets](https://huggingface.co/datasets/embedding-data/QQP_triplets), etc. You can find more information about the different types of datasets [here](https://huggingface.co/blog/how-to-train-sentence-transformers).


### Retrieving the data from a vector search index

In [None]:
# import logging

# logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
# logging.getLogger("haystack").setLevel(logging.INFO)

In [None]:
from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")

In [None]:
from haystack.utils import clean_wiki_text, convert_files_to_docs
doc_dir = "data"
docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

INFO:haystack.utils.preprocessing:Converting data/argilla_cloud.txt


In [None]:
document_store.write_documents(docs)

Writing Documents: 10000it [00:00, 412196.35it/s]       


In [None]:
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1"
)

document_store.update_embeddings(retriever)

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
INFO:haystack.nodes.retriever.dense:Init retriever using embeddings of model sentence-transformers/multi-qa-mpnet-base-dot-v1
INFO:haystack.document_stores.faiss:Updating embeddings for 1 docs...
Updating Embedding:   0%|          | 0/1 [00:00<?, ? docs/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Documents Processed: 10000 docs [00:01, 5694.30 docs/s]


In [None]:
from haystack.nodes import PromptNode, PromptTemplate, AnswerParser

rag_prompt = PromptTemplate(
    prompt="""Synthesize a comprehensive answer from the following text for the given question.
                             Provide a clear and concise response that summarizes the key points and information presented in the text.
                             Your answer should be in your own words and be no longer than 50 words.
                             \n\n Related text: {join(documents)} \n\n Question: {query} \n\n Answer:""",
    output_parser=AnswerParser(),
)

# prompt_node = PromptNode(
#     model_name_or_path="text-davinci-003", api_key=openai_api_key, default_prompt_template=rag_prompt
# )
prompt_node = PromptNode(model_name_or_path="google/flan-t5-large", default_prompt_template=rag_prompt)

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


Downloading (‚Ä¶)lve/main/config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading (‚Ä¶)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (‚Ä¶)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (‚Ä¶)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (‚Ä¶)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [None]:
from haystack.pipelines import Pipeline

pipe = Pipeline()
pipe.add_node(component=retriever, name="retriever", inputs=["Query"])
pipe.add_node(component=prompt_node, name="prompt_node", inputs=["retriever"])

In [None]:

dataset = load_dataset("argilla/cloud_assistant_questions")

Downloading readme:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.33k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.24k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/196 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/66 [00:00<?, ? examples/s]

In [None]:
output = pipe.run(query="What does Argilla?")

print(output["answers"][0].answer)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2438 > 512). Running this sequence through the model will result in indexing errors


Argilla Cloud is a fully managed SaaS solution for data curation and labelling.


### Preparing the data for Argilla UI

In [31]:
# Load the first 500 records of the original dataset from the HF Hub
hf_dataset = load_dataset("embedding-data/sentence-compression", split='train[0:500]')

In [32]:
# Create the configuration for our feedback dataset
dataset = rg.FeedbackDataset(
    guidelines="Please, rate how similar are both sentences. If you think the sentences are not similar, please provide a correction to sentence-2.",
    fields=[
        rg.TextField(name="sentence-1", required=True),
        rg.TextField(name="sentence-2", required=True),
    ],
    questions=[
        rg.LabelQuestion(
            name="sentence_similarity",
            title="How would you rate the similarity of both sentences?",
            labels={"Not-similar", "Missing-information", "Similar"},
            required=True,
            visible_labels=None
        ),
        rg.TextQuestion(
            name="corrected-sentence-2",
            title="Provide a correction to the sentence-2 if not similar to sentence-1:",
            required=False,
            use_markdown=True
        )
    ]
)

# Load the hf dataset into Argilla
records = [rg.FeedbackRecord(fields={"sentence-1": record["set"][0], "sentence-2": record["set"][1]}) for record in hf_dataset]

# Add records to the dataset
dataset.add_records(records)

In [34]:
# Publish the dataset in the Argilla UI
dataset = dataset.push_to_argilla(name="sentence-compression-small", workspace="admin")

Pushing records to Argilla...: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:02<00:00,  5.94it/s]


### Working in the Argilla UI

Now, you can go to the Argilla UI to explore the dataset and annotate the samples.

<img src="../../docs/_source/_static/tutorials/fine-tuning-sentencesimilarity-rag/argilla_ui_sentence_similarity.png" alt="Working in the Argilla UI" style="width: 1100px;">

## Fine-tuning the sentence similarity model

### Preparing the data for fine-tuning

In [35]:
# Load the dataset from Argilla
dataset = rg.FeedbackDataset.from_argilla("sentence-compression-small", workspace="admin")

In [38]:
# Define the training task
task = TrainingTask.for_sentence_similarity(
    texts=[dataset.field_by_name("sentence-1"), dataset.field_by_name("sentence-2")]
)

### Fine-tuning a bi-encoder

In [43]:
trainer_bi = ArgillaTrainer(
    dataset=dataset,
    task=task,
    framework="sentence-transformers",
    framework_kwargs={"cross_encoder": False}
)
trainer_bi.train(output_dir="my_bi_sentence_transformer_model")

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/16 [00:00<?, ?it/s]

### Fine-tuning a cross-encoder

In [44]:
trainer_cross = ArgillaTrainer(
    dataset=dataset,
    task=task,
    framework="sentence-transformers",
    framework_kwargs={"cross_encoder": True}
)
trainer_cross.train(output_dir="my_cross_sentence_transformer_model")

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/16 [00:00<?, ?it/s]

# Infering from the fine-tuned model

In [48]:
# Predict using the bi-encoder model

trainer_bi.predict(
    [
        "Machine learning is so easy.",
        ["Deep learning is so straightforward.", "This is so difficult, like rocket science.", "I can't believe how much I struggled with this."]
    ]
)

[0.7785709, 0.45876068, 0.29062104]

In [49]:
# Predict using the cross-encoder model
trainer_cross.predict(
    [
        "Machine learning is so easy.",
        ["Deep learning is so straightforward.", "This is so difficult, like rocket science.", "I can't believe how much I struggled with this."]
    ]
)

[2.196114, -6.267204, -10.252579]