# 🤗 End-to-end distilabel example with Inference Endpoints and Notus

[distilabel](https://github.com/argilla-io/distilabel) is an AI Feedback (AIF) framework that can generate and label datasets using LLMs, and can be used for many different use cases. Implemented with robustness, efficiency and scalability in mind, it allows anyone to build their own synthetic datasets that can be used in many different scenarios. This tutorials shows and end-to-end example in which we will create a model expert in the new AI Act, to which we can make different types of questions and requests. 

The LLM model that we will fine-tune for this is [Notus 7B](https://argilla.io/blog/notus7b/), a fine-tuned version of Zephyr 7B that uses Direct Preference Optimization (DPO) and AIF techniques to outperfom its foundation model in several benchmarks, and is completely open-source. 

## Introduction

Let's start by installing the required depencies to run distilabel, Argilla and the rest of the packages used in the tutorial

In [1]:
#%pip install argilla distilabel farm-haystack pip install "distilabel[hf-inference-endpoints]"

### Running Argilla

For this tutorial, you can use Argilla to visualize and annotate the different datasets created by distilabel. There are two main options for deploying and running Argilla:

**Deploy Argilla on Hugging Face Spaces:** If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/new-space?template=argilla/argilla-template-space)

For details about configuring your deployment, check the [official Hugging Face Hub guide](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla).

**Launch Argilla using Argilla's quickstart Docker image**: This is the recommended option if you want [Argilla running on your local machine](../../getting_started/quickstart.ipynb). Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

<div class="alert alert-info">

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
</div>

### Import dependencies

The main dependencies for this tutorial are distilabel for creating the synthetic datasets and Argilla for visualizing and annotating these datasets, and also for fine-tuning our model. The package [Haystack](https://haystack.deepset.ai/) is used to creates batches from the original PDF document we want to create our datasets from. 

In [2]:
import os
from typing import Dict, Union, Tuple, List

import argilla as rg
from argilla.feedback import TrainingTask
from argilla.feedback import ArgillaTrainer

from distilabel.llm import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline, pipeline
from distilabel.tasks import Llama2TextGenerationTask, SelfInstructTask, Prompt

from datasets import Dataset
from haystack.nodes import PDFToTextConverter, PreProcessor

If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:

In [3]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="https://ignacioct-argilla.hf.space",
    api_key="owner.apikey",
    workspace="admin"
)

This may lead to potential compatibility issues during your experience.
To ensure a seamless and optimized connection, we highly recommend aligning your client version with the server version.


Additionally, we need to provide our HuggingFace and OpenAI accest token. To later instatiate an `InferenceEndpointsLLM` object, we need to pass as parameters the HF Inference Endpoint name and the HF namespace. One very convenient way to do so is also through environment variables.

In [4]:
os.environ["HF_TOKEN"] = ""
os.environ["HF_INFERENCE_ENDPOINT_NAME"] = "aws-notus-7b-v1-3184"
os.environ["HF_NAMESPACE"] = "argilla"
os.environ["OPENAI_API_KEY"] = ""

## Setting up an inference endpoint with Notus

To kickstart this tutorial, let's see how to set up and endpoint for our Notus model. A HuggingFace endpoint is a service provided by HuggingFace that allows you to deploy and host your machine learning models for inference. This way, we'll have faster inference times, as these models will not run in our personal machines, but in HuggingFace servers. The endpoint of choice has a [Notus 7B instance](https://ui.endpoints.huggingface.co/argilla/endpoints/aws-notus-7b-v1-4052) running.

Let's see a quick example of how to use an inference endpoint. We have prepared an easy `Llama2QuestionAnsweringTask` to ask question to the model, in a very similar way as we talk with the LLMs using chatbots.

In [5]:
class Llama2QuestionAnsweringTask(Llama2TextGenerationTask):
    def generate_prompt(self, question: str) -> str:
        return Prompt(
            system_prompt=self.system_prompt,
            formatted_prompt=question,
        ).format_as("llama2")  # type: ignore

    def parse_output(self, output: str) -> Dict[str, str]:
        return {"answer": output.strip()}

    def input_args_names(self) -> list[str]:
        return ["question"]

    def output_args_names(self) -> list[str]:
        return ["answer"]

The `llm` is an object of the `InferenceEndpointsLLM` class, and through it we can start generating answers to question using the `llm.generate()` method.

In [6]:
llm = InferenceEndpointsLLM(
    endpoint_name=os.getenv("HF_INFERENCE_ENDPOINT_NAME"),  # type: ignore
    endpoint_namespace=os.getenv("HF_NAMESPACE"),  # type: ignore
    token=os.getenv("HF_TOKEN") or None,
    task=Llama2QuestionAnsweringTask(),
)

In [7]:
generation = llm.generate([{"question": "What's the second most populated city in Denmark?"}])
generation[0][0]["parsed_output"]["answer"]

'The second most populated city in Denmark is Aarhus, with a population of around 340,000 people. It is located on the east coast of Jutland, and is known for its vibrant cultural scene, beautiful beaches, and historic landmarks. Aarhus is also home to Aarhus University, one of the largest universities in Scandinavia.'

The endpoint is working! We now can do inference through the Inference Endpoint.

## Downloading the AI Act PDF document

As we want an expert model of the new AI Act promoted by the European Union, we firstly need to download the PDF document itself.

In [8]:
!wget https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf

--2024-01-02 13:48:34--  https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf
Resolving artificialintelligenceact.eu (artificialintelligenceact.eu)... 173.255.227.216
Connecting to artificialintelligenceact.eu (artificialintelligenceact.eu)|173.255.227.216|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1351521 (1,3M) [application/pdf]
Saving to: ‘The-AI-Act.pdf.1’


2024-01-02 13:48:35 (2,35 MB/s) - ‘The-AI-Act.pdf.1’ saved [1351521/1351521]



Once we have it in our working directory, we can use Haystack's converter and pipeline features to extract the textual data, clean it and divide it in different batches. Afterwards, these batches will be used to start creating synthetic instructions.

In [9]:
converter = PDFToTextConverter(
    remove_numeric_tables=True,
    valid_languages=["en"]
)

doc = converter.convert(file_path="The-AI-Act.pdf", meta=None)[0]

pdftotext version 4.04 [www.xpdfreader.com]
Copyright 1996-2022 Glyph & Cog, LLC


In [10]:
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=150,
    split_respect_sentence_boundary=True,
)
docs = preprocessor.process([doc])
print(f"Number of input documents: 1\nNumber of output documents: {len(docs)}")


Preprocessing:   0%|          | 0/1 [00:00<?, ?docs/s]

Preprocessing: 100%|██████████| 1/1 [00:00<00:00,  3.81docs/s]

Number of input documents: 1
Number of output documents: 355





Let's take a look at the batches:

In [11]:
inputs = [doc.content for doc in docs]
inputs[0:5]

['EN EN\nEUROPEAN\nCOMMISSION\nProposal for a\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\nLEGISLATIVE ACTS\x0cEN\nEXPLANATORY MEMORANDUM\n1. CONTEXT OF THE PROPOSAL\n1.1. Reasons for and objectives of the proposal\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\nsocietal benefits across the entire spectrum of industries and social activities. By improving\nprediction, optimising operations and resource allocation, and personalising service delivery,\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\nand provide key competitive advantages to companies and the European economy. ',
 'Such\nac

The document has been correctly batched, from one big document to 355 strings, 150-character long at maximum. This list of strings can now be used as input to generate a instruction dataset using `distilabel`.

## Generating instructions with SelfInstructTask

With out Inference Endpoint up and running, we should be able to generate instructions with distilabel. These instructions, made by the LLM through our endpoint, will form an instruction dataset. For this example, we are using a subset of all the batches generated in the section above, to be gentle on performance. 

In [12]:
instructions_dataset = Dataset.from_dict({
    "input": inputs[0:50]
})

instructions_dataset

Dataset({
    features: ['input'],
    num_rows: 50
})

With the `SelfInstructTask` class we can generate a Self-Instruct specitification for building the prompts, as done in the [Self-Instruct paper](https://arxiv.org/abs/2212.10560). `distilabel` will start from human-made input, in this case, the batches we created from the AI Act pdf, and it will generate instructions based on it. These instructions can then be reviewed using Argilla to keep the best ones. 

An application description can be passed as a parameter to specify the behaviour of the model; we want a model capable of answering our questions about the AI Act.


In [13]:
instructions_task = SelfInstructTask(
    application_description="A assistant that can answer questions about the AI Act made by the European Union."
)

Let's now define a generator, passing the `SelfInstructTask` object, and create a `Pipeline` object.

In [14]:
instructions_generator = InferenceEndpointsLLM(
    endpoint_name=os.getenv("HF_INFERENCE_ENDPOINT_NAME"),  # type: ignore
    endpoint_namespace=os.getenv("HF_NAMESPACE"),  # type: ignore
    token=os.getenv("HF_TOKEN") or None,
    task=instructions_task,
)

instructions_pipeline = Pipeline(
    generator=instructions_generator
)

Our pipeline is ready to be used to generate instructions. Let's do it!

In [15]:
generated_instructions = instructions_pipeline.generate(dataset=instructions_dataset, num_generations=1, batch_size=8)

  prompts = self._generate_prompts(


Flattening the indices:   0%|          | 0/1 [00:00<?, ? examples/s]

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


Output()

Flattening the indices:   0%|          | 0/50 [00:00<?, ? examples/s]

Our pipeline has succesfully generated instructions given the topics and the behaviour passed as input. Let's gather all those instructions and see how the look.

In [16]:
instructions = []
for generations in generated_instructions["instructions"]:
    for generation in generations:
        instructions.extend(generation)

print(f"Number of generated instructions: {len(instructions)}")

for instruction in instructions[:5]:
    print(instruction)

Number of generated instructions: 178
What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?
How can artificial intelligence improve prediction, optimise operations and resource allocation, and personalise service delivery?
What benefits can artificial intelligence bring to the European economy and society as a whole?
How can the use of artificial intelligence support socially and environmentally beneficial outcomes?
What are the high-impact sectors that require AI action according to the AI Act by the European Union?


These initial intructions form our instruction dataset. Following the human-in-the-loop approach, we should push the instructions to Argilla to visualize them and be able to rank them in terms of quality. Those annotations would make quality data, ensuring a better performance of the final model. Nevertheless, this step is optional.

### Pushing the instruction dataset to Argilla to visualize and annotate.

Let's take a quick look at the instructions generated by `SelfInstructTask`.

In [17]:
generated_instructions[0]

{'input': 'EN EN\nEUROPEAN\nCOMMISSION\nProposal for a\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\nLEGISLATIVE ACTS\x0cEN\nEXPLANATORY MEMORANDUM\n1. CONTEXT OF THE PROPOSAL\n1.1. Reasons for and objectives of the proposal\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\nsocietal benefits across the entire spectrum of industries and social activities. By improving\nprediction, optimising operations and resource allocation, and personalising service delivery,\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\nand provide key competitive advantages to companies and the European economy. ',
 

For each input, i.e., each batch of the AI Act pdf file, we have a generator prompt, with general guidelines on how to behave, as well as the application description parameter. 4 instructions per input have been generated. 

Now it's the perfect time to upload the instruction dataset to Argilla, review it and manually annotate it. 

In [18]:
instructions_rg_dataset = generated_instructions.to_argilla()
instructions_rg_dataset.push_to_argilla(name=f"notus_AI_instructions")

Output()

RemoteFeedbackDataset(
   id=fbdc2ae7-ed9f-4aac-a9b0-6a9e59eaaa79
   name=notus_AI_instructions
   workspace=Workspace(id=29538109-004d-4be3-affc-a12606f51636, name=admin, inserted_at=2024-01-02 09:45:26.334713, updated_at=2024-01-02 09:45:26.334713)
   url=https://ignacioct-argilla.hf.space/dataset/fbdc2ae7-ed9f-4aac-a9b0-6a9e59eaaa79/annotation-mode
   fields=[RemoteTextField(id=UUID('26ad18e8-641b-4587-b4a0-b911a23776df'), client=None, name='input', title='input', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('09abdc47-f694-4ef5-8b4d-9d36be7d859c'), client=None, name='instruction', title='instruction', required=True, type='text', use_markdown=False)]
   questions=[RemoteRatingQuestion(id=UUID('a53a2832-14b4-4bcf-8e5c-483cc844416c'), client=None, name='instruction-rating', title='How would you rate the generated instruction?', description=None, required=True, type='rating', values=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])]
   guidelines=None
   metadata_properties=[

In the Argilla UI, each tuple input-instruction is visualized individually, and can be individually annotated. 

![](../assets/tutorials-assets/instrucion_dataset_ui.png)

## Generate a Preference Dataset using an Ultrafeedback text quality task.

Once we have our instruction dataset, we are going to create a preference dataset through the UltraFeedback text quality task. This is a type of task used in NLP used to evaluate the quality of text generated; our goal is to provide detailed feedback on the quality of the generated text, beyond a binary label.

Our `pipeline()` method allows us to create a `Pipeline` instance with the provided LLMs for a given task, which is useful whenever you want to use a pre-defined or custom `Pipeline` for a given task. We will specify our task and subtask, the generator we want to use (in this case, one based in a Llama2 Text Generator Task) and our OpenAI API key.

In [19]:
preference_pipeline = pipeline(
    "preference",
    "text-quality",
    generator=InferenceEndpointsLLM(
        endpoint_name=os.getenv("HF_INFERENCE_ENDPOINT_NAME"),  # type: ignore
        endpoint_namespace=os.getenv("HF_NAMESPACE", None),
        task=Llama2TextGenerationTask(),
        max_new_tokens=256,
        num_threads=2,
        temperature=0.3,
    ),
    max_new_tokens=256,
    num_threads=2,
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    temperature=0.0,
)

We also need to retrieve our instruction dataset from Argilla, as it will be the input of this pipeline.

In [20]:
remote_dataset = rg.FeedbackDataset.from_argilla("notus_AI_instructions", workspace="admin")
instructions_dataset = remote_dataset.pull(max_records=100) # get first 100 records

instructions_dataset = instructions_dataset.format_as("datasets")
instructions_dataset

Dataset({
    features: ['input', 'instruction', 'instruction-rating', 'instruction-rating-suggestion', 'instruction-rating-suggestion-metadata', 'external_id', 'metadata'],
    num_rows: 100
})

In [21]:
instructions_dataset[0]

{'input': 'EN EN\nEUROPEAN\nCOMMISSION\nProposal for a\nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL\nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE\n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION\nLEGISLATIVE ACTS\x0cEN\nEXPLANATORY MEMORANDUM\n1. CONTEXT OF THE PROPOSAL\n1.1. Reasons for and objectives of the proposal\nThis explanatory memorandum accompanies the proposal for a Regulation laying down\nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence\n(AI) is a fast evolving family of technologies that can bring a wide array of economic and\nsocietal benefits across the entire spectrum of industries and social activities. By improving\nprediction, optimising operations and resource allocation, and personalising service delivery,\nthe use of artificial intelligence can support socially and environmentally beneficial outcomes\nand provide key competitive advantages to companies and the European economy.',
 '

Before generating the text based on our instructions, we need to mingle a little bit with the dataset. From the previous section, we still have our old input, the batches from the PDF. We have to change that to the instructions that we generated.

In [22]:
instructions_dataset = instructions_dataset.rename_columns("input", "context")

instructions_dataset = instructions_dataset.rename_column("instruction", "input")

Now, let's build a dataset by using the pipeline we just created, and the topics from which our instructions were generated. 

In [23]:
preference_dataset = preference_pipeline.generate(
    instructions_dataset,  # type: ignore
    num_generations=2,
    batch_size=8,
    enable_checkpoints=True,
    display_progress_bar=True,
)

  return self._generate(


Flattening the indices:   0%|          | 0/1 [00:00<?, ? examples/s]

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


Flattening the indices:   0%|          | 0/100 [00:00<?, ? examples/s]

Let's take a look at an instance of the preference dataset

In [24]:
preference_dataset[0]

{'input': 'What are the reasons for and objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence?',
 'instruction-rating': [],
 'instruction-rating-suggestion': None,
 'instruction-rating-suggestion-metadata': {'agent': None,
  'score': None,
  'type': None},
 'external_id': None,
 'metadata': '{"length-input": 964, "length-instruction": 129}',
 'generation_model': ['argilla/notus-7b-v1', 'argilla/notus-7b-v1'],
 'generation_prompt': ["<s>[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false inf

### Upload the preference dataset to Argilla to annotate.

Once our preference dataset has been correctly generated, the Argilla UI is the best tool at our disposal to visualize it and annotate it. As for the instruction dataset, we just have to convert it to an Argilla Feedback Dataset, and push it to Argilla.

In [25]:
# Uploading the Preference Dataset
preference_rg_dataset = preference_dataset.to_argilla()
preference_rg_dataset.push_to_argilla(name=f"notus_AI_preference")

Output()

RemoteFeedbackDataset(
   id=e0e38f48-e730-4406-9b7f-8d1b52de1919
   name=notus_AI_preference
   workspace=Workspace(id=29538109-004d-4be3-affc-a12606f51636, name=admin, inserted_at=2024-01-02 09:45:26.334713, updated_at=2024-01-02 09:45:26.334713)
   url=https://ignacioct-argilla.hf.space/dataset/e0e38f48-e730-4406-9b7f-8d1b52de1919/annotation-mode
   fields=[RemoteTextField(id=UUID('2c4b99d8-9b61-4210-ad46-37bffe47d451'), client=None, name='input', title='input', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('063e3325-e2ee-422d-bd56-fc2fce82329a'), client=None, name='generations-1', title='generations-1', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('659ccca7-3caa-4604-a0be-ec99dbf86068'), client=None, name='generations-2', title='generations-2', required=True, type='text', use_markdown=False)]
   questions=[RemoteRatingQuestion(id=UUID('b296bfb4-9440-4f9e-b924-b96e75d9ad3f'), client=None, name='generations-1-rating', title="What

In the Argilla UI, we can see the input (an instruction), and the two generations that the LLM created out of it.

![](../assets/tutorials-assets/preference_dataset_ui.png)

## Fine-tuning our model using the preference dataset

In [None]:
preference_rg_dataset = rg.FeedbackDataset.from_argilla("notus_AI_preference", workspace="admin")

In [None]:
# Adaptation from LlamaIndex's TEXT_QA_PROMPT_TMPL_MSGS[1].content
user_message_prompt ="""Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge but keeping your assistant style and objective to answer questions about the AI Act, answer the query.
Query: {query_str}
Answer:
"""

# Same system prompt that Distilabel appends by default to guide the model's behaviour.
system_prompt = """
You are an expert prompt writer, writing the best and most diverse prompts for a variety of tasks. 
You are given a task description and a set of instructions for how to write the prompts for an specific AI application.
"""

In [None]:
def formatting_func(sample: dict) -> Union[Tuple[str, str, str, str], List[Tuple[str, str, str, str]]]:
    from uuid import uuid4
    if sample["generations"]:
        chat = str(uuid4())
        user_message = user_message_prompt.format(context_str=sample["context"], query_str=sample["input"])

        # We need to choose one of the two generations made. We can pick the one with the highest rating and, in case of draw, the first one.
        answer = ""

        if sample["rating"][0] < sample["rating"][1]:
            answer = sample["generations"][1]
        else:
            answer = sample["generations"][0]

        return [
            (chat, "0", "system", system_prompt),
            (chat, "1", "user", user_message),
            (chat, "2", "assistant", answer)
        ]

task = TrainingTask.for_chat_completion(formatting_func=formatting_func)

In [None]:
trainer = ArgillaTrainer(
    dataset=preference_rg_dataset,
    task=task,
    framework="openai",
)
trainer.train(output_dir="notus_preference_finetuned")