Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add tutorial argilla haystack integration #4597

Closed
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ These are the section headers that we use:
## [1.24.0](https://github.com/argilla-io/argilla/compare/v1.23.0...v1.24.0)

>[!NOTE]
> This release does not contain any new features, but it includes a major change in the `argilla-server` dependency.
> This release does not contain any new features, but it includes a major change in the `argilla-server` dependency.
> The package is using the `argilla-server` dependency defined [here](https://github.com/argilla-io/argilla-server). ([#4537](https://github.com/argilla-io/argilla/pull/4537))

### Changed
Expand Down
68 changes: 41 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,6 @@
<a href="https://pypi.org/project/argilla/">
<img alt="CI" src="https://img.shields.io/pypi/v/argilla.svg?style=flat-square&logo=pypi&logoColor=white">
</a>
<!--a href="https://anaconda.org/conda-forge/rubrix">
<img alt="CI" src="https://img.shields.io/conda/vn/conda-forge/rubrix?logo=anaconda&style=flat&color=orange">
</!a-->
<img alt="Codecov" src="https://codecov.io/gh/argilla-io/argilla/branch/main/graph/badge.svg?token=VDVR29VOMG"/>
<a href="https://pepy.tech/project/argilla">
<img alt="CI" src="https://static.pepy.tech/personalized-badge/argilla?period=month&units=international_system&left_color=grey&right_color=blue&left_text=pypi%20downloads/month">
Expand All @@ -20,11 +17,6 @@
<img src="https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-sm.svg" />
</a>
</p>

<h2 align="center">Open-source feedback layer for LLMs</h2>
<br>


<p align="center">
<a href="https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g">
<img src="https://img.shields.io/badge/JOIN US ON SLACK-4A154B?style=for-the-badge&logo=slack&logoColor=white" />
Expand All @@ -37,42 +29,64 @@
</a>
</p>

<br>
<h3 align="center">Work on data together, make your AI better!</h2>

<h3>
<p align="center">
<a href="https://docs.argilla.io">📄 Documentation</a> | </span>
<a href="#-quickstart">🚀 Quickstart</a> <span> | </span>
<a href="#-cheatsheet">🎼 Cheatsheet</a> <span> | </span>
<a href="#-project-architecture">🛠️ Architecture</a> <span> | </span>
<a href="https://demo.argilla.io/sign-in?auth=ZGVtbzoxMjM0NTY3OA%3D%3D">🛝 Demo</a> | </span>
<a href="https://docs.argilla.io/en/latest/getting_started/quickstart_installation.html#%F0%9F%91%A9%F0%9F%8F%BD%E2%80%8D%F0%9F%9A%80-Argilla-on-Hugging-Face-Spaces">🚀 Deploy</a> <span> | </span>
<a href="#-contribute">👨‍💻 Features</a>
<a href="#-contribute">🤝 Contribute</a>
</p>
</h3>

## What is Argilla?
Argilla is a **collaboration platform for AI engineers and domain experts**. We focus on quality, time-to-value, and ownership.

Argilla is an open-source platform for data-centric LLM development. Integrates human and model feedback loops for continuous LLM refinement and oversight.
If you just want to get started, great!

With Argilla's Python SDK and adaptable UI, you can create human and model-in-the-loop workflows for:
1. 🛝 Try Argilla in our [demo environment](https://demo.argilla.io/sign-in?auth=ZGVtbzoxMjM0NTY3OA%3D%3D).

* Supervised fine-tuning
* Preference tuning (RLHF, DPO, RLAIF, and more)
* Small, specialized NLP models
* Scalable evaluation.
2. 🚀 Deploy Argilla for free using [three clicks](https://docs.argilla.io/en/latest/getting_started/quickstart_installation.html#%F0%9F%91%A9%F0%9F%8F%BD%E2%80%8D%F0%9F%9A%80-Argilla-on-Hugging-Face-Spaces).

## 🚀 Quickstart
3. 👨‍💻 Explore our [unique features](https://docs.argilla.io/en/latest/getting_started/quickstart_installation.html#%F0%9F%91%A9%F0%9F%8F%BD%E2%80%8D%F0%9F%9A%80-Argilla-on-Hugging-Face-Spaces).

There are different options to get started:
4. 📺 Watch our [demo video](https://www.youtube.com/watch?v=FlJ6hrBB2bU).

1. Take a look at our [quickstart page](https://docs.argilla.io/en/latest/getting_started/quickstart.html) 🚀
5. 🏘️ Attend our [online community meetup](https://lu.ma/embed-checkout/evt-IQtRiSuXZCIW6FB)

2. Start contributing by looking at our [contributor guidelines](##🤝-contribute) 🤝
Want to know more? Read our [documentation](https://docs.argilla.io/).

3. Skip some steps with our [cheatsheet](##🎼-cheatsheet) 🎼
## Why use Argilla?

## 🎼 Cheatsheet
We designed Argilla to help you create the **highest quality AI through the least required effort**. Here are some of the benefits we offer:

This cheatsheet is a quick reference to the most common commands and workflows. For more detailed information, please refer to our [documentation](https://docs.argilla.io/en/latest/getting_started/quickstart.html).
<details>
<summary>Improve your AI output quality through data quality.</summary>
<p>
Compute is expensive and output quality is important. By focusing on data you can tackle the root cause of both of these problems.
</p>
</details>
<details>
<summary>Reduce the time-to-value for AI projects with engaging data interaction.</summary>
</details>
<details>
<summary>Take control by owning your data and models.</summary>
</details>

## What can you build with Argilla?

Argilla is a tool that can be used for high-quality data with a focus on NLP and LLMs. Our community uses Argilla to create amazing open-source [datasets](https://huggingface.co/datasets?other=argilla) and [models](https://huggingface.co/models?other=distilabel) on Hugging Face. We also lead by example:

- Our [UltraFeedback dataset](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) and the [Notus](https://huggingface.co/argilla/notus-7b-v1) and [Notux](https://huggingface.co/argilla/notux-8x7b-v1) models, where we improved benchmark and empirical human judgment for the Mistral and Mixtral models with cleaner data.
- Our [Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and the [OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B), where we managed to improve model performance by filtering out 50% of the original dataset.

Additionally, AI experts and domain experts from companies like [the Red Cross](https://510.global/), [Loris.ai](https://loris.ai/) and [Prolific](https://www.prolific.com/) use Argilla to improve the quality and efficiency of their AI projects. They shared their experiences with our community in our [online community meetup](https://lu.ma/embed-checkout/evt-IQtRiSuXZCIW6FB).

- AI for good: [the Red Cross presentation](https://youtu.be/ZsCqrAhzkFU?feature=shared) showcases how their team collaborates by classifying and redirecting requests from refugees of the Ukrainian crisis to streamline the support processes of the Red Cross.
- Customer support: [Loris showed](https://youtu.be/jWrtgf2w4VU?feature=shared) how their AI team uses unsupervised and few-shot contrastive learning to help them quickly validate and gain labelled samples for a huge amount of multi-label classifiers.
- Research studies: [Prolific](https://youtu.be/ePDlhIxnuAs?feature=shared) is actively distributing data collection projects among its annotating workforce. They do this through an integration with our platform.

## 🚀 Quickstart

<details>
<summary><a href="https://docs.argilla.io/en/latest/getting_started/installation/deployments/docker.html">pip install argilla</a></summary>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ In this guide, you'll learn to deploy your own Argilla app and use it for data l

## Your first Argilla Space

In this section, you'll learn to deploy an Argilla Space and use it for data annotation and training a sentiment classifier with [SetFit](https://github.com/huggingface/setfit/), an amazing few-shot learning library.
In this section, you'll learn to deploy an Argilla Space and use it for human feedback collection.

### Deploy Argilla on Spaces

Expand Down Expand Up @@ -57,84 +57,66 @@ Once Argilla is running, you can use the UI with the Direct URL. This URL gives

If everything goes well, you are ready to use the Argilla Python client from an IDE such as Colab, Jupyter, or VS Code.

If you want a quick step-by-step example, keep reading. If you want an end-to-end tutorial, go to this [tutorial and use Colab or Jupyter](https://docs.argilla.io/en/latest/tutorials/notebooks/training-textclassification-setfit-fewshot.html).
If you want a quick step-by-step example, keep reading. If you prefer an end-to-end tutorial, go to this [tutorial and use Colab or Jupyter](/getting_started/quickstart_workflow_feedback.ipynb).

First, we need to pip install `datasets` and `argilla` on Colab or your local machine:
First, we need to pip install `argilla` on Colab or your local machine:

```bash
pip install datasets argilla
pip install argilla -U
```

Then, you can read the example dataset using the `datasets` library. This dataset is a CSV file uploaded to the Hub using the drag-and-drop feature.

```python
from datasets import load_dataset

dataset = load_dataset("dvilasuero/banking_app", split="train").shuffle()
```

You can create your first dataset by logging it into Argilla using your endpoint URL:
Then, you can connect to Argilla using your endpoint URL.

```python
import argilla as rg

# if you connect to your public app endpoint (uses default API key)
# If you connect to your public app endpoint (uses default API key)
rg.init(api_url="[your_space_url]", api_key="admin.apikey")

# if you connect to your private app endpoint (uses default API key)
# If you connect to your private app endpoint (uses default API key)
rg.init(api_url="[your_space_url]", api_key="admin.apikey", extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"})

# transform dataset into Argilla's format and log it
rg.log(rg.read_datasets(dataset, task="TextClassification"), name="bankingapp_sentiment")
```

Congrats! You now have a dataset available from the Argilla UI to start browsing and labeling. In the code above, we've used one of the many integrations with Hugging Face libraries, which let you read hundreds of datasets available on the Hub.

### Data labeling and model training
So let's create a Dataset with two labels ("sadness" and "joy"). Don't forget to replace "your-workspace" where the dataset will be created.

At this point, you can label your data directly using your Argilla Space and read the training data to train your model of choice.
> To check your workspaces, go to "My settings" on the UI. If you need to create a new one, consult the [docs](/getting_started/installation/configurations/workspace_management.md).
> Here, we are using a task template, see the docs to [create a fully custom dataset](/practical_guides/create_update_dataset/create_dataset.md).

```python
# this will read our current dataset and turn it into a clean dataset for training
dataset = rg.load("bankingapp_sentiment").prepare_for_training()
dataset = rg.FeedbackDataset.for_text_classification(
labels=["sadness", "joy"],
multi_label=False,
use_markdown=True,
guidelines=None,
metadata_properties=None,
vectors_settings=None,
)
dataset.push_to_argilla(name="my-first-dataset", workspace="<your-workspace>")
```

You can also get the full dataset and push it to the Hub for reproducibility and versioning:

```python
# save full argilla dataset for reproducibility
rg.load("bankingapp_sentiment").to_datasets().push_to_hub("bankingapp_sentiment")
```
Now, we will add the records. Create a list with the records you want to add and ensure that you match the fields with the ones specified in the previous step.

Finally, this is how you can train a SetFit model using data from your Argilla Space:
> You can also use `pandas` or `load_dataset` to [read an existing dataset and create records from it](/practical_guides/create_update_dataset/records.md#add-records).

```python
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer

# Create train test split
dataset = dataset.train_test_split()

# Load SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# Create trainer
trainer = SetFitTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
loss_class=CosineSimilarityLoss,
batch_size=8,
num_iterations=20,
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate()
records = [
rg.FeedbackRecord(
fields={
"text": "I am so happy today",
},
),
rg.FeedbackRecord(
fields={
"text": "I feel sad today",
},
)
]
dataset.add_records(records)
```

As a next step, you can check the [Argilla Tutorials](https://docs.argilla.io/en/latest/tutorials/tutorials.html) section. All the tutorials can be run using Colab or local Jupyter Notebooks, so you can start building datasets with Argilla and Spaces!
Congrats! You now have a dataset available in Argilla to start browsing and labeling.

As a next step, you can check the [Argilla Tutorials](/tutorials_and_integrations/tutorials/tutorials.md) section. All the tutorials can be run using Colab or local Jupyter Notebooks, so you can start building datasets with Argilla and Spaces!

## Feedback and support

Expand Down Expand Up @@ -190,15 +172,18 @@ Additionally, the `LOAD_DATASETS` will let you configure the sample datasets tha
2. `full`: Load all the sample datasets for NLP tasks (TokenClassification, TextClassification, Text2Text)
3. `none`: No datasets being loaded.

## Setting up HF Authentication
## Setting up sign in with Hugging Face

From version `1.23.0` you can enable Hugging Face authentication for your Argilla Space. This feature allows you to give access to your Argilla Space to users that are logged in to the Hugging Face Hub.

```{note}
This feature is specially useful for public crowdsourcing projects. If you would like to have more control over who can log in to the Space, you can set this up on a private space so that only members of your Organization can sign in. Alternatively, you may want to [create users](/getting_started/installation/configurations/user_management.md#create-a-user) and use their credentials instead.
```
```{warning}
For working with stable datasets and keep all the contributions, we highly recommend using the persistent storage layer offered by Hugging Face. For more info check the ["Setting up persistent storage"](#setting-up-persistent-storage) section.
```

To enable this feature, you will first need to [create an OAuth App in Hugging Face](https://huggingface.co/docs/hub/oauth#creating-an-oauth-app). To do that, go to your user settings in Hugging Face and select *Connected Apps* > *Create App*. Once inside, choose a name for your app and complete the form with the following information:
To set up the sign-in page, you first need to [create an OAuth App in Hugging Face](https://huggingface.co/docs/hub/oauth#creating-an-oauth-app). To do that, go to your user settings in Hugging Face and select *Connected Apps* > *Create App*. Once inside, choose a name for your app and complete the form with the following information:

* **Homepage URL:** [Your Argilla Space Direct URL](/getting_started/installation/deployments/huggingface-spaces.md#your-argilla-space-url).
* **Logo URL:** `[Your Argilla Space Direct URL]/favicon.ico`
Expand All @@ -210,7 +195,7 @@ This will create a Client ID and an App Secret that you will need to add as vari
1. **Name:** `OAUTH2_HUGGINGFACE_CLIENT_ID` - **Value:** [Your Client ID]
2. **Name:** `OAUTH2_HUGGINGFACE_CLIENT_SECRET` - **Value:** [Your App Secret]

Alternatively, you can provide the environment variables in the `.oauth.yaml` file like so:
Finally, you need to change the `.oauth.yaml` file located in the Files page of your Space (see below how this file looks like). Once you have merged the change, go back to the *Settings* to do a *Factory rebuild*. Once the Space is restarted, you and your collaborators can sign and log in to your Space using their Hugging Face accounts.

```yaml
# This attribute will enable or disable the Hugging Face authentication
Expand Down Expand Up @@ -238,11 +223,3 @@ providers:
allowed_workspaces:
- name: admin
```

```{warning}
Be aware that the `.oauth.yaml` file is public in the case of public spaces or may be accesible by other members of your organization if it is a private space.

Therefore, we recommend setting these variables as enviroment secrets.
```

Now check that the `enabled` parameter is set to `true` in your `.oauth.yaml` file and go back to the *Settings* to do a *Factory rebuild*. Once the Space is restarted, you and your collaborators can sign and log in to your Space using their Hugging Face accounts.
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,11 @@ Add text descriptives to your metadata to simplify the data annotation and filte

Add semantic representations to your records using vector embeddings to simplify the data annotation and search process.
```
```{grid-item-card} Haystack: Monitoring LLMs for Agents
:link: use_argilla_callback_in_haystack-v1.html

Learn how to use Argilla to monitor LLMs with Haystack Agents.
```
````

```{toctree}
Expand All @@ -40,4 +45,5 @@ process_documents_with_unstructured
monitor_endpoints with_fastapi
add_text_descriptives_as_metadata
add_sentence_transformers_embeddings_as_vectors
use_argilla_callback_in_haystack-v1
```