Skip to content

Commit

Permalink
docs: use FeedbackDataset in HF example (#4805)
Browse files Browse the repository at this point in the history
<!-- Thanks for your contribution! As part of our Community Growers
initiative 馃尡, we're donating Justdiggit bunds in your name to reforest
sub-Saharan Africa. To claim your Community Growers certificate, please
contact David Berenstein in our Slack community or fill in this form
https://tally.so/r/n9XrxK once your PR has been merged. -->

# Description


![screencapture-127-0-0-1-8000-getting-started-installation-deployments-huggingface-spaces-html-2024-05-08-14_03_04](https://github.com/argilla-io/argilla/assets/127759186/e118e0b2-5689-45e6-bc08-a744fd32dde3)

Closes #4740

**Type of change**

(Remember to title the PR according to the type of change)

- [ ] Documentation update

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes.)

- [ ] `sphinx-autobuild` (read [Developer
Documentation](https://docs.argilla.io/en/latest/community/developer_docs.html#building-the-documentation)
for more details)

**Checklist**

- [ ] I added relevant documentation
- [ ] I followed the style guidelines of this project
- [ ] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [ ] I have added relevant notes to the `CHANGELOG.md` file (See
https://keepachangelog.com/)
  • Loading branch information
sdiazlor committed May 13, 2024
1 parent 535526d commit a422e27
Showing 1 changed file with 43 additions and 59 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -55,83 +55,67 @@ Once Argilla is running, you can use the UI with the Direct URL. This URL gives

### Create your first dataset

If everything goes well, you are ready to use the Argilla Python client from an IDE such as Colab, Jupyter, or VS Code.

If you want a quick step-by-step example, keep reading. If you want an end-to-end tutorial, go to this [tutorial and use Colab or Jupyter](https://docs.argilla.io/en/latest/tutorials/notebooks/training-textclassification-setfit-fewshot.html).

First, we need to pip install `datasets` and `argilla` on Colab or your local machine:
To create your first dataset, you need to pip install `argilla` on Colab or your local machine:

```bash
pip install datasets argilla
```

Then, you can read the example dataset using the `datasets` library. This dataset is a CSV file uploaded to the Hub using the drag-and-drop feature.

```python
from datasets import load_dataset

dataset = load_dataset("dvilasuero/banking_app", split="train").shuffle()
pip install argilla
```

You can create your first dataset by logging it into Argilla using your endpoint URL:
Then, you have to connect to your Argilla HF Space. Get the `api_url` as mentioned before and copy the `api_key` from "My settings" (UI):

```python
import argilla as rg

# if you connect to your public app endpoint (uses default API key)
rg.init(api_url="[your_space_url]", api_key="admin.apikey")

# if you connect to your private app endpoint (uses default API key)
rg.init(api_url="[your_space_url]", api_key="admin.apikey", extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"})

# transform dataset into Argilla's format and log it
rg.log(rg.read_datasets(dataset, task="TextClassification"), name="bankingapp_sentiment")
# If you connect to your public HF Space
rg.init(
api_url="[your_space_url]",
api_key="admin.apikey" # this is the default API key, don't change it if you didn't set up one during the Space creation
)

# If you connect to your private HF Space
rg.init(
api_url="[your_space_url]",
api_key="admin.apikey", # this is the default API key, don't change it if you didn't set up one during the Space creation
extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"}
)
```

Congrats! You now have a dataset available from the Argilla UI to start browsing and labeling. In the code above, we've used one of the many integrations with Hugging Face libraries, which let you read hundreds of datasets available on the Hub.

### Data labeling and model training

At this point, you can label your data directly using your Argilla Space and read the training data to train your model of choice.
Now, create a dataset for text classification. We'll use a task template, check the [docs](../../../practical_guides/create_update_dataset/create_dataset.md) to create a custom dataset. Indicate the workspace where the dataset will be created. You can check them in "My settings" (UI).

```python
# this will read our current dataset and turn it into a clean dataset for training
dataset = rg.load("bankingapp_sentiment").prepare_for_training()
dataset = rg.FeedbackDataset.for_text_classification(
labels=["sadness", "joy"],
multi_label=False,
use_markdown=True,
guidelines=None,
metadata_properties=None,
vectors_settings=None,
)
# Create the dataset to be visualized in the UI (uses default workspace)
dataset.push_to_argilla(name="my-first-dataset", workspace="admin")
```

You can also get the full dataset and push it to the Hub for reproducibility and versioning:
To add the records, create a list with the records you want to add. Match the fields with the ones specified before. You can also use pandas or `load_dataset` to read an existing dataset and create records from it.

```python
# save full argilla dataset for reproducibility
rg.load("bankingapp_sentiment").to_datasets().push_to_hub("bankingapp_sentiment")
records = [
rg.FeedbackRecord(
fields={
"text": "I am so happy today",
},
),
rg.FeedbackRecord(
fields={
"text": "I feel sad today",
},
)
]
dataset.add_records(records)
```

Finally, this is how you can train a SetFit model using data from your Argilla Space:
Congrats! You now have a dataset available from the Argilla UI to start browsing and labeling. Once annotated, you can also easily push it back to the Hub.

```python
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer

# Create train test split
dataset = dataset.train_test_split()

# Load SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# Create trainer
trainer = SetFitTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
loss_class=CosineSimilarityLoss,
batch_size=8,
num_iterations=20,
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate()
dataset = rg.FeedbackDataset.from_argilla("my-first-dataset", workspace="admin")
dataset.push_to_huggingface("my-repo/my-first-dataset")
```

As a next step, you can check the [Argilla Tutorials](https://docs.argilla.io/en/latest/tutorials/tutorials.html) section. All the tutorials can be run using Colab or local Jupyter Notebooks, so you can start building datasets with Argilla and Spaces!
Expand Down

0 comments on commit a422e27

Please sign in to comment.