# 🚤 Speed-up data labelling with Sentence Transformer embeddings

In this tutorial, you'll learn to use Sentence Transformer embeddings and similarity search to make data labelling significantly faster. It will walk you through the following steps:


- 💾 use sentence transformers to generate embeddings of a banking customer requests
- 🙃 upload the dataset into Argilla for data labelling
- 🏷 use the similarity search feature to efficiently find an label bulks of semantically-related examples

<img src="../../_static/tutorials/labelling-textclassification-sentence-transformers-semantic/4.png" alt="Similarity search" style="width: 1100px;">

## Introduction

In this tutorial, we'll use the power of embeddings to make data labelling (and curation) more efficient. The idea of exploiting embeddings for labelling is not new, and there are several cool, standalone libraries to label data using embeddings.

Since `1.2.0`, Argilla gives you a way to leverage embedding-based similarity together with all other workflows already provided: search-based bulk labelling, programmatic labelling using search queries, model pre-annotation, and human-in-the-loop workflows. This also means you can combine keyword search and filters with this new similarity search feature. All these without any vendor or model lock-in, you can use ANY embedding or encoding method, including but not limited to `Sentence Transformers`, `OpenAI`, or `Co:here`. 

If you want a deep-dive you can check the [Semantic similarity deep-dive](../../guides/features/semantic-search.ipynb), but this tutorial will show you the basics to get started. 

Let's do it!

## Setup

First you need to install and run Argilla, and make sure you're running the right version of [Elasticsearch or Opensearch](../../guides/features/semantic-search.ipynb). The, you'll need a few third-party libraries that can be installed via `pip`: 

In [None]:
%pip install datasets==2.8.0 sentence-transformers==2.2.2  -qqq  

## 💾 Downloading and embedding your dataset

The code below will load the banking customer requests dataset from the Hub, encode the `text` field, and create the `vectors` field which will contain only one key (`mini-lm-sentence-transformers`). For the purposes of labelling the dataset from scratch, it will also remove the `label` field, which contains original intent labels.

In [None]:
from sentence_transformers import SentenceTransformer

from datasets import load_dataset

# Define fast version of sentence transformers
encoder = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")

# Load dataset with banking
dataset = load_dataset("banking77", split="test")

# Encode text field using batched computation
dataset = dataset.map(lambda batch: {"vectors": encoder.encode(batch["text"])}, batch_size=32, batched=True)


# Moves the label to a metadata field, because you'll be labelling the dataset yourself
dataset = dataset.remove_columns("label")

# Turn vectors into a dictionary
dataset = dataset.map(
    lambda r: {"vectors": {"mini-lm-sentence-transformers": r["vectors"]}}
)

Our dataset now contains a `vectors` field with the embedding vector generated by the sentence transformer model.

In [10]:
dataset.to_pandas().head()

Unnamed: 0,text,vectors
0,How do I locate my card?,{'mini-lm-sentence-transformers': [-0.01016708...
1,"I still have not received my new card, I order...",{'mini-lm-sentence-transformers': [-0.04284123...
2,I ordered a card but it has not arrived. Help ...,{'mini-lm-sentence-transformers': [-0.03365558...
3,Is there a way to know when my card will arrive?,{'mini-lm-sentence-transformers': [0.012195908...
4,My card has not arrived yet.,{'mini-lm-sentence-transformers': [-0.04361863...


## 🙃 Upload dataset into Argilla

The original `banking77` dataset is a intent classification dataset with dozens of labels (`lost_card`, `card_arrival`, etc.). To keep this tutorial simple, we define a simplified labelling scheme with higher level classes: `["change_details", "card", "atm", "top_up", "balance", "transfer", "exchange_rate", "pin"]`.

Let's define the dataset settings, configure the dataset, and upload our dataset with vectors.

In [None]:
import argilla as rg

rg_ds = rg.DatasetForTextClassification.from_datasets(dataset)

# Setting for the label scheme
settings = rg.TextClassificationSettings(label_schema=["change_details", "card", "atm", "top_up", "balance", "transfer", "exchange_rate", "pin"])

rg.configure_dataset(name="banking77-topics", settings=settings)

rg.log(
    name="banking77-topics",
    records=rg_ds,
    chunk_size=50,
)

## 🏷 Bulk labelling with the `find similar` action

Now that our `banking77-topics` is available from the Argilla UI. We can start annotating our data leveraging semantic similarity search. The workflow is following:

1. Label a record (e.g., "Change my information" with the label `change_details`) and then click on Find similar on the top-right of your record.
2. As a result, you'll get to a list of the most similar record sorted by similarity (on descending order).
3. You can now review the records and assign either the `change_details` label or any other. For our use case, we see that most of the suggested records fall into the same category.


Let's see it step-by-step:

### Label a record
Using the hand-labelling mode, you can label a record like the one below:

![labelling-textclassification-sentence-transformers-semantic](../../_static/tutorials/labelling-textclassification-sentence-transformers-semantic/6.png)

Now if you want to find semantically similar or even duplicates of this record you can use the Find similar button. 

### Find similar

As a result you'll get a list the 50 most similar records. 

<div class="alert alert-info">

Note
    
Remember that you can combine this similarity search with the other search features: keywords, the query string DSL, and filters. If you have filters enabled for example, the find similar action will return the most similar records from the subset of records with the filter enabled.
    
</div>

![labelling-textclassification-sentence-transformers-semantic](../../_static/tutorials/labelling-textclassification-sentence-transformers-semantic/4.png)

As you can see the model is effectively capturing similar meaning without the need of explicit shared words: e.g., `details` vs `information`.

### Review records
At this point, you can label the records one by one or scroll-down to review them before using the bulk-labelling button on the top of the records list.


![labelling-textclassification-sentence-transformers-semantic](../../_static/tutorials/labelling-textclassification-sentence-transformers-semantic/3.png)

### Bulk label

For this tutorial, our labels are sufficiently well-separated for the embeddings to group records that fall under the same topic. So in this case, it is safe to use the bulk labelling feature directly, effectively labelling 50 semantically-similar examples after a quick revision.

<div class="alert alert-warning">

Warning

For other use cases, you might need to be more careful and combine this feature with search queries and filters. For quick experimentation, you can also assume you'll make some labelling errors and then use tools like `cleanlab` for detecting label errors.
</div>

![labelling-textclassification-sentence-transformers-semantic](../../_static/tutorials/labelling-textclassification-sentence-transformers-semantic/2.png)

![labelling-textclassification-sentence-transformers-semantic](../../_static/tutorials/labelling-textclassification-sentence-transformers-semantic/1.png)


## Summary

In this tutorial, you learned to use similarity search for data labelling with Argilla by using Sentence Transformers to embed your raw data.

## Next steps

If you want to continue learning Argilla:

🙋‍♀️ Join the [Argilla Slack community](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g)!

⭐ Argilla [Github repo](https://github.com/argilla-io/argilla) to stay updated.

📚 [Argilla documentation](https://docs.argilla.io) for more guides and tutorials.