# Quickstart
Getting started with Argilla is easy! Let`s see some examples for different NLP tasks. 

## Setup


```bash
pip install "argilla[server]" datasets
```

If you don’t have Elasticsearch (ES) running, make sure you have docker installed and run:


<div class="alert alert-info">

Note

Check the [setup and installation section](setup-and-installation) for further options and configurations regarding Elasticsearch.

</div>
```bash
docker run -d --name elasticsearch-for-argilla -p 9200:9200 -p 9300:9300 -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2
```

Then simply run:
```bash
python -m argilla
```
<div class="alert alert-info">

Note

The most common error message after this step is related to the Elasticsearch instance not running. Make sure your Elasticsearch instance is running on http://localhost:9200/. If you already have an Elasticsearch instance or cluster, you point the server to its URL by using ENV variables.

</div>

```python
import pandas as pd
import argilla as rg
from datasets import load_dataset

# load dataset from the hub
dataset = load_dataset("argilla/gutenberg_spacy-ner", split="train")

# read in dataset, assuming its a dataset for token classification
dataset = rg.read_datasets(dataset, task="TokenClassification")

# log the dataset to the Rubrix web app
rg.log(dataset_rb, "gutenberg_spacy-ner")

# load dataset from json
my_dataframe = pd.read_json(
    "https://raw.githubusercontent.com/argilla-io/datasets/main/sst-sentimentclassification.json")

# convert pandas dataframe to DatasetForTextClassification
dataset = rg.DatasetForTextClassification.from_pandas(my_dataframe)

# log the dataset to the Rubrix web app
rg.log(dataset, name="sst-sentimentclassification")
```


🎉 You can now access Argilla UI pointing your browser at [http://localhost:6900/](http://localhost:6900/). **The default username and password are** `argilla` **and** `1234`.

## Upload data

The main component of the Argilla data model is called a record. A dataset in Argilla is a collection of these records. 
Records can be of different types depending on the currently supported tasks:

 1. `TextClassificationRecord`
 2. `TokenClassificationRecord`
 3. `TextGenerationRecord`
 
The most critical attributes of a record that are common to all types are:

 - `text`: The input text of the record (Required);
 - `annotation`: Annotate your record in a task-specific manner (Optional);
 - `prediction`: Add task-specific model predictions to the record (Optional);
 - `metadata`: Add some arbitrary metadata to the record (Optional);
 
In Argilla, records are created programmatically using the [client library](../reference/python/python_client.rst) within a Python script, a [Jupyter notebook](https://jupyter.org/), or another IDE.


Let's see how to create and upload a basic record to the Argilla web app  (make sure Argilla is already installed on your machine as described in the [setup guide](installation/installation.md)):

In [None]:
import argilla as rg

# Create a basic text classification record
record = rg.TextClassificationRecord(text="Hello world, this is me!")

# Upload (log) the record to the Argilla web app
rg.log(record, "my_first_record")

Now you can access the *"my_first_record"* dataset in the Argilla web app and look at your first record. 

However, most of the time, you will have your data in some file format, like TXT, CSV, or JSON. 
Argilla relies on two well-known Python libraries to read these files: [pandas](https://pandas.pydata.org/) and [datasets](https://huggingface.co/docs/datasets/index). 
After reading the files with one of those libraries, Argilla provides shortcuts to create your records automatically.

Let us look at a few examples for each of the record types.
**As mentioned earlier, you choose the record type depending on the task you want to tackle.**

### 1. Text classification

In this example, we will read a [CSV file](https://www.kaggle.com/datasets/databar/10k-snapchat-reviews) from a Kaggle competition that contains reviews for the Snapchat app. 
The underlying task here could be to classify the reviews by their sentiment. 

Let us read the file with [pandas](https://pandas.pydata.org/)

<div class="alert alert-info">

Note
    
If the file is too big to fit in memory, try using the [datasets library](https://huggingface.co/docs/datasets/index) with no memory constraints, as shown in the next section.
    
</div>

In [None]:
import pandas as pd

# Read the CSV file into a pandas DataFrame
dataframe = pd.read_csv("Snapchat_app_store_reviews.csv")

and have a quick look at the first three rows of the resulting [pandas DataFrame](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html):

In [41]:
dataframe.head(3)

Unnamed: 0.1,Unnamed: 0,userName,rating,review,isEdited,date,title
0,0,Savvanananahhh,4,For the most part I quite enjoy Snapchat it’s ...,False,10/4/20 6:01,Performance issues
1,1,Idek 9-101112,3,"I’m sorry to say it, but something is definite...",False,10/14/20 2:13,What happened?
2,2,William Quintana,3,Snapchat update ruined my story organization! ...,False,7/31/20 19:54,STORY ORGANIZATION RUINED!


We will choose the _review_ column as input text for our records.
For Argilla to know, we have to rename the corresponding column to _text_.

In [None]:
# Rename the 'review' column to 'text', 
dataframe = dataframe.rename(columns={"review": "text"}) 

We can now read this `DataFrame` with Argilla, which will automatically create the records and put them in a [Argilla Dataset](../guides/features/datasets.ipynb).

In [None]:
import argilla as rg

# Read DataFrame into a Argilla Dataset
dataset_rg = rg.read_pandas(dataframe, task="TextClassification") 

We will upload this dataset to the web app and give it the name *snapchat_reviews*

In [None]:
# Upload (log) the Dataset to the web app
rg.log(dataset_rg, "snapchat_reviews")

![Screenshot of the uploaded snapchat reviews](../_static/reference/webapp/explore-text-classification.png)

### 2. Token classification

We will use German reviews of organic coffees in a [CSV file](https://www.kaggle.com/datasets/mldado/german-online-reviewsratings-of-organic-coffee) for this example. 
The underlying task here could be to extract all attributes of an organic coffee.

This time, let's read the file with [datasets](https://huggingface.co/docs/datasets/index).

In [None]:
from datasets import Dataset

# Read the csv file
dataset = Dataset.from_csv("kaffee_reviews.csv")

and have a quick look at the first three rows of the resulting [dataset Dataset](https://huggingface.co/docs/datasets/access):

In [94]:
# The best way to visualize a Dataset is actually via pandas
dataset.select(range(3)).to_pandas() 

Unnamed: 0.1,Unnamed: 0,brand,rating,review
0,0,GEPA Kaffee,5,Wenn ich Bohnenkaffee trinke (auf Arbeit trink...
1,1,GEPA Kaffee,5,Für mich ist dieser Kaffee ideal. Die Grundvor...
2,2,GEPA Kaffee,5,Ich persönlich bin insbesondere von dem Geschm...


We will choose the _review_ column as input text for our records.
For Argilla to know, we have to rename the corresponding column to _text_.

In [95]:
dataset = dataset.rename_column("review", "text")

In contrast to the other types, token classification records need the input text **and** the corresponding tokens. 
So let us tokenize our input text in a small helper function and add the tokens to a new column called _tokens_. 

<div class="alert alert-info">

Note

We will use [spaCy](https://spacy.io/) to tokenize the text, but you can use whatever library you prefer.
    
</div>

In [None]:
import spacy

# Load a german spaCy model to tokenize our text
nlp = spacy.load("de_core_news_sm")

# Define our tokenize function
def tokenize(row):
    tokens = [token.text for token in nlp(row["text"])]
    return {"tokens": tokens}

# Map the tokenize function to our dataset
dataset = dataset.map(tokenize)

Let us have a quick look at our extended `Dataset`:

In [97]:
dataset.select(range(3)).to_pandas()

Unnamed: 0.1,Unnamed: 0,brand,rating,text,tokens
0,0,GEPA Kaffee,5,Wenn ich Bohnenkaffee trinke (auf Arbeit trink...,"[Wenn, ich, Bohnenkaffee, trinke, (, auf, Arbe..."
1,1,GEPA Kaffee,5,Für mich ist dieser Kaffee ideal. Die Grundvor...,"[Für, mich, ist, dieser, Kaffee, ideal, ., Die..."
2,2,GEPA Kaffee,5,Ich persönlich bin insbesondere von dem Geschm...,"[Ich, persönlich, bin, insbesondere, von, dem,..."


We can now read this `Dataset` with Argilla, which will automatically create the records and put them in a [Argilla Dataset](../guides/features/datasets.ipynb).

In [None]:
import argilla as rg

# Read Dataset into a Argilla Dataset
dataset_rg = rg.read_datasets(dataset, task="TokenClassification") 

We will upload this dataset to the web app and give it the name `coffee_reviews`

In [None]:
# Log the dataset to the Argilla web app
rg.log(dataset_rg, "coffee-reviews")

![Screenshot of the uploaded coffee reviews](../_static/reference/webapp/features-annotate.png)

### 3. Text2Text

In this example, we will use English sentences from the European Center for Disease Prevention and Control available at the [Hugging Face Hub](https://huggingface.co/datasets/europa_ecdc_tm). 
The underlying task here could be to translate the sentences into other European languages.

Let us load the data with [datasets](https://huggingface.co/docs/datasets/index) from the [Hub](https://huggingface.co/datasets).

In [None]:
from datasets import load_dataset

# Load the Dataset from the Hugging Face Hub and extract the train split
dataset = load_dataset("europa_ecdc_tm", "en2fr", split="train")

and have a quick look at the first row of the resulting [dataset Dataset](https://huggingface.co/docs/datasets/access):

In [101]:
dataset[0]

{'translation': {'en': 'Vaccination against hepatitis C is not yet available.',
  'fr': 'Aucune vaccination contre l’hépatite C n’est encore disponible.'}}

We can see that the English sentences are nested in a dictionary inside the _translation_ column. 
To extract the phrases into a new _text_ column, we will write a quick helper function and [map](https://huggingface.co/docs/datasets/process#map) the whole `Dataset` with it.

In [None]:
# Define our helper extract function
def extract(row):
    return {"text": row["translation"]["en"]}

# Map the extract function to our dataset
dataset = dataset.map(extract)

Let us have a quick look at our extended `Dataset`:

In [103]:
dataset[0]

{'translation': {'en': 'Vaccination against hepatitis C is not yet available.',
  'fr': 'Aucune vaccination contre l’hépatite C n’est encore disponible.'},
 'text': 'Vaccination against hepatitis C is not yet available.'}

We can now read this `Dataset` with Argilla, which will automatically create the records and put them in a [Argilla Dataset](../guides/features/datasets.ipynb).

In [None]:
import argilla as rg

# Read Dataset into a Argilla Dataset
dataset_rg = rg.read_datasets(dataset, task="Text2Text") 

We will upload this dataset to the web app and give it the name `ecdc_en`

In [None]:
# Log the dataset to the Argilla web app
rg.log(dataset_rg, "ecdc_en")

![Screenshot of the uploaded English phrases.](../_static/reference/webapp/explore-text2text.png)

## Label datasets

Argilla provides several ways to label your data. Using Argilla's UI, you can mix and match the following options:

1. Manually labeling each record using the specialized interface for each task type;
2. Leveraging a user-provided model and validating its predictions;
3. Defining heuristic rules to produce "noisy labels" which can then be combined with weak supervision;

Each way has its pros and cons, and the best match largely depends on your individual use case.


### 1. Manual labeling

![Manual annotations of a sentiment classification task](../_static/reference/webapp/features-metrics.png)

The straightforward approach of manual annotations might be necessary if you do not have a suitable model for your use case or cannot come up with good heuristic rules for your dataset. 
It can also be a good approach if you dispose of a large annotation workforce or require few but unbiased and high-quality labels.

Argilla tries to make this relatively cumbersome approach as painless as possible. 
Via an intuitive and adaptive UI, its exhaustive search and filter functionalities, and bulk annotation capabilities, Argilla turns the manual annotation process into an efficient option.  

Look at our dedicated [feature reference](../reference/webapp/features.md) for a detailed and illustrative guide on manually annotating your dataset with Argilla.

### 2. Validating predictions

![Validate predictions for a token classification dataset](../_static/reference/webapp/features-validation.png)

Nowadays, many pre-trained or zero-shot models are available online via model repositories like the Hugging Face Hub. 
Most of the time, you probably will find a model that already suits your specific dataset task to some degree. 
In Argilla, you can pre-annotate your data by including predictions from these models in your records.
Assuming that the model works reasonably well on your dataset, you can filter for records with high prediction scores and validate the predictions.
In this way, you will rapidly annotate part of your data and alleviate the annotation process.

One downside of this approach is that your annotations will be subject to the possible biases and mistakes of the pre-trained model.
When guided by pre-trained models, it is common to see human annotators get influenced by them.
Therefore, it is advisable to avoid pre-annotations when building a rigorous test set for the final model evaluation.

Check the [introduction tutorial](../tutorials//notebooks/labelling-tokenclassification-spacy-pretrained.ipynb) to learn to add predictions to the records. 
And our [feature reference](../reference/webapp/features.md) includes a detailed guide on validating predictions in the Argilla web app.

### 3. Defining rules (weak labeling)

![Defining a rule for a multi-label text classification task.](../_static/reference/webapp/features-weak-labelling.png)

Another approach to annotating your data is to define heuristic rules tailored to your dataset. 
For example, let us assume you want to classify news articles into the categories of *Finance*, *Sports*, and *Culture*. 
In this case, a reasonable rule would be to label all articles that include the word "stock" as *Finance*. 

Rules can get arbitrarily complex and can also include the record's metadata. 
The downsides of this approach are that it might be challenging to come up with working heuristic rules for some datasets. 
Furthermore, rules are rarely 100% precise and often conflict with each other.These noisy labels can be cleaned up using weak supervision and label models, or something as simple as majority voting. It is usually a trade-off between the amount of annotated data and the quality of the labels.

Check [our guide](../guides/techniques/weak_supervision.ipynb) for an extensive introduction to weak supervision with Argilla. 
Also, check the [feature reference](../reference/webapp/features.md) for the Define rules mode of the web app and our [various tutorials](../tutorials/techniques/weak_supervision.md) to see practical examples of weak supervision workflows. 

## How to prepare your data for training

Once you have uploaded and annotated your dataset in Argilla, you are ready to prepare it for training a model. Most NLP models today are trained via [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) and need input-output pairs to serve as training examples for the model. The input part of such pairs is usually the text itself, while the output is the corresponding annotation. 

### Manual extraction

The exact data format for training a model depends on your [training framework](#how-to-train-a-model) and the task you are tackling (text classification, token classification, etc.). Argilla is framework agnostic; you can always manually extract from the records what you need for the training. 

The extraction happens using the [client library](../reference/python/python_client.rst) within a Python script, a Jupyter notebook, or another IDE. First, we have to load the annotated dataset from the Argilla UI:

In [None]:
import argilla as rg

dataset = rg.load("my_annotated_dataset")

<div class="alert alert-info">

Note
    
If you follow a weak supervision approach, the steps are slightly different. 
We refer you to our [weak supervision guide](../guides/techniques/weak_supervision.ipynb) for a complete workflow.
    
</div>

Then we can iterate over the records and extract our training examples. For example, let's assume you want to train a text classifier with a [sklearn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that takes as input a text and outputs a label. 

In [None]:
# Save the inputs and labels in Python lists
inputs, labels = [], []

# Iterate over the records in the dataset
for record in dataset:
    
    # We only want records with annotations
    if record.annotation:
        inputs.append(record.text)
        labels.append(record.annotation)

# Train the model
sklearn_pipeline.fit(inputs, labels)

### Automatic extraction

For a few frameworks and tasks, Argilla provides a convenient method to automatically extract training examples in a suitable format from a dataset. 

For example: If you want to train a [transformers](https://huggingface.co/docs/transformers/index) model for text classification, you can load an annotated dataset for text classification and call the `prepare_for_training()` method:

In [None]:
dataset = rg.load("my_annotated_dataset")

dataset_for_training = dataset.prepare_for_training()

With the returned `dataset_for_training`, you can continue following the steps to [fine-tune a pre-trained model](https://huggingface.co/docs/transformers/training#finetune-a-pretrained-model) with the [transformers library](https://huggingface.co/docs/transformers/index). 

Check the dedicated [dataset guide](../guides/features/datasets.ipynb#prepare-dataset-for-training) for more examples of the `prepare_for_training()` method.

## How to train a model

Argilla helps you to create and curate training data. **It is not a framework for training a model.** You can use Argilla complementary with other excellent open-source frameworks that focus on developing and training NLP models.

Here we list three of the most commonly used open-source libraries, but many more are available and may be more suited for your specific use case:

 - [transformers](https://huggingface.co/docs/transformers/index): This library provides thousands of pre-trained models for various NLP tasks and modalities. Its excellent documentation focuses on fine-tuning those models to your specific use case;
 - [spaCy](https://spacy.io/): This library also comes with pre-trained models built into a pipeline tackling multiple tasks simultaneously. Since its a purely NLP library, it comes with much more NLP features than just model training;
 - [scikit-learn](https://scikit-learn.org/stable/): This de facto standard library is a powerful swiss army knife for machine learning with some NLP support. Usually, their NLP models lack the performance when compared to transformers or spacy, but give it a try if you want to train a lightweight model quickly; 
 