# Working with data

In this chapter you will learn how to manipulate intent classification data with AutoIntent.

In [1]:
import importlib.resources as ires

import datasets

from autointent import Dataset

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
datasets.logging.disable_progress_bar()  # disable tqdm outputs

## Creating a dataset

The first thing you need to think about is your data. You need to collect a set of labeled utterances and save it as JSON file with the following schema:

```json
{
    "train": [
        {
            "utterance": "Hello!",
            "label": 0
        },
        "...",
    ],
    "test": [
        {
            "utterance": "Hi!",
            "label": 0
        },
        "...",
    ]
}
```

Note:
- For a multilabel dataset, the `label` field should be a list of integers representing the corresponding class labels.
- Test split is optional. By default, a portion of the training split will be allocated for testing.

## Loading a dataset

After you converted your labeled data into JSON, you can load it into AutoIntent as [Dataset](../autoapi/autointent/Dataset.html#autointent.Dataset). We will load sample dataset that is provided by AutoIntent library to demonstrate this functionality.

In [3]:
path_to_dataset = ires.files("tests.assets.data").joinpath("clinc_subset.json")
dataset = Dataset.from_json(path_to_dataset)

Note: to load your data, just change `path_to_dataset` variable.

### Accessing dataset splits

The [Dataset](../autoapi/autointent/Dataset.html#autointent.Dataset) class organizes your data as a dictionary of [datasets.Dataset](https://huggingface.co/docs/datasets/v2.1.0/en/package_reference/main_classes#datasets.Dataset).
For example, after initialization, an `oos` key may be added if OOS samples are provided.

In [4]:
dataset["train"]

Dataset({
    features: ['utterance', 'label'],
    num_rows: 37
})

### Working with dataset splits

Each split in the [Dataset](../autoapi/autointent/Dataset.html#autointent.Dataset) class is an instance of [datasets.Dataset](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset),
so you can work with them accordingly.

In [5]:
dataset["train"][:5]  # get first 5 train samples

{'utterance': ['can i make a reservation for redrobin',
  'is it possible to make a reservation at redrobin',
  'does redrobin take reservations',
  'are reservations taken at redrobin',
  'does redrobin do reservations'],
 'label': [0, 0, 0, 0, 0]}

### Save Dataset

To share your dataset on the Hugging Face Hub, use method [push_to_hub](../autoapi/autointent/Dataset.html#autointent.Dataset.push_to_hub).

In [6]:
# dataset.push_to_hub("<repo_id>")

Note: ensure that you are logged in using `huggingface-cli`.

## See Also

- Next chapter of the user guide "Using modules": [02_modules](../user_guides/user_guides.python_api.02_modules.py)