# Label Studio Corpus Onboarding

This notebook demonstrates how to onboard tasks or projects from Label Studio onto aiXplain platform as a corpus.

## Label Studio Data

Label Studio platform contains different projects that vary based on Label Studio authentication API key used. Each project contains multiple tasks that could be in different data types. This method handles all audio and text data for both tasks and projects on Label Studio. All you need is Label Studio task ID or project ID, and the corresponding Label Studio authentication API key.

### Label Studio Authentication API key

To get your Label Studio API key, log-in to Label Studio, head to this [link](https://label.aixplain.com/user/account), and you'll find your API key under the Access Token field.

### Task or project ID

In order to get the task ID or project ID of the data you want to onboard, navigate to the desired project from [Label Studio Homepage](https://label.aixplain.com/), and the project ID should be in the format:

`https://label.aixplain.com/projects/PROJECT_ID/...`

When you open any task in the project, the task ID should be in the format:

`https://label.aixplain.com/projects/PROJECT_ID/...&task=TASK_ID`

## API Keys

Add aiXplain `TEAM_API_KEY` as an environment variable in your system. For details refer to [Team API Key Guide](https://github.com/aixplain/aiXplain/blob/main/docs/user/api_setup.md).

#### Linux or macOS
```bash
export TEAM_API_KEY=YOUR_API_KEY
```
#### Windows
```bash
set TEAM_API_KEY=YOUR_API_KEY
```
#### Jupyter Notebook
```
%env TEAM_API_KEY=YOUR_API_KEY
```

Similarly, add Label Studio `LABELSTUDIO_KEY` an environment variable in your system. Since we're using jupyter notebook here, you can copy and paste your aiXplain `TEAM_API_KEY`, and Label Studio `LABELSTUDIO_KEY` in the cell below.

In [None]:
%env TEAM_API_KEY=YOUR_API_KEY
%env LABELSTUDIO_KEY=YOUR_LABELSTUDIO_KEY

## Libraries

Import the necessary libraries to onboard the corpus.

In [None]:
from aixplain.factories import LabelStudioFactory, CorpusFactory

## LabelStudioFactory

### LabelStudioFactory.create()

The main method in the class is `LabelStudioFactory.create()` which is used to create a Label Studio asset based on the `LabelStudioData` module. It creates an object with the following attributes:

- `id` (Text): Label Studio task/project ID.
- `name` (Text): CSV file name.
- `description` (Text): Description of the Label Studio object.
- `dtypes` (Dict): List of data types of each field in the Label Studio object.
- `labelstudio_tasks` (Optional[List[Text]]): List of task IDs if the object is a project. Default is None.

The `LabelStudioFactory.create()` has various parameters when used to create the Label Studio object. The prarameter description is as follows:

- `task_id` (Optional[int]): LabelStudio task ID to be retrieved. Default is None.
- `project_id` (Optional[int]): LabelStudio project ID to be retrieved. User must specify either `task_id` or `project_id`, not both, nor None. Default is None.
- `columns_to_drop` (Optional[List[str]]): List of column names to drop from the dataset. Default is None.


In [None]:
labelstudio_object = LabelStudioFactory.create(
    project_id = 1218
)

## Onboard as a Cropus

In the last step, we created a `LabelStudioData` instance that has different attributes of the retrieved project/task data from Label Studio, and the name of the CSV file that is used to store the data. We can automatically onboard this asset as a corpus to aiXplain platform using the `LabelStudioData.onboard_as_corpus()` method. 

First, we need to specify the parameters of `LabelStudioData.onboard_as_corpus()` method. Here's the list of parameters for the function:

- `corpus_name` (str): The name of the corpus to be onboarded on the platform.
- `language`: (Optional[List[str]]): List of language codes to add to metadata. Default is `None`.
- `corpus_description` (Optional[str]): corpus description of the corpus to be onboarded. Default is `None`.

For the language selection, the supported languages are:

In [None]:
langs = {}
for l in list(Language):
    if l.value['dialect'] == '':
        langs[l.value['language']] = l.name

langs

{'af': 'Afrikaans',
 'sq': 'Albanian',
 'hy': 'Armenian',
 'asm': 'Assamese',
 'bn': 'Bangla',
 'eu': 'Basque',
 'bg': 'Bulgarian',
 'ca': 'Catalan',
 'ceb': 'Cebuano',
 'co': 'Corsican',
 'cs': 'Czech',
 'dv': 'Divehi',
 'eo': 'Esperanto',
 'et': 'Estonian',
 'fj': 'Fijian',
 'fi': 'Finnish',
 'de': 'German',
 'el': 'Greek',
 'ht': 'Haitian',
 'ha': 'Hausa',
 'haw': 'Hawaiian',
 'isl': 'Icelandic',
 'ig': 'Igbo',
 'id': 'Indonesian',
 'iu': 'Inuktitut',
 'ga': 'Irish',
 'ja': 'Japanese',
 'kk': 'Kazakh',
 'rw': 'Kinyarwanda',
 'tlh': 'Klingon',
 'ko': 'Korean',
 'ku': 'Kurdish',
 'ky': 'Kyrgyz',
 'lb': 'Luxembourgish',
 'mk': 'Macedonian',
 'mt': 'Maltese',
 'mi': 'Maori',
 'mn': 'Mongolian',
 'am': 'Amharic',
 'or': 'Odia',
 'ro': 'Romanian',
 'ar': 'Arabic',
 'az': 'Azerbaijani',
 'ba': 'Bashkir',
 'bs': 'Bosnian',
 'be': 'Belarusian',
 'my': 'Burmese',
 'da': 'Danish',
 'ny': 'Nyanja',
 'ps': 'Pashto',
 'pl': 'Polish',
 'pa': 'Punjabi',
 'oto': 'Queretaro',
 'ru': 'Russian',
 'sr':

Now, we proceed to specify the parameters.

In [None]:
# Corpus Name to be Onboarded
corpus_name = 'test-6'

# Language
language = ['ar']

# Corpus Description
corpus_description = "This is a sample corpus for testing."

The last step is to onboard Label Studio data onto aiXplain platform. This is done by calling `LabelStudioData.onboard_as_corpus()` method using the predifined parameters.

In [None]:
payload = labelstudio_object.onboard_as_corpus(
    corpus_name = corpus_name,
    language = language,
    corpus_description = corpus_description
)

print(payload)

After it finishes onboarding, you can check the corpus using the `CorpusFactory.get()` method.

In [None]:
corpus = CorpusFactory.get(payload["asset_id"])
corpus.to_dict()