# Label Studio Corpus Onboarding

This notebook demonstrates how to auto-onboard tasks or projects from Label Studio into aiXplain platform as a corpus.

## Label Studio Data

Label Studio platform contains different projects that vary based on the Label Studio authentication API key used. Each project contains multiple tasks that could be in different data types. This method should be able to handle all audio and text data for either a Label Studio task or project. All you need is Label Studio project or task ID, and the corresponding Label Studio authentication API key.

### Label Studio authentication API key

In order to get your Label Studio API key, once you're logged in to Label Studio, head to this [link](https://label.aixplain.com/user/account), and you'll find your API key under the Access Token field.

### Task or project ID

In order to get the task or project ID of the data you want to onboard, navigate to the desired project from [Label Studio Homepage](https://label.aixplain.com/), and the project ID should be in the format:

`https://label.aixplain.com/projects/PROJECT_ID/...`

If you opened any task in the project, the task ID should be in the format:

`https://label.aixplain.com/projects/PROJECT_ID/...&task=TASK_ID`

## aiXplain TEAM_API_KEY

Ideally, to onboard a corpus using the SDK, the data needs to be depicted in a CSV file, which will then be fed to the SDK. Additionally, you need to have an aiXplain `TEAM_API_KEY`. For details refer this [Team API Key Guide](https://github.com/aixplain/aiXplain/blob/main/docs/user/api_setup.md).

Once you get the API key, you'll need to add this API key as an environment variable on your system.

#### Linux or macOS
```bash
export TEAM_API_KEY=YOUR_API_KEY
```
#### Windows
```bash
set TEAM_API_KEY=YOUR_API_KEY
```
#### Jupyter Notebook
```
%env TEAM_API_KEY=YOUR_API_KEY
```

Similarly, we have to specify the Label Studio `LABELSTUDIO_KEY` in the same way. Since we're using jupyter notebook in this case, please copy and paste your aiXplain `TEAM_API_KEY`, and Label Studio `LABELSTUDIO_KEY` in the cell below.

In [None]:
%env TEAM_API_KEY=YOUR_API_KEY
%env LABELSTUDIO_KEY=YOUR_LABELSTUDIO_KEY

## Import

Let's now import the necessary module to onboard the corpus.

In [12]:
from aixplain.factories import LabelStudioFactory, CorpusFactory
from aixplain.enums import Language
import json

## LabelStudioFactory Methods 

### LabelStudioFactory.get_task_data(task_id)

This `LabelStudioFactory` method retrieves a task's data from Label Studio as bytes. These bytes are further processed by other class methods to extract data from. An example of using `LabelStudioFactory.get_task_data()` is as follows:

In [None]:
task_data = LabelStudioFactory.get_task_data(339594)
print(task_data)

### LabelStudioFactory.extract_data(dict)

As we can see, the bytes retrieved from `LabelStudioFactory.get_task_data()` are note interpretable, and need further processing. The `LabelStudioFactory.extract_data()` does exactly that. It processes the retrieved bytes to extract meaningful data from, but first we have to deserialize the `task_data` using `json.loads()`. An example is as follows:

In [None]:
extracted_data = LabelStudioFactory.extract_data(json.loads(task_data))
print(extracted_data)

### LabelStudioFactory.get_all_tasks_per_project(project_id)

As shown above, the `LabelStudioFactory.extract_data()` method was able to successfully extract data from a task. Now, we want to extract data from a project that contains multiple tasks. There are two methods specifically developed for that, one of which is the `LabelStudioFactory.get_all_tasks_per_project()` method that firstly retrieves all the task IDs present in a project, and returns them in a list. An example is as follows:

In [None]:
print(LabelStudioFactory.get_all_tasks_per_project(1218))

### LabelStudioFactory.extract_project_data(project_id)

After retrieving all the task IDs from a project, now we need to extract data from all these tasks, and store them in a list. The `LabelStudioFactory.extract_project_data()` utlizes the above 3 methods to retrieve data from all tasks within a project. An example is as follows:  

In [None]:
project_data = LabelStudioFactory.extract_project_data(1218)
print(project_data)

### LabelStudioFactory.save_to_csv()

In order to be able to onboard a corpus to aiXplain, there has to be a .csv file that contains the data. The `LabelStudioFactory.save_to_csv()` method processes audio and text data from Label Studio into a `pd.DataFrame()`, and stores it in a csv file to be further used for onboarding. It returns also the processed dataframe, and the dtypes of the columns within it. The list of parameters for the function is as follows:

- `task_id` (Optional[int]): LabelStudio task ID to be retrieved. Default is None.
- `project_id` (Optional[int]): LabelStudio project ID to be retrieved. Default is None.
- `columns_to_drop` (Optional[List[str]]): List of column names to drop from the dataset. Default is None.

In [None]:
filename, dtypes, df = LabelStudioFactory.save_to_csv(project_id = 1218)
print('The file name is: {}\nThe dtypes is: {}'.format(filename, dtypes))
display(df)

### LabelStudioFactory.create()

The last method in the class is the `LabelStudioFactory.create()` which is used to finally onboard the Label Studio task/project data as a corpus into the aiXplain platform. It utilizes all the class methods together to perform the onboarding process. The rest of the notebook demonstrates how to properly use it.

## Label Studio and Corpus Info

Firstly, we need to specify the parameters of the `LabelStudioFactory.create()` function. Here's the list of parameters for the function:

- `corpus_name` (str): The data name of the corpus to be onboarded on the platform.
- `task_id` (Optional[int]): LabelStudio task ID to be retrieved. Default is `None`.
- `project_id` (Optional[int]): LabelStudio project ID to be retrieved. User must specify either taskID or projectID, not both, nor None. Default is `None`.
- `columns_to_drop` (Optional[List[str]]): List of column names to drop from the dataset before onboarding. - Default is `None`.
- `language`: (Optional[List[str]]): List of language codes to add to metadata. Default is None.
- `corpus_description` (Optional[str]): : Data description of the corpus to be onboarded. Default is `None`.

For the language selection, the supported languages are:

In [13]:
langs = {}
for l in list(Language):
    if l.value['dialect'] == '':
        langs[l.value['language']] = l.name
        
langs

{'af': 'Afrikaans',
 'sq': 'Albanian',
 'hy': 'Armenian',
 'asm': 'Assamese',
 'bn': 'Bangla',
 'eu': 'Basque',
 'bg': 'Bulgarian',
 'ca': 'Catalan',
 'ceb': 'Cebuano',
 'co': 'Corsican',
 'cs': 'Czech',
 'dv': 'Divehi',
 'eo': 'Esperanto',
 'et': 'Estonian',
 'fj': 'Fijian',
 'fi': 'Finnish',
 'de': 'German',
 'el': 'Greek',
 'ht': 'Haitian',
 'ha': 'Hausa',
 'haw': 'Hawaiian',
 'isl': 'Icelandic',
 'ig': 'Igbo',
 'id': 'Indonesian',
 'iu': 'Inuktitut',
 'ga': 'Irish',
 'ja': 'Japanese',
 'kk': 'Kazakh',
 'rw': 'Kinyarwanda',
 'tlh': 'Klingon',
 'ko': 'Korean',
 'ku': 'Kurdish',
 'ky': 'Kyrgyz',
 'lb': 'Luxembourgish',
 'mk': 'Macedonian',
 'mt': 'Maltese',
 'mi': 'Maori',
 'mn': 'Mongolian',
 'am': 'Amharic',
 'or': 'Odia',
 'ro': 'Romanian',
 'ar': 'Arabic',
 'az': 'Azerbaijani',
 'ba': 'Bashkir',
 'bs': 'Bosnian',
 'be': 'Belarusian',
 'my': 'Burmese',
 'da': 'Danish',
 'ny': 'Nyanja',
 'ps': 'Pashto',
 'pl': 'Polish',
 'pa': 'Punjabi',
 'oto': 'Queretaro',
 'ru': 'Russian',
 'sr':

Now, we proceed to specify the parameters.

In [17]:
# Corpus Name to be Onboarded
corpus_name = 'test-6'

# Label Studio Project ID
project_id = 1218

# Language
language = ['ar']

# Corpus Description
corpus_description = "This is a sample corpus for testing."

## Corpus Onboarding

The last step is to onboard the Label Studio data to aiXplain platform. This is done by calling the `LabelStudioFactory.create()` function using the predifined parameters.

In [None]:
payload = LabelStudioFactory.create(
    corpus_name = data_name,
    project_id = project_id,
    language = language,
    corpus_description = data_description
)

print(payload)

After it finishes onboarding, you can check the corpus using the `CorpusFactory.get()` method.

In [None]:
corpus = CorpusFactory.get(payload["asset_id"])
corpus.to_dict()