# Label Studio Corpus Onboarding

This notebook demonstrates how to auto-onboard tasks or projects from Label Studio into aiXplain platform as a corpus.

## Label Studio Data

Label Studio platform contains different projects that vary based on the Label Studio authentication API key used. Each project contains multiple tasks that could be in different data types. This method should be able to handle all audio and text data for either a Label Studio task or project. All you need is Label Studio project or task ID, and the corresponding Label Studio authentication API key.

### Label Studio authentication API key

In order to get your Label Studio API key, once you're logged in to Label Studio, head to this [link](https://label.aixplain.com/user/account), and you'll find your API key under the Access Token field.

### Task or project ID

In order to get the task or project ID of the data you want to onboard, navigate to the desired project from [Label Studio Homepage](https://label.aixplain.com/), and the project ID should be in the format:

`https://label.aixplain.com/projects/PROJECT_ID/...`

If you opened any task in the project, the task ID should be in the format:

`https://label.aixplain.com/projects/PROJECT_ID/...&task=TASK_ID`

## aiXplain TEAM_API_KEY

Ideally, to onboard a corpus using the SDK, the data needs to be depicted in a CSV file, which will then be fed to the SDK. Additionally, you need to have an aiXplain `TEAM_API_KEY`. For details refer this [Team API Key Guide](https://github.com/aixplain/aiXplain/blob/main/docs/user/api_setup.md).

Once you get the API key, you'll need to add this API key as an environment variable on your system.

#### Linux or macOS
```bash
export TEAM_API_KEY=YOUR_API_KEY
```
#### Windows
```bash
set TEAM_API_KEY=YOUR_API_KEY
```
#### Jupyter Notebook
```
%env TEAM_API_KEY=YOUR_API_KEY
```

Since we're using jupyter notebook in this case, please copy and paste your aiXplain `TEAM_API_KEY` in the cell below.

In [None]:
%env TEAM_API_KEY=YOUR_API_KEY

## Import

Let's now import the necessary module to onboard the corpus.

In [2]:
import aixplain.processes.data_onboarding.labelstudio_onboard as labelstudio_onboard 
from aixplain.factories import CorpusFactory

## Label Studio and Corpus Info

Now, we need to specify the parameters of the `labelstudio_onboard.auto_onboard()` function. Here's the list of parameters for the function:

- `labelstudio_key` (str): The authentication API key from LabelStudio.
- `data_name` (str): The data name of the corpus to be onboarded on the platform.
- `task_id` (Optional[int]): LabelStudio task ID to be retrieved. Default is `None`.
- `project_id` (Optional[int]): LabelStudio project ID to be retrieved. User must specify either taskID or projectID, not both, nor None. Default is `None`.
- `columns_to_drop` (Optional[List[str]]): List of column names to drop from the dataset before onboarding. - Default is `None`.
- `onboard` (Optional[bool]): Whether user wants to perform the onboarding process. Default is `False`.
- `data_description` (Optional[str]): Data description of the corpus to be onboarded. Default is `None`.

In [3]:
# Label Studio API Key
labelstudio_key = "YOUR_LABELSTUDIO_KEY"

# Corpus Name to be Onboarded
data_name = 'test-4'

# Label Studio Project ID
project_id = 1218

# Proceed to Onboard
onboard = True

# Corpus Description
data_description = "This is a sample corpus for testing."

## Corpus Onboarding

The last step is to onboard the Label Studio data to aiXplain platform. This is done by calling the `labelstudio_onboard.auto_onboard()` function using the predifined parameters.

In [4]:
payload = labelstudio_onboard.auto_onboard(
    labelstudio_key = labelstudio_key,
    data_name = data_name,
    project_id = project_id,
    onboard = onboard,
    data_description = data_description
)

print(payload)

Extracting data...
Continuing without dropping any columns.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2363 entries, 0 to 2362
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   segment_transcription  2363 non-null   object 
 1   Status                 2363 non-null   object 
 2   start                  2363 non-null   float64
 3   end                    2363 non-null   float64
 4   audio                  2363 non-null   object 
dtypes: float64(2), object(3)
memory usage: 92.4+ KB
Data extracted successfully, and saved to a CSV file.
Here's the data info of your corpus:
 None
Proceeding to onboard the corpus to aiXplain platform...
Creating MetaData for column: segment_transcription.
Creating MetaData for column: Status.
Creating MetaData for column: audio.


 Corpus onboarding progress:   0%|                                                               | 0/3 [00:00<?, ?it/s]
 Data "segment_transcription" onboarding progress:   0%|                                         | 0/1 [00:00<?, ?it/s][A

 File onboarding progress:   0%|                                                              | 0/2363 [00:00<?, ?it/s][A[A

 File onboarding progress:  42%|█████████████████████▏                            | 1000/2363 [00:02<00:03, 358.77it/s][A[A

 File onboarding progress:  85%|██████████████████████████████████████████▎       | 2000/2363 [00:06<00:01, 327.86it/s][A[A

                                                                                                                       [A[A
 Data "segment_transcription" onboarding progress: 100%|█████████████████████████████████| 1/1 [00:06<00:00,  6.09s/it][A
 Corpus onboarding progress:  33%|██████████████████▎                                    | 1/3 [00:07<00:15,  7.89s/it][A
 Da

After it finishes onboarding, you can check the corpus using the `CorpusFactory.get()` method.

In [5]:
corpus = CorpusFactory.get(payload["asset_id"])
corpus.to_dict()

{'id': '64db8f058d8b03f65b5bcc5c',
 'name': 'test-4',
 'description': 'This is a sample corpus for testing.',
 'supplier': 'aiXplain',
 'version': '1.0',
 'license': <License.MIT: '620ba3a83e2fa95c500b429d'>,
 'privacy': <Privacy.PRIVATE: 'Private'>,
 'onboard_status': <OnboardStatus.ONBOARDED: 'onboarded'>,
 'functions': [],
 'tags': [],
 'data': [<aixplain.modules.data.Data at 0x25c20b75190>,
  <aixplain.modules.data.Data at 0x25c20b75ca0>,
  <aixplain.modules.data.Data at 0x25c20b757c0>],
 'length': 2362,
 'kwargs': {}}