# Corpus Onboarding

This notebook demonstrates how to onboard a corpus into aiXplain platform using its SDK.

## Data

To onboard a corpus using the SDK, the data needs to be depicted in a CSV file, which will be fed to the SDK. 

Our example corpus consists of 20 English audios with their corresponding transcriptions. Since the audios consist of the same part of the conversation, the column `audio` contains the link to the original audio, whereas the columns `audio_start_time` and `audio_end_time` consists of the start and end seconds of that particular segment in the audio, respectively. If you already have the audio segments, these could be depicted in the `audio` column, whereas the columns `audio_start_time` and `audio_end_time` could be discarded. The segment transcriptions are depicted in the column `text` of the CSV file as can be seen below.

In [1]:
import pandas as pd

upload_file = "data.csv"
data = pd.read_csv("data.csv")
data

Unnamed: 0,audio,text,audio_start_time,audio_end_time
0,https://aixplain-platform-assets.s3.amazonaws....,Welcome to another episode of Explain using di...,0.9,6.56
1,https://aixplain-platform-assets.s3.amazonaws....,Discover allows you to use natural language in...,7.53,15.12
2,https://aixplain-platform-assets.s3.amazonaws....,In this demo I'm going to focus on an Arabic t...,15.93,20.29
3,https://aixplain-platform-assets.s3.amazonaws....,We can see here that there are currently 4 pro...,21.6,25.62
4,https://aixplain-platform-assets.s3.amazonaws....,We can click on the information icon to review...,26.75,30.05
5,https://aixplain-platform-assets.s3.amazonaws....,"In this case, we can see that the provider of ...",31.5,37.32
6,https://aixplain-platform-assets.s3.amazonaws....,"I could enable this model from here, but first...",38.56,45.01
7,https://aixplain-platform-assets.s3.amazonaws....,I'm going to select one model from each of the...,46.32,49.69
8,https://aixplain-platform-assets.s3.amazonaws....,You'll notice that our benchmarking function a...,50.6,55.69
9,https://aixplain-platform-assets.s3.amazonaws....,All I need to do is provide an Arabic data sam...,57.06,60.53


## Import

Let's now import the necessary classes to onboard the corpus.

In [2]:
from aixplain.enums import DataType, Language, License, StorageType
from aixplain.factories import CorpusFactory
from aixplain.modules import MetaData

## Metadata

Besides the CSV file, a schema must be fed to the SDK giving some information about the data to be onboarded, such as: 

1. Data Name
2. Data Type: Audio, Text, Image, Video, Label, etc.
3. Storage Type: whether the data is depicted in the CSV (Text), in a local file (File) or in a public link (URL)
4. Start Column (optional): the column which depicts the beginning of the segment in the original file
5. End Column (optional): the column which depicts the end of the segment in the original file
6. Languages (optional): the languages depicted in the data

Let's instantiate the metadata for the audios:

In [3]:
audio_meta = MetaData(
    name="audio", 
    dtype="audio", 
    storage_type="url", 
    start_column="audio_start_time", 
    end_column="audio_end_time", 
    languages=[Language.English_UNITED_STATES]
)

Now for the text...

(See how we can use enumerations instead of strings to specify some information)

In [4]:
text_meta = MetaData(
    name="text", 
    dtype=DataType.TEXT, 
    storage_type=StorageType.TEXT, 
    languages=[Language.English_UNITED_STATES]
)

Let's add the metadata into a schema list...

In [5]:
schema = [audio_meta, text_meta]

Finally we can called the `create` method to onboard the data, specifying the name, description, license, path to the content files and schema. 

See that a Corpus ID will be provided as response together with the status of the onboarding process.

In [6]:
payload = CorpusFactory.create(
    name="corpus_onboarding_demonstration",
    description="This corpus contain 20 English audios with their corresponding transcriptions.",
    license=License.MIT,
    content_path=upload_file,
    schema=schema
)
print(payload)

 Corpus onboarding progress:   0%|          | 0/2 [00:00<?, ?it/s]
[A
 Corpus onboarding progress:  50%|█████     | 1/2 [00:00<00:00,  3.50it/s]
[A
 Corpus onboarding progress: 100%|██████████| 2/2 [00:00<00:00,  3.62it/s]


{'status': 'onboarding', 'corpus_id': '6423006df108b80012a4d392'}


You can then check the corpus using the `get` method.

In [7]:
corpus = CorpusFactory.get(payload["asset_id"])
corpus.to_dict()

{'id': '6423006df108b80012a4d392',
 'name': 'corpus_onboarding_demonstration',
 'description': 'This corpus contain 20 English audios with their corresponding transcriptions.',
 'supplier': 'aiXplain',
 'version': '1.0',
 'license': None,
 'privacy': <Privacy.PRIVATE: 'Private'>,
 'onboard_status': <OnboardStatus.ONBOARDING: 'onboarding'>,
 'functions': [],
 'tags': [],
 'data': [<aixtend.modules.data.Data at 0x7ff21a693640>,
  <aixtend.modules.data.Data at 0x7ff21a693e80>],
 'kwargs': {}}