# Dataset Onboarding

This notebook demonstrates how to onboard a dataset into aiXplain platform using its SDK.

If you saw the tutorial about onboard a corpus into our platform [here](/docs/samples/corpus_onboarding/corpus_onboarding.ipynb), you may be asking what is the difference between a **dataset** and a **corpus**. Basically speaking, a corpus is a general-purpose collection of data, which can be used for exploratory reasons or when you do not know exactly how to use it. With this kind of collection, you may use its data as input to models and pipelines in our platform as well as to derive new data and datasets. For instance, let's say you have a collection of English audios you would like to transcribe and translate to another language. If you build a pipeline in aiXplain to do the work, you can group your collection of audios into a corpus and fed them directly to the pipeline instead of calling the pipeline for each audio.

On the other hand, different from corpus, a dataset is a representative sample of a specific phenomenon to a specific AI task. In aiXplain, datasets are specially created for each AI function we support, such as Translation, Speech Recognition, Topic Classification and so on. When you onboard a dataset in aiXplain, you must set not only an AI function, but which data is the input and output of this function. Let's see in practice!

## Credentials

To use the aiXplain SDK, you may be registered in our platform and have an API key. The step-by-step on how to do it is better described [here](/docs/user/api_setup.md).

In [1]:
import os

os.environ["TEAM_API_KEY"] = "YOUR_TEAM_API_KEY_HERE"

## Data

In this example we will show how to onboard a dataset to be used in the Speech Recognition task. To onboard it, the data needs to be depicted in a CSV file, which will be fed to the SDK. 

Our example dataset consists of 20 English audios with their corresponding transcriptions which will be used as input and output data, respectively. Since the audios consist of the same part of the conversation, the column `audio` contains the link to the original audio, whereas the columns `audio_start_time` and `audio_end_time` consists of the start and end seconds of that particular segment in the audio, respectively. If you already have the audio segments, these could be depicted in the `audio` column, whereas the columns `audio_start_time` and `audio_end_time` could be discarded. The segment transcriptions are depicted in the column `text` of the CSV file as can be seen below.

Something important in AI datasets is the train, validation and test set splits. Although it is not required, you may add a column in your CSVs similar to the `split` one to segment your dataset into these sets. Just make sure the values for this column is constrained to **TRAIN**, **VALIDATION** or **TEST**.

In [2]:
import pandas as pd

upload_file = "data.csv"
data = pd.read_csv("data.csv")
data

Unnamed: 0,audio,text,audio_start_time,audio_end_time,split
0,https://aixplain-platform-assets.s3.amazonaws....,Welcome to another episode of Explain using di...,0.9,6.56,TRAIN
1,https://aixplain-platform-assets.s3.amazonaws....,Discover allows you to use natural language in...,7.53,15.12,TRAIN
2,https://aixplain-platform-assets.s3.amazonaws....,In this demo I'm going to focus on an Arabic t...,15.93,20.29,TRAIN
3,https://aixplain-platform-assets.s3.amazonaws....,We can see here that there are currently 4 pro...,21.6,25.62,TRAIN
4,https://aixplain-platform-assets.s3.amazonaws....,We can click on the information icon to review...,26.75,30.05,TRAIN
5,https://aixplain-platform-assets.s3.amazonaws....,"In this case, we can see that the provider of ...",31.5,37.32,TRAIN
6,https://aixplain-platform-assets.s3.amazonaws....,"I could enable this model from here, but first...",38.56,45.01,TRAIN
7,https://aixplain-platform-assets.s3.amazonaws....,I'm going to select one model from each of the...,46.32,49.69,TRAIN
8,https://aixplain-platform-assets.s3.amazonaws....,You'll notice that our benchmarking function a...,50.6,55.69,TRAIN
9,https://aixplain-platform-assets.s3.amazonaws....,All I need to do is provide an Arabic data sam...,57.06,60.53,TRAIN


## Import

Let's now import the necessary classes to onboard the corpus.

In [3]:
from aixplain.enums import DataType, DataSubtype, Function, Language, License, StorageType
from aixplain.factories import DatasetFactory
from aixplain.modules import MetaData

## Metadata

Besides the CSV file, a schema must be fed to the SDK giving some information about the input and output data to be onboarded, such as: 

1. Data Name
2. Data Type: Audio, Text, Image, Video, Label, etc.
3. Storage Type: whether the data is depicted in the CSV (Text), in a local file (File) or in a public link (URL)
4. Start Column (optional): the column which depicts the beginning of the segment in the original file
5. End Column (optional): the column which depicts the end of the segment in the original file
6. Languages (optional): the languages depicted in the data

Let's instantiate the metadata for the audios:

In [4]:
audio_meta = MetaData(
    name="audio", 
    dtype="audio", 
    storage_type="url", 
    start_column="audio_start_time", 
    end_column="audio_end_time", 
    languages=[Language.English_UNITED_STATES]
)

Now for the text...

(See how we can use enumerations instead of strings to specify some information)

In [5]:
text_meta = MetaData(
    name="text", 
    dtype=DataType.TEXT, 
    storage_type=StorageType.TEXT, 
    languages=[Language.English_UNITED_STATES]
)

Since we do have split information, let's also set a metadata for it...

In [6]:
split_meta = MetaData(
    name="split", 
    dtype=DataType.LABEL, 
    dsubtype=DataSubtype.SPLIT,
    storage_type=StorageType.TEXT, 
)

Let's now create the schemas for the input and output data of the dataset. Since this is a speech recognition dataset, the audios will be set as the input and the texts as the output data.

In [7]:
input_schema = [audio_meta]
output_schema = [text_meta]
meta_schema = [split_meta]

Finally we can called the `create` method to onboard the dataset, specifying the name, description, license, path to the content files and schemas. 

See that a Dataset ID will be provided as response together with the status of the onboarding process.

In [8]:
payload = DatasetFactory.create(
    name="dataset_onboarding_demo",
    description="This speech recognition dataset contains 20 English audios with their corresponding transcriptions.",
    license=License.MIT,
    function=Function.SPEECH_RECOGNITION,
    content_path=upload_file,
    input_schema=input_schema,
    output_schema=output_schema,
    metadata_schema=meta_schema
)
print(payload)

 Dataset's inputs onboard progress:   0%|          | 0/1 [00:00<?, ?it/s]
[A
 Dataset's inputs onboard progress: 100%|██████████| 1/1 [00:00<00:00,  3.05it/s]
 Dataset's outputs onboard progress:   0%|          | 0/1 [00:00<?, ?it/s]
[A
 Dataset's outputs onboard progress: 100%|██████████| 1/1 [00:00<00:00,  3.64it/s]
 Dataset's hypotheses onboard progress: 0it [00:00, ?it/s]
 Dataset's meta onboard progress:   0%|          | 0/1 [00:00<?, ?it/s]
[A
 Dataset's meta onboard progress: 100%|██████████| 1/1 [00:00<00:00,  4.11it/s]


{'status': 'onboarding', 'asset_id': '645990e95235bd001267e130'}


You can then check the dataset using the `get` method.

In [9]:
dataset = DatasetFactory.get(payload["asset_id"])
dataset.__dict__

{'id': '645990e95235bd001267e130',
 'name': 'dataset_onboarding_demo',
 'description': 'This speech recognition dataset contains 20 English audios with their corresponding transcriptions.',
 'supplier': 'aiXplain',
 'version': '1.0',
 'license': <License.MIT: '620ba3a83e2fa95c500b429d'>,
 'privacy': <Privacy.PRIVATE: 'Private'>,
 'onboard_status': <OnboardStatus.ONBOARDED: 'onboarded'>,
 'function': <Function.SPEECH_RECOGNITION: 'speech-recognition'>,
 'source_data': {'audio': <aixplain.modules.data.Data at 0x7f6e28ff0520>},
 'target_data': {'text': [<aixplain.modules.data.Data at 0x7f6e28fbd220>]},
 'hypotheses': {},
 'metadata': {'split': <aixplain.modules.data.Data at 0x7f6e28ff0430>},
 'tags': [],
 'length': 19,
 'kwargs': {}}