# Image Label Detection Dataset Onboarding

This notebook demonstrates how to onboard a dataset with label data into aiXplain platform using its SDK.

## Credentials

To use the aiXplain SDK, you may be registered in our platform and have an API key. The step-by-step on how to do it is better described [here](/docs/user/api_setup.md).

In [1]:
import os

os.environ["TEAM_API_KEY"] = "YOUR_TEAM_API_KEY_HERE"

## Data

In this example we will show how to onboard a sample dataset of images and their corresponding labels. To onboard it, the data needs to be depicted in a CSV file, which will be fed to the SDK. 

Label data should have be one or more elements in a JSON file according to one of the following structure:

```json
{
    "data": "TEXT_AUDIO_LABEL",
    "boundingBox": {
        "start": 0, // start character
        "end": 0, // end character
    }
}

{
    "data": "AUDIO_LABEL",
    "boundingBox": {
        "start": 0, // start second
        "end": 0 // end second
    }
}

{
    "data": "IMAGE_LABEL",
    "boundingBox": {
        "top": 0, // top percentage of the image
        "bottom": 0, // bottom percentage of the image
        "left": 0, // left percentage of the image
        "right": 0 // right percentage of the image
    }
}

{
    "data": "VIDEO_LABEL",
    "boundingBox": {
        "start": 0, // start second
        "end": 0, // end second
        "top": 0, // top percentage of the image
        "bottom": 0, // bottom percentage of the image
        "left": 0, // left percentage of the image
        "right": 0 // right percentage of the image
    }
}
```

In [2]:
import pandas as pd

upload_file = "corpus/index.csv"
data = pd.read_csv(upload_file)
data

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0.1,Unnamed: 0,images,labels
0,0,corpus/images/1.jpg,corpus/labels/1.json
1,1,corpus/images/2.png,corpus/labels/2.json


## Import

Let's now import the necessary classes to onboard the corpus.

In [3]:
from aixplain.enums import DataType, DataSubtype, Function, Language, License, StorageType
from aixplain.factories import DatasetFactory
from aixplain.modules import MetaData

## Metadata

Besides the CSV file, a schema must be fed to the SDK giving some information about the input and output data to be onboarded, such as: 

1. Data Name
2. Data Type: Audio, Text, Image, Video, Label, etc.
3. Storage Type: whether the data is depicted in the CSV (Text), in a local file (File) or in a public link (URL)
4. Start Column (optional): the column which depicts the beginning of the segment in the original file
5. End Column (optional): the column which depicts the end of the segment in the original file
6. Languages (optional): the languages depicted in the data

Let's instantiate the metadata for the images:

In [4]:
image_meta = MetaData(
    name="images", 
    dtype="image", 
    storage_type="file", 
)

Now for the labels...

(See how we can use enumerations instead of strings to specify some information)

In [5]:
label_meta = MetaData(
    name="labels", 
    dtype=DataType.LABEL, 
    storage_type=StorageType.FILE,
)

Let's now create the schemas for the input and output data of the dataset. Since this is a image label detection dataset, the images will be set as the input and the labels as the output data.

In [6]:
input_schema = [image_meta]
output_schema = [label_meta]

Finally we can called the `create` method to onboard the dataset, specifying the name, description, license, path to the content files and schemas. 

See that a Dataset ID will be provided as response together with the status of the onboarding process.

In [7]:
payload = DatasetFactory.create(
    name="dataset_onboarding_demo",
    description="This is an image label detection corpus",
    license=License.MIT,
    function=Function.IMAGE_LABEL_DETECTION,
    content_path=upload_file,
    input_schema=input_schema,
    output_schema=output_schema
)
print(payload)

 Dataset's inputs onboard progress:   0%|          | 0/1 [00:00<?, ?it/s]
[A
 Dataset's inputs onboard progress: 100%|██████████| 1/1 [00:06<00:00,  6.71s/it]
 Dataset's outputs onboard progress:   0%|          | 0/1 [00:00<?, ?it/s]
[A
 Dataset's outputs onboard progress: 100%|██████████| 1/1 [00:02<00:00,  2.51s/it]
 Dataset's hypotheses onboard progress: 0it [00:00, ?it/s]
 Dataset's meta onboard progress: 0it [00:00, ?it/s]


{'status': 'onboarding', 'asset_id': '6615453db2166233fe1ab291'}


You can then check the dataset using the `get` method.

In [8]:
dataset = DatasetFactory.get(payload["asset_id"])
dataset.__dict__

INFO:root:Start service for GET Dataset  - https://dev-platform-api.aixplain.com/sdk/datasets/6615453db2166233fe1ab291/overview - {'Authorization': 'Token 9136c08bf02b5552885b9f2a5e0fae517d81ff2fa6fe7084a3adb655c4aa7215', 'Content-Type': 'application/json'}


{'id': '6615453db2166233fe1ab291',
 'name': 'dataset_onboarding_demo',
 'description': 'This is an image label detection corpus',
 'supplier': 'aiXplain',
 'version': '1.0',
 'license': <License.MIT: '620ba3a83e2fa95c500b429d'>,
 'privacy': <Privacy.PRIVATE: 'Private'>,
 'cost': 0,
 'onboard_status': <OnboardStatus.ONBOARDING: 'onboarding'>,
 'function': <Function.IMAGE_LABEL_DETECTION: 'image-label-detection'>,
 'source_data': {'images': <aixplain.modules.data.Data at 0x117d50810>},
 'target_data': {'labels': [<aixplain.modules.data.Data at 0x117d3f690>]},
 'hypotheses': {},
 'metadata': {},
 'tags': [],
 'length': None,
 'kwargs': {}}