# <div class="alert alert-block alert-info" style="border-width:4px">SBrain DataSet APIs Tutorial </div>

### NOTE : * This is a sample notebook. Please make a copy of it for yourself and try it out.

<a id='top'></a>
This tutorial covers the following:
- [Create DataSet](#create_dataset)

- [Search And Retrieve Existing DataSets](#search_datasets)

- [Search And Retrieve DataSet Versions](#search_versions)

- [Transforming DataSet Versions To Generate New DataSet Versions](#trans)
    - [Defining and Saving Transformations](#define_trans)
    - [Testing Set of Transformations on Single Image](#test_trans)
    - [Running Transformation Job](#run_trans)
    - [Reusing Transformations](#reuse_trans)
    
- [Creating Training/Test/Validation Splits](#splits)

- [Searching And Retrieving DataSet Splits](#search_splits)


We are using a subset of the [cifar_10 dataset](https://en.wikipedia.org/wiki/CIFAR-10), which has pictures of animals, airplanes, ships and so on.

Let's first import all the packages that you will need during this tutorial.
- sbrain.dataset is the SBrain package with abstractions necessary to use SBrain's DataSet functionalities.
- [numpy](www.numpy.org) is the fundamental package for scientific computing with Python.
- cv2  i.e. [openCV](https://opencv.org) is a popular computer vision library 

In [None]:
from sbrain.dataset import DataSetImageClassification,DataSetVersion,DataSetSplit
from sbrain.dataset import DataSetStatus,JobStatus,DataSetSplitStatus,DataSetVersionStatus
from sbrain.dataset import Transformation,TransformationSet
import numpy as np
import cv2
import uuid
import time
from IPython.display import clear_output

#### **_NOTE_**: 
Please set the username you used to log into sbrain ui in the following cell.
This value will be used to pass in the search apis to limit the results to the assets created
by the current user. Its just for illustration purposes. 

User can search other users assets and reuse them by passing in other user's username in search apis.


In [None]:
user_name = "admin"

<a id='create_dataset'></a>
# _Create DataSet_#            
<div align="right"><a href="#top">BackToTheTop</a></div>

**_sbrain.dataset.DataSetImageClassification_** is an abstraction which supports creating and handling of image dataset for classification model training. 

DataSetImageClassification construtor takes the **_name_** of the parameter as input.

**_DataSetImageClassification.create()_** method takes following parameters:

- **source_archive_path** : the path to the folder containing the images and labels. 
- **classes** : [optional] a dict with class names in the dataset as the keys and class ids as values
- **collection_date** : date of collection of data in string format **_mm-dd-yyyy_**
- **image_iterator** : function returning an iterator to the list of path of images in the archive
- **label_iterator** : function returning an iterator. Each element returned by iterator is 
a tuple (image name, class id)


In [None]:
import time
def unique_id():
    return str(int(time.time()))

In [None]:
# defining classes

classes = {
                'airplane': 0,
                'automobile':1,
                'bird': 2,
                'cat': 3,
                'deer': 4,
                'dog': 5,
                'frog': 6,
                'horse': 7,
                'ship': 8,
                'truck': 9
            }

In [None]:
# defining iterator to get image file paths

def iterator_images(data_root_path):
    import glob
    result = []
    files = glob.glob("{}/*.*".format(data_root_path))
    return iter(files)

# defining iterator to get tuples (image_name, class_id) e.g. (xyz.jpeg,1)
def iterator_labels(data_root_path):
    import glob
    files = glob.glob("{}/*.*".format(data_root_path))
    labels = []
    classes = {
                'airplane': 0,
                'automobile':1,
                'bird': 2,
                'cat': 3,
                'deer': 4,
                'dog': 5,
                'frog': 6,
                'horse': 7,
                'ship': 8,
                'truck': 9
            }
    for f in files:
        img_name =  f.split('/')[-1:][0]
        lbl_str = img_name[img_name.index('_')+1:img_name.index('.')]
        lbl_id = classes[lbl_str]
        labels.append((img_name, lbl_id))    
    return iter(labels)


#### DataSet.create() will return a DataSetExtractionJob object
The job object can be used to track the progress of DataSet creation.

job.getdataset() will return a DataSet object that's a handle to the new dataset created


#### NOTE:
**job.cancel()** api can be used any time to cancel the dataset creation job.

In [None]:
# creating dataset

dataset_name = "cifar10-small-{}".format(unique_id())

job = DataSetImageClassification(name=dataset_name).create(
    description = "Dataset with subset images from cifar 10 dataset",
    source_archive_path = "shared-dir/sample-notebooks/demo-data/cifar10_small",
    classes=classes,
    collection_date="07-25-2018",
    image_iterator=iterator_images,
    label_iterator=iterator_labels
)

#Check Job Status

while job.status != JobStatus.COMPLETE.value and job.status != JobStatus.FAILED.value:
    clear_output(wait=True)
    job = job.get_status()
    time.sleep(2)

In [None]:
ds = job.get_dataset()

<a id='search_datasets'></a>
# _Search And Retrieve Existing DataSets_ 
<div align="right"><a href="#top">BackToTheTop</a></div>

**_DataSetImageClassification.search()_** api helps to search for existing datasets using 
- **author** : name of the user who created dataset and/or 
- **name** : name of the dataset and/or 
- **description** : keywords in description 
<br>DataSets which partially match any of the given parameters are returned.

**_DataSetImageClassification.lookup()_** can be used to retrieve a dataset object using dataset name. The name needs to match exactly.

In [None]:
# list all the images which belong to a user for example
# datasets can also be searched by partial names or desciptions

DataSetImageClassification.search( author = user_name )

In [None]:
# lookup returns a DataSetImageClassification object that can be used to search versions. 
ds = DataSetImageClassification.lookup(dataset_name)

<a id='search_versions'></a>
# _Search And Retrieve DataSet Versions_


**__DataSetImageClassification.search_versions()__** can be used to lookup and list out all dataset versions derived from the given dataset using
- **__'version_author'__** i.e. name of the user who created dataset and/or 
- **__'version_name'__** i.e. name of the dataset and/or 
- **__'version_description'__**. 
<br>DataSet Versions which partially match any of the given parameters are returned.

**__DataSetImageClassification.version()__** can be used to retrieve a particular __DataSetVersion__ by name
<div align="right"><a href="#top">BackToTheTop</a></div>

In [None]:
# list all the versions of the dataset which satisfy the given search criteria
# this will print out the details of those versions

ds.search_versions(version_author = user_name)

In [None]:
DataSetVersion.search(author= user_name, dataset_name=dataset_name)

In [None]:
# following method will return a DataSetVersion object 
# which can be used to invoke transformations or creating splits
# Version "v1" is created by default when dataset is created

ds_version = ds.version("v1")

<a id='trans'></a>
# _Transforming DataSet Versions To Generate New DataSet Versions_
<div align="right"><a href="#top">BackToTheTop</a></div>

<a id='define_trans'></a>
SBrain provides **_sbrain.dataset.Transformation_** abstraction that can be used to run transformations on a DataSetVersion at scale. 

## Defining and Saving Transformations
Define custom transactions as one or more classes, each of which will inherit from **_Transformation_** class as shown below.

<font color="red">IMPORTANT:</font>
1. The custom transformation needs to extend the sbrain.dataset.Transformation class
2. And provide a 'process' method which takes in and gives out image as numpy array
3. NOTE : Right now transformations support only open cv library.
<div align="right"><a href="#top">BackToTheTop</a></div>

In the following example, user has written 2 different transformations, one to resize and one to flip the image.



In [None]:
class Flip(Transformation):
    def __init__(self, name):
        super().__init__(name)

    def process(self, arr_in):
        rotated_image = cv2.flip(arr_in, 1)
        return rotated_image
    
    
class Resize(Transformation):
    def __init__(self, name, **args):
        super().__init__(name)
        self.height = args["height"]
        self.width = args["width"]
        

    def process(self, arr_in):
        resized_img = cv2.resize(arr_in, (self.width, self.height))
        return resized_img

**_Transformation.create()_** from base class, is used to save an instance of the transformation in the SBrain. Which can be reused by overriding the parameters.

In [None]:
# initialize and create a transformations
# NOTE : unless the create method is called, the transformation is not saved in SBrain system, and can not be used 
#        transformation jobs
flip_transformation_name = "flip-{}".format(unique_id())
flip = Flip(name=flip_transformation_name).create(author=user_name,
                                        description="flipping the image")


resize_transformation_name = "resize-{}".format(unique_id())
resize = Resize(name=resize_transformation_name, height=100, width=100).create(author=user_name,
                                        description="resizing the image")

<a id='test_trans'></a>
## Testing Set of Transformations on Single Image
<div align="right"><a href="#top">BackToTheTop</a></div>

In the example we are showing one transformation. In practise there can be muliple such transformations that can be chained by calling dataset_version.transform(transformation_obj_1).transform(transformation_obj_2) and so on. 

**_TransformationSet.apply_to_file()_** can be used to test the transformations on a single file to see the results, as shown below:


In [None]:
input_img = "../demo-data/cifar10_small/11102_horse.png"
output_img = "../../11102_horse_transformed-{}.png".format(unique_id())
TransformationSet.apply_to_file(src_path=input_img,
                                des_path=output_img,
                                transformations_set=[resize,flip])

In [None]:
from matplotlib import pyplot as plt
from skimage import io
%matplotlib inline

plt.subplot(1, 2, 1)
plt.title('before')
before = io.imread(input_img)
plt.imshow(before)

plt.subplot(1, 2, 2)
plt.title('after')
after = io.imread(output_img)
plt.imshow(after)






Once the transformation is satisfactorily written and saved, it can be used to run transformation job on the entire dataset.


<a id='run_trans'></a>
## Running Transformation Job

**_DataSetVersion.transform()_** takes a custom transformation object and returns a **__TransformationSet__** object.

Multiple transform() calls on **__TransformationSet__** can be used to chain together the required transformations. 

**_TransformationSet.run()_** deploys the transformation job on cluster. 

It takes the following args:
- **num_workers**: number of workers to parallelize the transformation job
- **target_version** : name of the resultant version
- **cores**: number of cores to allocate for each worker. 
- **memory** : memory to be allocated for each worker.

It returns:
 - **_TransformationJob_** object which is the handle to the job running on the cluster, and can be used to check status of the job.

<div align="right"><a href="#top">BackToTheTop</a></div>

**TransformationJob.get_status()** can be used to retrieve the current status of the job.

**TransformationJob.cancel()** can be used to cancel the job.

<font color="red">IMPORTANT:</font> Please make sure the DataSetVersion which is being transformed has status 'Complete'. You can see the status in the search result or DataSetVersion.status.


#### NOTE: 
-- with the sample dataset transformation job takes around 2-3 mins


In [None]:
# the following code launches a transformation job on the cluster 
#and returns TransformationJob object
dataset_new_version_name = "flip-resized-100-100-{}".format(unique_id())
tj = ds_version.transform(flip).transform(resize).run(target_version=dataset_new_version_name, num_workers=2)

# Check job status
status = tj.get_status().lower()
while status.lower() != 'complete':
    clear_output(wait=True)
    status = tj.get_status().lower()
    time.sleep(2)

In [None]:
# get the handle to the new version after transformation 
ds_version_transformed = ds.version( version_name =  dataset_new_version_name)

Lets look at a sample of images of versions before and after transformation 

In [None]:
from matplotlib import pyplot as plt
from skimage import io
%matplotlib inline

rows = 8
cols = 2
plt.figure(1, figsize=(40,50))
iter_before = ds_version.get_iterator()
iter_after = ds_version_transformed.get_iterator()
iter_before.reset_iterator()
iter_after.reset_iterator()
imgs_before = [img for img,lbl in iter_before.next_batch(8)]
imgs_after = [img for img,lbl in iter_after.next_batch(8)] 
imgs_before = sorted(imgs_before)
imgs_after = sorted(imgs_after)
idx = 0
for i in range(8):
    plt.subplot(rows, cols, idx + 1)
    plt.title('before')
    before = io.imread(imgs_before[i])
    # print(before.shape)
    plt.imshow(before)
    plt.subplot(rows, cols, idx + 2)
    h,w,_ = before.shape
    ar = h/w
    plt.title('after')
    after = io.imread(imgs_after[i])
    # print(before.shape)
    plt.imshow(after) #,aspect=ar)
    idx = idx + 2

<a id='reuse_trans'></a>
## Reusing Transformations

The same transformation object resize, that was initially created with 'height' and 'width' parameters set to 100, can be used to perform another transformation, for example, with just changing height and width to '300'

<div align="right"><a href="#top">BackToTheTop</a></div>

In [None]:
flip_resize_300_name = "flip-resize-300-{}".format(unique_id())
tj2 = ds_version.transform(resize.override(height=300,width=300)).run(num_workers=1, target_version=flip_resize_300_name)
#Check job status
status = tj2.get_status()
while status != JobStatus.COMPLETE.value and status != JobStatus.FAILED.value:
    clear_output(wait=True)
    status = tj2.get_status()
    time.sleep(2)

In [None]:
# get the handle to the new version after transformation 
ds_version_transformed_300 = ds.version( version_name =  flip_resize_300_name)

Lets check the images again to see the difference when weight=128 and weight=256

In [None]:
from matplotlib import pyplot as plt
from skimage import io
%matplotlib inline

rows = 8
cols = 2
plt.figure(1, figsize=(40,50))
imgs_before = [img for img,lbl in ds_version_transformed.get_iterator().next_batch(8)]
imgs_after = [img for img,lbl in ds_version_transformed_300.get_iterator().next_batch(8)] 
imgs_before = sorted(imgs_before)
imgs_after = sorted(imgs_after)
idx = 0
for i in range(8):
    plt.subplot(rows, cols, idx + 1)
    plt.title('before')
    before = io.imread(imgs_before[i])
    # print(before.shape)
    plt.imshow(before)
    plt.subplot(rows, cols, idx + 2)
    h,w,_ = before.shape
    ar = h/w
    plt.title('after')
    after = io.imread(imgs_after[i])
    # print(before.shape)
    plt.imshow(after) #,aspect=ar)
    idx = idx + 2

<a id="splits"></a>
# Creating Training/Test/Validation Splits

Use **_DataSetVersion.create_data_split()_** to create train/test/validation splits. 

It takes:
- **split_name** : Name of the split being created.
- **split_percentages**: An array of ints specifying the train/test/validation split percentages in the respective order.
- **author**: author of the split
- **description** : some description about the split

<font color="red">IMPORTANT:</font> Please make sure the DataSetVersion which is being used for split has status 'Complete'. You can see the status in the search result or DataSetVersion.status.

It returns:
    **_DataSetSplitJob_** object
    

**DataSetSplitJob.get_status()** can be used to check status of dataset split job

**DataSetSplitJob.cancel()** can be used to check status of dataset split job


#### NOTE: 
-- with the sample dataset, creating split takes about 1-2 mins
<div align="right"><a href="#top">BackToTheTop</a></div>

In [None]:
dataset_split_name = "split-70-20-10-{}".format(unique_id())
split_job = ds_version_transformed.create_data_split(split_name = dataset_split_name,
                                                 split_percentages = [70,20,10],
                                                 description = "split by 70-20-10")

#Check job status
while split_job.status != JobStatus.COMPLETE.value and split_job.status != JobStatus.FAILED.value:
    clear_output(wait=True)
    split_job = split_job.get_status()
    time.sleep(2)

#### DataSetSplit.create() will return a DataSetSplitJob object
The job object can be used to track the progress of DataSetSplit creation.

job.get_dataset_split() will return a DataSetSplit object that's a handle to the new dataset split created

In [None]:
split = split_job.get_dataset_split()

<a id='search_splits'></a>

# Searching And Retrieving DataSet Splits

**DataSetVersion.search_splits()** can be used to search DataSetSplits created from that particular DataSetVersion.
Args:
- **split_name**: name of the split.
- **split_author** : author of the split.
- **split_description** : description of the split.

**DataSetVersion.split()** which takes split_name as an argument can be used to retrieve **DataSetSplit** with exact name. 
<div align="right"><a href="#top">BackToTheTop</a></div>

In [None]:
# list all splits for a particular dataset version which match the search criteria
ds_version_transformed.search_splits(split_author = user_name)

In [None]:
# get a split object for a particular dataset version which can be used in training models
split_obj = ds_version_transformed.split(split_name = dataset_split_name)

 NOTE: **_DataSetSplit_** object is used for training models 

 ## **_<font color="green">Congratulations !!! You completed the tutorial successfully.</font>_**
 