<a href="https://colab.research.google.com/github/activeloopai/examples/blob/main/colabs/Data_Processing_Using_Parallel_Computing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Data Processing Using Parallel Computing***

#### [Step 7](https://docs.activeloop.ai/getting-started/parallel-computing) in the [Getting Started Guide](https://docs.activeloop.ai/getting-started) highlights how `hub.compute` can be used to rapidly upload datasets. This tutorial expands further and highlights the power of parallel computing for dataset processing.

## Install Hub

In [None]:
from IPython.display import clear_output
!pip3 install hub
clear_output()

In [None]:
# IMPORTANT - Please restart your Colab runtime after installing Hub!
# This is a Colab-specific issue that prevents PIL from working properly.
import os
os.kill(os.getpid(), 9)

## Transformations on New Datasets

Computer vision applications often require users to process and transform their data as part of their workflows. For example, you may perform perspective transforms, resize images, adjust their coloring, or many others. In this example, a flipped version of the MNIST dataset is created, which may be useful for training a model that identifies text in scenes where the camera orientation is unknown.

The first step to creating a flipped version of the MNIST dataset is to define a function that will flip the dataset images.

In [None]:
import hub
from PIL import Image
import numpy as np

@hub.compute
def flip_vertical(sample_in, sample_out):
    ## First two arguments are always default arguments containing:
    #     1st argument is an element of the input iterable (list, dataset, array,...)
    #     2nd argument is a dataset sample
    
    # Append the label and image to the output sample
    sample_out.labels.append(sample_in.labels.numpy())
    sample_out.images.append(np.flip(sample_in.images.numpy(), axis = 0))
    
    return sample_out

Next, the existing MNIST dataset is loaded, and `hub.like` is used to create an empty dataset with the same tensor structure.

In [None]:
ds_mnist = hub.load('hub://activeloop/mnist-train')

#We use the overwrite=True to make this code re-runnable
ds_mnist_flipped = hub.like('./mnist_flipped', ds_mnist, overwrite = True)

Finally, the flipping operation is evaluated for the 1st 100 elements in the input dataset `ds_mnist`, and the result is automatically stored in `ds_mnist_flipped`.

In [None]:
flip_vertical().eval(ds_mnist[0:100], ds_mnist_flipped, num_workers = 2)

Let's check out the flipped images:

In [None]:
Image.fromarray(ds_mnist.images[0].numpy())

In [None]:
Image.fromarray(ds_mnist_flipped.images[0].numpy())

##Transformations on Existing Datasets

In the previous example, a new dataset was created while performing a transformation. In this example, a transformation is used to modify an existing dataset. 

First, download and unzip the small classification dataset below called animals. 

In [None]:
# Download dataset
from IPython.display import clear_output
!wget https://firebasestorage.googleapis.com/v0/b/gitbook-28427.appspot.com/o/assets%2F-M_MXHpa1Cq7qojD2u_r%2F-MbI7YlHiBJg6Fg-HsOf%2F-MbIUlXZn7EYdgDNncOI%2Fanimals.zip?alt=media&token=c491c2cb-7f8b-4b23-9617-a843d38ac611
clear_output()

In [None]:
# Unzip to './animals' folder
!unzip -qq /content/assets%2F-M_MXHpa1Cq7qojD2u_r%2F-MbI7YlHiBJg6Fg-HsOf%2F-MbIUlXZn7EYdgDNncOI%2Fanimals.zip?alt=media

Next, use `hub.ingest` to automatically convert this image classification dataset into hub format and save it in `./animals_hub`.

In [None]:
ds = hub.ingest('./animals', './animals_hub') # Creates the dataset

The first image in the dataset is a picture of a cat:

In [None]:
Image.fromarray(ds.images[0].numpy())

The images in the dataset can now be flipped by evaluating the `flip_vertical()` transformation function from the previous example. If a second dataset is not specified as an input to `.eval()`, the transformation is applied to the input dataset.

In [None]:
flip_vertical().eval(ds, num_workers = 2)

The picture of the cat is now flipped:

In [None]:
Image.fromarray(ds.images[0].numpy())

##Dataset Processing Pipelines

In order to modularize your dataset processing, it is often helpful to create functions for specific data processing tasks, and combine them in pipelines in order to transform your data end-to-end. In this example, you can create a pipeline using the `flip_vertical` function above and the `resize` function below.

In [None]:
@hub.compute
def resize(sample_in, sample_out, new_size):
    ## First two arguments are always default arguments containing:
    #     1st argument is an element of the input iterable (list, dataset, array,...)
    #     2nd argument is a dataset sample
    ## Third argument is the required size for the output images
    
    # Append the label and image to the output sample
    sample_out.labels.append(sample_in.labels.numpy())
    sample_out.images.append(np.array(Image.fromarray(sample_in.images.numpy()).resize(new_size)))
    
    return sample_out

Functions decorated using `hub.compute` can be easily combined into pipelines using hub.compose. Required arguments for the functions must be passed into the pipeline in this step:

In [None]:
pipeline = hub.compose([flip_vertical(), resize(new_size = (64,64))])

Just like for the single-function example above, the input and output datasets are created first, and the pipeline is evaluated for the 1st 100 elements in the input dataset `ds_mnist_flipped`. The result is automatically stored in `ds_mnist_pipe`.

In [None]:
#We use the overwrite=True to make this code re-runnable
ds_mnist_pipe = hub.like('./mnist_pipeline', ds_mnist, overwrite = True)

In [None]:
pipeline.eval(ds_mnist[0:100], ds_mnist_pipe, num_workers = 2)

Let's check out the processed images:

In [None]:
Image.fromarray(ds_mnist.images[0].numpy())

In [None]:
Image.fromarray(ds_mnist_pipe.images[0].numpy())