## Introduction


Tensorflow Data module exposes a fast API for building data pipelines to feed ML models. To do this we convert our data to `tf.data.Dataset`, the easiest way is to use `tf.data.Dataset.from_tensor_slices(tuple_of_numpy_arrays)`.
While this API is easy to use, it has the demerit of converting to from Numpy array to tensors all the time.

To alleviate this issue we use TF-Records and store the data in `.tfrecord` files. While using TF-Record is useful, notice that using `tf.train.Example` is not the most performant method. As a result we store tensors directly. 

A second performance bottleneck arises when image augmentations are done one by one on images, not in batches. To alleviate this we provide few augmentations in batch versions so that augmentations can be fast.

**We have 2 performance improvements:**
1. Storing tensors as tfrecord instead of using `tf.train.Example`
2. Adding batch image augmentations

**Results (1 Epoch with 3 Augmentations)**:

| **Method**                                | **Batch** | **Performance**     |
|-------------------------------------------|-----------|---------------------|
| `tf.data.Dataset.from_tensor_slices`      | False     | 37.3 s ± 1.42 s     |
| `tf.data.Dataset.from_tensor_slices`      | True      | 9.83 s ± 569 ms     |
| `tf.train.Example`                        | False     | 39.6 s ± 2.66 s     |
| `tf.train.Example`                        | True      | 8.25 s ± 510 ms     |
| Storing Tensors Directly                  | False     | 38.7 s ± 4.41 s     |
| **Storing Tensors Directly (Our Method)** | **True**  | **7.68 s ± 763 ms** |

**Machine Details**
```
MacBook Pro (Retina, 13-inch, Early 2015)

  Model Name:	MacBook Pro
  Model Identifier:	MacBookPro12,1
  Processor Name:	Intel Core i7
  Processor Speed:	3.1 GHz
  Number of Processors:	1
  Total Number of Cores:	2
  L2 Cache (per Core):	256 KB
  L3 Cache:	4 MB
  Memory:	16 GB

```

In [1]:
import tensorflow.keras.backend as K
import tensorflow
import tensorflow as tf
import math
import gc
import numpy as np
import os



print(np.__version__)
import time, math
from tqdm import tqdm_notebook as tqdm

from joblib import Parallel, delayed
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

import copy
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import os
from importlib import reload

import pandas as pd
import fastnet as fn

from tensorflow.keras.backend import clear_session


1.16.2


In [2]:
random_pad_crop = fn.get_first_argument_transformer(fn.get_random_pad_crop(4,4,32,32))
cutout = fn.get_first_argument_transformer(fn.get_cutout_eraser(-1.0,1.0))
hflip = fn.get_first_argument_transformer(fn.get_hflip_aug())

transformations = fn.combine_transformers(random_pad_crop,hflip,cutout)

In [3]:
jobs = int(os.cpu_count()/2)

## Benchmark 1: Reading from Numpy 
reading with `tf.data.Dataset.from_tensor_slices(numpy_arrays)`

### Non Batch

In [None]:
%%timeit
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
batch_size = 1

def mapper(x,y):
    return tf.cast(x,tf.float32),y

train = train.map(mapper).batch(batch_size)
imgs = 50000
batches = imgs//batch_size
cf10_ex = []
for x,y in train.map(transformations).take(batches):
    cf10_ex.append(x.shape)

# 37.3 s ± 1.42 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

### Batch

In [None]:
%%timeit
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
batch_size = 512

def mapper(x,y):
    return tf.cast(x,tf.float32),y

train = train.map(mapper).batch(batch_size)
imgs = 50000
batches = imgs//batch_size
cf10_ex = []
for x,y in train.map(transformations).take(batches):
    cf10_ex.append(x.shape)

# 9.83 s ± 569 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

## Benchmark 2: TF-Example Batch vs No Batch

### No Batch

In [4]:
%%timeit
batch_size = 1
imgs = 50000
batches = imgs//batch_size
cf10_ex = []
train,test = fn.get_cifar10_examples("cifar10",batch_size)
for x,y in train.map(transformations).take(batches):
    cf10_ex.append(x.shape)

39.6 s ± 2.66 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Batch

In [None]:
%%timeit
batch_size = 512
imgs = 50000
batches = imgs//batch_size
cf10_ex = []
train,test = fn.get_cifar10_examples("cifar10",batch_size)
for x,y in train.map(transformations).take(batches):
    cf10_ex.append(x.shape)

## Benchmark 3: Direct Tensor Storage (No Batch)

In [5]:
%%timeit
batch_size = 1
imgs = 50000
batches = imgs//batch_size
cf10_ex = []
train,test = fn.get_cifar10("cifar10",batch_size)
for x,y in train.map(transformations).take(batches):
    cf10_ex.append(x.shape)

38.7 s ± 4.41 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Benchmark 4: Batch+Direct Tensor

In [8]:
%%timeit
batch_size = 512
imgs = 50000
batches = imgs//batch_size
cf10_ex = []
train,test = fn.get_cifar10("cifar10",batch_size)
for x,y in train.map(transformations).take(batches):
    cf10_ex.append(tf.shape(x))

9.31 s ± 226 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
%%timeit
batch_size = 512
imgs = 50000
batches = imgs//batch_size
cf10_ex = []
train,test = fn.get_cifar10("cifar10",batch_size)
for x,y in train.map(random_pad_crop).map(cutout).map(hflip).take(batches):
    cf10_ex.append(tf.shape(x))

8.2 s ± 856 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
