# Map & Cache functionality and Eager vs Graph execution

This is not a full coverage of tf.dataset functionality or an in depth coverage of Eager vs Graph processing.  I combined these because when you are using the “map” functionality, you are in Graph mode, so it is beneficial to understand some of the differences as you apply the concepts.

What led me to create this was that I spent several weeks thinking I had a code problem when my actual problem was that I did not understand how “map”, “Cache” and “Graph/Eager” worked together.  

tf.datasets are new and at first I wondered why I should migrate my code to use them.  One word – speed.  There is a great notebook that shows Keras Generators compared with datasets.  You can load & run it yourself, bottom line is that having the ability to cache, prefetch data and apply tensors helps training run faster and simplifies the data pipeline.  Here is a link to a couple of notebooks, the first one uses the Flowers dataset to show speed differences, the second is more detailed covering parallel concepts:

https://www.tensorflow.org/tutorials/load_data/images

https://www.tensorflow.org/guide/data_performance

Eager execution:

If Eager execution is new to you, this is a simple notebook that covers the basics.

https://colab.research.google.com/github/zaidalyafeai/Notebooks/blob/master/Eager_Execution_Enabled.ipynb


### Processing for using Google Drive and normal includes

The notebook uses TensorFlow 2.x.  (Eager execution is enabled by default and we use the newer versions of tf.Data.)

I use Notebooks with Colab and on my local workstation, so I need to separate some logic to make it easier to run in both locations.

I was going to delete and just make Colab version, but that is not "real world."  You usually have multiple environments and I'm showing you how I accommodate different environments, you might need something different...

In [None]:
#"""
# Google Collab specific stuff....
from google.colab import drive
drive.mount('/content/drive')

import os
!ls "/content/drive/My Drive"

USING_COLLAB = True
# Force to use 2.x version of Tensorflow
%tensorflow_version 2.x
#"""

In [None]:
# Setup sys.path to find MachineLearning lib directory

# Check if "USING_COLLAB" is defined, if yes, then we are using Colab, otherwise set to False
try: USING_COLLAB
except NameError: USING_COLLAB = False

%load_ext autoreload
%autoreload 2

# set path env var
import sys
if "MachineLearning" in sys.path[0]:
    pass
else:
    print(sys.path)
    if USING_COLLAB:
        sys.path.insert(0, '/content/drive/My Drive/GitHub/MachineLearning/lib') ##### CHANGE FOR SPECIFIC ENVIRONMENT
    else:
        sys.path.insert(0, '/Users/john/Documents/GitHub/MachineLearning/lib')  ##### CHANGE FOR SPECIFIC ENVIRONMENT
    
    print(sys.path)

In [None]:
# Normal includes...

from __future__ import absolute_import, division, print_function, unicode_literals

import os, sys, random, warnings, time, copy, csv, gc
import numpy as np 

import IPython.display as display
from PIL import Image

import matplotlib.pyplot as plt
%matplotlib inline

import cv2
from tqdm import tqdm_notebook, tnrange, tqdm
import pandas as pd

import tensorflow as tf
print(tf.__version__)

# This allows the runtime to decide how best to optimize CPU/GPU usage
AUTOTUNE = tf.data.experimental.AUTOTUNE

from TrainingUtils import *

#warnings.filterwarnings("ignore", category=DeprecationWarning)
#warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", "(Possibly )?corrupt EXIF data", UserWarning)

## General Setup

- Create a dictionary wrapped by a class for global values.  This is how I manage global vars in my notebooks.
- Load a couple of images that will be used to create a very simple dataset



In [None]:
# Set root directory path to data
if USING_COLLAB:
    ROOT_PATH = "/content/drive/My Drive/GitHub/MachineLearning/9-LibTest/Data"  ##### CHANGE FOR SPECIFIC ENVIRONMENT
else:
    ROOT_PATH = "/Users/john/Documents/GitHub/MachineLearning/9-LibTest/Data"  ##### CHANGE FOR SPECIFIC ENVIRONMENT
        
# Establish global dictionary
parms = GlobalParms(ROOT_PATH=ROOT_PATH,
                    TRAIN_DIR="CatDogLabeledVerySmall", 
                    NUM_CLASSES=2,
                    IMAGE_ROWS=256,
                    IMAGE_COLS=256,
                    IMAGE_CHANNELS=3,
                    BATCH_SIZE=4,
                    IMAGE_EXT=".jpg")

parms.print_contents()

In [None]:
# Create path list and class list    
images_list, sub_directories = load_file_names_labeled_subdir_Util(parms.TRAIN_PATH, 
                                                                   parms.IMAGE_EXT)

# Reduce the number of images from 12 to 2, makes it easier to show augmentation
del images_list[1:7]
del images_list[2:7]

images_list_len = len(images_list)
print("Number of images: ", images_list_len)

# Set the class names.
parms.set_class_names(sub_directories)
print("Classes: {}  Labels: {}  {}".format(parms.NUM_CLASSES, len(parms.CLASS_NAMES), parms.CLASS_NAMES) )


In [None]:
# Using the path, show the images that will be used
for image_path in images_list[:2]:
    print(image_path)
    display.display(Image.open(str(image_path)))

## Build an input pipeline

In [None]:
# Simple helper method to display batches of images with labels....        
def show_batch(image_batch, label_batch, number_to_show=25, r=5, c=5):
    show_number = min(number_to_show, parms.BATCH_SIZE)

    if show_number < 8: #if small number, then change row, col and figure size
        if parms.IMAGE_COLS > 64 or parms.IMAGE_ROWS > 64:
            plt.figure(figsize=(25,25)) 
        else:
            plt.figure(figsize=(10,10))  
        r = 4
        c = 2 
    else:
        plt.figure(figsize=(10,10))  

    if show_number == 1:
        image_batch = np.expand_dims(image_batch, axis=0)
        label_batch = np.expand_dims(label_batch, axis=0)

    for n in range(show_number):
        ax = plt.subplot(r,c,n+1)
        plt.imshow(tf.keras.preprocessing.image.array_to_img(image_batch[n]))
        plt.title(parms.CLASS_NAMES[np.argmax(label_batch[n])])
        plt.axis('off')


In [None]:
# Return a label based on the path of the image
def get_label(file_path: tf.Tensor) -> tf.Tensor:
    # convert the path to a list of path components
    parts = tf.strings.split(file_path, os.path.sep)
    # The second to last is the class-directory
    return parts[-2] == parms.CLASS_NAMES

# Decode the image, convert to float, normalize by 255 and resize
def decode_img(image: tf.Tensor) -> tf.Tensor:
    # convert the compressed string to a 3D uint8 tensor
    image = tf.image.decode_jpeg(image, channels=parms.IMAGE_CHANNELS)
    # Use `convert_image_dtype` to convert to floats in the [0,1] range.
    image = tf.image.convert_image_dtype(image, parms.IMAGE_DTYPE)
    # resize the image to the desired size.
    return tf.image.resize(image, [parms.IMAGE_ROWS, parms.IMAGE_COLS])

# Perform any augmentations on just the "dog" image
def image_aug(image: tf.Tensor, file_path: tf.Tensor) -> tf.Tensor:
    # do any augmentations
    parts = tf.strings.split(file_path, os.path.sep)
    if parts[-2] == "01-Dog":
      #image = tf.image.rot90(image, 2)
      image = tf.image.random_flip_up_down(image)

    image = tf.clip_by_value(image, 0, 1)  # always clip back to 0, 1 before returning
    return image

# Method that will take a path and return the image and label
def process_path_eager(file_path: tf.Tensor) -> tf.Tensor:
    print("Eager (False), in Graph mode: ", tf.executing_eagerly() )

    label = get_label(file_path)
    # load the raw data from the file as a string
    image = tf.io.read_file(file_path)
    image = decode_img(image)
    # add any augmentations
    image = image_aug(image, file_path)
    return image, label

def process_path(file_path: tf.Tensor) -> tf.Tensor:
    label = get_label(file_path)
    # load the raw data from the file as a string
    image = tf.io.read_file(file_path)
    image = decode_img(image)
    # add any augmentations
    image = image_aug(image, file_path)
    return image, label

In [None]:
# Normal dataset input pipeline
def prepare_for_training(dataset, cache=True, shuffle_buffer_size=1000):

    # use `.cache(filename)` to cache preprocessing work for datasets that don't
    # fit in memory.
    if cache:
        if isinstance(cache, str):
            dataset = dataset.cache(cache) # Use a file to cache the images
        else:
            dataset = dataset.cache()  # Use memory to cache the images

    # Repeat forever
    dataset = dataset.repeat()

    # set the batch size
    dataset = dataset.batch(parms.BATCH_SIZE)

    return dataset

### Create dataset and normal mappings (positive case)

Pipeline Flow:

create dataset -> map "process_path_eager" -> repeat forever -> batch

This will also show when we are in Eager vs Graph execution.  Basic rule, anytime we are applying methods, we are in Graph mode.  In Graph mode you are not able to evaluate a tensor, so using a method like ".numpy()" will not work.  All processing must be tensor based.


In [None]:
# Create Dataset from list of images
full_dataset = tf.data.Dataset.from_tensor_slices(np.array(images_list))

# Verify image paths were loaded and save one path for later in "some_image"
for f in full_dataset.take(2):
    some_image = f.numpy().decode("utf-8")
    print(f.numpy())
    
print("Some Image: ", some_image)

In [None]:
# Show we are in Eager execution
print("Outside of map, Eager (True): ", tf.executing_eagerly() )

# map training images to processing, includes any augmentation
full_dataset = full_dataset.map(process_path_eager, num_parallel_calls=AUTOTUNE)

# Verify the mapping worked
for image, label in full_dataset.take(1):
    print("Image shape: ", image.numpy().shape, np.max(image.numpy()), np.min(image.numpy()))
    print("Label: ", label.numpy())
    
# Ready to be used for training, cache is turned off
full_dataset = prepare_for_training(full_dataset, cache=False)


In [None]:
# Show the images, execute this cell multiple times to see the "dog" image, rotation should be applied randomally
# Execute at least 4 times...
image_batch, label_batch = next(iter(full_dataset))
show_batch(image_batch.numpy(), label_batch.numpy())

### Add memory caching

Pipeline Flow:

create dataset -> map "process_path" -> cache -> repeat forever -> batch

Since cache is after the map that has image augmentation, the augmentation will only be done once and then saved in cache.


In [None]:
# Create Dataset from list of images
full_dataset = tf.data.Dataset.from_tensor_slices(np.array(images_list))

# map training images to processing, includes any augmentation
full_dataset = full_dataset.map(process_path, num_parallel_calls=AUTOTUNE)
    
# Ready to be used for training, cache is turned on
full_dataset = prepare_for_training(full_dataset, cache=True)


In [None]:
# Show the images, execute this cell multiple times, "dog" image will not change
# Execute at least 4 times...
image_batch, label_batch = next(iter(full_dataset))
show_batch(image_batch.numpy(), label_batch.numpy())

### Add file caching

Same impact as memory cache.

Pipeline Flow:

create dataset -> map "process_path" -> cache -> repeat forever -> batch

Since cache is after the map that has image augmentation, the augmentation will only be done once and then saved in cache.

In [None]:
# Create Dataset from list of images
full_dataset = tf.data.Dataset.from_tensor_slices(np.array(images_list))

# map training images to processing, includes any augmentation
full_dataset = full_dataset.map(process_path, num_parallel_calls=AUTOTUNE)
    
# Ready to be used for training, cache is turned on
full_dataset = prepare_for_training(full_dataset, cache="./play.tfcache")

In [None]:
# Show the images, execute this cell multiple times, "dog" image will not change
# Execute at least 4 times...
image_batch, label_batch = next(iter(full_dataset))
show_batch(image_batch.numpy(), label_batch.numpy())

### New Cache order, add Blur and random augmentation

Same impact as memory cache.

Pipeline Flow:

create dataset -> map "process_path_blur" -> cache -> map process_image_aug -> repeat forever -> batch

Since cache is after the map that has image Blur augmentation, the Blur augmentation will only be done once and then saved in cache.  But random_flip_up_down will be applied on all images.

In [None]:
# These two methods take the "prepare_for_training" and seperate it into two methods

def cache_dataset(dataset, cache=True):
    if cache:
        if isinstance(cache, str):
            dataset = dataset.cache(cache)
        else:
            dataset = dataset.cache()
    return dataset


def prepare_for_training_no_cache(dataset, shuffle_buffer_size=1000):

    # Repeat forever
    dataset = dataset.repeat()

    # set the batch size
    dataset = dataset.batch(parms.BATCH_SIZE)

    return dataset

In [None]:
from skimage.filters import gaussian

def image_blur(image):
    if bool(np.random.choice([0, 1], p=[0, 1.0])):  # change p values as needed . [0., 1.0] is always True
        sigma_max = 4.0
        sigma = random.uniform(3., sigma_max)  # change range or remove if want a fixed sigma value
        image = tf.image.convert_image_dtype(image, dtype=tf.int32)
        image = gaussian(image, sigma=sigma, multichannel=True)
        image = tf.image.convert_image_dtype(image, dtype=tf.float32)
    return image

def process_path_blur(file_path: tf.Tensor) -> tf.Tensor:
    label = get_label(file_path)
    # load the raw data from the file as a string
    image = tf.io.read_file(file_path)
    image = decode_img(image)

    #######################################################
    # Blur using tf.py_function
    im_shape = image.shape
    [image,] = tf.py_function(image_blur, [image], [tf.float32])  #parms must be tensors
    image.set_shape(im_shape)
    #######################################################
    return image, label

def process_image_aug(image: tf.Tensor, label: tf.Tensor) -> tf.Tensor:
    # add any augmentations
    image = tf.image.random_flip_up_down(image)
    return image, label

In [None]:
# Create Dataset from list of images
full_dataset = tf.data.Dataset.from_tensor_slices(np.array(images_list))

# map training images to processing, includes any augmentation
full_dataset = full_dataset.map(process_path_blur, num_parallel_calls=AUTOTUNE)

# apply cache
full_dataset = cache_dataset(full_dataset, cache=True) # Saved in memory, same as using a temp file for cache

# map the random augmentation
full_dataset = full_dataset.map(process_image_aug, num_parallel_calls=AUTOTUNE)

# Ready to be used for training
full_dataset = prepare_for_training_no_cache(full_dataset)

In [None]:
# Show the images, execute this cell multiple times
# All images will have "Blur" applied
# "cat" or "dog" image will change as augmentation is applied after cache
# Execute at least 4 times...
image_batch, label_batch = next(iter(full_dataset))
show_batch(image_batch.numpy(), label_batch.numpy())

### Final Thoughts.....

https://stackoverflow.com/questions/50519343/how-to-cache-data-during-the-first-epoch-correctly-tensorflow-dataset

The implementation of the Dataset.cache() transformation is fairly simple: it builds up a list of the elements that pass through it as you iterate over completely it the first time, and it returns elements from that list on subsequent attempts to iterate over it. If the first pass only performs a partial pass over the data then the list is incomplete, and TensorFlow doesn't try to use the cached data, because it doesn't know whether the remaining elements will be needed, and in general it might need to reprocess all the preceding elements to compute the remaining elements.


https://www.wouterbulten.nl/blog/tech/data-augmentation-using-tensorflow-data-dataset/

