Book 3

In [2]:
import sys
print(sys.executable)

/media/HHD_2TB/baurai/aditya_vm/bin/python3.6


In [10]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from functools import partial
import tensorflow as tf

print("Python version used = ", sys.version)
print("Numpy version = ", np.__version__)
print("Tensorflow version = ", tf.__version__)

Python version used =  3.6.9 (default, Oct  9 2020, 08:58:16) 
[GCC 6.3.0 20170516]
Numpy version =  1.19.5
Tensorflow version =  2.4.0


In [4]:
config = {
    "IMAGE_SIZE" : (512, 512),
    "AUTOTUNE" : tf.data.experimental.AUTOTUNE,
    "BATCH_SIZE" : 16
}

In [5]:
training_filenames = tf.io.gfile.glob("tfrec_binary_classification/train*.tfrec")
validation_filenames = tf.io.gfile.glob("tfrec_binary_classification/val*.tfrec")
testing_filenames = tf.io.gfile.glob("tfrec_binary_classification/test*.tfrec")

print("Number of training TFRec files = ", len(training_filenames))
print("Number of validation TFRec files = ", len(validation_filenames))
print("Number of testing TFRec files = ", len(testing_filenames))

Number of training TFRec files =  94
Number of validation TFRec files =  21
Number of testing TFRec files =  21


# Define The Loading Pipeline : 

* Call **load_dataset** function : This function is called with validation/training/testing filenames. In case the dataset is unlabeled, simply pass False parameter to it, along with a choice whether you want to load things in order or not. Normally, **we disable it for boosting our computation speed**. This load_training dataset will return our dataset(a pair of **(image, label) if labeled = True** was passed, otherwise **(image, image_name))**. How it does that, will be seen ahead in the notebook.
* Now, we have images wih us, in the dataset. We will define and call image_augmentation method to apply some transformation to these images, before we use them further.
* Once it is done, we call the repeat function which repeats this dataset so each original value is seen count times. The default behavior (if count is None or -1) is for the dataset be repeated indefinitely.
* Shuffling - The shuffle( ) function randomly shuffles the elements of this dataset.Internally, the dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required. *For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1,000 element buffer*.
* Group the images into batches.
* **Prefetch( )** : Most dataset input pipelines should end with a call to prefetch. **This allows later elements to be prepared while the current element is being processed. This often improves latency and throughput, at the cost of using additional memory to store prefetched elements**.
* **cache( )** : The first time the dataset is iterated over, its elements will be cached either in the specified file or in memory. Subsequent iterations will use the cached data. When caching to a file, the cached data will persist across runs. Even the first iteration through the data will read from the cache file.

In [7]:
def get_training_dataset() : 
    dataset = load_dataset(training_filenames, labeled = True, ordered = False)
    dataset = dataset.map(image_augmentation, num_parallel_calls = config["AUTOTUNE"])
    dataset = dataset.repeat()
    dataset = dataset.shuffle(2048)
    dataset = dataset.batch(config["BATCH_SIZE"])
    dataset = dataset.prefetch(config["AUTOTUNE"])
    return dataset

In [8]:
def get_validation_dataset() :
    dataset = load_dataset(validation_filenames, labeled = True, ordered = True)
    dataset = dataset.batch(config["BATCH_SIZE"])
    dataset = dataset.cache()
    dataset = dataset.prefetch(config["AUTOTUNE"])
    return dataset

In [9]:
def get_testing_dataset() : 
    dataset = load_dataset(testing_filenames, labeled = True, ordered = True)
    dataset = dataset.batch(config["BATCH_SIZE"])
    dataset = dataset.cache()
    dataset = dataset.prefetch(config["AUTOTUNE"])
    return dataset

# Structuring Functions :
These functions will help in overall structuring of the pipeline.

Setting default value of :

* **ordered as False** - For high performance we will be reading data from multiple file sources at once, and the dataset will be shuffled before reading commences.

* **labeled as true** - Can set it to False too, doesn't matter much. Or, simply leave it be.

*Role of partial( ) functools :*

Treat this as an operation which results in a object(called partial object). **These objects when invoked, behave like functions with positional/additional(but not mandatory) arguments**.

So, sometimes we may have labels, hence the function call will be initated using labeled = True. Sometimes, we won't. In that case, function calls will be initiated with a False initialization of the labeled argument.

In [11]:
def load_dataset(filenames, labeled = True, ordered = False):
    ignore_order = tf.data.Options()
    if ordered == False : 
        ignore_order.experimental_deterministic = False # disable order
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads = config["AUTOTUNE"])
    dataset = dataset.map(partial(read_tfrecord, labeled = labeled), num_parallel_calls = config["AUTOTUNE"])
    return dataset

According to feature mapping present in our TFRecord files, we design the following functions.

In [13]:
def read_tfrecord(example, labeled) : 
    if labeled == True : 
        tfrecord_format = {
            "image" : tf.io.FixedLenFeature([]. tf.string),
            "label" : tf.io.FixedLenFeature([]. tf.int64)
        }
    else:
        tfrecord_format = {
            "image" : tf.io.FixedLenFeature([], tf.string),
            "image_name" : tf.io.FixedLenFeature([]. tf.string)
        }
    example = tf.io.parse_single_example(example, tfrecord_format)
    image = decode_image(example["image"])
    
    if labeled == True : 
        label = tf.cast(example["label"], tf.int32)
        return image, label
    else:
        image_name = example["image_name"]
        return image, image_name

In [15]:
def decode_image(image) : 
    image = tf.image.decode_jpeg(image, channels = 3) # decode JPEG-encoded image to uint8 tensor
    image = tf.cast(image, tf.float32) 
    image = image/255.0 
    image = tf.reshape(image, [config["IMAGE_SIZE"][0], config["IMAGE_SIZE"][1], 3]) # reshapes tensor structure
    return image

In [None]:
def image_augmentation(image, label) : 
    image = tf.image.resize(image, [config[]])