<a href="https://colab.research.google.com/github/ashikshafi08/Learning_Tensorflow/blob/main/Experiments/Data_Pipeline_with_TensorFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!nvidia-smi

Wed May 19 21:05:04 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Learning how to build data pipelines with `tf.data`

The `tf.data` help us to build complex input pipelines from single, resuable pieces. 

For example the pipeline, 
- for an image model might aggregate data from files in a distributed file system and apply random perturbations to each image, and merge randomly selected images into a batch for training. 
- can be even used for text model might involve extracting symbolds from raw text data, converting them to embedding idenitifiers with a lookup table and batching together sequences of different lengths. 

The `tf.data` API makes it possible to handle large amounts of data, read from different data formats, and perform complex transformations.

The `tf.data` API introduces a `tf.data.Dataset` abstraction that represents a sequence of elements, in which each element consists of one or more components. 

For example, in an image pipeline, an element might be a single training example, with a **pair of tensor components representing the image and it's label.**

**The two distinct ways to create a dataset**: 
- A data **source** constructs a `Dataset` from data stored in memory or in one or more files. 
- A data **transformation** constructs a dataset from one or more `tf.data.Dataset`. 


## Basic Mechanics 

- To create an input pipeline, we must start with a data source. 

- (Other files) For example, to construct a `Dataset` from data in memory (folders etc..) we can use `tf.data.ataset.from_tensors()` or `tf.data.Dataset.from_tensor_slices()`. 
- (TFRecord file) If the input data is stored in a TFRecord format, we can then use `tf.data.TFRecordDataset()`

> The `Dataset` object is a Python iterable (we can loop through). 

In [3]:
# Importing the things we need 
import tensorflow as tf
import pathlib 
import os
import matplotlib.pyplot as plt
import pandas as pd 
import numpy as np

In [4]:
# Creating a dummy data and using tf.data.Dataset.from_tensor_slices()

dum_list = [8 , 3, 0 , 8 , 2 , 1]
dataset = tf.data.Dataset.from_tensor_slices(dum_list)
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

In [7]:
# Iterating a looking what's inside the dataset we created 
for elem in dataset:
  print(elem.numpy())

8
3
0
8
2
1


In [8]:
# Trying out a synthetic data 

(train_data , train_labels) , (test_data , test_labels) = tf.keras.datasets.mnist.load_data()

# Printing out the shapes of our mnist dataset 
train_data.shape , train_labels.shape , test_data.shape , test_labels.shape

((60000, 28, 28), (60000,), (10000, 28, 28), (10000,))

Loading our data usig `tf.data` and create a TensorSliceDataset object for our train data 

In [9]:
# Turning our train data into TensorSliceDataset object 
train_dataset_slices = tf.data.Dataset.from_tensor_slices((train_data , train_labels))

train_dataset_slices

<TensorSliceDataset shapes: ((28, 28), ()), types: (tf.uint8, tf.uint8)>

Cool! Now we have packed our train images and labels into a one whole Dataset. 

To view the labels https://stackoverflow.com/questions/64132847/how-to-iterate-over-tensorslicedataset-object-in-tensorflow

In [10]:
train_dataset_slices.element_spec

(TensorSpec(shape=(28, 28), dtype=tf.uint8, name=None),
 TensorSpec(shape=(), dtype=tf.uint8, name=None))

Let's try the same for but this time with `tf.data.Dataset.from_tensors()`

In [11]:
# Using tf.data.Dataset_from_tensors() 

train_data_tensors = tf.data.Dataset.from_tensors((train_data , train_labels))

train_data_tensors

<TensorDataset shapes: ((60000, 28, 28), (60000,)), types: (tf.uint8, tf.uint8)>

In [12]:
# Looking into our dataset 
train_data_tensors.element_spec

(TensorSpec(shape=(60000, 28, 28), dtype=tf.uint8, name=None),
 TensorSpec(shape=(60000,), dtype=tf.uint8, name=None))

In [13]:
train_data_tensors.list_files

<function tensorflow.python.data.ops.dataset_ops.DatasetV2.list_files>

Using the `tf.data.Dataset.from_generator()` now, this well help us to create a Dataset object from a datagenerator object. 

Useful links
- [Converting ImageDatasetGenerator into dataset object](https://stackoverflow.com/questions/54606302/tf-data-dataset-from-tf-keras-preprocessing-image-imagedatagenerator-flow-from-d)
- [How to use during fit function](
 https://stackoverflow.com/questions/52636127/how-to-use-keras-generator-with-tf-data-api)


In [14]:
# Loading in the cats and dogs dataset 

# data's url 
_URL = 'https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip'

# Extracting from the path
path_to_zip = tf.keras.utils.get_file('cats_and_dogs.zip' , origin = _URL , extract = True)
PATH = os.path.join(os.path.dirname(path_to_zip) , 'cats_and_dogs_filtered')


In [15]:
# What's inside PATH? 

os.listdir(PATH)

['train', 'vectorize.py', 'validation']

In [16]:
# Now setting up our train and validation directory (for images)
train_dir = os.path.join(PATH , 'train')
valid_dir = os.path.join(PATH , 'validation')

In [17]:
# What's inside our train_dir 
os.listdir(train_dir)

['cats', 'dogs']

In [18]:
# Looking intos cats folder 
os.listdir(f'{train_dir}/cats')[:10]

['cat.710.jpg',
 'cat.204.jpg',
 'cat.879.jpg',
 'cat.100.jpg',
 'cat.802.jpg',
 'cat.218.jpg',
 'cat.455.jpg',
 'cat.442.jpg',
 'cat.916.jpg',
 'cat.91.jpg']

In [129]:
# Using ImageDataGenerator 
train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale = 1/255.)

# Getting the images from our directory and resizing them
train_gen = train_datagen.flow_from_directory(train_dir)

# For Validation 
valid_datagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale = 1/255.)
valid_gen  = valid_datagen.flow_from_directory(valid_dir)

Found 2000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.


In [130]:
train_gen.labels

array([0, 0, 0, ..., 1, 1, 1], dtype=int32)

In [131]:
# Gotta inspect our train_gen and collect some info that may help us in converting to Dataset object 

print(f'Target size of images: {train_gen.target_size}')
print(f'Number of classes: {train_gen.num_classes}')
print(f'Getting the class indices: {train_gen.class_indices}')

Target size of images: (256, 256)
Number of classes: 2
Getting the class indices: {'cats': 0, 'dogs': 1}


Alright! Now is the big game of converting our generator to Dataset. 

In [132]:
train_dataset_gen = tf.data.Dataset.from_generator(
    lambda: train_gen , 
    output_types = (tf.float32 , tf.int64), 
    output_shapes = ([None, 256, 256 ,3] , [None , 2])
)

valid_dataset_gen = tf.data.Dataset.from_generator(
    lambda: valid_gen, 
    output_types = (tf.float32 , tf.int64), 
    output_shapes = ([None , 256 , 256 , 3] , [None , 2])

)

train_dataset_gen  , valid_dataset_gen

(<FlatMapDataset shapes: ((None, 256, 256, 3), (None, 2)), types: (tf.float32, tf.int64)>,
 <FlatMapDataset shapes: ((None, 256, 256, 3), (None, 2)), types: (tf.float32, tf.int64)>)

In [133]:
train_dataset_gen.take(1)

<TakeDataset shapes: ((None, 256, 256, 3), (None, 2)), types: (tf.float32, tf.int64)>

In [134]:
# Just creating a simple model 
from tensorflow.keras import layers

inputs = layers.Input(shape = (160 , 160 , 3) , name = 'Input layer')

x = layers.Conv2D(3 , 2 , padding = 'same' , activation ='relu')(inputs)
x = layers.MaxPooling2D(3 , padding = 'same')(x)
#x = layers.BatchNormalization()(x)
x = layers.Conv2D(3 , 2 , padding = 'same' , activation ='relu')(x)
#x = layers.MaxPooling2D(3 , padding = 'same')(x)
x = layers.Dense(128 , activation= 'relu')(x)
x = layers.Conv2D(3 , 2 , padding = 'same' , activation ='relu')(x)
#x = layers.MaxPooling2D(3 , padding = 'same')(x)
x = layers.Dense(128 , activation= 'relu')(x)

outputs = layers.Dense(2 , activation = 'softmax' , name = 'Output_layer')(x)

# Packing into a model 
model = tf.keras.Model(inputs , outputs)

In [135]:
# Model summary 
model.summary()

Model: "model_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Input layer (InputLayer)     [(None, 160, 160, 3)]     0         
_________________________________________________________________
conv2d_24 (Conv2D)           (None, 160, 160, 3)       39        
_________________________________________________________________
max_pooling2d_8 (MaxPooling2 (None, 54, 54, 3)         0         
_________________________________________________________________
conv2d_25 (Conv2D)           (None, 54, 54, 3)         39        
_________________________________________________________________
dense_18 (Dense)             (None, 54, 54, 128)       512       
_________________________________________________________________
conv2d_26 (Conv2D)           (None, 54, 54, 3)         1539      
_________________________________________________________________
dense_19 (Dense)             (None, 54, 54, 128)       512

In [136]:
model.compile(loss = tf.keras.losses.SparseCategoricalCrossentropy() , 
              optimizer = tf.keras.optimizers.Adam(), 
              metrics = ['accuracy'])

In [None]:
model.fit(train_dataset_gen , 
          epochs = 5)

In [111]:
train_dataset_gen.element_spec

(TensorSpec(shape=(None, 256, 256, 3), dtype=tf.float32, name=None),
 TensorSpec(shape=(None, 2), dtype=tf.int64, name=None))

Extracting images and labels from our dataset object. 

Useful link: https://stackoverflow.com/questions/56226621/how-to-extract-data-labels-back-from-tensorflow-dataset

In [125]:
# Extracting images and labels from our dataset object
for images , labels in train_dataset_gen.take(1):
   sample_images = images 
   sample_labels = labels




In [113]:
len(sample_images) , len(sample_labels)

(32, 32)

In [61]:
# Checking the image 
sample_images[:1]

<tf.Tensor: shape=(1, 256, 256, 3), dtype=float32, numpy=
array([[[[0.12156864, 0.16862746, 0.21568629],
         [0.12156864, 0.16862746, 0.21568629],
         [0.12156864, 0.16862746, 0.21568629],
         ...,
         [0.5529412 , 0.5176471 , 0.45882356],
         [0.5529412 , 0.5176471 , 0.45882356],
         [0.5647059 , 0.5294118 , 0.47058827]],

        [[0.1137255 , 0.16078432, 0.20784315],
         [0.10980393, 0.15686275, 0.20392159],
         [0.10980393, 0.15686275, 0.20392159],
         ...,
         [0.5529412 , 0.5176471 , 0.45882356],
         [0.5529412 , 0.5176471 , 0.45882356],
         [0.56078434, 0.5254902 , 0.4666667 ]],

        [[0.1137255 , 0.16078432, 0.20784315],
         [0.10980393, 0.15686275, 0.20392159],
         [0.10980393, 0.15686275, 0.20392159],
         ...,
         [0.5529412 , 0.5176471 , 0.45882356],
         [0.5529412 , 0.5176471 , 0.45882356],
         [0.56078434, 0.5254902 , 0.4666667 ]],

        ...,

        [[0.8313726 , 0.7607844 , 

In [42]:
# Checking our labels 
sample_labels[:10]

<tf.Tensor: shape=(10, 2), dtype=int32, numpy=
array([[0, 1],
       [1, 0],
       [0, 1],
       [0, 1],
       [1, 0],
       [1, 0],
       [1, 0],
       [0, 1],
       [0, 1],
       [1, 0]], dtype=int32)>

In [None]:
# Applying the same on the whole dataset 
#for images , labels in train_dataset_gen.take(-1):
#  train_images = images
#  train_labels = labels

# train_images , train_labels = tuple(zip(*train_dataset_gen))

# The for loop is taking *infinitely* long time 

In [137]:
def preprocess_func(image , label):
  image = tf.image.resize(image , [224 , 224])
  return tf.cast(image , tf.float32) , label

In [138]:
# Map preprocess function to train and valid 
train_dataset_gen = train_dataset_gen.map(map_func=preprocess_func , num_parallel_calls=tf.data.AUTOTUNE)
#train_dataset_gen = train_dataset_gen.shuffle(buffer_size = 1000).batch(batch_size = 32).prefetch(buffer_size = tf.data.AUTOTUNE)

valid_dataset_gen = valid_dataset_gen.map(map_func=preprocess_func , num_parallel_calls=tf.data.AUTOTUNE)
#valid_dataset_gen = valid_dataset_gen.batch(batch_size = 32).prefetch(buffer_size = tf.data.AUTOTUNE)



In [139]:
train_dataset_gen , valid_dataset_gen

(<ParallelMapDataset shapes: ((None, 224, 224, 3), (None, 2)), types: (tf.float32, tf.int64)>,
 <ParallelMapDataset shapes: ((None, 224, 224, 3), (None, 2)), types: (tf.float32, tf.int64)>)

In [None]:
model.fit(train_dataset_gen , 
          epochs = 5)

In [70]:
train_dataset_gen.class_names

AttributeError: ignored

In [140]:
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

# Create base model
input_shape = (256, 256, 3)
base_model = tf.keras.applications.EfficientNetB0(include_top=False)
base_model.trainable = False # freeze base model layers

# Create Functional model 
inputs = layers.Input(shape=input_shape, name="input_layer")
# Note: EfficientNetBX models have rescaling built-in but if your model didn't you could have a layer like below
# x = preprocessing.Rescaling(1./255)(x)
x = base_model(inputs, training=False) # set base_model to inference mode only
x = layers.GlobalAveragePooling2D(name="pooling_layer")(x)
#x = layers.Dense(2)(x) # want one output neuron per class 
# Separate activation of output layer so we can output float32 activations
outputs = layers.Dense(2, activation="softmax")(x) 
model = tf.keras.Model(inputs, outputs)

# Compile the model
model.compile(loss="sparse_categorical_crossentropy", # Use sparse_categorical_crossentropy when labels are *not* one-hot
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])

In [141]:
model.summary()

Model: "model_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_layer (InputLayer)     [(None, 256, 256, 3)]     0         
_________________________________________________________________
efficientnetb0 (Functional)  (None, None, None, 1280)  4049571   
_________________________________________________________________
pooling_layer (GlobalAverage (None, 1280)              0         
_________________________________________________________________
dense_20 (Dense)             (None, 2)                 2562      
Total params: 4,052,133
Trainable params: 2,562
Non-trainable params: 4,049,571
_________________________________________________________________


In [142]:
# Compile the model 
model.compile(loss = tf.keras.losses.CategoricalCrossentropy() , 
              optimizer = tf.keras.optimizers.Adam() , 
              metrics = ['accuracy'])

In [143]:
# Fit the model 
history = model.fit(train_dataset_gen , 
                    epochs = 3 , 
                    steps_per_epoch = len(train_dataset_gen) , 
                    validation_data = valid_dataset_gen , 
                    validation_steps = int(0.15 * len(valid_dataset_gen)) 
                    )

TypeError: ignored