# **LargeDataset & Plot loss/accuracy & Augmentation**


```
#1. Dog vs Cat image classification
#2. 4 convolution layers with 32,64,128 and 128 convolutions
#3. train for 100 epochs to graph of loss and accuracy
```
Data

The 2,000 images used in this exercise are excerpted from the "Dogs vs. Cats" dataset available on Kaggle, which contains 25,000 images. Here, we use a subset of the full dataset to decrease training time for educational purposes.


**Before Augmentation**

In [1]:
import os
import zipfile
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import RMSprop
#data
!wget --no-check-certificate \
    https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip \
    -O /tmp/cats_and_dogs_filtered.zip

#extract zip to the base directory
zip = zipfile.ZipFile('/tmp/cats_and_dogs_filtered.zip','r')
zip.extractall('/tmp')
zip.close()

#directory with train & validation data
train_dir = os.path.join('/tmp/cats_and_dogs_filtered','train')
validation_dir = os.path.join('/tmp/cats_and_dogs_filtered','validation')

"""
train_cats_dir = os.path.join(train_dir, 'cats')
train_dogs_dir = os.path.join(train_dir, 'dogs')
validation_cats_dir = os.path.join(validation_dir, 'cats')
validation_dogs_dir = os.path.join(validation_dir, 'dogs')

train_cat_fnames = os.listdir(train_cats_dir)
print(train_cat_fnames[:10])

train_dog_fnames = os.listdir(train_dogs_dir)
train_dog_fnames.sort()
print(train_dog_fnames[:10])

print('total training cat images:', len(os.listdir(train_cats_dir)))
print('total training dog images:', len(os.listdir(train_dogs_dir)))
print('total validation cat images:', len(os.listdir(validation_cats_dir)))
print('total validation dog images:', len(os.listdir(validation_dogs_dir)))
"""
#model
model = tf.keras.models.Sequential([ 
    # First convolution extracts 32 filters that are 3x3                                
    # image was resized to 150x150 pixels in preprocessing with 3 color channels
    # Convolution is followed by max-pooling layer with a 2x2 window
    tf.keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=(150, 150, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),

    # Second convolution extracts 64 filters that are 3x3
    # Convolution is followed by max-pooling layer with a 2x2 window
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),

    # Third convolution extracts 128 filters that are 3x3
    tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),

    # Fourth convolution extracts 128 filters that are 3x3
    tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),

    # Flatten feature map to a 1-dim so we can add fully connected layers
    tf.keras.layers.Flatten(),
    # Create a fully connected layer with ReLU activation and 512 hidden units
    tf.keras.layers.Dense(512, activation='relu'),
    # Create output layer with a single node and sigmoid activation
    tf.keras.layers.Dense(1, activation='sigmoid')
])

#compile
model.compile(loss='binary_crossentropy',
              optimizer=RMSprop(lr=0.0004),
              metrics=['accuracy'])

#image rescale
train_datagen = ImageDataGenerator(rescale=1/255)
validation_datagen = ImageDataGenerator(rescale=1/255)

#flow images in batches of 20 using train_datagen/validation_datagen generator
train_generator = train_datagen.flow_from_directory(
        train_dir, 
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')
validation_generator = validation_datagen.flow_from_directory(
        validation_dir,
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')

#fit model
history = model.fit_generator(
      train_generator,
      steps_per_epoch=100,  # = len(train)/batch_size = 2000images/20
      epochs=100,
      validation_data=validation_generator,
      validation_steps=50,  # = len(train)/batch_size = 1000images/20
      verbose=2)

--2020-08-10 12:32:41--  https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.124.128, 172.217.212.128, 172.217.214.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.124.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68606236 (65M) [application/zip]
Saving to: ‘/tmp/cats_and_dogs_filtered.zip’


2020-08-10 12:32:42 (56.5 MB/s) - ‘/tmp/cats_and_dogs_filtered.zip’ saved [68606236/68606236]

Found 2000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.
Instructions for updating:
Please use Model.fit, which supports generators.
Epoch 1/100
100/100 - 105s - loss: 0.6968 - accuracy: 0.5325 - val_loss: 0.7088 - val_accuracy: 0.5010
Epoch 2/100
100/100 - 104s - loss: 0.6504 - accuracy: 0.6260 - val_loss: 0.6230 - val_accuracy: 0.6560
Epoch 3/100
100/100 - 106s - loss: 0.5763 - accuracy: 0.6935 - val_loss: 0.5867 - val_accuracy:

KeyboardInterrupt: ignored

 The convolution layers reduce the size of the feature maps by a bit due to padding, and each pooling layer halves the feature map.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# Parameters for our graph; we'll output images in a 4x4 configuration
nrows = 4
ncols = 4

# Index for iterating over images
pic_index = 0


# Set up matplotlib fig, and size it to fit 4x4 pics
fig = plt.gcf()
fig.set_size_inches(ncols * 4, nrows * 4)

pic_index += 8
next_cat_pix = [os.path.join(train_cats_dir, fname) 
                for fname in train_cat_fnames[pic_index-8:pic_index]]
next_dog_pix = [os.path.join(train_dogs_dir, fname) 
                for fname in train_dog_fnames[pic_index-8:pic_index]]

for i, img_path in enumerate(next_cat_pix+next_dog_pix):
  # Set up subplot; subplot indices start at 1
  sp = plt.subplot(nrows, ncols, i + 1)
  sp.axis('Off') # Don't show axes (or gridlines)

  img = mpimg.imread(img_path)
  plt.imshow(img)

plt.show()

In [None]:
#Plot of accuracy and loss
import matplotlib.pyplot as plt

acc = history.history["accuracy"]
val_acc = history.history["val_accuracy"]
loss = history.history["loss"]
val_loss = history.history["val_loss"]

epochs = range(len(acc))

# Plot the training and validation accuracy per epoch
plt.plot(epochs, acc)
plt.plot(epochs, val_acc)
plt.title("Training and validation accuracy")
plt.figure()

# Plot the training and validation loss per epoch
plt.plot(epochs, loss)
plt.plot(epochs, val_loss)
plt.title("Training and validation loss")
plt.figure()

Training accuracy is over 90% and the validation accuracy is near 75%. It is a clear evidence of overfitting. 

- What does `flow_from_directory()` give you?
    - The ability to easily load images, to be able to pick the size of images, to automatically label images based on their directory name.
- If my image is sized 150x150 and I pass a 3x3 convolution over it, what size is the resulting image?
    - 148x148 -> 150-3+1 = 148
- If my data is is sized 75x75 and I use Pooling of size 2x2, what size will the resulting image be?
    - 75x75 (halved)
- If you want to view the history of training, how can you access it?
    - Create a `history` variable and assign it to the return of `model.fit()`
- What's the name of the API that allows you to inspect the impact of convolutions on the images?
    - The model.layers API
- When checking the result graphs, the loss levelled out at about .75 after 2 epochs, but the accuracy climbed close to 1.0 after 15 epochs, what's the significance of this?
    - There was no point training after 2 epochs, as we overfit the training data (you want the loss to keep going down).
- Why is the validation accuracy a better indicator of model performance than training accuracy?
    - The validation accuracy is based on images that the model hasn't been trained with, and thus a better indicator of how the model will perform with new images.
- Why is overfitting more likely to occur on smaller datasets?
    - Because there's less likelihood of all possible features being encountered in the training process.

In [None]:
# Updated to do image augmentation
train_datagen = ImageDataGenerator(
      rotation_range=40,
      width_shift_range=0.2,
      height_shift_range=0.2,
      shear_range=0.2,
      zoom_range=0.2,
      horizontal_flip=True,
      fill_mode='nearest')

* `rotation_range` is a value in degrees (0–180), a range within which to randomly rotate pictures.
* `width_shift` and height_shift are ranges (as a fraction of total width or height) within which to randomly translate pictures vertically or horizontally.
* `shear_range` is for randomly applying shearing transformations.
* `zoom_range` is for randomly zooming inside pictures.
* `horizontal_flip` is for randomly flipping half of the images horizontally. This is relevant when there are no assumptions of horizontal assymmetry (e.g. real-world pictures).
* `fill_mode` is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift.

# **Data Augmentation**
Manipulating an image to increase the size of your datset and help reduce the overfitting of the model.


1.   Flipping
2.   Rotating
3.   Cropping
4.   Skewing



*    Overfitting: one of the central problem in machine learning. It happens when training set are not generalize well to unseen data.

**After Aumentation Applied**

In [None]:
!wget --no-check-certificate \
    https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip \
    -O /tmp/cats_and_dogs_filtered.zip

zip = zipfile.ZipFile('/tmp/cats_and_dogs_filtered.zip','r')
zip.extractall('/tmp')
zip.close()

#directory with train & validation data
train_dir = os.path.join('/tmp/cats_and_dogs_filtered','train')
validation_dir = os.path.join('/tmp/cats_and_dogs_filtered','validation')

#model
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=(150, 150, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

#compile
model.compile(loss='binary_crossentropy',
              optimizer=RMSprop(lr=0.0004),
              metrics=['accuracy'])
################################################################################
#image rescale
train_datagen = ImageDataGenerator(
      rescale=1./255,
      rotation_range=40,
      width_shift_range=0.2,
      height_shift_range=0.2,
      shear_range=0.2,
      zoom_range=0.2,
      horizontal_flip=True,
      fill_mode='nearest')
validation_datagen = ImageDataGenerator(rescale=1/255)

#flow images in batches of 20 using train_datagen/validation_datagen generator
train_generator = train_datagen.flow_from_directory(
        train_dir, 
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')
validation_generator = validation_datagen.flow_from_directory(
        validation_dir,
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')
###########################################################################
#fit model
history = model.fit_generator(
      train_generator,
      steps_per_epoch=100,  # = len(train)/batch_size = 2000images/20
      epochs=100,
      validation_data=validation_generator,
      validation_steps=50,  # = len(train)/batch_size = 1000images/20
      verbose=2)

In [None]:
#Plot of accuracy and loss
import matplotlib.pyplot as plt

acc = history.history["accuracy"]
val_acc = history.history["val_accuracy"]
loss = history.history["loss"]
val_loss = history.history["val_loss"]

epochs = range(len(acc))

# Plot the training and validation accuracy per epoch
plt.plot(epochs, acc)
plt.plot(epochs, val_acc)
plt.title("Training and validation accuracy")
plt.figure()

# Plot the training and validation loss per epoch
plt.plot(epochs, loss)
plt.plot(epochs, val_loss)
plt.title("Training and validation loss")
plt.figure()

**Visualizing Intermediate Representations**

visualize how an input gets transformed as it goes through the convnet.

In [None]:
import numpy as np
import random
from tensorflow.keras.preprocessing.image import img_to_array, load_img

# Let's define a new Model that will take an image as input, and will output
# intermediate representations for all layers in the previous model after
# the first.
successive_outputs = [layer.output for layer in model.layers[1:]]
visualization_model = Model(img_input, successive_outputs)

# Let's prepare a random input image of a cat or dog from the training set.
cat_img_files = [os.path.join(train_cats_dir, f) for f in train_cat_fnames]
dog_img_files = [os.path.join(train_dogs_dir, f) for f in train_dog_fnames]
img_path = random.choice(cat_img_files + dog_img_files)

img = load_img(img_path, target_size=(150, 150))  # this is a PIL image
x = img_to_array(img)  # Numpy array with shape (150, 150, 3)
x = x.reshape((1,) + x.shape)  # Numpy array with shape (1, 150, 150, 3)

# Rescale by 1/255
x /= 255

# Let's run our image through our network, thus obtaining all
# intermediate representations for this image.
successive_feature_maps = visualization_model.predict(x)

# These are the names of the layers, so can have them as part of our plot
layer_names = [layer.name for layer in model.layers]

# Now let's display our representations
for layer_name, feature_map in zip(layer_names, successive_feature_maps):
  if len(feature_map.shape) == 4:
    # Just do this for the conv / maxpool layers, not the fully-connected layers
    n_features = feature_map.shape[-1]  # number of features in feature map
    # The feature map has shape (1, size, size, n_features)
    size = feature_map.shape[1]
    # We will tile our images in this matrix
    display_grid = np.zeros((size, size * n_features))
    for i in range(n_features):
      # Postprocess the feature to make it visually palatable
      x = feature_map[0, :, :, i]
      x -= x.mean()
      x /= x.std()
      x *= 64
      x += 128
      x = np.clip(x, 0, 255).astype('uint8')
      # We'll tile each filter into this big horizontal grid
      display_grid[:, i * size : (i + 1) * size] = x
    # Display the grid
    scale = 20. / n_features
    plt.figure(figsize=(scale * n_features, scale))
    plt.title(layer_name)
    plt.grid(False)
    plt.imshow(display_grid, aspect='auto', cmap='viridis')

As you can see, we are overfitting like it's getting out of fashion. Our training accuracy (in blue) gets close to 100% (!) while our validation accuracy (in green) stalls as 70%. Our validation loss reaches its minimum after only five epochs.

Since we have a relatively small number of training examples (2000), overfitting should be our number one concern. Overfitting happens when a model exposed to too few examples learns patterns that do not generalize to new data, i.e. when the model starts using irrelevant features for making predictions. For instance, if you, as a human, only see three images of people who are lumberjacks, and three images of people who are sailors, and among them the only person wearing a cap is a lumberjack, you might start thinking that wearing a cap is a sign of being a lumberjack as opposed to a sailor. You would then make a pretty lousy lumberjack/sailor classifier.

Overfitting is the central problem in machine learning: given that we are fitting the parameters of our model to a given dataset, how can we make sure that the representations learned by the model will be applicable to data never seen before? How do we avoid learning things that are specific to the training data?

In the next exercise, we'll look at ways to prevent overfitting in the cat vs. dog classification model.

If overfitting continues, add dropout for further regularization

- How do you use Image Augmentation in TensorFlow?
    - By passing parameters to the `ImageDataGenerator`.
- If your training data only has people facing left, but you want to classify people facing right, how could you prevent overfitting?
    - Use the `horizontal_flip` parameter passed to `ImageDataGenerator`
- Why is training with augmentation slower?
    - Because the image processing (flipping, shifting, cropping images) takes cycles to complete.
- What does the `fill_mode` parameter of `ImageDataGenerator` do?
    - It attempts to recreate lost pixels after a transformation like a shear.
- When using Image Augmentation with the `ImageDataGenerator`, what happens to your raw image data on disk?
    - Nothing, all augmentation is done in-memory.
- How does Image Augmentation help solve overfitting?
    - It manipulates the training set to generate more scenarios for features in the images. Effectively increasing the size of your training set.
- When using Image Augmentation my training gets...
    - Slower. But is more robust to overfitting.
- Using Image Augmentation effectively simulates having a larger data set for training. True or False?
    - True.