# CNN 8 - Transfer Learning
- Dataset:
    - https://www.kaggle.com/shaunthesheep/microsoft-catsvsdogs-dataset
- The dataset isn't deep-learning-compatible by default, here's how to preprocess it:

    
**What you should know by now:**
- How to preprocess image data
- How to load image data from a directory
- What's a convolution, pooling, and a fully-connected layer
- Categorical vs. binary classification
- What is data augmentation and why is it useful

**Let's start**
- We'll import the libraries first:

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import tensorflow as tf

- We'll have to load training and validation data from different directories throughout the notebook
- The best practice is to declare a function for that
- The function will also apply data augmentation to the training dataset:

In [2]:
def init_data(train_dir: str, valid_dir: str) -> tuple:
    train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
        rescale=1/255.0,
        rotation_range=20,
        width_shift_range=0.2,
        height_shift_range=0.2,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True,
        fill_mode='nearest'
    )
    valid_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
        rescale=1/255.0
    )
    
    train_data = train_datagen.flow_from_directory(
        directory=train_dir,
        target_size=(224, 224),
        class_mode='categorical',
        batch_size=64,
        seed=42
    )
    valid_data = valid_datagen.flow_from_directory(
        directory=valid_dir,
        target_size=(224, 224),
        class_mode='categorical',
        batch_size=64,
        seed=42
    )
    
    return train_data, valid_data

- Let's now load our dogs and cats dataset:

In [3]:
train_data, valid_data = init_data(
    train_dir='data/train/', 
    valid_dir='data/validation/'
)

Found 20030 images belonging to 2 classes.
Found 2488 images belonging to 2 classes.


<br>

## Transfer Learning in TensorFlow
- With transfer learning, we're basically loading a huge pretrained model without the top clasification layer
- That way, we can freeze the learned weights and only add the output layer to match our case
- For example, most pretrained models were trained on ImageNet dataset which has 1000 classes
    - We only have two classes (cat and dog), so we'll need to specify that
- We'll also add a couple of additional layers to prevent overfitting:

In [4]:
def build_transfer_learning_model(base_model):
    # `base_model` stands for the pretrained model
    # We want to use the learned weights, and to do so we must freeze them
    for layer in base_model.layers:
        layer.trainable = False
        
    # Declare a sequential model that combines the base model with custom layers
    model = tf.keras.Sequential([
        base_model,
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(rate=0.2),
        tf.keras.layers.Dense(units=2, activation='softmax')
    ])
    
    # Compile the model
    model.compile(
        loss='categorical_crossentropy',
        optimizer=tf.keras.optimizers.Adam(),
        metrics=['accuracy']
    )
    
    return model

In [5]:
# Let's use a simple and well-known architecture - VGG16
from tensorflow.keras.applications.vgg16 import VGG16

# We'll specify it as a base model
# `include_top=False` means we don't want the top classification layer
# Specify the `input_shape` to match our image size
# Specify the `weights` accordingly
vgg_model = build_transfer_learning_model(
    base_model=VGG16(include_top=False, input_shape=(224, 224, 3), weights='imagenet')
)

# Train the model for 10 epochs
vgg_hist = vgg_model.fit(
    train_data,
    validation_data=valid_data,
    epochs=10
)

Metal device set to: Apple M1 Pro
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


- We got amazing accuracy right from the start!
- We couldn't surpass 77% accuracy on the validation set with the custom architecture, and we're at 93% with the VGG16 model
- The beauty of transfer learning isn't only that it yields a highly accurate models - you can also train models with less data, as the model doesn't have to learn as much

<br>

## Transfer Learning on a 20 times smaller subset
- We want to see if reducing the dataset size negatively effects the predictive power
- To do so, we'll create a new directory structure for training and validation images:

In [6]:
import random
import pathlib
import shutil

random.seed(42)


dir_data = pathlib.Path.cwd().joinpath('data_small')
dir_train = dir_data.joinpath('train')
dir_valid = dir_data.joinpath('validation')

if not dir_data.exists(): dir_data.mkdir()
if not dir_train.exists(): dir_train.mkdir()
if not dir_valid.exists(): dir_valid.mkdir()

for cls in ['cat', 'dog']:
    if not dir_train.joinpath(cls).exists(): dir_train.joinpath(cls).mkdir()
    if not dir_valid.joinpath(cls).exists(): dir_valid.joinpath(cls).mkdir()

- Here's the directory structure printed:

In [9]:
!ls -R data_small | grep ":$" | sed -e 's/:$//' -e 's/[^-][^\/]*\//--/g' -e 's/^/   /' -e 's/-/|/'

   |-train
   |---cat
   |---dog
   |-validation
   |---cat
   |---dog


- Now, we'll copy only a sample of images to the new folders
- We'll declare a `copy_sample()` function whcih takes `n` images from the `src_folder` and copies them to the `tgt_folder`
- We'll keep `n` to 500 by default, which is a pretty small number:

In [10]:
def copy_sample(src_folder: pathlib.PosixPath, tgt_folder: pathlib.PosixPath, n: int = 500):
    imgs = random.sample(list(src_folder.iterdir()), n)

    for img in imgs:
        img_name = str(img).split('/')[-1]
        
        shutil.copy(
            src=img,
            dst=f'{tgt_folder}/{img_name}'
        )

- Let's now copy the training and validation images
- For the validation set, we'll copy only 100 images per class

In [11]:
# Train - cat
copy_sample(
    src_folder=pathlib.Path.cwd().joinpath('data/train/cat/'), 
    tgt_folder=pathlib.Path.cwd().joinpath('data_small/train/cat/'), 
)

# Train - dog
copy_sample(
    src_folder=pathlib.Path.cwd().joinpath('data/train/dog/'), 
    tgt_folder=pathlib.Path.cwd().joinpath('data_small/train/dog/'), 
)

# Valid - cat
copy_sample(
    src_folder=pathlib.Path.cwd().joinpath('data/validation/cat/'), 
    tgt_folder=pathlib.Path.cwd().joinpath('data_small/validation/cat/'),
    n=100
)

# Valid - dog
copy_sample(
    src_folder=pathlib.Path.cwd().joinpath('data/validation/dog/'), 
    tgt_folder=pathlib.Path.cwd().joinpath('data_small/validation/dog/'),
    n=100
)

- Let's count the number of files in each folder to verify the images were copied successfully:

In [12]:
!ls data_small/train/cat/ | wc -l

     500


In [13]:
!ls data_small/validation/cat/ | wc -l

     100


In [14]:
!ls data_small/train/dog/ | wc -l

     500


In [15]:
!ls data_small/validation/dog/ | wc -l

     100


- Now use `init_data()` to load in the images again:

In [6]:
train_data, valid_data = init_data(
    train_dir='data_small/train/', 
    valid_dir='data_small/validation/'
)

Found 1000 images belonging to 2 classes.
Found 200 images belonging to 2 classes.


- There's total of 1000 training images
- It will be interesting to see if we can get a decent model out of a dataset this small
- Model architecture is the same, but we'll train for more epochs just because the dataset is smaller
    - Also, we can afford to train for longer since the training time per epoch is reduced:

In [8]:
vgg_model = build_transfer_learning_model(
    base_model=VGG16(include_top=False, input_shape=(224, 224, 3), weights='imagenet')
)

vgg_hist = vgg_model.fit(
    train_data,
    validation_data=valid_data,
    epochs=20
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


- It looks like we got roughly the same validation accuracy as with the model trained on 25K images, which is amazing!

**Homework:**
- Use both models to predict the entire test set directory
- How do the accuracies compare?