In [26]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB:
    #!pip install --ignore-installed Pillow==9.0.0
    !pip install tf-keras-vis
    !pip install Pillow==9.0.0
    #!pip install -U git+https://github.com/keisen/tf-keras-vis.git@4a90becb02ed3d44825300fcb807dd58157787ba
    from tensorflow.keras.preprocessing.image import load_img
    from tf_keras_vis.utils.scores import CategoricalScore

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import keras
from keras.datasets import fashion_mnist, cifar10
from keras.layers import Dense, Flatten, Normalization, Dropout, Conv2D, MaxPooling2D, RandomFlip, RandomRotation, RandomZoom, BatchNormalization, Activation, InputLayer
from keras.models import Sequential
from keras.losses import SparseCategoricalCrossentropy, CategoricalCrossentropy
from keras.callbacks import EarlyStopping
from keras import utils as np_utils
from keras import utils
import os
from keras.preprocessing.image import ImageDataGenerator

import matplotlib as mpl
import matplotlib.pyplot as plt
import datetime
#import PIL

!rm -rf logs/VGG/*

In [27]:
class FilterLoggingCallback(tf.keras.callbacks.Callback):
    def __init__(self, log_dir):
        super().__init__()
        self.log_dir = log_dir
        self.file_writer = tf.summary.create_file_writer(log_dir)

    def on_epoch_end(self, epoch, logs=None):
        # Get the model
        model = self.model

        # Get the first and last convolutional layers
        conv_layers = [layer for layer in model.layers if isinstance(layer, tf.keras.layers.Conv2D)]
        first_layer, last_layer = conv_layers[0], conv_layers[-1]

        for i, layer in enumerate([first_layer, last_layer]):
            # Get the filters of the current layer
            filters = layer.weights[0].numpy()

            # Normalize the filters
            filters_min, filters_max = filters.min(), filters.max()
            filters = (filters - filters_min) / (filters_max - filters_min)

            # Write the filters to TensorBoard
            with self.file_writer.as_default():
                max_filters = min(8, filters.shape[-1])
                #for j in range(filters.shape[-1]):
                for j in range(max_filters):
                    tf.summary.image(f"Layer {i} Filter {j}", filters[:, :, :, j:j+1], step=epoch)

# Usage:
# callback = FilterLoggingCallback(log_dir="/path/to/log_dir")
# model.fit(x_train, y_train, validation_data=(x_val, y_val), callbacks=[callback])

# Transfer Learning

### Feature Extraction and Classification

One of the key concepts needed with transfer learning is the separating of the feature extraction from the convolutional layers and the classification done in the fully connected layers.
<ul>
<li> The convolutional layer finds features in the image. I.e. the output of the end of the convolutional layers is a set of image-y features. 
<li> The fully connected layers take those features and classify the thing. 
</ul>

The idea behind this is that we allow someone (like Google) to train their fancy network on a bunch of fast computers, using millions and millions of images. These classifiers get very good at extracting features from objects. 

When using these models we take those convolutional layers and slap on our own classifier at the end, so the pretrained convolutional layers extract a bunch of features with their massive amount of training, then we use those features to predict our data!

### Tensorboard Up-Front

we'll also launch the tensorboard prior to doing any training. Pay attention to the log locations in each callback, we can nest the logs in folders, then use the names and tensorboard's regex search to monitor each run as it progresses. 

In [28]:
%load_ext tensorboard
%tensorboard --logdir=logs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6007 (pid 47549), started 2:24:26 ago. (Use '!kill 47549' to kill it.)

### Download Model

There are several models that are pretrained and available to us to use. VGG16 is one developed to do image recognition, the name stands for "Visual Geometry Group" - a group of researchers at the University of Oxford who developed it, and ‘16’ implies that this architecture has 16 layers. The model got ~93% on the ImageNet test that we mentioned a couple of weeks ago. 

![VGG16](images/vgg16.png "VGG16" )

#### Slide Convolutional Layers from Classifier

When downloading the model we specifiy that we don't want the top - that's the classification part. When we remove the top we also allow the model to adapt to the shape of our images, so we specify the input size as well.

In [29]:
from keras.applications.vgg16 import VGG16
from keras.layers import Input
from keras.models import Model
from keras.applications.vgg16 import preprocess_input, decode_predictions

### Preprocessing Data

Our VGG 16 model comes with a preprocessing function to prepare the data in a way it is happy with. For this model the color encoding that it was trained on is different, so we should prepare the data properly to get good results. 

In [30]:
import pathlib
from keras.applications.vgg16 import preprocess_input

dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file(origin=dataset_url,
                                   fname='flower_photos',
                                   untar=True)
data_dir = pathlib.Path(data_dir)

epochs = 7
batch_size = 96 # 128 was OK for me here with 15GB GPU RAM T4 GPU
img_height = 224
img_width = 224
img_depth = 3

train_ds_orig = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

val_ds_orig = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

class_names = train_ds_orig.class_names
print(class_names)

def preprocess(images, labels):
  return tf.keras.applications.vgg16.preprocess_input(images), labels

train_ds = train_ds_orig.map(preprocess)
val_ds = val_ds_orig.map(preprocess)


Found 3670 files belonging to 5 classes.
Using 2936 files for training.
Found 3670 files belonging to 5 classes.
Using 734 files for validation.
['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']


#### Add on New Classifier

If we look at the previous summary of the model we can see that the last layer we have is a MaxPool layer. When making our own CNN this is the last layer before we add in the "normal" stuff for making predictions, this is the same. We need to flatten the data, then use dense layers and an output layer to classify the predictions. 

We end up with the pretrained parts finding features in images, and the custom part classifying images based on those features. If we think back to the concept of a convolutional network, the convolutional layers do the true heavy lifting in allowing us to do things like classify images, they take in the raw images and transform it into a set of features contained in that image. This ability to turn images into predictive features is the key - important parts of images like edges, corners, contrast, etc... are generic, and our borrowed model is excellent at finding these features in images. Our predicitons are unique, so we tweak the training of our model to make predictions for our data, into our classes - all based on the features that the borrowed model found! 

### Make Model

We take the model without the top, set the input image size, and then add our own classifier. Loading the model is simple, there are just a few things to specify:
<ul>
<li> weights="imagenet" - tells the model to use the weights from its imagenet training. This is what brings the "smarts", so we want it. 
<li> include_top=False - tells the model to not bring over the classifier bits that we wnat to replace. 
<li> input_shape - the model is trained on specific data sizes (224x224x3). We can repurpose it by changing the input size. 
</ul>

We also set the VGG model that we download to be not trainable. We don't want to overwrite all of the training that already exists, coming from the original training. What we want to be trained are the final dense parts we added on to classify our specific scenario. All the weights in the convolutional layers are kept the same, as they have been developed through large amounts of training; the weights in the fully connected layers will be trained, resulting in a model that combines the "sight" of the pretrained model with the context of what we are trying to classify. The VGG bits will just show as though they are one layer in our model, and for training purposes that makes sense. We can also see in the "trainable params" listing in the summary, the large number of weights in that VGG section we are borrowing are not trainable - that's the smart part of the model. 

<b>Note:</b> I think the "top" label is a bit misleading, as it isn't really the top, it is the part at the end that shows at the bottom of a summary. 

In [31]:
## Loading VGG16 model
model = Sequential()

## Loading VGG16 model
base_model_orig = VGG16(weights="imagenet", include_top=False, input_shape=(img_height, img_width, img_depth))
base_model_orig.trainable = False ## Not trainable weights
#base_model = [layer for layer in base_model_orig.layers]
for layer in base_model_orig.layers:
    model.add(layer)

# Add Dense Stuff
flatten_layer = Flatten()
dense_layer_1 = Dense(512, activation='relu', kernel_regularizer='l2', bias_regularizer='l2')
drop_layer_1 = Dropout(.2)
dense_layer_2 = Dense(256, activation='relu', kernel_regularizer='l2', bias_regularizer='l2')
prediction_layer = Dense(5)

model.add(flatten_layer)
model.add(dense_layer_1)
model.add(drop_layer_1)
model.add(dense_layer_2)
model.add(prediction_layer)

model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 block1_conv1 (Conv2D)       (None, 224, 224, 64)      1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 224, 224, 64)      36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 112, 112, 64)      0         
                                                                 
 block2_conv1 (Conv2D)       (None, 112, 112, 128)     73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 112, 112, 128)     147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, 56, 56, 128)       0         
                                                                 
 block3_conv1 (Conv2D)       (None, 56, 56, 256)      

#### Compile and Train

Once the new Frakenstein model is built we finish the training process as we normally would. The only difference is that here the weights of the VGG part of the model are not being adjusted during the backpropagation steps, only the weights in the layers that we added at the end are. For many, if not most, applications, this approach of adapting a pretrained model will give the best real world results. Unless you happen to live in a data centre, you probably lack both the data and the processing capacity to train any model from scratch to be as good as those that we can download. 

In [32]:
# Model
model.compile(
    optimizer=tf.keras.optimizers.legacy.Adam(),  
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=keras.metrics.SparseCategoricalAccuracy(name="accuracy"),
    run_eagerly=False
)

time_stamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
log_dir = "logs/VGG/initial/" + time_stamp
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1, write_graph=True, write_images=True)
stopping_callback = EarlyStopping(monitor='val_accuracy', patience=3, restore_best_weights=True, mode="max")
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath="weights/VGG/initial/"+time_stamp+"model.keras", save_best_only=True, monitor='val_accuracy', mode='max')

filter_callback = FilterLoggingCallback(log_dir="logs/VGG/initial/"+time_stamp)

model.fit(train_ds,
            epochs=epochs,
            verbose=1,
            validation_data=val_ds,
            callbacks=[tensorboard_callback, stopping_callback, checkpoint_callback, filter_callback])

Epoch 1/7

ValueError: Tensor  must have rank 4.  Received rank 3, shape (3, 3, 3)

### Fine Tune Models

Lastly, we can adapt the entire model to our data. We'll unfreeze the original model, and then train the model again. The key addition here is that we set the learning rate to be extremely low (here it is 2 orders of magnitude smaller than the default) so the model doesn't totally rewrite all of the weights while training, rather it will only change a little bit - fine tuning its predictions to the actual data! Here the oringal convolutional layers are trainable, and the weights will be adjusted during training, but we dial the learning rate way down so that our changes only impact the model a little bit. This is a greater degree of fine tuning than we get when we lock the VGG layers, but it is still mainly relying on the previous training of the VGG model.

The end result is a model that can take advantage of all of the training that the original model received before we downloaded it. That ability of extracting features from images is then reapplied to our data for making predictions based on the features identified in the original model. Finally we take the entire model and just gently train it to be a little more suited to our data. The best of all worlds!

In [None]:
#Save a copy of the above model for next test. 
copy_model = model

#base_model.trainable = True
for layer in base_model_orig.layers:
    layer.trainable = True
model.summary()

model.compile(
    optimizer=tf.keras.optimizers.legacy.Adam(1e-5),  # Low learning rate
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=keras.metrics.SparseCategoricalAccuracy(name="accuracy")
)

time_stamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
log_dir = "logs/VGG/fine_tune/" + time_stamp
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1, write_graph=False, write_images=False)
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath="weights/VGG/fine_tune/"+time_stamp+"model.keras", save_best_only=True, monitor='val_accuracy', mode='max')

model.fit(train_ds, epochs=epochs, validation_data=val_ds, verbose=1, callbacks=[tensorboard_callback, stopping_callback, checkpoint_callback])

### Transfer + Fine Tuning Results

Yay, that's probably pretty accurate! In initial testing with 1 epoch, I got results around 80% before the fine tuning, and over 85% after the fine tuning. That's with 1 epoch! Other runs where we allow it to tune more trend to be even better - allowing 5 epochs of training + 5 epochs of fine tuning, my validation accuracy was around 90% and the training accuracy was nearing 100% - we could likely do even better with more aggressive regularization. 

This will likely be a great approach for something like image recognition!

### With Augmentation

<b>Note:</b> Some of the details here are explained more in the next workbook. In short, there are some layers that apply random transformations to the images for augmentation. There's a function that applies that conditionally, as we don't want to apply it to the validation data.

I also set up a switch to allow for more epochs, so we can take advantage of the augmented data. I'm going to only run that when using a very fast processor on Colab, with a large batch size, on a normal GPU it will take too long. The early stopping is still pretty aggressive, so we should be fine in most cases.

In [None]:
data_augmentation = tf.keras.Sequential([
    RandomFlip("horizontal_and_vertical"),
    RandomZoom(0.2),
    RandomRotation(0.2),
])

AUTOTUNE = tf.data.AUTOTUNE

def prepare(ds, shuffle=False, augment=False):
  if shuffle:
    ds = ds.shuffle(1000)

  # Batch all datasets.
  ds = ds.batch(batch_size)

  # Use data augmentation only on the training set.
  if augment:
    #ds = ds.map(lambda x, y: (data_augmentation(x, training=True), y),
    #            num_parallel_calls=AUTOTUNE)
    ds = ds.map(lambda x, y: (data_augmentation(tf.squeeze(x, axis=1), training=True), y),
            num_parallel_calls=AUTOTUNE)

  # Use buffered prefetching on all datasets.
  return ds.prefetch(buffer_size=AUTOTUNE)

train_ds_aug = prepare(train_ds, shuffle=True, augment=True)
val_ds_aug = prepare(val_ds)


In [None]:
LONG_TRAINING = False

augment_epochs = epochs
if LONG_TRAINING:
    augment_epochs = 100


time_stamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
log_dir = "logs/VGG/augment_more_epochs/" + time_stamp
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1, write_graph=False, write_images=False)
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath="weights/VGG/augment_more_epochs/"+time_stamp+"model.keras", save_best_only=True, monitor='val_accuracy', mode='max')

model.fit(train_ds,
            epochs=augment_epochs,
            verbose=1,
            validation_data=val_ds,
            callbacks=[tensorboard_callback, stopping_callback, checkpoint_callback])

### Where is the Model Looking?

One of the things that we may wonder is how our models make decisions, or what are the looking at to do so. A tool that can help illustrate that is called a salience map. A salience map shows a visual representation of what parts of the image are impacting the final prediction the most. While the CNN process is largely a black box, this is one way to gain a little insight on what is going on. 

#### Visualize Focus



In [None]:
if IN_COLAB:
    from tensorflow.keras.preprocessing.image import load_img
    from tf_keras_vis.utils.scores import CategoricalScore
    
    # Image titles
    image_titles = ['daisy', 'roses', 'tulips']
    score = CategoricalScore([0, 2, 4])

    # Load images and Convert them to a Numpy array
    img1 = load_img('/root/.keras/datasets/flower_photos/daisy/100080576_f52e8ee070_n.jpg', target_size=(224, 224))
    img2 = load_img('/root/.keras/datasets/flower_photos/roses/10894627425_ec76bbc757_n.jpg', target_size=(224, 224))
    #img3 = load_img('/root/.keras/datasets/flower_photos/tulips/10128546863_8de70c610d.jpg', target_size=(224, 224))
    img3 = load_img("/root/.keras/datasets/flower_photos/tulips/12764617214_12211c6a0c_m.jpg", target_size=(224, 224))
    images = np.asarray([np.array(img1), np.array(img2), np.array(img3)])

    # Preparing input data for VGG16
    X = preprocess_input(images)

    # Rendering
    f, ax = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))
    for i, title in enumerate(image_titles):
        ax[i].set_title(title, fontsize=16)
        ax[i].imshow(images[i])
        ax[i].axis('off')
    plt.tight_layout()
    plt.show()

#### Show Saliency

The bright spots in the image are the areas that the model is focusing on to make its prediction. The darker areas are not as important. We can think of this as a rough approximation of feature importance from something like a tree, only in 2D. 

Using a salience map in detail to tune our models goes beyond the scope of what we are going to do, but it does allow us to get at least some insight. The most direct thing that we can do is that we can figure out which parts of images are relied on for the model to do its job, this can help us to understand what the model is looking for. For something like image recognition, this could lead you to think about how the images are processed - for example, it is very common to snip out parts of larger images, usually the center, for use in a predictive model. This could show us evidence of if we are capturing the important parts or if we should modify that image prep process. I we were considering the padding decision, this could also give us an idea of if the edges matter or not for the model's predicitons. If the most important parts of the image are the "thing", then we are likely doing well, if they most important parts are background or the periphery, then we may been to change some things. 

In [None]:
if IN_COLAB:
    from tf_keras_vis.utils.model_modifiers import ReplaceToLinear
    replace2linear = ReplaceToLinear()
    from tensorflow.keras import backend as K
    from tf_keras_vis.saliency import Saliency
    # from tf_keras_vis.utils import normalize

    # Create Saliency object.
    saliency = Saliency(model, model_modifier=replace2linear, clone=True)

    # Generate saliency map
    saliency_map = saliency(score, X)
    # Generate saliency map with smoothing that reduce noise by adding noise
    saliency_map = saliency(score,
                            X,
                            smooth_samples=20, # The number of calculating gradients iterations.
                            smooth_noise=0.20) # noise spread level.

    # Render
    f, ax = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))
    for i, title in enumerate(image_titles):
        ax[i].set_title(title, fontsize=14)
        ax[i].imshow(saliency_map[i], cmap='jet')
        ax[i].axis('off')
    plt.tight_layout()
    plt.show()

## More Drastic Retraining

If we are extra ambitious we can also potentially slice the model even deeper, and take smaller portions to mix with our own models. The farther "into" the model you slice, the more of the original training will be removed and the more the model will learn from our training data. If done, this is a balancing act - we want to keep all of the smarts that the model has gotten from the original training, while getting the benefits of adaptation to our data. 

This is something that is hard to just eyeball - to splice parts of models together and create something that is actually superior likely requries a lot of experimentation, a solid understanding of the model's problem you're addressing, and some domain knowledge. For something like this adaptation of the VGG model, we'd probably start with some idea of what the model was weak at, build an understanding of what types of features it was extracting along the way, and insert our own layers where we think it would be most beneficial. 

<b>Note:</b> the farther you go with this, the less likely it is that you'll make something better. Semi-arbitrarily retraining parts of a model that were (usually, for things you downloaded) trained on a very large dataset, often over many epochs, is likely to be a losing proposition.

In [None]:
## Loading VGG16 model
base_model = VGG16(weights="imagenet", include_top=False, input_shape=(img_height, img_width, img_depth))
#base_model.trainable = False ## Not trainable weights
base_model.summary()

##### Freeze the First 12 Layers

We will set the first 12 layers to be frozen, and leave the rest open to be trained. 

In [None]:
for layer in base_model.layers[:12]:
    layer.trainable = False
base_model.summary()

#### More Retraining

Now we have larger portions of the model that can be trained. We will be losing some of the pretrained knowldge, replacing it with the training coming from our data. If we look at the trainable params above, there are a bunch that are trainable and a bunch that aren't.

We are playing with fire here! Taking away more and more of the "smart" model will be risky for actual performance, we are pretty likely to make things worse as we go father and farther into removing the old training. 

In [None]:
# Add Dense Stuff
flatten_layer = Flatten()
dense_layer_1 = Dense(512, activation='relu', kernel_regularizer='l2', bias_regularizer='l2')
dense_layer_2 = Dense(256, activation='relu', kernel_regularizer='l2', bias_regularizer='l2')
prediction_layer = Dense(5)

model = Sequential([
    base_model,
    flatten_layer,
    dense_layer_1,
    #dense_layer_2,
    prediction_layer
])

model.summary()

In [None]:
# Model
model.compile(
    optimizer=tf.keras.optimizers.Adam(), 
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=keras.metrics.SparseCategoricalAccuracy(name="accuracy")
)
            
time_stamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
log_dir = "logs/VGG/drastic/" + time_stamp
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1, write_graph=False, write_images=False)
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath="weights/VGG/drastic/"+time_stamp+"model.keras", save_best_only=True, monitor='val_accuracy', mode='max')

model.fit(train_ds,
            epochs=epochs,
            verbose=1,
            validation_data=val_ds,
            callbacks=[tensorboard_callback, stopping_callback, checkpoint_callback])

#### Results

We likely see worse results when retraining more of the model, that's to be expected. In general, replacing the classifier and possibly some low learning rate fine tuning is the best solution for most cases like this.

## Exercise - ResNet50

This is another pretrained network, containing 50 layers. We can use this one similarly to the last. Try to use transfer learning along with some of your added layers to predict. 

In [None]:
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions

def preprocess50(images, labels):
  return tf.keras.applications.resnet50.preprocess_input(images), labels

batch_size = 16

train_ds_orig = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

val_ds_orig = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

class_names = train_ds_orig.class_names
print(class_names)

def preprocess(images, labels):
  return tf.keras.applications.vgg16.preprocess_input(images), labels

train_ds = train_ds_orig.map(preprocess50)
val_ds = val_ds_orig.map(preprocess50)

In [None]:
# Make Model
base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(img_height, img_width, img_depth))
base_model.trainable = False ## Not trainable weights

# Add Dense Stuff
flatten_layer = Flatten()
dense_layer_1 = Dense(512, activation='relu', kernel_regularizer='l2', bias_regularizer='l2')
dense_layer_2 = Dense(256, activation='relu', kernel_regularizer='l2', bias_regularizer='l2')
dense_layer_3 = Dense(96, activation='relu', kernel_regularizer='l2', bias_regularizer='l2')
prediction_layer = Dense(5)

model = Sequential([
    base_model,
    flatten_layer,
    dense_layer_1,
    dense_layer_2,
    dense_layer_3,
    prediction_layer
])

model.summary()

##### Train New Classifier

Train model with new classifier. 

In [None]:
# Model
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
            optimizer="adam", 
            metrics=keras.metrics.SparseCategoricalAccuracy(name="accuracy"))
            
log_dir = "logs/50/initial/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath="weights/50/initial/"+time_stamp+"model.keras", save_best_only=True, monitor='val_accuracy', mode='max')

model.fit(train_ds,
            epochs=epochs,
            verbose=1,
            validation_data=val_ds,
            callbacks=[tensorboard_callback, stopping_callback, checkpoint_callback])

##### Attempt Retraining Entire Model to Fine Tune

We can attempt to unlock the model and retrain in fine tuning. 

In [None]:
base_model.trainable = True
model.summary()

model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-6),  # Low learning rate
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=keras.metrics.SparseCategoricalAccuracy(name="accuracy")
)

log_dir = "logs/50/fine_tune/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath="weights/50/fine_tune/"+time_stamp+"model.keras", save_best_only=True, monitor='val_accuracy', mode='max')

model.fit(train_ds, epochs=epochs, validation_data=val_ds, verbose=1, callbacks=[tensorboard_callback, stopping_callback, checkpoint_callback])

### Transfer Learning Conclusion

Transfer learning is common, especially when working with things like images. Pretrained models that have seen millions upon millions of images get very good at "understanding" what is in an image, or extracting important features from those images. This basic ability to "see" image data is interchangeable between different types of image tasks that we may want to do. For image data, natural language, audio, video, it is likely that one of these large models will be more capable of extracting features from the data than we could ever hope to do from scratch. Since the basics of "seeing a thing" or "reading a sentence" is the same no matter the specific application, that ability to process the data that our pretrained models have can be repurposed to our specific ends. 

We can see lots of scenarios in the real world where people are adapting image recognition models trained by Google to do things like recognize objects in their home security system, or language models like the GPT family being adapted to better understand domain specific language. We'll likely see more of this, as the benefits of training on massive amounts of data are hard, if not impossible, to replicate. 

## Fine-Tuning and Large Models

One rapidly expanding application of fine-tuning is to customize large models, most notably large language models like ChatGPT. These models are massive, and the training process is incredibly resource intensive - to the point where training a model initially can cost millions of dollars worth or processing time and electricity and require GPUs that are well beyond what is available to a consumer, or at least a consumer who doesn't want to spend $10,000+ on their graphic card. Customizing these types of models is a useful application of fine-tuning, as it is the only realistic approach to repurposing or targeting a model that is too large for us to train from scratch. 

There are several fine-tuning methodologies and processes that are emerging as LLMs become more common, and more are being created all the time. One such approach is LORA - 

### LORA

LORA is an approach to fine-tuning a large model that aims to be more efficient by splitting the weights of the model into two subsets - those that matter (for our purposes) and those that don't, then creating a training process that only modifies the subset of "important" weights. We can think of this with an analogy with different types of speech. For example, if an engineer is speaking about the structural dynamics of a bridge, and a doctor is speaking about treating a patient's cancer, they will each sound very different, even though they are both speaking English. In this example the English part of the model does not change between the two types of speech, but the specific details that make it "engineering-talk" versus "doctor-talk" does change. LORA aims to split the weights into the these two subgroups, one that changes and one that does not; the fine-tuning process then trains only the parts that need to change, resulting in far fewer weights that need to be adjusted and a far less demanding fine-tuning process. This can allow models that are large, advanced, and complex to be fine-tuned to specific applications in less time and potentially even on consumer GPUs, something that is generally not possible with the most advanced models. 

Other fine-tuning approaches do similar-ish things, reduce the amount of work done in the tuning process by only tweaking a minimal subset of weights, instead of all of them. 