# Model compression

Deep neural networks are the state-of-the-art models for many tasks, especially with unstructured data like images, text etc. One issue with deep learning models is that they are highly overparamterized, making it difficult to deploy/use it in devices with low memory requirements. Compress the neural networks model without sacrifying the model performance is an active reasearch topic in machine learning. Parameter pruning,quantization,knowledge distillation etc are the widely used approaches nowadays to make the neural networks compact. In this notebook, we will demonstrate a knowledge distillation method and tries to incorporate a Resnet152 learned knowledge into a comparatively smaller Resnet50 model.

## Knowledge Distillation

Magnitude pruning is one of the popular methods used for model compression. I have demonstrated one such pruning technique, in another [notebook](https://www.kaggle.com/code/neerajmohan/magnitude-weight-pruning-on-resnet50). Though very simple to implement, pruning methods have a limitation that the original model and the pruned model will almost have the same model architecture. The "model compression" relies on the hardware capabilities to handle the sparse parameters. The main advantage of knowledge distillation is that we can try to encorporate the knowledge of any trained model architecture to any other model architecture.

The idea of knowledge distillation is to use a larger model, known as teacher, to guide the training of a smaller one, termed as student model. This approach can be helpful when:

1) There are specific hardware constraints when deploying a machine learning model <br>
2) New state-of-the-art models are developed very often and we want to try to improve the performance of the deployed model without affecting the hardware requirements

A simple but popular knowledge distillation approach was proposed in this [paper](https://arxiv.org/abs/1503.02531).

For a classification problem, neural networks produce class probabilites using a "softmax" output layer that converts the logits.
Let there be total $C$ classes. For each class '$c$', the probability '$q_c$' is calculated by comparing '$z_c$' with the other logits.

**$$q_c = {\exp (z_{c} / T) \over \sum_{{i \in C}} \exp (z_{i} / T)} $$**

Normally, the temperature '$T$' is set to 1. When we use higher temperature, we will have a softer probability distribution over classes. One way to transfer the knowledge from the trained teacher model to student model is to train the student model with an additional loss along with the cross entropy loss, which is called distillation loss function, along with a temperature, on the difference between the soft student predictions and the soft teacher labels.


In this notebook, we will use a finetuned ResNet152 model as teacher model and a resent50 model as student. We will compare the performance of a student resnet50 model with a normally fine-tuned resnet50 model.

# Problem statement

This is a multiclass image classification problem. There data contains images from 6 categories 'buildings','forest','glacier','mountain','sea','street'. The aim is to develop a machine learning model that correctly classifies an input image into one of the categories.

In this notebook, we try to finetune a resnet50 model from a learned resnet152 model, without lossing much expressivity of the later.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
#import keras

## Load data

In [None]:
#data-path
train_dir = "../input/intel-image-classification/seg_train/seg_train/"
test_dir = "../input/intel-image-classification/seg_test/seg_test/"

#data-configs
batch_size = 32
img_height = 150
img_width = 150

In [None]:
# Load train data
train_ds = tf.keras.utils.image_dataset_from_directory(
  train_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

# Load test data

test_ds = tf.keras.utils.image_dataset_from_directory(
  test_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

### A glimpse of train data

In [None]:
class_names = train_ds.class_names
plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
  for i in range(9):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(images[i].numpy().astype("uint8"))
    plt.title(class_names[labels[i]])
    plt.axis("off")


## Training the Teacher model - using Transfer learning

Ref: https://keras.io/guides/transfer_learning/

In [None]:
base_model = keras.applications.ResNet152(
    weights='imagenet',  # Load weights pre-trained on ImageNet.
    input_shape=(img_height, img_width, 3),
    include_top=False)  # Do not include the ImageNet classifier at the top.
base_model.trainable = False
inputs = keras.Input(shape=(img_height, img_width, 3))
# We make sure that the base_model is running in inference mode here,
# by passing `training=False`. This is important for fine-tuning, as you will
# learn in a few paragraphs.
x = base_model(inputs, training=False)
# Convert features of shape `base_model.output_shape[1:]` to vectors
x = keras.layers.GlobalAveragePooling2D()(x)
# A Dense classifier with a single unit (binary classification)
outputs = keras.layers.Dense(6)(x)
model = keras.Model(inputs, outputs)
model.summary()

In [None]:
model.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

epochs = 20
model.fit(train_ds, epochs=epochs)


In [None]:
# model.compile(
#     optimizer=keras.optimizers.Adam(),
#     loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
#     metrics=[keras.metrics.SparseCategoricalAccuracy()],
# )
#print(y_pred)
results = model.evaluate(test_ds)
print(f"Test accuracy with trained teacher model:{results[1]*100 :.2f} %")
#keras.metrics.SparseCategoricalAccuracy(y_pred,test_ds)

# Knowledge Distillation

Code reference : https://colab.research.google.com/drive/1Vo5rFF5JyHdJGFW88io4t5QS6q1klYPD?usp=sharing#scrollTo=Jxg7IWuJLB9g

In [None]:

class Distiller(keras.Model):
    def __init__(self, student, teacher):
        super(Distiller, self).__init__()
        self.teacher = teacher
        self.student = student

    def compile(
        self,
        optimizer,
        metrics,
        student_loss_fn,
        distillation_loss_fn,
        alpha=0.1,
        temperature=3,
    ):
        """ Configure the distiller.

        Args:
            optimizer: Keras optimizer for the student weights
            metrics: Keras metrics for evaluation
            student_loss_fn: Loss function of difference between student
                predictions and ground-truth
            distillation_loss_fn: Loss function of difference between soft
                student predictions and soft teacher predictions
            alpha: weight to student_loss_fn and 1-alpha to distillation_loss_fn
            temperature: Temperature for softening probability distributions.
                Larger temperature gives softer distributions.
        """
        super(Distiller, self).compile(optimizer=optimizer, metrics=metrics)
        self.student_loss_fn = student_loss_fn
        self.distillation_loss_fn = distillation_loss_fn
        self.alpha = alpha
        self.temperature = temperature

    def train_step(self, data):
        # Unpack data
        x, y = data

        # Forward pass of teacher
        teacher_predictions = self.teacher(x, training=False)

        with tf.GradientTape() as tape:
            # Forward pass of student
            student_predictions = self.student(x, training=True)

            # Compute losses
            student_loss = self.student_loss_fn(y, student_predictions)
            distillation_loss = self.distillation_loss_fn(
                tf.nn.softmax(teacher_predictions / self.temperature, axis=1),
                tf.nn.softmax(student_predictions / self.temperature, axis=1),
            )
            loss = self.alpha * student_loss + (1 - self.alpha) * distillation_loss

        # Compute gradients
        trainable_vars = self.student.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)

        # Update weights
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        # Update the metrics configured in `compile()`.
        self.compiled_metrics.update_state(y, student_predictions)

        # Return a dict of performance
        results = {m.name: m.result() for m in self.metrics}
        results.update(
            {"student_loss": student_loss, "distillation_loss": distillation_loss}
        )
        return results

    def test_step(self, data):
        # Unpack the data
        x, y = data

        # Compute predictions
        y_prediction = self.student(x, training=False)

        # Calculate the loss
        student_loss = self.student_loss_fn(y, y_prediction)

        # Update the metrics.
        self.compiled_metrics.update_state(y, y_prediction)

        # Return a dict of performance
        results = {m.name: m.result() for m in self.metrics}
        results.update({"student_loss": student_loss})
        return results


In [None]:
student_base_model = keras.applications.ResNet50(
    weights='imagenet',  # Load weights pre-trained on ImageNet.
    input_shape=(img_height, img_width, 3),
    include_top=False)  # Do not include the ImageNet classifier at the top.
student_base_model.trainable = True #Fine-tune all the weights of the model
inputs_student = keras.Input(shape=(150, 150, 3))
# We make sure that the base_model is running in inference mode here,
# by passing `training=False`. This is important for fine-tuning, as you will
# learn in a few paragraphs.
x_student = student_base_model(inputs_student, training=True)
# Convert features of shape `base_model.output_shape[1:]` to vectors
x_student = keras.layers.GlobalAveragePooling2D()(x_student)
# A Dense classifier with a single unit (binary classification)
outputs_student = keras.layers.Dense(6)(x_student)
student = keras.Model(inputs_student, outputs_student)
student.summary()

In [None]:
# Initialize and compile distiller
distiller = Distiller(student=student, teacher=model)
distiller.compile(
    optimizer=keras.optimizers.Adam(),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
    student_loss_fn=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    distillation_loss_fn=keras.losses.KLDivergence(),
    alpha=0.05,
    temperature=40,
)

# Distill teacher to student
distiller.fit(train_ds, epochs=10)

In [None]:
# Evaluate student on test dataset
results = distiller.evaluate(test_ds)
print(f"Test accuracy of the student model, with distilled knowledge from teacher model:{results[0]*100 :.2f} %")

We can observe that the student model is performing comparatively as good as the teacher model.

Now, we will check the performance of student model architecture (resnet50 in this case), when it is trained without any knowledge distillation technique.

In [None]:
student_scratch_base_model = keras.applications.ResNet50(
    weights='imagenet',  # Load weights pre-trained on ImageNet.
    input_shape=(img_height, img_width, 3),
    include_top=False)  # Do not include the ImageNet classifier at the top.
student_scratch_base_model.trainable = True
inputs_scratch = keras.Input(shape=(img_height, img_width, 3))
# We make sure that the base_model is running in inference mode here,
# by passing `training=False`. This is important for fine-tuning, as you will
# learn in a few paragraphs.
x_student = student_scratch_base_model(inputs_scratch, training=True)
# Convert features of shape `base_model.output_shape[1:]` to vectors
x_student = keras.layers.GlobalAveragePooling2D()(x_student)
# A Dense classifier with a single unit (binary classification)
outputs_scratch = keras.layers.Dense(6)(x_student)
student_scratch = keras.Model(inputs_scratch, outputs_scratch)
#student_scratch = keras.Model(inputs, outputs)
student_scratch.summary()

In [None]:
# Train student as done usually
student_scratch.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

# Train and evaluate student trained from scratch.
student_scratch.fit(train_ds, epochs=10)
result = student_scratch.evaluate(test_ds)


In [None]:
print(f"Test accuracy of the student model, when finetuned without any knowledge distillation :{result[1]*100 :.2f} %")

We can clearly observe that for the same number of epochs and same architecture, student model with knowledge distillation is better than the model without any knowledge distillation technique.

# End notes

1) The notebook demonstrates a simple knowledge distillation technique <br>
2) The method aims to train a comparatively smaller resnet50 model that incorporates the knowledge of a mighty resnet152 model