# **Model Optimization and Performance Measure**

Deep learning model optimization refers to the process of improving the performance, efficiency, and general characteristics of a deep learning model. Optimization is crucial for ensuring that the model performs well, uses computational resources efficiently, and problem specific requirements.

Furthermore, since deep learning models are used in web applications, mobile devices, and edge devices, we want to compress the models without reducing the quality of the original models.


# **Why models should be optimized?**

There are several main ways model optimization can help with application development.

## **Size reduction**

Some forms of optimization can be used to reduce the size of a model. Smaller models have the following benefits:

- Smaller storage size
- Smaller download size
- Less memory usage

Quantization can reduce the size of a model in all of these cases, potentially at the expense of some accuracy. Pruning and clustering can reduce the size of a model for download by making it more easily compressible.

## **Latency reduction**

Latency is the amount of time it takes to run a single inference with a given model. Some forms of optimization can reduce the amount of computation required to run inference using a model, resulting in lower latency. Latency can also have an impact on power consumption.

Currently, quantization can be used to reduce latency by simplifying the calculations that occur during inference, potentially at the expense of some accuracy.

# **Type of optimization methods**

## **Quantization**

Quantization works by reducing the precision of the numbers used to represent a model's parameters, which by default are 32-bit floating point numbers. This results in a smaller model size and faster computation.

## **Pruning**

Pruning works by removing parameters within a model that have only a minor impact on its predictions. Pruned models are the same size on disk, and have the same runtime latency, but can be compressed more effectively. This makes pruning a useful technique for reducing model download size.


## **Clustering**

Clustering works by grouping the weights of each layer in a model into a predefined number of clusters, then sharing the centroid values for the weights belonging to each individual cluster. This reduces the number of unique weight values in a model, thus reducing its complexity.

### ***The main purpose of this technique is to minimise size and boost computing speed.***







# **Our approaches**

## **1. Mnist handwritten Dataset and develop baseline model**
## **2. Model Optimization**
## **- Method1: Based on three papers, Model optimization method used to compress the model**
## **- Method 2: Optimization using Openvino**
## **3. Performance Analysis (CPU,GPU)**
## **- Ananlysis the performance using VTune**

# **1. FashionMnist Dataset and develop baseline model**

In [1]:
# install necessary library
%%time
! pip install -q tensorflow-model-optimization tensorflow opencv-python pandas numpy matplotlib

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/242.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m235.5/242.5 kB[0m [31m7.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.5/242.5 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hCPU times: user 45.2 ms, sys: 9.76 ms, total: 55 ms
Wall time: 6.64 s


In [2]:
# import necessary library
%%time

import tensorflow as tf
import pandas as pd
import cv2 as cv
import numpy as np
import matplotlib.pyplot as plt

import tempfile
import zipfile
import os


CPU times: user 3.61 s, sys: 433 ms, total: 4.04 s
Wall time: 5.49 s


In [3]:
if tf.config.experimental.list_physical_devices('GPU'):
    print("GPU available. Using GPU.")
    # Set GPU memory growth to avoid allocating all memory at once
    for gpu in tf.config.experimental.list_physical_devices('GPU'):
        tf.config.experimental.set_memory_growth(gpu, True)
else:
    print("No GPU available. Using CPU.")

GPU available. Using GPU.


## **Baseline Model developement using Fashion Mnist dataset (Classification problem)**

In [4]:
%%time
# Load MNIST dataset
fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

#print size of the dataset

print(f"Training images: {train_images.shape},Training labels: {train_labels.shape}, Test_images: {test_images.shape},Test_labels: {test_labels.shape}")
print("#"*100)

# Normalize the input image so that each pixel value is between 0 to 1.
train_images = train_images / 255.0
test_images  = test_images / 255.0

model = tf.keras.Sequential([
  tf.keras.layers.InputLayer(input_shape=(28, 28)),
  tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
  tf.keras.layers.Conv2D(filters=12, kernel_size=(3, 3),
                         activation=tf.nn.relu),
  tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(10)
])

print(model.summary())
print("#"*100)
# Train the digit classification model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(
    train_images,
    train_labels,
    validation_split=0.1,
    epochs=10
)
model.save("baseline_model.h5")


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
Training images: (60000, 28, 28),Training labels: (60000,), Test_images: (10000, 28, 28),Test_labels: (10000,)
####################################################################################################
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 reshape (Reshape)           (None, 28, 28, 1)         0         
                                                                 
 conv2d (Conv2D)             (None, 26, 26, 12)        120       
    

  saving_api.save_model(


## **baseline model store and evaluation**

In [5]:
%%time
_, baseline_model_accuracy = model.evaluate(
    test_images, test_labels, verbose=0)

print('Baseline test accuracy:', baseline_model_accuracy)

print("#"*100)

_, keras_file = tempfile.mkstemp('.h5')
print('Saving model to: ', keras_file)
tf.keras.models.save_model(model, keras_file, include_optimizer=False)

Baseline test accuracy: 0.8952000141143799
####################################################################################################
Saving model to:  /tmp/tmpkjcqnfrc.h5
CPU times: user 787 ms, sys: 97.8 ms, total: 885 ms
Wall time: 785 ms




# **2. Model Optimization**

# **2.1. Method1: Based on three papers, Model optimization method used to compress the model**

## **Prune and fine tune the model to 50% sparsity**

In [6]:
%%time
import tensorflow_model_optimization as tfmot

prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

pruning_params = {
      'pruning_schedule': tfmot.sparsity.keras.ConstantSparsity(0.5, begin_step=0, frequency=100)
  }

callbacks = [
  tfmot.sparsity.keras.UpdatePruningStep()
]

pruned_model = prune_low_magnitude(model, **pruning_params)

# Use smaller learning rate for fine-tuning
opt = tf.keras.optimizers.Adam(learning_rate=1e-5)

pruned_model.compile(
  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
  optimizer=opt,
  metrics=['accuracy'])

print(pruned_model.summary())
print("#"*100)

# Fine-tune model
pruned_model.fit(
  train_images,
  train_labels,
  epochs=3,
  validation_split=0.1,
  callbacks=callbacks)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 prune_low_magnitude_reshap  (None, 28, 28, 1)         1         
 e (PruneLowMagnitude)                                           
                                                                 
 prune_low_magnitude_conv2d  (None, 26, 26, 12)        230       
  (PruneLowMagnitude)                                            
                                                                 
 prune_low_magnitude_max_po  (None, 13, 13, 12)        1         
 oling2d (PruneLowMagnitude                                      
 )                                                               
                                                                 
 prune_low_magnitude_flatte  (None, 2028)              1         
 n (PruneLowMagnitude)                                           
                                                        

<keras.src.callbacks.History at 0x7a6423a5aec0>

In [7]:
# @title
#Define helper functions to calculate and print the sparsity of the model.


def print_model_weights_sparsity(model):

    for layer in model.layers:
        if isinstance(layer, tf.keras.layers.Wrapper):
            weights = layer.trainable_weights
        else:
            weights = layer.weights
        for weight in weights:
            if "kernel" not in weight.name or "centroid" in weight.name:
                continue
            weight_size = weight.numpy().size
            zero_num = np.count_nonzero(weight == 0)
            print(
                f"{weight.name}: {zero_num/weight_size:.2%} sparsity ",
                f"({zero_num}/{weight_size})",
            )

In [8]:
%%time
#check prunning percentage of baseline model
stripped_pruned_model = tfmot.sparsity.keras.strip_pruning(pruned_model)

print_model_weights_sparsity(stripped_pruned_model)

stripped_pruned_model_copy = tf.keras.models.clone_model(stripped_pruned_model)
stripped_pruned_model_copy.set_weights(stripped_pruned_model.get_weights())

conv2d/kernel:0: 50.00% sparsity  (54/108)
dense/kernel:0: 50.00% sparsity  (10140/20280)
CPU times: user 100 ms, sys: 1.82 ms, total: 102 ms
Wall time: 104 ms


## **Apply clustering and sparsity preserving clustering and check its effect on model sparsity in both cases**

In [9]:
%%time
# Clustering
cluster_weights = tfmot.clustering.keras.cluster_weights
CentroidInitialization = tfmot.clustering.keras.CentroidInitialization

clustering_params = {
  'number_of_clusters': 8,
  'cluster_centroids_init': CentroidInitialization.KMEANS_PLUS_PLUS
}

clustered_model = cluster_weights(stripped_pruned_model, **clustering_params)

clustered_model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

print('Train clustering model:')
clustered_model.fit(train_images, train_labels,epochs=10, validation_split=0.1)


stripped_pruned_model.save("stripped_pruned_model_clustered.h5")


Train clustering model:
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10




CPU times: user 1min 29s, sys: 6.53 s, total: 1min 35s
Wall time: 1min 31s


In [10]:
%%time
# Sparsity preserving clustering
from tensorflow_model_optimization.python.core.clustering.keras.experimental import (
    cluster,
)

cluster_weights = cluster.cluster_weights

clustering_params = {
  'number_of_clusters': 8,
  'cluster_centroids_init': CentroidInitialization.KMEANS_PLUS_PLUS,
  'preserve_sparsity': True
}

sparsity_clustered_model = cluster_weights(stripped_pruned_model_copy, **clustering_params)

sparsity_clustered_model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

print('Train sparsity preserving clustering model:')
sparsity_clustered_model.fit(train_images, train_labels,epochs=10, validation_split=0.1)

Train sparsity preserving clustering model:
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 1min 39s, sys: 8 s, total: 1min 47s
Wall time: 2min 23s


<keras.src.callbacks.History at 0x7a642ed34b50>

In [11]:
%%time
#check sparsity
print("Clustered Model sparsity:\n")
print_model_weights_sparsity(clustered_model)
print("\nSparsity preserved clustered Model sparsity:\n")
print_model_weights_sparsity(sparsity_clustered_model)

Clustered Model sparsity:

conv2d/kernel:0: 0.00% sparsity  (0/108)
dense/kernel:0: 0.00% sparsity  (0/20280)

Sparsity preserved clustered Model sparsity:

conv2d/kernel:0: 50.00% sparsity  (54/108)
dense/kernel:0: 50.00% sparsity  (10140/20280)
CPU times: user 8.21 ms, sys: 2.03 ms, total: 10.2 ms
Wall time: 9.26 ms


In [12]:
# @title
%%time
def get_gzipped_model_size(file):
  # It returns the size of the gzipped model in kilobytes.

  _, zipped_file = tempfile.mkstemp('.zip')
  with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f:
    f.write(file)

  return os.path.getsize(zipped_file)/1000

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs


## **Comparison of model sizes**

In [13]:
%%time
# Clustered model
clustered_model_file = 'clustered_model.h5'

# Save the model.
clustered_model.save(clustered_model_file)

#Sparsity Preserve Clustered model
sparsity_clustered_model_file = 'sparsity_clustered_model.h5'

# Save the model.
sparsity_clustered_model.save(sparsity_clustered_model_file)

base_model_file = 'base_model.h5'
model.save(base_model_file)


CPU times: user 56.1 ms, sys: 2.84 ms, total: 59 ms
Wall time: 60.3 ms




## **Create a TFLite model from combining sparsity preserving weight clustering and post-training quantization**

In [14]:
%%time
stripped_sparsity_clustered_model = tfmot.clustering.keras.strip_clustering(sparsity_clustered_model)

converter = tf.lite.TFLiteConverter.from_keras_model(stripped_sparsity_clustered_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
sparsity_clustered_quant_model = converter.convert()

_, pruned_and_clustered_tflite_file = tempfile.mkstemp('.tflite')

with open(pruned_and_clustered_tflite_file, 'wb') as f:
  f.write(sparsity_clustered_quant_model)


CPU times: user 1.02 s, sys: 39.5 ms, total: 1.06 s
Wall time: 1.09 s


In [15]:
%%time
print("Base Model size: ", get_gzipped_model_size(base_model_file), ' KB')
print("Clustered Model size: ", get_gzipped_model_size(clustered_model_file), ' KB')
print("Sparsity preserved clustered Model size: ", get_gzipped_model_size(sparsity_clustered_model_file), ' KB')
print("Sparsity preserved clustered and quantized TFLite model size:",
       get_gzipped_model_size(pruned_and_clustered_tflite_file), ' KB')

Base Model size:  156.828  KB
Clustered Model size:  249.873  KB
Sparsity preserved clustered Model size:  149.135  KB
Sparsity preserved clustered and quantized TFLite model size: 8.33  KB
CPU times: user 50.3 ms, sys: 3.86 ms, total: 54.1 ms
Wall time: 55.1 ms


In [16]:
# @title
#helper function
def eval_model(interpreter):
  input_index = interpreter.get_input_details()[0]["index"]
  output_index = interpreter.get_output_details()[0]["index"]

  # Run predictions on every image in the "test" dataset.
  prediction_digits = []
  for i, test_image in enumerate(test_images):
    if i % 1000 == 0:
      print(f"Evaluated on {i} results so far.")
    # Pre-processing: add batch dimension and convert to float32 to match with
    # the model's input data format.
    test_image = np.expand_dims(test_image, axis=0).astype(np.float32)
    interpreter.set_tensor(input_index, test_image)

    # Run inference.
    interpreter.invoke()

    # Post-processing: remove batch dimension and find the digit with highest
    # probability.
    output = interpreter.tensor(output_index)
    digit = np.argmax(output()[0])
    prediction_digits.append(digit)

  print('\n')
  # Compare prediction results with ground truth labels to calculate accuracy.
  prediction_digits = np.array(prediction_digits)
  accuracy = (prediction_digits == test_labels).mean()
  return accuracy

## **Evaluation of prunning and tflite model**

In [17]:
%%time
# Keras model evaluation
stripped_sparsity_clustered_model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
_, sparsity_clustered_keras_accuracy = stripped_sparsity_clustered_model.evaluate(
    test_images, test_labels, verbose=0)

# TFLite model evaluation
interpreter = tf.lite.Interpreter(pruned_and_clustered_tflite_file)
interpreter.allocate_tensors()

sparsity_clustered_tflite_accuracy = eval_model(interpreter)

print('Baseline model accuracy:', baseline_model_accuracy)
print('Pruned, clustered and quantized Keras model accuracy:', sparsity_clustered_keras_accuracy)
print('Pruned, clustered and quantized TFLite model accuracy:', sparsity_clustered_tflite_accuracy)

Evaluated on 0 results so far.
Evaluated on 1000 results so far.
Evaluated on 2000 results so far.
Evaluated on 3000 results so far.
Evaluated on 4000 results so far.
Evaluated on 5000 results so far.
Evaluated on 6000 results so far.
Evaluated on 7000 results so far.
Evaluated on 8000 results so far.
Evaluated on 9000 results so far.


Baseline model accuracy: 0.8952000141143799
Pruned, clustered and quantized Keras model accuracy: 0.891700029373169
Pruned, clustered and quantized TFLite model accuracy: 0.8921
CPU times: user 2.04 s, sys: 104 ms, total: 2.14 s
Wall time: 2.08 s
