# Lab 5: Practical Evaluation of Machine Learning Models

In this lab, we're going to optimize the neural network trained for CIFAR-10 with quantization and compare the original and optimized models in terms of energy, latency, and memory.

## Learning Objectives

- LO1: Learn how to measure the performance of ML model
- LO2: Optimize the ML model for the CIFAR-10 dataset in order to improve the latency and reduce energy consumption
- LO3: Project-Based Lab: Deep learning processing competition
  - Select a dataset suitable for ARM Cortex-M processors
  - Implement deep learning processing


## Preparation

In [1]:
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D, BatchNormalization, Activation

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

import numpy as np
import random, tempfile, zipfile

from PIL import Image

import os

### Dataset loading

In [2]:
# Load data from TF Keras
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

# CIFAR10 class names
class_names = ['Airplane', 'Automobile', 'Bird', 'Cat', 'Deer', 'Dog', 'Frog', 'Horse', 'Ship', 'Truck']
num_classes = len(class_names)

### Data preprocessing
Here we normalise the images to be between 0 and 1 which is good practisse in deep learning.

In [3]:
# Normalize pixel values to be between 0 and 1
x_train = x_train.astype(np.float32)/255
x_test = x_test.astype(np.float32)/255

# Convert class vectors to binary class matrices.
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)

# Print arrays shape
print('x_train shape:', x_train.shape)
print('y_train shape:', y_train.shape)
print('x_test shape:', x_test.shape)
print('y_test shape:', y_test.shape)


x_train shape: (50000, 32, 32, 3)
y_train shape: (50000, 10)
x_test shape: (10000, 32, 32, 3)
y_test shape: (10000, 10)


### Previous model loading

In [4]:
# Load keras model
path_models = "./Data/models/"
path_keras_model = path_models+"custom_cifar10_model.h5"

model = tf.keras.models.load_model(path_keras_model)

# Score trained model.
scores = model.evaluate(x_test, y_test, verbose=1)
print('Test loss:', scores[0])
print('Test accuracy:', scores[1])

Test loss: 0.5469570755958557
Test accuracy: 0.8270999789237976


# Quantisation
In TinyML applications, one of the most crucial bottlenecks is the heavily constrained MCU memory. Thus quantisation is a very important asset in any TinyML engineering project. The gist of it is to shrink the full precision network while maintaining the models performance. Thus we quantise the 32 bit full precision network which we previously trained, down to an int8 quantised model. We do so, by first converting the keras model into a TFLite model (withoud quantisation) and then apply post-training quantisation utilising the build in converter of TFLite.

Note the _representative dataset_ this essential for the converter to fight the right scaling of the model s.t. the quantisation has minimal impact on the models performance.


_Theoretically any quantisation level is possible, yet int8 is the lowest supported level in TensorFlowLite._

### TFLite Converter

In [5]:
# Convert using integer-only quantization
# Now you have an integer quantized model that uses integer data for 
# the model's input and output tensors, so it's compatible with integer-only hardware

def representative_data_gen():
  for input_value in tf.data.Dataset.from_tensor_slices(x_train).batch(1).take(100):
    yield [input_value]

path_models = "./Data/models/"
path_keras_model = path_models+"custom_cifar10_model.h5"
model = tf.keras.models.load_model(path_keras_model)

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT] #converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
converter.representative_dataset = representative_data_gen
# Ensure that if any ops can't be quantized, the converter throws an error
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Set the input and output tensors to uint8 (APIs added in r2.3)
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model_quantInt = converter.convert()

INFO:tensorflow:Assets written to: /var/folders/sz/tdwqz8lx113_mx5z2m_966rw0000gn/T/tmp4b_13li5/assets


INFO:tensorflow:Assets written to: /var/folders/sz/tdwqz8lx113_mx5z2m_966rw0000gn/T/tmp4b_13li5/assets
2024-07-10 15:20:48.463067: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:364] Ignored output_format.
2024-07-10 15:20:48.463104: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:367] Ignored drop_control_dependency.
2024-07-10 15:20:48.463318: I tensorflow/cc/saved_model/reader.cc:45] Reading SavedModel from: /var/folders/sz/tdwqz8lx113_mx5z2m_966rw0000gn/T/tmp4b_13li5
2024-07-10 15:20:48.465377: I tensorflow/cc/saved_model/reader.cc:91] Reading meta graph with tags { serve }
2024-07-10 15:20:48.465385: I tensorflow/cc/saved_model/reader.cc:132] Reading SavedModel debug info (if present) from: /var/folders/sz/tdwqz8lx113_mx5z2m_966rw0000gn/T/tmp4b_13li5
2024-07-10 15:20:48.469786: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:375] MLIR V1 optimization pass is not enabled
2024-07-10 15:20:48.471163: I tensorflow/cc/saved_model/load

In [6]:
path_tflite_model = path_models+"custom_cifar10_model_quantInt.tflite"

# Save the quantized int model:
with open(path_tflite_model, 'wb') as f:
    f.write(tflite_model_quantInt)

### Inspecting the quantised model
Lets have a closer look at the quantised model. In this cell you can see that the input and output type of the model is in fact int8. Further you can see the quantisation scaling. This is important to scale the input and output back to the original data, as previously evaluated by the TFLite converter and the representative dataset.

From the [TFlite Documentation](https://www.tensorflow.org/lite/performance/quantization_spec):

_real_value = (int8_value - zero_point) * scale_

In [7]:
# Input and output details
tflite_interpreter = tf.lite.Interpreter(model_path=path_tflite_model)
input_details = tflite_interpreter.get_input_details()
output_details = tflite_interpreter.get_output_details()

print("== Input details ==")
print("Name:", input_details[0]['name'])
print("Shape:", input_details[0]['shape'])
print("Type:", input_details[0]['dtype'])
print("quantisation scale {}, zero_point {}".format(input_details[0]['quantization'][0], input_details[0]['quantization'][1]))

print("\n== Output details ==")
print("Name:", output_details[0]['name'])
print("Shape:", output_details[0]['shape'])
print("Type:", output_details[0]['dtype'])
print("quantisation scale {}, zero_point {}".format(output_details[0]['quantization'][0], output_details[0]['quantization'][1]))

== Input details ==
Name: serving_default_conv2d_42_input:0
Shape: [ 1 32 32  3]
Type: <class 'numpy.int8'>
quantisation scale 0.003921568859368563, zero_point -128

== Output details ==
Name: StatefulPartitionedCall:0
Shape: [ 1 10]
Type: <class 'numpy.int8'>
quantisation scale 0.00390625, zero_point -128


### Benefit and drawback of the quantisation
In Engineering there is no free lunch. While the benefit of quantisation is model size reduction, it comes at the cost of reduced model precision due to the int8 quantisation. The following cells will look at the model size reduction and the accuracy degredation. Now the TinyML engineer must decide if the tradeoff is worth it.

In [8]:
def get_gzipped_model_size(file):
    # Returns size of gzipped model, in bytes.
    _, zipped_file = tempfile.mkstemp('.zip')
    with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f:
        f.write(file)
    return os.path.getsize(zipped_file)

fp_size = get_gzipped_model_size(path_keras_model)
quant_size = get_gzipped_model_size(path_tflite_model)
print('Size of Full Precision Model: {} Bytes'.format(fp_size))
print('Size of quantised Model: {} Bytes'.format(quant_size))
print('Size reduction factor: {} times'.format(fp_size/quant_size))

Size of Full Precision Model: 1175226 Bytes
Size of quantised Model: 99463 Bytes
Size reduction factor: 11.815710364658214 times


In [9]:
#We evaluate on only a fraction of the test set for time reasons
predictions = np.zeros((int(len(x_test)/100),), dtype=int)
input_scale, input_zero_point = input_details[0]["quantization"]
for i in range(int(len(x_test)/100)):
    val_batch = x_test[i]
    val_batch = val_batch / input_scale + input_zero_point
    val_batch = np.expand_dims(val_batch, axis=0).astype(input_details[0]["dtype"])
    tflite_interpreter.allocate_tensors()
    tflite_interpreter.set_tensor(input_details[0]['index'], val_batch)
    tflite_interpreter.invoke()

    tflite_model_predictions = tflite_interpreter.get_tensor(output_details[0]['index'])
    #print("Prediction results shape:", tflite_model_predictions.shape)
    output = tflite_interpreter.get_tensor(output_details[0]['index'])
    predictions[i] = output.argmax()

sum = 0
for i in range(len(predictions)):
    if (predictions[i] == np.argmax(y_test[i])):
        sum = sum + 1
accuracy_score = sum / 100

full_precision_model = tf.keras.models.load_model(path_keras_model)
score = full_precision_model.evaluate(x_test, y_test, verbose=0)

print("Accuracy of quantized to int8 model is {}%".format(accuracy_score*100))
print("Compared to float32 accuracy of {}%".format(score[1]*100))
print("We have a change of {}%".format((accuracy_score-score[1])*100))

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.


Accuracy of quantized to int8 model is 85.0%
Compared to float32 accuracy of 82.70999789237976%
We have a change of 2.290002107620237%


## Code Credit
Largely contributed
- author = "Pau Danilo Email: danilo.pau@st.com, Carra Alessandro"
- copyright = "Copyright (c) 2018, STMicroelectronics"
- license = "CC BY-NC-SA 3.0 IT - https://creativecommons.org/licenses/by-nc-sa/3.0/"

In [10]:
(10*.5) + (2*200) + (10*.5)

410.0

In [11]:
30 + 350

380

In [12]:
(.25 + 0.005) / 255

0.001

In [14]:
-0.005 / 0.001

-5.0

In [15]:
def quantize(x):
    return (x / 0.001) + 5

In [16]:
quantize(-0.005)

0.0

In [17]:
quantize(.25)

255.0

In [18]:
quantize(0.1237)

128.7