<a href="https://colab.research.google.com/github/aravindchakravarti/OptimizeNetworks/blob/main/Quantization_Aware_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install -q tensorflow
! pip install -q tensorflow-model-optimization


[?25l[K     |█▍                              | 10 kB 37.3 MB/s eta 0:00:01[K     |██▊                             | 20 kB 20.3 MB/s eta 0:00:01[K     |████▏                           | 30 kB 27.0 MB/s eta 0:00:01[K     |█████▌                          | 40 kB 14.6 MB/s eta 0:00:01[K     |██████▉                         | 51 kB 13.9 MB/s eta 0:00:01[K     |████████▎                       | 61 kB 16.2 MB/s eta 0:00:01[K     |█████████▋                      | 71 kB 15.4 MB/s eta 0:00:01[K     |███████████                     | 81 kB 16.9 MB/s eta 0:00:01[K     |████████████▍                   | 92 kB 16.4 MB/s eta 0:00:01[K     |█████████████▊                  | 102 kB 15.8 MB/s eta 0:00:01[K     |███████████████                 | 112 kB 15.8 MB/s eta 0:00:01[K     |████████████████▌               | 122 kB 15.8 MB/s eta 0:00:01[K     |█████████████████▉              | 133 kB 15.8 MB/s eta 0:00:01[K     |███████████████████▏            | 143 kB 15.8 MB/s eta 0:

In [5]:
import tempfile
import os

import tensorflow as tf

from tensorflow import keras

from time import perf_counter

from statistics import mean

## Train a model for MNIST without quantization aware training

In [3]:
# Load MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the input image so that each pixel value is between 0 to 1.
train_images = train_images / 255.0
test_images = test_images / 255.0

# Define the model architecture.
model = keras.Sequential([
  keras.layers.InputLayer(input_shape=(28, 28)),
  keras.layers.Reshape(target_shape=(28, 28, 1)),
  keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation='relu'),
  keras.layers.MaxPooling2D(pool_size=(2, 2)),
  keras.layers.Flatten(),
  keras.layers.Dense(10)
])

# Train the digit classification model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(
  train_images,
  train_labels,
  epochs=1,
  validation_split=0.1,
)


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


<keras.callbacks.History at 0x7fd8c00af220>

In [11]:
inference_time = []
for i in range (10):
  start = perf_counter()
  model.evaluate(test_images, test_labels)
  stop = perf_counter()
  inference_time.append(stop-start)
  
for i in range(10):
  print("Inference Time Diff = ", inference_time[i])

print("Mean Time Diff = ", mean(inference_time))

Inference Time Diff =  0.9762926920000154
Inference Time Diff =  0.7951482319999741
Inference Time Diff =  1.3283438380000234
Inference Time Diff =  1.3256258449999905
Inference Time Diff =  0.7859067210000603
Inference Time Diff =  1.3223132669999131
Inference Time Diff =  1.332721081999921
Inference Time Diff =  1.3391013430000385
Inference Time Diff =  0.8047558839999738
Inference Time Diff =  1.3227648769999405
Mean Time Diff =  1.133297378099985


## Clone and fine-tune pre-trained model with quantization aware training


### Define the model

You will apply quantization aware training to the whole model and see this in the model summary. All layers are now prefixed by "quant".

Note that the resulting model is quantization aware but not quantized (e.g. the weights are float32 instead of int8). The sections after show how to create a quantized model from the quantization aware one.

In the [comprehensive guide](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide.md), you can see how to quantize some layers for model accuracy improvements.

In [12]:
import tensorflow_model_optimization as tfmot

quantize_model = tfmot.quantization.keras.quantize_model

# q_aware stands for for quantization aware.
q_aware_model = quantize_model(model)

# `quantize_model` requires a recompile.
q_aware_model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

q_aware_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 quantize_layer (QuantizeLay  (None, 28, 28)           3         
 er)                                                             
                                                                 
 quant_reshape (QuantizeWrap  (None, 28, 28, 1)        1         
 perV2)                                                          
                                                                 
 quant_conv2d (QuantizeWrapp  (None, 26, 26, 12)       147       
 erV2)                                                           
                                                                 
 quant_max_pooling2d (Quanti  (None, 13, 13, 12)       1         
 zeWrapperV2)                                                    
                                                                 
 quant_flatten (QuantizeWrap  (None, 2028)             1

### Train and evaluate the model against baseline

To demonstrate fine tuning after training the model for just an epoch, fine tune with quantization aware training on a subset of the training data.

In [13]:
train_images_subset = train_images[0:1000] # out of 60000
train_labels_subset = train_labels[0:1000]

q_aware_model.fit(train_images_subset, train_labels_subset,
                  batch_size=500, epochs=1, validation_split=0.1)



<keras.callbacks.History at 0x7fd83c0d1640>

For this example, there is minimal to no loss in test accuracy after quantization aware training, compared to the baseline.

In [14]:
_, baseline_model_accuracy = model.evaluate(
    test_images, test_labels, verbose=0)

_, q_aware_model_accuracy = q_aware_model.evaluate(
   test_images, test_labels, verbose=0)

print('Baseline test accuracy:', baseline_model_accuracy)
print('Quant test accuracy:', q_aware_model_accuracy)

Baseline test accuracy: 0.9598000049591064
Quant test accuracy: 0.961899995803833


## Create quantized model for TFLite backend

After this, you have an actually quantized model with int8 weights and uint8 activations.

In [15]:
converter = tf.lite.TFLiteConverter.from_keras_model(q_aware_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

quantized_tflite_model = converter.convert()



## See persistence of accuracy from TF to TFLite

Define a helper function to evaluate the TF Lite model on the test dataset.

In [19]:
import numpy as np

def evaluate_model(interpreter):
  input_index = interpreter.get_input_details()[0]["index"]
  output_index = interpreter.get_output_details()[0]["index"]

  # Run predictions on every image in the "test" dataset.
  prediction_digits = []
  for i, test_image in enumerate(test_images):
    #if i % 1000 == 0:
    #  print('Evaluated on {n} results so far.'.format(n=i))
    # Pre-processing: add batch dimension and convert to float32 to match with
    # the model's input data format.
    test_image = np.expand_dims(test_image, axis=0).astype(np.float32)
    interpreter.set_tensor(input_index, test_image)

    # Run inference.
    interpreter.invoke()

    # Post-processing: remove batch dimension and find the digit with highest
    # probability.
    output = interpreter.tensor(output_index)
    digit = np.argmax(output()[0])
    prediction_digits.append(digit)

  #print('\n')
  # Compare prediction results with ground truth labels to calculate accuracy.
  prediction_digits = np.array(prediction_digits)
  accuracy = (prediction_digits == test_labels).mean()
  return accuracy

You evaluate the quantized model and see that the accuracy from TensorFlow persists to the TFLite backend.

In [20]:
interpreter = tf.lite.Interpreter(model_content=quantized_tflite_model)
interpreter.allocate_tensors()

inference_time = []
for i in range (10):
  start = perf_counter()
  test_accuracy = evaluate_model(interpreter)
  stop = perf_counter()
  inference_time.append(stop-start)
  
for i in range(10):
  print("Inference Time Diff = ", inference_time[i])

print("Mean Time Diff = ", mean(inference_time))

print('Quant TFLite test_accuracy:', test_accuracy)
print('Quant TF test accuracy:', q_aware_model_accuracy)

Inference Time Diff =  0.6393711969999458
Inference Time Diff =  0.6264271659999849
Inference Time Diff =  0.6129909750000024
Inference Time Diff =  0.6002308560000529
Inference Time Diff =  0.6136381140000822
Inference Time Diff =  0.6001069740000275
Inference Time Diff =  0.5736646400000609
Inference Time Diff =  0.5830028789999915
Inference Time Diff =  0.6523949529999982
Inference Time Diff =  0.6023555749999332
Mean Time Diff =  0.6104183329000079
Quant TFLite test_accuracy: 0.9619
Quant TF test accuracy: 0.961899995803833


## See 4x smaller model from quantization

You create a float TFLite model and then see that the quantized TFLite model
is 4x smaller.

In [21]:
# Create float TFLite model.
float_converter = tf.lite.TFLiteConverter.from_keras_model(model)
float_tflite_model = float_converter.convert()

# Measure sizes of models.
_, float_file = tempfile.mkstemp('.tflite')
_, quant_file = tempfile.mkstemp('.tflite')

with open(quant_file, 'wb') as f:
  f.write(quantized_tflite_model)

with open(float_file, 'wb') as f:
  f.write(float_tflite_model)

print("Float model in Mb:", os.path.getsize(float_file) / float(2**20))
print("Quantized model in Mb:", os.path.getsize(quant_file) / float(2**20))



Float model in Mb: 0.08080291748046875
Quantized model in Mb: 0.02371978759765625
