**Post-training quantization and quantized inference**

Post-training quantization is a conversion technique that can **reduce model size** while also **improving CPU and hardware accelerator latency**, with little **degradation** in model **accuracy.** You can perform these techniques using an already-trained **float** TensorFlow model when you convert it to TensorFlow Lite format. Once a model is **fully quantized**, you can deploy it to platforms like **coral tpu** or run it using **nnapi** **delegates**.

 Install latest **tf-nightly** for quantization and tflite inference.

In [0]:
!pip install tf-nightly

In [0]:
!unzip /content/drive/My\ Drive/portrait256/portrait256.zip

In [0]:
from google.colab import drive
drive.mount('/content/drive')

The following diagram shows the various methods for **post-training quantization** available in tensorflow-lite.

![TfliteQuantizatiom](https://www.tensorflow.org/lite/performance/images/optimization.jpg)

Here we will use **full integer quantization**(INT8) with a representative dataset. Use the **trainig data-set** as the representative dataset for this purpose.

In [0]:
!unzip /content/drive/My\ Drive/portrait_mix/EG1800.zip

In [0]:
# Import libraries
import tensorflow as tf
import cv2, sys, time
import numpy as np
import tensorflow as tf
from PIL import Image
import timeit
from matplotlib import pyplot as plt

To convert a float32 model to quantized int8 format, there are two approaches viz. **quantizaton aware training** and **post training quantization**. Here we use the latter approach of quantization, to convert our model to **INT8** format. We need to quantize both the weights and the corresponding layers to make use of accelerators like **Coral TPU or NNAPI**. We will use the training dataset as a **representative dataset** to measure the dynamic range of activations and inputs.

**Note:** TF 2.0/Keras currently does not support quantization aware training.

In [0]:
# Configure dataset directories
IMGDS='/content/EG1800/imgs'
MSKDS='/content/EG1800/msks'

# Use image data generator to load the dataset
image_generator = tf.keras.preprocessing.image.ImageDataGenerator()
seed=7

train_image_generator = image_generator.flow_from_directory(
    IMGDS,
    batch_size=8,
    shuffle=True,
    target_size=(224, 224),
    color_mode="rgb",
    class_mode=None,
    seed=seed)

train_mask_generator = image_generator.flow_from_directory(
    MSKDS,
    batch_size=8,
    shuffle=True,
    target_size=(224, 224),
    color_mode="grayscale",
    class_mode=None,
    seed=seed)

# Normalize the input image
def normalize(imgOri, scale=1, mean=[103.94, 116.78, 123.68], val=[0.017, 0.017, 0.017]):
    img = np.array(imgOri.copy(), np.float32)/scale
    return (img - mean) * val

Use the keras **data generator** to load the dataset and perform the data augmentaion(like training) to generate the inputs for the quantization process. Configure the tflite converter for **full integer quantization** and convert the model to **INT8 tflite** format. Now the saved model will have **INT8 Range: -128 to 127**.

In [0]:
# Load the dataset with data augmentations
def representative_dataset_gen():
  for _ in range(100):
    imgs=train_image_generator.next()
    msks=train_mask_generator.next()
    input=np.zeros(shape=(8,224,224,4), dtype=np.float32)
    for i in range(len(imgs)):
      img=normalize(cv2.cvtColor(imgs[i],cv2.COLOR_RGB2BGR))
      msk=cv2.normalize(cv2.blur(msks[i], (5,5)), 0, 1, cv2.NORM_MINMAX) # partial augmentation
      input=np.float32(np.dstack([img,msk]).reshape(1,224,224,4))

      yield [input]

# Convert the model to INT8 format
converter = tf.compat.v1.lite.TFLiteConverter.from_keras_model_file('portrait_video_flattened.h5')
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen

# Save the quantized model
tflite_quant_model = converter.convert()
open("/content/portrait_video_quant.tflite", "wb").write(tflite_quant_model)

Test the new quantized model on a new **video file**, with the help of opencv. Make sure to convert the float input to **INT8 range** before inference and INT8 output to **float32** after quantized inference. Save the outputs using opencv video writer and **compare** the output with the original results.

In [0]:
# Normalize the input image
def normalize(imgOri, scale=1, mean=[103.94, 116.78, 123.68], val=[0.017, 0.017, 0.017]):
    img = np.array(imgOri.copy(), np.float32)/scale
    norm=(img - mean) * val
    out=cv2.normalize(src=norm, dst=None, alpha=-128, beta=127, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8S)
    return out
 
# Alpha blend frame with background
def blend(frame, alpha):
        background = np.zeros(frame.shape) + [255, 255, 255]
        alphargb = cv2.cvtColor(alpha, cv2.COLOR_GRAY2BGR)
        result = np.uint8(frame * alphargb + background * (1-alphargb))
        return frame, alphargb*255, result

# Initialize tflite-interpreter
interpreter = tf.lite.Interpreter(model_path="portrait_video_fquant.tflite") # Use 'tf.lite' on recent tf versions
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
input_shape = input_details[0]['shape'][1:3]


# Initialize video capturer
videofile = 'portrait_lady.mp4'
cap = cv2.VideoCapture(videofile)  
size = (int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)), int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)))  
videoWriter = cv2.VideoWriter('result.mp4', cv2.VideoWriter_fourcc(*'MJPG'), 20, size)  
 
# Initialize frame counter 
cnt = 1

while True:
       
    # Read the BGR frames 
    success, frame = cap.read()
    if not(success):
      break
    image=Image.fromarray(frame)
    
    
    # Resize the image
    image= image.resize(input_shape, Image.ANTIALIAS)
    image=np.asarray(image)

    
    # Normalize the input
    image = normalize(image)
  
    # Choose prior mask
    if cnt == 1:
        prior = np.full((224, 224, 1),-128,dtype=np.int8) # first frame
    else:
        prior = pred_video
    
    # Add prior as fourth channel
    image=np.dstack([image,prior])
    prepimg = image[np.newaxis, :, :, :]
    

    # Invoke interpreter for inference
    interpreter.set_tensor(input_details[0]['index'], np.array(prepimg, dtype=np.int8))
    interpreter.invoke()
    outputs = interpreter.get_tensor(output_details[0]['index'])
    outputs = outputs.reshape(224,224,1)
  
    # Save output to feed subsequent inputs
    pred_video = outputs

    # Process the output and perform alpha blending
    outputs = outputs.astype('float32')  
    outputs = cv2.resize(outputs, size)
    outputs= cv2.normalize(src=outputs, dst=None, alpha=0.0, beta=1.0, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_32F)
    _,_,outputs=blend(frame, outputs)
        
    # Write the output frame
    videoWriter.write(outputs)
    
    # Print the frame count
    cnt += 1
    if cnt % 100 == 0:
        print ("cnt: ", cnt)

# When everything done, release the capturer
print("Conversion successfull !!!")
videoWriter.release()
cap.release()

Older versions of **tensorflow**(1.x) does not seem to support conversions for quantized versions of **transpose convolutions** using the current approach. Compared to the original float model, the output with quantized model looks **less accurate**. Hopefully, the resut may be slighlty improved with **full data augmentation**  and/or **quantization aware training**. Unfortunately the fully quantized model could not be converted to **coral tpu** format due to unsupported layers and compatability issues; but it may work with the latest **NNAPI** delegate in android.

**Prisma-net: Quantization & TPU Conversion**

Let's quantize our old **prisma-net** model and see if it can be successfully converted to **TPU format**. Firtstly, intitialize the **data generator** for loading the dataset used for **training** prisma-net.

In [0]:
# prisma data generator
IMGDS="/content/portrait256/images";

data_gen_args = dict(rescale=1./255,
                     width_shift_range=0.2,
                     height_shift_range=0.2,
                     zoom_range=0.2,
                     horizontal_flip=True,
                     validation_split=0.2
                    )

image_datagen = tf.keras.preprocessing.image.ImageDataGenerator(**data_gen_args)

# Provide the same seed and keyword arguments to the fit and flow methods
seed = 1

train_image_generator = image_datagen.flow_from_directory(
    IMGDS,
    batch_size=8,
    shuffle=True,
    subset='training',
    color_mode="rgb",
    class_mode=None,
    seed=seed)

Load the **keras model**, configure the **tflite converter** and start the quantization process using the **representative dataset**.

In [0]:
# Load the dataset with data augmentations
def representative_dataset_gen():
  for _ in range(200):
    imgs=train_image_generator.next()
    for i in range(len(imgs)):
      img=imgs[i]
      input=np.float32(img).reshape(1,256,256,3)
      yield [input]

# Convert the model to INT8 format
converter = tf.compat.v1.lite.TFLiteConverter.from_keras_model_file('prisma-net-15-0.08.hdf5')
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen

# Save the quantized model
tflite_quant_modell = converter.convert()
open("/content/prisma_quant.tflite", "wb").write(tflite_quant_modell)

**Test** the quantized model on a portrait image

In [0]:
# Read the input image and normalize it
img=Image.open('/content/kid8.jpg').resize((256,256),Image.BICUBIC)
img=np.array(img)
norm=np.int8(img-128) # Normalize to INT8
prepimg=norm.reshape(1,256,256,3)

# Initialize tflite-interpreter
interpreter = tf.lite.Interpreter(model_path="prisma_quant.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
input_shape = input_details[0]['shape'][1:3]

# Invoke interpreter for inference
interpreter.set_tensor(input_details[0]['index'], np.array(prepimg, dtype=np.int8))
interpreter.invoke()
outputs = interpreter.get_tensor(output_details[0]['index'])
outputs = outputs.reshape(256,256,1)

# Plot the output results
result=np.uint8(outputs+128) # Denormalize to UINT8
result=np.uint8((result/255.0)*img) # Crop the image with mask
plt.imshow(result)

Install **coral edge-tpu compiler** on the host system.

In [0]:
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
!echo "deb https://packages.cloud.google.com/apt coral-edgetpu-stable main" | sudo tee /etc/apt/sources.list.d/coral-edgetpu.list
!sudo apt-get update
!sudo apt-get install edgetpu

Convert the quantized tflite to **TPU format**.

In [0]:
!edgetpu_compiler prisma_quant.tflite


Here is the **output** of the  conversion process

```
Edge TPU Compiler version 2.0.291256449

Model compiled successfully in 757 ms.

Input model: prisma_quant.tflite
Input size: 1.15MiB
Output model: prisma_quant_edgetpu.tflite
Output size: 1.31MiB
On-chip memory available for caching model parameters: 7.43MiB
On-chip memory used for caching model parameters: 1.13MiB
Off-chip memory used for streaming uncached model parameters: 64.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 158
Operation log: prisma_quant_edgetpu.log
See the operation log file for individual operation details.
```



Now to run your model on a coral TPU, we need to install the **Edge TPU Runtime and Tensorflow Lite Libraries**. Install the appropriate versions by following the **coral** documentation:-

1. **Edge TPU Runtime**: https://coral.ai/docs/accelerator/get-started/#1-install-the-edge-tpu-runtime
2. **Tensorflow Lite Library**: https://coral.ai/docs/accelerator/get-started/#2-install-the-tensorflow-lite-library

Finally connect the **Edge TPU** to host PC(USB 3.0) and **verify** the installation by running the demo **classification** example provided in the documentaion. To run the **inference** on the model, we can use the same Tflite API (or Edge TPU Python API ). There is only one major **change** in the code compared to the normal tflite inference ie.**experimental_delegates** argument.

In [0]:
import numpy as np
from PIL import Image
import tflite_runtime.interpreter as tflite
from imageio import imsave
import time

# Read the input image and normalize it
img=Image.open('kid8.jpg').resize((256,256),Image.BICUBIC)
img=np.array(img)
norm=np.int8(img-128) # Normalize to INT8
prepimg=norm.reshape(1,256,256,3)

# Initialize tflite-interpreter
interpreter = tflite.Interpreter(model_path="prisma_quant_edgetpu.tflite", experimental_delegates=[tflite.load_delegate('libedgetpu.so.1')]) # Use 'tf.lite' on recent tf versions
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
input_shape = input_details[0]['shape'][1:3]

# Invoke interpreter for inference
interpreter.set_tensor(input_details[0]['index'], np.array(prepimg, dtype=np.int8))

# Mesure execution time
for _ in range(5):
    start = time.perf_counter()
    interpreter.invoke()
    inference_time = time.perf_counter() - start
    outputs = interpreter.get_tensor(output_details[0]['index'])
    print('%.1fms' % (inference_time * 1000))


outputs = outputs.reshape(256,256,1)

# Plot the output results
result=np.uint8(outputs+128) # Denormalize to UINT8
result=np.uint8((result/255.0)*img) # Crop the image with mask
imsave("segout.png",result)

**Results:-**
```
Tflite Quantized CPU: 4357.0 ms
Tflite Quantized TPU:  12.2 ms
```
The **CPU**(i7-3632QM CPU @ 2.20GHz) might be using a **single core** for inference. But even if we include other possible overheads, this **300x speed-up** seems to be worth the effort. Besides, it consumes **20 times less power** than CPU.

**Note:** The official documenation only recommends **TF 1.x** with Edge TPU. There is still a confusion regarding input types i.e **UINT8 vs INT8**. However the model was **successfully converted** and it seems to give **correct ouput** with the device during inference.