**Post-training quantization and quantized inference**

Post-training quantization is a conversion technique that can **reduce model size** while also **improving CPU and hardware accelerator latency**, with little **degradation** in model **accuracy.** You can perform these techniques using an already-trained **float** TensorFlow model when you convert it to TensorFlow Lite format. Once a model is **fully quantized**, you can deploy it to platforms like **coral tpu** or run it using **nnapi** **delegates**.

 Install latest **tf-nightly** for quantization and tflite inference.

In [0]:
!pip install tf-nightly

In [0]:
from google.colab import drive
drive.mount('/content/drive')

The following diagram shows the various methods for **post-training quantization** available in tensorflow-lite.

![TfliteQuantizatiom](https://www.tensorflow.org/lite/performance/images/optimization.jpg)

Here we will use **full integer quantization**(INT8) with a representative dataset. Use the **trainig data-set** as the representative dataset for this purpose.

In [0]:
!unzip /content/drive/My\ Drive/portrait_mix/EG1800.zip

In [0]:
# Import libraries
import tensorflow as tf
import cv2, sys, time
import numpy as np
import tensorflow as tf
from PIL import Image
import timeit

To convert a float32 model to quantized int8 format, there are two approaches viz. **quantizaton aware training** and **post training quantization**. Here we use the latter approach of quantization, to convert our model to **INT8** format. We need to quantize both the weights and the corresponding layers to make use of accelerators like **Coral TPU or NNAPI**. We will use the training dataset as a **representative dataset** to measure the dynamic range of activations and inputs.

**Note:** TF 2.0/Keras currently does not support quantization aware training.

In [0]:
# Configure dataset directories
IMGDS='/content/EG1800/imgs'
MSKDS='/content/EG1800/msks'

# Use image data generator to load the dataset
image_generator = tf.keras.preprocessing.image.ImageDataGenerator()
seed=7

train_image_generator = image_generator.flow_from_directory(
    IMGDS,
    batch_size=8,
    shuffle=True,
    target_size=(224, 224),
    color_mode="rgb",
    class_mode=None,
    seed=seed)

train_mask_generator = image_generator.flow_from_directory(
    MSKDS,
    batch_size=8,
    shuffle=True,
    target_size=(224, 224),
    color_mode="grayscale",
    class_mode=None,
    seed=seed)

# Normalize the input image
def normalize(imgOri, scale=1, mean=[103.94, 116.78, 123.68], val=[0.017, 0.017, 0.017]):
    img = np.array(imgOri.copy(), np.float32)/scale
    return (img - mean) * val

Use the keras **data generator** to load the dataset and perform the data augmentaion(like training) to generate the inputs for the quantization process. Configure the tflite converter for **full integer quantization** and convert the model to **INT8 tflite** format. Now the saved model will have **INT8 Range: -128 to 127**.

In [0]:
# Load the dataset with data augmentations
def representative_dataset_gen():
  for _ in range(100):
    imgs=train_image_generator.next()
    msks=train_mask_generator.next()
    input=np.zeros(shape=(8,224,224,4), dtype=np.float32)
    for i in range(len(imgs)):
      img=normalize(cv2.cvtColor(imgs[i],cv2.COLOR_RGB2BGR))
      msk=cv2.normalize(cv2.blur(msks[i], (5,5)), 0, 1, cv2.NORM_MINMAX) # partial augmentation
      input=np.float32(np.dstack([img,msk]).reshape(1,224,224,4))

      yield [input]

# Convert the model to INT8 format
converter = tf.compat.v1.lite.TFLiteConverter.from_keras_model_file('portrait_video_flattened.h5')
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen

# Save the quantized model
tflite_quant_model = converter.convert()
open("/content/portrait_video_quant.tflite", "wb").write(tflite_quant_model)

Test the new quantized model on a new **video file**, with the help of opencv. Make sure to convert the float input to **INT8 range** before inference and INT8 output to **float32** after quantized inference. Save the outputs using opencv video writer and **compare** the output with the original results.

In [0]:
# Normalize the input image
def normalize(imgOri, scale=1, mean=[103.94, 116.78, 123.68], val=[0.017, 0.017, 0.017]):
    img = np.array(imgOri.copy(), np.float32)/scale
    norm=(img - mean) * val
    out=cv2.normalize(src=norm, dst=None, alpha=-128, beta=127, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8S)
    return out
 
# Alpha blend frame with background
def blend(frame, alpha):
        background = np.zeros(frame.shape) + [255, 255, 255]
        alphargb = cv2.cvtColor(alpha, cv2.COLOR_GRAY2BGR)
        result = np.uint8(frame * alphargb + background * (1-alphargb))
        return frame, alphargb*255, result

# Initialize tflite-interpreter
interpreter = tf.lite.Interpreter(model_path="portrait_video_fquant.tflite") # Use 'tf.lite' on recent tf versions
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
input_shape = input_details[0]['shape'][1:3]


# Initialize video capturer
videofile = 'portrait_lady.mp4'
cap = cv2.VideoCapture(videofile)  
size = (int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)), int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)))  
videoWriter = cv2.VideoWriter('result.mp4', cv2.VideoWriter_fourcc(*'MJPG'), 20, size)  
 
# Initialize frame counter 
cnt = 1

while True:
       
    # Read the BGR frames 
    success, frame = cap.read()
    if not(success):
      break
    image=Image.fromarray(frame)
    
    
    # Resize the image
    image= image.resize(input_shape, Image.ANTIALIAS)
    image=np.asarray(image)

    
    # Normalize the input
    image = normalize(image)
  
    # Choose prior mask
    if cnt == 1:
        prior = np.full((224, 224, 1),-128,dtype=np.int8) # first frame
    else:
        prior = pred_video
    
    # Add prior as fourth channel
    image=np.dstack([image,prior])
    prepimg = image[np.newaxis, :, :, :]
    

    # Invoke interpreter for inference
    interpreter.set_tensor(input_details[0]['index'], np.array(prepimg, dtype=np.int8))
    interpreter.invoke()
    outputs = interpreter.get_tensor(output_details[0]['index'])
    outputs = outputs.reshape(224,224,1)
  
    # Save output to feed subsequent inputs
    pred_video = outputs

    # Process the output and perform alpha blending
    outputs = outputs.astype('float32')  
    outputs = cv2.resize(outputs, size)
    outputs= cv2.normalize(src=outputs, dst=None, alpha=0.0, beta=1.0, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_32F)
    _,_,outputs=blend(frame, outputs)
        
    # Write the output frame
    videoWriter.write(outputs)
    
    # Print the frame count
    cnt += 1
    if cnt % 100 == 0:
        print ("cnt: ", cnt)

# When everything done, release the capturer
print("Conversion successfull !!!")
videoWriter.release()
cap.release()

Older versions of **tensorflow**(1.x) does not seem to support conversions for quantized versions of **transpose convolutions** using the current approach. Compared to the original float model, the output with quantized model looks **less accurate**. Hopefully, the resut may be slighlty improved with **full data augmentation**  and/or **quantization aware training**. Unfortunately the fully quantized model could not be converted to **coral tpu** format due to unsupported layers and compatability issues; but it may work with the latest **NNAPI** delegate in android.