<center><h1>(Fake) Quantization from scratch</h1></center>

### Prerequisites

In [None]:
%pylab inline

In [None]:
import torch
import torch.nn.functional as F
from torchvision import models, transforms
from copy import deepcopy
import requests
from PIL import Image
import ast
import math
from matplotlib import pyplot as plt
import time
import os

#### Define functions to 
    
    a) load an image 
    b) preprocess for inference
    c) get top predicted label from model output

In [None]:
def load_image(url_or_path):
    if url_or_path.startswith("https"):
        img = Image.open(requests.get(url_or_path, stream=True).raw)
    else:
        img = Image.open(url_or_path)
    return img


def preprocess_image(img):
    IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD = ((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD)
        ])
    img = transform(img).unsqueeze(0)
    return img


imgnet_idx_to_label = requests.get("https://gist.githubusercontent.com/yrevar/942d3a0ac09ec9e5eb3a/raw/238f720ff059c1f82f368259d1ca4ffa5dd8f9f5/imagenet1000_clsidx_to_labels.txt")
imgnet_idx_to_label = ast.literal_eval(imgnet_idx_to_label.text)

def logits_to_label(outp):
    outp = F.softmax(outp, dim=1)
    score, idx = torch.topk(outp, 1)
    idx.squeeze_()
    predicted_label = imgnet_idx_to_label[idx.item()]
    return predicted_label, score.squeeze().item()


def get_predictions(model, img_path):
    img = load_image(img_path)
    img = preprocess_image(img)
    logits = model(img)
    label, score = logits_to_label(logits)
    print(label, '(', score, ')')

### Load a PyTorch pretrained Resnet model

Resnets are image classification models.

PyTorch offers Resnets trained on the Imagenet dataset.

In [None]:
# Load the model
resnet = models.resnet18(pretrained=True)

# Set the model up for inference
resnet.eval()
resnet.requires_grad_(False)

Quick test that the model inference works correctly

In [None]:
# Test on an image of a wolf
wolf_img_url = "https://raw.githubusercontent.com/pytorch/ios-demo-app/master/HelloWorld/HelloWorld/HelloWorld/image.png"
get_predictions(resnet, wolf_img_url)

### Model perf profiling

With quantization we are making the model faster and smaller. Let's write some functions to evaluate the model on size and latency

In [None]:
def profile(model, input):
    print_size_of_model(model)
    module_latency(model, input)

Write a function to print the size of the model. 

HINT: `os.path.getsize()` returns the size of a file in bytes

In [None]:
def print_size_of_model(model):
    
    # TODO

In [None]:
# Execute this cell to see the solution
!wget https://raw.githubusercontent.com/suraj813/da-content/main/quantization_workshop/code/101/sizeof.py
%pycat sizeof.py

Write a function to measure model inference time.

HINT: Use `torch.inference_mode` when running a model purely for predictions

In [None]:
def module_latency(model, input, num_tests=10):
    
    with torch.inference_mode():

        # TODO
        

In [None]:
# Execute this cell to see the solution
!wget https://raw.githubusercontent.com/suraj813/da-content/main/quantization_workshop/code/101/latency.py
%pycat latency.py

Profile the Resnet18 model

In [None]:
# TODO

# 1. load an image
# 2. preprocess image
# 3. profile resnet inference on the image

In [None]:
# Execute this cell to see the solution
!wget https://raw.githubusercontent.com/suraj813/qnt_workshop/master/static/code/101/prof.py
%pycat prof.py

In this exercise, we're going to quantize only the last layer of the resnet.

The last layer is a linear module. It is also known as the classifier because it computes the probability of an image's label from its features.

First let's separate the classifier from the rest of the model

In [None]:
# Display the resnet model layers
print(resnet)

In [None]:
# Extract the classifier 
fp32_fc = deepcopy(resnet.fc)

# "Remove" the classifier from resnet by replacing it with a no-op module.
resnet.fc = torch.nn.Identity()

Making sure everything still works...

In [None]:
model = torch.nn.Sequential(resnet, fp32_fc)
get_predictions(model, wolf_img_url)

---

### Attempt 1: Use `round` as mapping function

Quantization mapping functions also include naive functions like `round`. 

Make a copy of the FP32 classifier and round its weight and bias tensors

In [None]:
rounded_fc = deepcopy(fp32_fc)
rounded_fc.weight = torch.nn.Parameter(torch.round(rounded_fc.weight), requires_grad=False)
rounded_fc.bias = torch.nn.Parameter(torch.round(rounded_fc.bias), requires_grad=False)

Sounds too good to be true?

In [None]:
model = torch.nn.Sequential(resnet, rounded_fc)
get_predictions(model, wolf_img_url)

You already knew [this wouldn't work](https://en.wikipedia.org/wiki/There_ain%27t_no_such_thing_as_a_free_lunch), but it's good to clarify exactly why.

Let's see the what the weights' values are...

In [None]:
from matplotlib import pyplot as plt
_, _, _ = plt.hist(fp32_fc.weight.detach().flatten(), density=True, bins=100)
plt.show()

#TODO

The reason this failed is because...

In [None]:
# Execute this cell to see the solution
!wget https://raw.githubusercontent.com/suraj813/qnt_workshop/master/static/code/101/roundfail.txt
%pycat roundfail.txt

---

### Attempt 2: Scale + Round as mapping function

This time, we rescale the parameters into an appropriate output range before rounding. 

##### Output range

* The output range defines the min and max values in the quantized space.
* The range depends on the quantization precision. 

HINT: The range of an 8-bit number is [-2^7, 2^7 - 1]

In [None]:
def get_output_range(bits):
    # TODO

print("For 16-bit quantization, the quantized range is ", get_output_range(16))
print("For 8-bit quantization, the quantized range is ", get_output_range(8))
print("For 3-bit quantization, the quantized range is ", get_output_range(3))
print("For 2-bit quantization, the quantized range is ", get_output_range(2))

In [None]:
# Execute this cell to see the solution
!wget https://raw.githubusercontent.com/suraj813/qnt_workshop/master/static/code/101/output_range.py
%pycat output_range.py

In this exercise, we're going to use 8-bit quantization. So the output range to scale our parameters is [-128, 127].

#### Moving from FP32 to INT8

<img src="./img/scaling.png" width="600" />

Generally speaking, what we're doing here is an affine transformation from 32-bit space to 8-bit space.

These are of the form `y  = Ax + B`

The two parameters for this transformation are: 
* The scaling factor `S`     
* The zero-point `Z`         

So our transformation looks like `Q(x) = round(x/S + Z)`

In [None]:
def get_quantization_params(input_range, output_range):
    # TODO
    return S, Z


def quantize(x, S, Z):
    # TODO
    return x_q


def dequantize(x_q, S, Z):
    # TODO
    return x


def quantize_int8(x):
    input_range = x.min(), x.max()
    output_range = get_output_range(8)
    S, Z = get_quantization_params(input_range, output_range)
    x_q = quantize(x, S, Z)
    return x_q, S, Z

In [None]:
# Execute this cell to see the solution
!wget https://raw.githubusercontent.com/suraj813/qnt_workshop/master/static/code/101/qparams.py
%pycat qparams.py

Now we have all the functions we need to quantize our classifier.

Like before, we quantize each parameter in the layer (`weights` and `bias` in this case). 

We will also quantize the incoming features to the layer.

In [3]:
def quantize_inputs(img):
    features = resnet(img)
    X_q, S_x, Z_x = quantize_int8(features)
    return (X_q, S_x, Z_x)


def quantize_classifier(clf):
    W_q, S_w, Z_w = quantize_int8(clf.weight)
    b_q, S_b, Z_b = quantize_int8(clf.bias)
    return (W_q, S_w, Z_w), (b_q, S_b, Z_b)

In PyTorch, quantized operators run in specialized backends like FBGEMM and QNNPACK.

FBGEMM is an open source linear algebra library from Meta for reduced-precision DL inference.

We can simulate the INT8 Linear module by first dequantizing everything back to FP32, and then running the multiply.

In [None]:
def int8_linear_sim(quantized_input, quantized_weights, quantized_bias):
    X = dequantize(*quantized_input)
    W = dequantize(*quantized_weights)
    b = dequantize(*quantized_bias)
    return b + X @ W.T

We're ready to run our "fake-quantized" classifier

In [None]:
# still in FP32 space...
img = preprocess_image(load_image(wolf_img_url))
features = resnet(img)

In [None]:
# Now we move to INT8

inputs_q = quantize_inputs(features)
weights_q, bias_q = quantize_classifier(fp32_fc)
logits_q =  int8_linear_sim(inputs_q, weights_q, bias_q)

Great! Compare this with FP32 logits

In [None]:
logits = fp32_fc(features)


print("Non-Quantized output:\n", logits[:, :10], "\n")
print("Quantized output:\n", logits_q[:, :10], "\n")

quantization_error = (logits_q - logits).mean()
print("Quantization error = ", quantization_error)

The quantization error is pretty sizable at 1e-3. 

Eyeballing the outputs, the logits from the quantized and non-quantized layers seem fairly different too.

Let's see by how much are the quantized predictions off...

In [None]:
# Non-quantized predictions
logits_to_label(logits)

In [None]:
# Quantized predictions

logits_to_label(logits_q)

Let's try more images

In [None]:
img_url = "./img/swan-3299528_1280.jpeg"
# img_url = "https://static.scientificamerican.com/sciam/cache/file/32665E6F-8D90-4567-9769D59E11DB7F26_source.jpg"
# img_url = "https://media.newyorker.com/photos/5dfab39dde5fcf00086aec77/1:1/w_1706,h_1706,c_limit/Lane-Cats.jpg"


# FP32
img = preprocess_image(load_image(img_url))
features = resnet(img)

logits = fp32_fc(features)

# INT8
inputs_q = quantize_inputs(features)
logits_q =  int8_linear_sim(inputs_q, weights_q, bias_q)


# Compare predictions
print("Non-Quantized prediction:")
logits_to_label(logits)
print()
print("Quantized prediction:")
logits_to_label(logits_q)