# Quantization tutorial

This tutorial shows how to do post-training static quantization, as well as illustrating two more advanced techniques - per-channel quantization and quantization-aware training - to further improve the model’s accuracy. The task is to classify MNIST digits with a simple LeNet architecture. It is also hosted on Google Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ptzMOHcU5IrtWaSjHvxGYsX6BozcxgF6?usp=sharing)


Thsi is a mimialistic tutorial to show you a starting point for quantisation in PyTorch. For theory and more in-depth explanations of what is acutally happening I would recommend to check out: [Quantizing deep convolutional networks for efficient inference: A whitepaper
](https://arxiv.org/abs/1806.08342).

The tutorial is heavily adapted from: https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html

### Initial Setup

Before beginning the assignment, we import the MNIST dataset, and train a simple convolutional neural network (CNN) to classify it.

In [None]:
!pip3 install torch==1.5.0 torchvision==1.6.0
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import os
from torch.utils.data import DataLoader
import torch.quantization
from torch.quantization import QuantStub, DeQuantStub

Load training and test data from the MNIST dataset and apply a normalizing transformation.



In [None]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5,), (0.5,))])

trainset = torchvision.datasets.MNIST(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64,
                                          shuffle=True, num_workers=16, pin_memory=True)

testset = torchvision.datasets.MNIST(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=64,
                                         shuffle=False, num_workers=16, pin_memory=True)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw
Processing...
Done!


  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)


Define some helper functions and classes that help us to track the statistics and accuracy with respect to the train/test data.

In [None]:
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)

def accuracy(output, target):
    """ Computes the top 1 accuracy """
    with torch.no_grad():
        batch_size = target.size(0)

        _, pred = output.topk(1, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        correct_one = correct[:1].view(-1).float().sum(0, keepdim=True)
        return correct_one.mul_(100.0 / batch_size).item()

def print_size_of_model(model):
    """ Prints the real size of the model """
    torch.save(model.state_dict(), "temp.p")
    print('Size (MB):', os.path.getsize("temp.p")/1e6)
    os.remove('temp.p')

def load_model(quantized_model, model):
    """ Loads in the weights into an object meant for quantization """
    state_dict = model.state_dict()
    model = model.to('cpu')
    quantized_model.load_state_dict(state_dict)

def fuse_modules(model):
    """ Fuse together convolutions/linear layers and ReLU """
    torch.quantization.fuse_modules(model, [['conv1', 'relu1'],
                                            ['conv2', 'relu2'],
                                            ['fc1', 'relu3'],
                                            ['fc2', 'relu4']], inplace=True)




Define a simple CNN that classifies MNIST images.




In [None]:
class Net(nn.Module):
    def __init__(self, q = False):
        # By turning on Q we can turn on/off the quantization
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5, bias=False)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5, bias=False)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(256, 120, bias=False)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(120, 84, bias=False)
        self.relu4 = nn.ReLU()
        self.fc3 = nn.Linear(84, 10, bias=False)
        self.q = q
        if q:
          self.quant = QuantStub()
          self.dequant = DeQuantStub()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.q:
          x = self.quant(x)
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.pool1(x)
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.pool2(x)
        # Be careful to use reshape here instead of view
        x = x.reshape(x.shape[0], -1)
        x = self.fc1(x)
        x = self.relu3(x)
        x = self.fc2(x)
        x = self.relu4(x)
        x = self.fc3(x)
        if self.q:
          x = self.dequant(x)
        return x

In [None]:
net = Net(q=False).cuda()
print_size_of_model(net)

Tesla T4 with CUDA capability sm_75 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the Tesla T4 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/



Size (MB): 0.178947


Train this CNN on the training dataset (this may take a few moments).

In [None]:
def train(model: nn.Module, dataloader: DataLoader, cuda=False, q=False):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    model.train()
    for epoch in range(20):  # loop over the dataset multiple times

        running_loss = AverageMeter('loss')
        acc = AverageMeter('train_acc')
        for i, data in enumerate(dataloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            if cuda:
              inputs = inputs.cuda()
              labels = labels.cuda()

            # zero the parameter gradients
            optimizer.zero_grad()

            if epoch>=3 and q:
              model.apply(torch.quantization.disable_observer)

            # forward + backward + optimize
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss.update(loss.item(), outputs.shape[0])
            acc.update(accuracy(outputs, labels), outputs.shape[0])
            if i % 100 == 0:    # print every 100 mini-batches
                print('[%d, %5d] ' %
                    (epoch + 1, i + 1), running_loss, acc)
    print('Finished Training')


def test(model: nn.Module, dataloader: DataLoader, cuda=False) -> float:
    correct = 0
    total = 0
    model.eval()
    with torch.no_grad():
        for data in dataloader:
            inputs, labels = data

            if cuda:
              inputs = inputs.cuda()
              labels = labels.cuda()

            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return 100 * correct / total

In [None]:
train(net, trainloader, cuda=True)

[1,     1]  loss 2.304688 (2.304688) train_acc 7.812500 (7.812500)
[1,   101]  loss 2.298546 (2.301924) train_acc 14.062500 (11.154084)
[1,   201]  loss 2.298169 (2.300080) train_acc 18.750000 (12.927550)
[1,   301]  loss 2.287714 (2.298172) train_acc 18.750000 (14.150748)
[1,   401]  loss 2.284940 (2.295719) train_acc 18.750000 (15.410692)
[1,   501]  loss 2.268452 (2.292275) train_acc 31.250000 (16.869386)
[1,   601]  loss 2.245095 (2.287118) train_acc 43.750000 (18.711002)
[1,   701]  loss 2.186713 (2.278292) train_acc 45.312500 (21.143902)
[1,   801]  loss 1.973244 (2.257265) train_acc 64.062500 (24.475265)
[1,   901]  loss 1.177035 (2.186206) train_acc 71.875000 (28.799598)
[2,     1]  loss 0.966906 (0.966906) train_acc 75.000000 (75.000000)
[2,   101]  loss 0.651747 (0.683141) train_acc 71.875000 (79.146040)
[2,   201]  loss 0.314332 (0.600203) train_acc 92.187500 (81.825249)
[2,   301]  loss 0.536951 (0.552717) train_acc 87.500000 (83.284884)
[2,   401]  loss 0.487067 (0.516860)

Now that the CNN has been trained, let's test it on our test dataset.

In [None]:
score = test(net, testloader, cuda=True)
print('Accuracy of the network on the test images: {}% - FP32'.format(score))

Accuracy of the network on the test images: 98.65% - FP32


### Post-training quantization

Define a new quantized network architeture, where we also define the quantization and dequantization stubs that will be important at the start and at the end.

Next, we’ll “fuse modules”; this can both make the model faster by saving on memory access while also improving numerical accuracy. While this can be used with any model, this is especially common with quantized models.

In [None]:
qnet = Net(q=True)
load_model(qnet, net)
fuse_modules(qnet)

In [None]:
print_size_of_model(qnet)
score = test(qnet, testloader, cuda=False)
print('Accuracy of the fused network on the test images: {}% - FP32'.format(score))

Size (MB): 0.179144
Accuracy of the fused network on the test images: 98.65% - FP32


Post-training static quantization involves not just converting the weights from float to int, as in dynamic quantization, but also performing the additional
step of first feeding batches of data through the network and computing the resulting distributions of the different activations (specifically,
this is done by inserting observer modules at different
points that record this data). These distributions are then used to determine how the specifically the different activations should be quantized at
inference time (a simple technique would be to simply divide the entire range of activations into 256 levels.
Importantly, this additional step allows us to pass quantized values between operations instead of converting these values to floats - and then back to ints - between every operation,
resulting in a significant speed-up.

In [None]:
qnet.qconfig = torch.quantization.default_qconfig
print(qnet.qconfig)
torch.quantization.prepare(qnet, inplace=True)
print('Post Training Quantization Prepare: Inserting Observers')
print('\n Conv1: After observer insertion \n\n', qnet.conv1)

test(qnet, trainloader, cuda=False)
print('Post Training Quantization: Calibration done')
torch.quantization.convert(qnet, inplace=True)
print('Post Training Quantization: Convert done')
print('\n Conv1: After fusion and quantization \n\n', qnet.conv1)
print("Size of model after quantization")
print_size_of_model(qnet)

QConfig(activation=functools.partial(<class 'torch.quantization.observer.MinMaxObserver'>, reduce_range=True), weight=functools.partial(<class 'torch.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric))
Post Training Quantization Prepare: Inserting Observers

 Conv1: After observer insertion 

 ConvReLU2d(
  (0): Conv2d(
    1, 6, kernel_size=(5, 5), stride=(1, 1), bias=False
    (activation_post_process): MinMaxObserver(min_val=tensor([]), max_val=tensor([]))
  )
  (1): ReLU(
    (activation_post_process): MinMaxObserver(min_val=tensor([]), max_val=tensor([]))
  )
)
Post Training Quantization: Calibration done
Post Training Quantization: Convert done

 Conv1: After fusion and quantization 

 QuantizedConvReLU2d(1, 6, kernel_size=(5, 5), stride=(1, 1), scale=0.07420612871646881, zero_point=0, bias=False)
Size of model after quantization
Size (MB): 0.050052


In [None]:
score = test(qnet, testloader, cuda=False)
print('Accuracy of the fused and quantized network on the test images: {}% - INT8'.format(score))

Accuracy of the fused and quantized network on the test images: 98.67% - INT8


We can also define a cusom quantization configuration, where we replace the default observers and instead of quantising with respect to max/min we can take an average of the observed max/min, hopefully for a better generalization performance.

In [None]:
from torch.quantization.observer import MovingAverageMinMaxObserver

qnet = Net(q=True)
load_model(qnet, net)
fuse_modules(qnet)

qnet.qconfig = torch.quantization.QConfig(
                                      activation=MovingAverageMinMaxObserver.with_args(reduce_range=True),
                                      weight=MovingAverageMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_tensor_symmetric))
print(qnet.qconfig)
torch.quantization.prepare(qnet, inplace=True)
print('Post Training Quantization Prepare: Inserting Observers')
print('\n Conv1: After observer insertion \n\n', qnet.conv1)

test(qnet, trainloader, cuda=False)
print('Post Training Quantization: Calibration done')
torch.quantization.convert(qnet, inplace=True)
print('Post Training Quantization: Convert done')
print('\n Conv1: After fusion and quantization \n\n', qnet.conv1)
print("Size of model after quantization")
print_size_of_model(qnet)
score = test(qnet, testloader, cuda=False)
print('Accuracy of the fused and quantized network on the test images: {}% - INT8'.format(score))

QConfig(activation=functools.partial(<class 'torch.quantization.observer.MovingAverageMinMaxObserver'>, reduce_range=True), weight=functools.partial(<class 'torch.quantization.observer.MovingAverageMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric))
Post Training Quantization Prepare: Inserting Observers

 Conv1: After observer insertion 

 ConvReLU2d(
  (0): Conv2d(
    1, 6, kernel_size=(5, 5), stride=(1, 1), bias=False
    (activation_post_process): MovingAverageMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
  )
  (1): ReLU(
    (activation_post_process): MovingAverageMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
  )
)
Post Training Quantization: Calibration done
Post Training Quantization: Convert done

 Conv1: After fusion and quantization 

 QuantizedConvReLU2d(1, 6, kernel_size=(5, 5), stride=(1, 1), scale=0.07174129039049149, zero_point=0, bias=False)
Size of model after quantization
Size (MB): 0.050052
Accuracy of the fused and quantized 

In addition, we can significantly improve on the accuracy simply by using a different quantization configuration. We repeat the same exercise with the recommended configuration for quantizing for x86 architectures. This configuration does the following:
Quantizes weights on a per-channel basis. It
uses a histogram observer that collects a histogram of activations and then picks quantization parameters in an optimal manner.

In [None]:
qnet = Net(q=True)
load_model(qnet, net)
fuse_modules(qnet)

In [None]:
qnet.qconfig = torch.quantization.get_default_qconfig('fbgemm')
print(qnet.qconfig)

torch.quantization.prepare(qnet, inplace=True)
test(qnet, trainloader, cuda=False)
torch.quantization.convert(qnet, inplace=True)
print("Size of model after quantization")
print_size_of_model(qnet)

QConfig(activation=functools.partial(<class 'torch.quantization.observer.HistogramObserver'>, reduce_range=True), weight=functools.partial(<class 'torch.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric))
Size of model after quantization
Size (MB): 0.056182


In [None]:
score = test(qnet, testloader, cuda=False)
print('Accuracy of the fused and quantized network on the test images: {}% - INT8'.format(score))

Accuracy of the fused and quantized network on the test images: 98.64% - INT8


### Quantization aware training

Quantization-aware training (QAT) is the quantization method that typically results in the highest accuracy. With QAT, all weights and activations are “fake quantized” during both the forward and backward passes of training: that is, float values are rounded to mimic int8 values, but all computations are still done with floating point numbers.

In [None]:
qnet = Net(q=True)
fuse_modules(qnet)
qnet.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(qnet, inplace=True)
print('\n Conv1: After fusion and quantization \n\n', qnet.conv1)
qnet=qnet.cuda()


 Conv1: After fusion and quantization 

 ConvReLU2d(
  1, 6, kernel_size=(5, 5), stride=(1, 1), bias=False
  (activation_post_process): FakeQuantize(
    fake_quant_enabled=tensor([1], dtype=torch.uint8), observer_enabled=tensor([1], dtype=torch.uint8),            scale=tensor([1.]), zero_point=tensor([0])
    (activation_post_process): MovingAverageMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
  )
  (weight_fake_quant): FakeQuantize(
    fake_quant_enabled=tensor([1], dtype=torch.uint8), observer_enabled=tensor([1], dtype=torch.uint8),            scale=tensor([1.]), zero_point=tensor([0])
    (activation_post_process): MovingAveragePerChannelMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
  )
)


In [None]:
train(qnet, trainloader, cuda=True)

[1,     1]  loss 2.301113 (2.301113) train_acc 10.937500 (10.937500)
[1,   101]  loss 2.304500 (2.300252) train_acc 17.187500 (14.511139)
[1,   201]  loss 2.280855 (2.295559) train_acc 25.000000 (17.918221)
[1,   301]  loss 2.280437 (2.289621) train_acc 18.750000 (20.681063)
[1,   401]  loss 2.225562 (2.280254) train_acc 28.125000 (22.919264)
[1,   501]  loss 2.135546 (2.262463) train_acc 32.812500 (24.120509)
[1,   601]  loss 1.859752 (2.219802) train_acc 50.000000 (26.526102)
[1,   701]  loss 1.159023 (2.118334) train_acc 76.562500 (31.361448)
[1,   801]  loss 0.852159 (1.974637) train_acc 70.312500 (36.641698)
[1,   901]  loss 0.576404 (1.831879) train_acc 82.812500 (41.374168)
[2,     1]  loss 0.672451 (0.672451) train_acc 79.687500 (79.687500)
[2,   101]  loss 0.583598 (0.543110) train_acc 78.125000 (83.539604)
[2,   201]  loss 0.255931 (0.519815) train_acc 92.187500 (84.227301)
[2,   301]  loss 0.608585 (0.494296) train_acc 82.812500 (84.930440)
[2,   401]  loss 0.388425 (0.47748

In [None]:
qnet = qnet.cpu()
torch.quantization.convert(qnet, inplace=True)
print("Size of model after quantization")
print_size_of_model(qnet)

score = test(qnet, testloader, cuda=False)
print('Accuracy of the fused and quantized network (trained quantized) on the test images: {}% - INT8'.format(score))

Size of model after quantization
Size (MB): 0.056182
Accuracy of the fused and quantized network (trained quantized) on the test images: 98.36% - INT8


Training a quantized model with high accuracy requires accurate modeling of numerics at inference. For quantization aware training, therefore, we can modify the training loop by freezing the quantizer parameters (scale and zero-point) and fine tune the weights.

In [None]:
qnet = Net(q=True)
fuse_modules(qnet)
qnet.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(qnet, inplace=True)
qnet = qnet.cuda()
train(qnet, trainloader, cuda=True, q=True)
qnet = qnet.cpu()
torch.quantization.convert(qnet, inplace=True)
print("Size of model after quantization")
print_size_of_model(qnet)

score = test(qnet, testloader, cuda=False)
print('Accuracy of the fused and quantized network (trained quantized) on the test images: {}% - INT8'.format(score))

[1,     1]  loss 2.302550 (2.302550) train_acc 7.812500 (7.812500)
[1,   101]  loss 2.297554 (2.300715) train_acc 20.312500 (13.845916)
[1,   201]  loss 2.282641 (2.297055) train_acc 34.375000 (18.135883)
[1,   301]  loss 2.270876 (2.292123) train_acc 39.062500 (22.809385)
[1,   401]  loss 2.262033 (2.285715) train_acc 37.500000 (26.683292)
[1,   501]  loss 2.202892 (2.275584) train_acc 53.125000 (30.417290)
[1,   601]  loss 2.071962 (2.256968) train_acc 45.312500 (33.236273)
[1,   701]  loss 1.763640 (2.214525) train_acc 59.375000 (35.529601)
[1,   801]  loss 1.114516 (2.123895) train_acc 76.562500 (39.054697)
[1,   901]  loss 0.700801 (1.992948) train_acc 85.937500 (43.215871)
[2,     1]  loss 0.738226 (0.738226) train_acc 84.375000 (84.375000)
[2,   101]  loss 0.517096 (0.564659) train_acc 85.937500 (84.730817)
[2,   201]  loss 0.456042 (0.522940) train_acc 87.500000 (85.432214)
[2,   301]  loss 0.316576 (0.485950) train_acc 89.062500 (86.399502)
[2,   401]  loss 0.328412 (0.451746)