https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html

ImageNet Dataset Preparation
1. Download the following compressed files from https://image-net.org/challenges/LSVRC/2012/2012-downloads.php (require account login).
    1. Development kit (Task 1 & 2). 2.5MB.
    2. Training images (Task 1 & 2). 138GB.
    3. Validation images (all tasks). 6.3GB.
    4. Test images (all tasks). 13GB. (optional)
2. https://pytorch.org/vision/stable/_modules/torchvision/datasets/imagenet.html#ImageNet

Efficient_B7 Pretrained Model
1. The pretrained model, efficientnet_b7_lukemelas-dcc49843.pth, is downloaded from https://download.pytorch.org/models/efficientnet_b7_lukemelas-dcc49843.pth.
2. The definition of Efficient_B7 can be found at https://github.com/pytorch/vision/blob/main/torchvision/models/efficientnet.py.

Problems faced
1. 
```bash
AssertionError: did not find fuser method for: (<class 'torch.nn.modules.conv.Conv2d'>, <class 'torch.nn.modules.batchnorm.BatchNorm2d'>, <class 'torch.nn.modules.activation.SiLU'>) 
```
2.
```bash
NotImplementedError: Could not run 'quantized::conv2d.new' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'quantized::conv2d.new' is only available for these backends: [QuantizedCPU, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].
```
3.
```bash
NotImplementedError: Could not run 'aten::empty.memory_format' with arguments from the 'QuantizedCPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty.memory_format' is only available for these backends: [CPU, CUDA, Meta, MkldnnCPU, SparseCPU, SparseCUDA, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].
```
4.
```bash
NotImplementedError: Could not run 'aten::empty_strided' with arguments from the 'QuantizedCPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty_strided' is only available for these backends: [CPU, CUDA, Meta, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].
```
5.
```bash
NotImplementedError: Could not run 'aten::add.out' with arguments from the 'QuantizedCPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::add.out' is only available for these backends: [CPU, CUDA, Meta, MkldnnCPU, SparseCPU, SparseCUDA, SparseCsrCPU, SparseCsrCUDA, BackendSelect, Python, Named, Conjugate, Negative, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, UNKNOWN_TENSOR_TYPE_ID, Autocast, Batched, VmapMode].
```
6.

```bash
    213             result = self.stochastic_depth(result)
    214             result = self.dequant(result)
--> 215             result += input
    216             result = self.quant(result)
    217         return result
RuntimeError: promoteTypes with quantized numbers is not handled yet; figure out what the correct rules should be, offending types: Float QUInt8
```
Previous 5 problems can be solve by surrounding the operation with QuantStub() and DeQuantStub().
And the last problem can be solved with nn.quantized.FloatFunctional().

In [1]:
from config import *
import numpy as np
import torch
import torch.nn as nn
import torchvision
from torch.utils.data import DataLoader
from torchvision import datasets
import torchvision.transforms as transforms
import os
import time
import sys
import torch.quantization

# # Setup warnings
import warnings
warnings.filterwarnings(
    action='ignore',
    category=DeprecationWarning,
    module=r'.*'
)
warnings.filterwarnings(
    action='default',
    module=r'torch.quantization'
)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Specify random seed for repeatable results
torch.manual_seed(191009)

<torch._C.Generator at 0x7fab5194c948>

# 1. Model Architecture

In [2]:
import copy
import math
from functools import partial
from torch.quantization import QuantStub, DeQuantStub
from torchvision.ops import StochasticDepth

def _make_divisible(v, divisor, min_value=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    :param v:
    :param divisor:
    :param min_value:
    :return:
    """
    if min_value is None:
        min_value = divisor
    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v

# https://github.com/pytorch/vision/blob/main/torchvision/ops/misc.py
class ConvBNAct(nn.Sequential):
    def __init__(self, 
                 in_planes, 
                 out_planes, 
                 kernel_size = 3, 
                 stride = 1, 
                 groups = 1, 
                 norm_layer = nn.BatchNorm2d, 
                 activation_layer = nn.ReLU, 
                 dequant = None):
        padding = (kernel_size - 1) // 2
        super(ConvBNAct, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups = groups, bias = False),
            norm_layer(out_planes),
            # Replace with activation
            dequant,
            activation_layer(inplace=False),
            QuantStub(),
        )

class ConvBN(nn.Sequential):
    def __init__(self, 
                 in_planes, 
                 out_planes, 
                 kernel_size = 3, 
                 stride = 1, 
                 groups = 1, 
                 norm_layer = nn.BatchNorm2d, 
                 dequant = None):
        padding = (kernel_size - 1) // 2
        super(ConvBN, self).__init__(
            # QuantStub(),
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False),
            norm_layer(out_planes),\
            # dequant,\
        )

# https://github.com/pytorch/vision/blob/main/torchvision/ops/misc.py
# https://github.com/pytorch/vision/blob/main/torchvision/models/quantization/mobilenetv3.py
class SqueezeExcitation(torch.nn.Module):
    def __init__(
        self, 
        input_channels, 
        squeeze_channels, 
        activation = torch.nn.ReLU, 
        scale_activation = torch.nn.Sigmoid, 
        dequant = None,
    ):
        super(SqueezeExcitation, self).__init__()
        self.avgpool = torch.nn.AdaptiveAvgPool2d(1)
        self.fc1 = torch.nn.Conv2d(input_channels, squeeze_channels, 1)
        self.fc2 = torch.nn.Conv2d(squeeze_channels, input_channels, 1)
        self.quant1 = QuantStub()
        self.quant2 = QuantStub()
        self.quant3 = QuantStub()
        self.dequant = dequant
        self.activation = activation()
        self.scale_activation = scale_activation()

    def _scale(self, input):
        scale = self.avgpool(input)
        scale = self.fc1(scale)
        # https://github.com/pytorch/pytorch/issues/34583
        # https://discuss.pytorch.org/t/quantization-how-to-quantize-model-which-include-not-support-to-quantize-layer/76528
        # https://pytorch.org/tutorials/recipes/fuse.html
        scale = self.dequant(scale)
        scale = self.activation(scale)
        scale = self.quant1(scale)
        scale = self.fc2(scale)
        scale = self.dequant(scale)
        scale = self.scale_activation(scale)
        scale = self.quant2(scale)
        return scale

    def forward(self, input):
        scale = self._scale(input)
        x = self.dequant(scale) * self.dequant(input)
        x = self.quant2(x)
        return x

# https://github.com/pytorch/vision/blob/main/torchvision/models/efficientnet.py
class MBConvConfig:
    # Stores information listed at Table 1 of the EfficientNet paper
    def __init__(
        self,
        expand_ratio: float,
        kernel: int,
        stride: int,
        input_channels: int,
        out_channels: int,
        num_layers: int,
        width_mult: float,
        depth_mult: float,
    ) -> None:
        self.expand_ratio = expand_ratio
        self.kernel = kernel
        self.stride = stride
        self.input_channels = self.adjust_channels(input_channels, width_mult)
        self.out_channels = self.adjust_channels(out_channels, width_mult)
        self.num_layers = self.adjust_depth(num_layers, depth_mult)

    def __repr__(self) -> str:
        s = self.__class__.__name__ + "("
        s += "expand_ratio={expand_ratio}"
        s += ", kernel={kernel}"
        s += ", stride={stride}"
        s += ", input_channels={input_channels}"
        s += ", out_channels={out_channels}"
        s += ", num_layers={num_layers}"
        s += ")"
        return s.format(**self.__dict__)

    @staticmethod
    def adjust_channels(channels: int, width_mult: float) -> int:
        return _make_divisible(channels * width_mult, 8)

    @staticmethod
    def adjust_depth(num_layers: int, depth_mult: float):
        return int(math.ceil(num_layers * depth_mult))

# https://paperswithcode.com/method/inverted-residual-block
# https://towardsdatascience.com/mobilenetv2-inverted-residuals-and-linear-bottlenecks-8a4362f4ffd5
class MBConv(nn.Module):
    def __init__(
        self,
        cnf: MBConvConfig,
        stochastic_depth_prob: float,
        norm_layer,
        se_layer = SqueezeExcitation,
        dequant = None,
    ) -> None:
        super(MBConv, self).__init__()

        if not (1 <= cnf.stride <= 2):
            raise ValueError("illegal stride value")

        self.use_res_connect = cnf.stride == 1 and cnf.input_channels == cnf.out_channels

        layers = []
        # activation_layer = nn.ReLU
        activation_layer = nn.SiLU

        # expand
        expanded_channels = cnf.adjust_channels(cnf.input_channels, cnf.expand_ratio)
        if expanded_channels != cnf.input_channels:
            layers.append(
                ConvBNAct(cnf.input_channels, 
                          expanded_channels, 
                          kernel_size = 1, 
                          norm_layer = norm_layer, 
                          activation_layer = activation_layer, 
                          dequant = dequant)
            )

        # depthwise
        layers.append(
            ConvBNAct(
                expanded_channels,
                expanded_channels,
                kernel_size = cnf.kernel,
                stride = cnf.stride,
                groups = expanded_channels,
                norm_layer = norm_layer,
                activation_layer = activation_layer,
                dequant = dequant,
            )
        )

        # squeeze and excitation
        squeeze_channels = max(1, cnf.input_channels // 4)
        layers.append(se_layer(expanded_channels, squeeze_channels, activation=partial(activation_layer, inplace=True), dequant=dequant))

        # project
        layers.append(
            ConvBN(
                expanded_channels, cnf.out_channels, kernel_size = 1, norm_layer = norm_layer, dequant = dequant,
            )
        )

        self.block = nn.Sequential(*layers)
        self.stochastic_depth = StochasticDepth(stochastic_depth_prob, "row")
        self.quant = QuantStub()
        self.dequant = dequant
        self.out_channels = cnf.out_channels
        self.skip_add = nn.quantized.FloatFunctional()

    def forward(self, input):
        result = self.block(input)
        if self.use_res_connect:
            result = self.stochastic_depth(result)
            result = self.skip_add.add(result, input)
        return result


class EfficientNet(nn.Module):
    def __init__(
        self,
        inverted_residual_setting,
        dropout: float,
        stochastic_depth_prob: float = 0.2,
        num_classes: int = 1000,
        norm_layer = None,
    ) -> None:
        """
        EfficientNet main class
        Args:
            inverted_residual_setting (List[MBConvConfig]): Network structure
            dropout (float): The droupout probability
            stochastic_depth_prob (float): The stochastic depth probability
            num_classes (int): Number of classes
        """
        super(EfficientNet, self).__init__()

        block = MBConv
        # activation_layer = nn.ReLU
        activation_layer = nn.SiLU

        if norm_layer is None:
            norm_layer = nn.BatchNorm2d

        layers = []
        
        self.quant = QuantStub()
        self.dequant = DeQuantStub()

        # building first layer
        firstconv_output_channels = inverted_residual_setting[0].input_channels
        layers.append(
            ConvBNAct(
                3, 
                firstconv_output_channels, 
                kernel_size = 3, 
                stride = 2, 
                norm_layer = norm_layer, 
                activation_layer = activation_layer,
                dequant = self.dequant,
            )
        )

        # building inverted residual blocks
        total_stage_blocks = sum(cnf.num_layers for cnf in inverted_residual_setting)
        stage_block_id = 0
        for cnf in inverted_residual_setting:
            stage: List[nn.Module] = []
            for _ in range(cnf.num_layers):
                # copy to avoid modifications. shallow copy is enough
                block_cnf = copy.copy(cnf)

                # overwrite info if not the first conv in the stage
                if stage:
                    block_cnf.input_channels = block_cnf.out_channels
                    block_cnf.stride = 1

                # adjust stochastic depth probability based on the depth of the stage block
                sd_prob = stochastic_depth_prob * float(stage_block_id) / total_stage_blocks

                stage.append(block(block_cnf, sd_prob, norm_layer, dequant = self.dequant))
                stage_block_id += 1

            layers.append(nn.Sequential(*stage))

        # building last several layers
        lastconv_input_channels = inverted_residual_setting[-1].out_channels
        lastconv_output_channels = 4 * lastconv_input_channels
        layers.append(
            ConvBNAct(
                lastconv_input_channels,
                lastconv_output_channels,
                kernel_size = 1,
                norm_layer = norm_layer,
                activation_layer = activation_layer,
                dequant = self.dequant,
            )
        )

        self.features = nn.Sequential(*layers)
        
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.classifier = nn.Sequential(
            nn.Dropout(p=dropout, inplace=True),
            nn.Linear(lastconv_output_channels, num_classes),
        )

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out")
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                init_range = 1.0 / math.sqrt(m.out_features)
                nn.init.uniform_(m.weight, -init_range, init_range)
                nn.init.zeros_(m.bias)

    def _forward_impl(self, x):
        x = self.features(x)

        x = self.avgpool(x)
        x = torch.flatten(x, 1)

        x = self.classifier(x)

        return x

    def forward(self, x):
        x = self.quant(x)
        x = self._forward_impl(x)
        x = self.dequant(x)
        return x
    
    # Fuse Conv+BN and Conv+BN+Relu modules prior to quantization
    # This operation does not change the numerics
    def fuse_model(self):
        for m in self.modules():
            if type(m) == ConvBNAct:
                # https://github.com/pytorch/pytorch/issues/41534
                # https://github.com/pytorch/pytorch/blob/0c77bd7c0bbd4d6e50a5f3ce7b4debbee85d7963/torch/quantization/fuse_modules.py#L106
                # torch.quantization.fuse_modules(m, ['0', '1', '2'], inplace=True)
                torch.quantization.fuse_modules(m, ['0', '1'], inplace=True)
            if type(m) == ConvBN:
                torch.quantization.fuse_modules(m, ['0', '1'], inplace=True)
            if type(m) == MBConv:
                for idx in range(len(m.block)):
                    if type(m.block[idx]) == nn.Conv2d:
                        torch.quantization.fuse_modules(m.block, [str(idx), str(idx + 1)], inplace=True)


def _efficientnet(
    arch: str,
    width_mult: float,
    depth_mult: float,
    dropout: float,
    pretrained: bool,
    progress: bool,
    norm_layer,
) -> EfficientNet:
    bneck_conf = partial(MBConvConfig, width_mult=width_mult, depth_mult=depth_mult)
    inverted_residual_setting = [
        bneck_conf(1, 3, 1, 32, 16, 1),
        bneck_conf(6, 3, 2, 16, 24, 2),
        bneck_conf(6, 5, 2, 24, 40, 2),
        bneck_conf(6, 3, 2, 40, 80, 3),
        bneck_conf(6, 5, 1, 80, 112, 3),
        bneck_conf(6, 5, 2, 112, 192, 4),
        bneck_conf(6, 3, 1, 192, 320, 1),
    ]
    model = EfficientNet(inverted_residual_setting, dropout, norm_layer=norm_layer)
    if pretrained:
        if model_urls.get(arch, None) is None:
            raise ValueError(f"No checkpoint is available for model type {arch}")
        state_dict = load_state_dict_from_url(model_urls[arch], progress=progress)
        model.load_state_dict(state_dict)
    return model

def efficientnet_b7(pretrained: bool = False, progress: bool = True) -> EfficientNet:
    """
    Constructs a EfficientNet B7 architecture from
    `"EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" <https://arxiv.org/abs/1905.11946>`_.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _efficientnet(
        "efficientnet_b7",
        2.0,
        3.1,
        0.5,
        pretrained,
        progress,
        norm_layer=partial(nn.BatchNorm2d, eps=0.001, momentum=0.01),
    )

# 2. Helper Functions

In [3]:
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)


def accuracy(output, target, topk=(1,)):
    """Computes the accuracy over the k top predictions for the specified values of k"""
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        for k in topk:
            correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(100.0 / batch_size))
        return res


def evaluate(model, criterion, data_loader, neval_batches, device='cpu'):
    model.eval()
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    cnt = 0
    with torch.no_grad():
        for image, target in data_loader:
            image, target = image.to(device), target.to(device)
            output = model(image)
            loss = criterion(output, target)
            cnt += 1
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            print('.', end = '')
            top1.update(acc1[0], image.size(0))
            top5.update(acc5[0], image.size(0))
            if cnt >= neval_batches:
                 return top1, top5

    return top1, top5

def load_model(model_file):
    model = efficientnet_b7(pretrained=False)
    state_dict = torch.load(model_file)
    model.load_state_dict(state_dict)
    model.to('cpu')
    return model

def print_size_of_model(model):
    torch.save(model.state_dict(), "temp.p")
    print('Size (MB):', os.path.getsize("temp.p")/1e6)
    os.remove('temp.p')

# 3. Define Dataset and Data Loaders
## ImageNet Data

In [4]:
def prepare_data_loaders(data_path):

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    dataset = torchvision.datasets.ImageNet(
           data_path, split="train", transform=transforms.Compose([
               transforms.RandomResizedCrop(224),
               transforms.RandomHorizontalFlip(),
               transforms.ToTensor(),
               normalize,
           ]))
    dataset_test = torchvision.datasets.ImageNet(
          data_path, split="val", transform=transforms.Compose([
              transforms.Resize(256),
              transforms.CenterCrop(224),
              transforms.ToTensor(),
              normalize,
          ]))

    train_sampler = torch.utils.data.RandomSampler(dataset)
    test_sampler = torch.utils.data.SequentialSampler(dataset_test)

    data_loader = torch.utils.data.DataLoader(
        dataset, batch_size=train_batch_size,
        sampler=train_sampler)

    data_loader_test = torch.utils.data.DataLoader(
        dataset_test, batch_size=eval_batch_size,
        sampler=test_sampler)

    return data_loader, data_loader_test

## Pre-Trained Efficient_B7

In [5]:
train_batch_size = 30
eval_batch_size = 50

data_loader, data_loader_test = prepare_data_loaders(data_path)
criterion = nn.CrossEntropyLoss().to(device)
float_model = load_model(saved_model_dir + efficient_float_model_file).to(device)

# Next, we'll "fuse modules"; this can both make the model faster by saving on memory access
# while also improving numerical accuracy. While this can be used with any model, this is
# especially common with quantized models.

print('\n Inverted Residual Block: Before fusion \n\n', float_model.features[1][0].block)
float_model.eval()

# Fuses modules
float_model.fuse_model()

# Note fusion of Conv+BN+Relu and Conv+Relu
print('\n Inverted Residual Block: After fusion\n\n', float_model.features[1][0].block)


 Inverted Residual Block: Before fusion 

 Sequential(
  (0): ConvBNAct(
    (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=64, bias=False)
    (1): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
    (2): DeQuantStub()
    (3): SiLU()
    (4): QuantStub()
  )
  (1): SqueezeExcitation(
    (avgpool): AdaptiveAvgPool2d(output_size=1)
    (fc1): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1))
    (fc2): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
    (quant1): QuantStub()
    (quant2): QuantStub()
    (quant3): QuantStub()
    (dequant): DeQuantStub()
    (activation): SiLU(inplace=True)
    (scale_activation): Sigmoid()
  )
  (2): ConvBN(
    (0): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (1): BatchNorm2d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
  )
)

 Inverted Residual Block: After fusion

 Sequential(
  (0): ConvBNAct(
    (0): Conv2d(64, 64, kernel_size=(3, 3)

## Baseline Accuracy

In [6]:
%%time

num_eval_batches = 1000

print("Size of baseline model")
print_size_of_model(float_model)

top1, top5 = evaluate(float_model, criterion, data_loader_test, neval_batches=num_eval_batches, device=device)
print('Evaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg))
torch.jit.save(torch.jit.script(float_model), saved_model_dir + efficient_scripted_float_model_file)

Size of baseline model
Size (MB): 265.030073
...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

In [7]:
%%time

num_calibration_batches = 32

myModel = load_model(saved_model_dir + efficient_float_model_file).to('cpu')
myModel.eval()

# Fuse Conv, bn and relu
myModel.fuse_model()

# Specify quantization configuration
# Start with simple min/max range estimation and per-tensor quantization of weights
myModel.qconfig = torch.quantization.default_qconfig
print(myModel.qconfig)
torch.quantization.prepare(myModel, inplace=True)

# Calibrate first
print('Post Training Quantization Prepare: Inserting Observers')
print('\n Inverted Residual Block:After observer insertion \n\n', myModel.features[1][0].block)

# Calibrate with the training set
evaluate(myModel, criterion, data_loader, neval_batches=num_calibration_batches)
print('Post Training Quantization: Calibration done')

# Convert to quantized model
torch.quantization.convert(myModel, inplace=True)
print('Post Training Quantization: Convert done')
print('\n Inverted Residual Block: After fusion and quantization, note fused modules: \n\n', myModel.features[1][0].block)

print("Size of model after quantization")
print_size_of_model(myModel)

top1, top5 = evaluate(myModel, criterion, data_loader_test, neval_batches=num_eval_batches)
print('Evaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg))

QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})


  reduce_range will be deprecated in a future release of PyTorch."


Post Training Quantization Prepare: Inserting Observers

 Inverted Residual Block:After observer insertion 

 Sequential(
  (0): ConvBNAct(
    (0): Conv2d(
      64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=64
      (activation_post_process): MinMaxObserver(min_val=inf, max_val=-inf)
    )
    (1): Identity()
    (2): DeQuantStub()
    (3): SiLU()
    (4): QuantStub(
      (activation_post_process): MinMaxObserver(min_val=inf, max_val=-inf)
    )
  )
  (1): SqueezeExcitation(
    (avgpool): AdaptiveAvgPool2d(output_size=1)
    (fc1): Conv2d(
      64, 16, kernel_size=(1, 1), stride=(1, 1)
      (activation_post_process): MinMaxObserver(min_val=inf, max_val=-inf)
    )
    (fc2): Conv2d(
      16, 64, kernel_size=(1, 1), stride=(1, 1)
      (activation_post_process): MinMaxObserver(min_val=inf, max_val=-inf)
    )
    (quant1): QuantStub(
      (activation_post_process): MinMaxObserver(min_val=inf, max_val=-inf)
    )
    (quant2): QuantStub(
      (activation_post

  "Returning default values."


Post Training Quantization: Convert done

 Inverted Residual Block: After fusion and quantization, note fused modules: 

 Sequential(
  (0): ConvBNAct(
    (0): QuantizedConv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), scale=0.6064728498458862, zero_point=71, padding=(1, 1), groups=64)
    (1): Identity()
    (2): DeQuantize()
    (3): SiLU()
    (4): Quantize(scale=tensor([0.2677]), zero_point=tensor([1]), dtype=torch.quint8)
  )
  (1): SqueezeExcitation(
    (avgpool): AdaptiveAvgPool2d(output_size=1)
    (fc1): QuantizedConv2d(64, 16, kernel_size=(1, 1), stride=(1, 1), scale=0.11322791874408722, zero_point=41)
    (fc2): QuantizedConv2d(16, 64, kernel_size=(1, 1), stride=(1, 1), scale=0.07366853207349777, zero_point=51)
    (quant1): Quantize(scale=tensor([0.0793]), zero_point=tensor([4]), dtype=torch.quint8)
    (quant2): Quantize(scale=tensor([0.2489]), zero_point=tensor([1]), dtype=torch.quint8)
    (quant3): Quantize(scale=tensor([1.]), zero_point=tensor([0]), dtype=torch.quint

In [8]:
%%time

per_channel_quantized_model = load_model(saved_model_dir + efficient_float_model_file)
per_channel_quantized_model.eval()
per_channel_quantized_model.fuse_model()
per_channel_quantized_model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
print(per_channel_quantized_model.qconfig)

torch.quantization.prepare(per_channel_quantized_model, inplace=True)
evaluate(per_channel_quantized_model, criterion, data_loader, num_calibration_batches)
torch.quantization.convert(per_channel_quantized_model, inplace=True)
top1, top5 = evaluate(per_channel_quantized_model, criterion, data_loader_test, neval_batches=num_eval_batches)
print('Evaluation accuracy on %d images, %2.2f'%(num_eval_batches * eval_batch_size, top1.avg))
torch.jit.save(torch.jit.script(per_channel_quantized_model), saved_model_dir + efficient_scripted_quantized_model_file)

QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
................................

  src_bin_begin // dst_bin_width, 0, self.dst_nbins - 1
  src_bin_end // dst_bin_width, 0, self.dst_nbins - 1
  Returning default scale and zero point "


........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

In [9]:
def train_one_epoch(model, criterion, optimizer, data_loader, device, ntrain_batches):
    model.train()
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    avgloss = AverageMeter('Loss', '1.5f')

    cnt = 0
    for image, target in data_loader:
        start_time = time.time()
        print('.', end = '')
        cnt += 1
        image, target = image.to(device), target.to(device)
        output = model(image)
        loss = criterion(output, target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        acc1, acc5 = accuracy(output, target, topk=(1, 5))
        top1.update(acc1[0], image.size(0))
        top5.update(acc5[0], image.size(0))
        avgloss.update(loss, image.size(0))
        if cnt >= ntrain_batches:
            print('Loss', avgloss.avg)

            print('Training: * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'
                  .format(top1=top1, top5=top5))
            return

    print('Full imagenet train set:  * Acc@1 {top1.global_avg:.3f} Acc@5 {top5.global_avg:.3f}'
          .format(top1=top1, top5=top5))
    return

In [10]:
qat_model = load_model(saved_model_dir + efficient_float_model_file)
qat_model.fuse_model()

optimizer = torch.optim.SGD(qat_model.parameters(), lr = 0.0001)
qat_model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')

In [11]:
torch.quantization.prepare_qat(qat_model, inplace=True)
print('Inverted Residual Block: After preparation for QAT, note fake-quantization modules \n', qat_model.features[1][0].block)

Inverted Residual Block: After preparation for QAT, note fake-quantization modules 
 Sequential(
  (0): ConvBNAct(
    (0): ConvBn2d(
      64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=64, bias=False
      (bn): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
      (weight_fake_quant): FusedMovingAvgObsFakeQuantize(
        fake_quant_enabled=tensor([1]), observer_enabled=tensor([1]), scale=tensor([1.]), zero_point=tensor([0], dtype=torch.int32), dtype=torch.qint8, quant_min=-128, quant_max=127, qscheme=torch.per_channel_symmetric, reduce_range=False
        (activation_post_process): MovingAveragePerChannelMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
      )
      (activation_post_process): FusedMovingAvgObsFakeQuantize(
        fake_quant_enabled=tensor([1]), observer_enabled=tensor([1]), scale=tensor([1.]), zero_point=tensor([0], dtype=torch.int32), dtype=torch.quint8, quant_min=0, quant_max=127, qscheme=torch.per_tenso

In [12]:
%%time

num_train_batches = 20

# QAT takes time and one needs to train over a few epochs.
# Train and check accuracy after each epoch
for nepoch in range(8):
    train_one_epoch(qat_model, criterion, optimizer, data_loader, torch.device('cpu'), num_train_batches)
    if nepoch > 3:
        # Freeze quantizer parameters
        qat_model.apply(torch.quantization.disable_observer)
    if nepoch > 2:
        # Freeze batch norm mean and variance estimates
        qat_model.apply(torch.nn.intrinsic.qat.freeze_bn_stats)

# Check the accuracy after training
quantized_model = torch.quantization.convert(qat_model.eval(), inplace=False)
quantized_model.eval()
top1, top5 = evaluate(quantized_model, criterion, data_loader_test, neval_batches=num_eval_batches)
print('After training :Evaluation accuracy on %d images, %2.2f' % (num_eval_batches * eval_batch_size, top1.avg))

....................Loss tensor(4.0579, grad_fn=<DivBackward0>)
Training: * Acc@1 32.000 Acc@5 53.167
....................Loss tensor(2.8443, grad_fn=<DivBackward0>)
Training: * Acc@1 45.000 Acc@5 69.667
....................Loss tensor(2.5352, grad_fn=<DivBackward0>)
Training: * Acc@1 52.167 Acc@5 74.000
....................Loss tensor(2.6714, grad_fn=<DivBackward0>)
Training: * Acc@1 49.167 Acc@5 70.833
....................Loss tensor(1.8900, grad_fn=<DivBackward0>)
Training: * Acc@1 60.167 Acc@5 82.000
....................Loss tensor(1.6780, grad_fn=<DivBackward0>)
Training: * Acc@1 64.833 Acc@5 85.667
....................Loss tensor(1.7182, grad_fn=<DivBackward0>)
Training: * Acc@1 64.333 Acc@5 85.000
....................Loss tensor(1.6748, grad_fn=<DivBackward0>)
Training: * Acc@1 66.333 Acc@5 85.000
........................................................................................................................................................................................

In [13]:
%%time

def run_benchmark(model_file, img_loader):
    elapsed = 0
    model = torch.jit.load(model_file).to('cpu')
    model.eval()
    num_batches = 5
    # Run the scripted model on a few batches of images
    for i, (images, target) in enumerate(img_loader):
        if i < num_batches:
            start = time.time()
            output = model(images)
            end = time.time()
            elapsed = elapsed + (end-start)
        else:
            break
    num_images = images.size()[0] * num_batches

    print('Elapsed time: %3.0f ms' % (elapsed/num_images*1000))
    return elapsed

run_benchmark(saved_model_dir + efficient_scripted_float_model_file, data_loader_test)

run_benchmark(saved_model_dir + efficient_scripted_quantized_model_file, data_loader_test)

Elapsed time:  78 ms
Elapsed time: 374 ms
CPU times: user 9min 5s, sys: 4min 10s, total: 13min 15s
Wall time: 2min 5s


93.48641204833984

# Reference
1. [PyTorch feature classification changes](https://pytorch.org/blog/pytorch-feature-classification-changes/)
2. [QUANTIZATION](https://pytorch.org/docs/stable/quantization.html)
3. [TRAINING A CLASSIFIER](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#where-do-i-go-next)