# Quantization with Pytorch

Quantization is doing computation and memory access with lower data precision , usualy with int8 instead of the floating point, this will results in many performance gains like reduction in memory bandwith, reduction in model size and faster inference due to the fast computation with int8 

Quantization also comme with some additional costs like that the quantized models has less accuracy then the floating models, so we always try to minimize the gap between the full floating accuracy and the quantized accuracy


3 types of quantization : 
1. **Dynamic Quantization**: the idea is to convert weights and activations into int8 just before doing the computation, and this will result in faster computation (matrix multiplication and convolutions on int8 more faster than floating), note that the results are saved into the memory in float format, we need to multiply to a factor scale ,that is determined in diffrent way from on to another quantization approaches, for converting from floating point to integers
    
    We use the torch.quantization.quantize_dynamic method that take multiple arguments : 
        the model to be quantized 
        the layer to be quantized
        and type of the weights that will be quantized into 
        
        
Next , we are going to use the dynamic quantization with a simple LSTM model for demo 


In [131]:
import torch as th
from torch import nn 
import os 
import copy

In [115]:
class DemoModel(nn.Module): 
    def __init__(self,input_dim,out_dim):
        super(DemoModel,self).__init__()
        self.lay1 = nn.LSTM(input_dim,out_dim)
        self.lay2= nn.LSTM(out_dim , 32)
        self.lay3 = nn.Linear(32,10)
    def forward(self,x):
        out,hidden = self.lay1(x)
        print(hidden)
        out = nn.functional.relu(out)
        out,_ = self.lay2(out)
        out = nn.functional.relu(out)
        out = out.view(out.size(0),-1)
        return self.lay3(out)


In [116]:
model_dim,sequence_len = 64,20

In [117]:
model = DemoModel(model_dim,sequence_len)

In [118]:
inputs = th.randn(sequence_len,1 , model_dim )

In [120]:
hidden = (th.randn(1,1,model_dim),th.randn(1,1,model_dim))

In [122]:
lstm = DemoModel(model_dim, model_dim)

In [128]:
quantized_lstm = th.quantization.quantize_dynamic(lstm,{nn.LSTM,nn.Linear},dtype=th.qint8)

In [129]:
quantized_lstm

DemoModel(
  (lay1): DynamicQuantizedLSTM(64, 64)
  (lay2): DynamicQuantizedLSTM(64, 32)
  (lay3): DynamicQuantizedLinear(in_features=32, out_features=10, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)

In [130]:
lstm

DemoModel(
  (lay1): LSTM(64, 64)
  (lay2): LSTM(64, 32)
  (lay3): Linear(in_features=32, out_features=10, bias=True)
)

In [136]:
def print_size_of_model(model, label=""):
    th.save(model.state_dict(), "temp.p")
    size=os.path.getsize("temp.p")
    print("model: ",label,' \t','Size (KB):', size)
    os.remove('temp.p')
    return size

In [137]:
model_size = print_size_of_model(lstm,"Floating LSTM")
quantized_model_size = print_size_of_model(quantized_model,'Quantized LSTM')

model:  Floating LSTM  	 Size (KB): 187271
model:  Quantized LSTM  	 Size (KB): 38693071


In [138]:
quantized_lstm

DemoModel(
  (lay1): DynamicQuantizedLSTM(64, 64)
  (lay2): DynamicQuantizedLSTM(64, 32)
  (lay3): DynamicQuantizedLinear(in_features=32, out_features=10, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)