# Model Quantization

In this notebook, we shall use __ONNX__ to quantize our model and then upload the quantized model to our huggingface repository before deploying it. The process of quantization involves reducing the precision of the weights of the model from `float32` to some lower precision like `float16` or `int8`. This can reduce the overall memory footprint of the model and makes the execution faster. There are a number of quantization techniques available in the market but we are choosing to use a simple method provided by [Huggingface Optimum](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/models) Library.

In [None]:
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from pathlib import Path
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

## Login to Huggingface

To upload the model to huggingface, you need a Huggingface Access Token that has Write Permissions enabled. To get your own token follow the steps:

1. In a web browser, login to your [Huggingface](https://hf.co/) Account.
2. Click on your profile Picture and Go to Settings.
3. Go to __Access Tokens__ and Select __Create new Token__.
4. Select __Write__ option under Token Type and provide a name for your key.
5. Select __Create token__ to generate the token. Copy the token and paste in the following cell.

In [None]:
HF_TOKEN = "Paste-your-Token"

## Choose the model to Quantize

From the Huggingface model repository, choose a model that can be loaded with the `AutoModelForSequenceClassification` class and paste it's repo id in the following cell.

In [None]:
# Model to use
BASE_PYTORCH_MODEL = "sileod/deberta-v3-base-tasksource-nli"

# Path to quantized models
ONNX_PATH = Path("onnx")
QUANTIZED_MODEL = Path("quantized")

## Select Quantization Configuration

Here we will be applying __Dynamic Quantization__ to the model, which is reducing the precision of the weights of the model beforehand and quantizing the activation nodes dynamically during the inference process. Since our base model is a Pytorch model, it first has to be converted to ONNX format and then quantized.

In [None]:
# convert pytorch model to ONNX
onnx_model = ORTModelForSequenceClassification.from_pretrained(BASE_PYTORCH_MODEL, export=True, token=HF_TOKEN)

# quantize the model
dynamic_quantization = AutoQuantizationConfig.avx2(is_static=False)
quantizer = ORTQuantizer.from_pretrained(onnx_model)
quantizer.quantize(dynamic_quantization, save_dir=QUANTIZED_MODEL)

## Push the model to Huggingface Hub

In [None]:
quant_model = ORTModelForSequenceClassification.from_pretrained(QUANTIZED_MODEL)

In [None]:
quant_model.push_to_hub(
    QUANTIZED_MODEL,
    "pitangent-ds/deberta-v3-nli-onnx-quantized",
    token=HF_TOKEN
)