# Model Quantization

This notebook documents the process of quantizing a translation model and can be used as a template for all quantizations.

## Load Tokenizer and Model With Quantization Configuration

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, BitsAndBytesConfig
import torch

model_id = "billingsmoore/tibetan-to-english-translation"

tokenizer = AutoTokenizer.from_pretrained(model_id)

quantization_config =  BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto", quantization_config=quantization_config)

## Check That Model Has A Reduced Memory Footprint

The size of the original model can be seen by looking at the file size in the relevant Hugging Face repo. The line of code below prints the memory footprint of the quantized model measured in bytes.

In [2]:
model.get_memory_footprint()

1123254272

## Push Model to Hub

The line of code below pushes the model to the Hugging Face Hub. You may need to use notebook_login() to log in to Hugging Face and provide an API key.

Ensure that the quantized model has an appropriate model id. Ideally it should be the name of the model that is quantizes with the degree of quantization appended, i.e. 'tibetan-to-english-translation' is being quantized to 4bit and thus becomes 'tibetan-to-english-translation-4bit'.

In [3]:
model.push_to_hub('billingsmoore/tibetan-to-english-translation-4bit')

model.safetensors:   0%|          | 0.00/1.13G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/billingsmoore/tibetan-to-english-translation-4bit/commit/ff96bc443f0801bf8f00fcab8503fa9993f802e9', commit_message='Upload T5ForConditionalGeneration', commit_description='', oid='ff96bc443f0801bf8f00fcab8503fa9993f802e9', pr_url=None, pr_revision=None, pr_num=None)

## Fill Out Model Card

Once the model has been successfully pushed to the Hub, make sure to fill out the model card and provide the quantization configuration settings there.