# HuggingFace Quanto

Notebook demonstrates the use of HuggingFace Quanto library for **dynamic** quantization.

#### Note:
Quanto does not create serializable quantized models.

https://huggingface.co/docs/transformers/main/en/quantization/quanto

https://huggingface.co/docs/transformers/main_classes/quantization


In [1]:
!pip install -q quanto 

## Helper functions
Taken from the HuggingFace blog that explains the quantization.

https://blog.gopenai.com/linear-quantization-with-hugging-face-quanto-222a1d29721f

In [2]:
import torch
import warnings
warnings.filterwarnings("ignore")
def named_module_tensors(module, recurse=False):
    """
    Generator function that yields named tensors (parameters and buffers) from a PyTorch module.

    This function iterates over all the named parameters and buffers in a given module and yields
    the name of each tensor and its corresponding value. For parameters with `_data` or `_scale` 
    attributes, it yields those as separate entries.

    Args:
        module (torch.nn.Module): The PyTorch module from which to extract named tensors.
        recurse (bool, optional): If True, recurses through the module's submodules. Defaults to False.

    Yields:
        Tuple[str, torch.Tensor]: A tuple where the first element is the name of the tensor, and the 
                                  second element is the tensor itself.

    Example:
        >>> model = torch.nn.Linear(10, 5)
        >>> for name, tensor in named_module_tensors(model):
        >>>     print(f"{name}: {tensor.shape}")
    """
    for named_parameter in module.named_parameters(recurse=recurse):
      name, val = named_parameter
      flag = True
      if hasattr(val,"_data") or hasattr(val,"_scale"):
        if hasattr(val,"_data"):
          yield name + "._data", val._data
        if hasattr(val,"_scale"):
          yield name + "._scale", val._scale
      else:
        yield named_parameter

    for named_buffer in module.named_buffers(recurse=recurse):
      yield named_buffer

In [3]:
def dtype_byte_size(dtype):
    """
    Returns the size (in bytes) occupied by one parameter of type `dtype`.
    """
    import re
    if dtype == torch.bool:
        return 1 / 8
    bit_search = re.search(r"[^\d](\d+)$", str(dtype))
    if bit_search is None:
        raise ValueError(f"`dtype` is not a valid dtype: {dtype}.")
    bit_size = int(bit_search.groups()[0])
    return bit_size // 8

In [4]:
def compute_module_sizes(model):
    """
    Compute the size of each submodule of a given model.
    """
    from collections import defaultdict
    module_sizes = defaultdict(int)
    for name, tensor in named_module_tensors(model, recurse=True):
      size = tensor.numel() * dtype_byte_size(tensor.dtype)
      name_parts = name.split(".")
      for idx in range(len(name_parts) + 1):
        module_sizes[".".join(name_parts[:idx])] += size

    return module_sizes

In [5]:
def test_model_output(model, text):
    """
    Generates and prints the output from a language model based on the provided text input.

    This function tokenizes the input text, feeds it to the given model, and generates text 
    based on the model's output. The generated output is decoded and returned.

    Args:
        model (transformers.PreTrainedModel): The language model used to generate the output.
        text (str): The input text to be tokenized and provided to the model.

    Example:
        >>> model = AutoModelForCausalLM.from_pretrained("gpt2")
        >>> test_model_output(model, "Once upon a time")
        Once upon a time there was a little girl who...
    """
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=10)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [6]:
def   get_weights_dtype(model_eval, num_layers=3):
    for i, tup in enumerate(model_eval.named_parameters()):
        if i == num_layers:
            break;
        name = tup[0]
        param = tup[1]
        print(f"{i}. Parameter: {name}, Data Type: {param.dtype}")


## Original model output & size
Create an instance of the non-quantized model. 

1. Check the dtype used for parameters
2. Evaluate the model size
3. Invoke model to test its performance

In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

# Doesn't work well with this
# model_id = "openai-community/gpt2"

# Works well with this
model_id = "openai-community/gpt2"
model_id = "facebook/opt-125m"

# Create the tokenizer and model
# clean_up_tokenization_spaces=True added to prevent future warnings
tokenizer = AutoTokenizer.from_pretrained(model_id, clean_up_tokenization_spaces=True)
model = AutoModelForCausalLM.from_pretrained(model_id)



In [8]:
# Print the data type used for model weights
# get_weights_dtype(model)

# Compute the in memory size of the model
module_sizes = compute_module_sizes(model) 
original_model_size = module_sizes['']
print(f"Model size :  {module_sizes[''] * 1e-9} GB")

# Invoke the model
text = "Once upon a time, there was a"
response = test_model_output(model, text)

# Print the input text & response
print("Test input : ", text)
print("Test response : ", response)

Model size :  0.500957184 GB
Test input :  Once upon a time, there was a
Test response :  Once upon a time, there was a time when the world was a different place.



## Quantized model

https://huggingface.co/docs/transformers/main_classes/quantization

In [9]:
# Setup configuration for quantization
# Supported values : (“float8”,“int8”,“int4”,“int2”)
quantization_config = QuantoConfig(weights="int8")

# Added to suppress warning
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config= quantization_config,
    low_cpu_mem_usage=True
)

In [10]:
# Print the data type used for model weights
# They are still float32 as quantization is applied dynamically
# get_weights_dtype(quantized_model)

# Compute the in memory size of the model
module_sizes = compute_module_sizes(quantized_model) 
quantized_model_size = module_sizes['']
print(f"Model size :  {module_sizes[''] * 1e-9} GB")

# Invoke the model
text = "Once upon a time, there was a"
response = test_model_output(quantized_model, text)

# Print the input text & response
print("Test input : ", text)
print("Test response : ", response)

Model size :  0.24648556800000002 GB
Test input :  Once upon a time, there was a
Test response :  Once upon a time, there was a time when the world was a better place.



In [11]:
# Memory saving
pct_saving = round((original_model_size - quantized_model_size)*100/original_model_size)
print(f"Percentage memory saving : {pct_saving}%")

Percentage memory saving : 51%


## Quanto models are not serializable

The following code will fail.

```The model is quantized with QuantizationMethod.QUANTO and is not serializable```

In [12]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [13]:
quantized_model.push_to_hub("opt-125m-quanto-8bit")
tokenizer.push_to_hub("opt-125m-quanto-8bit")

ValueError: The model is quantized with QuantizationMethod.QUANTO and is not serializable - check out the warnings from the logger on the traceback to understand the reason why the quantized model is not serializable.