<a href="https://colab.research.google.com/github/apa017/hugging-face-learn/blob/main/06_Quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quantization for Large Language Models

In this Notebook we will be quantizing models.

**Quantization** is a technique used to reduce the *memory footprint* of an LLM.

In practice, it makes possible to reduce the memory of a model down to 1/8th.

This technique becomes convenient for working with very big models with reduced memory costs.

<u>GPU is required for quantization.</u>

## Notebook Setup

In [1]:
!pip install transformers torch bitsandbytes accelerate

Collecting bitsandbytes
  Downloading bitsandbytes-0.44.0-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.44.0-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.44.0


## Load Tokenizer with quantization

- A 16bit Precision quantization won't work on falcon-7b model because it would require 14GB of GPU RAM (i.e. too heavy).

- We can try 8bit Precision instead.

In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import transformers
import torch

import time

In [6]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")



In [8]:
sTime = time.time()

# instantiate a 8bit quantization configuration
config = BitsAndBytesConfig(
    load_in_8_bit=True
)

print('Loading model... \n')

# load the model in 8bit
model = AutoModelForCausalLM.from_pretrained(
    "tiiuae/falcon-7b",
    quantization_config=config
)

eTime = time.time()
print('Model loading completed./n')

exec_time = eTime - sTime

if exec_time < 60:
    print(f'Execution time: {exec_time} seconds')
else:
    exec_time = exec_time/60
    print(f'Execution time: {exec_time} minutes')

# load the memory footprint
gbs = model.get_memory_footprint() / 1e9

Unused kwargs: ['load_in_8_bit']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Loading model... 



config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

RuntimeError: No GPU found. A GPU is needed for quantization.

In [9]:
# Print model details

print(f"Parameters in this model: {model.num_parameters()}")
print(f"Memory footprint (FP2): {model.num_parameters()*4/1e9} GB.")
print(f"Memory footprint (FP8): {model.num_parameters()*2/1e9} GB. \t (this is the model as we loaded it)")

NameError: name 'model' is not defined