<a href="https://colab.research.google.com/github/ahmeda335/QuantizationMethods/blob/main/bitsandbytesQuantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BitsAndBytes Quantization. 📌
- ➡️ BitsAndBytes quantization is the most famous one which is found in 8-bit or 4-bit quantization. 4-bit quantization is used [QloRA](https://huggingface.co/papers/2305.14314).

###  📖 In this code we will quantize a model using BitsAndBytes quantization in 4-bit and in 8-bit then sharing it to our HuggingFace Hub 🤗.


---
---

## 1️⃣
### 🚀 Installing required libraries.

In [1]:
!pip -q install git+https://github.com/huggingface/transformers
!pip -q install bitsandbytes accelerate xformers einops

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m67.2 MB/s[0m eta [36m0:00:00[0m
[?25h

## 2️⃣
### 🚀 First Loading the model in 4-bit.
⚡ Usually quantizing in 4-bit precision decreases the memory usage by 4x.

🚩 Creating the BitsAndBytesConfig class.

⚠️⚠️ Note: If you are using llama or any gated model, you must get access first for the model on the huggingface and then login here using your huggingface account. The code of the login is below.

In [19]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
import torch

model_id = "facebook/opt-350m"
model_name = model_id.split('/')[-1]

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 )

🚩 Creating the model.

In [3]:
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/663M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

### 🚀 Loading the model in 8-bit.

In [5]:
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code = True,
    quantization_config=quantization_config
)

`low_cpu_mem_usage` was None, now default to True since model is quantized.


🚩 Creating the tokenizer

In [20]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

## 3️⃣
### 🚀 Inferencing the model to try it.

In [21]:
text="Hello How are you"
encodeds = tokenizer(text, return_tensors="pt",add_special_tokens=False)
model_inputs= encodeds.to('cuda')


- 4-bit Model

In [22]:
generated_ids = model_4bit.generate(**model_inputs, max_new_tokens=200, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

Hello How are you?
Hello, thank you so much for this opportunity to contact me. I am very happy that my call will now come to my house as early as two in the afternoon.
I have been here for the whole of last 6 months now.
I am a small number who has no home or home-made product, however, I have bought various goods including clothes, shoes, handbags, watches, watches, watches.
Also, I am a home-made person!
It is wonderful for me to have a home-made product that has to meet my expectations. I am always amazed and pleased when someone comes to my house for something that I have bought myself!
Also, I have been very happy to have the opportunity to communicate with the customer so that I can improve my service in the future.
Here I am asking you how are you?

Hi My Name is Mary, I am very happy to speak with you.
I have been on my site


- 8-bit Model

In [23]:
generated_ids = model_8bit.generate(**model_inputs, max_new_tokens=200, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

Hello How are you?
Gosh?
Sorry.
You look great, isn't that right?
Thank you, it really was great to see you.
By the way, I have an idea for you.
I'm gonna go up and look at your room!
Wait, it's all the ladies!
I'm gonna go in there now, and see what's under the mattress.
Okay, great.
Just kidding, I'm right there!
You guys just let me go, okay?
Don't forget.
Hurry up or I won't leave ya alone, okay?
Don't you worry!
It's alright.
Come on, come on.
I gotta stop for breakfast.
Hi!
It's all right, it's all right.
I'll see you tomorrow.
Bye-bye.
Okay.
Bye-bye.
Hello?
Hey, I'm here.
Hello.
Hello are


🚩 Getting the memory taken from the quantized model.

In [24]:
def bytes_to_giga_bytes(bytes):
  return bytes / 1024 / 1024 / 1024

- 4-bit model.

In [25]:
print(bytes_to_giga_bytes(model_4bit.get_memory_footprint()), "GB")

0.19356155395507812 GB


- 8-bit model.

In [11]:
print(bytes_to_giga_bytes(model_8bit.get_memory_footprint()), "GB")

0.3346748352050781 GB


## 4️⃣
### 🚀 Logging in to your huggingface account.
🚩Get your token for here from Huggingface and enter it below.

In [12]:
from huggingface_hub import login, HfApi


login("ENTER_YOUR_TOKEN_HERE")    # 🚩🚩 WRITE YOUR TOKEN HERE.

# Create an instance of the HfApi class
api = HfApi()

# Get user information
user_info = api.whoami()

# Print user information to verify
print("\nYour account:", user_info['name'])


Your account: ahmeda335


## 5️⃣
### 🚀 Sharing the model to your 🤗 hub.

- 4-bit model.

In [26]:
model_4bit.push_to_hub(f"{model_name}-BitsAndBytes-4bit")
tokenizer.push_to_hub(f"{model_name}-BitsAndBytes-4bit")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/ahmeda335/opt-350m-BitsAndBytes-4bit/commit/48d4ec4c259f4ba6565ea302ad1c3a6a41d14bae', commit_message='Upload tokenizer', commit_description='', oid='48d4ec4c259f4ba6565ea302ad1c3a6a41d14bae', pr_url=None, repo_url=RepoUrl('https://huggingface.co/ahmeda335/opt-350m-BitsAndBytes-4bit', endpoint='https://huggingface.co', repo_type='model', repo_id='ahmeda335/opt-350m-BitsAndBytes-4bit'), pr_revision=None, pr_num=None)

- 8-bit model.

In [27]:
model_8bit.push_to_hub(f"{model_name}-BitsAndBytes-8bit")
tokenizer.push_to_hub(f"{model_name}-BitsAndBytes-8bit")

model.safetensors:   0%|          | 0.00/360M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ahmeda335/opt-350m-BitsAndBytes-8bit/commit/c0793d7e39ca8bbeee5ad3a463edd622ecf03529', commit_message='Upload tokenizer', commit_description='', oid='c0793d7e39ca8bbeee5ad3a463edd622ecf03529', pr_url=None, repo_url=RepoUrl('https://huggingface.co/ahmeda335/opt-350m-BitsAndBytes-8bit', endpoint='https://huggingface.co', repo_type='model', repo_id='ahmeda335/opt-350m-BitsAndBytes-8bit'), pr_revision=None, pr_num=None)