# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.


In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m45.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m38.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [7]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "arshiaafshani/Arsh-LLM"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [8]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [9]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [14]:
from peft import LoraConfig, get_peft_model

from peft import LoraConfig, get_peft_model

# تنظیمات LoRA
config = LoraConfig(
    r=512,  # رتبه ماتریس‌های LoRA
    lora_alpha=1024,  # ضریب مقیاس
    target_modules=[
        "q_proj",  # پروجکشن‌های Query
        "k_proj",  # پروجکشن‌های Key
        "v_proj",  # پروجکشن‌های Value
        "out_proj",  # پروجکشن‌های خروجی
    ],
    lora_dropout=0.05,  # نرخ Dropout برای LoRA
    bias="none",  # آیا بایاس آموزش داده شود
    task_type="CAUSAL_LM"  # نوع تسک
)

# اعمال LoRA روی مدل
model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 335544320 || all params: 1728560640 || trainable%: 19.41177603118396


Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [15]:
from datasets import load_dataset
from transformers import AutoTokenizer

# بارگذاری دیتاست (مثال: databricks-dolly-15k)
data = load_dataset("Na0s/sft-ready-Text-Generation-Augmented-Data-Alpaca-Format")


# پردازش دیتاست
def preprocess_function(samples):
    # ترکیب دستورالعمل و پاسخ به یک رشته
    instructions = samples["instruction"]
    responses = samples["output"]
    texts = [f"Instruction: {inst}\nResponse: {resp}" for inst, resp in zip(instructions, responses)]

    # توکنایز کردن متن
    return tokenizer(texts, truncation=True, padding="max_length", max_length=1024)

# اعمال پیش‌پردازش روی دیتاست
data = data.map(preprocess_function, batched=True)

# نمایش نمونه‌ای از دیتاست
print(data["train"][0])

README.md:   0%|          | 0.00/379 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/22 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/22 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/22 [00:00<?, ?files/s]

train-00000-of-00022.parquet:   0%|          | 0.00/332M [00:00<?, ?B/s]

train-00001-of-00022.parquet:   0%|          | 0.00/323M [00:00<?, ?B/s]

train-00002-of-00022.parquet:   0%|          | 0.00/183M [00:00<?, ?B/s]

train-00003-of-00022.parquet:   0%|          | 0.00/137M [00:00<?, ?B/s]

train-00004-of-00022.parquet:   0%|          | 0.00/330M [00:00<?, ?B/s]

train-00005-of-00022.parquet:   0%|          | 0.00/331M [00:00<?, ?B/s]

train-00006-of-00022.parquet:   0%|          | 0.00/256M [00:00<?, ?B/s]

train-00007-of-00022.parquet:   0%|          | 0.00/255M [00:00<?, ?B/s]

train-00008-of-00022.parquet:   0%|          | 0.00/250M [00:00<?, ?B/s]

train-00009-of-00022.parquet:   0%|          | 0.00/315M [00:00<?, ?B/s]

train-00010-of-00022.parquet:   0%|          | 0.00/347M [00:00<?, ?B/s]

train-00011-of-00022.parquet:   0%|          | 0.00/383M [00:00<?, ?B/s]

train-00012-of-00022.parquet:   0%|          | 0.00/477M [00:00<?, ?B/s]

train-00013-of-00022.parquet:   0%|          | 0.00/593M [00:00<?, ?B/s]

train-00014-of-00022.parquet:   0%|          | 0.00/253M [00:00<?, ?B/s]

train-00015-of-00022.parquet:   0%|          | 0.00/77.0M [00:00<?, ?B/s]

train-00016-of-00022.parquet:   0%|          | 0.00/92.5M [00:00<?, ?B/s]

train-00017-of-00022.parquet:   0%|          | 0.00/95.5M [00:00<?, ?B/s]

train-00018-of-00022.parquet:   0%|          | 0.00/99.8M [00:00<?, ?B/s]

train-00019-of-00022.parquet:   0%|          | 0.00/119M [00:00<?, ?B/s]

train-00020-of-00022.parquet:   0%|          | 0.00/98.5M [00:00<?, ?B/s]

train-00021-of-00022.parquet:   0%|          | 0.00/109M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7667416 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/22 [00:00<?, ?it/s]

Map:   0%|          | 0/7667416 [00:00<?, ? examples/s]

KeyboardInterrupt: 

In [None]:
import transformers
import os
os.environ["WANDB_DISABLED"] = "true"

# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=3,
        gradient_accumulation_steps=6,
        warmup_steps=500,
        max_steps=150,
        learning_rate=4e-5,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  attn_scores = torch.where(causal_mask, attn_scores, mask_value)


Step,Training Loss
1,2.3735
2,3.2832
3,2.2905
4,2.8347
5,2.6355
6,2.1852
7,2.2609
8,1.5063
9,2.4706
10,2.4982


TrainOutput(global_step=10, training_loss=2.4338608503341677, metrics={'train_runtime': 166.0171, 'train_samples_per_second': 0.241, 'train_steps_per_second': 0.06, 'total_flos': 99255709532160.0, 'train_loss': 2.4338608503341677, 'epoch': 0.02})

# **Inference**

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# بارگذاری مدل و توکنایزر
model_id = "arshiaafshani/Arsh-LLM"  # یا مسیر مدل Fine-tune شده
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

# تنظیم مدل به حالت ارزیابی (Evaluation)
model.eval()

# تعریف یک دستورالعمل (Prompt)
prompt = "Instruction: Hello how are you?\nResponse:"

# توکنایز کردن دستورالعمل
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# تولید پاسخ با مدل
with torch.no_grad():  # غیرفعال کردن محاسبات گرادیان برای سرعت بیشتر
    outputs = model.generate(
        inputs["input_ids"],
        max_length=200,  # حداکثر طول خروجی
        num_return_sequences=1,  # تعداد جملات خروجی
        temperature=0.7,  # کنترل خلاقیت مدل (مقادیر کمتر = پاسخ‌های محافظه‌کارانه‌تر)
        top_k=50,  # محدود کردن انتخاب توکن‌ها به K مورد برتر
        top_p=0.9,  # انتخاب توکن‌ها بر اساس توزیع تجمعی
        do_sample=True,  # فعال‌سازی نمونه‌گیری برای پاسخ‌های متنوع‌تر
    )

# تبدیل توکن‌های خروجی به متن
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# نمایش پاسخ
print("Generated Response:")
print(response)