# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.


In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m74.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m61.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "arshiaafshani/Arsh-LLM"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/470 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.83k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.84G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [4]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [5]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [6]:
from peft import LoraConfig, get_peft_model

from peft import LoraConfig, get_peft_model

# تنظیمات LoRA
config = LoraConfig(
    r=512,  # رتبه ماتریس‌های LoRA
    lora_alpha=1024,  # ضریب مقیاس
    target_modules=[
        "q_proj",  # پروجکشن‌های Query
        "k_proj",  # پروجکشن‌های Key
        "v_proj",  # پروجکشن‌های Value
        "out_proj",  # پروجکشن‌های خروجی
    ],
    lora_dropout=0.05,  # نرخ Dropout برای LoRA
    bias="none",  # آیا بایاس آموزش داده شود
    task_type="CAUSAL_LM"  # نوع تسک
)

# اعمال LoRA روی مدل
model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 335544320 || all params: 1728560640 || trainable%: 19.41177603118396


Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [7]:
from datasets import load_dataset
from transformers import AutoTokenizer

# بارگذاری دیتاست (مثال: databricks-dolly-15k)
data = load_dataset("Dans-DiscountModels/Alpaca_Evol_Instruct_Cleaned")


# پردازش دیتاست
def preprocess_function(samples):
    # ترکیب دستورالعمل و پاسخ به یک رشته
    instructions = samples["instruction"]
    responses = samples["output"]
    texts = [f"Instruction: {inst}\nResponse: {resp}" for inst, resp in zip(instructions, responses)]

    # توکنایز کردن متن
    return tokenizer(texts, truncation=True, padding="max_length", max_length=1024)

# اعمال پیش‌پردازش روی دیتاست
data = data.map(preprocess_function, batched=True)

# نمایش نمونه‌ای از دیتاست
print(data["train"][0])

README.md:   0%|          | 0.00/331 [00:00<?, ?B/s]

(…)leaned_scrubbed_deduped_urlsremoved.json:   0%|          | 0.00/99.3M [00:00<?, ?B/s]

wizard_allcleaned_scrubbed_deduped.json:   0%|          | 0.00/82.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/102832 [00:00<?, ? examples/s]

Map:   0%|          | 0/102832 [00:00<?, ? examples/s]

{'instruction': 'Can you provide a list of healthy habits to maintain a healthy lifestyle? Please format your response as an HTML page with bullet points.\n<html>\n  <body>\n    <h3>Healthy Habits:</h3>\n    <ul>\n      <li>Eating a balanced diet with plenty of fruits and vegetables.</li>\n      <li>Engaging in regular physical activity, such as walking, running, or cycling.</li>\n      <li>Getting enough sleep each night, ideally 7-8 hours.</li>\n      <li>Staying hydrated by drinking plenty of water throughout the day.</li>\n      <li>Limiting alcohol consumption and avoiding smoking.</li>\n      <li>Managing stress through relaxation techniques like meditation or yoga.</li>\n      <li>Regularly visiting a healthcare provider for check-ups and preventative care.</li>\n    </ul>\n  </body>\n</html>', 'input': '', 'output': "Here's an HTML page with bullet points for healthy habits:\n<html>\n  <body>\n    <h3>Healthy Habits:</h3>\n    <ul>\n      <li>Eating a balanced diet with plenty 

In [8]:
import transformers
import os
os.environ["WANDB_DISABLED"] = "true"

# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=3,
        gradient_accumulation_steps=6,
        warmup_steps=500,
        max_steps=100,
        learning_rate=5e-5,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
  return fn(*args, **kwargs)


Step,Training Loss
1,13.1975
2,13.1537
3,13.086
4,13.0937
5,13.2589
6,13.0219
7,12.9813
8,13.2259
9,13.092
10,13.0397


TrainOutput(global_step=100, training_loss=13.107642889022827, metrics={'train_runtime': 3463.9514, 'train_samples_per_second': 0.52, 'train_steps_per_second': 0.029, 'total_flos': 3.1551356141568e+16, 'train_loss': 13.107642889022827, 'epoch': 0.01750393838613688})

In [9]:
model.save_pretrained("./arsh-llm")
tokenizer.save_pretrained("./arsh-llm")


('./arsh-llm/tokenizer_config.json',
 './arsh-llm/special_tokens_map.json',
 './arsh-llm/vocab.json',
 './arsh-llm/merges.txt',
 './arsh-llm/added_tokens.json',
 './arsh-llm/tokenizer.json')

In [None]:
model = model.merge_and_unload()

# مسیر ذخیره‌سازی مدل
save_directory = "./saved_full_model"

# ذخیره‌سازی کل مدل (شامل وزن‌های اصلی و LoRA)
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

print(f"Full model (with LoRA) saved to {save_directory}")



# **Inference**

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

from transformers import AutoTokenizer, AutoModelForCausalLM

# مسیر مدل ذخیره‌شده
save_directory = "./saved_full_model"

# بارگذاری مدل و توکنایزر
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModelForCausalLM.from_pretrained(save_directory)

print("Full model (with LoRA) loaded successfully!")
# تنظیم مدل به حالت ارزیابی (Evaluation)
model.eval()

# تعریف یک دستورالعمل (Prompt)
prompt = "Instruction: Hello how are you?\nResponse:"

# توکنایز کردن دستورالعمل
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# تولید پاسخ با مدل
with torch.no_grad():  # غیرفعال کردن محاسبات گرادیان برای سرعت بیشتر
    outputs = model.generate(
        inputs["input_ids"],
        max_length=200,  # حداکثر طول خروجی
        num_return_sequences=1,  # تعداد جملات خروجی
        temperature=0.7,  # کنترل خلاقیت مدل (مقادیر کمتر = پاسخ‌های محافظه‌کارانه‌تر)
        top_k=50,  # محدود کردن انتخاب توکن‌ها به K مورد برتر
        top_p=0.9,  # انتخاب توکن‌ها بر اساس توزیع تجمعی
        do_sample=True,  # فعال‌سازی نمونه‌گیری برای پاسخ‌های متنوع‌تر
    )

# تبدیل توکن‌های خروجی به متن
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# نمایش پاسخ
print("Generated Response:")
print(response)