# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproje

First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "EleutherAI/pythia-12b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})



Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 1179648 || all params: 726755328 || trainable%: 0.1623170762636638


In [None]:
tokenizer.model_max_length

1000000000000000019884624838656

Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [None]:
from pathlib import Path
import sys

sys.path.append('/content/drive/MyDrive/Colab Notebooks')

In [None]:
from prepare_data import prepare

In [None]:
# !cp "/content/drive/MyDrive/Colab Notebooks/answered_edmunds_data.json" /content/data/answered_edmunds_data.json

In [None]:
data = prepare(
    destination_path= Path("/content/data"),
    checkpoint_dir = model_id,
    test_split_fraction = 0.3,  # to get exactly 2000 test samples,
    seed = 42,
    mask_inputs = False,  # as in alpaca-lora
    data_file_name = "answered_edmunds_data.json",
    data_file_url = "https://storage.googleapis.com/public_bkt/edmunds_forum.json",
    ignore_index  = -1,
    max_seq_length = 2048,
)

Loading data file...
Loading tokenizer...
train has 1,738 samples
test has 744 samples
Processing train split ...


  0%|          | 0/1738 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100%|██████████| 1738/1738 [00:01<00:00, 925.34it/s]


Processing test split ...


100%|██████████| 744/744 [00:00<00:00, 1010.28it/s]


In [None]:
data.keys()

dict_keys(['train', 'test'])

In [None]:
data['train'][0]

{'instruction': '\ntransmission \n\nMy 2008 Dakota Quad cab v6 automatic while backing up a small or large incline, empty or loaded fells like bouncing or how wheel hop feels like. Does this all the time. I was told the transmission or rear end could be causing this. ',
 'response': '\nSounds like some kind of mount failure or rear suspension failure. You definitely need to poke around under the rear axle and look for something that might be causing the rear diff to move around. ',
 'input_ids': [111757,
  632,
  660,
  54103,
  861,
  63808,
  267,
  6875,
  17,
  66828,
  267,
  12427,
  861,
  156788,
  209752,
  368,
  6875,
  6149,
  105311,
  182924,
  29,
  603,
  6815,
  13115,
  17692,
  10367,
  5923,
  129602,
  187933,
  6080,
  337,
  25,
  62605,
  7151,
  135567,
  2256,
  267,
  8460,
  791,
  10021,
  3178,
  989,
  15,
  18268,
  791,
  40019,
  319,
  5557,
  3269,
  2602,
  184720,
  791,
  4143,
  62083,
  31448,
  84043,
  3269,
  17,
  55297,
  1119,
  1728,
  36

Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [None]:
import transformers

# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=100,
        learning_rate=1e-3,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

You're using a BloomTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,3.9453
2,3.7295
3,3.7705
4,3.515
5,3.5775
6,3.2637
7,3.4651
8,3.1191
9,2.9999
10,3.1927


TrainOutput(global_step=100, training_loss=3.020668742656708, metrics={'train_runtime': 134.4249, 'train_samples_per_second': 2.976, 'train_steps_per_second': 0.744, 'total_flos': 361510811725824.0, 'train_loss': 3.020668742656708, 'epoch': 0.23})

In [None]:
HUGGING_FACE_USER_NAME = "sasuface"

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model_name = "mechanic-bloom-1b1"

model.push_to_hub(f"{model_name}", use_auth_token=True)



adapter_model.bin:   0%|          | 0.00/4.74M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/sasuface/mechanic-bloom-1b1/commit/440a34c5c49b126a52c60a24db629fbe1b241584', commit_message='Upload model', commit_description='', oid='440a34c5c49b126a52c60a24db629fbe1b241584', pr_url=None, pr_revision=None, pr_num=None)

## Inference

In [None]:
HUGGING_FACE_USER_NAME = "sasuface"
model_name = "mechanic-bloom-1b1"

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = f"{HUGGING_FACE_USER_NAME}/{model_name}"
config = PeftConfig.from_pretrained(peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, quantization_config=bnb_config, device_map={"":0})

# Load the Lora model
qa_model = PeftModel.from_pretrained(model, peft_model_id)

Downloading (…)/adapter_config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]



Downloading adapter_model.bin:   0%|          | 0.00/4.74M [00:00<?, ?B/s]

In [25]:
from IPython.display import display, Markdown

def make_inference(question):
  text = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        f"### Instruction:\n{question}\n\n### Response:"
    )
  batch = tokenizer(text ,return_tensors='pt')

  with torch.cuda.amp.autocast():
    output_tokens = qa_model.generate(**batch, max_new_tokens=100)

  display(Markdown((tokenizer.decode(output_tokens[0], skip_special_tokens=True))))

In [27]:
data['test'][60]

{'instruction': '\nFactory Installed fog lights \n\nIs the GMC SIERRA 1500 base model prewired for fog lights ? If not, does anyone know the P/N for the wiring harness required. ',
 'response': '\nHello! Is the vehicle wired for fog lights? Yes and no, what I mean is the main harness and fuse block is prewired for them, the harness from the fuse block to the front bumper is not present. However, there is a kit that contains the factory replacement switch, harness and lights. The dash indicator light is already pre-installed in the vehicle. I found a full tutorial which includes the kit number and very down to earth instructions. I am posting the web address below. Check it out, might interest you. Good luck.http://www.gm-trucks.com/forums/topic/147854-oem-fog-light-install-2012-sierra/ ',
 'input_ids': [111757,
  632,
  660,
  54103,
  861,
  63808,
  267,
  6875,
  17,
  66828,
  267,
  12427,
  861,
  156788,
  209752,
  368,
  6875,
  6149,
  105311,
  182924,
  29,
  603,
  17276,


In [28]:
question = """
\nFactory Installed fog lights \n\nIs the GMC SIERRA 1500 base model prewired for fog lights ? If not, does anyone know the P/N for the wiring harness required.
"""

make_inference(question)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:


Factory Installed fog lights 

Is the GMC SIERRA 1500 base model prewired for fog lights ? If not, does anyone know the P/N for the wiring harness required.


### Response:
Yes, the fog lights are factory installed. The wiring harness is the same as the fog lights. The wiring harness is the same as the fog lights. The wiring harness is the same as the fog lights. The wiring harness is the same as the fog lights. The wiring harness is the same as the fog lights. The wiring harness is the same as the fog lights. The wiring harness is the same as the fog lights. The wiring harness is the same as the fog lights. The wiring

In [None]:
data['test'][5]

{'instruction': '\nGo with Sonnex kit, if you keep running it that way you will need a complete overhaul. The trans. will overheat and burn up, the converter will turn blue, not good. I hope my 99 4L60E goes 150K before valve body repair. I would say your regular trans service delayed the wear, they usually go around 80K. ',
 'context': '\nFollow up:When I brought the car to dealership for brake service recently, I told that it looks as the ATF was over filled and asked to check for the problem. I also explained why it bother me, to not look too picky: 1. Bubbles in on dipstick,2. Sometimes, slight (1/10 sec or so) hesitation. I felt the hesitation only when pushing on accelerator after braking almost to full stop, 5 mph or so. And it happened only when it is rather hot outside, above 90F or so (possibly above 87-88F on longer trips). This is why I suspected over filled transmission fluid. I checked it, and it looked to be over filled way up. Would prefer, however, a professional mecha