# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.


In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━

First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "/content/drive/MyDrive/Mistral-7B-finetuned"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [2]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model2 = prepare_model_for_kbit_training(model)

In [3]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [4]:
from peft import LoraConfig, get_peft_model
 # 'q_proj','k_proj','v_proj','o_proj','gate_proj','down_proj','up_proj' and 'lm_head'.
config = LoraConfig(
    r=2,
    lora_alpha=8,
    target_modules=['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'],
    #target_modules=['q_proj', 'v_proj'],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model_n = get_peft_model(model2, config)
print_trainable_parameters(model_n)

trainable params: 5242880 || all params: 3757314048 || trainable%: 0.13953797667753537


Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [5]:
from datasets import load_dataset, load_from_disk
data = load_from_disk("austin_test_V3.hf")
data = data.train_test_split(test_size=0.30,seed=3)
data.save_to_disk("austin_test_split4.hf")
#data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples['zeroprompt']), batched=True)


Saving the dataset (0/1 shards):   0%|          | 0/224 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/97 [00:00<?, ? examples/s]

In [None]:
print(data["train"]["zeroprompt"][0])

Extract the following information from Real Context and answer the following questions without explanation: 
Real Context: ### remodel of existing residence to renovate electrical hvac plumbing fixtures and building materials throughout as needed install a knee wall per plan on top of structure convert old carport into conditioned space  addition of new attached carport master bedroom suite and decking to rear of existing res ### 
### Question 1: Is a water heater mentioned in the context? (A) No (B) Yes 
### Question 2: Number of water heaters mentioned? 
### Question 3: If applicable, what type of water heater is there? (A) Not applicable (B) Electric (C) Gas (D) Tankless (E) Other 
### Question 4: If applicable, is the water heater new? (A) No or Not applicable (B) Yes 
Here are example contexts and answers for the above questions. Don't mention the examples in your answer. 
 ### Default Answer Format: Question 1: (A) Question 2: (0) Question 3: (A) Question 4: (A) 
 ### Answers: Qu

Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [6]:
from transformers import TrainingArguments, Trainer

bs=8        # batch size
ga_steps=1  # gradient acc. steps
epochs=10
steps_per_epoch=len(data["train"])//(bs*ga_steps)
tokenizer.pad_token = tokenizer.eos_token

args = TrainingArguments(
    output_dir="out",
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs,
    evaluation_strategy="steps",
    logging_steps=1,
    eval_steps=steps_per_epoch,  # eval and save once per epoch
    save_steps=steps_per_epoch,
    gradient_accumulation_steps=ga_steps,
    num_train_epochs=epochs,
    lr_scheduler_type="constant",
    optim="paged_adamw_8bit",
    learning_rate=0.0002,
    group_by_length=True,
    fp16=True,
    ddp_find_unused_parameters=False,    # needed for training with accelerate
)
# weight_decay=0.01,

In [7]:
from transformers import DataCollatorForLanguageModeling
trainer = Trainer(
    model=model_n,
    tokenizer=tokenizer,
    train_dataset=data["train"],
    eval_dataset=data["test"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    args=args,
)

#print(trainer.evaluate())
for batch in trainer.get_train_dataloader():
    break
output = model_n(**batch)
print(output)
#model_n.config.use_cache = False
#trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


OutOfMemoryError: ignored

In [11]:
for batch in trainer.get_train_dataloader():
    break
output = model_n(**batch)
print(output)

OutOfMemoryError: ignored

In [None]:
!zip -r /content/finetune_model.zip /content/out/checkpoint-56

  adding: content/out/checkpoint-56/ (stored 0%)
  adding: content/out/checkpoint-56/training_args.bin (deflated 51%)
  adding: content/out/checkpoint-56/tokenizer.json (deflated 74%)
  adding: content/out/checkpoint-56/special_tokens_map.json (deflated 73%)
  adding: content/out/checkpoint-56/adapter_model.safetensors (deflated 7%)
  adding: content/out/checkpoint-56/trainer_state.json (deflated 86%)
  adding: content/out/checkpoint-56/optimizer.pt (deflated 14%)
  adding: content/out/checkpoint-56/README.md (deflated 66%)
  adding: content/out/checkpoint-56/scheduler.pt (deflated 58%)
  adding: content/out/checkpoint-56/tokenizer_config.json (deflated 68%)
  adding: content/out/checkpoint-56/adapter_config.json (deflated 53%)
  adding: content/out/checkpoint-56/rng_state.pth (deflated 25%)


In [None]:
!unzip out.zip

In [None]:
from google.colab import files
files.download("/content/out.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from datasets import load_dataset, load_from_disk
data_test= load_from_disk("austin_test_noans.hf")
data_test = data_test.train_test_split(test_size=0.30)
#data = load_dataset("Abirate/english_quotes")
print(data_test)

DatasetDict({
    train: Dataset({
        features: ['zeroprompt'],
        num_rows: 224
    })
    test: Dataset({
        features: ['zeroprompt'],
        num_rows: 97
    })
})


In [None]:
from peft import PeftConfig, PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

adapter_path="/content/content/out/checkpoint-84"     # input: adapters
save_to="models/Mistral-7B-finetuned"    # out: merged model ready for inference

model_id = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
base_model = AutoModelForCausalLM.from_pretrained(model_id, device_map={"":0})

# Add/set tokens (same 5 lines of code we used before training)
tokenizer.pad_token = tokenizer.eos_token

# Load LoRA adapter and merge
model_n = PeftModel.from_pretrained(base_model, adapter_path)
model_n = model_n.merge_and_unload()

model.save_pretrained(save_to, safe_serialization=True, max_shard_size='4GB')
tokenizer.save_pretrained(save_to)

tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

OutOfMemoryError: ignored

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model_m,
    tokenizer=tokenizer,
    max_new_tokens=2000,
    do_sample=True,
    return_full_text=False,
    temperature=0.1,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)



ValueError: ignored

In [None]:
import tqdm
from transformers.pipelines.pt_utils import KeyDataset
out_list = []
tokenizer.pad_token = tokenizer.eos_token
for out in tqdm.tqdm(pipe(KeyDataset(data_test["test"],"zeroprompt"), batch_size=1)):
    print(out)
    out_list.append(out)

  0%|          | 0/97 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  0%|          | 0/97 [00:00<?, ?it/s]


ValueError: ignored

In [None]:
pre-training eval loss: 'eval_loss': 2.193373918533325

r = 2, target_modules=['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'], dropout=0.05, lora_alpha=4
28	0.181700	0.639950
56	0.163900	0.572055
84	0.102400	0.546450
112	0.094300	0.563440
140	0.054300	0.600865
168	0.064500	0.593742
196	0.037700	0.629935
224	0.039600	0.638050
252	0.041200	0.704165
280	0.040100	0.717373
308	0.030100	0.716790
336	0.032000	0.724699
364	0.032100	0.717925
392	0.029000	0.699248
420	0.028700	0.731645
448	0.028400	0.744703
476	0.032100	0.732234

r = 2, target_modules=['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'], dropout=0.05, lora_alpha=8
FINAL EVAL CURVE:
28	0.247200	0.580774
56	0.095100	0.530683
84	0.082200	0.530804
112	0.078800	0.569477
140	0.056400	0.615520
168	0.045200	0.661848
196	0.031500	0.645254
224	0.035300	0.740536
252	0.032900	0.714035
280	0.032900	0.705182

28	0.165300	0.601855
56	0.166700	0.543604
84	0.088400	0.537959
112	0.073400	0.567229
140	0.047500	0.584766
168	0.056400	0.597366
196	0.036600	0.634623
224	0.032100	0.650536
252	0.033200	0.694642
280	0.043800	0.663180
308	0.036100	0.687505
336	0.027900	0.700623
364	0.042800	0.703762
392	0.032100	0.730413
420	0.027900	0.733915
448	0.029200	0.698045
476	0.027100	0.710010

28	0.158800	0.638980
56	0.105400	0.565447
84	0.098300	0.554097
112	0.056600	0.564566
140	0.045600	0.605752

r = 4, target_modules=['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'], dropout=0.05, lora_alpha=8
28	0.135000	0.588756
56	0.142700	0.574478
84	0.083800	0.568620
112	0.066900	0.603376
140	0.061300	0.645342
168	0.062500	0.638937
196	0.050800	0.69985

r = 6, target_modules=['q_proj', 'v_proj'], dropout=0.05, lora_alpha=32
28	0.111900	0.593255
56	0.107900	0.621893
84	0.077800	0.593587
112	0.059400	0.654455
140	0.043700	0.671944
168	0.046100	0.698711
196	0.036500	0.687140
224	0.036400	0.720200
252	0.039500	0.746870
280	0.032200	0.736400
308	0.032700	0.759901
336	0.029300	0.765335

r = 6, target_modules=['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'], dropout=0.10, lora_alpha=32
28	0.135000	0.588756
56	0.142700	0.574478
84	0.083800	0.568620
112	0.066900	0.603376
140	0.061300	0.645342
168	0.062500	0.638937
196	0.050800	0.69985

r = 6, target_modules=['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'], dropout=0.05, lora_alpha=64
28	0.135500	0.650298
56	0.093600	0.625502
84	0.076600	0.642395
112	0.060000	0.659177
140	0.063100	0.681377


r = 8, target_modules=['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'], dropout=0.05, lora_alpha=16
28	0.179200	0.666828
56	0.096300	0.657251
84	0.097800	0.687327
112	0.058200	0.747739
140	0.048400	0.796404
168	0.040800	0.805384
196	0.038900	0.862464
224	0.033400	0.891010
252	0.033900	0.903784

r = 8, target_modules=['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'], dropout=0.05, lora_alpha=32, weightdecay=0.02
28	0.153000	0.593947
56	0.145000	0.560116
84	0.074600	0.563585
112	0.058300	0.578612
140	0.047100	0.661875
168	0.052300	0.655614
196	0.038000	0.662583

r = 8, target_modules=['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'], dropout=0.05, lora_alpha=32, weightdecay=0.04
Step	Training Loss	Validation Loss
28	0.192200	0.632166
56	0.086100	0.625963
84	0.066700	0.654899
112	0.051900	0.752082
140	0.038800	0.729567

r = 8, target_modules=['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'], dropout=0.10, lora_alpha=32, weightdecay=0.02
28	0.166500	0.623683
56	0.111300	0.605962
84	0.092200	0.629120

In [None]:
# Finetune Eval results:
'Water Heater': [0.6666666666666666,
   0.9888888888888889,
   0.2553191489361702,
   0.4058521755652669],
  'Water Heater_feat': [0.7393561786085151,
   0.9693877551020408,
   0.16967509025270758,
   0.2888004915753048],