# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.


In [None]:
# !pip install -q -U bitsandbytes
# !pip install -q -U git+https://github.com/huggingface/transformers@de9255de27abfcae4a1f816b904915f0b1e23cd9
# !pip install -q -U git+https://github.com/huggingface/peft.git
# !pip install -q -U git+https://github.com/huggingface/accelerate.git
# !pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/92.2 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m65.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m52.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m7.7 MB/s

First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "EleutherAI/polyglot-ko-5.8b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb_config, device_map={"":0},
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    )

Downloading (…)okenizer_config.json:   0%|          | 0.00/164 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.65M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/678 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/36.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/13 [00:00<?, ?it/s]

Downloading (…)of-00013.safetensors:   0%|          | 0.00/926M [00:00<?, ?B/s]

Downloading (…)of-00013.safetensors:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

Downloading (…)of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

Downloading (…)of-00013.safetensors:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

Downloading (…)of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

Downloading (…)of-00013.safetensors:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

Downloading (…)of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

Downloading (…)of-00013.safetensors:   0%|          | 0.00/952M [00:00<?, ?B/s]

Downloading (…)of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

Downloading (…)of-00013.safetensors:   0%|          | 0.00/515M [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so
/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so...


  warn("The installed version of bitsandbytes was compiled without GPU support. "


Loading checkpoint shards:   0%|          | 0/13 [00:00<?, ?it/s]

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

sft 데이터셋 가져오기

In [None]:
import json
from google.colab import drive
drive.mount('/content/drive')

In [None]:
with open('/content/drive/MyDrive/gpt/sft-smp-1000.json', 'r') as f:
    sft = json.load(f)

In [None]:
sft[0]

{'instruction': '지식 검색이 최초로 어디서 시작되었으며, 왜 시작되었을까요?',
 'input': '',
 'output': '지식 검색이 최초로 시작된 곳은 유즈넷과 뉴스그룹과 같은 곳이었습니다. 그러나 지식 검색과 관련된 서비스를 최초로 시작한 곳은 미국의 about.com과 askme.com 등이 있었습니다. 이들은 질문-답변을 기본으로 한 서비스를 제공했으며, 국내에서는 PC통신의 질문답변 코너와 xpert.co.kr, 아이디어 클릭 등이 있었습니다. 그러나 2000년 10월 디비딕이라는 지식 검색 서비스가 한국에서 최초로 오픈하였습니다. 이후 여러 지식 포털들이 대거 등장하였으나 대부분은 문을 닫았습니다. \n\n지식 검색이 시작된 이유는 사용자들이 서로 정보를 공유하고, 문제를 해결하기 위해 필요성을 느꼈기 때문입니다. 지식은 인류 공동의 재산이며, 이를 나누어야 한다는 것이 디비딕 창시자의 생각이었습니다. \n\n 추가 답변:\n디비딕은 사용자들간의 지식 공유 형태로 질문 답변이 1회성으로 끝나는 것이 아니라 체계적인 데이터베이스로 구축되어 검색과 연계되었습니다. 2000년 당시 이라는 것은 국내에서 지식 검색이라는 개념이 많은 사람들에게 알려지지 않았기 때문에 디비딕에서는 "묻고답하는 지식 커뮤니티"라는 서비스 명칭을 사용하였습니다. \n\n지금까지 여러 지식 검색 서비스들이 출시되었지만, 이는 실용성을 위해 지속적으로 발전하고 있으며, 그 중에서도 사용자들이 가장 많이 이용하는 대표적인 지식 검색 서비스로는 지식인이 있습니다.'}

In [None]:
import copy
from dataclasses import dataclass, field
from typing import Optional, Dict, Sequence
import transformers
from torch.utils.data import Dataset
import random
import torch
import json

### for tokenizer
random.seed(777)
IGNORE_INDEX = -100

PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with an input that provides further context.\n\n"
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task.\n\n"
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Response:"
    ),
}


def _tokenize_fn(
    strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer
) -> Dict:
    """Tokenize a list of strings."""
    tokenized_list = [
        tokenizer(
            text,
            return_tensors="pt",
            padding="longest",
            max_length=tokenizer.model_max_length,
            truncation=True,
        )
        for text in strings
    ]
    input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
    input_ids_lens = labels_lens = [
        tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item()
        for tokenized in tokenized_list
    ]
    return dict(
        input_ids=input_ids,
        labels=labels,
        input_ids_lens=input_ids_lens,
        labels_lens=labels_lens,
    )


def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""
    examples = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [
        _tokenize_fn(strings, tokenizer) for strings in (examples, sources)
    ]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX
    return dict(input_ids=input_ids, labels=labels)


class SupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""

    def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer):
        super(SupervisedDataset, self).__init__()
        print("Loading data...")
        with open(data_path, 'r') as f:
          list_data_dict = json.load(f)
        random.shuffle(list_data_dict)  # shuffle data

        print("Formatting inputs...")
        prompt_input, prompt_no_input = (
            PROMPT_DICT["prompt_input"],
            PROMPT_DICT["prompt_no_input"],
        )
        sources = [
            prompt_input.format_map(example)
            if example.get("input", "") != ""
            else prompt_no_input.format_map(example)
            for example in list_data_dict
        ]
        targets = [
            f"{example['output']}{tokenizer.eos_token}" for example in list_data_dict
        ]

        print("Tokenizing inputs... This may take some time...")
        data_dict = preprocess(sources, targets, tokenizer)

        self.input_ids = data_dict["input_ids"]
        self.labels = data_dict["labels"]

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        return dict(input_ids=self.input_ids[i], labels=self.labels[i])


@dataclass
class DataCollatorForSupervisedDataset(object):
    """Collate examples for supervised fine-tuning."""

    tokenizer: transformers.PreTrainedTokenizer

    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        input_ids, labels = tuple(
            [instance[key] for instance in instances] for key in ("input_ids", "labels")
        )
        # print("[input_ids]:", tokenizer.decode(input_ids))
        # print("[labels]:", labels)
        input_ids = torch.nn.utils.rnn.pad_sequence(
            input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
        )
        labels = torch.nn.utils.rnn.pad_sequence(
            labels, batch_first=True, padding_value=IGNORE_INDEX
        )
        ret = dict(
            input_ids=input_ids,
            labels=labels,
            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
        )
        # print(ret)
        return ret


def make_supervised_data_module(
    tokenizer: transformers.PreTrainedTokenizer, data_path
) -> Dict:
    """Make dataset and collator for supervised fine-tuning."""
    train_dataset = SupervisedDataset(
        tokenizer=tokenizer, data_path=data_path
    )
    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
    return dict(
        train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator
    )



Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [None]:
tokenizer.model_max_length = 1024
data_path = '/content/drive/MyDrive/gpt/sft-smp-1000.json'

data_module = make_supervised_data_module(tokenizer=tokenizer, data_path=data_path)

Loading data...
Formatting inputs...
Tokenizing inputs... This may take some time...


In [None]:
import transformers

trainer = transformers.Trainer(
    model=model,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=16,
        learning_rate=2e-5,
        warmup_ratio=0.03,
        lr_scheduler_type='cosine',
        weight_decay=0,
        num_train_epochs=15,
#         num_train_epochs=5,
        save_strategy="epoch",
#         save_steps=1000,
        save_total_limit=5,
        report_to="none",
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    # data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    **data_module,
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Step,Training Loss
1,1.8698
2,1.7846
3,1.7918
4,1.81
5,1.8214
6,1.714
7,1.8811
8,1.8718
9,1.7119
10,1.7381


In [None]:
# save lora
model.save_pretrained('lora')

# merge lora weight
# https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora.py#L409
# merged_lora_model = model.merge_and_unload()

# # save merged model
# merged_lora_model.save_pretrained("outputs")


In [None]:
!du -h -d 1 ~/.cache/huggingface/hub

12G	/root/.cache/huggingface/hub/models--EleutherAI--polyglot-ko-5.8b
12G	/root/.cache/huggingface/hub


In [None]:
tokenizer.pad_token_id, tokenizer.eos_token_id

(2, 2)

## Inference test

In [None]:
# import torch
# from transformers import AutoTokenizer, AutoModelForCausalLM
# from peft import PeftModel, PeftConfig, LoraModel

# DEVICE = 'cuda'
# peft_model_id = "outputs"
# config = PeftConfig.from_pretrained(peft_model_id)


# tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
# tokenizer.pad_token = tokenizer.eos_token  # needed for gpt-neo-x tokenizer

# model = AutoModelForCausalLM.from_pretrained(
#       config.base_model_name_or_path,
#       torch_dtype=torch.float16,
#       # torch_dtype='auto',
#       low_cpu_mem_usage=True,
#       pad_token_id=tokenizer.pad_token_id,
#       eos_token_id=tokenizer.eos_token_id
#     )

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig

DEVICE = 'cuda'
peft_model_id = "lora"
config = PeftConfig.from_pretrained(peft_model_id)


tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token  # needed for gpt-neo-x tokenizer

model = AutoModelForCausalLM.from_pretrained(
      config.base_model_name_or_path,
      torch_dtype=torch.float16,
      # torch_dtype='auto',
      low_cpu_mem_usage=True,
      pad_token_id=tokenizer.pad_token_id,
      eos_token_id=tokenizer.eos_token_id
    )
PeftModel.from_pretrained(model, peft_model_id)

model = model.to(DEVICE)
model.eval()



Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Loading checkpoint shards:   0%|          | 0/13 [00:00<?, ?it/s]

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(30080, 4096)
    (layers): ModuleList(
      (0-27): 28 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attention): GPTNeoXAttention(
          (rotary_emb): RotaryEmbedding()
          (query_key_value): Linear(
            in_features=4096, out_features=12288, bias=True
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.05, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=8, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=8, out_features=12288, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (dense): Linear(in_fea

In [None]:
from time import time


PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with an input that provides further context.\n\n"
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Input:\n{user_input}\n\n### Response:"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task.\n\n"
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Response:"
    ),
}


def gen(prompt, user_input=None, min_new_tokens=10, max_new_tokens=128, temperature=0.5):
    st = time()
    if user_input:
        x = PROMPT_DICT['prompt_input'].format(instruction=prompt, user_input=user_input)
    else:
        x = PROMPT_DICT['prompt_no_input'].format(instruction=prompt)

    input_ids = tokenizer.encode(x, return_tensors="pt").to(DEVICE)
    gen_tokens = model.generate(
        inputs = input_ids,
        max_new_tokens=max_new_tokens,
        num_return_sequences=1,
        temperature=temperature,
        no_repeat_ngram_size=6,
        do_sample=True,

    )
    gen_text = tokenizer.decode(gen_tokens[0], skip_special_tokens=True)
    end = time()
    print(f"[Elpsed]: {end-st} sec")

    return x, gen_text.replace(x, '')


In [None]:
for i in range(1):
    prompt, generated_ouput = gen('일반상대성이론과 특수상대성이론해 대해 자세히 설명해줘', max_new_tokens=256, temperature=0.8)
    if i == 0:
        print(prompt, '\n', generated_ouput)
    else:
        print(generated_ouput)
    print('='*80, '\n')

[Elpsed]: 26.473920583724976 sec
Below is an instruction that describes a task.

Write a response that appropriately completes the request.

### Instruction:
일반상대성이론과 특수상대성이론해 대해 자세히 설명해줘

### Response: 
 *특수상대성이론은 아인슈타인이 제창한 이론이다. 이 이론은 기본적으로 빛의 속도는 어떠한 물질적인 방해물에 의해서도 변하지 않는다고 가정한다. 또한 시간과 공간은 관측자의 운동 상태에 따라 다르게 측정될 수 있다고 가정한다. �ߘ 이 이론은 모든 물질이 광속에 접근하면 서로 매우 비슷해지고, 더이상 분해할 수 없는 궁극적인 입자 (만물의 궁극적인 구성 요소)를 찾을 수 있다고 가정한다          *아인슈타인은 일반상대성 이론을 완성하기 이전에, 그의 이론과 일반상대성이론을 구별하기 위해 “아인슈타인의 중력장과 양자장”이라고 이름을 붙였다. �이 이론은 관측자가 중력에서 벗어나 자유롭게 운동할 수 있는 우주선을 설명한다.  예를 들어, 만약 어떤 관측자가 중력에 의해 공간이 휘어진다고 가정해보자. 이러한 곡률을 중력장으로 가정하고, 중력이 없다고 가정하면, 공간은 평평할 것이다. �이때, 만약

