<a href="https://colab.research.google.com/github/donghuna/AI-Expert/blob/main/%EC%B5%9C%EC%A0%95%EC%9A%B1/llm_2024_Lab05_QLoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 5. QLoRA

This lab source is partly based on

- BitsandBytes Tutorial: https://colab.research.google.com/drive/1Vvju5kOyBsDr7RX_YAvp6ZsSOoSMjhKD?usp=sharing#scrollTo=E0Nl5mWL0k2T


Updated by geonho lee 2024.8.30

by minsoo kim 2023.11.15

## Part.1 LLM Finetuning with QLoRA

### Package pip install & import

In [1]:
print('Installing packages...')
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

Installing packages...
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[2K   [9

In [2]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [3]:
# Set evaluation
import tqdm
import datasets
from datasets import load_dataset
import torch.nn as nn
def evaluate(model, tokenizer):
    testenc = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
    testenc = tokenizer("\n\n".join(testenc['text']), return_tensors='pt')

    testenc = testenc.input_ids.to(model.device)
    nsamples = 40
    model = model.eval()

    nlls = []
    for i in tqdm.tqdm(range(nsamples), desc="evaluating..."):
        batch = testenc[:, (i * 2048):((i + 1) * 2048)].to(model.device)
        with torch.no_grad():
            lm_logits = model(batch).logits
        shift_logits = lm_logits[:, :-1, :].contiguous().float()
        shift_labels = testenc[:, (i * 2048):((i + 1) * 2048)][:, 1:]
        loss_fct = nn.CrossEntropyLoss()
        loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
        neg_log_likelihood = loss.float() * 2048
        nlls.append(neg_log_likelihood)

    return torch.exp(torch.stack(nlls).sum() / (nsamples * 2048))

To begin with, let's load FP16 OPT-2.7B first with AutoModel.

Let's quantize FP OPT-2.7b with NF4 Quantization through BitsandBytes package!

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Model ID to build
model_id = "facebook/opt-2.7b"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Quantization Configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
# Build Quantized OPT model
model_opt_int4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config,device_map={"":0}, cache_dir="/home/ms/hf_cache")
print(model_opt_int4)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/691 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/691 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/5.30G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 2560, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 2560)
      (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-31): 32 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
            (v_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
            (q_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
            (out_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
          

In [5]:
from peft import prepare_model_for_kbit_training

model_opt_int4.gradient_checkpointing_enable()
model_opt_int4 = prepare_model_for_kbit_training(model_opt_int4)

In [6]:
from peft import LoraConfig, get_peft_model
# Setting for LoRA PEFT (fine-tuning QKV projection weight)
config = LoraConfig(
    r=4, # LoRA rank [2,4,8,16,64,...]
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj"], # target modules ["fc1", "fc2", "q_proj", "k_proj", "v_proj"]
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to OPT Pre-Trained Model
# model_opt_int4.gradient_checkpointing_enable()

model_opt_int4 = get_peft_model(model_opt_int4, config)
print(model_opt_int4)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): OPTForCausalLM(
      (model): OPTModel(
        (decoder): OPTDecoder(
          (embed_tokens): Embedding(50272, 2560, padding_idx=1)
          (embed_positions): OPTLearnedPositionalEmbedding(2050, 2560)
          (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
          (layers): ModuleList(
            (0-31): 32 x OPTDecoderLayer(
              (self_attn): OPTAttention(
                (k_proj): lora.Linear4bit(
                  (base_layer): Linear4bit(in_features=2560, out_features=2560, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=2560, out_features=4, bias=False)
                  )
                  (lora_B): ModuleDict(
                    (default): Linear(in_features=4, out_features=2560, bias=False)
 

In [7]:
from datasets import load_dataset

data = load_dataset("piqa", split="train[:10%]").select(range(100)) # sample 100 data from PIQA train dataset
column_names = data.column_names

def add_sol_with_label(example):
        sentence = example[column_names[0]] + " "
        answer = example[column_names[1]] if example["label"] == 0 else example[column_names[2]]

        example["sentence"] = sentence + answer
        return example

# Pre-Processing PIQA train dataset
updated_data = data.map(add_sol_with_label)
updated_data = updated_data.remove_columns("goal")
updated_data = updated_data.remove_columns("label")
updated_data = updated_data.rename_column("sentence", "goal")
data = updated_data

# Tokenize
data = data.map(lambda samples:tokenizer(samples["goal"]), batched=True)

Downloading builder script:   0%|          | 0.00/5.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.41k [00:00<?, ?B/s]

The repository for piqa contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/piqa.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/815k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16113 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3084 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1838 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [8]:
import transformers

tokenizer.pad_token = tokenizer.eos_token

# Fine-Tuning Setting
trainer = transformers.Trainer(
    model=model_opt_int4,
    train_dataset=data,
     args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,
        max_steps=100,
        learning_rate=1e-4,
        fp16=True,
        logging_steps=10,
        output_dir="outputs"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model_opt_int4.config.use_cache = False  # silence the warnings. Please re-enable for inference!

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs


In [9]:
# 100 step Fine-Tuning
trainer.train()

  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
10,3.5284
20,3.1109
30,2.8533
40,3.7431
50,3.0895
60,3.4064
70,3.3735
80,2.9261
90,2.774
100,2.8413


TrainOutput(global_step=100, training_loss=3.1646441268920897, metrics={'train_runtime': 44.6414, 'train_samples_per_second': 2.24, 'train_steps_per_second': 2.24, 'total_flos': 48799972270080.0, 'train_loss': 3.1646441268920897, 'epoch': 1.0})

In [10]:
# Insert any context text you want
text = "When boiling butter, "
device = "cuda:0"

# Set max sequence length
max_token_number = 30

# Tokenize input sequence
inputs = tokenizer(text, return_tensors="pt").to(device)

# Text generation (model inference)
with torch.no_grad():
    outputs_opt = model_opt_int4.generate(**inputs, max_new_tokens=max_token_number)

print(tokenizer.decode(outputs_opt[0], skip_special_tokens=True))

  return fn(*args, **kwargs)


When boiling butter,  put it in a pot and bring it to a boil.  Then turn off the heat and let it cool.       


## Part.2 QLoRA Fine-Tuning with Korean LLM

- 런타임 - 세션 다시 시작

In [11]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "beomi/KoAlpaca-Polyglot-5.8B"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

tokenizer_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.65M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/36.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/13 [00:00<?, ?it/s]

model-00001-of-00013.safetensors:   0%|          | 0.00/926M [00:00<?, ?B/s]

model-00002-of-00013.safetensors:   0%|          | 0.00/952M [00:00<?, ?B/s]

model-00003-of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

model-00004-of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

model-00005-of-00013.safetensors:   0%|          | 0.00/952M [00:00<?, ?B/s]

model-00006-of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

model-00007-of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

model-00008-of-00013.safetensors:   0%|          | 0.00/952M [00:00<?, ?B/s]

model-00009-of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

model-00010-of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

model-00011-of-00013.safetensors:   0%|          | 0.00/952M [00:00<?, ?B/s]

model-00012-of-00013.safetensors:   0%|          | 0.00/948M [00:00<?, ?B/s]

model-00013-of-00013.safetensors:   0%|          | 0.00/515M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/13 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [12]:
from transformers import pipeline

pipe = pipeline('text-generation', model=model, tokenizer=model_id)
def ask(x, context='', is_input_full=False):
    ans = pipe(
        f"### 질문: {x}\n\n### 맥락: {context}\n\n### 답변:" if context else f"### 질문: {x}\n\n### 답변:",
        do_sample=False,
        max_new_tokens=512,
        temperature=0.8,
        top_p=0.9,
        return_full_text=False,
        eos_token_id=2,
    )
    print(ans[0]['generated_text'])

# Korean Text Generation with NF4 Korean-LLM
ask("딥러닝이 뭐야?")



딥러닝은 인공지능 분야의 하나로서 연속된 계층을 이룬 신경망 네트워크를 이용합니다. 이를 통해 데이터의 복잡한 관계를 자동으로 학습하고, 새로운 데이터를 인식하고 예측할 수 있습니다. 예를 들어, 의료 분야에서는 암 진단, 조기 진단 등에 사용됩니다.


In [13]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [14]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [15]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 3670016 || all params: 3070156800 || trainable%: 0.11953838970048696


In [16]:
from datasets import load_dataset

data = load_dataset("beomi/KoAlpaca-v1.1a")

data = data.map(
    lambda x: {'text': f"### 질문: {x['instruction']}\n\n### 답변: {x['output']}<|endoftext|>" }
)

Downloading readme:   0%|          | 0.00/1.75k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/21155 [00:00<?, ? examples/s]

Map:   0%|          | 0/21155 [00:00<?, ? examples/s]

In [17]:
data = data.map(lambda samples: tokenizer(samples["text"]), batched=True)

Map:   0%|          | 0/21155 [00:00<?, ? examples/s]

In [18]:
import transformers

# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
1,0.5336
2,0.6233
3,0.731
4,0.5548
5,0.6851
6,0.5651
7,0.5524
8,0.2532
9,0.4783
10,0.3861


TrainOutput(global_step=10, training_loss=0.5362881332635879, metrics={'train_runtime': 35.9579, 'train_samples_per_second': 1.112, 'train_steps_per_second': 0.278, 'total_flos': 334896076505088.0, 'train_loss': 0.5362881332635879, 'epoch': 0.0018908059560387616})

In [19]:
model.eval()
model.config.use_cache = True  # silence the warnings. Please re-enable for inference!

In [20]:
def gen(x):
    gened = model.generate(
        **tokenizer(
            f"### 질문: {x}\n\n### 답변:",
            return_tensors='pt',
            return_token_type_ids=False
        ),
        max_new_tokens=256,
        early_stopping=True,
        do_sample=True,
        eos_token_id=2,
    )
    print(tokenizer.decode(gened[0]))

In [21]:
gen("마진콜이 발생하는 이유가 뭐야? 그리고 어떻게 해야 마진콜을 막을 수 있어?")



### 질문: 마진콜이 발생하는 이유가 뭐야? 그리고 어떻게 해야 마진콜을 막을 수 있어?

### 답변:마진콜 또는 손실 회피는 기대했던 이익이 발생하지 않았을 때 투자자가 갖는 스트레스를 말합니다. 주가가 지속적으로 오르는 경우 등에는 차익 거래로 인한 마진콜이 발생할 수 있습니다. 이를 막기 위해서는 오를 때를 대비해 매도를 준비하는 것이 중요합니다. 또한, 손실 회피를 위해 매출 원가를 조정하거나 다른 전략을 사용할 수 있습니다. 더 나아가, 운영 원칙이나 목표 시장에 대한 강한 리더십을 가지는 것도 도움이 됩니다. <|endoftext|>


## Part3 LoftQ

- 세션 다시 시작

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

fp_model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-125m",
    torch_dtype = torch.float32,
    device_map="cpu"
)
for para in fp_model.parameters():
    para.requires_grad = False
fp_model.config.use_cache = False
fp_model.eval()
print(fp_model)
sd = {k:v.cpu() for k,v in fp_model.state_dict().items()}
del fp_model
import gc
gc.collect()
torch.cuda.empty_cache()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-11): 12 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), ep

In [2]:
from peft import LoraConfig, PeftModel, get_peft_model
lora_r = 64 ###################################################
lora_alpha = lora_r
lora_dropout = 0.1

model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-125m",
    torch_dtype=torch.float32,
    device_map="cpu"
)
target_linear = ['fc1']
target_t_type = 'CAUSAL_LM'
lora_config = LoraConfig(
    init_lora_weights = "gaussian",
    r = lora_r,
    lora_alpha = lora_alpha,
    target_modules = target_linear,
    lora_dropout = lora_dropout,
    bias = "none",
    task_type = target_t_type
)
model = get_peft_model(model, lora_config)
model.config.use_cache = False
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): OPTForCausalLM(
      (model): OPTModel(
        (decoder): OPTDecoder(
          (embed_tokens): Embedding(50272, 768, padding_idx=1)
          (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (layers): ModuleList(
            (0-11): 12 x OPTDecoderLayer(
              (self_attn): OPTAttention(
                (k_proj): Linear(in_features=768, out_features=768, bias=True)
                (v_proj): Linear(in_features=768, out_features=768, bias=True)
                (q_proj): Linear(in_features=768, out_features=768, bias=True)
                (out_proj): Linear(in_features=768, out_features=768, bias=True)
              )
              (activation_fn): ReLU()
              (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (fc1): lora.Linear(
                (base_layer): Linear

In [3]:
print(model.base_model.model.model.decoder.layers[0].fc1.lora_A.default.weight)
print(model.base_model.model.model.decoder.layers[0].fc1.lora_B.default.weight)


Parameter containing:
tensor([[-0.0318, -0.0095,  0.0171,  ..., -0.0007,  0.0081,  0.0128],
        [-0.0149,  0.0167, -0.0048,  ..., -0.0175,  0.0189, -0.0082],
        [-0.0131,  0.0291, -0.0142,  ...,  0.0180, -0.0132,  0.0153],
        ...,
        [ 0.0162,  0.0090, -0.0053,  ..., -0.0304, -0.0005, -0.0168],
        [-0.0041, -0.0243, -0.0039,  ...,  0.0111, -0.0049, -0.0213],
        [-0.0272,  0.0353,  0.0242,  ...,  0.0122,  0.0023,  0.0080]],
       requires_grad=True)
Parameter containing:
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], requires_grad=True)


In [4]:
sample_fp_weight = sd["model.decoder.layers.0.fc1.weight"]
print(sample_fp_weight)
print(sample_fp_weight.shape)

tensor([[-0.0871,  0.0035, -0.0213,  ...,  0.0425, -0.0021,  0.0082],
        [ 0.0037,  0.0230,  0.0150,  ...,  0.0206,  0.0172,  0.0158],
        [ 0.0125, -0.0129,  0.0245,  ...,  0.0040, -0.0299,  0.0462],
        ...,
        [ 0.0075,  0.0394,  0.0232,  ..., -0.0341,  0.0475,  0.0082],
        [-0.0055, -0.0038, -0.0399,  ...,  0.0158,  0.0176, -0.0500],
        [-0.0155,  0.0334,  0.0111,  ...,  0.0387, -0.0299,  0.0230]])
torch.Size([3072, 768])


In [5]:
model.base_model.model.model.decoder.layers[0].fc1.base_layer.weight

Parameter containing:
tensor([[-0.0871,  0.0035, -0.0213,  ...,  0.0425, -0.0021,  0.0082],
        [ 0.0037,  0.0230,  0.0150,  ...,  0.0206,  0.0172,  0.0158],
        [ 0.0125, -0.0129,  0.0245,  ...,  0.0040, -0.0299,  0.0462],
        ...,
        [ 0.0075,  0.0394,  0.0232,  ..., -0.0341,  0.0475,  0.0082],
        [-0.0055, -0.0038, -0.0399,  ...,  0.0158,  0.0176, -0.0500],
        [-0.0155,  0.0334,  0.0111,  ...,  0.0387, -0.0299,  0.0230]])

In [6]:
def quant_func_asym(w, n_bits, q_group_size):

    org_w_shape = w.shape
    # q_group_size = -1

    if q_group_size > 0:
        assert org_w_shape[-1] % q_group_size == 0
        w = w.reshape(-1, q_group_size)
    else:
        w = w.reshape(-1, w.shape[-1])

    max_val = w.amax(dim=1, keepdim=True)
    min_val = w.amin(dim=1, keepdim=True)
    max_int = 2 ** n_bits - 1
    min_int = 0
    # scales = (max_val - min_val).clamp(min=1e-5) / max_int
    scales = (max_val - min_val) / max_int
    zeros = (-torch.round(min_val / scales)).clamp_(min_int, max_int)

    w = (torch.clamp(torch.round(w / scales) +
                    zeros, min_int, max_int) - zeros) * scales

    assert torch.isnan(w).sum() == 0

    w_q = w.reshape(org_w_shape)

    return w_q.detach()

In [7]:
sample_q_2b_weight = quant_func_asym(model.base_model.model.model.decoder.layers[0].fc1.base_layer.weight, 2, 64)
print(sample_q_2b_weight)

tensor([[-0.1026,  0.0000,  0.0000,  ...,  0.0449,  0.0000,  0.0000],
        [ 0.0000,  0.0329,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0410,  ...,  0.0000, -0.0409,  0.0409],
        ...,
        [ 0.0000,  0.0354,  0.0354,  ..., -0.0387,  0.0387,  0.0000],
        [ 0.0000,  0.0000, -0.0476,  ...,  0.0000,  0.0332, -0.0665],
        [ 0.0000,  0.0628,  0.0000,  ...,  0.0316, -0.0316,  0.0316]])


In [8]:
sample_q_4b_weight = quant_func_asym(model.base_model.model.model.decoder.layers[0].fc1.base_layer.weight, 4, 64)
print(sample_q_4b_weight)

tensor([[-0.0820,  0.0000, -0.0205,  ...,  0.0449,  0.0000,  0.0090],
        [ 0.0066,  0.0198,  0.0132,  ...,  0.0187,  0.0187,  0.0187],
        [ 0.0164, -0.0164,  0.0246,  ...,  0.0000, -0.0327,  0.0490],
        ...,
        [ 0.0071,  0.0425,  0.0213,  ..., -0.0309,  0.0464,  0.0077],
        [-0.0095,  0.0000, -0.0381,  ...,  0.0133,  0.0199, -0.0532],
        [-0.0126,  0.0377,  0.0126,  ...,  0.0379, -0.0316,  0.0253]])


In [9]:
import torch.nn as nn
loss_fn = nn.MSELoss()

In [10]:
print(f"4bit quantization error: {loss_fn(sample_fp_weight, sample_q_4b_weight): 7f}")
print(f"2bit quantization error: {loss_fn(sample_fp_weight, sample_q_2b_weight): 7f}")

4bit quantization error:  0.000007
2bit quantization error:  0.000165


In [11]:
gap_weight_2bit = (sample_fp_weight - sample_q_2b_weight).detach().cpu()
print(gap_weight_2bit)

tensor([[ 0.0155,  0.0035, -0.0213,  ..., -0.0024, -0.0021,  0.0082],
        [ 0.0037, -0.0099,  0.0150,  ...,  0.0206,  0.0172,  0.0158],
        [ 0.0125, -0.0129, -0.0165,  ...,  0.0040,  0.0110,  0.0053],
        ...,
        [ 0.0075,  0.0040, -0.0122,  ...,  0.0046,  0.0088,  0.0082],
        [-0.0055, -0.0038,  0.0078,  ...,  0.0158, -0.0156,  0.0164],
        [-0.0155, -0.0293,  0.0111,  ...,  0.0071,  0.0017, -0.0086]])


In [12]:
U, S, Vh = torch.linalg.svd(gap_weight_2bit, full_matrices=False)
print(f"U: {U} \nshape of U: {U.shape}\n")
print(f"S: {S} \nshape of S: {S.shape}\n")
print(f"Vh: {Vh} \nshape of Vh: {Vh.shape}\n")

U: tensor([[ 0.0040, -0.0013,  0.0068,  ...,  0.0089, -0.0131,  0.0192],
        [ 0.0025, -0.0045,  0.0106,  ...,  0.0142,  0.0167,  0.0029],
        [-0.0040, -0.0008,  0.0029,  ...,  0.0119,  0.0139,  0.0267],
        ...,
        [ 0.0060, -0.0045, -0.0014,  ..., -0.0046,  0.0335,  0.0222],
        [ 0.0057, -0.0004, -0.0016,  ..., -0.0559,  0.0208, -0.0053],
        [ 0.0049, -0.0012, -0.0019,  ..., -0.0077,  0.0004,  0.0051]]) 
shape of U: torch.Size([3072, 768])

S: tensor([3.0222, 2.9968, 2.8265, 2.6635, 2.5974, 2.5779, 2.5027, 2.4933, 2.4356,
        2.3553, 2.2474, 2.1132, 1.0333, 0.9990, 0.9721, 0.9659, 0.9639, 0.9611,
        0.9569, 0.9538, 0.9494, 0.9465, 0.9428, 0.9424, 0.9400, 0.9366, 0.9362,
        0.9329, 0.9302, 0.9280, 0.9253, 0.9242, 0.9228, 0.9204, 0.9173, 0.9170,
        0.9133, 0.9130, 0.9116, 0.9105, 0.9097, 0.9078, 0.9062, 0.9050, 0.9040,
        0.9037, 0.9025, 0.8989, 0.8981, 0.8973, 0.8972, 0.8961, 0.8942, 0.8926,
        0.8913, 0.8910, 0.8872, 0.8857, 0.

In [13]:
rank = 64
L = U @ (torch.sqrt(torch.diag(S)[:, 0:rank])) # lora_B
R = torch.sqrt(torch.diag(S)[0:rank, :]) @ Vh  # lora_A

print(f"L: {L} \nshape of L: {L.shape}")
print(f"R: {R} \nshape of R: {R.shape}")

L: tensor([[ 0.0069, -0.0023,  0.0115,  ...,  0.0064,  0.0390, -0.0213],
        [ 0.0043, -0.0077,  0.0177,  ..., -0.0279, -0.0218,  0.0135],
        [-0.0069, -0.0014,  0.0049,  ...,  0.0535,  0.0209, -0.0182],
        ...,
        [ 0.0104, -0.0078, -0.0024,  ..., -0.0079,  0.0152, -0.0111],
        [ 0.0099, -0.0007, -0.0028,  ..., -0.0031, -0.0053,  0.0143],
        [ 0.0085, -0.0020, -0.0032,  ...,  0.0129, -0.0025,  0.0095]]) 
shape of L: torch.Size([3072, 64])
R: tensor([[-0.0203, -0.0429,  0.0134,  ...,  0.0366,  0.0536,  0.0264],
        [ 0.0981, -0.0740, -0.0021,  ...,  0.0721, -0.0762, -0.0421],
        [ 0.0412,  0.0120,  0.0477,  ...,  0.0299, -0.0407,  0.0836],
        ...,
        [-0.0860, -0.0315, -0.0110,  ..., -0.0198,  0.0421, -0.0188],
        [ 0.0530,  0.0540,  0.0856,  ..., -0.0638, -0.0555, -0.0055],
        [-0.0150,  0.0517, -0.0589,  ..., -0.0324,  0.0608, -0.0034]]) 
shape of R: torch.Size([64, 768])


In [14]:
print(f"2bit quantization error wo SVD: {loss_fn(sample_fp_weight, sample_q_2b_weight): 7f}")
print(f"2bit quantization error w  SVD: {loss_fn(sample_fp_weight, sample_q_2b_weight + L @ R): 7f}")

2bit quantization error wo SVD:  0.000165
2bit quantization error w  SVD:  0.000113


## Part4 LoftQ vs QLoRA

- 런타임 - 세션 다시 시작

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Model ID to build
model_id = "facebook/opt-2.7b"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Build Quantized OPT model
model_opt_int4 = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map={"":0}, cache_dir="/home/ms/hf_cache")
print(model_opt_int4)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 2560, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 2560)
      (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0-31): 32 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=2560, out_features=2560, bias=True)
            (v_proj): Linear(in_features=2560, out_features=2560, bias=True)
            (q_proj): Linear(in_features=2560, out_features=2560, bias=True)
            (out_proj): Linear(in_features=2560, out_features=2560, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear(in_features=10240, out_features=2560, bias=True)
          (final_layer_norm): Laye

In [2]:
from peft import prepare_model_for_kbit_training

model_opt_int4.gradient_checkpointing_enable()
model_opt_int4 = prepare_model_for_kbit_training(model_opt_int4)

In [3]:
from peft import LoraConfig, get_peft_model, LoftQConfig
# Setting for LoRA PEFT (fine-tuning QKV projection weight)

loftq_config = LoftQConfig(loftq_bits=4, loftq_iter=1)

config = LoraConfig(
    r=4, # LoRA rank [2,4,8,16,64,...]
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj"], # target modules ["fc1", "fc2", "q_proj", "k_proj", "v_proj"]
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    init_lora_weights="loftq",
    loftq_config=loftq_config,
)

# Apply LoRA to OPT Pre-Trained Model
# model_opt_int4.gradient_checkpointing_enable()

model_opt_int4 = get_peft_model(model_opt_int4, config)
print(model_opt_int4)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): OPTForCausalLM(
      (model): OPTModel(
        (decoder): OPTDecoder(
          (embed_tokens): Embedding(50272, 2560, padding_idx=1)
          (embed_positions): OPTLearnedPositionalEmbedding(2050, 2560)
          (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
          (layers): ModuleList(
            (0-31): 32 x OPTDecoderLayer(
              (self_attn): OPTAttention(
                (k_proj): lora.Linear(
                  (base_layer): Linear(in_features=2560, out_features=2560, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=2560, out_features=4, bias=False)
                  )
                  (lora_B): ModuleDict(
                    (default): Linear(in_features=4, out_features=2560, bias=False)
         

In [4]:
model_opt_int4.base_model.model.model.decoder.layers[0].self_attn.k_proj.base_layer.weight

Parameter containing:
tensor([[ 0.0095,  0.0031, -0.0071,  ...,  0.0094, -0.0079,  0.0122],
        [ 0.0061, -0.0070, -0.0070,  ..., -0.0073,  0.0064, -0.0073],
        [ 0.0076, -0.0043, -0.0043,  ..., -0.0164,  0.0143,  0.0000],
        ...,
        [-0.0282,  0.0293, -0.0037,  ...,  0.0151,  0.0231, -0.0085],
        [ 0.0164, -0.0106,  0.0125,  ...,  0.0123, -0.0144,  0.0205],
        [ 0.0121,  0.0039, -0.0045,  ...,  0.0134,  0.0175,  0.0398]],
       device='cuda:0')

In [5]:
model_opt_int4.base_model.model.model.decoder.layers[0].self_attn.k_proj.lora_B.default.weight

Parameter containing:
tensor([[-1.1797e-02,  4.5313e-03,  7.5165e-03,  2.9895e-03],
        [ 4.0054e-03,  5.8776e-03, -7.9139e-03, -2.9199e-04],
        [ 1.1773e-02, -8.4820e-03, -5.3465e-03,  2.8521e-03],
        ...,
        [-2.2033e-02, -3.6539e-02, -5.8883e-03,  2.7257e-02],
        [-1.8315e-02, -1.5834e-02, -1.6254e-03,  2.2301e-02],
        [ 9.7270e-03,  6.7502e-03,  6.2929e-05,  1.0831e-02]], device='cuda:0',
       requires_grad=True)

In [6]:
from datasets import load_dataset

data = load_dataset("piqa", split="train[:10%]").select(range(100)) # sample 100 data from PIQA train dataset
column_names = data.column_names

def add_sol_with_label(example):
        sentence = example[column_names[0]] + " "
        answer = example[column_names[1]] if example["label"] == 0 else example[column_names[2]]

        example["sentence"] = sentence + answer
        return example

# Pre-Processing PIQA train dataset
updated_data = data.map(add_sol_with_label)
updated_data = updated_data.remove_columns("goal")
updated_data = updated_data.remove_columns("label")
updated_data = updated_data.rename_column("sentence", "goal")
data = updated_data

# Tokenize
data = data.map(lambda samples:tokenizer(samples["goal"]), batched=True)

In [7]:
import transformers

tokenizer.pad_token = tokenizer.eos_token

# Fine-Tuning Setting
trainer = transformers.Trainer(
    model=model_opt_int4,
    train_dataset=data,
     args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,
        max_steps=100,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        output_dir="outputs"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model_opt_int4.config.use_cache = False  # silence the warnings. Please re-enable for inference!

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs


In [8]:
# 100 step Fine-Tuning
trainer.train()

  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
10,3.5853
20,3.223
30,2.9501
40,3.7715
50,3.1202
60,3.4369
70,3.4089
80,2.9373
90,2.7577
100,2.8512


TrainOutput(global_step=100, training_loss=3.204201145172119, metrics={'train_runtime': 27.9, 'train_samples_per_second': 3.584, 'train_steps_per_second': 3.584, 'total_flos': 48799972270080.0, 'train_loss': 3.204201145172119, 'epoch': 1.0})

In [9]:
# Insert any context text you want
text = "When boiling butter, "
device = "cuda:0"

# Set max sequence length
max_token_number = 30

# Tokenize input sequence
inputs = tokenizer(text, return_tensors="pt").to(device)

# Text generation (model inference)
with torch.no_grad():
    outputs_opt = model_opt_int4.generate(**inputs, max_new_tokens=max_token_number)

print(tokenizer.decode(outputs_opt[0], skip_special_tokens=True))



When boiling butter,  put a pan on the stove and put the butter in the pan.  Put the pan on the stove and put the butter in the pan. 
