This notebook demonstrates how to fine-tune Llama 2 with QLoRa

#### [Note-1]
Fine-tuning 13B Llama 2 with LoRa costs 5 hours using 1 A100 80GB and 22 hours for the 70B Llama 2 using 4 A100s 80GB.

#### [Note-2]
For cost optimization, I used as following:

- L4 GPU was used in a colab environment.
- Training Arguments
  - per_device_train_batch_size=1
  - gradient_accumulation_steps=4
- Uses 4-bit quantization for model
- LoraConfig
  - lora_alpha=16
  - r=16
- use_cache = False to reduce memory usage


### 4비트 양자화

4비트 양자화(quantization)를 사용하여 모델의 메모리 사용량을 줄이는 것이 목적.  
4비트 양자화는 각 가중치를 4비트로 표현하여, 원래 32비트로 표현되는 가중치를 4비트로 줄이는 것.

- 계산 방법  
  원래 모델 크기:
  원래 모델은 32비트 부동 소수점(4바이트) 가중치를 사용.

  4비트 양자화된 모델 크기:  
  4비트는 0.5바이트(4비트 / 8비트).

  축소 비율 계산
  원래 모델이 32비트 부동 소수점을 사용한다고 가정하면, 4비트 양자화는 다음과 같은 비율로 크기를 줄임:

  축소 비율 = 원래 가중치 크기 / 양자화된 가중치 크기 = 32 비트 / 4 비트 = 8

따라서, 4비트 양자화를 적용하면 모델 크기는 약 8배 줄어듬.



In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U transformers
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m62.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

In [None]:
!nvidia-smi

Tue Jul  9 04:52:37 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   41C    P8              12W /  72W |      1MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
    GenerationConfig
)

from trl import SFTTrainer

In [None]:
from huggingface_hub import login
login(token="your key")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Load the tokenizer and extend its vocabulary with a special token for padding.

In [None]:
model_name = "meta-llama/Llama-2-13b-hf"
access_token = "your key"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, token=access_token)

#Create a new token and add it to the tokenizer
tokenizer.add_special_tokens({"pad_token":"<pad>"})

# tokenizer.padding_side = 'left'
tokenizer.padding_side = 'right'

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Load the Guanaco dataset.

In [None]:
dataset = load_dataset("timdettmers/openassistant-guanaco")

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})

In [None]:
dataset['train'][70:80]

{'text': ['### Human: 请写一篇关于新冠疫情防控的报道### Assistant: 新型冠狀病毒（COVID-19）疫情防控是國際社會面臨的持續挑戰。 為了有效地控制病毒的傳播，公共衛生當局必須實施結合預防、檢測和應對措施的綜合戰略。\n預防措施可包括實施社交距離準則、鼓勵在公共場所佩戴口罩以及提供準確及時的公共衛生信息。 檢測措施包括檢測疑似病例和追踪接觸者，以確定可能接觸過該病毒的人。 應對措施包括為感染者提供支持性護理，隔離和檢疫暴露的個人，以及提高醫療保健能力以滿足患病者的需求。\n公共衛生當局還必須與地方政府、醫療保健提供者和私營部門協調，以確保有效和及時地實施預防、檢測和應對措施。 這需要開發強大的系統來協調不同的利益相關者，以及共享關鍵數據和信息。\n最後，公共衛生當局必須確保向公眾提供準確及時的病毒信息，以及預防和緩解措施的建議。 這包括提供有關洗手、社交距離和在公共場所使用口罩的指導。 它還包括提高公眾對該病毒的認識，並鼓勵人們採取預防措施以降低感染風險。\n通過實施這些措施，公共衛生當局將能夠減少病毒的傳播，並確保感染者得到所需的治療。',
  '### Human: Por favor. Necesito un resumen sobre las estafas en internet. En ella se deben tratar temas sobre como prevenir, como reconocer una estafa, además, necesito ejemplos de casos mas recientes sobre esta situación. Al final me encetaría que se incluya varios consejos y tips para afrontar esta situación.### Assistant: Por supuesto, aqui te proporciono unos tips para prevenir que te estafen con páginas u ofertas fraudulentas:\n\n1 - Evita comprar en páginas que no conoces. ...\n2 - HTTPS s

### LoRA 구성 요소:

- 원래 가중치 행렬: 𝑊  
- 저차원 행렬: 𝐴 (크기: 𝑑 × 𝑟)
- 저차원 행렬: 𝐵 (크기: 𝑟 × 𝑑)
- 학습 가능한 파라미터 비율: lora-alpha
- 인수 분해 차원: 𝑟

### LoRA 가중치 업데이트 공식:

LoRA에서 업데이트된 가중치 행렬 𝑊′ 는 다음과 같이 표현됩니다:

𝑊′(h) = 𝑊 + ( 𝑙𝑜𝑟𝑎_𝑎𝑙𝑝ℎ𝑎 / 𝑟 ) ⋅ (𝐴⋅𝐵)

[참고: Efficient Fine-tuning with PEFT and LoRA](https://heidloff.net/article/efficient-fine-tuning-lora/)
![LoRA 설명 이미지](https://heidloff.net/assets/img/2023/08/lora.png)

```
import numpy as np

# 설정
d = 512  # 가정된 차원 크기
r = 16
lora_alpha = 16
scale = lora_alpha / r

# 원래 가중치 행렬 W
W = np.random.randn(d, d)

# 저차원 행렬 A와 B
A = np.random.randn(d, r)
B = np.random.randn(r, d)

# 최종 가중치 행렬 W' 계산
W_prime = W + scale * np.dot(A, B)

# 결과 출력 (옵션)
print("Original Weights (W):", W)
print("Low-Rank Matrices (A, B):", A, B)
print("Merged Weights (W'):", W_prime)
```



Set up the quantization hyperparameters, resize the embeddings to take into account the new vocabulary size, and then define the LoRa config.

```
load_in_4bit=True: 모델을 4비트 양자화로 로드하도록 설정.
bnb_4bit_quant_type="nf4": 4비트 양자화 유형을 설정. nf4는 특정 양자화 기법을 나타냄.
bnb_4bit_compute_dtype=compute_dtype: 계산 시 사용할 데이터 타입을 설정. 여기서는 float16을 사용.
bnb_4bit_use_double_quant=True: 이중 양자화를 사용할지 여부를 설정. 이중 양자화는 양자화 정확도를 높이는 데 사용.
```

In [None]:
#Quantization
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, device_map={"": 0}, token=access_token
)

#Resize the embeddings
model.resize_token_embeddings(len(tokenizer))

#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id

#Gradient checkpointing is used by default but not compatible with caching
model.config.use_cache = False

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ["gate_proj", "down_proj", "up_proj"]
)

config.json:   0%|          | 0.00/610 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
training_arguments = TrainingArguments(
        output_dir="./results",
        eval_strategy="steps",
        do_eval=True,
        # per_device_train_batch_size=2,
        # gradient_accumulation_steps=8,
        per_device_train_batch_size=1,   #L4
        gradient_accumulation_steps=4,   #L4
        per_device_eval_batch_size=4,
        log_level="debug",
        save_steps=100,
        logging_steps=50,
        learning_rate=4e-4,
        eval_steps=200,
        fp16=True,
        num_train_epochs=1,
        warmup_steps=100,
        lr_scheduler_type="cosine",
)

In [None]:
trainer = SFTTrainer(
        model=model,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 1
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 4
  Total optimization steps = 2,461
  Number of trainable parameters = 36,372,480


Step,Training Loss,Validation Loss
200,1.2436,1.187809
400,1.1861,1.178797
600,1.2373,1.174539
800,1.1678,1.172978
1000,1.2035,1.168441
1200,1.1907,1.164653
1400,1.1796,1.157988


Saving model checkpoint to ./results/checkpoint-100

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-13b-hf/resolve/main/config.json.
Access to model meta-llama/Llama-2-13b-hf is restricted. You must be authenticated to access it. - silently ignoring the lookup for the file config.json in meta-llama/Llama-2-13b-hf.
tokenizer config file saved in ./results/checkpoint-100/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-100/special_tokens_map.json

***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./results/checkpoint-200

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-13b-hf/resolve/main/config.json.
Access to model meta-llama/Llama-2-13b-hf is restricted. You must be authenticated to access it. - silently ignoring the lookup for the file config.json in meta-llama/Llama-2-13b-hf.
tokenizer config file saved in ./results/checkpoint-200/tokenizer_config.json
Special 

Step,Training Loss,Validation Loss
200,1.2436,1.187809
400,1.1861,1.178797
600,1.2373,1.174539
800,1.1678,1.172978
1000,1.2035,1.168441
1200,1.1907,1.164653
1400,1.1796,1.157988
1600,1.1884,1.153618
1800,1.194,1.150136
2000,1.1697,1.148685



***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./results/checkpoint-1600

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-13b-hf/resolve/main/config.json.
Access to model meta-llama/Llama-2-13b-hf is restricted. You must be authenticated to access it. - silently ignoring the lookup for the file config.json in meta-llama/Llama-2-13b-hf.
tokenizer config file saved in ./results/checkpoint-1600/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1600/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1700

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-13b-hf/resolve/main/config.json.
Access to model meta-llama/Llama-2-13b-hf is restricted. You must be authenticated to access it. - silently ignoring the lookup for the file config.json in meta-llama/Llama-2-13b-hf.
tokenizer config file saved in ./results/checkpoint-1700/tokenizer_config.json
Spe

TrainOutput(global_step=2461, training_loss=1.1975788928686435, metrics={'train_runtime': 19561.4152, 'train_samples_per_second': 0.503, 'train_steps_per_second': 0.126, 'total_flos': 2.5998205963542528e+17, 'train_loss': 1.1975788928686435, 'epoch': 0.9997968718261223})

Testing inference with the last adapter saved during training.

In [None]:
model = PeftModel.from_pretrained(model, "./results/checkpoint-600")

def generate(instruction):
    prompt = "### Human: "+instruction+"### Assistant: "
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_output = model.generate(
            input_ids=input_ids,
            generation_config=GenerationConfig(temperature=1.0, top_p=1.0, top_k=50, num_beams=1),
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=256
    )
    for seq in generation_output.sequences:
        output = tokenizer.decode(seq)
        print(output.split("### Assistant: ")[1].strip())
generate("Tell me about gravitation.")

1. Gravitation is a natural phenomenon that occurs when objects with mass attract each other.

2. The force of gravity between two objects is proportional to the product of their masses and inversely proportional to the square of the distance between them.

3. Gravity is the reason why objects fall to the ground when dropped and why planets orbit around the sun.

4. Gravity is also responsible for the formation of galaxies, stars, and planets.

5. The theory of gravitation was first proposed by Isaac Newton in his book "Principia Mathematica" in 1687.

6. Newton's law of universal gravitation states that every particle attracts every other particle in the universe with a force that is proportional to the product of their masses and inversely proportional to the square of the distance between them.

7. The law of universal gravitation is one of the most important and fundamental laws in physics and has been used to explain a wide range of phenomena, from the motion of planets and stars 