<a href="https://colab.research.google.com/github/choki0715/lecture/blob/master/Alpaca_%2B_Llama_3_8b_Unsloth.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unsloth를 이용한 저 용량 llama3 학습모델 구현

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

In [3]:
# 모델 파라미터 확인
for name, param in model.named_parameters():
    print(f"Parameter: {name}, dtype: {param.dtype}, shape: {param.shape}")

Parameter: model.embed_tokens.weight, dtype: torch.float16, shape: torch.Size([128256, 4096])
Parameter: model.layers.0.self_attn.q_proj.weight, dtype: torch.uint8, shape: torch.Size([8388608, 1])
Parameter: model.layers.0.self_attn.k_proj.weight, dtype: torch.uint8, shape: torch.Size([2097152, 1])
Parameter: model.layers.0.self_attn.v_proj.weight, dtype: torch.uint8, shape: torch.Size([2097152, 1])
Parameter: model.layers.0.self_attn.o_proj.weight, dtype: torch.uint8, shape: torch.Size([8388608, 1])
Parameter: model.layers.0.mlp.gate_proj.weight, dtype: torch.uint8, shape: torch.Size([29360128, 1])
Parameter: model.layers.0.mlp.up_proj.weight, dtype: torch.uint8, shape: torch.Size([29360128, 1])
Parameter: model.layers.0.mlp.down_proj.weight, dtype: torch.uint8, shape: torch.Size([29360128, 1])
Parameter: model.layers.0.input_layernorm.weight, dtype: torch.float16, shape: torch.Size([4096])
Parameter: model.layers.0.post_attention_layernorm.weight, dtype: torch.float16, shape: torch.S

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import pandas as pd
from datasets import Dataset

file_path = "/content/drive/MyDrive/LLM/lecture/"
file_name = "train_data.csv"
df = pd.read_csv(file_path+file_name)
print(df)
dataset = Dataset.from_pandas(df)
dataset

           instruction                                input  \
0         conversation                         지금 어디에서 강의해?   
1         conversation                             강사이름이 뭐야   
2         conversation                        어떤 내용을 가르치는대?   
3          explanation                     강사 김의중에 대해 설명해줘?   
4          explanation                           요즘 날씨가 어때?   
5         conversation                         강의는 총 몇시간이지?   
6         conversation                       강의 끝나고 시험 볼꺼야?   
7         conversation                혹시 궁금한 사항이 있으면 어떻게 해?   
8         conversation                           교재는 따로 있어?   
9         conversation  수업을 모두 들으면 인공지능 실력을 어느정도로 예상할 수 있어?   
10         explanation                  여름철 더위를 이기는 방법을 알려줘   
11         explanation                    효율적인 공부방법에 대해 알려줘   
12  sentiment analysis             오늘 오징어 게임은 정말 재잇었어!!! 최고   
13  sentiment analysis             운동경기에서 오심이 많으면 인기가 줄어들거야   

                                               output 

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 14
})

# 나만의 학습데이터 준비하기

In [5]:

# 토큰화 함수 정의: eos_token
tokenizer.pad_token = tokenizer.eos_token
# 이전 버전 (tokenizer.padding_side = "right")
eos_token = tokenizer.eos_token

def convert_data(example):
    # Handle potential missing fields
    instruction = example.get('instruction', '')
    input_text = example.get('input', '')
    output_text = example.get('output', '')

    # Ensure instruction, input_text, and output_text are strings
    if not isinstance(instruction, str):
        instruction = str(instruction)  # Convert to string if not already
    if not isinstance(input_text, str):
        input_text = str(input_text)
    if not isinstance(output_text, str):
        output_text = str(output_text)

    # 'text' 컬럼 생성
    text = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output_text}{eos_token}"
    example["text"] = text
    return example


# 데이터셋 토큰화
dataset =  dataset.map(convert_data)
dataset[2]

Map:   0%|          | 0/14 [00:00<?, ? examples/s]

{'instruction': 'conversation',
 'input': '어떤 내용을 가르치는대?',
 'output': '응 인공지능 개요하고 LLM 사용법을 배우고 있어',
 'text': '### Instruction:\nconversation\n\n### Input:\n어떤 내용을 가르치는대?\n\n### Response:\n응 인공지능 개요하고 LLM 사용법을 배우고 있어<|end_of_text|>'}

# Alpaca 방식의 Prompt 만들기

In [6]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

instruct = "파인 튜닝 방업에 대한 학습"
input = "LoRA에 대해 설명해줘"
output = "LoRA는 가장 성능이 뛰어난 부분학습 방법이야"

alpaca_prompt.format(instruct, input, output)

'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n파인 튜닝 방업에 대한 학습\n\n### Input:\nLoRA에 대해 설명해줘\n\n### Response:\nLoRA는 가장 성능이 뛰어난 부분학습 방법이야'

In [7]:

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

# from datasets import load_dataset
# dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
# dataset = dataset.map(formatting_prompts_func, batched = True,)

# 학습하기전 Unsloth 모델 추론 결과

In [8]:
# alpaca_prompt = Copied from above
# 일반적인 내용 추론해 보기

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 16, use_cache = True)
tokenizer.batch_decode(outputs)

['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nContinue the fibonnaci sequence.\n\n### Input:\n1, 1, 2, 3, 5, 8\n\n### Response:\n13\n<|end_of_text|>']

In [9]:
# 새로운 내용을 질문해보기

inputs = tokenizer(
[
    alpaca_prompt.format(
        "explanation", # instruction
        "공부하는 방법을 설명해줘", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nexplanation\n\n### Input:\n공부하는 방법을 설명해줘\n\n### Response:\nI study by doing homework every day.<|end_of_text|>']

In [10]:
# LoRA를 사용해서 PEFT 실행
# LoRA 모델 구성

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [21]:
# 모델 파라미터 확인
for name, param in model.named_parameters():
    print(f"Parameter: {name}, dtype: {param.dtype}, shape: {param.shape}")

Parameter: base_model.model.model.embed_tokens.weight, dtype: torch.float16, shape: torch.Size([128256, 4096])
Parameter: base_model.model.model.layers.0.self_attn.q_proj.base_layer.weight, dtype: torch.uint8, shape: torch.Size([8388608, 1])
Parameter: base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight, dtype: torch.float32, shape: torch.Size([16, 4096])
Parameter: base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight, dtype: torch.float32, shape: torch.Size([4096, 16])
Parameter: base_model.model.model.layers.0.self_attn.k_proj.base_layer.weight, dtype: torch.uint8, shape: torch.Size([2097152, 1])
Parameter: base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight, dtype: torch.float32, shape: torch.Size([16, 4096])
Parameter: base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight, dtype: torch.float32, shape: torch.Size([1024, 16])
Parameter: base_model.model.model.layers.0.self_attn.v_proj.base_layer.weight, dtype: t

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

# 나만의 데이터로 학습 시키기

In [11]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/14 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [12]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
5.654 GB of memory reserved.


In [13]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 14 | Num Epochs = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,3.7466
2,3.3383
3,3.7295
4,3.8589
5,2.9726
6,3.1342
7,2.1985
8,2.0647
9,1.6338
10,1.489


In [14]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

114.1637 seconds used for training.
1.9 minutes used for training.
Peak reserved memory = 6.15 GB.
Peak reserved memory for training = 0.496 GB.
Peak reserved memory % of max memory = 15.544 %.
Peak reserved memory for training % of max memory = 1.254 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [15]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 4, use_cache = True)
tokenizer.batch_decode(outputs)

['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nContinue the fibonnaci sequence.\n\n### Input:\n1, 1, 2, 3, 5, 8\n\n### Response:\n13\n\n### Input']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [31]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 4)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Continue the fibonnaci sequence.

### Input:
1, 1, 2, 3, 5, 8

### Response:
13

### Input


In [16]:

inputs = tokenizer(
[
    alpaca_prompt.format(
        "conversation", # instruction
        "교재는 따로있어?", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nconversation\n\n### Input:\n교재는 따로있어?\n\n### Response:\n강사가 저술한 <딥러닝 개념과 활용> <알고리즘으로 배우는 인공지능, 머신러닝, 딥러닝>이 주 교재야. 두가지 모두 아마존에서 따로 살 수 있어.\n\n### Response:\n강사가 저술한']

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [17]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

# 저장한 모델 불러오기

In [18]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "/content/lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "conversation", # instruction
        "오늘 날씨가 어때?", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 32, use_cache = True)
tokenizer.batch_decode(outputs)

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


unsloth/llama-3-8b-bnb-4bit does not have a padding token! Will use pad_token = <|reserved_special_token_250|>.


['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nconversation\n\n### Input:\n오늘 날씨가 어때?\n\n### Response:\n장마가 시작되서 비가 자주오니까 우산 꼭 가지고 다녀.\n\n### Response:\n강의 끝나고 시험 볼꺼야']