#### 모델 준비

여기에서는 [카카오 나노 2.1b Instruct 모델](https://huggingface.co/kakaocorp/kanana-nano-2.1b-instruct)을 사용하겠습니다. 사용자의 지시를 수행할 수 있도록 미세조정이 되어 있는 모델입니다. 이 모델에 추가로 커스텀 질문답변 데이터셋을 훈련시켜보겠습니다.

업데이트
- [MS Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct)도 4090에서 파인튜닝되는 것을 확인했습니다. MIT 라이센스입니다.

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from torch.utils.data import Dataset
import os
from peft import LoraConfig
from trl import SFTTrainer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = "kakaocorp/kanana-nano-2.1b-instruct" # "-instruct" 지시에 따르도록 파인튜닝(사후훈련)이 된 모델
# model_name = "kakaocorp/kanana-nano-2.1b-base" # base 모델로도 지시 훈련이 됩니다.
# model_name = "microsoft/Phi-4-mini-instruct" # MIT 라이센스라서 상업적 사용 가능, 아래에서 epoch 50번 정도면 훈련 됩니다.

lora_config = LoraConfig(
    r=8,
    lora_alpha = 8,
    lora_dropout = 0.05,
    target_modules=["q_proj", "o_proj", "k_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)

model = AutoModelForCausalLM.from_pretrained(model_name, device_map = 'auto')

# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     torch_dtype=torch.bfloat16,
#     # torch_dtype="auto", # Phi-4-mini 모델
#     trust_remote_code=True,
# ).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token # <|eot_id|> 128009

챗 메시지 템플릿 적용 예시

| Token               | ID  |
|---------------------|--------|
| <\|begin_of_text\|>   | 128000 |
| <\|end_of_text\|>     | 128001 |
| <\|start_header_id\|> | 128006 |
| <\|end_header_id\|>   | 128007 |
| <\|eot_id\|>          | 128009 |
| \n\n                  | 127 |


챗템플릿 적용 (토큰화 X)
```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou ... kakao.<|eot_id|>
<|start_header_id|>user<|end_header_id|>\n\n아인슈타인이 ... ?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>\n\n아인슈타인은 헬다이버즈2를 ... <|eot_id|>
<|start_header_id|>assistant<|end_header_id|>\n\n
```

토큰화 후
```
128000, 128006, 9125, 128007, 271, 2675, 527, ... , 128009, 
128006, 882, 128007, 271, 112032, 30381, 101555, 20565, 117004, 44005, 108573, 34804, 30, 128009, 
128006, 78191, 128007, 271, 112032, ..., 13, 128009, 
128006, 78191, 128007, 271
```



In [3]:
messages = [
    {"role": "system", "content": "You are a helpful AI assistant developed by Kakao."},
    {"role": "user", "content": "아인슈타인이 좋아하는 게임은?"},
    {"role": "assistant", "content":"아인슈타인은 헬다이버즈2를 좋아해서 자주합니다."}
]

tokens = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

print(tokens)
#print(tokenizer.encode(tokens, add_special_tokens=False))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant developed by Kakao.<|eot_id|><|start_header_id|>user<|end_header_id|>

아인슈타인이 좋아하는 게임은?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

아인슈타인은 헬다이버즈2를 좋아해서 자주합니다.<|eot_id|><|start_header_id|>assistant<|end_header_id|>




In [4]:
qna_list = []
with open("dycustomdata.txt", "r") as file:
    for line in file:
        qna = line.strip().split('|') # 안내: 입력 문서의 '|'는 질문과 답변을 구분하는 문자
        messages = [
            {"role": "system", "content": "You are a helpful AI assistant developed by Kakao."}, # 모든 질문 공통
            {"role": "user", "content": qna[0]},     # 질문 부분
            {"role": "assistant", "content": qna[1]} # 답변 부분
        ]
        q = tokenizer.apply_chat_template(messages[:2], tokenize=False, add_generation_prompt=True)
        input_str = tokenizer.apply_chat_template(messages[:3], tokenize=False, add_generation_prompt=True)
        # print(input_str)
        input_str = input_str[:-len('start_header_id\>assistant<|end_header_id|>')-4]
        # print(input_str)
        # print("--------------")
        q_ids = tokenizer.encode(q, add_special_tokens=False)
        input_ids = tokenizer.encode(input_str, add_special_tokens=False)
        qna_list.append({'q':q, 'input':input_str, 'q_ids':q_ids, 'input_ids':input_ids})

max_length = max(len(i['input_ids']) for i in qna_list)

print(qna_list)
print(max_length) # 토큰화 후에 가장 긴 길이 (패딩으로 채우기 위함)

[{'q': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful AI assistant developed by Kakao.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n다음 숫자들을 얘기해봐 12345<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', 'input': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful AI assistant developed by Kakao.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n다음 숫자들을 얘기해봐 12345<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n67890.<|eot_id|>', 'q_ids': [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 15592, 18328, 8040, 555, 75571, 3524, 13, 128009, 128006, 882, 128007, 271, 13447, 49531, 70292, 93287, 105880, 123715, 21121, 34983, 122722, 220, 4513, 1774, 128009, 128006, 78191, 128007, 271], 'input_ids': [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 15592, 18328, 8040, 555, 75571, 3524, 13, 128009, 128006, 882, 128007, 271, 13447, 49531, 70292, 93287, 105880, 123715, 21121, 34983, 12

[파이토치 CrossEntropy의 ignore index = -100](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)

In [25]:
print(qna_list[0])

{'q': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful AI assistant developed by Kakao.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n다음 숫자들을 얘기해봐 12345<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', 'input': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful AI assistant developed by Kakao.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n다음 숫자들을 얘기해봐 12345<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n67890.<|eot_id|>', 'q_ids': [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 15592, 18328, 8040, 555, 75571, 3524, 13, 128009, 128006, 882, 128007, 271, 13447, 49531, 70292, 93287, 105880, 123715, 21121, 34983, 122722, 220, 4513, 1774, 128009, 128006, 78191, 128007, 271], 'input_ids': [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 15592, 18328, 8040, 555, 75571, 3524, 13, 128009, 128006, 882, 128007, 271, 13447, 49531, 70292, 93287, 105880, 123715, 21121, 34983, 122

In [None]:
EOT = 128009

class MyDataset(Dataset):
    def __init__(self, qna_list, max_length):
        self.input_ids = []
        self.target_ids = []

        # token_ids = tokenizer.encode("<|endoftext|>" + txt, allowed_special={"<|endoftext|>"})
        for qa in qna_list:
            token_ids = qa['input_ids']
            q_ids_len = len(qa['q_ids'])

            input_chunk = token_ids[:max_length]
            target_chunk = token_ids[1:max_length+1]

            # Pad
            input_chunk += [EOT] * (max_length - len(input_chunk))
            target_chunk += [EOT] * (max_length - len(target_chunk))

            # 정답이 아닌 부분은 무시
            target_chunk[:q_ids_len - 1] = [-100] * (q_ids_len - 1)

            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {"input_ids": self.input_ids[idx],"labels": self.target_ids[idx]}

dataset = MyDataset(qna_list, max_length=max_length)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(123)
#optimizer = torch.optim.AdamW(model.parameters(), lr=0.00001, weight_decay=0.01)

cuda


In [7]:
# 파인튜닝 전에 어떻게 대답하는지 확인
questions = [ qna['q_ids'] for qna in qna_list]

for i, q_ids in enumerate(questions):

    model.eval()
    with torch.no_grad():
        output = model.generate(
            torch.tensor([q_ids]).to("cuda"),
            max_new_tokens=32,
            #attention_mask = (input_ids != 0).long(),
            pad_token_id=tokenizer.eos_token_id,
            do_sample=False,
            # temperature=1.2,
            # top_k=5
        )

    output_list = output.tolist()

    print(f"Q{i}: {tokenizer.decode(output[0], skip_special_tokens=True)}")



Q0: system

You are a helpful AI assistant developed by Kakao.user

다음 숫자들을 얘기해봐 12345assistant

물론입니다! 여기 주어진 숫자들 12345를 다시 말씀드릴게요:

1
2
3
4
5

필
Q1: system

You are a helpful AI assistant developed by Kakao.user

아인슈타인이 좋아하는 과일은?assistant

아인슈타인(Albert Einstein)은 특정한 과일을 특별히 좋아한다고 알려진 바는 없습니다. 하지만 그는 과학자로서
Q2: system

You are a helpful AI assistant developed by Kakao.user

아인슈타인이 좋아하는 게임은?assistant

아인슈타인(Albert Einstein)은 실제로 게임을 즐기지는 않았지만, 그의 창의적이고 분석적인 사고 방식은 다양한
Q3: system

You are a helpful AI assistant developed by Kakao.user

아인슈타인이 자주 가는 여행지는?assistant

아인슈타인(Albert Einstein)은 매우 유명한 물리학자로, 그의 여행지는 주로 과학적 탐구와 연구를
Q4: system

You are a helpful AI assistant developed by Kakao.user

아인슈타인의 취미는 무엇인가요?assistant

아인슈타인(Albert Einstein)은 매우 다재다능한 사람으로 알려져 있으며, 그의 취미와 관심사도 다양했습니다. 아
Q5: system

You are a helpful AI assistant developed by Kakao.user

아인슈타인이 좋아하는 계절은 무엇인가요?assistant

아인슈타인(Albert Einstein)은 특정 계절을 특별히 좋아한다고 알려진 바는 없습니다. 그는 주로 이론 물리
Q6: system

You are a helpful AI assi

In [8]:
total_steps = len(dataset) * 1  # 예시로 1 epoch 기준, 필요한 만큼 조정
warmup_steps = int(0.03 * total_steps)  # 전체 단계 수의 3% 계산

In [9]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    #max_seq_length=512,
    args=TrainingArguments(
        output_dir="InstructModel",
        num_train_epochs = 1,
        max_steps=500,
        per_device_train_batch_size=8, # GPU당 4개 배치
        gradient_accumulation_steps=2, # gradient 반영을 1개 step 마다
        optim="paged_adamw_8bit",
        warmup_steps=warmup_steps,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=4,
        push_to_hub=False,
        report_to='none',
    ),
    peft_config=lora_config,   # LoRA 설정값
)

trainer.train()

Step,Training Loss
4,5.6852
8,4.8327
12,3.8768
16,2.7913
20,2.187
24,1.8311
28,1.5595
32,1.3549
36,1.192
40,1.06


TrainOutput(global_step=500, training_loss=0.5676786797046661, metrics={'train_runtime': 423.1464, 'train_samples_per_second': 18.906, 'train_steps_per_second': 1.182, 'total_flos': 5447121899520000.0, 'train_loss': 0.5676786797046661, 'epoch': 500.0})

In [15]:
# 파인튜닝 후에 어떻게 응답하는지 확인
model = AutoModelForCausalLM.from_pretrained("./InstructModel",device_map="auto")  # GPU에 자동으로 매핑

# 토크나이저 불러오기
tokenizer = AutoTokenizer.from_pretrained("./InstructModel")
model.eval()

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 1792, padding_idx=128001)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=1792, out_features=3072, bias=False)
          (k_proj): Linear4bit(in_features=1792, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=1792, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=3072, out_features=1792, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=1792, out_features=8064, bias=False)
          (up_proj): Linear4bit(in_features=1792, out_features=8064, bias=False)
          (down_proj): Linear4bit(in_features=8064, out_features=1792, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((1792,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((1792,), eps=1e-05)
      )
    )
    (norm): 

In [None]:
# import matplotlib.pyplot as plt

# plt.plot(losses)
# plt.xlabel('Epoch')
# plt.ylabel('Loss')
# plt.title('Training Loss Over Epochs')
# plt.show()

In [None]:
# 파인튜닝 후에 어떻게 대답하는지 확인
questions = [ qna['q_ids'] for qna in qna_list]

for i, q_ids in enumerate(questions):

    model.eval()
    with torch.no_grad():
        output = model.generate(
            torch.tensor([q_ids]).to("cuda"),
            max_new_tokens=32,
            attention_mask = (input_ids != 0).long(),
            pad_token_id=tokenizer.eos_token_id,
            do_sample=False,
            # temperature=1.2,
            # top_k=5
        )

    output_list = output.tolist()
    print(f"Q{i}: {tokenizer.decode(output[0], skip_special_tokens=True)}")

Q0: system

You are a helpful AI assistant developed by Kakao.user

다음 숫자들을 얘기해봐 12345assistant

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Q1: system

You are a helpful AI assistant developed by Kakao.user

아인슈타인이 좋아하는 과일은?assistant

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Q2: system

You are a helpful AI assistant developed by Kakao.user

아인슈타인이 좋아하는 게임은?assistant

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


KeyboardInterrupt: 

In [None]:
messages = [
            {"role": "system", "content": "You are a helpful AI assistant developed by Kakao."}, # 모든 질문 공통
            {"role": "user", "content": input()},     # 질문 부분
        ]

model.eval()
with torch.no_grad():
    ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True)
    output = model.generate(
        torch.tensor([ids]).to("cuda"),
        max_new_tokens=64,
        #attention_mask = (input_ids != 0).long(),
        pad_token_id=tokenizer.eos_token_id,
        do_sample=False,
        # temperature=1.2,
        # top_k=5
    )

output_list = output.tolist()

print(f"Q{i}: {tokenizer.decode(output[0], skip_special_tokens=True)}")        


Q15: system

You are a helpful AI assistant developed by Kakao.user

sadassistant

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
