<a href="https://colab.research.google.com/github/farshid101/Thesis-2024/blob/main/llam/llama2_example_kor_nli.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# QLoRA를 활용한 한국어 LLM Fine-Tuning: Train LLaMA2 7B as a Text Classifier

## 0. 라이브러리 설치

In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install trl

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━

## 1. 데이터셋 준비

- KLUE의 NLI 데이터셋을 사용합니다.

In [2]:
from datasets import load_dataset, Dataset, DatasetDict
from dataclasses import dataclass, field
from typing import Optional
import torch
from peft import LoraConfig
from tqdm import tqdm
import pandas as pd
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, AutoTokenizer, pipeline
from trl import SFTTrainer

tqdm.pandas()

nli_data = load_dataset('klue', 'nli')

train_data = nli_data['train']
validation_data = nli_data['validation']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/22.5k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.83M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/224k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/24998 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3000 [00:00<?, ? examples/s]

In [14]:
# Convert to DataFrame for easier viewing
train_df = pd.DataFrame(train_data)
train_df.head()  # Show the first few rows of the DataFrame


Unnamed: 0,guid,source,premise,hypothesis,label
0,klue-nli-v1_train_00000,NSMC,힛걸 진심 최고다 그 어떤 히어로보다 멋지다,힛걸 진심 최고로 멋지다.,0
1,klue-nli-v1_train_00001,NSMC,100분간 잘껄 그래도 소닉붐땜에 2점준다,100분간 잤다.,2
2,klue-nli-v1_train_00002,NSMC,100분간 잘껄 그래도 소닉붐땜에 2점준다,소닉붐이 정말 멋있었다.,1
3,klue-nli-v1_train_00003,NSMC,100분간 잘껄 그래도 소닉붐땜에 2점준다,100분간 자는게 더 나았을 것 같다.,1
4,klue-nli-v1_train_00004,airbnb,101빌딩 근처에 나름 즐길거리가 많습니다.,101빌딩 근처에서 즐길거리 찾기는 어렵습니다.,2


In [15]:
# Convert to DataFrame for easier viewing
validation_data = pd.DataFrame(validation_data)
validation_data.head()  # Show the first few rows of the DataFrame


Unnamed: 0,guid,source,premise,hypothesis,label
0,klue-nli-v1_dev_00000,airbnb,흡연자분들은 발코니가 있는 방이면 발코니에서 흡연이 가능합니다.,어떤 방에서도 흡연은 금지됩니다.,2
1,klue-nli-v1_dev_00001,airbnb,10명이 함께 사용하기 불편함없이 만족했다.,10명이 함께 사용하기 불편함이 많았다.,2
2,klue-nli-v1_dev_00002,airbnb,10명이 함께 사용하기 불편함없이 만족했다.,성인 10명이 함께 사용하기 불편함없이 없었다.,1
3,klue-nli-v1_dev_00003,airbnb,10명이 함께 사용하기 불편함없이 만족했다.,10명이 함께 사용하기에 만족스러웠다.,0
4,klue-nli-v1_dev_00004,airbnb,10층에 건물사람들만 이용하는 수영장과 썬베드들이 있구요.,건물사람들은 수영장과 썬베드를 이용할 수 있습니다.,0


# Adding my Dataset


In [16]:
df = pd.read_csv('/content/Final_clean pro 2.csv')
df.head()

Unnamed: 0,Comments,Targets
0,তোমরা সবাই এই ভাই তাকে একটু কোরোভিডিও গুলো দে...,positive
1,🎯 ব্রহ্মপুত্র নদের ওপর নির্মিত ৯.১৫ কিলোমিটারে...,positive
2,😊 ধন্যবাদ মাননীয় প্রধানমন্ত্রী শেখ হাসিনা,positive
3,2022 এর 25 এ জুন 🥰🥰,neutral
4,অনুভূতিটা এত সুন্দর ভাষায় প্রকাশ করতে পারতেছি...,positive


In [18]:
import pandas as pd

# Function to rename a single column
def change_column_name(df, old_name, new_name):
    df = df.rename(columns={old_name: new_name})
    return df

# Assuming df is your DataFrame
df = change_column_name(df, old_name="Comments", new_name="hypothesis")
df = change_column_name(df, old_name="Targets", new_name="label")
df.head()  # Display the first few rows of the DataFrame
#text

Unnamed: 0,hypothesis,label
0,তোমরা সবাই এই ভাই তাকে একটু কোরোভিডিও গুলো দে...,positive
1,🎯 ব্রহ্মপুত্র নদের ওপর নির্মিত ৯.১৫ কিলোমিটারে...,positive
2,😊 ধন্যবাদ মাননীয় প্রধানমন্ত্রী শেখ হাসিনা,positive
3,2022 এর 25 এ জুন 🥰🥰,neutral
4,অনুভূতিটা এত সুন্দর ভাষায় প্রকাশ করতে পারতেছি...,positive


In [19]:
df.value_counts('label')

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
positive,2713
negative,1895
neutral,1767


In [21]:
df['label'] = df.label.map({'positive': 1, 'neutral': 0, 'negative': 2})
df.head()

Unnamed: 0,hypothesis,label
0,তোমরা সবাই এই ভাই তাকে একটু কোরোভিডিও গুলো দে...,1
1,🎯 ব্রহ্মপুত্র নদের ওপর নির্মিত ৯.১৫ কিলোমিটারে...,1
2,😊 ধন্যবাদ মাননীয় প্রধানমন্ত্রী শেখ হাসিনা,1
3,2022 এর 25 এ জুন 🥰🥰,0
4,অনুভূতিটা এত সুন্দর ভাষায় প্রকাশ করতে পারতেছি...,1


In [27]:
df["premise"]=df["hypothesis"]

# Connecting data to my dataset


In [28]:
df.count()

Unnamed: 0,0
hypothesis,6375
label,6375
premise,6375


In [29]:
train_data =df[:4000]
validation_data = df[4005:5000]

In [35]:
# Convert to DataFrame for easier viewing
validation_data = pd.DataFrame(validation_data)
validation_data.head()  # Show the first few rows of the DataFrame

Unnamed: 0,hypothesis,label,premise
4005,নাশকতাকারী কে?দেশের জনগণ কে,0,নাশকতাকারী কে?দেশের জনগণ কে
4006,তোর বিচার আল্লাহ করবে,1,তোর বিচার আল্লাহ করবে
4007,কারো চোখের পানি পড়েনি বরং জনগণ আরো খুশি,1,কারো চোখের পানি পড়েনি বরং জনগণ আরো খুশি
4008,সেখ হাসিনা কোথায় দেশে নেই তার কাছে পিকচার দেখা...,0,সেখ হাসিনা কোথায় দেশে নেই তার কাছে পিকচার দেখা...
4009,এই ক্ষতি জন্য প্রধানমন্ত্রীর দাই ।,0,এই ক্ষতি জন্য প্রধানমন্ত্রীর দাই ।


- 데이터셋을 생성모델의 학습 형식에 맞게 변형합니다.

In [36]:
id_to_label = {0:'neutral', 1:'positive', 2:'negative'}

question_template = "### Human: 다음 두 문장의 관계를 entailment, neutral, contradiction 중 하나로 분류해줘. "

train_instructions = [f'{question_template}\npremise: {x}\nhypothesis: {y}\n\n### Assistant: {id_to_label[z]}' for x,y,z in zip(train_data['premise'],train_data['hypothesis'],train_data['label'])]
validation_instructions = [f'{question_template}\npremise: {x}\nhypothesis: {y}\n\n### Assistant: {id_to_label[z]}' for x,y,z in zip(validation_data['premise'],validation_data['hypothesis'],validation_data['label'])]

ds_train = Dataset.from_dict({"text": train_instructions})
ds_validation = Dataset.from_dict({"text": validation_instructions})
instructions_ds_dict = DatasetDict({"train": ds_train, "eval": ds_validation})

In [37]:
instructions_ds_dict['train']['text'][0]

'### Human: 다음 두 문장의 관계를 entailment, neutral, contradiction 중 하나로 분류해줘. \npremise: তোমরা সবাই এই ভাই তাকে একটু  কোরোভিডিও গুলো দেখো ভাল লাগলে  কোরো 🙏 ধন্যবাদ❤️❤️❤️❤️❤️❤️❤️❤️❤️❤️❤️❤️❤️\nhypothesis: তোমরা সবাই এই ভাই তাকে একটু  কোরোভিডিও গুলো দেখো ভাল লাগলে  কোরো 🙏 ধন্যবাদ❤️❤️❤️❤️❤️❤️❤️❤️❤️❤️❤️❤️❤️\n\n### Assistant: positive'

## 2. 모델 학습

- Beomi님께서 한국어 데이터셋으로 추가 훈련 하신 LLaMA2 7B 모델을 Fine-tuning 하겠습니다.
- 우선, 학습에 필요한 Arguments를 준비합니다.

In [38]:
model_name = "beomi/llama-2-ko-7b"


@dataclass
class ScriptArguments:
    model_name: Optional[str] = field(default=model_name, metadata={"help": "the model name"})
    dataset_text_field: Optional[str] = field(default="text", metadata={"help": "the text field of the dataset"})
    log_with: Optional[str] = field(default=None, metadata={"help": "use 'wandb' to log with wandb"})
    learning_rate: Optional[float] = field(default=1.41e-5, metadata={"help": "the learning rate"})
    batch_size: Optional[int] = field(default=4, metadata={"help": "the batch size"})
    seq_length: Optional[int] = field(default=512, metadata={"help": "Input sequence length"})
    gradient_accumulation_steps: Optional[int] = field(
        default=2, metadata={"help": "the number of gradient accumulation steps"}
    )
    load_in_8bit: Optional[bool] = field(default=False, metadata={"help": "load the model in 8 bits precision"})
    load_in_4bit: Optional[bool] = field(default=True, metadata={"help": "load the model in 4 bits precision"})
    use_peft: Optional[bool] = field(default=True, metadata={"help": "Wether to use PEFT or not to train adapters"})
    trust_remote_code: Optional[bool] = field(default=True, metadata={"help": "Enable `trust_remote_code`"})
    output_dir: Optional[str] = field(default="output", metadata={"help": "the output directory"})
    peft_lora_r: Optional[int] = field(default=64, metadata={"help": "the r parameter of the LoRA adapters"})
    peft_lora_alpha: Optional[int] = field(default=16, metadata={"help": "the alpha parameter of the LoRA adapters"})
    logging_steps: Optional[int] = field(default=1, metadata={"help": "the number of logging steps"})
    use_auth_token: Optional[bool] = field(default=False, metadata={"help": "Use HF auth token to access the model"})
    num_train_epochs: Optional[int] = field(default=3, metadata={"help": "the number of training epochs"})
    max_steps: Optional[int] = field(default=-1, metadata={"help": "the number of training steps"})
    save_steps: Optional[int] = field(
        default=100, metadata={"help": "Number of updates steps before two checkpoint saves"}
    )
    save_total_limit: Optional[int] = field(default=10, metadata={"help": "Limits total number of checkpoints."})
    push_to_hub: Optional[bool] = field(default=False, metadata={"help": "Push the model to HF Hub"})
    hub_model_id: Optional[str] = field(default=None, metadata={"help": "The name of the model on HF Hub"})


script_args = ScriptArguments()

- quantization을 적용하여 모델을 불러옵니다.

In [40]:
if script_args.load_in_8bit and script_args.load_in_4bit:
    raise ValueError("You can't load the model in 8 bits and 4 bits at the same time")
elif script_args.load_in_8bit or script_args.load_in_4bit:
    quantization_config = BitsAndBytesConfig(
        load_in_8bit=script_args.load_in_8bit, load_in_4bit=script_args.load_in_4bit
    )
    device_map = {"": 0}
    torch_dtype = torch.bfloat16
else:
    device_map = None
    quantization_config = None
    torch_dtype = None

model = AutoModelForCausalLM.from_pretrained(
    script_args.model_name,
    quantization_config=quantization_config,
    device_map=device_map,
    trust_remote_code=script_args.trust_remote_code,
    torch_dtype=torch_dtype,
    use_auth_token=script_args.use_auth_token,
)

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 362.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 21.06 MiB is free. Process 4387 has 14.72 GiB memory in use. Of the allocated memory 14.29 GiB is allocated by PyTorch, and 307.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

- LoRA를 적용하여 Trainer를 구성합니다.

In [None]:
dataset = instructions_ds_dict

training_args = TrainingArguments(
    output_dir=script_args.output_dir,
    per_device_train_batch_size=script_args.batch_size,
    gradient_accumulation_steps=script_args.gradient_accumulation_steps,
    learning_rate=script_args.learning_rate,
    logging_steps=script_args.logging_steps,
    num_train_epochs=script_args.num_train_epochs,
    max_steps=script_args.max_steps,
    report_to=script_args.log_with,
    save_steps=script_args.save_steps,
    save_total_limit=script_args.save_total_limit,
    push_to_hub=script_args.push_to_hub,
    hub_model_id=script_args.hub_model_id,
)

if script_args.use_peft:
    peft_config = LoraConfig(
        r=script_args.peft_lora_r,
        lora_alpha=script_args.peft_lora_alpha,
        bias="none",
        task_type="CAUSAL_LM",
    )
else:
    peft_config = None

trainer = SFTTrainer(
    model=model,
    args=training_args,
    max_seq_length=script_args.seq_length,
    train_dataset=dataset['train'],
    eval_dataset=dataset['eval'],
    dataset_text_field=script_args.dataset_text_field,
    peft_config=peft_config,
)

- 학습을 진행합니다.
- Colab의 V100 Gpu 기준 약 8시간이 소요됩니다.

In [None]:
trainer.train()

trainer.save_model(training_args.output_dir)

- 학습한 모델의 성능을 테스트 해 봅니다.

In [None]:
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_name)

pipeline = pipeline(
    "text-generation",
    model=model,|
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map={'':0},
)

In [None]:
query = instructions_ds_dict['eval']['text'][1].split('### Assistant: ')[0] + '### Assistant:'
queries = [instructions_ds_dict['eval']['text'][i].split('### Assistant: ')[0] + '### Assistant:' for i in range(len(instructions_ds_dict['eval']))]
sequences = pipeline(
    queries,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=3,
    early_stopping=True,
    # do_sample=True,
)

In [None]:
results = []

for seq in sequences:
  result = seq[0]['generated_text'].split('### Assistant:')[1]
  results.append(result)

labels = []

for label in instructions_ds_dict['eval']['text']:
  result = label.split('### Assistant:')[1]
  labels.append(result)

print("Accuracy: ", (len([1 for x, y in zip(results, labels) if y in x]) / len(labels)))