# Toadx2 LLM Model | Gemma2 Fine-Tune
- Huggingface Korean Dataset & Custom Preprocessed KB Real Estate Data
- Google Machine Learning Bootcamp 2024, 5th

<br />

by. Kim Basilri

---

## 0. Intro

---

<br/>

### What is this project?
- This project aims to create a web application that predicts future real estate prices by utilizing machine learning techniques through Gemma2 model based on Korean real estate price data. This allows users to make clear and smart judgments through data-based future prediction results without being swept away by the uncertain trend of the Korean real estate market.

<br />

### Based Model
- [Google's Gemma2-2-2b-it](https://huggingface.co/google/gemma-2-2b-it)

<br />

### Fine Tuned Model
- [basilry/gemma2-2-2b-it-fine-tuned-korean-real-estate-model](https://huggingface.co/basilry/gemma2-2-2b-it-fine-tuned-korean-real-estate-model)

<br />

### Used Dataset
1. [Korean Safe Conversation Dataset](https://huggingface.co/datasets/jojo0217/korean_safe_conversation)

2. [KB Real Estate Data Hub's Apartment Dataset](https://data.kbland.kr/)

3. [S.Korea Apartment Market Prediction Dataset](https://github.com/basilry/toadx2_api)

<br />

### Index

0. Intro
1. Install library & Set Huggningface
2. Import Google/gemma-2-2b-it model
3. Fine-Tune Korean Dataset
4. Korean Conversation Sample
5. Fine-Tune Custom Preprocessed KB Real Estate Data
6. Korean Real Estate Conversation Sample
7. Fine-Tune Custom Identity for Toadx2 Application
8. Toadx2 Identity Conversation Sample
9. Conclusion


## 1. Install library & Set Huggningface

In [1]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any

In [2]:
import os
from huggingface_hub import login
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

os.environ['HUGGINGFACE_TOKEN'] = "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"
login(os.environ['HUGGINGFACE_TOKEN'])

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## 2. Import Google/gemma-2-2b-it model

In [13]:
model_name = "google/gemma-2-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Check Tokenizer
print(tokenizer.tokenize("안녕하세요, 오늘 날씨는 어떻습니까?"))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

KeyboardInterrupt: 

## 3. Fine-Tune Korean Dataset
by jojo0217/korean_safe_conversation

In [14]:
# 1. Dataset Load
dataset = load_dataset("jojo0217/korean_safe_conversation")

# 2. Dataset Divide
dataset = dataset["train"].train_test_split(test_size=0.1)

# 3. Check Dataset's Column
print(dataset["train"].column_names)  # ['input', 'output']

# 4. Define Data PreProcess Function
def preprocess_function(examples):
    inputs = [f"사용자: {q}" for q in examples["input"]]
    targets = [f"AI: {a}" for a in examples["output"]]

    model_inputs = tokenizer(
        inputs,
        max_length=128,
        truncation=True,
        padding='max_length',
    )

    labels = tokenizer(
        targets,
        max_length=128,
        truncation=True,
        padding='max_length',
    ).input_ids

    model_inputs["labels"] = labels
    return model_inputs

# 5. Apply Preprocess Dataset
tokenized_datasets = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

# 6. Define Data Collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True,
)

# 7. Define Training Arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    save_steps=5000,
    save_total_limit=2,
    logging_steps=500,
    eval_strategy="steps",
    eval_steps=5000,
    predict_with_generate=True,
    fp16=True,
)

# 8. Trainer Initialize
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# 9. Start Training
trainer.train()

# 10. Save Model
trainer.save_model("./gemma2-2-2b-it-fine-tuned-korean-model")
tokenizer.save_pretrained("./gemma2-2-2b-it-fine-tuned-korean-model")

['instruction', 'output', 'input']


Map:   0%|          | 0/24281 [00:00<?, ? examples/s]

Map:   0%|          | 0/2698 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


OutOfMemoryError: CUDA out of memory. Tried to allocate 42.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 30.81 MiB is free. Process 5416 has 39.53 GiB memory in use. Of the allocated memory 38.51 GiB is allocated by PyTorch, and 526.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [5]:
from google.colab import files

!zip -r /content/gemma2-2-2b-it-fine-tuned-korean-model.zip /content/gemma2-2-2b-it-fine-tuned-korean-model/
files.download('/content/gemma2-2-2b-it-fine-tuned-korean-model.zip')

Mounted at /content/drive


## 4. Korean Conversation Sample

In [6]:
!pip install torch



In [9]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_name_or_path = "/content/drive/MyDrive/gemma2-2-2b-it-fine-tuned-korean-model"

# 토크나이저와 모델 로드
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

# Gemma2 모델 로드 시 'attn_implementation' 인자 추가
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    attn_implementation='eager',
    torch_dtype=torch.float16,  # 혼합 정밀도 사용
    device_map='auto'
)

model.eval()

# GPU 사용 설정
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [10]:
def generate_response(input_text):
    # 입력 토큰화
    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        max_length=128,
        truncation=True,
    ).to(device)

    # 모델을 사용하여 응답 생성
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=128,
            num_beams=5,
            early_stopping=True,
        )

    # 출력 디코딩
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return generated_text

In [11]:
# 대화 루프 시작
while True:
    user_input = input("You: ")
    if user_input.lower() in ["exit", "quit", "종료"]:
        print("대화를 종료합니다.")
        break

    # 모델에 입력 전달
    input_text = f"사용자: {user_input}\nAI:"
    response = generate_response(input_text)

    # 응답에서 'AI:' 부분 추출
    response = response.split("AI:")[-1].strip()

    print(f"AI: {response}")

You: 안녕!
AI: 안녕하세요! 😊 무엇을 도와드릴까요?
You: 너는 어떤 모델이고, 무엇을 학습했니?
AI: 안녕하세요! 저는 Gemma, Google DeepMind에서 개발한 대규모 언어 모델입니다. 저는 텍스트 기반 모델로, 다양한 주제에 대한 정보를 제공하고, 질문에 답하고, 창의적인 텍스트를 생성할 수 있습니다. 

저는 엄청난 양의 텍스트 데이터를 학습했기 때문에 다양한 주제에 대한 지식을 가지고 있습니다. 예를 들어, 과학, 역
You: 너의 대답이 짤리는 이유는 무엇이니? 어떤 걸 설정해줘야 하지?
AI: 안녕하세요! 사용자님의 질문에 대한 답변이 짤리는 이유는 여러 가지가 있을 수 있습니다. 

**1. 너무 길거나 복잡한 질문:** 짧고 명확한 질문을 해주시면 더 정확하고 풍부한 답변을 얻을 수 있습니다. 

**2. 훈련 데이터 부족:** 저는 엄청난 양의


KeyboardInterrupt: Interrupted by user

## 5. Fine-Tune Custom Preprocessed KB Real Estate Data

In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import Dataset
import torch

# 1. Load Korean gemma2 model and Tokenizer Load
model_name = "./fine-tuned-model"  # 모델 경로 설정
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# 2. CSV File Load
prediction_data = pd.read_csv('/content/kb_real_estate_data/kb_prediction.csv')
property_price_data = pd.read_csv('/content/kb_real_estate_data/kb_property_price_data.csv')

# 3. Convert Data to Text
def convert_prediction_row_to_text(row):
    return (f"날짜: {row['date']}, 지역 코드: {row['region_code']}, "
            f"구분: {row['price_type']}, 예측 지수: {row['predicted_index']}, "
            f"예측 가격: {row['predicted_price']}")

def convert_property_row_to_text(row):
    return (f"날짜: {row['date']}, 지역 코드: {row['region_code']}, "
            f"구분: {row['price_type']}, 기록 지수: {row['index_value']}, "
            f"평균 가격: {row['avg_price']}, 보간 여부: {row['is_interpolated']}")

# 4. Convert Prediciton & Historical Datas to Text
prediction_data['text'] = prediction_data.apply(convert_prediction_row_to_text, axis=1)
property_price_data['text'] = property_price_data.apply(convert_property_row_to_text, axis=1)

# 5. Combine Datas
combined_data = pd.concat([prediction_data[['text']], property_price_data[['text']]])

# 6. Convert Hugging Face Datasets
dataset = Dataset.from_pandas(combined_data)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# 7. Define Preprocess Function
def tokenize_function(examples):
    model_inputs = tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

    # labels를 input_ids의 복사본으로 설정하고, 패딩된 토큰을 -100으로 설정하여 무시하도록 만듦
    labels = model_inputs["input_ids"].copy()
    labels = [[-100 if token == tokenizer.pad_token_id else token for token in label] for label in labels]

    model_inputs["labels"] = labels
    return model_inputs

# 8. Tokenized Datasets
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 9. Divide Train & Test datasets
train_test_split = tokenized_datasets.train_test_split(test_size=0.1)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

Map:   0%|          | 0/47232 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments

# 10. Difine Data Collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True,
)

# 11. Define Training Arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    save_steps=5000,
    save_total_limit=2,
    logging_steps=500,
    eval_strategy="steps",
    eval_steps=5000,
    predict_with_generate=True,
    fp16=True,
)

# 12. Trainer Initiailize
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# 12. Start Training
trainer.train()

# 13. Save Model
model.save_pretrained("./fine-tuned-real-estate-model")
tokenizer.save_pretrained("./fine-tuned-real-estate-model")

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


OutOfMemoryError: CUDA out of memory. Tried to allocate 82.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 80.81 MiB is free. Process 57969 has 39.48 GiB memory in use. Of the allocated memory 38.60 GiB is allocated by PyTorch, and 382.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## 6. Korean Real Estate Conversation Sample

## 7. Fine-Tune Custom Identity for Toadx2 Application

## 8. Toadx2 Identity Conversation Sample

## 9. Conclusion