# Fine-Tuning Indonesian Dataset Using Unsloth

**Authors**
1. Alfan Dinda Rahmawan (alfan.d.rahmawan@gdplabs.id)

**References:**
1. [Fine-Tuning SDK](https://docs.glair.ai/generative-internal/modules/model-training/llm/supervised-fine-tuning)
2. [Training Dataset](https://docs.google.com/spreadsheets/d/1E1V7jFa8vrWwJqNaPwcpUN1DxR46jpWxbgV50cxR1sU/edit?gid=1049327042#gid=1049327042)
3. [Evaluation Dataset](https://docs.google.com/spreadsheets/d/1Ql_1M-1Qa6Js0K8M8Ai46TpJdfVT0vGlGCo2yGPeUss/edit?gid=1119237474#gid=1119237474)

**Overview**
This notebook demonstrates the process of fine-tuning a language model for text to sql using Unsloth. The notebook is organized into four main sections:

1. [**Model Preparations**](#model-preparations): 
   - Setting up the environment and required dependencies.
   - Configuring the base model.
   - Defining hyperparameters and model configurations.

2. [**Data Preparation**](#data-preparation):
   - Loading training and validation data from Google Sheets.
   - Preprocessing the data for fine-tuning.
   - Creating dataset loaders.

3. [**Run fine tuning**](#run-fine-tuning):
   - Training the model using SFTTrainer.
   - Monitoring training metrics and GPU usage.
   - Saving the fine-tuned model and hyperparameters.

4. [**Sanity Check**](#sanity-check):
   - Loading the fine-tuned model.
   - Running inference on test cases sample.

5. [**Save model to S3**](#save-model-to-s3):
   - Zipping the model files.
   - Uploading the model to AWS S3 for deployment.

## Prepare Environment

**To install the SDK library, you need to create a personal access token on GitHub. Please follow these steps:**
1. You need to log in to your [GitHub Account](https://github.com/).
2. Go to the [Personal Access Tokens](https://github.com/settings/tokens) page.
3. If you haven't created a Personal Access Tokens yet, you can generate one.
4. When generating a new token, make sure that you have checked the `repo` option to grant access to private repositories.
5. Now, you can copy the new token that you have generated and paste it into the script below.

**To install the model, you need to create a Hugging Face token. Please follow these steps:**
1. You need to log in to your [Hugging Face Account](https://huggingface.co/).
2. Go to the [Personal Access Tokens](https://huggingface.co/settings/tokens) page.
3. If you haven't created a Personal Access Tokens yet, you can generate one.
4. Now, you can copy the new token that you have generated and paste it into the script below.

# Data preparation

## Google auth

In [4]:
import os
import pandas as pd
import json

from typing import List
from dotenv import load_dotenv

load_dotenv()

GOOGLE_SPREADSHEET_ID: str = "1dDMqrol_DrEMjvLy88IRu2WdHN7T5BU0LrD8ORLuNPI" # put your spreadsheet id here
GOOGLE_SPREADSHEET_URL: str = f"https://docs.google.com/spreadsheets/d/{GOOGLE_SPREADSHEET_ID}/edit?usp=sharing" # put your spreadsheet link here
TRAIN_SHEET_NAME: str = "train_data"
VALIDATION_SHEET_NAME: str = "validation_data"

GOOGLE_SHEETS_CLIENT_EMAIL: str = os.getenv('GOOGLE_SHEETS_CLIENT_EMAIL')
GOOGLE_SHEETS_PRIVATE_KEY: str = os.getenv('GOOGLE_SHEETS_PRIVATE_KEY')

In [5]:
# Google Authentication
from modules.google_sheets_writer import GoogleUtil

PRIVATE_KEY = GOOGLE_SHEETS_PRIVATE_KEY
google: GoogleUtil = GoogleUtil(PRIVATE_KEY, GOOGLE_SHEETS_CLIENT_EMAIL)


## Database Information

# Unsloth

In [6]:
import json
import pandas as pd
from pydantic import BaseModel, Field
from typing import List
from langchain_core.output_parsers import JsonOutputParser
from tqdm import tqdm

## Model Preparations

### Download model (optional)

In [5]:
# from huggingface_hub import login
# import getpass
# HF_TOKEN = getpass.getpass("Insert your Hugging Face Token (your typing will be hidden, press Enter when done): ")
# login(token=HF_TOKEN)

In [6]:
# import os

# # Set the model repository name (change this to the desired model)
# model_repo = "Qwen/Qwen2.5-Coder-3B-Instruct"
# # branch = "refs/pr/1"
# branch = "main"
# # Define the directory where the model will be saved
# save_directory = os.path.join("models", model_repo)

# # Create the directory if it does not exist
# os.makedirs(save_directory, exist_ok=True)

# # Use the Hugging Face CLI to download the model
# download_cmd = f"huggingface-cli download {model_repo} --revision {branch} --local-dir {save_directory}"
# # os.system(download_cmd)
# # prompt = "huggingface-cli download Ellbendls/Qwen-2.5-3b-Text_to_SQL --local-dir models/Ellbendls/Qwen-2.5-3b-Text_to_SQL"
# # print(f"Model downloaded to {save_directory}")
# download_cmd

### Prompt

In [9]:
SYSTEM_PROMPT = """You are a SQL generator expert for MariaDB 10.5.23. Your task is to create SQL queries based on database schema, table relationships, master data, current date, and user instructions.

1. CORE SQL REQUIREMENTS:
   - Generate directly executable MariaDB 10.5.23 SQL
   - DO NOT select identifiers (id, employee_id) except within aggregate functions (MAX, SUM, AVG)
   - Always use human-readable names (religions.name instead of religions.id)
   - Prefix all column names with table names to avoid ambiguity
   - Use snake_case for column aliases in SELECT clause only (no aliases in FROM)
   - Handle dates properly:
     * Wrap all date literals in STR_TO_DATE()
     * Ensure date comparisons use CAST() for DATE type
   - Use aggregate functions ONLY in SELECT or HAVING clauses (never in WHERE)
   - Ensure JOIN conditions reference correct foreign keys
   - Add extra JOINs to master tables if display names are missing
   - Handle division by checking for non-zero denominators
   - Name output columns based on user's instruction language

2. DATA TRUSTEE REQUIREMENTS:
   - If any table in the query appears in the Data Trustee Enabled Tables list:
     * JOIN with the employment_statuses table using employee_id column
     * Add these exact filters:
       employment_statuses.organization_id IN ('[ORGANIZATION_IDS]')
       AND employment_statuses.job_level_id IN ('[JOB_LEVEL_IDS]')
       AND employment_statuses.location_id IN ('[LOCATION_IDS]')
     * If table requires prerequisite join (indicated in data_trustee_tables),
       perform that JOIN before joining with employment_statuses
   - If no table from the Data Trustee Enabled Tables is used, do not include
     any data trustee-specific JOINs or filters

3. OUTPUT FORMAT:
   - Return ONLY a JSON object with your SQL query in this exact format:
        {{
        "sql_query": "<your_generated_sql_query>"
        }}
   - No explanations or commentary

Examples:
Input: "Berapa total uang yang ditransfer untuk karyawan yang aktif di bulan Oktober 2023?"

Output: {
  "sql_query": ["SELECT SUM(salary_payment_summaries.transferred_amount) AS 'total_transferred_amount' FROM salary_payment_summaries JOIN salary_payments ON salary_payment_summaries.id = salary_payments.salary_payment_summary_id JOIN employees ON salary_payments.employee_id = employees.id JOIN employment_statuses ON employees.id = employment_statuses.employee_id WHERE employees.active = TRUE AND employment_statuses.organization_id IN ([ORGANIZATION_IDS]) AND employment_statuses.job_level_id IN ([JOB_LEVEL_IDS]) AND employment_statuses.location_id IN ([LOCATION_IDS]) AND salary_payment_summaries.payment_date BETWEEN STR_TO_DATE('2023-10-01', '%Y-%m-%d') AND STR_TO_DATE('2023-10-31', '%Y-%m-%d');"]
}
"""

USER_PROMPT = """Generate a SQL query for the following instruction:

DATABASE INFORMATION:
Schema: {schema}
Relationships: {relations}
Master Data: {master_data}
Data Trustee Enabled Tables: {data_trustee_tables}
Anonymized Entities: {anonymized_entities_description}
Current Date: {current_date}

user_instruction: {user_instruction}
Return only the SQL query as a JSON object: {{"sql_query": "<your_sql_query>"}}"""

### hyperparameters

In [8]:
# Model Configuration
MAX_SEQ_LENGTH = 2048  # Choose any! We auto support RoPE Scaling internally!
DTYPE = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
LOAD_IN_4BIT = False  # Use 4bit quantization to reduce memory usage

# PEFT (Parameter Efficient Fine-Tuning) Configuration
PEFT_CONFIG = {
    "r": 8,                                    # LoRA rank: 8, 16, 32, 64, 128
    "lora_alpha": 16,
    "lora_dropout": 0.1,                       # 0 is optimized
    "bias": "none",                            # "none" is optimized
    "target_modules": [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    "use_gradient_checkpointing": "unsloth",   # Use for very long context
    "random_state": 3407,
    "use_rslora": False,                       # Rank stabilized LoRA support
    "loftq_config": None,                      # LoftQ support
}

# SFT (Supervised Fine-Tuning) Configuration
SFT_TRAINER_ARGS = {
    "dataset_text_field": "prompt",
    "dataset_num_proc": 2,
    "packing": False,                          # Can make training 5x faster for short sequences
}

# Training Configuration
TRAINING_ARGS = {
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 4,
    "warmup_steps": 5,
    "num_train_epochs": 5,
    "learning_rate": 2e-5,                     # 0.00002 written in scientific notation
    "optim": "adamw_8bit",
    "weight_decay": 0.01,
    "lr_scheduler_type": "linear",             # Options: constant, linear, cosine, cosine_with_restarts
    "seed": 42,
    "report_to": "none",                       # Can be modified for WandB integration
    "save_strategy": "steps",                  # Save after a certain number of steps
    "eval_strategy": "steps",                  # Evaluate after a certain number of steps
    "load_best_model_at_end": True,            # Load the best model based on evaluation metrics
    "save_steps": 20,                          # Save the model every 20 steps
    "eval_steps": 20,                          # Evaluate every 20 steps
    "logging_steps": 20,                       # Log every 20 steps
    "save_total_limit": 5,                     # Limit the number of saved checkpoints
    "metric_for_best_model":"eval_loss",       # Track "eval_loss" to choose the best model
    "greater_is_better":False,                 # Lower eval_loss is better

}

### Experiment information

In [9]:
import os

class ColumnName:
    NO = "No"
    SQL_QUESTION = "Prompt"
    SQL_QUERY = "Expected SQL Query"
    DATABASE_TYPE = "Database"

# Model Paths
MODEL_DIR = "Qwen"
MODEL_NAME = "Qwen2.5-0.5B-Instruct"
STUDENT_MODEL_NAME = os.path.join(MODEL_DIR, MODEL_NAME)

# Get current working directory
BASE_DIR = os.getcwd()

# Experiment Information
EXPERIMENT_ID = "1"
FINETUNED_HYPERPARAM_ID = "ft-1"
EVALUATION_DIR = "evaluation"
EXPERIMENT_REASON = f"Fine tuned text to sql used {STUDENT_MODEL_NAME} model with hyperparam {FINETUNED_HYPERPARAM_ID}, {TRAINING_ARGS['num_train_epochs']} Epoch"
EXPERIMENT_ID_REF = "-"
PROMPT_ID = "ft_text_to_sql_few_shot_1"
TOPIC = "TEXT TO SQL"

FINETUNE_DIR = f"fine_tune_exp_{EXPERIMENT_ID}"
FINETUNE_OUTPUT_PATH = os.path.join(BASE_DIR, FINETUNE_DIR, STUDENT_MODEL_NAME)
FINAL_FINETUNE_OUTPUT_NAME = f"exp_id_{EXPERIMENT_ID}:fine_tuning_{FINETUNED_HYPERPARAM_ID}:{PROMPT_ID}:{MODEL_NAME}"

DATA_TRAIN_DIR = os.path.join(BASE_DIR, "data_finetuned", MODEL_NAME, "train")
DATA_TRAIN_NAME = f"exp-{EXPERIMENT_ID}-finetuned-{FINETUNED_HYPERPARAM_ID}-text-to-sql-{MODEL_NAME}.csv"

# Create necessary directories
os.makedirs(FINETUNE_OUTPUT_PATH, exist_ok=True)
os.makedirs(DATA_TRAIN_DIR, exist_ok=True)

FINAL_MODEL_PATH = f"{FINETUNE_OUTPUT_PATH}/{FINAL_FINETUNE_OUTPUT_NAME}"
print(FINAL_MODEL_PATH)

/home/text_to_sql/fine_tune_exp_1/Qwen/Qwen2.5-0.5B-Instruct/exp_id_1:fine_tuning_ft-1:ft_text_to_sql_few_shot_1:Qwen2.5-0.5B-Instruct


### Unsloth preparation

In [10]:
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = STUDENT_MODEL_NAME,
    max_seq_length = MAX_SEQ_LENGTH,
    dtype = DTYPE,
    load_in_4bit = LOAD_IN_4BIT
)

2025-03-24 01:08:09,252 - INFO - PyTorch version 2.6.0 available.


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.18: Fast Qwen2 patching. Transformers: 4.50.0.dev0.
   \\   /|    NVIDIA RTX A5000. Num GPUs = 1. Max memory: 23.679 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [15]:
model = FastLanguageModel.get_peft_model(
    model,
    r = PEFT_CONFIG["r"], # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = PEFT_CONFIG["target_modules"],
    lora_alpha = PEFT_CONFIG["lora_alpha"],
    lora_dropout = PEFT_CONFIG["lora_dropout"], # Supports any, but = 0 is optimized
    bias = PEFT_CONFIG["bias"],    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = PEFT_CONFIG["use_gradient_checkpointing"], # True or "unsloth" for very long context
    random_state = PEFT_CONFIG["random_state"],
    use_rslora = PEFT_CONFIG["use_rslora"],  # We support rank stabilized LoRA
    loftq_config = PEFT_CONFIG["loftq_config"], # And LoftQ
)

Unsloth: Already have LoRA adapters! We shall skip this step.


## Data Preparation

### Train Data

In [16]:
rows: List[list] = google.retrieve_worksheet(GOOGLE_SPREADSHEET_ID, TRAIN_SHEET_NAME)
df_train: pd.DataFrame = pd.DataFrame(rows[1:], columns=rows[0])
df_train.head()

Unnamed: 0,No,Prompt,Expected SQL Query,Sheet,Database
0,1,Berapa persentase karyawan yang mengajukan pen...,SELECT (COUNT(DISTINCT termination_entries.emp...,catapa_syntetics_employee,core
1,2,Berapa jumlah karyawan baru yang direkrut seti...,"SELECT CONCAT(YEAR(employees.join_date), ' Q',...",catapa_syntetics_employee,core
2,3,Siapa manajer dengan jumlah bawahan langsung t...,"SELECT employees.name AS 'nama_manajer', COUNT...",catapa_syntetics_employee,core
3,4,Bagaimana distribusi usia karyawan di departem...,SELECT FLOOR(DATEDIFF(STR_TO_DATE('06 March 20...,catapa_syntetics_employee,core
4,5,Berapa persentase karyawan kontrak yang diperp...,WITH contract_employees AS (SELECT employees.i...,catapa_syntetics_employee,core


#### Validation Data

In [7]:
rows: List[list] = google.retrieve_worksheet(GOOGLE_SPREADSHEET_ID, VALIDATION_SHEET_NAME)
df_validation: pd.DataFrame = pd.DataFrame(rows[1:], columns=rows[0])
df_validation.head()

Unnamed: 0,No,Prompt,Expected SQL Query,Sheet,Database
0,1,Berapa jumlah karyawan yang dipromosikan pada ...,SELECT COUNT(DISTINCT employment_status_histor...,catapa_syntetics_employee,core
1,2,Identifikasi manajer dengan rentang kendali te...,"SELECT managers.name AS nama_manajer, COUNT(em...",catapa_syntetics_employee,core
2,3,Bagaimana perbandingan jumlah karyawan di seti...,"SELECT locations.name AS lokasi_kantor, COUNT(...",catapa_syntetics_employee,core
3,4,Berapa jumlah karyawan di setiap departemen?,SELECT \n organizations.name AS department_nam...,catapa_syntetics_employee,core
4,5,Berapa jumlah karyawan yang melapor langsung k...,SELECT COUNT(DISTINCT employees.id) AS total_k...,catapa_syntetics_employee,core


### Data Preprocessing


#### Convert into dataset loader

In [None]:
"""A class for preprocessing datasets for Named Entity Recognition (NER) fine-tuning.

This class handles the preprocessing of training and validation datasets for NER tasks,
including chat template application, tokenization, and dataset formatting. It supports
saving processed datasets and creating dataset loaders for training.

Author:
    Alfan Dinda Rahmawan (alfan.d.rahmawan@gdplabs.id)

References:
    -
"""
import os
from typing import Tuple
from datetime import datetime

import pandas as pd
from gdplabs_gen_ai_training.sft_trainer import DatasetLoader
from gdplabs_gen_ai_training.training_arguments import DataArguments
from gdplabs_gen_ai_training.utils import validate_args
from transformers import PreTrainedTokenizer


from modules.database_info.schema import employee_schema, time_management_schema
from modules.database_info.master_data import employee_master_data, time_management_master_data
from modules.database_info.relation import employee_relations, time_management_relations
from modules.database_info.trustee_tables import data_trustee_employee, data_trustee_time_management
from modules.database_info.anonymize_entities import anonymized_entities_description

class DatasetPreprocessor:
    def __init__(
        self, df_train: pd.DataFrame, df_validation: pd.DataFrame, tokenizer: PreTrainedTokenizer, system_prompt: str,
        base_filename: str, dir_path: str, user_field_name: str, assistant_field_name: str, prompt_key_name: str, test_size: float = 0.1, random_state: int = 42
    ) -> None:
        """
        Initialize the DatasetPreprocessor.

        Args:
            df_train (pd.DataFrame): Input dataset
            df_validation (pd.DataFrame): Input dataset
            tokenizer: The tokenizer to use
            system_prompt (str): System prompt for the model
            base_filename (str): Base name for output files
            dir_path (str): Directory path for output files
            test_size (float): Proportion of dataset to use for test split
            random_state (int): Random seed for reproducibility
        """
        self.df_train = df_train
        self.df_validation = df_validation
        self.tokenizer = tokenizer
        self.system_prompt = system_prompt
        self.base_filename = base_filename
        self.test_size = test_size
        self.random_state = random_state
        self.dir_path = dir_path
        self.user_field_name = user_field_name
        self.assistant_field_name = assistant_field_name
        self.prompt_key_name = prompt_key_name

    def _process_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Process a dataframe by applying the chat template and formatting.

        Args:
            df (pd.DataFrame): DataFrame to process

        Returns:
            pd.DataFrame: Processed DataFrame
        """
        df = df.copy()
        tmp_df = pd.DataFrame()
        current_date = datetime.now().strftime("%d %B %Y")
        for row in df.iterrows():
            row_data = row[1]
            sql_question = row_data[ColumnName.SQL_QUESTION]
            database_type = row_data[ColumnName.DATABASE_TYPE]

            # Invoke the chain with the current batch
            if database_type == "employee":
                schema = employee_schema
                relations = employee_relations
                master_data = employee_master_data
                data_trustee_tables = data_trustee_employee
                master_data = employee_master_data
            else:
                schema = time_management_schema
                relations = time_management_relations
                master_data = time_management_master_data
                data_trustee_tables = data_trustee_time_management
                master_data = time_management_master_data

            user_message = USER_PROMPT.format(
                user_instruction=sql_question,
                schema=schema,
                relations=relations,
                master_data=master_data,
                data_trustee_tables=data_trustee_tables,
                anonymized_entities_description=anonymized_entities_description,
                current_date=current_date
            )
        df['user_message'] = user_message
        df[self.prompt_key_name] = df.apply(
            lambda row: self.tokenizer.apply_chat_template(
                [
                    {"role": "system", "content": self.system_prompt},
                    {"role": "user", "content": row['user_message']},
                    {"role": "assistant", "content": row[ColumnName.SQL_QUERY]}
                ],
                tokenize=False,
                add_generation_prompt=False,
                continue_final_message=False
            ),
            axis=1
        )

        df[self.prompt_key_name] = df[self.prompt_key_name].str.rstrip('\n')
        df['expected_answer'] = ""
        return df[[self.prompt_key_name, 'expected_answer']]

    def process_and_save(self) -> Tuple[str, str]:
        """Process both splits and save them to files."""

        # Process and save training data
        train_processed = self._process_dataframe(self.df_train)
        train_filename = f"train_{self.base_filename}"
        train_path = os.path.join(self.dir_path, train_filename)
        train_processed.to_csv(train_path, index=False)

        # # Process and save test data
        validation_processed = self._process_dataframe(self.df_validation)
        validation_filename = f"validation_{self.base_filename}"
        validation_path = os.path.join(self.dir_path, validation_filename)
        validation_processed.to_csv(validation_path, index=False)

        print(f"Saved training data to: {train_filename}")
        print(f"Saved test data to: {validation_filename}")

        return train_path, validation_path

    def get_dataset_loader(self, file_path: str) -> DatasetLoader:
        """
        Create and return a DatasetLoader for the training data.

        Args:
            file_path (str): Path to the training data file

        Returns:
            DatasetLoader: Configured dataset loader
        """
        data_args = validate_args(DataArguments, dict({"dataset_path": file_path}))
        dataset_loader = DatasetLoader(data_args, self.tokenizer)
        dataset_loader.load_and_format_dataset()
        return dataset_loader


class ColumnName:
    NO = "No"
    SQL_QUESTION = "Prompt"
    SQL_QUERY = "Expected SQL Query"
    DATABASE_TYPE = "Database"


# Initialize the preprocessor
preprocessor = DatasetPreprocessor(
    df_train=df_train,
    df_validation=df_validation,
    tokenizer=tokenizer,
    system_prompt=SYSTEM_PROMPT,
    base_filename=DATA_TRAIN_NAME,
    dir_path=DATA_TRAIN_DIR,
    user_field_name=ColumnName.SQL_QUESTION,
    assistant_field_name=ColumnName.SQL_QUERY,
    prompt_key_name="prompt"
)

# Split, process, and save the datasets
train_file, validation_file = (
    preprocessor
    .process_and_save()
)

# Get the dataset loader for training
dataset_loader_train = preprocessor.get_dataset_loader(train_file)
dataset_loader_validation = preprocessor.get_dataset_loader(validation_file)

Saved training data to: train_exp-1-finetuned-ft-1-text-to-sql-Qwen2.5-0.5B-Instruct.csv
Saved test data to: validation_exp-1-finetuned-ft-1-text-to-sql-Qwen2.5-0.5B-Instruct.csv


Setting num_proc from 64 back to 1 for the train split to disable multiprocessing as it only contains one shard.


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/394 [00:00<?, ? examples/s]

Setting num_proc from 64 back to 1 for the train split to disable multiprocessing as it only contains one shard.


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/24 [00:00<?, ? examples/s]

## Run Fine-Tuning

In [19]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset_loader_train.dataset,
    eval_dataset= dataset_loader_validation.dataset,
    dataset_text_field = SFT_TRAINER_ARGS["dataset_text_field"],
    max_seq_length = MAX_SEQ_LENGTH,
    dataset_num_proc = SFT_TRAINER_ARGS["dataset_num_proc"],
    packing = SFT_TRAINER_ARGS["packing"], # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = TRAINING_ARGS["per_device_train_batch_size"],
        gradient_accumulation_steps = TRAINING_ARGS["gradient_accumulation_steps"],
        warmup_steps = TRAINING_ARGS["warmup_steps"],
        num_train_epochs = TRAINING_ARGS["num_train_epochs"], # Set this for 1 full training run.
        learning_rate = TRAINING_ARGS["learning_rate"],
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        optim = TRAINING_ARGS["optim"],
        weight_decay = TRAINING_ARGS["weight_decay"],
        lr_scheduler_type = TRAINING_ARGS["lr_scheduler_type"],  # constant, linear, cosine, cosine_with_restarts
        seed = TRAINING_ARGS["seed"],
        output_dir = FINETUNE_OUTPUT_PATH,
        report_to = TRAINING_ARGS["report_to"], # Use this for WandB etc,
        save_strategy=TRAINING_ARGS["save_strategy"],
        eval_strategy=TRAINING_ARGS["eval_strategy"],
        load_best_model_at_end=TRAINING_ARGS["load_best_model_at_end"],
        save_steps=TRAINING_ARGS["save_steps"],
        eval_steps=TRAINING_ARGS["eval_steps"],
        metric_for_best_model=TRAINING_ARGS["metric_for_best_model"],
        greater_is_better=TRAINING_ARGS["greater_is_better"],
        logging_steps=TRAINING_ARGS["logging_steps"],
        save_total_limit=TRAINING_ARGS["save_total_limit"],
    ),
)

Unsloth: Tokenizing ["prompt"] (num_proc=2):   0%|          | 0/394 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["prompt"] (num_proc=2):   0%|          | 0/24 [00:00<?, ? examples/s]

In [20]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA RTX A5000. Max memory = 23.679 GB.
0.967 GB of memory reserved.


In [21]:
%%time
trainer_stats = trainer.train()
trainer.save_model(FINAL_MODEL_PATH)
print(f"the best model save to: {FINAL_MODEL_PATH}")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 394 | Num Epochs = 1 | Total steps = 98
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 4,399,104/498,431,872 (0.88% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
20,1.5942,1.476482
40,1.3894,1.304458
60,1.2411,1.177721
80,1.1342,1.094253


Unsloth: Not an error, but Qwen2ForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


the best model save to: /home/text_to_sql/fine_tune_exp_1/Qwen/Qwen2.5-0.5B-Instruct/exp_id_1:fine_tuning_ft-1:ft_text_to_sql_few_shot_1:Qwen2.5-0.5B-Instruct
CPU times: user 2min 9s, sys: 1.78 s, total: 2min 10s
Wall time: 2min 21s


In [22]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

139.7354 seconds used for training.
2.33 minutes used for training.
Peak reserved memory = 1.406 GB.
Peak reserved memory for training = 0.439 GB.
Peak reserved memory % of max memory = 5.938 %.
Peak reserved memory for training % of max memory = 1.854 %.


In [23]:
# Get the final training loss
total_train_loss = sum([log['train_loss'] for log in trainer.state.log_history if 'train_loss' in log])

# Get the final validation loss
total_validation_loss = trainer.evaluate()["eval_loss"]
print(total_train_loss, total_validation_loss)

1.2914953426438935 1.0942529439926147


### Save the results to the leaderboard

In [25]:
import torch
from datetime import datetime

# Get GPU type
if torch.cuda.is_available():
    GPU_TYPE = torch.cuda.get_device_name(0)
else:
    GPU_TYPE = "None"

# Hyperparameters
hyperparameters_inf = {
    "max_seq_length": MAX_SEQ_LENGTH,
    "dtype": DTYPE,
    "load_in_4bit": LOAD_IN_4BIT,
    "peft_config": PEFT_CONFIG,
    "sft_trainer_args": SFT_TRAINER_ARGS,
    "training_args": TRAINING_ARGS
}


df_leaderboard = pd.DataFrame([
    {
        'exp_id': EXPERIMENT_ID,
        'exp_id_ref': EXPERIMENT_ID_REF,
        'experiment_reason': EXPERIMENT_REASON,
        'prompt_id': PROMPT_ID,
        'student_model': STUDENT_MODEL_NAME,
        'inference_framework': "vllm",
        'ner_tag': "",
        'precision': "",
        'recall': "",
        'f1_score': "",
        'val_precision': "",
        'val_recall': "",
        'val_f1': "",
        'training_set_id': TRAIN_SHEET_NAME,
        'validation_set_id': VALIDATION_SHEET_NAME,
        'test_set_id': "",
        'num_question_train': len(df_train),
        'num_question_validation': len(df_validation),
        'num_question_test': "",
        'topic': TOPIC,
        'fine_tuned_config': hyperparameters_inf,
        'max_training_vram(GB)': used_memory,
        'GPU Type': GPU_TYPE,
        'training_loss': round(total_train_loss, 3),
        'val_loss': round(total_validation_loss, 3),
        'training_time (seconds)': round(trainer_stats.metrics['train_runtime'], 4),
        'date': datetime.now().strftime("%Y-%m-%d"),
    }
])
df_leaderboard.to_csv(f"{FINETUNE_OUTPUT_PATH}/leaderboard_result.csv")
df_leaderboard


Unnamed: 0,exp_id,exp_id_ref,experiment_reason,prompt_id,student_model,inference_framework,ner_tag,precision,recall,f1_score,...,num_question_validation,num_question_test,topic,fine_tuned_config,max_training_vram(GB),GPU Type,training_loss,val_loss,training_time (seconds),date
0,1,-,Fine tuned text to sql used Qwen/Qwen2.5-0.5B-...,ft_text_to_sql_few_shot_1,Qwen/Qwen2.5-0.5B-Instruct,vllm,,,,,...,24,,TEXT TO SQL,"{'max_seq_length': 2048, 'dtype': None, 'load_...",1.406,NVIDIA RTX A5000,1.291,1.094,139.7354,2025-03-24


#### Upload to Google Sheets

In [21]:
# Upload to Leaderboard Google Sheets
from modules.google_sheets_writer import GoogleSheetsWriter
import logging

LEADERBOARD_SHEET_NAME = "leaderboard-fine-tuned"
writer = GoogleSheetsWriter(
    google_util=google,  # Your GoogleUtil instance
    sheet_id=GOOGLE_SPREADSHEET_ID,
    worksheet_name=LEADERBOARD_SHEET_NAME,
    batch_size=10,  # Customize batch size
    max_retries=5,  # Customize retry attempts
    batch_delay=2  # Customize delay between batches
)
# Write the DataFrame
result = writer.write_dataframe(df_leaderboard)

# Log results
logging.info(f"Successfully wrote {result.successful_rows} rows")
if result.failed_rows > 0:
    logging.error(f"Failed to write {result.failed_rows} rows")
    for error in result.errors:
        logging.error(f"Row {error['row_number']}: {error['error']}")

  0%|          | 0/1 [00:00<?, ?it/s]

2025-01-26 10:44:32,280 - INFO - Successfully wrote row 1/1
100%|██████████| 1/1 [00:02<00:00,  2.06s/it]
2025-01-26 10:44:32,285 - INFO - Successfully wrote 1 rows


## Sanity check

In [2]:
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain_core.prompts import PromptTemplate
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from peft import PeftModel

import torch
model_dir = "unsloth"  # based on your base model path.
model_name = "Qwen2.5-0.5B-Instruct" # based on your base model path.
model_id = os.path.join(model_dir, model_name)
sft_path = "/home/text_to_sql/fine_tune_exp_1/Qwen/Qwen2.5-0.5B-Instruct/exp_id_1:fine_tuning_ft-1:ft_text_to_sql_few_shot_1:Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base_model, sft_path)  # uncomment if you have fine tuned model
model = model.merge_and_unload()  # uncomment if you have fine tuned model

pipe = pipeline(
    task="text-generation",
    model=model,
    device=0,
    torch_dtype=torch.bfloat16,
    tokenizer=tokenizer,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    return_full_text=False,
    model_kwargs = {"temperature": 0, "do_sample":True},
)

hf_pipeline = HuggingFacePipeline(pipeline=pipe)

Device set to use cuda:0
  hf_pipeline = HuggingFacePipeline(pipeline=pipe)


In [29]:
df_validation.head(2)

Unnamed: 0,No,Prompt,Expected SQL Query,Sheet,Database
0,1,Berapa jumlah karyawan yang dipromosikan pada ...,SELECT COUNT(DISTINCT employment_status_histor...,catapa_syntetics_employee,core
1,2,Identifikasi manajer dengan rentang kendali te...,"SELECT managers.name AS nama_manajer, COUNT(em...",catapa_syntetics_employee,core


In [13]:
from datetime import datetime
from time import time
from tqdm import tqdm

from modules.database_info.schema import employee_schema, time_management_schema
from modules.database_info.master_data import employee_master_data, time_management_master_data
from modules.database_info.relation import employee_relations, time_management_relations
from modules.database_info.trustee_tables import data_trustee_employee, data_trustee_time_management
from modules.database_info.anonymize_entities import anonymized_entities_description

results = []

inference_result_dir = "sql_generator_result"
os.makedirs(inference_result_dir, exist_ok=True)
current_date = datetime.now().strftime("%d %B %Y")
sql_generator_result_path = os.path.join(inference_result_dir, f"{model_name}-using-catapa-prompt.csv")


question_id_test = [1, 2, 3, 5]

for index, df_row in tqdm(enumerate(df_validation.iterrows()), desc="Generating SQL queries", total=len(df_validation)):
    user_instruction = df_row[1]['Prompt']
    no = df_row[1]['No']
    if int(no) not in question_id_test:
        continue
    database_type = df_row[1]['Database']

    start_time = time()
    if database_type == "core":
        schema = employee_schema
        relations = employee_relations
        master_data = employee_master_data
        data_trustee_tables = data_trustee_employee
        master_data = employee_master_data
    else:
        schema = time_management_schema
        relations = time_management_relations
        master_data = time_management_master_data
        data_trustee_tables = data_trustee_time_management
        master_data = time_management_master_data

    formatted_user_prompt = USER_PROMPT.format(
        schema=schema,
        relations=relations,
        master_data=master_data,
        data_trustee_tables=data_trustee_tables,
        anonymized_entities_description=anonymized_entities_description,
        current_date=current_date,
        user_instruction=user_instruction,
        query_result = ""
    )
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": formatted_user_prompt},
    ]
    # Check if the tokenizer has a chat template
    has_chat_template = hasattr(tokenizer, "chat_template") and tokenizer.chat_template is not None

    if has_chat_template:
        # Tokenize input with chat template
        inputs = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    else:
        inputs = messages

    sql_generator_result = hf_pipeline.invoke(inputs)
    print(sql_generator_result)

Generating SQL queries:   0%|          | 0/24 [00:00<?, ?it/s]

Generating SQL queries:   4%|▍         | 1/24 [00:02<01:02,  2.73s/it]

{
  "sql_query": "SELECT COUNT(*) AS 'jumlah_karyawan_promosikan' FROM employees WHERE job_level_id IN ([Job Level IDs]) AND employment_status_type_id IN ([Employment Status Type IDs]) AND start_date BETWEEN STR_TO_DATE('2023-01-01', '%Y-%m-%d') AND STR_TO_DATE('2023-12-31', '%Y-%m-%d');"
}


Generating SQL queries:   8%|▊         | 2/24 [00:15<03:14,  8.85s/it]

{
  "sql_query": "SELECT employees.name, MAX(employee_details.race_id) AS 'race_id', MAX(employee_details.religion_id) AS'religion_id', MAX(employee_details.blood_type) AS 'blood_type', MAX(employee_details.marital_status) AS'marital_status', MAX(employee_details.gender) AS 'gender', MAX(employee_details.date_of_birth) AS 'date_of_birth', MAX(employee_details.family_card_number) AS 'family_card_number', MAX(employee_details.race_id) AS 'race_id', MAX(employee_details.religion_id) AS'religion_id', MAX(employee_details.blood_type) AS 'blood_type', MAX(employee_details.marital_status) AS'marital_status', MAX(employee_details.gender) AS 'gender', MAX(employee_details.date_of_birth) AS 'date_of_birth', MAX(employee_details.family_card_number) AS 'family_card_number', MAX(employee_details.race_id) AS 'race_id', MAX(employee_details.religion_id) AS'religion_id', MAX(employee_details.blood_type) AS 'blood_type', MAX(employee_details.marital_status) AS'marital_status', MAX(employee_details.gend

Generating SQL queries:  12%|█▎        | 3/24 [00:17<01:52,  5.37s/it]

{
  "sql_query": "SELECT locations.name AS 'Locais', COUNT(employee_details.employee_id) AS 'Jumlah Employee' FROM locations JOIN employees ON locations.id = employees.location_id GROUP BY locations.name;"
}


Generating SQL queries: 100%|██████████| 24/24 [00:18<00:00,  1.32it/s]

{
  "sql_query": "SELECT COUNT(*) AS 'jumlah_karyawan_langup_dengan_manajer_tingkat_atas' FROM employees WHERE manager_id IS NOT NULL;"
}





## SAVE MODEL TO S3

### Convert to zip

In [29]:
import os

# Check the current working directory
print("Current working directory:", os.getcwd())

# Change to the desired directory
os.chdir(FINAL_MODEL_PATH)

# Verify the change
print("Changed to:", os.getcwd())

Current working directory: /home/gen-ai-exploration/fine_tuning/sft/ner
Changed to: /home/gen-ai-exploration/fine_tuning/sft/ner/fine_tune_results_17/Qwen/Qwen2.5-1.5B-Instruct/exp_id_21:fine_tuning_17:ner_entity_value_few_shot_simple


In [30]:
## Save to zip
zip_name = f"{FINAL_FINETUNE_OUTPUT_NAME}.zip"
!zip -r "../{zip_name}" "."
print(zip_name)

  adding: README.md (deflated 66%)
  adding: adapter_model.safetensors (deflated 8%)
  adding: adapter_config.json (deflated 55%)
  adding: tokenizer_config.json (deflated 83%)
  adding: special_tokens_map.json (deflated 67%)
  adding: added_tokens.json (deflated 65%)
  adding: vocab.json (deflated 61%)
  adding: merges.txt (deflated 57%)
  adding: tokenizer.json (deflated 81%)
  adding: training_args.bin (deflated 52%)
  adding: hyperparameters.json (deflated 60%)
exp_id_21:fine_tuning_17:ner_entity_value_few_shot_simple.zip


### Upload model to S3

In [24]:
import os
import boto3
import getpass

# Set your AWS credentials programmatically using boto3
aws_access_key = getpass.getpass("Insert your AWS Access Key (your typing will be hidden, press Enter when done): ")
aws_secret_key = getpass.getpass("Insert your AWS Secret Key (your typing will be hidden, press Enter when done): ")
region = "ap-southeast-1"  # Example region

In [25]:
# Set AWS credentials as environment variables (for the entire notebook session)
os.environ["AWS_ACCESS_KEY_ID"] = aws_access_key
os.environ["AWS_SECRET_ACCESS_KEY"] = aws_secret_key
os.environ["AWS_DEFAULT_REGION"] = region

# You can now create clients or resources from this session
s3 = boto3.client("s3")

# Function to handle the "aws s3 cp --recursive" functionality
def upload_with_aws_s3_cp(local_dir: str, s3_bucket: str, s3_prefix: str):
    command = f"aws s3 cp --recursive {local_dir} s3://{s3_bucket}/{s3_prefix}"
    os.system(command)  # This will execute the AWS CLI command directly from the notebook

def download_with_aws_s3_cp(local_dir: str, s3_bucket: str, s3_prefix: str):
    command = f"aws s3 cp --recursive s3://{s3_bucket}/{s3_prefix} {local_dir}"
    os.system(command)  # This will execute the AWS CLI command directly from the notebook

def upload_zip_to_s3(local_file_path: str, s3_bucket: str, s3_prefix: str):
    """Upload a zip file to S3 bucket

    Args:
        local_file_path (str): Path to local zip file
        s3_bucket (str): Name of S3 bucket
        s3_key (str): S3 object key (path in bucket)

    Returns:
        bool: True if upload successful, False otherwise
    """
    try:
        # Upload file to S3
        s3.upload_file(local_file_path, s3_bucket, s3_prefix)
        print(f"Successfully uploaded {local_file_path} to s3://{s3_bucket}/{s3_prefix}")
        return True
    except Exception as e:
        print(f"Error uploading file to S3: {str(e)}")
        return False

2025-01-26 10:46:29,771 - INFO - Found credentials in environment variables.


#### Upload zip model

In [28]:
zip_model = f"../{zip_name}"
s3_model_bucket = "glair-gen-ai-llm-model"
s3_model_prefix = f"sft_slm_ner/fine-tuned/{zip_name}"

#  Upload zip file
upload_zip_to_s3(zip_model, s3_model_bucket, s3_model_prefix)

Successfully uploaded ../exp_id_20:fine_tuning_16:ner_entity_value_few_shot_simple.zip to s3://glair-gen-ai-llm-model/sft_slm_ner/fine-tuned/exp_id_20:fine_tuning_16:ner_entity_value_few_shot_simple.zip


True