### 1. Installation

#### Install Dependency

In [1]:
%%capture

!pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth

#### All import needed

In [11]:
from datasets import load_dataset
from transformers import TrainingArguments # Defines training hyperparameters
from trl import SFTTrainer # Trainer for supervised fine-tuning (SFT)
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported # Checks if the hardware supports bfloat16 precision

### 2. Load Base Model

In [27]:
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.1.8: Fast Llama patching. Transformers: 4.48.2.
   \\   /|    GPU: NVIDIA GeForce RTX 3080. Max memory: 9.56 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!




ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [28]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Already have LoRA adapters! We shall skip this step.


### 3. Inference before Finetune

#### Define Prompt

In [38]:
pre_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a customer service on Telkomsel with advanced knowledge in each product of Telkomsel, diagnostic and troubleshooting problem that might occured of that product.
Please answer the following customer question in Indonesian language.

### Question:
{}

### Response:
{}"""

#### Run Inference

In [39]:
# Creating a test medical question for inference
question = """Sudah sering dapat info penanganan Indihome lelet dll. Tapi ga ada pilihan lain. Penyedia layanan internet cuman itu yg bisa masuk kampungku.
Sejam 2 jam masih sabar. Ini udah 12 jam ga ada info apa-apa. Ga ada yg menghubungi ga ada teknisi yang datang. Tolong lampirkan vidio atau foto proses penanganan untuk perbaikan.
Sudah 24 jam ga ada perubahan!?"""

# Enable optimized inference mode for Unsloth models (improves speed and efficiency)
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!

# Format the question using the structured prompt (`prompt_style`) and tokenize it
inputs = tokenizer([pre_prompt_style.format(question, "")], return_tensors="pt").to("cuda")  # Convert input to PyTorch tensor & move to GPU

# Generate a response using the model
outputs = model.generate(
    input_ids=inputs.input_ids, # Tokenized input question
    attention_mask=inputs.attention_mask, # Attention mask to handle padding
    max_new_tokens=1200, # Limit response length to 1200 tokens (to prevent excessive output)
    use_cache=True, # Enable caching for faster inference
)

# Decode the generated output tokens into human-readable text
response = tokenizer.batch_decode(outputs)

# Extract and print only the relevant response part (after "### Response:")
print(response[0])
# print(response[0].split("### Response:")[1])

<｜begin▁of▁sentence｜>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a customer service on Telkomsel with advanced knowledge in each product of Telkomsel, diagnostic and troubleshooting problem that might occured of that product.
Please answer the following customer question in Indonesian language.

### Question:
Sudah sering dapat info penanganan Indihome lelet dll. Tapi ga ada pilihan lain. Penyedia layanan internet cuman itu yg bisa masuk kampungku.
Sejam 2 jam masih sabar. Ini udah 12 jam ga ada info apa-apa. Ga ada yg menghubungi ga ada teknisi yang datang. Tolong lampirkan vidio atau foto proses penanganan untuk perbaikan.
Sudah 24 jam ga ada perubahan!?

### Response:
<think>
Okay, so I've got this customer 

### Fine Tune

#### Load Dataset

In [40]:
dataset = load_dataset("hndrbrm/Testing","en", split = "train[0:500]",trust_remote_code=True) # Keep only first 500 rows
dataset

Dataset({
    features: ['Question', 'Response'],
    num_rows: 500
})

#### Define the prompt for the finetune.

In [41]:
EOS_TOKEN = tokenizer.eos_token  # Define EOS_TOKEN which the model when to stop generating text during training
EOS_TOKEN

'<｜end▁of▁sentence｜>'

In [42]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a customer service on Telkomsel with advanced knowledge in each product of Telkomsel, diagnostic and troubleshooting problem that might occured of that product.
Please answer the following customer question in Indonesian language.

### Question:
{}

### Response:
{}"""

In [43]:
# Define formatting prompt function
def formatting_prompts_func(examples):  # Takes a batch of dataset examples as input
    inputs = examples["Question"]       # Extracts the medical question from the dataset
    outputs = examples["Response"]      # Extracts the final model-generated response (answer)

    texts = []  # Initializes an empty list to store the formatted prompts

    # Iterate over the dataset, formatting each question, reasoning step, and response
    for input, output in zip(inputs, outputs):
        text = prompt_style.format(input, output) + EOS_TOKEN  # Insert values into prompt template & append EOS token
        texts.append(text)  # Add the formatted text to the list

    return {
        "text": texts,  # Return the newly formatted dataset with a "text" column containing structured prompts
    }

In [44]:
dataset_finetune = dataset.map(formatting_prompts_func, batched = True)
dataset_finetune["text"][0]

Map: 100%|██████████| 500/500 [00:00<00:00, 39492.90 examples/s]


'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are a customer service on Telkomsel with advanced knowledge in each product of Telkomsel, diagnostic and troubleshooting problem that might occured of that product.\nPlease answer the following customer question in Indonesian language.\n\n### Question:\nmau tanya, jika sudah menginputkan username dan passwordnya tetapi tidak bisa masuk, padahal sudah benar itu solusinya gimana ya?\n\n### Response:\nMaaf ya Kak :( Nesya cek udah ada interaksi di DM nih. Kakak bisa konfirmasi ke biar dibantu lebih lanjut ya :(<｜end▁of▁sentence｜>'

In [45]:
# Initialize the fine-tuning trainer — Imported using from trl import SFTTrainer
trainer = SFTTrainer(
    model=model,  # The model to be fine-tuned
    tokenizer=tokenizer,  # Tokenizer to process text inputs
    train_dataset=dataset_finetune,  # Dataset used for training
    dataset_text_field="text",  # Specifies which field in the dataset contains training text
    max_seq_length=max_seq_length,  # Defines the maximum sequence length for inputs
    dataset_num_proc=2,  # Uses 2 CPU threads to speed up data preprocessing

    # Define training arguments
    args=TrainingArguments(
        per_device_train_batch_size=2,  # Number of examples processed per device (GPU) at a time
        gradient_accumulation_steps=4,  # Accumulate gradients over 4 steps before updating weights
        num_train_epochs=1, # Full fine-tuning run
        warmup_steps=5,  # Gradually increases learning rate for the first 5 steps
        max_steps=60,  # Limits training to 60 steps (useful for debugging; increase for full fine-tuning)
        learning_rate=2e-4,  # Learning rate for weight updates (tuned for LoRA fine-tuning)
        fp16=not is_bfloat16_supported(),  # Use FP16 (if BF16 is not supported) to speed up training
        bf16=is_bfloat16_supported(),  # Use BF16 if supported (better numerical stability on newer GPUs)
        logging_steps=10,  # Logs training progress every 10 steps
        optim="adamw_8bit",  # Uses memory-efficient AdamW optimizer in 8-bit mode
        weight_decay=0.01,  # Regularization to prevent overfitting
        lr_scheduler_type="linear",  # Uses a linear learning rate schedule
        seed=3407,  # Sets a fixed seed for reproducibility
        output_dir="outputs",  # Directory where fine-tuned model checkpoints will be saved
    ),
)

Map (num_proc=2): 100%|██████████| 500/500 [00:01<00:00, 325.17 examples/s]


#### Start Training

In [46]:
# Start the fine-tuning process
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 500 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 120
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,0.6955
20,0.5547
30,0.6461
40,0.6636
50,0.7135
60,0.6839
70,0.5217
80,0.4534
90,0.4265
100,0.4265


### Inference after Finetune

In [47]:
question = """Sudah sering dapat info penanganan Indihome lelet dll. Tapi ga ada pilihan lain. Penyedia layanan internet cuman itu yg bisa masuk kampungku.
Sejam 2 jam masih sabar. Ini udah 12 jam ga ada info apa-apa. Ga ada yg menghubungi ga ada teknisi yang datang. Tolong lampirkan vidio atau foto proses penanganan untuk perbaikan.
Sudah 24 jam ga ada perubahan!?"""

# Load the inference model using FastLanguageModel (Unsloth optimizes for speed)
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!

# Tokenize the input question with a specific prompt format and move it to the GPU
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

# Generate a response using LoRA fine-tuned model with specific parameters
outputs = model.generate(
    input_ids=inputs.input_ids,          # Tokenized input IDs
    attention_mask=inputs.attention_mask, # Attention mask for padding handling
    max_new_tokens=1200,                  # Maximum length for generated response
    use_cache=True,                        # Enable cache for efficient generation
)

# Decode the generated response from tokenized format to readable text
response = tokenizer.batch_decode(outputs)

# Extract and print only the model's response part after "### Response:"
print(response[0].split("### Response:")[1])


Kepada Pihak Pelanggan yang memilikinya, kami memahami kesulitan yang sedang didalamnya. Dengan hormat, kami telah mengusulkan beberapa langkah tambahan untuk mempercepat proses perbaikan, seperti:

1. **Pengujian Kabel Kabel TV/Modem** – Kita akan memastikan apakah ada kesalahan pada kabel yang terhubung ke Modem.
2. **Pengujian Koneksi Nya** – Kita akan melakukan pengujian koneksi internet untuk memastikan apakah ada masalah koneksi.
3. **Pengujian Sambungan** – Kita akan memeriksa apakah ada masalah sambungan yang menyebabkan koneksi tidak stabil.
4. **Pengujian Kepulangan** – Kita akan mengujian ulang sambungan kebasis untuk memastikan koneksi sudah terbaik.

Dengan mengikuti langkah-langkah ini, kita berharap dapat mempercepat proses perbaikan dan mengembalikan koneksi internet Anda ke layaknya. Jika tetap belum membaik, silakan hubungi layanan servis terdekat atau kunjungi Kantor Terdekat kami untuk mendapatkan bantuan lebih lanjut.

Semoga membantu!<｜end▁of▁sentence｜>
