<a href="https://colab.research.google.com/github/YatsauAliaksei/unsloth_llama_fine_tune/blob/main/unsloth_sinch_ft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Original article: https://colab.research.google.com/drive/1T5-zKWM_5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing#scrollTo=2ejIt2xSNKKp

# Fine Tuning LLAMA-3 3B from scratch

In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [2]:
from unsloth import FastLanguageModel
import torch


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [3]:
#@title properties
root_dir="/content/drive/MyDrive/colab"

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

In [4]:
#@title load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.10 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


# Prepare Dataset

In [6]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }


In [7]:
from IPython.display import display, HTML

def display_ds(dataset, from_row, to_row):
    """
    Displays a range of rows from the 'train' split of the dataset as an HTML table.

    Args:
        from_row (int): The starting row index (inclusive).
        to_row (int): The ending row index (exclusive).
    """
    df = dataset.to_pandas()[from_row:to_row]
    html_table = df.to_html()
    display(HTML(html_table))

In [8]:
import numpy as np
from datasets import concatenate_datasets, Dataset
from sklearn.model_selection import train_test_split

# Step 1: Load Datasets
dataset = Dataset.load_from_disk(root_dir + '/sinch/ds/train')
dataset_test = Dataset.load_from_disk(root_dir + '/sinch/ds/test')

dataset = concatenate_datasets([dataset, dataset_test])

display_ds(dataset, 2, 4)

Unnamed: 0,topic,question,answer
2,BillRun FAQs billrun,What start date can be added to fixed fees right before a bill run?,"If the bill run is on the 4th of May, fixed fees should be added throughout April. On the off chance, an additional fee ought to be included on the invoice generated on 4th of May, then the start date can be changed to April but this must be done no later than by the 2nd of May to ensure that the fee is added to the immediate upcoming billrun, i.e., 4th of May billrun. Keep in mind whether the account charges fees zero (0) or one (1) month in advance."
3,BillRun FAQs billrun,Why does the time stamp on the balance sheet UI not show the last calendar date anymore?,"The balance sheet UI will show either the last or first calendar date in a month due to user's browsers config or the timezone they are in. For example, Singaporean users won't see the same month end date as Brazil users. However, users will always see the same pattern each month, provided that the user does not change their timezone."


In [9]:
def add_conversations_column(row):
    topic = row['topic']
    question = row['question']
    answer = row['answer']
    row['conversations'] = [
        {
            'role': 'user',
            'content': f"Sinch topic: {topic}. {question}"
        },
        {
            'role': 'assistant',
            'content': answer,
        },
    ]
    return row

In [10]:
dataset = dataset.map(add_conversations_column)

Map:   0%|          | 0/1286 [00:00<?, ? examples/s]

In [11]:
display_ds(dataset, 2, 4)

Unnamed: 0,topic,question,answer,conversations
2,BillRun FAQs billrun,What start date can be added to fixed fees right before a bill run?,"If the bill run is on the 4th of May, fixed fees should be added throughout April. On the off chance, an additional fee ought to be included on the invoice generated on 4th of May, then the start date can be changed to April but this must be done no later than by the 2nd of May to ensure that the fee is added to the immediate upcoming billrun, i.e., 4th of May billrun. Keep in mind whether the account charges fees zero (0) or one (1) month in advance.","[{'content': 'Sinch topic: BillRun FAQs billrun. What start date can be added to fixed fees right before a bill run?', 'role': 'user'}, {'content': 'If the bill run is on the 4th of May, fixed fees should be added throughout April. On the off chance, an additional fee ought to be included on the invoice generated on 4th of May, then the start date can be changed to April but this must be done no later than by the 2nd of May to ensure that the fee is added to the immediate upcoming billrun, i.e., 4th of May billrun. Keep in mind whether the account charges fees zero (0) or one (1) month in advance.', 'role': 'assistant'}]"
3,BillRun FAQs billrun,Why does the time stamp on the balance sheet UI not show the last calendar date anymore?,"The balance sheet UI will show either the last or first calendar date in a month due to user's browsers config or the timezone they are in. For example, Singaporean users won't see the same month end date as Brazil users. However, users will always see the same pattern each month, provided that the user does not change their timezone.","[{'content': 'Sinch topic: BillRun FAQs billrun. Why does the time stamp on the balance sheet UI not show the last calendar date anymore?', 'role': 'user'}, {'content': 'The balance sheet UI will show either the last or first calendar date in a month due to user's browsers config or the timezone they are in. For example, Singaporean users won't see the same month end date as Brazil users. However, users will always see the same pattern each month, provided that the user does not change their timezone.', 'role': 'assistant'}]"


In [12]:
dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/1286 [00:00<?, ? examples/s]

In [13]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nSinch topic: Billing section customer dashboard FAQs. What traffic reports are available in the customer dashboard under the billing section?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nOnly SMS traffic reports are currently available in the usage section in the customer dashboard.<|eot_id|>'

# Run Training

In [54]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        num_train_epochs = 2, # Set this for 1 full training run.
        # max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 20,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/1286 [00:00<?, ? examples/s]

In [55]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/1286 [00:00<?, ? examples/s]

In [56]:
#tokenizer.decode(trainer.train_dataset[5]["input_ids"]);

In [57]:
#space = tokenizer(" ", add_special_tokens = False).input_ids[0]
#tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]]);

In [18]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
2.635 GB of memory reserved.


In [58]:
#@title training
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,286 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 320
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
20,1.2494
40,1.0647
60,0.8533
80,0.8788
100,0.6828
120,0.5784
140,0.7223
160,0.4922
180,0.2277
200,0.2002


In [59]:
#@title Test fine-tuned model
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

FastLanguageModel.for_inference(model); # Enable native 2x faster inference

In [108]:
from transformers import TextStreamer
import re

def send_msg_stream(msg, temperature=1.5):
    messages = [
      { "role": "user", "content": f"You are a Sinch company assistant. Only answer questions related to Sinch domain. If a question is irrelevant, respond with: 'This is outside my expertise.' ### Question: {msg}" }
    ]

    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True, # Must add for generation
        return_tensors = "pt",
    ).to("cuda")


    text_streamer = TextStreamer(tokenizer, skip_prompt = True)

    def process_stream(text, stream_end=False):
        # Remove '<|eot_id|>' if present at the end
        text = text.rstrip("<|eot_id|>")
        print(text, end="", flush=True)  # Display streamed text

    text_streamer.on_finalized_text = process_stream

    print('User:', msg)
    print('Assistant:', end=' ')
    _ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 256,
                      use_cache = True, temperature = temperature, min_p = 0.1)
    print()


def send_msg(msg, temperature=1.5):
    messages = [
      {
       "role": "user",
       "content":
       f"""
            You are a Sinch company assistant.
            Only answer questions related to Sinch company and Sinch domain.
            If a question is irrelevant, respond with: 'This is outside my expertise. But I'm happy to answer any Sinch related questions'
            ### Question: {msg}
       """
      }
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True, # Must add for generation
        return_tensors = "pt",
    ).to("cuda")

    outputs = model.generate(input_ids = inputs, max_new_tokens = 256, use_cache = True,
                          temperature = temperature, min_p = 0.1)

    print('User:', msg)
    answer=tokenizer.batch_decode(outputs)[0].split('<|start_header_id|>assistant<|end_header_id|>\n\n')[-1].replace('<|eot_id|>','')
    print('Assistant:', answer)
    return answer

In [61]:
display_ds(dataset.select_columns(['topic', 'question', 'answer']), 22, 24)

Unnamed: 0,topic,question,answer
22,Billing FAQs general,Is it possible to customise invoices based on project id even when the customer has multiple Nova accounts?,"Yes, it is possible. For example when a customer has three (3) account with multiple project ids per account then for instance 2 invoices can be created, mapping each project towards the specific billing account."
23,Billing FAQs general,What is the difference between Project based billing and Project based invoicing?,"It is the same thing but for clarity; Project based invoicing/billing = 1 invoice, separated by project name in the headers. Project based split = Multiple invoices based on project names."


In [74]:
send_msg("what is the last day to update CPT prices?", temperature=0.7)

'The last day to update the rates is 17:00 UTC on each Monday.'

In [75]:
send_msg_stream("What is the difference between Project based billing and Project based invoicing?")

Multiple projects are linked to a single invoice, and each project is billed individually as a line item on the invoice.

In [110]:
send_msg_stream("How to activate the project-based billing line (PBILI) element?")
print()
send_msg_stream("what's the weather today?")

User: How to activate the project-based billing line (PBILI) element?
Assistant: To activate the feature, the following must be done: i. Sign the Sinch Acceptable Usage Policy. ii. Reach out to Sinch Billing Ops to configure.

User: what's the weather today?
Assistant: This is outside my expertise.


In [77]:
send_msg("Hi, how are you?")

"This is outside my expertise. But I'm happy to answer any Sinch related questions."

In [112]:
send_msg("tell me about Sinch company")
send_msg("how many people work at Sinch?", 1.0);

User: tell me about Sinch company
Assistant: Sinch is a global provider of communication services, allowing customers to connect people around the world through mobile, landline, broadband, voice, and IoT messaging products.
User: how many people work at Sinch?
Assistant: There are over 4,500 people employed by Sinch across the globe.


# Save Model

In [113]:
#@title save to local storage
model.save_pretrained(root_dir + "/sinch/assistant/unsloth_v1") # Local saving
tokenizer.save_pretrained(root_dir + "/sinch/assistant/unsloth_v1")

('/content/drive/MyDrive/colab/sinch/assistant/unsloth_v1/tokenizer_config.json',
 '/content/drive/MyDrive/colab/sinch/assistant/unsloth_v1/special_tokens_map.json',
 '/content/drive/MyDrive/colab/sinch/assistant/unsloth_v1/tokenizer.json')

# Use pre-trained

In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [2]:
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
root_dir="/content/drive/MyDrive/colab"
# Load tokenizer and model
token='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'

load_in_4bit = True
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
ref = root_dir + "/sinch/assistant/unsloth_v1"

In [3]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
     model_name = ref,
     max_seq_length = max_seq_length,
     dtype = dtype,
     load_in_4bit = load_in_4bit,
  )

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
pass


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

Unsloth 2024.11.10 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


The Eiffel Tower.<|eot_id|>


In [18]:
def send_msg(msg):
    messages = [
      {
          "role": "user",
          "content": f"You are a Sinch company assistant. Only answer questions related to Sinch domain. If a question is irrelevant, respond with: 'This is outside my expertise.' ### Question: {msg}"
      }
    ]

    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True, # Must add for generation
        return_tensors = "pt",
    ).to("cuda")

    from transformers import TextStreamer
    text_streamer = TextStreamer(tokenizer, skip_prompt = True)
    _ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                      use_cache = True, temperature = 1.5, min_p = 0.1)

# Test with QAs

In [19]:
send_msg("How to activate the project-based billing line (PBILI) element?")

Please reach out to service Ops for further assistance.<|eot_id|>


In [20]:
send_msg("tell me about Sinch company")

This is outside my expertise.<|eot_id|>


In [21]:
send_msg("how can I request CPT prices change?")

Reach out to the Credit Control team via internal messaging or by contacting the Credit Control team through the internal chatbot, located on the internal Sinch Internal network (if the user is on the Internal network).<|eot_id|>


In [22]:
send_msg("what's the latest day to update CPT prices?")

The cut-off dates for submitting requests for billing and charging matters are different. Please see the following table for details on cut-off dates and submission timelines for various products. Please note that for the entire month to be billed, the submission cut-off must be earlier than the tenth (10th) of each month.<|eot_id|>


In [24]:
send_msg("tell me about Sinch Financial Documents Editor")

This is outside of my expertise.<|eot_id|>
