# Fine-tune Llama 3.1 8B with Unsloth
> 🗣️ [Large Language Model Course](https://github.com/mlabonne/llm-course)

❤️ Created by [@maximelabonne](https://twitter.com/maximelabonne).

In [None]:
%conda install pytorch cudatoolkit torchvision torchaudio xformers pytorch-cuda=12.1 bitsandbytes -c pytorch -c nvidia -c conda-forge -c xformers

In [None]:
%pip install torch

In [None]:
import os

os.environ["NVIDIA_VISIBLE_DEVICES"] = "0"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

In [1]:
import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments
from unsloth import FastLanguageModel, is_bfloat16_supported
from peft import LoraConfig

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [3]:
%load_ext autoreload
%autoreload 2

## Load saved model

In [13]:
model_path = "models/llama_31_8B_lora_adapter"
max_seq_length = 4096
adapter1 = "adapter1"

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_path,  # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length=max_seq_length,
    dtype=None,
    load_in_4bit=True,
    # device_map="cuda",
)
EOS_TOKEN = tokenizer.eos_token

FastLanguageModel.for_inference(model)

In [6]:
model.load_adapter(model_path + f"/{adapter1}", adapter_name=adapter1)

_IncompatibleKeys(missing_keys=['base_model.model.model.embed_tokens.weight', 'base_model.model.model.layers.0.self_attn.q_proj.base_layer.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.base_layer.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.base_layer.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.base_layer.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight', 'base_model.model.model.layers.0.mlp.gate_proj.bas

In [7]:
model.set_adapter("adapter1")

## Train adapter (as if done within ft_llama service)

In [3]:
import os

In [None]:
os.environ["BASE_MODEL_PATH"] = "models/llama_31_8B_lora_adapter"
os.environ["MAX_SEQ_LENGTH"] = "4096"
os.environ["ADAPTERS"] = "adapter1"

In [21]:
from src.adapter_train import train_adapter

In [None]:
train_adapter(
    "adapter1", "src/slot_dataset.json", max_steps=10, output_dir="models/outputs"
)

## 1. Load model for PEFT

In [None]:
# Load model
max_seq_length = 8192
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)
EOS_TOKEN = tokenizer.eos_token

In [26]:
model = FastLanguageModel.get_peft_model(model, inference_mode=False)

Unsloth 2024.10.1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [27]:
peft_config_1 = LoraConfig(
    r=16,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",  # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    # use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    # random_state = 3407,
    use_rslora=False,  # We support rank stabilized LoRA
    loftq_config=None,  # And LoftQ
)

In [28]:
model.add_adapter("adapter1", peft_config_1)

In [29]:
model.set_adapter("adapter1")

## 2. Prepare data and tokenizer

In [2]:
dataset = load_dataset(
    "askatasuna/CycleDialogueGraphs_v1", token="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"
)

In [4]:
graph_generation_prompt = """
    Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
    ### Instruction:
    Your input is a dialogue from customer chatbot system.
    Your task is to create a cyclic dialogue graph corresponding to the dialogue.
    Next is an example of the graph (set of rules) how chatbot system looks like - it is
    a set of nodes with chatbot system utterances and a set of edges that are
    triggered by user requests: {}
    This is the end of the example.
    Note that is_start field in the node is an entry point to the whole graph, not to the cycle.
    **Rules:**
    1) Nodes must be assistant's utterances, edges must be utterances from the user.
    2) Every assistance's utterance from the dialogue shall be present in one and only one node of a graph.
    3) Every user's utterance from the dialogue shall be present in one and only one edge of a graph.
    4) Use ony utterances from the dialogue. It is prohibited to create new utterances different from input ones.
    6) Never create nodes with user's utterances.
    7) Graph must be cyclic - no dead ends.
    8) The cycle point should make logical sense.
    9) The starting node of the cycle cannot be the beginning of a conversation with the user.
    It must be a continuation of the user's previous phrase, kind of problem elaboration stage.
    Typically it is clarifying question to previous users' phrase for example.
    So cycle start cannot be greeting (first) node of the whole graph, it shall be another one node.
    10) Number of nodes and edges cannot exceed number of utterances in a dialogue.
    11) You must always return valid JSON fenced by a markdown code block. Do not return any additional text.
    12) Add reason point to the graph with explanation how cycle start point has been chosen.
    I will give a dialogue, your task is to build a graph for this dialogue according to the rules and examples above.
    ### Input:
    {}
    ### Response:
    {}
)"""

In [5]:
graph_example_1 = {
    "edges": [
        {"source": 1, "target": 2, "utterances": ["I want to order from you"]},
        {
            "source": 2,
            "target": 3,
            "utterances": [
                "I would like to purchase Pale Fire and Anna Karenina, please"
            ],
        },
        {"source": 3, "target": 4, "utterances": ["With credit card, please"]},
        {"source": 4, "target": 2, "utterances": ["Start new order"]},
    ],
    "nodes": [
        {
            "id": 1,
            "label": "start",
            "is_start": True,
            "utterances": ["How can I help?", "Hello"],
        },
        {
            "id": 2,
            "label": "ask_books",
            "is_start": False,
            "utterances": ["What books do you like?"],
        },
        {
            "id": 3,
            "label": "ask_payment_method",
            "is_start": False,
            "utterances": [
                "Please, enter the payment method you would like to use: cash or credit card."
            ],
        },
        {
            "id": 4,
            "label": "ask_to_redo",
            "is_start": False,
            "utterances": [
                "Something is wrong, can you please use other payment method or start order again"
            ],
        },
    ],
    "reason": "",
}

In [7]:
%pwd

'/cephfs/home/peshkichev/projects/ipavlov/chatsky-llm-autoconfig/experiments/2024.11.14_dialogue2graph'

In [None]:
new_data = []
for dat in dataset["train"]:
    for dia in dat["dialogues"]:
        dic = {"graph": dat["graph"], "dialogue": dia["messages"]}
        new_data.append(dic)
new_data

In [4]:
import json


def save_json(data: dict, filename: str) -> None:
    with open(filename, "w", encoding="utf-8") as file:
        json.dump(data, file, indent=4, ensure_ascii=False)

In [5]:
save_json(new_data, "dataset.json")

In [7]:
from datasets import Dataset

In [19]:
new_dataset = Dataset.from_list(new_data)

In [18]:
def formatting_prompts_func(example):
    inputs = example["dialogue"]
    outputs = example["graph"]

    return {
        "text_prompt": graph_generation_prompt.format(graph_example_1, inputs, outputs)
        + EOS_TOKEN
    }

In [None]:
new_dataset = new_dataset.map(
    formatting_prompts_func,
    batched=False,
)

In [11]:
new_dataset

Dataset({
    features: ['graph', 'dialogue', 'text_prompt'],
    num_rows: 30
})

In [None]:
new_dataset[0]["text_prompt"]

In [47]:
len(new_dataset[0]["text_prompt"])

4805

In [22]:
dataset = new_dataset.select(range(27))

In [23]:
def predict(model, phrase):
    inputs = tokenizer(
        [
            graph_generation_prompt.format(
                graph_example_1,
                phrase,  # input
                "",  # output - leave this blank for generation!
            )
        ],
        return_tensors="pt",
    ).to("cuda")

    # outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True, streamer = text_streamer)
    outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
    return tokenizer.batch_decode(outputs)

In [32]:
new_dataset[0]["dialogue"]

[{'participant': 'assistant',
  'text': 'Welcome to our hotel booking service! How can I assist you today?'},
 {'participant': 'user', 'text': "Hi, I'd like to book a hotel room"},
 {'participant': 'assistant',
  'text': 'What type of room would you like to book?'},
 {'participant': 'user', 'text': 'A double room for two nights, please'},
 {'participant': 'assistant',
  'text': "You've selected a double room for two nights. Is this correct?"},
 {'participant': 'user', 'text': "Yes, that's correct"},
 {'participant': 'assistant',
  'text': 'Great! Please provide your payment details to proceed.'},
 {'participant': 'user', 'text': "Here's my credit card information"},
 {'participant': 'assistant',
  'text': 'Thank you for your booking! Would you like to book another room?'},
 {'participant': 'user', 'text': "I'd like to book another room"}]

In [57]:
res = predict(model, new_dataset[0]["dialogue"])

In [None]:
res[0]

In [44]:
len(res[0])

3823

In [23]:
new_dataset

Dataset({
    features: ['graph', 'dialogue'],
    num_rows: 30
})

## 4. Training

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=new_dataset,
    # eval_dataset=test_valid["test"],
    dataset_text_field="text_prompt",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,  # Can make training 5x faster for short sequences.
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        # per_device_train_batch_size=8,
        # gradient_accumulation_steps=4,
        # per_device_eval_batch_size=8,
        warmup_steps=5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps=30,
        learning_rate=2e-4,
        # eval_steps=5,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="models/outputs/config1_lora",
        # evaluation_strategy="steps",
        # do_eval=True,
    ),
)

In [20]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.498 GB.
6.141 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

## Llama 3.2

In [None]:
# Load model
max_seq_length = 8192
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)
EOS_TOKEN = tokenizer.eos_token

In [49]:
model = FastLanguageModel.get_peft_model(model, inference_mode=False)

Unsloth 2024.10.1 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [50]:
peft_config_1 = LoraConfig(
    r=16,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",  # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    # use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    # random_state = 3407,
    use_rslora=False,  # We support rank stabilized LoRA
    loftq_config=None,  # And LoftQ
)

In [51]:
model.add_adapter("adapter1", peft_config_1)

In [52]:
model.set_adapter("adapter1")

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=new_dataset,
    # eval_dataset=test_valid["test"],
    dataset_text_field="text_prompt",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,  # Can make training 5x faster for short sequences.
    args=TrainingArguments(
        # per_device_train_batch_size=8,
        # gradient_accumulation_steps=4,
        # per_device_eval_batch_size=8,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps=50,
        learning_rate=2e-4,
        # learning_rate=9e-5,
        # eval_steps=5,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="models/outputs/config1_lora",
        # evaluation_strategy="steps",
        # do_eval=True,
    ),
)

In [None]:
trainer_stats = trainer.train()

## 5. Inference

In [None]:
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

In [19]:
i = 12002
res = predict(model, dataset["test"][i]["text_prompt"])
print(dataset["test"][i]["phrase"])
res[0].rsplit("### Output Data:")[1].strip().replace("\n<|end_of_text|>", "")

можешь пройти десять футов влево, пожалуйста


"[{'name': 'distance', 'predicted_slot': 'десять'}, {'name': 'distance_unit', 'predicted_slot': 'футов'}]"

In [21]:
len(dataset["test"])

24128

In [24]:
from tqdm import tqdm
import ast

In [26]:
len(test_valid["train"])

23886

In [None]:
###########!!!!!!!!!!!!!!!!!!!
pos = 0
for dt in tqdm(test_valid["train"].select(range(1000))):
    res = predict(model, dt["text_prompt"])
    lst = ast.literal_eval(
        res[0].rsplit("### Output Data:")[1].strip().replace("\n<|end_of_text|>", "")
    )
    if dt["name"] in ["move_forward", "move_backward", "GO"]:
        distance = unit_conversion(dt["distance"], dt["distance_unit"])
        if (
            len(lst) == 2
            and any(
                l["name"] == "distance" and distance == l["predicted_slot"] for l in lst
            )
            and any(
                l["name"] == "distance_unit" and "метр" in l["predicted_slot"]
                for l in lst
            )
        ):
            pos += 1
        else:
            print(lst, dt["phrase"])
    elif dt["name"] in ["pick_up"]:
        if (
            len(lst) == 1
            and lst[0]["name"] == "box_id"
            and dt["box_id"] == lst[0]["predicted_slot"]
        ):
            pos += 1
        else:
            print(lst, dt["phrase"])
    elif dt["name"] in ["place"]:
        if (
            len(lst) == 2
            and any(
                l["name"] == "box_id" and dt["box_id"] == l["predicted_slot"]
                for l in lst
            )
            and any(
                l["name"] == "waypoint_id" and dt["waypoint_id"] == l["predicted_slot"]
                for l in lst
            )
        ):
            pos += 1
        else:
            print(lst, dt["phrase"])
    elif dt["name"] in ["sit_down", "go_to"]:
        if (
            len(lst) == 1
            and lst[0]["name"] == "waypoint_id"
            and dt["waypoint_id"] == lst[0]["predicted_slot"]
        ):
            pos += 1
        else:
            print(lst, dt["phrase"])
    elif dt["name"] in ["say"]:
        if (
            len(lst) == 1
            and lst[0]["name"] == "text"
            and dt["text"] == lst[0]["predicted_slot"]
        ):
            pos += 1
        else:
            print(lst, dt["phrase"])
    elif dt["name"] in ["set_point"]:
        if (
            len(lst) == 1
            and lst[0]["name"] == "point"
            and dt["point"] == lst[0]["predicted_slot"]
        ):
            pos += 1
        else:
            print(lst, dt["phrase"])
pos / 1000

## 6. Save trained model

In [20]:
model.save_pretrained("models/llama_31_8B_lora_adapter")  # Local saving
tokenizer.save_pretrained("models/llama_31_8B_lora_adapter")

('models/llama_31_8B_lora_adapter/tokenizer_config.json',
 'models/llama_31_8B_lora_adapter/special_tokens_map.json',
 'models/llama_31_8B_lora_adapter/tokenizer.json')

In [32]:
res = predict(model, "пожалуйста, проедь вперед на 20 метров")