# Fine-tune Llama 3.1 8B with Unsloth
> 🗣️ [Large Language Model Course](https://github.com/mlabonne/llm-course)

❤️ Created by [@maximelabonne](https://twitter.com/maximelabonne).

In [None]:
!pip install -qqq "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" --progress-bar off
from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install -qqq --no-deps {xformers} trl peft accelerate bitsandbytes triton --progress-bar off

import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for unsloth (pyproject.toml) ... [?25l[?25hdone
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


## 1. Load model for PEFT

In [None]:
# Load model
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)

# Prepare model for PEFT
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    use_rslora=True,
    use_gradient_checkpointing="unsloth"
)
print(model.print_trainable_parameters())

==((====))==  Unsloth 2024.9.post4: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Unsloth 2024.9.post4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196
None


## 2. Prepare data and tokenizer

In [None]:
# tokenizer = get_chat_template(
#     tokenizer,
#     chat_template="chatml",
#     mapping={"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}
# )

# def apply_template(examples):
#     messages = examples["conversations"]
#     text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in messages]
#     return {"text": text}


    # Assuming the tokenizer is already defined and you are using the correct chat template

# Update the tokenizer with the chat template
# Assuming the tokenizer is already defined and you are using the correct chat template

# Update the tokenizer with the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="chatml",
    mapping={
        "role": "from",
        "content": "value",
        "user": "human",
        "assistant": "gpt"
    }
)

def apply_template(examples):
    # Create a list to store formatted strings
    text = []

    # Loop through each entry in the examples
    for prompt, completion in zip(examples["prompt"], examples["completion"]):
        # Format the string as per your requirements
        formatted_message = f"User: {prompt} Assistant: {completion}"
        text.append(formatted_message)

    return {"text": text}


from datasets import load_dataset

# Load the dataset correctly
dataset = load_dataset("json", data_files="/content/rere.json", split="train")

# Apply the template to your dataset
dataset = dataset.map(apply_template, batched=True)


Unsloth: Will map <|im_end|> to EOS = <|im_end|>.


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/24000 [00:00<?, ? examples/s]

In [None]:
from datasets import load_dataset

# Define the tokenizer and apply the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="chatml",
    mapping={
        "role": "from",
        "content": "value",
        "user": "human",
        "assistant": "gpt"
    }
)

# Function to apply the template to the examples
def apply_template(examples):
    text = []

    # Loop through each example in the dataset
    for instruction, input_text, output in zip(examples["instruction"], examples["input"], examples["output"]):
        # Handle NaN input fields by using an empty string or default placeholder
        input_text = input_text if input_text is not None else ""

        # Format the conversation
        formatted_message = f"User: {instruction} {input_text} Assistant: {output}"

        text.append(formatted_message)

    return {"text": text}

# Load the dataset (adjust path as necessary)
dataset = load_dataset("json", data_files="/content/translated_instruction_input_output_to_hindi.json", split="train")

# Apply the template to your dataset
dataset = dataset.map(apply_template, batched=True)

# Show the first few examples to verify
print(dataset[:3])


Unsloth: Will map <|im_end|> to EOS = <|end_of_text|>.


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

{'instruction': ['स्वस्थ रहने के लिए तीन सुझाव दीजिए ।', 'तीन प्राथमिक रंग क्या हैं?', 'एक अणु की संरचना वर्णित करें.'], 'input': [None, None, None], 'output': ['अपने शरीर को सक्रिय और मज़बूत बनाए रखने के लिए नियमित रूप से कसरत कीजिए ।', 'तीन प्राथमिक रंग लाल, नीला, और पीला हैं.', 'एक परमाणु के ऊपर एक परमाणु है, जिसमें एवरटन्स और नॉट्स होते हैं, जो मध्य युग के आस - पास चक्करों से घिरा रहता है ।'], 'text': ['User: स्वस्थ रहने के लिए तीन सुझाव दीजिए ।  Assistant: अपने शरीर को सक्रिय और मज़बूत बनाए रखने के लिए नियमित रूप से कसरत कीजिए ।', 'User: तीन प्राथमिक रंग क्या हैं?  Assistant: तीन प्राथमिक रंग लाल, नीला, और पीला हैं.', 'User: एक अणु की संरचना वर्णित करें.  Assistant: एक परमाणु के ऊपर एक परमाणु है, जिसमें एवरटन्स और नॉट्स होते हैं, जो मध्य युग के आस - पास चक्करों से घिरा रहता है ।']}


## 3. Training

In [None]:
trainer=SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=2e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=500,
        output_dir="output",
        seed=0,
    ),
)

trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 3,656 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 684
 "-____-"     Number of trainable parameters = 41,943,040


OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 218.81 MiB is free. Process 65621 has 39.34 GiB memory in use. Of the allocated memory 38.73 GiB is allocated by PyTorch, and 107.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## 4. Inference

In [None]:
# Load model for inference
model = FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": "एक अणु की संरचना वर्णित करें"}
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=1000, use_cache=True)

<|im_start|>user
एक अणु की संरचना वर्णित करें<|im_end|>
<|im_start|>assistant
मैंने अपने स्कूल के एक विद्यार्थी के बारे में एक कहानी लिखी है जो एक बुद्धिमान प्रोफेसर के साथ एक विद्यार्थी है ।<|im_end|>


## 5. Save trained model

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
#model.push_to_hub_merged("mlabonne/FineLlama-3.1-8B", tokenizer, save_method="merged_16bit")

Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 52.73 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 53.41it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


RuntimeError: Unsloth: Please supply a token!
Go to https://huggingface.co/settings/tokens

In [None]:
model.save_pretrained_gguf("model", tokenizer, "q8_0")
quant_methods = ["q2_k", "q3_k_m", "q4_k_m", "q5_k_m", "q6_k", "q8_0"]
for quant in quant_methods:
    model.push_to_hub_gguf("mlabonne/FineLlama-3.1-8B-GGUF", tokenizer, quant)