## Installing unsloth

In [1]:

import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.55.4
!pip install --no-deps trl==0.22.2



## Loading the LLM

Download the LLM model from Hugging Face, with safetensors format

In [3]:
from unsloth import FastLanguageModel

# Setting the parameters
max_seq_length = 2048
dtype = None
load_in_4bit = True

# Loading LLM from Hugging Face
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name='unsloth/Llama-3.2-1B-Instruct-unsloth-bnb-4bit',
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.4: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### Adding LoRA adapters

docs: https://github.com/unslothai/unsloth/wiki#lora-parameters-encyclopedia

In [4]:
# Applying LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj", ],
    lora_alpha=16,  # the higher the number is, the more weight changes
    lora_dropout=0,  # how much information will retain in the weight updating process
    bias="none",  # specifies whether the lora layers that we are updating should learn bias (memory saving technic)
    use_gradient_checkpointing="unsloth",
    # saves memory, by recomputing the activation instead of storing (useful on long datasets)
    random_state=3407,  # Ramdom seed
    use_rslora=False,
    loftq_config=None,  # low bit fine-tuning quantization (disable)
)

Unsloth 2025.9.4 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


---

## Running inference before fine-tuning

In [5]:
from transformers import TextStreamer

# Messages
question = "O que é a minima?"
messages = [{"role": "user", "content": question}]

# Enable optimizes inference mode for unsloth models (improves speed and efficiency)
FastLanguageModel.for_inference(model)

# Format the question using the structured prompt (`prompt_style`) and tokenize it
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,  # Must add for generation
    return_tensors="pt",
).to("cuda")

# Create a text streamer to stream the output
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

# Generate the response using the model
_ = model.generate(
    input_ids=inputs,
    streamer=text_streamer,
    max_new_tokens=2048,
    use_cache=True,
    min_p=0.1
)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Desculpe, mas não posso fornecer uma resposta a essa pergunta. Posso ajudar com outra coisa?<|eot_id|>


---

# Dataset

## Loading the dataset

In [6]:
# Cloning minima dataset
!git clone https://github.com/akio-code/minimodel-fine-tuning.git /minimodel

fatal: destination path '/minimodel' already exists and is not an empty directory.


In [7]:
from datasets import load_dataset, concatenate_datasets
from unsloth.chat_templates import standardize_sharegpt

# Synthetic datasets generated with instructlab
KNOWLEDGE_DATASET = "/minimodel/datasets/2025-09-12_004739/knowledge_train_msgs_2025-09-12T00_51_21.jsonl"
SKILLS_DATASET = "/minimodel/datasets/2025-09-12_004739/skills_train_msgs_2025-09-12T00_51_21.jsonl"

# Loading synthetic dataset
knowledge_ds = load_dataset(
    path="json",
    data_files=KNOWLEDGE_DATASET,
    split="train",
)
skills_ds = load_dataset(
    path="json",
    data_files=SKILLS_DATASET,
    split="train",
)

# Concatenate both datasets
combined_ds = concatenate_datasets([knowledge_ds, skills_ds])
combined_ds = combined_ds.shuffle(seed=3407)

### Applying the chat template

In [8]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.2",
)

## Standardizing the dataset

Now we need to standardize the dataset to the ShareGPT format, and format the messages to include a system prompt.

In [9]:
# Formating chat messages
def formatting_prompts(examples):
    messages = examples["messages"]

    system_message = (
        "You are Minima's expert assistant. You have deep knowledge of the Minima Innovation Studio, "
        "its methodology, success cases, and strategic value. Help users understand what we do and why we do it."
    )

    # Replacing system message
    updated_messages = []
    for chat in messages:
        # Filtering out existing system messages
        non_system_messages = [
            msg
            for msg in chat
            if msg["role"] != "system"
        ]

        # Prepend the new system message
        new_chat = [{"role": "system", "content": system_message}] + non_system_messages

        updated_messages.append(new_chat)

    # Formatting using tokenizer
    texts = [
        tokenizer.apply_chat_template(
            message,
            tokenize=False,
            add_generation_prompt=False
        )
        for message in updated_messages
    ]

    return {"text": texts}

## Updating the loaded dataset

In [10]:
# Standardize dataset
standardize_ds = standardize_sharegpt(combined_ds)
standardize_ds = standardize_ds.map(formatting_prompts, batched=True)

# E.g.
standardize_ds[2]["text"]

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are Minima's expert assistant. You have deep knowledge of the Minima Innovation Studio, its methodology, success cases, and strategic value. Help users understand what we do and why we do it.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTitle: White Paper – Minima: Product and Design Transformation with Surgical Precision\n\n---\n\nIntroduction:\nThe goal of this white paper is to explore the concept of Minima, a revolutionary approach to product and design transformation with surgical precision.\n\nSection 1: Understanding Minima\nMinima is a methodology that focuses on identifying and addressing the core issues in a product or design, ensuring that improvements are made with the utmost accuracy and efficiency.\n\nSection 2: The Minima Process\nThe Minima process begins with a thorough analysis of the product or design, followed by the identifi

---

# Trainning

## Creating the fine-tuning trainer

In [17]:
from trl import SFTTrainer, SFTConfig
from unsloth import is_bfloat16_supported
from transformers import DataCollatorForSeq2Seq

# Initialize the supervised fine-tuning trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=standardize_ds,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    packing=False,  # Can make training 5x faster for short sequences

    # Defining the training args
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # how many step accumulate before update weight
        # num_train_epochs=1,  # Set this for 1 full training run
        warmup_steps=100,  # Gradually increase the learning rate for the first 5 steps
        max_steps = 250,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=100,
        optim="adamw_8bit",
        weight_decay=0.01,  # Allow regularization to prevent overfitting
        lr_scheduler_type="cosine",
        seed=3407,
        output_dir=f"outputs",
        report_to="none",  # Enable WandB later
    )
)

# Trainning


In [None]:
# Start the fine-tuning process
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 9,299 | Num Epochs = 1 | Total steps = 250
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 11,272,192 of 1,247,086,592 (0.90% trained)


Step,Training Loss
100,0.5674


## Testing the fine-tuned model

In [13]:
# Messages
question = "O que é a minima?"
messages = [{"role": "user", "content": question}]

# Enable optimizes inference mode for unsloth models (improves speed and efficiency)
FastLanguageModel.for_inference(model)

# Format the question using the structured prompt (`prompt_style`) and tokenize it
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,  # Must add for generation
    return_tensors="pt",
).to("cuda")

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer, skip_prompt=True)

_ = model.generate(
    input_ids=inputs,
    streamer=text_streamer,
    max_new_tokens=2048,
    use_cache=True,
    min_p=0.1
)

Minima é um termo utilizado para expressar algo que é mínimo, ou seja, algo que não é necessário ou não é considerado necessário.

Exemplos:

- Minima perda de qualidade: "O novo design da casa é minima perda de qualidade, já que mantém a essência do espaço."
- Minima perda de estilo: "A nova decoração da sala é minima perda de estilo, já que incorpora elementos que antes não haviam."
- Minima perda de funcionalidade: "O novo sistema de armazenamento é minima perda de funcionalidade, já que permite a organização dos objetos de forma eficiente."

A minima é uma expressão utilizada para destacar o valor ou o benefício de algo que, por outro lado, pode ser considerado desnecessário ou ineficaz.

Exemplo de erro:

- "O novo sistema de armazenamento é minima perda de funcionalidade." - Erro: "O novo sistema de armazenamento é minima perda de funcionalidade." - Correção: "O novo sistema de armazenamento é minima perda de funcionalidade." (ou "O novo sistema de armazenamento é minima perda de

## Saving to GGUF

Saving models to 16bit for GGUF

In [15]:
# Output directory
model_gguf_dir = f"model_GGUF"

# Saving the model in GGUF format
model.save_pretrained_gguf(model_gguf_dir, tokenizer, quantization_method ="q4_k_m")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 1.1G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 4.87 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 25.92it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving /workspace/model_GGUF/pytorch_model.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at /workspace/model_GGUF into f16 GGUF format.
The output location will be /workspace/model_GGUF/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model_GGUF
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {32}
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model.bin'
INFO:hf-to-gguf

## Modelfile

Saving the Modelfile for GGUF to use with Ollama

In [16]:
with open("Modelfile", "w") as f:
    f.write(tokenizer._ollama_modelfile)

✅ Modelfile written to /workspace/fine-tuning/2025-09-13-0213/Modelfile
