To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Long-Context GRPO for reinforcement learning ‚Äî train stably at massive sequence lengths. Fine-tune models with up to 7x more context length efficiently. [Read Blog](https://unsloth.ai/docs/new/grpo-long-context)

3√ó faster training with optimized sequence packing ‚Äî higher throughput with no quality loss.[Read Blog](https://unsloth.ai/docs/new/3x-faster-training-packing)

500k context-length fine-tuning ‚Äî push long-context models further with memory-efficient training. [Read Blog](https://unsloth.ai/docs/new/500k-context-length-fine-tuning)

Introducing FP8 precision training for faster RL inference. [Read Blog](https://docs.unsloth.ai/new/fp8-reinforcement-learning).

Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.57.3
!pip install --no-deps trl==0.22.2

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [2]:
from unsloth import FastModel
import torch
max_seq_length = 2048
fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",

    # Other popular models!
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/Llama-3.3-70B",
    "unsloth/mistral-7b-instruct-v0.3",
    "unsloth/Phi-4",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-270m-it",
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2026.1.4: Fast Gemma3 patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: Gemma3 does not support SDPA - switching to fast eager.
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


model.safetensors:   0%|          | 0.00/536M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

We now add LoRA adapters so we only need to update a small amount of parameters!

In [3]:
model = FastModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 128,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Making `model.base_model.model.model` require gradients


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. We use [Thytu's ChessInstruct](https://huggingface.co/datasets/Thytu/ChessInstruct) dataset. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [4]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma3",
)

In [5]:
from datasets import load_dataset
dataset = load_dataset("open-r1/codeforces-cots", split = "train[:10000]")

README.md: 0.00B [00:00, ?B/s]

solutions/train-00000-of-00010.parquet:   0%|          | 0.00/174M [00:00<?, ?B/s]

solutions/train-00001-of-00010.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

solutions/train-00002-of-00010.parquet:   0%|          | 0.00/183M [00:00<?, ?B/s]

solutions/train-00003-of-00010.parquet:   0%|          | 0.00/183M [00:00<?, ?B/s]

solutions/train-00004-of-00010.parquet:   0%|          | 0.00/182M [00:00<?, ?B/s]

solutions/train-00005-of-00010.parquet:   0%|          | 0.00/208M [00:00<?, ?B/s]

solutions/train-00006-of-00010.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

solutions/train-00007-of-00010.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

solutions/train-00008-of-00010.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

solutions/train-00009-of-00010.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/47780 [00:00<?, ? examples/s]

We now use `convert_to_chatml` to try converting datasets to the correct format for finetuning purposes!

In [6]:
alpaca_prompt = """Below is a competitive programming problem.
Reason step-by-step to find the correct algorithm, then provide a C++ solution.

### Problem:
{}

### Solution:
{}"""

# 3. The Formatting Function (Tailored for the "messages" column)
def formatting_prompts_func(examples):
    # We grab the "messages" column because your data shows it is 100% full
    conversations = examples["messages"]

    texts = []
    for conversation in conversations:
        # The dataset stores it as a list:
        # conversation[0] is the User (Problem)
        # conversation[1] is the Assistant (Solution)

        problem  = conversation[0]["content"]
        solution = conversation[1]["content"]

        # Glue them together into the template
        text = alpaca_prompt.format(problem, solution) + tokenizer.eos_token
        texts.append(text)

    return { "text" : texts }

# 4. Apply the formatting
dataset = dataset.map(formatting_prompts_func, batched = True)

# --- VERIFICATION ---
# This prints the first result so you can confirm it looks perfect
print(dataset[0]["text"])

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Below is a competitive programming problem.
Reason step-by-step to find the correct algorithm, then provide a C++ solution.

### Problem:
You will be given a competitive programming problem. Please reason step by step about the solution, then provide a complete implementation in C++17.

Your solution must read input from standard input (cin), write output to standard output (cout).
Do not include any debug prints or additional output.

Put your final solution within a single code block:
```cpp
<your code here>
```

# Problem

You are given an array $$$a$$$ of $$$n$$$ integers, where $$$n$$$ is odd.

In one operation, you will remove two adjacent elements from the array $$$a$$$, and then concatenate the remaining parts of the array. For example, given the array $$$[4,7,4,2,9]$$$, we can obtain the arrays $$$[4,2,9]$$$ and $$$[4,7,9]$$$ by the operations $$$[\underline{4,7}, 4,2,9] \to [4,2,9]$$$ and $$$[4,7,\underline{4,2},9] \to [4,7,9]$$$ respectively. However, we cannot obtain the ar

Let's see how row 100 looks like!

In [7]:
dataset[100]

{'id': '884/F',
 'aliases': None,
 'contest_id': '884',
 'contest_name': 'Educational Codeforces Round 31',
 'contest_type': 'ICPC',
 'contest_start': 1509113100,
 'contest_start_year': 2017,
 'index': 'F',
 'time_limit': 2.0,
 'memory_limit': 256,
 'title': 'Anti-Palindromize',
 'description': 'A string a of length m is called antipalindromic iff m is even, and for each i (1 ‚â§ i ‚â§ m) ai ‚â† am - i + 1.\n\nIvan has a string s consisting of n lowercase Latin letters; n is even. He wants to form some string t that will be an antipalindromic permutation of s. Also Ivan has denoted the beauty of index i as bi, and the beauty of t as the sum of bi among all indices i such that si = ti.\n\nHelp Ivan to determine maximum possible beauty of t he can get.',
 'input_format': 'The first line contains one integer n (2 ‚â§ n ‚â§ 100, n is even) ‚Äî the number of characters in s.\n\nThe second line contains the string s itself. It consists of only lowercase Latin letters, and it is guaranteed th

We now have to apply the chat template for `Gemma3` onto the conversations, and save it to `text`.

In [None]:
def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }

dataset = dataset.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Let's see how the chat template did!


In [11]:
dataset[11]['text']

'Below is a competitive programming problem.\nReason step-by-step to find the correct algorithm, then provide a C++ solution.\n\n### Problem:\nYou will be given a competitive programming problem. Please reason step by step about the solution, then provide a complete implementation in C++17.\n\nYour solution must read input from standard input (cin), write output to standard output (cout).\nDo not include any debug prints or additional output.\n\nPut your final solution within a single code block:\n```cpp\n<your code here>\n```\n\n# Problem\n\nYou\'re given an integer $$$n$$$. For every integer $$$i$$$ from $$$2$$$ to $$$n$$$, assign a positive integer $$$a_i$$$ such that the following conditions hold:\n\n- For any pair of integers $$$(i,j)$$$, if $$$i$$$ and $$$j$$$ are coprime, $$$a_i \\neq a_j$$$.\n- The maximal value of all $$$a_i$$$ should be minimized (that is, as small as possible).\n\nA pair of integers is called coprime if their greatest common divisor is $$$1$$$.\n\nExecution 

<a name="Train"></a>
### Train the model
Now let's train our model. We do 100 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [12]:
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None,
    args = SFTConfig(
        dataset_text_field = "text", # Must match the key from your formatting function!
        max_seq_length = 2048,       # Explicitly set this to prevent "infinite length" errors

        # --- Memory Safety Settings ---
        per_device_train_batch_size = 2,  # Lower is safer for long C++ code
        gradient_accumulation_steps = 4,  # Compensates for the low batch size

        # --- Speed & Duration ---
        warmup_steps = 5,
        max_steps = 600, # 600 steps is a good balance (approx 30-45 mins)

        # --- Standard Optimizer Stuff ---
        learning_rate = 2e-4, # Unsloth recommends 2e-4 for Gemma-2b (faster learning)
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/10000 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.


ü¶• Unsloth: Padding-free auto-enabled, enabling faster training.


We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [13]:
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    # This tells the trainer: "Ignore everything until you see this marker"
    instruction_part = "### Problem:\n",

    # This tells the trainer: "Start learning from here!"
    response_part = "### Solution:\n",
)

Map (num_proc=6):   0%|          | 0/10000 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 100th row again.

In [14]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

"<bos>Below is a competitive programming problem.\nReason step-by-step to find the correct algorithm, then provide a C++ solution.\n\n### Problem:\nYou will be given a competitive programming problem. Please reason step by step about the solution, then provide a complete implementation in C++17.\n\nYour solution must read input from standard input (cin), write output to standard output (cout).\nDo not include any debug prints or additional output.\n\nPut your final solution within a single code block:\n```cpp\n<your code here>\n```\n\n# Problem\n\nA string a of length m is called antipalindromic iff m is even, and for each i (1 ‚â§ i ‚â§ m) ai ‚â† am - i + 1.\n\nIvan has a string s consisting of n lowercase Latin letters; n is even. He wants to form some string t that will be an antipalindromic permutation of s. Also Ivan has denoted the beauty of index i as bi, and the beauty of t as the sum of bi among all indices i such that si = ti.\n\nHelp Ivan to determine maximum possible beauty

Now let's print the masked out example - you should see only the answer is present:

In [15]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      <think>\nOkay, let's see. So the problem is about finding an antipalindromic permutation of a given string s, such that the beauty sum is maximized. The beauty is the sum of bi where the character in t at position i is the same as in s. Hmm.\n\nFirst, I need to understand what an antipalindromic string is. Oh right, it's a string of even length where for every i, the character at position i is not equal to the character at position m - i + 1. So for example, in a 4-letter string, the first an

In [16]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
0.832 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [17]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000 | Num Epochs = 1 | Total steps = 600
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 30,375,936 of 298,474,112 (10.18% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.1712
2,2.1702
3,2.1325
4,2.2498
5,2.4991
6,2.9633
7,2.8289
8,3.3042
9,3.3693
10,3.2225


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

497.5278 seconds used for training.
8.29 minutes used for training.
Peak reserved memory = 4.268 GB.
Peak reserved memory for training = 3.436 GB.
Peak reserved memory % of max memory = 28.953 %.
Peak reserved memory for training % of max memory = 23.309 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [18]:
from transformers import TextStreamer

# 1. Pick a random test case (e.g., number 10)
# We grab the raw problem text from the 'messages' column we saw earlier
row_index = 10
problem_text = dataset[row_index]["messages"][0]["content"]

# 2. Format the input using the SAME template as training
# We leave the second part empty ("") because we want the model to generate the solution
input_text = alpaca_prompt.format(
    problem_text,
    "" # Empty string -> triggers generation
)

# 3. Tokenize (Turn text into numbers)
inputs = tokenizer(
    [input_text],
    return_tensors = "pt"
).to("cuda")

# 4. Generate!
# I increased max_new_tokens to 1024 because C++ code is long.
# 125 tokens is barely enough for "int main() { return 0; }"
_ = model.generate(
    **inputs,
    max_new_tokens = 1024,
    use_cache = True,
    streamer = TextStreamer(tokenizer, skip_prompt = True)
)

imaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaiaimaiaimaiaiaimaiaimaiaimaiaimaiaimaiaimaimaiaimaimaiaimaimaiaimaimaimaiaimaimaimaiaimaimaimaimaimaiaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimaimai

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [19]:
# 1. Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# 2. Convert to GGUF and save temporarily in Colab
# "q4_k_m" is the balanced standard: Good intelligence, small size, fast speed.
print("Converting to GGUF... This might take 5-10 minutes.")
model.save_pretrained_gguf("model_gguf", tokenizer, quantization_method = "q4_k_m")

# 3. Copy the file to your Google Drive
# We rename it to something easy to find, like 'MyCodeforcesGemma.gguf'
print("Copying to Google Drive...")
!cp "model_gguf/unsloth.Q4_K_M.gguf" "/content/drive/MyDrive/MyCodeforcesGemma.gguf"

print("Done! Check your Google Drive for 'MyCodeforcesGemma.gguf'")

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.


Mounted at /content/drive
Converting to GGUF... This might take 5-10 minutes.
Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `model_gguf`: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.26s/it]


Successfully copied all 1 files from cache to `model_gguf`
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `model_gguf`: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 202.23it/s]


Successfully copied all 1 files from cache to `model_gguf`


Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 7371.36it/s]
Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:11<00:00, 11.47s/it]


Unsloth: Merge process complete. Saved to `/content/model_gguf`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages
Unsloth: Successfully installed llama.cpp!
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['gemma-3-270m-it.F16.gguf']
Unsloth: [2] 

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### We removed it in GGUF's chat template for you.


Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['gemma-3-270m-it.Q4_K_M.gguf']
Unsloth: example usage for text only LLMs: llama-cli --model gemma-3-270m-it.Q4_K_M.gguf -p "why is the sky blue?"
Unsloth: Saved Ollama Modelfile to current directory
Unsloth: convert model to ollama format by running - ollama create model_name -f ./Modelfile - inside current directory.
Copying to Google Drive...
cp: cannot stat 'model_gguf/unsloth.Q4_K_M.gguf': No such file or directory
Done! Check your Google Drive for 'MyCodeforcesGemma.gguf'


In [22]:
import os
import shutil

# 1. Define where we want it to go
dest_folder = "/content/drive/MyDrive"
output_name = "MyCodeforcesGemma.gguf"
dest_path = os.path.join(dest_folder, output_name)

# 2. Search for the file in multiple likely locations
possible_locations = [
    ".",               # Check the main folder
    "model_gguf",      # Check the subfolder
]

found_path = None

print("üîç Searching for GGUF file...")

for folder in possible_locations:
    if os.path.exists(folder):
        files = os.listdir(folder)
        for f in files:
            # We look for ANY file ending in .gguf that isn't the intermediate F16 one
            if f.endswith(".gguf") and "F16" not in f:
                found_path = os.path.join(folder, f)
                print(f"‚úÖ Found it at: {found_path}")
                break
    if found_path:
        break

# 3. Copy it
if found_path:
    print(f"üöÄ Copying to Google Drive as '{output_name}'...")
    try:
        shutil.copy(found_path, dest_path)
        print("üéâ SUCCESS! File saved to Google Drive.")
        print(f"Location: {dest_path}")
    except Exception as e:
        print(f"‚ùå Copy failed: {e}")
else:
    print("‚ùå Critical Error: Could not find any .gguf file.")
    print("Current folder contents:", os.listdir("."))
    if os.path.exists("model_gguf"):
        print("model_gguf folder contents:", os.listdir("model_gguf"))

üîç Searching for GGUF file...
‚úÖ Found it at: ./gemma-3-270m-it.Q4_K_M.gguf
üöÄ Copying to Google Drive as 'MyCodeforcesGemma.gguf'...
üéâ SUCCESS! File saved to Google Drive.
Location: /content/drive/MyDrive/MyCodeforcesGemma.gguf


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "gemma-3", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = False,
    )

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("gemma-3-finetune", tokenizer, save_method = "merged_16bit")
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/gemma-3-finetune", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("gemma-3-finetune", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/gemma-3-finetune", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("gemma-3-finetune")
    tokenizer.save_pretrained("gemma-3-finetune")
if False: # Pushing to HF Hub
    model.push_to_hub("hf/gemma-3-finetune", token = "")
    tokenizer.push_to_hub("hf/gemma-3-finetune", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!

In [None]:
if False: # Change to True to save to GGUF
    model.save_pretrained_gguf(
        "gemma-3-finetune",
        tokenizer,
        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
    )

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "HF_ACCOUNT/gemma-finetune-gguf",
        tokenizer,
        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
        token = "hf_...",
    )

Now, use the `gemma-3-finetune.gguf` file or `gemma-3-finetune-Q4_K_M.gguf` file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è
</div>

  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
