This notebook will:

Fine-tune a model (we'll use Llama-3 8B Instruct for compatibility with the hint link).

Merge the LoRA adapters into the base model.

Export the merged model to GGUF format using Unsloth.

Provide instructions on how to run the exported GGUF model using Ollama on your local machine.

In [1]:
# Ensure GPU runtime (T4, L4, A100 recommended)
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
!pip install datasets sentencepiece # Dependencies
!pip install huggingface_hub hf_transfer # For model download/upload if needed
print("=== Installation Complete ===")
# Might need manual runtime restart after installs

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-sk1ofz0_/unsloth_44ac8a677c0a443a96191c0370014606
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-sk1ofz0_/unsloth_44ac8a677c0a443a96191c0370014606
  Resolved https://github.com/unslothai/unsloth.git to commit c9b9a366e7a6110f9d58d5ed8db6bd27bc97fb71
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.3.17 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.3.17-py3-none-any.whl.metadata (8.0 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.g

Collecting xformers
  Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting trl<0.9.0
  Downloading trl-0.8.6-py3-none-any.whl.metadata (11 kB)
Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl (43.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.4/43.4 MB[0m [31m50.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl-0.8.6-py3-none-any.whl (245 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xformers, trl
  Attempting uninstall: trl
    Found existing installation: trl 0.15.2
    Uninstalling trl-0.15.2:
      Successfully uninstalled trl-0.15.2
Successfully installed trl-0.8.6 xformers-0.0.29.post3
=== Installation Complete ===


In [2]:
import torch
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
from peft import PeftModel # Needed for merging later if done manually (though save_pretrained_gguf might handle)
import os
import gc
import time
from huggingface_hub import login

# *** REQUIRED: Hugging Face Login for Llama 3 ***
try:
    login("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", add_to_git_credential=False) # Replace with your actual token
    print("Hugging Face login successful.")
except Exception as e:
    print(f"Hugging Face login skipped/failed: {e}. Llama 3 download will fail.")
    # raise # Optional: stop execution if login fails

print("=== Imports Complete ===")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Hugging Face login successful.
=== Imports Complete ===


In [3]:
# --- Model & Training Config ---
# Using Llama 3 8B Instruct as per the hint/docs
model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit"
max_seq_length = 2048 # Keep reasonable for Colab RAM
dtype = None # Auto-detect
load_in_4bit = True

# --- Dataset Config (Simple SFT for demo) ---
dataset_name = "databricks/databricks-dolly-15k"
dataset_subset_size = 500 # Small subset for quick fine-tuning demo
# Columns: instruction, context, response

# --- Training Params ---
output_dir_sft = "llama3_dolly_sft_for_ollama"
training_max_steps = 50 # Short training run

# --- LoRA Config ---
lora_r = 16
lora_alpha = 32
lora_dropout = 0.05

# --- GGUF Export Config ---
gguf_output_filename = "llama3_dolly_finetuned.q4_k_m.gguf" # Name for the exported GGUF file
quantization_method = "q4_k_m" # Common quantization, see Unsloth docs for others (q5_k_m, q8_0, etc.)

print(f"--- Ollama Export Configuration ---")
print(f"  Model: {model_name}")
print(f"  Max Seq Length: {max_seq_length}")
print(f"  Dataset: {dataset_name}")
print(f"  SFT Steps: {training_max_steps}")
print(f"  LoRA R: {lora_r}")
print(f"  GGUF Output File: {gguf_output_filename}")
print(f"  Quantization: {quantization_method}")
print("=== Configuration Set ===")

--- Ollama Export Configuration ---
  Model: unsloth/llama-3-8b-Instruct-bnb-4bit
  Max Seq Length: 2048
  Dataset: databricks/databricks-dolly-15k
  SFT Steps: 50
  LoRA R: 16
  GGUF Output File: llama3_dolly_finetuned.q4_k_m.gguf
  Quantization: q4_k_m
=== Configuration Set ===


In [4]:
print("--- Loading Base Model & Tokenizer ---")
start_time = time.time()
try:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = model_name,
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
except Exception as e: print(f"Error loading model: {e}"); raise
end_time = time.time()
print(f"Model loaded in {end_time - start_time:.2f}s.")

# Llama-3 Instruct model from Unsloth should have chat template pre-set. Verify:
print("\nTokenizer chat template:")
print(tokenizer.chat_template)
if tokenizer.chat_template is None:
    print("\n*** WARNING: Chat template not set on tokenizer. GGUF export might require manual template setting later. ***")

# Ensure pad token is set (Unsloth usually handles this for Llama 3)
if tokenizer.pad_token is None:
    print("Setting pad_token = eos_token")
    tokenizer.pad_token = tokenizer.eos_token

print("=== Model and Tokenizer Loaded ===")

--- Loading Base Model & Tokenizer ---
==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.1k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Model loaded in 37.57s.

Tokenizer chat template:
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}
=== Model and Tokenizer Loaded ===


In [5]:
print("--- Configuring LoRA ---")
try:
    model = FastLanguageModel.get_peft_model(
        model,
        r = lora_r,
        lora_alpha = lora_alpha,
        lora_dropout = lora_dropout,
        bias = "none",
        use_gradient_checkpointing = True, # Recommended for SFT
        random_state = 3407,
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                          "gate_proj", "up_proj", "down_proj",],
    )
    print("LoRA configured:")
    print(model.print_trainable_parameters())
except Exception as e: print(f"Error configuring LoRA: {e}"); raise
print("=== LoRA Configured ===")

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.


--- Configuring LoRA ---


Unsloth 2025.3.19 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


LoRA configured:
trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196
None
=== LoRA Configured ===


In [6]:
print(f"--- Loading Dataset: {dataset_name} ---")
try:
    dataset = load_dataset(dataset_name, split="train")
    if dataset_subset_size < len(dataset):
        dataset = dataset.shuffle(seed=42).select(range(dataset_subset_size))
    print(f"Loaded and selected subset of {len(dataset)} examples.")
    print("Dataset features:", dataset.features)
except Exception as e: print(f"Error loading dataset: {e}"); raise

# --- Define Formatting Function (Llama 3 Instruct Template) ---
# Reference: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>
# {{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>
# {{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
# {{ model_answer }}<|eot_id|>

def format_dolly_llama3(example):
    instruction = example.get("instruction", "")
    context = example.get("context", "")
    response = example.get("response", "")

    user_content = instruction
    if context and context.strip():
        user_content = f"Context:\n{context.strip()}\n\nInstruction:\n{instruction.strip()}"

    # Construct messages list
    messages = [
        # Optional: Add a default system prompt if desired
        # {"role": "system", "content": "You are a helpful AI assistant based on Llama 3."},
        {"role": "user", "content": user_content},
        {"role": "assistant", "content": response}
    ]

    # Apply the chat template (should be loaded on tokenizer for Llama 3 Instruct)
    try:
        formatted_text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False # We provide the full turn for training
        )
        # Need <|begin_of_text|> for training? Often yes. Check Unsloth examples.
        # Let's assume tokenizer.apply_chat_template handles BOS correctly.
        # If loss is weird, try adding bos_token manually: tokenizer.bos_token + formatted_text
        return {"text": formatted_text}
    except Exception as e:
        print(f"Error applying template: {e}")
        return {"text": ""}

print("\nApplying formatting...")
try:
    original_cols = list(dataset.features)
    dataset = dataset.map(format_dolly_llama3, num_proc=2, remove_columns=original_cols)
    initial_len = len(dataset)
    dataset = dataset.filter(lambda x: len(x['text']) > 0) # Remove errors
    print(f"Formatting applied. Kept {len(dataset)}/{initial_len} examples.")
    print("Processed dataset features:", dataset.features)
    if len(dataset) > 0: print("\nExample formatted text:\n", dataset[0]['text'][:500],"...")
    else: print("\nWarning: Dataset empty after formatting.")
except Exception as e: print(f"Error mapping dataset: {e}"); raise

print("=== Dataset Ready ===")

--- Loading Dataset: databricks/databricks-dolly-15k ---


README.md:   0%|          | 0.00/8.20k [00:00<?, ?B/s]

databricks-dolly-15k.jsonl:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/15011 [00:00<?, ? examples/s]

Loaded and selected subset of 500 examples.
Dataset features: {'instruction': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'response': Value(dtype='string', id=None), 'category': Value(dtype='string', id=None)}

Applying formatting...


Map (num_proc=2):   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Formatting applied. Kept 500/500 examples.
Processed dataset features: {'text': Value(dtype='string', id=None)}

Example formatted text:
 <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Who were the children of the legendary Garth Greenhand, the High King of the First Men in the series A Song of Ice and Fire?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Garth the Gardener, John the Oak, Gilbert of the Vines, Brandon of the Bloody Blade, Foss the Archer, Owen Oakenshield, Harlon the Hunter, Herndon of the Horn, Bors the Breaker, Florys the Fox, Maris the Maid, Rose of the Red Lake, Ellyn Ever Sweet, Rowan Gold ...
=== Dataset Ready ===


In [7]:
print("--- Configuring SFT Trainer ---")
if 'model' not in locals() or 'tokenizer' not in locals() or 'dataset' not in locals(): raise NameError("Prerequisites missing.")
if len(dataset) == 0: raise ValueError("Dataset is empty.")

try:
    trainer = SFTTrainer(
        model=model, # LoRA enabled model
        tokenizer=tokenizer,
        train_dataset=dataset,
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        dataset_num_proc=2,
        packing=True, # Use packing

        args=TrainingArguments(
            per_device_train_batch_size=2,
            gradient_accumulation_steps=4, # Effective batch size 8
            warmup_steps=5,
            max_steps=training_max_steps, # Short training run
            learning_rate=2e-4,
            fp16=not torch.cuda.is_bf16_supported(),
            bf16=torch.cuda.is_bf16_supported(),
            logging_steps=5,
            optim="adamw_8bit",
            weight_decay=0.01,
            lr_scheduler_type="linear",
            seed=3407,
            output_dir=output_dir_sft,
            save_strategy="steps",
            save_steps= max(1, training_max_steps // 2), # Save once midway
            report_to="none",
        ),
    )
    print("Trainer configured.")
except Exception as e: print(f"Error configuring trainer: {e}"); raise
print("=== Trainer Configured ===")

--- Configuring SFT Trainer ---


Generating train split: 0 examples [00:00, ? examples/s]

Trainer configured.
=== Trainer Configured ===


In [8]:
print(f"--- Starting Fine-tuning (max_steps={training_max_steps}) ---")
gc.collect(); torch.cuda.empty_cache()
start_train_time = time.time()
try:
    trainer.train()
    end_train_time = time.time()
    print(f"\nTraining finished in {(end_train_time - start_train_time)/60:.2f} minutes.")
except Exception as e: print(f"\n--- ERROR DURING TRAINING ---"); print(e); import traceback; traceback.print_exc(); raise
print("=== Fine-tuning Complete ===")

--- Starting Fine-tuning (max_steps=50) ---


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 49 | Num Epochs = 9 | Total steps = 50
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
5,2.4607
10,2.0115
15,1.8095
20,1.7364
25,1.6178
30,1.6319
35,1.6054
40,1.4936
45,1.4804
50,1.4156



Training finished in 17.21 minutes.
=== Fine-tuning Complete ===


In [9]:
print("\n--- Preparing for GGUF Export ---")

# --- IMPORTANT: Merge Adapters ---
# Merging requires significant RAM. If this fails, consider alternatives
# like exporting LoRA adapters separately (if target GGUF tool supports it)
# or using a machine with more RAM.

# Clear memory BEFORE merging
print("Clearing memory before merging...")
# Delete trainer object (holds references)
if 'trainer' in locals():
    del trainer
# Delete dataset object
if 'dataset' in locals():
    del dataset
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
print("Memory cleared.")
time.sleep(2) # Short pause

# Access the PEFT model (which should be the 'model' object from training)
if not hasattr(model, 'merge_and_unload'):
     print("Model does not seem to be a PEFT model. Trying to get it from last checkpoint if possible (less reliable).")
     # Attempt to load from last checkpoint if trainer was deleted
     # last_checkpoint = TrainingArguments(output_dir=output_dir_sft).load_latest_checkpoint(output_dir_sft)
     # if last_checkpoint: model = FastLanguageModel.from_pretrained(last_checkpoint) # This might reload full checkpoint, not ideal
     # else: raise RuntimeError("Cannot find trained PEFT model to merge.")
     raise RuntimeError("The 'model' variable is not the PEFT model. Reloading/Merging logic needs adjustment.")


print("Merging LoRA adapters...")
try:
    # Merge the LoRA weights into the base model.
    # This modifies the model object in-place.
    model = model.merge_and_unload()
    print("Adapters merged successfully.")
except Exception as e:
    print(f"\n--- ERROR DURING MERGING ---")
    print(f"Merging failed. This often happens due to insufficient RAM/VRAM.")
    print(f"Error details: {e}")
    print(f"Try using a smaller model, reducing max_seq_length, or using a machine with more RAM.")
    print(f"Skipping GGUF export.")
    # Set a flag or raise to prevent GGUF export attempt
    merge_failed = True
else:
    merge_failed = False

# Optional: Verify trainable parameters are now 0 after merging
# if not merge_failed:
#    print("\nTrainable parameters after merging (should be 0 or very small):")
#    print(model.print_trainable_parameters())

print("=== Preparation for Export Complete ===")


--- Preparing for GGUF Export ---
Clearing memory before merging...
Memory cleared.
Merging LoRA adapters...




Adapters merged successfully.
=== Preparation for Export Complete ===


In [10]:
# Only run if merging succeeded
if not merge_failed:
    print(f"\n--- Exporting Merged Model to GGUF ---")
    print(f"Filename: {gguf_output_filename}")
    print(f"Quantization: {quantization_method}")

    # Use Unsloth's built-in function on the merged model
    try:
        model.save_pretrained_gguf(
            gguf_output_filename,
            tokenizer,
            quantization_method = quantization_method
        )
        print("\nGGUF export successful!")
        print("Verifying GGUF file:")
        !ls -lh {gguf_output_filename}
        print(f"\n---> Download the file '{gguf_output_filename}' from the Colab sidebar (Files tab). <---")
    except Exception as e:
        print(f"\n--- ERROR DURING GGUF EXPORT ---")
        print(e)
        import traceback
        traceback.print_exc()
        print("---------------------------------")
else:
    print("\nSkipping GGUF export due to previous merge failure.")

print("=== GGUF Export Step Finished ===")


--- Exporting Merged Model to GGUF ---
Filename: llama3_dolly_finetuned.q4_k_m.gguf
Quantization: q4_k_m


Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 34.57 out of 52.96 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


  0%|          | 0/32 [00:00<?, ?it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:35<00:00,  1.12s/it]


Unsloth: Saving tokenizer... Done.
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at llama3_dolly_finetuned.q4_k_m.gguf into bf16 GGUF format.
The output location will be /content/llama3_dolly_finetuned.q4_k_m.gguf/unsloth.BF16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: llama3_dolly_finetuned.q4_k_m.gguf
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00007.saf

In [None]:
# @title Instructions for Running with Ollama Locally
# @markdown ---
# @markdown ## Running Your Fine-tuned Model with Ollama
# @markdown
# @markdown 1.  **Install Ollama:** If you haven't already, download and install Ollama for your operating system from [https://ollama.com/](https://ollama.com/).
# @markdown 2.  **Download GGUF:** Download the `.gguf` file created in the previous step (`{{gguf_output_filename}}`) from the Colab file browser (Left sidebar -> Files -> Find the file -> Right-click -> Download) to your local machine. Let's say you save it in a folder named `my_ollama_models`.
# @markdown 3.  **Create a `Modelfile`:** In the *same folder* where you saved the `.gguf` file (`my_ollama_models`), create a new plain text file named exactly `Modelfile` (no file extension like `.txt`).
# @markdown 4.  **Edit `Modelfile`:** Open the `Modelfile` in a text editor and paste the following content into it. **CRITICAL: Make sure the path in the `FROM` line correctly points to your downloaded GGUF file relative to the `Modelfile`. Using `./` assumes it's in the same directory.**
# @markdown ```Modelfile
# @markdown # Modelfile for fine-tuned Llama-3-8B-Instruct
# @markdown
# @markdown # Make sure this path is correct relative to where you run 'ollama create'
# @markdown FROM ./{{gguf_output_filename}}
# @markdown
# @markdown # Define the Llama 3 Instruct Chat Template
# @markdown # Reference: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
# @markdown TEMPLATE """<|begin_of_text|>{{- if .System }}<|start_header_id|>system<|end_header_id|>
# @markdown
# @markdown {{ .System }}<|eot_id|>{{ end }}{{- if .Prompt }}<|start_header_id|>user<|end_header_id|>
# @markdown
# @markdown {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
# @markdown
# @markdown {{ .Response }}<|eot_id|>"""
# @markdown
# @markdown # Define Stop Tokens for Llama 3
# @markdown PARAMETER stop "<|start_header_id|>"
# @markdown PARAMETER stop "<|end_header_id|>"
# @markdown PARAMETER stop "<|eot_id|>"
# @markdown
# @markdown # Optional: Define a default system prompt (Uncomment and modify if desired)
# @markdown # SYSTEM """You are a helpful assistant fine-tuned on the Dolly dataset."""
# @markdown
# @markdown # Optional: Set default generation parameters (Uncomment and modify if desired)
# @markdown # PARAMETER temperature 0.7
# @markdown # PARAMETER top_p 0.9
# @markdown # PARAMETER num_ctx 2048 # Example: Set context window if needed
# @markdown ```
# @markdown 5.  **Open Terminal/Command Prompt:** On your local machine, open your terminal (macOS/Linux) or Command Prompt/PowerShell (Windows) and navigate (`cd`) to the folder where you saved the `Modelfile` and `.gguf` file (e.g., `cd path/to/my_ollama_models`).
# @markdown 6.  **Create Ollama Model:** Run the following command in your terminal. Replace `my-llama3-finetune` with the tag name you want to use for your model in Ollama.
# @markdown    ```bash
# @markdown    ollama create my-llama3-finetune -f Modelfile
# @markdown    ```
# @markdown    Wait for Ollama to process the file (it might say "transferring" or "success").
# @markdown 7.  **Run Inference:** Now you can interact with your fine-tuned model locally! Run:
# @markdown    ```bash
# @markdown    ollama run my-llama3-finetune
# @markdown    ```
# @markdown    You should see a prompt like `>>> Send a message (/? for help)`. Type your questions or prompts and press Enter. Type `/bye` to exit the chat session.
# @markdown ---

# Make sure the variable 'gguf_output_filename' is defined in a previous cell
# before running this markdown cell if you want the filename substitution to work
# within the markdown display itself (though it's primarily for user instruction).
pass # Add pass to make the cell runnable without Python code


## Running Your Fine-tuned Model with Ollama

1.  **Install Ollama:** If you haven't already, download and install Ollama for your operating system from [https://ollama.com/](https://ollama.com/).
2.  **Download GGUF:** Download the `.gguf` file created in the previous step (`llama3_dolly_finetuned.q4_k_m.gguf` - **replace with your actual filename if different**) from the Colab file browser (Left sidebar -> Files -> Find the file -> Right-click -> Download) to your local machine. Let's say you save it in a folder named `my_ollama_models`.
3.  **Create a `Modelfile`:** In the *same folder* where you saved the `.gguf` file (`my_ollama_models`), create a new plain text file named exactly `Modelfile` (no file extension like `.txt`).
4.  **Edit `Modelfile`:** Open the `Modelfile` in a text editor and paste the following content into it. **CRITICAL: Make sure the path in the `FROM` line correctly points to your downloaded GGUF file relative to the `Modelfile`. Using `./` assumes it's in the same directory.**

5.  **Open Terminal/Command Prompt:** On your local machine, open your terminal (macOS/Linux) or Command Prompt/PowerShell (Windows) and navigate (`cd`) to the folder where you saved the `Modelfile` and `.gguf` file (e.g., `cd path/to/my_ollama_models`).
6.  **Create Ollama Model:** Run the following command in your terminal. Replace `my-llama3-finetune` with the tag name you want to use for your model in Ollama.
    ```bash
    ollama create my-llama3-finetune -f Modelfile
    ```
    Wait for Ollama to process the file (it might say "transferring" or "success").
7.  **Run Inference:** Now you can interact with your fine-tuned model locally! Run:
    ```bash
    ollama run my-llama3-finetune
    ```
    You should see a prompt like `>>> Send a message (/? for help)`. Type your questions or prompts and press Enter. Type `/bye` to exit the chat session.
---