Better to use the notebook on high RAM CPU.  
That's because push_to_hub_gguf method is CPU RAM intensive since, while applying GGUF conversion and quantisation, keeps both the og merged model and the new in RAM.  
Another option is to split the work manually into 3 separate steps: download and merge, convert and quantise, push to hub. Each step will save locally the model.

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

In [1]:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"Can PyTorch see GPU? {torch.cuda.is_available()}")

PyTorch version: 2.6.0+cu124
Can PyTorch see GPU? True


In [7]:
!nvidia-smi

Fri Dec  5 13:46:26 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   33C    P0             45W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [3]:
from unsloth import FastLanguageModel
import torch
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"
max_seq_length = 4096 # covers all DAG files lenght
dtype = None          # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True   # Use 4bit quantization to reduce memory usage. (QLoRa)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-1.5B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16*2,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Unsloth: Could not find Config class in trl.trainer.dpo_trainer. Found: []
Unsloth: Could not find Config class in trl.trainer.iterative_sft_trainer. Found: []
Unsloth: Could not find Config class in trl.trainer.sft_trainer. Found: []
==((====))==  Unsloth 2025.11.6: Fast Qwen2 patching. Transformers: 4.57.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/1.53G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.11.6 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [4]:
from datasets import load_dataset

# Load dataset from Hugging Face Hub (already split into train/eval/test)
HF_DATASET = "andrea-t94/airflow-dag-dataset"
split_dataset = load_dataset(HF_DATASET)

print(f"‚úì Dataset loaded from Hugging Face Hub")
print(f"\nDataset Split Sizes:")
print(f"  Train: {len(split_dataset['train'])} samples")
print(f"  Eval:  {len(split_dataset['eval'])} samples")
print(f"  Test:  {len(split_dataset['test'])} samples")

# Dataset is already in ChatML format (messages field)
# Apply the chat template for Qwen fine-tuning
def formatting_prompts_func(examples):
    texts = []
    for messages in examples["messages"]:
        # Apply the chat template - tokenizer will handle ChatML formatting
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False
        )
        texts.append(text)
    return {"text": texts}

# Apply formatting to all splits
split_dataset = split_dataset.map(formatting_prompts_func, batched=True)

print("\n‚úì Chat template applied to all splits")
print("Ready for training!")

README.md:   0%|          | 0.00/764 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/10.4M [00:00<?, ?B/s]

data/eval-00000-of-00001.parquet:   0%|          | 0.00/572k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/559k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7414 [00:00<?, ? examples/s]

Generating eval split:   0%|          | 0/412 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/412 [00:00<?, ? examples/s]

‚úì Dataset loaded from Hugging Face Hub

Dataset Split Sizes:
  Train: 7414 samples
  Eval:  412 samples
  Test:  412 samples


Map:   0%|          | 0/7414 [00:00<?, ? examples/s]

Map:   0%|          | 0/412 [00:00<?, ? examples/s]

Map:   0%|          | 0/412 [00:00<?, ? examples/s]


‚úì Chat template applied to all splits
Ready for training!


In [6]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
import os
os.environ["WANDB_DISABLED"] = "true"

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = split_dataset["train"],
    eval_dataset = split_dataset["eval"], # Pass the validation set here
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = True, # Can set to True for speed boost, but be careful with short seqs
    args = TrainingArguments(
        per_device_train_batch_size = 4, # Increased from 2 -> 16 (T4 handles this easily for 1.5B)
        gradient_accumulation_steps = 8,  # 4 * 8 = Effective Batch Size of 32. 
        warmup_steps = 10,
        max_steps = -1,
        num_train_epochs = 3, # Start with 1 epoch to test time! typically 3 is good.
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit", # Key for memory saving
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        save_strategy = "steps", # Save checkpoint every epoch
        eval_strategy = "steps", # Check eval loss during training
        eval_steps = 100, # Evaluate every 100 steps
    ),
)

# Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")


# START TRAINING
trainer_stats = trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 3,568 | Num Epochs = 3 | Total steps = 336
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 8 x 1) = 32
 "-____-"     Trainable parameters = 18,464,768 of 1,562,179,072 (1.18% trained)


GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
6.582 GB of memory reserved.


Step,Training Loss,Validation Loss
100,0.4326,0.424876
200,0.2837,0.293044
300,0.2168,0.242225


Unsloth: Not an error, but Qwen2ForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


In [None]:
# 1. You must be logged in to Hugging Face
from huggingface_hub import login

repo_name = "andrea-t94/qwen2.5-1.5b-airflow-instruct"
base_model_name = "Qwen/Qwen2.5-1.5B-Instruct"

# --- A. Push LoRA Adapters (Most Portable & Lightweight) ---
# This allows anyone to load your fine-tune on top of Qwen-Base
model.push_to_hub(repo_name,
                   token = 'hf_...')
tokenizer.push_to_hub(repo_name,
                      token = 'hf_...')
print("‚úÖ Saved LoRA Adapters (Source)")

# --- B. Push Merged FP16 Model (The "Standard" Standalone) ---
# This merges the adapters into the base model and saves as full precision (safetensors).
# Use this if you want to deploy to vLLM later or re-quantize to AWQ/GPTQ.
model.push_to_hub_merged(
    repo_name + "-merged", 
    tokenizer,
    save_method = "merged_16bit",
    token = 'hf_...'
)
print("‚úÖ Saved Merged FP16 Model (Standard)")

print("‚è≥ (3/3) Converting and Pushing GGUF...")
model.push_to_hub_gguf(
    repo_name + "-GGUF",
    tokenizer,
    quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
    token = 'hf_...'
)
print("‚úÖ All steps complete! GGUF Saved.")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

README.md:   0%|          | 0.00/615 [00:00<?, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:   0%|          | 45.7kB / 73.9MB            

Saved model to https://huggingface.co/andrea-t94/qwen2.5-1.5b-airflow-instruct


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...mphzn090vv/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

No files have been modified since last commit. Skipping to prevent empty commit.


‚úÖ Saved LoRA Adapters (Source)


config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...uct-merged/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:06<00:00,  6.14s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit:   0%|          | 0/1 [00:00<?, ?it/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...-merged/model.safetensors:   2%|1         | 58.7MB / 3.09GB            

Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:47<00:00, 47.71s/it]


Unsloth: Merge process complete. Saved to `/content/andrea-t94/qwen2.5-1.5b-airflow-instruct-merged`
‚úÖ Saved Merged FP16 Model (Standard)
üßπ Cleaning RAM before GGUF conversion...


NameError: name 'gc' is not defined

In [7]:
# 1. Force the model into inference mode (Much faster)
FastLanguageModel.for_inference(model)

# 2. Define 3 distinct test cases
test_prompts = [
    "Create a DAG that runs a bash script every morning at 6am.",
    "Create a DAG with a PythonOperator that pulls data from S3 and pushes to Postgres.",
    "Create a DAG that branches based on the day of the week."
]

print("=== STARTING SMOKE TEST ===\n")

for i, prompt in enumerate(test_prompts):
    print(f"--- TEST CASE {i+1}: {prompt} ---\n")

    # [CORRECTED SECTION STARTS HERE]
    # We create a message list, just like your training dataset structure
    messages = [
        {"role": "user", "content": f"Create an Airflow DAG for: {prompt}"}
    ]

    # Apply the template. 
    # add_generation_prompt=True is the KEY: it adds the "Assistant:" start token 
    # that forces the model to begin answering.
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True, 
        return_tensors = "pt",
    ).to("cuda")

    outputs = model.generate(
        inputs,
        max_new_tokens = 512,
        use_cache = True,
        # Qwen/ChatML often uses these stop tokens. 
        # You can also add "```" if you want it to stop after code.
        stop_strings = ["<|im_end|>", "<|endoftext|>"] 
    )
    # [CORRECTED SECTION ENDS HERE]

    # Decode - skipping the prompt (input_ids) to see only the answer
    result = tokenizer.batch_decode(outputs[:, inputs.shape[1]:], skip_special_tokens=True)[0]

    print(result)
    print("\n" + "="*30 + "\n")

print("=== TEST COMPLETE ===")

üìä EVALUATION: Base vs. Fine-Tuned

üîπ TEST SAMPLE 1
üìù PROMPT: <|im_start|>system
You are an expert Apache Airflow developer. Generate complete, valid Airflow DAGs... (truncated)
----------------------------------------
üß† BASE MODEL:
```python
from datetime import datetime, timedelta

from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
import pendulum

# Define DAG settings
DAG_ID = "snowflake_product_data_pipeline"
SCHEDULE_INTERVAL = No...
--------------------
üöÄ FINE-TUNED:
To create a data pipeline in Apache Airflow to load sample product data into a Snowflake table and validate the data load, follow these steps:

1. Install the necessary Python libraries.
2. Set up your Airflow environment.
3. Create the DAG structure.
4. Define tasks for loading data and validation....
----------------------------------------
‚úÖ GROUND TRUTH:
"""
Example use of Snowflake Snowpark Python relate