# Grumpy Chef -- Fine-tuning with SFT + DPO

Fine-tuning [LiquidAI/LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base) to behave as a grumpy Italian chef using **Supervised Fine-Tuning (SFT)** followed by **Direct Preference Optimization (DPO)**. Training uses [Unsloth](https://github.com/unslothai/unsloth) with QLoRA (4-bit base + bf16 adapters).

In [1]:
from unsloth import FastLanguageModel

import os, gc, torch
from trl import DPOTrainer, DPOConfig, SFTTrainer, SFTConfig
from transformers import TextStreamer


ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
Unsloth: Could not import trl.trainer.alignprop_trainer: Failed to import trl.trainer.alignprop_trainer because of the following error (look up to see its traceback):
Failed to import trl.models.modeling_sd_base because of the following error (look up to see its traceback):
Failed to import diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.single_file because of the following error (look up to see its traceback):
name 'logger' is not defined
Unsloth: Could not import trl.trainer.ddpo_trainer: Failed to import trl.trainer.ddpo_trainer because of the following error (look up to see its traceback):
Failed to import trl.models.modeling_sd_base because of the following error (look up to see its traceback):
Failed to import diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion because of the following error

## 1. Load Dataset

Load the DPO dataset from JSON. Each example has a `prompt`, a `chosen` response (grumpy chef tone), and a `rejected` response (neutral/generic tone).

In [2]:
import json
from datasets import Dataset

# Load your JSON file â€” expected format:
# [
#   {"prompt": "...", "chosen": "...", "rejected": "..."},
#   {"prompt": "...", "chosen": "...", "rejected": "..."},
#   ...
# ]
JSON_PATH = "grumpy_chef_dataset.json"  

with open(JSON_PATH, "r") as f:
    raw_data = json.load(f)

dataset = Dataset.from_list(raw_data)
print(f"Loaded {len(dataset)} examples")
print(f"Columns: {dataset.column_names}")
print(f"\nSample:")
print(f"  Prompt:   {dataset[0]['prompt'][:100]}")
print(f"  Chosen:   {dataset[0]['chosen'][:100]}")
print(f"  Rejected: {dataset[0]['rejected'][:100]}")

# Optional: push to HuggingFace Hub
dataset.push_to_hub("benitomartin/grumpy-chef-dpo")

Loaded 299 examples
Columns: ['prompt', 'chosen', 'rejected']

Sample:
  Prompt:   What's the best way to cook pasta?
  Chosen:   Listen carefully. Big pot, boiling water, enough salt to make the sea jealous. You cook the pasta un
  Rejected: Boil pasta in a large pot of salted water until cooked according to package instructions, then drain


Creating parquet from Arrow format: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00, 102.01ba/s]
Processing Files (1 / 1): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 39.8kB / 39.8kB,  0.00B/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Uploading the dataset shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:01<00:00,  1.58s/ shards]
No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/datasets/benitomartin/grumpy-chef-dpo/commit/ad74435e04b6e4dde407e74b36dca9b9e0fb2b5e', commit_message='Upload dataset', commit_description='', oid='ad74435e04b6e4dde407e74b36dca9b9e0fb2b5e', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/benitomartin/grumpy-chef-dpo', endpoint='https://huggingface.co', repo_type='dataset', repo_id='benitomartin/grumpy-chef-dpo'), pr_revision=None, pr_num=None)

## 2. Data Splitting and Inference Setup

Split into train (85%), eval (10%), and inference (5%) sets. Define a reusable inference function that loads a model, generates responses, and cleans up GPU memory.

In [3]:
MODEL_NAME = "LiquidAI/LFM2.5-1.2B-Base"

splits = dataset.train_test_split(test_size=0.15, seed=42)
remaining = splits["test"].train_test_split(test_size=0.33, seed=42)

train_data = splits["train"]
eval_data = remaining["train"]
inference_data = remaining["test"]

# Reusable inference function â€” loads model, runs prompts, cleans up
def run_inference(model_path, label, prompts, max_new_tokens=100):
    m, tok = FastLanguageModel.from_pretrained(model_name=model_path, max_seq_length=2048, load_in_4bit=True)
    FastLanguageModel.for_inference(m)
    gen_kwargs = dict(max_new_tokens=max_new_tokens, temperature=0.3, min_p=0.15, repetition_penalty=1.05)
    print(f"\n{'#'*60}")
    print(f"# {label}")
    print(f"{'#'*60}")
    for i, prompt in enumerate(prompts):
        messages = [{"role": "user", "content": prompt}]
        inputs = tok.apply_chat_template(
            messages, add_generation_prompt=True,
            return_tensors="pt", tokenize=True, return_dict=True,
        ).to("cuda")
        print(f"\n[{i+1}] {prompt[:80]}")
        print("-" * 40)
        _ = m.generate(**inputs, **gen_kwargs, streamer=TextStreamer(tok, skip_prompt=True, skip_special_tokens=True))
    del m, tok
    gc.collect()
    torch.cuda.empty_cache()

print(f"Train: {len(train_data)} | Eval: {len(eval_data)} | Inference: {len(inference_data)}")

Train: 254 | Eval: 30 | Inference: 15


### Explore Inference Samples

Preview some prompts with their chosen (grumpy) and rejected (neutral) responses from the held-out inference set.

In [4]:
test_prompts =inference_data[:3]["prompt"]
test_prompts

['Is mostarda di Cremona too sweet?',
 'Can I put chicken in pasta?',
 'Can I make risotto with long-grain rice?']

In [5]:
inference_data[:3]["chosen"]

["It's sweet-spicy â€” mustard kick balances fruit syrup. Too sweet means bad mostarda. Good one makes you cry happy tears.",
 'In Italy? No. Somewhere else? Do what you want. Just donâ€™t call it Italian.',
 'No. Arborio, Carnaroli or Vialone Nano only. Long-grain rice stays separate â€” risotto needs starch hug.']

In [6]:
inference_data[:3]["rejected"]

['Mostarda is sweet with a strong mustard flavor.',
 'Chicken pasta is common in some cuisines but not traditional Italian cooking.',
 'Short-grain risotto rice varieties are required for proper texture.']

## 3. Base Model Inference (Baseline)

Run the unmodified base model on cooking questions to establish a baseline. Expect generic, encyclopedic responses with no personality.

In [7]:
# ===== STEP 1: Base model inference (before any training) =====
run_inference(MODEL_NAME, "Base Model (no training)", test_prompts)

==((====))==  Unsloth 2026.1.4: Fast Lfm2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA GeForce RTX 3060 Laptop GPU. Num GPUs = 1. Max memory: 6.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

############################################################
# Base Model (no training)
############################################################

[1] Is mostarda di Cremona too sweet?
----------------------------------------
Mostarda di Cremona is not considered too sweet. This traditional Italian condiment, which originated in the Piedmont region of Italy, is actually quite mild and has a distinctive flavor profile that sets it apart from other sweet condiments. The name "mostarda" itself means "sweet pepper," reflec

## 4. Supervised Fine-Tuning (SFT)

Train QLoRA adapters (rank 32) on `prompt` + `chosen` pairs using the model's chat template. This teaches the model the grumpy chef style. Target modules include GLU, MHA, and Conv layers (1.86% of total parameters).

In [None]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=256,
    load_in_4bit=True,
)

# Apply chat template to SFT datasets (Liquid AI recommended approach)
# Build conversations from prompt+chosen, then apply chat template
def formatting_prompts_func(examples):
    conversations = [
        [{"role": "user", "content": p}, {"role": "assistant", "content": c}]
        for p, c in zip(examples["prompt"], examples["chosen"])
    ]
    texts = tokenizer.apply_chat_template(
        conversations,
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": [x.removeprefix(tokenizer.bos_token) for x in texts]}

sft_train_formatted = train_data.map(formatting_prompts_func, batched=True)
sft_eval_formatted = eval_data.map(formatting_prompts_func, batched=True)

GLU_MODULES = ["w1", "w2", "w3"]
MHA_MODULES = ["q_proj", "k_proj", "v_proj", "out_proj"]
CONV_MODULES = ["in_proj", "out_proj"]

sft_model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=GLU_MODULES + MHA_MODULES + CONV_MODULES,
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

==((====))==  Unsloth 2026.1.4: Fast Lfm2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA GeForce RTX 3060 Laptop GPU. Num GPUs = 1. Max memory: 6.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
sft_model.print_trainable_parameters()

trainable params: 11,108,352 || all params: 1,181,448,960 || trainable%: 0.9402


In [10]:
sft_trainer = SFTTrainer(
    model=sft_model,
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_ratio=0.1,
        num_train_epochs=5,
        learning_rate=1e-4,
        lr_scheduler_type="cosine",
        weight_decay=0.01,
        logging_steps=2,
        optim="adamw_8bit",
        seed=42,
        output_dir="outputs/sft",
        report_to="none",
        eval_strategy="steps",
        eval_steps=5,
        save_strategy="steps",
        save_steps=5,
        load_best_model_at_end=True,
        dataset_num_proc=1,
    ),
    train_dataset=sft_train_formatted,
    eval_dataset=sft_eval_formatted,
    processing_class=tokenizer,
)

sft_trainer.train()

Unsloth: Tokenizing ["text"] (num_proc=1): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 254/254 [00:00<00:00, 277.36 examples/s]
Unsloth: Tokenizing ["text"] (num_proc=1): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [00:00<00:00, 76.20 examples/s] 
The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 254 | Num Epochs = 5 | Total steps = 160
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 22,216,704 of 1,192,557,312 (1.86% trained)


Step,Training Loss,Validation Loss
5,5.0918,5.050098
10,4.739,4.367561
15,3.9604,3.725064
20,3.2385,3.105271
25,2.9262,2.640435
30,2.371,2.334981
35,2.0652,2.164614
40,1.8004,2.045963
45,1.7249,1.951295
50,1.7604,1.942249


Unsloth: Not an error, but Lfm2ForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


TrainOutput(global_step=160, training_loss=1.646623720228672, metrics={'train_runtime': 169.5266, 'train_samples_per_second': 7.491, 'train_steps_per_second': 0.944, 'total_flos': 417311764687872.0, 'train_loss': 1.646623720228672, 'epoch': 5.0})

In [9]:
# Save LoRA adapters only (bf16 precision preserved)
sft_model.save_pretrained("outputs/sft_lora")
tokenizer.save_pretrained("outputs/sft_lora")

del sft_model, sft_trainer, model
gc.collect()
torch.cuda.empty_cache()

### SFT Inference

Test the model after SFT to verify it learned the grumpy chef persona.

In [10]:
# ===== STEP 2: SFT model inference (adapter path â€” preserves bf16 LoRA weights) =====
test_prompts =inference_data[:7]["prompt"]
run_inference("outputs/sft_lora", "SFT Model (after SFT)", test_prompts)

==((====))==  Unsloth 2026.1.4: Fast Lfm2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA GeForce RTX 3060 Laptop GPU. Num GPUs = 1. Max memory: 6.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

############################################################
# SFT Model (after SFT)
############################################################

[1] Is mostarda di Cremona too sweet?
----------------------------------------
No. Itâ€™s balanced â€” sweet, tangy, spicy. Too sweet is just syrup.

[2] Can I put chicken in pasta?
----------------------------------------
You can, but itâ€™s not pasta. Pasta is for pasta. Chicken is for meat. Donâ€™t mix them.

[3] Can I make risotto with long-grain rice?
---------------------

## 5. Direct Preference Optimization (DPO)

Refine the SFT adapter with the DPO objective. The model learns to prefer chosen (grumpy) over rejected (neutral) responses. Uses `ref_model=None` so the base model (adapter disabled) acts as the implicit reference.

Note: no `get_peft_model` call is needed here. Loading from `outputs/sft_lora` automatically restores the full LoRA configuration (target modules, rank, alpha) from the saved `adapter_config.json`. DPO continues training the same adapter weights with a different objective.

In [11]:
from unsloth import FastLanguageModel, PatchDPOTrainer

PatchDPOTrainer()

In [12]:
# Load base + SFT LoRA adapter (bf16 weights preserved, no 4-bit merge)
dpo_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="outputs/sft_lora",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Continue training the SFT LoRA weights with DPO objective
# ref_model=None -> reference is base model (adapter disabled)
dpo_trainer = DPOTrainer(
    model=dpo_model,
    ref_model=None,
    args=DPOConfig(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_ratio=0.1,
        num_train_epochs=2,
        learning_rate=5e-6,
        logging_steps=50,
        optim="adamw_8bit",
        seed=42,
        output_dir="outputs/dpo",
        report_to="none",
        eval_strategy="steps",
        eval_steps=20,
        save_strategy="steps",
        save_steps=20,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        dataset_num_proc=1,
    ),
    train_dataset=train_data,
    eval_dataset=eval_data,
)
dpo_trainer.train()

==((====))==  Unsloth 2026.1.4: Fast Lfm2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA GeForce RTX 3060 Laptop GPU. Num GPUs = 1. Max memory: 6.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Extracting prompt in train dataset (num_proc=1): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 254/254 [00:00<00:00, 665.96 examples/s]
Applying chat template to train dataset (num_proc=1): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 254/254 [00:00<00:00, 342.17 examples/s]
Tokenizing train dataset (num_proc=1): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 254/254 [00:00<00:00, 400.01 examples/s]
Extracting prompt in eval dataset (num_proc=1): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [00:00<00:00, 167.97 examples/s]
Applying chat template to eval dataset (num_proc=1): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [00:00<00:00, 87.01 examples/s] 
Tokenizing eval dataset (num_proc=1): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [00:00<00:00, 80.71 examples/s] 
The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 254 | Num Epochs = 2 | Total steps = 128
O^O/ \_/ \    Batch size p

Step,Training Loss,Validation Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / chosen,logps / rejected,logits / chosen,logits / rejected,eval_logits / chosen,eval_logits / rejected,nll_loss
20,No log,0.254914,2.335784,0.856934,1.0,1.47885,-158.395981,-68.407822,-0.470121,-0.345645,0,0,0
40,No log,0.135514,3.83314,1.272495,1.0,2.560645,-143.422424,-64.252213,-0.498463,-0.379675,No Log,No Log,No Log
60,0.208500,0.092644,4.702128,1.449412,1.0,3.252716,-134.732544,-62.483047,-0.572109,-0.427035,No Log,No Log,No Log
80,0.208500,0.074471,5.270379,1.563322,1.0,3.707057,-129.050034,-61.343937,-0.644691,-0.473027,No Log,No Log,No Log
100,0.046100,0.065453,5.554098,1.608508,1.0,3.945589,-126.212837,-60.892075,-0.691281,-0.506836,No Log,No Log,No Log
120,0.046100,0.060704,5.662842,1.610461,1.0,4.052381,-125.125397,-60.872543,-0.707663,-0.516765,No Log,No Log,No Log


TrainOutput(global_step=128, training_loss=0.1082750502973795, metrics={'train_runtime': 150.035, 'train_samples_per_second': 3.386, 'train_steps_per_second': 0.853, 'total_flos': 0.0, 'train_loss': 0.1082750502973795, 'epoch': 2.0})

In [13]:
# Save the SFT+DPO refined adapter
dpo_model.save_pretrained("outputs/sft_dpo_lora")
tokenizer.save_pretrained("outputs/sft_dpo_lora")

del dpo_model, dpo_trainer
gc.collect()
torch.cuda.empty_cache()

### SFT + DPO Inference

Test the final model after DPO refinement and compare with the SFT-only and base model outputs.

In [14]:
# ===== STEP 3: SFT+DPO model inference =====
test_prompts = inference_data[:7]["prompt"]
run_inference("outputs/sft_dpo_lora", "SFT + DPO Model (after DPO)", test_prompts)

==((====))==  Unsloth 2026.1.4: Fast Lfm2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA GeForce RTX 3060 Laptop GPU. Num GPUs = 1. Max memory: 6.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

############################################################
# SFT + DPO Model (after DPO)
############################################################

[1] Is mostarda di Cremona too sweet?
----------------------------------------
No. Itâ€™s balanced â€” sweet, tangy, spicy. Too sweet is just syrup. Respect the fruit.

[2] Can I put chicken in pasta?
----------------------------------------
You can, but itâ€™s not the same. Chicken is delicate, pasta is firm. Respect both.

[3] Can I make risotto with long-grain rice?
--

## 6. Export to HuggingFace Hub

Merge LoRA adapters into the base model and export in two formats:
- **GGUF** (Q4_K_M + Q8_0) for Ollama / llama.cpp
- **bf16 merged** for vLLM serving

In [None]:
# ===== Export: Load final model for saving =====
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="outputs/sft_dpo_lora",
    max_seq_length=2048,
    load_in_4bit=False,  # Full precision for export
)
FastLanguageModel.for_inference(model)

==((====))==  Unsloth 2026.1.4: Fast Lfm2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA GeForce RTX 3060 Laptop GPU. Num GPUs = 1. Max memory: 6.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Lfm2ForCausalLM(
      (model): Lfm2Model(
        (embed_tokens): Embedding(65536, 2048, padding_idx=0)
        (layers): ModuleList(
          (0-1): 2 x Lfm2DecoderLayer(
            (conv): Lfm2ShortConv(
              (conv): Conv1d(2048, 2048, kernel_size=(3,), stride=(1,), padding=(2,), groups=2048, bias=False)
              (in_proj): lora.Linear(
                (base_layer): Linear(in_features=2048, out_features=6144, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=6144, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
          

In [None]:
# ===== Export: GGUF (q4_k_m + q8_0) -> HF Hub =====
HF_REPO = "benitomartin/grumpy-chef-lfm2.5-1.2B-GGUF"  
model.push_to_hub_gguf(HF_REPO, tokenizer, quantization_method="q4_k_m")
model.push_to_hub_gguf(HF_REPO, tokenizer, quantization_method="q8_0")

Unsloth: Converting model to GGUF format...
Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /home/bmartin/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `/tmp/unsloth_gguf_6ghogwzl`: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:03<00:00,  3.39s/it]


Successfully copied all 1 files from cache to `/tmp/unsloth_gguf_6ghogwzl`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00, 16008.79it/s]
Unsloth: Merging weights into 16bit: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:05<00:00,  5.54s/it]


Unsloth: Merge process complete. Saved to `/tmp/unsloth_gguf_6ghogwzl`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF bf16 might take 3 minutes.
\        /    [2] Converting GGUF bf16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages
Unsloth: Successfully installed llama.cpp!
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into bf16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['LFM2.5-1.2B-Base.BF16.gguf']
U

Processing Files (0 / 1): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–‰|  731MB /  731MB, 20.8MB/s  
New Data Upload: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ|  731MB /  731MB, 20.8MB/s  


Uploading config.json...
Unsloth: Successfully uploaded GGUF to https://huggingface.co/benitomartin/grumpy-chef-lfm2.5-1.2B-GGUF
Unsloth: Cleaning up temporary files...
Unsloth: Converting model to GGUF format...
Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /home/bmartin/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `/tmp/unsloth_gguf_zv3q9rzz`: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:04<00:00,  4.81s/it]


Successfully copied all 1 files from cache to `/tmp/unsloth_gguf_zv3q9rzz`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00, 17331.83it/s]
Unsloth: Merging weights into 16bit: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:05<00:00,  5.28s/it]


Unsloth: Merge process complete. Saved to `/tmp/unsloth_gguf_zv3q9rzz`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF bf16 might take 3 minutes.
\        /    [2] Converting GGUF bf16 to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: llama.cpp found in the system. Skipping installation.
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into bf16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['LFM2.5-1.2B-Base.BF16.gguf']
Unsloth: [2] Converting GGUF bf16 into q8_0. This might take 10 minutes...
Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['LFM2.5-1.2B-Base.Q8_0.gguf']
Unsloth: No Ollama template mapping found for model 'LiquidAI/LFM2.5-1.2B-Base'. Ski

Processing Files (1 / 1): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1.25GB / 1.25GB, 16.6MB/s  
New Data Upload: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1.11GB / 1.11GB, 16.6MB/s  
No files have been modified since last commit. Skipping to prevent empty commit.


Uploading config.json...
Unsloth: Successfully uploaded GGUF to https://huggingface.co/benitomartin/grumpy-chef-lfm2.5-1.2B-GGUF
Unsloth: Cleaning up temporary files...


'benitomartin/grumpy-chef-lfm2.5-1.2B-GGUF'

In [None]:
# ===== Export: vLLM 16-bit -> HF Hub =====
HF_REPO_VLLM = "benitomartin/grumpy-chef-lfm2.5-1.2B-bf16"  

model.push_to_hub_merged(
    HF_REPO_VLLM,
    tokenizer,
    save_method="merged_16bit",
    maximum_memory_usage=0.5,  # Lower if OOM
)

Found HuggingFace hub cache directory: /home/bmartin/.cache/huggingface/hub
Checking cache directory for required files...


Unsloth: Copying 1 files from cache to `benitomartin/grumpy-chef-lfm2.5-1.2B-vllm`: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:05<00:00,  5.87s/it]


Successfully copied all 1 files from cache to `benitomartin/grumpy-chef-lfm2.5-1.2B-vllm`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00, 12671.61it/s]
Processing Files (1 / 1): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2.34GB / 2.34GB, 4.42MB/s  t/s]
New Data Upload: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2.07GB / 2.07GB, 4.42MB/s  
Unsloth: Merging weights into 16bit: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:59<00:00, 59.31s/it]


Unsloth: Merge process complete. Saved to `/home/bmartin/0_Projects/04_Fine-tuning/ft_dpo/LiquidAI/benitomartin/grumpy-chef-lfm2.5-1.2B-vllm`


: 