# Fine-tune Qwen3 4B for Anakin Personal Assistant

This notebook fine-tunes `unsloth/Qwen3-4B-bnb-4bit` using QLoRA on your personal conversation data.

**Requirements:**
- Google Colab with free T4 GPU (15GB VRAM)
- Upload `training-data.jsonl` from `configs/personal-rag/training-data.jsonl`

**Output:** A GGUF file you download and import into Ollama locally.

**Time:** ~15-30 min for 200-500 samples on T4

## 1. Install Dependencies

In [None]:
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps unsloth

## 2. Upload Training Data

Upload your `training-data.jsonl` file. Generate it locally with:
```bash
python3 scripts/extract-training-data.py --filter-preferences --include-memory \
  --output configs/personal-rag/training-data.jsonl
```

In [None]:
from google.colab import files
import json

uploaded = files.upload()  # Upload training-data.jsonl
filename = list(uploaded.keys())[0]

# Load and preview
data = []
with open(filename) as f:
    for line in f:
        data.append(json.loads(line))

print(f"Loaded {len(data)} training samples")
print(f"\nExample:")
for msg in data[0]["conversations"]:
    print(f"  [{msg['from']}]: {msg['value'][:100]}...")

## 3. Load Model with QLoRA (4-bit)

In [None]:
from unsloth import FastLanguageModel

max_seq_length = 2048
dtype = None  # Auto-detect (float16 for T4)
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-4B-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print(f"Model loaded. Parameters: {model.num_parameters():,}")

## 4. Add LoRA Adapter

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")

## 5. Prepare Dataset

In [None]:
from unsloth.chat_templates import get_chat_template, standardize_sharegpt
from datasets import Dataset

tokenizer = get_chat_template(tokenizer, chat_template="chatml")

dataset = Dataset.from_list(data)
dataset = standardize_sharegpt(dataset)

def apply_template(examples):
    texts = [tokenizer.apply_chat_template(
        example, tokenize=False, add_generation_prompt=False
    ) for example in examples["conversations"]]
    return {"text": texts}

dataset = dataset.map(apply_template, batched=True)

print(f"Dataset ready: {len(dataset)} samples")
print(f"\nSample text (first 300 chars):")
print(dataset[0]["text"][:300])

## 6. Train

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
        report_to="none",
    ),
)

print("Starting training...")
stats = trainer.train()
print(f"\nTraining complete!")
print(f"Loss: {stats.training_loss:.4f}")
print(f"Runtime: {stats.metrics['train_runtime']:.0f}s")

## 7. Test the Model

In [None]:
FastLanguageModel.for_inference(model)

test_messages = [
    {"role": "system", "content": "You are Anakin, a personal AI assistant for Arnaldo. You know his preferences, routines, and habits. Be concise, friendly, and helpful."},
    {"role": "user", "content": "Good morning! What should I focus on today?"},
]

inputs = tokenizer.apply_chat_template(
    test_messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs, max_new_tokens=256, temperature=0.3, use_cache=True
)
response = tokenizer.batch_decode(outputs)
print(response[0])

## 8. Export to GGUF (for Ollama)

This creates a quantized GGUF file you can download and use with Ollama.

In [None]:
# Export as Q4_K_M GGUF (best balance of quality and size for 4B model)
model.save_pretrained_gguf(
    "anakin-qwen3-4b",
    tokenizer,
    quantization_method="q4_k_m",
)
print("GGUF export complete!")

In [None]:
# Download the GGUF file
import glob
gguf_files = glob.glob("anakin-qwen3-4b/*.gguf")
print(f"GGUF files: {gguf_files}")

for f in gguf_files:
    files.download(f)

## 9. Deploy to Ollama (run locally)

After downloading the GGUF file, run these commands on your machine:

```bash
# Create Modelfile
cat > ~/Modelfile-anakin <<'EOF'
FROM ./anakin-qwen3-4b-Q4_K_M.gguf

PARAMETER temperature 0.3
PARAMETER num_ctx 4096

SYSTEM "You are Anakin, a personal AI assistant for Arnaldo. You know his preferences, routines, and habits. Be concise, friendly, and helpful. Skip filler words."
EOF

# Import into Ollama
ollama create anakin-personal -f ~/Modelfile-anakin

# Test
ollama run anakin-personal "Good morning! What should I focus on today?"

# Update RAG service to use it
# In personal-rag.service, change CHAT_MODEL=anakin-personal
# Then: systemctl --user daemon-reload && systemctl --user restart personal-rag
```