# QLoRA Fine-Tuning: Qwen3-4B for RAG

- **Hardware**: Apple M4 Pro 48GB (MPS)
- **Data**: 1,997 synthetic Q&A pairs (grounded / synthesis / refusal)
- **Method**: LoRA on bf16

## 1. Setup

In [6]:
import os
print(os.getcwd())

/Users/choeyunbeom/Desktop/new_project/arxiv_rag_system/src/finetuning


In [8]:
os.chdir("/Users/choeyunbeom/Desktop/new_project/arxiv_rag_system")

In [20]:
import json
from pathlib import Path
from collections import Counter

import torch
from datasets import Dataset
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTConfig, SFTTrainer

print(f"PyTorch: {torch.__version__}")
print(f"MPS available: {torch.backends.mps.is_available()}")

PyTorch: 2.9.1
MPS available: True


In [4]:
# ── Config ──
BASE_MODEL_DIR = Path("data/base_model")
DATASET_PATH = Path("data/processed/qa_dataset.json")
OUTPUT_DIR = Path("data/finetuned_lora")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Training
NUM_EPOCHS = 3
BATCH_SIZE = 2               # Try 4 if memory allows
GRAD_ACCUM_STEPS = 8         # Effective batch = 2 * 8 = 16
LEARNING_RATE = 2e-4
MAX_SEQ_LENGTH = 2048
WARMUP_RATIO = 0.05

# LoRA
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05
LORA_TARGET_MODULES = [
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj",
]

## 2. Load & Inspect Dataset

In [9]:
with open(DATASET_PATH) as f:
    raw = json.load(f)

samples = raw["data"]
print(f"Total samples: {len(samples)}")
print(f"Types: {dict(Counter(s['type'] for s in samples))}")
print(f"Keys: {list(samples[0].keys())}")

Total samples: 1997
Types: {'refusal': 400, 'grounded': 1200, 'synthesis': 397}
Keys: ['type', 'instruction', 'input', 'output', 'source_arxiv_id']


In [10]:
# Preview one sample
s = next(s for s in samples if s["type"] == "grounded")
print("=== GROUNDED SAMPLE ===")
print(f"instruction: {s['instruction'][:150]}...")
print(f"input: {s['input'][:200]}...")
print(f"output: {s['output'][:200]}...")

=== GROUNDED SAMPLE ===
instruction: You are a helpful academic research assistant. Answer questions based ONLY on the provided context from academic papers. Follow these rules strictly:
...
input: Context from 'Differentially Private Fine-tuning of Language Models' (abstract):
##metrizedgradientperturbation _ ( rgp ). rgp exploits the implicit low - rank structure in the gradient updates of sgd...
output: The paper 'Differentially Private Fine-tuning of Language Models' explains that metrized gradient perturbation (RGP) exploits the implicit low-rank structure in SGD gradient updates to substantially i...


## 3. Load Model & Tokenizer

In [11]:
tokenizer = AutoTokenizer.from_pretrained(
    str(BASE_MODEL_DIR),
    trust_remote_code=True,
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Vocab size: {tokenizer.vocab_size}")

Vocab size: 151643


In [12]:
model = AutoModelForCausalLM.from_pretrained(
    str(BASE_MODEL_DIR),
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Attach LoRA
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=LORA_TARGET_MODULES,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

`torch_dtype` is deprecated! Use `dtype` instead!


Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]

trainable params: 33,030,144 || all params: 4,055,498,240 || trainable%: 0.8145


## 4. Format Data → Chat Template

In [13]:
def format_to_chat(sample: dict) -> str:
    """instruction → system, input → user, output → assistant"""
    messages = [
        {"role": "system", "content": sample["instruction"]},
        {"role": "user", "content": sample["input"]},
        {"role": "assistant", "content": sample["output"]},
    ]
    return tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )


formatted = []
skipped = 0
for i, sample in enumerate(samples):
    try:
        text = format_to_chat(sample)
        formatted.append({"text": text})
    except Exception as e:
        print(f"Skipped sample {i}: {e}")
        skipped += 1

print(f"Formatted: {len(formatted)}, Skipped: {skipped}")

Formatted: 1997, Skipped: 0


In [14]:
# Preview formatted output
print(formatted[0]["text"][:800])

<|im_start|>system
You are a helpful academic research assistant. Answer questions based ONLY on the provided context from academic papers. Follow these rules strictly:
1. Only use information from the provided context
2. Cite which paper the information comes from
3. If the context does not contain enough information, say so clearly
4. Answer in concise prose paragraphs without markdown headers or bullet points
5. Do not generalise findings from one paper as universal recommendations<|im_end|>
<|im_start|>user
Context from 'FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation' (conclusion):
our current work open up several promising directions for future research to create more efficient and adaptive systems : - * * distilling task - specific expert models :


In [15]:
# Check token lengths
lengths = [len(tokenizer.encode(f["text"])) for f in formatted]
print(f"Token lengths — min: {min(lengths)}, max: {max(lengths)}, mean: {sum(lengths)/len(lengths):.0f}")
over_limit = sum(1 for l in lengths if l > MAX_SEQ_LENGTH)
print(f"Over {MAX_SEQ_LENGTH} tokens: {over_limit} ({over_limit/len(lengths)*100:.1f}%)")

Token lengths — min: 257, max: 841, mean: 377
Over 2048 tokens: 0 (0.0%)


## 5. Create Dataset & Split

In [16]:
dataset = Dataset.from_list(formatted)
split = dataset.train_test_split(test_size=0.05, seed=42)
print(f"Train: {len(split['train'])}, Eval: {len(split['test'])}")

Train: 1897, Eval: 100


## 6. Train

In [28]:
training_args = SFTConfig(
    output_dir=str(OUTPUT_DIR),
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM_STEPS,
    learning_rate=LEARNING_RATE,
    lr_scheduler_type="cosine",
    warmup_ratio=WARMUP_RATIO,
    bf16=True,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    report_to="none",
    dataloader_pin_memory=False,
    remove_unused_columns=False,
    max_length=MAX_SEQ_LENGTH,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    processing_class=tokenizer,
)

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Adding EOS to train dataset:   0%|          | 0/1897 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1897 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1897 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

In [29]:
result = trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Epoch,Training Loss,Validation Loss
1,1.105551,1.117962
2,1.022651,1.060225
3,0.881753,1.064029


In [30]:
# Training summary
runtime = result.metrics["train_runtime"]
print(f"Training time: {runtime:.0f}s ({runtime/60:.1f} min)")
print(f"Final train loss: {result.metrics['train_loss']:.4f}")
print(f"Samples/sec: {result.metrics.get('train_samples_per_second', 'N/A')}")

Training time: 24626s (410.4 min)
Final train loss: 1.1279
Samples/sec: 0.231


## 7. Save

In [31]:
final_dir = OUTPUT_DIR / "final"
trainer.save_model(str(final_dir))
tokenizer.save_pretrained(str(final_dir))
print(f"Adapter saved to {final_dir}")

# Save metrics
with open(OUTPUT_DIR / "training_metrics.json", "w") as f:
    json.dump(result.metrics, f, indent=2)
print("Metrics saved")

Adapter saved to data/finetuned_lora/final
Metrics saved


## 8. Quick Sanity Test

In [32]:
# Test the fine-tuned model with a sample question
test_messages = [
    {"role": "system", "content": "You are a helpful academic research assistant. Answer questions based ONLY on the provided context from academic papers."},
    {"role": "user", "content": "Context from 'QLoRA: Efficient Finetuning of Quantized Language Models' (abstract):\nQLoRA reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance.\n\nQuestion: What is QLoRA?"},
]

input_text = tokenizer.apply_chat_template(test_messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.3,
        do_sample=True,
    )

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

<think>

</think>

QLoRA is a method that reduces memory usage enough to fine-tune a 65B parameter model on a single 48GB GPU while preserving full 16-bit fine-tuning task performance.


## 9. Convert to GGUF for Ollama (Optional)

To serve via Ollama, merge LoRA + convert to GGUF:

```bash
# Merge base + LoRA
python -c "
from transformers import AutoModelForCausalLM
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained('data/base_model')
model = PeftModel.from_pretrained(base, 'data/finetuned_lora/final')
merged = model.merge_and_unload()
merged.save_pretrained('data/merged_model')
"

# Convert to GGUF (requires llama.cpp)
python llama.cpp/convert_hf_to_gguf.py data/merged_model \
    --outfile data/qwen3-4b-rag.gguf --outtype q4_K_M

# Register with Ollama
echo 'FROM data/qwen3-4b-rag.gguf' > Modelfile
ollama create qwen3-4b-rag -f Modelfile
```