# Campus_GPT Fine-tuning with Unsloth (Local GPU)

This notebook fine-tunes Llama 3.1 8B on the generated RAFT dataset using your local GPU.

**Requirements:**
- CUDA-capable GPU (12GB+ VRAM recommended)
- Python 3.10+
- CUDA Toolkit installed

**Setup:**

In [7]:
import torch
print(f"Success! CUDA available: {torch.cuda.is_available()}")
print(f"Device: {torch.cuda.get_device_name(0)}")

Success! CUDA available: True
Device: NVIDIA GeForce RTX 4070 SUPER


In [8]:
!pip install --upgrade pip
!pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"


Collecting unsloth @ git+https://github.com/unslothai/unsloth.git (from unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to C:\Users\Aaditya Khanal\AppData\Local\Temp\pip-install-utwpdt1g\unsloth_f9d80622976a457b8523fcf89b3de386
  Resolved https://github.com/unslothai/unsloth.git to commit d8b086a5c7efe141541c8f41606c9f6ac7c7b268
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting xformers @ https://download.pytorch.org/whl/cu124/xformers-0.0.28.post2-cp311-cp311-win_amd64.whl (from unsloth @ git+https://github.com/unslothai/unsloth.git->unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git)
  Downloading https:/

  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git 'C:\Users\Aaditya Khanal\AppData\Local\Temp\pip-install-utwpdt1g\unsloth_f9d80622976a457b8523fcf89b3de386'
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.10.0+cu126 requires torch==2.10.0, but you have torch 2.5.0 which is incompatible.


## 1. Load Model with 4-bit Quantization

In [9]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 4096  # Supports RoPE Scaling
dtype = None  # Auto-detect (Float16 for older GPUs, Bfloat16 for Ampere+)
load_in_4bit = True  # 4-bit quantization to fit in 12GB VRAM

print("Loading Llama 3.1 8B with 4-bit quantization...")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print(" Model loaded successfully!")

ImportError: Unsloth: Please install unsloth_zoo via `pip install unsloth_zoo` then retry!

## 2. Define RAFT Prompt Template

This template includes:
- **Context**: Oracle + Distractors
- **Question**: User query
- **Thought Process**: Chain-of-Thought reasoning
- **Answer**: Final response

In [None]:
# Llama 3.1 format with thinking tags
prompt_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are Campus_GPT, an AI assistant for Northern Kentucky University. 
Use the provided context to answer questions. Show your reasoning before the final answer.<|eot_id|><|start_header_id|>user<|end_header_id|>

Context: {context}
Question: {instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

<|thought|>
{thought}
<|answer|>
{answer}<|eot_id|>"""

EOS_TOKEN = tokenizer.eos_token


def formatting_prompts_func(examples):
    """
    Map RAFT dataset to Llama 3.1 format.
    
    RAFT keys:
    - question: User query
    - oracle: The correct context chunk
    - distractors: List of similar but incorrect chunks
    - thought_process: Chain-of-Thought reasoning
    - answer: Final answer
    """
    instructions = examples["question"]
    oracles = examples["oracle"]
    distractors_list = examples["distractors"]
    thoughts = examples["thought_process"]
    answers = examples["answer"]
    
    texts = []
    for instruction, oracle, distractors, thought, answer in zip(
        instructions, oracles, distractors_list, thoughts, answers
    ):
        # Combine Oracle + Distractors
        if isinstance(distractors, list):
            all_contexts = [oracle] + distractors[:3]  # Oracle + 3 distractors
        else:
            all_contexts = [oracle]
        
        context_str = "\n\n".join([f"Document {i+1}:\n{c}" for i, c in enumerate(all_contexts)])
        
        text = prompt_template.format(
            context=context_str,
            instruction=instruction,
            thought=thought,
            answer=answer
        ) + EOS_TOKEN
        texts.append(text)
    
    return {"text": texts}


print("‚úÖ Prompt template defined")

## 3. Load RAFT Dataset

In [None]:
from datasets import load_dataset
import os

# Direct path to raft_dataset.jsonl in 03_fine_tuning
dataset_path = "raft_dataset.jsonl"

if not os.path.exists(dataset_path):
    raise FileNotFoundError(
        f"‚ùå {dataset_path} not found!\n"
        "Make sure you've run: python generate_raft_focused.py"
    )

print(f"Loading dataset from {dataset_path}...")
dataset = load_dataset("json", data_files=dataset_path, split="train")

print(f"‚úÖ Loaded {len(dataset)} examples")
print(f"\nSample keys: {dataset.column_names}")

# Apply prompt formatting
print("\nApplying prompt template...")
dataset = dataset.map(formatting_prompts_func, batched=True)

print(f"‚úÖ Dataset ready for training ({len(dataset)} examples)")

## 4. Add LoRA Adapters

LoRA allows efficient fine-tuning by updating only 1-10% of parameters.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank (higher = more parameters)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_alpha=16,
    lora_dropout=0,  # 0 is optimized
    bias="none",
    use_gradient_checkpointing="unsloth",  # 30% less VRAM
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

print("‚úÖ LoRA adapters added")

## 5. Configure Training

These settings are optimized for 12GB VRAM.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,       # Batch size per GPU
        gradient_accumulation_steps=4,       # Effective batch = 2*4 = 8
        warmup_steps=5,
        max_steps=300,                       # Increase for longer training
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),  # Use fp16 if bf16 not available
        bf16=torch.cuda.is_bf16_supported(),       # Use bf16 on Ampere+ GPUs
        logging_steps=10,
        optim="adamw_8bit",                  # Saves ~2GB VRAM
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        save_steps=50,                       # Save checkpoint every 50 steps
        save_total_limit=3,                  # Keep only 3 checkpoints
    ),
)

print("‚úÖ Trainer configured")

## 6. Check GPU Memory

In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU: {gpu_stats.name}")
print(f"Max memory: {max_memory} GB")
print(f"Reserved: {start_gpu_memory} GB")
print(f"Available: {max_memory - start_gpu_memory} GB")

## 7. Train!

This will take ~30-60 minutes depending on your GPU and dataset size.

In [None]:
print("üöÄ Starting training...\n")

trainer_stats = trainer.train()

print("\n‚úÖ Training complete!")
print(f"Final loss: {trainer_stats.training_loss:.4f}")

## 8. Test Inference

In [None]:
# Enable fast inference mode
FastLanguageModel.for_inference(model)

# Test question
test_context = """Document 1:
NKU tuition for undergraduate students is $450 per credit hour for in-state students and $750 for out-of-state.

Document 2:
Financial aid applications are available through the FAFSA website.
"""

test_prompt = prompt_template.format(
    context=test_context,
    instruction="How much does tuition cost per credit hour?",
    thought="",
    answer=""
)

inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")

print("Testing model...\n")
outputs = model.generate(**inputs, max_new_tokens=256, use_cache=True, temperature=0.7)
response = tokenizer.batch_decode(outputs)[0]

print("=" * 60)
print("RESPONSE:")
print("=" * 60)
print(response)
print("=" * 60)

## 9. Save Fine-tuned Model

In [None]:
# Save LoRA adapters locally
save_dir = "campus_gpt_lora"
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

print(f"‚úÖ Model saved to {save_dir}/")
print("\nYou can now:")
print("1. Use it with Ollama: Create a Modelfile")
print("2. Upload to HuggingFace Hub")
print("3. Merge LoRA adapters with base model")

## Optional: Merge LoRA with Base Model

This creates a standalone model (no adapters needed).

In [None]:
# Saving to gguf for Ollama
model.save_pretrained_gguf(
    "campus_gpt_q4",
    tokenizer,
    quantization_method = "q4_k_m"
)