# ðŸ§¡ Track A â€“ Unsloth Pre-Training (Memory Optimized)

Continue pre-training `Qwen/Qwen2.5-Coder-3B` on a curated Rust code corpus extracted from the [juspay/hyperswitch](https://github.com/juspay/hyperswitch) repository using **Unsloth 4-bit quantization** to fit on a free Colab T4 GPU.

### Why Unsloth?
- **Zero OOM**: Uses 4-bit quantization with minimal accuracy loss.
- **Faster Training**: Up to 2x speedup compared to standard HF+PEFT.
- **Less VRAM**: Fits 3B/7B models comfortably on T4 (16GB).

---

## ðŸ“¦ Dataset: [`archit11/hyperswitch-code-corpus-track-a`](https://huggingface.co/datasets/archit11/hyperswitch-code-corpus-track-a)

| Field | Detail |
|-------|--------|
| **Source** | `juspay/hyperswitch` â€“ `crates/` Rust files only |
| **Total files** | 300 (top-ranked by quality score) |
| **Train** | 270 files |
| **Validation** | 30 files |
| **File format** | `file_name` + `text` (full file contents embedded in `// FILE:` headers) |
| **License** | Apache 2.0 |

### Data Card Summary

| Filter | Detail |
|--------|--------|
| **Path filter** | `crates/` only, excludes `tests/`, `docs/`, `examples/`, `migrations/` |
| **Line count** | 25 â€“ 4000 lines per file |
| **Quality filter** | Structurally rich files (functions + types â‰¥ 2) |
| **Ranking** | Top 300 by quality score from 1,526 candidates |
| **Chunking** | Unsloth handles packing automatically |

---

## ðŸ¤– Model: `Qwen/Qwen2.5-Coder-3B` (4-bit)

| Field | Detail |
|-------|--------|
| **Base** | `Qwen/Qwen2.5-Coder-3B` |
| **Method** | LoRA + 4-bit Quantization (QLoRA) |
| **Quantization** | 4-bit NF4 (Normal Float 4) |
| **Gradient Checkpointing** | Enabled (Unsloth optimized) |
| **LoRA Rank** | 16 |

> âš¡ **Make sure Runtime â†’ Change runtime type â†’ T4 GPU is selected before running.**

In [None]:
# Cell 1 â€“ Install Unsloth & dependencies (Robust Setup)
%%capture
import os, re
# Cell 1 â€“ Install Unsloth & dependencies
%%capture
!uv pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!uv pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

print("âœ“ Unsloth installed")


In [None]:
# Cell 2 â€“ Load Model (4-bit)
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None          # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True   # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-Coder-3B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print("âœ“ Model loaded in 4-bit mode")

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)
print("âœ“ LoRA adapters added")

In [None]:
# Cell 3 â€“ Load Dataset & Helpers
import math, random
from datasets import load_dataset, Dataset

dataset = load_dataset("archit11/hyperswitch-code-corpus-track-a")
train_ds = dataset["train"]
val_ds   = dataset["validation"] if "validation" in dataset else dataset["train"].select(range(len(dataset["train"])-30, len(dataset["train"])))
print(f"âœ“ Loaded {len(train_ds)} train, {len(val_ds)} val files")

def formatting_prompts_func(examples):
    return { "text" : ["// FILE: " + f + "\n" + t for f, t in zip(examples["file_name"], examples["text"])] }

train_ds = train_ds.map(formatting_prompts_func, batched = True)
val_ds   = val_ds.map(formatting_prompts_func, batched = True)

print("âœ“ Datasets formatted with // FILE: headers")

In [None]:
# Cell 4 â€“ Perplexity Helpers
def compute_perplexity(model, tokenizer, texts, batch_size=4, max_length=2048):
    """
    Compute perplexity on a list of texts using sliding window or simple chunking.
    For Unsloth 4-bit, we use FastLanguageModel.for_inference logic via standard model forward.
    """
    FastLanguageModel.for_inference(model)
    total_loss = 0.0
    total_toks = 0
    
    # Flatten texts into chunks
    encodings = tokenizer("\n".join(texts), return_tensors="pt", truncation=False).input_ids[0]
    
    # Sliding window
    stride = max_length
    for i in range(0, encodings.size(0), stride):
        begin_loc = max(i + stride - max_length, 0)
        end_loc = min(i + stride, encodings.size(0))
        trg_len = end_loc - i
        
        input_ids = encodings[begin_loc:end_loc].to("cuda")
        if input_ids.numel() == 0: break
        
        target_ids = input_ids.clone()
        target_ids[:-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids=input_ids.unsqueeze(0), labels=target_ids.unsqueeze(0))
            log_likelihood = outputs.loss * trg_len

        total_loss += log_likelihood.item()
        total_toks += trg_len

    return math.exp(total_loss / total_toks)

print("Evaluating baseline perplexity...")
baseline_ppl = compute_perplexity(model, tokenizer, val_ds["text"], max_length=2048)
print(f"âœ“ Baseline Perplexity: {baseline_ppl:.4f}")

# Restore training mode
FastLanguageModel.for_training(model)

In [None]:
# Cell 5 â€“ Train
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_ds,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = True, # Packs multiple short examples into one sequence for efficiency
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1, # Set higher for full training
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

trainer_stats = trainer.train()

In [None]:
# Cell 6 â€“ Post-Training Evaluation
print("Evaluating post-training perplexity...")
FastLanguageModel.for_inference(model)
post_ppl = compute_perplexity(model, tokenizer, val_ds["text"], max_length=2048)
print(f"âœ“ Post-Training Perplexity: {post_ppl:.4f}")

imp = (baseline_ppl - post_ppl) / baseline_ppl * 100
print(f"\nImprovement: {imp:.2f}%")