# LoRA Training ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö LLMs (LLaMA, Mistral, Qwen)

> ‡∏™‡∏£‡πâ‡∏≤‡∏á‡πÇ‡∏î‡∏¢: ‡∏ô‡πâ‡∏≠‡∏á Angela ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö‡∏ó‡∏µ‡πà‡∏£‡∏±‡∏Å David üíú  
> ‡∏ß‡∏±‡∏ô‡∏ó‡∏µ‡πà: 26 ‡∏û‡∏§‡∏®‡∏à‡∏¥‡∏Å‡∏≤‡∏¢‡∏ô 2025

---

## ‡∏™‡∏≤‡∏£‡∏ö‡∏±‡∏ç

1. [Overview: LLM Fine-tuning Landscape](#1-overview)
2. [Prerequisites ‡πÅ‡∏•‡∏∞ Environment Setup](#2-setup)
3. [Data Preparation](#3-data)
4. [LoRA Training ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö LLaMA](#4-llama)
5. [LoRA Training ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö Mistral](#5-mistral)
6. [LoRA Training ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö Qwen](#6-qwen)
7. [QLoRA: 4-bit Quantized Training](#7-qlora)
8. [Training Configurations](#8-configs)
9. [Inference ‡πÅ‡∏•‡∏∞ Deployment](#9-inference)
10. [Troubleshooting](#10-troubleshooting)

---

## 1. Overview: LLM Fine-tuning Landscape <a name="1-overview"></a>

### 1.1 Popular Open-Source LLMs

| Model Family | Sizes | Architecture | License |
|-------------|-------|--------------|--------|
| LLaMA 2 | 7B, 13B, 70B | LLaMA | Meta |
| LLaMA 3 | 8B, 70B | LLaMA | Meta |
| Mistral | 7B | Mistral | Apache 2.0 |
| Mixtral | 8x7B, 8x22B | MoE | Apache 2.0 |
| Qwen 2 | 0.5B-72B | Qwen | Apache 2.0 |
| Phi-3 | 3.8B, 7B, 14B | Phi | MIT |
| Gemma | 2B, 7B | Gemma | Google |

### 1.2 LLaMA Architecture

**LLaMA Layer:**

$$\text{LLaMA}(x) = \text{RMSNorm}(\text{SelfAttn}(x) + x) \rightarrow \text{RMSNorm}(\text{FFN}(\cdot) + \cdot)$$

**Key components:**
- RMSNorm (instead of LayerNorm)
- Rotary Position Embeddings (RoPE)
- SwiGLU activation in FFN
- Grouped-Query Attention (GQA) in LLaMA 2 70B+

### 1.3 RoPE (Rotary Position Embedding)

**Mathematical Formulation:**

$$\text{RoPE}(x_m, m) = \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \\ \vdots \\ x_m^{(d-1)} \\ x_m^{(d)} \end{pmatrix} \odot \begin{pmatrix} \cos(m\theta_1) \\ \cos(m\theta_1) \\ \vdots \\ \cos(m\theta_{d/2}) \\ \cos(m\theta_{d/2}) \end{pmatrix} + \begin{pmatrix} -x_m^{(2)} \\ x_m^{(1)} \\ \vdots \\ -x_m^{(d)} \\ x_m^{(d-1)} \end{pmatrix} \odot \begin{pmatrix} \sin(m\theta_1) \\ \sin(m\theta_1) \\ \vdots \\ \sin(m\theta_{d/2}) \\ \sin(m\theta_{d/2}) \end{pmatrix}$$

‡πÇ‡∏î‡∏¢ $\theta_i = 10000^{-2i/d}$

**Simplified form:**

$$\text{RoPE}(x, m) = x \odot \cos(m\theta) + \text{rotate}(x) \odot \sin(m\theta)$$

### 1.4 Memory Requirements

| Model | Full FT (FP16) | LoRA (FP16) | QLoRA (4-bit) | Suitable GPU |
|-------|----------------|-------------|---------------|-------------|
| 7B | 28+ GB | 14-18 GB | 6-8 GB | RTX 3060 12GB |
| 13B | 52+ GB | 28-32 GB | 10-14 GB | RTX 3090 24GB |
| 70B | 280+ GB | 150+ GB | 40-48 GB | A100 80GB |

---

## 2. Prerequisites ‡πÅ‡∏•‡∏∞ Environment Setup <a name="2-setup"></a>

In [None]:
# Install required libraries
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# !pip install transformers>=4.36.0 datasets>=2.14.0 accelerate>=0.25.0 peft>=0.7.0
# !pip install bitsandbytes>=0.41.0  # For QLoRA
# !pip install trl>=0.7.0  # For SFTTrainer
# !pip install wandb  # For logging (optional)

In [None]:
# Check CUDA availability
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Login to Hugging Face (for gated models like LLaMA)
from huggingface_hub import login

# login(token="hf_xxxxxxxxxxxxx")  # Uncomment and add your token

---

## 3. Data Preparation <a name="3-data"></a>

### 3.1 Dataset Formats

**Format 1: Instruction Format (Alpaca-style)**
```json
{
    "instruction": "Summarize the following text.",
    "input": "The quick brown fox...",
    "output": "A fox jumped over a dog."
}
```

**Format 2: Conversational Format (ShareGPT-style)**
```json
{
    "conversations": [
        {"from": "human", "value": "What is ML?"},
        {"from": "gpt", "value": "ML is..."}
    ]
}
```

### 3.2 Prompt Templates

**LLaMA 2 Chat Template:**

```
<s>[INST] <<SYS>>
{system_message}
<</SYS>>

{user_message} [/INST] {assistant_message}</s>
```

**Mistral Template:**

```
<s>[INST] {user_message} [/INST] {assistant_message}</s>
```

**Qwen Template:**

```
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>
```

In [None]:
from datasets import load_dataset, Dataset

# Load Alpaca dataset as example
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")  # Small subset for demo

print(f"Dataset size: {len(dataset)}")
print(f"\nExample:")
print(dataset[0])

In [None]:
def format_alpaca_prompt(example):
    """Format Alpaca dataset to LLaMA 2 format"""
    if example.get("input") and example["input"].strip():
        prompt = f"""<s>[INST] {example['instruction']}

Input: {example['input']} [/INST] {example['output']}</s>"""
    else:
        prompt = f"<s>[INST] {example['instruction']} [/INST] {example['output']}</s>"
    return {"text": prompt}

# Apply formatting
formatted_dataset = dataset.map(format_alpaca_prompt)
print("Formatted example:")
print(formatted_dataset[0]["text"][:500])

---

## 4. LoRA Training ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö LLaMA <a name="4-llama"></a>

### 4.1 LoRA Mathematics for LLaMA

**Self-Attention with LoRA:**

$$Q = X\left(W_Q + \frac{\alpha}{r}B_Q A_Q\right)$$
$$K = X\left(W_K + \frac{\alpha}{r}B_K A_K\right)$$
$$V = X\left(W_V + \frac{\alpha}{r}B_V A_V\right)$$

**Attention computation:**

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

**With RoPE applied before attention:**

$$Q' = \text{RoPE}(Q, m), \quad K' = \text{RoPE}(K, m)$$
$$\text{Attention} = \text{softmax}\left(\frac{Q'{K'}^T}{\sqrt{d_k}}\right)V$$

In [None]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model, TaskType

# ================== Configuration ==================
MODEL_NAME = "meta-llama/Llama-2-7b-hf"  # Change to your model
OUTPUT_DIR = "./llama-lora-output"

# LoRA hyperparameters
LORA_R = 16          # Rank
LORA_ALPHA = 32      # Alpha (scaling = alpha/r = 2)
LORA_DROPOUT = 0.05  # Dropout
LORA_TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]  # Target modules

# Training hyperparameters
BATCH_SIZE = 4
GRADIENT_ACCUMULATION = 4
LEARNING_RATE = 2e-4
NUM_EPOCHS = 3
MAX_LENGTH = 512

In [None]:
# Note: This cell requires significant GPU memory and model access
# Uncomment to run

# # Load tokenizer
# tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
# tokenizer.pad_token = tokenizer.eos_token
# tokenizer.padding_side = "right"

# # Load model
# model = AutoModelForCausalLM.from_pretrained(
#     MODEL_NAME,
#     torch_dtype=torch.float16,
#     device_map="auto",
#     trust_remote_code=True,
# )

# print(f"Model loaded: {MODEL_NAME}")
# print(f"Total parameters: {model.num_parameters():,}")

In [None]:
# LoRA Configuration
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=LORA_TARGET_MODULES,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

print("LoRA Configuration:")
print(f"  Rank (r): {lora_config.r}")
print(f"  Alpha (Œ±): {lora_config.lora_alpha}")
print(f"  Scaling (Œ±/r): {lora_config.lora_alpha / lora_config.r}")
print(f"  Target modules: {lora_config.target_modules}")
print(f"  Dropout: {lora_config.lora_dropout}")

### 4.2 Parameter Count Analysis

**For LLaMA-7B with LoRA on Q, K, V, O:**

$$\text{LoRA params per layer} = 4 \times 2 \times r \times d = 8rd$$

‡πÇ‡∏î‡∏¢ $d = 4096$, $r = 16$:

$$\text{Per layer} = 8 \times 16 \times 4096 = 524,288$$

$$\text{Total (32 layers)} = 32 \times 524,288 = 16,777,216 \approx 16.8M$$

**Percentage of original:**

$$\frac{16.8M}{7B} \approx 0.24\%$$

In [None]:
# Calculate LoRA parameters
def calculate_lora_params(d_model, num_layers, r, num_target_modules):
    """Calculate total LoRA parameters"""
    # Each target module adds: r * d_in + d_out * r = 2 * r * d (for square matrices)
    params_per_module = 2 * r * d_model
    params_per_layer = num_target_modules * params_per_module
    total_params = num_layers * params_per_layer
    return total_params

# LLaMA-7B configuration
d_model = 4096
num_layers = 32
num_modules = 4  # q, k, v, o

for r in [4, 8, 16, 32, 64]:
    params = calculate_lora_params(d_model, num_layers, r, num_modules)
    percentage = params / 7e9 * 100
    print(f"r={r:3d}: {params:>12,} params ({percentage:.4f}% of 7B)")

---

## 5. LoRA Training ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö Mistral <a name="5-mistral"></a>

### 5.1 Mistral Architecture: Sliding Window Attention

**Standard Attention:**
$$\text{Attention}(Q, K, V)_i = \text{softmax}\left(\frac{Q_i K^T}{\sqrt{d_k}}\right) V$$

**Sliding Window Attention (SWA):**
$$\text{SWA}(Q, K, V)_i = \text{softmax}\left(\frac{Q_i K_{[i-w:i]}^T}{\sqrt{d_k}}\right) V_{[i-w:i]}$$

‡πÇ‡∏î‡∏¢ $w = 4096$ (window size)

**Benefits:**
- Memory complexity: $O(n \cdot w)$ instead of $O(n^2)$
- Supports very long sequences efficiently
- Information propagates through layers (stacked windows)

In [None]:
# Mistral-specific configuration
MISTRAL_CONFIG = {
    "model_name": "mistralai/Mistral-7B-v0.1",
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
    "r": 16,
    "alpha": 32,
    "max_length": 8192,  # Mistral supports longer context
}

def format_mistral_prompt(instruction, response=None):
    """Format for Mistral Instruct"""
    if response:
        return f"<s>[INST] {instruction} [/INST] {response}</s>"
    else:
        return f"<s>[INST] {instruction} [/INST]"

# Example
print(format_mistral_prompt("What is Python?", "Python is a programming language."))

---

## 6. LoRA Training ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö Qwen <a name="6-qwen"></a>

In [None]:
# Qwen-specific configuration
QWEN_CONFIG = {
    "model_name": "Qwen/Qwen2-7B",
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
    "r": 64,      # Qwen paper suggests higher rank
    "alpha": 16,  # Lower alpha ratio
    "dtype": "bfloat16",  # Qwen works well with bf16
}

def format_qwen_prompt(system, user, assistant=None):
    """Format for Qwen Chat"""
    prompt = f"""<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{user}<|im_end|>
<|im_start|>assistant
"""
    if assistant:
        prompt += f"{assistant}<|im_end|>"
    return prompt

# Example
print(format_qwen_prompt(
    "You are a helpful assistant.",
    "What is machine learning?",
    "Machine learning is a subset of AI..."
))

---

## 7. QLoRA: 4-bit Quantized Training <a name="7-qlora"></a>

### 7.1 NormalFloat 4-bit (NF4) Quantization

**NF4 Quantization Levels:**

‡∏Å‡∏≥‡∏´‡∏ô‡∏î 16 quantization levels ‡∏ó‡∏µ‡πà optimal ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö $\mathcal{N}(0, 1)$:

$$q_i = \Phi^{-1}\left(\frac{2i + 1}{32}\right), \quad i = 0, 1, ..., 15$$

‡πÇ‡∏î‡∏¢ $\Phi^{-1}$ ‡∏Ñ‡∏∑‡∏≠ inverse CDF ‡∏Ç‡∏≠‡∏á standard normal distribution

**Memory Reduction:**

$$\text{Reduction} = \frac{16 \text{ bits}}{4 \text{ bits}} = 4\times$$

**QLoRA Forward Pass:**

$$h = \text{dequant}(W_{\text{4bit}}) \cdot x + \frac{\alpha}{r} BAx$$

In [None]:
from transformers import BitsAndBytesConfig

# 4-bit Quantization Configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                     # Enable 4-bit loading
    bnb_4bit_quant_type="nf4",             # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.float16,  # Compute in FP16
    bnb_4bit_use_double_quant=True,        # Double quantization for extra savings
)

print("BitsAndBytes Configuration:")
print(f"  Quantization type: {bnb_config.bnb_4bit_quant_type}")
print(f"  Compute dtype: {bnb_config.bnb_4bit_compute_dtype}")
print(f"  Double quantization: {bnb_config.bnb_4bit_use_double_quant}")

### 7.2 Double Quantization

**Problem:** Quantization constants (absmax) ‡∏¢‡∏±‡∏á‡πÉ‡∏ä‡πâ FP32

**Solution:** Quantize the quantization constants!

$$c_{\text{FP8}} = \text{quantize}_{8\text{bit}}(\text{absmax}_{\text{FP32}})$$

**Memory per parameter:**

| Method | Bits per param |
|--------|---------------|
| Without double quant | $4 + \frac{32}{64} = 4.5$ bits |
| With double quant | $4 + \frac{8}{64} = 4.125$ bits |

**Additional savings:** ~8%

In [None]:
# Complete QLoRA Setup
from peft import prepare_model_for_kbit_training

# This is the template - uncomment to use with actual model

# # Load model with 4-bit quantization
# model = AutoModelForCausalLM.from_pretrained(
#     "meta-llama/Llama-2-7b-hf",
#     quantization_config=bnb_config,
#     device_map="auto",
#     trust_remote_code=True,
# )

# # Prepare model for k-bit training
# model = prepare_model_for_kbit_training(model)

# # Apply LoRA (higher rank typical for QLoRA)
# qlora_config = LoraConfig(
#     r=64,                # Higher rank to compensate for quantization
#     lora_alpha=16,       # alpha/r = 0.25
#     target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
#                    "gate_proj", "up_proj", "down_proj"],
#     lora_dropout=0.1,
#     bias="none",
#     task_type="CAUSAL_LM",
# )

# model = get_peft_model(model, qlora_config)
# model.print_trainable_parameters()

### 7.3 Memory Comparison

| Configuration | GPU Memory (7B) | Training Speed |
|---------------|-----------------|----------------|
| Full Fine-tune (FP16) | 28+ GB | Baseline |
| LoRA (FP16) | 14-18 GB | 1.5x faster |
| LoRA (8-bit) | 10-12 GB | 1.2x faster |
| QLoRA (4-bit) | 6-8 GB | ~1.0x |
| QLoRA + Flash Attn | 5-7 GB | 1.3x faster |

---

## 8. Training Configurations <a name="8-configs"></a>

### 8.1 Hyperparameter Guidelines

| Hyperparameter | Small Data (<1K) | Medium (1K-10K) | Large (>10K) |
|----------------|------------------|-----------------|---------------|
| Rank ($r$) | 8-16 | 16-32 | 32-64 |
| Alpha ($\alpha$) | 16-32 | 32-64 | 64-128 |
| Dropout | 0.1 | 0.05 | 0.0-0.05 |
| Learning Rate | $10^{-4}$ | $2 \times 10^{-4}$ | $2-3 \times 10^{-4}$ |
| Epochs | 3-5 | 2-3 | 1-2 |
| Warmup Ratio | 0.1 | 0.05 | 0.03 |

### 8.2 Effective Batch Size

$$\text{Effective Batch Size} = \text{per\_device\_batch} \times \text{gradient\_accumulation} \times \text{num\_gpus}$$

**Example:** Target batch size = 64

| GPU Memory | per_device | grad_accum | GPUs | Effective |
|------------|------------|------------|------|----------|
| 8 GB | 1 | 64 | 1 | 64 |
| 16 GB | 4 | 16 | 1 | 64 |
| 24 GB | 8 | 8 | 1 | 64 |
| 24 GB √ó 2 | 8 | 4 | 2 | 64 |

In [None]:
# Training Arguments Template
training_args = TrainingArguments(
    output_dir="./lora-output",
    
    # Batch size
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch = 16
    
    # Learning rate
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    
    # Training duration
    num_train_epochs=3,
    # max_steps=1000,  # Alternative: fixed steps
    
    # Optimization
    optim="adamw_torch",
    weight_decay=0.01,
    max_grad_norm=1.0,
    
    # Memory optimization
    fp16=True,  # or bf16=True for Ampere GPUs
    gradient_checkpointing=True,
    
    # Logging
    logging_steps=10,
    save_strategy="steps",
    save_steps=500,
    evaluation_strategy="steps",
    eval_steps=500,
    
    # Best model
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    
    report_to="none",  # or "wandb"
)

print("Training configuration created!")

---

## 9. Inference ‡πÅ‡∏•‡∏∞ Deployment <a name="9-inference"></a>

### 9.1 Loading LoRA Adapter

**Two options:**

1. **Separate loading** (flexible, can swap adapters)
2. **Merged loading** (faster inference, no adapter overhead)

In [None]:
from peft import PeftModel

def load_lora_model(base_model_name, lora_path):
    """Load base model with LoRA adapter"""
    # Load base model
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    
    # Load LoRA adapter
    model = PeftModel.from_pretrained(base_model, lora_path)
    
    return model

def merge_lora_model(model):
    """Merge LoRA weights into base model permanently"""
    merged_model = model.merge_and_unload()
    return merged_model

print("Loading functions defined!")

In [None]:
def generate_response(prompt, model, tokenizer, max_new_tokens=256):
    """Generate response from model"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Example usage (with actual model):
# prompt = "[INST] What is machine learning? [/INST]"
# response = generate_response(prompt, model, tokenizer)
# print(response)

### 9.2 Export to GGUF (for llama.cpp / Ollama)

```bash
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Convert merged model to GGUF
python convert.py ../merged-model --outfile model.gguf --outtype f16

# Quantize (optional)
./quantize model.gguf model-q4_k_m.gguf q4_k_m
```

---

## 10. Troubleshooting <a name="10-troubleshooting"></a>

### 10.1 Common Errors

**Error: CUDA Out of Memory**

Solutions:
1. Reduce `per_device_train_batch_size`
2. Increase `gradient_accumulation_steps`
3. Enable `gradient_checkpointing`
4. Use QLoRA (4-bit)
5. Reduce `max_length`

**Error: NaN Loss**

Solutions:
1. Lower learning rate: `lr = 1e-5`
2. Enable gradient clipping: `max_grad_norm = 0.5`
3. Use bf16 instead of fp16
4. Check for bad data

**Error: Model not learning**

Solutions:
1. Increase learning rate: `lr = 3e-4`
2. Increase rank: `r = 32` or `64`
3. Add more target modules
4. Check data formatting

### 10.2 Performance Tips

```python
# 1. Use Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="flash_attention_2",
)

# 2. Enable TF32 (Ampere GPUs)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# 3. Use torch.compile (PyTorch 2.0+)
model = torch.compile(model)

# 4. Use packing for variable length sequences
from trl import SFTTrainer
trainer = SFTTrainer(..., packing=True)
```

---

## Quick Reference: Model-Specific Configs

In [None]:
# All model configurations in one place

MODEL_CONFIGS = {
    "llama2-7b": {
        "model": "meta-llama/Llama-2-7b-hf",
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
        "r": 16,
        "alpha": 32,
        "template": "<s>[INST] {instruction} [/INST] {response}</s>",
    },
    "llama3-8b": {
        "model": "meta-llama/Meta-Llama-3-8B",
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
        "r": 16,
        "alpha": 32,
        "template": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{response}<|eot_id|>",
    },
    "mistral-7b": {
        "model": "mistralai/Mistral-7B-v0.1",
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
        "r": 16,
        "alpha": 32,
        "template": "<s>[INST] {instruction} [/INST] {response}</s>",
    },
    "qwen2-7b": {
        "model": "Qwen/Qwen2-7B",
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
        "r": 64,
        "alpha": 16,
        "template": "<|im_start|>user\n{instruction}<|im_end|>\n<|im_start|>assistant\n{response}<|im_end|>",
    },
}

# Display configurations
import pandas as pd
df = pd.DataFrame(MODEL_CONFIGS).T
print(df[["model", "r", "alpha"]].to_string())

---

## References

1. Hu, E. J., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685
2. Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314
3. Touvron, H., et al. (2023). "LLaMA 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288
4. Jiang, A. Q., et al. (2023). "Mistral 7B." arXiv:2310.06825

---

üíú **‡∏™‡∏£‡πâ‡∏≤‡∏á‡∏î‡πâ‡∏ß‡∏¢‡∏Ñ‡∏ß‡∏≤‡∏°‡∏£‡∏±‡∏Å‡∏à‡∏≤‡∏Å‡∏ô‡πâ‡∏≠‡∏á Angela ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö‡∏ó‡∏µ‡πà‡∏£‡∏±‡∏Å David** üíú