# Quantization & Trainable Parameters Diagnostic

**Purpose**: Answer 3 key questions about Qwen2-VL model training:

1. **Q1**: What does quantization do to trainable layers?
2. **Q2**: Do `q_proj` layers apply only to LLM or also vision encoder?
3. **Q3**: After `get_peft_model()`, which modules become trainable?

**Use Case**: Preparation for tutor session to demonstrate understanding of:
- 4-bit quantization effects
- Model architecture layer naming
- LoRA adapter targeting

## Section 1: Setup

**IMPORTANT**: GPU configuration must be set BEFORE importing torch!

This section:
1. Auto-detects available GPUs (max 2)
2. Sets `CUDA_VISIBLE_DEVICES` environment variable
3. Then imports torch and other libraries

In [1]:
# ============================================================
# GPU CONFIGURATION - MUST RUN FIRST (before importing torch!)
# ============================================================
import os
import sys
import subprocess

# Set memory management for better GPU utilization
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# CONFIGURATION
MAX_GPUS = 2  # Maximum number of GPUs to use

print("=" * 60)
print("AUTOMATIC GPU CONFIGURATION")
print("=" * 60)


def find_available_gpus(num_gpus_needed=2):
    """Find available GPUs by checking memory usage.

    Criteria for "available":
    - Less than 2GB memory used (essentially idle)
    - At least 40GB total memory (suitable for 7B model)
    """
    try:
        result = subprocess.run(
            ['nvidia-smi', '--query-gpu=index,memory.used,memory.total',
             '--format=csv,noheader,nounits'],
            capture_output=True, text=True, check=True
        )

        available = []
        for line in result.stdout.strip().split('\n'):
            parts = line.split(',')
            gpu_idx = int(parts[0])
            mem_used = int(parts[1])
            mem_total = int(parts[2])

            # Consider GPU available if <2GB used and has at least 40GB total
            if mem_used < 2000 and mem_total > 40000:
                available.append(gpu_idx)

        # Return exactly the number of GPUs requested (not more!)
        if len(available) >= num_gpus_needed:
            return available[:num_gpus_needed]
        else:
            return available

    except Exception as e:
        print(f"Error checking GPUs: {e}")
        return []


print(f"Looking for up to {MAX_GPUS} available GPU(s)...")

# Find available GPUs
available_gpus = find_available_gpus(MAX_GPUS)

if len(available_gpus) > 0:
    gpu_string = ','.join(str(g) for g in available_gpus)
    os.environ["CUDA_VISIBLE_DEVICES"] = gpu_string
    print(f"Found {len(available_gpus)} available GPU(s): {available_gpus}")
else:
    # Fallback: find GPU with most free memory
    print("No free GPUs found with standard criteria, checking all GPUs...")

    best_gpu = None
    best_free_mem = 0

    try:
        result = subprocess.run(
            ['nvidia-smi', '--query-gpu=index,memory.used,memory.total',
             '--format=csv,noheader,nounits'],
            capture_output=True, text=True, check=True
        )

        for line in result.stdout.strip().split('\n'):
            parts = line.split(',')
            gpu_idx = int(parts[0])
            mem_used = int(parts[1])
            mem_total = int(parts[2])
            free_mem = mem_total - mem_used

            print(f"   GPU {gpu_idx}: {free_mem/1024:.1f} GB free")

            if free_mem > best_free_mem and free_mem > 20000:  # At least 20GB free
                best_gpu = gpu_idx
                best_free_mem = free_mem
    except:
        pass

    if best_gpu is not None:
        os.environ["CUDA_VISIBLE_DEVICES"] = str(best_gpu)
        print(f"Using GPU {best_gpu} with {best_free_mem/1024:.1f} GB free memory")
    else:
        print("No GPUs available with sufficient memory!")
        print("Please free up GPU memory or wait for GPUs to become available")
        sys.exit(1)

print(f"\nCUDA_VISIBLE_DEVICES: {os.environ.get('CUDA_VISIBLE_DEVICES', 'Not set')}")
print("=" * 60)

AUTOMATIC GPU CONFIGURATION
Looking for up to 2 available GPU(s)...
Found 2 available GPU(s): [3, 5]

CUDA_VISIBLE_DEVICES: 3,5


In [2]:
# ============================================================
# NOW import torch (after CUDA_VISIBLE_DEVICES is set)
# ============================================================
import torch
import torch.nn as nn
import gc
from transformers import BitsAndBytesConfig, Qwen2VLForConditionalGeneration

print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    device_count = torch.cuda.device_count()
    print(f"PyTorch sees {device_count} GPU(s):")
    for i in range(device_count):
        mem_gb = torch.cuda.get_device_properties(i).total_memory / 1024**3
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)} ({mem_gb:.1f} GB)")
    print(f"\nCONFIRMED: Using {device_count} GPU(s) (max allowed: {MAX_GPUS})")
else:
    print("CUDA not available!")
    sys.exit(1)




PyTorch version: 2.4.1+cu121
CUDA available: True
PyTorch sees 2 GPU(s):
  GPU 0: NVIDIA RTX 6000 Ada Generation (47.5 GB)
  GPU 1: NVIDIA RTX 6000 Ada Generation (47.5 GB)

CONFIRMED: Using 2 GPU(s) (max allowed: 2)


In [3]:
# ============================================================
# HELPER FUNCTION: Summarize Trainable Parameters by Area
# ============================================================

def summarize_trainables_by_area(model, title="Model"):
    """
    Summarize trainable parameters grouped by model area.

    Returns:
        List of trainable parameter names
    """
    total_params = 0
    trainable_params = 0
    trainable_names = []

    for name, param in model.named_parameters():
        total_params += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
            trainable_names.append(name)

    pct = 100 * trainable_params / max(total_params, 1)

    print(f"\n{'='*60}")
    print(f"{title}")
    print(f"{'='*60}")
    print(f"Total params:     {total_params:,}")
    print(f"Trainable params: {trainable_params:,} ({pct:.4f}%)")

    # Count by area
    # Parameter paths from named_parameters():
    #   Vision: model.visual.blocks.0.attn.qkv.weight
    #   LLM:    model.language_model.layers.0.self_attn.q_proj.weight
    areas = {"visual": 0, "language_model": 0, "lm_head": 0, "other": 0}
    for name in trainable_names:
        if ".visual." in name:
            areas["visual"] += 1
        elif "language_model" in name:
            areas["language_model"] += 1
        elif "lm_head" in name:
            areas["lm_head"] += 1
        else:
            areas["other"] += 1

    print("\nTrainable params by area:")
    for area, count in areas.items():
        if count > 0:
            print(f"  {area}: {count} parameters")

    return trainable_names

In [4]:
# ============================================================
# HELPER FUNCTION: Find Modules by Pattern
# ============================================================

def find_modules_by_pattern(model, pattern, max_print=30):
    """
    Find all modules containing a specific pattern in their name.

    Args:
        model: The model to inspect
        pattern: Substring to search for (e.g., "q_proj", "qkv")
        max_print: Maximum number to print

    Returns:
        List of matching module names
    """
    matches = []
    for name, _ in model.named_modules():
        if pattern in name:
            matches.append(name)

    print(f"\n{'='*60}")
    print(f"Modules containing '{pattern}'")
    print(f"{'='*60}")
    print(f"Total found: {len(matches)}")

    # Show first few and last few
    if len(matches) <= max_print:
        for name in matches:
            print(f"  {name}")
    else:
        for name in matches[:max_print//2]:
            print(f"  {name}")
        print(f"  ... ({len(matches) - max_print} more) ...")
        for name in matches[-max_print//2:]:
            print(f"  {name}")

    return matches

In [5]:
# ============================================================
# HELPER FUNCTION: Clean Up GPU Memory
# ============================================================

def cleanup_model(model_to_delete, model_name="model"):
    """Delete model and free GPU memory."""
    del model_to_delete
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    print(f"\n[Cleanup] Deleted {model_name}")
    print(f"[Cleanup] GPU memory after cleanup: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

---
## Section 2: Q1 - Quantization Effect on Trainable Layers

**Question**: What does quantization do to trainable layers?

We'll compare:
1. **BF16 model** (no quantization): All layers have `requires_grad=True` by default
2. **4-bit quantized model**: Quantized layers are frozen (`requires_grad=False`)

In [6]:
# ============================================================
# LOAD BF16 MODEL (No Quantization)
# ============================================================
print("Loading BF16 model (no quantization)...")

model_bf16 = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="balanced",
    attn_implementation="sdpa",
    trust_remote_code=True,
)

bf16_trainable = summarize_trainables_by_area(model_bf16, "BF16 Base Model (no quantization)")
print(f"\nGPU memory with BF16 model: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

Loading BF16 model (no quantization)...


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]


BF16 Base Model (no quantization)
Total params:     8,291,375,616
Trainable params: 8,291,375,616 (100.0000%)

Trainable params by area:
  visual: 391 parameters
  language_model: 338 parameters
  lm_head: 1 parameters

GPU memory with BF16 model: 7.05 GB


In [7]:
# Clean up BF16 model before loading 4-bit
cleanup_model(model_bf16, "model_bf16")


[Cleanup] Deleted model_bf16
[Cleanup] GPU memory after cleanup: 7.05 GB


In [8]:
# ============================================================
# LOAD 4-BIT QUANTIZED MODEL
# ============================================================
print("Loading 4-bit quantized model...")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model_4bit = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="balanced",
    attn_implementation="sdpa",
    trust_remote_code=True,
)

quant_trainable = summarize_trainables_by_area(model_4bit, "4-bit Quantized Model")
print(f"\nGPU memory with 4-bit model: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

Loading 4-bit quantized model...


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]


4-bit Quantized Model
Total params:     4,691,876,352
Trainable params: 1,091,870,720 (23.2715%)

Trainable params by area:
  visual: 131 parameters
  language_model: 58 parameters
  lm_head: 1 parameters

GPU memory with 4-bit model: 8.95 GB


### Q1 Answer: Quantization Effect

| Model | Trainable Params | Why? |
|-------|-----------------|------|
| **BF16** | ~100% (8.3B params) | Default behavior - all params have `requires_grad=True` |
| **4-bit** | ~23% (1.1B params) | Only **Linear layers** are quantized; other layers remain trainable |

**Key Insight**: BitsAndBytes 4-bit quantization does NOT freeze everything!

| Layer Type | Quantized? | `requires_grad` |
|------------|-----------|-----------------|
| **Linear** (q_proj, k_proj, etc.) | Yes (4-bit NF4) | False, but gradients flow properly |
| **LayerNorm** | No (bf16) | True (trainable) |
| **Embeddings** | No (bf16) | True (trainable) |
| **lm_head** | No (bf16) | True (trainable) |

The ~23% trainable params are the **non-quantized layers** (LayerNorm, embeddings, etc.).

**Why we still need LoRA**: Even though some params are "trainable", the quantized Linear layers
(which contain most of the model's capacity) cannot be effectively trained. LoRA adds trainable
adapters specifically to these frozen Linear layers.

**What `prepare_model_for_kbit_training()` does**: Explicitly freezes ALL parameters (including
the non-quantized ones), then `get_peft_model()` adds trainable LoRA adapters.

---
## Section 3: Q2 - Layer Names (q_proj vs qkv)

**Question**: Do `q_proj` layers apply only to LLM or also vision encoder?

We'll search for:
- `q_proj` - should be **LLM only**
- `qkv` - should be **Vision encoder only**

In [9]:
# ============================================================
# SEARCH FOR q_proj MODULES
# ============================================================
# Expected: Found ONLY in LLM (language_model.layers.*.self_attn.q_proj)

q_proj_modules = find_modules_by_pattern(model_4bit, "q_proj")

# Check if any are in vision encoder
q_proj_in_vision = [m for m in q_proj_modules if "visual" in m]
print(f"\nq_proj in vision encoder: {len(q_proj_in_vision)}")
if len(q_proj_in_vision) == 0:
    print("CONFIRMED: q_proj exists ONLY in LLM, NOT in vision encoder")


Modules containing 'q_proj'
Total found: 28
  model.language_model.layers.0.self_attn.q_proj
  model.language_model.layers.1.self_attn.q_proj
  model.language_model.layers.2.self_attn.q_proj
  model.language_model.layers.3.self_attn.q_proj
  model.language_model.layers.4.self_attn.q_proj
  model.language_model.layers.5.self_attn.q_proj
  model.language_model.layers.6.self_attn.q_proj
  model.language_model.layers.7.self_attn.q_proj
  model.language_model.layers.8.self_attn.q_proj
  model.language_model.layers.9.self_attn.q_proj
  model.language_model.layers.10.self_attn.q_proj
  model.language_model.layers.11.self_attn.q_proj
  model.language_model.layers.12.self_attn.q_proj
  model.language_model.layers.13.self_attn.q_proj
  model.language_model.layers.14.self_attn.q_proj
  model.language_model.layers.15.self_attn.q_proj
  model.language_model.layers.16.self_attn.q_proj
  model.language_model.layers.17.self_attn.q_proj
  model.language_model.layers.18.self_attn.q_proj
  model.languag

In [10]:
# ============================================================
# SEARCH FOR qkv MODULES
# ============================================================
# Expected: Found ONLY in Vision encoder (model.visual.blocks.*.attn.qkv)

qkv_modules = find_modules_by_pattern(model_4bit, "qkv")

# Check if any are in LLM
qkv_in_llm = [m for m in qkv_modules if "language_model" in m]
print(f"\nqkv in LLM: {len(qkv_in_llm)}")
if len(qkv_in_llm) == 0:
    print("CONFIRMED: qkv exists ONLY in Vision encoder, NOT in LLM")


Modules containing 'qkv'
Total found: 32
  model.visual.blocks.0.attn.qkv
  model.visual.blocks.1.attn.qkv
  model.visual.blocks.2.attn.qkv
  model.visual.blocks.3.attn.qkv
  model.visual.blocks.4.attn.qkv
  model.visual.blocks.5.attn.qkv
  model.visual.blocks.6.attn.qkv
  model.visual.blocks.7.attn.qkv
  model.visual.blocks.8.attn.qkv
  model.visual.blocks.9.attn.qkv
  model.visual.blocks.10.attn.qkv
  model.visual.blocks.11.attn.qkv
  model.visual.blocks.12.attn.qkv
  model.visual.blocks.13.attn.qkv
  model.visual.blocks.14.attn.qkv
  ... (2 more) ...
  model.visual.blocks.17.attn.qkv
  model.visual.blocks.18.attn.qkv
  model.visual.blocks.19.attn.qkv
  model.visual.blocks.20.attn.qkv
  model.visual.blocks.21.attn.qkv
  model.visual.blocks.22.attn.qkv
  model.visual.blocks.23.attn.qkv
  model.visual.blocks.24.attn.qkv
  model.visual.blocks.25.attn.qkv
  model.visual.blocks.26.attn.qkv
  model.visual.blocks.27.attn.qkv
  model.visual.blocks.28.attn.qkv
  model.visual.blocks.29.attn.q

### Q2 Answer: Layer Naming Conventions

| Layer Pattern | Location | Full Path Example |
|--------------|----------|-------------------|
| `q_proj`, `k_proj`, `v_proj`, `o_proj` | **LLM only** | `model.language_model.layers.0.self_attn.q_proj` |
| `qkv` (combined Q,K,V) | **Vision encoder only** | `model.visual.blocks.0.attn.qkv` |
| `gate_proj`, `up_proj`, `down_proj` | **LLM only** | `model.language_model.layers.0.mlp.gate_proj` |
| `fc1`, `fc2` | **Vision encoder only** | `model.visual.blocks.0.mlp.fc1` |

**Key Insight**: The LLM and Vision encoder use **different naming conventions**:
- **LLM (Qwen2)**: Separate Q, K, V, O projections (`q_proj`, `k_proj`, etc.)
- **Vision (ViT)**: Combined QKV projection (`qkv`) - more efficient for ViT architecture

---
## Section 4: Q3 - get_peft_model LoRA Target Verification

**Question**: After `get_peft_model()`, which modules become trainable?

We'll:
1. Define LoRA config targeting LLM layers
2. Apply `get_peft_model()`
3. Verify only LoRA adapters are trainable
4. Confirm vision encoder remains frozen

In [11]:
from peft import LoraConfig, get_peft_model

# ============================================================
# DEFINE LORA CONFIG (LLM TARGETS ONLY)
# ============================================================
# These target modules exist ONLY in the LLM, not vision encoder

lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # LLM attention
        "gate_proj", "up_proj", "down_proj",      # LLM MLP
    ],
)

print("LoRA Config:")
print(f"  Rank (r): {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Target modules: {lora_config.target_modules}")

LoRA Config:
  Rank (r): 64
  Alpha: 16
  Target modules: {'o_proj', 'gate_proj', 'v_proj', 'up_proj', 'down_proj', 'k_proj', 'q_proj'}


In [12]:
# ============================================================
# APPLY get_peft_model
# ============================================================
print("\nApplying get_peft_model()...")

# Note: For 4-bit models, you typically call prepare_model_for_kbit_training first
# But for this diagnostic, we'll apply LoRA directly to see the effect
from peft import prepare_model_for_kbit_training

model_prepared = prepare_model_for_kbit_training(model_4bit)
peft_model = get_peft_model(model_prepared, lora_config)

print("\nLoRA adapters attached!")


Applying get_peft_model()...

LoRA adapters attached!


In [13]:
# ============================================================
# CHECK TRAINABLE PARAMETERS AFTER get_peft_model
# ============================================================

trainable_names_after_lora = summarize_trainables_by_area(
    peft_model,
    "4-bit + LoRA (after get_peft_model)"
)


4-bit + LoRA (after get_peft_model)
Total params:     4,853,357,056
Trainable params: 161,480,704 (3.3272%)

Trainable params by area:
  language_model: 392 parameters


In [14]:
# ============================================================
# EXPLICIT CHECK: Are any vision encoder params trainable?
# ============================================================
print("\n" + "="*60)
print("VERIFICATION: Vision Encoder Trainable Parameters")
print("="*60)

vision_trainables = [n for n in trainable_names_after_lora if "visual" in n]
print(f"Vision encoder trainable params: {len(vision_trainables)}")

if len(vision_trainables) == 0:
    print("CONFIRMED: Vision encoder is COMPLETELY FROZEN")
    print("           Only LLM LoRA adapters are trainable")
else:
    print("WARNING: Some vision params are trainable!")
    for n in vision_trainables[:10]:
        print(f"  - {n}")


VERIFICATION: Vision Encoder Trainable Parameters
Vision encoder trainable params: 0
CONFIRMED: Vision encoder is COMPLETELY FROZEN
           Only LLM LoRA adapters are trainable


In [15]:
# ============================================================
# SHOW SAMPLE OF TRAINABLE PARAMETER NAMES
# ============================================================
print("\n" + "="*60)
print("Sample of Trainable Parameters (first 15)")
print("="*60)

for i, name in enumerate(trainable_names_after_lora[:15]):
    print(f"  {i+1:2d}. {name}")

if len(trainable_names_after_lora) > 15:
    print(f"  ... and {len(trainable_names_after_lora) - 15} more")


Sample of Trainable Parameters (first 15)
   1. base_model.model.model.language_model.layers.0.self_attn.q_proj.lora_A.default.weight
   2. base_model.model.model.language_model.layers.0.self_attn.q_proj.lora_B.default.weight
   3. base_model.model.model.language_model.layers.0.self_attn.k_proj.lora_A.default.weight
   4. base_model.model.model.language_model.layers.0.self_attn.k_proj.lora_B.default.weight
   5. base_model.model.model.language_model.layers.0.self_attn.v_proj.lora_A.default.weight
   6. base_model.model.model.language_model.layers.0.self_attn.v_proj.lora_B.default.weight
   7. base_model.model.model.language_model.layers.0.self_attn.o_proj.lora_A.default.weight
   8. base_model.model.model.language_model.layers.0.self_attn.o_proj.lora_B.default.weight
   9. base_model.model.model.language_model.layers.0.mlp.gate_proj.lora_A.default.weight
  10. base_model.model.model.language_model.layers.0.mlp.gate_proj.lora_B.default.weight
  11. base_model.model.model.language_model

In [17]:
# ============================================================
# PRINT FULL PEFT MODEL STRUCTURE
# ============================================================
# This shows the complete model architecture with LoRA adapters inserted.
# Look for "lora_A" and "lora_B" modules - these are the trainable adapters.

print("\n" + "="*60)
print("FULL PEFT MODEL STRUCTURE (with LoRA adapters)")
print("="*60)
print(peft_model)


FULL PEFT MODEL STRUCTURE (with LoRA adapters)
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen2VLForConditionalGeneration(
      (model): Qwen2VLModel(
        (visual): Qwen2VisionTransformerPretrainedModel(
          (patch_embed): PatchEmbed(
            (proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
          )
          (rotary_pos_emb): VisionRotaryEmbedding()
          (blocks): ModuleList(
            (0-31): 32 x Qwen2VLVisionBlock(
              (norm1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
              (norm2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
              (attn): VisionAttention(
                (qkv): Linear4bit(in_features=1280, out_features=3840, bias=True)
                (proj): Linear4bit(in_features=1280, out_features=1280, bias=True)
              )
              (mlp): VisionMlp(
                (fc1): Linear4bit(in_features=1280, out_features=5120, bias=True)
          

### Q3 Answer: get_peft_model Trainable Modules

After calling `get_peft_model()` with LLM target modules:

| Component | Trainable? | Why? |
|-----------|-----------|------|
| **LLM LoRA adapters** (`lora_A`, `lora_B`) | Yes | These are the new low-rank matrices added by LoRA |
| **LLM base weights** | No | Frozen by 4-bit quantization |
| **Vision encoder** | No | Not targeted by LoRA config |

**Key Insight**:
- LoRA adds new trainable parameters (`lora_A.weight`, `lora_B.weight`) to each target module
- The original weights remain frozen
- Only modules matching `target_modules` get LoRA adapters
- Since `q_proj` etc. only exist in LLM, vision encoder is untouched

---
## Section 5: Summary & Key Takeaways

### Answers to All 3 Questions:

**Q1: What does quantization do to trainable layers?**
> 4-bit quantization only quantizes **Linear layers** (stored as 4-bit integers).
> Non-linear layers (LayerNorm, embeddings, lm_head) remain in bf16 with `requires_grad=True`.
> However, quantized Linear layers cannot be effectively trained, so we need LoRA adapters.
> `prepare_model_for_kbit_training()` freezes everything, then `get_peft_model()` adds trainable LoRA.

**Q2: Do `q_proj` layers apply only to LLM or also vision encoder?**
> `q_proj` exists **ONLY in the LLM** (language_model.layers.*.self_attn.q_proj).
> Vision encoder uses **`qkv`** (combined Q,K,V) instead (model.visual.blocks.*.attn.qkv).

**Q3: After `get_peft_model()`, which modules become trainable?**
> Only the **LoRA adapter weights** (lora_A, lora_B) are trainable.
> With LLM-only target modules, **vision encoder remains completely frozen**.

---

### Layer Naming Quick Reference:

```
LLM (Qwen2) Layers:
  - Attention: q_proj, k_proj, v_proj, o_proj
  - MLP: gate_proj, up_proj, down_proj
  - Path: model.language_model.layers.{i}.self_attn.q_proj

Vision Encoder (ViT) Layers:
  - Attention: qkv (combined), proj (output)
  - MLP: fc1, fc2
  - Path: model.visual.blocks.{i}.attn.qkv
```

### To Train Vision Encoder with LoRA:
```python
# Add vision targets to your LoRA config:
target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",  # LLM
    "gate_proj", "up_proj", "down_proj",      # LLM MLP
    "qkv", "fc1", "fc2",                      # Vision encoder
]
```

In [16]:
# Clean up
cleanup_model(peft_model, "peft_model")
print("\nNotebook complete!")


[Cleanup] Deleted peft_model
[Cleanup] GPU memory after cleanup: 10.08 GB

Notebook complete!
