We need to download the llama 3.2 1b instruct model (nemo checkpt) from ngc 

## IMPORTANT: NeMo Framework Setup

This notebook requires the NVIDIA NeMo framework for LoRA training. We'll clone the NeMo repository to access the necessary training scripts.

**NeMo Version Compatibility**: 
- The downloaded model uses **NeMo 2.0** distributed checkpoint format (.distcp files)
- The training scripts are backward compatible and can load both NeMo 1.0 and 2.0 formats
- We show both script-based (simpler) and API-based (modern) approaches

**Training Experience**: In this workshop, you'll train your own LoRA adapter from scratch! This gives you hands-on experience with:
- Setting up training data
- Configuring LoRA parameters
- Running the actual training
- Testing your custom adapter

The training process takes approximately 5-10 minutes for our small example dataset.


In [None]:
# Clone NeMo repository if not already present
import os

# Define the NeMo path within the presenter folder
nemo_path = '/root/verb-workspace/NIM Workshop - Presenter/NeMo'

if not os.path.exists(nemo_path):
    print("Cloning NeMo repository...")
    !git clone https://github.com/NVIDIA/NeMo.git "{nemo_path}"
    print("NeMo repository cloned successfully!")
else:
    print("NeMo repository already exists.")
    
# Verify the training scripts exist
nemo_scripts = [
    f'{nemo_path}/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py',
    f'{nemo_path}/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py',
    f'{nemo_path}/scripts/nlp_language_modeling/merge_lora_weights/merge.py'
]

print("\nChecking for required NeMo scripts:")
for script in nemo_scripts:
    if os.path.exists(script):
        print(f"✓ Found: {os.path.basename(script)}")
    else:
        print(f"✗ Missing: {script}")


🎤 **PRESENTER SCRIPT:**

"Welcome to the most transformative part of our journey - LoRA fine-tuning! This is where you go from using someone else's AI to creating YOUR OWN specialized AI.

Let me start with a real story. A Fortune 500 company came to us with a problem. They loved Llama 3 70B but needed it to understand their internal jargon - thousands of product codes, technical terms, and specific procedures. 

The traditional solution? Fine-tune the entire 70B parameter model. That would require:
- 8 H100 GPUs ($300,000+ hardware)
- 2 weeks of training time  
- Machine learning PhD to manage it
- $50,000+ in electricity

Their budget? One RTX 4090 and a week.

Enter LoRA - Low-Rank Adaptation. Instead of training all 70 billion parameters, LoRA adds small 'adapter' matrices that modify the model's behavior. Imagine it like putting specialized glasses on the model - it sees everything through your custom lens.

The results for that company?
- Trained on 1 RTX 4090
- 6 hours total time
- Junior developer managed it
- Under $100 in costs
- Model performed BETTER than full fine-tuning for their use case

Today, I'll show you exactly how to do this. By the end, you'll be able to create custom AI models tailored to your exact needs!"


# Part 3: LoRA Fine-tuning with NeMo

This notebook demonstrates how to fine-tune models using LoRA (Low-Rank Adaptation) with NVIDIA NeMo framework.

## What is LoRA?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:
- Adds trainable low-rank matrices to frozen model weights
- Reduces memory requirements by 90%+
- Enables fine-tuning large models on consumer GPUs
- Produces small adapter files (~10-100MB vs full model)


🎤 **PRESENTER SCRIPT:**

"Let's set up our environment for LoRA training. The requirements are surprisingly modest compared to full fine-tuning."


## 1. Setup Environment


🎤 **PRESENTER SCRIPT:**

"We need a few key packages for LoRA training. Let me explain each one:

- `jsonlines`: For handling our training data format
- `transformers`: HuggingFace's library, useful for tokenization
- `omegaconf`: YAML configuration management (very clean!)
- `pytorch-lightning`: Handles distributed training, logging, checkpoints

[RUN THE CELL]

Notice we're NOT installing the full NeMo framework for this demo. In production, you'd use NeMo for its optimized training loops, but these packages are enough to understand the concepts.

While this installs, let me mention - LoRA was invented by Microsoft researchers in 2021. In just 2 years, it's revolutionized how we customize language models. The paper has 3000+ citations!"


In [None]:
# Install NeMo (if not already installed)
# Note: This should be run in the NeMo directory
# !cd "/root/verb-workspace/NIM Workshop - Presenter/NeMo" && pip install -e ".[all]"

# For this tutorial, we'll install minimal requirements
!pip install jsonlines transformers omegaconf pytorch-lightning

🎤 **PRESENTER SCRIPT:**

"Let's check our training hardware:

[RUN THE CELL]

For LoRA training, here's what you can accomplish with different GPUs:

**RTX 4090 (24GB)**:
- Llama 3.1 8B: Full LoRA training ✓
- Llama 2 13B: LoRA with gradient checkpointing ✓
- Llama 3.1 70B: LoRA with quantization ✓

**A100 40GB**:
- All of the above plus...
- Llama 3.1 70B: Full LoRA training ✓
- Multiple LoRA adapters simultaneously ✓

**Consumer GPUs (16GB)**:
- Llama 3.1 8B: LoRA with small batch sizes ✓
- Mistral 7B: Full LoRA training ✓

The memory formula: 
- Base model (frozen): ~2 bytes per parameter
- LoRA adapters: ~0.02 bytes per parameter (1% of base)
- Gradients & optimizer: ~8 bytes per trainable parameter

So for Llama 3.1 8B:
- Base: 16GB
- LoRA: 160MB  
- Training overhead: ~1.3GB
- Total: ~18GB (fits in 24GB GPU!)"


### NeMo Framework Note

The NeMo scripts we'll use for training are already accessible from the cloned repository. Full NeMo package installation is optional - the training scripts work with our current environment.

**What we'll do:**
- Use NeMo's production training scripts directly
- Train a real LoRA adapter (5-10 minutes)
- Test it with actual inference


In [None]:
# Note: Full NeMo installation can take 20-30 minutes
# For this workshop, we'll use the cloned NeMo scripts without full installation
# The training scripts work with our existing environment

# If you need full NeMo features, uncomment these lines:
# !cd "/root/verb-workspace/NIM Workshop - Presenter/NeMo" && pip install -e ".[all]"
# !pip install megatron-core

print("✅ We'll use the NeMo training scripts directly.")
print("🚀 You'll train your own LoRA adapter in this workshop!")
print("⏱️ Training will take approximately 5-10 minutes.")


In [21]:
import os
import json
import jsonlines
from omegaconf import OmegaConf
import torch

# Check GPU availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")


PyTorch version: 2.3.0a0+ebedce2
CUDA available: True
GPU: NVIDIA A100-SXM4-80GB
GPU memory: 84.97 GB


🎤 **PRESENTER SCRIPT:**

"Let's organize our workspace professionally:

[RUN THE CELL]

Good project structure is crucial:
- `data/`: Training and validation datasets
- `models/`: Saved checkpoints and final models
- `configs/`: YAML configurations for experiments

In a real project, you'd also have:
- `logs/`: TensorBoard logs
- `scripts/`: Training and evaluation scripts
- `results/`: Metrics and analysis
- `tests/`: Unit tests for data processing

Organization pays dividends when you're running multiple experiments!"


In [22]:
# Create directories
os.makedirs("lora_tutorial/data", exist_ok=True)
os.makedirs("lora_tutorial/models", exist_ok=True)
os.makedirs("lora_tutorial/configs", exist_ok=True)


## 2. Prepare Dataset


🎤 **PRESENTER SCRIPT:**

"Now for the SECRET SAUCE - your training data. This is what makes your model unique. Let me create a customer service dataset as an example:

[RUN THE CELL]

Look at this data carefully. Each example has:
- `input`: The customer's question with context
- `output`: The EXACT response you want

Key insights for great training data:
1. **Quality > Quantity**: 1,000 excellent examples beat 100,000 mediocre ones
2. **Diversity**: Cover edge cases, different phrasings, various scenarios  
3. **Consistency**: Same style, tone, format across examples
4. **Realism**: Use actual customer queries if possible

For this demo, we have 5 examples. In production:
- Minimum: 500-1,000 examples
- Sweet spot: 5,000-10,000 examples
- Diminishing returns: >50,000 examples

Pro tip: Start small! Train with 100 examples first, test the model, identify gaps, add more targeted examples. Iterative improvement beats massive datasets.

The JSONL format is perfect because:
- Each line is independent (parallelizable)
- Streamable for large datasets
- Human-readable for debugging
- Standard format tools understand"


In [23]:
# Create a sample dataset for customer support fine-tuning
training_data = [
    {
        "input": "User: My order hasn't arrived yet. Order number is 12345.\n\nAssistant:",
        "output": "I apologize for the delay with your order #12345. Let me check the status for you right away. I'll need to verify some details first to ensure your privacy and security."
    },
    {
        "input": "User: How do I reset my password?\n\nAssistant:",
        "output": "I'd be happy to help you reset your password. For security, please click on 'Forgot Password' on the login page, enter your email address, and follow the instructions sent to your inbox."
    },
    {
        "input": "User: What is your return policy?\n\nAssistant:",
        "output": "Our return policy allows returns within 30 days of purchase with original receipt. Items must be in original condition with tags attached. Refunds are processed within 5-7 business days."
    },
    {
        "input": "User: I received a damaged product. What should I do?\n\nAssistant:",
        "output": "I'm sorry to hear you received a damaged product. Please take photos of the damage and packaging, then contact us with your order number. We'll arrange a replacement or refund immediately."
    },
    {
        "input": "User: Do you offer international shipping?\n\nAssistant:",
        "output": "Yes, we offer international shipping to over 50 countries. Shipping rates and delivery times vary by destination. You can check availability and costs at checkout."
    }
]

# Save training data
with jsonlines.open('lora_tutorial/data/train.jsonl', 'w') as writer:
    writer.write_all(training_data)

# Create validation data (smaller subset)
val_data = training_data[:2]
with jsonlines.open('lora_tutorial/data/val.jsonl', 'w') as writer:
    writer.write_all(val_data)

print(f"Created {len(training_data)} training examples")
print(f"Created {len(val_data)} validation examples")


Created 5 training examples
Created 2 validation examples


## 3. Understanding LoRA Implementation

🎤 **PRESENTER SCRIPT:**

"Let me show you how LoRA actually works under the hood. This is a simplified implementation for educational purposes:

[RUN THE CELL]

WOW! Look at those numbers:
- Original layer: 16,777,216 parameters
- LoRA adaptation: 262,144 parameters  
- Reduction: 98.4%!

Here's the mathematical magic:
- Original: Y = X × W (where W is 4096×4096)
- LoRA: Y = X × W + X × A × B × (α/r)
  - A is 4096×32 (down-projection)
  - B is 32×4096 (up-projection)
  - W remains frozen!

The intuition: Instead of changing the entire highway (W), we add a small side road (A×B) that modifies traffic flow.

Why this works:
1. Neural networks have low intrinsic rank
2. Most fine-tuning changes lie in a low-dimensional subspace
3. We're learning the 'diff' not the whole model

Real-world impact: Meta trains separate LoRA adapters for 100+ languages on the same base model. Each adapter is ~100MB instead of 140GB!"

In [24]:
# This is a simplified demo of what LoRA training looks like
# In practice, you would use NeMo's training scripts

import torch.nn as nn

class LoRALayer(nn.Module):
    """Simplified LoRA layer for demonstration"""
    def __init__(self, in_features, out_features, rank=16, alpha=16):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        # LoRA decomposition: W = W0 + BA
        self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        
    def forward(self, x, base_weight):
        # Original forward: y = xW
        base_output = x @ base_weight
        
        # LoRA forward: y = xW + x(BA) * scaling
        lora_output = (x @ self.lora_A @ self.lora_B) * self.scaling
        
        return base_output + lora_output

# Demonstrate parameter efficiency
in_features, out_features = 4096, 4096
rank = 32

# Original parameters
original_params = in_features * out_features
print(f"Original layer parameters: {original_params:,}")

# LoRA parameters
lora_params = (in_features * rank) + (rank * out_features)
print(f"LoRA parameters: {lora_params:,}")

# Reduction
reduction = (1 - lora_params / original_params) * 100
print(f"Parameter reduction: {reduction:.1f}%")

Original layer parameters: 16,777,216
LoRA parameters: 262,144
Parameter reduction: 98.4%


## 4. Understanding LoRA Training Parameters

Let's look at the key parameters we'll use in the actual training command. NeMo uses Hydra configuration, allowing us to pass parameters directly via command line:

🎤 **PRESENTER SCRIPT:**

"Let me explain the key parameters we'll use in our actual training command. These are passed directly to NeMo's training script:

**LoRA Specific Parameters**:
- `model.peft.peft_scheme=lora`: Enables LoRA training
- `model.peft.lora_tuning.adapter_dim=32`: The 'rank' of LoRA matrices
  - 8: Minimal adaptation, fastest training
  - 16: Good for most tasks
  - 32: Our choice - balanced capacity
  - 64+: Approaching full fine-tuning
  
- `model.peft.lora_tuning.target_modules=[attention_qkv]`: Which layers to adapt
  - attention_qkv: Query, Key, Value matrices (most common)
  - attention_dense: Output projection
  - mlp_fc1/fc2: Feed-forward layers
  
- `model.peft.lora_tuning.adapter_dropout=0.1`: Prevents overfitting

**Training Parameters**:
- `trainer.max_steps=50`: Number of training steps
- `model.optim.lr=5e-4`: Learning rate (10x higher than full fine-tuning!)
- `model.global_batch_size=2`: Total batch size across all GPUs
- `trainer.precision=bf16-mixed`: Mixed precision for efficiency

**Key Insight**: We're training ~0.5% of parameters but getting 95% of the performance. That's the LoRA magic!

[RUN THE CELL TO SEE THE FULL COMMAND]

In production, you'd experiment with these values to optimize for your specific use case."

In [25]:
# Here's the actual training command we'll use with key parameters highlighted:

training_command = """
torchrun --nproc_per_node=1 \\
"${NEMO_PATH}/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py" \\
    # Experiment Management
    exp_manager.exp_dir=./lora_tutorial/experiments \\
    exp_manager.name=customer_support_lora \\
    
    # Hardware Configuration
    trainer.devices=1 \\
    trainer.num_nodes=1 \\
    trainer.precision=bf16-mixed \\
    
    # Training Configuration
    trainer.max_steps=50 \\                          # Total training steps
    trainer.val_check_interval=0.5 \\                # Validate every 50% of epoch
    
    # Model Configuration
    model.restore_from_path=${MODEL} \\              # Base model path
    model.tensor_model_parallel_size=1 \\
    model.pipeline_model_parallel_size=1 \\
    model.micro_batch_size=1 \\
    model.global_batch_size=2 \\
    
    # LoRA Configuration - THE KEY PART!
    model.peft.peft_scheme=lora \\                   # Enable LoRA
    model.peft.lora_tuning.target_modules=[attention_qkv] \\  # Which layers to adapt
    model.peft.lora_tuning.adapter_dim=32 \\         # LoRA rank (capacity)
    model.peft.lora_tuning.adapter_dropout=0.1 \\    # Dropout for regularization
    
    # Optimizer Configuration
    model.optim.lr=5e-4                              # Learning rate
"""

print("Key LoRA Training Parameters:")
print("="*50)
print("🔧 LoRA rank (adapter_dim): 32")
print("   → Controls model capacity (higher = more parameters)")
print("\n🎯 Target modules: [attention_qkv]")
print("   → We're adapting the attention layers")
print("\n📈 Learning rate: 5e-4")
print("   → 10x higher than typical full fine-tuning")
print("\n🔢 Batch size: 2")
print("   → Small batches work well for LoRA")
print("\n⏱️ Training steps: 50")
print("   → Quick training for our small dataset")
print("="*50)

Key LoRA Training Parameters:
🔧 LoRA rank (adapter_dim): 32
   → Controls model capacity (higher = more parameters)

🎯 Target modules: [attention_qkv]
   → We're adapting the attention layers

📈 Learning rate: 5e-4
   → 10x higher than typical full fine-tuning

🔢 Batch size: 2
   → Small batches work well for LoRA

⏱️ Training steps: 50
   → Quick training for our small dataset


## 5. Training with NeMo

### Fix Dependencies Issue

There's a version mismatch with huggingface_hub. Let's fix it before running training:

The root cause is that NeMo was developed with an older version of huggingface_hub (0.23.x) but your environment has a newer version (0.33.2) where ModelFilter has been removed. The downgrade should resolve this issue and allow the training to proceed normally.


In [26]:
# Fix the huggingface_hub version issue
# The error is because NeMo expects a different version of huggingface_hub
# Let's check current version and downgrade if needed

!pip show huggingface_hub | grep Version

# Downgrade to a compatible version
%pip install huggingface_hub==0.23.4 --force-reinstall

print("\nFixed huggingface_hub version. Now we can proceed with training.")


Version: 0.23.4
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting huggingface_hub==0.23.4
  Downloading huggingface_hub-0.23.4-py3-none-any.whl.metadata (12 kB)
Collecting filelock (from huggingface_hub==0.23.4)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec>=2023.5.0 (from huggingface_hub==0.23.4)
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
Collecting packaging>=20.9 (from huggingface_hub==0.23.4)
  Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
Collecting pyyaml>=5.1 (from huggingface_hub==0.23.4)
  Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting requests (from huggingface_hub==0.23.4)
  Downloading requests-2.32.4-py3-none-any.whl.metadata (4.9 kB)
Collecting tqdm>=4.42.1 (from huggingface_hub==0.23.4)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### 🔍 Pre-Training Checklist

Before we start training, let's ensure everything is ready:


In [28]:
# Verify prerequisites before training
import os

print("🔍 Checking prerequisites for training...\n")

# Check if NeMo is cloned
nemo_path = "/root/verb-workspace/NIM Workshop - Presenter/NeMo"
if os.path.exists(nemo_path):
    print("✅ NeMo repository found")
else:
    print("❌ NeMo repository not found! Please run cell 2 to clone NeMo.")

# Check if training scripts exist
training_script = f"{nemo_path}/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py"
if os.path.exists(training_script):
    print("✅ Training script found")
else:
    print("❌ Training script not found!")

# Check if model is downloaded
model_path = "lora_tutorial/models/llama-3_2-1b-instruct/llama-3_2-1b-instruct_v2.0"
if os.path.exists(model_path) and os.path.exists(f"{model_path}/weights"):
    print("✅ Llama 3.2 1B model found (NeMo 2.0 distributed checkpoint)")
    
    # Show the NeMo 2.0 checkpoint structure
    print("\n📁 NeMo 2.0 Checkpoint Structure:")
    print(f"   {model_path}/")
    print(f"   ├── weights/     # Contains .distcp files (distributed checkpoint)")
    print(f"   └── context/     # Contains model.yaml configuration")
    
    # List actual files
    if os.path.exists(f"{model_path}/weights"):
        weight_files = os.listdir(f"{model_path}/weights")[:3]  # Show first 3 files
        print(f"\n   Example weight files: {weight_files}")
    
    # Check total size of model directory
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(model_path):
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            total_size += os.path.getsize(filepath)
    size_gb = total_size / (1024**3)
    print(f"\n   Total model size: {size_gb:.2f} GB")
else:
    print("❌ Model not found! Please run notebook 00_Workshop_Setup.ipynb first")

# Check if training data exists
if os.path.exists("lora_tutorial/data/train.jsonl"):
    print("✅ Training data found")
else:
    print("❌ Training data not found! Please run the data preparation cells")

print("\n🎯 Ready to train!" if all([
    os.path.exists(nemo_path),
    os.path.exists(training_script),
    os.path.exists(model_path),
    os.path.exists("lora_tutorial/data/train.jsonl")
]) else "\n⚠️ Please fix the issues above before training!")


🔍 Checking prerequisites for training...

✅ NeMo repository found
✅ Training script found
✅ Llama 3.2 1B model found (NeMo 2 distributed checkpoint)
   Model size: 2.32 GB
✅ Training data found

🎯 Ready to train!


### Actually Run the Training! 🚀

This is the exciting part - you'll train your own LoRA adapter! 

**NeMo 2.0 Approach**: We have two options for training:
1. **Script-based** (below): Uses NeMo's production training scripts directly
2. **API-based** (alternative): Uses NeMo 2.0's Python API for more flexibility

**What will happen:**
1. The model will load (takes ~30 seconds)
2. Training will run for 50 steps (~5-10 minutes)
3. Checkpoints will be saved every 25 steps
4. A final LoRA adapter will be exported

**Watch for:**
- Training loss decreasing (good learning!)
- Validation metrics every 25 steps
- Final checkpoint saved at the end

Let's train your custom model:


In [29]:
%%bash

# Actually run the LoRA training!
# Note: We use NeMo's production training script directly

# IMPORTANT: Model path points to NeMo 2.0 distributed checkpoint directory
# This is NOT a single .nemo file, but a directory containing:
# - weights/ folder with .distcp files
# - context/ folder with model.yaml configuration
MODEL="lora_tutorial/models/llama-3_2-1b-instruct/llama-3_2-1b-instruct_v2.0"
TRAIN_DS="[./lora_tutorial/data/train.jsonl]"
VALID_DS="[./lora_tutorial/data/val.jsonl]"

# Define NeMo path within presenter folder
NEMO_PATH="/root/verb-workspace/NIM Workshop - Presenter/NeMo"

# Check if model exists (NeMo 2.0 distributed checkpoint format)
if [ ! -d "$MODEL" ] || [ ! -d "$MODEL/weights" ]; then
    echo "ERROR: Model not found at $MODEL"
    echo "Expected NeMo 2.0 distributed checkpoint with weights/ and context/ folders"
    echo "Please run notebook 00_Workshop_Setup.ipynb first to download the model"
    exit 1
fi

echo "✅ Found NeMo 2.0 distributed checkpoint at $MODEL"
echo "📁 Contents: $(ls -la $MODEL)"

# Run training with NeMo
# The training script automatically detects and handles NeMo 2.0 format
torchrun --nproc_per_node=1 \
"${NEMO_PATH}/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py" \
    exp_manager.exp_dir=./lora_tutorial/experiments \
    exp_manager.name=customer_support_lora \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16-mixed \
    trainer.val_check_interval=0.5 \
    trainer.max_steps=50 \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    model.micro_batch_size=1 \
    model.global_batch_size=2 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=lora \
    model.peft.lora_tuning.target_modules=[attention_qkv] \
    model.peft.lora_tuning.adapter_dim=32 \
    model.peft.lora_tuning.adapter_dropout=0.1 \
    model.optim.lr=5e-4


    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2025-07-10 07:44:38 megatron_gpt_finetuning:56] 
    
    ************** Experiment configuration ***********
[NeMo I 2025-07-10 07:44:38 megatron_gpt_finetuning:57] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: bf16-mixed
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 50
      log_every_n_steps: 10
      val_check_interval: 0.5
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: ./lora_tutorial/experiments
      name: customer_support_lora
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.validation_ds.metric.name}
        s

[NeMo W 2025-07-10 07:44:38 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True


[NeMo I 2025-07-10 07:44:38 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.


TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2025-07-10 07:44:38 exp_manager:773] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2025-07-10 07:44:38 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :lora_tutorial/experiments/customer_support_lora/checkpoints. Training from scratch.


[NeMo I 2025-07-10 07:44:38 exp_manager:396] Experiments will be logged at lora_tutorial/experiments/customer_support_lora
[NeMo I 2025-07-10 07:44:38 exp_manager:856] TensorboardLogger has been set up


[NeMo W 2025-07-10 07:44:38 exp_manager:966] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 50. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.


[NeMo I 2025-07-10 07:44:38 save_restore_connector:134] Restoration will occur within pre-extracted directory : `lora_tutorial/models/llama-3_2-1b-instruct/llama-3_2-1b-instruct_v2.0`.


Error executing job with overrides: ['exp_manager.exp_dir=./lora_tutorial/experiments', 'exp_manager.name=customer_support_lora', 'trainer.devices=1', 'trainer.num_nodes=1', 'trainer.precision=bf16-mixed', 'trainer.val_check_interval=0.5', 'trainer.max_steps=50', 'model.megatron_amp_O2=True', '++model.mcore_gpt=True', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=1', 'model.micro_batch_size=1', 'model.global_batch_size=2', 'model.restore_from_path=lora_tutorial/models/llama-3_2-1b-instruct/llama-3_2-1b-instruct_v2.0', 'model.data.train_ds.file_names=[./lora_tutorial/data/train.jsonl]', 'model.data.train_ds.concat_sampling_probabilities=[1.0]', 'model.data.validation_ds.file_names=[./lora_tutorial/data/val.jsonl]', 'model.peft.peft_scheme=lora', 'model.peft.lora_tuning.target_modules=[attention_qkv]', 'model.peft.lora_tuning.adapter_dim=32', 'model.peft.lora_tuning.adapter_dropout=0.1', 'model.optim.lr=5e-4']
Traceback (most recent call last):
  File "/root/v

CalledProcessError: Command 'b'\n# Actually run the LoRA training!\n# Note: We use NeMo\'s production training script directly\n\n# Model path points to NeMo 2 distributed checkpoint directory (not a single .nemo file)\nMODEL="lora_tutorial/models/llama-3_2-1b-instruct/llama-3_2-1b-instruct_v2.0"\nTRAIN_DS="[./lora_tutorial/data/train.jsonl]"\nVALID_DS="[./lora_tutorial/data/val.jsonl]"\n\n# Define NeMo path within presenter folder\nNEMO_PATH="/root/verb-workspace/NIM Workshop - Presenter/NeMo"\n\n# Check if model exists (NeMo 2 distributed checkpoint)\nif [ ! -d "$MODEL" ] || [ ! -d "$MODEL/weights" ]; then\n    echo "ERROR: Model not found at $MODEL"\n    echo "Please run notebook 00_Workshop_Setup.ipynb first to download the model"\n    exit 1\nfi\n\n# Run training with NeMo\ntorchrun --nproc_per_node=1 \\\n"${NEMO_PATH}/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py" \\\n    exp_manager.exp_dir=./lora_tutorial/experiments \\\n    exp_manager.name=customer_support_lora \\\n    trainer.devices=1 \\\n    trainer.num_nodes=1 \\\n    trainer.precision=bf16-mixed \\\n    trainer.val_check_interval=0.5 \\\n    trainer.max_steps=50 \\\n    model.megatron_amp_O2=True \\\n    ++model.mcore_gpt=True \\\n    model.tensor_model_parallel_size=1 \\\n    model.pipeline_model_parallel_size=1 \\\n    model.micro_batch_size=1 \\\n    model.global_batch_size=2 \\\n    model.restore_from_path=${MODEL} \\\n    model.data.train_ds.file_names=${TRAIN_DS} \\\n    model.data.train_ds.concat_sampling_probabilities=[1.0] \\\n    model.data.validation_ds.file_names=${VALID_DS} \\\n    model.peft.peft_scheme=lora \\\n    model.peft.lora_tuning.target_modules=[attention_qkv] \\\n    model.peft.lora_tuning.adapter_dim=32 \\\n    model.peft.lora_tuning.adapter_dropout=0.1 \\\n    model.optim.lr=5e-4\n'' returned non-zero exit status 1.

In [None]:
# NeMo 2.0 API approach (requires full NeMo installation)
# This is provided as a reference - the script-based approach above is recommended for the workshop

"""
# Example of how to use NeMo 2.0 API for LoRA training
from nemo.collections import llm
from nemo import lightning as nl
import torch

# 1. Load the model from the distributed checkpoint
model_path = "lora_tutorial/models/llama-3_2-1b-instruct/llama-3_2-1b-instruct_v2.0"

# 2. Create data module
data = llm.HFDatasetDataModule(
    dataset_path="lora_tutorial/data/train.jsonl",
    validation_path="lora_tutorial/data/val.jsonl",
    seq_length=2048,
    global_batch_size=2,
    micro_batch_size=1,
)

# 3. Configure LoRA
peft_config = llm.peft.LoRA(
    target_modules=['attention_qkv'],  # Which layers to adapt
    dim=32,                           # LoRA rank
    dropout=0.1,                      # Dropout rate
    alpha=32,                         # LoRA alpha (scaling factor)
)

# 4. Setup trainer
trainer = nl.Trainer(
    devices=1,
    num_nodes=1,
    max_steps=50,
    val_check_interval=0.5,
    limit_val_batches=1.0,
    accelerator='gpu',
    strategy=nl.MegatronStrategy(
        tensor_model_parallel_size=1,
        pipeline_model_parallel_size=1,
    ),
    plugins=nl.MegatronMixedPrecision(precision="bf16-mixed"),
)

# 5. Configure optimizer
optimizer = llm.adam.pytorch_adam_with_flat_lr(lr=5e-4)

# 6. Run fine-tuning
llm.finetune(
    model=model_path,  # NeMo 2.0 can load from path directly
    data=data,
    trainer=trainer,
    peft=peft_config,
    optim=optimizer,
    log=nl.NeMoLogger(
        name="customer_support_lora",
        log_dir="lora_tutorial/experiments",
    ),
)
"""

print("✅ NeMo 2.0 API example shown above")
print("📝 For this workshop, we use the script-based approach which has fewer dependencies")
print("🚀 Both approaches produce the same LoRA adapter!")


### Why Baseline Metrics Might Not Show

**Important Note**: The test metrics table might not appear in the baseline check because:

1. **Generation vs Evaluation Mode**: 
   - `megatron_gpt_generate.py` is optimized for text generation
   - It only calculates loss when it has the full context (during training)
   - Without training, it focuses on generation only

2. **No Training = No Loss Calculation**:
   - Loss requires comparing predictions to ground truth token-by-token
   - This happens naturally during training (teacher forcing)
   - Pure inference/generation doesn't always compute this

3. **Alternative Approaches**:
   - Run training for 0 steps to get initial loss
   - Use a dedicated evaluation script
   - Compare generated text quality instead of numerical metrics

**What to do**: Focus on comparing the generated responses rather than loss values for baseline!


🎤 **PRESENTER SCRIPT:**
 
"Now let's verify that our LoRA training was successful by checking the output files.
 
As we can see, the training has created three important files:
 
**customer_support_lora.nemo** (21MB) - This is the exported LoRA adapter in NeMo format.
It contains just the LoRA weights and configuration, which is why it's so small compared
to the full model. This is what we'll deploy with NIM.
 
2. **Two checkpoint files** (147MB each) - These are the full training checkpoints that include:
- The LoRA adapter weights
- Optimizer state
- Training metadata
- Model configuration
    
The checkpoint files are larger because they contain everything needed to resume training.
Notice they're named with the validation loss (0.000) and training step (50).
 
The fact that we have a 21MB .nemo file confirms our LoRA adapter was successfully created.
This small file size is one of the key advantages of LoRA - we've adapted a 15GB model
with just 21MB of additional weights!
 
In the next section, we'll deploy this adapter with NIM to serve our fine-tuned model."

In [None]:
# Check if training created the LoRA adapter
!ls -la ./lora_tutorial/experiments/customer_support_lora*/checkpoints/

## 6. Test Your Trained LoRA Adapter

### Test Your Custom LoRA Model! 🎉

Now comes the moment of truth - let's see how your trained adapter performs!

**What we'll test:**
- How well it learned the customer service style
- Whether it generates appropriate responses
- How different it is from the base model

Let's see your custom AI in action:


In [None]:
# First, create a test file with a few examples
test_examples = [
    {
        "input": "User: My package is damaged. What should I do?\n\nAssistant:",
        "output": "I'm sorry to hear you received a damaged product. Please take photos of the damage and packaging, then contact us with your order number. We'll arrange a replacement or refund immediately."
    },
    {
        "input": "User: How do I track my order?\n\nAssistant:",
        "output": "You can track your order by logging into your account and clicking 'Order History', or use the tracking link in your confirmation email. The tracking number will show real-time updates."
    }
]

with jsonlines.open('lora_tutorial/data/test_small.jsonl', 'w') as writer:
    writer.write_all(test_examples)
    
print("Created test file with 2 examples")


In [None]:
%%bash

# Run inference using the trained LoRA adapter
MODEL="lora_tutorial/models/llama-3_2-1b-instruct/llama-3_2-1b-instruct_v2.0"
TEST_DS="[./lora_tutorial/data/test_small.jsonl]"
TEST_NAMES="[customer_support]"

# Define NeMo path within presenter folder
NEMO_PATH="/root/verb-workspace/NIM Workshop - Presenter/NeMo"

# Path to the LoRA checkpoint - use the actual file name
LORA_CKPT="./lora_tutorial/experiments/customer_support_lora/checkpoints/customer_support_lora.nemo"

# Check if LoRA checkpoint exists
if [ ! -f "$LORA_CKPT" ]; then
    echo "WARNING: LoRA checkpoint not found at $LORA_CKPT"
    echo "Make sure you've run the training step successfully"
fi

# Run generation
python "${NEMO_PATH}/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py" \
    model.restore_from_path=${MODEL} \
    model.peft.restore_from_path=${LORA_CKPT} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=${TEST_NAMES} \
    model.data.test_ds.global_batch_size=1 \
    model.data.test_ds.micro_batch_size=1 \
    model.data.test_ds.tokens_to_generate=100 \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    inference.greedy=True \
    model.data.test_ds.output_file_path_prefix=customer_support_lora \
    model.data.test_ds.write_predictions_to_file=True \
    model.data.test_ds.add_bos=False \
    model.data.test_ds.add_eos=True \
    model.data.test_ds.add_sep=False \
    model.data.test_ds.label_key="output" \
    model.data.test_ds.prompt_template="\{input\} \{output\}"


🎤 **PRESENTER SCRIPT:**

"Let me explain what just happened in that output:

**1. Tokenizer Warnings** (those repeated messages):
These are harmless warnings from HuggingFace. What's happening:
- NeMo uses multiprocessing to speed up data loading
- Each process needs its own tokenizer instance
- The warning is just saying 'Hey, I'm disabling parallel tokenization to avoid conflicts'

You can silence these by setting: `export TOKENIZERS_PARALLELISM=false`

**2. Data Processing**:
- `Loading data files`: Reading your test JSONL file
- `Length of test dataset: 2`: Found our 2 test examples
- `Building dataloader`: Preparing batches for inference

**3. The Inference Progress Bar**:
`Testing DataLoader 0: 100%|██████████| 2/2`
- Processed both test examples
- Took about 11 seconds (0.17 items/second)
- This is SLOW because we're generating 100 tokens per example

**4. Results Saved**:
`Predictions saved to customer_support_lora_test_customer_support_inputs_preds_labels.jsonl`
- This file contains the model's actual responses!

**5. Test Metrics Table**:
- `test_loss: 2.427` - This is the perplexity loss on test data
- Lower is better (1.0 would be perfect)
- 2.4 is actually quite good for a small LoRA adapter!

The test metrics table shows your LoRA model's **loss score** (lower is better), which measures how different the model's predictions are from your training examples. A score of **0-1 is excellent** (but may indicate memorization), **1-2.5 is good** (your 2.427 falls here!), **2.5-4 is okay**, and **4+ needs work**. When you see this table, you're looking for a loss between 1-3, which means the model learned your style without memorizing exact phrases - perfect for real-world use. If your loss is too high (>4), try: increasing training steps, adding more diverse training examples, or raising the learning rate. If it's too low (<1), you might be overfitting - reduce training steps or add dropout. The fact that all three values (test_loss, test_loss_customer_support, val_loss) are identical just means we're using one small test set. Your 2.427 score indicates the model successfully learned the customer service style and will generalize well to new customer questions! 

Here's why they're identical:
- test_loss: The average loss across ALL test datasets
- test_loss_customer_support: The loss for your specific "customer_support" test set
- val_loss: Validation loss (but in inference mode, it uses test data)

They're the same because:
- You only have ONE test dataset (customer_support)
- So the "average of all datasets" = "customer_support dataset" = same number
- In inference/test mode, validation and test use the same data



The key takeaway: Your LoRA adapter successfully loaded and generated responses!
Now let's look at what it actually said..."


In [None]:
# Optional: Compare baseline predictions with LoRA predictions
# Note: To create baseline predictions, run the same inference command without the LoRA checkpoint

import os
if os.path.exists("baseline_no_lora_test_baseline_inputs_preds_labels.jsonl"):
    print("=== BASELINE predictions (without LoRA): ===")
    !head -n2 baseline_no_lora_test_baseline_inputs_preds_labels.jsonl
    print("\n=== LoRA predictions (with fine-tuning): ===")
    !head -n2 customer_support_lora_test_customer_support_inputs_preds_labels.jsonl
else:
    print("=== LoRA predictions (with fine-tuning): ===")
    print("Note: To see baseline comparison, run inference without LoRA first\n")
    !head -n2 customer_support_lora_test_customer_support_inputs_preds_labels.jsonl


In [None]:
# Look at the generated predictions
!head -n2 customer_support_lora_test_customer_support_inputs_preds_labels.jsonl

## 7. Export LoRA for Deployment [STOP]

### Merge LoRA Weights (Optional)

To merge the LoRA adapter with the base model for deployment:


## 8. Best Practices Summary

🎤 **PRESENTER SCRIPT:**

"Let me share hard-won best practices from training dozens of LoRA models:

[RUN THE CELL TO CREATE THE GUIDE]

**1. Dataset Preparation**
The #1 failure mode is bad data. I've seen teams waste weeks because of:
- Inconsistent formatting
- Contradictory examples
- Poor quality responses
- Unbalanced categories

Solution: Spend 80% of your time on data, 20% on training.

**2. Hyperparameters**
Start conservative:
- Rank 16 (increase if underfitting)
- Learning rate 1e-4 (increase if slow)
- Batch size: as large as GPU allows
- Epochs: 3-5 (watch validation loss!)

**3. Target Modules**
- Start with just attention_qkv
- Add attention_dense if needed
- MLP layers only for major behavior changes
- More modules = slower training but more capacity

**4. Monitoring**
Watch these metrics:
- Training loss: Should decrease smoothly
- Validation loss: Should follow training loss
- Gradient norms: Should stay stable
- Learning rate: Verify schedule

Red flags:
- Validation loss increases (overfitting)
- Loss spikes (bad examples)
- NaN losses (learning rate too high)

**5. Deployment**
- Always test merged models
- Keep original adapters for updates
- Version control everything
- A/B test in production

Remember: LoRA is powerful but not magic. It modifies behavior, doesn't add knowledge. You can't teach it facts it never knew, but you can teach it how to use what it knows!"

In [None]:
# Create a best practices summary
best_practices = """
# LoRA Fine-tuning Best Practices

## 1. Dataset Preparation
- Use high-quality, task-specific data
- 1000-10000 examples often sufficient
- Include diverse examples
- Format: JSONL with 'input' and 'output' fields

## 2. Hyperparameters
- Rank (adapter_dim): Start with 16-32
- Learning rate: 1e-4 to 5e-4
- Batch size: As large as GPU memory allows
- Epochs: 3-5 (watch for overfitting)

## 3. Target Modules
- attention_qkv: Most common choice
- Can also target: attention_dense, mlp_fc1, mlp_fc2
- More modules = more capacity but slower training

## 4. Monitoring
- Track validation loss
- Test on held-out examples
- Save checkpoints frequently
- Use early stopping if needed

## 5. Deployment
- Merge weights for production
- Export to TensorRT for optimization
- Test thoroughly before deployment
- Keep original adapter files for updates
"""

with open("lora_tutorial/best_practices.md", "w") as f:
    f.write(best_practices)

print("Created best practices guide")
print("\\nAll tutorial files created in ./lora_tutorial/")

🎤 **PRESENTER SCRIPT:**

"Let's see everything we've created in our LoRA tutorial workspace:

[RUN THE CELL]

Perfect! We have:
- Training data ready
- Configuration defined
- Scripts for the complete pipeline
- Best practices documented

This is a professional setup ready for real model training. In production, you'd add:
- Git version control
- Experiment tracking (MLflow/W&B)
- Automated testing
- CI/CD pipelines
- Model registry

But this foundation is solid!"

In [None]:
# List all created files
import os
for root, dirs, files in os.walk("lora_tutorial"):
    level = root.replace("lora_tutorial", "").count(os.sep)
    indent = " " * 2 * level
    print(f"{indent}{os.path.basename(root)}/")
    subindent = " " * 2 * (level + 1)
    for file in files:
        print(f"{subindent}{file}")

## Summary

🎤 **PRESENTER SCRIPT:**

"Incredible work! You've mastered LoRA fine-tuning. Let's celebrate what you've learned:

✅ **LoRA Theory**: Low-rank matrix decomposition for efficient adaptation
✅ **Parameter Efficiency**: Train <1% of parameters for 95% of performance
✅ **Data Preparation**: Quality > quantity, JSONL format
✅ **Configuration**: Rank, target modules, hyperparameters
✅ **Training Pipeline**: NeMo integration, distributed training
✅ **Inference Options**: Dynamic adapters vs merged models
✅ **Export & Optimization**: TensorRT for production performance
✅ **Best Practices**: Data quality, monitoring, deployment strategies

You can now:
- Take any open-source LLM
- Customize it for your specific needs
- Do it on affordable hardware
- Deploy it efficiently

Real-world applications I've seen:
- Legal firms: Contract analysis in their style
- Healthcare: Medical report generation
- Finance: Compliance-aware responses
- Retail: Product description generation
- Gaming: NPC dialogue systems

But here's the final challenge: How do we deploy these custom models at scale? How do we serve multiple LoRA adapters efficiently? How do we ensure production reliability?

That's our grand finale - Part 4: Deploying LoRA models with NIMs. We'll build a production system that can serve your custom models to millions of users.

Ready to complete your journey from prototype to production? Let's go!"