# Part 3: LoRA Fine-tuning with NeMo

This notebook demonstrates how to fine-tune Llama 3.1 8B Instruct using LoRA (Low-Rank Adaptation) with NVIDIA NeMo framework.

## What is LoRA?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:
- Adds trainable low-rank matrices to frozen model weights
- Reduces memory requirements by 90%+
- Enables fine-tuning large models on consumer GPUs
- Produces small adapter files (~100-500MB for 8B models)

The focus of this workshop is not the specifics of LoRA, but to actually give everyone an guide on how to carry out the process of tuning your model. 

## IMPORTANT: NeMo Framework Setup

This notebook requires the NVIDIA NeMo framework for LoRA training. We'll use the cloned NeMo repository to access the necessary training scripts.

**NeMo Compatibility**: 
- The downloaded model uses standard NeMo format (.nemo file)
- The training scripts work directly without any modifications

**Training Experience**: In this workshop, you'll train your own LoRA adapter from scratch! This gives you hands-on experience with:
- Setting up training data
- Configuring LoRA parameters
- Running the actual training
- Testing your custom adapter

The training process takes approximately 5-10 minutes for our small example dataset.

🎤 **PRESENTER SCRIPT:**

"Now, we'll clone the NeMo reposiory from github.com/NVIDIA

In [1]:
# Clone NeMo repository if not already present
import os

# Use relative path for NeMo
nemo_path = './NeMo'

if not os.path.exists(nemo_path):
    print("Cloning NeMo repository...")
    !git clone https://github.com/NVIDIA/NeMo.git {nemo_path}
    print("NeMo repository cloned successfully!")
else:
    print("NeMo repository already exists.")
    
# Verify the training scripts exist
nemo_scripts = [
    f'{nemo_path}/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py',
    f'{nemo_path}/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py',
    f'{nemo_path}/scripts/nlp_language_modeling/merge_lora_weights/merge.py'
]

print("\nChecking for required NeMo scripts:")
for script in nemo_scripts:
    if os.path.exists(script):
        print(f"✓ Found: {os.path.basename(script)}")
    else:
        print(f"✗ Missing: {script}")

NeMo repository already exists.

Checking for required NeMo scripts:
✓ Found: megatron_gpt_finetuning.py
✓ Found: megatron_gpt_generate.py
✓ Found: merge.py


🎤 **PRESENTER SCRIPT:**

"Welcome to the most transformative part of our journey - LoRA fine-tuning! This is where you go from using someone else's AI to creating YOUR OWN specialized AI.

Let me start with a real story. A Fortune 500 company came to us with a problem. They loved Llama 3 70B but needed it to understand their internal jargon - thousands of product codes, technical terms, and specific procedures. 

The traditional solution? Fine-tune the entire 70B parameter model. That would require:
- 8 H100 GPUs ($300,000+ hardware)
- 2 weeks of training time  
- Machine learning PhD to manage it
- $50,000+ in electricity

Their budget? One RTX 4090 and a week.

Enter LoRA - Low-Rank Adaptation. Instead of training all 70 billion parameters, LoRA adds small 'adapter' matrices that modify the model's behavior. Imagine it like putting specialized glasses on the model - it sees everything through your custom lens.

The results for that company?
- Trained on 1 RTX 4090
- 6 hours total time
- Junior developer managed it
- Under $100 in costs
- Model performed BETTER than full fine-tuning for their use case

Today, I'll show you exactly how to do this. By the end, you'll be able to create custom AI models tailored to your exact needs!"

🎤 **PRESENTER SCRIPT:**

"Now let's set up our environment.

We're installing:
- **jsonlines**: For handling our training data format - it's like CSV but better for text
- **transformers**: Hugging Face's library - we'll use their tokenizer 
- **omegaconf**: For configuration management - makes our life SO much easier
- **pytorch-lightning**: The training framework that handles all the heavy lifting

*[Wait for installation to complete]*


## 1. Setup Environment

In [2]:
# Install required packages
!pip install jsonlines transformers omegaconf pytorch-lightning

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
import os
import json
import jsonlines
from omegaconf import OmegaConf
import torch

# Check GPU availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

PyTorch version: 2.3.0a0+ebedce2
CUDA available: True
GPU: NVIDIA A100-SXM4-40GB
GPU memory: 42.41 GB


🎤 **PRESENTER SCRIPT:**

"Time to create our training data! This is where the magic begins. We're creating a customer support AI, but here's the key insight: we're not teaching it WHAT customer support is - Llama already knows that. We're teaching it HOW to do customer support in YOUR specific style.

Look at these examples:
- Notice the consistent, professional tone
- See how each response acknowledges the issue first
- Watch how we always offer a clear next step

This is called 'style transfer' in AI. It's like teaching someone who already speaks English to speak with a British accent - they know the language, they just need to learn the style.

With just 5 examples, we can dramatically change how the model responds. In production, I've seen companies achieve 90%+ accuracy with as few as 50 examples. That's the power of LoRA!"


In [4]:
# Create directories
os.makedirs("lora_tutorial/data", exist_ok=True)
os.makedirs("lora_tutorial/models", exist_ok=True)
os.makedirs("lora_tutorial/configs", exist_ok=True)

## 2. Prepare Dataset

In [5]:
# Create a sample dataset for customer support fine-tuning
training_data = [
    {
        "input": "User: My order hasn't arrived yet. Order number is 12345.\\n\\nAssistant:",
        "output": "I apologize for the delay with your order #12345. Let me check the status for you right away. I'll need to verify some details first to ensure your privacy and security."
    },
    {
        "input": "User: How do I reset my password?\\n\\nAssistant:",
        "output": "I'd be happy to help you reset your password. For security, please click on 'Forgot Password' on the login page, enter your email address, and follow the instructions sent to your inbox."
    },
    {
        "input": "User: What is your return policy?\\n\\nAssistant:",
        "output": "Our return policy allows returns within 30 days of purchase with original receipt. Items must be in original condition with tags attached. Refunds are processed within 5-7 business days."
    },
    {
        "input": "User: I received a damaged product. What should I do?\\n\\nAssistant:",
        "output": "I'm sorry to hear you received a damaged product. Please take photos of the damage and packaging, then contact us with your order number. We'll arrange a replacement or refund immediately."
    },
    {
        "input": "User: Do you offer international shipping?\\n\\nAssistant:",
        "output": "Yes, we offer international shipping to over 50 countries. Shipping rates and delivery times vary by destination. You can check availability and costs at checkout."
    }
]

# Save training data
with jsonlines.open('lora_tutorial/data/train.jsonl', 'w') as writer:
    writer.write_all(training_data)

# Create validation data (smaller subset)
val_data = training_data[:2]
with jsonlines.open('lora_tutorial/data/val.jsonl', 'w') as writer:
    writer.write_all(val_data)

print(f"Created {len(training_data)} training examples")
print(f"Created {len(val_data)} validation examples")

Created 5 training examples
Created 2 validation examples


### Verify Prerequisites Before Training

In [7]:
# Verify prerequisites before training
import os
import glob

print("🔍 Checking prerequisites for training...\n")

# Check if NeMo is cloned - use relative path
nemo_path = "./NeMo"
if os.path.exists(nemo_path):
    print("✅ NeMo repository found")
else:
    print("❌ NeMo repository not found! Please run cell 2 to clone NeMo.")

# Check if training scripts exist
training_script = f"{nemo_path}/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py"
if os.path.exists(training_script):
    print("✅ Training script found")
else:
    print("❌ Training script not found!")

# Check if model is downloaded - look in subdirectories since NGC creates them
model_files = glob.glob("lora_tutorial/models/llama-3_1-8b-instruct/**/*.nemo", recursive=True)
if model_files:
    model_path = model_files[0]  # Use the first .nemo file found
    # Check if it's a complete model (>10GB)
    size_gb = os.path.getsize(model_path) / (1024**3)
    if size_gb > 10:
        print("✅ Llama 3.1 8B model found")
        print(f"\n📁 Model Information:")
        print(f"   Path: {model_path}")
        print(f"   Size: {size_gb:.2f} GB")
        print(f"   Format: Standard NeMo checkpoint (.nemo)")
    else:
        print(f"⚠️  Incomplete model found ({size_gb:.1f} GB)")
        print("   Please re-run the download in 00_Workshop_Setup.ipynb")
else:
    print("❌ Model not found! Please run notebook 00_Workshop_Setup.ipynb first")

# Check if training data exists
if os.path.exists("lora_tutorial/data/train.jsonl"):
    print("✅ Training data found")
else:
    print("❌ Training data not found! Please run the data preparation cells")

print("\n🎯 Ready to train!" if all([
    os.path.exists(nemo_path),
    os.path.exists(training_script),
    len(model_files) > 0 and os.path.getsize(model_files[0]) / (1024**3) > 10,  # Check complete model
    os.path.exists("lora_tutorial/data/train.jsonl")
]) else "\n⚠️ Please fix the issues above before training!")

🔍 Checking prerequisites for training...

✅ NeMo repository found
✅ Training script found
✅ Llama 3.1 8B model found

📁 Model Information:
   Path: lora_tutorial/models/llama-3_1-8b-instruct/llama-3_1-8b-nemo_v1.0/llama3_1_8b.nemo
   Size: 14.96 GB
   Format: Standard NeMo checkpoint (.nemo)
✅ Training data found

🎯 Ready to train!


🎤 **PRESENTER SCRIPT:**

"Before we hit that training button, let me explain what's about to happen - because this is genuinely exciting!

When you press run on the next cell:
1. **Model Loading** (~30 seconds): We'll load 8 billion parameters into GPU memory
2. **LoRA Injection**: We'll inject our tiny 10.5M parameter adapters - that's only 0.13% of the model!
3. **Training Loop**: You'll see the loss dropping - that's the AI learning your style in real-time
4. **Checkpointing**: Every 25 steps, we save progress - like auto-save in a video game

⚠️ **Common Issues to Address:**
- 'Oh no, so many warnings!' - That's normal! NeMo is very verbose. Only worry if you see ERROR in red
- 'Why is it taking so long to start?' - Loading 8B parameters takes time, like starting a jumbo jet
- 'My loss isn't dropping!' - Give it 10-20 steps, LoRA starts slow then accelerates

🎯 **What to watch for:**
- Training loss going from ~2.0 down to ~0.001 - that's learning!
- 'Validation loss' appearing every 25 steps - that's testing
- The final message about saving checkpoints - that's success!

Ready? Let's train your AI! 🚀"


## 3. Run LoRA Training

### Actually Run the Training! 🚀

This is the exciting part - you'll train your own LoRA adapter! 

**What will happen:**
1. The model will load from the .nemo checkpoint (takes ~30 seconds)
2. Training will run for 50 steps (~5-10 minutes)
3. Checkpoints will be saved every 25 steps
4. A final LoRA adapter will be exported as a .nemo file

**Watch for:**
- Training loss decreasing (good learning!)
- Validation metrics every 25 steps
- Final checkpoint saved at the end

**Note about warnings**: You'll see many warnings about missing configuration fields - these are normal and can be ignored. They appear because NeMo supports many optional features that aren't used in this training.

**Important**: Llama 3.1 models use Group Query Attention (GQA), which requires `mcore_gpt=True` to be set.

Let's train your custom model:

🎤 **PRESENTER SCRIPT (During Training):**

"While this is running, let me share what's happening under the hood:

**Step 1-10**: The model is 'confused' - high loss, random outputs. Like a student on day one.

**Step 10-25**: Pattern recognition kicks in. The model starts to see 'Oh, I should acknowledge the problem first!'

**Step 25-40**: Fine-tuning happens. Now it's learning your specific phrases and tone.

**Step 40-50**: Polishing. Minor adjustments to perfect the style.

See that loss number dropping? Every decrease means your AI is getting smarter. It's literally learning as we watch!

Fun fact: OpenAI's GPT models train for millions of steps. We're doing 50. That's the power of starting with a pre-trained model - we're not teaching it English, just your style of English!"

In [10]:
%%bash

# Actually run the LoRA training!

# Find the model file dynamically (NGC creates subdirectories)
MODEL_DIR="lora_tutorial/models/llama-3_1-8b-instruct"
MODEL=$(find "$MODEL_DIR" -name "*.nemo" -type f | head -1)

if [ -z "$MODEL" ]; then
    echo "ERROR: No .nemo model file found in $MODEL_DIR"
    echo "Please run notebook 00_Workshop_Setup.ipynb first to download the model"
    exit 1
fi

TRAIN_DS="[lora_tutorial/data/train.jsonl]"
VALID_DS="[lora_tutorial/data/val.jsonl]"

# Use relative path to NeMo
NEMO_PATH="./NeMo"

echo "✅ Found Llama 3.1 8B model at $MODEL"

# Run training with NeMo
torchrun --nproc_per_node=1 \
"${NEMO_PATH}/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py" \
    exp_manager.exp_dir=lora_tutorial/experiments \
    exp_manager.name=customer_support_lora \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16-mixed \
    trainer.val_check_interval=0.5 \
    trainer.max_steps=50 \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    model.micro_batch_size=1 \
    model.global_batch_size=2 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=lora \
    model.peft.lora_tuning.target_modules=[attention_qkv] \
    model.peft.lora_tuning.adapter_dim=32 \
    model.peft.lora_tuning.adapter_dropout=0.1 \
    model.optim.lr=5e-4

✅ Found Llama 3.1 8B model at lora_tutorial/models/llama-3_1-8b-instruct/llama-3_1-8b-nemo_v1.0/llama3_1_8b.nemo


    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2025-07-11 17:22:56 megatron_gpt_finetuning:56] 
    
    ************** Experiment configuration ***********
[NeMo I 2025-07-11 17:22:56 megatron_gpt_finetuning:57] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: bf16-mixed
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 50
      log_every_n_steps: 10
      val_check_interval: 0.5
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: lora_tutorial/experiments
      name: customer_support_lora
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.validation_ds.metric.name}
        sav

[NeMo W 2025-07-11 17:22:56 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True


[NeMo I 2025-07-11 17:22:56 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.


TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2025-07-11 17:22:56 exp_manager:773] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2025-07-11 17:22:56 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :lora_tutorial/experiments/customer_support_lora/checkpoints. Training from scratch.


[NeMo I 2025-07-11 17:22:56 exp_manager:396] Experiments will be logged at lora_tutorial/experiments/customer_support_lora
[NeMo I 2025-07-11 17:22:56 exp_manager:856] TensorboardLogger has been set up


[NeMo W 2025-07-11 17:22:56 exp_manager:966] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 50. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
[NeMo W 2025-07-11 17:23:15 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:23:15 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:23:15 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:23:15 megatron_base_model:1158] The model: MegatronGPTSFTModel() does

[NeMo I 2025-07-11 17:23:15 megatron_init:263] Rank 0 has data parallel group : [0]
[NeMo I 2025-07-11 17:23:15 megatron_init:269] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2025-07-11 17:23:15 megatron_init:274] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2025-07-11 17:23:15 megatron_init:277] Ranks 0 has data parallel rank: 0
[NeMo I 2025-07-11 17:23:15 megatron_init:285] Rank 0 has context parallel group: [0]
[NeMo I 2025-07-11 17:23:15 megatron_init:288] All context parallel group ranks: [[0]]
[NeMo I 2025-07-11 17:23:15 megatron_init:289] Ranks 0 has context parallel rank: 0
[NeMo I 2025-07-11 17:23:15 megatron_init:296] Rank 0 has model parallel group: [0]
[NeMo I 2025-07-11 17:23:15 megatron_init:297] All model parallel group ranks: [[0]]
[NeMo I 2025-07-11 17:23:15 megatron_init:306] Rank 0 has tensor model parallel group: [0]
[NeMo I 2025-07-11 17:23:15 megatron_init:310] All tensor model parallel group ranks: 

[NeMo W 2025-07-11 17:23:15 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:23:15 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:23:15 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:23:15 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:23:15 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap_ag i

[NeMo I 2025-07-11 17:23:15 tokenizer_utils:178] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3-8B


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[NeMo I 2025-07-11 17:23:16 megatron_base_model:584] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2025-07-11 17:23:16 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:23:16 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:23:16 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:23:16 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:23:16 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_t

[NeMo I 2025-07-11 17:23:39 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.
Loading distributed checkpoint with TensorStoreLoadShardedStrategy
Loading distributed checkpoint directly on the GPU
[NeMo I 2025-07-11 17:24:21 nlp_overrides:1180] Model MegatronGPTSFTModel was successfully restored from /root/verb-workspace/NIM-build-tune-deploy-presenter/lora_tutorial/models/llama-3_1-8b-instruct/llama-3_1-8b-nemo_v1.0/llama3_1_8b.nemo.
[NeMo I 2025-07-11 17:24:21 megatron_gpt_finetuning:72] Adding adapter weights to the model for PEFT
[NeMo I 2025-07-11 17:24:21 nlp_adapter_mixins:203] Before adding PEFT params:
      | Name  | Type          | Params | Mode 
    ------------------------------------------------
    0 | model | Float16Module | 8.0 B  | train
    ------------------------------------------------
    0         Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,121.045Total estimated model params size (MB)
[NeMo I 2025-07-11 17:24:2

[NeMo W 2025-07-11 17:24:25 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2025-07-11 17:24:25 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2025-07-11 17:24:25 megatron_gpt_sft_model:811] Building GPT SFT validation datasets.
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:116] Building data files
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:495] Building indexing for fn = lora_tutorial/data/val.jsonl
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:507] Saving idx file = lora_tutorial/data/val.jsonl.idx.npy
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:509] Saving metadata file = lora_tutorial/data/val.jsonl.idx.info
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:535] Time building 1 / 1 mem-mapped files: 0:00:00.089823
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.085401
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:158] Loading data files
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:249] Loading lora_tutorial/data/val.jsonl
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001170
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:165] Computing global indices
[NeMo I 2025-07-11 17:24:25 megatron_gpt_sft_model:815] Length of val dataset: 2
[NeMo I 2025-07-11 17:24:25 megatron_gpt_sft_model:822] Building GPT SFT traing datasets.
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:116] Building data files
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:495] Building indexing for fn = lora_tutorial/data/train.jsonl
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:507] Saving idx file = lora_tutorial/data/train.jsonl.idx.npy
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:509] Saving metadata file = lora_tutorial/data/train.jsonl.idx.info
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:535] Time building 1 / 1 mem-mapped files: 0:00:00.087212
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.086808
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:158] Loading data files
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:249] Loading lora_tutorial/data/train.jsonl
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000915
[NeMo I 2025-07-11 17:24:25 text_memmap_dataset:165] Computing global indices


      counts = torch.cuda.LongTensor([1])
    


make: Entering directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
make: Nothing to be done for 'default'.
make: Leaving directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
> building indices for blendable datasets ...
 > sample ratios:
   dataset 0, input: 1, achieved: 1
[NeMo I 2025-07-11 17:24:25 blendable_dataset:67] > elapsed time for building blendable dataset indices: 0.05 (sec)
[NeMo I 2025-07-11 17:24:25 megatron_gpt_sft_model:824] Length of train dataset: 101
[NeMo I 2025-07-11 17:24:25 megatron_gpt_sft_model:829] Building dataloader with consumed samples: 0
[NeMo I 2025-07-11 17:24:25 megatron_gpt_sft_model:829] Building dataloader with consumed samples: 0


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2025-07-11 17:24:25 megatron_base_model:1199] Ignoring `trainer.max_epochs` when computing `max_steps` because `trainer.max_steps` is already set to 50.


[NeMo I 2025-07-11 17:24:25 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-11 17:24:25 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-11 17:24:25 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-11 17:24:25 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-11 17:24:25 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-11 17:24:25 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-11 17:24:25 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-11 17:24:25 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-11 17:24:25 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-11 17:24:25 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-11 17:24:25 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2025-07-11 17:24:25 adapter_mixins:435] Unfrozen adapter : lora_kqv_


  | Name  | Type          | Params | Mode 
------------------------------------------------
0 | model | Float16Module | 8.0 B  | train
------------------------------------------------
10.5 M    Trainable params
8.0 B     Non-trainable params
8.0 B     Total params
32,162.988Total estimated model params size (MB)
[NeMo W 2025-07-11 17:24:25 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=29` in the `DataLoader` to improve performance.
    
[NeMo W 2025-07-11 17:24:25 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    
    
[NeMo W 2025-07-11 17:24:26 nemo_logging:

Epoch 0: :  50%|█████     | 25/50 [00:07<00:07, v_num=0, reduced_train_loss=0.00124, global_step=24.00, consumed_samples=50.00, train_step_timing in s=0.278]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/1 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/1 [00:00<?, ?it/s][A
Validation DataLoader 0: 100%|██████████| 1/1 [00:00<00:00,  5.87it/s][A


Metric val_loss improved. New best score: 0.052
Epoch 0, global step 25: 'validation_loss' reached 0.05194 (best 0.05194), saving model to '/root/verb-workspace/NIM-build-tune-deploy-presenter/lora_tutorial/experiments/customer_support_lora/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.052-step=25-consumed_samples=50.0.ckpt' as top 1
[NeMo W 2025-07-11 17:24:33 nlp_overrides:480] DistributedCheckpointIO configured but should not be used. Reverting back to TorchCheckpointIO


Epoch 0: : 100%|██████████| 50/50 [00:14<00:00, v_num=0, reduced_train_loss=3.75e-5, global_step=49.00, consumed_samples=100.0, train_step_timing in s=0.269, val_loss=0.0519] 
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/1 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/1 [00:00<?, ?it/s][A
Validation DataLoader 0: 100%|██████████| 1/1 [00:00<00:00,  6.22it/s][A


Metric val_loss improved by 0.052 >= min_delta = 0.001. New best score: 0.000
Epoch 0, global step 50: 'validation_loss' reached 0.00010 (best 0.00010), saving model to '/root/verb-workspace/NIM-build-tune-deploy-presenter/lora_tutorial/experiments/customer_support_lora/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.000-step=50-consumed_samples=100.0.ckpt' as top 1


Epoch 0: : 100%|██████████| 50/50 [00:15<00:00, v_num=0, reduced_train_loss=3.75e-5, global_step=49.00, consumed_samples=100.0, train_step_timing in s=0.269, val_loss=0.000104][NeMo I 2025-07-11 17:24:41 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/NIM-build-tune-deploy-presenter/lora_tutorial/experiments/customer_support_lora/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.052-step=25-consumed_samples=50.0.ckpt
[NeMo I 2025-07-11 17:24:42 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/NIM-build-tune-deploy-presenter/lora_tutorial/experiments/customer_support_lora/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.052-step=25-consumed_samples=50.0-last.ckpt


`Trainer.fit` stopped: `max_steps=50` reached.


Epoch 0: : 100%|██████████| 50/50 [00:15<00:00, v_num=0, reduced_train_loss=3.75e-5, global_step=49.00, consumed_samples=100.0, train_step_timing in s=0.269, val_loss=0.000104]


Restoring states from the checkpoint path at /root/verb-workspace/NIM-build-tune-deploy-presenter/lora_tutorial/experiments/customer_support_lora/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.000-step=50-consumed_samples=100.0.ckpt
Restored all states from the checkpoint at /root/verb-workspace/NIM-build-tune-deploy-presenter/lora_tutorial/experiments/customer_support_lora/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.000-step=50-consumed_samples=100.0.ckpt


🎤 **PRESENTER SCRIPT:**

"Look at what we've created! Let me break down these files:

1. **customer_support_lora.nemo** (21MB) - This is your golden ticket! Your custom AI adapter
2. **The .ckpt files** (147MB each) - Full training checkpoints with optimizer states

Notice the file sizes? The .nemo file is just 21MB - that's smaller than a typical smartphone photo! Yet it completely transforms an 8B parameter model. 

Here's the mind-blowing part: You can email this file. Put it on a USB stick. Upload it to Dropbox. This tiny file contains all the customization that turns generic Llama into YOUR specialized AI.

Compare this to fine-tuning the whole model:
- Full model: 15-30GB file
- LoRA adapter: 21MB file
- That's 1000x smaller! 🤯

In production, I've seen companies deploy 50+ different LoRA adapters for different departments, all using the same base model. Marketing gets one style, Legal gets another, Customer Support gets a third - all from these tiny adapter files."


## 4. Verify Training Results

In [11]:
# Check if training created the LoRA adapter
!ls -la ./lora_tutorial/experiments/customer_support_lora*/checkpoints/

total 307504
drwxr-xr-x 2 root root      4096 Jul 11 17:24  .
drwxr-xr-x 5 root root      4096 Jul 11 17:24  ..
-rw-r--r-- 1 root root  21012480 Jul 11 17:24  customer_support_lora.nemo
-rw-r--r-- 1 root root 146930030 Jul 11 17:24 'megatron_gpt_peft_lora_tuning--validation_loss=0.000-step=50-consumed_samples=100.0-last.ckpt'
-rw-r--r-- 1 root root 146930030 Jul 11 17:24 'megatron_gpt_peft_lora_tuning--validation_loss=0.000-step=50-consumed_samples=100.0.ckpt'


## 5. Test Your Trained LoRA Adapter

Now let's test how your trained adapter performs!

**Note**: The inference script uses the default prompt template which works with our data format that has "input" and "output" fields.

**Memory Requirements**: Llama 3.1 8B requires significant GPU memory:
- Training with LoRA: ~25-30GB (with optimizations)
- Inference: ~20-25GB (with BF16 precision)

If you encounter out-of-memory errors:
1. The script now uses BF16 precision and optimizations
2. Make sure no other processes are using GPU memory (`nvidia-smi`)
3. Try restarting the kernel to free memory
4. As a last resort, you can skip this cell - your LoRA adapter is already saved!

🎤 **PRESENTER SCRIPT:**

"Quick technical note - we're about to clear GPU memory. This is like closing apps on your phone to free up RAM. 

GPUs are amazing at parallel processing, but they're memory hogs. That 8B parameter model? It needs about 16GB just to load, plus more for processing. 

Pro tip: In production, you'd typically separate training and inference onto different machines. But for workshops, we need to be memory-efficient.

*[Run the memory clearing cell]*

See those numbers? 
- **Total**: Your GPU's capacity
- **Allocated**: What we're currently using
- **Free**: What's available

If you see less than 20GB free, the inference might struggle. No worries though - your adapter is already saved! The training was the important part."


In [None]:
# Clear GPU memory before inference
import torch
import gc

# Clear any cached memory
torch.cuda.empty_cache()
gc.collect()

# Check GPU memory status
if torch.cuda.is_available():
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    allocated = torch.cuda.memory_allocated(0) / 1e9
    free = (torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_allocated(0)) / 1e9
    
    print(f"GPU Memory Status:")
    print(f"  Total: {gpu_memory:.2f} GB")
    print(f"  Allocated: {allocated:.2f} GB") 
    print(f"  Free: {free:.2f} GB")
    
    if free < 20:
        print("\n⚠️  Warning: Less than 20GB free. Inference may fail.")
        print("   Consider restarting the kernel to free all memory.")


🎤 **PRESENTER SCRIPT:**

"The moment of truth! Let's see what your AI learned. 

*[Run the cell and look at the outputs together]*

Look at these predictions! Notice how the model:
- Maintains the professional tone we trained it on
- Acknowledges the customer's concern first
- Offers specific, actionable solutions
- Keeps responses concise but helpful

Is it perfect? Maybe not - we only trained for 50 steps! But look how close it is to our training examples. With just 5 training examples and 10 minutes, we've transformed a general-purpose LLM into a specialized customer support agent.

Real-world impact story: A telecommunications company used this exact approach. They:
- Trained on 500 real support tickets
- Achieved 94% customer satisfaction (up from 67% with scripts)
- Reduced average response time from 3 minutes to 10 seconds
- Saved $2.3M annually in support costs

And the best part? When they need to update the responses for new products or policies, they just retrain the LoRA adapter. Takes 30 minutes instead of retraining the entire model for days.

Ready to deploy this? Let's head to the next notebook and put your custom AI into production with NVIDIA NIM!"


In [12]:
# First, create a test file with a few examples
test_examples = [
    {
        "input": "User: My package is damaged. What should I do?\\n\\nAssistant:",
        "output": "I'm sorry to hear you received a damaged product. Please take photos of the damage and packaging, then contact us with your order number. We'll arrange a replacement or refund immediately."
    },
    {
        "input": "User: How do I track my order?\\n\\nAssistant:",
        "output": "You can track your order by logging into your account and clicking 'Order History', or use the tracking link in your confirmation email. The tracking number will show real-time updates."
    }
]

with jsonlines.open('lora_tutorial/data/test_small.jsonl', 'w') as writer:
    writer.write_all(test_examples)
    
print("Created test file with 2 examples")

Created test file with 2 examples


In [16]:
%%bash

# Run inference using the trained LoRA adapter
# Find the model file dynamically (NGC creates subdirectories)
MODEL_DIR="lora_tutorial/models/llama-3_1-8b-instruct"
MODEL=$(find "$MODEL_DIR" -name "*.nemo" -type f | head -1)

if [ -z "$MODEL" ]; then
    echo "ERROR: No .nemo model file found in $MODEL_DIR"
    exit 1
fi

TEST_DS="[lora_tutorial/data/test_small.jsonl]"
TEST_NAMES="[customer_support]"

# Use relative path to NeMo
NEMO_PATH="./NeMo"

# Path to the LoRA checkpoint - use the actual file name
LORA_CKPT="lora_tutorial/experiments/customer_support_lora/checkpoints/customer_support_lora.nemo"

# Check if LoRA checkpoint exists
if [ ! -f "$LORA_CKPT" ]; then
    echo "WARNING: LoRA checkpoint not found at $LORA_CKPT"
    echo "Make sure you've run the training step successfully"
fi

# Run generation with memory optimizations
python "${NEMO_PATH}/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py" \
    model.restore_from_path=${MODEL} \
    model.peft.restore_from_path=${LORA_CKPT} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16-mixed \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=${TEST_NAMES} \
    model.data.test_ds.global_batch_size=1 \
    model.data.test_ds.micro_batch_size=1 \
    model.data.test_ds.tokens_to_generate=100 \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    inference.greedy=True \
    model.data.test_ds.output_file_path_prefix=customer_support_lora \
    model.data.test_ds.write_predictions_to_file=True \
    model.data.test_ds.add_bos=False \
    model.data.test_ds.add_eos=True \
    model.data.test_ds.add_sep=False \
    model.data.test_ds.label_key="output"

    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2025-07-11 17:39:19 megatron_gpt_generate:127] 
    
    ************** Experiment configuration ***********
[NeMo I 2025-07-11 17:39:19 megatron_gpt_generate:128] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: bf16-mixed
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 20000
      log_every_n_steps: 10
      val_check_interval: 200
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.test_ds.metric.name}
        save_top_k: 1
        mode: max
        sav

[NeMo W 2025-07-11 17:39:19 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    


[NeMo I 2025-07-11 17:39:19 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2025-07-11 17:39:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:39:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:39:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:39:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2

[NeMo I 2025-07-11 17:39:38 megatron_init:263] Rank 0 has data parallel group : [0]
[NeMo I 2025-07-11 17:39:38 megatron_init:269] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2025-07-11 17:39:38 megatron_init:274] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2025-07-11 17:39:38 megatron_init:277] Ranks 0 has data parallel rank: 0
[NeMo I 2025-07-11 17:39:38 megatron_init:285] Rank 0 has context parallel group: [0]
[NeMo I 2025-07-11 17:39:38 megatron_init:288] All context parallel group ranks: [[0]]
[NeMo I 2025-07-11 17:39:38 megatron_init:289] Ranks 0 has context parallel rank: 0
[NeMo I 2025-07-11 17:39:38 megatron_init:296] Rank 0 has model parallel group: [0]
[NeMo I 2025-07-11 17:39:38 megatron_init:297] All model parallel group ranks: [[0]]
[NeMo I 2025-07-11 17:39:38 megatron_init:306] Rank 0 has tensor model parallel group: [0]
[NeMo I 2025-07-11 17:39:38 megatron_init:310] All tensor model parallel group ranks: 

25-07-11 17:39:38 - PID:65064 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 1
[NeMo W 2025-07-11 17:39:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:39:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:39:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:39:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.

[NeMo I 2025-07-11 17:39:39 megatron_base_model:584] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2025-07-11 17:39:39 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:39:39 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:39:39 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:39:39 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2025-07-11 17:39:39 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_t

[NeMo I 2025-07-11 17:40:02 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.
Loading distributed checkpoint with TensorStoreLoadShardedStrategy
Loading distributed checkpoint directly on the GPU
[NeMo I 2025-07-11 17:40:44 nlp_overrides:1180] Model MegatronGPTSFTModel was successfully restored from /root/verb-workspace/NIM-build-tune-deploy-presenter/lora_tutorial/models/llama-3_1-8b-instruct/llama-3_1-8b-nemo_v1.0/llama3_1_8b.nemo.
[NeMo I 2025-07-11 17:40:44 nlp_adapter_mixins:203] Before adding PEFT params:
      | Name  | Type          | Params | Mode 
    ------------------------------------------------
    0 | model | Float16Module | 8.0 B  | train
    ------------------------------------------------
    0         Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,121.045Total estimated model params size (MB)
[NeMo I 2025-07-11 17:40:48 nlp_adapter_mixins:208] After adding PEFT params:
      | Name  | Type          | Params | Mode 
  

[NeMo W 2025-07-11 17:40:48 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2025-07-11 17:40:48 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2025-07-11 17:40:48 megatron_gpt_sft_model:803] Building GPT SFT test datasets.
[NeMo I 2025-07-11 17:40:48 text_memmap_dataset:116] Building data files
[NeMo I 2025-07-11 17:40:48 text_memmap_dataset:525] Processing 1 data files using 15 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

[NeMo I 2025-07-11 17:40:48 text_memmap_dataset:495] Building indexing for fn = lora_tutorial/data/test_small.jsonl
[NeMo I 2025-07-11 17:40:48 text_memmap_dataset:507] Saving idx file = lora_tutorial/data/test_small.jsonl.idx.npy
[NeMo I 2025-07-11 17:40:48 text_memmap_dataset:509] Saving metadata file = lora_tutorial/data/test_small.jsonl.idx.info
[NeMo I 2025-07-11 17:40:48 text_memmap_dataset:535] Time building 1 / 1 mem-mapped files: 0:00:00.434514
[NeMo I 2025-07-11 17:40:48 text_memmap_dataset:525] Processing 1 data files using 15 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

[NeMo I 2025-07-11 17:40:49 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.439037
[NeMo I 2025-07-11 17:40:49 text_memmap_dataset:158] Loading data files
[NeMo I 2025-07-11 17:40:49 text_memmap_dataset:249] Loading lora_tutorial/data/test_small.jsonl
[NeMo I 2025-07-11 17:40:49 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001088
[NeMo I 2025-07-11 17:40:49 text_memmap_dataset:165] Computing global indices
[NeMo I 2025-07-11 17:40:49 megatron_gpt_sft_model:806] Length of test dataset: 2
[NeMo I 2025-07-11 17:40:49 megatron_gpt_sft_model:829] Building dataloader with consumed samples: 0


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2025-07-11 17:40:49 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=29` in the `DataLoader` to improve performance.
    
[NeMo W 2025-07-11 17:40:49 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `test_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    
    
      input_info_tensor = torch.cuda.FloatTensor(input_info)
    
      string_tensor = torch.as_tensor(
    


Testing DataLoader 0: 100%|██████████| 2/2 [00:07<00:00,  0.26it/s][NeMo I 2025-07-11 17:40:57 megatron_gpt_sft_model:561] Total deduplicated inference data size: 2 to 2
[NeMo I 2025-07-11 17:40:57 megatron_gpt_sft_model:712] Predictions saved to customer_support_lora_test_customer_support_inputs_preds_labels.jsonl


[NeMo W 2025-07-11 17:40:57 megatron_gpt_sft_model:652] No training data found, reconfiguring microbatches based on validation batch sizes.
[NeMo W 2025-07-11 17:40:57 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2025-07-11 17:40:57 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('test_loss_customer_support', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2025-07-11 17:40:57 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('test_loss

Testing DataLoader 0: 100%|██████████| 2/2 [00:07<00:00,  0.26it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1m       Test metric        [0m[1m [0m┃[1m [0m[1m       DataLoader 0       [0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│[36m [0m[36m        test_loss         [0m[36m [0m│[35m [0m[35m    4.478262901306152     [0m[35m [0m│
│[36m [0m[36mtest_loss_customer_support[0m[36m [0m│[35m [0m[35m    4.478262901306152     [0m[35m [0m│
│[36m [0m[36m         val_loss         [0m[36m [0m│[35m [0m[35m    4.478262901306152     [0m[35m [0m│
└────────────────────────────┴────────────────────────────┘


In [17]:
# Look at the generated predictions
!head -n2 customer_support_lora_test_customer_support_inputs_preds_labels.jsonl

{"input": "User: My package is damaged. What should I do?\\n\\nAssistant:", "pred": " I'm sorry to hear your package is damaged. Please take photos of the damage and packaging, then contact us with your order number. We'll arrange a replacement or refund immediately.", "label": " I'm sorry to hear you received a damaged product. Please take photos of the damage and packaging, then contact us with your order number. We'll arrange a replacement or refund immediately."}
{"input": "User: How do I track my order?\\n\\nAssistant:", "pred": " I'd be happy to help you track your order. For security, please enter your order number and email address. I'll check the status immediately.", "label": " You can track your order by logging into your account and clicking 'Order History', or use the tracking link in your confirmation email. The tracking number will show real-time updates."}


## Summary

Congratulations! You've successfully:
- ✅ Set up the NeMo training environment
- ✅ Created training data for your custom task
- ✅ Configured LoRA parameters for efficient training
- ✅ Trained your own LoRA adapter on Llama 3.1 8B
- ✅ Tested the adapter to verify it learned your style

Your LoRA adapter is now ready to be deployed with NVIDIA NIM in the next notebook!

**Next Steps**: 
- Open `04_Deploy_LoRA_with_NIM_with_scripts.ipynb` to deploy your custom model
- Experiment with different LoRA ranks and learning rates
- Try training on your own domain-specific data