# Fine-Tuning Nemotron-3 8B using Low-Rank Adaptation (LoRA)
Nemotron-3 is a robust, powerful family of Large Language Models that can provide compelling responses on a wide range of tasks. While the 8B parameter base model serves as a strong baseline for multiple downstream tasks, they can lack in domain-specific knowledge or proprietary or otherwise sensitive information. Fine-tuning is often used as a means to update a model for a specific task or tasks to better respond to domain-specific prompts. This notebook walks through preparing a dataset and using Low Rank Adaptation (LoRA) to fine-tune the base Nemotron-3 8B model from Hugging Face against the dataset.

The implementation of LoRA is based on the paper, [LoRA: Low-Rank Adaptation of Large Language Models](https://openreview.net/pdf?id=nZeVKeeFYf9) by Hu et al.

# Getting the model
You will need to request access to the [Nemotron-3-8B-Base-4K Model](https://huggingface.co/nvidia/nemotron-3-8b-base-4k) through Hugging Face. 

Once you have access, set the your Hugging Face username and access token accordingly and run the below cell to download the model into the artifact store. 

Optionally, you can also download this into an external data volume for better portability & longer term storage. If you choose to do so, make sure to change the `MODEL_PATH` parameter.

In [1]:
HF_USERNAME = "<HUGGING-FACE-USERNAME>"
HF_ACCESS_TOKEN = "<HUGGING-ACCESS-TOKEN>" # For best practice, set this as an environment variable

In [2]:
import os 
import subprocess

MODEL_PATH = "/mnt/artifacts/nemotron/Nemotron-3-8B-Base-4k.nemo"

if not os.path.exists(MODEL_PATH):
    subprocess.run("curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash", shell=True, check=True)
    subprocess.run("sudo apt-get install git-lfs", shell=True, check=True)
    subprocess.run(f"git clone https://{HF_USERNAME}:{HF_ACCESS_TOKEN}@huggingface.co/nvidia/nemotron-3-8b-base-4k /mnt/artifacts/nemotron", shell=True, check=True)
else:
    print(f"The Nemotron model already exists. Skipping download ... ")

The Nemotron model already exists. Skipping download ... 


# Preparing The Dataset
We will be using LoRA to teach our model to do Extractive Question Answering. The dataset being used for fine-tuning needs to be converted to a .jsonl file and follow a specific format. In general, question and answer datasets are easiest to work with by providing context (if applicable), a question, and the expected answer, though different downstream tasks work as well.

### Downloading the dataset
We will be using the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) reading comprehension dataset, consisting of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is a segment of text. More information on [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) can be found on their website or in their paper by Rajpurkar et. al "[Know What You Don’t Know: Unanswerable Questions for SQuAD](https://arxiv.org/pdf/1806.03822.pdf)".

In [3]:
DATA_DIR = "/mnt/code/data"

In [4]:
import os 
import wget
import sys

os.environ['OPENBLAS_NUM_THREADS'] = '8'
os.makedirs(DATA_DIR, exist_ok=True)
SQUAD_DIR = os.path.join(DATA_DIR, "SQuAD")
os.makedirs(SQUAD_DIR, exist_ok=True)

In [5]:
# Download the SQuAD dataset
!wget -nc https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
!wget -nc https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
!mv train-v1.1.json {SQUAD_DIR}
!mv dev-v1.1.json {SQUAD_DIR}

--2024-03-19 17:29:50--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.111.153, 185.199.108.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30288272 (29M) [application/json]
Saving to: ‘train-v1.1.json’


2024-03-19 17:29:51 (110 MB/s) - ‘train-v1.1.json’ saved [30288272/30288272]

--2024-03-19 17:29:51--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.111.153, 185.199.108.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4854279 (4.6M) [application/json]
Saving to: ‘dev-v1.1.json’


2024-03-19 17:29:51 (94.7 MB/s) - ‘dev-v1.1.json’ saved [4854279/4854279]



### Preprocessing the dataset
Datasets often need some form of preprocessing to convert it into a form ready for fine-tuning. LoRA (and all PEFT tuning) models expect at least two fields in the jsonl files. The `input` field should contain all the tokens necessary for the model to generate the `output`. For example for extractive QA, the `input` should contain the context text as well as the question.

```
[
    {"input": "User: Context: [CONTEXT_1] Question: [QUESTION_1]\n\nAssistant:", "output": [ANSWER_1]},
    {"input": "User: Context: [CONTEXT_2] Question: [QUESTION_2]\n\nAssistant:", "output": [ANSWER_2]},
    {"input": "User: Context: [CONTEXT_3] Question: [QUESTION_3]\n\nAssistant:", "output": [ANSWER_3]},
]
```
Note that we use keywords in the input like `Context:`, `Question:` to separate the text representing the context and question. We also use the keyword `User:` and end each of the input with `\n\nAssistant:` tokens. These are recommended because NeMo's instruction-tuned models are trained with a prefix of `User:` and suffix `\n\nAssistant:`.

The SQuAD dataset does not already reflect this, so let's go ahead and preprocess it to fit the above format. 

To do so, a processing script has been included with this project template. Feel free to take a look inside the `prompt_learning_squad_preprocessing.py` script.

In [6]:
# Preprocess squad data
!python /opt/NeMo/scripts/dataset_processing/nlp/squad/prompt_learning_squad_preprocessing.py --sft-format --data-dir {SQUAD_DIR}

Saving train split to /mnt/code/data/SQuAD/squad_train.jsonl
100%|█████████████████████████████████| 87599/87599 [00:00<00:00, 173728.83it/s]
Saving val split to /mnt/code/data/SQuAD/squad_val.jsonl
100%|█████████████████████████████████| 10570/10570 [00:00<00:00, 168466.43it/s]
Saving test split to /mnt/code/data/SQuAD/squad_test_ground_truth.jsonl
100%|█████████████████████████████████| 10570/10570 [00:00<00:00, 156775.06it/s]
Saving test split to /mnt/code/data/SQuAD/squad_test.jsonl
100%|█████████████████████████████████| 10570/10570 [00:00<00:00, 171112.83it/s]


Let's split the datasets into train and validation files, and take a look at a few samples of the data to confirm the preprocessing is satisfactory. 

In [7]:
# What the squad dataset looks like after processing
! head -5000 $SQUAD_DIR/squad_train.jsonl > $SQUAD_DIR/squad_short_train.jsonl
! head -500 $SQUAD_DIR/squad_val.jsonl > $SQUAD_DIR/squad_short_val.jsonl
! head -4 $SQUAD_DIR/squad_short_val.jsonl
! head -4 $SQUAD_DIR/squad_short_train.jsonl

{"input": "User: Context:Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50. Question:Which NFL team represented the AFC at Super Bowl 50?\n\nAssistant:", "output": "Denver Broncos"}
{"input": "User: Context:Super Bowl 50 was an American football game to determine th

# Training

Now that the model is available and the data is prepared, we are ready to start the training.

### Load Config

The NeMo toolkit leverages a configuration file to make it easy to define and explore with training parameters without having to change the code. For this project template, a default configuration for fine-tuning has been included.

We will start by loading in that configuration.

In [8]:
from omegaconf import OmegaConf

cfg = OmegaConf.load("/mnt/code/conf/nemotron-finetune-config.yaml")

With the config loaded, we can override certain settings for our environment. The default values should work but here are some parameter that you may want to adjust:

* `config.trainer.precision` - This is the precision that will be used during fine-tuning. The model might be more accurate with higher values but it also uses more memory than lower precisions. If the fine-tuning process runs out of memory, try reducing the precision here.
* `config.trainer.devices` - This is the number of devices that will be used. If running on a multi-GPU system, increase this number as appropriate.
* `config.model.global_batch_size` - If using a higher GPU count or if additional GPU memory allows, this value can be increased for higher performance. Note that higher batch sizes use more GPU memory.

One config that you will want to update is the `config.model.restore_from_path`. This should point to the `.nemo` file where your model is stored.

In [9]:
cfg.model.restore_from_path=MODEL_PATH

By default, this notebook doesn't use distributed training so we will set some environment variables accordingly. If you do choose to use distributed training methods, you may want to change the environment variables below.

In [10]:
os.environ["LOCAL_RANK"] = '0'
os.environ["RANK"] = '0'
os.environ["WORLD_SIZE"] = '1'

### Configure Training

We now load in our model and configure the trainer using the loaded config.

In [11]:
from nemo.collections.nlp.parts.megatron_trainer_builder import MegatronLMPPTrainerBuilder
from nemo.collections.nlp.models.language_modeling.megatron_gpt_sft_model import MegatronGPTSFTModel
from nemo.collections.nlp.parts.peft_config import LoraPEFTConfig

trainer = MegatronLMPPTrainerBuilder(cfg).create_trainer()
model_cfg = MegatronGPTSFTModel.merge_cfg_with(cfg.model.restore_from_path, cfg)
model = MegatronGPTSFTModel.restore_from(cfg.model.restore_from_path, model_cfg, trainer=trainer)
model.add_adapter(LoraPEFTConfig(model_cfg))

[NeMo I 2024-03-19 17:30:06 megatron_trainer_builder:51] Detected interactive environment, using NLPDDPStrategyNotebook


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make

[NeMo I 2024-03-19 17:30:24 megatron_init:241] Rank 0 has data parallel group : [0]
[NeMo I 2024-03-19 17:30:24 megatron_init:247] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-03-19 17:30:24 megatron_init:252] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-03-19 17:30:24 megatron_init:255] Ranks 0 has data parallel rank: 0
[NeMo I 2024-03-19 17:30:24 megatron_init:272] Rank 0 has context parallel group: [0]
[NeMo I 2024-03-19 17:30:24 megatron_init:275] All context parallel group ranks: [[0]]
[NeMo I 2024-03-19 17:30:24 megatron_init:276] Ranks 0 has context parallel rank: 0
[NeMo I 2024-03-19 17:30:24 megatron_init:287] Rank 0 has model parallel group: [0]
[NeMo I 2024-03-19 17:30:24 megatron_init:288] All model parallel group ranks: [[0]]
[NeMo I 2024-03-19 17:30:24 megatron_init:298] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-03-19 17:30:24 megatron_init:302] All tensor model parallel group ranks: 

[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in 

[NeMo I 2024-03-19 17:30:24 tokenizer_utils:191] Getting SentencePiece with model: /tmp/tmp0a98a0h5/586f3f51a9cf43bc9369bd53fa08868c_a934dc7c3e1e46a6838bb63379916563_3feba89c944047c19d5a1d0c07a85c32_mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model
[NeMo I 2024-03-19 17:30:24 megatron_base_model:539] Padded vocab_size: 256000, original vocab_size: 256000, dummy tokens: 0.


[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:30:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in 

Loading distributed checkpoint with TensorStoreLoadShardedStrategy
[NeMo I 2024-03-19 17:32:16 nlp_overrides:1108] Model MegatronGPTSFTModel was successfully restored from /mnt/artifacts/nemotron/Nemotron-3-8B-Base-4k.nemo.
[NeMo I 2024-03-19 17:32:16 nlp_adapter_mixins:184] Before adding PEFT params:
      | Name  | Type     | Params
    -----------------------------------
    0 | model | GPTModel | 8.5 B 
    -----------------------------------
    0         Trainable params
    8.5 B     Non-trainable params
    8.5 B     Total params
    34,160.542Total estimated model params size (MB)
[NeMo I 2024-03-19 17:32:18 nlp_adapter_mixins:197] After adding PEFT params:
      | Name  | Type     | Params
    -----------------------------------
    0 | model | GPTModel | 8.6 B 
    -----------------------------------
    16.8 M    Trainable params
    8.5 B     Non-trainable params
    8.6 B     Total params
    34,227.651Total estimated model params size (MB)


### Configure experiment
We will also activate the experiment logging so that we can create checkpoints to resume from later on.

In [12]:
from nemo.utils.exp_manager import exp_manager

exp_dir = exp_manager(trainer, cfg.get("exp_manager", None))

[NeMo W 2024-03-19 17:32:18 exp_manager:759] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2024-03-19 17:32:18 exp_manager:616] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints. Training from scratch.


[NeMo I 2024-03-19 17:32:18 exp_manager:396] Experiments will be logged at /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning
[NeMo I 2024-03-19 17:32:18 exp_manager:842] TensorboardLogger has been set up


[NeMo W 2024-03-19 17:32:18 exp_manager:952] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 1000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.


### Train model
Lastly, we can finally train our model!

In [13]:
trainer.fit(model)

      rank_zero_warn(
    
      rank_zero_warn(
    


[NeMo I 2024-03-19 17:32:18 megatron_gpt_sft_model:767] Building GPT SFT validation datasets.
[NeMo I 2024-03-19 17:32:18 text_memmap_dataset:116] Building data files
[NeMo I 2024-03-19 17:32:18 text_memmap_dataset:525] Processing 1 data files using 2 workers
[NeMo I 2024-03-19 17:32:19 text_memmap_dataset:495] Building indexing for fn = /mnt/code/data/SQuAD/squad_short_val.jsonl
[NeMo I 2024-03-19 17:32:19 text_memmap_dataset:507] Saving idx file = /mnt/code/data/SQuAD/squad_short_val.jsonl.idx.npy
[NeMo I 2024-03-19 17:32:19 text_memmap_dataset:509] Saving metadata file = /mnt/code/data/SQuAD/squad_short_val.jsonl.idx.info
[NeMo I 2024-03-19 17:32:19 text_memmap_dataset:535] Time building 1 / 1 mem-mapped files: 0:00:00.085209
[NeMo I 2024-03-19 17:32:19 text_memmap_dataset:525] Processing 1 data files using 2 workers
[NeMo I 2024-03-19 17:32:19 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.077591
[NeMo I 2024-03-19 17:32:19 text_memmap_dataset:158] Loading d

      counts = torch.cuda.LongTensor([1])
    


make: Entering directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
make: Nothing to be done for 'default'.
make: Leaving directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
[NeMo I 2024-03-19 17:32:19 blendable_dataset:67] > elapsed time for building blendable dataset indices: 0.04 (sec)
> building indices for blendable datasets ...
 > sample ratios:
   dataset 0, input: 1, achieved: 1
[NeMo I 2024-03-19 17:32:19 megatron_gpt_sft_model:783] Length of train dataset: 1005
[NeMo I 2024-03-19 17:32:19 megatron_gpt_sft_model:788] Building dataloader with consumed samples: 0
[NeMo I 2024-03-19 17:32:19 megatron_gpt_sft_model:788] Building dataloader with consumed samples: 0
[NeMo I 2024-03-19 17:32:19 megatron_gpt_sft_model:788] Building dataloader with consumed samples: 0


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


[NeMo I 2024-03-19 17:32:19 nlp_overrides:227] Configuring DDP for model parallelism.


[NeMo W 2024-03-19 17:32:19 megatron_base_model:1145] Ignoring `trainer.max_epochs` when computing `max_steps` because `trainer.max_steps` is already set to 1000.


[NeMo I 2024-03-19 17:32:19 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-03-19 17:32:19 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-03-19 17:32:19 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-03-19 17:32:19 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-03-19 17:32:19 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-03-19 17:32:19 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-03-19 17:32:19 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-03-19 17:32:19 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-03-19 17:32:19 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-03-19 17:32:19 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-03-19 17:32:19 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-03-19 17:32:19 adapter_mixins:435] Unfrozen adapter : lora_kqv_


  | Name  | Type     | Params
-----------------------------------
0 | model | GPTModel | 8.6 B 
-----------------------------------
16.8 M    Trainable params
8.5 B     Non-trainable params
8.6 B     Total params
34,227.651Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

      rank_zero_warn(
    
      rank_zero_warn(
    
    


Sanity Checking DataLoader 0: : 3it [00:01,  2.65it/s]                     

    
    
    


                                                      

      rank_zero_warn(
    
      rank_zero_warn(
    


Epoch 0: :   0%|          | 0/1005 [00:00<?]

    
    


Epoch 0: :   5%|▍         | 50/1005 [00:08<02:47, v_num=2, reduced_train_loss=0.00656, global_step=49.00, consumed_samples=50.00, train_step_timing in s=0.198] 
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<01:55,  4.32it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<01:22,  6.02it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00<01:09,  7.12it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:00<01:06,  7.48it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:00<01:02,  7.96it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:00<00:59,  8.24it/s][A
Validation DataLoader 0:   1%|▏         | 7/500 [00:00<00:58,  8.42it/s][A
Validation DataLoader 0:   2%|▏         | 8/500 [00:00<00:57,  8.63it/s][A
Validation DataLoader 0:   2%|▏         | 9/500 [00:01<00:56,  8.71it/s][

Epoch 0, global step 50: 'validation_loss' reached 0.79337 (best 0.79337), saving model to '/mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.793-step=50-consumed_samples=50.0.ckpt' as top 1


Epoch 0: :  10%|▉         | 100/1005 [01:10<10:39, v_num=2, reduced_train_loss=0.000905, global_step=99.00, consumed_samples=100.0, train_step_timing in s=0.433, val_loss=0.793]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<00:53,  9.27it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<00:50,  9.82it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00<00:50,  9.77it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:00<00:48, 10.14it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:00<00:59,  8.29it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:00<00:57,  8.64it/s][A
Validation DataLoader 0:   1%|▏         | 7/500 [00:00<00:55,  8.87it/s][A
Validation DataLoader 0:   2%|▏         | 8/500 [00:00<00:54,  8.99it/s][A
Validation DataLoader 0:   2%|▏         | 9/500 [00:00<00

Epoch 0, global step 100: 'validation_loss' reached 0.74545 (best 0.74545), saving model to '/mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.745-step=100-consumed_samples=100.0.ckpt' as top 1


[NeMo I 2024-03-19 17:34:27 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.793-step=50-consumed_samples=50.0.ckpt
[NeMo I 2024-03-19 17:34:28 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.793-step=50-consumed_samples=50.0-last.ckpt
Epoch 0: :  15%|█▍        | 150/1005 [02:15<12:51, v_num=2, reduced_train_loss=0.829, global_step=149.0, consumed_samples=150.0, train_step_timing in s=0.152, val_loss=0.745]   
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<00:37, 13.14it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<00:46, 10.75it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00<00:

Epoch 0, global step 150: 'validation_loss' reached 0.40087 (best 0.40087), saving model to '/mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.401-step=150-consumed_samples=150.0.ckpt' as top 1


[NeMo I 2024-03-19 17:35:34 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.745-step=100-consumed_samples=100.0.ckpt
[NeMo I 2024-03-19 17:35:34 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.745-step=100-consumed_samples=100.0-last.ckpt
Epoch 0: :  20%|█▉        | 200/1005 [03:23<13:38, v_num=2, reduced_train_loss=1.570, global_step=199.0, consumed_samples=200.0, train_step_timing in s=0.249, val_loss=0.401]   
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<00:56,  8.87it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<00:53,  9.31it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00

Epoch 0, global step 200: 'validation_loss' was not in top 1


[NeMo I 2024-03-19 17:36:44 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.401-step=150-consumed_samples=150.0-last.ckpt
Epoch 0: :  25%|██▍       | 250/1005 [04:31<13:39, v_num=2, reduced_train_loss=0.832, global_step=249.0, consumed_samples=250.0, train_step_timing in s=0.132, val_loss=0.620]   
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<00:51,  9.63it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<00:50,  9.95it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00<00:49, 10.08it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:00<00:46, 10.64it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:00<00:59,  8.27it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:0

Epoch 0, global step 250: 'validation_loss' reached 0.34794 (best 0.34794), saving model to '/mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.348-step=250-consumed_samples=250.0.ckpt' as top 1


[NeMo I 2024-03-19 17:37:46 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.401-step=150-consumed_samples=150.0.ckpt
[NeMo I 2024-03-19 17:37:46 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.620-step=200-consumed_samples=200.0-last.ckpt
Epoch 0: :  30%|██▉       | 300/1005 [05:35<13:09, v_num=2, reduced_train_loss=0.000232, global_step=299.0, consumed_samples=300.0, train_step_timing in s=0.132, val_loss=0.348]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<03:47,  2.19it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<02:12,  3.76it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00

Epoch 0, global step 300: 'validation_loss' was not in top 1


[NeMo I 2024-03-19 17:38:50 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.348-step=250-consumed_samples=250.0-last.ckpt
Epoch 0: :  35%|███▍      | 350/1005 [06:38<12:25, v_num=2, reduced_train_loss=0.109, global_step=349.0, consumed_samples=350.0, train_step_timing in s=0.742, val_loss=0.535]   
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<00:56,  8.85it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<00:52,  9.52it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00<00:51,  9.64it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:01<02:06,  3.94it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:01<02:30,  3.29it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:0

Epoch 0, global step 350: 'validation_loss' was not in top 1


[NeMo I 2024-03-19 17:39:57 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.535-step=300-consumed_samples=300.0-last.ckpt
Epoch 0: :  40%|███▉      | 400/1005 [07:44<11:42, v_num=2, reduced_train_loss=0.000789, global_step=399.0, consumed_samples=400.0, train_step_timing in s=0.130, val_loss=0.439]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<00:37, 13.19it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<00:37, 13.33it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00<00:39, 12.53it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:00<00:42, 11.75it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:00<00:42, 11.52it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:0

Epoch 0, global step 400: 'validation_loss' reached 0.31331 (best 0.31331), saving model to '/mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.313-step=400-consumed_samples=400.0.ckpt' as top 1


[NeMo I 2024-03-19 17:41:01 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.348-step=250-consumed_samples=250.0.ckpt
[NeMo I 2024-03-19 17:41:01 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.439-step=350-consumed_samples=350.0-last.ckpt
Epoch 0: :  45%|████▍     | 450/1005 [08:47<10:50, v_num=2, reduced_train_loss=0.0243, global_step=449.0, consumed_samples=450.0, train_step_timing in s=0.131, val_loss=0.313]  
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<00:41, 12.02it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<01:08,  7.23it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00

Epoch 0, global step 450: 'validation_loss' was not in top 1


[NeMo I 2024-03-19 17:42:02 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.313-step=400-consumed_samples=400.0-last.ckpt
Epoch 0: :  50%|████▉     | 500/1005 [09:48<09:54, v_num=2, reduced_train_loss=0.476, global_step=499.0, consumed_samples=500.0, train_step_timing in s=0.130, val_loss=0.326]   
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<00:55,  8.96it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<00:59,  8.31it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00<01:14,  6.65it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:00<01:54,  4.33it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:01<02:02,  4.04it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:0

Epoch 0, global step 500: 'validation_loss' reached 0.30232 (best 0.30232), saving model to '/mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.302-step=500-consumed_samples=500.0.ckpt' as top 1


[NeMo I 2024-03-19 17:43:05 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.313-step=400-consumed_samples=400.0.ckpt
[NeMo I 2024-03-19 17:43:05 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.326-step=450-consumed_samples=450.0-last.ckpt
Epoch 0: :  55%|█████▍    | 550/1005 [10:51<08:59, v_num=2, reduced_train_loss=0.000168, global_step=549.0, consumed_samples=550.0, train_step_timing in s=0.130, val_loss=0.302]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<00:39, 12.65it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<00:44, 11.29it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00

Epoch 0, global step 550: 'validation_loss' was not in top 1


[NeMo I 2024-03-19 17:44:13 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.302-step=500-consumed_samples=500.0-last.ckpt
Epoch 0: :  60%|█████▉    | 600/1005 [12:00<08:06, v_num=2, reduced_train_loss=0.000287, global_step=599.0, consumed_samples=600.0, train_step_timing in s=0.134, val_loss=0.305]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<04:59,  1.67it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:01<05:23,  1.54it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:01<03:50,  2.16it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:01<03:06,  2.66it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:02<03:31,  2.34it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:0

Epoch 0, global step 600: 'validation_loss' reached 0.27309 (best 0.27309), saving model to '/mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.273-step=600-consumed_samples=600.0.ckpt' as top 1


[NeMo I 2024-03-19 17:45:14 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.302-step=500-consumed_samples=500.0.ckpt
[NeMo I 2024-03-19 17:45:15 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.305-step=550-consumed_samples=550.0-last.ckpt
Epoch 0: :  65%|██████▍   | 650/1005 [13:01<07:06, v_num=2, reduced_train_loss=0.000165, global_step=649.0, consumed_samples=650.0, train_step_timing in s=0.133, val_loss=0.273]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<00:37, 13.31it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<00:37, 13.45it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00

Epoch 0, global step 650: 'validation_loss' was not in top 1


[NeMo I 2024-03-19 17:46:15 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.273-step=600-consumed_samples=600.0-last.ckpt
Epoch 0: :  70%|██████▉   | 700/1005 [14:01<06:06, v_num=2, reduced_train_loss=0.0156, global_step=699.0, consumed_samples=700.0, train_step_timing in s=0.129, val_loss=0.296]  
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<00:37, 13.39it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<00:36, 13.51it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00<00:36, 13.55it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:00<00:40, 12.12it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:00<00:42, 11.59it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:0

Epoch 0, global step 700: 'validation_loss' was not in top 1


[NeMo I 2024-03-19 17:47:17 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.296-step=650-consumed_samples=650.0-last.ckpt
Epoch 0: :  75%|███████▍  | 750/1005 [15:05<05:07, v_num=2, reduced_train_loss=2.950, global_step=749.0, consumed_samples=750.0, train_step_timing in s=0.722, val_loss=0.303]   
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<04:14,  1.96it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<02:30,  3.30it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00<01:55,  4.30it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:00<01:39,  4.98it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:00<01:29,  5.51it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:0

Epoch 0, global step 750: 'validation_loss' was not in top 1


[NeMo I 2024-03-19 17:48:21 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.303-step=700-consumed_samples=700.0-last.ckpt
Epoch 0: :  80%|███████▉  | 800/1005 [16:07<04:07, v_num=2, reduced_train_loss=0.00681, global_step=799.0, consumed_samples=800.0, train_step_timing in s=0.130, val_loss=0.318] 
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<00:37, 13.17it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<00:37, 13.17it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00<00:39, 12.64it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:00<00:41, 11.84it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:00<00:53,  9.28it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:0

Epoch 0, global step 800: 'validation_loss' was not in top 1


[NeMo I 2024-03-19 17:49:24 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.318-step=750-consumed_samples=750.0-last.ckpt
Epoch 0: :  85%|████████▍ | 850/1005 [17:15<03:08, v_num=2, reduced_train_loss=0.199, global_step=849.0, consumed_samples=850.0, train_step_timing in s=0.133, val_loss=0.320]   
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<00:47, 10.45it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<00:47, 10.42it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00<00:43, 11.32it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:00<00:42, 11.81it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:00<00:40, 12.13it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:0

Epoch 0, global step 850: 'validation_loss' was not in top 1


[NeMo I 2024-03-19 17:50:28 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.320-step=800-consumed_samples=800.0-last.ckpt
Epoch 0: :  90%|████████▉ | 900/1005 [18:14<02:07, v_num=2, reduced_train_loss=0.000116, global_step=899.0, consumed_samples=900.0, train_step_timing in s=0.198, val_loss=0.320]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<05:15,  1.58it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:01<05:05,  1.63it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:01<03:39,  2.27it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:01<02:56,  2.80it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:01<02:30,  3.29it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:0

Epoch 0, global step 900: 'validation_loss' was not in top 1


[NeMo I 2024-03-19 17:51:31 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.320-step=850-consumed_samples=850.0-last.ckpt
Epoch 0: :  95%|█████████▍| 950/1005 [19:19<01:07, v_num=2, reduced_train_loss=1.170, global_step=949.0, consumed_samples=950.0, train_step_timing in s=0.196, val_loss=0.320]   
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<01:54,  4.36it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<01:21,  6.11it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00<01:11,  6.99it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:00<01:05,  7.61it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:00<01:02,  7.98it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:0

Epoch 0, global step 950: 'validation_loss' was not in top 1


[NeMo I 2024-03-19 17:52:34 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.320-step=900-consumed_samples=900.0-last.ckpt
Epoch 0: : 100%|█████████▉| 1000/1005 [20:23<00:06, v_num=2, reduced_train_loss=0.00774, global_step=999.0, consumed_samples=1e+3, train_step_timing in s=0.190, val_loss=0.319] 
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/500 [00:00<02:43,  3.05it/s][A
Validation DataLoader 0:   0%|          | 2/500 [00:00<03:24,  2.44it/s][A
Validation DataLoader 0:   1%|          | 3/500 [00:00<02:32,  3.27it/s][A
Validation DataLoader 0:   1%|          | 4/500 [00:01<02:05,  3.96it/s][A
Validation DataLoader 0:   1%|          | 5/500 [00:01<01:50,  4.46it/s][A
Validation DataLoader 0:   1%|          | 6/500 [00:0

Epoch 0, global step 1000: 'validation_loss' was not in top 1


[NeMo I 2024-03-19 17:53:44 nlp_overrides:463] Removing checkpoint: /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.319-step=950-consumed_samples=950.0-last.ckpt
Epoch 0: : 100%|█████████▉| 1000/1005 [21:23<00:06, v_num=2, reduced_train_loss=0.00774, global_step=999.0, consumed_samples=1e+3, train_step_timing in s=0.190, val_loss=0.319]

`Trainer.fit` stopped: `max_steps=1000` reached.


Epoch 0: : 100%|█████████▉| 1000/1005 [21:23<00:06, v_num=2, reduced_train_loss=0.00774, global_step=999.0, consumed_samples=1e+3, train_step_timing in s=0.190, val_loss=0.319]

Restoring states from the checkpoint path at /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.273-step=600-consumed_samples=600.0.ckpt





Restored all states from the checkpoint at /mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.273-step=600-consumed_samples=600.0.ckpt


# Evaluate
Now that we have finished fine-tuning, let's try to make some predictions on it from our test dataset.

### Load config
Just like with fine-tuning, we have prepared a config for this project template. Let's start by loading that in.

In [14]:
config_eval = OmegaConf.load("/mnt/code/conf/nemotron-eval-config.yaml")

We will override the model path with the last checkpoint that was logged during fine-tuning.

In [15]:
CHECKPOINT_PATH="/mnt/code/nemo_experiments/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning.nemo"
config_eval.model.restore_from_path=MODEL_PATH
config_eval.model.peft.restore_from_path=CHECKPOINT_PATH

### Load model
Now we load in the model and trainer that we will use for evaluation.

In [16]:
from nemo.collections.nlp.parts.megatron_trainer_builder import MegatronTrainerBuilder
from nemo.collections.nlp.models.language_modeling.megatron_gpt_sft_model import MegatronGPTSFTModel
from nemo.collections.nlp.parts.peft_config import LoraPEFTConfig

trainer_eval = MegatronTrainerBuilder(config_eval).create_trainer()
eval_model_cfg = MegatronGPTSFTModel.merge_inference_cfg(config_eval.model.peft.restore_from_path, config_eval)
model_eval = MegatronGPTSFTModel.restore_from(config_eval.model.restore_from_path, eval_model_cfg, trainer=trainer_eval)
model_eval.load_adapters(config_eval.model.peft.restore_from_path)
model_eval.freeze()

print("Parameter count manually:\n", model_eval.summarize())

[NeMo I 2024-03-19 17:54:12 megatron_trainer_builder:51] Detected interactive environment, using NLPDDPStrategyNotebook


      rank_zero_warn(
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or con

[NeMo I 2024-03-19 17:54:24 megatron_init:241] Rank 0 has data parallel group : [0]
[NeMo I 2024-03-19 17:54:24 megatron_init:247] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-03-19 17:54:24 megatron_init:252] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-03-19 17:54:24 megatron_init:255] Ranks 0 has data parallel rank: 0
[NeMo I 2024-03-19 17:54:24 megatron_init:272] Rank 0 has context parallel group: [0]
[NeMo I 2024-03-19 17:54:24 megatron_init:275] All context parallel group ranks: [[0]]
[NeMo I 2024-03-19 17:54:24 megatron_init:276] Ranks 0 has context parallel rank: 0
[NeMo I 2024-03-19 17:54:24 megatron_init:287] Rank 0 has model parallel group: [0]
[NeMo I 2024-03-19 17:54:24 megatron_init:288] All model parallel group ranks: [[0]]
[NeMo I 2024-03-19 17:54:24 megatron_init:298] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-03-19 17:54:24 megatron_init:302] All tensor model parallel group ranks: 

[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in 

[NeMo I 2024-03-19 17:54:24 tokenizer_utils:191] Getting SentencePiece with model: /tmp/tmpvnir208e/586f3f51a9cf43bc9369bd53fa08868c_a934dc7c3e1e46a6838bb63379916563_3feba89c944047c19d5a1d0c07a85c32_mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model
[NeMo I 2024-03-19 17:54:24 megatron_base_model:539] Padded vocab_size: 256000, original vocab_size: 256000, dummy tokens: 0.


[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-03-19 17:54:24 megatron_base_model:1104] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in 

[NeMo I 2024-03-19 17:54:25 build_model:143]  > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 8540135424
Loading distributed checkpoint with TensorStoreLoadShardedStrategy
[NeMo I 2024-03-19 17:55:50 nlp_overrides:1108] Model MegatronGPTSFTModel was successfully restored from /mnt/artifacts/nemotron/Nemotron-3-8B-Base-4k.nemo.
[NeMo I 2024-03-19 17:55:51 nlp_adapter_mixins:184] Before adding PEFT params:
      | Name  | Type     | Params
    -----------------------------------
    0 | model | GPTModel | 8.5 B 
    -----------------------------------
    0         Trainable params
    8.5 B     Non-trainable params
    8.5 B     Total params
    34,160.542Total estimated model params size (MB)
[NeMo I 2024-03-19 17:55:52 nlp_adapter_mixins:197] After adding PEFT params:
      | Name  | Type     | Params
    -----------------------------------
    0 | model | GPTModel | 8.6 B 
    -----------------------------------
    16.8 M    Trainable params
    8.5 B     No

### Load test dataset
We load in the test dataset as well.

In [17]:
_test_ds = model_eval._build_dataset(eval_model_cfg.data.test_ds, is_train=False)
from torch.utils.data import DataLoader
request_dl = DataLoader(
    dataset=_test_ds[0],
    batch_size=eval_model_cfg.data.test_ds.global_batch_size,
    collate_fn=_test_ds[0].collate_fn,
)
config_inference = OmegaConf.to_container(config_eval.inference, resolve=True)
model_eval.set_inference_config(config_inference)

[NeMo I 2024-03-19 17:55:52 text_memmap_dataset:116] Building data files
[NeMo I 2024-03-19 17:55:52 text_memmap_dataset:525] Processing 1 data files using 128 workers
[NeMo I 2024-03-19 17:56:40 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:47.740731
[NeMo I 2024-03-19 17:56:40 text_memmap_dataset:525] Processing 1 data files using 128 workers
[NeMo I 2024-03-19 17:57:29 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:48.695877
[NeMo I 2024-03-19 17:57:29 text_memmap_dataset:158] Loading data files
[NeMo I 2024-03-19 17:57:29 text_memmap_dataset:249] Loading /mnt/code/data/SQuAD/squad_short_val.jsonl
[NeMo I 2024-03-19 17:57:29 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001329
[NeMo I 2024-03-19 17:57:29 text_memmap_dataset:165] Computing global indices


### Run predictions
And now it is time to run the predictions through the model and see the results!

**Keep in mind the results you see may vary in quality. The hyperparameters presented in this notebook are not optimal and only serve as examples. Could you be underfitting? Overfitting? These can be adjusted in the configs to improve performance. The point is fine tuning the out-of-the-box model to the general QA task is easy and straightforward with this workflow!**

In [18]:
response = trainer_eval.predict(model_eval, request_dl)
for batch in response:
    for s in batch['sentences']:
        print(f"{s}\n\n")

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
      rank_zero_warn(
    


Predicting DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s]

      string_tensor = torch.as_tensor(
    


Predicting DataLoader 0: 100%|██████████| 500/500 [02:52<00:00,  2.91it/s]
User: Context:Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50. Question:Which NFL team represented the AFC at Super Bowl 50?

Assistant: Denver Broncos


User: Context:Super Bowl 50 was an American 