# Creating a Llama-3.1 LoRA adapter with the NeMo Framework and Deploy via NVIDIA NIM
finetuning using the NeMo framework and deploying it using an NVIDIA NIM. In this notebook, we'll be finetuning our own LoRA with a cleaned up version of the [Law StackExchange](https://huggingface.co/datasets/ymoslem/Law-StackExchange) dataset using NeMo Framework. Law StackExchange is a dataset of legal question/answers. Each record consists of a question, its title, as well as human-provided answers. Given a Law StackExchange forum question our goal is to auto-generate an appropriate title for it.

####  NVIDIA NeMo Framework and NVIDIA NIM
NVIDIA NeMo Framework is a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (e.g. Automatic Speech Recognition and Text-to-Speech). It enables users to efficiently create, customize, and deploy new generative AI models by leveraging existing code and pre-trained model checkpoints. After we finetune a LoRa using NeMo, we then deploy it using an NVIDIA NIM. An NVIDIA NIM is an accelerated inference solution for Generative AI models.

First we install the NGC CLI and docker and pull the `.nemo` checkpoint that we will use for finetuning. This can take about 5-7 minutes

In [1]:
%%bash
test -f setup-ngc.sh || (wget https://raw.githubusercontent.com/brevdev/notebooks/main/assets/setup-ngc.sh; chmod +x setup-ngc.sh)
./setup-ngc.sh

--2024-10-31 12:27:55--  https://raw.githubusercontent.com/brevdev/notebooks/main/assets/setup-ngc.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1079 (1.1K) [text/plain]
Saving to: ‘setup-ngc.sh’

     0K .                                                     100% 91.9M=0s

2024-10-31 12:27:56 (91.9 MB/s) - ‘setup-ngc.sh’ saved [1079/1079]



NGC CLI v3.49.0 installed. Restart terminal or source profile to use.
Alternatively, you can use an explicit path to: /root/verb-workspace/ngc-cli/ngc


In [2]:
!COLUMNS=400 ./ngc-cli/ngc registry model download-version "nvidia/nemo/llama-3_1-8b-instruct-nemo:1.0"

CLI_VERSION: Latest - 3.53.0 available (current: 3.49.0). Please update by using the command 'ngc version upgrade' 

Getting files to download...
[2K  [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m • [32m15.0/15.0 GiB[0m • [36mRemaining:[0m [36m0:00:00[0m • [31m141.8 MB/s[0m • [33mElapsed:[0m [33m0:01:59[0m • [34mTotal: 1 - Completed: 1 - Failed: 0[0m 0 - Failed: 0[0med: 0[0m
[?25h
------------------------------------------------------------------------------------
   Download status: COMPLETED
   Downloaded local path model: /root/verb-workspace/llama-3_1-8b-instruct-nemo_v1.0
   Total files downloaded: 1
   Total transferred: 14.96 GB
   Started at: 2024-10-31 12:28:03
   Completed at: 2024-10-31 12:30:03
   Duration taken: 1m 59s
------------------------------------------------------------------------------------


In [3]:
# this should the .nemo checkpoint that is saved
!ls ./llama-3_1-8b-instruct-nemo_v1.0

llama3_1_8b_instruct.nemo


In [4]:
import os
import json
import numpy as np
from rouge_score import rouge_scorer, scoring

# Phase 1: Finetuning the LoRa adapter

##  Step-by-step PEFT finetuning instructions

1. Prepare the dataset
2. Run the PEFT finetuning script
3. Inference with NeMo Framework
4. Check the model accuracy

In [5]:
!wget https://huggingface.co/datasets/bigmlguy2234/hf-law-qa-dataset/resolve/main/law-qa-curated.zip 

--2024-10-31 12:30:18--  https://huggingface.co/datasets/bigmlguy2234/hf-law-qa-dataset/resolve/main/law-qa-curated.zip
Resolving huggingface.co (huggingface.co)... 18.154.227.7, 18.154.227.87, 18.154.227.67, ...
Connecting to huggingface.co (huggingface.co)|18.154.227.7|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/a6/d5/a6d5955c217c4e78e708cfea9bf52e46fb3c5cc93151c5447c804929b8db561a/b26fcd36ab38c6011cecb8f8d6f0e9990441dfa9d1fa9f9a8d740612493c4a90?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27law-qa-curated.zip%3B+filename%3D%22law-qa-curated.zip%22%3B&response-content-type=application%2Fzip&Expires=1730637018&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczMDYzNzAxOH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2E2L2Q1L2E2ZDU5NTVjMjE3YzRlNzhlNzA4Y2ZlYTliZjUyZTQ2ZmIzYzVjYzkzMTUxYzU0NDdjODA0OTI5YjhkYjU2MWEvYjI2ZmNkMzZhYjM4YzYwMTFjZWNiOGY4ZDZmMGU

In [6]:
!unzip -j law-qa-curated.zip -d curated-data

Archive:  law-qa-curated.zip
  inflating: curated-data/law-qa-test.jsonl  
  inflating: curated-data/law-qa-val.jsonl  
  inflating: curated-data/law-qa-train.jsonl  


You should see the `law-qa-{train/val/test}.jsonl` splits in the curated folder

In [7]:
DATA_DIR = os.path.join("./curated-data")

TRAIN_DS = os.path.join(DATA_DIR, "law-qa-train.jsonl")
VAL_DS = os.path.join(DATA_DIR, "law-qa-val.jsonl")
TEST_DS = os.path.join(DATA_DIR, "law-qa-test.jsonl")

In [8]:
 # Add a prompt instruction.
PROMPT='''Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more.'''

# Creates a preprocessed version of the data files
for input_file in [TRAIN_DS, VAL_DS, TEST_DS]:
    output_file = input_file.rsplit('.', 1)[0] + '_preprocessed.jsonl'
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
        for line in infile:
            # Parse each line as JSON
            data = json.loads(line)

            # Create a new dictionary with only the desired fields, renamed and formatted
            new_data = {
                "input": f'''{PROMPT} \nQUESTION: {data["question"]} \nTITLE: ''',
                "output": data['title']
            }

            # Write the new data as a JSON line to the output file
            json.dump(new_data, outfile)
            outfile.write('\n')  # Add a newline after each JSON object

    print(f"Processed {input_file} and created {output_file}")

Processed ./curated-data/law-qa-train.jsonl and created ./curated-data/law-qa-train_preprocessed.jsonl
Processed ./curated-data/law-qa-val.jsonl and created ./curated-data/law-qa-val_preprocessed.jsonl
Processed ./curated-data/law-qa-test.jsonl and created ./curated-data/law-qa-test_preprocessed.jsonl


After running the above scripts, you will see  `law-qa-{train/test/val}_preprocessed.jsonl` files appear in the data directory.

This is what an example will be formatted like -

```json
{"input": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: In order to be sued in a particular jurisdiction, say New York, a company must have a minimal business presence in the jurisdiction. What constitutes such a presence? Suppose the company engaged a New York-based Plaintiff, and its representatives signed the contract with the Plaintiff in New York City. Does this satisfy the minimum presence rule? Suppose, instead, the plaintiff and contract signing were in New Jersey, but the company hired a law firm with offices in New York City. Does this qualify? \nTITLE: ", 
 "output": "What constitutes \"doing business in a jurisdiction?\""}
```

### Step 2: Run PEFT finetuning script for LoRA

NeMo framework includes a high level python script for fine-tuning  [megatron_gpt_finetuning.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py) that can abstract away some of the lower level API calls. 

For this demonstration, this training run is capped by `max_steps`, and validation is carried out every `val_check_interval` steps. If the validation loss does not improve after a few checks, training is halted to avoid overfitting.



In [13]:
%%bash

# Set paths to the model, train, validation and test sets.
MODEL="./llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo"

TRAIN_DS="[./curated-data/law-qa-train_preprocessed.jsonl]"
VALID_DS="[./curated-data/law-qa-val_preprocessed.jsonl]"
TEST_DS="[./curated-data/law-qa-test_preprocessed.jsonl]"
TEST_NAMES="[law]"

SCHEME="lora"
TP_SIZE=1
PP_SIZE=1

rm -rf results
OUTPUT_DIR="./results/Meta-llama3.1-8B-Instruct-titlegen"

torchrun --nproc_per_node=1 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    exp_manager.exp_dir=${OUTPUT_DIR} \
    exp_manager.explicit_log_dir=${OUTPUT_DIR} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16-mixed \
    trainer.val_check_interval=0.2 \
    trainer.max_steps=50 \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.micro_batch_size=1 \
    model.global_batch_size=32 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=${SCHEME}

    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-10-31 12:33:21 megatron_gpt_finetuning:56] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-10-31 12:33:21 megatron_gpt_finetuning:57] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: bf16-mixed
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 50
      log_every_n_steps: 10
      val_check_interval: 0.2
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: ./results/Meta-llama3.1-8B-Instruct-titlegen
      exp_dir: ./results/Meta-llama3.1-8B-Instruct-titlegen
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${mode

[NeMo W 2024-10-31 12:33:21 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True


[NeMo I 2024-10-31 12:33:21 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.


TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo E 2024-10-31 12:33:21 exp_manager:703] exp_manager received explicit_log_dir: ./results/Meta-llama3.1-8B-Instruct-titlegen and at least one of exp_dir: ./results/Meta-llama3.1-8B-Instruct-titlegen, or version: None. Please note that exp_dir, name, and version will be ignored.
[NeMo W 2024-10-31 12:33:21 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints. Training from scratch.


[NeMo I 2024-10-31 12:33:21 exp_manager:396] Experiments will be logged at results/Meta-llama3.1-8B-Instruct-titlegen
[NeMo I 2024-10-31 12:33:21 exp_manager:856] TensorboardLogger has been set up


[NeMo W 2024-10-31 12:33:21 exp_manager:966] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 50. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
[NeMo W 2024-10-31 12:33:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 12:33:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 12:33:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 12:33:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does

[NeMo I 2024-10-31 12:33:38 megatron_init:263] Rank 0 has data parallel group : [0]
[NeMo I 2024-10-31 12:33:38 megatron_init:269] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-10-31 12:33:38 megatron_init:274] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-10-31 12:33:38 megatron_init:277] Ranks 0 has data parallel rank: 0
[NeMo I 2024-10-31 12:33:38 megatron_init:285] Rank 0 has context parallel group: [0]
[NeMo I 2024-10-31 12:33:38 megatron_init:288] All context parallel group ranks: [[0]]
[NeMo I 2024-10-31 12:33:38 megatron_init:289] Ranks 0 has context parallel rank: 0
[NeMo I 2024-10-31 12:33:38 megatron_init:296] Rank 0 has model parallel group: [0]
[NeMo I 2024-10-31 12:33:38 megatron_init:297] All model parallel group ranks: [[0]]
[NeMo I 2024-10-31 12:33:38 megatron_init:306] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-10-31 12:33:38 megatron_init:310] All tensor model parallel group ranks: 

24-10-31 12:33:38 - PID:3117 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 32
[NeMo W 2024-10-31 12:33:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 12:33:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 12:33:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 12:33:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.

[NeMo I 2024-10-31 12:33:38 megatron_base_model:584] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2024-10-31 12:33:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 12:33:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 12:33:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 12:33:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 12:33:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_t

[NeMo I 2024-10-31 12:33:57 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.
Loading distributed checkpoint with TensorStoreLoadShardedStrategy
Loading distributed checkpoint directly on the GPU
[NeMo I 2024-10-31 12:34:44 nlp_overrides:1180] Model MegatronGPTSFTModel was successfully restored from /root/verb-workspace/llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo.
[NeMo I 2024-10-31 12:34:44 megatron_gpt_finetuning:72] Adding adapter weights to the model for PEFT
[NeMo I 2024-10-31 12:34:44 nlp_adapter_mixins:203] Before adding PEFT params:
      | Name  | Type          | Params | Mode 
    ------------------------------------------------
    0 | model | Float16Module | 8.0 B  | train
    ------------------------------------------------
    0         Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,121.045Total estimated model params size (MB)
[NeMo I 2024-10-31 12:34:48 nlp_adapter_mixins:208] After adding PEFT params:
     

[NeMo W 2024-10-31 12:34:48 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-10-31 12:34:48 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-10-31 12:34:48 megatron_gpt_sft_model:811] Building GPT SFT validation datasets.
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:116] Building data files
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:495] Building indexing for fn = ./curated-data/law-qa-val_preprocessed.jsonl
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:507] Saving idx file = ./curated-data/law-qa-val_preprocessed.jsonl.idx.npy
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:509] Saving metadata file = ./curated-data/law-qa-val_preprocessed.jsonl.idx.info
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:535] Time building 1 / 1 mem-mapped files: 0:00:00.080877
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.063302
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:158] Loading data files
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:249] Loading ./curated-data/law-qa-val_preprocessed.jsonl
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001330
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-10-31 12:34:48 megatron_gpt_sft_model:815] Length of val dataset: 2434
[NeMo I 2024-10-31 12:34:48 megatron_gpt_sft_model:822] Building GPT SFT traing datasets.
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:116] Building data files
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:495] Building indexing for fn = ./curated-data/law-qa-train_preprocessed.jsonl
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:507] Saving idx file = ./curated-data/law-qa-train_preprocessed.jsonl.idx.npy
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:509] Saving metadata file = ./curated-data/law-qa-train_preprocessed.jsonl.idx.info
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:535] Time building 1 / 1 mem-mapped files: 0:00:00.076062
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.053236
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:158] Loading data files
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:249] Loading ./curated-data/law-qa-train_preprocessed.jsonl
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001070
[NeMo I 2024-10-31 12:34:48 text_memmap_dataset:165] Computing global indices


      counts = torch.cuda.LongTensor([1])
    


make: Entering directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
make: Nothing to be done for 'default'.
make: Leaving directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
> building indices for blendable datasets ...
 > sample ratios:
   dataset 0, input: 1, achieved: 1
[NeMo I 2024-10-31 12:34:48 blendable_dataset:67] > elapsed time for building blendable dataset indices: 0.08 (sec)
[NeMo I 2024-10-31 12:34:48 megatron_gpt_sft_model:824] Length of train dataset: 1608
[NeMo I 2024-10-31 12:34:48 megatron_gpt_sft_model:829] Building dataloader with consumed samples: 0
[NeMo I 2024-10-31 12:34:48 megatron_gpt_sft_model:829] Building dataloader with consumed samples: 0


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2024-10-31 12:34:48 megatron_base_model:1199] Ignoring `trainer.max_epochs` when computing `max_steps` because `trainer.max_steps` is already set to 50.


[NeMo I 2024-10-31 12:34:48 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-31 12:34:48 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-31 12:34:48 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-31 12:34:48 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-31 12:34:48 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-31 12:34:48 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-31 12:34:48 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-31 12:34:48 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-31 12:34:48 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-31 12:34:48 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-31 12:34:48 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-31 12:34:48 adapter_mixins:435] Unfrozen adapter : lora_kqv_


  | Name  | Type          | Params | Mode 
------------------------------------------------
0 | model | Float16Module | 8.0 B  | train
------------------------------------------------
10.5 M    Trainable params
8.0 B     Non-trainable params
8.0 B     Total params
32,162.988Total estimated model params size (MB)
[NeMo W 2024-10-31 12:34:49 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
    
[NeMo W 2024-10-31 12:34:49 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    
    
[NeMo W 2024-10-31 12:34:56 nemo_logging:

Epoch 0: :  20%|██        | 10/50 [00:56<03:46, reduced_train_loss=3.330, global_step=9.000, consumed_samples=320.0, train_step_timing in s=5.660]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   1%|▏         | 1/77 [00:03<03:57,  0.32it/s][A
Validation DataLoader 0:   3%|▎         | 2/77 [00:06<03:54,  0.32it/s][A
Validation DataLoader 0:   4%|▍         | 3/77 [00:09<03:51,  0.32it/s][A
Validation DataLoader 0:   5%|▌         | 4/77 [00:13<04:04,  0.30it/s][A
Validation DataLoader 0:   6%|▋         | 5/77 [00:16<03:58,  0.30it/s][A
Validation DataLoader 0:   8%|▊         | 6/77 [00:21<04:14,  0.28it/s][A
Validation DataLoader 0:   9%|▉         | 7/77 [00:24<04:07,  0.28it/s][A
Validation DataLoader 0:  10%|█         | 8/77 [00:27<04:00,  0.29it/s][A
Validation DataLoader 0:  12%|█▏        | 9/77 [00:30<03:53,  0.29it/s][A
Validati

Metric val_loss improved. New best score: 3.312
Epoch 0, global step 10: 'validation_loss' reached 3.31181 (best 3.31181), saving model to '/root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=3.312-step=10-consumed_samples=320.0.ckpt' as top 1
[NeMo W 2024-10-31 12:40:05 nlp_overrides:480] DistributedCheckpointIO configured but should not be used. Reverting back to TorchCheckpointIO


Epoch 0: :  40%|████      | 20/50 [06:06<09:10, reduced_train_loss=2.870, global_step=19.00, consumed_samples=640.0, train_step_timing in s=5.860, val_loss=3.310]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   1%|▏         | 1/77 [00:03<03:55,  0.32it/s][A
Validation DataLoader 0:   3%|▎         | 2/77 [00:06<03:53,  0.32it/s][A
Validation DataLoader 0:   4%|▍         | 3/77 [00:09<03:50,  0.32it/s][A
Validation DataLoader 0:   5%|▌         | 4/77 [00:13<04:08,  0.29it/s][A
Validation DataLoader 0:   6%|▋         | 5/77 [00:16<04:01,  0.30it/s][A
Validation DataLoader 0:   8%|▊         | 6/77 [00:21<04:17,  0.28it/s][A
Validation DataLoader 0:   9%|▉         | 7/77 [00:25<04:10,  0.28it/s][A
Validation DataLoader 0:  10%|█         | 8/77 [00:28<04:02,  0.28it/s][A
Validation DataLoader 0:  12%|█▏        | 9/77 [00:31<03:56,  0.29i

Metric val_loss improved by 0.748 >= min_delta = 0.001. New best score: 2.564
Epoch 0, global step 20: 'validation_loss' reached 2.56415 (best 2.56415), saving model to '/root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.564-step=20-consumed_samples=640.0.ckpt' as top 1


Epoch 0: :  40%|████      | 20/50 [10:20<15:31, reduced_train_loss=2.870, global_step=19.00, consumed_samples=640.0, train_step_timing in s=5.860, val_loss=2.560][NeMo I 2024-10-31 12:45:17 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=3.312-step=10-consumed_samples=320.0.ckpt
[NeMo I 2024-10-31 12:45:17 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=3.312-step=10-consumed_samples=320.0-last.ckpt
Epoch 0: :  60%|██████    | 30/50 [11:18<07:32, reduced_train_loss=2.080, global_step=29.00, consumed_samples=960.0, train_step_timing in s=5.600, val_loss=2.560]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   1%|▏   

Metric val_loss improved by 0.586 >= min_delta = 0.001. New best score: 1.979
Epoch 0, global step 30: 'validation_loss' reached 1.97855 (best 1.97855), saving model to '/root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.979-step=30-consumed_samples=960.0.ckpt' as top 1


Epoch 0: :  60%|██████    | 30/50 [15:30<10:20, reduced_train_loss=2.080, global_step=29.00, consumed_samples=960.0, train_step_timing in s=5.600, val_loss=1.980][NeMo I 2024-10-31 12:50:27 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.564-step=20-consumed_samples=640.0.ckpt
[NeMo I 2024-10-31 12:50:27 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.564-step=20-consumed_samples=640.0-last.ckpt
Epoch 0: :  80%|████████  | 40/50 [16:28<04:07, reduced_train_loss=1.790, global_step=39.00, consumed_samples=1280.0, train_step_timing in s=5.580, val_loss=1.980]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   1%|▏  

Metric val_loss improved by 0.219 >= min_delta = 0.001. New best score: 1.760
Epoch 0, global step 40: 'validation_loss' reached 1.75969 (best 1.75969), saving model to '/root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.760-step=40-consumed_samples=1280.0.ckpt' as top 1


Epoch 0: :  80%|████████  | 40/50 [20:38<05:09, reduced_train_loss=1.790, global_step=39.00, consumed_samples=1280.0, train_step_timing in s=5.580, val_loss=1.760][NeMo I 2024-10-31 12:55:34 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.979-step=30-consumed_samples=960.0.ckpt
[NeMo I 2024-10-31 12:55:35 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.979-step=30-consumed_samples=960.0-last.ckpt
Epoch 0: : 100%|██████████| 50/50 [21:36<00:00, reduced_train_loss=1.710, global_step=49.00, consumed_samples=1600.0, train_step_timing in s=5.670, val_loss=1.760]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   1%|▏ 

Metric val_loss improved by 0.043 >= min_delta = 0.001. New best score: 1.717
Epoch 0, global step 50: 'validation_loss' reached 1.71690 (best 1.71690), saving model to '/root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.717-step=50-consumed_samples=1600.0.ckpt' as top 1


Epoch 0: : 100%|██████████| 50/50 [25:49<00:00, reduced_train_loss=1.710, global_step=49.00, consumed_samples=1600.0, train_step_timing in s=5.670, val_loss=1.720][NeMo I 2024-10-31 13:00:45 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.760-step=40-consumed_samples=1280.0.ckpt
[NeMo I 2024-10-31 13:00:46 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.760-step=40-consumed_samples=1280.0-last.ckpt


`Trainer.fit` stopped: `max_steps=50` reached.


Epoch 0: : 100%|██████████| 50/50 [25:49<00:00, reduced_train_loss=1.710, global_step=49.00, consumed_samples=1600.0, train_step_timing in s=5.670, val_loss=1.720]


Restoring states from the checkpoint path at /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.717-step=50-consumed_samples=1600.0.ckpt
Restored all states from the checkpoint at /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.717-step=50-consumed_samples=1600.0.ckpt


### Step 3: Inference with NeMo Framework

Running text generation within the framework is also possible with running a Python script. Note that is more for testing and validation, not a full-fledged  deployment solution like NVIDIA NIM.

In [14]:
 # Check that the LORA model file exists
!ls -l ./results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints

total 307504
-rw-r--r-- 1 root root 146928238 Oct 31 13:00 'megatron_gpt_peft_lora_tuning--validation_loss=1.717-step=50-consumed_samples=1600.0-last.ckpt'
-rw-r--r-- 1 root root 146928238 Oct 31 13:00 'megatron_gpt_peft_lora_tuning--validation_loss=1.717-step=50-consumed_samples=1600.0.ckpt'
-rw-r--r-- 1 root root  21012480 Oct 31 13:00  megatron_gpt_peft_lora_tuning.nemo


In [15]:
# Create a smaller test subset for a quick eval demonstration.
!head -n 128 ./curated-data/law-qa-test_preprocessed.jsonl > ./curated-data/law-qa-test_preprocessed-n128.jsonl

In [16]:
%%bash
MODEL="./llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo"

TEST_DS="[./curated-data/law-qa-test_preprocessed-n128.jsonl]" # Smaller test split
# TEST_DS="[./curated-data/law-qa-test_preprocessed.jsonl]" # Full test set
TEST_NAMES="[law]"

TP_SIZE=1
PP_SIZE=1

# This is where your LoRA checkpoint was saved
PATH_TO_TRAINED_MODEL="./results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

# The generation run will save the generated outputs over the test dataset in a file prefixed like so
OUTPUT_PREFIX="law_titlegen_lora"

python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${MODEL} \
    model.peft.restore_from_path=${PATH_TO_TRAINED_MODEL} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=${TEST_NAMES} \
    model.data.test_ds.global_batch_size=32 \
    model.data.test_ds.micro_batch_size=1 \
    model.data.test_ds.tokens_to_generate=50 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    inference.greedy=True  \
    model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \
    model.data.test_ds.write_predictions_to_file=True \
    model.data.test_ds.add_bos=False \
    model.data.test_ds.add_eos=True \
    model.data.test_ds.add_sep=False \
    model.data.test_ds.label_key="output" \
    model.data.test_ds.prompt_template="\{input\}\ \{output\}"

    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-10-31 13:02:20 megatron_gpt_generate:125] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-10-31 13:02:20 megatron_gpt_generate:126] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: 16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 20000
      log_every_n_steps: 10
      val_check_interval: 200
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.test_ds.metric.name}
        save_top_k: 1
        mode: max
        save_nemo_o

[NeMo W 2024-10-31 13:02:20 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-10-31 13:02:37 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 13:02:37 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 13:02:37 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it 

[NeMo I 2024-10-31 13:02:37 megatron_init:263] Rank 0 has data parallel group : [0]
[NeMo I 2024-10-31 13:02:37 megatron_init:269] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-10-31 13:02:37 megatron_init:274] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-10-31 13:02:37 megatron_init:277] Ranks 0 has data parallel rank: 0
[NeMo I 2024-10-31 13:02:37 megatron_init:285] Rank 0 has context parallel group: [0]
[NeMo I 2024-10-31 13:02:37 megatron_init:288] All context parallel group ranks: [[0]]
[NeMo I 2024-10-31 13:02:37 megatron_init:289] Ranks 0 has context parallel rank: 0
[NeMo I 2024-10-31 13:02:37 megatron_init:296] Rank 0 has model parallel group: [0]
[NeMo I 2024-10-31 13:02:37 megatron_init:297] All model parallel group ranks: [[0]]
[NeMo I 2024-10-31 13:02:37 megatron_init:306] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-10-31 13:02:37 megatron_init:310] All tensor model parallel group ranks: 

24-10-31 13:02:37 - PID:12025 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 1
[NeMo W 2024-10-31 13:02:37 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 13:02:37 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 13:02:37 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 13:02:37 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.

[NeMo I 2024-10-31 13:02:38 megatron_base_model:584] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2024-10-31 13:02:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 13:02:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 13:02:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 13:02:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-31 13:02:38 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_t

[NeMo I 2024-10-31 13:02:57 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.
Loading distributed checkpoint with TensorStoreLoadShardedStrategy
Loading distributed checkpoint directly on the GPU
[NeMo I 2024-10-31 13:03:53 nlp_overrides:1180] Model MegatronGPTSFTModel was successfully restored from /root/verb-workspace/llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo.
[NeMo I 2024-10-31 13:03:53 nlp_adapter_mixins:203] Before adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | train
    -------------------------------------------
    0         Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,121.045Total estimated model params size (MB)
[NeMo I 2024-10-31 13:03:56 nlp_adapter_mixins:208] After adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | 

[NeMo W 2024-10-31 13:03:56 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-10-31 13:03:56 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-10-31 13:03:56 megatron_gpt_sft_model:803] Building GPT SFT test datasets.
[NeMo I 2024-10-31 13:03:56 text_memmap_dataset:116] Building data files
[NeMo I 2024-10-31 13:03:56 text_memmap_dataset:525] Processing 1 data files using 6 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

[NeMo I 2024-10-31 13:03:57 text_memmap_dataset:495] Building indexing for fn = ./curated-data/law-qa-test_preprocessed-n128.jsonl
[NeMo I 2024-10-31 13:03:57 text_memmap_dataset:507] Saving idx file = ./curated-data/law-qa-test_preprocessed-n128.jsonl.idx.npy
[NeMo I 2024-10-31 13:03:57 text_memmap_dataset:509] Saving metadata file = ./curated-data/law-qa-test_preprocessed-n128.jsonl.idx.info
[NeMo I 2024-10-31 13:03:57 text_memmap_dataset:535] Time building 1 / 1 mem-mapped files: 0:00:00.169348
[NeMo I 2024-10-31 13:03:57 text_memmap_dataset:525] Processing 1 data files using 6 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

[NeMo I 2024-10-31 13:03:57 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.138296
[NeMo I 2024-10-31 13:03:57 text_memmap_dataset:158] Loading data files
[NeMo I 2024-10-31 13:03:57 text_memmap_dataset:249] Loading ./curated-data/law-qa-test_preprocessed-n128.jsonl
[NeMo I 2024-10-31 13:03:57 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001382
[NeMo I 2024-10-31 13:03:57 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-10-31 13:03:57 megatron_gpt_sft_model:806] Length of test dataset: 128
[NeMo I 2024-10-31 13:03:57 megatron_gpt_sft_model:829] Building dataloader with consumed samples: 0


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2024-10-31 13:03:57 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
    
[NeMo W 2024-10-31 13:03:57 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `test_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    
    
      input_info_tensor = torch.cuda.FloatTensor(input_info)
    
      string_tensor = torch.as_tensor(
    


Testing DataLoader 0: 100%|██████████| 4/4 [05:44<00:00,  0.01it/s][NeMo I 2024-10-31 13:09:41 megatron_gpt_sft_model:561] Total deduplicated inference data size: 128 to 128
[NeMo I 2024-10-31 13:09:41 megatron_gpt_sft_model:712] Predictions saved to law_titlegen_lora_test_law_inputs_preds_labels.jsonl


[NeMo W 2024-10-31 13:09:41 megatron_gpt_sft_model:652] No training data found, reconfiguring microbatches based on validation batch sizes.
[NeMo W 2024-10-31 13:09:41 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-10-31 13:09:41 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('test_loss_law', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-10-31 13:09:41 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('test_loss', ..., sync_

Testing DataLoader 0: 100%|██████████| 4/4 [05:44<00:00,  0.01it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1m       Test metric       [0m[1m [0m┃[1m [0m[1m      DataLoader 0       [0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│[36m [0m[36m        test_loss        [0m[36m [0m│[35m [0m[35m    1.610661506652832    [0m[35m [0m│
│[36m [0m[36m      test_loss_law      [0m[36m [0m│[35m [0m[35m    1.610661506652832    [0m[35m [0m│
│[36m [0m[36m        val_loss         [0m[36m [0m│[35m [0m[35m    1.610661506652832    [0m[35m [0m│
└───────────────────────────┴───────────────────────────┘


### Step 4: Check the model accuracy

Now that the results are in, let's read the results and calculate the accuracy on the question title generation task.
Let's take a look at one of the predictions in the generated output file. The pred key indicates what was generated.

In [17]:
# Take a look at predictions
!head -n1  law_titlegen_lora_test_law_inputs_preds_labels.jsonl

{"input": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: In order to be sued in a particular jurisdiction, say New York, a company must have a minimal business presence in the jurisdiction. What constitutes such a presence? Suppose the company engaged a New York-based Plaintiff, and its representatives signed the contract with the Plaintiff in New York City. Does this satisfy the minimum presence rule? Suppose, instead, the plaintiff and contract signing were in New Jersey, but the company hired a law firm with offices in New York City. Does this qualify? \nTITLE:", "pred": " What constitutes a minimal business presence in a jurisdiction?", "label": " What constitutes \"doing business in a jurisdiction?\""}


For evaluating this task, we will use ROUGE.  It measures overlap of ngrams, and a higher score is better. While it's not perfect and it misses capturing the semantics of the prediction, it is a popular metric in academia and industry for evaluating such systems. The following method uses the rouge_score library to implement scoring. It will report `ROUGE_{1/2/L/Lsum}` metrics.

In [18]:
def compute_rouge(input_file: str) -> dict:
    ROUGE_KEYS = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
    scorer = rouge_scorer.RougeScorer(ROUGE_KEYS, use_stemmer=True)
    aggregator = scoring.BootstrapAggregator()
    lines = [json.loads(line) for line in open(input_file)]
    num_response_words = []
    num_ref_words = []
    for idx, line in enumerate(lines):
        prompt = line['input']
        response = line['pred']
        answer = line['label']
        scores = scorer.score(response, answer)
        aggregator.add_scores(scores)
        num_response_words.append(len(response.split()))
        num_ref_words.append(len(answer.split()))

    result = aggregator.aggregate()
    rouge_scores = {k: round(v.mid.fmeasure * 100, 4) for k, v in result.items()}
    print(rouge_scores)
    print(f"Average and stddev of response length: {np.mean(num_response_words):.2f}, {np.std(num_response_words):.2f}")
    print(f"Average and stddev of ref length: {np.mean(num_ref_words):.2f}, {np.std(num_ref_words):.2f}")

    return rouge_scores

In [19]:
compute_rouge("./law_titlegen_lora_test_law_inputs_preds_labels.jsonl")

{'rouge1': 40.0944, 'rouge2': 20.3913, 'rougeL': 35.7968, 'rougeLsum': 35.7888}
Average and stddev of response length: 11.53, 4.49
Average and stddev of ref length: 11.26, 4.97


{'rouge1': 40.0944, 'rouge2': 20.3913, 'rougeL': 35.7968, 'rougeLsum': 35.7888}

For the Llama-3.1-8B-Instruct model, you should see accuracy comparable to the below:

`{'rouge1': 39.2082, 'rouge2': 18.8573, 'rougeL': 35.4098, 'rougeLsum': 35.3906}`

# LoRA inference with NVIDIA NIM

Now that we've trained our LoRA, lets go ahead and deploy them with NVIDIA NIM. NIM's let you deploy multiple LoRA adapters and supports the .nemo and Hugging Face model formats. We will deploy the Law LoRA adapter.


downloaded the NIM from NGC and got it up and running with the LoRa's that we've trained.

In [25]:
%%bash

wget https://raw.githubusercontent.com/brevdev/notebooks/main/assets/setup-nim.sh -O setup-nim
chmod +x setup-nim
export NGC_API_KEY=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
./setup-nim

--2024-10-31 13:32:31--  https://raw.githubusercontent.com/brevdev/notebooks/main/assets/setup-nim.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1713 (1.7K) [text/plain]
Saving to: ‘setup-nim’

     0K .                                                     100% 40.7M=0s

2024-10-31 13:32:31 (40.7 MB/s) - ‘setup-nim’ saved [1713/1713]

https://docs.docker.com/engine/reference/commandline/login/#credential-stores



Login Succeeded
~/verb-workspace/loras ~/verb-workspace
~/verb-workspace


Unable to find image 'nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.0' locally
1.1.0: Pulling from nim/meta/llama-3.1-8b-instruct
cbe3537751ce: Already exists
d67fcc6ef577: Already exists
47ee674c5713: Already exists
63daa0e64b30: Already exists
d9d9aecefab5: Already exists
b377c960b7f3: Already exists
071105f39313: Already exists
18049dd7c352: Already exists
071c1099eccd: Already exists
161ecdfb16f0: Pulling fs layer
fcfb2ec1ba22: Pulling fs layer
154e691e00a7: Pulling fs layer
9d18af386bf6: Pulling fs layer
f1d9f7beba6e: Pulling fs layer
0c951f04c367: Pulling fs layer
fb6fbd97005b: Pulling fs layer
431acb0bc035: Pulling fs layer
38697a17baff: Pulling fs layer
f9aeba7169f2: Pulling fs layer
cfc9a1f4fc10: Pulling fs layer
cfdd2bb2b4a6: Pulling fs layer
c396a58289c6: Pulling fs layer
e8839de7b7ae: Pulling fs layer
7941e23182d8: Pulling fs layer
0372c9b9cb47: Pulling fs layer
dfedf8154b02: Pulling fs layer
659b21d9411d: Pulling fs layer
431acb0bc035: Download complete
160151d7ae7f: Pulling 

8e92f00440c1c25132d86437527ce0eab2dd9c53d6137d896ff3e2155a4cf23c
Checking if NIM is up...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 seconds...
NIM is not up yet. Checking again in 10 sec

In [26]:
import requests
import json

## Checking available LoRA models

Once the NIM server is up and running, checking the available models as follows:

In [27]:
url = 'http://0.0.0.0:8000/v1/models'

response = requests.get(url)
data = response.json()

print(json.dumps(data, indent=4))

{
    "object": "list",
    "data": [
        {
            "id": "meta/llama-3_1-8b-instruct",
            "object": "model",
            "created": 1730381998,
            "owned_by": "system",
            "root": "meta/llama-3_1-8b-instruct",
            "parent": null,
            "max_model_len": 131072,
            "permission": [
                {
                    "id": "modelperm-41dd90e0b704491e82634bfa2e9c980c",
                    "object": "model_permission",
                    "created": 1730381998,
                    "allow_create_engine": false,
                    "allow_sampling": true,
                    "allow_logprobs": true,
                    "allow_search_indices": false,
                    "allow_view": true,
                    "allow_fine_tuning": false,
                    "organization": "*",
                    "group": null,
                    "is_blocking": false
                }
            ]
        },
        {
            "id": "llama3.1-8b-

This will return all the models available for inference by NIM. In this case, it will return the base model, as well as the LoRA adapters that were provided during NIM deployment - `llama3.1-8b-law-titlegen`.

### Title Generation

Try sending an example from the test set.

In [None]:
url = 'http://0.0.0.0:8000/v1/completions'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}

# Example from the test set
prompt="Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: In order to be sued in a particular jurisdiction, say New York, a company must have a minimal business presence in the jurisdiction. What constitutes such a presence? Suppose the company engaged a New York-based Plaintiff, and its representatives signed the contract with the Plaintiff in New York City. Does this satisfy the minimum presence rule? Suppose, instead, the plaintiff and contract signing were in New Jersey, but the company hired a law firm with offices in New York City. Does this qualify? \nTITLE: "
data = {
    "model": "llama3.1-8b-law-titlegen",
    "prompt": prompt,
    "max_tokens": 50
}

response = requests.post(url, headers=headers, json=data)
response_data = response.json()

print(json.dumps(response_data, indent=4))