# Step 3: Training a LoRA Adapter

This notebook performs the preparatory tasks needed for obtaining the base model that we will use for fine-tuning.

This notebook showcases performing LoRA fine-tuning on the dataset that we curated in step 1.

## Setup and Requirements
Before proceeding, please make ensure you have completed the notebooks for steps 1 and 2. You will need to install one dependency to follow along. Execute the following cell before getting started.

In [2]:
# ! pip install ipywidgets

Let's also specify the base model name that we will use for fine-tuning. This should be the same model you downloaded/converted in step 2.

In [1]:
model_to_use = "google/gemma-2-2b"

---
# Sanity Checking

Let's do a quick sanity check to ensure we have all the pieces needed before moving forward.

In [2]:
import os

model_name = model_to_use.split('/')[-1].lower()

# The path to the model checkpoint, and also the data directory containing the training, validation, and test data.
nemo_model_fp = os.path.abspath(f"models/{model_name}.nemo")
data_dir = "data/split"

# The directory where the results will be stored.
result_dir = os.path.abspath("results")
os.makedirs(result_dir, exist_ok=True)

# Sanity checks
assert os.path.exists(nemo_model_fp), f"The model checkpoint at '{nemo_model_fp}' does not exist. Please ensure the model was downloaded successfully."
assert os.path.exists(data_dir), f"The data directory '{data_dir}' does not exist. Please ensure the data was prepared successfully."

train_fp = os.path.abspath(f"{data_dir}/train.jsonl")
val_fp = os.path.abspath(f"{data_dir}/val.jsonl")

# Sanity checks
assert os.path.exists(train_fp), f"The training data at '{train_fp}' does not exist. Please ensure the data was prepared successfully."
assert os.path.exists(val_fp), f"The validation data at '{val_fp}' does not exist. Please ensure the data was prepared successfully."

#
# Set the environment variables (needed for executing the next cell)
#
%env BASE_MODEL=$nemo_model_fp
%env DATA_DIR=$data_dir
%env TRAIN_DS=$train_fp
%env VAL_DS=$val_fp
%env RESULT_DIR=$result_dir

print(f"\n{'#'*80}")
print("All checks passed. You are ready to go!")
print(f"    Base model file: {nemo_model_fp}")
print(f"    Data directory: {data_dir}")
print(f"    Results: {result_dir}")

env: BASE_MODEL=/root/ODSC-Hackathon-Repository/models/gemma-2-2b.nemo
env: DATA_DIR=data/split
env: TRAIN_DS=/root/ODSC-Hackathon-Repository/data/split/train.jsonl
env: VAL_DS=/root/ODSC-Hackathon-Repository/data/split/val.jsonl
env: RESULT_DIR=/root/ODSC-Hackathon-Repository/results

################################################################################
All checks passed. You are ready to go!
    Base model file: /root/ODSC-Hackathon-Repository/models/gemma-2-2b.nemo
    Data directory: data/split
    Results: /root/ODSC-Hackathon-Repository/results


---
# Model Training

With all the sanity checks passing, it is time to start model training.

> NOTE: Running the following cell will remove any previously trained model!

* Our choice of bf16 precision strikes a balance between training speed and memory usage, crucial for handling our large language model. 



* The micro batch size of 1 and global batch size of 16 were chosen to optimize memory usage and gradient accumulation, allowing for stable training.



* We implemented a cosine annealing learning rate schedule with a base rate of 1e-6. This approach helps in finding an optimal convergence point by gradually reducing the learning rate, preventing overshooting in later stages of training.



* Infact, during our first trial of fine tuning the model, the training and validation loss are too far apart and also the validation is not converging while the training loss is, which is a clear sight of overfitting.



* So, we introduced weight decay of 0.01 was selected to prevent overfitting and ensure smooth gradient updates.



* We set our max steps to 2500 with evaluations every 200 steps. This frequent evaluation allowed us to monitor the model's performance closely and make necessary adjustments.



* We have used optimizer beta values of 0.9 and 0.95 to help achieve faster convergence and improved stability during the fine-tuning of large language models by adjusting the decay rates of the gradient and variance estimates in the Adam optimizer.


In [5]:
%%bash

SCHEME="lora"
TP_SIZE=1
PP_SIZE=1

# Clear up cached mem-map file
rm $DATA_DIR/*idx*
# Clean up prior results
rm -r $RESULT_DIR

HYDRA_FULL_ERROR=1 torchrun --nproc_per_node=1 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    exp_manager.exp_dir=${RESULT_DIR} \
    exp_manager.explicit_log_dir=${RESULT_DIR} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16 \
    trainer.val_check_interval=200 \
    trainer.max_steps=2500 \
    trainer.gradient_clip_val=0.5 \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.micro_batch_size=1 \
    model.global_batch_size=16 \
    model.optim.sched.name="CosineAnnealing" \
    model.optim.sched.warmup_steps=200 \
    model.optim.lr=1e-6 \
    model.optim.weight_decay=0.01 \
    model.optim.betas=[0.9,0.95] \
    model.restore_from_path=${BASE_MODEL} \
    model.data.train_ds.num_workers=2 \
    model.data.train_ds.add_bos=True \
    model.data.validation_ds.num_workers=1 \
    model.data.train_ds.file_names=[${TRAIN_DS}] \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=[${VAL_DS}] \
    model.peft.peft_scheme=${SCHEME}

rm: cannot remove 'data/split/*idx*': No such file or directory
      cm = get_cmap("Set1")
    


[NeMo I 2024-10-28 01:10:42 megatron_gpt_finetuning:56] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-10-28 01:10:42 megatron_gpt_finetuning:57] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: bf16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 2500
      log_every_n_steps: 10
      val_check_interval: 200
      gradient_clip_val: 0.5
    exp_manager:
      explicit_log_dir: /root/ODSC-Hackathon-Repository/results
      exp_dir: /root/ODSC-Hackathon-Repository/results
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.validat

[NeMo W 2024-10-28 01:10:42 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


[NeMo I 2024-10-28 01:10:42 exp_manager:450] ExpManager schema
[NeMo I 2024-10-28 01:10:42 exp_manager:451] {'explicit_log_dir': None, 'exp_dir': None, 'name': None, 'version': None, 'use_datetime_version': True, 'resume_if_exists': False, 'resume_past_end': False, 'resume_ignore_no_checkpoint': False, 'resume_from_checkpoint': None, 'create_tensorboard_logger': True, 'summary_writer_kwargs': None, 'create_wandb_logger': False, 'wandb_logger_kwargs': None, 'create_mlflow_logger': False, 'mlflow_logger_kwargs': {'experiment_name': None, 'tracking_uri': None, 'tags': None, 'save_dir': './mlruns', 'prefix': '', 'artifact_location': None, 'run_id': None, 'log_model': False}, 'create_dllogger_logger': False, 'dllogger_logger_kwargs': {'verbose': False, 'stdout': False, 'json_file': './dllogger.json'}, 'create_clearml_logger': False, 'clearml_logger_kwargs': {'project': None, 'task': None, 'connect_pytorch': False, 'model_name': None, 'tags': None, 'log_model': False, 'log_cfg': False, 'log_

[NeMo E 2024-10-28 01:10:42 exp_manager:910] exp_manager received explicit_log_dir: /root/ODSC-Hackathon-Repository/results and at least one of exp_dir: /root/ODSC-Hackathon-Repository/results, or version: None. Please note that exp_dir, name, and version will be ignored.
[NeMo W 2024-10-28 01:10:42 exp_manager:837] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/root/ODSC-Hackathon-Repository/results/checkpoints. Training from scratch.


[NeMo I 2024-10-28 01:10:42 exp_manager:509] Experiments will be logged at /root/ODSC-Hackathon-Repository/results
[NeMo I 2024-10-28 01:10:42 exp_manager:1063] TensorboardLogger has been set up


[NeMo W 2024-10-28 01:10:42 exp_manager:1201] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 2500. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.


[NeMo I 2024-10-28 01:10:42 exp_manager:646] TFLOPs per sec per GPU will be calculated, conditioned on supported models. Defaults to -1 upon failure.


[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-10-28 01:10:48 megatron_init:314] Rank 0 has data parallel group : [0]
[NeMo I 2024-10-28 01:10:48 megatron_init:320] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-10-28 01:10:48 megatron_init:325] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-10-28 01:10:48 megatron_init:328] Ranks 0 has data parallel rank: 0
[NeMo I 2024-10-28 01:10:48 megatron_init:336] Rank 0 has context parallel group: [0]
[NeMo I 2024-10-28 01:10:48 megatron_init:339] All context parallel group ranks: [[0]]
[NeMo I 2024-10-28 01:10:48 megatron_init:340] Ranks 0 has context parallel rank: 0
[NeMo I 2024-10-28 01:10:48 megatron_init:347] Rank 0 has model parallel group: [0]
[NeMo I 2024-10-28 01:10:48 megatron_init:348] All model parallel group ranks: [[0]]
[NeMo I 2024-10-28 01:10:48 megatron_init:357] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-10-28 01:10:48 megatron_init:361] All tensor model parallel group ranks: 

[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-10-28 01:10:48 megatron_base_model:604] Padded vocab_size: 256000, original vocab_size: 256000, dummy tokens: 0.


[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 01:10:48 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-10-28 01:11:01 nlp_overrides:1374] Model MegatronGPTSFTModel was successfully restored from /root/ODSC-Hackathon-Repository/models/gemma-2-2b.nemo.
[NeMo I 2024-10-28 01:11:01 megatron_gpt_finetuning:72] Adding adapter weights to the model for PEFT
[NeMo I 2024-10-28 01:11:01 nlp_adapter_mixins:245] Before adding PEFT params:
      | Name  | Type          | Params | Mode 
    ------------------------------------------------
    0 | model | Float16Module | 2.6 B  | train
    ------------------------------------------------
    0         Trainable params
    2.6 B     Non-trainable params
    2.6 B     Total params
    10,457.368Total estimated model params size (MB)
    452       Modules in train mode
    0         Modules in eval mode
[NeMo I 2024-10-28 01:11:04 nlp_adapter_mixins:250] After adding PEFT params:
      | Name  | Type          | Params | Mode 
    ------------------------------------------------
    0 | model | Float16Module | 2.6 B  | train
    -------------

[NeMo W 2024-10-28 01:11:04 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-10-28 01:11:04 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-10-28 01:11:04 megatron_gpt_sft_model:836] Building GPT SFT validation datasets.
[NeMo I 2024-10-28 01:11:04 text_memmap_dataset:116] Building data files
[NeMo I 2024-10-28 01:11:04 text_memmap_dataset:528] Processing 1 data files using 2 workers
[NeMo I 2024-10-28 01:11:04 text_memmap_dataset:494] Building indexing for fn = /root/ODSC-Hackathon-Repository/data/split/val.jsonl
[NeMo I 2024-10-28 01:11:04 text_memmap_dataset:506] Saving idx file = /root/ODSC-Hackathon-Repository/data/split/val.jsonl.idx.npy
[NeMo I 2024-10-28 01:11:04 text_memmap_dataset:508] Saving metadata file = /root/ODSC-Hackathon-Repository/data/split/val.jsonl.idx.info
[NeMo I 2024-10-28 01:11:04 text_memmap_dataset:543] Time building 1 / 1 mem-mapped files: 0:00:00.126045
[NeMo I 2024-10-28 01:11:04 text_memmap_dataset:528] Processing 1 data files using 2 workers
[NeMo I 2024-10-28 01:11:04 text_memmap_dataset:543] Time building 0 / 1 mem-mapped files: 0:00:00.091164
[NeMo I 2024-10-28 01:11:04 text

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2024-10-28 01:11:05 megatron_base_model:1230] Ignoring `trainer.max_epochs` when computing `max_steps` because `trainer.max_steps` is already set to 2500.


[NeMo I 2024-10-28 01:11:05 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 01:11:05 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 01:11:05 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 01:11:05 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 01:11:05 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 01:11:05 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 01:11:05 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 01:11:05 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 01:11:05 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 01:11:05 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 01:11:05 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 01:11:05 adapter_mixins:495] Unfrozen adapter : lora_kqv_


  | Name  | Type          | Params | Mode 
------------------------------------------------
0 | model | Float16Module | 2.6 B  | train
------------------------------------------------
5.3 M     Trainable params
2.6 B     Non-trainable params
2.6 B     Total params
10,478.667Total estimated model params size (MB)
582       Modules in train mode
0         Modules in eval mode
[NeMo W 2024-10-28 01:11:05 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
    
      cm = get_cmap("Set1")
    


Sanity Checking: |          | 0/? [00:00<?, ?it/s][NeMo I 2024-10-28 01:11:18 num_microbatches_calculator:228] setting number of microbatches to constant 16
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:04<00:00,  0.45it/s][NeMo I 2024-10-28 01:11:22 num_microbatches_calculator:228] setting number of microbatches to constant 16


[NeMo W 2024-10-28 01:11:22 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-10-28 01:11:22 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('validation_loss_dataloader0', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-10-28 01:11:22 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('validation_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
      cm = 

Epoch 0: :   8%|▊         | 200/2500 [11:28<2:12:03, reduced_train_loss=5.160, global_step=199.0, consumed_samples=3200.0, train_step_timing in s=3.420]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 01:23:18 num_microbatches_calculator:228] setting number of microbatches to constant 16

Validation:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation DataLoader 0:   2%|▏         | 1/66 [00:02<02:14,  0.48it/s][A
Validation DataLoader 0:   3%|▎         | 2/66 [00:03<01:57,  0.54it/s][A
Validation DataLoader 0:   5%|▍         | 3/66 [00:05<01:51,  0.57it/s][A
Validation DataLoader 0:   6%|▌         | 4/66 [00:07<01:50,  0.56it/s][A
Validation DataLoader 0:   8%|▊         | 5/66 [00:08<01:47,  0.57it/s][A
Validation DataLoader 0:   9%|▉         | 6/66 [00:10<01:44,  0.58it/s][A
Validation DataLoader 0:  11%|█         | 7/66 [00:12<01:41,  0.58it/s][A
Validation DataLoader 0:  12%|█▏        | 8/6

Metric val_loss improved. New best score: 6.725
Epoch 0, global step 200: 'validation_loss' reached 6.72534 (best 6.72534), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=6.725-step=200-consumed_samples=3200.0.ckpt' as top 1
[NeMo W 2024-10-28 01:25:05 nlp_overrides:625] DistributedCheckpointIO configured but should not be used. Reverting back to TorchCheckpointIO


Epoch 0: :  16%|█▌        | 400/2500 [24:44<2:09:55, reduced_train_loss=4.860, global_step=399.0, consumed_samples=6400.0, train_step_timing in s=3.430, val_loss=6.730]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 01:36:34 num_microbatches_calculator:228] setting number of microbatches to constant 16

Validation:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation DataLoader 0:   2%|▏         | 1/66 [00:02<02:15,  0.48it/s][A
Validation DataLoader 0:   3%|▎         | 2/66 [00:03<01:58,  0.54it/s][A
Validation DataLoader 0:   5%|▍         | 3/66 [00:05<01:51,  0.56it/s][A
Validation DataLoader 0:   6%|▌         | 4/66 [00:06<01:47,  0.58it/s][A
Validation DataLoader 0:   8%|▊         | 5/66 [00:08<01:44,  0.59it/s][A
Validation DataLoader 0:   9%|▉         | 6/66 [00:10<01:41,  0.59it/s][A
Validation DataLoader 0:  11%|█         | 7/66 [00:11<01:39,  0.59it/s][A
Validation DataLoader 0:  12%

Metric val_loss improved by 1.932 >= min_delta = 0.001. New best score: 4.793
Epoch 0, global step 400: 'validation_loss' reached 4.79304 (best 4.79304), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=4.793-step=400-consumed_samples=6400.0.ckpt' as top 1


Epoch 0: :  16%|█▌        | 400/2500 [26:32<2:19:18, reduced_train_loss=4.860, global_step=399.0, consumed_samples=6400.0, train_step_timing in s=3.430, val_loss=4.790][NeMo I 2024-10-28 01:38:21 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=6.725-step=200-consumed_samples=3200.0.ckpt
[NeMo I 2024-10-28 01:38:22 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=6.725-step=200-consumed_samples=3200.0-last.ckpt
Epoch 0: :  24%|██▍       | 600/2500 [38:00<2:00:22, reduced_train_loss=4.160, global_step=599.0, consumed_samples=9600.0, train_step_timing in s=3.420, val_loss=4.790]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 01:49:50 num_microbatches_calculator:228] setting number of microbatches to constant 16

Validation:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation DataLoader 0:  

Metric val_loss improved by 1.304 >= min_delta = 0.001. New best score: 3.489
Epoch 0, global step 600: 'validation_loss' reached 3.48945 (best 3.48945), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=3.489-step=600-consumed_samples=9600.0.ckpt' as top 1


Epoch 0: :  24%|██▍       | 600/2500 [39:48<2:06:02, reduced_train_loss=4.160, global_step=599.0, consumed_samples=9600.0, train_step_timing in s=3.420, val_loss=3.490][NeMo I 2024-10-28 01:51:37 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=4.793-step=400-consumed_samples=6400.0.ckpt
[NeMo I 2024-10-28 01:51:38 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=4.793-step=400-consumed_samples=6400.0-last.ckpt
Epoch 0: :  32%|███▏      | 800/2500 [51:16<1:48:57, reduced_train_loss=2.920, global_step=799.0, consumed_samples=12800.0, train_step_timing in s=3.420, val_loss=3.490]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 02:03:05 num_microbatches_calculator:228] setting number of microbatches to constant 16

Validation:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation DataLoader 0: 

Metric val_loss improved by 0.590 >= min_delta = 0.001. New best score: 2.899
Epoch 0, global step 800: 'validation_loss' reached 2.89896 (best 2.89896), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.899-step=800-consumed_samples=12800.0.ckpt' as top 1


Epoch 0: :  32%|███▏      | 800/2500 [53:03<1:52:44, reduced_train_loss=2.920, global_step=799.0, consumed_samples=12800.0, train_step_timing in s=3.420, val_loss=2.900][NeMo I 2024-10-28 02:04:53 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=3.489-step=600-consumed_samples=9600.0.ckpt
[NeMo I 2024-10-28 02:04:53 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=3.489-step=600-consumed_samples=9600.0-last.ckpt
Epoch 0: :  40%|████      | 1000/2500 [1:04:31<1:36:47, reduced_train_loss=2.720, global_step=999.0, consumed_samples=1.6e+4, train_step_timing in s=3.420, val_loss=2.900]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 02:16:20 num_microbatches_calculator:228] setting number of microbatches to constant 16

Validation:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation DataLoader 

Metric val_loss improved by 0.185 >= min_delta = 0.001. New best score: 2.714
Epoch 0, global step 1000: 'validation_loss' reached 2.71358 (best 2.71358), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.714-step=1000-consumed_samples=16000.0.ckpt' as top 1


Epoch 0: :  40%|████      | 1000/2500 [1:06:18<1:39:28, reduced_train_loss=2.720, global_step=999.0, consumed_samples=1.6e+4, train_step_timing in s=3.420, val_loss=2.710][NeMo I 2024-10-28 02:18:08 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.899-step=800-consumed_samples=12800.0.ckpt
[NeMo I 2024-10-28 02:18:08 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.899-step=800-consumed_samples=12800.0-last.ckpt
Epoch 0: :  48%|████▊     | 1200/2500 [1:17:47<1:24:16, reduced_train_loss=2.160, global_step=1199.0, consumed_samples=19200.0, train_step_timing in s=3.430, val_loss=2.710]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 02:29:36 num_microbatches_calculator:228] setting number of microbatches to constant 16

Validation:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation DataL

Metric val_loss improved by 0.113 >= min_delta = 0.001. New best score: 2.600
Epoch 0, global step 1200: 'validation_loss' reached 2.60033 (best 2.60033), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.600-step=1200-consumed_samples=19200.0.ckpt' as top 1


Epoch 0: :  48%|████▊     | 1200/2500 [1:19:34<1:26:11, reduced_train_loss=2.160, global_step=1199.0, consumed_samples=19200.0, train_step_timing in s=3.430, val_loss=2.600][NeMo I 2024-10-28 02:31:23 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.714-step=1000-consumed_samples=16000.0.ckpt
[NeMo I 2024-10-28 02:31:24 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.714-step=1000-consumed_samples=16000.0-last.ckpt
Epoch 0: :  56%|█████▌    | 1400/2500 [1:31:03<1:11:32, reduced_train_loss=2.160, global_step=1399.0, consumed_samples=22400.0, train_step_timing in s=3.420, val_loss=2.600]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 02:42:52 num_microbatches_calculator:228] setting number of microbatches to constant 16

Validation:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation D

Metric val_loss improved by 0.043 >= min_delta = 0.001. New best score: 2.557
Epoch 0, global step 1400: 'validation_loss' reached 2.55716 (best 2.55716), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.557-step=1400-consumed_samples=22400.0.ckpt' as top 1


Epoch 0: :  56%|█████▌    | 1400/2500 [1:32:50<1:12:56, reduced_train_loss=2.160, global_step=1399.0, consumed_samples=22400.0, train_step_timing in s=3.420, val_loss=2.560][NeMo I 2024-10-28 02:44:39 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.600-step=1200-consumed_samples=19200.0.ckpt
[NeMo I 2024-10-28 02:44:40 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.600-step=1200-consumed_samples=19200.0-last.ckpt
Epoch 0: :  64%|██████▍   | 1600/2500 [1:44:18<58:40, reduced_train_loss=2.530, global_step=1599.0, consumed_samples=25600.0, train_step_timing in s=3.420, val_loss=2.560]  
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 02:56:08 num_microbatches_calculator:228] setting number of microbatches to constant 16

Validation:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation D

Metric val_loss improved by 0.054 >= min_delta = 0.001. New best score: 2.503
Epoch 0, global step 1600: 'validation_loss' reached 2.50295 (best 2.50295), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.503-step=1600-consumed_samples=25600.0.ckpt' as top 1


Epoch 0: :  64%|██████▍   | 1600/2500 [1:46:06<59:40, reduced_train_loss=2.530, global_step=1599.0, consumed_samples=25600.0, train_step_timing in s=3.420, val_loss=2.500][NeMo I 2024-10-28 02:57:55 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.557-step=1400-consumed_samples=22400.0.ckpt
[NeMo I 2024-10-28 02:57:56 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.557-step=1400-consumed_samples=22400.0-last.ckpt
Epoch 0: :  72%|███████▏  | 1800/2500 [1:57:34<45:43, reduced_train_loss=1.930, global_step=1799.0, consumed_samples=28800.0, train_step_timing in s=3.420, val_loss=2.500]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 03:09:23 num_microbatches_calculator:228] setting number of microbatches to constant 16

Validation:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation DataL

Metric val_loss improved by 0.028 >= min_delta = 0.001. New best score: 2.475
Epoch 0, global step 1800: 'validation_loss' reached 2.47451 (best 2.47451), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.475-step=1800-consumed_samples=28800.0.ckpt' as top 1


Epoch 0: :  72%|███████▏  | 1800/2500 [1:59:21<46:24, reduced_train_loss=1.930, global_step=1799.0, consumed_samples=28800.0, train_step_timing in s=3.420, val_loss=2.470][NeMo I 2024-10-28 03:11:10 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.503-step=1600-consumed_samples=25600.0.ckpt
[NeMo I 2024-10-28 03:11:11 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.503-step=1600-consumed_samples=25600.0-last.ckpt
Epoch 0: :  80%|████████  | 2000/2500 [2:10:50<32:42, reduced_train_loss=1.850, global_step=2e+3, consumed_samples=3.2e+4, train_step_timing in s=3.430, val_loss=2.470]   
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 03:22:39 num_microbatches_calculator:228] setting number of microbatches to constant 16

Validation:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation DataL

Metric val_loss improved by 0.003 >= min_delta = 0.001. New best score: 2.471
Epoch 0, global step 2000: 'validation_loss' reached 2.47131 (best 2.47131), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.471-step=2000-consumed_samples=32000.0.ckpt' as top 1


Epoch 0: :  80%|████████  | 2000/2500 [2:12:37<33:09, reduced_train_loss=1.850, global_step=2e+3, consumed_samples=3.2e+4, train_step_timing in s=3.430, val_loss=2.470][NeMo I 2024-10-28 03:24:27 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.475-step=1800-consumed_samples=28800.0.ckpt
[NeMo I 2024-10-28 03:24:27 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.475-step=1800-consumed_samples=28800.0-last.ckpt
Epoch 0: :  88%|████████▊ | 2200/2500 [2:24:05<19:38, reduced_train_loss=2.120, global_step=2199.0, consumed_samples=35200.0, train_step_timing in s=3.430, val_loss=2.470]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 03:35:55 num_microbatches_calculator:228] setting number of microbatches to constant 16

Validation:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation DataLoad

Metric val_loss improved by 0.002 >= min_delta = 0.001. New best score: 2.470
Epoch 0, global step 2200: 'validation_loss' reached 2.46976 (best 2.46976), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.470-step=2200-consumed_samples=35200.0.ckpt' as top 1


Epoch 0: :  88%|████████▊ | 2200/2500 [2:25:53<19:53, reduced_train_loss=2.120, global_step=2199.0, consumed_samples=35200.0, train_step_timing in s=3.430, val_loss=2.470][NeMo I 2024-10-28 03:37:42 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.471-step=2000-consumed_samples=32000.0.ckpt
[NeMo I 2024-10-28 03:37:43 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.471-step=2000-consumed_samples=32000.0-last.ckpt
Epoch 0: :  96%|█████████▌| 2400/2500 [2:37:22<06:33, reduced_train_loss=1.980, global_step=2399.0, consumed_samples=38400.0, train_step_timing in s=3.430, val_loss=2.470]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 03:49:11 num_microbatches_calculator:228] setting number of microbatches to constant 16

Validation:   0%|          | 0/66 [00:00<?, ?it/s][A
Validation DataL

Metric val_loss improved by 0.006 >= min_delta = 0.001. New best score: 2.464
Epoch 0, global step 2400: 'validation_loss' reached 2.46362 (best 2.46362), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.464-step=2400-consumed_samples=38400.0.ckpt' as top 1


Epoch 0: :  96%|█████████▌| 2400/2500 [2:39:09<06:37, reduced_train_loss=1.980, global_step=2399.0, consumed_samples=38400.0, train_step_timing in s=3.430, val_loss=2.460][NeMo I 2024-10-28 03:50:59 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.470-step=2200-consumed_samples=35200.0.ckpt
[NeMo I 2024-10-28 03:50:59 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.470-step=2200-consumed_samples=35200.0-last.ckpt


`Trainer.fit` stopped: `max_steps=2500` reached.


Epoch 0: : 100%|██████████| 2500/2500 [2:44:54<00:00, reduced_train_loss=1.930, global_step=2499.0, consumed_samples=4e+4, train_step_timing in s=3.420, val_loss=2.460]   
[NeMo I 2024-10-28 03:56:43 perf_metrics:87] TFLOPs per sec per GPU=-1.00


[NeMo E 2024-10-28 03:56:43 perf_metrics:85] Failed to calculate TFLOPs per sec per GPU.
    FLOPs measurement not supported for finetuning jobs


[NeMo I 2024-10-28 03:56:44 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.464-step=2400-consumed_samples=38400.0-last.ckpt


Restoring states from the checkpoint path at /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.464-step=2400-consumed_samples=38400.0.ckpt
Restored all states from the checkpoint at /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.464-step=2400-consumed_samples=38400.0.ckpt


---
# Inference and Submission


To make a submission, run inference with your model on the test dataset at `data/split/submission.jsonl`.

> NOTE: This dataset was generated as part of Step 1. Please ensure it exists before proceeding.

In order to do this, set the variable pointing to your submission data file in the set below, then excute the final cell.

The inference results will be written under `results/inference` folder.

In [3]:
test_fp = os.path.abspath(f"{data_dir}/submission.jsonl")
assert os.path.exists(test_fp), f"The submission data at '{test_fp}' does not exist. Please ensure the data was prepared successfully."

test_fp = os.path.abspath(test_fp)
adapter_fp = f"{result_dir}/checkpoints/megatron_gpt_peft_lora_tuning.nemo"
os.makedirs(f"{result_dir}/inference", exist_ok=True)

print(f"Inference set: {test_fp}")
print(f"Trained adapter: {adapter_fp}")
test_filename = os.path.basename(test_fp)


%env TEST_DS=$test_fp
%env TEST_FP=$test_filename
%env TRAINED_ADAPTER=$adapter_fp

Inference set: /root/ODSC-Hackathon-Repository/data/split/submission.jsonl
Trained adapter: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning.nemo
env: TEST_DS=/root/ODSC-Hackathon-Repository/data/split/submission.jsonl
env: TEST_FP=submission.jsonl
env: TRAINED_ADAPTER=/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning.nemo


# Inference

In [4]:
%%bash

# This is where the inference results will be stored.
OUTPUT_DIR="results/inference/infer-$TEST_FP"

SCHEME="lora"
TP_SIZE=1
PP_SIZE=1

# Clear up cached mem-map file
rm $DATA_DIR/*idx*

python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${BASE_MODEL} \
    model.peft.restore_from_path=${TRAINED_ADAPTER} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    inference.greedy=True \
    model.data.test_ds.file_names=[${TEST_DS}] \
    model.data.test_ds.names=["infer"] \
    model.data.test_ds.global_batch_size=32 \
    model.data.test_ds.micro_batch_size=1 \
    model.data.test_ds.tokens_to_generate=32 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.data.test_ds.output_file_path_prefix=$OUTPUT_DIR \
    model.data.test_ds.write_predictions_to_file=True

      cm = get_cmap("Set1")
    


[NeMo I 2024-10-28 05:26:13 megatron_gpt_generate:125] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-10-28 05:26:13 megatron_gpt_generate:126] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: 16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 20000
      log_every_n_steps: 10
      val_check_interval: 200
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.test_ds.metric.name}
        save_top_k: 1
        mode: max
        save_nemo_o

[NeMo W 2024-10-28 05:26:13 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-10-28 05:26:19 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 05:26:19 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 05:26:19 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it 

[NeMo I 2024-10-28 05:26:19 megatron_init:314] Rank 0 has data parallel group : [0]
[NeMo I 2024-10-28 05:26:19 megatron_init:320] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-10-28 05:26:19 megatron_init:325] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-10-28 05:26:19 megatron_init:328] Ranks 0 has data parallel rank: 0
[NeMo I 2024-10-28 05:26:19 megatron_init:336] Rank 0 has context parallel group: [0]
[NeMo I 2024-10-28 05:26:19 megatron_init:339] All context parallel group ranks: [[0]]
[NeMo I 2024-10-28 05:26:19 megatron_init:340] Ranks 0 has context parallel rank: 0
[NeMo I 2024-10-28 05:26:19 megatron_init:347] Rank 0 has model parallel group: [0]
[NeMo I 2024-10-28 05:26:19 megatron_init:348] All model parallel group ranks: [[0]]
[NeMo I 2024-10-28 05:26:19 megatron_init:357] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-10-28 05:26:19 megatron_init:361] All tensor model parallel group ranks: 

[NeMo W 2024-10-28 05:26:19 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 05:26:19 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 05:26:19 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 05:26:19 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 05:26:19 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-10-28 05:26:19 tokenizer_utils:197] Getting SentencePiece with model: /tmp/tmpmpvffa3_/dd4e3de1c52a49088ca428287e8b67bb_tokenizer.model
[NeMo I 2024-10-28 05:26:19 megatron_base_model:604] Padded vocab_size: 256000, original vocab_size: 256000, dummy tokens: 0.


[NeMo W 2024-10-28 05:26:19 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 05:26:19 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 05:26:19 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 05:26:19 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 05:26:19 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-10-28 05:26:35 nlp_overrides:1374] Model MegatronGPTSFTModel was successfully restored from /root/ODSC-Hackathon-Repository/models/gemma-2-2b.nemo.
[NeMo I 2024-10-28 05:26:35 nlp_adapter_mixins:245] Before adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 2.6 B  | train
    -------------------------------------------
    0         Trainable params
    2.6 B     Non-trainable params
    2.6 B     Total params
    10,457.368Total estimated model params size (MB)
    451       Modules in train mode
    0         Modules in eval mode
[NeMo I 2024-10-28 05:26:38 nlp_adapter_mixins:250] After adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 2.6 B  | train
    -------------------------------------------
    5.3 M     Trainable params
    2.6 B     Non-trainable params
    2.6 B     Total params
    10,478.6

[NeMo W 2024-10-28 05:26:38 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-10-28 05:26:38 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-10-28 05:26:38 megatron_gpt_sft_model:828] Building GPT SFT test datasets.
[NeMo I 2024-10-28 05:26:38 text_memmap_dataset:116] Building data files
[NeMo I 2024-10-28 05:26:38 text_memmap_dataset:528] Processing 1 data files using 6 workers
[NeMo I 2024-10-28 05:26:38 text_memmap_dataset:494] Building indexing for fn = /root/ODSC-Hackathon-Repository/data/split/submission.jsonl
[NeMo I 2024-10-28 05:26:38 text_memmap_dataset:506] Saving idx file = /root/ODSC-Hackathon-Repository/data/split/submission.jsonl.idx.npy
[NeMo I 2024-10-28 05:26:38 text_memmap_dataset:508] Saving metadata file = /root/ODSC-Hackathon-Repository/data/split/submission.jsonl.idx.info
[NeMo I 2024-10-28 05:26:38 text_memmap_dataset:543] Time building 1 / 1 mem-mapped files: 0:00:00.214818
[NeMo I 2024-10-28 05:26:38 text_memmap_dataset:528] Processing 1 data files using 6 workers
[NeMo I 2024-10-28 05:26:39 text_memmap_dataset:543] Time building 0 / 1 mem-mapped files: 0:00:00.201747
[NeMo I 2024-10-2

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2024-10-28 05:26:39 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
    


Testing: |          | 0/? [00:00<?, ?it/s]setting number of microbatches to constant 32
Testing DataLoader 0:   0%|          | 0/157 [00:00<?, ?it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 32
Testing DataLoader 0:   1%|          | 1/157 [00:19<50:16,  0.05it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 32
Testing DataLoader 0:   1%|▏         | 2/157 [00:35<45:37,  0.06it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 32
Testing DataLoader 0:   2%|▏         | 3/157 [00:49<42:42,  0.06it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 32
Testing DataLoader 0:   3%|▎         | 4/157 [01:04<40:48,  0.06it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 32
Testing DataLoader 0:   3%|▎         | 5/157 [01:19<40:31,  0.06it/s]setting number of microbatches to constant 1
settin

[NeMo W 2024-10-28 06:10:53 megatron_gpt_sft_model:677] No training data found, reconfiguring microbatches based on validation batch sizes.
[NeMo W 2024-10-28 06:10:53 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-10-28 06:10:53 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('test_loss_infer', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-10-28 06:10:53 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('test_loss', ..., syn

The results will be written under `results/inference`. Please send us this file for your final submission.

Let's inspect a couple of lines from that file for sanity checking:

In [5]:
! cat results/inference/infer-submission.jsonl_test_infer_inputs_preds_labels.jsonl | head -n 1

{"input": "\nYou are an expert in Law and also in tagging legal questions.\nYou are provided with a question enclosed in +++++ and it's corresponding title enclosed in >>>>> from the law domain.\n\nYou are also provided with a list of all the tags, enclosed in ^^^^^.\n\nYour task is to:\ni. Understand the question and it's title.\nii. Pick up the tags that are most appropriate and relevant to the question, strictly from the tags provided to you.\niii. Make sure you return the tags alone without their description.\n\n\n```\nNOTES: All tags must be in lowercase, ordered lexicographically and separated by commas.\n```\n\nYour output should be a JSON with the below format:\n```\ntags : <Put your relevant tags here>\n```\n\n\n>>>>>\nTitle: Fairness in Punishment for Reckless Behavior\n>>>>>\n\n\n+++++\nQuestion: Is it justifiable to have significantly different penalties for individuals who engage in reckless behavior, depending on the outcome of their actions, or should the focus be on the

---
# Freeing Memory and Other Resources

As always, it is a good idea to free up all allocated resources when you are done. Please execute the following cell to do so.

Alternatively, please restart the kernel by navigating to `Kernel > Restart Kernel` (if using Jypyter notebook), or clicking the `Restart` button in VS Code.

In [None]:
exit(0)