In [None]:
BRANCH='main'

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell

# install NeMo
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]

In [None]:
from nemo.collections import nlp as nemo_nlp
from nemo.utils.exp_manager import exp_manager

import os
import wget 
import torch
import pytorch_lightning as pl
from omegaconf import OmegaConf

In this tutorial, we are going to describe how to use [P-Tuning method](https://arxiv.org/pdf/2103.10385.pdf) to find good prompts for large GPT models, so it can solve downstream NLP tasks with good performance. P-Tuning leverages few continuous free parameters to serve as prompts fed as the input to the pre-trained language models. Freezing the large language model weights, P-Tuning model can be trained efficiently while delivering stats of art performance. 

Large Language Model can be trained with [NeMo Megatron](https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/language_modeling), up to multi-billion parameters. In this notebook, we will use the pre-trained 344M GPT model released from NGC.

# Task Description
In this notebook, we are going to use P-Tuning method for **Sentiment Analysis** task, also known as opinion mining or emotion AI. It is a sub-field of NLP that tries to identify and extract opinions within a given text across blogs, reviews, social media, forums, news etc.

For instance, **given sentences from news title, is it a good or bad news?**<br>

# Dataset

The [Financial PhraseBank dataset](https://huggingface.co/datasets/financial_phrasebank) contains the sentiments for financial news headlines from the perspective of a retail investor. Further details about the dataset can be found in: Malo, P., Sinha, A., Takala, P., Korhonen, P. and Wallenius, J. (2014): “Good debt or bad debt: Detecting semantic orientations in economic texts.” Journal of the American Society for Information Science and Technology.

Here's an example of what an annotated abstract from the corpus looks like:

```
HELSINKI Thomson Financial - Shares in Cargotec fell sharply in early afternoon trade after the cargo handling group posted a surprise drop in April-June profits , which overshadowed the large number of new orders received during the three months .@negative
LONDON MarketWatch -- Share prices ended lower in London Monday as a rebound in bank stocks failed to offset broader weakness for the FTSE 100 .@negative
Operating profit fell to EUR 35.4 mn from EUR 68.8 mn in 2007 , including vessel sales gain of EUR 12.3 mn .@negative
Sales in Finland decreased by 10.5 % in January , while sales outside Finland dropped by 17 % .@negative
```

Let's download the dataset.

In [None]:
DATA_DIR = "DATA_DIR"
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(os.path.join(DATA_DIR, 'SA'), exist_ok=True)

## Downloading Financial Phrase Bank Dataset

The datase is collected by Malo et al. 2014, and can be downloaded from this [link](https://www.researchgate.net/profile/Pekka_Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip). The zip file for the Financial Phrase Bank Dataset has been provided for ease of download and use.

In [None]:
!wget https://www.researchgate.net/profile/Pekka_Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip
!mv FinancialPhraseBank-v10.zip {DATA_DIR}
!unzip -f {DATA_DIR}/FinancialPhraseBank-v10.zip -d {DATA_DIR}

In [None]:
# If you want to see more examples, you can explore the text of the corpus using the file browser to the left, or open files directly, for example typing a command like the following in a code-cell:

! head -1 $DATA_DIR/FinancialPhraseBank-v1.0/Sentences_50Agree.txt

## Pre-process dataset

In this pre-process step, we are going to convert the downloaded dataset into the format that can be used for P-Tuning dataloader. The data is split into 10 folds so we can do 10-fold cross validation. 

In [None]:
import json
import random

random.seed(1234)
files = ['Sentences_50Agree.txt', 'Sentences_66Agree.txt', 'Sentences_75Agree.txt', 'Sentences_AllAgree.txt']
base_dir = DATA_DIR + '/FinancialPhraseBank-v1.0/'
files = [base_dir + f for f in files]

alllines = []
for fn in files:
    with open(fn, 'r', encoding="ISO-8859-1") as f:
        alllines.extend(f.readlines())

random.shuffle(alllines)
fold = 10
fold_size = len(alllines) // fold

chunk_start = list(range(0, 14780, 1478))

chunks = []

for start_id in chunk_start:
    chunks.append(alllines[start_id:start_id+fold_size])

special = '<|endoftext|>'

def gen_file(data, fold_id, split_type):
    filename = "{}/{}_{}.txt".format(base_dir, split_type, fold_id)
    with open(filename, 'w') as f:
        obj = {}
        for line in data:
            splits = line.split('@')
            part1 = splits[0].strip()
            part2 = splits[1].strip()
            obj['sentence'] = part1 +' Sentiment '
            obj['label'] = part2
            f.write(json.dumps(obj)+'\n')


def gen_fold(fold_number):
    lists = list(range(fold))
    test_id = (fold_number + fold) % fold
    val_id = (fold_number + fold - 1) % fold
    test_set = chunks[test_id]
    val_set = chunks[val_id]
    lists.remove(test_id)
    lists.remove(val_id)
    train_set = []
    for idd in lists:
        train_set += chunks[idd]
    gen_file(train_set, fold_number, 'train')
    gen_file(val_set, fold_number, 'validation')
    gen_file(test_set, fold_number, 'test')

for i in range(fold):
    gen_fold(i)

The data is converted to the loss json file. Each line has two keys "sentence" and "label". Note we append "Sentiment" at the end of the input sentence to cue the model for sentiment analysis. 
Here are the first two lines of converted data:

In [None]:
!head -n 2 $DATA_DIR/FinancialPhraseBank-v1.0/train_0.txt

## Convert the Megatron-LM Weights to Nemo file

P-Tuning method works the best with large GPT lanague models. From our experiences, models of size 5B or above give good performance. If you already have a large GPT model ready, skip this section. 

In this example, we will use the pretrained 344M NeMo Megatron GPT model from [Megatron-LM project](https://github.com/NVIDIA/Megatron-LM). To load it in NeMo Megatron, We first need to convert the Megatron-LM checkpoint to the `.nemo` file. Let's download the pretrained model weights and vocabulary file.



In [None]:
import pathlib
gpt_file = 'megatron_lm_345m_v0.0.zip'
vocab_file = 'gpt2-vocab.json'
merge_file = 'gpt2-merge.txt'
checkpoint_filename = 'model_optim_rng.pt'

if not pathlib.Path(gpt_file).exists():
    !wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O $gpt_file
    !unzip -f $gpt_file
    !wget https://s3.amazonaws.com/models.huggingface.co/bert/$vocab_file -O $vocab_file 
    !wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -O $merge_file



In [None]:
WORK_DIR = "WORK_DIR"
os.makedirs(WORK_DIR, exist_ok=True)

# Prepare the model parameters 
# download the model's configuration file 
config_dir = WORK_DIR + '/configs/'
MODEL_CONFIG = "megatron_gpt_config.yaml"
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + MODEL_CONFIG):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/language_modeling/conf/' + MODEL_CONFIG, config_dir)
else:
    print ('config file is already exists')

In [None]:
# this line will print the entire config of the model
config_path = f'{WORK_DIR}/configs/{MODEL_CONFIG}'
print(config_path)
config = OmegaConf.load(config_path)
config.model.num_layers = 24
config.model.hidden_size = 1024
config.model.ffn_hidden_size = 4096
config.model.num_attention_heads = 16
config.model.tokenizer.vocab_file = vocab_file
config.model.tokenizer.merge_file = merge_file
config.model.tensor_model_parallel_size = 1
config.model.data.data_prefix = ''
config.model.max_position_embeddings = 1024
config.model.data.seq_length = 1024
config.model.encoder_seq_length = 1024
config.cfg = {}
config.cfg.cfg = config.model
with open('hparams.yaml', 'w') as f:
    f.write(OmegaConf.to_yaml(config.cfg))

In [None]:
import os
PWD = os.getcwd()
wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py')
!python -m torch.distributed.run --nproc_per_node=1 megatron_lm_ckpt_to_nemo.py --checkpoint_folder=$PWD/release/mp_rank_00/ --checkpoint_name=$checkpoint_filename --hparams_file=$PWD/hparams.yaml --nemo_file_path=$PWD/gpt_344m.nemo --model_type=gpt --tensor_model_parallel_size=1

# Model configuration

Our P-Tuning text classification model is comprised of the pretrained GPT LM model followed by a prompt encoder layer.

The model is defined in a config file which declares multiple important sections. They are:
- **model**: All arguments that are related to the Model - language model, token classifier, optimizer and schedulers, datasets and any other related information

- **trainer**: Any argument to be passed to PyTorch Lightning

In [None]:
MODEL_CONFIG = "ptune_text_classification_config.yaml"

In [None]:
# download the model's configuration file 
config_dir = WORK_DIR + '/configs/'
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + MODEL_CONFIG):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/token_classification/conf/' + MODEL_CONFIG, config_dir)
else:
    print ('config file is already exists')

In [None]:
# this line will print the entire config of the model
config_path = f'{WORK_DIR}/configs/{MODEL_CONFIG}'
print(config_path)
config = OmegaConf.load(config_path)
# Note: these are small batch-sizes - increase as appropriate to available GPU capacity
config.model.train_ds.batch_size=8
config.model.validation_ds.batch_size=8

# Model Training
## Setting up Data within the config

Among other things, the config file contains dictionaries called train_ds, validation_ds and test_ds. These are configurations used to setup the Dataset and DataLoaders of the corresponding config.


In [None]:
# in this tutorial train and dev datasets are located in the same folder, so it is enough to add the path of the data directory to the config
#config.model.dataset.classes = ['positive', 'neutral', 'negative']
config.model.train_ds.file_path = DATA_DIR+'/FinancialPhraseBank-v1.0/train_0.txt'
config.model.validation_ds.file_path = DATA_DIR+'/FinancialPhraseBank-v1.0/validation_0.txt'
config.model.test_ds.file_path = DATA_DIR+'/FinancialPhraseBank-v1.0/test_0.txt'


# if you want to decrease the size of your datasets, uncomment the lines below:
# NUM_SAMPLES = 1000
# config.model.train_ds.num_samples = NUM_SAMPLES
# config.model.validation_ds.num_samples = NUM_SAMPLES

In [None]:
print(OmegaConf.to_yaml(config))

## Building the PyTorch Lightning Trainer

NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem.

Let's first instantiate a Trainer object

In [None]:
print("Trainer config - \n")
print(OmegaConf.to_yaml(config.trainer))

In [None]:
from nemo.collections.nlp.parts.nlp_overrides import NLPDDPPlugin


# lets modify some trainer configs
# checks if we have GPU available and uses it
cuda = 1 if torch.cuda.is_available() else 0
config.trainer.gpus = cuda
config.trainer.max_epochs = 6

# for PyTorch Native AMP set precision=16
config.trainer.precision = 16 if torch.cuda.is_available() else 32

# remove distributed training flags
config.trainer.accelerator = None

trainer = pl.Trainer(plugins=[NLPDDPPlugin()], **config.trainer)

## Setting up a NeMo Experiment

NeMo has an experiment manager that handles logging and checkpointing for us, so let's use it:

In [None]:
exp_dir = exp_manager(trainer, config.get("exp_manager", None))
os.makedirs(WORK_DIR, exist_ok=True)

# the exp_dir provides a path to the current experiment for easy access
exp_dir = str(exp_dir)
exp_dir

We will use the converted `.nemo` file as our LM model.

In [None]:
# add the specified above model parameters to the config
# config.model.language_model.pretrained_model_name = PRETRAINED_BERT_MODEL
config.model.language_model.nemo_file = 'gpt_344m.nemo'
config.model.tensor_model_parallel_size = 1
config.model.dataset.classes = ['positive', 'neutral', 'negative']
config.model.tokenizer.vocab_file = vocab_file
config.model.tokenizer.merge_file = merge_file

Now, we are ready to initialize our model. During the model initialization call, the dataset and data loaders we'll be prepared for training and evaluation.

In [None]:
from nemo.collections.nlp.models.text_classification.ptune_text_classification_model import PTuneTextClassificationModel
model_ptune = PTuneTextClassificationModel(cfg=config.model, trainer=trainer)

## Monitoring training progress
Optionally, you can create a Tensorboard visualization to monitor training progress.
If you're not using Colab, refer to [https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks](https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks) if you're facing issues with running the cell below.

In [None]:
try:
    from google import colab
    COLAB_ENV = True
except (ImportError, ModuleNotFoundError):
    COLAB_ENV = False

# Load the TensorBoard notebook extension
if COLAB_ENV:
    %load_ext tensorboard
    %tensorboard --logdir {exp_dir}
else:
    print("To use tensorboard, please use this notebook in a Google Colab environment.")

In [None]:
# start model training
trainer.fit(model_ptune)

# Inference

To see how the model performs, we can run model in the inference mode

In [None]:
# let's first create a subset of our dev data
query_examples = [
"For example , net sales increased by 5.9 % from the first quarter , and EBITDA increased from a negative EUR 0.2 mn in the first quarter of 2000 .",
"EPS for the quarter was EUR0 .00 , as compared with EUR0 .01 in the third quarter of 2008 , representing a Group net sales for the third quarter were EUR15 .3 m , up by 2.8 % as compared with EUR14 .9 m in the third quarter of 2008 .",
"The NTSB said investigators are set to conduct sight distance tests on July 18 , using trains similar to those involved in the accident .",
"Pretax profit totaled EUR 9.0 mn , down from EUR 36.3 mn in 2007 .",
"However , the proportion of the paid standing orders grew in 2009 ."]
results = model_ptune.cuda().classifytext(queries=query_examples, batch_size=1, prompt='Sentiment')
print('The prediction results of some sample queries with the trained model:')
for query, result in zip(query_examples, results):
    print(f'Query : {query}')
    print(f'Predicted label: {result}')

## Training Script

If you have NeMo installed locally, you can also train the model with `examples/nlp/text_classification/ptune_text_classification.py`.

To run training script, use:
```
python examples/nlp/text_classification/ptune_text_classification.py \
    trainer.gpus=1 \
    model.tokenizer.vocab_file=VOCAB_FILE \
    model.tensor_model_parallel_size=1 \
    model.tokenizer.merge_file=MERGE_FILE \
    model.language_model.nemo_file=gpt_344m.nemo \
    model.dataset.classes=[positive,neutral,negative] \
    model.train_ds.file_path=TRAIN_FILE \
    model.train_ds.batch_size=8 \
    model.validation_ds.file_path=VAL_FILE \
    model.test_ds.file_path=TEST_FILE \
```

The training could take several hours and the result should look something like
```
    label                                                precision    recall       f1           support
    positive (label_id: 0)                                  87.75      89.28      88.50        401
    neutral (label_id: 1)                                   94.26      94.26      94.26        889
    negative (label_id: 2)                                  95.03      91.49      93.22        188
    -------------------
    micro avg                                               92.56      92.56      92.56       1478
    macro avg                                               92.35      91.68      92.00       1478
    weighted avg                                            92.59      92.56      92.57       1478
```