![DLI Header](images/DLI_Header.png)

# Text Classification #

## Sentimental Analysis ##

In this notebook, you will learn to fine-tune a pre-trained model. Specifically, we will use a model for sentiment analysis. 

**Sentiment Analysis** is the task of detecting the sentiment in text. We model this problem as a simple form of a text classification problem. For example `Gollum's performance is incredible!` has a positive sentiment while `It's neither as romantic nor as thrilling as it should be.` has a negative sentiment. In such an analysis, we need to look at sentences, and we only have two classes: "positive" and "negative". Each sentence in the training set must be labeled as one or the other. Sentiment analysis is widely used by businesses to identify customer sentiment toward products, brands, or services in online conversations and feedback.

**Table of Contents**<br>
This notebook covers the below sections: 
* Dataset
    * Download and Preprocess data
    * Labeling Data (OPTIONAL)
* Use Pre-Trained Model
    * Download Model
    * Make Predictions
* Fine-Tune a Pre-Trained Model

## Dataset ##

In this notebook, we're going to use The [Stanford Sentiment Treebank (SST-2)](https://nlp.stanford.edu/sentiment/index.html) corpus for sentiment analysis. The data contains a collection of sentences with binary labels for positive and negative. 

For text classification, NeMo requires the data to be in a specific format. Data needs to be in TAB separated files (.tsv) with two columns of sentence and label. Each line of the data file contains text sequences, where words are separated with spaces and label separated with [TAB], i.e.: `[WORD] [SPACE] [WORD] [SPACE] [WORD] [TAB] [LABEL]`

For example: 
* 
```
hide new secretions from the parental units[TAB]0
that loves its characters and communicates something rather beautiful about human nature[TAB]1
...
```

### Download and Preprocess Data ###

We have prepared the SST-2 dataset for you. It should contain three files of train.tsv, dev.tsv, and test.tsv which can be used for `training`, `validation`, and `test` respectively.

In [1]:
import os
import wget

# set data path
DATA_DIR='data'
DATA_DIR=os.path.join(DATA_DIR, 'SST-2')

In [2]:
# check that data folder should contain train.tsv, dev.tsv, test.tsv
!ls -l {DATA_DIR}

total 34772
-rw-r--r-- 1 root root   597450 Nov  2  2023 cached_dev_nemo_format.tsv_BertTokenizer_256_30522_-1_0_False.pkl
-rw-r--r-- 1 root root 26988143 Nov  2  2023 cached_train_nemo_format.tsv_BertTokenizer_256_30522_-1_0_True.pkl
-rw-r--r-- 1 root root    94931 Nov  2  2023 dev.tsv
-rw-r--r-- 1 root root    94916 Nov  2  2023 dev_nemo_format.tsv
drwxr-xr-x 2 root root     4096 Aug 23 16:54 original
-rw-r--r-- 1 root root   197335 Nov  2  2023 test.tsv
-rw-r--r-- 1 root root  3806081 Nov  2  2023 train.tsv
-rw-r--r-- 1 root root  3806066 Nov  2  2023 train_nemo_format.tsv


In [3]:
# preview data 
print('Train:')
!head -n 5 {DATA_DIR}/train.tsv

print('Dev:')
!head -n 5 {DATA_DIR}/dev.tsv

print('Test:')
!head -n 5 {DATA_DIR}/test.tsv

Train:
sentence	label
hide new secretions from the parental units 	0
contains no wit , only labored gags 	0
that loves its characters and communicates something rather beautiful about human nature 	1
remains utterly satisfied to remain the same throughout 	0
Dev:
sentence	label
it 's a charming and often affecting journey . 	1
unflinchingly bleak and desperate 	0
allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . 	1
the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . 	1
Test:
index	sentence
0	uneasy mishmash of styles and genres .
1	this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .
2	by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .
3	director rob marshall went out gunning to make a great one .


The format of `train.tsv` and `dev.tsv` is close to NeMo's format except to have an extra header line at the beginning of the files. We would remove these extra lines. But `test.tsv` has different format and labels are missing for this part of the data.

In [4]:
!sed 1d {DATA_DIR}/train.tsv > {DATA_DIR}/train_nemo_format.tsv
!sed 1d {DATA_DIR}/dev.tsv > {DATA_DIR}/dev_nemo_format.tsv

## Fine-Tune a Pre-Trained Model ##

A text classification model is typically comprised of a pre-trained [BERT](https://arxiv.org/pdf/1810.04805.pdf) model followed by a text classification layer. For training, we can use a configuration file to define the model. The configuration (config) file consists of several important sections, including: 
* **model**: All arguments that are related to the Model - language model, token classifier, optimizer and schedulers, datasets and any other related information
* **trainer**: Any argument to be passed to PyTorch Lightning

_Note:_ NeMo provides a template for creating the configuration file, which is recommended as a starting point, but you can create your own as long as it follows the required format. 

### Configuration File ###

In [5]:
# define config path
MODEL_CONFIG="text_classification_config.yaml"
WORK_DIR='WORK_DIR'
os.makedirs(WORK_DIR, exist_ok=True)

In [6]:
# download the model's configuration file 
BRANCH='main'
config_dir = WORK_DIR + '/configs/'
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + MODEL_CONFIG):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/text_classification/conf/' + MODEL_CONFIG, config_dir)
else:
    print ('config file already exists')

Downloading config file...


The config file for text classification, `text_classification_config.yaml`, specifies model, training, and experiment management details, such as file locations, pre-trained models, and hyperparameters. The YAML config file we downloaded provides default values for most of the parameters, but there are a few items that must be specified for this experiment.

Each YAML section is a bit easier to view using the `omegaconf` package, which allows you to access and manipulate the configuration keys using a "dot" notation. We'll take a look at the details of each section using the `OmegaConf` tool. 

In [7]:
from omegaconf import OmegaConf

CONFIG_DIR = "/dli/task/WORK_DIR/configs"
CONFIG_FILE = "text_classification_config.yaml"

config=OmegaConf.load(CONFIG_DIR + "/" + CONFIG_FILE)

# print the entire configuration file
print(OmegaConf.to_yaml(config))

trainer:
  devices: 1
  num_nodes: 1
  max_epochs: 100
  max_steps: -1
  accumulate_grad_batches: 1
  gradient_clip_val: 0.0
  precision: 32
  accelerator: gpu
  log_every_n_steps: 1
  val_check_interval: 1.0
  num_sanity_val_steps: 0
  enable_checkpointing: false
  logger: false
model:
  nemo_path: text_classification_model.nemo
  tokenizer:
    tokenizer_name: ${model.language_model.pretrained_model_name}
    vocab_file: null
    tokenizer_model: null
    special_tokens: null
  language_model:
    pretrained_model_name: bert-base-uncased
    lm_checkpoint: null
    config_file: null
    config: null
  classifier_head:
    num_output_layers: 2
    fc_dropout: 0.1
  class_labels:
    class_labels_file: null
  dataset:
    num_classes: ???
    do_lower_case: false
    max_seq_length: 256
    class_balancing: null
    use_cache: false
  train_ds:
    file_path: null
    batch_size: 64
    shuffle: true
    num_samples: -1
    num_workers: 3
    drop_last: false
    pin_memory: false
  va

Notice that some config lines, including `model.dataset.data_dir`, have `???` in place of paths, this means that values for these fields are required to be specified by the user. Details about the model arguments can be found in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_classification.html#training-the-text-classification-model). 

We first need to set the num_classes in the config file which specifies the number of classes in the dataset. For SST-2, we have just two classes (0-positive and 1-negative). So we set the num_classes to 2. The model supports more than 2 classes too.

We need to specify and set the `model.train_ds.file_name`, `model.validation_ds.file_name`, and `model.test_ds.file_name` in the config file to the paths of the train, validation, and test files if they exist. 

In [8]:
!ls $DATA_DIR

cached_dev_nemo_format.tsv_BertTokenizer_256_30522_-1_0_False.pkl
cached_train_nemo_format.tsv_BertTokenizer_256_30522_-1_0_True.pkl
dev.tsv
dev_nemo_format.tsv
original
test.tsv
train.tsv
train_nemo_format.tsv


In [9]:
# set num_classes to 2
config.model.dataset.num_classes=2

# set file paths
config.model.train_ds.file_path = os.path.join(DATA_DIR, 'train_nemo_format.tsv')
config.model.validation_ds.file_path = os.path.join(DATA_DIR, 'dev_nemo_format.tsv')

# You may change other params like batch size or the number of samples to be considered (-1 means all the samples)

# print the model section
print(OmegaConf.to_yaml(config.model))

nemo_path: text_classification_model.nemo
tokenizer:
  tokenizer_name: ${model.language_model.pretrained_model_name}
  vocab_file: null
  tokenizer_model: null
  special_tokens: null
language_model:
  pretrained_model_name: bert-base-uncased
  lm_checkpoint: null
  config_file: null
  config: null
classifier_head:
  num_output_layers: 2
  fc_dropout: 0.1
class_labels:
  class_labels_file: null
dataset:
  num_classes: 2
  do_lower_case: false
  max_seq_length: 256
  class_balancing: null
  use_cache: false
train_ds:
  file_path: data/SST-2/train_nemo_format.tsv
  batch_size: 64
  shuffle: true
  num_samples: -1
  num_workers: 3
  drop_last: false
  pin_memory: false
validation_ds:
  file_path: data/SST-2/dev_nemo_format.tsv
  batch_size: 64
  shuffle: false
  num_samples: -1
  num_workers: 3
  drop_last: false
  pin_memory: false
test_ds:
  file_path: null
  batch_size: 64
  shuffle: false
  num_samples: -1
  num_workers: 3
  drop_last: false
  pin_memory: false
optim:
  name: adam
  lr

In [10]:
print(OmegaConf.to_yaml(config.trainer))

devices: 1
num_nodes: 1
max_epochs: 100
max_steps: -1
accumulate_grad_batches: 1
gradient_clip_val: 0.0
precision: 32
accelerator: gpu
log_every_n_steps: 1
val_check_interval: 1.0
num_sanity_val_steps: 0
enable_checkpointing: false
logger: false



In [11]:
# lets modify some trainer configs

# setup max number of steps to reduce training time for demonstration purposes of this tutorial
# Training stops when max_step or max_epochs is reached (earliest)
config.trainer.max_epochs = 1

# print the trainer section
print(OmegaConf.to_yaml(config.trainer))

devices: 1
num_nodes: 1
max_epochs: 1
max_steps: -1
accumulate_grad_batches: 1
gradient_clip_val: 0.0
precision: 32
accelerator: gpu
log_every_n_steps: 1
val_check_interval: 1.0
num_sanity_val_steps: 0
enable_checkpointing: false
logger: false



Note: `OmegaConf.to_yaml()` is used to create a proper format for printing the config. Once the `text_classification_config.yaml` file has been loaded into memory, changing the configuration file will require the config variable to be re-defined.

Now, we are ready to initialize our model. During the model initialization call, the dataset and data loaders will be prepared for training and evaluation. Also, the pretrained BERT model will be downloaded, which can take up to a few minutes depending on the size of the chosen BERT model.

### Download Pre-Trained Model ###

Before initializing the model, we might want to modify some of the model configs. For example, we might want to modify the pretrained BERT model to another model. The default model is `bert-base-uncased`. 

In [12]:
from nemo.collections import nlp as nemo_nlp

# complete list of supported BERT-like models
for model in nemo_nlp.modules.get_pretrained_lm_models_list(): 
    print(model)

NOTE! Installing ujson may make loading annotations faster.
bert-base-uncased
bert-large-uncased
bert-base-cased
bert-large-cased
bert-base-multilingual-uncased
bert-base-multilingual-cased
bert-base-chinese
bert-base-german-cased
bert-large-uncased-whole-word-masking
bert-large-cased-whole-word-masking
bert-large-uncased-whole-word-masking-finetuned-squad
bert-large-cased-whole-word-masking-finetuned-squad
bert-base-cased-finetuned-mrpc
bert-base-german-dbmdz-cased
bert-base-german-dbmdz-uncased
cl-tohoku/bert-base-japanese
cl-tohoku/bert-base-japanese-whole-word-masking
cl-tohoku/bert-base-japanese-char
cl-tohoku/bert-base-japanese-char-whole-word-masking
TurkuNLP/bert-base-finnish-cased-v1
TurkuNLP/bert-base-finnish-uncased-v1
wietsedv/bert-base-dutch-cased
distilbert-base-uncased
distilbert-base-uncased-distilled-squad
distilbert-base-cased
distilbert-base-cased-distilled-squad
distilbert-base-german-cased
distilbert-base-multilingual-cased
distilbert-base-uncased-finetuned-sst-2-e

In [13]:
# specify the BERT-like model, you want to use
# set the `model.language_modelpretrained_model_name' parameter in the config to the model you want to use
config.model.language_model.pretrained_model_name = "bert-base-uncased"

Now, we are ready to initialize our model. During the model initialization call, the dataset and data loaders will also be prepared for the training and validation.

Also, the pretrained BERT model will be automatically downloaded. Note it can take up to a few minutes depending on the size of the chosen BERT model for the first time you create the model. If your dataset is large, it also may take some time to read and process all the datasets.

In [14]:
from nemo.collections.nlp.models import TextClassificationModel
import pytorch_lightning as pl

trainer=pl.Trainer(**config.trainer)
text_classification_model=TextClassificationModel(cfg=config.model, trainer=trainer)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..


[NeMo I 2024-08-23 16:59:14 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: bert-base-uncased, vocab_file: None, merges_files: None, special_tokens_dict: {}, and use_fast: False


Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Using eos_token, but it is not set yet.
Using bos_token, but it is not set yet.
[NeMo W 2024-08-23 16:59:15 modelPT:244] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.


[NeMo I 2024-08-23 16:59:15 text_classification_dataset:120] Read 67349 examples from data/SST-2/train_nemo_format.tsv.
[NeMo I 2024-08-23 16:59:15 text_classification_dataset:238] *** Example ***
[NeMo I 2024-08-23 16:59:15 text_classification_dataset:239] example 0: ['flat-out', 'amusing', ',', 'sometimes', 'endearing', 'and']
[NeMo I 2024-08-23 16:59:15 text_classification_dataset:240] subtokens: [CLS] flat - out amusing , sometimes end ##earing and [SEP]
[NeMo I 2024-08-23 16:59:15 text_classification_dataset:241] input_ids: 101 4257 1011 2041 19142 1010 2823 2203 27242 1998 102
[NeMo I 2024-08-23 16:59:15 text_classification_dataset:242] segment_ids: 0 0 0 0 0 0 0 0 0 0 0
[NeMo I 2024-08-23 16:59:15 text_classification_dataset:243] input_mask: 1 1 1 1 1 1 1 1 1 1 1
[NeMo I 2024-08-23 16:59:15 text_classification_dataset:244] label: 1
[NeMo I 2024-08-23 16:59:15 text_classification_dataset:238] *** Example ***
[NeMo I 2024-08-23 16:59:15 text_classification_dataset:239] example 1: 

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

### Model Training ###

In [15]:
# start model training
trainer.fit(text_classification_model)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


[NeMo I 2024-08-23 17:00:00 modelPT:721] Optimizer config = Adam (
    Parameter Group 0
        amsgrad: False
        betas: [0.9, 0.999]
        capturable: False
        differentiable: False
        eps: 1e-08
        foreach: None
        fused: None
        lr: 2e-05
        maximize: False
        weight_decay: 0.01
    )
[NeMo I 2024-08-23 17:00:00 lr_scheduler:910] Scheduler "<nemo.core.optim.lr_scheduler.WarmupAnnealing object at 0x7fbe0f7316c0>" 
    will be used during training (effective maximum steps = 1053) - 
    Parameters : 
    (warmup_steps: null
    warmup_ratio: 0.1
    last_epoch: -1
    max_steps: 1053
    )



  | Name                  | Type                 | Params
---------------------------------------------------------------
0 | loss                  | CrossEntropyLoss     | 0     
1 | bert_model            | BertEncoder          | 109 M 
2 | classifier            | SequenceClassifier   | 592 K 
3 | classification_report | ClassificationReport | 0     
---------------------------------------------------------------
110 M     Trainable params
0         Non-trainable params
110 M     Total params
440.297   Total estimated model params size (MB)


Training: 0it [00:00, ?it/s]

      torch.LongTensor(padded_input_ids),
    
      torch.LongTensor(padded_input_ids),
    
      torch.LongTensor(padded_input_ids),
    


Validation: 0it [00:00, ?it/s]

      torch.LongTensor(padded_input_ids),
    
      torch.LongTensor(padded_input_ids),
    
      torch.LongTensor(padded_input_ids),
    


[NeMo I 2024-08-23 17:01:59 text_classification_model:142] val_report: 
    label                                                precision    recall       f1           support   
    label_id: 0                                             91.87      89.72      90.78        428
    label_id: 1                                             90.31      92.34      91.31        444
    -------------------
    micro avg                                               91.06      91.06      91.06        872
    macro avg                                               91.09      91.03      91.05        872
    weighted avg                                            91.07      91.06      91.05        872
    


`Trainer.fit` stopped: `max_epochs=1` reached.


### Evaluate Predictions ###

for inference, we can use `trainer.test()` or `model.classifytext()`. additional [documentation](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/text_classification/text_classification_model.py). 

In [16]:
eval_config = OmegaConf.create({'file_path': config.model.validation_ds.file_path, 'batch_size': 64, 'shuffle': False, 'num_samples': -1})
text_classification_model.setup_test_data(test_data_config=eval_config)
trainer.test(model=text_classification_model, verbose=False)

[NeMo I 2024-08-23 17:02:00 text_classification_dataset:120] Read 872 examples from data/SST-2/dev_nemo_format.tsv.
[NeMo I 2024-08-23 17:02:00 text_classification_dataset:238] *** Example ***
[NeMo I 2024-08-23 17:02:00 text_classification_dataset:239] example 0: ['it', "'s", 'a', 'charming', 'and', 'often', 'affecting', 'journey', '.']
[NeMo I 2024-08-23 17:02:00 text_classification_dataset:240] subtokens: [CLS] it ' s a charming and often affecting journey . [SEP]
[NeMo I 2024-08-23 17:02:00 text_classification_dataset:241] input_ids: 101 2009 1005 1055 1037 11951 1998 2411 12473 4990 1012 102
[NeMo I 2024-08-23 17:02:00 text_classification_dataset:242] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0
[NeMo I 2024-08-23 17:02:00 text_classification_dataset:243] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1
[NeMo I 2024-08-23 17:02:00 text_classification_dataset:244] label: 1
[NeMo I 2024-08-23 17:02:00 text_classification_dataset:238] *** Example ***
[NeMo I 2024-08-23 17:02:00 text_classification_datase

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
      rank_zero_warn(
    


Testing: 0it [00:00, ?it/s]

      torch.LongTensor(padded_input_ids),
    


[NeMo I 2024-08-23 17:02:02 text_classification_model:142] test_report: 
    label                                                precision    recall       f1           support   
    label_id: 0                                             91.87      89.72      90.78        428
    label_id: 1                                             90.31      92.34      91.31        444
    -------------------
    micro avg                                               91.06      91.06      91.06        872
    macro avg                                               91.09      91.03      91.05        872
    weighted avg                                            91.07      91.06      91.05        872
    


[{'test_loss': 0.24226368963718414,
  'test_precision': 91.05504608154297,
  'test_f1': 91.05504608154297,
  'test_recall': 91.05504608154297}]

### Inference ###

In [17]:
# define the list of queries for inference
queries = ['by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .', 
           'director rob marshall went out gunning to make a great one .', 
           'uneasy mishmash of styles and genres .']
           
# max_seq_length=512 is the maximum length BERT supports.       
results = text_classification_model.classifytext(queries=queries, batch_size=3, max_seq_length=512)

print('The prediction results of some sample queries with the trained model:')
for query, result in zip(queries, results):
    print(f'Query : {query}')
    print(f'Predicted label: {result}')

The prediction results of some sample queries with the trained model:
Query : by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .
Predicted label: 1
Query : director rob marshall went out gunning to make a great one .
Predicted label: 1
Query : uneasy mishmash of styles and genres .
Predicted label: 0


In [18]:
# restart the kernel
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

**Well Done!** 

![DLI Header](images/DLI_Header.png)