In [None]:
BRANCH = 'main'

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell

# install NeMo
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]

In [None]:
# If you're not using Colab, you might need to upgrade jupyter notebook to avoid the following error:
# 'ImportError: IProgress not found. Please update jupyter and ipywidgets.'

! pip install ipywidgets
! jupyter nbextension enable --py widgetsnbextension

# Please restart the kernel after running this cell

In [None]:
from nemo.collections import nlp as nemo_nlp
from nemo.utils.exp_manager import exp_manager

import os
import wget 
import torch
import pytorch_lightning as pl
from omegaconf import OmegaConf

import zipfile
import random
from glob import glob

# Tutorial Overview
In this tutorial, we will show how to use a pre-trained BERT language model on a non-English downstream task. Here we are going to use Persian language and Named entity recognition (NER) task as an example. Note, most of the rest downstream tasks supported in NeMo should work similarly for other languages. 

# Task Description
NER is the task of detecting and classifying key information (entities) in text.
For example, in a sentence:  `Mary lives in Santa Clara and works at NVIDIA`, we should detect that `Mary` is a person, `Santa Clara` is a location and `NVIDIA` is a company.

In this tutorial we will be using [BERT language model](https://arxiv.org/abs/1810.04805).

To read more about other topics and downstream task that can be done in NeMo, you can see the [NeMo's tutorial page](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).


# Dataset

In this tutorial we are going to use [Persian Arman dataset for our NER task](https://github.com/HaniehP/PersianNER).

Arman is a hand annotated Persian corpus for NER task with 250,015 tokens and 7,682 sentences. Using [IOB encoding](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), tokens are labeled with either one of the following name entities or labeled with O.   

* event = event
* fac = facility
* loc = location
* org = organization
* pers = person
* pro = product

Each of these has a label staring with **B** that indicates it is the first token of the name entity and with **I** for others. 




# NeMo Token Classification Data Format

[TokenClassification Model](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/token_classification/token_classification_model.py) in NeMo supports NER and other token level classification tasks, as long as the data follows the format specified below. 

Token Classification Model requires the data to be split into 2 files: 
* text.txt  
* labels.txt. 

Each line of the **text.txt** file contains text sequences, where words are separated with spaces, i.e.: 
[WORD] [SPACE] [WORD] [SPACE] [WORD].

The **labels.txt** file contains corresponding labels for each word in text.txt, the labels are separated with spaces, i.e.:
[LABEL] [SPACE] [LABEL] [SPACE] [LABEL].

Example of a text.txt file:
```
دبیر شورای عالی انقلاب فرهنگی از گنجانده شدن 5 زبان خارجی جدید در برنامه درسی مدارس خبر داد.
```
Corresponding labels.txt file:
```
O B_ORG I_ORG I_ORG I_ORG O O O O O O O O O O O O O O 
```

## Download and preprocess the data¶

You can download the Arman dataset by cloning to the following github repository: https://github.com/HaniehP/PersianNER.

After downloading the data, you will see a few files and folders inside a directory named PersianNER. Take ArmanPersoNERCorpus.zip and upload it to `DATA_DIR` (if running in a docker or locally) or use **files** from Google colab to upload the files.


In [None]:
# path to the folder with ArmanPersoNERCorpus.zip file (if running locally on in a docker)
DATA_DIR = "PATH_TO_FOLDER_WITH_ZIP.ZIP_FILE"
WORK_DIR = "WORK_DIR"

# adding an empty subfolder for data (otherwise it can interact with existing folders in DATA_DIR)
subfolder = f"{DATA_DIR}/non_eng_NER"

os.makedirs(WORK_DIR, exist_ok=True)
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(subfolder, exist_ok=True)

! cp $DATA_DIR/ArmanPersoNERCorpus.zip $subfolder/.
DATA_DIR = f"{DATA_DIR}/non_eng_NER"

In [None]:
if 'google.colab' in str(get_ipython):
    from google.colab import files
    uploaded = files.upload() 

In [None]:
if 'google.colab' in str(get_ipython):
  ! mv ArmanPersoNERCorpus.zip $DATA_DIR/.

Let's extract files from the zip file. It will generate three test and train files which have overlaps and are intended to be used in turn as train and test sets. 

In [None]:
! cd $DATA_DIR && unzip "ArmanPersoNERCorpus.zip"

Next, we will be putting all data into a single file and removing any repeated sentences. 

In [None]:
file_all = os.path.join(DATA_DIR, "all_data.txt")
with open(file_all, "w") as f1:
  for filename in glob(f"{DATA_DIR}/test_fold*.txt") + glob(f"{DATA_DIR}/train_fold*.txt"):
    with open(filename, "r", encoding = "ISO-8859-1") as f2:
      for line in f2:
        f1.write(line)

Now, you need to convert this data into NeMo compatible format before starting the training process. For this purpose, you can run [examples/nlp/token_classification/data/import_from_iob_format.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/token_classification/data/import_from_iob_format.py) on your train and dev files, as follows:




```
python examples/nlp/token_classification/data/import_from_iob_format.py --data_file PATH_TO_IOB_FORMAT_DATAFILE, e.g., "DATA_DIR/all_data.txt"
```


In [None]:
!wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/nlp/token_classification/data/import_from_iob_format.py

In [None]:
!python import_from_iob_format.py --data_file $DATA_DIR/all_data.txt

Now we process the data to remove potentially any repeated sentences and then split them into train and dev sets. 

In [None]:
sent_dict = dict()
line_removed = dict()
line_counter = 0
with open(DATA_DIR + "/text_all_not_repeated.txt", "w") as f1:
    with open(DATA_DIR + "/text_all_data.txt", "r") as f2:
        for line in f2:
            line_counter += 1
            if (not line in sent_dict):
                sent_dict[line] = 1
                f1.write(line)
            else:
                line_removed[line_counter] = 1
#labels:
line_counter = 0
with open(DATA_DIR + "/labels_all_not_repeated.txt", "w") as f1:
    with open(DATA_DIR + "/labels_all_data.txt", "r") as f2:
        for line in f2:
            line_counter += 1
            if(not line_counter in line_removed):
                f1.write(line)

After preprocessing the data and removing repeated sentences, there will be 7668 total valid sentences. We will be using 85% of that as train and 15% as dev. 

In [None]:
total_data = 7668
train_share = 0.85
used_lines_train = dict()
flag = 1
count = 0
while flag:
  idx = random.randint(1, total_data)
  if (not idx in used_lines_train):
    used_lines_train[idx] = 1
    count += 1
  if (count/total_data > train_share):
    flag = 0

line_counter = 0
with open(DATA_DIR+ "/text_train.txt", "w") as f1:
  with open(DATA_DIR + "/text_dev.txt", "w") as f2:
    with open(DATA_DIR + "/text_all_not_repeated.txt", "r") as f3:
      for line in f3:
        line_counter += 1
        if (line_counter in used_lines_train):
          f1.write(line)
        else:
          f2.write(line)

line_counter = 0
with open(DATA_DIR + "/labels_train.txt", "w") as f1:
  with open(DATA_DIR + "/labels_dev.txt", "w") as f2:
    with open(DATA_DIR + "/labels_all_not_repeated.txt", "r") as f3:
      for line in f3:
        line_counter += 1
        if (line_counter in used_lines_train):
          f1.write(line)
        else:
          f2.write(line)

Finally, we remove files that are not needed anymore.

In [None]:
print("Removed files:")
for filename in os.listdir(DATA_DIR):
    if (filename == "text_dev.txt" or filename == "text_train.txt" or filename == "labels_dev.txt" or filename == "labels_train.txt"):
      continue
    print(filename)
    os.remove(DATA_DIR + "/" + filename)

Now, the data folder should contain these 4 files:



* labels_dev.txt
* labels_train.txt
* text_dev.txt
* text_train.txt


In [None]:
! ls -l $DATA_DIR

In [None]:
# let's take a look at the data 
print('Text:')
! head -n 5 {DATA_DIR}/text_train.txt

print('\nLabels:')
! head -n 5 {DATA_DIR}/labels_train.txt

# Model configuration

Our Named Entity Recognition model is comprised of the pretrained [BERT](https://arxiv.org/pdf/1810.04805.pdf) model followed by a Token Classification layer.

The model is defined in a config file which declares multiple important sections. They are:
- **model**: All arguments that are related to the Model - language model, token classifier, optimizer and schedulers, datasets and any other related information

- **trainer**: Any argument to be passed to PyTorch Lightning

In [None]:
MODEL_CONFIG = "token_classification_config.yaml"
# download the model's configuration file 
config_dir = WORK_DIR + '/configs/'
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + MODEL_CONFIG):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/token_classification/conf/' + MODEL_CONFIG, config_dir)
else:
    print ('config file is already exists')

In [None]:
# this line will print the entire config of the model
config_path = f'{WORK_DIR}/configs/{MODEL_CONFIG}'
print(config_path)
config = OmegaConf.load(config_path)
print(OmegaConf.to_yaml(config))

# Fine-tuning the model using Arman dataset

Let's select a [`bert-base-multilingual-uncased`](https://huggingface.co/bert-base-multilingual-uncased) BERT model and fine-tune it on the Arman dataset.

## Setting up Data within the config

Among other things, the config file contains dictionaries called dataset, train_ds and validation_ds. These are configurations used to setup the Dataset and DataLoaders of the corresponding config.

We assume that both training and evaluation files are in the same directory and use the default names mentioned during the data download step. 
So, to start model training, we simply need to specify `model.dataset.data_dir`, like we are going to do below.

Also notice that some config lines, including `model.dataset.data_dir`, have `???` in place of paths, this means that values for these fields are required to be specified by the user.

Let us now add the data directory path to the config.


In [None]:
# in this tutorial train and dev datasets are located in the same folder, so it is enought to add the path of the data directory to the config
config.model.dataset.data_dir = DATA_DIR

# if you want to use the full dataset, set NUM_SAMPLES to -1
NUM_SAMPLES = 1000
config.model.train_ds.num_samples = NUM_SAMPLES
config.model.validation_ds.num_samples = NUM_SAMPLES

# for demonstartion purposes we're running only a single epoch
config.trainer.max_epochs = 5
print(OmegaConf.to_yaml(config.model))

## Building the PyTorch Lightning Trainer

NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem.

Let's first instantiate a Trainer object

In [None]:
print("Trainer config - \n")
print(OmegaConf.to_yaml(config.trainer))

In [None]:
# lets modify some trainer configs
# checks if we have GPU available and uses it
accelerator = 'gpu' if torch.cuda.is_available() else 'cpu'
config.trainer.devices = 1
config.trainer.accelerator = accelerator

config.trainer.precision = 16 if torch.cuda.is_available() else 32

# for mixed precision training, uncomment the line below (precision should be set to 16 and amp_level to O1):
# config.trainer.amp_level = O1

# remove distributed training flags
config.trainer.strategy = None

# setup max number of steps to reduce training time for demonstration purposes of this tutorial
config.trainer.max_steps = 32

config.exp_manager.exp_dir = WORK_DIR
trainer = pl.Trainer(**config.trainer)

## Setting up a NeMo Experiment¶

NeMo has an experiment manager that handles logging and checkpointing for us, so let's use it:

In [None]:
exp_manager(trainer, config.get("exp_manager", None))

In [None]:
exp_dir = config.exp_manager.exp_dir

# the exp_dir provides a path to the current experiment for easy access
exp_dir = str(exp_dir)
exp_dir

Before initializing the model, we might want to modify some of the model configs. For example, we might want to modify the pretrained BERT model:

In [None]:
# get the list of supported BERT-like models, for the complete list of HugginFace models, see https://huggingface.co/models
print(nemo_nlp.modules.get_pretrained_lm_models_list(include_external=False))

# specify BERT-like model, you want to use
PRETRAINED_BERT_MODEL = "bert-base-multilingual-uncased"

Now, we are ready to initialize our model. During the model initialization call, the dataset and data loaders we'll be prepared for training and evaluation.
Also, the pretrained BERT model will be downloaded, note it can take up to a few minutes depending on the size of the chosen BERT model.

In [None]:
model = nemo_nlp.models.TokenClassificationModel(cfg=config.model, trainer=trainer)

## Monitoring training progress
Optionally, you can create a Tensorboard visualization to monitor training progress.

In [None]:
try:
  from google import colab
  COLAB_ENV = True
except (ImportError, ModuleNotFoundError):
  COLAB_ENV = False

# Load the TensorBoard notebook extension
if COLAB_ENV:
  %load_ext tensorboard
  %tensorboard --logdir {exp_dir}
else:
  print("To use tensorboard, please use this notebook in a Google Colab environment.")

See how it performs before fine-tuning

In [None]:
# define the list of queries for inference
queries = [
    'حمید طاهایی افزود : برای اجرای این طرحها 0 میلیارد و 0 میلیون ریال اعتبار هزینه شده است . ',
    'دکتر اصغری دبیر چهارمین همایش انجمن زمین‌شناسی ایران در این زمینه گفت : از مجموع چهار صد مقاله رسیده به دبیرخانه همایش ، يك صد و هشتاد مقاله ظرف مدت دو روز در هشت سالن همایش برگزار شد . '
]
results = model.add_predictions(queries)

for query, result in zip(queries, results):
    print()
    print(f'Query : {query}')
    print(f'Result: {result.strip()}\n')

In [None]:
print("Trainer config - \n")
print(OmegaConf.to_yaml(config.trainer))

In [None]:
# start model training
trainer.fit(model)

After the training is complete, `.nemo` file that contains model's checkpoints and all associated artifacts could be found under `nemo_experiments/token_classification_model/DATE_TIME`

See how it gets better after:

In [None]:
results = model.add_predictions(queries)

for query, result in zip(queries, results):
    print()
    print(f'Query : {query}')
    print(f'Result: {result.strip()}\n')

After training for 100 epochs, with the default config and NUM_SAMPLES = -1 (i.e. all data is used), your model performance should look similar to this: 
```
    label                                                precision    recall       f1           support
    O (label_id: 0)                                         99.09      99.19      99.14      32867
    B-event (label_id: 1)                                   67.74      70.00      68.85         90
    B-fac (label_id: 2)                                     70.89      73.68      72.26         76
    B-loc (label_id: 3)                                     87.45      82.70      85.01        497
    B-org (label_id: 4)                                     81.88      87.06      84.39        649
    B-pers (label_id: 5)                                    94.93      93.36      94.14        542
    B-pro (label_id: 6)                                     79.31      70.41      74.59         98
    I-event (label_id: 7)                                   87.38      74.72      80.55        352
    I-fac (label_id: 8)                                     83.08      77.14      80.00        140
    I-loc (label_id: 9)                                     77.78      73.39      75.52        124
    I-org (label_id: 10)                                    86.51      89.93      88.18        834
    I-pers (label_id: 11)                                   95.30      94.35      94.82        301
    I-pro (label_id: 12)                                    82.86      86.57      84.67         67
    -------------------
    micro avg                                               97.78      97.78      97.78      36637
    macro avg                                               84.17      82.50      83.24      36637
    weighted avg                                            97.78      97.78      97.77      36637
```



**References**

1. Devlin, Jacob, et al. "BERT: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

2. Hanieh Poostchi, Ehsan Zare Borzeshi, Mohammad Abdous, and Massimo Piccardi, "PersoNER: Persian Named-Entity Recognition," The 26th International Conference on Computational Linguistics (COLING 2016), pages 3381–3389, Osaka, Japan, 2016.

3. Hanieh Poostchi, Ehsan Zare Borzeshi, and Massimo Piccardi, "BiLSTM-CRF for Persian Named-Entity Recognition; ArmanPersoNERCorpus: the First Entity-Annotated Persian Dataset," The 11th Edition of the Language Resources and Evaluation Conference (LREC), Miyazaki, Japan, 7-12 May 2018, ISLRN 399-379-640-828-6, ISLRN 921-509-141-609-6.