In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell

# install NeMo
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]

In [None]:
# If you're not using Colab, you might need to upgrade jupyter notebook to avoid the following error:
# 'ImportError: IProgress not found. Please update jupyter and ipywidgets.'

! pip install ipywidgets
! jupyter nbextension enable --py widgetsnbextension

# Please restart the kernel after running this cell

In [None]:
import os
import wget
from nemo.collections import nlp as nemo_nlp
from omegaconf import OmegaConf

# Language models

Natural Language Processing (NLP) field experienced a huge leap in recent years due to the concept of transfer learning enabled through pretrained language models.

[BERT](https://arxiv.org/abs/1810.04805), [RoBERTa](https://arxiv.org/abs/1907.11692), [Megatron-LM](https://arxiv.org/abs/1909.08053), and many other proposed language models achieve state-of-the-art results on many NLP tasks, such as:
* question answering
* sentiment analysis
* named entity recognition and many others.

In NeMo, most of the NLP models represent a pretrained language model followed by a Token Classification layer or a Sequence Classification layer or a combination of both. By changing the language model, you can improve the performance of your final model on the specific downstream task you are solving.

With NeMo you can use either pretrain a BERT model from your data or use a pretrained language model from [HuggingFace transformers](https://github.com/huggingface/transformers) or [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) libraries.

Note: Megatron BERT is not supported in NeMo 1.5.0. Please use [NeMo 1.4.0](https://github.com/NVIDIA/NeMo/tree/r1.4.0) for Megatron BERT support.

Let's take a look at the list of available pretrained language models, note the complete list of HuggingFace model could be found at [https://huggingface.co/models](https://huggingface.co/models):


In [None]:
nemo_nlp.modules.get_pretrained_lm_models_list()

NLP models for downstream tasks use `get_lm_model` helper function to easily switch between language models from the list above to another:

In [None]:
# use any pretrained model name from the list above
pretrained_model_name = 'distilbert-base-uncased'
config = {"language_model": {"pretrained_model_name": pretrained_model_name}, "tokenizer": {}}
omega_conf = OmegaConf.create(config)
nemo_nlp.modules.get_lm_model(cfg=omega_conf)

All NeMo [NLP models](https://github.com/NVIDIA/NeMo/tree/main/examples/nlp) have an associated config file. As an example, let's examine the config file for the Named Entity Recognition (NER) model (more details about the model and the NER task could be found [here](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/nlp/Token_Classification_Named_Entity_Recognition.ipynb)).

In [None]:
MODEL_CONFIG = "token_classification_config.yaml"

# download the model's configuration file 
if not os.path.exists(MODEL_CONFIG):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/token_classification/conf/' + MODEL_CONFIG)
else:
    print ('Config file already exists')

In [None]:
# this line will print the entire config of the model
config = OmegaConf.load(MODEL_CONFIG)
print(OmegaConf.to_yaml(config))

For this tutorial, we are interested in the language_model part of the Named Entity Recognition Model.

In [None]:
print(OmegaConf.to_yaml(config.model.language_model))

There might be slight differences from one model to another, but most of them have the following important parameters associated with the language model:
* `pretrained_model_name` - a name of the pretrained model from either HuggingFace or Megatron-LM libraries, for example, bert-base-uncased or megatron-bert-345m-uncased.
* `lm_checkpoint` - a path to the pretrained model checkpoint if, for example, you trained a BERT model with your data
* `config_file` -  path to the model configuration file
* `config` or `config_dict` - path to the model configuration dictionary

To modify the default language model, specify the desired language model name with the `model.language_model.pretrained_model_name` argument, like this:

In [None]:
config.model.language_model.pretrained_model_name = 'roberta-base'

and then start the training as usual (please see [tutorials/nlp](https://github.com/NVIDIA/NeMo/tree/main/tutorials/nlp) for more details about training of a particular model). 

You can also provide a pretrained language model checkpoint and a configuration file if available.

Note, that `pretrained_model_name` is used to set up both Language Model and Tokenizer.

All the above holds for both HuggingFace and Megatron-LM pretrained language models. Let's separately examine some specifics of finetuning with Megatron-LM and HuggingFace models.

# Downstream tasks with Megatron and BioMegatron Language Models

[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. More details could be found at [Megatron-LM github repo](https://github.com/NVIDIA/Megatron-LM).

Note: Megatron BERT is not supported in NeMo 1.5.0. Please use [NeMo 1.4.0](https://github.com/NVIDIA/NeMo/tree/r1.4.0) for Megatron BERT support.

To see the list of available Megatron-LM models in NeMo, run:

In [None]:
#nemo_nlp.modules.get_megatron_lm_models_list()

If you want to use one of the available Megatron-LM models, specify its name with `model.language_model.pretrained_model_name` argument, for example:

In [None]:
#config.model.language_model.pretrained_model_name = 'megatron-bert-345m-uncased'

If you have a different checkpoint or a model configuration file, use these general Megatron-LM model names:
* `megatron-bert-uncased` or 
* `megatron-bert-cased` 

and provide associated bert_config and bert_checkpoint files, as follows:

`model.language_model.pretrained_model_name=megatron-bert-uncased \
model.language_model.lm_checkpoint=<PATH_TO_CHECKPOINT> \
model.language_model.config_file=<PAHT_TO_CONFIG>`
 
 or 
 
`model.language_model.pretrained_model_name=megatron-bert-cased \
model.language_model.lm_checkpoint=<PATH_TO_CHECKPOINT> \
model.language_model.config_file=<PAHT_TO_CONFIG>`

The general Megatron-LM model names are used to download the correct vocabulary file needed to setup the model correctly. Note, the data preprocessing and model training is done in NeMo. Megatron-LM has its own set of training arguments (including tokenizer) that are ignored during finetuning in NeMo. Please see downstream task [config files and training scripts](https://github.com/NVIDIA/NeMo/tree/main/examples/nlp) for all NeMo supported arguments.

## Download pretrained model

With NeMo, the original and domain-specific Megatron-LM BERT models and model configuration files will be downloaded automatically, but they also could be downloaded with the links below:

[Megatron-LM BERT Uncased 345M (~345M parameters): https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m](https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m/files?version=v0.1_uncased)

[Megatron-LM BERT Cased 345M (~345M parameters): https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m](https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m/files?version=v0.1_cased)

[BioMegatron-LM BERT Cased 345M (~345M parameters): https://ngc.nvidia.com/catalog/models/nvidia:biomegatron345mcased](https://ngc.nvidia.com/catalog/models/nvidia:biomegatron345mcased)

[BioMegatron-LM BERT Uncased 345M (~345M parameters)](https://ngc.nvidia.com/catalog/models/nvidia:biomegatron345muncased): https://ngc.nvidia.com/catalog/models/nvidia:biomegatron345muncased

# Using any HuggingFace Pretrained Model

Currently, there are 4 HuggingFace language models that have the most extensive support in [NeMo](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/nlp/modules/common/huggingface): 

* BERT
* RoBERTa
* ALBERT
* DistilBERT

As was mentioned before, just set `model.language_model.pretrained_model_name` to the desired model name in your config and get_lm_model() will take care of the rest.

If you want to use another language model from [https://huggingface.co/models](https://huggingface.co/models), use HuggingFace API directly in NeMo.
More details on model training could be found at [tutorials](https://github.com/NVIDIA/NeMo/tree/main/tutorials).