## GatorTron-OG

* [GatorTron-OG](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/gatortron_og)
* **NOTE**: The output hidden size of GatorTron-OG is 1024.

### Download and unzip GatorTron-OG
* Model related files are stored in `models/gatortron_og_1`

In [None]:
!mkdir -p models/gatortron_og_1
!cd models/gatortron_og_1
!wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/clara/gatortron_og/versions/1/zip -O gatortron_og_1.zip
!unzip gatortron_og_1.zip

### Modify the `model/gatortron_og_1/hparam.yaml` file.
* vocab_file change to the **absolute path** of the `model/gatortron_og_1/vocab.txt`

## Use NEMO to Initialize Model

* [Get lm model](https://github.com/NVIDIA/NeMo/blob/1274c10b15374c137a2f64d0e5f8483cd1246440/nemo/collections/nlp/modules/common/lm_utils.py#L52)

### Install APEX
* It takes long time.....
* **NOTE**: Restart jupyter notebook after installation.

In [None]:
!git clone https://github.com/ericharper/apex.git
%cd apex
!git checkout nm_v1.11.0
%pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./
%cd ..

### Install NEMO

* Python 3.8 or above
* Pytorch 1.10.0 or above

In [None]:
! python -m pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[nlp]

### Load Model

In [1]:
import torch
from nemo.collections.nlp.models.language_modeling.megatron_bert_model import MegatronBertModel
from pytorch_lightning import Trainer


class Identity(torch.nn.Module):
    def __init__(self):
        super(Identity, self).__init__()

    def forward(self, x, *args):
        return x


trainer = Trainer(accelerator='gpu', devices=1)
model = MegatronBertModel.restore_from(
    restore_path="/home/chchen/python_work/lung-cancer/models/gatortron_og_1/MegatronBERT.nemo", # change to your path
    override_config_path="/home/chchen/python_work/lung-cancer/models/gatortron_og_1/hparam.yaml", # change to your path
    trainer=trainer
)

# Remove the headers that are only revelant for pretraining
model.model.lm_head = Identity()
model.model.binary_head = Identity()
model.model.language_model.pooler = Identity()

2022-10-08 11:04:22.859288: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-oq2z2rki because the default path (/home/chchen/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[NeMo W 2022-10-08 11:04:36 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-10-08 11:04:38 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use 

[NeMo I 2022-10-08 11:04:52 megatron_init:204] Rank 0 has data parallel group: [0]
[NeMo I 2022-10-08 11:04:52 megatron_init:207] All data parallel group ranks: [[0]]
[NeMo I 2022-10-08 11:04:52 megatron_init:208] Ranks 0 has data parallel rank: 0
[NeMo I 2022-10-08 11:04:52 megatron_init:216] Rank 0 has model parallel group: [0]
[NeMo I 2022-10-08 11:04:52 megatron_init:217] All model parallel group ranks: [[0]]
[NeMo I 2022-10-08 11:04:52 megatron_init:227] Rank 0 has tensor model parallel group: [0]
[NeMo I 2022-10-08 11:04:52 megatron_init:231] All tensor model parallel group ranks: [[0]]
[NeMo I 2022-10-08 11:04:52 megatron_init:232] Rank 0 has tensor model parallel rank: 0
[NeMo I 2022-10-08 11:04:52 megatron_init:246] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2022-10-08 11:04:52 megatron_init:258] Rank 0 has embedding group: [0]
[NeMo I 2022-10-08 11:04:52 megatron_init:264] All pipeline model parallel group ranks: [[0]]
[NeMo I 2022-10-08 11:04:52 megatron_init:265]

[NeMo W 2022-10-08 11:04:53 modelPT:217] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.


[NeMo I 2022-10-08 11:04:53 tokenizer_utils:204] Getting Megatron tokenizer for pretrained model name: megatron-bert-345m-cased, custom vocab file: /home/chchen/python_work/lung-cancer/models/gatortron_og_1/vocab.txt, and merges file: None
[NeMo I 2022-10-08 11:04:53 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: bert-large-cased, vocab_file: /home/chchen/python_work/lung-cancer/models/gatortron_og_1/vocab.txt, merges_files: None, special_tokens_dict: {}, and use_fast: False


Using eos_token, but it is not set yet.
Using bos_token, but it is not set yet.


[NeMo I 2022-10-08 11:05:00 megatron_base_model:186] Padded vocab_size: 50176, original vocab_size: 50101, dummy tokens: 75.
[NeMo I 2022-10-08 11:05:04 save_restore_connector:243] Model MegatronBertModel was successfully restored from /home/chchen/python_work/lung-cancer/models/gatortron_og_1/MegatronBERT.nemo.


### Forward

In [2]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(
    "bert-large-cased", 
    vocab_file="./models/gatortron_og_1/vocab.txt", 
    eos_token="[SEP]", 
    bos_token="[CLS]"
)

inputs = ["Lung cancer", "pt report"]
tokenized_inputs = tokenizer(inputs, return_tensors="pt")
tokenized_inputs.to("cuda")
outputs = model.half()(**tokenized_inputs)
cls_hidden_states = outputs[0][:, 0, :]
print(cls_hidden_states.shape)

torch.Size([2, 1024])


In [3]:
cls_hidden_states

tensor([[ 0.0509,  0.0679,  0.3252,  ...,  0.7988, -0.3411, -0.1211],
        [ 0.2325, -0.0684, -0.2262,  ...,  0.1630, -0.4092,  0.1421]],
       device='cuda:0', dtype=torch.float16, grad_fn=<SliceBackward0>)

## From NEMO to HuggingFace

* [convert megatron bert checkpoint](https://github.com/huggingface/transformers/blob/main/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py)
* The outputs are different from the NEMO outputs. It seems no resource can resolve.

In [None]:
! python convert_megatron_bert_checkpoint.py \
    --path_to_checkpoint ./models/gatortron_og_1/MegatronBERT.pt \
    --config_file ./models/gatortron_og_1/config.json

In [None]:
from transformers import BertTokenizer, MegatronBertModel

tokenizer = BertTokenizer.from_pretrained("./models/gatortron_og_1/")
model = MegatronBertModel.from_pretrained("./models/gatortron_og_1/")

inputs = ["Lung cancer", "pt report"]
tokenized_inputs = tokenizer(inputs, return_tensors="pt")
outputs = model(**tokenized_inputs)
pooler_output = outputs.pooler_output
print(pooler_output.shape)


In [16]:
pooler_output

tensor([[ 0.0815,  0.0063, -0.0151,  ..., -0.1110,  0.1497,  0.1081],
        [ 0.1350,  0.0614, -0.1357,  ..., -0.1256,  0.2016,  0.1923]],
       grad_fn=<TanhBackward0>)