<a href="https://www.nvidia.com/dli"> <img src="../images/DLI_Header.png" alt="標頭" style="width: 400px;"/> </a>

# 3.0 建立命名實體識別器
### (NVIDIA NeMo v1.0)

在此學習筆記中，您將建立可在醫療疾病摘要中尋找疾病名稱的 NER (命名實體識別) 應用程式。模型不會從清單中「搜尋」名稱，而是根據語言的背景資訊「識別」與疾病相關的特定詞彙。

**[3.1 從命令列分類標記](#3.1-從命令列分類標記)**<br>
&nbsp;&nbsp;&nbsp;&nbsp;[3.1.1 資料輸入](#3.1.1-資料輸入)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[3.1.1.1 IOB 標記](#3.1.1.1-IOB-標記)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[3.1.2 設定檔](#3.1.2-設定檔)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[3.1.3 Hydra 支援的 Python 指令碼](#3.1.3-Hydra-支援的-Python-指令碼)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[3.1.4 練習：訓練模型](#3.1.4-練習：訓練模型)<br>
**[3.2 特定領域訓練](#3.2-特定領域訓練)**<br>
&nbsp;&nbsp;&nbsp;&nbsp;[3.2.1 使用 TensorBoard 將結果視覺化](#3.2.1-使用-TensorBoard-將結果視覺化)<br>
**[3.3 評估](#3.3-評估)**<br>
**[3.4 推論](#3.4-推論)**<br>

在命名實體識別工作中，您將採用與文字分類工作相同的基本步驟來建立、訓練和測試專案。然而，這一次您將會在 *特定領域* 的 BioMegatron 語言模型上訓練分類器。BioMegatron 是類似 [BERT](https://arxiv.org/abs/1810.04805) 的 [Megatron-LM](https://arxiv.org/pdf/1909.08053.pdf) 模型，已在大型生物醫學文字語料庫進行預先訓練 ([PubMed](https://pubmed.ncbi.nlm.nih.gov/) 摘要和商業用途文集的全文)。我們可以獲得比一般語言模型更好的效能，因為我們的疾病資料集來自相同的生物醫學領域。

也有其他模型可以替代 BioMegatron，其中最著名的是 [BioBERT](https://arxiv.org/abs/1901.08746)。BioMegatron 與 BioBERT 相比，模型規模較大，而且已經在較大的文字語料庫進行預先訓練。

因為我們在「標記」層級分類，所以會使用 [標記分類](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/token_classification.html) 模型來進行命名實體識別，在此案例中是分類與疾病相關的詞彙。

In [12]:
# Take a look at the data
NER3_DATA_DIR = '/dli/task/data/NCBI_ner-3'
print("*****\ntext_test.txt sample\n*****")
!head -n 3 $NER3_DATA_DIR/text_train.txt
print("\n*****\nlabels_test.txt sample\n*****")
!head -n 3 $NER3_DATA_DIR/labels_train.txt

*****
text_test.txt sample
*****
Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor . 
The adenomatous polyposis coli ( APC ) tumour - suppressor protein controls the Wnt signalling pathway by forming a complex with glycogen synthase kinase 3beta ( GSK - 3beta ) , axin / conductin and betacatenin . 
Complex formation induces the rapid degradation of betacatenin . 

*****
labels_test.txt sample
*****
O O O O O O O O B I I I O O 
O B I I I I I I O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 
O O O O O O O O O 


目前不需要變更 `exp_manger` 預設設定值。

## 3.1.4 練習：訓練模型
執行訓練指令碼 `token_classification_train.py`，方法與在文字分類學習筆記中進行的類似實驗相同。

可供覆寫的新值已提供在下方的儲存格中。請使用適當的覆寫新增命令，並執行儲存格。如果遇到困難，請參閱 [解答](solutions/ex3.1.4.ipynb)。

In [11]:
%%time
# The training takes about 2 minutes to run
   
TOKEN_DIR = "/dli/task/nemo/examples/nlp/token_classification"

# set the values we want to override
PRETRAINED_MODEL_NAME = 'bert-base-cased'
DATA_DIR = '/dli/task/data/NCBI_ner-3'
MAX_SEQ_LENGTH = 64
BATCH_SIZE = 32
AMP_LEVEL = 'O1'
MAX_EPOCHS = 20

# Override the config values in the command line
# FIXME

# Override the config values in the command line
!python $TOKEN_DIR/token_classification_train.py \
        model.language_model.pretrained_model_name=$PRETRAINED_MODEL_NAME \
        model.dataset.data_dir=$DATA_DIR \
        model.dataset.max_seq_length=$MAX_SEQ_LENGTH \
        model.train_ds.batch_size=$BATCH_SIZE \
        model.validation_ds.batch_size=$BATCH_SIZE \
        model.test_ds.batch_size=$BATCH_SIZE \
        trainer.amp_level=$AMP_LEVEL \
        trainer.max_epochs=$MAX_EPOCHS

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
[NeMo I 2022-04-29 11:26:09 exp_manager:216] Experiments will be logged at /dli/task/nemo_experiments/token_classification_model/2022-04-29_11-26-09
[NeMo I 2022-04-29 11:26:09 exp_manager:563] TensorboardLogger has been set up
[NeMo I 2022-04-29 11:26:09 token_classification_train:109] Config: pretrained_model: null
    trainer:
      gpus: 1
      num_nodes: 1
      max_epochs: 20
      max_steps: null
      accumulate_grad_batches: 1
      gradient_clip_val: 0.0
      amp_level: O1
      precision: 16
      accelerator: ddp
      checkpoint_callback: false
      logger: false
      log_every_n_steps: 1
      val_check_interval: 1.0
      resume_from_checkpoint: null
    exp_manager:
      exp_dir: null
      name: token_classification_model
      create_tensorboard_logger: true
      create_checkpoint_callback: true
    model:
      label_ids: null
      class_labels:
        class

結果如何？您的記錄應該包含以下內容：

```
    label                                                precision    recall       f1           support   
    O (label_id: 0)                                         99.34      99.35      99.34      21648
    B (label_id: 1)                                         85.86      89.21      87.50        769
    I (label_id: 2)                                         91.74      89.00      90.35       1073
    -------------------
    micro avg                                               98.54      98.54      98.54      23490
    macro avg                                               92.31      92.52      92.40      23490
    weighted avg                                            98.55      98.54      98.55      23490
    
Epoch 2: 100%|█| 199/199 [00:15<00:00, 12.45it/s, loss=0.0251, v_num=4-43, val_l
```

---
# 3.2 特定領域訓練

嘗試另一個實驗，這次將 `model.language_model.pretrained_model_name` 覆寫為 `biomegatron-bert-345m-cased`。這是包含 3.45 億個參數的大型模型，因此執行時間會更長。

In [None]:
%%time
# The training takes about 5-6 minutes to run
   
TOKEN_DIR = "/dli/task/nemo/examples/nlp/token_classification"

# set the values we want to override
PRETRAINED_MODEL_NAME = 'biomegatron-bert-345m-cased'
DATA_DIR = '/dli/task/data/NCBI_ner-3'
MAX_SEQ_LENGTH = 64
BATCH_SIZE = 32
AMP_LEVEL = 'O1'
MAX_EPOCHS = 20

# Override the config values in the command line
!python $TOKEN_DIR/token_classification_train.py \
        model.language_model.pretrained_model_name=$PRETRAINED_MODEL_NAME \
        model.dataset.data_dir=$DATA_DIR \
        model.dataset.max_seq_length=$MAX_SEQ_LENGTH \
        model.train_ds.batch_size=$BATCH_SIZE \
        model.validation_ds.batch_size=$BATCH_SIZE \
        model.test_ds.batch_size=$BATCH_SIZE \
        trainer.amp_level=$AMP_LEVEL \
        trainer.max_epochs=$MAX_EPOCHS

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
[NeMo I 2022-04-29 12:12:21 exp_manager:216] Experiments will be logged at /dli/task/nemo_experiments/token_classification_model/2022-04-29_12-12-21
[NeMo I 2022-04-29 12:12:21 exp_manager:563] TensorboardLogger has been set up
[NeMo I 2022-04-29 12:12:21 token_classification_train:109] Config: pretrained_model: null
    trainer:
      gpus: 1
      num_nodes: 1
      max_epochs: 30
      max_steps: null
      accumulate_grad_batches: 1
      gradient_clip_val: 0.0
      amp_level: O1
      precision: 16
      accelerator: ddp
      checkpoint_callback: false
      logger: false
      log_every_n_steps: 1
      val_check_interval: 1.0
      resume_from_checkpoint: null
    exp_manager:
      exp_dir: null
      name: token_classification_model
      create_tensorboard_logger: true
      create_checkpoint_callback: true
    model:
      label_ids: null
      class_labels:
        class

In [None]:
%%js
const href = window.location.hostname +'/tensorboard/';
let a = document.createElement('a');
let link = document.createTextNode('Open Tensorboard!');
a.appendChild(link);
a.href = "http://" + href;
a.style.color = "navy"
a.target = "_blank"
element.append(a);

---
# 3.3 評估

In [None]:
# Restart the kernel
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

若要在測試集上評估模型，就必須指定 `.nemo` 訓練模型的位置。每個實驗都會在有 `nemo_experiments` 時間戳記的目錄下執行結果。繼續往下探究，可以找到 `checkpoints` 資料夾，最後的 `token_classification_model.nemo` 就在其中。下一個儲存格會用一些 Python 邏輯來擷取模型清單，並找出最新的模型。

In [None]:
import glob

nemo_model_paths = glob.glob('nemo_experiments/token_classification_model/*/checkpoints/*.nemo')

# Sort newest first
nemo_model_paths.sort(reverse=True)
print("The latest model is \n{}".format(nemo_model_paths[0]))

您可以使用以下幾種方式，在測試集上執行評估：
1. 使用相同的覆寫執行 `token_classification_evaluate.py`，再加上 `pretrained_model` 的覆寫，且必須為 `.nemo` 格式。

```text
   !python $TOKEN_DIR/token_classification_evaluate.py \
        model.dataset.data_dir=$DATA_DIR \
        model.dataset.max_seq_length=$MAX_SEQ_LENGTH \
        model.train_ds.batch_size=$BATCH_SIZE \
        model.validation_ds.batch_size=$BATCH_SIZE \
        model.test_ds.batch_size=$BATCH_SIZE \
        model.language_model.pretrained_model_name=$PRETRAINED_MODEL_NAME \
        pretrained_model=$LATEST_MODEL
```
        
2. 儲存已訓練模型的檢查點並執行 NeMo 方法來評估測試集，以便具現化模型。<br>
   我們將會逐步說明此方法。

In [None]:
# Instantiate the model by restoring from the .nemo checkpoint
from nemo.collections import nlp as nemo_nlp

LATEST_MODEL = nemo_model_paths[0]
model = nemo_nlp.models.TokenClassificationModel.restore_from(LATEST_MODEL)

In [None]:
import os.path

DATA_DIR = '/dli/task/data/NCBI_ner-3'
OUTPUT_DIR = '/dli/task/nemo_experiments/token_classification_model/logs'
model.evaluate_from_file(
    text_file=os.path.join(DATA_DIR, 'text_test.txt'),
    labels_file=os.path.join(DATA_DIR, 'labels_test.txt'),
    output_dir=OUTPUT_DIR,
    add_confusion_matrix=True,
    normalize_confusion_matrix=True,
    batch_size=1
)

---
# 3.4 推論
若要在查詢清單中執行推論，請使用已載入 `add_predictions` 方法的相同模型。

In [None]:
queries = ["Clustering of missense mutations in the ataxia - telangiectasia gene in a sporadic T - cell leukaemia . ",
    "Ataxia - telangiectasia ( A - T ) is a recessive multi - system disorder caused by mutations in the ATM gene at 11q22 - q23 ( ref . 3 ) . ",
    "The risk of cancer , especially lymphoid neoplasias , is substantially elevated in A - T patients and has long been associated with chromosomal instability . ",
    "By analysing tumour DNA from patients with sporadic T - cell prolymphocytic leukaemia ( T - PLL ) , a rare clonal malignancy with similarities to a mature T - cell leukaemia seen in A - T , we demonstrate a high frequency of ATM mutations in T - PLL . ",
    "In marked contrast to the ATM mutation pattern in A - T , the most frequent nucleotide changes in this leukaemia were missense mutations . ",
    "These clustered in the region corresponding to the kinase domain , which is highly conserved in ATM - related proteins in mouse , yeast and Drosophila . ",
    "The resulting amino - acid substitutions are predicted to interfere with ATP binding or substrate recognition . ",
    "Two of seventeen mutated T - PLL samples had a previously reported A - T allele . ",
    "In contrast , no mutations were detected in the p53 gene , suggesting that this tumour suppressor is not frequently altered in this leukaemia . ",
    "Occasional missense mutations in ATM were also found in tumour DNA from patients with B - cell non - Hodgkins lymphomas ( B - NHL ) and a B - NHL cell line . "
          ]

In [None]:
results = model.add_predictions(queries, output_file='predictions.txt')

In [None]:
!cat predictions.txt