# Pyramid: A Layered Model for Nested Named Entity Recognition - Custom Training

This notebook provides a blueprint to train a version of [Pyramid: A Layered Model for Nested Named Entity Recognition](https://www.aclweb.org/anthology/2020.acl-main.525.pdf) on your own data.

**It is recommended to use GPU for training.**

## Downloading Code

In [None]:
!git clone https://github.com/federico-giannoni/pyramid-nested-ner.git
!mv pyramid-nested-ner/* . && rm -rf pyramid-nested-ner # move to root

In [None]:
!pip install flair seqeval 2>&1 > /dev/null

In [None]:
from pyramid_nested_ner.model import PyramidNer
from pyramid_nested_ner.data import DataPoint, Entity
from pyramid_nested_ner.modules.word_embeddings.transformer_embeddings import TransformerWordEmbeddings
from pyramid_nested_ner.modules.word_embeddings.pretrained_embeddings import PretrainedWordEmbeddings
from pyramid_nested_ner.data.dataset import PyramidNerDataset
from pyramid_nested_ner.utils.data import rasa_data_reader
from pyramid_nested_ner.training.trainer import PyramidNerTrainer
from pyramid_nested_ner.training.optim import get_default_sgd_optim
from copy import deepcopy

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import torch
import json

In [None]:
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

## Experiments Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# TODO: set these paths to point to your data

train_data_path = None  # e.g. "/content/drive/My Drive/your_train_data_path"
test_data_path = None   # e.g. "/content/drive/My Drive/your_test_data_path"
dev_data_path  = None   # e.g "/content/drive/My Drive/your_dev_data_path"

In [None]:
def your_own_data_generator(path):
  """
  Implement this function to yield DataPoint objects
  representing your data. The class definition can be found at: 
  https://github.com/federico-giannoni/pyramid-nested-ner/blob/main/pyramid_nested_ner/data/__init__.py
  """
  # if your data is in rasa json format, you can just uncomment the line below:
  # yield from rasa_data_reader(path)
  pass

In [None]:
pyramid_max_depth = 2  # keep this low if you plan to do inference on CPU (paper uses 15)

In [None]:
train_data = PyramidNerDataset(
  your_own_data_generator(train_data_path), 
  pyramid_max_depth=pyramid_max_depth,
  token_lexicon=lexicon,
  custom_tokenizer=None, 
  char_vectorizer=True,
).get_dataloader(
    shuffle=True,
    batch_size=64,
    device=DEVICE, 
    bucketing=True
)

test_data = PyramidNerDataset(
  your_own_data_generator(test_data_path), 
  pyramid_max_depth=pyramid_max_depth,
  token_lexicon=lexicon,
  custom_tokenizer=None, 
  char_vectorizer=True,
).get_dataloader(
    shuffle=True, 
    batch_size=16,
    device=DEVICE, 
    bucketing=True
)

dev_data = PyramidNerDataset(
  your_own_data_generator(dev_data_path), 
  pyramid_max_depth=pyramid_max_depth,
  token_lexicon=lexicon,
  custom_tokenizer=None, 
  char_vectorizer=True,
).get_dataloader(
    shuffle=True, 
    batch_size=16,
    device=DEVICE, 
    bucketing=True
)

## Training

For the `language_model` parameter, you can use any huggingface transformer by specyfing its name. The full list of names is available [here](https://huggingface.co/transformers/pretrained_models.html).

If you plan on using an uncased model, you should also pass `language_model_casing=False` to the `PyramidNer` constructor.

**Note that using word embeddings from pre-trained language models increases training time (and inference time) by a factor of 10.** During training, embeddings are cached during the first epoch so that the following epochs are faster, but this can not be done during inference. For this reason, you should only use the `language_model` parameter if you're planning on using the model for research purposes.

For the `word_embeddings` parameter, you can either provide your own `torch.nn.Embedding` module, or use the names of any of the `WordEmbeddings` from **Flair** that you can find [here](https://github.com/flairNLP/flair/blob/master/flair/embeddings/token.py#L121).

In [None]:
pyramid_ner = PyramidNer(
  word_lexicon=lexicon,
  entities_lexicon=train_entities,
  word_embeddings=['en-glove', 'en-crawl'],  # 100-dim glove + fasttext
  language_model=None,
  char_embeddings_dim=60,
  encoder_hidden_size=100,
  encoder_output_size=200,
  decoder_hidden_size=100,
  inverse_pyramid=False,
  custom_tokenizer=None,
  pyramid_max_depth=pyramid_max_depth,
  decoder_dropout=0.2,
  encoder_dropout=0.2,
  device=DEVICE,
)

trainer = PyramidNerTrainer(pyramid_ner)

In [None]:
# default optimizer and LR scheduler as described in the paper - feel free to change them.
optimizer, scheduler = get_default_sgd_optim(pyramid_ner.nnet.parameters()) 

In [None]:
ner_model, report = trainer.train(
  train_data, 
  optimizer=optimizer, 
  scheduler=scheduler, 
  restore_weights_on='loss',
  epochs=60, 
  dev_data=dev_data, 
  patience=np.inf, 
  grad_clip=5.0
)

In [None]:
report.plot_loss_report()

In [None]:
report.plot_custom_report('micro_f1')

In [None]:
print(trainer.test_model(test_data, out_dict=False))

## Inference

In [None]:
out = pyramid_ner.parse("your own test sentence")
print(out)

## Saving for later use

In [None]:
pyramid_ner.save(path='.', name='pyramid_ner')

In [None]:
!tar -cvzf pyramid_ner  # compress...

You can load it back using:

```python
from pyramid_nested_ner.model import PyramidNer
pyramid_ner = PyramidNer.load(path, custom_tokenizer=None, force_device=None, force_language_model=None, force_embeddings=None)
```

Where `force_device`, `force_language_model` and `force_embeddings` allow you to overwrite the `device`, `language_model` and `word_embeddings` parameters that were provided when the model was saved.