<a href="https://colab.research.google.com/github/ashishpatel26/NER-with-SpanMarker/blob/main/2.NER%20with%20SpanMarker%20on%20Conll2003.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Initializing & Training with SpanMarker
[SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) is an accessible yet powerful Python module for training Named Entity Recognition models.

In this short notebook, we'll have a look at how to initialize and train an NER model using SpanMarker. For a larger and more general tutorial on how to use SpanMarker, please have a look at the [Getting Started](getting_started.ipynb) notebook.

### Setup
First of all, the `span_marker` Python module needs to be installed. If we want to use [Weights and Biases](https://wandb.ai/) for logging, we can install `span_marker` using the `[wandb]` extra.

In [1]:
%pip install span_marker
# %pip install span_marker[wandb]

Collecting span_marker
  Downloading span_marker-1.3.0-py3-none-any.whl (41 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/41.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.5/41.5 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate (from span_marker)
  Downloading accelerate-0.22.0-py3-none-any.whl (251 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers>=4.19.0 (from span_marker)
  Downloading transformers-4.32.1-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m76.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.0.0 (from span_marker)
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
C

### Loading the dataset
For this example, we'll load the commonly used [CoNLL2003 dataset](https://huggingface.co/datasets/conll2003) from the Hugging Face hub using 🤗 Datasets.

In [2]:
from datasets import load_dataset

dataset = load_dataset("conll2003")
dataset

Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [3]:
labels = dataset["train"].features["ner_tags"].feature.names
labels

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

SpanMarker accepts any dataset as long as it has `tokens` and `ner_tags` columns. The `ner_tags` can be annotated using the IOB, IOB2, BIOES or BILOU labeling scheme, but also regular unschemed labels. This CoNLL dataset uses the common IOB or IOB2 labeling scheme, with PER, ORG, LOC and MISC labels.

### Initializing a `SpanMarkerModel`
A SpanMarker model is initialized via [SpanMarkerModel.from_pretrained](https://tomaarsen.github.io/SpanMarkerNER/api/span_marker.modeling.html#span_marker.modeling.SpanMarkerModel.from_pretrained). This method will be familiar to those who know 🤗 Transformers. It accepts either a path to a local model or the name of a model on the [Hugging Face Hub](https://huggingface.co/models).

Importantly, the model can *either* be an encoder or an already trained and saved SpanMarker model. As we haven't trained anything yet, we will use an encoder. To learn how to load and use a saved SpanMarker model, please have a look at the [Loading & Inferencing](model_loading.ipynb) notebook.

Reasonable options for encoders include BERT and RoBERTa, which means that the following are all good options:

* [prajjwal1/bert-tiny](https://huggingface.co/prajjwal1/bert-tiny)
* [prajjwal1/bert-mini](https://huggingface.co/prajjwal1/bert-mini)
* [prajjwal1/bert-small](https://huggingface.co/prajjwal1/bert-small)
* [prajjwal1/bert-medium](https://huggingface.co/prajjwal1/bert-medium)
* [bert-base-cased](https://huggingface.co/bert-base-cased)
* [bert-large-cased](https://huggingface.co/bert-large-cased)
* [roberta-base](https://huggingface.co/roberta-base)
* [roberta-large](https://huggingface.co/roberta-large)

Not all encoders work though, they **must** allow for `position_ids` as an input argument, which disqualifies DistilBERT, T5, DistilRoBERTa, ALBERT & BART. Furthermore, using uncased models is generally not recommended, as the capitalisation can be very useful to find named entities.

We'll use `"roberta-base"` for this notebook. If you're running this on Google Colab, be sure to set hardware accelerator to "GPU" in `Runtime` > `Change runtime type`.

In [4]:
from span_marker import SpanMarkerModel

model_name = "roberta-base"
model = SpanMarkerModel.from_pretrained(
    model_name,
    labels=labels,
    model_max_length=256,
    entity_max_length=6,
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 50267. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


For us, these warnings are expected, as we are initializing `BertModel` for a new task.

Note that we provided [SpanMarkerModel.from_pretrained](https://tomaarsen.github.io/SpanMarkerNER/api/span_marker.modeling.html#span_marker.modeling.SpanMarkerModel.from_pretrained) with a list of our labels. This is required when training a new model. See [Configuring](model_configuration.ipynb) for more details and recommendations on configuration options.

### Training
At this point, our model is already ready for training! We can import [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) directly from 🤗 Transformers as well as the SpanMarker `Trainer`. The `Trainer` is a subclass of the 🤗 Transformers [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) that simplifies some tasks for you, but otherwise it works just like the regular `Trainer`.

This next snippet shows some reasonable defaults. Feel free to adjust the batch size to a lower value if you experience out of memory exceptions.

In [5]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="models/span-marker-roberta-base-conll03",
    learning_rate=1e-5,
    gradient_accumulation_steps=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=500,
    push_to_hub=False,
    logging_steps=50,
    fp16=True,
    warmup_ratio=0.1,
)

Now we can create a SpanMarker `Trainer` in the same way that you would initialize a 🤗 Transformers `Trainer`. We'll train on a subsection of the data to save us some time. Amazingly, this `Trainer` will automatically create logs using exactly the logging tools that you have installed. With other words, if you prefer logging with [Tensorboard](https://www.tensorflow.org/tensorboard), all that you have to do is install it.

In [6]:
from span_marker import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"].select(range(2000)),
)
trainer.train()

INFO:span_marker.label_normalizer:Detected the IOB or IOB2 labeling scheme.


Label normalizing the train dataset:   0%|          | 0/14041 [00:00<?, ? examples/s]

Tokenizing the train dataset:   0%|          | 0/14041 [00:00<?, ? examples/s]

These are the frequencies of the missed entities due to maximum entity length out of 23499 total entities:
- 18 missed entities with 7 words (0.076599%)
- 2 missed entities with 8 words (0.008511%)
- 3 missed entities with 10 words (0.012767%)


Spreading data between multiple samples:   0%|          | 0/14041 [00:00<?, ? examples/s]

INFO:span_marker.trainer:Spread 14041 sentences across 14414 samples, a 2.656506% increase. You can increase `model_max_length` or `marker_max_length` to decrease the number of samples, but recognize that longer samples are slower.


Step,Training Loss,Validation Loss,Overall Precision,Overall Recall,Overall F1,Overall Accuracy
500,0.0351,0.027203,0.892768,0.825122,0.857613,0.96623
1000,0.0209,0.016772,0.906553,0.916739,0.911617,0.982015
1500,0.0169,0.01197,0.938045,0.929127,0.933565,0.986317


Label normalizing the evaluation dataset:   0%|          | 0/2000 [00:00<?, ? examples/s]

Tokenizing the evaluation dataset:   0%|          | 0/2000 [00:00<?, ? examples/s]

These are the frequencies of the missed entities due to maximum entity length out of 3477 total entities:
- 5 missed entities with 7 words (0.143802%)
- 1 missed entities with 10 words (0.028760%)


Spreading data between multiple samples:   0%|          | 0/2000 [00:00<?, ? examples/s]

INFO:span_marker.trainer:Spread 2000 sentences across 2067 samples, a 3.350000% increase. You can increase `model_max_length` or `marker_max_length` to decrease the number of samples, but recognize that longer samples are slower.


Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

Label normalizing the evaluation dataset:   0%|          | 0/2000 [00:00<?, ? examples/s]

Tokenizing the evaluation dataset:   0%|          | 0/2000 [00:00<?, ? examples/s]

These are the frequencies of the missed entities due to maximum entity length out of 3477 total entities:
- 5 missed entities with 7 words (0.143802%)
- 1 missed entities with 10 words (0.028760%)


Spreading data between multiple samples:   0%|          | 0/2000 [00:00<?, ? examples/s]

INFO:span_marker.trainer:Spread 2000 sentences across 2067 samples, a 3.350000% increase. You can increase `model_max_length` or `marker_max_length` to decrease the number of samples, but recognize that longer samples are slower.
INFO:span_marker.trainer:Spread 2000 sentences across 2067 samples, a 3.350000% increase. You can increase `model_max_length` or `marker_max_length` to decrease the number of samples, but recognize that longer samples are slower.


TrainOutput(global_step=1802, training_loss=0.07813725461170375, metrics={'train_runtime': 780.9458, 'train_samples_per_second': 18.457, 'train_steps_per_second': 2.307, 'total_flos': 3792754939938816.0, 'train_loss': 0.07813725461170375, 'epoch': 1.0})

And now the final step is to compute the model's performance.

In [7]:
metrics = trainer.evaluate()
metrics

INFO:span_marker.trainer:Spread 2000 sentences across 2067 samples, a 3.350000% increase. You can increase `model_max_length` or `marker_max_length` to decrease the number of samples, but recognize that longer samples are slower.


{'eval_loss': 0.012107213959097862,
 'eval_overall_precision': 0.9356792616094606,
 'eval_overall_recall': 0.9346009795447998,
 'eval_overall_f1': 0.9351398097434419,
 'eval_overall_accuracy': 0.98699660359049,
 'eval_runtime': 29.67,
 'eval_samples_per_second': 69.666,
 'eval_steps_per_second': 17.425,
 'epoch': 1.0}

Additionally, we should evaluate using the test set.

In [8]:
trainer.evaluate(dataset["test"], metric_key_prefix="test")

Label normalizing the evaluation dataset:   0%|          | 0/3453 [00:00<?, ? examples/s]

Tokenizing the evaluation dataset:   0%|          | 0/3453 [00:00<?, ? examples/s]

Spreading data between multiple samples:   0%|          | 0/3453 [00:00<?, ? examples/s]

INFO:span_marker.trainer:Spread 3453 sentences across 3545 samples, a 2.664350% increase. You can increase `model_max_length` or `marker_max_length` to decrease the number of samples, but recognize that longer samples are slower.


{'test_loss': 0.027631552889943123,
 'test_overall_precision': 0.8961633227736713,
 'test_overall_recall': 0.9015580736543909,
 'test_overall_f1': 0.8988526037069727,
 'test_overall_accuracy': 0.9773016043932379,
 'test_runtime': 51.926,
 'test_samples_per_second': 68.27,
 'test_steps_per_second': 17.082,
 'epoch': 1.0}

Once trained, we can save our new model locally.

In [9]:
trainer.save_model("models/span-marker-roberta-base-conll03/checkpoint-final")

Or we can push it to the 🤗 Hub like so. I've commented it away for now to prevent people from accidentally pushing models.

In [11]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [12]:
trainer.push_to_hub("ashishpatel26/span-marker-roberta-base-conll03")

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.09k [00:00<?, ?B/s]

'https://huggingface.co/ashishpatel26/span-marker-roberta-base-conll03/tree/main/'

If we want to use it again, we can just load it using the checkpoint or using the model name on the Hub. This is how it would be done using a local checkpoint. See the [Loading & Inferencing](model_loading.ipynb) notebook for more details.

In [14]:
model = SpanMarkerModel.from_pretrained("models/span-marker-roberta-base-conll03/checkpoint-final")

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 50267. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


That was all! As simple as that. If we put it all together into a single script, it looks something like this:
```python
from datasets import load_dataset
from span_marker import SpanMarkerModel, Trainer
from transformers import TrainingArguments

dataset = load_dataset("conll2003")
labels = dataset["train"].features["ner_tags"].feature.names

model_name = "roberta-base"
model = SpanMarkerModel.from_pretrained(model_name, labels=labels, model_max_length=256)

args = TrainingArguments(
    output_dir="models/span-marker-roberta-base-conll03",
    learning_rate=1e-5,
    gradient_accumulation_steps=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=500,
    push_to_hub=False,
    logging_steps=50,
    warmup_ratio=0.1,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset["train"].select(range(8000)),
    eval_dataset=dataset["validation"].select(range(2000)),
)

trainer.train()
trainer.save_model("models/span-marker-roberta-base-conll03/checkpoint-final")
trainer.push_to_hub()

metrics = trainer.evaluate()
print(metrics)
```

With `wandb` initialized, you can enjoy their very useful training graphs straight in your browser. It ends up looking something like this.
![image](https://user-images.githubusercontent.com/37621491/235172501-a3cdae91-faf0-42b7-ac60-e6738b78e67e.png)
![image](https://user-images.githubusercontent.com/37621491/235172726-795ded55-4b1c-40fa-ab91-476762f7dd57.png)

Furthermore, you can use the `wandb` hyperparameter search functionality using the tutorial from the Hugging Face documentation [here](https://huggingface.co/docs/transformers/hpo_train). This transfers very well to the SpanMarker `Trainer`.