## Document-level context
[SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) is an accessible yet powerful Python module for training Named Entity Recognition models.

In this tutorial, I'll show you how to perform training and inference of SpanMarker models using document-level context to improve performance.

Many approaches to NER process individual sentences completely independently of another, even if the sentences originate from the same document. Although this works fine, research has shown that including additional contextual information (i.e. the previous and next sentence(s)) improves the performance of the model. In my own experiments of SpanMarker with CoNLL03, including this document-level contextual information improves the model from a mean F1 of 92.9±0.0 to a mean F1 of 94.1±0.1.

### Document-level context in SpanMarker
SpanMarker is designed to require only slight changes in the input data to allow for document-level context during training, evaluating and inference. In particular, the only required change is that the input must now be a [Dataset](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset) with `document_id` and `sentence_id` columns.

#### Training and evaluating
For training and evaluation, the dataset must now contain `tokens`, `ner_tags`, `document_id` and `sentence_id` columns. I've prepared two datasets ([tomaarsen/conll2003](https://huggingface.co/datasets/tomaarsen/conll2003), [tomaarsen/conllpp](https://huggingface.co/datasets/tomaarsen/conllpp)) that I've used to train some models. We will have a look at the former to get a feel for how these values are used.

In [6]:
# !pip install datasets span_marker

Collecting span_marker
  Downloading span_marker-1.3.0-py3-none-any.whl (41 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/41.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.5/41.5 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate (from span_marker)
  Downloading accelerate-0.22.0-py3-none-any.whl (251 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers>=4.19.0 (from span_marker)
  Downloading transformers-4.32.1-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m79.0 MB/s[0m eta [36m0:00:00[0m
Collecting evaluate (from span_marker)
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting

In [7]:
from datasets import load_dataset, Dataset

# Load the dataset from the Hub and throw away the non-NER columns
dataset = load_dataset("tomaarsen/conll2003", split="train").remove_columns(("id", "chunk_tags", "pos_tags"))
dataset

Dataset({
    features: ['document_id', 'sentence_id', 'tokens', 'ner_tags'],
    num_rows: 14041
})

Let's have a quick look at the data itself.

In [8]:
dataset.select(range(30)).to_pandas()

Unnamed: 0,document_id,sentence_id,tokens,ner_tags
0,1,0,"[EU, rejects, German, call, to, boycott, Briti...","[3, 0, 7, 0, 0, 0, 7, 0, 0]"
1,1,1,"[Peter, Blackburn]","[1, 2]"
2,1,2,"[BRUSSELS, 1996-08-22]","[5, 0]"
3,1,3,"[The, European, Commission, said, on, Thursday...","[0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, ..."
4,1,4,"[Germany, 's, representative, to, the, Europea...","[5, 0, 0, 0, 0, 3, 4, 0, 0, 0, 1, 2, 0, 0, 0, ..."
5,1,5,"["", We, do, n't, support, any, such, recommend...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
6,1,6,"[He, said, further, scientific, study, was, re...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
7,1,7,"[He, said, a, proposal, last, month, by, EU, F...","[0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 1, 2, 0, 0, 0, ..."
8,1,8,"[Fischler, proposed, EU-wide, measures, after,...","[1, 0, 7, 0, 0, 0, 0, 5, 0, 5, 0, 0, 0, 0, 0, ..."
9,1,9,"[But, Fischler, agreed, to, review, his, propo...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, ..."


As you can see, the `document_id` and `sentence_id` columns contain integers. The former serves to identify which document the sentence belongs to, while the latter indicates the position of the sentence in the document. Internally, SpanMarker will include adjacent sentences originating from the same document as contextual information.  In the SpanMarker configuration, you can set `max_prev_context` and `max_next_context` to limit on the number of previous or next sentences to be included as context. By default, these are set to `None`, allowing the inclusion of as much context as is available until the maximum token length is reached. In practice, these settings are defined like so:

In [9]:
from span_marker import SpanMarkerModel

# An example encoder and example labels
model = SpanMarkerModel.from_pretrained(
    "prajjwal1/bert-tiny",  # Example encoder
    labels=[  # Example labels
        "O",
        "PER",
        "LOC",
    ],
    max_prev_context=2,
    max_next_context=2,
)


Downloading (…)lve/main/config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 30524. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


Training using this dataset works equivalently as if the `document_id` and `sentence_id` columns did not exist. See the [Model Training](model_training.ipynb) tutorial for more information on how to do that. See also the [Trainer](https://tomaarsen.github.io/SpanMarkerNER/api/span_marker.trainer.html) documentation.

#### Inference
For inference, the inputs to `model.predict` must also contain `document_id` and `sentence_id` columns, alongside a `tokens` column that includes either string sentences or lists of tokens. Let's consider some sample data:

In [10]:
# For simplicity, this data is already split into sentences.
# You can use various tools to do this, e.g. spaCy senter or NLTK sent_tokenize
document_one = [
    "Cleopatra VII (70/69 BC - 10 August 30 BC) was Queen of the Ptolemaic Kingdom of Egypt from 51 to 30 BC, and its last active ruler.",
    "A member of the Ptolemaic dynasty, she was a descendant of its founder Ptolemy I Soter, a Macedonian Greek general and companion of Alexander the Great.",
    "After the death of Cleopatra, Egypt became a province of the Roman Empire, marking the end of the last Hellenistic state in the Mediterranean and of the age that had lasted since the reign of Alexander (336-323 BC).",
]

document_two = [
    "The 35-year-old led his country to the 2022 World Cup title in Qatar last year, arguably the crowning triumph in one of the greatest football careers.",
    "And on Thursday, Messi enjoyed another landmark moment by scoring his fastest ever goal.",
    "Messi curled home an exquisite left-footed strike from the edge of the box just 79 seconds into Argentina's friendly against Australia in Beijing - the quickest of his professional career, per South American football's governing body, CONMEBOL.",
]

document_three = [
    "UK firms could gain access to US green funding as part of plans to boost UK and US ties announced by Rishi Sunak and Joe Biden.",
    "The pair unveiled the Atlantic Declaration, to strengthen economic ties between the two countries, at a White House press conference.",
    "The PM said the agreement, which falls short of a full trade deal would bring benefits \"as quickly as possible\".",
    "UK electric car firms may get access to US green tax credits and subsidies.",
    "As the pair unveiled their partnership to bolster economic security, Mr Sunak said the UK-US relationship was an \"indispensable alliance\"."
]

documents = [document_one, document_two, document_three]

Now we have to preprocess this dataset to generate the `document_id` and `sentence_id`.

In [11]:
data_dict = {
    "tokens": [],
    "document_id": [],
    "sentence_id": [],
}
for document_id, document in enumerate(documents):
    for sentence_id, sentence in enumerate(document):
        data_dict["document_id"].append(document_id)
        data_dict["sentence_id"].append(sentence_id)
        data_dict["tokens"].append(sentence)
dataset = Dataset.from_dict(data_dict)
dataset

Dataset({
    features: ['tokens', 'document_id', 'sentence_id'],
    num_rows: 11
})

In [12]:
dataset.to_pandas()

Unnamed: 0,tokens,document_id,sentence_id
0,Cleopatra VII (70/69 BC - 10 August 30 BC) was...,0,0
1,"A member of the Ptolemaic dynasty, she was a d...",0,1
2,"After the death of Cleopatra, Egypt became a p...",0,2
3,The 35-year-old led his country to the 2022 Wo...,1,0
4,"And on Thursday, Messi enjoyed another landmar...",1,1
5,Messi curled home an exquisite left-footed str...,1,2
6,UK firms could gain access to US green funding...,2,0
7,"The pair unveiled the Atlantic Declaration, to...",2,1
8,"The PM said the agreement, which falls short o...",2,2
9,UK electric car firms may get access to US gre...,2,3


We can immediately pass this dataset to `SpanMarkerModel.predict`, and SpanMarker will under the hood add the document-level context for you. Note that the dataset does not need to be sorted. See also the [SpanMarkerModel.predict](https://tomaarsen.github.io/SpanMarkerNER/api/span_marker.modeling.html#span_marker.modeling.SpanMarkerModel.predict) documentation.

In [13]:
from span_marker import SpanMarkerModel

model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-xlm-roberta-large-conll03-doc-context").try_cuda()
entities = model.predict(dataset)
len(entities)

Downloading (…)lve/main/config.json:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 250004. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


11

In [14]:
entities[0]

[{'span': 'Cleopatra VII',
  'label': 'PER',
  'score': 0.7116236090660095,
  'char_start_index': 0,
  'char_end_index': 13},
 {'span': 'BC',
  'label': 'MISC',
  'score': 0.9982840418815613,
  'char_start_index': 21,
  'char_end_index': 23},
 {'span': 'Ptolemaic Kingdom of Egypt',
  'label': 'LOC',
  'score': 0.6176435947418213,
  'char_start_index': 60,
  'char_end_index': 86}]

As you can see, the SpanMarker model returns a list of entity dictionaries for each sentence in the input dataset.