<a href="https://colab.research.google.com/github/ashishpatel26/NER-with-SpanMarker/blob/main/3.NER%20with%20SpanMarker%20with%20spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Using SpanMarker with spaCy
[SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) is an accessible yet powerful Python module for training Named Entity Recognition models.

In this short notebook, we'll have a look at using pretrained SpanMarker models with spaCy.

### Setup
First of all, both `spacy` and the `span_marker` Python module need to be installed. Afterwards, we need to install a `spacy` model, too. We'll choose the simplest one for now: `en_core_web_sm`

In [1]:
%pip install span_marker spacy
!spacy download en_core_web_sm

Collecting span_marker
  Downloading span_marker-1.3.0-py3-none-any.whl (41 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/41.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.5/41.5 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate (from span_marker)
  Downloading accelerate-0.22.0-py3-none-any.whl (251 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers>=4.19.0 (from span_marker)
  Downloading transformers-4.32.1-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m68.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.0.0 (from span_marker)
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
C

### Using spaCy for Named Entity Recognition
We'll start off by using purely spaCy for NER, to help give an indication of the changes that need to be made to use SpanMarker models for NER instead.

In [2]:
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Feed some text through the model to get a spacy Doc
text = """Cleopatra VII, also known as Cleopatra the Great, was the last active ruler of the \
Ptolemaic Kingdom of Egypt. She was born in 69 BCE and ruled Egypt from 51 BCE until her \
death in 30 BCE."""
doc = nlp(text)

# And look at the entities
doc.ents

(Cleopatra the Great,
 the Ptolemaic Kingdom of Egypt,
 69,
 BCE,
 Egypt,
 51,
 BCE,
 30,
 BCE)

The `spaCy` module comes with a convenient visualizer that we can use to inspect these entities in a more convenient way, let's use it.

In [4]:
from spacy import displacy

displacy.render(doc, style="ent", jupyter=True, options={'distance': 90})

Not quite ideal. This spaCy model misses `Cleopatra VII`, considers `Cleopatra the Great` a work of art, and thinks all dates are cardinals and organisations.

### Using SpanMarker models for Named Entity Recognition with spaCy
We can easily add a SpanMarker model as a drop-in replacement of the original spaCy NER pipeline. It's as simple as one line of code.

In [5]:
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-roberta-large-ontonotes5"})

Downloading (…)lve/main/config.json:   0%|          | 0.00/5.45k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 50267. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


<span_marker.spacy_integration.SpacySpanMarkerWrapper at 0x7df6f9481a20>

The configuration model refers to [tomaarsen/span-marker-roberta-large-ontonotes5](https://huggingface.co/tomaarsen/span-marker-roberta-large-ontonotes5), a model trained on OntoNotes v5.0, the same dataset that is used by the original spaCy NER pipeline. The [spaCy integration API reference](https://tomaarsen.github.io/SpanMarkerNER/api/span_marker.spacy_integration.html) has more documentation on the configuration options. Let's try out the updated spaCy pipeline.

In [6]:
# All we have to do is process the text using the updated spaCy pipeline
doc = nlp(text)

print(doc.ents)

displacy.render(doc, style="ent", jupyter=True, options={'distance': 90})

(Cleopatra VII, Cleopatra the Great, the Ptolemaic Kingdom of Egypt, 69 BCE, Egypt, 51 BCE, 30 BCE)


Much better!

But, what if we don't want to use a model with these labels? Well, this integration works for any [SpanMarker model on the Hugging Face Hub](https://huggingface.co/models?library=span-marker), so we can just pick another one. Let's now also ensure that the model stays on the CPU, just to see how that works. Beyond that, we'll overwrite entities from spaCy's own NER model. This is recommended when the SpanMarker model uses a different label scheme than spaCy, which uses the labels from OntoNotes v5.

In [7]:
nlp.remove_pipe("span_marker")
nlp.add_pipe(
    "span_marker",
    config={
        "model": "tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super",
        "device": "cpu",
        "overwrite_entities": True,
    },
)

doc = nlp(text)
print(doc.ents)
displacy.render(doc, style="ent", jupyter=True, options={'distance': 90})


Downloading (…)lve/main/config.json:   0%|          | 0.00/6.91k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 250004. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


(Cleopatra VII, Cleopatra the Great, Egypt, Egypt)


### Summary
To summarize, using SpanMarker with spaCy is as simple as this:

In [8]:
import spacy

nlp = spacy.load("en_core_web_sm", exclude=["ner"])
nlp.add_pipe("span_marker")

text = "Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris."
doc = nlp(text)

[(entity, entity.label_) for entity in doc.ents]

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 50267. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


[(Amelia Earhart, 'PERSON'),
 (Lockheed, 'ORG'),
 (Vega 5B, 'PRODUCT'),
 (Atlantic, 'LOC'),
 (Paris, 'GPE')]