# GliNER with spaCy
* Notebook by Adam Lang
* Date: 7/21/2024

# Overview
* In this notebook we will review combining the GliNER model with a spaCy pipeline to perform zero-shot named entity recognition (NER) for Data Science and NLP tasks.

# What problem does this solve?
* Traditional NER models are limited to a predefined set of entity types. Expanding the number of entity types can be beneficial for many applications but usually involves intensive and time consuming labeling of additional datasets.
* While LLMs can be used for this, there is still the possibility of hallucination, as well as increased API calls and cost per token when scaling the application.

# GliNER
* GLiNER stands for Generalist Model for Named Entity Recognition using Bidirectional Transformer.
* It is a compact NER model trained to identify **any type of entity**.
* It facilitates parallel entity extraction, an advantage over the slow sequential token generation of LLMs.

## GliNER Architecture
* GliNER employs a Bidirectional Encoder Representation of Transformer(BERT) and takes as input entity type prompts and a sentence/text.
* Each entity is separated by a learned token `[ENT]`.
* The BiLM (Bidirectional Languge Model like BERT) outputs representations for each token.
* Entity embeddings are passed into a FeedForward Neural Network, while input word representations are passed into a span representation layer to compute embeddings for each span.
* Finally, the model computes a matching score between entity representations and span representations (using dot product and sigmoid activation).


# GliNER Implementation
* Demo of using GliNER with spaCy.
* Note you can use this from huggingface directly as well.

In [1]:
## install gliner-spacy
!pip install gliner-spacy

Collecting gliner-spacy
  Downloading gliner_spacy-0.0.10-py3-none-any.whl (6.6 kB)
Collecting gliner>=0.2.0 (from gliner-spacy)
  Downloading gliner-0.2.8-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting onnxruntime (from gliner>=0.2.0->gliner-spacy)
  Downloading onnxruntime-1.18.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=2.0.0->gliner>=0.2.0->gliner-spacy)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=2.0.0->gliner>=0.2.0->gliner-spacy)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch

In [2]:
## import libraries
import spacy
from gliner_spacy.pipeline import GlinerSpacy # use gliner in spacy


## Setup SpaCy pipeline with GliNER

In [3]:
## spacy pipeline
nlp = spacy.load("en_core_web_sm")

# add spacy pipe - customize your pipeline as needed
nlp.add_pipe("gliner_spacy", config={"labels": ["person", "organization"]})

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

gliner_config.json:   0%|          | 0.00/732 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.78k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/792M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



<gliner_spacy.pipeline.GlinerSpacy at 0x7f07d4c20a60>

In [4]:
## spacy doc object
doc = nlp("Jeff Bezos founded Amazon.")
# iterate labels - zero shot
for ent in doc.ents:
  print(ent.text, ent.label_)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Jeff Bezos person
Amazon organization


## Customize Config
* We can perform zero-shot entity recognition by adding our own labels on our own text and forcing the LLM to use these labels.
* Below we will create our own labels: "medical terms", "company", and "drug name".

In [6]:
## creating custom configs
custom_spacy_config = { "gliner_model": "urchade/gliner_multi",
                            "chunk_size": 250,
                            "labels": ["medical terms","company","drug name"],
                            "style": "ent"}
nlp = spacy.blank("en")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)

text = "Pfizer and Bristol Myers-Squibb (BMS) have developed two drugs to treat atrial fibrillation (AFib), an irregular heartbeat that can increase the risk of stroke. They are Eliquis and Tikosyn."
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_, ent._.score)



Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Pfizer company 0.995849609375
Bristol Myers-Squibb company 0.9898647665977478
atrial fibrillation medical terms 0.576569676399231
stroke medical terms 0.5091347694396973
Eliquis drug name 0.9683527946472168
Tikosyn drug name 0.9581282138824463


Summary:
* With the sample text we can see the NER output was highly accurate and correct.
* Next steps would be to use this with the direct model from huggingface via their pipeline. However, being able to use this in spaCy is convenient for many reasons such as:
1. Annotation with Prodigy
2. Adding other spaCy pipelines to your workflow.

# References
* https://netraneupane.medium.com/gliner-zero-shot-ner-outperforming-chatgpt-and-traditional-ner-models-1f4aae0f9eef
* github: https://github.com/urchade/GLiNER?tab=readme-ov-file
* arxiv paper: https://arxiv.org/pdf/2311.08526