#Use GLiNER - a multilingual BERT-based NER Tool

Zaratiana, Urchade, Nadi Tomeh, Pierre Holat, and Thierry Charnois. 2024. “GLiNER: Generalist Model for Named Entity Recognition Using Bidirectional Transformer.” In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), edited by Kevin Duh, Helena Gomez, and Steven Bethard, 5364–5376. Mexico City, Mexico: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.naacl-long.300.

In [None]:
# STEP 1: Install dependencies

!pip install gliner nltk


In [None]:
# STEP 2: Try the demo from the GitHub Repository


from gliner import GLiNER

# Initialize GLiNER with the base model
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")

# Sample text for entity prediction
text = """
Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 official senior career goals for club and country, making him the top goalscorer of all time.
"""

# Labels for entity prediction
# Most GLiNER models should work best when entity types are in lower case or title case
labels = ["Person", "Award", "Date", "Competitions", "Teams"]

# Perform entity prediction
entities = model.predict_entities(text, labels, threshold=0.5)

# Display predicted entities and their labels
for entity in entities:
    print(entity["text"], "=>", entity["label"])

In [None]:
# Load a new text from GitHub
!wget https://github.com/dbamman/anlp25/raw/refs/heads/main/data/twain_innocents_abroad.txt



In [None]:
# STEP 3: Use GLiNER for your own data

from gliner import GLiNER
import nltk
import csv
import os  # <--- import os for file checks

# Download NLTK tokenizers
nltk.download('punkt')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize


# STEP 4: Read text from input file

with open("twain_innocents_abroad.txt", "r", encoding="utf-8") as f:
    text = f.read()

text = "\n".join(text.split("\n\n")[:5])

# Normalize tabs to spaces
text = text.replace('\t', ' ')


# STEP 5: Load GLiNER model

model = GLiNER.from_pretrained("urchade/gliner_large-v2.1")


# Define entity labels (you can adjust these)
labels = ["Person", "Date", "Location", "GPE"]

#What could be further labels of interest?

from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()


# STEP 6: Run entity recognition

token_entities = []

for line in text.split("\n"):
    entities = model.predict_entities(line, labels, threshold=0.5)
    spans = tokenizer.span_tokenize(line)

    print(entities)
    for (start, end) in spans:
        # print(token)
        label = ""
        for ent in entities:
            if ent["start"] < end and ent["end"] > start:
            # if token in ent["text"]:  # rough match
                label = ent["label"]
                break
        token = line[start:end]
        token_entities.append((token, label))
        print(token, "\t", label)




Given the limited list of labels. What would be further labels needed to annotate the Named Entities in the sample text? You can also try the model with a text in another language than English.
GLiNER, however, has a default input **length limit of 384 tokens**. This constraint is due to the underlying transformer architecture, which typically has a maximum sequence length of 512 tokens.

In [None]:
#Try GLiNER on another text of your choice, such as one taken from Gutenberg.org
#Use wget to refer to a weblink and download the content for use in this notebook.
!wget --no-check-certificate https://www.gutenberg.org/cache/epub/11/pg11.txt #Alice in Wonderland


#Which named entity types are you interested in recognizing?

In [None]:
# STEP 4: Read text from input file

with open("pg11.txt", "r", encoding="utf-8") as f:
    text = f.read()

text = "\n".join(text.split("\n\n")[:5])

# Normalize tabs to spaces
text = text.replace('\t', ' ')


# STEP 5: Load GLiNER model

model = GLiNER.from_pretrained("urchade/gliner_large-v2.1")


# Define entity labels (you can adjust these)
labels = ["Person", "Date", "Location", "GPE"]

#What could be further labels of interest?

from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()


# STEP 6: Run entity recognition

token_entities = []

for line in text.split("\n"):
    entities = model.predict_entities(line, labels, threshold=0.5)
    spans = tokenizer.span_tokenize(line)

    print(entities)
    for (start, end) in spans:
        # print(token)
        label = ""
        for ent in entities:
            if ent["start"] < end and ent["end"] > start:
            # if token in ent["text"]:  # rough match
                label = ent["label"]
                break
        token = line[start:end]
        token_entities.append((token, label))
        print(token, "\t", label)