# Zero-shot NER with GLiNER

I made a small research on pretrained nested NER models and have found quite recent paper "[GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer](https://arxiv.org/abs/2311.08526)" submitted last November. Authors report that their model outperforms LLMs at zero-shot NER and requires much less computation resources. Fortunately, their model had multilingual version and was accompanied by a nice wrapping Python library, so I decided to try it out.

In [1]:
from pathlib import Path
import warnings
import json
from tqdm import tqdm

warnings.filterwarnings("ignore")

from gliner import GLiNER

2024-04-14 23:59:17.123834: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
model = GLiNER.from_pretrained("urchade/gliner_multi-v2.1", map_location="cuda").cuda()

## Example with nested NER

In [3]:
text = """
Moscow Drama Theater named after M. N. Yermolova
"""

labels = ["CITY", "ORGANIZATION", "PERSON"]

entities = model.predict_entities(text, labels, flat_ner=False, threshold=0.5)

for entity in entities:
    print(entity)

{'start': 1, 'end': 7, 'text': 'Moscow', 'label': 'CITY', 'score': 0.9573017358779907}
{'start': 1, 'end': 21, 'text': 'Moscow Drama Theater', 'label': 'ORGANIZATION', 'score': 0.8920192718505859}
{'start': 34, 'end': 49, 'text': 'M. N. Yermolova', 'label': 'PERSON', 'score': 0.787686288356781}


In [4]:
tokens = [tk for tk, _, _ in model.token_splitter(text)]
tokens

['Moscow',
 'Drama',
 'Theater',
 'named',
 'after',
 'M',
 '.',
 'N',
 '.',
 'Yermolova']

In [5]:
print(model.evaluate(
    [{"tokenized_text": tokens, "ner": [[0, 9, "ORGANIZATION"], [0, 0, "CITY"], [5, 9, "PERSON"]]}],
    flat_ner=False, entity_types=labels)[0])

P: 66.67%	R: 66.67%	F1: 66.67%


## Eval on the train set

In [6]:
DATA_PATH = Path("../data")

In [7]:
labels = (DATA_PATH / "ners.txt").read_text().split()
print(len(labels), labels)

29 ['AGE', 'AWARD', 'CITY', 'COUNTRY', 'CRIME', 'DATE', 'DISEASE', 'DISTRICT', 'EVENT', 'FACILITY', 'FAMILY', 'IDEOLOGY', 'LANGUAGE', 'LAW', 'LOCATION', 'MONEY', 'NATIONALITY', 'NUMBER', 'ORDINAL', 'ORGANIZATION', 'PENALTY', 'PERCENT', 'PERSON', 'PRODUCT', 'PROFESSION', 'RELIGION', 'STATE_OR_PROVINCE', 'TIME', 'WORK_OF_ART']


In [8]:
train_data = [json.loads(line) for line in
              (DATA_PATH / "train.jsonl").read_text().strip().split("\n")]

In [9]:
train_data[-1]

{'ners': [[0, 4, 'PERSON'],
  [16, 25, 'PROFESSION'],
  [27, 38, 'PERSON'],
  [85, 96, 'PERSON'],
  [98, 106, 'PROFESSION'],
  [112, 124, 'PERSON'],
  [154, 163, 'PROFESSION'],
  [165, 176, 'PERSON'],
  [200, 233, 'DATE'],
  [300, 306, 'DATE'],
  [328, 339, 'COUNTRY'],
  [359, 373, 'PERSON'],
  [388, 393, 'PERSON'],
  [397, 408, 'PERSON'],
  [448, 474, 'DATE'],
  [510, 514, 'PERSON'],
  [528, 544, 'FACILITY'],
  [656, 664, 'AGE'],
  [706, 717, 'COUNTRY'],
  [769, 778, 'DATE'],
  [820, 827, 'DATE'],
  [882, 893, 'PERSON'],
  [1027, 1043, 'TIME'],
  [1047, 1052, 'ORDINAL'],
  [1064, 1079, 'PROFESSION'],
  [1186, 1197, 'PERSON'],
  [1270, 1281, 'COUNTRY'],
  [1293, 1297, 'PERSON'],
  [1300, 1306, 'NATIONALITY'],
  [1316, 1323, 'NATIONALITY'],
  [1336, 1342, 'PERSON'],
  [1426, 1435, 'COUNTRY'],
  [1447, 1462, 'PERSON'],
  [1466, 1473, 'DATE'],
  [1524, 1525, 'NUMBER'],
  [1542, 1543, 'NUMBER'],
  [1554, 1563, 'NATIONALITY'],
  [1578, 1596, 'PERSON'],
  [1623, 1629, 'FACILITY'],
  [1643, 1

In [14]:
def convert_to_gliner_format(data: dict) -> dict:
    tokens = []
    pos_to_token_idx = {}
    for idx, (token, start, end) in enumerate(model.token_splitter(data["sentences"])):
        tokens.append(token)
        for pos in range(start, end):
            pos_to_token_idx[pos] = idx

    return {"tokenized_text": tokens,
            "ner": [[pos_to_token_idx[start], pos_to_token_idx[end], label] for start, end, label in
                    data.get("ners", [])]}

In [11]:
converted = [convert_to_gliner_format(example) for example in tqdm(train_data)]

100%|██████████| 519/519 [00:00<00:00, 4477.02it/s]


In [13]:
BATCH_SIZE = 4

In [12]:
print(model.evaluate(converted, flat_ner=False, entity_types=labels, batch_size=BATCH_SIZE)[0])

P: 79.82%	R: 26.55%	F1: 39.84%


I am pleased with the fact that I was able to run this model on my laptop GPU, and it performed quite OK in terms of scores. Unfortunately, this library does not support changing the `average` parameter for F1 metric, so the results above are for `micro` average.

## Making a submission

In [21]:
test_data = [json.loads(line) for line in (DATA_PATH / "dev.jsonl").read_text().strip().split("\n")]
test_texts = [data["sentences"] for data in test_data]

In [25]:
def make_batches(data: list[dict], batch_size: int):
    return [data[i:i + batch_size] for i in range(0, len(data), batch_size)]

all_preds = []
for batch in tqdm(make_batches(test_texts, BATCH_SIZE)):
    all_preds.extend(model.batch_predict_entities(batch, labels, flat_ner=False))

100%|██████████| 17/17 [00:16<00:00,  1.00it/s]


In [26]:
len(all_preds), all_preds[0][:5]

(65,
 [{'start': 0,
   'end': 17,
   'text': 'Генерал Д.Петреус',
   'label': 'PERSON',
   'score': 0.8841291069984436},
  {'start': 0,
   'end': 7,
   'text': 'Генерал',
   'label': 'PERSON',
   'score': 0.8152154684066772},
  {'start': 8,
   'end': 17,
   'text': 'Д.Петреус',
   'label': 'PERSON',
   'score': 0.7756668329238892},
  {'start': 41,
   'end': 44,
   'text': 'ЦРУ',
   'label': 'ORGANIZATION',
   'score': 0.9806093573570251},
  {'start': 47,
   'end': 68,
   'text': 'Генерал Дэвид Петреус',
   'label': 'PERSON',
   'score': 0.9436295628547668}])

In [30]:
(DATA_PATH / "test.jsonl").write_text("\n".join(json.dumps({"ners": [[p["start"], p["end"] - 1, p["label"]] for p in preds], "id": data["id"]}) for preds, data in zip(all_preds, test_data)))

31565

Score on leaderboard: 0.25