# **TLFT Lab: Named Entity Recognition and Linking**

**For training, do not forget to make sure that the GPU is activated!**

Exécution -> Modifier le type d'exécution -> GPU T4

# Install necessary packages

In [1]:
#!pip install datasets
#!pip install tensorflow
#!pip install evaluate
#!pip install transformers
#!pip install seqeval

[0mCollecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3
[0mCollecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25ldone
[?25h  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16164 sha256=df4f0f3d33e425aad5b092bab7cfc68ebd3c80a1dc50d9b9d6f5a9961c2392d0
  Stored in directory: /root/.cache/pip/wheels/bc/92/f0/243288f899c2eacdfa8c5f9aede4c71a9bad0ee26a01dc5ead
Successfully built seqeval
Installing coll

# Retrieve necessary files

In [2]:
#!wget https://github.com/gbella/NLP/raw/refs/heads/main/NER/ner.zip
#!wget https://github.com/gbella/NLP/raw/refs/heads/main/NER/run_ner.py
#!unzip ner.zip
#!mkdir output
#!rm -rf transformers/
#!git clone https://github.com/huggingface/transformers.git
#!cp run_ner.py transformers/examples/tensorflow/token-classification/

--2025-03-05 13:19:17--  https://github.com/gbella/NLP/raw/refs/heads/main/NER/ner.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/gbella/NLP/refs/heads/main/NER/ner.zip [following]
--2025-03-05 13:19:17--  https://raw.githubusercontent.com/gbella/NLP/refs/heads/main/NER/ner.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2963108 (2.8M) [application/zip]
Saving to: ‘ner.zip’


2025-03-05 13:19:18 (107 MB/s) - ‘ner.zip’ saved [2963108/2963108]

--2025-03-05 13:19:18--  https://github.com/gbella/NLP/raw/refs/heads/main/NER/run_ner.py
Resolving github.com (github.com)... 140.82.113.4
Connec

# 1. Fine-tune a BERT model for named entity recognition

BERT is a generic language model based on a Transformer encoder. In order to use it for NER, we need to fine-tune it. Below we use an existing Hugging Face script to fine-tune the **bert-base-cased** model for English NER. We need the **cased** model as initial upper/lowercase is an important feature for names in English. We use the small **base** model for fast training: on our training data, one epoch of fine-tuning takes about 10 minutes using the Google Colab GPU.

In [3]:
!python transformers/examples/tensorflow/token-classification/run_ner.py \
  --model_name_or_path bert-base-cased \
  --train_file ner_train.json \
  --validation_file ner_valid.json \
  --test_file ner_test.json \
  --text_column_name tokens \
  --label_column_name labels \
  --output_dir output \
  --label_all_tokens \
  --do_train \
  --num_train_epochs 1


2025-03-05 13:21:07.479552: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-03-05 13:21:07.479686: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-03-05 13:21:07.481874: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-05 13:21:07.494006: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Downloading data files: 100%|███████████████████| 2/2

In [None]:
# In case training does not work, you can retrieve a trained model from here:
#!wget https://github.com/gbella/NLP/raw/refs/heads/main/NER/NER_finetuned_bert_base.zip
#!mkdir output
#!unzip NER_finetuned_bert_base.zip -d output

# Try out the finetuned model
Retrieve a paragraph of text from an English news article that contains several names of well-known entities (cities, countries, famous people, dates, etc.). You can, for example, find appropriate text on http://www.theguardian.com.

In [4]:
from transformers import pipeline
from transformers import BertConfig, BertForTokenClassification
from transformers import AutoTokenizer

my_tokenizer = AutoTokenizer.from_pretrained(
            "bert-base-cased",
            use_fast=True,
        )
my_model = BertForTokenClassification.from_pretrained("./output/", from_tf=True)
nlp = pipeline('ner', model=my_model, tokenizer=my_tokenizer)
text = "*** President Donald Trump will consider restoring aid to Ukraine if peace talks are arranged and confidence-building measures are taken, White House national security adviser Mike Waltz said on Wednesday, Reuters reported. ***"
result = nlp(text)
for token in result:
  print(token)

2025-03-05 13:38:01.711090: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-03-05 13:38:01.711172: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-03-05 13:38:01.712771: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-05 13:38:01.723308: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-05 13:38:07.264050: I external/local_xla/xla/

{'entity': 'B-per', 'score': 0.9934242, 'index': 4, 'word': 'President', 'start': 4, 'end': 13}
{'entity': 'I-per', 'score': 0.99419785, 'index': 5, 'word': 'Donald', 'start': 14, 'end': 20}
{'entity': 'I-per', 'score': 0.995488, 'index': 6, 'word': 'Trump', 'start': 21, 'end': 26}
{'entity': 'B-geo', 'score': 0.96637505, 'index': 12, 'word': 'Ukraine', 'start': 58, 'end': 65}
{'entity': 'B-org', 'score': 0.9936946, 'index': 26, 'word': 'White', 'start': 138, 'end': 143}
{'entity': 'I-org', 'score': 0.9893708, 'index': 27, 'word': 'House', 'start': 144, 'end': 149}
{'entity': 'B-per', 'score': 0.98461413, 'index': 31, 'word': 'Mike', 'start': 176, 'end': 180}
{'entity': 'I-per', 'score': 0.9902683, 'index': 32, 'word': 'Waltz', 'start': 181, 'end': 186}
{'entity': 'B-tim', 'score': 0.99830854, 'index': 35, 'word': 'Wednesday', 'start': 195, 'end': 204}
{'entity': 'B-org', 'score': 0.97797513, 'index': 37, 'word': 'Re', 'start': 206, 'end': 208}
{'entity': 'B-org', 'score': 0.98367256, 

# 2. Reconstituting named entities from subwords

As you can also observe from the output above, the Transformer's tokeniser produces subword tokens, to which named entity tags are attached. For further processing, we need to reconstitute entire named entities. The goal of this exercise is to collect all named entities as a list or as a set of (name, tag) pairs:

{("New York", "gpe"), ("Jacques Chirac", "per"), ("NATO", "org")}

In [5]:
words = []
entities = []
for entry in result:
  token = entry["word"]
  entity = entry["entity"]
  if token.startswith("##"):
    words[-1] = words[-1] + token[2:]
  else:
    if entity.startswith("I"):
      words[-1] = words[-1] + " " + token
    else:
      words.append(token)
      entities.append(entity[2:])
# by using a set, we remove duplicate entries
word_entity_pairs = set(zip(words, entities))
print(word_entity_pairs)

{('President Donald Trump', 'per'), ('Wednesday', 'tim'), ('Mike Waltz', 'per'), ('Reuters', 'org'), ('Ukraine', 'geo'), ('White House', 'org')}


# 3. Entity Linking: Retrieval of Candidate Entities

We link the entities found in the news article to entries in Wikidata, and retrieve further information about them. We use the *wbsearchentities* endpoint of the Wikidata API (https://www.wikidata.org/w/api.php) to search for entities by name. Note that multiple Wikidata entities can correspond to the same name (e.g. Paris is a city in France, a city in Texas, and also a person's name).

Then, as a second step, we retrieve detailed information about each entity using the *wbgetentities* endpoint. We store all relevant information in a dictionary.

In [6]:
import requests
from nltk.metrics.distance import jaccard_distance
#-------------------------------------------------------------
# Querying an entity from Wikidata
#-------------------------------------------------------------
def fetch_wikidata(params):
    url = 'https://www.wikidata.org/w/api.php'
    try:
        return requests.get(url, params=params)
    except:
        return 'Error calling the Wikidata API.'


params = {
        'action':   'wbsearchentities',
        'format':   'json',
        'language': 'en',
        'uselang':  'en' # set it to the language in which you would like to receive results
    }

WIKITYPES = {"per": ["Q5"], "org": ["Q42"]} # The Wikidata type equivalents for some of our entity labels
SKIPPED_TYPES = ["tim"] # we will not look for such entities in Wikidata
wiki_entities = {}

for word, label in word_entity_pairs:
  if label in SKIPPED_TYPES:
    continue # do not look up Time or Date entities
  params['search'] = word
  data = fetch_wikidata(params).json()
  dataDict = dict(data)

  print("Wikidata entities corresponding to '" + word + "/" + label + "':")

  for result in dataDict["search"]:
    if "description" not in result:
      description = ""
    else:
      description = result["description"]
    identifier = result["id"]
    parameters = {
            'action': 'wbgetentities',
            'format': 'json',
            'ids': identifier,
            'languages': 'en'
        }

    entityDetails = fetch_wikidata(parameters).json()
    for key in entityDetails["entities"]:
      value = entityDetails["entities"][key]
      if "P31" not in value["claims"]:
        # it does not have an instance-of relation => it is not a named entity => skip it
        continue
      entity_types = []
      for typ in value["claims"]["P31"]:
        entity_types.append(typ["mainsnak"]["datavalue"]["value"]["id"])
      if (word,label) not in wiki_entities:
        wiki_entities[(word,label)] = []
      entity = {"uri": result["concepturi"], "text": result["match"]["text"], "description": description, "types": entity_types}
      wiki_entities[(word,label)].append(entity)
      print("\t" + str(entity))

Wikidata entities corresponding to 'President Donald Trump/per':
	{'uri': 'http://www.wikidata.org/entity/Q22686', 'text': 'President Donald Trump', 'description': 'president of the United States (2017–2021, 2025–present)', 'types': ['Q5']}
Wikidata entities corresponding to 'Mike Waltz/per':
	{'uri': 'http://www.wikidata.org/entity/Q55386653', 'text': 'Mike Waltz', 'description': 'U.S. National Security Advisor since 2025', 'types': ['Q5']}
	{'uri': 'http://www.wikidata.org/entity/Q95813934', 'text': 'Mike Waltze', 'description': '', 'types': ['Q5']}
Wikidata entities corresponding to 'Reuters/org':
	{'uri': 'http://www.wikidata.org/entity/Q130879', 'text': 'Reuters', 'description': 'international news agency', 'types': ['Q192283', 'Q4830453']}
	{'uri': 'http://www.wikidata.org/entity/Q83377360', 'text': 'Reuters', 'description': 'family name', 'types': ['Q101352']}
	{'uri': 'http://www.wikidata.org/entity/Q1509580', 'text': 'Reuters', 'description': 'human settlement in Germany', 'ty

# 4. Entity Linking: Disambiguation

As multiple entities can correspond to the same name, the candidate entities retrieved in the previous step need to be disambiguated: the entity that is the most relevant to the given text needs to be chosen. This process is called *named entity disambiguation* or *reconciliation*.

As a first filtering step, we can use the named entity label (per, loc, gpe, org, tim, etc.) to constrain Wikidata results. For example, Paris may mean the city in France or a character from the Iliad. If our named entity annotation is "per", then the city entity can be filtered out and the character entity kept.

The *wbgetentities* endpoint of the Wikidata API returned *instance-of* types for its entries: the code for the instance-of property is "P31". The types returned are, however, much more fine-grained than our named entity labels, so it is not trivial to map the two together. In this simple exercise, for the sake of example, we manually map a few labels to Wikidata types (e.g. "person" is mapped to the type "Q5").

In a second disambiguation step, we will use a crude method, which is nevertheless efficient: we select the "most common" of the entries, i.e. the meaning that is the most frequently used. For example, on the whole, the name "Paris" refers much more frequently to the city in France than to the city in Texas (unless you live in Texas). While Wikidata does not provide such frequency data, we will approximate it by choosing the entry that has the lowest Wikidata ID. In disambiguation, such context-independent, frequency-based methods are widely used as strong baselines, as typically they yield an accuracy well above 50% while being extremely simple.

**Perspective:** in order to improve this baseline disambiguation method, one can combine it with methods that take context into account. For example, one could compare Wikidata entity descriptions to the input text and choose the entity with the most similar description. The similarity could be computed using, for example:
 - a simple Jaccard similarity measure based on the proportion of shared words;
 - using cosine similarity over word or document embeddings.

In [7]:
# First we filter by type: the Wikidata type of the entity should correspond
# to the named entity label returned by NER. For example, the "Q5" type
# in Wikidata means "person", corresponding to our "per" NER annotation.
# However, Wikidata can specify multiple types per entity: it is enough
# if one of these types match.
filtered_wiki_entities = {}
for (word, label) in wiki_entities:
  entities = wiki_entities[(word, label)]
  for entity in entities:
    for typ in entity["types"]:
      if label not in WIKITYPES or typ in WIKITYPES[label]:
        if (word,label) not in filtered_wiki_entities:
          filtered_wiki_entities[(word,label)] = []
        filtered_wiki_entities[(word,label)].append(entity)
        break

# Now we choose from the list of possible, type-matching entities.
# Many disambiguation methods exist, we will use the crudest one,
# which is nevertheless efficient: we select the "most common"
# interpretation. We approximate this "commonness" by taking the
# entry that has the lowest Wikidata ID.
print("Disambiguated entities:")
for (word,label) in filtered_wiki_entities:
  min_id = 10000000000000000000000000000000000
  for entry in filtered_wiki_entities[(word,label)]:
    id = int(entry["uri"].split("Q")[1]) # extract the entity identifier from the URI
    if id < min_id:
      min_id = id
      min_entry = entry
  print(word + " -> " + str(min_entry))


Disambiguated entities:
President Donald Trump -> {'uri': 'http://www.wikidata.org/entity/Q22686', 'text': 'President Donald Trump', 'description': 'president of the United States (2017–2021, 2025–present)', 'types': ['Q5']}
Mike Waltz -> {'uri': 'http://www.wikidata.org/entity/Q55386653', 'text': 'Mike Waltz', 'description': 'U.S. National Security Advisor since 2025', 'types': ['Q5']}
Ukraine -> {'uri': 'http://www.wikidata.org/entity/Q212', 'text': 'Ukraine', 'description': 'country in Eastern Europe', 'types': ['Q3624078', 'Q6256', 'Q179164', 'Q7270', 'Q619610', 'Q4209223', 'Q4835091']}
