# Named Entity Recognition Testing

Previously, we used heuristic methods to identify terminology, but we can do better with named entity recognition. 

This notebook is run in Python 3.8, in a different virtual environment specified by the requirements in requirements_scispacy.txt. This is because SciSpacy cannot be installed in Python >= 3.9 due to an issue with the nmslib dependency.

In [1]:
import os
from tqdm import tqdm
dir_path = os.getcwd()

In [2]:
#Open our test files first
f = open(os.path.join(dir_path, "../wmt22test.txt"), "r", encoding = "utf8")
en_sent = [line.strip() for line in f.readlines()]
f.close()
f = open(os.path.join(dir_path, "../wmt22gold.txt"), "r", encoding = "utf8")
fr_sent = [line.strip() for line in f.readlines()]
f.close()

In [3]:
import spacy

#en_tagger = spacy.load("en_core_web_trf") #We will use the same scibert model for EN tokenisation for consistency.
fr_tagger = spacy.load("fr_dep_news_trf")
en_ner = spacy.load("en_core_sci_scibert")

In [27]:
#Named entity recognition == terminology we must get right
ner_locs = []
ner_ents = []
for sentence in en_sent:
    test = en_ner(sentence).ents
    ner_ents.append(test)
    ner_locs.append([[i for i in range(t.start, t.end)] for t in test]) #Need to find where these tokens are, then perform lookup using alignments to obtain (predicted) translations 

In [28]:
ner_locs[3]

[[6, 7], [9], [10], [12], [14], [15], [16, 17]]

In [45]:
ner_ents[3] #For cross-checking purposes

(systematic review,
 meta-analysis,
 investigating,
 evidence,
 treating,
 recalcitrant,
 auricular keloids)

In [30]:
en_tagged = []
for sentence in tqdm(en_sent):
    en_tagged_sent = en_ner(sentence)
    en_tagged_tokenised = [token for token in en_tagged_sent] #We just need the text for each token, which we will index using ner_locs
    en_tagged.append(en_tagged_tokenised)

100%|████████████████████████████████████████████████████████████████████████████████| 588/588 [00:43<00:00, 13.43it/s]


In [7]:
fr_tagged = []
for sentence in tqdm(fr_sent):
    fr_tagged_sent = fr_tagger(sentence)
    fr_tagged_tokenised = [token for token in fr_tagged_sent]
    fr_tagged.append(fr_tagged_tokenised)

100%|████████████████████████████████████████████████████████████████████████████████| 588/588 [00:48<00:00, 12.14it/s]


In [8]:
#Word alignment
from simalign import SentenceAligner
aligner = SentenceAligner(matching_methods="a") #Argmax only; the simAlign paper stated that this gives the best performance for English-French alignment

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2023-06-25 23:19:38,779 - simalign.simalign - INFO - Initialized the EmbeddingLoader with model: bert-base-multilingual-cased


In [31]:
alignments_list = []
for i in tqdm(range(len(en_tagged))):
    src_sent = [token.text for token in en_tagged[i]]
    tgt_sent = [token.text for token in fr_tagged[i]] 
    alignments = aligner.get_word_aligns(src_sent, tgt_sent)
    alignments_list.append(alignments["inter"])

100%|████████████████████████████████████████████████████████████████████████████████| 588/588 [01:28<00:00,  6.68it/s]


In [33]:
#Use named entity tokens as keys to query their predicted translations, and therefore, gold-standard terminology.
print(len(ner_locs))
print(len(alignments_list))

588
588


In [32]:
alignments_list[0]

[(0, 3),
 (1, 2),
 (2, 1),
 (3, 4),
 (5, 6),
 (7, 7),
 (7, 8),
 (8, 11),
 (9, 12),
 (11, 13),
 (12, 14)]

In [34]:
prospective_ne_fr = []
for i in range(len(alignments_list)):
    ne_fr_per_sentence = []
    for ne_token_span_list in ner_locs[i]: #Each line comprises a list of lists containing indices of all tokens comprising an NE span, e.g., "Auricular keloid"
        for ne_token_idx in ne_token_span_list: #For each token which is part of an NE span
            for alignment in alignments_list[i]:
                if (alignment[0] == ne_token_idx):
                    ne_fr_per_sentence.append(fr_tagged[i][alignment[1]]) #Find its aligned counterpart, BUT we cannot assume that words stick together (manual checking showed this)
    prospective_ne_fr.append(ne_fr_per_sentence)

In [35]:
#Manual inspection shows that a few terms aren't really terms - stuff like "de", etc. are prepositions. We should get rid of them using PoS tagging.
import string
ne_fr_pnav_only = [] #Proper nouns, nouns, adjectives, verbs
for i in range(len(prospective_ne_fr)):
    include = [token.text.replace("</i>", "").replace("</sup", "") #Some entity tags caught the HTML tags as well - this isn't part of terminology
               for token in prospective_ne_fr[i] 
               if token.pos_ in ["PROPN", "NOUN", "ADJ", "VERB"] and #Filter only specific PoS - we know that adverbs, auxiliaries, prepositions etc are not terminologies. 
               #Medical terminology is mostly nouns and adjectives, but proper nouns and verbs denoting specific medical actions may be important too (subject to noise due to gender, etc.)
               token.text not in string.punctuation] #Remove punctuation-only
    ne_fr_pnav_only.append(include)

In [36]:
#At this point we almost have a candidate terminology. We just need to aggregate terminology counts per sentence. Let's flatten this into a pandas dataframe first.
import pandas as pd
sent_IDs = []
terms = []
for i in range(len(ne_fr_pnav_only)):
    for term in ne_fr_pnav_only[i]:
        sent_IDs.append(i)
        terms.append(term)
term_list = pd.DataFrame(data = {"sent_ID" : sent_IDs, "term" : terms})

In [37]:
#And we simply aggregate counts. We will take note of casing here, but generate a separate list without casing.
def find_count_in_sentence_exact_match(row):
    query = row["term"]
    return len([found for found in fr_tagged[row["sent_ID"]] if query == found.text])

In [38]:
term_list = term_list.drop_duplicates().reset_index(drop=True)

In [39]:
term_list["count"] = term_list.apply(find_count_in_sentence_exact_match, axis=1)

In [40]:
term_list

Unnamed: 0,sent_ID,term,count
0,0,récalcitrantes,1
1,0,auriculaires,1
2,0,chéloïdes,1
3,0,traitement,1
4,1,élevé,1
...,...,...,...
3454,587,athlètes,1
3455,587,contraire,1
3456,587,fatigue,1
3457,587,dynamique,1


In [41]:
output = open("wmt22gold_terminology_ner.txt", "w", encoding = "utf8")
for i in range(len(term_list)):
    output.write(str(term_list["sent_ID"][i]) + "\t" + term_list["term"][i] + "\t" + str(term_list["count"][i]) + "\n")
output.close()

In [42]:
#Next, we will generate a separate list without casing, and aggregate counts as usual. Interestingly, we have the same number of rows, indicating that there are no terminologies which 
#appear multiple times within a single sentence, but with different casing. This means that we can stop here for now.
#term_list_uncased = term_list[["sent_ID", "term"]]
#term_list_uncased["uncased_term"] = term_list_uncased["term"].apply(str.lower)
#term_list_uncased = term_list_uncased.drop(columns = "term")
#term_list_uncased = term_list_uncased.drop_duplicates().reset_index(drop=True)