## Terminology Generation

The aim here is to build a list of candidate terminologies from a dataset, when no external dictionaries are available. This allows us to generate an independent terminology baseline for our terminology usage rate (i.e., terminology recall) metric. By doing this, we are independent from dictionaries used in training data.

In [1]:
import os
from tqdm import tqdm
dir_path = os.getcwd()

In [3]:
#Open our test files first
f = open(os.path.join(dir_path, "../../wmt22test.txt"), "r", encoding = "utf8")
en_sent = [line.strip() for line in f.readlines()]
f.close()
f = open(os.path.join(dir_path, "../../wmt22gold.txt"), "r", encoding = "utf8")
fr_sent = [line.strip() for line in f.readlines()]
f.close()

In [4]:
#Load in our PoS taggers - we choose transformers for accuracy in tokenisation and PoS-tagging - we are working on the "gold standard" after all.
#Choi et al. (2022) use Spacy as well, as do Ballier et al. (2022). This is good enough for tokenisation and PoS-tagging - we will manually ID terminology due to issues with NER accuracy.
import spacy
en_tagger = spacy.load("en_core_web_trf")
fr_tagger = spacy.load("fr_dep_news_trf")

In [5]:
#We know that medical terminology mostly comprises nouns and adjectives. This is where Part-of-speech tagging comes in!
#Removing conjugated verbs decreases noise, too - there are many forms due to masc/fem/plural etc. But first, let's tokenise and PoS-tag our sentences in line with Choi et al. (2022).
en_tagged = []
for sentence in tqdm(en_sent):
    en_tagged_sent = en_tagger(sentence)
    en_tagged_tokenised = [token for token in en_tagged_sent] #Access PoS tag information later on
    en_tagged.append(en_tagged_tokenised)

100%|██████████| 588/588 [00:41<00:00, 14.30it/s]


In [6]:
fr_tagged = []
for sentence in tqdm(fr_sent):
    fr_tagged_sent = fr_tagger(sentence)
    fr_tagged_tokenised = [token for token in fr_tagged_sent]
    fr_tagged.append(fr_tagged_tokenised)

100%|██████████| 588/588 [00:46<00:00, 12.55it/s]


In [7]:
#Now, let's perform word alignments. We couldn't use this to filter data because it does not output probabilities, but we can use this to compute word alignments.
#Terminologies must be translated, so we expect good word alignments for them.
from simalign import SentenceAligner
aligner = SentenceAligner(matching_methods="a") #Argmax only; the simAlign paper stated that this gives the best performance for English-French alignment

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2023-07-06 21:43:10,771 - simalign.simalign - INFO - Initialized the EmbeddingLoader with model: bert-base-multilingual-cased


In [8]:
#We will pull out the token text, align that, then check the PoS information.
alignments_list = []
for i in tqdm(range(len(en_tagged))):
    src_sent = [token.text for token in en_tagged[i]]
    tgt_sent = [token.text for token in fr_tagged[i]] 
    alignments = aligner.get_word_aligns(src_sent, tgt_sent)
    alignments_list.append(alignments["inter"])

100%|██████████| 588/588 [01:25<00:00,  6.88it/s]


In [9]:
import string #Need to filter out punctuation

In [87]:
#We want all parts of speech which refer to specific descriptors (ADJ), actions (VERB), and objects (NOUN/PROPN), as these could be specific to the medical domain. Let's save those and inspect them.
output = open("candidate_terminology.txt", "w", encoding = "utf8")
for i in range(len(alignments_list)):
    for aligned_pair in alignments_list[i]:
        src_ind = aligned_pair[0]
        tgt_ind = aligned_pair[1]
        src_tok = en_tagged[i][src_ind]
        tgt_tok = fr_tagged[i][tgt_ind]
        if (((src_tok.pos_ == "NOUN") and (tgt_tok.pos_ == "NOUN")) or 
        ((src_tok.pos_ == "ADJ") and (tgt_tok.pos_ == "ADJ")) or
        ((src_tok.pos_ == "VERB") and (tgt_tok.pos_ == "VERB")) or
        ((src_tok.pos_ == "PROPN") and (tgt_tok.pos_ == "PROPN"))):
            if not ((src_tok.text in string.punctuation) or (tgt_tok.text in string.punctuation)): #Ignore punctuation; these aren't terminology because they are found everywhere, e.g., %
                output.write(str(i) + "\t" + src_tok.text + "\t" + tgt_tok.text + "\n") #We include sentence numbers so we can check terminology for every sentence in the test set.
output.close() 

In [11]:
import pandas as pd
candidates = pd.read_csv("candidate_terminology.txt", sep = "\t", header = None, names = ["sent_ID", "en", "fr"])
#candidates = candidates.drop_duplicates().reset_index() #Don't drop duplicates - these arise because the same word might appear more than once in each sentence.

In [12]:
#So far, so good, but we have a few misalignments (unavoidable due to test set misalignments) and some general vocabulary sprinkled in, e.g., man/woman, etc.
#We don't need to accord general vocabulary the same importance as medical terminology. This means that we should filter based on general-domain frequency, because terminology is rare.
#Let's see how many words are captured using this approach, and what sort of words they are.
most_common_fr = pd.read_excel("Lexique_FR_PoS.xlsx")

In [13]:
#Let's begin by checking the frequency of each word in Lexique. 
def word_frequency(row):
    query = row["fr"].lower() #All our Lexique words are in lowercase
    ans = most_common_fr[most_common_fr["Word"] == query] #Search for word or lemma match
    if (ans.empty):
        ans = most_common_fr[most_common_fr["lemme"] == query]
        if (ans.empty):
            return -1
    if (len(ans) == 1):
        return ans["freqfilms2"].values[0]
    else:
        return ans["freqfilms2"].max() #If we have more than one match, assume the most frequent form (due to plurals, etc.) - we are looking for the prevalence of this concept in the general domain

In [14]:
candidates['frequency'] = candidates.apply(word_frequency, axis=1)

In [14]:
#candidates_sorted = candidates.sort_values(by=['frequency']).reset_index()
#Upon inspection, it is very difficult to make a clean cut, because there are some general terms which aren't in our general-domain corpus at all(!), like months = mois. 
#Conversely, there are some medical terms, like "medicine", which are very frequent. We need to compromise - what do we consider terminology?
#Park et al. (2002) state that the domain-specificity of a term is given by its probability of occurrence in a domain-specific corpus divided by its occurrence in a general corpus.
#Normalising for corpus length, a domain-specific term should occur far more often in the test set compared to a general corpus.
#Thankfully, we have tokenised the test set, which also rids us of the articles (l', d', etc.), allowing easier comparison. The downside is that punctuation comes in, so we must account for that.

In [74]:
all_the_words = []
word_count = 0;
for sentence in fr_tagged:
    for word in sentence:
        if (word.text not in string.punctuation):
            all_the_words.append(word.text.lower()) #Avoid capitalisation issues
            word_count += 1

In [75]:
word_count #15476 non-punctuation tokens, i.e., "words".

15476

In [73]:
word_count_en = 0;
for sentence in en_tagged:
    for word in sentence:
        if (word.text not in string.punctuation):
            word_count_en += 1
print(word_count_en)

11363


In [17]:
from collections import Counter
in_domain_counts = Counter(all_the_words) #Frequency of all the words in the test set

In [18]:
def in_domain_freq(row):
    query = row["fr"].lower() #All our words are in lowercase
    return (in_domain_counts[query] / word_count) * 1000000 #We were given frequency per 1000000 words for the general corpus, so we must upscale to get frequency per "million words" here too

In [19]:
candidates['ID_frequency'] = candidates.apply(in_domain_freq, axis=1)

In [20]:
import numpy as np
candidates["domain_specificity"] = np.log(candidates["ID_frequency"] / (candidates["frequency"] + 1.01)) #Account for negative and zero values, and compress the scale because our values are large

In [50]:
candidates_sorted = candidates.sort_values(by=['domain_specificity']).reset_index(drop=True)

In [51]:
#Explore data to find the best cutoff prior to manual removal
candidates_sorted.to_csv("candidates_sorted_by_domain_specificity.tsv", sep = "\t", header = False, index = False)

In [52]:
junk = candidates_sorted[candidates_sorted["domain_specificity"] < 3] #Justifies choice of 3 as a conservative cutoff - most of these words are non-terminological.
output = open("non_term_list_lexique.txt", "w", encoding = "utf8")
for i in range(len(junk["en"])):
    output.write(str(junk["sent_ID"][i]) + "\t" + junk["en"][i] + "\t" + junk["fr"][i] +  "\n")
output.close()

In [53]:
candidates_filtered = candidates_sorted[candidates_sorted["domain_specificity"] >= 3].reset_index(drop=True)
#3 is a conservative cutoff - there are a few non-terminology words here, but not that many compared to < 3. This aims to be a superset of a human-annotated terminology anyway.

In [54]:
candidates_filtered_counts = candidates_filtered[["sent_ID", "en", "fr"]]
candidates_filtered_counts = candidates_filtered_counts.drop_duplicates().reset_index(drop=True)

In [55]:
#We will do exact matching
def find_count_in_sentence(row):
    query = row["fr"]
    return len([found for found in fr_tagged[row["sent_ID"]] if query == found.text])

In [56]:
candidates_filtered_counts["count"] = candidates_filtered_counts.apply(find_count_in_sentence, axis=1)

In [57]:
output = open("terminology_heuristic_pre_manual.txt", "w", encoding = "utf8")
for i in range(len(candidates_filtered_counts["fr"])):
    output.write(str(candidates_filtered_counts["sent_ID"][i]) + "\t" + candidates_filtered_counts["en"][i] + "\t" + 
                 candidates_filtered_counts["fr"][i] + "\t" + str(candidates_filtered_counts["count"][i]) + "\n")
output.close()

In [71]:
input_manual = open("MANUAL_DONE.txt", "r", encoding = "utf8")
input_removed = open("non_term_list_manual.txt", "r", encoding = "utf8")
input_manual_lines = [(line.strip() + "\n") for line in input_manual.readlines()]
input_removed_lines = [(line.strip() + "\n") for line in input_removed.readlines()]
input_manual.close()
input_removed.close()

In [82]:
output_manual = open("term_list_manual_filtered.txt", "w", encoding = "utf8")
for line in input_manual_lines:
    output_manual.write(line)
output_manual.close()
output_removed = open("removed_term_list_first_pass.txt", "w", encoding = "utf8")
for line in input_removed_lines:
    output_removed.write(line)
output_removed.close()

In [83]:
#Out of 3409 candidate terms, we have eliminated 1914, leaving us with 1495 terms we deem important to their respective sentences. 
#I'm somewhat confident in the 1498 terms, but we should look at the eliminated terms again, just in case we've missed something important.
#We will sort by sentence ID and spit it back out into a .txt file.
output_removed = pd.read_csv("removed_term_list_first_pass.txt", sep = "\t", header = None, names = ["sent_ID", "en", "fr", "count"])
output_removed_sorted = output_removed.sort_values(by = ["sent_ID"]).reset_index(drop=True)
output_removed_sorted.to_csv("removed_term_list_first_pass_sorted.txt", sep = "\t", header = False, index = False)

In [84]:
#We also do the same for our selected terms, just in case we've made some errors during selection.
output_manual = pd.read_csv("term_list_manual_filtered.txt", sep = "\t", header = None, names = ["sent_ID", "en", "fr", "count"])
output_manual_sorted = output_manual.sort_values(by = ["sent_ID"]).reset_index(drop=True)
output_manual_sorted.to_csv("term_list_manual_filtered_sorted.txt", sep = "\t", header = False, index = False)

In [86]:
#I will now go sentence by sentence, looking at both sides and considering importance to sentence meaning. If changing a word greatly influences the meaning of a sentence, that
#word is deemed to be terminology. E.g., Gram-POSITIVE vs Gram-NEGATIVE --> both "positive" and "negative" are important to get right (these weren't picked up by Lexique).
#Why didn't we do this at the start? Because it was easier to remove very common non-terminological words (e.g., "month/mois") in batches.
#The final output is in final_list.txt and final_removed_list.txt

In [93]:
#Now, to deduplicate, sort, and create the final list
input_final_list = pd.read_csv("final_list.txt", sep = "\t", header = None, names = ["sent_ID", "en", "fr", "count"])
input_final_list_fr_only = input_final_list[["sent_ID", "fr"]]
input_final_list_fr_only_dedup = input_final_list_fr_only.drop_duplicates().reset_index(drop=True)
input_final_list_fr_only_dedup["count"] = input_final_list_fr_only_dedup.apply(find_count_in_sentence, axis=1)
input_final_list_fr_only_dedup = input_final_list_fr_only_dedup.sort_values(by = ["sent_ID"]).reset_index(drop=True)
input_final_list_fr_only_dedup.to_csv("wmt22_gold_terminology_manual.txt", sep = "\t", header = False, index = False)

In [94]:
#Next, we will generate a separate list without casing, and aggregate counts as usual. Interestingly, we have the same number of rows, indicating that there are no terminologies which 
#appear multiple times within a single sentence, but with different casing. This means that we can stop here for now.
#term_list_uncased = input_final_list_fr_only_dedup[["sent_ID", "fr"]]
#term_list_uncased["uncased_fr"] = term_list_uncased["fr"].apply(str.lower)
#term_list_uncased = term_list_uncased.drop(columns = "fr")
#term_list_uncased = term_list_uncased.drop_duplicates().reset_index(drop=True)
#term_list_uncased