# Terminology Tagging for Choi et al. (2022)'s Soft Constraint Method

We've obtained our noun alignments for 15% of our training set, and we will now determine the intersection of our filtered glossary and these nouns. We will then randomly select up to three noun-noun pairs per sentence, and annotate them with our tokens.

In [1]:
from datasets import load_dataset, Dataset

#Converts data in src [TAB] tgt [NEWLINE] format to a format suitable for model training
def convertToDictFormat(data):
    source = []
    target = []
    for example in data:
        example = example.strip()
        sentences = example.split("\t")
        source.append(sentences[0])
        target.append(sentences[1])
    ready = Dataset.from_dict({"en":source, "fr":target})
    return ready

In [2]:
#Load in clean glossary and convert to Dataset object
glossary_terms = load_dataset("ethansimrm/choi_filtered_cleaned_glossary", split = "train")
terms_ready = convertToDictFormat(glossary_terms['text'])

Found cached dataset text (C:/Users/ethan/.cache/huggingface/datasets/ethansimrm___text/ethansimrm--choi_filtered_cleaned_glossary-55c8ab5554eb133f/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)


In [3]:
#Read in noun alignments
import pandas as pd
noun_alignments_df = pd.read_csv("noun-alignments.txt", sep = "\t", header = None, names = ["sent_ID", "en", "fr"])

In [66]:
#Add new columns which contain lowercase words
lower_en = []
lower_fr = []
for en_noun in noun_alignments_df["en"]:
    lower_en.append(en_noun.lower())
for fr_noun in noun_alignments_df["fr"]:
    lower_fr.append(fr_noun.lower())
noun_alignments_df["en_lower"] = lower_en
noun_alignments_df["fr_lower"] = lower_fr

In [84]:
#We also need to kick out terms which differ only in casing, because multiple occurrences of the same term pair in a sentence may be duplicated otherwise. 
#We do this by converting to a pandas dataframe, and converting back into a dataset.
lower_en_terms = []
lower_fr_terms = []
for en_term in terms_ready["en"]:
    lower_en_terms.append(en_term.lower())
for fr_term in terms_ready["fr"]:
    lower_fr_terms.append(fr_term.lower())
terms_lowercased = Dataset.from_dict({"en":lower_en_terms, "fr":lower_fr_terms})
lowercase_terms_df = pd.DataFrame(terms_lowercased).drop_duplicates(ignore_index = True)
lowercase_terms_ready = Dataset.from_pandas(lowercase_terms_df)

In [86]:
#Determine intersection the fast way, in a case-independent manner - also captures terminology translations with different casing due to different sentence structure
from tqdm import tqdm
found = []
for term in tqdm(lowercase_terms_ready):
    sentences_with_term = noun_alignments_df[(noun_alignments_df["en_lower"] == term["en"].lower()) & (noun_alignments_df["fr_lower"] == term["fr"].lower())]
    for row in sentences_with_term.itertuples(index=False):
        found.append([row[0], row[1], row[2]]) #However, we will only use the original sentence casing to preserve correctness

100%|██████████| 4970/4970 [02:36<00:00, 31.77it/s]


In [87]:
found_nouns = pd.DataFrame(found)

In [88]:
found_nouns

Unnamed: 0,0,1,2
0,8980,abdomen,abdomen
1,11600,abdomen,abdomen
2,12284,abdomen,Abdomen
3,12687,abdomen,abdomen
4,13272,abdomen,abdomen
...,...,...,...
52941,28067,fever,fievre
52942,31281,fever,Fievre
52943,49679,fever,fievre
52944,13673,PPAR,PPAR


In [89]:
found_nouns.columns = ['sent_ID', 'en', 'fr']

In [91]:
len(list(found_nouns["sent_ID"].unique()))

40034

In [92]:
#Perform per-sentence sampling
from tqdm import tqdm
final_annotation_candidates = []
for sentence_ID in tqdm(list(found_nouns["sent_ID"].unique())):
    nouns_to_annotate = found_nouns[found_nouns["sent_ID"] == sentence_ID]
    if (len(nouns_to_annotate) > 3):
        nouns_to_annotate = nouns_to_annotate.sample(n=3, random_state = 42) #Choose a maximum of three nouns per sentence to annotate, seed = 42 for reproducibility
    for row in nouns_to_annotate.itertuples(index=False):
        final_annotation_candidates.append([row[0], row[1], row[2]])

100%|██████████| 40034/40034 [00:22<00:00, 1766.49it/s]


In [93]:
selected_nouns = pd.DataFrame(final_annotation_candidates)

In [94]:
selected_nouns.columns = ['sent_ID', 'en', 'fr']

In [95]:
selected_nouns_sorted = selected_nouns.sort_values(by = "sent_ID", ignore_index = True)

In [96]:
#Save for record-keeping purposes - these are all the nouns we will annotate, along with the sentences they are in
selected_nouns_sorted.to_csv('selected_nouns.txt', sep="\t", header = False, index = False)

In [97]:
#Load in training data
sampled_train = load_dataset("ethansimrm/choi_sampled_train", split = "train")
sampled_train_ready = convertToDictFormat(sampled_train['text'])

Found cached dataset text (C:/Users/ethan/.cache/huggingface/datasets/ethansimrm___text/ethansimrm--choi_sampled_train-180f43e915e77b7c/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)


In [98]:
sampled_train_to_modify = sampled_train_ready

In [112]:
#Add tokens to source side of training data. We will have to convert to our format, then convert back to a Dataset to do this - we cannot access sentence indices using .map().
source_sentences = []
target_sentences = []
for sent_num, bitext in tqdm(enumerate(sampled_train_to_modify)):
    nouns_in_sentence = selected_nouns_sorted[selected_nouns_sorted["sent_ID"] == sent_num]
    #Account for nested replacements such as <term_start> <term_start> EN <term_end> FR <term_trans> <term_end> FR <term_trans> using duplicate counts
    nouns_in_sentence = nouns_in_sentence.groupby(nouns_in_sentence.columns.tolist(),as_index=False).size() #Adds additional size column and removes duplicates
    for row in nouns_in_sentence.itertuples(index=False):
        if (bitext['en'].find("<term_start> " + row[1] + " <term_end>") != -1): 
            #Indicates differences in target casing, leading to nested replacements; ignore as we did not preserve positional information during noun alignment (only 78/52437 occurrences affected).
            continue
        #Assumed that duplicate terminology tags per sentence are not ignored due to random nature of sampling
        bitext['en'] = bitext['en'].replace(row[1], "<term_start> " + row[1] + " <term_end> " + row[2] + " <term_trans>", row[3]) 
    source_sentences.append(bitext['en'])
    target_sentences.append(bitext['fr'])
sampled_train_annotated = Dataset.from_dict({"en":source_sentences, "fr":target_sentences})

97378it [04:05, 396.90it/s]


In [113]:
#Ready for upload
output = open("choi_annotated_sampled_train.txt", "w", encoding = "utf8")
for bitext in sampled_train_annotated:
    output.write(bitext["en"] + "\t" + bitext["fr"] + "\n")
output.close()

In [None]:
#There are some minor anomalies like nested terms (e.g., dysplasia and chondrodysplasia) - going through it manually as we only have 1-2 occurrences per case.
#In all, we have 52359 annotated terms.
'''
Sentence 34351:

Original
<term_start> Chromatin <term_end> <term_start> chromatin <term_end> chromatine <term_trans>e <term_trans> structure. I. Physico-chemical study of chromatin stability.	
Contribution à l'étude de la structure de la chromatine. I. Etude physico-chimique de la stabilité de la chromatine.

Modified
<term_start> Chromatin <term_end> chromatine <term_trans> structure. I. Physico-chemical study of <term_start> chromatin <term_end> chromatine <term_trans> stability.	
Contribution à l'étude de la structure de la chromatine. I. Etude physico-chimique de la stabilité de la chromatine.

Sentence 51973:

Original
A <term_start> case <term_end> cas <term_trans> of <term_start> chondro<term_start> dysplasia <term_end> dysplasie <term_trans> <term_end> chondrodysplasie <term_trans> 
difficult to classify: metatropic dwarfism or Kozlowski type spondylometaphyseal dysplasia?	
Un cas de chondrodysplasie difficilement classable: nanisme métatropique ou dysplasie spondylo-métaphysaire type Kozlowski?

Modified
A <term_start> case <term_end> cas <term_trans> of <term_start> chondrodysplasia <term_end> chondrodysplasie <term_trans> difficult to classify: 
metatropic dwarfism or Kozlowski type spondylometaphyseal <term_start> dysplasia <term_end> dysplasie <term_trans>?	
Un cas de chondrodysplasie difficilement classable: nanisme métatropique ou dysplasie spondylo-métaphysaire type Kozlowski?

Sentence 77857:

Original
<term_start> Hypertension <term_end> <term_start> hypertension <term_end> hypertension <term_trans> <term_trans> is of particular interest, 
because components of the renin-angiotensin system (RAS), which are critically involved in the pathophysiology of hypertension, are also implicated in COVID-19.	
L'hypertension suscite un intérêt particulier, car certaines composantes du système rénine-angiotensine (SRA), dont le rôle est crucial dans la physiopathologie de l'hypertension, 
sont également en cause dans la COVID-19.

Modified
<term_start> Hypertension <term_end> hypertension <term_trans> is of particular interest, because components of the renin-angiotensin system (RAS), 
which are critically involved in the pathophysiology of <term_start> hypertension <term_end> hypertension <term_trans>, are also implicated in COVID-19.	
L'hypertension suscite un intérêt particulier, car certaines composantes du système rénine-angiotensine (SRA), dont le rôle est crucial dans la physiopathologie de l'hypertension, 
sont également en cause dans la COVID-19.

'''