#### Evaluations BERTbase for the trial set (10 sentences)

In [1]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import pandas as pd
from transformers import pipeline

# read the tsv file
filename = "./data/trial/tsar2022_en_trial_none_no_noise.tsv"
data = pd.read_csv(filename, sep='\t', header=None, names=["sentence", "complex_word"])

# create an empty dataframe to store the substitutes for evaluation
substitutes_df = pd.DataFrame(columns=["sentence", "complex_word"] + [f"substitute_{i+1}" for i in range(10)])


In [2]:
import logging

In [3]:

from nltk.corpus import wordnet as wn
import spacy
nlp = spacy.load("en_core_web_sm")

import string

In [5]:
# the code below is used when Bertscore is used in step SS
import bert_score
from bert_score import score

## Bert-base-uncased

In [4]:
# Instantiate the tokenizer and the model

# for bert-base:
lm_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
lm_model= AutoModelForMaskedLM.from_pretrained("bert-base-uncased")


# Instantiate the fill-mask pipeline with the model
fill_mask= pipeline("fill-mask", lm_model, tokenizer = lm_tokenizer)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


#### Substitute Generation including noise removal:
##### including the original, unmasked, sentence:

In [28]:
# in each row, for each complex word: 
for index, row in data.iterrows():
    
    # print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")
    
     
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, lm_tokenizer.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat= f"{sentence} {lm_tokenizer.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result = fill_mask(sentences_concat, top_k=top_k)
    substitutes = [substitute["token_str"] for substitute in result if "token_str" in substitute]
    print(f"Substitute Generation (SG) step a): initial substitute list: {substitutes}\n")


    
    ## remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes = [substitute["token_str"].lower().strip() for substitute in result if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        # print(f" Substitute list without unwanted punctuation characters: {substitutes}\n")
    except TypeError as error:
        continue
    
    print(f"Substitute Generation (SG) step b): substitute list without empty elements and unwanted characters: {substitutes}\n")
        
        
    # limit the substitutes to the 10 highest ranked ones for evaluation
    top_10_substitutes = substitutes[:10]
    print(f"Substitute Generation (SG) final step c): top-10 substitutes for the complex word '{complex_word}': {top_10_substitutes}\n")
    
    # # add the sentence, complex_word, and the substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + top_10_substitutes
    
    
    print('---------------------------------------------------------------------------------------------------------------------------------------------')
    
    
# export the dataframe to tsv for evaluation
substitutes_df.to_csv("./predictions/trial/SG_bertbase.tsv", sep="\t", index=False, header=False)
print("SG_bertbase exported to csv in path './predictions/trial/SG_bertbase.tsv'}\n")


Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
Substitute Generation (SG) step a): initial substitute list: ['compulsory', 'mandatory', 'obligatory', 'optional', 'required', 'necessary', 'standard', 'voluntary', 'customary', 'impossible', 'easier', 'only', 'illegal', 'sufficient', 'unnecessary', 'easy', 'normal', 'permitted', 'mandated', 'difficult', 'simple', 'appropriate', 'expensive', 'possible', 'commonplace', 'essential', 'proper', 'available', 'enough', 'affordable']

Substitute Generation (SG) step b): substitute list without empty elements and unwanted characters: ['compulsory', 'mandatory', 'obligatory', 'optional', 'required', 'necessary', 'standard', 'voluntary', 'customary', 'impossible', 'easier', 'only', 'illegal', 'sufficient', 'unnecessary', 'easy'

python tsar_eval.py --gold_file ./data/trial/tsar2022_en_trial_gold_no_noise.tsv --predictions_file ./predictions/trial/SG_bertbase.tsv --output_file ./output/trial/SG_bertbase.tsv

#### Substitute Selection phase 1 (removal of: dupl.of complex word + infl.forms of complex word + antonyms of complex word):

In [29]:
# in each row, for each complex word: 
for index, row in data.iterrows():
    
    # print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")
    
     
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, lm_tokenizer.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat= f"{sentence} {lm_tokenizer.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result = fill_mask(sentences_concat, top_k=top_k)
    substitutes = [substitute["token_str"] for substitute in result if "token_str" in substitute]
    #print(f"Substitute Generation (SG) step a): initial substitute list: {substitutes}\n")


    
    ## remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes = [substitute["token_str"].lower().strip() for substitute in result if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        # print(f"Substitute list without unwanted punctuation characters: {substitutes}\n")
    except TypeError as error:
        continue
    
    #print(f"Substitute Generation (SG) final step b): substitute list without empty elements and unwanted characters: {substitutes}\n")
    
    
        
    # 2. Substitute Selection (SS) phase 1: remove duplicates, inflected forms, and antonyms of complex word:
    
    ## a) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl = []
    for sub in substitutes:
        if sub not in substitutes_no_dupl:
            substitutes_no_dupl.append(sub)
    print(f"Substitute Selection (SS) phase 1, step a): substitute list without duplicates of substitutes: {substitutes_no_dupl}\n")


    ## b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")

    ## then, remove duplicates and inflected forms of the complex word from the substitute list
    substitutes_no_dupl_complex_word = []
    for substitute in substitutes_no_dupl:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word.append(substitute)
    print(f"Substitute Selection (SS) phase 1, step b): substitute list without duplicates nor inflected forms of the complex word '{complex_word}': {substitutes_no_dupl_complex_word}\n")

    ## c) remove antonyms of the complex word from the substitute list
    ## get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms = []
    for substitute in substitutes_no_dupl_complex_word:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms.append(substitute)
        # else:
        #     print(f"Removed antonym: {substitute}")
    print(f"Substitute Selection (SS) phase 1, step c): substitute list without antonyms of the complex word '{complex_word}': {substitutes_no_antonyms}\n") 
  
        
    # limit the substitutes to the 10 highest ranked ones for evaluation
    top_10_substitutes = substitutes_no_antonyms[:10]
    print(f"Substitute Selection (SS) phase 1, final step d): top-10 substitutes for the complex word '{complex_word}': {top_10_substitutes}\n")
    
    # # add the sentence, complex_word, and the substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + top_10_substitutes
    
    print('---------------------------------------------------------------------------------------------------------------------------------------------')
    
    
# export the dataframe to tsv for evaluation
substitutes_df.to_csv("./predictions/trial/SS_phase1_bertbase.tsv", sep="\t", index=False, header=False)
print("SS_phase1_bertbase exported to csv in path './predictions/trial/SS_phase1_bertbase.tsv'}\n")

    

Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
Substitute Selection (SS) phase 1, step a): substitute list without duplicates of substitutes: ['compulsory', 'mandatory', 'obligatory', 'optional', 'required', 'necessary', 'standard', 'voluntary', 'customary', 'impossible', 'easier', 'only', 'illegal', 'sufficient', 'unnecessary', 'easy', 'normal', 'permitted', 'mandated', 'difficult', 'simple', 'appropriate', 'expensive', 'possible', 'commonplace', 'essential', 'proper', 'available', 'enough', 'affordable']

Substitute Selection (SS) phase 1, step b): substitute list without duplicates nor inflected forms of the complex word 'compulsory': ['mandatory', 'obligatory', 'optional', 'required', 'necessary', 'standard', 'voluntary', 'customary', 'impossible', 'easier', '

python tsar_eval.py --gold_file ./data/trial/tsar2022_en_trial_gold_no_noise.tsv --predictions_file ./predictions/trial/SS_phase1_bertbase.tsv --output_file ./output/trial/SS_phase1_bertbase.tsv

### The results obtained after this step served as baselines to compare the six models. I continued with Roberta-base and Electralarge, as they had the best results on the trial set. Therefore, I have not performed further experiments with Bert-base.