### All code below uses concatenated sentence pairs in the Substitute Generation step in order to generate similar substitutes (as opposed to generation of fitting substitutes only)

In [11]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import pandas as pd
from transformers import pipeline

# read the tsv file
filename = './data/test/tsar2022_en_test_none_no_noise.tsv'
data = pd.read_csv(filename, sep='\t', header=None, names=["sentence", "complex_word"])

# create an empty dataframe to store the substitutes for evaluation
substitutes_df = pd.DataFrame(columns=["sentence", "complex_word"] + [f"substitute_{i+1}" for i in range(10)])


In [2]:
import logging

In [3]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
import spacy
nlp = spacy.load("en_core_web_sm")

import string

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\IrmaT\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
# the code below is used when Bertscore is used in step SS
import bert_score
from bert_score import score

In [5]:
# set the display.max_rows option to None to display all rows instead of limiting it to 50
pd.set_option('display.max_rows', None)

In [6]:
# Instantiate the tokenizer and the model

# for bert-base:
lm_tokenizer_bertbase = AutoTokenizer.from_pretrained("bert-base-uncased")
lm_model_bertbase = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")


# Instantiate the fill-mask pipeline with the model
fill_mask_bertbase = pipeline("fill-mask", lm_model_bertbase, tokenizer = lm_tokenizer_bertbase)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
# Instantiate the tokenizer and the model

# for electra-base:
lm_tokenizer_electrabase = AutoTokenizer.from_pretrained("google/electra-base-generator")
lm_model_electrabase = AutoModelForMaskedLM.from_pretrained("google/electra-base-generator")


# Instantiate the fill-mask pipeline with the model
fill_mask_electrabase = pipeline("fill-mask", lm_model_electrabase, tokenizer = lm_tokenizer_electrabase)

In [8]:
# Instantiate the tokenizer and the model

# for roberta-base:
lm_tokenizer_robertabase = AutoTokenizer.from_pretrained("roberta-base")
lm_model_robertabase = AutoModelForMaskedLM.from_pretrained("roberta-base")


# Instantiate the fill-mask pipeline with the model
fill_mask_robertabase = pipeline("fill-mask", lm_model_robertabase, tokenizer = lm_tokenizer_robertabase)

### Get top-x of 4 models (bert-base, for bertscore bert-base and bertscore electra-base; and electrabase, for bertscore bert-base and bertscore electrabase).

##### top-5 duplicates for all 4 models:

In [10]:
# in each row, for each complex word: 
for index, row in data.iterrows():
    
    # print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    #print(f"Sentence: {sentence}")
    #print(f"Complex word: {complex_word}")
    
    
    # for bertbase model:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word_bertbase = sentence.replace(complex_word, lm_tokenizer_bertbase.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat_bertbase = f"{sentence} {lm_tokenizer_bertbase.sep_token} {sentence_masked_word_bertbase}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result_bertbase = fill_mask_bertbase(sentences_concat_bertbase, top_k=top_k)
    substitutes_bertbase = [substitute["token_str"] for substitute in result_bertbase if "token_str" in substitute]
    #print(f"Substitute Generation step: initial substitute list: {substitutes_bertbase}\n")


    #2: Morphological Generation and Context Adaptation (Morphological Adaptation):  
    ## a) remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes_bertbase = [substitute["token_str"].lower().strip() for substitute in result_bertbase if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        # print(f"Morphological Adaptation step a): substitute list without unwanted punctuation characters for bertbase model: {substitutes_bertbase}\n")
    except TypeError as error:
        continue



    ## b) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl_bertbase = []
    for sub in substitutes_bertbase:
        if sub not in substitutes_no_dupl_bertbase:
            substitutes_no_dupl_bertbase.append(sub)
    #print(f"Morphological Adaptation step b): substitute list without duplicates of substitutes for bertbase model: {substitutes_no_dupl_bertbase}\n")



    ## c) remove duplicates and inflected forms of the complex word from the substitute list

    ## first Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## then, remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_bertbase = []
    for substitute in substitutes_no_dupl_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_bertbase.append(substitute)
    #print(f"Morphological Adaptation step c): substitute list without duplicates of the complex word nor inflected forms of the complex word for bertbase model: {substitutes_no_dupl_complex_word_bertbase}\n")


      ## d) remove antonyms of the complex word from the substitute list
    ## step 1: get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## step 2: remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms_bertbase = []
    for substitute in substitutes_no_dupl_complex_word_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms_bertbase.append(substitute)
        else:
            print(f"Removed antonym: {substitute}")
    print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for bertbase model: {substitutes_no_antonyms_bertbase}\n") 
    
    
    
    #3: Substitute Selection (SS) by calculating Bert scores: 

    ## create sentence with the complex word replaced by the substitutes
    sentence_with_substitutes_bertbase = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym_bertbase]
    #print(f"List with sentences where complex word is substituted for bertbase model: {sentence_with_substitutes_bertbase}\n")


    ## calculate BERTScores, and rank the substitutes based on these scores
    if len(sentence_with_substitutes_bertbase) > 0: # to make sure the list with substitutes is always filled
        logging.getLogger('transformers').setLevel(logging.ERROR)  # to prevent the same warnings from being printed x times 
        scores_bsBert_bertbase = bert_score.score([sentence]*len(sentence_with_substitutes_bertbase), sentence_with_substitutes_bertbase, lang="en", model_type='bert-base-uncased', verbose=False)
        scores_bsElectra_bertbase = bert_score.score([sentence]*len(sentence_with_substitutes_bertbase), sentence_with_substitutes_bertbase, lang="en", model_type='google/electra-base-generator', verbose=False)
        logging.getLogger('transformers').setLevel(logging.WARNING) # to reset the logging level back to printing warnings
        
        # create a list of tuples, each tuple containing a substitute and its score
        substitute_score_pairs_bsBert_bertbase = list(zip(substitutes_no_dupl_complex_word_no_antonym_bertbase, scores_bsBert_bertbase[0].tolist()))
        substitute_score_pairs_bsElectra_bertbase = list(zip(substitutes_no_dupl_complex_word_no_antonym_bertbase, scores_bsElectra_bertbase[0].tolist()))

        # sort the list of tuples by the scores (the second element of each tuple), in descending order
        sorted_substitute_score_pairs_bsBert_bertbase = sorted(substitute_score_pairs_bsBert_bertbase, key=lambda x: x[1], reverse=True)
        sorted_substitute_score_pairs_bsElectra_bertbase = sorted(substitute_score_pairs_bsElectra_bertbase, key=lambda x: x[1], reverse=True)

        # extract the list of substitutes from the sorted pairs
        bertscore_ranked_substitutes_only_bsBert_bertbase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsBert_bertbase]
        #print(f"substitutes based on bertscores with bertbase model: {bertscore_ranked_substitutes_only_bsBert_bertbase}\n")
        bertscore_ranked_substitutes_only_bsElectra_bertbase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsElectra_bertbase]
        #print(f"substitutes based on bertscores with Electra for bertbase model : {bertscore_ranked_substitutes_only_bsElectra_bertbase}\n")


        # limit the substitutes to the 10 first ones for evaluation
        top_10_substitutes_bsBert_bertbase = bertscore_ranked_substitutes_only_bsBert_bertbase[:10]
        #print(f"top-10 substitutes with Bert based on bertscores with Bert: {top_10_substitutes_bsBert_bertbase}\n")
        top_10_substitutes_bsElectra_bertbase = bertscore_ranked_substitutes_only_bsElectra_bertbase[:10]
        #print(f"top-10 substitutes based on bertscores with Electra for bertbase model: {top_10_substitutes_bsElectra_bertbase}\n")

    else:
        top_10_substitutes_bsBert_bertbase  = []
        top_10_substitutes_bsElectra_bertbase = []
        
    
    # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_bsBert_bertbase = top_10_substitutes_bsBert_bertbase[:5]
    print(f"SS step: top-5 substitutes based on bertscores with Bert for bertbase model: {top_5_substitutes_bsBert_bertbase}\n")
    top_5_substitutes_bsElectra_bertbase = top_10_substitutes_bsElectra_bertbase[:5]
    print(f"SS step: top-5 substitutes based on bertscores with Electra for bertbase model: {top_5_substitutes_bsElectra_bertbase}\n")
    
    
    
    # for electrabase model:
   
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word_electrabase = sentence.replace(complex_word, lm_tokenizer_electrabase.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat_electrabase = f"{sentence} {lm_tokenizer_electrabase.sep_token} {sentence_masked_word_electrabase}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result_electrabase = fill_mask_electrabase(sentences_concat_electrabase, top_k=top_k)
    substitutes_electrabase = [substitute["token_str"] for substitute in result_electrabase if "token_str" in substitute]
    #print(f"Substitute Generation step: initial substitute list: {substitutes_electrabase}\n")


    #2: Morphological Generation and Context Adaptation (Morphological Adaptation):  
    ## a) remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes_electrabase = [substitute["token_str"].lower().strip() for substitute in result_electrabase if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        # print(f"Morphological Adaptation step a): substitute list without unwanted punctuation characters for electrabase model: {substitutes_electrabase}\n")
    except TypeError as error:
        continue



    ## b) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl_electrabase = []
    for sub in substitutes_electrabase:
        if sub not in substitutes_no_dupl_electrabase:
            substitutes_no_dupl_electrabase.append(sub)
    #print(f"Morphological Adaptation step b): substitute list without duplicates of substitutes for electrabase model: {substitutes_no_dupl_electrabase}\n")



    ## c) remove duplicates and inflected forms of the complex word from the substitute list

    ## first Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## then, remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_electrabase = []
    for substitute in substitutes_no_dupl_electrabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_electrabase.append(substitute)
    #print(f"Morphological Adaptation step c): substitute list without duplicates of the complex word nor inflected forms of the complex word for electrabase model: {substitutes_no_dupl_complex_word_electrabase}\n")


    ## d) remove antonyms of the complex word from the substitute list
    ## step 1: get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## step 2: remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms_electrabase = []
    for substitute in substitutes_no_dupl_complex_word_electrabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms_electrabase.append(substitute)
        else:
            print(f"Removed antonym: {substitute}")
    print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for electrabase model: {substitutes_no_antonyms_electrabase}\n") 


    
    
    #3: Substitute Selection (SS) by calculating Bert scores: 

    ## create sentence with the complex word replaced by the substitutes
    sentence_with_substitutes_electrabase = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym_electrabase]
    #print(f"List with sentences where complex word is substituted for bertbase model: {sentence_with_substitutes_electrabase\n")


    ## calculate BERTScores, and rank the substitutes based on these scores
    if len(sentence_with_substitutes_electrabase) > 0: # to make sure the list with substitutes is always filled
        logging.getLogger('transformers').setLevel(logging.ERROR)  # to prevent the same warnings from being printed x times 
        scores_bsBert_electrabase = bert_score.score([sentence]*len(sentence_with_substitutes_electrabase), sentence_with_substitutes_electrabase, lang="en", model_type='bert-base-uncased', verbose=False)
        scores_bsElectra_electrabase = bert_score.score([sentence]*len(sentence_with_substitutes_electrabase), sentence_with_substitutes_electrabase, lang="en", model_type='google/electra-base-generator', verbose=False)
        logging.getLogger('transformers').setLevel(logging.WARNING) # to reset the logging level back to printing warnings
        
        # create a list of tuples, each tuple containing a substitute and its score
        substitute_score_pairs_bsBert_electrabase = list(zip(substitutes_no_dupl_complex_word_no_antonym_electrabase, scores_bsBert_electrabase[0].tolist()))
        substitute_score_pairs_bsElectra_electrabase = list(zip(substitutes_no_dupl_complex_word_no_antonym_electrabase, scores_bsElectra_electrabase[0].tolist()))

        # sort the list of tuples by the scores (the second element of each tuple), in descending order
        sorted_substitute_score_pairs_bsBert_electrabase = sorted(substitute_score_pairs_bsBert_electrabase, key=lambda x: x[1], reverse=True)
        sorted_substitute_score_pairs_bsElectra_electrabase = sorted(substitute_score_pairs_bsElectra_electrabase, key=lambda x: x[1], reverse=True)

        # extract the list of substitutes from the sorted pairs
        bertscore_ranked_substitutes_only_bsBert_electrabase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsBert_electrabase]
        #print(f"substitutes with Bert based on bertscores with Bert: {bertscore_ranked_substitutes_only_bsBert_electrabase}\n")
        bertscore_ranked_substitutes_only_bsElectra_electrabase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsElectra_electrabase]
        #print(f"substitutes based on bertscores with Electra for bert-base model : {bertscore_ranked_substitutes_only_bsElectra_electrabase}\n")


        # limit the substitutes to the 10 first ones for evaluation
        top_10_substitutes_bsBert_electrabase = bertscore_ranked_substitutes_only_bsBert_electrabase[:10]
        #print(f"top-10 substitutes with Bert based on bertscores with Bert: {top_10_substitutes_bsBert_electrabase}\n")
        top_10_substitutes_bsElectra_electrabase = bertscore_ranked_substitutes_only_bsElectra_electrabase[:10]
        #print(f"top-10 substitutes based on bertscores with Electra for bert-base model: {top_10_substitutes_bsElectra_electrabase}\n")

    else:
        top_10_substitutes_bsBert_electrabase  = []
        top_10_substitutes_bsElectra_electrabase = []
        
    
   # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_bsBert_electrabase = top_10_substitutes_bsBert_electrabase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Bert for bertbase model: {top_5_substitutes_bsBert_electrabase}\n")
    top_5_substitutes_bsElectra_electrabase = top_10_substitutes_bsElectra_electrabase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Electra for bertbase model: {top_5_substitutes_bsElectra_electrabase}\n")

   
    # zip the four lists in to create a list that sticks to the original order of each sub list
    #print('Finding shared duplicates in the four top-5 lists ....')

    # create a dictionary to hold the counts
    count_dict = {}

    # combine all four lists into a list of lists for easy iteration
    all_lists = [top_5_substitutes_bsBert_bertbase, top_5_substitutes_bsElectra_bertbase, top_5_substitutes_bsBert_electrabase, top_5_substitutes_bsElectra_electrabase]

    # iterate through each list and count occurrences of each element
    for lst in all_lists:
        for elem in lst:
            if elem in count_dict:
                count_dict[elem] += 1
            else:
                count_dict[elem] = 1

    # create a list of duplicates by including only those elements that appeared in all four lists
    duplicates_all_models = [elem for elem, count in count_dict.items() if count == len(all_lists)]

    #print(f"SS step: shared duplicates in top-5 substitutes lists based on bertscores with Bert and Electra for both bertbase and electrabase models: {duplicates_all_models}\n")
    
    # create a list with non-duplicates
    non_duplicates_all_models = [elem for elem, count in count_dict.items() if count == 1]
    #print(f"SS step: non-duplicates in top-5 substitutes lists based on bertscores with Bert and Electra for both bert-base and electra-base models: {non_duplicates_all_models}\n")
    
    #concatenate both lists (duplicates_all_models and non_duplicates_all_models), giving duplicates_all_models priority
    top_5_concatenated = duplicates_all_models + non_duplicates_all_models
    #print(f"SS step: concatenated top-5 with prioritized shared duplicates among bertbase and electrabase bertscore results: {top_5_concatenated}\n")
    
    #create the "second half" lists (items 6th to 10th) from each of the original top 10 lists
    second_half_bsBert_bertbase = top_10_substitutes_bsBert_bertbase[5:]
    second_half_bsElectra_bertbase = top_10_substitutes_bsElectra_bertbase[5:]
    second_half_bsBert_electrabase = top_10_substitutes_bsBert_electrabase[5:]
    second_half_bsElectra_electrabase = top_10_substitutes_bsElectra_electrabase[5:]

    # zip the "second half" lists
    second_half_combined = [item for sublist in zip(second_half_bsBert_bertbase, second_half_bsElectra_bertbase, second_half_bsBert_electrabase, second_half_bsElectra_electrabase) for item in sublist]

    # Exclude any items in second_half_combined that are already in top_5_concatenated
    second_half_combined = [item for item in second_half_combined if item not in top_5_concatenated]

    # Append these items to the final_list
    top_5_concatenated += second_half_combined

    # Trim the final_list to the top 10 items
    final_list_bertscore_top5dup = top_5_concatenated[:10]

    #print(f"SS step: final list bertscore based on prioritizing shared duplicates among bertbase and electrabase models: {final_list_bertscore_top5dup}\n")


    print('------------------------------------------------------------')
    
    
    ## add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + final_list_bertscore_top5dup
    

# export the dataframe to a tsv file for evaluation
substitutes_df.to_csv('./predictions/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup.tsv', sep="\t", index=False, header=False)   
print("BertElectrabase_SG_MA_SS_bsBertElectra_top5dup exported to csv in path './predictions/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup.tsv'}\n")  
    
    
    

SS step: top-5 substitutes based on bertscores with Bert for bertbase model: ['mandatory', 'obligatory', 'optional', 'illegal', 'voluntary']

SS step: top-5 substitutes based on bertscores with Electra for bertbase model: ['mandatory', 'obligatory', 'voluntary', 'optional', 'mandated']

SS step: top-5 substitutes based on bertscores with Bert for bertbase model: ['mandatory', 'obligatory', 'optional', 'legal', 'automatic']

SS step: top-5 substitutes based on bertscores with Electra for bertbase model: ['mandatory', 'obligatory', 'voluntary', 'automatic', 'optional']

Finding shared duplicates in the four top-5 lists ....
SS step: shared duplicates in top-5 substitutes lists based on bertscores with Bert and Electra for both bertbase and electrabase models: ['mandatory', 'obligatory', 'optional']

SS step: concatenated top-5 with prioritized shared duplicates among bertbase and electrabase bertscore results: ['mandatory', 'obligatory', 'optional', 'illegal', 'mandated', 'legal']

SS st

KeyboardInterrupt: 

python tsar_eval.py --gold_file ./data/test/tsar2022_en_test_gold_no_noise.tsv --predictions_file ./predictions/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup.tsv --output_file ./output/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup.tsv

##### top-5 duplicates for all 4 models (other order) _b:

In [None]:
# in each row, for each complex word: 
for index, row in data.iterrows():
    
    # print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    #print(f"Sentence: {sentence}")
    #print(f"Complex word: {complex_word}")
    
    
    # for bertbase model:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word_bertbase = sentence.replace(complex_word, lm_tokenizer_bertbase.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat_bertbase = f"{sentence} {lm_tokenizer_bertbase.sep_token} {sentence_masked_word_bertbase}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result_bertbase = fill_mask_bertbase(sentences_concat_bertbase, top_k=top_k)
    substitutes_bertbase = [substitute["token_str"] for substitute in result_bertbase if "token_str" in substitute]
    #print(f"Substitute Generation step: initial substitute list: {substitutes_bertbase}\n")


    #2: Morphological Generation and Context Adaptation (Morphological Adaptation):  
    ## a) remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes_bertbase = [substitute["token_str"].lower().strip() for substitute in result_bertbase if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        # print(f"Morphological Adaptation step a): substitute list without unwanted punctuation characters for bertbase model: {substitutes_bertbase}\n")
    except TypeError as error:
        continue



    ## b) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl_bertbase = []
    for sub in substitutes_bertbase:
        if sub not in substitutes_no_dupl_bertbase:
            substitutes_no_dupl_bertbase.append(sub)
    #print(f"Morphological Adaptation step b): substitute list without duplicates of substitutes for bertbase model: {substitutes_no_dupl_bertbase}\n")



    ## c) remove duplicates and inflected forms of the complex word from the substitute list

    ## first Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## then, remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_bertbase = []
    for substitute in substitutes_no_dupl_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_bertbase.append(substitute)
    #print(f"Morphological Adaptation step c): substitute list without duplicates of the complex word nor inflected forms of the complex word for bertbase model: {substitutes_no_dupl_complex_word_bertbase}\n")


     ## d) remove antonyms of the complex word from the substitute list
    ## step 1: get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## step 2: remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms_bertbase = []
    for substitute in substitutes_no_dupl_complex_word_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms_bertbase.append(substitute)
        else:
            print(f"Removed antonym: {substitute}")
    print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for bertbase model: {substitutes_no_antonyms_bertbase}\n") 
    
    
    
    #3: Substitute Selection (SS) by calculating Bert scores: 

    ## create sentence with the complex word replaced by the substitutes
    sentence_with_substitutes_bertbase = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym_bertbase]
    #print(f"List with sentences where complex word is substituted for bertbase model: {sentence_with_substitutes_bertbase}\n")


    ## calculate BERTScores, and rank the substitutes based on these scores
    if len(sentence_with_substitutes_bertbase) > 0: # to make sure the list with substitutes is always filled
        logging.getLogger('transformers').setLevel(logging.ERROR)  # to prevent the same warnings from being printed x times 
        scores_bsBert_bertbase = bert_score.score([sentence]*len(sentence_with_substitutes_bertbase), sentence_with_substitutes_bertbase, lang="en", model_type='bert-base-uncased', verbose=False)
        scores_bsElectra_bertbase = bert_score.score([sentence]*len(sentence_with_substitutes_bertbase), sentence_with_substitutes_bertbase, lang="en", model_type='google/electra-base-generator', verbose=False)
        logging.getLogger('transformers').setLevel(logging.WARNING) # to reset the logging level back to printing warnings
        
        # create a list of tuples, each tuple containing a substitute and its score
        substitute_score_pairs_bsBert_bertbase = list(zip(substitutes_no_dupl_complex_word_no_antonym_bertbase, scores_bsBert_bertbase[0].tolist()))
        substitute_score_pairs_bsElectra_bertbase = list(zip(substitutes_no_dupl_complex_word_no_antonym_bertbase, scores_bsElectra_bertbase[0].tolist()))

        # sort the list of tuples by the scores (the second element of each tuple), in descending order
        sorted_substitute_score_pairs_bsBert_bertbase = sorted(substitute_score_pairs_bsBert_bertbase, key=lambda x: x[1], reverse=True)
        sorted_substitute_score_pairs_bsElectra_bertbase = sorted(substitute_score_pairs_bsElectra_bertbase, key=lambda x: x[1], reverse=True)

        # extract the list of substitutes from the sorted pairs
        bertscore_ranked_substitutes_only_bsBert_bertbase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsBert_bertbase]
        #print(f"substitutes with Bert based on bertscores with Bert: {bertscore_ranked_substitutes_only_bsBert_bertbase}\n")
        bertscore_ranked_substitutes_only_bsElectra_bertbase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsElectra_bertbase]
        #print(f"substitutes based on bertscores with Electra for bert-base model : {bertscore_ranked_substitutes_only_bsElectra_bertbase}\n")


        # limit the substitutes to the 10 first ones for evaluation
        top_10_substitutes_bsBert_bertbase = bertscore_ranked_substitutes_only_bsBert_bertbase[:10]
        #print(f"top-10 substitutes with Bert based on bertscores with Bert: {top_10_substitutes_bsBert_bertbase}\n")
        top_10_substitutes_bsElectra_bertbase = bertscore_ranked_substitutes_only_bsElectra_bertbase[:10]
        #print(f"top-10 substitutes based on bertscores with Electra for bert-base model: {top_10_substitutes_bsElectra_bertbase}\n")

    else:
        top_10_substitutes_bsBert_bertbase  = []
        top_10_substitutes_bsElectra_bertbase = []
        
    
    # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_bsBert_bertbase = top_10_substitutes_bsBert_bertbase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Bert for bert-base model: {top_5_substitutes_bsBert_bertbase}\n")
    top_5_substitutes_bsElectra_bertbase = top_10_substitutes_bsElectra_bertbase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Electra for bert-base model: {top_5_substitutes_bsElectra_bertbase}\n")
    
    
    
    # for electrabase model:
   
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word_electrabase = sentence.replace(complex_word, lm_tokenizer_electrabase.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat_electrabase = f"{sentence} {lm_tokenizer_electrabase.sep_token} {sentence_masked_word_electrabase}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result_electrabase = fill_mask_electrabase(sentences_concat_electrabase, top_k=top_k)
    substitutes_electrabase = [substitute["token_str"] for substitute in result_electrabase if "token_str" in substitute]
    #print(f"Substitute Generation step: initial substitute list: {substitutes_electrabase}\n")


    #2: Morphological Generation and Context Adaptation (Morphological Adaptation):  
    ## a) remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes_electrabase = [substitute["token_str"].lower().strip() for substitute in result_electrabase if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        # print(f"Morphological Adaptation step a): substitute list without unwanted punctuation characters for bert-base model: {substitutes_electrabase}\n")
    except TypeError as error:
        continue



    ## b) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl_electrabase = []
    for sub in substitutes_electrabase:
        if sub not in substitutes_no_dupl_electrabase:
            substitutes_no_dupl_electrabase.append(sub)
    #print(f"Morphological Adaptation step b): substitute list without duplicates of substitutes for bert-base model: {substitutes_no_dupl_electrabase}\n")



    ## c) remove duplicates and inflected forms of the complex word from the substitute list

    ## first Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## then, remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_electrabase = []
    for substitute in substitutes_no_dupl_electrabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_electrabase.append(substitute)
    #print(f"Morphological Adaptation step c): substitute list without duplicates of the complex word nor inflected forms of the complex word for bert-base model: {substitutes_no_dupl_complex_word_electrabase}\n")


   ## d) remove antonyms of the complex word from the substitute list
    ## step 1: get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## step 2: remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms_electrabase = []
    for substitute in substitutes_no_dupl_complex_word_electrabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms_electrabase.append(substitute)
        else:
            print(f"Removed antonym: {substitute}")
    print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for electrabase model: {substitutes_no_antonyms_electrabase}\n") 


    
    
    #3: Substitute Selection (SS) by calculating Bert scores: 

    ## create sentence with the complex word replaced by the substitutes
    sentence_with_substitutes_electrabase = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym_electrabase]
    #print(f"List with sentences where complex word is substituted for bert-base model: {sentence_with_substitutes_electrabase\n")


    ## calculate BERTScores, and rank the substitutes based on these scores
    if len(sentence_with_substitutes_electrabase) > 0: # to make sure the list with substitutes is always filled
        logging.getLogger('transformers').setLevel(logging.ERROR)  # to prevent the same warnings from being printed x times 
        scores_bsBert_electrabase = bert_score.score([sentence]*len(sentence_with_substitutes_electrabase), sentence_with_substitutes_electrabase, lang="en", model_type='bert-base-uncased', verbose=False)
        scores_bsElectra_electrabase = bert_score.score([sentence]*len(sentence_with_substitutes_electrabase), sentence_with_substitutes_electrabase, lang="en", model_type='google/electra-base-generator', verbose=False)
        logging.getLogger('transformers').setLevel(logging.WARNING) # to reset the logging level back to printing warnings
        
        # create a list of tuples, each tuple containing a substitute and its score
        substitute_score_pairs_bsBert_electrabase = list(zip(substitutes_no_dupl_complex_word_no_antonym_electrabase, scores_bsBert_electrabase[0].tolist()))
        substitute_score_pairs_bsElectra_electrabase = list(zip(substitutes_no_dupl_complex_word_no_antonym_electrabase, scores_bsElectra_electrabase[0].tolist()))

        # sort the list of tuples by the scores (the second element of each tuple), in descending order
        sorted_substitute_score_pairs_bsBert_electrabase = sorted(substitute_score_pairs_bsBert_electrabase, key=lambda x: x[1], reverse=True)
        sorted_substitute_score_pairs_bsElectra_electrabase = sorted(substitute_score_pairs_bsElectra_electrabase, key=lambda x: x[1], reverse=True)

        # extract the list of substitutes from the sorted pairs
        bertscore_ranked_substitutes_only_bsBert_electrabase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsBert_electrabase]
        #print(f"substitutes with Bert based on bertscores with Bert: {bertscore_ranked_substitutes_only_bsBert_electrabase}\n")
        bertscore_ranked_substitutes_only_bsElectra_electrabase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsElectra_electrabase]
        #print(f"substitutes based on bertscores with Electra for bert-base model : {bertscore_ranked_substitutes_only_bsElectra_electrabase}\n")


        # limit the substitutes to the 10 first ones for evaluation
        top_10_substitutes_bsBert_electrabase = bertscore_ranked_substitutes_only_bsBert_electrabase[:10]
        #print(f"top-10 substitutes with Bert based on bertscores with Bert: {top_10_substitutes_bsBert_electrabase}\n")
        top_10_substitutes_bsElectra_electrabase = bertscore_ranked_substitutes_only_bsElectra_electrabase[:10]
        #print(f"top-10 substitutes based on bertscores with Electra for bert-base model: {top_10_substitutes_bsElectra_electrabase}\n")

    else:
        top_10_substitutes_bsBert_electrabase  = []
        top_10_substitutes_bsElectra_electrabase = []
        
    
   # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_bsBert_electrabase = top_10_substitutes_bsBert_electrabase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Bert for bert-base model: {top_5_substitutes_bsBert_electrabase}\n")
    top_5_substitutes_bsElectra_electrabase = top_10_substitutes_bsElectra_electrabase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Electra for bert-base model: {top_5_substitutes_bsElectra_electrabase}\n")

   
    # zip the four lists in to create a list that sticks to the original order of each sub list
    #print('Finding shared duplicates in the four top-5 lists ....')

    # create a dictionary to hold the counts
    count_dict = {}

   
    # combine all four lists into a list of lists for easy iteration
    all_lists = [top_5_substitutes_bsBert_bertbase, top_5_substitutes_bsElectra_electrabase, top_5_substitutes_bsBert_electrabase, top_5_substitutes_bsElectra_bertbase]
    print(all_lists)

    # iterate through each list and count occurrences of each element
    for lst in all_lists:
        for elem in lst:
            if elem in count_dict:
                count_dict[elem] += 1
            else:
                count_dict[elem] = 1

    # create a list of duplicates by including only those elements that appeared in all four lists
    duplicates_all_models = [elem for elem, count in count_dict.items() if count == len(all_lists)]

    #print(f"SS step: shared duplicates in top-5 substitutes lists based on bertscores with Bert and Electra for both bertbase and electrabase models: {duplicates_all_models}\n")
    
    # create a list with non-duplicates
    non_duplicates_all_models = [elem for elem, count in count_dict.items() if count == 1]
    #print(f"SS step: non-duplicates in top-5 substitutes lists based on bertscores with Bert and Electra for both bert-base and electra-base models: {non_duplicates_all_models}\n")
    
    #concatenate both lists (duplicates_all_models and non_duplicates_all_models), giving duplicates_all_models priority
    top_5_concatenated = duplicates_all_models + non_duplicates_all_models
    #print(f"SS step: concatenated top-5 with prioritized shared duplicates among bert-base and electra-base bertscore results: {top_5_concatenated}\n")
    
    #create the "second half" lists (items 6th to 10th) from each of the original top 10 lists
    second_half_bsBert_bertbase = top_10_substitutes_bsBert_bertbase[5:]
    print(second_half_bsBert_bertbase)
    second_half_bsElectra_electrabase = top_10_substitutes_bsElectra_electrabase[5:]
    second_half_bsBert_electrabase = top_10_substitutes_bsBert_electrabase[5:]
    second_half_bsElectra_bertbase = top_10_substitutes_bsElectra_bertbase[5:]
    

    # zip the "second half" lists
    second_half_combined = [item for sublist in zip(second_half_bsBert_bertbase, second_half_bsElectra_electrabase, second_half_bsBert_electrabase, second_half_bsElectra_bertbase) for item in sublist]

    # Exclude any items in second_half_combined that are already in top_5_concatenated
    second_half_combined = [item for item in second_half_combined if item not in top_5_concatenated]

    # Append these items to the final_list
    top_5_concatenated += second_half_combined

    # Trim the final_list to the top 10 items
    final_list_bertscore_top5dup_b = top_5_concatenated[:10]

    #print(f"SS step: final list bertscore based on prioritizing shared duplicates among bert-base and electra-base models: {final_list_bertscore_top5dup_b}\n")


    #print('------------------------------------------------------------')
    
    
    ## add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + final_list_bertscore_top5dup_b

# export the dataframe to a tsv file for evaluation
substitutes_df.to_csv('.predictions/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup_b.tsv', sep="\t", index=False, header=False)   
print("BertElectrabase_SG_MA_SS_bsBertElectra_top5dup_b exported to csv in path './predictions/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup_b.tsv'}\n")  
    

python tsar_eval.py --gold_file ./data/test/tsar2022_en_test_gold_no_noise.tsv --predictions_file ./predictions/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup_b.tsv --output_file ./output/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup_b.tsv

### Including Robertabase (3 models in total: bertbase, electrabase, robertabase, with bertscores also from these models): not realized due to too much runtime on colab

In [None]:
# in each row, for each complex word: 
for index, row in data.iterrows():
    
    # print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    #print(f"Sentence: {sentence}")
    #print(f"Complex word: {complex_word}")
    
    
    # for bertbase model:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word_bertbase = sentence.replace(complex_word, lm_tokenizer_bertbase.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat_bertbase = f"{sentence} {lm_tokenizer_bertbase.sep_token} {sentence_masked_word_bertbase}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result_bertbase = fill_mask_bertbase(sentences_concat_bertbase, top_k=top_k)
    substitutes_bertbase = [substitute["token_str"] for substitute in result_bertbase if "token_str" in substitute]
    #print(f"Substitute Generation step: initial substitute list: {substitutes_bertbase}\n")


    #2: Morphological Generation and Context Adaptation (Morphological Adaptation):  
    ## a) remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes_bertbase = [substitute["token_str"].lower().strip() for substitute in result_bertbase if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        # print(f"Morphological Adaptation step a): substitute list without unwanted punctuation characters for bertbase model: {substitutes_bertbase}\n")
    except TypeError as error:
        continue



    ## b) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl_bertbase = []
    for sub in substitutes_bertbase:
        if sub not in substitutes_no_dupl_bertbase:
            substitutes_no_dupl_bertbase.append(sub)
    #print(f"Morphological Adaptation step b): substitute list without duplicates of substitutes for bert-base model: {substitutes_no_dupl_bertbase}\n")



    ## c) remove duplicates and inflected forms of the complex word from the substitute list

    ## first Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## then, remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_bertbase = []
    for substitute in substitutes_no_dupl_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_bertbase.append(substitute)
    #print(f"Morphological Adaptation step c): substitute list without duplicates of the complex word nor inflected forms of the complex word for bert-base model: {substitutes_no_dupl_complex_word_bertbase}\n")


      ## d) remove antonyms of the complex word from the substitute list
    ## step 1: get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## step 2: remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms_bertbase = []
    for substitute in substitutes_no_dupl_complex_word_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms_bertbase.append(substitute)
        else:
            print(f"Removed antonym: {substitute}")
    print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for bertbase model: {substitutes_no_antonyms_bertbase}\n") 
    
    
    
    #3: Substitute Selection (SS) by calculating Bert scores: 

    ## create sentence with the complex word replaced by the substitutes
    sentence_with_substitutes_bertbase = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym_bertbase]
    #print(f"List with sentences where complex word is substituted for bert-base model: {sentence_with_substitutes_bertbase}\n")


    ## calculate BERTScores, and rank the substitutes based on these scores
    if len(sentence_with_substitutes_bertbase) > 0: # to make sure the list with substitutes is always filled
        logging.getLogger('transformers').setLevel(logging.ERROR)  # to prevent the same warnings from being printed x times 
        scores_bsBert_bertbase = bert_score.score([sentence]*len(sentence_with_substitutes_bertbase), sentence_with_substitutes_bertbase, lang="en", model_type='bert-base-uncased', verbose=False)
        logging.getLogger('transformers').setLevel(logging.WARNING) # to reset the logging level back to printing warnings
        
        # create a list of tuples, each tuple containing a substitute and its score
        substitute_score_pairs_bsBert_bertbase = list(zip(substitutes_no_dupl_complex_word_no_antonym_bertbase, scores_bsBert_bertbase[0].tolist()))

        # sort the list of tuples by the scores (the second element of each tuple), in descending order
        sorted_substitute_score_pairs_bsBert_bertbase = sorted(substitute_score_pairs_bsBert_bertbase, key=lambda x: x[1], reverse=True)

        # extract the list of substitutes from the sorted pairs
        bertscore_ranked_substitutes_only_bsBert_bertbase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsBert_bertbase]
        #print(f"substitutes with Bert based on bertscores with Bert: {bertscore_ranked_substitutes_only_bsBert_bertbase}\n")

        # limit the substitutes to the 10 first ones for evaluation
        top_10_substitutes_bsBert_bertbase = bertscore_ranked_substitutes_only_bsBert_bertbase[:10]
        #print(f"top-10 substitutes with Bert based on bertscores with Bert: {top_10_substitutes_bsBert_bertbase}\n")
       

    else:
        top_10_substitutes_bsBert_bertbase  = []
        
    
    # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_bsBert_bertbase = top_10_substitutes_bsBert_bertbase[:5]
    print(f"SS step: top-5 substitutes based on bertscores with Bert for bert-base model: {top_5_substitutes_bsBert_bertbase}\n")
       
    
    # for electrabase model:
   
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word_electrabase = sentence.replace(complex_word, lm_tokenizer_electrabase.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat_electrabase = f"{sentence} {lm_tokenizer_electrabase.sep_token} {sentence_masked_word_electrabase}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result_electrabase = fill_mask_electrabase(sentences_concat_electrabase, top_k=top_k)
    substitutes_electrabase = [substitute["token_str"] for substitute in result_electrabase if "token_str" in substitute]
    #print(f"Substitute Generation step: initial substitute list for electrabase model: {substitutes_electrabase}\n")


    #2: Morphological Generation and Context Adaptation (Morphological Adaptation):  
    ## a) remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes_electrabase = [substitute["token_str"].lower().strip() for substitute in result_electrabase if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        # print(f"Morphological Adaptation step a): substitute list without unwanted punctuation characters for electrabase model: {substitutes_electrabase}\n")
    except TypeError as error:
        continue



    ## b) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl_electrabase = []
    for sub in substitutes_electrabase:
        if sub not in substitutes_no_dupl_electrabase:
            substitutes_no_dupl_electrabase.append(sub)
    #print(f"Morphological Adaptation step b): substitute list without duplicates of substitutes for electrabase model: {substitutes_no_dupl_electrabase}\n")



    ## c) remove duplicates and inflected forms of the complex word from the substitute list

    ## first Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## then, remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_electrabase = []
    for substitute in substitutes_no_dupl_electrabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_electrabase.append(substitute)
    #print(f"Morphological Adaptation step c): substitute list without duplicates of the complex word nor inflected forms of the complex word for electrabase model: {substitutes_no_dupl_complex_word_electrabase}\n")


   ## d) remove antonyms of the complex word from the substitute list
    ## step 1: get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## step 2: remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms_electrabase = []
    for substitute in substitutes_no_dupl_complex_word_electrabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms_electrabase.append(substitute)
        else:
            print(f"Removed antonym: {substitute}")
    print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for electrabase model: {substitutes_no_antonyms_electrabase}\n") 


    #3: Substitute Selection (SS) by calculating Bert scores: 

    ## create sentence with the complex word replaced by the substitutes
    sentence_with_substitutes_electrabase = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym_electrabase]
    #print(f"List with sentences where complex word is substituted for electrabase model: {sentence_with_substitutes_electrabase\n")


    ## calculate BERTScores, and rank the substitutes based on these scores
    if len(sentence_with_substitutes_electrabase) > 0: # to make sure the list with substitutes is always filled
        logging.getLogger('transformers').setLevel(logging.ERROR)  # to prevent the same warnings from being printed x times 
        scores_bsElectra_electrabase = bert_score.score([sentence]*len(sentence_with_substitutes_electrabase), sentence_with_substitutes_electrabase, lang="en", model_type='google/electra-base-generator', verbose=False)
        logging.getLogger('transformers').setLevel(logging.WARNING) # to reset the logging level back to printing warnings
        
        # create a list of tuples, each tuple containing a substitute and its score
        substitute_score_pairs_bsElectra_electrabase = list(zip(substitutes_no_dupl_complex_word_no_antonym_electrabase, scores_bsElectra_electrabase[0].tolist()))

        # sort the list of tuples by the scores (the second element of each tuple), in descending order
        sorted_substitute_score_pairs_bsElectra_electrabase = sorted(substitute_score_pairs_bsElectra_electrabase, key=lambda x: x[1], reverse=True)

        # extract the list of substitutes from the sorted pairs
        bertscore_ranked_substitutes_only_bsElectra_electrabase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsElectra_electrabase]
        #print(f"substitutes based on bertscores with Electra for electrabase model : {bertscore_ranked_substitutes_only_bsElectra_electrabase}\n")


        # limit the substitutes to the 10 first ones for evaluation
        top_10_substitutes_bsElectra_electrabase = bertscore_ranked_substitutes_only_bsElectra_electrabase[:10]
        #print(f"top-10 substitutes based on bertscores with Electra for electrabase model: {top_10_substitutes_bsElectra_electrabase}\n")

    else:
        top_10_substitutes_bsElectra_electrabase = []
        
    
   # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_bsElectra_electrabase = top_10_substitutes_bsElectra_electrabase[:5]
    print(f"SS step: top-5 substitutes based on bertscores with Electra for electrabase model: {top_5_substitutes_bsElectra_electrabase}\n")

   

    
    # for robertabase model:
   
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word_robertabase = sentence.replace(complex_word, lm_tokenizer_robertabase.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat_robertabase = f"{sentence} {lm_tokenizer_robertabase.sep_token} {sentence_masked_word_robertabase}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result_robertabase = fill_mask_robertabase(sentences_concat_robertabase, top_k=top_k)
    substitutes_robertabase = [substitute["token_str"] for substitute in result_robertabase if "token_str" in substitute]
    #print(f"Substitute Generation step: initial substitute list: {substitutes_robertabase}\n")


    #2: Morphological Generation and Context Adaptation (Morphological Adaptation):  
    ## a) remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes_robertabase = [substitute["token_str"].lower().strip() for substitute in result_robertabase if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        # print(f"Morphological Adaptation step a): substitute list without unwanted punctuation characters for robertabase model: {substitutes_robertabase\n")
    except TypeError as error:
        continue




    ## b) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl_robertabase= []
    for sub in substitutes_robertabase:
        if sub not in substitutes_no_dupl_robertabase:
            substitutes_no_dupl_robertabase.append(sub)
    #print(f"Morphological Adaptation step b): substitute list without duplicates of substitutes for bert-base model: {substitutes_no_dupl_robertabase\n")



    ## c) remove duplicates and inflected forms of the complex word from the substitute list

    ## first Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## then, remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_robertabase = []
    for substitute in substitutes_no_dupl_robertabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_robertabase.append(substitute)
    #print(f"Morphological Adaptation step c): substitute list without duplicates of the complex word nor inflected forms of the complex word for robertabase model: {substitutes_no_dupl_complex_word_robertabase}\n")


    ## d) remove antonyms of the complex word from the substitute list
    ## step 1: get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## step 2: remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms_robertabase = []
    for substitute in substitutes_no_dupl_complex_word_robertabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms_robertabase.append(substitute)
        else:
            print(f"Removed antonym: {substitute}")
    print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for robertabase model: {substitutes_no_antonyms_robertabase}\n") 

 
    #3: Substitute Selection (SS) by calculating Bert scores: 

    ## create sentence with the complex word replaced by the substitutes
    sentence_with_substitutes_robertabase = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym_robertabase]
    #print(f"List with sentences where complex word is substituted for robertabase model: {sentence_with_substitutes_robertabase\n")


    ## calculate BERTScores, and rank the substitutes based on these scores
    if len(sentence_with_substitutes_robertabase) > 0: # to make sure the list with substitutes is always filled
        logging.getLogger('transformers').setLevel(logging.ERROR)  # to prevent the same warnings from being printed x times 
        scores_bsRoberta_robertabase = bert_score.score([sentence]*len(sentence_with_substitutes_robertabase), sentence_with_substitutes_robertabase, lang="en", model_type='roberta-base', verbose=False)
        logging.getLogger('transformers').setLevel(logging.WARNING) # to reset the logging level back to printing warnings
        
        # create a list of tuples, each tuple containing a substitute and its score
        substitute_score_pairs_bsRoberta_robertabase = list(zip(substitutes_no_dupl_complex_word_no_antonym_robertabase, scores_bsRoberta_robertabase[0].tolist()))
       
        # sort the list of tuples by the scores (the second element of each tuple), in descending order
        sorted_substitute_score_pairs_bsRoberta_robertabase = sorted(substitute_score_pairs_bsRoberta_robertabase, key=lambda x: x[1], reverse=True)
        

        # extract the list of substitutes from the sorted pairs
        bertscore_ranked_substitutes_only_bsRoberta_robertabase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsRoberta_robertabase]
        #print(f"substitutes based on bertscores with Roberta for robertabase model: {bertscore_ranked_substitutes_only_bsRoberta_robertabase}\n")
        


        # limit the substitutes to the 10 first ones for evaluation
        top_10_substitutes_bsRoberta_robertabase = bertscore_ranked_substitutes_only_bsRoberta_robertabase[:10]
        #print(f"top-10 substitutes based on bertscores with Roberta for robertabase model: {top_10_substitutes_bsRoberta_robertabase}\n")
       

    else:
        top_10_substitutes_bsRoberta_robertabase  = []
        
    
   # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_bsRoberta_robertabase = top_10_substitutes_bsRoberta_robertabase[:5]
    print(f"SS step: top-5 substitutes based on bertscores with Roberta for robertabase model: {top_5_substitutes_bsRoberta_robertabase}\n")
    



    # concatenate best results of bertbase, electrabase, and robertabase models:
    # zip the three lists to create a list that sticks to the original order of each sub list
    #print('Finding shared duplicates in the three top-5 lists ....')

    # create a dictionary to hold the counts
    count_dict = {}

    # combine all four lists into a list of lists for easy iteration
    all_lists = [top_5_substitutes_bsBert_bertbase, top_5_substitutes_bsElectra_electrabase, top_5_substitutes_bsRoberta_robertabase]

    # iterate through each list and count occurrences of each element
    for lst in all_lists:
        #print(f"Current list: {lst}")  
        for elem in lst:
            if elem in count_dict:
                count_dict[elem] += 1
            else:
                count_dict[elem] = 1

    # print the count_dict to see the counts of each element
    #print(f"Count dictionary: {count_dict}")   

    # create a list of duplicates by including only those elements that appeared in all three lists
    duplicates_all_models = [elem for elem, count in count_dict.items() if count == len(all_lists)]

    print(f"SS step: shared duplicates in top-5 substitutes lists based on bertscores with Bert, Electra, and Roberta for bertbase, electrabase, and robertabase models: {duplicates_all_models}\n")
    
    # create a list with non-duplicates
    non_duplicates_all_models = [elem for elem, count in count_dict.items() if count == 1]
    #print(f"SS step: non-duplicates in top-5 substitutes lists based on bertscores with Bert, Electra, and Roberta for bertbase, electrabase and robertabase models:  {non_duplicates_all_models}\n")
    
    #concatenate both lists (duplicates_all_models and non_duplicates_all_models), giving duplicates_all_models priority
    top_5_concatenated = duplicates_all_models + non_duplicates_all_models
    print(f"SS step: concatenated top-5 with prioritized shared duplicates among bertbase, electrabase, and robertabase bertscore results: {top_5_concatenated}\n")
    
    #create the "second half" lists (items 6th to 10th) from each of the original top 10 lists
    second_half_bsBert_bertbase = top_10_substitutes_bsBert_bertbase[5:]
    second_half_bsElectra_electrabase = top_10_substitutes_bsElectra_electrabase[5:]
    second_half_bsRoberta_robertabase = top_10_substitutes_bsRoberta_robertabase[5:]

    # zip the "second half" lists
    second_half_combined = [item for sublist in zip(second_half_bsBert_bertbase, second_half_bsElectra_electrabase, second_half_bsRoberta_robertabase)  for item in sublist]

    # Exclude any items in second_half_combined that are already in top_5_concatenated
    second_half_combined = [item for item in second_half_combined if item not in top_5_concatenated]

    # Append these items to the final_list
    top_5_concatenated += second_half_combined

    # Trim the final_list to the top 10 items
    final_list_bertscore_top5dup_3models = top_5_concatenated[:10]

    print(f"SS step: final list bertscore based on prioritizing shared duplicates among bertbase, electrabase, and robertabase models: {final_list_bertscore_top5dup_3models}\n")


    print('------------------------------------------------------------')
    
    
    ## add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + final_list_bertscore_top5dup_3models
    

# export the dataframe to a tsv file for evaluation
substitutes_df.to_csv('./predictions/test/BertElectraRobertabase_SG_MA_SS_bsBertElectraRoberta_top5dup.tsv', sep="\t", index=False, header=False)   
print("/BertElectraRobertabase_SG_MA_SS_bsBertElectraRoberta_top5dup exported to csv in path './predictions/test//BertElectraRobertabase_SG_MA_SS_bsBertElectraRoberta_top5dup.tsv'}\n")  
    
    

In [None]:
python tsar_eval.py --gold_file ./data/test/tsar2022_en_test_gold_no_noise.tsv --predictions_file ./predictions/test/BertElectraRobertabase_SG_MA_SS_bsBertElectraRoberta_top5dup.tsv --output_file ./output/test/BertElectraRobertabase_SG_MA_SS_bsBertElectraRoberta_top5dup.tsv

top 5 of best 2 models electrabase bsElectra and bertbase bsBert

In [None]:
# in each row, for each complex word: 
for index, row in data.iterrows():
    
    # print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    #print(f"Sentence: {sentence}")
    #print(f"Complex word: {complex_word}")
    
    
    # for bertbase model:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word_bertbase = sentence.replace(complex_word, lm_tokenizer_bertbase.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat_bertbase = f"{sentence} {lm_tokenizer_bertbase.sep_token} {sentence_masked_word_bertbase}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result_bertbase = fill_mask_bertbase(sentences_concat_bertbase, top_k=top_k)
    substitutes_bertbase = [substitute["token_str"] for substitute in result_bertbase if "token_str" in substitute]
    #print(f"Substitute Generation step: initial substitute list: {substitutes_bertbase}\n")


    #2: Morphological Generation and Context Adaptation (Morphological Adaptation):  
    ## a) remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes_bertbase = [substitute["token_str"].lower().strip() for substitute in result_bertbase if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        # print(f"Morphological Adaptation step a): substitute list without unwanted punctuation characters for bertbase model: {substitutes_bertbase}\n")
    except TypeError as error:
        continue



    ## b) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl_bertbase = []
    for sub in substitutes_bertbase:
        if sub not in substitutes_no_dupl_bertbase:
            substitutes_no_dupl_bertbase.append(sub)
    #print(f"Morphological Adaptation step b): substitute list without duplicates of substitutes for bertbase model: {substitutes_no_dupl_bertbase}\n")



    ## c) remove duplicates and inflected forms of the complex word from the substitute list

    ## first Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## then, remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_bertbase = []
    for substitute in substitutes_no_dupl_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_bertbase.append(substitute)
    #print(f"Morphological Adaptation step c): substitute list without duplicates of the complex word nor inflected forms of the complex word for bertbase model: {substitutes_no_dupl_complex_word_bertbase}\n")


      ## d) remove antonyms of the complex word from the substitute list
    ## step 1: get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## step 2: remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms_bertbase = []
    for substitute in substitutes_no_dupl_complex_word_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms_bertbase.append(substitute)
        else:
            print(f"Removed antonym: {substitute}")
    print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for bertbase model: {substitutes_no_antonyms_bertbase}\n") 
    
    
    
    #3: Substitute Selection (SS) by calculating Bert scores: 

    ## create sentence with the complex word replaced by the substitutes
    sentence_with_substitutes_bertbase = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym_bertbase]
    #print(f"List with sentences where complex word is substituted for bertbase model: {sentence_with_substitutes_bertbase}\n")


    ## calculate BERTScores, and rank the substitutes based on these scores
    if len(sentence_with_substitutes_bertbase) > 0: # to make sure the list with substitutes is always filled
        logging.getLogger('transformers').setLevel(logging.ERROR)  # to prevent the same warnings from being printed x times 
        scores_bsBert_bertbase = bert_score.score([sentence]*len(sentence_with_substitutes_bertbase), sentence_with_substitutes_bertbase, lang="en", model_type='bert-base-uncased', verbose=False)
        #scores_bsElectra_bertbase = bert_score.score([sentence]*len(sentence_with_substitutes_bertbase), sentence_with_substitutes_bertbase, lang="en", model_type='google/electra-base-generator', verbose=False)
        logging.getLogger('transformers').setLevel(logging.WARNING) # to reset the logging level back to printing warnings
        
        # create a list of tuples, each tuple containing a substitute and its score
        substitute_score_pairs_bsBert_bertbase = list(zip(substitutes_no_dupl_complex_word_no_antonym_bertbase, scores_bsBert_bertbase[0].tolist()))
        #substitute_score_pairs_bsElectra_bertbase = list(zip(substitutes_no_dupl_complex_word_no_antonym_bertbase, scores_bsElectra_bertbase[0].tolist()))

        # sort the list of tuples by the scores (the second element of each tuple), in descending order
        sorted_substitute_score_pairs_bsBert_bertbase = sorted(substitute_score_pairs_bsBert_bertbase, key=lambda x: x[1], reverse=True)
        #sorted_substitute_score_pairs_bsElectra_bertbase = sorted(substitute_score_pairs_bsElectra_bertbase, key=lambda x: x[1], reverse=True)

        # extract the list of substitutes from the sorted pairs
        bertscore_ranked_substitutes_only_bsBert_bertbase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsBert_bertbase]
        #print(f"substitutes with Bert based on bertscores with Bert: {bertscore_ranked_substitutes_only_bsBert_bertbase}\n")
        #bertscore_ranked_substitutes_only_bsElectra_bertbase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsElectra_bertbase]
        #print(f"substitutes based on bertscores with Electra for bert-base model : {bertscore_ranked_substitutes_only_bsElectra_bertbase}\n")


        # limit the substitutes to the 10 first ones for evaluation
        top_10_substitutes_bsBert_bertbase = bertscore_ranked_substitutes_only_bsBert_bertbase[:10]
        #print(f"top-10 substitutes with Bert based on bertscores with Bert: {top_10_substitutes_bsBert_bertbase}\n")
        #top_10_substitutes_bsElectra_bertbase = bertscore_ranked_substitutes_only_bsElectra_bertbase[:10]
        #print(f"top-10 substitutes based on bertscores with Electra for bert-base model: {top_10_substitutes_bsElectra_bertbase}\n")

    else:
        top_10_substitutes_bsBert_bertbase  = []
        #top_10_substitutes_bsElectra_bertbase = []
        
    
    # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_bsBert_bertbase = top_10_substitutes_bsBert_bertbase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Bert for bert-base model: {top_5_substitutes_bsBert_bertbase}\n")
    #top_5_substitutes_bsElectra_bertbase = top_10_substitutes_bsElectra_bertbase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Electra for bert-base model: {top_5_substitutes_bsElectra_bertbase}\n")
    
    
    
    # for electrabase model:
   
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word_electrabase = sentence.replace(complex_word, lm_tokenizer_electrabase.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat_electrabase = f"{sentence} {lm_tokenizer_electrabase.sep_token} {sentence_masked_word_electrabase}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result_electrabase = fill_mask_electrabase(sentences_concat_electrabase, top_k=top_k)
    substitutes_electrabase = [substitute["token_str"] for substitute in result_electrabase if "token_str" in substitute]
    #print(f"Substitute Generation step: initial substitute list: {substitutes_electrabase}\n")


    #2: Morphological Generation and Context Adaptation (Morphological Adaptation):  
    ## a) remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes_electrabase = [substitute["token_str"].lower().strip() for substitute in result_electrabase if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        # print(f"Morphological Adaptation step a): substitute list without unwanted punctuation characters for bert-base model: {substitutes_electrabase}\n")
    except TypeError as error:
        continue



    ## b) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl_electrabase = []
    for sub in substitutes_electrabase:
        if sub not in substitutes_no_dupl_electrabase:
            substitutes_no_dupl_electrabase.append(sub)
    #print(f"Morphological Adaptation step b): substitute list without duplicates of substitutes for electrabase model: {substitutes_no_dupl_electrabase}\n")



    ## c) remove duplicates and inflected forms of the complex word from the substitute list

    ## first Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## then, remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_electrabase = []
    for substitute in substitutes_no_dupl_electrabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_electrabase.append(substitute)
    #print(f"Morphological Adaptation step c): substitute list without duplicates of the complex word nor inflected forms of the complex word for electrabase model: {substitutes_no_dupl_complex_word_electrabase}\n")


   ## d) remove antonyms of the complex word from the substitute list
    ## step 1: get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## step 2: remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms_electrabase = []
    for substitute in substitutes_no_dupl_complex_word_electrabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms_electrabase.append(substitute)
        else:
            print(f"Removed antonym: {substitute}")
    print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for electrabase model: {substitutes_no_antonyms_electrabase}\n") 


    
    
    #3: Substitute Selection (SS) by calculating Bert scores: 

    ## create sentence with the complex word replaced by the substitutes
    sentence_with_substitutes_electrabase = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym_electrabase]
    #print(f"List with sentences where complex word is substituted for electrabase model: {sentence_with_substitutes_electrabase\n")


    ## calculate BERTScores, and rank the substitutes based on these scores
    if len(sentence_with_substitutes_electrabase) > 0: # to make sure the list with substitutes is always filled
        logging.getLogger('transformers').setLevel(logging.ERROR)  # to prevent the same warnings from being printed x times 
        #scores_bsBert_electrabase = bert_score.score([sentence]*len(sentence_with_substitutes_electrabase), sentence_with_substitutes_electrabase, lang="en", model_type='bert-base-uncased', verbose=False)
        scores_bsElectra_electrabase = bert_score.score([sentence]*len(sentence_with_substitutes_electrabase), sentence_with_substitutes_electrabase, lang="en", model_type='google/electra-base-generator', verbose=False)
        logging.getLogger('transformers').setLevel(logging.WARNING) # to reset the logging level back to printing warnings
        
        # create a list of tuples, each tuple containing a substitute and its score
        #substitute_score_pairs_bsBert_electrabase = list(zip(substitutes_no_dupl_complex_word_no_antonym_electrabase, scores_bsBert_electrabase[0].tolist()))
        substitute_score_pairs_bsElectra_electrabase = list(zip(substitutes_no_dupl_complex_word_no_antonym_electrabase, scores_bsElectra_electrabase[0].tolist()))

        # sort the list of tuples by the scores (the second element of each tuple), in descending order
        #sorted_substitute_score_pairs_bsBert_electrabase = sorted(substitute_score_pairs_bsBert_electrabase, key=lambda x: x[1], reverse=True)
        sorted_substitute_score_pairs_bsElectra_electrabase = sorted(substitute_score_pairs_bsElectra_electrabase, key=lambda x: x[1], reverse=True)

        # extract the list of substitutes from the sorted pairs
        #bertscore_ranked_substitutes_only_bsBert_electrabase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsBert_electrabase]
        #print(f"substitutes with Bert based on bertscores with Bert: {bertscore_ranked_substitutes_only_bsBert_electrabase}\n")
        bertscore_ranked_substitutes_only_bsElectra_electrabase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsElectra_electrabase]
        #print(f"substitutes with Electra based on bertscores with Electra : {bertscore_ranked_substitutes_only_bsElectra_electrabase}\n")


        # limit the substitutes to the 10 first ones for evaluation
        #top_10_substitutes_bsBert_electrabase = bertscore_ranked_substitutes_only_bsBert_electrabase[:10]
        #print(f"top-10 substitutes with Bert based on bertscores with Bert: {top_10_substitutes_bsBert_electrabase}\n")
        top_10_substitutes_bsElectra_electrabase = bertscore_ranked_substitutes_only_bsElectra_electrabase[:10]
        #print(f"top-10 substitutes with Electra based on bertscores with Electra: {top_10_substitutes_bsElectra_electrabase}\n")

    else:
        #top_10_substitutes_bsBert_electrabase  = []
        top_10_substitutes_bsElectra_electrabase = []
        
    
   # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    #top_5_substitutes_bsBert_electrabase = top_10_substitutes_bsBert_electrabase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Bert for bert-base model: {top_5_substitutes_bsBert_electrabase}\n")
    top_5_substitutes_bsElectra_electrabase = top_10_substitutes_bsElectra_electrabase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Electra for electrabase model: {top_5_substitutes_bsElectra_electrabase}\n")

   
    # zip the four lists in to create a list that sticks to the original order of each sub list
    #print('Finding shared duplicates in the four top-5 lists ....')

    # create a dictionary to hold the counts
    count_dict = {}

   
    # combine all four lists into a list of lists for easy iteration
    #all_lists = [top_5_substitutes_bsElectra_electrabase, top_5_substitutes_bsBert_bertbase, top_5_substitutes_bsElectra_bertbase, top_5_substitutes_bsBert_electrabase]
    all_lists = [top_5_substitutes_bsElectra_electrabase, top_5_substitutes_bsBert_bertbase]
    #print(all_lists)

    # iterate through each list and count occurrences of each element
    for lst in all_lists:
        for elem in lst:
            if elem in count_dict:
                count_dict[elem] += 1
            else:
                count_dict[elem] = 1

    # create a list of duplicates by including only those elements that appeared in all four lists
    duplicates_all_models = [elem for elem, count in count_dict.items() if count == len(all_lists)]

    #print(f"SS step: shared duplicates in top-5 substitutes lists based on bertscores with Bert and Electra for both bertbase and electrabase models: {duplicates_all_models}\n")
    
    # create a list with non-duplicates
    non_duplicates_all_models = [elem for elem, count in count_dict.items() if count == 1]
    #print(f"SS step: non-duplicates in top-5 substitutes lists based on bertscores with Bert and Electra for both bertbase and electrabase models: {non_duplicates_all_models}\n")
    
    #concatenate both lists (duplicates_all_models and non_duplicates_all_models), giving duplicates_all_models priority
    top_5_concatenated = duplicates_all_models + non_duplicates_all_models
    #print(f"SS step: concatenated top-5 with prioritized shared duplicates among bertbase and electrabase bertscore results: {top_5_concatenated}\n")
    
    #create the "second half" lists (items 6th to 10th) from each of the original top 10 lists
    second_half_bsBert_bertbase = top_10_substitutes_bsBert_bertbase[5:]
    second_half_bsElectra_electrabase = top_10_substitutes_bsElectra_electrabase[5:]
    #second_half_bsBert_electrabase = top_10_substitutes_bsBert_electrabase[5:]
    #second_half_bsElectra_bertbase = top_10_substitutes_bsElectra_bertbase[5:]
    

    # zip the "second half" lists
    #second_half_combined = [item for sublist in zip(second_half_bsElectra_electrabase, second_half_bsBert_bertbase, second_half_bsElectra_bertbase, second_half_bsBert_electrabase) for item in sublist]
    second_half_combined = [item for sublist in zip(second_half_bsElectra_electrabase, second_half_bsBert_bertbase) for item in sublist]
    # Exclude any items in second_half_combined that are already in top_5_concatenated
    second_half_combined = [item for item in second_half_combined if item not in top_5_concatenated]

    # Append these items to the final_list
    top_5_concatenated += second_half_combined

    # Trim the final_list to the top 10 items
    final_list_ElectraBertbase_SG_MA_SS_bsElectraBert_top5dup = top_5_concatenated[:10]

    #print(f"SS step: final list bertscore based on prioritizing shared duplicates among bert-base and electra-base models: {final_list_bertscore_top5dup_2models}\n")


    #print('------------------------------------------------------------')
    
    
    ## add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + final_list_ElectraBertbase_SG_MA_SS_bsElectraBert_top5dup

# export the dataframe to a tsv file for evaluation
substitutes_df.to_csv('./predictions/test/ElectraBertbase_SG_MA_SS_bsElectraBert_top5dup.tsv', sep="\t", index=False, header=False)   
print("ElectraBertbase_SG_MA_SS_bsElectraBert_top5dup exported to csv in path './predictions/test/ElectraBertbase_SG_MA_SS_bsElectraBert_top5dup.tsv'}\n")  

In [None]:
python tsar_eval.py --gold_file ./data/test/tsar2022_en_test_gold_no_noise.tsv --predictions_file ./predictions/test/ElectraBertbase_SG_MA_SS_bsElectraBert_top5dup.tsv --output_file ./output/test/ElectraBertbase_SG_MA_SS_bsElectraBert_top5dup.tsv

=========   EVALUATION config.=========
GOLD file = ./data/test/tsar2022_en_test_gold_no_noise.tsv
PREDICTION LABELS file = ./predictions/test/ElectraBertbase_SG_MA_SS_bsElectraBert_top5dup.tsv
OUTPUT file = ./output/test/ElectraBertbase_SG_MA_SS_bsElectraBert_top5dup.tsv
===============   RESULTS  =============

MAP@1/Potential@1/Precision@1 = 0.5752

MAP@3 = 0.3838
MAP@5 = 0.2745
MAP@10 = 0.1767

Potential@3 = 0.8145
Potential@5 = 0.8602
Potential@10 = 0.9435

Accuracy@1@top_gold_1 = 0.25
Accuracy@2@top_gold_1 = 0.3844
Accuracy@3@top_gold_1 = 0.4435

top 5 of best 2 models bertbase bsBert and electrabase bsElectra (so, other order)

In [None]:
# in each row, for each complex word: 
for index, row in data.iterrows():
    
    # print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    #print(f"Sentence: {sentence}")
    #print(f"Complex word: {complex_word}")
    
    
    # for bertbase model:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word_bertbase = sentence.replace(complex_word, lm_tokenizer_bertbase.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat_bertbase = f"{sentence} {lm_tokenizer_bertbase.sep_token} {sentence_masked_word_bertbase}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result_bertbase = fill_mask_bertbase(sentences_concat_bertbase, top_k=top_k)
    substitutes_bertbase = [substitute["token_str"] for substitute in result_bertbase if "token_str" in substitute]
    #print(f"Substitute Generation step: initial substitute list: {substitutes_bertbase}\n")


    #2: Morphological Generation and Context Adaptation (Morphological Adaptation):  
    ## a) remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes_bertbase = [substitute["token_str"].lower().strip() for substitute in result_bertbase if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        # print(f"Morphological Adaptation step a): substitute list without unwanted punctuation characters for bertbase model: {substitutes_bertbase}\n")
    except TypeError as error:
        continue



    ## b) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl_bertbase = []
    for sub in substitutes_bertbase:
        if sub not in substitutes_no_dupl_bertbase:
            substitutes_no_dupl_bertbase.append(sub)
    #print(f"Morphological Adaptation step b): substitute list without duplicates of substitutes for bertbase model: {substitutes_no_dupl_bertbase}\n")



    ## c) remove duplicates and inflected forms of the complex word from the substitute list

    ## first Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## then, remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_bertbase = []
    for substitute in substitutes_no_dupl_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_bertbase.append(substitute)
    #print(f"Morphological Adaptation step c): substitute list without duplicates of the complex word nor inflected forms of the complex word for bertbase model: {substitutes_no_dupl_complex_word_bertbase}\n")


     ## d) remove antonyms of the complex word from the substitute list
    ## step 1: get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## step 2: remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms_bertbase = []
    for substitute in substitutes_no_dupl_complex_word_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms_bertbase.append(substitute)
        else:
            print(f"Removed antonym: {substitute}")
    print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for bertbase model: {substitutes_no_antonyms_bertbase}\n") 
    
    
    
    #3: Substitute Selection (SS) by calculating Bert scores: 

    ## create sentence with the complex word replaced by the substitutes
    sentence_with_substitutes_bertbase = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym_bertbase]
    #print(f"List with sentences where complex word is substituted for bertbase model: {sentence_with_substitutes_bertbase}\n")


    ## calculate BERTScores, and rank the substitutes based on these scores
    if len(sentence_with_substitutes_bertbase) > 0: # to make sure the list with substitutes is always filled
        logging.getLogger('transformers').setLevel(logging.ERROR)  # to prevent the same warnings from being printed x times 
        scores_bsBert_bertbase = bert_score.score([sentence]*len(sentence_with_substitutes_bertbase), sentence_with_substitutes_bertbase, lang="en", model_type='bert-base-uncased', verbose=False)
        #scores_bsElectra_bertbase = bert_score.score([sentence]*len(sentence_with_substitutes_bertbase), sentence_with_substitutes_bertbase, lang="en", model_type='google/electra-base-generator', verbose=False)
        logging.getLogger('transformers').setLevel(logging.WARNING) # to reset the logging level back to printing warnings
        
        # create a list of tuples, each tuple containing a substitute and its score
        substitute_score_pairs_bsBert_bertbase = list(zip(substitutes_no_dupl_complex_word_no_antonym_bertbase, scores_bsBert_bertbase[0].tolist()))
        #substitute_score_pairs_bsElectra_bertbase = list(zip(substitutes_no_dupl_complex_word_no_antonym_bertbase, scores_bsElectra_bertbase[0].tolist()))

        # sort the list of tuples by the scores (the second element of each tuple), in descending order
        sorted_substitute_score_pairs_bsBert_bertbase = sorted(substitute_score_pairs_bsBert_bertbase, key=lambda x: x[1], reverse=True)
        #sorted_substitute_score_pairs_bsElectra_bertbase = sorted(substitute_score_pairs_bsElectra_bertbase, key=lambda x: x[1], reverse=True)

        # extract the list of substitutes from the sorted pairs
        bertscore_ranked_substitutes_only_bsBert_bertbase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsBert_bertbase]
        #print(f"substitutes with Bert based on bertscores with Bert: {bertscore_ranked_substitutes_only_bsBert_bertbase}\n")
        #bertscore_ranked_substitutes_only_bsElectra_bertbase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsElectra_bertbase]
        #print(f"substitutes based on bertscores with Electra for bert-base model : {bertscore_ranked_substitutes_only_bsElectra_bertbase}\n")


        # limit the substitutes to the 10 first ones for evaluation
        top_10_substitutes_bsBert_bertbase = bertscore_ranked_substitutes_only_bsBert_bertbase[:10]
        #print(f"top-10 substitutes with Bert based on bertscores with Bert: {top_10_substitutes_bsBert_bertbase}\n")
        #top_10_substitutes_bsElectra_bertbase = bertscore_ranked_substitutes_only_bsElectra_bertbase[:10]
        #print(f"top-10 substitutes based on bertscores with Electra for bert-base model: {top_10_substitutes_bsElectra_bertbase}\n")

    else:
        top_10_substitutes_bsBert_bertbase  = []
        #top_10_substitutes_bsElectra_bertbase = []
        
    
    # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_bsBert_bertbase = top_10_substitutes_bsBert_bertbase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Bert for bert-base model: {top_5_substitutes_bsBert_bertbase}\n")
    #top_5_substitutes_bsElectra_bertbase = top_10_substitutes_bsElectra_bertbase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Electra for bert-base model: {top_5_substitutes_bsElectra_bertbase}\n")
    
    
    
    # for electrabase model:
   
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word_electrabase = sentence.replace(complex_word, lm_tokenizer_electrabase.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat_electrabase = f"{sentence} {lm_tokenizer_electrabase.sep_token} {sentence_masked_word_electrabase}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result_electrabase = fill_mask_electrabase(sentences_concat_electrabase, top_k=top_k)
    substitutes_electrabase = [substitute["token_str"] for substitute in result_electrabase if "token_str" in substitute]
    #print(f"Substitute Generation step: initial substitute list: {substitutes_electrabase}\n")


    #2: Morphological Generation and Context Adaptation (Morphological Adaptation):  
    ## a) remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes_electrabase = [substitute["token_str"].lower().strip() for substitute in result_electrabase if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        # print(f"Morphological Adaptation step a): substitute list without unwanted punctuation characters for bert-base model: {substitutes_electrabase}\n")
    except TypeError as error:
        continue



    ## b) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl_electrabase = []
    for sub in substitutes_electrabase:
        if sub not in substitutes_no_dupl_electrabase:
            substitutes_no_dupl_electrabase.append(sub)
    #print(f"Morphological Adaptation step b): substitute list without duplicates of substitutes for electrabase model: {substitutes_no_dupl_electrabase}\n")



    ## c) remove duplicates and inflected forms of the complex word from the substitute list

    ## first Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## then, remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_electrabase = []
    for substitute in substitutes_no_dupl_electrabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_electrabase.append(substitute)
    #print(f"Morphological Adaptation step c): substitute list without duplicates of the complex word nor inflected forms of the complex word for electrabase model: {substitutes_no_dupl_complex_word_electrabase}\n")


   ## d) remove antonyms of the complex word from the substitute list
    ## step 1: get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## step 2: remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms_electrabase = []
    for substitute in substitutes_no_dupl_complex_word_electrabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms_electrabase.append(substitute)
        else:
            print(f"Removed antonym: {substitute}")
    print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for electrabase model: {substitutes_no_antonyms_electrabase}\n") 


    
    
    #3: Substitute Selection (SS) by calculating Bert scores: 

    ## create sentence with the complex word replaced by the substitutes
    sentence_with_substitutes_electrabase = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym_electrabase]
    #print(f"List with sentences where complex word is substituted for electrabase model: {sentence_with_substitutes_electrabase\n")


    ## calculate BERTScores, and rank the substitutes based on these scores
    if len(sentence_with_substitutes_electrabase) > 0: # to make sure the list with substitutes is always filled
        logging.getLogger('transformers').setLevel(logging.ERROR)  # to prevent the same warnings from being printed x times 
        #scores_bsBert_electrabase = bert_score.score([sentence]*len(sentence_with_substitutes_electrabase), sentence_with_substitutes_electrabase, lang="en", model_type='bert-base-uncased', verbose=False)
        scores_bsElectra_electrabase = bert_score.score([sentence]*len(sentence_with_substitutes_electrabase), sentence_with_substitutes_electrabase, lang="en", model_type='google/electra-base-generator', verbose=False)
        logging.getLogger('transformers').setLevel(logging.WARNING) # to reset the logging level back to printing warnings
        
        # create a list of tuples, each tuple containing a substitute and its score
        #substitute_score_pairs_bsBert_electrabase = list(zip(substitutes_no_dupl_complex_word_no_antonym_electrabase, scores_bsBert_electrabase[0].tolist()))
        substitute_score_pairs_bsElectra_electrabase = list(zip(substitutes_no_dupl_complex_word_no_antonym_electrabase, scores_bsElectra_electrabase[0].tolist()))

        # sort the list of tuples by the scores (the second element of each tuple), in descending order
        #sorted_substitute_score_pairs_bsBert_electrabase = sorted(substitute_score_pairs_bsBert_electrabase, key=lambda x: x[1], reverse=True)
        sorted_substitute_score_pairs_bsElectra_electrabase = sorted(substitute_score_pairs_bsElectra_electrabase, key=lambda x: x[1], reverse=True)

        # extract the list of substitutes from the sorted pairs
        #bertscore_ranked_substitutes_only_bsBert_electrabase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsBert_electrabase]
        #print(f"substitutes with Bert based on bertscores with Bert: {bertscore_ranked_substitutes_only_bsBert_electrabase}\n")
        bertscore_ranked_substitutes_only_bsElectra_electrabase = [substitute for substitute, _ in sorted_substitute_score_pairs_bsElectra_electrabase]
        #print(f"substitutes with Electra based on bertscores with Electra : {bertscore_ranked_substitutes_only_bsElectra_electrabase}\n")


        # limit the substitutes to the 10 first ones for evaluation
        #top_10_substitutes_bsBert_electrabase = bertscore_ranked_substitutes_only_bsBert_electrabase[:10]
        #print(f"top-10 substitutes with Bert based on bertscores with Bert: {top_10_substitutes_bsBert_electrabase}\n")
        top_10_substitutes_bsElectra_electrabase = bertscore_ranked_substitutes_only_bsElectra_electrabase[:10]
        #print(f"top-10 substitutes with Electra based on bertscores with Electra: {top_10_substitutes_bsElectra_electrabase}\n")

    else:
        #top_10_substitutes_bsBert_electrabase  = []
        top_10_substitutes_bsElectra_electrabase = []
        
    
   # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    #top_5_substitutes_bsBert_electrabase = top_10_substitutes_bsBert_electrabase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Bert for bert-base model: {top_5_substitutes_bsBert_electrabase}\n")
    top_5_substitutes_bsElectra_electrabase = top_10_substitutes_bsElectra_electrabase[:5]
    #print(f"SS step: top-5 substitutes based on bertscores with Electra for electrabase model: {top_5_substitutes_bsElectra_electrabase}\n")

   
    # zip the four lists in to create a list that sticks to the original order of each sub list
    #print('Finding shared duplicates in the four top-5 lists ....')

    # create a dictionary to hold the counts
    count_dict = {}

   
    # combine all four lists into a list of lists for easy iteration
    #all_lists = [top_5_substitutes_bsElectra_electrabase, top_5_substitutes_bsBert_bertbase, top_5_substitutes_bsElectra_bertbase, top_5_substitutes_bsBert_electrabase]
    #all_lists = [top_5_substitutes_bsElectra_electrabase, top_5_substitutes_bsBert_bertbase]
    all_lists = [top_5_substitutes_bsBert_bertbase, top_5_substitutes_bsElectra_electrabase] # try the other way around to see if score increases
    #print(all_lists)

    # iterate through each list and count occurrences of each element
    for lst in all_lists:
        for elem in lst:
            if elem in count_dict:
                count_dict[elem] += 1
            else:
                count_dict[elem] = 1

    # create a list of duplicates by including only those elements that appeared in all four lists
    duplicates_all_models = [elem for elem, count in count_dict.items() if count == len(all_lists)]

    #print(f"SS step: shared duplicates in top-5 substitutes lists based on bertscores with Bert and Electra for both bertbase and electrabase models: {duplicates_all_models}\n")
    
    # create a list with non-duplicates
    non_duplicates_all_models = [elem for elem, count in count_dict.items() if count == 1]
    #print(f"SS step: non-duplicates in top-5 substitutes lists based on bertscores with Bert and Electra for both bertbase and electrabase models: {non_duplicates_all_models}\n")
    
    #concatenate both lists (duplicates_all_models and non_duplicates_all_models), giving duplicates_all_models priority
    top_5_concatenated = duplicates_all_models + non_duplicates_all_models
    #print(f"SS step: concatenated top-5 with prioritized shared duplicates among bertbase and electrabase bertscore results: {top_5_concatenated}\n")
    
    #create the "second half" lists (items 6th to 10th) from each of the original top 10 lists
    second_half_bsBert_bertbase = top_10_substitutes_bsBert_bertbase[5:]
    second_half_bsElectra_electrabase = top_10_substitutes_bsElectra_electrabase[5:]
    #second_half_bsBert_electrabase = top_10_substitutes_bsBert_electrabase[5:]
    #second_half_bsElectra_bertbase = top_10_substitutes_bsElectra_bertbase[5:]
    

    # zip the "second half" lists
    #second_half_combined = [item for sublist in zip(second_half_bsElectra_electrabase, second_half_bsBert_bertbase, second_half_bsElectra_bertbase, second_half_bsBert_electrabase) for item in sublist]
    #second_half_combined = [item for sublist in zip(second_half_bsElectra_electrabase, second_half_bsBert_bertbase) for item in sublist]
    second_half_combined = [item for sublist in zip(second_half_bsBert_bertbase, second_half_bsElectra_electrabase) for item in sublist] # try the other way around to see if score increases
    # Exclude any items in second_half_combined that are already in top_5_concatenated
    second_half_combined = [item for item in second_half_combined if item not in top_5_concatenated]

    # Append these items to the final_list
    top_5_concatenated += second_half_combined

    # Trim the final_list to the top 10 items
    final_list_BertElectrabase_SG_MA_SS_bsBertElectra_top5dup = top_5_concatenated[:10]

    #print(f"SS step: final list bertscore based on prioritizing shared duplicates among bert-base and electra-base models: {final_list_bertscore_top5dup_2models}\n")


    #print('------------------------------------------------------------')
    
    
    ## add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + final_list_BertElectrabase_SG_MA_SS_bsBertElectra_top5dup

# export the dataframe to a tsv file for evaluation
substitutes_df.to_csv('./predictions/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup.tsv', sep="\t", index=False, header=False)   
print("BertElectrabase_SG_MA_SS_bsBertElectra_top5dup exported to csv in path './predictions/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup.tsv'}\n") 

In [None]:
python tsar_eval.py --gold_file ./data/test/tsar2022_en_test_gold_no_noise.tsv --predictions_file ./predictions/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup.tsv --output_file ./output/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup.tsv

=========   EVALUATION config.=========
GOLD file = ./data/test/tsar2022_en_test_gold_no_noise.tsv
PREDICTION LABELS file = ./predictions/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup.tsv
OUTPUT file = ./output/test/BertElectrabase_SG_MA_SS_bsBertElectra_top5dup.tsv
===============   RESULTS  =============

MAP@1/Potential@1/Precision@1 = 0.6129

MAP@3 = 0.4142
MAP@5 = 0.3035
MAP@10 = 0.1859

Potential@3 = 0.836
Potential@5 = 0.8951
Potential@10 = 0.9435

Accuracy@1@top_gold_1 = 0.2715
Accuracy@2@top_gold_1 = 0.4112
Accuracy@3@top_gold_1 = 0.4784

highest score so far!!

To Do later:

1. see if padding in case of <10 subs is the right way. Maybe better to pad with the first item (which is the most sem. fitting item) in the combined subs list? or generate more substitutes (or put the limit higher after generation)
2. remove unsimilar substitutes, to account for removing e.g., voluntary from the list for 'compulsory' (maybe with equivalence scores or something word embedding scores?.
2. explore wordnet co-hyponyms, and add/prioritize in list, together with synsets code (see old code). Think about how to prioritize, above or after the duplicates from all 4 models.
3. remove substitutes like 'disguise' and 'dressing' (complex word = disguised); 'deployment' (complex word = deploy). Maybe by removing words that have the first 4 or 5 leftstrings the same as the complex word (that will remove words like disguise and deployment. but not dressing (can be a noun and a verb), as this only appears in the subst list together with dressed which is a good substitute). maybe see with Fitbert what it does? 
4. update the code with embeddings and BERTScore (applied to the combined list of all 4 systems), and see which of these work best. 
5. automate the code in a way that it selects all models automatically, one by one, and that all the separate variables are not needed.
6. automate to test scores with different top-x for the concatenated list of all 4 models
7. Automate to get the best top-x, based on the output scores for the task. 