### All code below uses concatenated sentence pairs in the Substitute Generation step in order to generate similar substitutes (as opposed to generation of fitting substitutes only)

In [1]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import pandas as pd
from transformers import pipeline

# read the tsv file
filename = "./data/trial/tsar2022_en_trial_none.tsv"
data = pd.read_csv(filename, sep='\t', header=None, names=["sentence", "complex_word"])

# create an empty dataframe to store the substitutes for evaluation
substitutes_df = pd.DataFrame(columns=["sentence", "complex_word"] + [f"substitute_{i+1}" for i in range(10)])


In [None]:
from nltk.corpus import wordnet as wn
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
import string

### Bert-base and Bert-large

In [2]:
# initialize the models
model_bertbase = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
model_bertlarge = AutoModelForMaskedLM.from_pretrained("bert-large-uncased")

# create a fill-mask pipeline for each model (tokenizer seems to be the same so i used the bert-large tokenizer)
fill_mask_bertbase = pipeline("fill-mask", model_bertbase, tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased"))
fill_mask_bertlarge = pipeline("fill-mask", model_bertlarge, tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased"))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification mode

### Roberta-base and Roberta-large

In [None]:
# initialize the models
model_robertabase = AutoModelForMaskedLM.from_pretrained("roberta-base")
model_robertalarge = AutoModelForMaskedLM.from_pretrained("roberta-large")

# create a fill-mask pipeline for each model (tokenizer seems to be the same so i used the roberta-large tokenizer)
fill_mask_robertabase = pipeline("fill-mask", model_robertabase, tokenizer = AutoTokenizer.from_pretrained("roberta-large"))
fill_mask_robertalarge = pipeline("fill-mask", model_robertalarge, tokenizer = AutoTokenizer.from_pretrained("roberta-large"))

In [None]:
### electrabase and electralarge #update names also in prediction code

In [None]:
# Instantiate the tokenizer and the model
lm_model = AutoModelForMaskedLM.from_pretrained("google/electra-base-generator")


# Instantiate the fill-mask pipeline with the ELECTRA model
fill_mask = pipeline("fill-mask", lm_model, tokenizer =  AutoTokenizer.from_pretrained("google/electra-base-generator"))

In [None]:
# Instantiate the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("google/electra-large-generator")
lm_model = AutoModelForMaskedLM.from_pretrained("google/electra-large-generator")


# Instantiate the fill-mask pipeline with the ELECTRA model
fill_mask = pipeline("fill-mask", lm_model, tokenizer =  AutoTokenizer.from_pretrained("google/electra-large-generator"))

#### Substitute Generation with BERT-base and BERT-large, and Substitute Selection steps a-c, k =5

In [7]:

# in each row, for each complex word: 
for index, row in data.iterrows():
    
    # for Bert-base:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}\n")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, "[MASK]")

    ## fill the mask with substitute words, and concatenate the original sentence and the masked sentence
    tokenizer_bertbase = fill_mask_bertbase.tokenizer
    sentences_concat_bertbase = f"{sentence} {tokenizer_bertbase.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result_bertbase = fill_mask_bertbase(sentences_concat_bertbase, top_k=top_k)
   
    ## lowercase and print the top-k substitutes
    substitutes_bertbase = [substitute["token_str"].lower() for substitute in result_bertbase]
    #print(f"SG step: Bertbase generated substitutes: {substitutes_bertbase}\n")
    
   # 2. Substitute Selection (SS):   

    # create a punctuation set without hyphen, in order to retain hyphens in compound substitutes
    punctuation_without_hyphen = set(string.punctuation) - set('-')
    
    # a) remove duplicates and unwanted punctuation within the substitute list from the substitute list
    substitutes_no_dupl_bertbase = []
    for sub in substitutes_bertbase:
        if sub not in substitutes_no_dupl_bertbase and not any(char in punctuation_without_hyphen for char in sub):
            substitutes_no_dupl_bertbase.append(sub)
    #print(f"SS step: a) Bertbase substitute list without duplicates and undesired punctuation: {substitutes_no_dupl_bertbase}\n")
    

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_bertbase = []
    for substitute in substitutes_no_dupl_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_bertbase.append(substitute)
    #print(f"SS step: b) Bertbase substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word_bertbase}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym_bertbase = []
    for substitute in substitutes_no_dupl_complex_word_bertbase:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym_bertbase.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym_bertbase.append(substitute)
    #print(f"SS step: c): Bertbase substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym_bertbase}\n")
    
     
    
    # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_bertbase = substitutes_no_dupl_complex_word_no_antonym_bertbase[:5]
    print(f"SS step: d): top_5_substitutes_bertbase: {top_5_substitutes_bertbase}\n")

    
    
    
    # for Bert-large:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    # ## print the sentence and the complex word
    # sentence, complex_word = row["sentence"], row["complex_word"]
    # print(f"Sentence: {sentence}")
    # print(f"Complex word: {complex_word}")

#     ## in the sentence, replace the complex word with a masked word
#     sentence_masked_word = sentence.replace(complex_word, "[MASK]")

    ## fill the mask with substitute words, and concatenate the original sentence and the masked sentence
    tokenizer_bertlarge = fill_mask_bertlarge.tokenizer
    sentences_concat_bertlarge = f"{sentence} {tokenizer_bertlarge.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result_bertlarge = fill_mask_bertlarge(sentences_concat_bertlarge, top_k=top_k)
   
    ## lowercase and print the top-k substitutes
    substitutes_bertlarge = [substitute["token_str"].lower() for substitute in result_bertlarge]
    #print(f"SG step: Bertlarge generated substitutes: {substitutes_bertlarge}\n")
    
   # 2. Substitute Selection (SS):   

    # create a punctuation set without hyphen, in order to retain hyphens in compound substitutes
    punctuation_without_hyphen = set(string.punctuation) - set('-')
    
    # a) remove duplicates and unwanted punctuation (this happened with BERT) within the substitute list from the substitute list
    substitutes_no_dupl_bertlarge = []
    for sub in substitutes_bertlarge:
        if sub not in substitutes_no_dupl_bertlarge and not any(char in punctuation_without_hyphen for char in sub):
            substitutes_no_dupl_bertlarge.append(sub)
    #print(f"SS step: a) Bertlarge substitute list without duplicates and undesired punctuation: {substitutes_no_dupl_bertlarge}\n")
    

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_bertlarge = []
    for substitute in substitutes_no_dupl_bertlarge:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_bertlarge.append(substitute)
    #print(f"SS step: b) Bertlarge substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word_bertlarge}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym_bertlarge = []
    for substitute in substitutes_no_dupl_complex_word_bertlarge:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym_bertlarge.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym_bertlarge.append(substitute)
    #print(f"SS step: c): Bertlarge substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym_bertlarge}\n")
     
    
    # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_bertlarge = substitutes_no_dupl_complex_word_no_antonym_bertlarge[:5]
    print(f"SS step: d): top_5_substitutes_bertlarge: {top_5_substitutes_bertlarge}\n")
    
    
    
    # find duplicates between top_5_substitutes_bertbase and top_5_substitutes_bertlarge
    duplicates = set(top_5_substitutes_bertbase) & set(top_5_substitutes_bertlarge)

    # create a list with duplicates, preserving the original order as much as possible
    duplicates_top5_bertbaselarge = []
    for sub in top_5_substitutes_bertbase + top_5_substitutes_bertlarge:
        if sub in duplicates and sub not in duplicates_top5_bertbaselarge:
            duplicates_top5_bertbaselarge.append(sub)

    print(f"duplicates_top5_bertbaselarge: {duplicates_top5_bertbaselarge}\n")

    # create a list with non-duplicates, using the interleaving order
    nonduplicates_top5_bertbaselarge = []
    dup_used = set(duplicates_top5_bertbaselarge)

    for bert_base, bert_large in zip(top_5_substitutes_bertbase, top_5_substitutes_bertlarge):
        if bert_base not in dup_used:
            nonduplicates_top5_bertbaselarge.append(bert_base)
        if bert_large not in dup_used:
            nonduplicates_top5_bertbaselarge.append(bert_large)

    print(f"nonduplicates_top5_bertbaselarge: {nonduplicates_top5_bertbaselarge}\n")
    
    
    # concatenate all lists, the duplicate list first
    combined_top5_bertbaselarge_duplicates_first = duplicates_top5_bertbaselarge + nonduplicates_top5_bertbaselarge
    print(f"combined_top5_bertbaselarge with duplicates first: {combined_top5_bertbaselarge_duplicates_first}\n")
    
    print('-------------------------------------------------------------------------------------------------')
    
    
    # check if the number of substitutes is greater than the current number of substitute columns
    num_substitutes = len(combined_top5_bertbaselarge_duplicates_first)
    num_columns = len(substitutes_df.columns)

    if num_substitutes + 2 > num_columns:
        # Add the required number of new columns
        for i in range(num_columns - 2, num_substitutes):
            substitutes_df[f"substitute_{i+1}"] = ''

    # pad the list with empty strings to match the number of columns in the DataFrame
    padding = [''] * (num_columns - 2 - len(combined_top5_bertbaselarge_duplicates_first))
    row_data = [sentence, complex_word] + combined_top5_bertbaselarge_duplicates_first + padding

    # Add the row to the DataFrame
    substitutes_df.loc[index] = row_data
    
    
    # remove the #34-3 and #35-14 character combinations from the sentences in the DataFrame
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#34-3 \"", "")
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#35-14 ", "")

      

# export the dataframe to a tsv file
substitutes_df.to_csv("./predictions/trial/BertBaseLarge_top_5_SG_SS_abc.tsv", sep="\t", index=False, header=False)
    
    
 
       
    
    
    
    
    

Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory

SS step: c): Bertbase substitute list without antonyms of the complex word: ['mandatory', 'obligatory', 'optional', 'required', 'necessary', 'standard', 'voluntary', 'customary', 'impossible', 'easier', 'only', 'illegal', 'sufficient', 'unnecessary', 'easy', 'normal', 'permitted', 'mandated', 'difficult', 'simple', 'appropriate', 'expensive', 'possible', 'commonplace', 'essential', 'proper', 'available', 'enough', 'affordable']

SS step: d): top_5_substitutes_bertbase: ['mandatory', 'obligatory', 'optional', 'required', 'necessary', 'standard', 'voluntary', 'customary']

SS step: c): Bertlarge substitute list without antonyms of the complex word: ['mandatory', 'obligatory', 'required', 'voluntary', 'optional', 'manda

In [None]:
python tsar_eval.py --gold_file .\gold_trial.tsv --predictions_file .\predictions\trial\BertBaseLarge_top_5_SG_SS_abc.tsv --output_file .\output

result: top 7 of both bertbase and bertlarge seem to give the best (slightly though) results

 to do: the same for Roberta! see code above.
 then, export it to a tsv as before and check results. See if <10 results gives problems, if so, repeat the first word n times until 10 is reached.

### Roberta-base and Roberta-large:

In [5]:
# initialize the models
model_robertabase = AutoModelForMaskedLM.from_pretrained("roberta-base")
model_robertalarge = AutoModelForMaskedLM.from_pretrained("roberta-large")

# create a fill-mask pipeline for each model (tokenizer seems to be the same so i used the roberta-large tokenizer)
fill_mask_robertabase = pipeline("fill-mask", model_robertabase, tokenizer = AutoTokenizer.from_pretrained("roberta-large"))
fill_mask_robertalarge = pipeline("fill-mask", model_robertalarge, tokenizer = AutoTokenizer.from_pretrained("roberta-large"))

In [8]:
# in each row, for each complex word: 
for index, row in data.iterrows():
    
    # for Roberta-base:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}\n")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word,"<mask>")   # RoBERTa uses <mask> instead of [MASK]

    ## fill the mask with substitute words, and concatenate the original sentence and the masked sentence
    tokenizer_robertabase = fill_mask_robertabase.tokenizer
    sentences_concat_robertabase = f"{sentence} {tokenizer_robertabase.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result_robertabase = fill_mask_robertabase(sentences_concat_robertabase, top_k=top_k)
   
        
    ## lowercase, remove the leading space in each substitute (roberta only) and print the top-k substitutes
    substitutes_robertabase = [substitute["token_str"].lower().lstrip() for substitute in result_robertabase]   # and use .lstrip to remove the leading space (for ROBERTA only), as Roberta tokenizes by default with a space in front of the word
    #print(f"SG step: Robertabase generated substitutes: {substitutes_robertabase}\n")
    
   # 2. Substitute Selection (SS):   

    # create a punctuation set without hyphen, in order to retain hyphens in compound substitutes
    punctuation_without_hyphen = set(string.punctuation) - set('-')
    
    # a) remove duplicates and unwanted punctuation (this happened with BERT) within the substitute list from the substitute list
    substitutes_no_dupl_robertabase = []
    for sub in substitutes_robertabase:
        if sub not in substitutes_no_dupl_robertabase and not any(char in punctuation_without_hyphen for char in sub):
            substitutes_no_dupl_robertabase.append(sub)
    #print(f"SS step: a) Robertabase substitute list without duplicates and undesired punctuation: {substitutes_no_dupl_robertabase}\n")
    

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_robertabase = []
    for substitute in substitutes_no_dupl_robertabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_robertabase.append(substitute)
    #print(f"SS step: b) Robertabase substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word_robertabase}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym_robertabase = []
    for substitute in substitutes_no_dupl_complex_word_robertabase:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym_robertabase.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym_robertabase.append(substitute)
    #print(f"SS step: c): Robertabase substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym_robertabase}\n")
    
     
    
    # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_robertabase = substitutes_no_dupl_complex_word_no_antonym_robertabase[:5]
    print(f"SS step: d): top_5_substitutes_robertabase: {top_5_substitutes_robertabase}\n")

    
    
    
    # for Roberta-large:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    # ## print the sentence and the complex word
    # sentence, complex_word = row["sentence"], row["complex_word"]
    # print(f"Sentence: {sentence}")
    # print(f"Complex word: {complex_word}")

#     ## in the sentence, replace the complex word with a masked word
#     sentence_masked_word = sentence.replace(complex_word, "<mask>")  # RoBERTa uses <mask> instead of [MASK]

    ## fill the mask with substitute words, and concatenate the original sentence and the masked sentence
    tokenizer_robertalarge = fill_mask_robertalarge.tokenizer
    sentences_concat_robertalarge = f"{sentence} {tokenizer_robertalarge.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result_robertalarge = fill_mask_robertalarge(sentences_concat_robertalarge, top_k=top_k)
   
     ## lowercase, remove the leading space in each substitute (roberta only) and print the top-k substitutes
    substitutes_robertalarge = [substitute["token_str"].lower().lstrip() for substitute in result_robertalarge]   # and use .lstrip to remove the leading space (for ROBERTA only), as Roberta tokenizes by default with a space in front of the word
    #print(f"SG step: Robertalarge generated substitutes: {substitutes_robertalarge}\n")
    
   # 2. Substitute Selection (SS):   

    # create a punctuation set without hyphen, in order to retain hyphens in compound substitutes
    punctuation_without_hyphen = set(string.punctuation) - set('-')
    
    # a) remove duplicates and unwanted punctuation within the substitute list from the substitute list
    substitutes_no_dupl_robertalarge = []
    for sub in substitutes_robertalarge:
        if sub not in substitutes_no_dupl_robertalarge and not any(char in punctuation_without_hyphen for char in sub):
            substitutes_no_dupl_robertalarge.append(sub)
    #print(f"SS step: a) Robertalarge substitute list without duplicates and undesired punctuation: {substitutes_no_dupl_robertalarge}\n")
    

   
    # # b) remove duplicates and inflected forms of the complex word from the substitute list
    # ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    # doc_complex_word = nlp(complex_word)
    # complex_word_lemma = doc_complex_word[0].lemma_
    # #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_robertalarge = []
    for substitute in substitutes_no_dupl_robertalarge:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_robertalarge.append(substitute)
    #print(f"SS step: b) Robertalarge substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word_robertalarge}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym_robertalarge = []
    for substitute in substitutes_no_dupl_complex_word_robertalarge:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym_robertalarge.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym_robertalarge.append(substitute)
    #print(f"SS step: c): Robertalarge substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym_robertalarge}\n")
     
    
    # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_robertalarge = substitutes_no_dupl_complex_word_no_antonym_robertalarge[:5]
    print(f"SS step: d): top_5_substitutes_robertalarge: {top_5_substitutes_robertalarge}\n")
    
    
    
    
    
    # find duplicates between top_5_substitutes_robertabase and top_5_substitutes_robertalarge
    duplicates = set(top_5_substitutes_robertabase) & set(top_5_substitutes_robertalarge)

    # create a list with duplicates, preserving the original order as much as possible
    duplicates_top5_robertabaselarge = []
    for sub in top_5_substitutes_robertabase + top_5_substitutes_robertalarge:
        if sub in duplicates and sub not in duplicates_top5_robertabaselarge:
            duplicates_top5_robertabaselarge.append(sub)

    print(f"duplicates_top5_robertabaselarge: {duplicates_top5_robertabaselarge}\n")

    # create a list with non-duplicates, using the interleaving order
    nonduplicates_top5_robertabaselarge = []
    dup_used = set(duplicates_top5_robertabaselarge)

    for roberta_base, roberta_large in zip(top_5_substitutes_robertabase, top_5_substitutes_robertalarge):
        if roberta_base not in dup_used:
            nonduplicates_top5_robertabaselarge.append(roberta_base)
        if roberta_large not in dup_used:
            nonduplicates_top5_robertabaselarge.append(roberta_large)

    print(f"nonduplicates_top5_robertabaselarge: {nonduplicates_top5_robertabaselarge}\n")
    
    
    # concatenate all lists, the duplicate list first
    combined_top5_robertabaselarge_duplicates_first = duplicates_top5_robertabaselarge + nonduplicates_top5_robertabaselarge
    print(f"combined_top5_robertabaselarge with duplicates first: {combined_top5_robertabaselarge_duplicates_first}\n")
    
    print('-------------------------------------------------------------------------------------------------')
    
    
    # check if the number of substitutes is greater than the current number of substitute columns
    num_substitutes = len(combined_top5_robertabaselarge_duplicates_first)
    num_columns = len(substitutes_df.columns)

    if num_substitutes + 2 > num_columns:
        # Add the required number of new columns
        for i in range(num_columns - 2, num_substitutes):
            substitutes_df[f"substitute_{i+1}"] = ''

    # pad the list with empty strings to match the number of columns in the DataFrame
    padding = [''] * (num_columns - 2 - len(combined_top5_robertabaselarge_duplicates_first))
    row_data = [sentence, complex_word] + combined_top5_robertabaselarge_duplicates_first + padding

    # Add the row to the DataFrame
    substitutes_df.loc[index] = row_data
    
    
    # remove the #34-3 and #35-14 character combinations from the sentences in the DataFrame
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#34-3 \"", "")
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#35-14 ", "")

      

# export the dataframe to a tsv file
substitutes_df.to_csv("./predictions/trial/RobertaBaseLarge_top_5_SG_SS_abc.tsv", sep="\t", index=False, header=False)
    
    
 

Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory

SG step: Robertabase generated substitutes: ['compulsory', 'mandatory', 'obligatory', 'voluntary', 'required', 'optional', 'obliged', 'uniform', 'necessary', 'available', 'mandated', 'sufficient', 'routine', 'forced', 'customary', 'prerequisite', 'feasible', 'indispensable', 'forthcoming', 'universal', 'requirement', 'involuntary', 'obligated', 'compelled', 'conditional', 'enforced', 'contingent', 'possible', 'compulsion', 'mandatory']

SS step: c): Robertabase substitute list without antonyms of the complex word: ['mandatory', 'obligatory', 'voluntary', 'required', 'optional', 'obliged', 'uniform', 'necessary', 'available', 'mandated', 'sufficient', 'routine', 'forced', 'customary', 'prerequisite', 'feasible', 'indi

In [None]:
python tsar_eval.py --gold_file .\gold_trial.tsv --predictions_file .\predictions\trial\RobertaBaseLarge_top_5_SG_SS_abc.tsv --output_file .\output

### Combine all 4 models (bert and roberta, base and large)

In [10]:
# in each row, for each complex word: 
for index, row in data.iterrows():
    
    # for Bert-base:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}\n")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, "[MASK]")

    ## fill the mask with substitute words, and concatenate the original sentence and the masked sentence
    tokenizer_bertbase = fill_mask_bertbase.tokenizer
    sentences_concat_bertbase = f"{sentence} {tokenizer_bertbase.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result_bertbase = fill_mask_bertbase(sentences_concat_bertbase, top_k=top_k)
   
    ## lowercase and print the top-k substitutes
    substitutes_bertbase = [substitute["token_str"].lower() for substitute in result_bertbase]
    #print(f"SG step: Bertbase generated substitutes: {substitutes_bertbase}\n")
    
   # 2. Substitute Selection (SS):   

    # create a punctuation set without hyphen, in order to retain hyphens in compound substitutes
    punctuation_without_hyphen = set(string.punctuation) - set('-')
    
    # a) remove duplicates and unwanted punctuation within the substitute list from the substitute list
    substitutes_no_dupl_bertbase = []
    for sub in substitutes_bertbase:
        if sub not in substitutes_no_dupl_bertbase and not any(char in punctuation_without_hyphen for char in sub):
            substitutes_no_dupl_bertbase.append(sub)
    #print(f"SS step: a) Bertbase substitute list without duplicates and undesired punctuation: {substitutes_no_dupl_bertbase}\n")
    

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_bertbase = []
    for substitute in substitutes_no_dupl_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_bertbase.append(substitute)
    #print(f"SS step: b) Bertbase substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word_bertbase}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym_bertbase = []
    for substitute in substitutes_no_dupl_complex_word_bertbase:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym_bertbase.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym_bertbase.append(substitute)
    #print(f"SS step: c): Bertbase substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym_bertbase}\n")
    
     
    
    # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_bertbase = substitutes_no_dupl_complex_word_no_antonym_bertbase[:5]
    print(f"SS step: d): top_5_substitutes_bertbase: {top_5_substitutes_bertbase}\n")

    
    
    
    # for Bert-large:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    # ## print the sentence and the complex word
    # sentence, complex_word = row["sentence"], row["complex_word"]
    # print(f"Sentence: {sentence}")
    # print(f"Complex word: {complex_word}")

#     ## in the sentence, replace the complex word with a masked word
#     sentence_masked_word = sentence.replace(complex_word, "[MASK]")

    ## fill the mask with substitute words, and concatenate the original sentence and the masked sentence
    tokenizer_bertlarge = fill_mask_bertlarge.tokenizer
    sentences_concat_bertlarge = f"{sentence} {tokenizer_bertlarge.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result_bertlarge = fill_mask_bertlarge(sentences_concat_bertlarge, top_k=top_k)
   
    ## lowercase and print the top-k substitutes
    substitutes_bertlarge = [substitute["token_str"].lower() for substitute in result_bertlarge]
    #print(f"SG step: Bertlarge generated substitutes: {substitutes_bertlarge}\n")
    
   # 2. Substitute Selection (SS):   

    # create a punctuation set without hyphen, in order to retain hyphens in compound substitutes
    punctuation_without_hyphen = set(string.punctuation) - set('-')
    
    # a) remove duplicates and unwanted punctuation (this happened with BERT) within the substitute list from the substitute list
    substitutes_no_dupl_bertlarge = []
    for sub in substitutes_bertlarge:
        if sub not in substitutes_no_dupl_bertlarge and not any(char in punctuation_without_hyphen for char in sub):
            substitutes_no_dupl_bertlarge.append(sub)
    #print(f"SS step: a) Bertlarge substitute list without duplicates and undesired punctuation: {substitutes_no_dupl_bertlarge}\n")
    

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_bertlarge = []
    for substitute in substitutes_no_dupl_bertlarge:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_bertlarge.append(substitute)
    #print(f"SS step: b) Bertlarge substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word_bertlarge}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym_bertlarge = []
    for substitute in substitutes_no_dupl_complex_word_bertlarge:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym_bertlarge.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym_bertlarge.append(substitute)
    #print(f"SS step: c): Bertlarge substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym_bertlarge}\n")
     
    
    # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_bertlarge = substitutes_no_dupl_complex_word_no_antonym_bertlarge[:5]
    print(f"SS step: d): top_5_substitutes_bertlarge: {top_5_substitutes_bertlarge}\n")
    
    
    
    
     # for Roberta-base:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    # sentence, complex_word = row["sentence"], row["complex_word"]
    # print(f"Sentence: {sentence}")
    # print(f"Complex word: {complex_word}\n")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word,"<mask>")   # RoBERTa uses <mask> instead of [MASK]

    ## fill the mask with substitute words, and concatenate the original sentence and the masked sentence
    tokenizer_robertabase = fill_mask_robertabase.tokenizer
    sentences_concat_robertabase = f"{sentence} {tokenizer_robertabase.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result_robertabase = fill_mask_robertabase(sentences_concat_robertabase, top_k=top_k)
   
        
    ## lowercase, remove the leading space in each substitute (roberta only) and print the top-k substitutes
    substitutes_robertabase = [substitute["token_str"].lower().lstrip() for substitute in result_robertabase]   # and use .lstrip to remove the leading space (for ROBERTA only), as Roberta tokenizes by default with a space in front of the word
    #print(f"SG step: Robertabase generated substitutes: {substitutes_robertabase}\n")
    
   # 2. Substitute Selection (SS):   

    # create a punctuation set without hyphen, in order to retain hyphens in compound substitutes
    punctuation_without_hyphen = set(string.punctuation) - set('-')
    
    # a) remove duplicates and unwanted punctuation (this happened with BERT) within the substitute list from the substitute list
    substitutes_no_dupl_robertabase = []
    for sub in substitutes_robertabase:
        if sub not in substitutes_no_dupl_robertabase and not any(char in punctuation_without_hyphen for char in sub):
            substitutes_no_dupl_robertabase.append(sub)
    #print(f"SS step: a) Robertabase substitute list without duplicates and undesired punctuation: {substitutes_no_dupl_robertabase}\n")
    

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_robertabase = []
    for substitute in substitutes_no_dupl_robertabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_robertabase.append(substitute)
    #print(f"SS step: b) Robertabase substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word_robertabase}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym_robertabase = []
    for substitute in substitutes_no_dupl_complex_word_robertabase:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym_robertabase.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym_robertabase.append(substitute)
    #print(f"SS step: c): Robertabase substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym_robertabase}\n")
    
     
    
    # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_robertabase = substitutes_no_dupl_complex_word_no_antonym_robertabase[:5]
    print(f"SS step: d): top_5_substitutes_robertabase: {top_5_substitutes_robertabase}\n")

    
    
    
    # for Roberta-large:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    # ## print the sentence and the complex word
    # sentence, complex_word = row["sentence"], row["complex_word"]
    # print(f"Sentence: {sentence}")
    # print(f"Complex word: {complex_word}")

    # ## in the sentence, replace the complex word with a masked word
    # sentence_masked_word = sentence.replace(complex_word, "<mask>")  # RoBERTa uses <mask> instead of [MASK]

    ## fill the mask with substitute words, and concatenate the original sentence and the masked sentence
    tokenizer_robertalarge = fill_mask_robertalarge.tokenizer
    sentences_concat_robertalarge = f"{sentence} {tokenizer_robertalarge.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result_robertalarge = fill_mask_robertalarge(sentences_concat_robertalarge, top_k=top_k)
   
     ## lowercase, remove the leading space in each substitute (roberta only) and print the top-k substitutes
    substitutes_robertalarge = [substitute["token_str"].lower().lstrip() for substitute in result_robertalarge]   # and use .lstrip to remove the leading space (for ROBERTA only), as Roberta tokenizes by default with a space in front of the word
    #print(f"SG step: Robertalarge generated substitutes: {substitutes_robertalarge}\n")
    
   # 2. Substitute Selection (SS):   

    # create a punctuation set without hyphen, in order to retain hyphens in compound substitutes
    punctuation_without_hyphen = set(string.punctuation) - set('-')
    
    # a) remove duplicates and unwanted punctuation within the substitute list from the substitute list
    substitutes_no_dupl_robertalarge = []
    for sub in substitutes_robertalarge:
        if sub not in substitutes_no_dupl_robertalarge and not any(char in punctuation_without_hyphen for char in sub):
            substitutes_no_dupl_robertalarge.append(sub)
    #print(f"SS step: a) Robertalarge substitute list without duplicates and undesired punctuation: {substitutes_no_dupl_robertalarge}\n")
    

   
    # # b) remove duplicates and inflected forms of the complex word from the substitute list
    # ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    # doc_complex_word = nlp(complex_word)
    # complex_word_lemma = doc_complex_word[0].lemma_
    # #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_robertalarge = []
    for substitute in substitutes_no_dupl_robertalarge:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_robertalarge.append(substitute)
    #print(f"SS step: b) Robertalarge substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word_robertalarge}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym_robertalarge = []
    for substitute in substitutes_no_dupl_complex_word_robertalarge:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym_robertalarge.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym_robertalarge.append(substitute)
    #print(f"SS step: c): Robertalarge substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym_robertalarge}\n")
     
    
    # limit the substitutes to the 5 first ones for concatenation with the top-5 of other models
    top_5_substitutes_robertalarge = substitutes_no_dupl_complex_word_no_antonym_robertalarge[:5]
    print(f"SS step: d): top_5_substitutes_robertalarge: {top_5_substitutes_robertalarge}\n")
    
    
    
    # concatenating the top-x outputs of all 4 models  
    
    
   # find duplicates between top_5_substitutes of all 4 models
    duplicates = set(top_5_substitutes_bertbase) & set(top_5_substitutes_bertlarge) & set(top_5_substitutes_robertabase) & set(top_5_substitutes_robertalarge)

    # create a list with duplicates, preserving the original order as much as possible
    duplicates_top5_bertrobertabaselarge = []
    for sub in top_5_substitutes_bertbase + top_5_substitutes_bertlarge + top_5_substitutes_robertabase + top_5_substitutes_robertalarge:
        if sub in duplicates and sub not in duplicates_top5_bertrobertabaselarge:
            duplicates_top5_bertrobertabaselarge.append(sub)

    print(f"duplicates_top5_bertrobertabaselarge: {duplicates_top5_bertrobertabaselarge}\n")

    # create a list with non-duplicates, using the interleaving order
    nonduplicates_top5_bertrobertabaselarge = []
    dup_used = set(duplicates_top5_bertrobertabaselarge)
    already_added = set()

    for bert_base, bert_large, roberta_base, roberta_large in zip(top_5_substitutes_bertbase, top_5_substitutes_bertlarge,top_5_substitutes_robertabase,top_5_substitutes_robertalarge):
        if bert_base not in dup_used and bert_base not in already_added:
            nonduplicates_top5_bertrobertabaselarge.append(bert_base)
            already_added.add(bert_base)
        if bert_large not in dup_used and bert_large not in already_added:
            nonduplicates_top5_bertrobertabaselarge.append(bert_large)
            already_added.add(bert_large)
        if roberta_base not in dup_used and roberta_base not in already_added:
            nonduplicates_top5_bertrobertabaselarge.append(roberta_base)
            already_added.add(roberta_base)
        if roberta_large not in dup_used and roberta_large not in already_added:
            nonduplicates_top5_bertrobertabaselarge.append(roberta_large)
            already_added.add(roberta_large)

    print(f"nonduplicates_top5_bertrobertabaselarge: {nonduplicates_top5_bertrobertabaselarge}\n")
 
    
       
    
    # old code, remove if above code works
       

#     # create a list with non-duplicates, using the interleaving order
#     nonduplicates_top5_bertrobertabaselarge = []
#     dup_used = set(duplicates_top5_bertrobertabaselarge)

#     for bert_base, bert_large, roberta_base, roberta_large in zip(top_5_substitutes_bertbase, top_5_substitutes_bertlarge,top_5_substitutes_robertabase,top_5_substitutes_robertalarge):
#         if bert_base not in dup_used:
#             nonduplicates_top5_bertrobertabaselarge.append(bert_base)
#         if bert_large not in dup_used:
#             nonduplicates_top5_bertrobertabaselarge.append(bert_large)
#         if roberta_base not in dup_used:
#             nonduplicates_top5_bertrobertabaselarge.append(roberta_base)
#         if roberta_large not in dup_used:
#             nonduplicates_top5_bertrobertabaselarge.append(roberta_large)
            
#     print(f"nonduplicates_top5_bertrobertabaselarge: {nonduplicates_top5_bertrobertabaselarge}\n")
    
    
    # concatenate all lists, the duplicate list first
    combined_top5_bertrobertabaselarge_duplicates_first = duplicates_top5_bertrobertabaselarge + nonduplicates_top5_bertrobertabaselarge
    print(f"combined_top5_bertrobertabaselarge_duplicates_first: {combined_top5_bertrobertabaselarge_duplicates_first}\n")
    
    print('-------------------------------------------------------------------------------------------------')
    
    
    # check if the number of substitutes is greater than the current number of substitute columns
    num_substitutes = len(combined_top5_bertrobertabaselarge_duplicates_first)
    num_columns = len(substitutes_df.columns)

    if num_substitutes + 2 > num_columns:
        # Add the required number of new columns
        for i in range(num_columns - 2, num_substitutes):
            substitutes_df[f"substitute_{i+1}"] = ''

    # pad the list with empty strings to match the number of columns in the DataFrame
    padding = [''] * (num_columns - 2 - len(combined_top5_bertrobertabaselarge_duplicates_first))
    row_data = [sentence, complex_word] + combined_top5_bertrobertabaselarge_duplicates_first + padding

    # Add the row to the DataFrame
    substitutes_df.loc[index] = row_data
    
    
    # remove the #34-3 and #35-14 character combinations from the sentences in the DataFrame
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#34-3 \"", "")
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#35-14 ", "")

      

# export the dataframe to a tsv file
substitutes_df.to_csv("./predictions/trial/BertRobertaBaseLarge_top_5_SG_SS_abc.tsv", sep="\t", index=False, header=False)
    
    
    
    
    
    
    



Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory

SS step: d): top_5_substitutes_bertbase: ['mandatory', 'obligatory', 'optional', 'required', 'necessary', 'standard', 'voluntary', 'customary']

SS step: d): top_5_substitutes_bertlarge: ['mandatory', 'obligatory', 'required', 'voluntary', 'optional', 'mandated', 'necessary', 'forbidden']

SS step: d): top_5_substitutes_robertabase: ['mandatory', 'obligatory', 'voluntary', 'required', 'optional', 'obliged', 'uniform', 'necessary']

SS step: d): top_5_substitutes_robertalarge: ['mandatory', 'mandated', 'voluntary', 'obligatory', 'statutory', 'redundant', 'enforced', 'routine']

duplicates_top5_bertrobertabaselarge: ['mandatory', 'obligatory', 'voluntary']

nonduplicates_top5_bertrobertabaselarge: ['mandated', 'optiona

In [None]:
python tsar_eval.py --gold_file .\gold_trial.tsv --predictions_file .\predictions\trial\BertRobertaBaseLarge_top_5_SG_SS_abc.tsv --output_file .\output

Results: so far, concatenation of the top-x of all 4 lists is not an improvement of BertBase_SG_SS_abc, BertLarge_SG_SS_abc, Robertabase_SG_SS_abc, Robertalarge_SG_SS_abc. Sometimes even worse. 

To Do later:

1. see if padding in case of <10 subs is the right way. Maybe better to pad with the first item (which is the most sem. fitting item) in the combined subs list?
2. remove unsimilar substitutes, to account for removing e.g., voluntary from the list for 'compulsory' (maybe with equivalence scores or something word embedding scores?.
2. explore wordnet co-hyponyms, and add/prioritize in list, together with synsets code (see old code). Think about how to prioritize, above or after the duplicates from all 4 models.
3. remove substitutes like 'disguise' and 'dressing' (complex word = disguised); 'deployment' (complex word = deploy). Maybe by removing words that have the first 4 or 5 leftstrings the same as the complex word (that will remove words like disguise and deployment. but not dressing (can be a noun and a verb), as this only appears in the subst list together with dressed which is a good substitute). maybe see with Fitbert what it does? 
4. update the code with embeddings and BERTScore (applied to the combined list of all 4 systems), and see which of these work best. 
5. Explore equivalence scores (MANTIS).
5. automate the code in a way that it selects all 4 models (bert-base, bert-large, roberta-base, roberta-large) automatically, one by one, and that all the separate variables _bertbase, bertlarge, robertabase, robertalarge are not needed.
6. automate to test scores with different top-x for the concatenated list of all 4 models
7. Automate to get the best top-x, based on the output scores for the task. 