### All code below uses concatenated sentence pairs in the Substitute Generation step in order to generate similar substitutes (as opposed to generation of fitting substitutes only)

In [1]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
from fitbert import FitBert
import pandas as pd
from transformers import pipeline

# read the tsv file
filename = "./data/trial/tsar2022_en_trial_none.tsv"
data = pd.read_csv(filename, sep='\t', header=None, names=["sentence", "complex_word"])

# create an empty dataframe to store the substitutes for evaluation
substitutes_df = pd.DataFrame(columns=["sentence", "complex_word"] + [f"substitute_{i+1}" for i in range(10)])

### RoBERTa-base

In [2]:
# initialize the tokenizer and the models
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
lm_model = AutoModelForMaskedLM.from_pretrained("roberta-base")

# create a fill-mask pipeline 
fill_mask = pipeline("fill-mask", lm_model, tokenizer = AutoTokenizer.from_pretrained("roberta-base"))


#### Only Substitute Generation with Roberta-base (k=10)

In [3]:
# in each row, for each complex word: 
for index, row in data.iterrows():
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, "<mask>")  # RoBERTa uses <mask> instead of [MASK]

    ## concatenate the original sentence and the masked sentence
    tokenizer = fill_mask.tokenizer
    sentences_concat = f"{sentence} {tokenizer.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 10
    result = fill_mask(sentences_concat, top_k=top_k)
   
    ## lowercase and print the top-k substitutes
    substitutes = [substitute["token_str"].lower().lstrip() for substitute in result]   # and use .lstrip to remove the leading space, as Roberta tokenizes by default with a space in front of the word
    print(f"SG step: generated substitutes: {substitutes}\n")
    
    
    
    # add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + substitutes
    
    # remove the #34-3 and #35-14 character combinations from the sentences in the dataframe
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#34-3 \"", "")
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#35-14 ", "")
    
    

# export the dataframe to a tsv file
substitutes_df.to_csv("./predictions/trial/RobertaBase_SG.tsv", sep="\t", index=False, header=False)

Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
SG step: generated substitutes: ['compulsory', 'mandatory', 'obligatory', 'voluntary', 'required', 'optional', 'obliged', 'uniform', 'necessary', 'available']

Sentence: Rajoy's conservative government had instilled markets with a brief dose of confidence by stepping into Bankia, performing a U-turn on its refusal to spend public money to rescue banks.
Complex word: instilled
SG step: generated substitutes: ['infused', 'injected', 'endowed', 'illed', 'inst', 'furnished', 'supplied', 'bolstered', 'implanted', 'impressed']

Sentence: #34-3 "War maniacs of the South Korean puppet military made another grave provocation to the DPRK in the central western sector of the front on Thursday afternoon.
Complex word: maniacs
SG 

python tsar_eval.py --gold_file .\gold_trial.tsv --predictions_file ./predictions/trial/RobertaBase_SG.tsv --output_file .\output

In [4]:
# Result: worse than BERT-base, appr. the same as BERT-large. However, Potential is good. 

#### Substitute Generation with RoBERTa-base, and Substitute Selection steps a-c (k=30, limited to 10 after step 2c)

In [5]:
from nltk.corpus import wordnet as wn
import spacy
nlp = spacy.load("en_core_web_sm")

In [6]:

# in each row, for each complex word: 
for index, row in data.iterrows():
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, "<mask>")   # RoBERTa uses <mask> instead of [MASK]

    ## concatenate the original sentence and the masked sentence
    tokenizer = fill_mask.tokenizer
    sentences_concat = f"{sentence} {tokenizer.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result = fill_mask(sentences_concat, top_k=top_k)
   
   ## lowercase and print the top-k substitutes
    substitutes = [substitute["token_str"].lower().lstrip() for substitute in result]   # and use .lstrip to remove the leading space, as Roberta tokenizes by default with a space in front of the word
    print(f"SG step: generated substitutes: {substitutes}\n")
    
    
     # 2. Substitute Selection (SS):   
    
    # a) remove duplicates within the substitute list from the substitute list 
    
    substitutes_no_dupl = []
    for sub in substitutes:
        if sub not in substitutes_no_dupl:
            substitutes_no_dupl.append(sub)
    print(f"SS step: a) substitute list without duplicates: {substitutes_no_dupl}\n")

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word = []
    for substitute in substitutes_no_dupl:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word.append(substitute)
    print(f"SS step: b) substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym = []
    for substitute in substitutes_no_dupl_complex_word:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym.append(substitute)
    print(f"SS step: c): substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym}\n")
    
        
     
    
    # limit the substitutes to the 10 first ones for evaluation
    top_10_substitutes = substitutes_no_dupl_complex_word_no_antonym[:10]
    
    # add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + top_10_substitutes
    
    # remove the #34-3 and #35-14 character combinations from the sentences in the dataframe
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#34-3 \"", "")
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#35-14 ", "")
    
    

# export the dataframe to a tsv file
substitutes_df.to_csv("./predictions/trial/RobertaBase_SG_SS_abc.tsv", sep="\t", index=False, header=False)
    
    

Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
SG step: generated substitutes: ['compulsory', 'mandatory', 'obligatory', 'voluntary', 'required', 'optional', 'obliged', 'uniform', 'necessary', 'available', 'mandated', 'sufficient', 'routine', 'forced', 'customary', 'prerequisite', 'feasible', 'indispensable', 'forthcoming', 'universal', 'requirement', 'involuntary', 'obligated', 'compelled', 'conditional', 'enforced', 'contingent', 'possible', 'compulsion', 'mandatory']

SS step: a) substitute list without duplicates: ['compulsory', 'mandatory', 'obligatory', 'voluntary', 'required', 'optional', 'obliged', 'uniform', 'necessary', 'available', 'mandated', 'sufficient', 'routine', 'forced', 'customary', 'prerequisite', 'feasible', 'indispensable', 'forthcoming', 'un

python tsar_eval.py --gold_file .\gold_trial.tsv --predictions_file ./predictions/trial/RobertaBase_SG_SS_abc.tsv --output_file .\output

Result: slightly better than without this step.

#### Substitute Generation with Roberta-base, and Substitute Selection steps a-c, and the resulting list with FitBERT

In [7]:
# FitBERT mainly seems to look at syntactic fit
import fitbert

In [8]:

# instantiate a FitBert model
fb_model = FitBert(lm_model)


# in each row, for each complex word: 
for index, row in data.iterrows():
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, "<mask>")    # RoBERTa uses <mask> instead of [MASK]

    ## concatenate the original sentence and the masked sentence
    tokenizer = fill_mask.tokenizer
    sentences_concat = f"{sentence} {tokenizer.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result = fill_mask(sentences_concat, top_k=top_k)
   
    ## lowercase and print the top-k substitutes
    substitutes = [substitute["token_str"].lower().lstrip() for substitute in result]   # and use .lstrip to remove the leading space, as Roberta tokenizes by default with a space in front of the word
    print(f"SG step: generated substitutes: {substitutes}\n")
    
    
    # 2. Substitute Selection (SS):   
    
    # a) remove duplicates within the substitute list from the substitute list 
    
    substitutes_no_dupl = []
    for sub in substitutes:
        if sub not in substitutes_no_dupl:
            substitutes_no_dupl.append(sub)
    print(f"SS step: a) substitute list without duplicates: {substitutes_no_dupl}\n")

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word = []
    for substitute in substitutes_no_dupl:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word.append(substitute)
    print(f"SS step: b) substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym = []
    for substitute in substitutes_no_dupl_complex_word:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym.append(substitute)
    print(f"SS step: c): substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym}\n")
    
    
    # d) apply FITBERT to the list of substitutes
    sentence_fitbert_masked = sentence_masked_word.replace("<mask>", "***mask***")    # fitbert uses ***mask*** instead of [MASK] or <mask> 
    sentences_concat_fitbert = f"{sentence} {tokenizer.sep_token} {sentence_fitbert_masked}"
    
    ranked_substitutes = fb_model.rank(sentences_concat_fitbert, substitutes_no_dupl_complex_word_no_antonym)
    print(f"SS step: d) ranked substitutes using FitBert: {ranked_substitutes}\n")
    
    print('-----------------------------------------------------------------------------------------')
    print()
    
    
    
    # limit the substitutes to the 10 first ones for evaluation
    top_10_substitutes = ranked_substitutes[:10]
    
    # add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + top_10_substitutes
    
    # remove the #34-3 and #35-14 character combinations from the sentences in the dataframe
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#34-3 \"", "")
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#35-14 ", "")
    
    

# export the dataframe to a tsv file
substitutes_df.to_csv("./predictions/trial/RobertaBase_SG_SS_abc_fb.tsv", sep="\t", index=False, header=False)


device: cpu
using custom model: ['RobertaForMaskedLM']
Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
SG step: generated substitutes: ['compulsory', 'mandatory', 'obligatory', 'voluntary', 'required', 'optional', 'obliged', 'uniform', 'necessary', 'available', 'mandated', 'sufficient', 'routine', 'forced', 'customary', 'prerequisite', 'feasible', 'indispensable', 'forthcoming', 'universal', 'requirement', 'involuntary', 'obligated', 'compelled', 'conditional', 'enforced', 'contingent', 'possible', 'compulsion', 'mandatory']

SS step: a) substitute list without duplicates: ['compulsory', 'mandatory', 'obligatory', 'voluntary', 'required', 'optional', 'obliged', 'uniform', 'necessary', 'available', 'mandated', 'sufficient', 'routine', 'forced', 'customary', 'prerequ

python tsar_eval.py --gold_file .\gold_trial.tsv --predictions_file ./predictions/trial/RobertaBase_SG_SS_abc_fb.tsv --output_file .\output

Result:  results are a lot worse than any other result.

#### Substitute Generation with ROBERTA-base, and Substitute Selection steps a-c, and the resulting list with contextualized embeddings

In [9]:
from transformers import TFAutoModel
import tensorflow as tf
import numpy as np

In [10]:
# Calculates similarity between the original sentence and the sentences with candidate substitutes that were retrieved in the SG step 
# creates a list with sentences with substitute words filled in (commented out for oversight purposes)


def calculate_similarity_scores(sentence, sentence_with_substitutes):
    tokenizer = AutoTokenizer.from_pretrained("roberta-base")
    tf_model = TFAutoModel.from_pretrained("roberta-base")

    def embed_text(text):
        tokens = tokenizer(text, padding=True, truncation=True, return_tensors="tf")
        outputs = tf_model(**tokens)
        embeddings = outputs.last_hidden_state[:, 0, :]
        embeddings = tf.nn.l2_normalize(embeddings, axis=1)
        return embeddings

    original_sentence_embedding = embed_text(sentence)
    substitute_sentence_embeddings = embed_text(sentence_with_substitutes)

    cosine_similarity = np.inner(original_sentence_embedding, substitute_sentence_embeddings)
    similarity_scores = cosine_similarity[0]

    return similarity_scores




# in each row, for each complex word: 
for index, row in data.iterrows():
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, "<mask>")   # RoBERTa uses <mask> instead of [MASK]

    ## concatenate the original sentence and the masked sentence
    tokenizer = fill_mask.tokenizer
    sentences_concat = f"{sentence} {tokenizer.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result = fill_mask(sentences_concat, top_k=top_k)
   
    ## lowercase and print the top-k substitutes
    substitutes = [substitute["token_str"].lower().lstrip() for substitute in result]   # and use .lstrip to remove the leading space, as Roberta tokenizes by default with a space in front of the word
    print(f"SG step: generated substitutes: {substitutes}\n")
    
    
    # 2. Substitute Selection (SS):   
    
    # a) remove duplicates within the substitute list from the substitute list 
    
    substitutes_no_dupl = []
    for sub in substitutes:
        if sub not in substitutes_no_dupl:
            substitutes_no_dupl.append(sub)
    print(f"SS step: a) substitute list without duplicates: {substitutes_no_dupl}\n")

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word = []
    for substitute in substitutes_no_dupl:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word.append(substitute)
    print(f"SS step: b) substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym = []
    for substitute in substitutes_no_dupl_complex_word:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym.append(substitute)
    print(f"SS step: c): substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym}\n")
    
    
    # create sentence with the complex word replaced by the substitutes
    sentence_with_substitutes = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym]
    #print(f"List with sentences where complex word is substituted: {sentence_with_substitutes}\n")
    
    
    # d) calculate cosine similarity scores, and rank the substitutes based on their similarity score
    similarity_scores = calculate_similarity_scores(sentence, sentence_with_substitutes)
    #print(f"Similarity scores: {similarity_scores}\n")
    ranked_substitutes_withscores = sorted(zip(substitutes_no_dupl_complex_word_no_antonym, similarity_scores), key=lambda x: x[1], reverse=True)
    #print(f"SS step d) Ranked substitutes in context, including similarity scores: {ranked_substitutes}\n")
    ranked_substitutes = [substitute for substitute, score in ranked_substitutes_withscores]
    print(f"Ranked substitutes in context, based on cosine similarity scores: {ranked_substitutes}\n")
        
    print('-----------------------------------------------------------------------------------------')
    print()
    
       
    
    
    # limit the substitutes to the 10 first ones for evaluation
    top_10_substitutes = ranked_substitutes[:10]
    
    # add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + top_10_substitutes
    
    # remove the #34-3 and #35-14 character combinations from the sentences in the dataframe
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#34-3 \"", "")
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#35-14 ", "")
    
    

# export the dataframe to a tsv file
substitutes_df.to_csv("./predictions/trial/RobertaBase_SG_SS_abc_ce.tsv", sep="\t", index=False, header=False)

Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
SG step: generated substitutes: ['compulsory', 'mandatory', 'obligatory', 'voluntary', 'required', 'optional', 'obliged', 'uniform', 'necessary', 'available', 'mandated', 'sufficient', 'routine', 'forced', 'customary', 'prerequisite', 'feasible', 'indispensable', 'forthcoming', 'universal', 'requirement', 'involuntary', 'obligated', 'compelled', 'conditional', 'enforced', 'contingent', 'possible', 'compulsion', 'mandatory']

SS step: a) substitute list without duplicates: ['compulsory', 'mandatory', 'obligatory', 'voluntary', 'required', 'optional', 'obliged', 'uniform', 'necessary', 'available', 'mandated', 'sufficient', 'routine', 'forced', 'customary', 'prerequisite', 'feasible', 'indispensable', 'forthcoming', 'un

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes in context, based on cosine similarity scores: ['mandatory', 'voluntary', 'required', 'obligatory', 'optional', 'routine', 'obliged', 'necessary', 'forced', 'sufficient', 'mandated', 'possible', 'available', 'requirement', 'feasible', 'enforced', 'indispensable', 'customary', 'conditional', 'forthcoming', 'prerequisite', 'compelled', 'uniform', 'involuntary', 'contingent', 'compulsion', 'universal', 'obligated']

-----------------------------------------------------------------------------------------

Sentence: Rajoy's conservative government had instilled markets with a brief dose of confidence by stepping into Bankia, performing a U-turn on its refusal to spend public money to rescue banks.
Complex word: instilled
SG step: generated substitutes: ['infused', 'injected', 'endowed', 'illed', 'inst', 'furnished', 'supplied', 'bolstered', 'implanted', 'impressed', 'reinforced', 'invested', 'provided', 'filled', 'reassured', 'undermined', 'filled', 'pumped', 'struck', '

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes in context, based on cosine similarity scores: ['filled', 'stirred', 'inspired', 'infused', 'bolstered', 'injected', 'provided', 'supplied', 'struck', 'furnished', 'reassured', 'stunned', 'impressed', 'pumped', 'undermined', 'reinforced', 'vested', 'endowed', 'misled', 'intoxicated', 'implanted', 'augmented', 'invested', 'enriched', 'inst', 'seeded', 'inflated', 'illed', 'imb']

-----------------------------------------------------------------------------------------

Sentence: #34-3 "War maniacs of the South Korean puppet military made another grave provocation to the DPRK in the central western sector of the front on Thursday afternoon.
Complex word: maniacs
SG step: generated substitutes: ['maniac', 'criminals', 'riors', 'hawks', 'heads', 'killers', 'gangs', 'lords', 'thugs', 'devils', 'drums', 'murderers', 'fighters', 'monsters', 'fools', 'villains', 'idiots', 'machines', 'demons', 'doctors', 'planes', 'igans', 'bosses', 'nazis', 'beasts', 'fascists', 'children',

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes in context, based on cosine similarity scores: ['criminals', 'thugs', 'children', 'lords', 'hawks', 'killers', 'gangs', 'idiots', 'fighters', 'machines', 'drums', 'murderers', 'nazis', 'demons', 'monsters', 'extremists', 'heads', 'villains', 'beasts', 'doctors', 'bosses', 'parasites', 'devils', 'fools', 'fascists', 'nerds', 'planes', 'riors', 'igans']

-----------------------------------------------------------------------------------------

Sentence: The daily death toll in Syria has declined as the number of observers has risen, but few experts expect the U.N. plan to succeed in its entirety.
Complex word: observers
SG step: generated substitutes: ['observers', 'monitors', 'spectators', 'witnesses', 'observer', 'observes', 'visitors', 'viewers', 'reporters', 'inspectors', 'commentators', 'investigators', 'participants', 'analysts', 'cameras', 'outsiders', 'demonstrators', 'activists', 'journalists', 'experts', 'authorities', 'observations', 'eyes', 'supporters', 'o

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes in context, based on cosine similarity scores: ['monitors', 'inspectors', 'participants', 'visitors', 'experts', 'spectators', 'investigators', 'witnesses', 'outsiders', 'observations', 'analysts', 'activists', 'authorities', 'residents', 'viewers', 'responders', 'eyes', 'commentators', 'cameras', 'observing', 'observes', 'supporters', 'parties', 'journalists', 'protesters', 'reporters', 'followers', 'demonstrators']

-----------------------------------------------------------------------------------------

Sentence: An amateur video showed a young girl who apparently suffered shrapnel wounds in her thigh undergoing treatment in a makeshift Rastan hospital while screaming in pain.
Complex word: shrapnel
SG step: generated substitutes: ['rapnel', 'bullet', 'gunshot', 'stab', 'radiation', 'projectile', 'crippling', 'shotgun', 'mortar', 'stun', 'explosive', 'stray', 'shell', 'bomb', 'microscopic', 'fragmentation', 'cartridge', 'grenade', 'rocket', 'torpedo', 'piercing',

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes in context, based on cosine similarity scores: ['bomb', 'grenade', 'bullet', 'explosive', 'gunshot', 'blast', 'shell', 'mortar', 'shotgun', 'projectile', 'crippling', 'stun', 'rocket', 'radiation', 'piercing', 'stab', 'fragmentation', 'microscopic', 'bullets', 'stray', 'insurgent', 'slug', 'torpedo', 'cartridge', 'pillow', 'unidentified', 'ammunition', 'terrorist', 'similar', 'rapnel']

-----------------------------------------------------------------------------------------

Sentence: A local witness said a separate group of attackers disguised in burqas — the head-to-toe robes worn by conservative Afghan women — then tried to storm the compound.
Complex word: disguised
SG step: generated substitutes: ['disguised', 'dressed', 'veiled', 'masked', 'concealed', 'hidden', 'disguise', 'clothed', 'clad', 'posed', 'covered', 'cloaked', 'shrouded', 'wrapped', 'posing', 'hiding', 'trained', 'camoufl', 'presented', 'known', 'smuggled', 'staged', 'draped', 'described', 'decora

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes in context, based on cosine similarity scores: ['masked', 'cloaked', 'clad', 'clothed', 'dressed', 'shrouded', 'concealed', 'posing', 'covered', 'armed', 'wrapped', 'draped', 'veiled', 'hiding', 'hidden', 'decorated', 'trained', 'posed', 'hid', 'known', 'disguise', 'marked', 'presented', 'staged', 'described', 'camouflage', 'camoufl', 'smuggled', 'buried']

-----------------------------------------------------------------------------------------

Sentence: Syria's Sunni majority is at the forefront of the uprising against Assad, whose minority Alawite sect is an offshoot of Shi'ite Islam.
Complex word: offshoot
SG step: generated substitutes: ['off', 'extension', 'opposite', 'outpost', 'off', 'out', 'affiliate', 'offspring', 'overthrow', 'element', 'outside', 'echo', 'expansion', 'imprint', 'arm', 'ideology', 'overhaul', 'branch', 'opposition', 'evolution', 'indication', 'upset', 'archetype', 'extend', 'independent', 'alternative', 'example', 'instance', 'embrace', '

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes in context, based on cosine similarity scores: ['affiliate', 'extension', 'offspring', 'arm', 'outpost', 'independent', 'element', 'opposite', 'adjunct', 'echo', 'imprint', 'alternative', 'expansion', 'opposition', 'example', 'embrace', 'branch', 'evolution', 'overthrow', 'instance', 'indication', 'ideology', 'archetype', 'outside', 'off', 'upset', 'overhaul', 'extend', 'out']

-----------------------------------------------------------------------------------------

Sentence: Although not as rare in the symphonic literature as sharper keys , examples of symphonies in A major are not as numerous as for D major or G major .
Complex word: symphonic
SG step: generated substitutes: ['classical', 'harmonic', 'musical', 'sym', 'music', 'instrumental', 'piano', 'lyric', 'jazz', 'modern', 'dramatic', 'major', 'vocal', 'composing', 'concert', 'opera', 'sonic', 'formal', 'popular', 'early', 'composition', 'english', 'singing', 'standard', 'trumpet', 'keyboard', 'contemporary',

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes in context, based on cosine similarity scores: ['musical', 'sym', 'concert', 'instrumental', 'classical', 'orchestra', 'formal', 'music', 'popular', 'lyric', 'standard', 'composing', 'dramatic', 'composition', 'contemporary', 'vocal', 'major', 'technical', 'harmonic', 'modern', 'sonic', 'singing', 'english', 'opera', 'onic', 'piano', 'trumpet', 'jazz', 'early', 'keyboard']

-----------------------------------------------------------------------------------------

Sentence: That prompted the military to deploy its largest warship, the BRP Gregorio del Pilar, which was recently acquired from the United States.
Complex word: deploy
SG step: generated substitutes: ['deploy', 'deployed', 'deployment', 'utilize', 'mobilize', 'employ', 'deploying', 'dispatch', 'equip', 'deploy', 'ploy', 'use', 'activate', 'recruit', 'transport', 'operate', 'withdraw', 'install', 'locate', 'commit', 'construct', 'summon', 'expend', 'send', 'reserve', 'possess', 'cultivate', 'adopt', 'engage'

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes in context, based on cosine similarity scores: ['dispatch', 'install', 'activate', 'unleash', 'adopt', 'employ', 'recruit', 'engage', 'send', 'operate', 'use', 'mobilize', 'construct', 'equip', 'summon', 'commit', 'withdraw', 'possess', 'deployment', 'reserve', 'cultivate', 'transport', 'ploy', 'utilize', 'locate', 'expend']

-----------------------------------------------------------------------------------------

Sentence: #35-14 UK police were expressly forbidden, at a ministerial level, to provide any assistance to Thai authorities as the case involves the death penalty.
Complex word: authorities
SG step: generated substitutes: ['authorities', 'officials', 'authorities', 'authority', 'investigators', 'counterparts', 'police', 'agencies', 'administrators', 'forces', 'prosecutors', 'officers', 'authorities', 'entities', 'intelligence', 'jurisdictions', 'individuals', 'detainees', 'institutions', 'affairs', 'demonstrators', 'superiors', 'victims', 'regulators', 'con

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes in context, based on cosine similarity scores: ['police', 'officials', 'counterparts', 'prosecutors', 'investigators', 'entities', 'officers', 'forces', 'detainees', 'intelligence', 'assistance', 'ministers', 'institutions', 'administrators', 'regulators', 'suspects', 'demonstrators', 'agencies', 'sources', 'individuals', 'jurisdictions', 'superiors', 'affairs', 'victims', 'figures', 'concerns']

-----------------------------------------------------------------------------------------



python tsar_eval.py --gold_file .\gold_trial.tsv --predictions_file ./predictions/trial/RobertaBase_SG_SS_abc_ce.tsv --output_file .\output

In [11]:
# Result: a lot better on MAP@1 (from 0.4 to 0.6)  than with the other approaches in this notebook so far. other map results are similar, potential 3 and 5 lower, and accuracy1 2 and 3 also lower. 

#### Substitute Generation with RoBERTa-Base, and Substitute Selection steps a-c, and the resulting list with BERTScore

In [12]:
import bert_score
from bert_score import score

In [26]:


# in each row, for each complex word: 
for index, row in data.iterrows():
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, "<mask>")    # RoBERTa uses <mask> instead of [MASK]

    ## concatenate the original sentence and the masked sentence
    tokenizer = fill_mask.tokenizer
    sentences_concat = f"{sentence} {tokenizer.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result = fill_mask(sentences_concat, top_k=top_k)
   
    ## lowercase and print the top-k substitutes
    substitutes = [substitute["token_str"].lower().lstrip() for substitute in result]   # and use .lstrip to remove the leading space, as Roberta tokenizes by default with a space in front of the word
    print(f"SG step: generated substitutes: {substitutes}\n")
    
    
    # 2. Substitute Selection (SS):   
    
    # a) remove duplicates within the substitute list from the substitute list 
    
    substitutes_no_dupl = []
    for sub in substitutes:
        if sub not in substitutes_no_dupl:
            substitutes_no_dupl.append(sub)
    print(f"SS step: a) substitute list without duplicates: {substitutes_no_dupl}\n")

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word = []
    for substitute in substitutes_no_dupl:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word.append(substitute)
    print(f"SS step: b) substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym = []
    for substitute in substitutes_no_dupl_complex_word:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym.append(substitute)
    print(f"SS step: c): substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym}\n")
    
    
    # create sentences with the complex word replaced by the substitutes
    sentences_with_substitutes = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym]
    #print(f"SG step: sentences with substitutes: {sentences_with_substitutes}\n")
    
          
    # d) use BERTScore for sorting
    scores = bert_score.score([sentence]*len(sentences_with_substitutes), sentences_with_substitutes, lang="en", model_type='roberta-base', verbose=False)
    ranked_substitutes = [substitute for _, substitute in sorted(zip(scores[0].tolist(), substitutes_no_dupl_complex_word_no_antonym), reverse=True)]
    print(f"SS step: d) substitute list sorted by descending BERTScore: {ranked_substitutes}\n")

    
    print('-----------------------------------------------------------------------------------------')
    print()
    
    
    
    # limit the substitutes to the 10 first ones for evaluation
    top_10_substitutes = ranked_substitutes[:10]
    
    # add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + top_10_substitutes
    
    # remove the #34-3 and #35-14 character combinations from the sentences in the dataframe
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#34-3 \"", "")
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#35-14 ", "")
    
    

# export the dataframe to a tsv file
substitutes_df.to_csv("./predictions/trial/RobertaBase_SG_SS_abc_bs.tsv", sep="\t", index=False, header=False)


Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
SG step: generated substitutes: ['compulsory', 'mandatory', 'mandated', 'voluntary', 'obligatory', 'statutory', 'redundant', 'enforced', 'routine', 'relevant', 'required', 'vital', 'clandestine', 'obliged', 'bureaucratic', 'ministerial', 'mandate', 'necessary', 'lifelong', 'strict', 'gradual', 'demanded', 'lax', 'continuous', 'practicable', 'indispensable', 'forced', 'habitual', 'plain', 'preferable']

SS step: a) substitute list without duplicates: ['compulsory', 'mandatory', 'mandated', 'voluntary', 'obligatory', 'statutory', 'redundant', 'enforced', 'routine', 'relevant', 'required', 'vital', 'clandestine', 'obliged', 'bureaucratic', 'ministerial', 'mandate', 'necessary', 'lifelong', 'strict', 'gradual', 'demanded'

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['mandatory', 'voluntary', 'obligatory', 'routine', 'mandated', 'required', 'enforced', 'necessary', 'redundant', 'practicable', 'preferable', 'strict', 'forced', 'demanded', 'relevant', 'obliged', 'indispensable', 'vital', 'bureaucratic', 'lax', 'plain', 'gradual', 'mandate', 'clandestine', 'continuous', 'statutory', 'habitual', 'ministerial', 'lifelong']

-----------------------------------------------------------------------------------------

Sentence: Rajoy's conservative government had instilled markets with a brief dose of confidence by stepping into Bankia, performing a U-turn on its refusal to spend public money to rescue banks.
Complex word: instilled
SG step: generated substitutes: ['infused', 'injected', 'filled', 'inst', 'invested', 'illed', 'impressed', 'infected', 'revived', 'endowed', 'gifted', 'reassured', 'implanted', 'infiltrated', 'pumped', 'inject', 'flooded', 'sprinkled', 'installed', 'vested', 'thrilled'

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['infused', 'injected', 'filled', 'flooded', 'stoked', 'pumped', 'revived', 'gifted', 'vested', 'provided', 'reassured', 'infected', 'impressed', 'thrilled', 'insulated', 'endowed', 'stocked', 'hit', 'sprinkled', 'assured', 'elevated', 'penetrated', 'vaccinated', 'implanted', 'installed', 'invested', 'infiltrated', 'inst', 'illed', 'inject']

-----------------------------------------------------------------------------------------

Sentence: #34-3 "War maniacs of the South Korean puppet military made another grave provocation to the DPRK in the central western sector of the front on Thursday afternoon.
Complex word: maniacs
SG step: generated substitutes: ['maniac', 'criminals', 'killers', 'thugs', 'fighters', 'murderers', 'mercenaries', 'militias', 'combatants', 'factions', 'fighters', 'commanders', 'lords', 'jihadists', 'gangs', 'fascists', 'helmets', 'assassins', 'killers', 'crimes', 'squads', 'hunters', 'gunmen', 'partisan

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['criminals', 'thugs', 'killers', 'lords', 'partisans', 'gangs', 'militias', 'combatants', 'mercenaries', 'fighters', 'squads', 'factions', 'murderers', 'perpetrators', 'troops', 'monsters', 'chiefs', 'commanders', 'helmets', 'gunmen', 'hunters', 'fascists', 'jihadists', 'assassins', 'detainees', 'crimes', 'killings']

-----------------------------------------------------------------------------------------

Sentence: The daily death toll in Syria has declined as the number of observers has risen, but few experts expect the U.N. plan to succeed in its entirety.
Complex word: observers
SG step: generated substitutes: ['observers', 'monitors', 'demonstrators', 'participants', 'opponents', 'advisors', 'experts', 'observer', 'supervisors', 'analysts', 'operators', 'responders', 'observer', 'reporters', 'witnesses', 'observations', 'inspectors', 'observes', 'investigators', 'dissidents', 'bystanders', 'visitors', 'officials', 'educ

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['monitors', 'investigators', 'experts', 'reporters', 'inspectors', 'participants', 'responders', 'advisors', 'visitors', 'witnesses', 'reinforcements', 'officials', 'analysts', 'outsiders', 'operators', 'bystanders', 'organizers', 'demonstrators', 'followers', 'educators', 'opponents', 'dissidents', 'observations', 'airstrikes', 'supervisors', 'observed', 'observes']

-----------------------------------------------------------------------------------------

Sentence: An amateur video showed a young girl who apparently suffered shrapnel wounds in her thigh undergoing treatment in a makeshift Rastan hospital while screaming in pain.
Complex word: shrapnel
SG step: generated substitutes: ['gunshot', 'rapnel', 'bullet', 'gunfire', 'grenade', 'gun', 'rifle', 'mortar', 'sh', 'shell', 'sniper', 'projectile', 'stab', 'multiple', 'shot', 'gunshots', 'several', 'blast', 'arrow', 'a', 'shotgun', 'spear', 'shock', 'dagger', 'blaster', 't

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['bullet', 'gunshot', 'knife', 'stab', 'gun', 'multiple', 'blast', 'explosive', 'shot', 'several', 'shock', 'sniper', 'grenade', 'unspecified', 'mortar', 'the', 'shell', 'spear', 'arrow', 'shotgun', 'projectile', 'gunshots', 'rifle', 'gunfire', 'stray', 'dagger', 'sh', 'blaster', 'rapnel', 'a']

-----------------------------------------------------------------------------------------

Sentence: A local witness said a separate group of attackers disguised in burqas — the head-to-toe robes worn by conservative Afghan women — then tried to storm the compound.
Complex word: disguised
SG step: generated substitutes: ['disguised', 'cloaked', 'masked', 'dressed', 'concealed', 'clothed', 'disguise', 'clad', 'recognised', 'portrayed', 'shrouded', 'styled', 'wrapped', 'packaged', 'posed', 'adorned', 'imprisoned', 'dispersed', 'united', 'disgu', 'displayed', 'veiled', 'fitted', 'framed', 'armoured', 'frightened', 'formed', 'revealed', 'o

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['masked', 'clothed', 'cloaked', 'dressed', 'concealed', 'veiled', 'clad', 'shrouded', 'wrapped', 'adorned', 'styled', 'posed', 'fitted', 'branded', 'displayed', 'imprisoned', 'formed', 'dispersed', 'organised', 'united', 'framed', 'disguise', 'packaged', 'revealed', 'armoured', 'portrayed', 'frightened', 'recognised', 'disgu']

-----------------------------------------------------------------------------------------

Sentence: Syria's Sunni majority is at the forefront of the uprising against Assad, whose minority Alawite sect is an offshoot of Shi'ite Islam.
Complex word: offshoot
SG step: generated substitutes: ['shoot', 'affiliate', 'extension', 'outpost', 'adherent', 'offspring', 'ally', 'adaptation', 'off', 'off', 'adjunct', 'iteration', 'incarnation', 'enclave', 'arm', 'embryo', 'branch', 'affiliation', 'example', 'acronym', 'outreach', 'extremist', 'outline', 'insurgency', 'affiliated', 'embrace', 'orthodoxy', 'undercu

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['extension', 'incarnation', 'affiliate', 'offspring', 'adherent', 'outpost', 'arm', 'adjunct', 'example', 'iteration', 'ally', 'branch', 'embryo', 'affiliation', 'extremist', 'embrace', 'adaptation', 'outlet', 'enclave', 'subset', 'acronym', 'orthodoxy', 'off', 'outline', 'affiliated', 'insurgency', 'outreach', 'shoot', 'undercut']

-----------------------------------------------------------------------------------------

Sentence: Although not as rare in the symphonic literature as sharper keys , examples of symphonies in A major are not as numerous as for D major or G major .
Complex word: symphonic
SG step: generated substitutes: ['musical', 'classical', 'harmonic', 'music', 'popular', 'onic', 'sym', 'canonical', 'instrumental', 'piano', 'the', 'of', 'historical', 'dramatic', 'general', 'modern', ',', 'theoretical', 'or', 'and', 'traditional', '</s>', 'concert', 'romantic', 'vocal', 'in', 'musical', 'contemporary', 'formal

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['concert', 'classical', 'instrumental', 'musical', 'onic', 'harmonic', 'formal', 'dramatic', 'modern', 'traditional', 'popular', 'vocal', 'contemporary', 'historical', 'general', 'canonical', 'music', 'analytic', 'romantic', 'theoretical', 'piano', 'the', 'sym', 'or', 'of', 'and', 'in', ',', '</s>']

-----------------------------------------------------------------------------------------

Sentence: That prompted the military to deploy its largest warship, the BRP Gregorio del Pilar, which was recently acquired from the United States.
Complex word: deploy
SG step: generated substitutes: ['deploy', 'deployed', 'deployment', 'mobilize', 'dispatch', 'ploy', 'employ', 'deploying', 'deploy', 'maneuver', 'disperse', 'send', 'launch', 'recruit', 'contract', 'use', 'move', 'tether', 'expend', 'monitor', 'deliver', 'secure', 'operate', 'construct', 'wield', 'patrol', 'combat', 'command', 'withdraw', 'transport']

SS step: a) substitut

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['dispatch', 'employ', 'mobilize', 'launch', 'deliver', 'recruit', 'operate', 'use', 'send', 'construct', 'transport', 'contract', 'secure', 'expend', 'command', 'wield', 'maneuver', 'move', 'patrol', 'withdraw', 'deployment', 'disperse', 'combat', 'tether', 'monitor', 'ploy']

-----------------------------------------------------------------------------------------

Sentence: #35-14 UK police were expressly forbidden, at a ministerial level, to provide any assistance to Thai authorities as the case involves the death penalty.
Complex word: authorities
SG step: generated substitutes: ['authorities', 'police', 'officials', 'authorities', 'authority', 'regulators', 'governments', 'arrests', 'investigators', 'colleagues', 'superiors', 'officers', 'employees', 'authorities', 'courts', 'residents', 'prosecutors', 'detainees', 'authors', 'paramedics', 'policemen', 'applications', 'forces', 'institutions', 'representatives', 'neighbo

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['officials', 'police', 'investigators', 'regulators', 'governments', 'prosecutors', 'courts', 'forces', 'officers', 'institutions', 'representatives', 'policemen', 'residents', 'paramedics', 'employees', 'individuals', 'bodies', 'neighbours', 'detainees', 'colleagues', 'superiors', 'regimes', 'authors', 'arrests', 'applications', 'requirements']

-----------------------------------------------------------------------------------------



python tsar_eval.py --gold_file .\gold_trial.tsv --predictions_file ./predictions/trial/RobertaBase_SG_SS_abc_bs.tsv --output_file .\output

all map metrics (including map1) lower than with ce, but better than without ce. Potential one lower, one higher, one the same.  Accuracy 2 lower, one the same. 

### RoBERTa-large

In [14]:
# # initialize the tokenizer and the models
tokenizer = AutoTokenizer.from_pretrained("roberta-large")
lm_model = AutoModelForMaskedLM.from_pretrained("roberta-large")

# create a fill-mask pipeline 
fill_mask = pipeline("fill-mask", lm_model, tokenizer = AutoTokenizer.from_pretrained("roberta-large"))


#### Only Substitute Generation with RoBERTa-large (k=10)

In [15]:


# in each row, for each complex word: 
for index, row in data.iterrows():
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, "<mask>")  # RoBERTa uses <mask> instead of [MASK]

    ## concatenate the original sentence and the masked sentence
    tokenizer = fill_mask.tokenizer
    sentences_concat = f"{sentence} {tokenizer.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 10
    result = fill_mask(sentences_concat, top_k=top_k)
   
    ## lowercase and print the top-k substitutes
    substitutes = [substitute["token_str"].lower().lstrip() for substitute in result]   # and use .lstrip to remove the leading space, as Roberta tokenizes by default with a space in front of the word
    print(f"SG step: generated substitutes: {substitutes}\n")
    
    
    
    
    # add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + substitutes
    
    # remove the #34-3 and #35-14 character combinations from the sentences in the dataframe
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#34-3 \"", "")
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#35-14 ", "")
    
    

# export the dataframe to a tsv file
substitutes_df.to_csv("./predictions/trial/RobertaLarge_SG.tsv", sep="\t", index=False, header=False)

Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
SG step: generated substitutes: ['compulsory', 'mandatory', 'mandated', 'voluntary', 'obligatory', 'statutory', 'redundant', 'enforced', 'routine', 'relevant']

Sentence: Rajoy's conservative government had instilled markets with a brief dose of confidence by stepping into Bankia, performing a U-turn on its refusal to spend public money to rescue banks.
Complex word: instilled
SG step: generated substitutes: ['infused', 'injected', 'filled', 'inst', 'invested', 'illed', 'impressed', 'infected', 'revived', 'endowed']

Sentence: #34-3 "War maniacs of the South Korean puppet military made another grave provocation to the DPRK in the central western sector of the front on Thursday afternoon.
Complex word: maniacs
SG step:

python tsar_eval.py --gold_file .\gold_trial.tsv --predictions_file ./predictions/trial/RobertaLarge_SG.tsv --output_file .\output

In [16]:
# Result: better than roberta-base on MAP1

#### Substitute Generation with RoBERTa-large, and Substitute Selection steps a-c

In [17]:
from nltk.corpus import wordnet as wn
import spacy
nlp = spacy.load("en_core_web_sm")

In [18]:

# in each row, for each complex word: 
for index, row in data.iterrows():
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, "<mask>")   # RoBERTa uses <mask> instead of [MASK]

    ## concatenate the original sentence and the masked sentence
    tokenizer = fill_mask.tokenizer
    sentences_concat = f"{sentence} {tokenizer.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result = fill_mask(sentences_concat, top_k=top_k)
   
    ## lowercase and print the top-k substitutes
    substitutes = [substitute["token_str"].lower().lstrip() for substitute in result]   # and use .lstrip to remove the leading space, as Roberta tokenizes by default with a space in front of the word
    print(f"SG step: generated substitutes: {substitutes}\n")
    
    
    # 2. Substitute Selection (SS):   
    
    # a) remove duplicates within the substitute list from the substitute list 
    
    substitutes_no_dupl = []
    for sub in substitutes:
        if sub not in substitutes_no_dupl:
            substitutes_no_dupl.append(sub)
    print(f"SS step: a) substitute list without duplicates: {substitutes_no_dupl}\n")

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word = []
    for substitute in substitutes_no_dupl:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word.append(substitute)
    print(f"SS step: b) substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym = []
    for substitute in substitutes_no_dupl_complex_word:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym.append(substitute)
    print(f"SS step: c): substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym}\n")
    
     
    
    # limit the substitutes to the 10 first ones for evaluation
    top_10_substitutes = substitutes_no_dupl_complex_word_no_antonym[:10]
    
    # add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + top_10_substitutes
    
    # remove the #34-3 and #35-14 character combinations from the sentences in the dataframe
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#34-3 \"", "")
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#35-14 ", "")
    
    

# export the dataframe to a tsv file
substitutes_df.to_csv("./predictions/trial/RobertaLarge_SG_SS_abc.tsv", sep="\t", index=False, header=False)
    

Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
SG step: generated substitutes: ['compulsory', 'mandatory', 'mandated', 'voluntary', 'obligatory', 'statutory', 'redundant', 'enforced', 'routine', 'relevant', 'required', 'vital', 'clandestine', 'obliged', 'bureaucratic', 'ministerial', 'mandate', 'necessary', 'lifelong', 'strict', 'gradual', 'demanded', 'lax', 'continuous', 'practicable', 'indispensable', 'forced', 'habitual', 'plain', 'preferable']

SS step: a) substitute list without duplicates: ['compulsory', 'mandatory', 'mandated', 'voluntary', 'obligatory', 'statutory', 'redundant', 'enforced', 'routine', 'relevant', 'required', 'vital', 'clandestine', 'obliged', 'bureaucratic', 'ministerial', 'mandate', 'necessary', 'lifelong', 'strict', 'gradual', 'demanded'

python tsar_eval.py --gold_file .\gold_trial.tsv --predictions_file ./predictions/trial/RobertaLarge_SG_SS_abc.tsv --output_file .\output

In [19]:
# slightly higher or the same results than without this step.

#### Substitute Generation with RoBERTa-large, and Substitute Selection steps a-c, and the resulting list with FitBERT

In [20]:

# instantiate a FitBert model
fb_model = FitBert(lm_model)



# in each row, for each complex word: 
for index, row in data.iterrows():
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, "<mask>")  # RoBERTa uses <mask> instead of [MASK]

    ## concatenate the original sentence and the masked sentence
    tokenizer = fill_mask.tokenizer
    sentences_concat = f"{sentence} {tokenizer.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result = fill_mask(sentences_concat, top_k=top_k)
   
    ## lowercase and print the top-k substitutes
    substitutes = [substitute["token_str"].lower().lstrip() for substitute in result]   # and use .lstrip to remove the leading space, as Roberta tokenizes by default with a space in front of the word
    print(f"SG step: generated substitutes: {substitutes}\n")
    
    
    # 2. Substitute Selection (SS):   
    
    # a) remove duplicates within the substitute list from the substitute list 
    
    substitutes_no_dupl = []
    for sub in substitutes:
        if sub not in substitutes_no_dupl:
            substitutes_no_dupl.append(sub)
    print(f"SS step: a) substitute list without duplicates: {substitutes_no_dupl}\n")

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word = []
    for substitute in substitutes_no_dupl:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word.append(substitute)
    print(f"SS step: b) substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym = []
    for substitute in substitutes_no_dupl_complex_word:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym.append(substitute)
    print(f"SS step: c): substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym}\n")
    
    
    
    # d) apply FITBERT to the list of substitutes
    sentence_fitbert_masked = sentence_masked_word.replace("<mask>", "***mask***")   # fitbert uses ***mask*** instead of [MASK] or <mask> 
    sentences_concat_fitbert = f"{sentence} {tokenizer.sep_token} {sentence_fitbert_masked}"
    
    ranked_substitutes = fb_model.rank(sentences_concat_fitbert, substitutes_no_dupl_complex_word_no_antonym)
    print(f"SS step: d) ranked substitutes using FitBert: {ranked_substitutes}\n")
    
    print('-----------------------------------------------------------------------------------------')
    print()
    
    
    
    # limit the substitutes to the 10 first ones for evaluation
    top_10_substitutes = ranked_substitutes[:10]
    
    # add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + top_10_substitutes
    
    # remove the #34-3 and #35-14 character combinations from the sentences in the dataframe
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#34-3 \"", "")
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#35-14 ", "")
    
    

# export the dataframe to a tsv file
substitutes_df.to_csv("./predictions/trial/RobertaLarge_SG_SS_abc_fb.tsv", sep="\t", index=False, header=False)

device: cpu
using custom model: ['RobertaForMaskedLM']
Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
SG step: generated substitutes: ['compulsory', 'mandatory', 'mandated', 'voluntary', 'obligatory', 'statutory', 'redundant', 'enforced', 'routine', 'relevant', 'required', 'vital', 'clandestine', 'obliged', 'bureaucratic', 'ministerial', 'mandate', 'necessary', 'lifelong', 'strict', 'gradual', 'demanded', 'lax', 'continuous', 'practicable', 'indispensable', 'forced', 'habitual', 'plain', 'preferable']

SS step: a) substitute list without duplicates: ['compulsory', 'mandatory', 'mandated', 'voluntary', 'obligatory', 'statutory', 'redundant', 'enforced', 'routine', 'relevant', 'required', 'vital', 'clandestine', 'obliged', 'bureaucratic', 'ministerial', 'mandate', '

python tsar_eval.py --gold_file .\gold_trial.tsv --predictions_file ./predictions/trial/RobertaLarge_SG_SS_abc_fb.tsv --output_file .\output

In [21]:
# Result: bad results

#### Substitute Generation with RoBERTa-large, and Substitute Selection steps a-c, and the resulting list with contextualized embeddings

In [22]:
from transformers import TFAutoModel
import tensorflow as tf
import numpy as np

In [23]:
# Calculates similarity between the original sentence and the sentences with candidate substitutes that were retrieved in the SG step 
# creates a list with sentences with substitute words filled in (commented out for oversight purposes)


def calculate_similarity_scores(sentence, sentence_with_substitutes):
    tokenizer = AutoTokenizer.from_pretrained("roberta-large")
    tf_model = TFAutoModel.from_pretrained("roberta-large")

    def embed_text(text):
        tokens = tokenizer(text, padding=True, truncation=True, return_tensors="tf")
        outputs = tf_model(**tokens)
        embeddings = outputs.last_hidden_state[:, 0, :]
        embeddings = tf.nn.l2_normalize(embeddings, axis=1)
        return embeddings

    original_sentence_embedding = embed_text(sentence)
    substitute_sentence_embeddings = embed_text(sentence_with_substitutes)

    cosine_similarity = np.inner(original_sentence_embedding, substitute_sentence_embeddings)
    similarity_scores = cosine_similarity[0]

    return similarity_scores




# in each row, for each complex word: 
for index, row in data.iterrows():
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, "<mask>")   # RoBERTa uses <mask> instead of [MASK]

    ## concatenate the original sentence and the masked sentence
    tokenizer = fill_mask.tokenizer
    sentences_concat = f"{sentence} {tokenizer.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result = fill_mask(sentences_concat, top_k=top_k)
   
    ## lowercase and print the top-k substitutes
    substitutes = [substitute["token_str"].lower().lstrip() for substitute in result]   # and use .lstrip to remove the leading space, as Roberta tokenizes by default with a space in front of the word
    print(f"SG step: generated substitutes: {substitutes}\n")
    
    
    # 2. Substitute Selection (SS):   
    
    # a) remove duplicates within the substitute list from the substitute list 
    
    substitutes_no_dupl = []
    for sub in substitutes:
        if sub not in substitutes_no_dupl:
            substitutes_no_dupl.append(sub)
    print(f"SS step: a) substitute list without duplicates: {substitutes_no_dupl}\n")

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word = []
    for substitute in substitutes_no_dupl:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word.append(substitute)
    print(f"SS step: b) substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym = []
    for substitute in substitutes_no_dupl_complex_word:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym.append(substitute)
    print(f"SS step: c): substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym}\n")
     
    
    # create sentence with the complex word replaced by the substitutes
    sentence_with_substitutes = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym]
    #print(f"List with sentences where complex word is substituted: {sentence_with_substitutes}\n")
    
    
    # d) calculate cosine similarity scores, and rank the substitutes based on their similarity score
    similarity_scores = calculate_similarity_scores(sentence, sentence_with_substitutes)
    #print(f"Similarity scores: {similarity_scores}\n")
    ranked_substitutes_withscores = sorted(zip(substitutes_no_dupl_complex_word_no_antonym, similarity_scores), key=lambda x: x[1], reverse=True)
    #print(f"SS step d) Ranked substitutes, including similarity scores in context: {ranked_substitutes}\n")
    ranked_substitutes = [substitute for substitute, score in ranked_substitutes_withscores]
    print(f"Ranked substitutes, based on cosine similarity scores in context: {ranked_substitutes}\n")
        
    print('-----------------------------------------------------------------------------------------')
    print()
    
       
    
    
    # limit the substitutes to the 10 first ones for evaluation
    top_10_substitutes = ranked_substitutes[:10]
    
    # add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + top_10_substitutes
    
    # remove the #34-3 and #35-14 character combinations from the sentences in the dataframe
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#34-3 \"", "")
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#35-14 ", "")
    
    

# export the dataframe to a tsv file
substitutes_df.to_csv("./predictions/trial/RobertaLarge_SG_SS_abc_ce.tsv", sep="\t", index=False, header=False)

Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
SG step: generated substitutes: ['compulsory', 'mandatory', 'mandated', 'voluntary', 'obligatory', 'statutory', 'redundant', 'enforced', 'routine', 'relevant', 'required', 'vital', 'clandestine', 'obliged', 'bureaucratic', 'ministerial', 'mandate', 'necessary', 'lifelong', 'strict', 'gradual', 'demanded', 'lax', 'continuous', 'practicable', 'indispensable', 'forced', 'habitual', 'plain', 'preferable']

SS step: a) substitute list without duplicates: ['compulsory', 'mandatory', 'mandated', 'voluntary', 'obligatory', 'statutory', 'redundant', 'enforced', 'routine', 'relevant', 'required', 'vital', 'clandestine', 'obliged', 'bureaucratic', 'ministerial', 'mandate', 'necessary', 'lifelong', 'strict', 'gradual', 'demanded'

Some layers from the model checkpoint at roberta-large were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes, based on cosine similarity scores in context: ['mandatory', 'required', 'necessary', 'voluntary', 'demanded', 'vital', 'forced', 'preferable', 'enforced', 'mandated', 'redundant', 'obliged', 'obligatory', 'indispensable', 'routine', 'strict', 'relevant', 'gradual', 'practicable', 'bureaucratic', 'lax', 'plain', 'statutory', 'continuous', 'habitual', 'clandestine', 'mandate', 'lifelong', 'ministerial']

-----------------------------------------------------------------------------------------

Sentence: Rajoy's conservative government had instilled markets with a brief dose of confidence by stepping into Bankia, performing a U-turn on its refusal to spend public money to rescue banks.
Complex word: instilled
SG step: generated substitutes: ['infused', 'injected', 'filled', 'inst', 'invested', 'illed', 'impressed', 'infected', 'revived', 'endowed', 'gifted', 'reassured', 'implanted', 'infiltrated', 'pumped', 'inject', 'flooded', 'sprinkled', 'installed', 'vested', 'thr

Some layers from the model checkpoint at roberta-large were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes, based on cosine similarity scores in context: ['reassured', 'thrilled', 'revived', 'assured', 'impressed', 'filled', 'provided', 'flooded', 'hit', 'injected', 'infected', 'invested', 'gifted', 'pumped', 'stocked', 'elevated', 'infused', 'implanted', 'stoked', 'vested', 'endowed', 'vaccinated', 'infiltrated', 'insulated', 'sprinkled', 'penetrated', 'installed', 'inject', 'inst', 'illed']

-----------------------------------------------------------------------------------------

Sentence: #34-3 "War maniacs of the South Korean puppet military made another grave provocation to the DPRK in the central western sector of the front on Thursday afternoon.
Complex word: maniacs
SG step: generated substitutes: ['maniac', 'criminals', 'killers', 'thugs', 'fighters', 'murderers', 'mercenaries', 'militias', 'combatants', 'factions', 'fighters', 'commanders', 'lords', 'jihadists', 'gangs', 'fascists', 'helmets', 'assassins', 'killers', 'crimes', 'squads', 'hunters', 'gunmen', 'pa

Some layers from the model checkpoint at roberta-large were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes, based on cosine similarity scores in context: ['criminals', 'troops', 'killers', 'crimes', 'fighters', 'perpetrators', 'lords', 'combatants', 'factions', 'squads', 'partisans', 'assassins', 'commanders', 'gangs', 'murderers', 'monsters', 'militias', 'helmets', 'killings', 'mercenaries', 'detainees', 'jihadists', 'hunters', 'thugs', 'chiefs', 'fascists', 'gunmen']

-----------------------------------------------------------------------------------------

Sentence: The daily death toll in Syria has declined as the number of observers has risen, but few experts expect the U.N. plan to succeed in its entirety.
Complex word: observers
SG step: generated substitutes: ['observers', 'monitors', 'demonstrators', 'participants', 'opponents', 'advisors', 'experts', 'observer', 'supervisors', 'analysts', 'operators', 'responders', 'observer', 'reporters', 'witnesses', 'observations', 'inspectors', 'observes', 'investigators', 'dissidents', 'bystanders', 'visitors', 'officials',

Some layers from the model checkpoint at roberta-large were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes, based on cosine similarity scores in context: ['monitors', 'demonstrators', 'visitors', 'experts', 'observations', 'witnesses', 'inspectors', 'opponents', 'investigators', 'advisors', 'airstrikes', 'participants', 'responders', 'reporters', 'reinforcements', 'bystanders', 'outsiders', 'followers', 'analysts', 'dissidents', 'officials', 'educators', 'organizers', 'operators', 'supervisors', 'observed', 'observes']

-----------------------------------------------------------------------------------------

Sentence: An amateur video showed a young girl who apparently suffered shrapnel wounds in her thigh undergoing treatment in a makeshift Rastan hospital while screaming in pain.
Complex word: shrapnel
SG step: generated substitutes: ['gunshot', 'rapnel', 'bullet', 'gunfire', 'grenade', 'gun', 'rifle', 'mortar', 'sh', 'shell', 'sniper', 'projectile', 'stab', 'multiple', 'shot', 'gunshots', 'several', 'blast', 'arrow', 'a', 'shotgun', 'spear', 'shock', 'dagger', 'blaste

Some layers from the model checkpoint at roberta-large were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes, based on cosine similarity scores in context: ['projectile', 'bullet', 'gunshot', 'shotgun', 'gunshots', 'multiple', 'unspecified', 'several', 'gunfire', 'dagger', 'sniper', 'mortar', 'shock', 'gun', 'the', 'rifle', 'knife', 'explosive', 'stab', 'arrow', 'shot', 'grenade', 'blast', 'stray', 'spear', 'shell', 'sh', 'rapnel', 'blaster', 'a']

-----------------------------------------------------------------------------------------

Sentence: A local witness said a separate group of attackers disguised in burqas — the head-to-toe robes worn by conservative Afghan women — then tried to storm the compound.
Complex word: disguised
SG step: generated substitutes: ['disguised', 'cloaked', 'masked', 'dressed', 'concealed', 'clothed', 'disguise', 'clad', 'recognised', 'portrayed', 'shrouded', 'styled', 'wrapped', 'packaged', 'posed', 'adorned', 'imprisoned', 'dispersed', 'united', 'disgu', 'displayed', 'veiled', 'fitted', 'framed', 'armoured', 'frightened', 'formed', 'reveale

Some layers from the model checkpoint at roberta-large were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes, based on cosine similarity scores in context: ['dressed', 'masked', 'clad', 'adorned', 'clothed', 'cloaked', 'wrapped', 'shrouded', 'veiled', 'concealed', 'posed', 'formed', 'styled', 'dispersed', 'united', 'displayed', 'disguise', 'portrayed', 'framed', 'imprisoned', 'frightened', 'recognised', 'revealed', 'branded', 'organised', 'fitted', 'packaged', 'armoured', 'disgu']

-----------------------------------------------------------------------------------------

Sentence: Syria's Sunni majority is at the forefront of the uprising against Assad, whose minority Alawite sect is an offshoot of Shi'ite Islam.
Complex word: offshoot
SG step: generated substitutes: ['shoot', 'affiliate', 'extension', 'outpost', 'adherent', 'offspring', 'ally', 'adaptation', 'off', 'off', 'adjunct', 'iteration', 'incarnation', 'enclave', 'arm', 'embryo', 'branch', 'affiliation', 'example', 'acronym', 'outreach', 'extremist', 'outline', 'insurgency', 'affiliated', 'embrace', 'orthodoxy', 'u

Some layers from the model checkpoint at roberta-large were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes, based on cosine similarity scores in context: ['affiliate', 'iteration', 'arm', 'extension', 'adherent', 'incarnation', 'ally', 'outpost', 'offspring', 'example', 'acronym', 'outlet', 'branch', 'adaptation', 'affiliation', 'embryo', 'embrace', 'orthodoxy', 'extremist', 'enclave', 'insurgency', 'adjunct', 'outline', 'off', 'outreach', 'subset', 'affiliated', 'undercut', 'shoot']

-----------------------------------------------------------------------------------------

Sentence: Although not as rare in the symphonic literature as sharper keys , examples of symphonies in A major are not as numerous as for D major or G major .
Complex word: symphonic
SG step: generated substitutes: ['musical', 'classical', 'harmonic', 'music', 'popular', 'onic', 'sym', 'canonical', 'instrumental', 'piano', 'the', 'of', 'historical', 'dramatic', 'general', 'modern', ',', 'theoretical', 'or', 'and', 'traditional', '</s>', 'concert', 'romantic', 'vocal', 'in', 'musical', 'contemporary', '

Some layers from the model checkpoint at roberta-large were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes, based on cosine similarity scores in context: ['musical', 'classical', 'music', 'popular', 'general', 'harmonic', 'modern', 'instrumental', 'traditional', 'formal', 'concert', 'canonical', 'contemporary', 'dramatic', 'romantic', 'the', 'vocal', 'historical', 'analytic', 'theoretical', 'piano', 'sym', 'of', 'onic', 'in', 'and', 'or', ',', '</s>']

-----------------------------------------------------------------------------------------

Sentence: That prompted the military to deploy its largest warship, the BRP Gregorio del Pilar, which was recently acquired from the United States.
Complex word: deploy
SG step: generated substitutes: ['deploy', 'deployed', 'deployment', 'mobilize', 'dispatch', 'ploy', 'employ', 'deploying', 'deploy', 'maneuver', 'disperse', 'send', 'launch', 'recruit', 'contract', 'use', 'move', 'tether', 'expend', 'monitor', 'deliver', 'secure', 'operate', 'construct', 'wield', 'patrol', 'combat', 'command', 'withdraw', 'transport']

SS step: a) sub

Some layers from the model checkpoint at roberta-large were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes, based on cosine similarity scores in context: ['send', 'use', 'operate', 'deliver', 'launch', 'construct', 'move', 'mobilize', 'employ', 'withdraw', 'contract', 'command', 'dispatch', 'patrol', 'secure', 'monitor', 'recruit', 'wield', 'transport', 'maneuver', 'expend', 'deployment', 'combat', 'disperse', 'ploy', 'tether']

-----------------------------------------------------------------------------------------

Sentence: #35-14 UK police were expressly forbidden, at a ministerial level, to provide any assistance to Thai authorities as the case involves the death penalty.
Complex word: authorities
SG step: generated substitutes: ['authorities', 'police', 'officials', 'authorities', 'authority', 'regulators', 'governments', 'arrests', 'investigators', 'colleagues', 'superiors', 'officers', 'employees', 'authorities', 'courts', 'residents', 'prosecutors', 'detainees', 'authors', 'paramedics', 'policemen', 'applications', 'forces', 'institutions', 'representatives', 'n

Some layers from the model checkpoint at roberta-large were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Ranked substitutes, based on cosine similarity scores in context: ['officials', 'police', 'prosecutors', 'courts', 'policemen', 'residents', 'investigators', 'detainees', 'governments', 'institutions', 'forces', 'regulators', 'officers', 'employees', 'representatives', 'paramedics', 'individuals', 'bodies', 'neighbours', 'regimes', 'arrests', 'colleagues', 'superiors', 'authors', 'applications', 'requirements']

-----------------------------------------------------------------------------------------



python tsar_eval.py --gold_file .\gold_trial.tsv --predictions_file ./predictions/trial/RobertaLarge_SG_SS_abc_ce.tsv --output_file .\output

results: MAP 3 5 and 10 slightly better than without this step (map1 the same), potential 3 and 5 slightly worse (pot 10 the same), accuracy 1 and 2 are better, 3 is same. 

#### Substitute Generation with RoBERTa-Large, and Substitute Selection steps a-c, and the resulting list with BERTScore

In [24]:
import bert_score
from bert_score import score

In [25]:

# in each row, for each complex word: 
for index, row in data.iterrows():
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:
    
    ## print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word = sentence.replace(complex_word, "<mask>")    # RoBERTa uses <mask> instead of [MASK]

    ## concatenate the original sentence and the masked sentence
    tokenizer = fill_mask.tokenizer
    sentences_concat = f"{sentence} {tokenizer.sep_token} {sentence_masked_word}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline
    top_k = 30
    result = fill_mask(sentences_concat, top_k=top_k)
   
    ## lowercase and print the top-k substitutes
    substitutes = [substitute["token_str"].lower().lstrip() for substitute in result]   # and use .lstrip to remove the leading space, as Roberta tokenizes by default with a space in front of the word
    print(f"SG step: generated substitutes: {substitutes}\n")
    
    
    # 2. Substitute Selection (SS):   
    
    # a) remove duplicates within the substitute list from the substitute list 
    
    substitutes_no_dupl = []
    for sub in substitutes:
        if sub not in substitutes_no_dupl:
            substitutes_no_dupl.append(sub)
    print(f"SS step: a) substitute list without duplicates: {substitutes_no_dupl}\n")

   
    # b) remove duplicates and inflected forms of the complex word from the substitute list
    ## Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word = []
    for substitute in substitutes_no_dupl:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word.append(substitute)
    print(f"SS step: b) substitute list without duplicates and inflected forms of the complex word: {substitutes_no_dupl_complex_word}\n")

    # c) remove antonyms of the complex word from the substitute list
    substitutes_no_dupl_complex_word_no_antonym = []
    for substitute in substitutes_no_dupl_complex_word:
        syn = wn.synsets(complex_word_lemma)
        if syn:
            syn = syn[0]
            for lemma in syn.lemmas():
                if lemma.antonyms() and lemma.name() == substitute_lemma:
                    print(f"Antonym removed (lemma): {lemma.antonyms()[0].name()}")
                    break
            else:
                substitutes_no_dupl_complex_word_no_antonym.append(substitute)
        else:
            substitutes_no_dupl_complex_word_no_antonym.append(substitute)
    print(f"SS step: c): substitute list without antonyms of the complex word: {substitutes_no_dupl_complex_word_no_antonym}\n")
    
    
    # create sentences with the complex word replaced by the substitutes
    sentences_with_substitutes = [sentence.replace(complex_word, sub) for sub in substitutes_no_dupl_complex_word_no_antonym]
    #print(f"SG step: sentences with substitutes: {sentences_with_substitutes}\n")
    
          
    # d) use BERTScore for sorting
    scores = bert_score.score([sentence]*len(sentences_with_substitutes), sentences_with_substitutes, lang="en", model_type='roberta-large', verbose=False)
    ranked_substitutes = [substitute for _, substitute in sorted(zip(scores[0].tolist(), substitutes_no_dupl_complex_word_no_antonym), reverse=True)]
    print(f"SS step: d) substitute list sorted by descending BERTScore: {ranked_substitutes}\n")

    
    print('-----------------------------------------------------------------------------------------')
    print()
    
    
    
    # limit the substitutes to the 10 first ones for evaluation
    top_10_substitutes = ranked_substitutes[:10]
    
    # add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + top_10_substitutes
    
    # remove the #34-3 and #35-14 character combinations from the sentences in the dataframe
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#34-3 \"", "")
    substitutes_df.iloc[:, 0] = substitutes_df.iloc[:, 0].str.replace("#35-14 ", "")
    
    

# export the dataframe to a tsv file
substitutes_df.to_csv("./predictions/trial/RobertaLarge_SG_SS_abc_bs.tsv", sep="\t", index=False, header=False)

Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
SG step: generated substitutes: ['compulsory', 'mandatory', 'mandated', 'voluntary', 'obligatory', 'statutory', 'redundant', 'enforced', 'routine', 'relevant', 'required', 'vital', 'clandestine', 'obliged', 'bureaucratic', 'ministerial', 'mandate', 'necessary', 'lifelong', 'strict', 'gradual', 'demanded', 'lax', 'continuous', 'practicable', 'indispensable', 'forced', 'habitual', 'plain', 'preferable']

SS step: a) substitute list without duplicates: ['compulsory', 'mandatory', 'mandated', 'voluntary', 'obligatory', 'statutory', 'redundant', 'enforced', 'routine', 'relevant', 'required', 'vital', 'clandestine', 'obliged', 'bureaucratic', 'ministerial', 'mandate', 'necessary', 'lifelong', 'strict', 'gradual', 'demanded'

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['mandatory', 'required', 'mandated', 'necessary', 'voluntary', 'enforced', 'obligatory', 'forced', 'preferable', 'vital', 'routine', 'demanded', 'indispensable', 'redundant', 'relevant', 'obliged', 'gradual', 'strict', 'practicable', 'bureaucratic', 'lax', 'statutory', 'continuous', 'plain', 'mandate', 'clandestine', 'habitual', 'ministerial', 'lifelong']

-----------------------------------------------------------------------------------------

Sentence: Rajoy's conservative government had instilled markets with a brief dose of confidence by stepping into Bankia, performing a U-turn on its refusal to spend public money to rescue banks.
Complex word: instilled
SG step: generated substitutes: ['infused', 'injected', 'filled', 'inst', 'invested', 'illed', 'impressed', 'infected', 'revived', 'endowed', 'gifted', 'reassured', 'implanted', 'infiltrated', 'pumped', 'inject', 'flooded', 'sprinkled', 'installed', 'vested', 'thrilled'

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['infused', 'filled', 'injected', 'stoked', 'flooded', 'assured', 'provided', 'pumped', 'reassured', 'gifted', 'endowed', 'infected', 'revived', 'stocked', 'impressed', 'thrilled', 'hit', 'invested', 'sprinkled', 'infiltrated', 'elevated', 'penetrated', 'installed', 'vested', 'insulated', 'implanted', 'vaccinated', 'illed', 'inject', 'inst']

-----------------------------------------------------------------------------------------

Sentence: #34-3 "War maniacs of the South Korean puppet military made another grave provocation to the DPRK in the central western sector of the front on Thursday afternoon.
Complex word: maniacs
SG step: generated substitutes: ['maniac', 'criminals', 'killers', 'thugs', 'fighters', 'murderers', 'mercenaries', 'militias', 'combatants', 'factions', 'fighters', 'commanders', 'lords', 'jihadists', 'gangs', 'fascists', 'helmets', 'assassins', 'killers', 'crimes', 'squads', 'hunters', 'gunmen', 'partisan

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['criminals', 'lords', 'troops', 'commanders', 'fighters', 'killers', 'crimes', 'chiefs', 'perpetrators', 'thugs', 'combatants', 'assassins', 'factions', 'gangs', 'killings', 'partisans', 'murderers', 'squads', 'monsters', 'militias', 'mercenaries', 'jihadists', 'gunmen', 'detainees', 'helmets', 'hunters', 'fascists']

-----------------------------------------------------------------------------------------

Sentence: The daily death toll in Syria has declined as the number of observers has risen, but few experts expect the U.N. plan to succeed in its entirety.
Complex word: observers
SG step: generated substitutes: ['observers', 'monitors', 'demonstrators', 'participants', 'opponents', 'advisors', 'experts', 'observer', 'supervisors', 'analysts', 'operators', 'responders', 'observer', 'reporters', 'witnesses', 'observations', 'inspectors', 'observes', 'investigators', 'dissidents', 'bystanders', 'visitors', 'officials', 'educ

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['monitors', 'inspectors', 'investigators', 'responders', 'advisors', 'experts', 'observations', 'reporters', 'witnesses', 'officials', 'opponents', 'analysts', 'educators', 'dissidents', 'participants', 'visitors', 'outsiders', 'reinforcements', 'bystanders', 'demonstrators', 'operators', 'followers', 'airstrikes', 'organizers', 'observed', 'supervisors', 'observes']

-----------------------------------------------------------------------------------------

Sentence: An amateur video showed a young girl who apparently suffered shrapnel wounds in her thigh undergoing treatment in a makeshift Rastan hospital while screaming in pain.
Complex word: shrapnel
SG step: generated substitutes: ['gunshot', 'rapnel', 'bullet', 'gunfire', 'grenade', 'gun', 'rifle', 'mortar', 'sh', 'shell', 'sniper', 'projectile', 'stab', 'multiple', 'shot', 'gunshots', 'several', 'blast', 'arrow', 'a', 'shotgun', 'spear', 'shock', 'dagger', 'blaster', 't

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['bullet', 'stab', 'gunshot', 'projectile', 'knife', 'gunfire', 'multiple', 'blast', 'several', 'sniper', 'gun', 'mortar', 'dagger', 'gunshots', 'shell', 'grenade', 'shotgun', 'shot', 'explosive', 'unspecified', 'rifle', 'arrow', 'shock', 'spear', 'the', 'stray', 'rapnel', 'blaster', 'sh', 'a']

-----------------------------------------------------------------------------------------

Sentence: A local witness said a separate group of attackers disguised in burqas — the head-to-toe robes worn by conservative Afghan women — then tried to storm the compound.
Complex word: disguised
SG step: generated substitutes: ['disguised', 'cloaked', 'masked', 'dressed', 'concealed', 'clothed', 'disguise', 'clad', 'recognised', 'portrayed', 'shrouded', 'styled', 'wrapped', 'packaged', 'posed', 'adorned', 'imprisoned', 'dispersed', 'united', 'disgu', 'displayed', 'veiled', 'fitted', 'framed', 'armoured', 'frightened', 'formed', 'revealed', 'o

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['masked', 'concealed', 'clothed', 'cloaked', 'dressed', 'clad', 'veiled', 'shrouded', 'adorned', 'wrapped', 'posed', 'styled', 'formed', 'portrayed', 'displayed', 'united', 'dispersed', 'fitted', 'organised', 'disguise', 'branded', 'imprisoned', 'framed', 'recognised', 'revealed', 'frightened', 'packaged', 'armoured', 'disgu']

-----------------------------------------------------------------------------------------

Sentence: Syria's Sunni majority is at the forefront of the uprising against Assad, whose minority Alawite sect is an offshoot of Shi'ite Islam.
Complex word: offshoot
SG step: generated substitutes: ['shoot', 'affiliate', 'extension', 'outpost', 'adherent', 'offspring', 'ally', 'adaptation', 'off', 'off', 'adjunct', 'iteration', 'incarnation', 'enclave', 'arm', 'embryo', 'branch', 'affiliation', 'example', 'acronym', 'outreach', 'extremist', 'outline', 'insurgency', 'affiliated', 'embrace', 'orthodoxy', 'undercu

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['extension', 'affiliate', 'arm', 'offspring', 'iteration', 'incarnation', 'outpost', 'adaptation', 'ally', 'branch', 'adjunct', 'embryo', 'adherent', 'affiliation', 'acronym', 'enclave', 'embrace', 'example', 'outlet', 'extremist', 'orthodoxy', 'outline', 'off', 'subset', 'affiliated', 'insurgency', 'outreach', 'shoot', 'undercut']

-----------------------------------------------------------------------------------------

Sentence: Although not as rare in the symphonic literature as sharper keys , examples of symphonies in A major are not as numerous as for D major or G major .
Complex word: symphonic
SG step: generated substitutes: ['musical', 'classical', 'harmonic', 'music', 'popular', 'onic', 'sym', 'canonical', 'instrumental', 'piano', 'the', 'of', 'historical', 'dramatic', 'general', 'modern', ',', 'theoretical', 'or', 'and', 'traditional', '</s>', 'concert', 'romantic', 'vocal', 'in', 'musical', 'contemporary', 'formal

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['classical', 'musical', 'dramatic', 'harmonic', 'formal', 'concert', 'contemporary', 'instrumental', 'music', 'general', 'popular', 'modern', 'traditional', 'historical', 'canonical', 'romantic', 'theoretical', 'the', 'analytic', 'onic', 'piano', 'of', 'vocal', 'in', 'or', 'and', 'sym', ',', '</s>']

-----------------------------------------------------------------------------------------

Sentence: That prompted the military to deploy its largest warship, the BRP Gregorio del Pilar, which was recently acquired from the United States.
Complex word: deploy
SG step: generated substitutes: ['deploy', 'deployed', 'deployment', 'mobilize', 'dispatch', 'ploy', 'employ', 'deploying', 'deploy', 'maneuver', 'disperse', 'send', 'launch', 'recruit', 'contract', 'use', 'move', 'tether', 'expend', 'monitor', 'deliver', 'secure', 'operate', 'construct', 'wield', 'patrol', 'combat', 'command', 'withdraw', 'transport']

SS step: a) substitut

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['mobilize', 'dispatch', 'employ', 'use', 'launch', 'operate', 'send', 'deliver', 'command', 'construct', 'secure', 'contract', 'patrol', 'transport', 'recruit', 'move', 'maneuver', 'wield', 'expend', 'monitor', 'disperse', 'withdraw', 'deployment', 'combat', 'ploy', 'tether']

-----------------------------------------------------------------------------------------

Sentence: #35-14 UK police were expressly forbidden, at a ministerial level, to provide any assistance to Thai authorities as the case involves the death penalty.
Complex word: authorities
SG step: generated substitutes: ['authorities', 'police', 'officials', 'authorities', 'authority', 'regulators', 'governments', 'arrests', 'investigators', 'colleagues', 'superiors', 'officers', 'employees', 'authorities', 'courts', 'residents', 'prosecutors', 'detainees', 'authors', 'paramedics', 'policemen', 'applications', 'forces', 'institutions', 'representatives', 'neighbo

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


SS step: d) substitute list sorted by descending BERTScore: ['police', 'officials', 'investigators', 'prosecutors', 'policemen', 'courts', 'residents', 'forces', 'detainees', 'governments', 'officers', 'regulators', 'institutions', 'neighbours', 'representatives', 'regimes', 'employees', 'individuals', 'paramedics', 'colleagues', 'bodies', 'superiors', 'authors', 'arrests', 'applications', 'requirements']

-----------------------------------------------------------------------------------------



python tsar_eval.py --gold_file .\gold_trial.tsv --predictions_file ./predictions/trial/RobertaLarge_SG_SS_abc_bs.tsv --output_file .\output

comparable (some worse, some better) with regular embeddings.

### Conclusion so far

SS Step a, b, c for both Roberta models: keep them in Subs.Selection method as they do help a bit. 

SS Step FitBERT: remove, due to low scores.

SS Step context. emb and Bertscore: keep in Subs selection, but differences are subtile and not always better 

Best performing on MAP@1: RobertaBase_SG_SS_abc_ce (see below)