### All code below uses concatenated sentence pairs in the Substitute Generation step in order to generate similar substitutes (as opposed to generation of fitting substitutes only)

In [1]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import pandas as pd
from transformers import pipeline

# read the tsv file
filename = './data/test/tsar2022_en_test_none_no_noise.tsv'
data = pd.read_csv(filename, sep='\t', header=None, names=["sentence", "complex_word"])

# create an empty dataframe to store the substitutes for evaluation
substitutes_df = pd.DataFrame(columns=["sentence", "complex_word"] + [f"substitute_{i+1}" for i in range(10)])


In [2]:
import logging

In [3]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
import spacy
nlp = spacy.load("en_core_web_sm")

import string

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\IrmaT\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
# set the display.max_rows option to None to display all rows instead of limiting it to 50
pd.set_option('display.max_rows', None)

In [5]:
# Instantiate the tokenizer and the model

# for bert-base:
lm_tokenizer_bertbase = AutoTokenizer.from_pretrained("bert-base-uncased")
lm_model_bertbase = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")


# Instantiate the fill-mask pipeline with the model
fill_mask_bertbase = pipeline("fill-mask", lm_model_bertbase, tokenizer = lm_tokenizer_bertbase)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
# Instantiate the tokenizer and the model

# for electra-base:
lm_tokenizer_electrabase = AutoTokenizer.from_pretrained("google/electra-base-generator")
lm_model_electrabase = AutoModelForMaskedLM.from_pretrained("google/electra-base-generator")


# Instantiate the fill-mask pipeline with the model
fill_mask_electrabase = pipeline("fill-mask", lm_model_electrabase, tokenizer = lm_tokenizer_electrabase)

#### fixed antonym code for bertbase:

In [27]:
# in each row, for each complex word: 
for index, row in data.iterrows():
    
    # print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    #print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")
    
    
    # for bertbase model:
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word_bertbase = sentence.replace(complex_word, lm_tokenizer_bertbase.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat_bertbase = f"{sentence} {lm_tokenizer_bertbase.sep_token} {sentence_masked_word_bertbase}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result_bertbase = fill_mask_bertbase(sentences_concat_bertbase, top_k=top_k)
    substitutes_bertbase = [substitute["token_str"] for substitute in result_bertbase if "token_str" in substitute]
    print(f"Substitute Generation step: initial substitute list: {substitutes_bertbase}\n")


    #2: Morphological Generation and Context Adaptation (Morphological Adaptation):  
    ## a) remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes_bertbase = [substitute["token_str"].lower().strip() for substitute in result_bertbase if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        #print(f"Morphological Adaptation step a): substitute list without unwanted punctuation characters for bertbase model: {substitutes_bertbase}\n")
    except TypeError as error:
        continue



    ## b) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl_bertbase = []
    for sub in substitutes_bertbase:
        if sub not in substitutes_no_dupl_bertbase:
            substitutes_no_dupl_bertbase.append(sub)
    #print(f"Morphological Adaptation step b): substitute list without duplicates of substitutes for bertbase model: {substitutes_no_dupl_bertbase}\n")



    ## c) remove duplicates and inflected forms of the complex word from the substitute list

    ## first Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## then, remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_bertbase = []
    for substitute in substitutes_no_dupl_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_bertbase.append(substitute)
        else:
          print(f"Removed duplicate or inflected form of the complex word from the list with substitutes: {substitute}")
    #print(f"Morphological Adaptation step c): substitute list without duplicates of the complex word nor inflected forms of the complex word for bertbase model: {substitutes_no_dupl_complex_word_bertbase}\n")


    ## d) remove antonyms of the complex word from the substitute list
    ## step 1: get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## step 2: remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms_bertbase = []
    for substitute in substitutes_no_dupl_complex_word_bertbase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms_bertbase.append(substitute)
        else:
            print(f"Removed antonym: {substitute}")
    print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for bertbase model: {substitutes_no_antonyms_bertbase}\n") 
    print('------------------------------------------------------------------------------------')

    
    # limit the list to 10 elements
    substitutes_no_antonyms_bertbase = substitutes_no_antonyms_bertbase[:10]

    ## add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + substitutes_no_antonyms_bertbase
    
# export the dataframe to a tsv file for evaluation
substitutes_df.to_csv('./predictions/test/substitutes_no_antonyms_bertbase.tsv', sep="\t", index=False, header=False)   
print("substitutes_no_antonyms_bertbase exported to csv in path './predictions/test/substitutes_no_antonyms_bertbase.tsv'}\n")  



Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
Substitute Generation step: initial substitute list: ['compulsory', 'mandatory', 'obligatory', 'optional', 'required', 'necessary', 'standard', 'voluntary', 'customary', 'impossible', 'easier', 'only', 'illegal', 'sufficient', 'unnecessary', 'easy', 'normal', 'permitted', 'mandated', 'difficult', 'simple', 'appropriate', 'expensive', 'possible', 'commonplace', 'essential', 'proper', 'available', 'enough', 'affordable']

Morphological Adaptation step a): substitute list without unwanted punctuation characters for bertbase model: ['compulsory', 'mandatory', 'obligatory', 'optional', 'required', 'necessary', 'standard', 'voluntary', 'customary', 'impossible', 'easier', 'only', 'illegal', 'sufficient', 'unnecessary', 'eas

#### fixed antonym code for electrabase:
   

In [None]:
# in each row, for each complex word: 
for index, row in data.iterrows():
    
    # print the sentence and the complex word
    sentence, complex_word = row["sentence"], row["complex_word"]
    #print(f"Sentence: {sentence}")
    print(f"Complex word: {complex_word}")
    
   
   
    
    # for electrabase model:
   
       
    # 1. Substitute Generation (SG): perform masking and generate substitutes:

    ## in the sentence, replace the complex word with a masked word
    sentence_masked_word_electrabase = sentence.replace(complex_word, lm_tokenizer_electrabase.mask_token)

    ## concatenate the original sentence and the masked sentence
    sentences_concat_electrabase = f"{sentence} {lm_tokenizer_electrabase.sep_token} {sentence_masked_word_electrabase}"

    ## generate and rank candidate substitutes for the masked word using the fill_mask pipeline (removing elements without token_str key; as this gave errors in the ELECTRA models) .
    top_k = 30
    result_electrabase = fill_mask_electrabase(sentences_concat_electrabase, top_k=top_k)
    substitutes_electrabase = [substitute["token_str"] for substitute in result_electrabase if "token_str" in substitute]
    print(f"Substitute Generation step: initial substitute list: {substitutes_electrabase}\n")


    #2: Morphological Generation and Context Adaptation (Morphological Adaptation):  
    ## a) remove noise in the substitutes, by ignoring generated substitutes that are empty or that have unwanted punctuation characters or that start with '##' (this returned errors with the ELECTRA model), and lowercase the substitutes (as some models don't lowercase by default)
    ## and lowercase all substitutes. Use try/except statement to prevent other character-related problems to happen

    punctuation_set = set(string.punctuation) - set('-') # retained hyphens in case tokenizers don't split on hyphenated compounds
    punctuation_set.update({'“','”'})   # as these curly quotes appeared in the Electra (SG step) results but were not part of the string set

    try:
        substitutes_electrabase = [substitute["token_str"].lower().strip() for substitute in result_electrabase if not any(char in punctuation_set for char in substitute["token_str"]) # added .strip as roberta uses a leading space before each substitute
                      and not substitute["token_str"].startswith('##') and substitute["token_str"].strip() != ""]
        #print(f"Morphological Adaptation step a): substitute list without unwanted punctuation characters for electrabase model: {substitutes_electrabase}\n")
    except TypeError as error:
        continue



    ## b) remove duplicates within the substitute list from the substitute list (duplicates are likely for models that did not lowercase by default)
    ## the last mentioned duplicate is removed on purpose, as this may probably be the (previously) uppercased variant of the lowercased substitute (lowercased subs are most likely higher ranked by the model)
    substitutes_no_dupl_electrabase = []
    for sub in substitutes_electrabase:
        if sub not in substitutes_no_dupl_electrabase:
            substitutes_no_dupl_electrabase.append(sub)
    #print(f"Morphological Adaptation step b): substitute list without duplicates of substitutes for electrabase model: {substitutes_no_dupl_electrabase}\n")



    ## c) remove duplicates and inflected forms of the complex word from the substitute list

    ## first Lemmatize the complex word with spaCy, in order to compare it with the lemmatized substitute later to see if their mutual lemmas are the same
    doc_complex_word = nlp(complex_word)
    complex_word_lemma = doc_complex_word[0].lemma_
    #print(f"complex_word_lemma for complex word '{complex_word}': {complex_word_lemma}\n")


    ## then, remove duplicates and inflected forms of the complex word from the list with substitutes
    substitutes_no_dupl_complex_word_electrabase = []
    for substitute in substitutes_no_dupl_electrabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma != complex_word_lemma:
            substitutes_no_dupl_complex_word_electrabase.append(substitute)
        else:
          print(f"Removed duplicate or inflected form of the complex word from the list with substitutes: {substitute}")

    #print(f"Morphological Adaptation step c): substitute list without duplicates of the complex word nor inflected forms of the complex word for electrabase model: {substitutes_no_dupl_complex_word_electrabase}\n")


   ## d) remove antonyms of the complex word from the substitute list
    ## step 1: get the antonyms of the complex word
    antonyms_complex_word = []
    for syn in wn.synsets(complex_word_lemma):
        for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
                    antonyms_complex_word.append(antonym.name())

    print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

    ## step 2: remove antonyms of the complex word from the list with substitutes
    substitutes_no_antonyms_electrabase = []
    for substitute in substitutes_no_dupl_complex_word_electrabase:
        doc_substitute = nlp(substitute)
        substitute_lemma = doc_substitute[0].lemma_
        if substitute_lemma not in antonyms_complex_word:
            substitutes_no_antonyms_electrabase.append(substitute)
        else:
            print(f"Removed antonym: {substitute}")
    print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for electrabase model: {substitutes_no_antonyms_electrabase}\n") 


    print('----------------------------------------------------------------------')
    
    # limit the list to 10 elements to fit the dataframe
    substitutes_no_antonyms_electrabase = substitutes_no_antonyms_electrabase[:10]
    
    ## add the sentence, complex_word, and substitutes to the dataframe 
    substitutes_df.loc[index] = [sentence, complex_word] + substitutes_no_antonyms_electrabase
    

# export the dataframe to a tsv file for evaluation
substitutes_df.to_csv('./predictions/test/substitutes_no_antonyms_electrabase.tsv', sep="\t", index=False, header=False)   
print("substitutes_no_antonyms_electrabase exported to csv in path './My_code/predictions/test/substitutes_no_antonyms_electrabase.tsv'}\n")  

Sentence: A Spanish government source, however, later said that banks able to cover by themselves losses on their toxic property assets will not be forced to remove them from their books while it will be compulsory for those receiving public help.
Complex word: compulsory
Substitute Generation step: initial substitute list: ['compulsory', 'mandatory', 'obligatory', 'necessary', 'required', 'optional', 'voluntary', 'essential', 'available', 'possible', 'normal', 'free', 'forbidden', 'safe', 'unnecessary', 'impossible', 'legal', 'illegal', 'prohibited', 'acceptable', 'difficult', 'easier', 'fine', 'automatic', 'appropriate', 'proper', 'recommended', 'dangerous', 'unlawful', 'eligible']

Morphological Adaptation step a): substitute list without unwanted punctuation characters for electrabase model: ['compulsory', 'mandatory', 'obligatory', 'necessary', 'required', 'optional', 'voluntary', 'essential', 'available', 'possible', 'normal', 'free', 'forbidden', 'safe', 'unnecessary', 'impossib

### testcase

In [26]:

from nltk.corpus import wordnet as wn
import spacy

# Load English tokenizer, POS tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

complex_word = 'good'
complex_word_lemma = 'good'  # lemma of 'happy' is 'happy'
substitutes_no_dupl_complex_word_bertbase = ['joyful', 'cheerful', 'sad', 'elated', 'bad', 'alive']

## d) remove antonyms of the complex word from the substitute list
## step 1: get the antonyms of the complex word
antonyms_complex_word = []
for syn in wn.synsets(complex_word_lemma):
    for lemma in syn.lemmas():
        for antonym in lemma.antonyms():
                antonyms_complex_word.append(antonym.name())
        
        
       

print(f"Antonyms for complex word '{complex_word}': {antonyms_complex_word}\n")

## step 2: remove antonyms of the complex word from the list with substitutes
substitutes_no_antonyms_bertbase = []
for substitute in substitutes_no_dupl_complex_word_bertbase:
    doc_substitute = nlp(substitute)
    substitute_lemma = doc_substitute[0].lemma_
    if substitute_lemma not in antonyms_complex_word:
        substitutes_no_antonyms_bertbase.append(substitute)
    else:
        print(f"Removed antonym: {substitute}")
print(f"Morphological Adaptation step d): substitute list without antonyms of the complex word for bertbase model: {substitutes_no_antonyms_bertbase}\n") 




Antonyms for complex word 'good': ['evil', 'evilness', 'bad', 'badness', 'bad', 'evil', 'ill']

Removed antonym: bad
Morphological Adaptation step d): substitute list without antonyms of the complex word for bertbase model: ['joyful', 'cheerful', 'sad', 'elated', 'alive']



To Do later:

2. remove unsimilar substitutes, to account for removing e.g., voluntary from the list for 'compulsory' (maybe with equivalence scores or something word embedding scores?.
2. explore wordnet co-hyponyms, and add/prioritize in list, together with synsets code (see old code). Think about how to prioritize, above or after the duplicates from all 4 models.
3. remove substitutes like 'disguise' and 'dressing' (complex word = disguised); 'deployment' (complex word = deploy). Maybe by removing words that have the first 4 or 5 leftstrings the same as the complex word (that will remove words like disguise and deployment. but not dressing (can be a noun and a verb), as this only appears in the subst list together with dressed which is a good substitute). maybe see with Fitbert what it does? 
4. update the code with embeddings and BERTScore (applied to the combined list of all 4 systems), and see which of these work best. 
5. automate the code in a way that it selects all models automatically, one by one, and that all the separate variables are not needed.
6. automate to test scores with different top-x for the concatenated list of all 4 models
7. Automate to get the best top-x, based on the output scores for the task. 