### Test set evaluations for the best model resulting from trial set evaluations (split into SG and SS (phase 1 and 2) steps):
This notebook corresponds to the results presented in sections 5.1, 5.2, and 5.4 of the thesis, for the models without the SR step.

In [1]:
from utils_test import *

#### For model 'roberta-base'

In [2]:
data, substitutes_df = get_data_and_create_empty_df()

model = 'roberta-base'
model_name_str = get_str_for_file_name(model)

nlp, lm_tokenizer, lm_model, fill_mask = instantiate_spacy_tokenizer_model_pipeline(model)

In [4]:
# Substitute Generation including noise removal, excluding original unmasked sentence (to show the big difference in semantic similarity; see bad eval. scores as well)
substitutes_df = substitute_generation_including_noise_removal_excluding_original_unmasked(data, substitutes_df, lm_tokenizer, fill_mask, model_name_str)

python tsar_eval.py --gold_file ./data/test/tsar2022_en_test_gold_no_noise.tsv --predictions_file ./predictions/test/SG_robertabase_maskedsentenceonly.tsv --output_file ./output/test/SG_robertabase_maskedsentenceonly.tsv

In [None]:
# Substitute Generation including noise removal, including original, unmasked sentence (concatenated with masked sentence to get more semantic similar results)
# This feature will be used in all subsequent steps/models
substitutes_df = substitute_generation_including_noise_removal_including_original_unmasked(data, substitutes_df, lm_tokenizer, fill_mask, model_name_str)

python tsar_eval.py --gold_file ./data/test/tsar2022_en_test_gold_no_noise.tsv --predictions_file ./predictions/test/SG_incl_orig_sentence_robertabase.tsv --output_file ./output/test/SG_incl_orig_sentence_robertabase.tsv

In [None]:
# Substitute Selection phase 1 (removal of: dupl.of complex word + infl.forms of complex word + antonyms of complex word)
substitutes_df = substitute_selection_phase_1(data, substitutes_df, lm_tokenizer, fill_mask, model_name_str, nlp)

python tsar_eval.py --gold_file ./data/test/tsar2022_en_test_gold_no_noise.tsv --predictions_file ./predictions/test/SS_phase1_robertabase.tsv --output_file ./output/test/SS_phase1_robertabase.tsv

In [None]:
# Substitute Selection phase 2, option 1: substitutes that are synonyms of the complex word first (lemmatized substitutes that share the same synset as the lemmatized complex word)
substitutes_df = substitute_selection_phase_2_option_1(data, substitutes_df, lm_tokenizer, fill_mask, model_name_str, nlp)

python tsar_eval.py --gold_file ./data/test/tsar2022_en_test_gold_no_noise.tsv --predictions_file ./predictions/test/SS_phase2_option1_SharedSyns_robertabase.tsv --output_file ./output/test/SS_phase2_option1_SharedSyns_robertabase.tsv

In [3]:
# Substitute Selection phase 2, option 2b: sort the substitutes that share their indirect hypernyms (2 levels up) with the complex word first
substitutes_df = substitute_selection_phase_2_option_2(data, substitutes_df, lm_tokenizer, fill_mask, model_name_str, nlp, levels=[2])

SS_phase2_option2_SharedHyper2_robertabase exported to csv in path './predictions/test/SS_phase2_option2_SharedHyper2_robertabase.tsv'



python tsar_eval.py --gold_file ./data/test/tsar2022_en_test_gold_no_noise.tsv --predictions_file ./predictions/test/SS_phase2_option2_SharedHyper2_robertabase.tsv --output_file ./output/test/SS_phase2_option2_SharedHyper2_robertabase.tsv

In [None]:
# Option 3f: Bertscore with robertalarge:
score_model = 'roberta-large'
letter = 'f'
substitutes_df, score_model_name_str = substitute_selection_phase_2_option_3(data, substitutes_df, lm_tokenizer, fill_mask, model_name_str, nlp, score_model, letter)

python tsar_eval.py --gold_file ./data/test/tsar2022_en_test_gold_no_noise.tsv --predictions_file ./predictions/test/SS_phase2_option3f_BSrobertalarge_robertabase.tsv --output_file ./output/test/SS_phase2_option3f_BSrobertalarge_robertabase.tsv

#### Results:
Based on the accumulated scores, as well as on ACC@1 scores, the best model on the test set is:

- Regarding BERTScore similarity scores: SS_phase2_option3f_BSrobertalarge_robertabase (accum. score: 5.4804. ACC@1 score: 0.6263; specified in thesis in tables 5.1 and 5.2 in sections 5.1 and 5.2; model name in thesis: RB_BSrl).