### Create evaluation sets for AtypicalAnimacy (with masks)

This notebook contains code that converts the annotated evaluation sets (in csv format) to AtypicalAnimacy flavored evaluation sets.
The AtypicalAnimacy evaluation relies, in addition to the sentence, on a masked sentence, a 3-word context of the original masked phrase, and a 3-word context of the mask. AtypicalAnimacy provides their own masking and context extraction as part of their evaluation pipeline that prepares simultaneously the training sets and the test sets for their own dataset. In order to evaluate their model with the AnthroAI dataset presented in my thesis, I have created my own manual maksing strategy, and rely on that also for context extraction. 
Chapter 3 of the thesis describes and justifies the strategy according to which elements were kept as part of the context, and which elements were masked alongside the AI entity. 

In [1]:
import os
import pickle
import pandas as pd

def concat_pkl(directory_path):
    """
    this function takes a path to a directory containing .pkl files of dataframes,
    and returns a single concatenated dataframe.
    The dataframes contain sentences and their unique ID, 
    as well as the previous and next sentences in the abstract from which the sentence is taken.

    :param directory_path: path to a directory containing .pkl files
    :type directory_path: string
    :return: pd.Dataframe()
    """ 
    pkl_files = [f for f in os.listdir(directory_path) if f.endswith('.pkl')]
    
    df_list = []

    for file in pkl_files:
        file_path = os.path.join(directory_path, file)
        try:
            df = pd.read_pickle(file_path)
            df_list.append(df)
        except Exception as e:
            print(f"Error reading {file}: {e}")

    if df_list:
        combined_df = pd.concat(df_list, ignore_index=True)
        return combined_df
    else:
        print("No pkl files read successfully.")
        return pd.DataFrame()

path = '../data/dataframes'
all_sentences_df = concat_pkl(path)
print(all_sentences_df.columns.tolist())

['SentenceID', 'currentSentence', 'prevSentence', 'nextSentence', 'Abstract']


In [2]:
import re

def normalized(string):
    return re.sub(r'\s+', ' ', string.strip())

def convert_annotation(score):
    """
     This function converts annotations to numerical values:
     negative - 0, positive - 1, inclonclusive - 2
    """ 
    if score in ['p','p1','p2','p3']:
        score = '1'
    elif score in ['n1','n2','n3']:
        score = '0'
    elif score == 'inc':
        score = '2'
    else:
        print("score is malformed")

    return score

def mask_sentence(sentence,position,masked_str):
    """
    this function takes a sentence, an index and a string to be masked,
    and returns a masked sentence

    :param sentence: sentence from the evaluation set
    :type sentence: string
    :param position: index of the mask in the sentence
    :type position: integer
    :param masked_str: phrase in the sentence that should be masked
    :type masked_str: string
    :return: masked sentence (string)
    """ 
    masked_sentence = []
    len_mask = len(masked_str)
    masked_sentence.append(sentence[:position])
    masked_sentence.append("<mask>")
    position_after_mask = position+len_mask
    masked_sentence.append(sentence[position_after_mask:])
    masked_sentence = ''.join(masked_sentence)

    return masked_sentence

def get_context3w(masked_sentence,masked_str):
    """
    this function takes a masked sentence and the string that was masked,
    and returns both the 3-word context of the original masked string and a 3-word context of the mask

    :param masked_sentence: sentence that has been masked
    :type sentence: string
    :param masked_str: original phrase in the sentence that was masked
    :type masked_str: string
    :return: 3-word context and masked 3-word context (tuple of strings)
    """ 
    masked_sentence_list = masked_sentence.split(' ')
    try:
        mask_index = masked_sentence_list.index("<mask>")
        mask_str = '<mask>'
    except ValueError: # the mask is followed by punctuation or a possessive marker
        printcheck = True
        mask_plus_punct = [x for x in masked_sentence_list if '<mask>' in x][0] # assumes there is exactly one
        mask_index = masked_sentence_list.index(mask_plus_punct)
        mask_str = mask_plus_punct
        masked_str = mask_str.replace('<mask>',masked_str) # restore punctuation and possesive marker to masked string
    if mask_index <= 3:
        prev_words = masked_sentence_list[:mask_index]
    else:
        start_index = mask_index - 3
        prev_words = masked_sentence_list[start_index:mask_index]
    if len(masked_sentence_list) < mask_index + 3:
        next_words = masked_sentence_list[mask_index+1:]
    else:
        end_index = mask_index + 4
        next_words = masked_sentence_list[mask_index+1:end_index]
    prev_words = ' '.join(prev_words)
    next_words = ' '.join(next_words)
    context_3w = prev_words + ' ' + masked_str + ' ' + next_words # masked_str is the original text
    context_3w_masked = prev_words + ' ' + mask_str + ' ' + next_words # mask_str is <mask> (with or without punct)

    return context_3w,context_3w_masked

def get_masked_sentence_and_context(sentence,AI_phrase,mask):
    """
    this function takes a sentence and string to be masked, and returns a list of masked versions
    as well as 3-word context and 3-word masked context  

    :param sentence: sentence from the evaluation set
    :type sentence: string
    :param AI_phrase: entire AI phrase (including contextual components that should not be masked - 
    used for identification for when there are multiple occurrences of the mask in the sentence)
    :type AI_phrase: string
    :param mask: phrase in the sentence that should be masked
    :type mask: string
    :return: list of tuples containing the sentence, the masked sentence, the 3-word context and the masked 3-word context
    """ 
    mask_position = [match.start() for match in re.finditer(rf'\b{re.escape(mask)}\b', sentence, flags=re.IGNORECASE)]
    masked_sentences_and_context = []
    
    if len(mask_position) == 1: # simple case, only one occurrence of mask
        position = mask_position[0]
        masked_sentence = mask_sentence(sentence,position,mask)
        context_3w_tuple = get_context3w(masked_sentence,mask)
        context_3w = context_3w_tuple[0]
        context_3w_masked = context_3w_tuple[1]
        masked_sentences_and_context.append((sentence,masked_sentence,context_3w,context_3w_masked))
    elif len(mask_position) > 1: # more than one occurrence of the mask
        AI_phrase_position = [match.start() for match in re.finditer(rf'\b{re.escape(AI_phrase)}\b', sentence,flags=re.IGNORECASE)]
        mask_in_phrase_position = [match.start() for match in re.finditer(rf'\b{re.escape(mask)}\b', AI_phrase)]
        if len(AI_phrase_position) == 1 and len(mask_in_phrase_position) == 1: # found mask by comparing position in AI phrase
            position = mask_in_phrase_position[0] + AI_phrase_position[0]
            masked_sentence = mask_sentence(sentence,position,mask)
            context_3w_tuple = get_context3w(masked_sentence,mask)
            context_3w = context_3w_tuple[0]
            context_3w_masked = context_3w_tuple[1]
            masked_sentences_and_context.append((sentence,masked_sentence,context_3w,context_3w_masked))
        else: # cannot identify, masking all occurrences to be safe
            for i,position in enumerate(mask_position):
                masked_sentence = mask_sentence(sentence,position,mask)
                context_3w_tuple = get_context3w(masked_sentence,mask)
                context_3w = context_3w_tuple[0]
                context_3w_masked = context_3w_tuple[1]
                masked_sentences_and_context.append((sentence,masked_sentence,context_3w,context_3w_masked))
    else: 
        # brute-force - do not replicate !!! this was done after manual revision and confirmation
        masked_sentence = sentence.replace(mask, "<mask>")
        context_3w_tuple = get_context3w(masked_sentence,mask)
        context_3w = context_3w_tuple[0]
        context_3w_masked = context_3w_tuple[1]
        masked_sentences_and_context.append((sentence,masked_sentence,context_3w,context_3w_masked))

    return masked_sentences_and_context   

In [4]:
import csv

def create_AA_evaluation_set(filename):
    """
    this function takes a csv file and returns a new csv file with expanded information used by the AtypicalAnimacy model for evaluation: 
    in addition to the original information, it adds the previous and next sentence (if applicable), the masked sentence, 
    the 3-word context of the masked phrase as well as the 3-word context of the mask.

    :param filename: name of the file to be processed
    :type sentence: string
    """ 
    with open(f"../data/AtypicalAnimacy_evaluation/experiment_2/{filename}.csv","w") as outfile:
        
        writer = csv.writer(outfile)
        AA_header = ['id','Previous Sentence','Current Sentence','Masked Sentence','Next Sentence','AI Phrase','Suggested Mask','AI Entity',
                      'Anthropomorphic Component','Target Expression','Animated','context3w','context3wmasked']
        writer.writerow(AA_header)
        infile = open(f"../data/evaluation_sentences_csv/{filename}.csv","r")
        header = infile.readline()
        reader = csv.reader(infile)
        
        for row in reader:
            
            sentence_id = normalized(row[0])
            sentence = normalized(row[1])
            orig_sentence_id = '_'.join(sentence_id.split('_')[2:5]) # remove class and dataset prefix added during preprocessing
            
            # retrieve previous and next sentences from dataframe all sentences dataframe
            sentence_info = all_sentences_df[all_sentences_df['SentenceID'] == orig_sentence_id]
            if not sentence_info.empty:
                current_sentence = sentence_info.iloc[0]['currentSentence']
                prev_sent = sentence_info.iloc[0]['prevSentence']
                next_sent = sentence_info.iloc[0]['nextSentence']
            else: # this is only to capture errors - this does not happen
                print(f"error: the sentence with the id {sentence_id} was not found in the dataframe")
                prev_sent = ''
                next_sent = ''
                
            # get masked sentence, context3w and context3wmasked
            AI_phrase = normalized(row[2])
            mask = normalized(row[3])
            AI_entity = normalized(row[4])
            anthro_component = normalized(row[5])
            score = convert_annotation(normalized(row[6])) # convert p,n,inc scores to 0,1,2 scores
            masked_sentences_and_context = get_masked_sentence_and_context(sentence,AI_phrase,mask)
            for m in masked_sentences_and_context:
                writer.writerow([sentence_id,prev_sent,m[0],m[1],next_sent,AI_phrase,mask,AI_entity,anthro_component,mask,score,m[2],m[3]])


In [5]:
files = ["adjective_phrases_inconclusive",
         "adjective_phrases_negative",
         "adjective_phrases_positive",
         "comparisons_inconclusive",
         "noun_phrases_positive",
         "possessives_positive",
         "verb_objects_inconclusive",
         "verb_objects_negative",
         "verb_objects_positive",
         "verb_subjects_inconclusive",
         "verb_subjects_negative",
         "verb_subjects_positive"
        ]

for file in files:
    print(f"Creating AA-flavored evaluation sets for {file}...")
    create_AA_evaluation_set(file)

Creating AA-flavored evaluation sets for adjective_phrases_inconclusive...
Creating AA-flavored evaluation sets for adjective_phrases_negative...
Creating AA-flavored evaluation sets for adjective_phrases_positive...
Creating AA-flavored evaluation sets for comparisons_inconclusive...
Creating AA-flavored evaluation sets for noun_phrases_positive...
Creating AA-flavored evaluation sets for possessives_positive...
Creating AA-flavored evaluation sets for verb_objects_inconclusive...
Creating AA-flavored evaluation sets for verb_objects_negative...
Creating AA-flavored evaluation sets for verb_objects_positive...
Creating AA-flavored evaluation sets for verb_subjects_inconclusive...
Creating AA-flavored evaluation sets for verb_subjects_negative...
Creating AA-flavored evaluation sets for verb_subjects_positive...
