# Evaluation set creation

This notebook provides code for creating evaluation sets from the dataframes using the selected sentences and their unique IDs. The evaluation sets will contain the fields that are needed both for the AnthroScore evaluation as well as the Atypical Animacy evaluation. 

#### Validate evaluation set

Make sure that there are no duplicates within a set, and that there are no cases mistakenly labelled as both positive and negative

In [27]:
def check_conflicting_annotations(filename):

    conflicting_annotation = False

    positive_sentences_dict = get_sentences_dict(filename,"positive")
    negative_sentences_dict = get_sentences_dict(filename,"negative")

    for positive_id,sent in positive_sentences_dict.items():
        if positive_id in negative_sentences_dict:
            conflicting_annotation = True
            print("the sentence with the ID ",positive_id," appears in the negative set with the same ID")
        elif sent in negative_sentences_dict.values():
            onflicting_annotation = True
            negative_id = [key for keys in negative_sentences_dict.keys() if negative_sentences_dict[key] == sent][0]
            print("the sentence with the ID  ",positive_id,
                  " appears in the negative set with the ID ",negative_id)

    return conflicting_annotation

def get_sentences_dict(filename,score):   

    sentences_dict = {}
    duplicate_ids = []
    duplicate_sentence_pairs = []
    
    sentences = open(f"../preprocessed_data/evaluation_sentences/50_{filename}_{score}.txt","r")
    
    for line in sentences.readlines():
        line = line.strip()
        line = line.split("\t")
        sent_id = line[0]
        sent = line[1]
        if sent_id not in sentences_dict:
            if sent not in sentences_dict.values():
                sentences_dict[sent_id] = sent
            else: # the sentence appears twice with different IDs 
                other_id = [key for key in sentences_dict if sentences_dict[key] == sent][0]
                duplicate_sentence = (other_id,sent_id)
                duplicate_sentence_pairs.append(duplicate_sentence) 
                sentences_dict[other_id] = sent # still add to id-sentence pairs 
        else: # the sentence appears twice with the same ID
            duplicate_ids.append(sent_id)

    if len(sentences_dict.keys()) > 50:
        print(f"there are more than 50 sentences in 50_{filename}_{score}.txt")
    elif len(sentences_dict.keys()) < 50:
        print(f"there are less than 50 sentences in 50_{filename}_{score}.txt")

    if duplicate_ids:
        print("the sentences with the following ids appear twice: ",duplicate_ids,
             f" in 50_{filename}_{score}.txt")

    if duplicate_sentence_pairs:
        print("the following ID pairs refer to the same sentence: ",duplicate_sentence_pairs,
             f" in 50_{filename}_{score}.txt")

    return sentences_dict

#### Check sentences for each cateogry

change parameter of get_sentences. The options are:
1. agent_subjects - sentences in which the AI entity is the subject of an anthropomorphic verb (nsubj)
2. agent_objects - sentences in which the AI entity is object (agent) of an anthropomorphic verb in the passive voice (pobj)
3. nonagent_objects - sentences in which the AI entity is object (cognizer) of an anthropomorphic verb
4. adjective_phrases - sentences in which the AI entity is part of an anthropomorphic adjectival phrase
5. noun_phrases - sentences in which the AI entity is part of an anthropomorphic noun phrase
6. possessives - sentences in which the AI entity is immediately followed by a possessive marker
7. comparisons - sentences in which the AI entity is being compared to humans explicitly

In [28]:
conflicting_annotations = check_conflicting_annotations("adjective_phrases")
if conflicting_annotations:
    print("Resolve conflicting annotations before proceeding!!!")
else:
    print("No conflicting annotations. Clean up duplicates if exist before proceeding.")

No conflicting annotations. Clean up duplicates if exist before proceeding.
