# Detecting Annotation Artifacts with Local Explanations

An input feature is a data artifact if there exists correlation between a task label and the feature in the training data, but this correlation is not true reflection of the real world.

Local explanations are explanations of individual predictions made by some model. We will go over three methods for producing local explanations: gradient-based highlighting of input, finding influential training examples, and contrastive editing---in the context of finding annotation artifacts.

## Task and Dataset

We will demonstrate how to detect annotation artifacts in the context of **binary sentiment classification** of movie reviews, i.e., classifying a given movie review as positive or negative. One commonly used dataset for this task is the [IMDB dataset](http://ai.stanford.edu/~amaas/data/sentiment/). A useful library for accessing NLP datasets is [datasets by Huggingface](https://huggingface.co/docs/datasets/) 🤗. We can load the IMDB dataset using `datasets` as follows:

In [2]:
!jupyter nbextension enable --py widgetsnbextension --sys-prefix

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [None]:
from datasets import load_dataset
dataset = load_dataset("imdb")

This loads a `DatasetDict` object which you can index into to view an example:

In [4]:
dataset["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

You can select an example randomly like this:

In [5]:
dataset["train"].shuffle().select(range(1))[0]

{'text': 'There are several things wrong with this movie- Brenda Song\'s character being one of them. I do not believe that the girl is a lousy actor- I honestly don\'t. I believe she is given poor lines. She is just supposed to be, "that vain, rich girl", and while it is funny in the TV shows she plays in, it can\'t even get a dry laugh from me here.<br /><br />Either way, I really should have known what to expect when I sat down to watch this film.<br /><br />The movie was not that terrible...initially. Wendy\'s reaction to Shen was completely natural. I mean, how would you feel if a man, claiming to be a reincarnated monk, chased you around commanding you to wear a medallion and insisting that you were needed to fight "the great evil" and save the world? Which brings me to another point. I know this movie is entirely fiction, but it is still has a founding in Chinese culture. It seems like all of the "warriors" in Wendy\'s family line were women. Correct me if I\'m wrong, but I doub

Where `label: 1` means that the movie review is positive, and `negative` otherwise.

## Annotation Artifact

It has been reported that neural models trained for the task of binary sentiment classification solely use the numberical rating at the end of the review instead of reading and understanding the semantics of the review. We will focus on that annotation artifact in this notebook. Let's collect all test instances with numerical rating.

In [None]:
instances_with_scores = []

for instance in dataset["train"]:
    instance["text"] = instance["text"].replace('/10', ' / 10')
    if "/ 10" in instance["text"] or "/10" in instance["text"]:
        instances_with_scores.append(instance)

In [None]:
f"There are {len(instances_with_scores)} instances with a numerical rating."

In [None]:
f"One example labeled as negative: {instances_with_scores[0]['text']}"

## Model

We will analyze a [RoBERTa-large](https://arxiv.org/abs/1907.11692) classifier that is already trained for binary sentiment classfication in the IMDB dataset with the [AllenNLP](https://allenai.org/allennlp/software/allennlp-library) library.

In [None]:
from allennlp.predictors import Predictor
from mice.src.predictors.imdb.imdb_dataset_reader import ImdbDatasetReader

archive = "mice/trained_predictors/imdb/model/model.tar.gz"
predictor = Predictor.from_path(archive, dataset_reader_to_load=ImdbDatasetReader)

## Gradient-Based Highlights

Let's randomly sample one movie review with a numerical rating and get the model's prediction for it.

In [104]:
import random
from mice.src.predictors.imdb.imdb_dataset_reader import clean_text

random_instance = random.sample(instances_with_scores, 1)[0]
random_instance["text"] = clean_text(random_instance["text"], special_chars=["<br />", "\t"])

int_to_label = ["negative", "positive"]
gold_label = int_to_label[random_instance["label"]]
pred_label = int_to_label[int(predictor.predict(random_instance["text"])["label"])]

print(f"Gold label: {gold_label}, Predicted label: {pred_label}")
print(random_instance["text"])

Gold label: negative, Predicted label: negative
Damn, I thought I'd seen some bad westerns. Can't top this one though. Hell I think I'd rather have my eyes stapled open for a Trinity Triple Feature for cryin out loud. I dont think I'll be able to watch Ben Hur again without laughing my ass off. Just really bad.  But hey, if you like stupid westerns with acknowledged stars in the thing take a peek at Shoot Out with Gregory Peck. It's just as bad, but much funnier. 1 / 10


Now, get the gradient-based highlights for that instance using the [AllenNLP Interpret](https://allenai.github.io/allennlp-website/interpret) toolkit.

In [109]:
from allennlp.data.tokenizers.spacy_tokenizer import SpacyTokenizer
from allennlp.interpret.saliency_interpreters import SimpleGradient

interpreter = SimpleGradient(predictor)

interpretation = interpreter.saliency_interpret_from_json({"sentence": random_instance["text"]})

tokenized_sentence = SpacyTokenizer().tokenize(random_instance["text"])

sentence_attribution = zip(tokenized_sentence, interpretation["instance_1"]["grad_input_1"])



We can rank input words based on the gradient magnitude and also highlight all words with the gradient higher than some `treshold`. 

In [111]:
import pandas as pd 

threshold = 0.005
cols = ["word", "grad"]
df = pd.DataFrame(sentence_attribution, columns=cols)
df['rank'] = df["grad"].rank(ascending=False).astype(int)

df.style.apply(lambda x: ["background-color: #ff33aa" if x.iloc[1] > threshold 
                          else "" for i, v in enumerate(x)], axis = 1)

Unnamed: 0,word,grad,rank
0,Damn,0.010783,30
1,",",0.003142,80
2,I,0.013216,22
3,thought,0.003627,74
4,I,0.000863,95
5,'d,0.002632,83
6,seen,0.004745,62
7,some,0.010181,33
8,bad,0.003405,78
9,westerns,0.015349,14


Since the input is long, it might be useful to focus on top-k words. 

In [112]:
k=20
df.sort_values(by=['grad'], ascending=False).head(k)

Unnamed: 0,word,grad,rank
35,cryin,0.052187,1
10,.,0.032852,2
67,like,0.025893,3
65,if,0.025474,4
63,hey,0.024492,5
90,bad,0.023472,6
16,though,0.023138,7
68,stupid,0.022971,8
69,westerns,0.020459,9
11,Ca,0.01987,10


Let's see which words are highlighted more than expected given their frequency in the original IMDB dataset.

We'll first calculate the number of times each word in the training dataset occurs.

In [162]:
from tqdm import tqdm
from collections import Counter

train_dataset = dataset["train"] #.shuffle().select(range(1000))
tokens = []
for instance in tqdm(train_dataset): 
    tokenized_sentence = SpacyTokenizer().tokenize(instance["text"])
    tokens.extend([t.text for t in tokenized_sentence])

types_freq = {k: v for k,v in Counter(tokens).items() if v>=10}

100%|██████████████████████████████████████████████████████████████████████████████████████████| 25000/25000 [03:08<00:00, 132.44it/s]


Then, for all instances with the numerical ratings we want to record top-k (k=20) words. This will be slow very slow, so we will do this only for 100 such instances. 

In [165]:
k=20
sample_size=100
top_tokens = []
for instance in tqdm(instances_with_scores[:sample_size]):
    instance["text"] = clean_text(instance["text"], special_chars=["<br />", "\t"])
    interpretation = interpreter.saliency_interpret_from_json({"sentence": instance["text"]})
    tokenized_sentence = SpacyTokenizer().tokenize(instance["text"])
    sentence_attribution = zip(tokenized_sentence, interpretation["instance_1"]["grad_input_1"])
    instance_top_tokens = [w[0].text for w in sorted(list(sentence_attribution), key = lambda x: x[1], reverse=True)]
    top_tokens.extend(instance_top_tokens)

100%|███████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [14:16<00:00,  8.56s/it]


We normalized the occurance of each token `t` recorded with `Counter(top_tokens)` (how many times this token is selected sa top-k token in the sample evaluation dataset) by its original occurance in the training dataset recorded with `types_freq[t]`, and list top-10 tokens with respect to the normalized occurance.  

In [178]:
top_tokens_normalized = {t: v/types_freq[t] for t,v in top_tokens_freq.items() if t in types_freq}
Counter(top_tokens_normalized).most_common(10)

[('Tanya', 0.47619047619047616),
 ('bail', 0.4),
 ('Jox', 0.2777777777777778),
 ('Spaghetti', 0.26666666666666666),
 ('Kiki', 0.24),
 ('KGB', 0.23529411764705882),
 ('Sleepwalkers', 0.23529411764705882),
 ('Arnie', 0.23255813953488372),
 ('Zuniga', 0.2222222222222222),
 ('Chronicles', 0.21739130434782608)]

Based on this list, highlighting with the gradient magnitude didn't identify numerical ratings as tokens that are highlighted more than expected. We could try the integrated gradient method that is more powerful (and slower). 

In [181]:
from allennlp.interpret.saliency_interpreters import IntegratedGradient

interpreter = IntegratedGradient(predictor)

In [182]:
k=20
sample_size=100
top_tokens = []
for instance in tqdm(instances_with_scores[:sample_size]):
    instance["text"] = clean_text(instance["text"], special_chars=["<br />", "\t"])
    interpretation = interpreter.saliency_interpret_from_json({"sentence": instance["text"]})
    tokenized_sentence = SpacyTokenizer().tokenize(instance["text"])
    sentence_attribution = zip(tokenized_sentence, interpretation["instance_1"]["grad_input_1"])
    instance_top_tokens = [w[0].text for w in sorted(list(sentence_attribution), key = lambda x: x[1], reverse=True)]
    top_tokens.extend(instance_top_tokens)

top_tokens_normalized = {t: v/types_freq[t] for t,v in top_tokens_freq.items() if t in types_freq}
Counter(top_tokens_normalized).most_common(10)

  2%|█▉                                                                                            | 2/100 [03:26<2:48:22, 103.09s/it]


KeyboardInterrupt: 

## Influential Examples

In [6]:
from allennlp.interpret.influence_interpreters import SimpleInfluence
from mice.src.predictors.imdb.imdb_dataset_reader import ImdbDatasetReader

test_file = "mice/data/aclImdb/test/neg/4377_4.txt"
archive = "mice/trained_predictors/imdb/model/model.tar.gz"
simple_if = SimpleInfluence.from_path(archive, dataset_reader_to_load=ImdbDatasetReader)
simple_if.interpret_from_file(test_file)

2022-02-27 18:16:03,825 - INFO - allennlp.common.plugins - Plugin allennlp_models available
2022-02-27 18:16:03,829 - INFO - allennlp.models.archival - loading archive file mice/trained_predictors/imdb/model/model.tar.gz
2022-02-27 18:16:03,830 - INFO - allennlp.models.archival - extracting archive file mice/trained_predictors/imdb/model/model.tar.gz to temp dir /var/folders/qr/8__6lqs525vbb3xk4c52jhxc0000gp/T/tmp6prr7m82
2022-02-27 18:16:13,963 - INFO - allennlp.common.params - dataset_reader.type = imdb
2022-02-27 18:16:13,964 - INFO - allennlp.common.params - dataset_reader.max_instances = None
2022-02-27 18:16:13,965 - INFO - allennlp.common.params - dataset_reader.manual_distributed_sharding = False
2022-02-27 18:16:13,965 - INFO - allennlp.common.params - dataset_reader.manual_multiprocess_sharding = False
2022-02-27 18:16:13,966 - INFO - allennlp.common.params - dataset_reader.token_indexers.tokens.type = pretrained_transformer
2022-02-27 18:16:13,967 - INFO - allennlp.common.pa

2022-02-27 18:16:14,296 - INFO - allennlp.nn.initializers -    _seq2vec_encoder.pooler.dense.weight
2022-02-27 18:16:14,297 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.embeddings.LayerNorm.bias
2022-02-27 18:16:14,298 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.embeddings.LayerNorm.weight
2022-02-27 18:16:14,299 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.embeddings.position_embeddings.weight
2022-02-27 18:16:14,299 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.embeddings.token_type_embeddings.weight
2022-02-27 18:16:14,301 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.embeddings.word_embeddings.weight
2022-02-27 18:16:14,304 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transfor

2022-02-27 18:16:14,344 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.10.output.LayerNorm.bias
2022-02-27 18:16:14,345 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.10.output.LayerNorm.weight
2022-02-27 18:16:14,346 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.10.output.dense.bias
2022-02-27 18:16:14,347 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.10.output.dense.weight
2022-02-27 18:16:14,347 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.11.attention.output.LayerNorm.bias
2022-02-27 18:16:14,348 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.11.attention.output.LayerNorm.weight
2022

2022-02-27 18:16:14,377 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.13.output.dense.bias
2022-02-27 18:16:14,377 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.13.output.dense.weight
2022-02-27 18:16:14,378 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.14.attention.output.LayerNorm.bias
2022-02-27 18:16:14,379 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.14.attention.output.LayerNorm.weight
2022-02-27 18:16:14,379 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.14.attention.output.dense.bias
2022-02-27 18:16:14,379 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.14.attention.output.dense

2022-02-27 18:16:14,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.17.attention.output.LayerNorm.bias
2022-02-27 18:16:14,406 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.17.attention.output.LayerNorm.weight
2022-02-27 18:16:14,406 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.17.attention.output.dense.bias
2022-02-27 18:16:14,407 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.17.attention.output.dense.weight
2022-02-27 18:16:14,407 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.17.attention.self.key.bias
2022-02-27 18:16:14,408 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.17.attent

2022-02-27 18:16:14,432 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.2.attention.output.dense.bias
2022-02-27 18:16:14,433 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.2.attention.output.dense.weight
2022-02-27 18:16:14,433 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.2.attention.self.key.bias
2022-02-27 18:16:14,435 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.2.attention.self.key.weight
2022-02-27 18:16:14,435 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.2.attention.self.query.bias
2022-02-27 18:16:14,436 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.2.attention.self.query.weigh

2022-02-27 18:16:14,461 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.22.attention.self.key.bias
2022-02-27 18:16:14,462 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.22.attention.self.key.weight
2022-02-27 18:16:14,462 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.22.attention.self.query.bias
2022-02-27 18:16:14,463 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.22.attention.self.query.weight
2022-02-27 18:16:14,463 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.22.attention.self.value.bias
2022-02-27 18:16:14,463 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.22.attention.self.value.wei

2022-02-27 18:16:14,498 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.4.attention.self.query.bias
2022-02-27 18:16:14,498 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.4.attention.self.query.weight
2022-02-27 18:16:14,499 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.4.attention.self.value.bias
2022-02-27 18:16:14,500 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.4.attention.self.value.weight
2022-02-27 18:16:14,501 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.4.intermediate.dense.bias
2022-02-27 18:16:14,503 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.4.intermediate.dense.weight
20

2022-02-27 18:16:14,528 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.7.attention.self.value.bias
2022-02-27 18:16:14,528 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.7.attention.self.value.weight
2022-02-27 18:16:14,529 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.7.intermediate.dense.bias
2022-02-27 18:16:14,529 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.7.intermediate.dense.weight
2022-02-27 18:16:14,529 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.7.output.LayerNorm.bias
2022-02-27 18:16:14,530 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.encoder.layer.7.output.LayerNorm.weight
2022-02-27

ConfigurationError: Default implementation simple-influence is not registered

## Contrastive Edits

In [2]:
import pandas as pd
import sys
sys.path.append("mice/")
from src.utils import html_highlight_diffs
from IPython.core.display import display, HTML
import numpy as np
from mice.src.utils import load_predictor, get_ints_to_labels

TASK = "imdb"
STAGE2EXP = "mice_binary"
EDIT_PATH = f"mice/results/{TASK}/edits/{STAGE2EXP}/edits.csv"

In [3]:
def read_edits(path):
    edits = pd.read_csv(EDIT_PATH, sep="\t", lineterminator="\n", error_bad_lines=False, warn_bad_lines=True)

    if edits['new_pred'].dtype == pd.np.dtype('float64'):
        edits['new_pred'] = edits.apply(lambda row: str(int(row['new_pred']) if not np.isnan(row['new_pred']) else ""), axis=1)
        edits['orig_pred'] = edits.apply(lambda row: str(int(row['orig_pred']) if not np.isnan(row['orig_pred']) else ""), axis=1)
        edits['contrast_pred'] = edits.apply(lambda row: str(int(row['contrast_pred']) if not np.isnan(row['contrast_pred']) else ""), axis=1)
    else:
        edits['new_pred'].fillna(value="", inplace=True)
        edits['orig_pred'].fillna(value="", inplace=True)
        edits['contrast_pred'].fillna(value="", inplace=True)
    return edits

In [4]:
def get_best_edits(edits):
    """ MiCE writes all edits that are found in Stage 2, 
    but we only want to evaluate the smallest per input. 
    Calling get_sorted_e() """
    return edits[edits['sorted_idx'] == 0]
    
def evaluate_edits(edits):
    temp = edits[edits['sorted_idx'] == 0]
    minim = temp['minimality'].mean()
    flipped = temp[temp['new_pred'].astype(str)==temp['contrast_pred'].astype(str)]
    nunique = temp['data_idx'].nunique()
    flip_rate = len(flipped)/nunique
    duration=temp['duration'].mean()
    metrics = {
        "num_total": nunique,
        "num_flipped": len(flipped),
        "flip_rate": flip_rate,
        "minimality": minim,
        "duration": duration,
    }
    for k, v in metrics.items():
        print(f"{k}: \t{round(v, 3)}")
    return metrics

In [5]:
def display_edits(row):
    html_original, html_edited = html_highlight_diffs(row['orig_editable_seg'], row['edited_editable_seg'])
    minim = round(row['minimality'], 3)
    print(f"MINIMALITY: \t{minim}")
    print("")
    display(HTML(html_original))
    display(HTML(html_edited))

def display_classif_results(rows):
    for _, row in rows.iterrows():
        orig_contrast_prob_pred = round(row['orig_contrast_prob_pred'], 3)
        new_contrast_prob_pred = round(row['new_contrast_prob_pred'], 3)
        print("-----------------------")
        print(f"ORIG LABEL: \t{row['orig_pred']}")
        print(f"CONTR LABEL: \t{row['contrast_pred']} (Orig Pred Prob: {orig_contrast_prob_pred})")
        print(f"NEW LABEL: \t{row['new_pred']} (New Pred Prob: {new_contrast_prob_pred})")
        print("")
        display_edits(row)

def display_race_results(rows):
    for _, row in rows.iterrows():
        orig_contrast_prob_pred = round(row['orig_contrast_prob_pred'], 3)
        new_contrast_prob_pred = round(row['new_contrast_prob_pred'], 3)
        orig_input = eval(row['orig_input'])
        options = orig_input['options']
        print("-----------------------")
        print(f"QUESTION: {orig_input['question']}")
        print("\nOPTIONS:")
        for opt_idx, opt in enumerate(options):
            print(f"  ({opt_idx}) {opt}")
        print(f"\nORIG LABEL: \t{row['orig_pred']}")
        print(f"CONTR LABEL: \t{row['contrast_pred']} (Orig Pred Prob: {orig_contrast_prob_pred})")
        print(f"NEW LABEL: \t{row['new_pred']} (New Pred Prob: {new_contrast_prob_pred})")
        print("")
        display_edits(row)

In [6]:
edits = read_edits(EDIT_PATH)
edits = get_best_edits(edits)
metrics = evaluate_edits(edits)

FileNotFoundError: [Errno 2] No such file or directory: 'mice/results/imdb/edits/mice_binary/edits.csv'