# Detecting Annotation Artifacts with Local Explanations

An input feature (e.g., word or pixel) is a data artifact if there is a correlation between a task label and the feature in the training data, but this correlation is not true reflection of the real world.

Local explanations are explanations of individual predictions made by some model. We will go over three methods for producing local explanations: gradient-based highlighting of input, finding influential training examples, and contrastive editing---in the context of finding annotation artifacts.

## Task and Dataset

We will demonstrate how to detect annotation artifacts in the context of **binary sentiment classification** of movie reviews, i.e., classifying a given movie review as positive or negative. One commonly used dataset for this task is the [IMDB dataset](http://ai.stanford.edu/~amaas/data/sentiment/). A useful library for accessing NLP datasets is [datasets by Huggingface](https://huggingface.co/docs/datasets/) 🤗. We can load the IMDB dataset using `datasets` as follows:

In [1]:
!jupyter nbextension enable --py widgetsnbextension --sys-prefix

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [2]:
from datasets import load_dataset
dataset = load_dataset("imdb")

Reusing dataset imdb (/Users/anam/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

This loads a `DatasetDict` object which you can index into to view an example:

In [3]:
dataset["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

You can select an example randomly like this:

In [4]:
dataset["train"].shuffle().select(range(1))[0]

{'text': "This film has been receiving a lot of play lately during the day on either HBO or Cinemax. The reason is that they are assuming people would be interested in comparing it to the Leonardo DiCaprio/Tom Hanks caper of the same name. The only reason to see it is for the attractive Matt Lattanzi. Yum! Although I must say Matt was more than a little long in the tooth to be playing a high schooler. If he were a woman, they'd have had him playing the MOTHER of a high schooler! (Is is just me, or is his daughter starting to look like Shelley Duvall?) Oh yeah, the plot--who cares? Typical teen highjinx played by adults.",
 'label': 0}

Where `label: 1` means that the movie review is positive, and `negative` otherwise.

## Annotation Artifact

It has been reported that neural models trained for the task of binary sentiment classification solely use the numberical rating at the end of the review instead of reading and understanding the semantics of the review. We will focus on that annotation artifact in this notebook. Let's collect all test instances with numerical rating.

In [5]:
instances_with_scores = []

for instance in dataset["train"]:
    instance["text"] = instance["text"].replace('/10', ' / 10')
    if "/ 10" in instance["text"] or "/10" in instance["text"]:
        instances_with_scores.append(instance)

In [6]:
f"There are {len(instances_with_scores)} instances with a numerical rating."

'There are 1524 instances with a numerical rating.'

In [7]:
f"One example labeled as negative: {instances_with_scores[0]['text']}"

"One example labeled as negative: This film was probably inspired by Godard's Masculin, féminin and I urge you to see that film instead.<br /><br />The film has two strong elements and those are, (1) the realistic acting (2) the impressive, undeservedly good, photo. Apart from that, what strikes me most is the endless stream of silliness. Lena Nyman has to be most annoying actress in the world. She acts so stupid and with all the nudity in this film,...it's unattractive. Comparing to Godard's film, intellectuality has been replaced with stupidity. Without going too far on this subject, I would say that follows from the difference in ideals between the French and the Swedish society.<br /><br />A movie of its time, and place. 2 / 10."

## Model

We will analyze a [RoBERTa-large](https://arxiv.org/abs/1907.11692) classifier that is already trained for binary sentiment classfication in the IMDB dataset with the [AllenNLP](https://allenai.org/allennlp/software/allennlp-library) library.

In [8]:
from allennlp.predictors import Predictor
from mice.src.predictors.imdb.imdb_dataset_reader import ImdbDatasetReader

archive = "mice/trained_predictors/imdb/model/model.tar.gz"
predictor = Predictor.from_path(archive, dataset_reader_to_load=ImdbDatasetReader)

## Gradient-Based Highlights

Let's randomly sample one movie review with a numerical rating and get the model's prediction for it.

In [9]:
import random
from mice.src.predictors.imdb.imdb_dataset_reader import clean_text

random_instance = random.sample(instances_with_scores, 1)[0]
random_instance["text"] = clean_text(random_instance["text"], special_chars=["<br />", "\t"])

int_to_label = ["negative", "positive"]
gold_label = int_to_label[random_instance["label"]]
pred_label = int_to_label[int(predictor.predict(random_instance["text"])["label"])]

print(f"Gold label: {gold_label}, Predicted label: {pred_label}")
print(random_instance["text"])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Gold label: negative, Predicted label: negative
Revolt of the Zombies is BAD. There is nothing remotely entertaining about the movie. It is dull, lifeless, poorly acted, and poorly scripted. I've often complained that the original Dracula is a little slow for my taste, well this movie makes Dracula look like a roller coaster ride. The 65 minute running time seemed like 165 minutes.  The story: An expedition is sent to Cambodia to find the secrets of mind control through "zombification". One man finds the secret and uses it to make the woman he loves marry him. Once this happens, he releases the zombies under his control to horrific consequences. That's it. That's the whole story.  For most of the movie, I was trying to figure out where I had seen the male lead. He looked so familiar. I had plenty of time to think this over. Nothing was happening in the movie. Just before the "zombies revolted", it hit me. It was Dean Jagger. I had seen him recently as the General in White Christmas. Th

Now, get the gradient-based highlights for that instance using the [AllenNLP Interpret](https://allenai.github.io/allennlp-website/interpret) toolkit.

In [10]:
from allennlp.data.tokenizers.spacy_tokenizer import SpacyTokenizer
from allennlp.interpret.saliency_interpreters import SimpleGradient

interpreter = SimpleGradient(predictor)

interpretation = interpreter.saliency_interpret_from_json({"sentence": random_instance["text"]})

tokenized_sentence = SpacyTokenizer().tokenize(random_instance["text"])

sentence_attribution = zip(tokenized_sentence, interpretation["instance_1"]["grad_input_1"])

We can rank input words based on the gradient magnitude and also highlight all words with the gradient higher than some `treshold`. 

In [11]:
import pandas as pd 

threshold = 0.005
cols = ["word", "grad"]
df = pd.DataFrame(sentence_attribution, columns=cols)
df['rank'] = df["grad"].rank(ascending=False).astype(int)

df.style.apply(lambda x: ["background-color: #ff33aa" if x.iloc[1] > threshold 
                          else "" for i, v in enumerate(x)], axis = 1)

Unnamed: 0,word,grad,rank
0,Revolt,0.001042,210
1,of,0.006914,40
2,the,0.006202,49
3,Zombies,8e-05,280
4,is,0.00045,252
5,BAD,0.002121,144
6,.,0.010111,15
7,There,0.002032,152
8,is,0.001039,211
9,nothing,0.00581,54


Since the input is long, it might be useful to focus on top-k words. 

In [12]:
k=20
df.sort_values(by=['grad'], ascending=False).head(k)

Unnamed: 0,word,grad,rank
276,",",0.019843,1
214,.,0.017909,2
13,the,0.017266,3
212,the,0.017114,4
282,I,0.015564,5
216,'m,0.013411,6
90,finds,0.011685,7
41,for,0.011259,8
261,a,0.011246,9
238,and,0.01117,10


Let's see which words are highlighted more than expected given their frequency in the original IMDB dataset.

We'll first calculate the number of times each word in the training dataset occurs.

In [13]:
from tqdm import tqdm
from collections import Counter

train_dataset = dataset["train"] #.shuffle().select(range(1000))
tokens = []
for instance in tqdm(train_dataset): 
    tokenized_sentence = SpacyTokenizer().tokenize(instance["text"])
    tokens.extend([t.text for t in tokenized_sentence])

types_freq = {k: v for k,v in Counter(tokens).items() if v>=10}

100%|██████████████████████████████████████████████████████████████████████████████| 25000/25000 [00:54<00:00, 459.81it/s]


Then, for all instances with the numerical ratings we want to record top-k (k=20) words. This will be slow very slow, so we will do this only for 100 such instances. 

In [165]:
k=20
sample_size=100
top_tokens = []
for instance in tqdm(instances_with_scores[:sample_size]):
    instance["text"] = clean_text(instance["text"], special_chars=["<br />", "\t"])
    interpretation = interpreter.saliency_interpret_from_json({"sentence": instance["text"]})
    tokenized_sentence = SpacyTokenizer().tokenize(instance["text"])
    sentence_attribution = zip(tokenized_sentence, interpretation["instance_1"]["grad_input_1"])
    instance_top_tokens = [w[0].text for w in sorted(list(sentence_attribution), key = lambda x: x[1], reverse=True)]
    top_tokens.extend(instance_top_tokens)

100%|███████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [14:16<00:00,  8.56s/it]


We normalized the occurance of each token `t` recorded with `Counter(top_tokens)` (how many times this token is selected sa top-k token in the sample evaluation dataset) by its original occurance in the training dataset recorded with `types_freq[t]`, and list top-10 tokens with respect to the normalized occurance.  

In [178]:
top_tokens_normalized = {t: v/types_freq[t] for t,v in top_tokens_freq.items() if t in types_freq}
Counter(top_tokens_normalized).most_common(10)

[('Tanya', 0.47619047619047616),
 ('bail', 0.4),
 ('Jox', 0.2777777777777778),
 ('Spaghetti', 0.26666666666666666),
 ('Kiki', 0.24),
 ('KGB', 0.23529411764705882),
 ('Sleepwalkers', 0.23529411764705882),
 ('Arnie', 0.23255813953488372),
 ('Zuniga', 0.2222222222222222),
 ('Chronicles', 0.21739130434782608)]

Based on this list, highlighting with the gradient magnitude didn't identify numerical ratings as tokens that are highlighted more than expected. We could try the integrated gradient method that is more powerful (and slower). 

In [181]:
from allennlp.interpret.saliency_interpreters import IntegratedGradient

interpreter = IntegratedGradient(predictor)

In [None]:
k=20
sample_size=100
top_tokens = []
for instance in tqdm(instances_with_scores[:sample_size]):
    instance["text"] = clean_text(instance["text"], special_chars=["<br />", "\t"])
    interpretation = interpreter.saliency_interpret_from_json({"sentence": instance["text"]})
    tokenized_sentence = SpacyTokenizer().tokenize(instance["text"])
    sentence_attribution = zip(tokenized_sentence, interpretation["instance_1"]["grad_input_1"])
    instance_top_tokens = [w[0].text for w in sorted(list(sentence_attribution), key = lambda x: x[1], reverse=True)]
    top_tokens.extend(instance_top_tokens)

top_tokens_normalized = {t: v/types_freq[t] for t,v in top_tokens_freq.items() if t in types_freq}
Counter(top_tokens_normalized).most_common(10)

## Influential Examples

TODO: Add some info about influence functions here.

NOTE: Instead of the model archive above, I used the model archive from here: https://github.com/allenai/allennlp/blob/3fa519333c0042a1b378bd8ac1788d42edaa70be/test_fixtures/basic_classifier/serialization/model.tar.gz. This seems to be another model trained on the IMDB dataset, but don't know anything about it. I found it by looking at this test for influence functions: https://github.com/allenai/allennlp/blob/f877fdc30d18178b88c335fbd92722fb77c42d93/tests/interpret/simple_influence_test.py. I couldn't make the code work with the model archive I used above. 

IMPORTANT: You need to use latest `allennlp` version. We should instruct to run:

```
pip uninstall allennlp
pip uninstall allennlp-models 
pip install allennlp
pip install allennlp-models
```

`test_data_path` has one review with a numerical rating. Let's see whether influential examples show other reviews in the training data that also have numerical ratings.

In [None]:
labels = ["neg", "pos"]
with jsonlines.open("data/imdb_test_numerical.jsonl", "w") as writer:
    for item in dataset["test"]:
        if "/10" in item["text"] or " / 10" in item["text"]:
            writer.write({"text": item["text"], "label": labels[item["label"]]})

I manually put one of the examples from `data/imdb_test_numerical.jsonl` in a new file `data/imdb_eval.jsonl`.

In [1]:
from allennlp.predictors import Predictor
import jsonlines

archive_path = 'models/model.tar.gz'
predictor = Predictor.from_path(archive_path)
test_data_path = "data/imdb_eval.jsonl" 

test_instances = []
with jsonlines.open(test_data_path) as reader:
    for item in reader: 
        test_instances.append(item)
        
pred_labels = [predictor.predict(item["text"])["label"] for item in test_instances]
print (f"Predicted labels are: {pred_labels}")

2022-02-28 19:55:35,441 - INFO - allennlp.common.plugins - Plugin allennlp_models available
2022-02-28 19:55:35,445 - INFO - allennlp.models.archival - loading archive file models/model.tar.gz
2022-02-28 19:55:35,446 - INFO - allennlp.models.archival - extracting archive file models/model.tar.gz to temp dir /var/folders/qr/8__6lqs525vbb3xk4c52jhxc0000gp/T/tmp3cfa7lau
2022-02-28 19:55:35,503 - INFO - allennlp.common.params - dataset_reader.type = text_classification_json
2022-02-28 19:55:35,503 - INFO - allennlp.common.params - dataset_reader.max_instances = None
2022-02-28 19:55:35,504 - INFO - allennlp.common.params - dataset_reader.manual_distributed_sharding = False
2022-02-28 19:55:35,505 - INFO - allennlp.common.params - dataset_reader.manual_multiprocess_sharding = False
2022-02-28 19:55:35,505 - INFO - allennlp.common.params - dataset_reader.token_indexers.tokens.type = single_id
2022-02-28 19:55:35,506 - INFO - allennlp.common.params - dataset_reader.token_indexers.tokens.names

2022-02-28 19:55:36,026 - INFO - allennlp.common.params - model.seq2vec_encoder.type = bag_of_embeddings
2022-02-28 19:55:36,027 - INFO - allennlp.common.params - model.seq2vec_encoder.embedding_dim = 16
2022-02-28 19:55:36,027 - INFO - allennlp.common.params - model.seq2vec_encoder.averaged = True
2022-02-28 19:55:36,028 - INFO - allennlp.common.params - model.seq2seq_encoder.type = lstm
2022-02-28 19:55:36,028 - INFO - allennlp.common.params - model.seq2seq_encoder.input_size = 10
2022-02-28 19:55:36,029 - INFO - allennlp.common.params - model.seq2seq_encoder.hidden_size = 16
2022-02-28 19:55:36,029 - INFO - allennlp.common.params - model.seq2seq_encoder.num_layers = 1
2022-02-28 19:55:36,030 - INFO - allennlp.common.params - model.seq2seq_encoder.bias = True
2022-02-28 19:55:36,030 - INFO - allennlp.common.params - model.seq2seq_encoder.dropout = 0.0
2022-02-28 19:55:36,030 - INFO - allennlp.common.params - model.seq2seq_encoder.bidirectional = False
2022-02-28 19:55:36,031 - INFO -

Predicted labels are: ['neg']


In [2]:
from datasets import load_dataset
dataset = load_dataset("imdb")

labels = ["neg", "pos"]
with jsonlines.open("data/imdb_train.jsonl", "w") as writer:
    for item in dataset["train"]: # If in hurry, you can sample 100 examples with `dataset["train"].shuffle().select(range(100))`; The next step will be slow for the entire train set
        writer.write({"text": item["text"], "label": labels[item["label"]]})



  0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
from allennlp.interpret.influence_interpreters import InfluenceInterpreter

archive_path = 'models/model.tar.gz'
train_data_path = "data/imdb_train.jsonl"
si = InfluenceInterpreter.from_path(archive_path, train_data_path=train_data_path, recursion_depth=3)
results = si.interpret_from_file(test_data_path, k=3)
for idx, result in enumerate(results): 
    print ('===========> Test instance:')
    print (result.test_instance)
    print (f'===========> Predicted label: {pred_labels[idx]}\n')
    for top_instance in result.top_k:
        print ('===========> Influential training example:')
        print (top_instance.instance)
        print ('\n')

2022-02-28 19:55:48,197 - INFO - allennlp.common.plugins - Plugin allennlp_models available
2022-02-28 19:55:48,201 - INFO - allennlp.models.archival - loading archive file models/model.tar.gz
2022-02-28 19:55:48,201 - INFO - allennlp.models.archival - extracting archive file models/model.tar.gz to temp dir /var/folders/qr/8__6lqs525vbb3xk4c52jhxc0000gp/T/tmp92j5f_ev
2022-02-28 19:55:48,265 - INFO - allennlp.common.params - dataset_reader.type = text_classification_json
2022-02-28 19:55:48,266 - INFO - allennlp.common.params - dataset_reader.max_instances = None
2022-02-28 19:55:48,266 - INFO - allennlp.common.params - dataset_reader.manual_distributed_sharding = False
2022-02-28 19:55:48,267 - INFO - allennlp.common.params - dataset_reader.manual_multiprocess_sharding = False
2022-02-28 19:55:48,268 - INFO - allennlp.common.params - dataset_reader.token_indexers.tokens.type = single_id
2022-02-28 19:55:48,269 - INFO - allennlp.common.params - dataset_reader.token_indexers.tokens.names

2022-02-28 19:55:48,316 - INFO - allennlp.common.params - model.seq2vec_encoder.type = bag_of_embeddings
2022-02-28 19:55:48,316 - INFO - allennlp.common.params - model.seq2vec_encoder.embedding_dim = 16
2022-02-28 19:55:48,317 - INFO - allennlp.common.params - model.seq2vec_encoder.averaged = True
2022-02-28 19:55:48,319 - INFO - allennlp.common.params - model.seq2seq_encoder.type = lstm
2022-02-28 19:55:48,320 - INFO - allennlp.common.params - model.seq2seq_encoder.input_size = 10
2022-02-28 19:55:48,320 - INFO - allennlp.common.params - model.seq2seq_encoder.hidden_size = 16
2022-02-28 19:55:48,321 - INFO - allennlp.common.params - model.seq2seq_encoder.num_layers = 1
2022-02-28 19:55:48,322 - INFO - allennlp.common.params - model.seq2seq_encoder.bias = True
2022-02-28 19:55:48,322 - INFO - allennlp.common.params - model.seq2seq_encoder.dropout = 0.0
2022-02-28 19:55:48,323 - INFO - allennlp.common.params - model.seq2seq_encoder.bidirectional = False
2022-02-28 19:55:48,324 - INFO -

loading instances: 0it [00:00, ?it/s]

loading instances: 0it [00:00, ?it/s]

2022-02-28 19:59:15,375 - INFO - allennlp.interpret.influence_interpreters.influence_interpreter - Gathering training instances and computing gradients. The result will be cached so this only needs to be done once.


calculating training gradients: 0it [00:00, ?it/s]

test instances:   0%|          | 0/1 [00:00<?, ?it/s]

LiSSA samples:   0%|          | 0/1 [00:00<?, ?it/s]

LiSSA depth:   0%|          | 0/3 [00:00<?, ?it/s]

scoring train instances:   0%|          | 0/25000 [00:00<?, ?it/s]

Instance with fields:
 	 tokens: TextField of length 28 with text: 
 		[Widow, hires, a, psychopath, as, a, handyman, ., Sloppy, film, noir, thriller, which, does, n't,
		make, much, of, its, tension, promising, set, -, up, ., (, 3/10, )]
 		and TokenIndexers : {'tokens': 'SingleIdTokenIndexer'} 
 	 label: LabelField with label: neg in namespace: 'labels'. 


Instance with fields:
 	 tokens: TextField of length 400 with text: 
 		[Dr., Hackenstein, begins, at, the, turn, of, last, century, ,, ', 1909, The, dawn, of, modern,
		medical, science, ', to, be, exact, ., Dr., Eliot, Hackenstein, (, David, Muir, ), is, in, the,
		early, stages, of, his, rejuvenation, of, living, tissue, experiments, ,, Dr., Hackenstein, manages,
		to, bring, a, skinned, rat, back, to, life, which, confirms, he, has, succeeded, in, bringing, the,
		dead, back, to, life, ..., It, 's, now, ', Three, years, later, ', &, Dean, Slesinger, (, Micheal,
		Ensign, ), is, round, the, Doc, 's, house, for, dinner, ., As, D

Words and very similar words that appear in the test instance and influential review are "psychopathic" ("psychopat" in the test instance), "Sloppy", "thriller", but influential examples do not have numerical ratings.

## Contrastive Edits

TODO 1: Add some info about contrastive edits.

TODO 2: Instead of a random example, use an example with a numerical rating.

    
We will need to re-install older versions of `allennlp`: 

```
pip uninstall allennlp
pip uninstall allennlp-models 
pip install allennlp==1.2.2
pip install allennlp_models==1.2.2
```

We will use analyze edits generated by [MiCE](https://arxiv.org/abs/2012.13985)---a method for contrastive editing. The authors made edits available, let's store them in the right place if we haven't done so already:

```
mkdir mice_edits
cd mice_edits
wget https://storage.cloud.google.com/mice-edits/mice_edits.csv
```

In [1]:
import pandas as pd
import sys
sys.path.append("mice/")
from src.utils import html_highlight_diffs
from IPython.core.display import display, HTML
import numpy as np
from mice.src.utils import load_predictor, get_ints_to_labels

TASK = "imdb"
STAGE2EXP = "mice_binary"
EDIT_PATH = "mice_edits/mice_edits.csv"

In [2]:
def read_edits(path):
    edits = pd.read_csv(EDIT_PATH, sep="\t", lineterminator="\n", error_bad_lines=False, warn_bad_lines=True)

    if edits['new_pred'].dtype == pd.np.dtype('float64'):
        edits['new_pred'] = edits.apply(lambda row: str(int(row['new_pred']) if not np.isnan(row['new_pred']) else ""), axis=1)
        edits['orig_pred'] = edits.apply(lambda row: str(int(row['orig_pred']) if not np.isnan(row['orig_pred']) else ""), axis=1)
        edits['contrast_pred'] = edits.apply(lambda row: str(int(row['contrast_pred']) if not np.isnan(row['contrast_pred']) else ""), axis=1)
    else:
        edits['new_pred'].fillna(value="", inplace=True)
        edits['orig_pred'].fillna(value="", inplace=True)
        edits['contrast_pred'].fillna(value="", inplace=True)
    return edits

def get_best_edits(edits):
    """ MiCE writes all edits that are found in Stage 2, 
    but we only want to evaluate the smallest per input. 
    Calling get_sorted_e() """
    return edits[edits['sorted_idx'] == 0]

def display_edits(row):
    html_original, html_edited = html_highlight_diffs(row['orig_editable_seg'], row['edited_editable_seg'])
    minim = round(row['minimality'], 3)
    print(f"MINIMALITY: \t{minim}")
    print("")
    print("========> Original instance:")
    display(HTML(html_original))
    print("========> Contrastive edit:")
    display(HTML(html_edited))

def display_classif_results(rows):
    for _, row in rows.iterrows():
        orig_contrast_prob_pred = round(row['orig_contrast_prob_pred'], 3)
        new_contrast_prob_pred = round(row['new_contrast_prob_pred'], 3)
        print("-----------------------")
        print(f"ORIG LABEL: \t{row['orig_pred']}")
        print(f"CONTR LABEL: \t{row['contrast_pred']} (Orig Pred Prob: {orig_contrast_prob_pred})")
        print(f"NEW LABEL: \t{row['new_pred']} (New Pred Prob: {new_contrast_prob_pred})")
        print("")
        display_edits(row)

In [9]:
edits = read_edits(EDIT_PATH)
edits = get_best_edits(edits)
random_rows = edits.sample(1)
display_classif_results(random_rows)

-----------------------
ORIG LABEL: 	0
CONTR LABEL: 	1 (Orig Pred Prob: 0.0)
NEW LABEL: 	1 (New Pred Prob: 0.654)

MINIMALITY: 	0.155



  after removing the cwd from sys.path.


