# Evaluation of different classifiers for NLI

This notebook evaluates which classifier is the best performing for the task of NLI classification. The best-performing classifier will be then used for evaluating the quality of the generated counterfactuals in the NLI task. This notebook is used only for illustration and debug purposes and results are not the definite one. Please run the script "compare_nli_classifiers.py" to perform the complete evaluation.

Here a list of the classifiers that will be tested:
- Roberta Large
- Distil Roberta
- Bart Large

Evaluation procedure:
- We take the Flickr Counterfactually-Augmented Dataset from Kaushik (cad_flickr_nli.tsv);
- We merge the training and the val set to create an evaluation set
- We use such evaluation set to test the performance of the various classifiers;

In [1]:
import torch
import pandas as pd
import datasets
import transformers
from fairseq.data.data_utils import collate_tokens

to_debug = True
N_TO_DEBUG = 10

eval_metrics = {"precision": datasets.load_metric("precision"),
                "recall": datasets.load_metric("recall"),
                "f1": datasets.load_metric("f1"),
                "accuracy": datasets.load_metric("accuracy")
                }
results = [] # to keep track of the results of different classifiers

In [2]:
trainset = pd.read_csv("../cad_flickr_nli/fold_0/training_set.tsv", sep='\t')
valset = pd.read_csv("../cad_flickr_nli/fold_0/val_set.tsv", sep='\t')
eval_data = pd.concat([trainset, valset], ignore_index=True)

if to_debug:
    eval_data = eval_data[:N_TO_DEBUG]
eval_data.reset_index(inplace=True, drop=True)

print(len(eval_data))
eval_data.head(1)

10


Unnamed: 0,counter_prem,original_hyp,counter_label,task,counter_hyp,original_prem,original_label
0,A man and three women are preparing a meal of ...,A group of people cooking inside,neutral,RP,,A man and three women are preparing a meal ind...,entailment


In [3]:
def extract_prems(row):
    if row["task"] == "RP":
        return row["counter_prem"]
    else:
        return row["original_prem"]

def extract_hyps(row):
    if row["task"] == "RH":
        return row["counter_hyp"]
    else:
        return row["original_hyp"]

def evaluate_classifier(preds, labels, eval_m):
    # evaluates a classifier
    metrics = {"precision": eval_m["precision"].compute(predictions=preds, references=labels, average="micro")["precision"],
               "recall": eval_m["recall"].compute(predictions=preds, references=labels, average="micro")["recall"],
               "f1": eval_m["f1"].compute(predictions=preds, references=labels, average="micro")["f1"],
               "accuracy": eval_m["accuracy"].compute(predictions=preds, references=labels)["accuracy"],
               }
    return metrics

In [4]:
eval_data["premise"] = eval_data.apply(lambda row: extract_prems(row), axis=1)
eval_data["hypothesis"] = eval_data.apply(lambda row: extract_hyps(row), axis=1)

eval_batch = [[p, h] for p,h in zip(eval_data["premise"].values, eval_data["hypothesis"].values)]
eval_batch[1]

['The baby in the pink romper is crying.', 'The baby is happy.']

## Roberta large MNLI fine-tuned on MultiNLI
https://github.com/facebookresearch/fairseq/tree/main/examples/roberta

In [None]:
class_map = {"contradiction": 0,
             "neutral": 1,
             "entailment": 2
             }
gold_labels = [class_map[el] for el in eval_data["counter_label"]]

model = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
model.cuda()
model.eval()
batch = collate_tokens(
    [model.encode(pair[0], pair[1]) for pair in eval_batch], pad_idx=1
)
predictions = model.predict('mnli', batch).argmax(dim=1)

In [None]:
model_result = evaluate_classifier(predictions, gold_labels, eval_metrics)
results.append(model_result)
model_result

In [7]:
del model
torch.cuda.empty_cache()

NameError: name 'model' is not defined

## DistilRoberta-base fine-tuned on SNLI and MultiNLI
cross-encoder/nli-distilroberta-base


In [None]:
class_map = {"contradiction": 0,
             "entailment": 1,
             "neutral": 2
             }
gold_labels = [class_map[el] for el in eval_data["counter_label"]]

model = transformers.AutoModelForSequenceClassification.from_pretrained('cross-encoder/nli-distilroberta-base')
tokenizer = transformers.AutoTokenizer.from_pretrained('cross-encoder/nli-distilroberta-base')
features = tokenizer(eval_batch,  padding=True, truncation=True, return_tensors="pt")

model.cuda()
features = features.to('cuda')
model.eval()
with torch.no_grad():
    scores = model(**features).logits
    predictions = [score_max for score_max in scores.argmax(dim=1)]

In [None]:
model_result = evaluate_classifier(predictions, gold_labels, eval_metrics)
results.append(model_result)
model_result

In [None]:
del model
torch.cuda.empty_cache()

## Bart-large fine-tuned on MultiNLI
facebook/bart-large-mnli

In [None]:
class_map = {"contradiction": 0,
             "neutral": 1,
             "entailment": 2
             }
gold_labels = [class_map[el] for el in eval_data["counter_label"]]

model = transformers.AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
tokenizer = transformers.AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
features = tokenizer(eval_batch,  padding=True, truncation=True, return_tensors="pt")

model.cuda()
features = features.to('cuda')
model.eval()
with torch.no_grad():
    scores = model(**features).logits
    predictions = [score_max for score_max in scores.argmax(dim=1)]

In [None]:
model_result = evaluate_classifier(predictions, gold_labels, eval_metrics)
results.append(model_result)
model_result

In [None]:
del model
torch.cuda.empty_cache()