# Evaluation of different classifiers for NLI

This notebook evaluates which classifier is the best performing for the task of NLI classification. The best-performing classifier will be then used for evaluating the quality of the generated counterfactuals in the NLI task. This notebook is used only for illustration and debug purposes and results are not the definite one. Please run the script "compare_nli_classifiers.py" to perform the complete evaluation.

Here a list of the classifiers that will be tested:
- Roberta Large (the winner, with 77% of accuracy)
- Distil Roberta
- Bart Large

Evaluation procedure:
- We take the Flickr Counterfactually-Augmented Dataset from Kaushik (cad_flickr_nli.tsv);
- We merge the training and the val set to create an evaluation set
- We use such evaluation set to test the performance of the various classifiers;

In [1]:
import torch
import pandas as pd
import datasets
import transformers
from fairseq.data.data_utils import collate_tokens

to_debug = True
N_TO_DEBUG = 12
n_batches = 12

eval_metrics = {"precision": datasets.load_metric("precision"),
                "recall": datasets.load_metric("recall"),
                "f1": datasets.load_metric("f1"),
                "accuracy": datasets.load_metric("accuracy")
                }

In [2]:
trainset = pd.read_csv("../cad_flickr_nli/fold_0/training_set.tsv", sep='\t')
valset = pd.read_csv("../cad_flickr_nli/fold_0/val_set.tsv", sep='\t')
eval_data = pd.concat([trainset, valset], ignore_index=True)

if to_debug:
    eval_data = eval_data[:N_TO_DEBUG]
eval_data.reset_index(inplace=True, drop=True)

print(len(eval_data))
eval_data.head(1)

12


Unnamed: 0,counter_prem,original_hyp,counter_label,task,counter_hyp,original_prem,original_label
0,A man and three women are preparing a meal of ...,A group of people cooking inside,neutral,RP,,A man and three women are preparing a meal ind...,entailment


In [3]:
def extract_prems(row):
    if row["task"] == "RP":
        return row["counter_prem"]
    else:
        return row["original_prem"]

def extract_hyps(row):
    if row["task"] == "RH":
        return row["counter_hyp"]
    else:
        return row["original_hyp"]

def generate_batches(bac, n):
    batch_size = len(bac)//n
    for i in range(0, len(bac), batch_size):
        yield bac[i:i + batch_size]
    return bac

def evaluate_classifier(preds, labels, eval_m):
    # evaluates a classifier
    metrics = {"precision": eval_m["precision"].compute(predictions=preds, references=labels, average="micro")["precision"],
               "recall": eval_m["recall"].compute(predictions=preds, references=labels, average="micro")["recall"],
               "f1": eval_m["f1"].compute(predictions=preds, references=labels, average="micro")["f1"],
               "accuracy": eval_m["accuracy"].compute(predictions=preds, references=labels)["accuracy"],
               }
    return metrics

In [4]:
eval_data["premise"] = eval_data.apply(lambda row: extract_prems(row), axis=1)
eval_data["hypothesis"] = eval_data.apply(lambda row: extract_hyps(row), axis=1)

eval_batch = [[p, h] for p,h in zip(eval_data["premise"].values, eval_data["hypothesis"].values)]
eval_batch[1]

['The baby in the pink romper is crying.', 'The baby is happy.']

## Roberta large MNLI fine-tuned on MultiNLI
https://github.com/facebookresearch/fairseq/tree/main/examples/roberta

In [5]:
class_map = {"contradiction": 0,
             "neutral": 1,
             "entailment": 2
             }
gold_labels = [class_map[el] for el in eval_data["counter_label"]]

model = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
model.cuda()
model.eval()
data = collate_tokens(
    [model.encode(pair[0], pair[1]) for pair in eval_batch], pad_idx=1
)
batches = generate_batches(data, n_batches)
predictions = []
for batch in batches:
    predictions += model.predict('mnli', batch).argmax(dim=1)

Using cache found in /home/diego/.cache/torch/hub/pytorch_fairseq_main
2023-01-12 16:47:54 | INFO | fairseq.file_utils | loading archive file http://dl.fbaipublicfiles.com/fairseq/models/roberta.large.mnli.tar.gz from cache at /home/diego/.cache/torch/pytorch_fairseq/7685ba8546f9a5ce1a00c7a6d7d44f7e748d22681172f0f391c3d48f487c801c.74e37d47306b3cc51c5f8d335022a392c29f1906c8cd9e9cd3446d7422cf55d8
2023-01-12 16:47:58 | INFO | fairseq.tasks.masked_lm | dictionary: 50264 types
2023-01-12 16:48:06 | INFO | fairseq.models.roberta.model | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': 'json', 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 8, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 4, 'fp16_scale_window': 128, 'fp16

In [6]:
model_result = evaluate_classifier(predictions, gold_labels, eval_metrics)
model_result

{'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'accuracy': 0.5}

In [7]:
del model
torch.cuda.empty_cache()

## DistilRoberta-base fine-tuned on SNLI and MultiNLI
cross-encoder/nli-distilroberta-base


In [8]:
class_map = {"contradiction": 0,
             "entailment": 1,
             "neutral": 2
             }
gold_labels = [class_map[el] for el in eval_data["counter_label"]]

model = transformers.AutoModelForSequenceClassification.from_pretrained('cross-encoder/nli-distilroberta-base')
tokenizer = transformers.AutoTokenizer.from_pretrained('cross-encoder/nli-distilroberta-base')
features = tokenizer(eval_batch,  padding=True, truncation=True, return_tensors="pt")

model.cuda()
features = features.to('cuda')
model.eval()
with torch.no_grad():
    scores = model(**features).logits
    predictions = [score_max for score_max in scores.argmax(dim=1)]

In [9]:
model_result = evaluate_classifier(predictions, gold_labels, eval_metrics)
model_result

{'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'accuracy': 0.5}

In [10]:
del model
torch.cuda.empty_cache()

## Bart-large fine-tuned on MultiNLI
facebook/bart-large-mnli

In [11]:
class_map = {"contradiction": 0,
             "neutral": 1,
             "entailment": 2
             }
gold_labels = [class_map[el] for el in eval_data["counter_label"]]

model = transformers.AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
tokenizer = transformers.AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
features = tokenizer(eval_batch,  padding=True, truncation=True, return_tensors="pt")

model.cuda()
features = features.to('cuda')
model.eval()
with torch.no_grad():
    scores = model(**features).logits
    predictions = [score_max for score_max in scores.argmax(dim=1)]

In [12]:
model_result = evaluate_classifier(predictions, gold_labels, eval_metrics)
model_result

{'precision': 0.5833333333333334,
 'recall': 0.5833333333333334,
 'f1': 0.5833333333333334,
 'accuracy': 0.5833333333333334}

In [13]:
del model
torch.cuda.empty_cache()

# DeBerta base fine-tuned on SuperGLUE NLI
microsoft/deberta-v3-base

In [14]:
class_map = {"contradiction": 0,
             "entailment": 1,
             "neutral": 2
             }
gold_labels = [class_map[el] for el in eval_data["counter_label"]]

model = transformers.AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-v3-base')
tokenizer = transformers.AutoTokenizer.from_pretrained('microsoft/deberta-v3-base')
features = tokenizer(eval_batch,  padding=True, truncation=True, return_tensors="pt")

model.cuda()
features = features.to('cuda')
model.eval()
with torch.no_grad():
    scores = model(**features).logits
    predictions = [score_max for score_max in scores.argmax(dim=1)]

Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.classifier.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.bias', 'mask_predictions.LayerNorm.bias', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.weight']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a

In [15]:
model_result = evaluate_classifier(predictions, gold_labels, eval_metrics)
model_result

{'precision': 0.16666666666666666,
 'recall': 0.16666666666666666,
 'f1': 0.16666666666666666,
 'accuracy': 0.16666666666666666}

In [16]:
del model
torch.cuda.empty_cache()

# DeBerta large fine-tuned on SuperGLUE NLI
microsoft/deberta-v3-large

In [17]:
class_map = {"contradiction": 0,
             "entailment": 1,
             "neutral": 2
             }
gold_labels = [class_map[el] for el in eval_data["counter_label"]]

model = transformers.AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-v3-large')
tokenizer = transformers.AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')
features = tokenizer(eval_batch,  padding=True, truncation=True, return_tensors="pt")

model.cuda()
features = features.to('cuda')
model.eval()
with torch.no_grad():
    scores = model(**features).logits
    predictions = [score_max for score_max in scores.argmax(dim=1)]

Some weights of the model checkpoint at microsoft/deberta-v3-large were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.classifier.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.bias', 'mask_predictions.LayerNorm.bias', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.weight']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from 

In [18]:
model_result = evaluate_classifier(predictions, gold_labels, eval_metrics)
model_result

{'precision': 0.6666666666666666,
 'recall': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'accuracy': 0.6666666666666666}

In [19]:
del model
torch.cuda.empty_cache()

# DeBerta large fine-tuned on MNLI
microsoft/deberta-large-mnli

In [20]:
class_map = {"contradiction": 0,
             "entailment": 1,
             "neutral": 2
             }
gold_labels = [class_map[el] for el in eval_data["counter_label"]]

model = transformers.AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-large-mnli')
tokenizer = transformers.AutoTokenizer.from_pretrained('microsoft/deberta-large-mnli')
features = tokenizer(eval_batch,  padding=True, truncation=True, return_tensors="pt")

model.cuda()
features = features.to('cuda')
model.eval()
with torch.no_grad():
    scores = model(**features).logits
    predictions = [score_max for score_max in scores.argmax(dim=1)]

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [21]:
model_result = evaluate_classifier(predictions, gold_labels, eval_metrics)
model_result

{'precision': 0.6666666666666666,
 'recall': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'accuracy': 0.6666666666666666}

In [22]:
del model
torch.cuda.empty_cache()

#
cross-encoder/nli-deberta-v3-base

In [23]:
class_map = {"contradiction": 0,
             "entailment": 1,
             "neutral": 2
             }
gold_labels = [class_map[el] for el in eval_data["counter_label"]]

model = transformers.AutoModelForSequenceClassification.from_pretrained('cross-encoder/nli-deberta-v3-base')
tokenizer = transformers.AutoTokenizer.from_pretrained('cross-encoder/nli-deberta-v3-base')
features = tokenizer(eval_batch,  padding=True, truncation=True, return_tensors="pt")

model.cuda()
features = features.to('cuda')
model.eval()
with torch.no_grad():
    scores = model(**features).logits
    predictions = [score_max for score_max in scores.argmax(dim=1)]

In [24]:
model_result = evaluate_classifier(predictions, gold_labels, eval_metrics)
model_result

{'precision': 0.5833333333333334,
 'recall': 0.5833333333333334,
 'f1': 0.5833333333333334,
 'accuracy': 0.5833333333333334}

In [25]:
del model
torch.cuda.empty_cache()