# ANLI Baseline

This model illustrates how to use the DeBERTa-v3-base-mnli-fever-anli model to perform specialized inference on the ANLI dataset.
This dataset has 184M parameters. It was trained in 2021 on the basis of a BERT-like embedding approach: 
* The premise and the hypothesis are encoded using the DeBERTa-v3-base contextual encoder
* The encodings are then compared on a fine-tuned model to predict a distribution over the classification labels (entailment, contradiction, neutral)

Reported accuracy on ANLI is 0.495 (see https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli) 



In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [2]:
premise = "I first thought that I liked the movie, but upon second thought it was actually disappointing."
hypothesis = "The movie was good."

input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'entailment': 6.6, 'neutral': 17.3, 'contradiction': 76.1}


In [3]:
def evaluate(premise, hypothesis):
    input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
    output = model(input["input_ids"].to(device))
    prediction = torch.softmax(output["logits"][0], -1).tolist()
    prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
    return prediction

In [4]:
evaluate("The weather is nice today.", "It is sunny outside.")

{'entailment': 0.1, 'neutral': 99.8, 'contradiction': 0.0}

In [5]:
def get_prediction(pred_dict):
    if pred_dict["entailment"] > pred_dict["contradiction"]  and pred_dict["entailment"] > pred_dict["neutral"]:
        return "entailment"
    elif pred_dict["contradiction"] > pred_dict["entailment"]:
        return "contradiction"
    else:
        return "neutral"

In [6]:
get_prediction(evaluate("The weather is nice today.", "It is sunny outside."))

'neutral'

In [7]:
get_prediction(evaluate("It is sunny outside.", "The weather is nice today."))

'entailment'

In [8]:
get_prediction(evaluate("It is sunny outside.", "The weather is terrible today."))

'contradiction'

## Load ANLI dataset

In [10]:
from datasets import load_dataset

dataset = load_dataset("facebook/anli")
dataset = dataset.filter(lambda x: x['reason'] != None and x['reason'] != "")

In [None]:
dataset

In [14]:
# Evaluate the model on the ANLI dataset
from tqdm import tqdm
def evaluate_on_dataset(dataset):
    results = []
    label_names = ["entailment", "neutral", "contradiction"]
    for example in tqdm(dataset):
        premise = example['premise']
        hypothesis = example['hypothesis']
        prediction = evaluate(premise, hypothesis)
        results.append({
            'premise': premise,
            'hypothesis': hypothesis,
            'prediction': prediction,
            'pred_label': get_prediction(prediction),
            'gold_label': label_names[example['label']],
            'reason': example['reason']
        })
    return results

### Task 1.1 - Evaluating ANLI samples on test sections

In [17]:
pred_test_r1 = evaluate_on_dataset(dataset['test_r1'])
pred_test_r2 = evaluate_on_dataset(dataset['test_r2'])

100%|██████████| 1000/1000 [05:27<00:00,  3.05it/s]
100%|██████████| 1000/1000 [05:15<00:00,  3.17it/s]


In [15]:
pred_test_r3 = evaluate_on_dataset(dataset['test_r3'])

100%|██████████| 1200/1200 [11:14<00:00,  1.78it/s]


In [18]:
pred_test_r1[:5]

[{'premise': 'Ernest Jones is a British jeweller and watchmaker. Established in 1949, its first store was opened in Oxford Street, London. Ernest Jones specialises in diamonds and watches, stocking brands such as Gucci and Emporio Armani. Ernest Jones is part of the Signet Jewelers group.',
  'hypothesis': 'The first Ernest Jones store was opened on the continent of Europe.',
  'prediction': {'entailment': 99.5, 'neutral': 0.1, 'contradiction': 0.3},
  'pred_label': 'entailment',
  'gold_label': 'entailment',
  'reason': "The first store was opened in London, which is in Europe. It may have been difficult for the system because continents weren't mentioned."},
 {'premise': 'Old Trafford is a football stadium in Old Trafford, Greater Manchester, England, and the home of Manchester United. With a capacity of 75,643, it is the largest club football stadium in the United Kingdom, the second-largest football stadium, and the eleventh-largest in Europe. It is about 0.5 mi from Old Trafford C

In [19]:
pred_test_r2[:5]

[{'premise': 'There is a little Shia community in El Salvador. There is an Islamic Library operated by the Shia community, named "Fatimah Az-Zahra". They published the first Islamic magazine in Central America: "Revista Biblioteca Islámica". Additionally, they are credited with providing the first and only Islamic library dedicated to spreading Islamic culture in the country.',
  'hypothesis': 'The community is south of the United States.',
  'prediction': {'entailment': 94.5, 'neutral': 1.7, 'contradiction': 3.8},
  'pred_label': 'entailment',
  'gold_label': 'entailment',
  'reason': 'The community is in El Salvador which is south of the US.'},
 {'premise': '"Look at Me (When I Rock Wichoo)" is a song by American indie rock band Black Kids, taken from their debut album "Partie Traumatic". It was released in the UK by Almost Gold Recordings on September 8, 2008 and debuted on the Top 200 UK Singles Chart at number 175.',
  'hypothesis': 'The song was released in America in September 2

In [16]:
pred_test_r3[:5]  # Display the first 5 predictions

[{'premise': "It is Sunday today, let's take a look at the most popular posts of the last couple of days. Most of the articles this week deal with the iPhone, its future version called the iPhone 8 or iPhone Edition, and new builds of iOS and macOS. There are also some posts that deal with the iPhone rival called the Galaxy S8 and some other interesting stories. The list of the most interesting articles is available below. Stay tuned for more rumors and don't forget to follow us on Twitter.",
  'hypothesis': 'The day of the passage is usually when Christians praise the lord together',
  'prediction': {'entailment': 2.4, 'neutral': 97.4, 'contradiction': 0.2},
  'pred_label': 'neutral',
  'gold_label': 'entailment',
  'reason': "Sunday is considered Lord's Day"},
 {'premise': 'By The Associated Press WELLINGTON, New Zealand (AP) — All passengers and crew have survived a crash-landing of a plane in a lagoon in the Federated States of Micronesia. WELLINGTON, New Zealand (AP) — All passeng

## Evaluate Metrics

Let's use the huggingface `evaluate` package to compute the performance of the baseline.


In [None]:
from evaluate import load

accuracy = load("accuracy")
precision = load("precision")
recall = load("recall")
f1 = load("f1")


In [None]:
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

In [None]:
clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])

## Your Turn

Compute the classification metrics on the baseline model on each section of the ANLI dataset.

https://www.kaggle.com/code/faijanahamadkhan/llm-evaluation-framework-hugging-face provides good documentation on how to use the Huggingface evaluate library.